222 18 27MB
English Pages [756] Year 2018
Medical Biostatistics Fourth Edition
Abhaya Indrayan Rajeev Kumar Malhotra
CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2018 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed on acid-free paper International Standard Book Number-13: 978-1-4987-9953-9 (Hardback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com
Editor-in-Chief Shein-Chung Chow, Ph.D., Professor, Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, North Carolina
Series Editors Byron Jones, Biometrical Fellow, Statistical Methodology, Integrated Information Sciences, Novartis Pharma AG, Basel, Switzerland
Jen-pei Liu, Professor, Division of Biometry, Department of Agronomy, National Taiwan University, Taipei, Taiwan
Karl E. Peace, Georgia Cancer Coalition, Distinguished Cancer Scholar, Senior Research Scientist and Professor of Biostatistics, Jiann-Ping Hsu College of Public Health, Georgia Southern University, Statesboro, Georgia
Bruce W. Turnbull, Professor, School of Operations Research and Industrial Engineering, Cornell University, Ithaca, New York
Published Titles Adaptive Design Methods in Clinical Trials, Second Edition Shein-Chung Chow and Mark Chang
Basic Statistics and Pharmaceutical Statistical Applications, Second Edition James E. De Muth
Adaptive Designs for Sequential Treatment Allocation Alessandro Baldi Antognini and Alessandra Giovagnoli
Bayesian Adaptive Methods for Clinical Trials Scott M. Berry, Bradley P. Carlin, J. Jack Lee, and Peter Muller
Adaptive Design Theory and Implementation Using SAS and R, Second Edition Mark Chang
Bayesian Analysis Made Simple: An Excel GUI for WinBUGS Phil Woodward
Advanced Bayesian Methods for Medical Test Accuracy Lyle D. Broemeling
Bayesian Designs for Phase I–II Clinical Trials Ying Yuan, Hoang Q. Nguyen, and Peter F. Thall
Analyzing Longitudinal Clinical Trial Data: A Practical Guide Craig Mallinckrodt and Ilya Lipkovich
Bayesian Methods for Measures of Agreement Lyle D. Broemeling
Applied Biclustering Methods for Big and High-Dimensional Data Using R Adetayo Kasim, Ziv Shkedy, Sebastian Kaiser, Sepp Hochreiter, and Willem Talloen
Bayesian Methods for Repeated Measures Lyle D. Broemeling
Applied Meta-Analysis with R Ding-Geng (Din) Chen and Karl E. Peace Applied Surrogate Endpoint Evaluation Methods with SAS and R Ariel Alonso, Theophile Bigirumurame, Tomasz Burzykowski, Marc Buyse, Geert Molenberghs, Leacky Muchene, Nolen Joy Perualila, Ziv Shkedy, and Wim Van der Elst
Bayesian Methods in Epidemiology Lyle D. Broemeling Bayesian Methods in Health Economics Gianluca Baio Bayesian Missing Data Problems: EM, Data Augmentation and Noniterative Computation Ming T. Tan, Guo-Liang Tian, and Kai Wang Ng
Published Titles Bayesian Modeling in Bioinformatics Dipak K. Dey, Samiran Ghosh, and Bani K. Mallick Benefit-Risk Assessment in Pharmaceutical Research and Development Andreas Sashegyi, James Felli, and Rebecca Noel Benefit-Risk Assessment Methods in Medical Product Development: Bridging Qualitative and Quantitative Assessments Qi Jiang and Weili He Bioequivalence and Statistics in Clinical Pharmacology, Second Edition Scott Patterson and Byron Jones Biosimilar Clinical Development: Scientific Considerations and New Methodologies Kerry B. Barker, Sandeep M. Menon, Ralph B. D’Agostino, Sr., Siyan Xu, and Bo Jin Biosimilars: Design and Analysis of Follow-on Biologics Shein-Chung Chow Biostatistics: A Computing Approach Stewart J. Anderson Cancer Clinical Trials: Current and Controversial Issues in Design and Analysis Stephen L. George, Xiaofei Wang, and Herbert Pang Causal Analysis in Biomedicine and Epidemiology: Based on Minimal Sufficient Causation Mikel Aickin Clinical and Statistical Considerations in Personalized Medicine Claudio Carini, Sandeep Menon, and Mark Chang Clinical Trial Data Analysis Using R Ding-Geng (Din) Chen and Karl E. Peace Clinical Trial Data Analysis Using R and SAS, Second Edition Ding-Geng (Din) Chen, Karl E. Peace, and Pinggao Zhang Clinical Trial Methodology Karl E. Peace and Ding-Geng (Din) Chen
Clinical Trial Optimization Using R Alex Dmitrienko and Erik Pulkstenis Cluster Randomised Trials: Second Edition Richard J. Hayes and Lawrence H. Moulton Computational Methods in Biomedical Research Ravindra Khattree and Dayanand N. Naik Computational Pharmacokinetics Anders Källén Confidence Intervals for Proportions and Related Measures of Effect Size Robert G. Newcombe Controversial Statistical Issues in Clinical Trials Shein-Chung Chow Data Analysis with Competing Risks and Intermediate States Ronald B. Geskus Data and Safety Monitoring Committees in Clinical Trials, Second Edition Jay Herson Design and Analysis of Animal Studies in Pharmaceutical Development Shein-Chung Chow and Jen-pei Liu Design and Analysis of Bioavailability and Bioequivalence Studies, Third Edition Shein-Chung Chow and Jen-pei Liu Design and Analysis of Bridging Studies Jen-pei Liu, Shein-Chung Chow, and Chin-Fu Hsiao Design & Analysis of Clinical Trials for Economic Evaluation & Reimbursement: An Applied Approach Using SAS & STATA Iftekhar Khan Design and Analysis of Clinical Trials for Predictive Medicine Shigeyuki Matsui, Marc Buyse, and Richard Simon Design and Analysis of Clinical Trials with Time-to-Event Endpoints Karl E. Peace Design and Analysis of Non-Inferiority Trials Mark D. Rothmann, Brian L. Wiens, and Ivan S. F. Chan
Published Titles Difference Equations with Public Health Applications Lemuel A. Moyé and Asha Seth Kapadia DNA Methylation Microarrays: Experimental Design and Statistical Analysis Sun-Chong Wang and Arturas Petronis DNA Microarrays and Related Genomics Techniques: Design, Analysis, and Interpretation of Experiments David B. Allison, Grier P. Page, T. Mark Beasley, and Jode W. Edwards
Inference Principles for Biostatisticians Ian C. Marschner Interval-Censored Time-to-Event Data: Methods and Applications Ding-Geng (Din) Chen, Jianguo Sun, and Karl E. Peace Introductory Adaptive Trial Designs: A Practical Guide with R Mark Chang Joint Models for Longitudinal and Timeto-Event Data: With Applications in R Dimitris Rizopoulos
Dose Finding by the Continual Reassessment Method Ying Kuen Cheung
Measures of Interobserver Agreement and Reliability, Second Edition Mohamed M. Shoukri
Dynamical Biostatistical Models Daniel Commenges and Hélène Jacqmin-Gadda
Medical Biostatistics, Fourth Edition A. Indrayan
Elementary Bayesian Biostatistics Lemuel A. Moyé
Meta-Analysis in Medicine and Health Policy Dalene Stangl and Donald A. Berry
Emerging Non-Clinical Biostatistics in Biopharmaceutical Development and Manufacturing Harry Yang
Methods in Comparative Effectiveness Research Constantine Gatsonis and Sally C. Morton
Empirical Likelihood Method in Survival Analysis Mai Zhou
Mixed Effects Models for the Population Approach: Models, Tasks, Methods and Tools Marc Lavielle
Essentials of a Successful Biostatistical Collaboration Arul Earnest
Modeling to Inform Infectious Disease Control Niels G. Becker
Exposure–Response Modeling: Methods and Practical Implementation Jixian Wang
Modern Adaptive Randomized Clinical Trials: Statistical and Practical Aspects Oleksandr Sverdlov
Frailty Models in Survival Analysis Andreas Wienke
Monte Carlo Simulation for the Pharmaceutical Industry: Concepts, Algorithms, and Case Studies Mark Chang
Fundamental Concepts for New Clinical Trialists Scott Evans and Naitee Ting Generalized Linear Models: A Bayesian Perspective Dipak K. Dey, Sujit K. Ghosh, and Bani K. Mallick Handbook of Regression and Modeling: Applications for the Clinical and Pharmaceutical Industries Daryl S. Paulson
Multiregional Clinical Trials for Simultaneous Global New Drug Development Joshua Chen and Hui Quan Multiple Testing Problems in Pharmaceutical Statistics Alex Dmitrienko, Ajit C. Tamhane, and Frank Bretz
Published Titles Noninferiority Testing in Clinical Trials: Issues and Challenges Tie-Hua Ng
Statistical Design and Analysis of Clinical Trials: Principles and Methods Weichung Joe Shih and Joseph Aisner
Optimal Design for Nonlinear Response Models Valerii V. Fedorov and Sergei L. Leonov
Statistical Design and Analysis of Stability Studies Shein-Chung Chow
Patient-Reported Outcomes: Measurement, Implementation and Interpretation Joseph C. Cappelleri, Kelly H. Zou, Andrew G. Bushmakin, Jose Ma. J. Alvir, Demissie Alemayehu, and Tara Symonds
Statistical Evaluation of Diagnostic Performance: Topics in ROC Analysis Kelly H. Zou, Aiyi Liu, Andriy Bandos, Lucila Ohno-Machado, and Howard Rockette
Quantitative Evaluation of Safety in Drug Development: Design, Analysis and Reporting Qi Jiang and H. Amy Xia
Statistical Methods for Drug Safety Robert D. Gibbons and Anup K. Amatya
Quantitative Methods for HIV/AIDS Research Cliburn Chan, Michael G. Hudgens, and Shein-Chung Chow Quantitative Methods for Traditional Chinese Medicine Development Shein-Chung Chow Randomized Clinical Trials of Nonpharmacological Treatments Isabelle Boutron, Philippe Ravaud, and David Moher Randomized Phase II Cancer Clinical Trials Sin-Ho Jung Repeated Measures Design with Generalized Linear Mixed Models for Randomized Controlled Trials Toshiro Tango Sample Size Calculations for Clustered and Longitudinal Outcomes in Clinical Research Chul Ahn, Moonseong Heo, and Song Zhang Sample Size Calculations in Clinical Research, Third Edition Shein-Chung Chow, Jun Shao, Hansheng Wang, and Yuliya Lokhnygina Statistical Analysis of Human Growth and Development Yin Bun Cheung
Statistical Methods for Clinical Trials Mark X. Norleans
Statistical Methods for Healthcare Performance Monitoring Alex Bottle and Paul Aylin Statistical Methods for Immunogenicity Assessment Harry Yang, Jianchun Zhang, Binbing Yu, and Wei Zhao Statistical Methods in Drug Combination Studies Wei Zhao and Harry Yang Statistical Testing Strategies in the Health Sciences Albert Vexler, Alan D. Hutson, and Xiwei Chen Statistics in Drug Research: Methodologies and Recent Developments Shein-Chung Chow and Jun Shao Statistics in the Pharmaceutical Industry, Third Edition Ralph Buncher and Jia-Yeong Tsay Survival Analysis in Medicine and Genetics Jialiang Li and Shuangge Ma Theory of Drug Development Eric B. Holmgren Translational Medicine: Strategies and Statistical Methods Dennis Cosmatos and Shein-Chung Chow
Contents Summary Tables................................................................................................................................................................................... xxix Preface...................................................................................................................................................................................................xxxv Frequently Used Notations............................................................................................................................................................. xxxvii 1. Medical Uncertainties.......................................................................................................................................................................1 1.1 Uncertainties in Health and Disease.....................................................................................................................................2 1.1.1 Uncertainties due to Intrinsic Variation..................................................................................................................2 1.1.1.1 Biologic Variability......................................................................................................................................2 1.1.1.2 Genetic Variability......................................................................................................................................3 1.1.1.3 Variation in Behavior and Other Host Factors........................................................................................3 1.1.1.4 Environmental Variability.........................................................................................................................3 1.1.1.5 Sampling Fluctuations................................................................................................................................3 1.1.2 Natural Variation in Assessment..............................................................................................................................4 1.1.2.1 Observer Variability....................................................................................................................................4 1.1.2.2 Variability in Treatment Strategies...........................................................................................................4 1.1.2.3 Instrument and Laboratory Variability...................................................................................................4 1.1.2.4 Imperfect Tools............................................................................................................................................4 1.1.2.5 Incomplete Information on the Patient....................................................................................................5 1.1.2.6 Poor Compliance with the Regimen.........................................................................................................5 1.1.3 Knowledge Limitations..............................................................................................................................................5 1.1.3.1 Epistemic Uncertainties..............................................................................................................................5 1.1.3.2 Chance Variability.......................................................................................................................................6 1.1.3.3 Diagnostic, Therapeutic, and Prognostic Uncertainties........................................................................6 1.1.3.4 Predictive and Other Uncertainties..........................................................................................................6 1.2 Uncertainties in Medical Research........................................................................................................................................7 1.2.1 Empiricism in Medical Research..............................................................................................................................7 1.2.1.1 Laboratory Experiments.............................................................................................................................7 1.2.1.2 Clinical Trials...............................................................................................................................................7 1.2.1.3 Surgical Procedures....................................................................................................................................7 1.2.1.4 Epidemiological Research..........................................................................................................................8 1.2.2 Elements of Minimizing the Impact of Uncertainties on Research....................................................................8 1.2.2.1 Proper Design..............................................................................................................................................8 1.2.2.2 Improved Medical Methods......................................................................................................................8 1.2.2.3 Analysis and Synthesis...............................................................................................................................9 1.3 Uncertainties in Health Planning and Evaluation..............................................................................................................9 1.3.1 Health Situation Analysis..........................................................................................................................................9 1.3.1.1 Identification of the Specifics of the Problem........................................................................................10 1.3.1.2 Magnitude of the Problem.......................................................................................................................10 1.3.1.3 Health Infrastructure...............................................................................................................................10 1.3.1.4 Feasibility of Remedial Steps...................................................................................................................11 1.3.2 Evaluation of Health Programs..............................................................................................................................11 1.4 Management of Uncertainties: About This Book..............................................................................................................11 1.4.1 Contents of the Book................................................................................................................................................12 1.4.1.1 Chapters......................................................................................................................................................12 1.4.1.2 Limitations and Strengths.......................................................................................................................13 1.4.1.3 New in the Fourth Edition.......................................................................................................................14 1.4.1.4 Unique Contribution of This Book.........................................................................................................14
ix
x
Contents
1.4.2
Salient Features of the Text......................................................................................................................................15 1.4.2.1 System of Notations..................................................................................................................................15 1.4.2.2 Guide Chart of the Biostatistical Methods............................................................................................16 References..........................................................................................................................................................................................16 Exercises.............................................................................................................................................................................................16 2. Basics of Medical Studies..............................................................................................................................................................17 2.1 Study Protocol........................................................................................................................................................................17 2.1.1 Problem, Objectives, and Hypotheses...................................................................................................................17 2.1.1.1 Problem.......................................................................................................................................................17 2.1.1.2 Broad and Specific Objectives.................................................................................................................18 2.1.1.3 Hypotheses.................................................................................................................................................18 2.1.2 Protocol Content........................................................................................................................................................19 2.2 Types of Medical Studies......................................................................................................................................................21 2.2.1 Elements of a Study Design.....................................................................................................................................22 2.2.2 Basic Types of Study Design...................................................................................................................................22 2.2.2.1 Descriptive Studies...................................................................................................................................23 2.2.2.2 Analytical Studies and Their Basic Types.............................................................................................24 2.2.3 Choosing a Design....................................................................................................................................................24 2.2.3.1 Recommended Design for Particular Setups........................................................................................24 2.2.3.2 Choice of Design by Level of Evidence..................................................................................................25 2.3 Data Collection.......................................................................................................................................................................27 2.3.1 Nature of Data...........................................................................................................................................................27 2.3.1.1 Factual, Knowledge-Based, and Opinion-Based Data.........................................................................27 2.3.1.2 Method of Obtaining the Data................................................................................................................27 2.3.2 Tools of Data Collection...........................................................................................................................................28 2.3.2.1 Existing Records........................................................................................................................................28 2.3.2.2 Questionnaires and Schedules................................................................................................................28 2.3.2.3 Likert Scale.................................................................................................................................................29 2.3.2.4 Guttman Scale............................................................................................................................................30 2.3.3 Pretesting and Pilot Study.......................................................................................................................................30 2.4 Nonsampling Errors and Other Biases...............................................................................................................................31 2.4.1 Nonresponse..............................................................................................................................................................31 2.4.2 Variety of Biases to Guard Against........................................................................................................................31 2.4.2.1 List of Biases...............................................................................................................................................31 2.4.2.2 Steps for Minimizing Bias........................................................................................................................35 References..........................................................................................................................................................................................35 Exercises.............................................................................................................................................................................................36 3. Sampling Methods..........................................................................................................................................................................37 3.1 Sampling Concepts................................................................................................................................................................37 3.1.1 Advantages and Limitations of Sampling.............................................................................................................37 3.1.1.1 Sampling Fluctuations..............................................................................................................................37 3.1.1.2 Advantages of Sampling..........................................................................................................................38 3.1.1.3 Limitations of Sampling...........................................................................................................................38 3.1.2 Some Special Terms Used in Sampling.................................................................................................................38 3.1.2.1 Unit of Inquiry and Sampling Unit........................................................................................................38 3.1.2.2 Sampling Frame........................................................................................................................................39 3.1.2.3 Parameters and Statistics.........................................................................................................................39 3.1.2.4 Sample Size................................................................................................................................................39 3.1.2.5 Nonrandom and Random Sampling......................................................................................................39 3.1.2.6 Sampling Weight.......................................................................................................................................39
Contents
xi
3.2
Common Methods of Random Sampling...........................................................................................................................40 3.2.1 Simple Random Sampling.......................................................................................................................................40 3.2.2 Stratified Random Sampling...................................................................................................................................41 3.2.3 Multistage Random Sampling................................................................................................................................43 3.2.4 Cluster Random Sampling......................................................................................................................................43 3.2.5 Systematic Random Sampling................................................................................................................................45 3.2.6 Choice of the Method of Random Sampling........................................................................................................46 3.3 Some Other Methods of Sampling......................................................................................................................................46 3.3.1 Other Random Methods of Sampling....................................................................................................................46 3.3.1.1 Probability Proportional to Size Sampling............................................................................................47 3.3.1.2 Area Sampling...........................................................................................................................................47 3.3.1.3 Inverse Sampling.......................................................................................................................................47 3.3.1.4 Consecutive Subjects Attending a Clinic...............................................................................................48 3.3.1.5 Sequential Sampling.................................................................................................................................48 3.3.2 Nonrandom Methods of Sampling........................................................................................................................48 3.3.2.1 Convenience Sample.................................................................................................................................48 3.3.2.2 Other Types of Purposive Samples.........................................................................................................49 References..........................................................................................................................................................................................49 Exercises.............................................................................................................................................................................................49 4. Designs for Observational Studies..............................................................................................................................................51 4.1 Some Basic Concepts.............................................................................................................................................................51 4.1.1 Antecedent and Outcome........................................................................................................................................51 4.1.2 Confounders..............................................................................................................................................................52 4.1.3 Effect Size...................................................................................................................................................................53 4.1.4 Ecological Studies.....................................................................................................................................................53 4.2 Prospective Studies................................................................................................................................................................53 4.2.1 Variations of Prospective Studies...........................................................................................................................54 4.2.1.1 Cohort Study..............................................................................................................................................54 4.2.1.2 Longitudinal Study...................................................................................................................................54 4.2.1.3 Repeated Measures Study........................................................................................................................55 4.2.2 Selection of Subjects for a Prospective Study.......................................................................................................55 4.2.2.1 Comparison Group in a Prospective Study...........................................................................................55 4.2.3 Potential Biases in Prospective Studies.................................................................................................................56 4.2.3.1 Selection Bias.............................................................................................................................................56 4.2.3.2 Bias due to Loss in Follow-Up.................................................................................................................56 4.2.3.3 Assessment Bias and Errors....................................................................................................................56 4.2.3.4 Bias due to Change in the Status............................................................................................................57 4.2.3.5 Confounding Bias.....................................................................................................................................57 4.2.3.6 Post Hoc Bias..............................................................................................................................................57 4.2.4 Merits and Demerits of Prospective Studies.........................................................................................................57 4.2.4.1 Merits of Prospective Studies..................................................................................................................57 4.2.4.2 Demerits of Prospective Studies.............................................................................................................57 4.3 Retrospective Studies............................................................................................................................................................58 4.3.1 Case–Control Design...............................................................................................................................................58 4.3.1.1 Nested Case–Control Design..................................................................................................................59 4.3.2 Selection of Cases and Controls..............................................................................................................................59 4.3.2.1 Selection of Cases......................................................................................................................................60 4.3.2.2 Selection of Controls.................................................................................................................................60 4.3.2.3 Sampling Methods in Retrospective Studies........................................................................................60 4.3.2.4 Confounders and Matching.....................................................................................................................60 4.3.3 Merits and Demerits of Case–Control Studies.....................................................................................................61 4.3.3.1 Merits of Case–Control Studies..............................................................................................................61 4.3.3.2 Demerits of Case–Control Studies..........................................................................................................61
xii
Contents
4.4
Cross-Sectional Studies.........................................................................................................................................................62 4.4.1 Selection of Subjects for a Cross-Sectional Study................................................................................................62 4.4.2 Merits and Demerits of Cross-Sectional Studies..................................................................................................62 4.4.2.1 Demerits of Cross-Sectional Studies......................................................................................................62 4.4.2.2 Merits of Cross-Sectional Studies...........................................................................................................63 4.5 Comparative Performance of Prospective, Retrospective, and Cross-Sectional Studies............................................63 4.5.1 Comparative Features and Performance Comparison....................................................................................... 64 4.5.2 Reporting Results of Observational Studies: STROBE........................................................................................65 References..........................................................................................................................................................................................66 Exercises.............................................................................................................................................................................................66 5. Medical Experiments......................................................................................................................................................................67 5.1 Basic Features of Medical Experiments..............................................................................................................................67 5.1.1 Statistical Principles of Experimentation..............................................................................................................68 5.1.1.1 Control Group............................................................................................................................................68 5.1.1.2 Randomization..........................................................................................................................................68 5.1.1.3 Replication..................................................................................................................................................69 5.1.2 Advantages and Limitations of Experiments.......................................................................................................69 5.1.2.1 Advantages.................................................................................................................................................69 5.1.2.2 Limitations.................................................................................................................................................70 5.2 Design of Experiments..........................................................................................................................................................70 5.2.1 Classical Designs: One-Way, Two-Way, and Factorial.........................................................................................71 5.2.1.1 One-Way Design........................................................................................................................................71 5.2.1.2 Two-Way Design........................................................................................................................................71 5.2.1.3 Interaction..................................................................................................................................................72 5.2.1.4 K-Way and Factorial Experiments...........................................................................................................73 5.2.2 Some Common Unconventional Designs..............................................................................................................74 5.2.2.1 Repeated Measures Design......................................................................................................................74 5.2.2.2 Crossover Design......................................................................................................................................75 5.2.2.3 Other Complex Designs...........................................................................................................................76 5.3 Choice of Sampling of Units for Laboratory Experiments...............................................................................................76 5.3.1 Choice of Experimental Unit...................................................................................................................................77 5.3.2 Sampling Methods in Laboratory Experiments...................................................................................................77 5.3.3 Choosing a Design of Experiment.........................................................................................................................77 5.3.4 Pharmacokinetic Studies.........................................................................................................................................78 References..........................................................................................................................................................................................78 Exercises.............................................................................................................................................................................................79 6. Clinical Trials...................................................................................................................................................................................81 6.1 Therapeutic Trials..................................................................................................................................................................81 6.1.1 Phases of a Clinical Trial.........................................................................................................................................81 6.1.1.1 Phase I Trial................................................................................................................................................81 6.1.1.2 Phase II Trial..............................................................................................................................................82 6.1.1.3 Phase III Trial.............................................................................................................................................82 6.1.1.4 Phase IV: Postmarketing Surveillance...................................................................................................83 6.1.2 Randomized Controlled Trials: Selection of Subjects.........................................................................................83 6.1.2.1 Selection of Participants for RCT............................................................................................................83 6.1.2.2 Control Group in a Clinical Trial............................................................................................................84 6.1.3 Randomization and Matching................................................................................................................................85 6.1.3.1 Randomization..........................................................................................................................................86 6.1.3.2 Matching.....................................................................................................................................................86 6.1.4 Methods of Random Allocation.............................................................................................................................87 6.1.4.1 Allocation Out of a Large Number of Available Subjects...................................................................87 6.1.4.2 Random Allocation of Consecutive Patients Coming to a Clinic......................................................87 6.1.4.3 Block, Cluster, and Stratified Randomization.......................................................................................88
xiii
Contents
6.1.5
Blinding and Masking.............................................................................................................................................89 6.1.5.1 Blinding......................................................................................................................................................89 6.1.5.2 Concealment of Allocation.......................................................................................................................89 6.1.5.3 Masking......................................................................................................................................................90 6.2 Issues in Clinical Trials.........................................................................................................................................................90 6.2.1 Outcome Assessment...............................................................................................................................................90 6.2.1.1 Specification of End Points or Outcome.................................................................................................90 6.2.1.2 Causal Inference........................................................................................................................................91 6.2.1.3 Side Effects.................................................................................................................................................91 6.2.1.4 Effectiveness versus Efficacy...................................................................................................................92 6.2.1.5 Pragmatic Trials.........................................................................................................................................92 6.2.2 Various Equivalences in Clinical Trials.................................................................................................................92 6.2.2.1 Superiority, Equivalence, and Noninferiority Trials............................................................................92 6.2.2.2 Therapeutic Equivalence and Bioequivalence......................................................................................93 6.2.3 Designs for Clinical Trials.......................................................................................................................................94 6.2.3.1 n-of-1, Up-and-Down, and Sequential Designs....................................................................................94 6.2.3.2 Choosing a Design for a Clinical Trial...................................................................................................95 6.2.4 Designs with Interim Appraisals...........................................................................................................................95 6.2.4.1 Designs with Provision to Stop Early.....................................................................................................96 6.2.4.2 Adaptive Designs......................................................................................................................................96 6.2.5 Biostatistical Ethics for Clinical Trials...................................................................................................................97 6.2.5.1 Equipoise....................................................................................................................................................97 6.2.5.2 Ethical Cautions.........................................................................................................................................98 6.2.5.3 Statistical Considerations in a Multicentric Trial.................................................................................98 6.2.5.4 Multiple Treatments with Different Outcomes in the Same Trial......................................................98 6.2.5.5 Size of the Trial..........................................................................................................................................99 6.2.5.6 Compliance................................................................................................................................................99 6.2.6 Reporting the Results of a Clinical Trial...............................................................................................................99 6.2.6.1 CONSORT Statement................................................................................................................................99 6.2.6.2 Registration of Trials and Open Access...............................................................................................100 6.3 Trials Other than for Therapeutics....................................................................................................................................101 6.3.1 Clinical Trials for Diagnostic and Prophylactic Modalities.............................................................................101 6.3.1.1 Diagnostic Trials......................................................................................................................................101 6.3.1.2 Prophylactic Trials in Clinics.................................................................................................................102 6.3.2 Field Trials for Screening, Prophylaxis, and Vaccines.......................................................................................102 6.3.2.1 Screening Trials.......................................................................................................................................102 6.3.2.2 Prophylactic Trials in the Field..............................................................................................................102 6.3.2.3 Vaccine Trials...........................................................................................................................................103 6.3.3 Issues in Field Trials...............................................................................................................................................103 6.3.3.1 Randomization and Blinding in Field Trials.......................................................................................103 6.3.3.2 Designs for Field Trials...........................................................................................................................104 References........................................................................................................................................................................................104 Exercises...........................................................................................................................................................................................105 7. Numerical Methods for Representing Variation....................................................................................................................107 7.1 Types of Measurement........................................................................................................................................................107 7.1.1 Nominal, Metric, and Ordinal Scales..................................................................................................................107 7.1.1.1 Nominal Scale..........................................................................................................................................107 7.1.1.2 Metric Scale..............................................................................................................................................108 7.1.1.3 Ordinal Scale............................................................................................................................................108 7.1.1.4 Grouping of a Metric Scale (Categorizing Continuous Measurements).........................................109 7.1.2 Other Classifications of the Types of Measurement.......................................................................................... 110 7.1.2.1 Discrete and Continuous Variables...................................................................................................... 110 7.1.2.2 Qualitative and Quantitative Data....................................................................................................... 111 7.1.2.3 Stochastic and Deterministic Variables............................................................................................... 111
xiv
Contents
7.2
Tabular Presentation............................................................................................................................................................ 111 7.2.1 Contingency Tables and Frequency Distribution..............................................................................................112 7.2.1.1 Empty Cells..............................................................................................................................................113 7.2.1.2 Problems in Preparing a Contingency Table on Metric Data...........................................................113 7.2.1.3 Features of a Table...................................................................................................................................113 7.2.2 Other Types of Statistical Tables........................................................................................................................... 114 7.2.2.1 Multiple Responses Tables..................................................................................................................... 114 7.2.2.2 Statistical Tables.......................................................................................................................................115 7.2.2.3 What Is a Good Statistical Table?..........................................................................................................115 7.3 Rates and Ratios...................................................................................................................................................................115 7.3.1 Proportion, Rate, and Ratio...................................................................................................................................115 7.3.1.1 Proportion................................................................................................................................................ 116 7.3.1.2 Rate............................................................................................................................................................ 116 7.3.1.3 Ratio.......................................................................................................................................................... 116 7.4 Central and Other Locations..............................................................................................................................................117 7.4.1 Central Values: Mean, Median, and Mode..........................................................................................................117 7.4.1.1 Understanding Mean, Median, and Mode.......................................................................................... 118 7.4.1.2 Calculation in the Case of Grouped Data............................................................................................ 118 7.4.1.3 Which Central Value to Use?.................................................................................................................120 7.4.1.4 Geometric Mean......................................................................................................................................121 7.4.1.5 Harmonic Mean.......................................................................................................................................121 7.4.2 Other Locations: Quantiles...................................................................................................................................122 7.4.2.1 Quantiles in Ungrouped Data...............................................................................................................123 7.4.2.2 Quantiles in Grouped Data....................................................................................................................123 7.4.2.3 Interpretation of Quantiles....................................................................................................................124 7.5 Measuring Variability.........................................................................................................................................................125 7.5.1 Variance and Standard Deviation........................................................................................................................126 7.5.1.1 Variance and Standard Deviation in Ungrouped Data.....................................................................126 7.5.1.2 Variance and Standard Deviation in Grouped Data..........................................................................128 7.5.1.3 Variance of Sum or Difference of Two Measurements......................................................................128 7.5.1.4 Measuring Variation in Skewed and Nominal Data: Interquartile Range and Variation Ratio......... 128 7.5.2 Coefficient of Variation..........................................................................................................................................129 References........................................................................................................................................................................................131 Exercises...........................................................................................................................................................................................131 8. Presentation of Variation by Figures: Data Visualization....................................................................................................133 8.1 Graphs for Frequency Distribution...................................................................................................................................133 8.1.1 Histogram and Its Variants...................................................................................................................................134 8.1.1.1 Histogram.................................................................................................................................................134 8.1.1.2 Stem-and-Leaf Plot..................................................................................................................................134 8.1.1.3 Line Histogram and Dot Plot................................................................................................................136 8.1.2 Polygon and Its Variants........................................................................................................................................136 8.1.2.1 Frequency Polygon..................................................................................................................................136 8.1.2.2 Area Diagram..........................................................................................................................................136 8.1.3 Frequency Curve.....................................................................................................................................................136 8.2 Pie, Bar, and Line Diagrams...............................................................................................................................................136 8.2.1 Pie Diagram.............................................................................................................................................................137 8.2.1.1 Useful Features of a Pie Diagram.........................................................................................................138 8.2.1.2 Donut Diagram........................................................................................................................................138 8.2.2 Bar Diagram.............................................................................................................................................................138 8.2.3 Scatter and Line Diagrams....................................................................................................................................140 8.2.3.1 Scatter Diagram.......................................................................................................................................140 8.2.3.2 Bubble Chart............................................................................................................................................140 8.2.3.3 Line Diagram...........................................................................................................................................142 8.2.3.4 Complex Line Diagram..........................................................................................................................142
xv
Contents
8.2.4 8.2.5
Choice and Cautions in Visual Display of Data.................................................................................................143 Mixed and Three-Dimensional Diagrams..........................................................................................................144 8.2.5.1 Mixed Diagram........................................................................................................................................144 8.2.5.2 Box-and-Whiskers Plot...........................................................................................................................144 8.2.5.3 Three-Dimensional Diagram................................................................................................................145 8.2.5.4 Biplot.........................................................................................................................................................146 8.2.5.5 Nomogram...............................................................................................................................................146 8.3 Special Diagrams in Health and Medicine......................................................................................................................146 8.3.1 Diagrams Used in Public Health..........................................................................................................................147 8.3.1.1 Epidemic Curve.......................................................................................................................................148 8.3.1.2 Lexis Diagram..........................................................................................................................................148 8.3.2 Diagrams Used in Individual Care and Research.............................................................................................148 8.3.2.1 Growth Chart...........................................................................................................................................148 8.3.2.2 Partogram.................................................................................................................................................150 8.3.2.3 Dendrogram.............................................................................................................................................150 8.3.2.4 Radar Graph.............................................................................................................................................150 8.4 Charts and Maps..................................................................................................................................................................152 8.4.1 Charts.......................................................................................................................................................................152 8.4.1.1 Schematic Chart......................................................................................................................................152 8.4.1.2 Health Infographics................................................................................................................................152 8.4.1.3 Pedigree Chart.........................................................................................................................................153 8.4.2 Maps.........................................................................................................................................................................154 8.4.2.1 Spot Map...................................................................................................................................................154 8.4.2.2 Thematic Choroplethic Map..................................................................................................................154 8.4.2.3 Cartogram................................................................................................................................................154 References........................................................................................................................................................................................156 Exercises...........................................................................................................................................................................................156 9. Some Quantitative Aspects of Medicine..................................................................................................................................159 9.1 Some Epidemiological Measures of Health and Disease...............................................................................................159 9.1.1 Epidemiological Indicators of Neonatal Health.................................................................................................160 9.1.1.1 Birth Weight.............................................................................................................................................160 9.1.1.2 Apgar Score.............................................................................................................................................. 161 9.1.2 Epidemiological Indicators of Growth in Children........................................................................................... 161 9.1.2.1 Weight-for-Age, Height-for-Age, and Weight-for-Height.................................................................. 161 9.1.2.2 Z-Scores and Percent of Median...........................................................................................................162 9.1.2.3 T-Score.......................................................................................................................................................163 9.1.2.4 Growth Velocity.......................................................................................................................................163 9.1.2.5 Skinfold Thickness..................................................................................................................................164 9.1.2.6 Other Indicators of Growth...................................................................................................................164 9.1.3 Epidemiological Indicators of Adolescent Health..............................................................................................164 9.1.3.1 Growth in Height and Weight in Adolescence...................................................................................164 9.1.3.2 Sexual Maturity Rating..........................................................................................................................165 9.1.4 Epidemiological Indicators of Adult Health.......................................................................................................165 9.1.4.1 Obesity......................................................................................................................................................165 9.1.4.2 Smoking....................................................................................................................................................166 9.1.4.3 Physiological Functions..........................................................................................................................168 9.1.4.4 Quality of Life..........................................................................................................................................168 9.1.5 Epidemiological Indicators of Geriatric Health..................................................................................................169 9.1.5.1 Activities of Daily Living.......................................................................................................................169 9.1.5.2 Mental Health of the Elderly.................................................................................................................169 9.2 Reference Values..................................................................................................................................................................169 9.2.1 Gaussian and Other Distributions.......................................................................................................................169 9.2.1.1 Properties of a Gaussian Distribution..................................................................................................170 9.2.1.2 Other Distributions.................................................................................................................................171
xvi
Contents
9.2.1.3 Checking Gaussianity: Simple but Approximate Methods..............................................................172 Reference or Normal Values.................................................................................................................................. 174 9.2.2.1 Implications of Normal Values.............................................................................................................. 174 9.2.3 Normal Range.........................................................................................................................................................175 9.2.3.1 Disease Threshold...................................................................................................................................175 9.2.3.2 Clinical Threshold...................................................................................................................................175 9.2.3.3 Statistical Threshold...............................................................................................................................176 9.3 Measurement of Uncertainty: Probability........................................................................................................................177 9.3.1 Elementary Laws of Probability...........................................................................................................................177 9.3.1.1 Law of Multiplication.............................................................................................................................178 9.3.1.2 Law of Addition.......................................................................................................................................178 9.3.2 Probability in Clinical Assessments....................................................................................................................179 9.3.2.1 Probabilities in Diagnosis......................................................................................................................179 9.3.2.2 Forwarding Diagnosis............................................................................................................................180 9.3.2.3 Assessment of Prognosis........................................................................................................................180 9.3.2.4 Choice of Treatment................................................................................................................................181 9.3.3 Further on Diagnosis: Bayes’ Rule.......................................................................................................................181 9.3.3.1 Bayes’ Rule...............................................................................................................................................181 9.3.3.2 Extension of Bayes’ Rule.........................................................................................................................182 9.4 Validity of Medical Tests.....................................................................................................................................................183 9.4.1 Sensitivity and Specificity.....................................................................................................................................184 9.4.1.1 Features of Sensitivity and Specificity.................................................................................................185 9.4.1.2 Likelihood Ratio......................................................................................................................................186 9.4.2 Predictivities............................................................................................................................................................186 9.4.2.1 Positive and Negative Predictivity.......................................................................................................186 9.4.2.2 Predictivity and Prevalence...................................................................................................................187 9.4.2.3 Meaning of Prevalence for Predictivity...............................................................................................188 9.4.2.4 Features of Positive and Negative Predictivities................................................................................189 9.4.3 Combination of Tests..............................................................................................................................................190 9.4.3.1 Tests in Series...........................................................................................................................................190 9.4.3.2 Tests in Parallel........................................................................................................................................190 9.4.4 Gains from a Test....................................................................................................................................................191 9.4.4.1 When Can a Test Be Avoided?...............................................................................................................192 9.5 Search for the Best Threshold of a Continuous Test: ROC Curve.................................................................................192 9.5.1 Sensitivity–Specificity-Based ROC Curve...........................................................................................................192 9.5.1.1 Methods to Find the Optimal Threshold Point..................................................................................194 9.5.1.2 Area under the ROC Curve...................................................................................................................195 9.5.2 Predictivity-Based ROC Curve.............................................................................................................................197 References........................................................................................................................................................................................198 Exercises...........................................................................................................................................................................................199 9.2.2
10. Clinimetrics and Evidence-Based Medicine............................................................................................................................203 10.1 Indicators, Indices, and Scores...........................................................................................................................................203 10.1.1 Indicators..................................................................................................................................................................203 10.1.1.1 Merits and Demerits of Indicators........................................................................................................203 10.1.1.2 Choice of Indicators................................................................................................................................204 10.1.2 Indices.......................................................................................................................................................................204 10.1.3 Scores........................................................................................................................................................................204 10.1.3.1 Scoring System for Diagnosis................................................................................................................205 10.1.3.2 Scoring for Gradation of Severity.........................................................................................................206 10.1.3.3 APACHE Scores.......................................................................................................................................207 10.2 Clinimetrics..........................................................................................................................................................................208 10.2.1 Method of Scoring..................................................................................................................................................208 10.2.1.1 Method of Scoring for Graded Characteristics...................................................................................208 10.2.1.2 Method of Scoring for Diagnosis..........................................................................................................209
Contents
xvii
10.2.1.3 Regression Method for Scoring.............................................................................................................209 10.2.2 Validity and Reliability of a Scoring System......................................................................................................210 10.2.2.1 Validity of a Scoring System..................................................................................................................210 10.2.2.2 Reliability of a Scoring System..............................................................................................................211 10.3 Evidence-Based Medicine...................................................................................................................................................212 10.3.1 Decision Analysis...................................................................................................................................................212 10.3.1.1 Decision Tree............................................................................................................................................212 10.3.2 Other Statistical Tools for Evidence-Based Medicine........................................................................................213 10.3.2.1 Etiology Diagram....................................................................................................................................213 10.3.2.2 Expert System.......................................................................................................................................... 214 References........................................................................................................................................................................................215 Exercises...........................................................................................................................................................................................216 11. Measurement of Community Health........................................................................................................................................219 11.1 Measures of Fertility and Medical Demography............................................................................................................219 11.1.1 Indicators of Fertility..............................................................................................................................................219 11.1.2 Medical Demography.............................................................................................................................................221 11.1.2.1 Population Pyramid................................................................................................................................221 11.1.2.2 Demographic Cycle.................................................................................................................................222 11.1.2.3 Other Demographic Indicators.............................................................................................................223 11.1.2.4 Stable and Stationary Population..........................................................................................................223 11.1.2.5 Sex Ratio...................................................................................................................................................223 11.2 Indicators of Mortality........................................................................................................................................................224 11.2.1 Crude and Standardized Death Rates.................................................................................................................224 11.2.1.1 Crude Death Rate....................................................................................................................................224 11.2.1.2 Age-Specific Death Rate.........................................................................................................................224 11.2.1.3 Standardized Death Rate.......................................................................................................................224 11.2.1.4 Comparative Mortality Ratio.................................................................................................................227 11.2.2 Specific Mortality Rates.........................................................................................................................................228 11.2.2.1 Fetal Deaths and Mortality in Children..............................................................................................228 11.2.2.2 Maternal Mortality..................................................................................................................................230 11.2.2.3 Adult Mortality........................................................................................................................................230 11.2.2.4 Other Measures of Mortality.................................................................................................................231 11.2.3 Death Spectrum......................................................................................................................................................231 11.3 Measures of Morbidity........................................................................................................................................................232 11.3.1 Prevalence and Incidence......................................................................................................................................232 11.3.1.1 Point Prevalence......................................................................................................................................232 11.3.1.2 Period Prevalence....................................................................................................................................233 11.3.1.3 Prevalence Rate Ratio.............................................................................................................................233 11.3.1.4 Incidence...................................................................................................................................................233 11.3.1.5 Concept of Person-Time..........................................................................................................................234 11.3.1.6 Capture–Recapture Methodology........................................................................................................234 11.3.2 Duration of Morbidity............................................................................................................................................235 11.3.2.1 Prevalence in Relation to Duration of Morbidity...............................................................................236 11.3.2.2 Incidence from Prevalence.....................................................................................................................236 11.3.2.3 Epidemiologically Consistent Estimates..............................................................................................237 11.3.3 Morbidity Measures for Acute Conditions.........................................................................................................237 11.3.3.1 Attack Rates.............................................................................................................................................238 11.3.3.2 Disease Spectrum....................................................................................................................................238 11.4 Indicators of Social and Mental Health............................................................................................................................240 11.4.1 Indicators of Social Health....................................................................................................................................240 11.4.1.1 Education..................................................................................................................................................240 11.4.1.2 Income.......................................................................................................................................................241 11.4.1.3 Occupation...............................................................................................................................................241 11.4.1.4 Socioeconomic Status.............................................................................................................................241
xviii
Contents
11.4.1.5 Dependency Ratio...................................................................................................................................242 11.4.1.6 Dietary Assessment................................................................................................................................242 11.4.1.7 Health Inequality....................................................................................................................................242 11.4.2 Indicators of Health Resources.............................................................................................................................243 11.4.2.1 Health Infrastructure.............................................................................................................................243 11.4.2.2 Health Expenditure................................................................................................................................244 11.4.3 Indicators of Lack of Mental Health.....................................................................................................................245 11.4.3.1 Smoking and Other Addictions............................................................................................................245 11.4.3.2 Divorces....................................................................................................................................................245 11.4.3.3 Vehicular Accidents and Crimes..........................................................................................................245 11.4.3.4 Other Measures of Lack of Mental Health..........................................................................................245 11.5 Composite Indices of Health..............................................................................................................................................246 11.5.1 Indices of Status of Comprehensive Health........................................................................................................246 11.5.1.1 Human Development Index..................................................................................................................246 11.5.1.2 Physical Quality of Life Index...............................................................................................................247 11.5.1.3 Index of Happiness.................................................................................................................................247 11.5.2 Indices of (Physical) Health Gap...........................................................................................................................248 11.5.2.1 DALYs Lost...............................................................................................................................................248 11.5.2.2 Human Poverty Index............................................................................................................................249 11.5.2.3 Index of Need for Health Resources.....................................................................................................249 References........................................................................................................................................................................................249 Exercises...........................................................................................................................................................................................250 12. Confidence Intervals, Principles of Tests of Significance, and Sample Size.....................................................................255 12.1 Sampling Distributions.......................................................................................................................................................255 12.1.1 Basic Concepts.........................................................................................................................................................255 12.1.1.1 Sampling Error........................................................................................................................................256 12.1.1.2 Point Estimate..........................................................................................................................................256 12.1.1.3 Standard Error of p and x......................................................................................................................256 12.1.2 Sampling Distribution of p and x.........................................................................................................................258 12.1.2.1 Gaussian Conditions...............................................................................................................................258 12.1.3 Obtaining Probabilities from a Gaussian Distribution.....................................................................................259 12.1.3.1 Gaussian Probability...............................................................................................................................259 12.1.3.2 Continuity Correction............................................................................................................................261 12.1.3.3 Probabilities Relating to the Mean and the Proportion.....................................................................261 12.1.4 Case of σ Not Known (t-Distribution)..................................................................................................................262 12.2 Confidence Intervals............................................................................................................................................................262 12.2.1 Confidence Interval for π, μ, and Median: Gaussian Conditions.....................................................................263 12.2.1.1 Confidence Interval for Proportion π (Large n)..................................................................................263 12.2.1.2 Lower and Upper Bounds for π (Large n)............................................................................................265 12.2.1.3 Confidence Interval for Mean μ (Large n)............................................................................................265 12.2.1.4 Confidence Bounds for Mean μ (Large n)............................................................................................267 12.2.1.5 CI for Median (Gaussian Distribution)................................................................................................268 12.2.2 Confidence Interval for Differences (Large n)....................................................................................................269 12.2.2.1 CI for the Difference in Two Independent Samples...........................................................................269 12.2.2.2 Paired Samples........................................................................................................................................270 12.2.3 Confidence Interval for π, μ, and Median: Non-Gaussian Conditions............................................................271 12.2.3.1 Confidence Interval for π (Small n).......................................................................................................272 12.2.3.2 Confidence Bound for π When the Success or Failure Rate in the Sample Is 0%..........................273 12.2.3.3 Confidence Interval for Median: Non-Gaussian Conditions............................................................274 12.3 P-Values and Statistical Significance.................................................................................................................................276 12.3.1 What Is Statistical Significance?...........................................................................................................................276 12.3.1.1 Court Judgment.......................................................................................................................................277 12.3.1.2 Errors in Diagnosis.................................................................................................................................277 12.3.1.3 Null Hypothesis......................................................................................................................................277
Contents
xix
12.3.1.4 Philosophical Basis of Statistical Tests.................................................................................................278 12.3.1.5 Alternative Hypothesis..........................................................................................................................278 12.3.1.6 One-Sided Alternatives: Which Tail Is Wagging?..............................................................................278 12.3.2 Errors, P-Values, and Power..................................................................................................................................279 12.3.2.1 Type I Error..............................................................................................................................................279 12.3.2.2 Type II Error.............................................................................................................................................280 12.3.2.3 Power.........................................................................................................................................................280 12.3.3 General Procedure to Obtain the P-Value...........................................................................................................281 12.3.3.1 Steps to Obtain a P-Value.......................................................................................................................281 12.3.3.2 Subtleties of Statistical Significance.....................................................................................................283 12.4 Assessing Gaussian Pattern................................................................................................................................................284 12.4.1 Approximate Methods for Assessing Gaussianity............................................................................................284 12.4.2 Significance Tests for Assessing Gaussianity.....................................................................................................285 12.4.2.1 Statistical Tests.........................................................................................................................................285 12.4.2.2 Transformations to Achieve Gaussianity............................................................................................285 12.5 Initial Debate on Statistical Significance..........................................................................................................................286 12.5.1 Confidence Interval versus Test of H0..................................................................................................................286 12.5.1.1 Equivalence of CI with Test of H0.........................................................................................................286 12.5.1.2 Valid Application of Test of Hypothesis..............................................................................................287 12.5.2 Medical Significance versus Statistical Significance..........................................................................................287 12.6 Sample Size Determination in Some Cases......................................................................................................................289 12.6.1 Sample Size Required in Estimation Setup........................................................................................................289 12.6.1.1 General Considerations for Sample Size in Estimation Setup..........................................................289 12.6.1.2 General Procedure for Determining the Sample Size for Estimation.............................................291 12.6.1.3 Formulas for Sample Size Calculation for Estimation in Simple Situations..................................292 12.6.2 Sample Size for Testing a Hypothesis with Specified Power...........................................................................294 12.6.2.1 General Considerations for Sample Size in a Testing of Hypothesis Setup...................................294 12.6.2.2 Power Calculations.................................................................................................................................295 12.6.2.3 Sample Size Formulas for Test of Hypothesis in Simple Situations................................................295 12.6.2.4 Sample Size in Some Other Popular Setups........................................................................................298 12.6.2.5 Nomograms and Tables of Sample Size...............................................................................................299 12.6.2.6 Thumb Rules............................................................................................................................................299 12.6.2.7 Power Analysis....................................................................................................................................... 300 12.6.3 Sample Size in Adaptive Clinical Trials............................................................................................................. 300 12.6.3.1 Stopping Rules in Case of Early Evidence of Success or Failure: Lan–deMets Procedure...........301 12.6.3.2 Sample Size Reestimation in Adaptive Designs.................................................................................302 References........................................................................................................................................................................................303 Exercises...........................................................................................................................................................................................304 13. Inference from Proportions.........................................................................................................................................................307 13.1 One Qualitative Variable.....................................................................................................................................................307 13.1.1 Dichotomous Categories: Binomial Distribution...............................................................................................307 13.1.1.1 Binomial Distribution.............................................................................................................................308 13.1.1.2 Large n: Gaussian Approximation to Binomial..................................................................................309 13.1.1.3 Z-Test for Proportion in One Group.....................................................................................................310 13.1.2 Poisson Distribution...............................................................................................................................................310 13.1.3 Polytomous Categories (Large n): Goodness-of-Fit Test.................................................................................... 311 13.1.3.1 Chi-Square and Its Explanation............................................................................................................312 13.1.3.2 Degrees of Freedom................................................................................................................................313 13.1.3.3 Cautions in Using Chi-Square...............................................................................................................313 13.1.3.4 Further Analysis: Partitioning of Tables.............................................................................................. 314 13.1.4 Goodness of Fit to Assess Gaussianity................................................................................................................315 13.1.5 Polytomous Categories (Small n): Exact Multinomial Test............................................................................... 316 13.1.5.1 Goodness of Fit in Small Samples......................................................................................................... 316 13.1.5.2 Data with Rare Outcomes: Negative Binomial Distribution............................................................317
xx
Contents
13.2 Proportions in 2×2 Tables....................................................................................................................................................318 13.2.1 Structure of 2×2 Table in Different Types of Study............................................................................................318 13.2.1.1 Structure in Prospective Study.............................................................................................................318 13.2.1.2 Structure in Retrospective Study..........................................................................................................318 13.2.1.3 Structure in Cross-Sectional Study......................................................................................................319 13.2.2 Two Independent Samples (Large n): Chi-Square Test and Proportion Test in a 2×2 Table........................319 13.2.2.1 Chi-Square Test for a 2×2 Table.............................................................................................................319 13.2.2.2 Yates’ Correction for Continuity...........................................................................................................320 13.2.2.3 Z-Test for Difference in Proportions in Two Independent Groups..................................................320 13.2.2.4 Detecting a Medically Important Difference in Proportions...........................................................321 13.2.2.5 Crossover Design with Binary Response (Large n)...........................................................................322 13.2.3 Equivalence Tests....................................................................................................................................................323 13.2.3.1 Superiority, Equivalence, and Noninferiority.....................................................................................323 13.2.3.2 Testing Equivalence................................................................................................................................324 13.2.3.3 Determining Noninferiority Margin...................................................................................................326 13.2.4 Two Independent Samples (Small n): Fisher Exact Test.....................................................................................326 13.2.4.1 Fisher Exact Test......................................................................................................................................326 13.2.4.2 Crossover Design (Small n)....................................................................................................................327 13.2.5 Proportions in Matched Pairs: McNemar Test (Large n) and Exact Test (Small n).......................................328 13.2.5.1 Large n: McNemar Test..........................................................................................................................328 13.2.5.2 Small n: Exact Test (Matched Pairs)......................................................................................................329 13.2.5.3 Comparison of Two Tests for Sensitivity and Specificity: Paired Setup.........................................330 13.3 Analysis of R×C Tables (Large n).......................................................................................................................................331 13.3.1 One Dichotomous and the Other Polytomous Variable (2×C Table)...............................................................331 13.3.1.1 Test Criterion for Association in R×C Tables.......................................................................................332 13.3.1.2 Trend in Proportions in Ordinal Categories.......................................................................................332 13.3.1.3 Dichotomy in Repeated Measures: Cochran Q-Test (Large n)..........................................................334 13.3.2 Two Polytomous Variables.....................................................................................................................................335 13.3.2.1 Chi-Square Test for Large n...................................................................................................................336 13.3.2.2 Matched Pairs: I×I Table and McNemar–Bowker Test.......................................................................337 13.4 Three-Way Tables.................................................................................................................................................................337 13.4.1 Assessment of Association in Three-Way Tables...............................................................................................338 13.4.2 Log-Linear Models..................................................................................................................................................340 13.4.2.1 Log-Linear Model for Two-Way Tables................................................................................................341 13.4.2.2 Log-Linear Model for Three-Way Tables.............................................................................................341 References........................................................................................................................................................................................343 Exercises...........................................................................................................................................................................................343 14. Relative Risk and Odds Ratio.....................................................................................................................................................347 14.1 Relative and Attributable Risks (Large n)........................................................................................................................347 14.1.1 Risk, Hazard, and Odds.........................................................................................................................................347 14.1.1.1 Risk............................................................................................................................................................347 14.1.1.2 Hazard Rate.............................................................................................................................................348 14.1.1.3 Odds..........................................................................................................................................................348 14.1.1.4 Ratios of Risks and Odds.......................................................................................................................348 14.1.2 Relative Risk............................................................................................................................................................348 14.1.2.1 RR in Independent Samples..................................................................................................................348 14.1.2.2 Confidence Interval for RR (Independent Samples)..........................................................................351 14.1.2.3 Test of Hypothesis on RR (Independent Samples).............................................................................352 14.1.2.4 RR in the Case of Matched Pairs...........................................................................................................353 14.1.3 Attributable Risk.....................................................................................................................................................353 14.1.3.1 AR in Independent Samples..................................................................................................................353 14.1.3.2 AR in Matched Pairs...............................................................................................................................354 14.1.3.3 Number Needed to Treat.......................................................................................................................355
Contents
xxi
14.1.3.4 Risk Reduction.........................................................................................................................................356 14.1.3.5 Population Attributable Risk.................................................................................................................357 14.2 Odds Ratio.............................................................................................................................................................................357 14.2.1 OR in Two Independent Samples.........................................................................................................................358 14.2.1.1 Interpretation of OR................................................................................................................................358 14.2.1.2 CI for OR (Independent Samples).........................................................................................................360 14.2.1.3 Test of Hypothesis on OR (Independent Samples).............................................................................360 14.2.2 OR in Matched Pairs...............................................................................................................................................361 14.2.2.1 Confidence Interval for OR (Matched Pairs).......................................................................................362 14.2.2.2 Test of Hypothesis on OR (Matched Pairs)..........................................................................................362 14.2.2.3 Multiple Controls....................................................................................................................................363 14.3 Stratified Analysis, Sample Size, and Meta-Analysis.....................................................................................................364 14.3.1 Mantel–Haenszel Procedure.................................................................................................................................364 14.3.1.1 Pooled Relative Risk................................................................................................................................364 14.3.1.2 Pooled Odds Ratio and Chi-Square......................................................................................................365 14.3.2 Sample Size Requirement for Statistical Inference on RR and OR..................................................................366 14.3.3 Meta-Analysis..........................................................................................................................................................370 14.3.3.1 Forest Plot.................................................................................................................................................370 14.3.3.2 Validity of Meta-Analysis......................................................................................................................371 References........................................................................................................................................................................................372 Exercises...........................................................................................................................................................................................372 15. Inference from Means..................................................................................................................................................................377 15.1 Comparison of Means in One and Two Groups (Gaussian Conditions): Student t-Test............................................378 15.1.1 Comparison with a Prespecified Mean...............................................................................................................378 15.1.1.1 Student t-Test for One Sample...............................................................................................................378 15.1.2 Difference in Means in Two Samples...................................................................................................................379 15.1.2.1 Paired Samples Setup.............................................................................................................................380 15.1.2.2 Unpaired (Independent) Samples Setup..............................................................................................380 15.1.2.3 Some Features of Student t....................................................................................................................382 15.1.2.4 Effect of Unequal n..................................................................................................................................383 15.1.2.5 Difference-in-Differences Approach....................................................................................................383 15.1.3 Analysis of Crossover Designs.............................................................................................................................383 15.1.3.1 Test for Group Effect...............................................................................................................................384 15.1.3.2 Test for Carryover Effect........................................................................................................................385 15.1.3.3 Test for Treatment Effect........................................................................................................................385 15.1.4 Analysis of Data of Up-and-Down Trials............................................................................................................386 15.2 Comparison of Means in 3 or More Groups (Gaussian Conditions): ANOVA F-Test................................................387 15.2.1 One-Way ANOVA...................................................................................................................................................387 15.2.1.1 Procedure to Test H0...............................................................................................................................388 15.2.1.2 Checking the Validity of the Assumptions of ANOVA.....................................................................391 15.2.2 Two-Way ANOVA...................................................................................................................................................392 15.2.2.1 Two-Factor Design..................................................................................................................................392 15.2.2.2 Hypotheses and Their Test in Two-Way ANOVA..............................................................................393 15.2.2.3 Main Effect and Interaction (Effect).....................................................................................................395 15.2.2.4 Type I, Type II, and Type III Sums of Squares....................................................................................396 15.2.3 Repeated Measures.................................................................................................................................................397 15.2.3.1 Random Effects versus Fixed Effects and Mixed Models.................................................................397 15.2.3.2 Sphericity and Huynh–Feldt Correction.............................................................................................397 15.2.4 Multiple Comparisons: Bonferroni, Tukey, and Dunnett Tests.......................................................................398 15.2.4.1 Bonferroni Procedure.............................................................................................................................399 15.2.4.2 Tukey Test.................................................................................................................................................399 15.2.4.3 Dunnett Test............................................................................................................................................ 400 15.2.4.4 Intricacies of Multiple Comparisons................................................................................................... 400
xxii
Contents
15.3 Non-Gaussian Conditions: Nonparametric Tests for Location.....................................................................................401 15.3.1 Comparison of Two Groups: Wilcoxon Tests......................................................................................................401 15.3.1.1 Case I: Paired Data—Sign Test and Wilcoxon Signed-Rank Test.....................................................401 15.3.1.2 Case II: Independent Samples—Wilcoxon Rank-Sum Test...............................................................404 15.3.2 Comparison of Three or More Groups: Kruskal–Wallis Test...........................................................................406 15.3.3 Two-Way Layout with n = 1: Friedman Test for Repeated Samples................................................................407 15.4 When Significant Is Not Significant..................................................................................................................................410 15.4.1 Nature of Statistical Significance..........................................................................................................................410 15.4.2 Testing for the Presence of a Medically Important Difference in Means.......................................................413 15.4.2.1 Detecting Specified Difference in Mean.............................................................................................. 414 15.4.2.2 Equivalence Tests for Means.................................................................................................................415 15.4.3 Power and Level of Significance...........................................................................................................................415 15.4.3.1 Further Explanation of Statistical Power.............................................................................................415 15.4.3.2 Balancing Type I and Type II Error......................................................................................................417 References........................................................................................................................................................................................418 Exercises...........................................................................................................................................................................................418 16. Relationships: Quantitative Outcome.......................................................................................................................................423 16.1 Some General Features of a Regression Setup.................................................................................................................424 16.1.1 Dependent and Independent Variables...............................................................................................................425 16.1.1.1 Simple, Multiple, and Multivariate Regressions.................................................................................425 16.1.2 Linear, Curvilinear, and Nonlinear Regressions...............................................................................................425 16.1.2.1 Linear Regression....................................................................................................................................425 16.1.2.2 Curvilinear Regression...........................................................................................................................426 16.1.2.3 Nonlinear Regressions...........................................................................................................................427 16.1.2.4 Regression through Origin....................................................................................................................428 16.1.3 Concept of Residuals..............................................................................................................................................428 16.1.4 General Method of Fitting a Regression..............................................................................................................429 16.1.5 Selection of Regressors...........................................................................................................................................430 16.1.5.1 Multicollinearity......................................................................................................................................431 16.1.5.2 Statistical Significance: Stepwise Procedures.....................................................................................431 16.1.5.3 Other Considerations..............................................................................................................................432 16.2 Linear Regression Models..................................................................................................................................................432 16.2.1 Simple Linear Regression......................................................................................................................................433 16.2.1.1 Meaning of Intercept and Slope in Simple Linear Regression.........................................................434 16.2.1.2 Estimation of Parameters of Simple Linear Regression....................................................................434 16.2.1.3 Confidence Intervals for the Parameters of Simple Linear Regression...........................................435 16.2.1.4 Tests of Hypothesis for the Parameters of Simple Linear Regression.............................................436 16.2.1.5 Confidence Band for Simple Linear Regression.................................................................................437 16.2.2 Multiple Linear Regression...................................................................................................................................438 16.2.2.1 Elements of Multiple Linear Regression..............................................................................................438 16.2.2.2 Understanding Multiple Linear Regression........................................................................................439 16.2.2.3 CI and Tests in Multiple Linear Regression....................................................................................... 440 16.3 Adequacy of a Regression.................................................................................................................................................. 440 16.3.1 Goodness of Fit and η2.......................................................................................................................................... 440 16.3.2 Multiple Correlation in Multiple Linear Regression........................................................................................ 440 16.3.3 Statistical Significance of Individual Regression Coefficients.........................................................................441 16.3.4 Validity of Assumptions....................................................................................................................................... 442 16.3.5 Choice of the Form of Regression........................................................................................................................ 443 16.3.6 Outliers and Missing Values................................................................................................................................ 446 16.4 Some Issues in Linear Regression.................................................................................................................................... 446 16.4.1 Implications of Regression....................................................................................................................................447 16.4.1.1 Standardized Coefficients......................................................................................................................447 16.4.1.2 Other Implications of Regression Models...........................................................................................447
Contents
xxiii
16.4.1.3 Equality of Two Regression Lines........................................................................................................ 448 16.4.1.4 Difference-in-Differences Approach with Regression...................................................................... 448 16.4.2 Some Variations of Regression..............................................................................................................................449 16.4.2.1 Ridge Regression.....................................................................................................................................449 16.4.2.2 Multilevel Regression.............................................................................................................................449 16.4.2.3 Regression Splines..................................................................................................................................450 16.4.2.4 Analysis of Covariance...........................................................................................................................451 16.4.2.5 Some Generalizations.............................................................................................................................452 16.5 Measuring the Strength of Quantitative Relationship...................................................................................................452 16.5.1 Product–Moment and Related Correlations.......................................................................................................452 16.5.1.1 Product–Moment Correlation................................................................................................................452 16.5.1.2 Statistical Significance of r.....................................................................................................................455 16.5.1.3 Comparison of Correlations in Two Independent Samples..............................................................456 16.5.1.4 Serial Correlation....................................................................................................................................456 16.5.1.5 Partial Correlation...................................................................................................................................456 16.5.2 Rank Correlation.....................................................................................................................................................457 16.5.2.1 Spearman Rho.........................................................................................................................................457 16.5.3 Intraclass Correlation.............................................................................................................................................458 16.5.3.1 Computation of Intraclass Correlation.................................................................................................459 16.5.3.2 ANOVA Formulation and Testing the Statistical Significance of ICC.............................................459 16.6 Assessment of Quantitative Agreement...........................................................................................................................460 16.6.1 Agreement in Quantitative Measurements........................................................................................................460 16.6.1.1 Statistical Formulation of the Problem of Agreement.......................................................................460 16.6.1.2 Limits of Disagreement Approach.......................................................................................................461 16.6.1.3 Intraclass Correlation as a Measure of Agreement............................................................................462 16.6.1.4 Relative Merits of the Two Methods.....................................................................................................462 16.6.2 Alternative Methods for Assessment of Agreement.........................................................................................463 16.6.2.1 Alternative Simple Approach to Agreement Assessment.................................................................463 16.6.2.2 Agreement Assessment for Different Measurements........................................................................464 References........................................................................................................................................................................................464 Exercises...........................................................................................................................................................................................465 17. Relationships: Qualitative Dependent......................................................................................................................................469 17.1 Binary Dependent: Logistic Regression (Large n)...........................................................................................................469 17.1.1 Meaning of a Logistic Model................................................................................................................................470 17.1.1.1 Logit and Logistic Coefficients..............................................................................................................470 17.1.1.2 Logistic versus Quantitative Regression.............................................................................................470 17.1.1.3 Etiological Specification of a Logistic Model......................................................................................472 17.1.2 Assessing the Overall Adequacy of a Logistic Regression...............................................................................472 17.1.2.1 Log-Likelihood........................................................................................................................................472 17.1.2.2 Classification Accuracy..........................................................................................................................474 17.1.2.3 Hosmer–Lemeshow Test........................................................................................................................474 17.1.2.4 Other Methods of Assessing the Adequacy of a Logistic Regression.............................................475 17.2 Inference from Logistic Coefficients.................................................................................................................................477 17.2.1 Interpretation of the Logistic Coefficients...........................................................................................................477 17.2.1.1 Dichotomous Regressors........................................................................................................................477 17.2.1.2 Polytomous Regressors..........................................................................................................................478 17.2.1.3 Continuous Regressors and Linearity.................................................................................................479 17.2.2 Confidence Interval and Test of Hypothesis on Logistic Coefficients............................................................480 17.3 Issues in Logistic Regression..............................................................................................................................................482 17.3.1 Conditional Logistic for Matched Data...............................................................................................................482 17.3.2 Polytomous Dependent..........................................................................................................................................483 17.3.2.1 Nominal Categories of the Dependent: Multinomial Logistic.........................................................483 17.3.2.2 Ordinal Categories of the Dependent Variable...................................................................................483
xxiv
Contents
17.4 Some Models for Qualitative Data and Generalizations................................................................................................484 17.4.1 Cox Regression for Hazards..................................................................................................................................484 17.4.2 Classification and Regression Tree.......................................................................................................................485 17.4.3 Further Generalizations.........................................................................................................................................486 17.4.3.1 Generalized Linear Models...................................................................................................................486 17.4.3.2 Generalized Estimating Equations.......................................................................................................486 17.4.4 Propensity Score Approach...................................................................................................................................487 17.5 Strength of Relationship in Qualitative Variables...........................................................................................................488 17.5.1 Both Variables Qualitative.....................................................................................................................................488 17.5.1.1 Dichotomous Categories........................................................................................................................488 17.5.1.2 Polytomous Categories: Nominal.........................................................................................................490 17.5.1.3 Proportional Reduction in Error (PRE)................................................................................................491 17.5.1.4 Polytomous Categories: Ordinal Association.....................................................................................492 17.5.2 One Qualitative and the Other Quantitative Variable......................................................................................494 17.5.2.1 Coefficient of Determination as a Measure of the Degree of Relationship....................................494 17.5.2.2 Biserial Correlation.................................................................................................................................495 17.5.3 Agreement in Qualitative Measurements (Matched Pairs)..............................................................................495 17.5.3.1 Meaning of Qualitative Agreement......................................................................................................495 17.5.3.2 Cohen Kappa............................................................................................................................................496 17.5.3.3 Agreement Charts...................................................................................................................................497 References........................................................................................................................................................................................498 Exercises...........................................................................................................................................................................................499 18. Survival Analysis..........................................................................................................................................................................503 18.1 Life Expectancy....................................................................................................................................................................503 18.1.1 Life Table..................................................................................................................................................................504 18.1.2 Other Forms of Life Expectancy...........................................................................................................................507 18.1.2.1 Potential Years of Life Lost....................................................................................................................507 18.1.2.2 Healthy Life Expectancy........................................................................................................................507 18.1.2.3 Application to Other Setups..................................................................................................................507 18.2 Analysis of Survival Data...................................................................................................................................................508 18.2.1 Nature of Survival Data.........................................................................................................................................508 18.2.1.1 Types of Censoring.................................................................................................................................508 18.2.1.2 Collection of Survival Time Data..........................................................................................................509 18.2.1.3 Statistical Measures of Survival............................................................................................................510 18.2.2 Survival Observed in Time Intervals: Life Table Method................................................................................510 18.2.2.1 Life Table Method...................................................................................................................................510 18.2.2.2 Survival Function....................................................................................................................................511 18.2.3 Continuous Observation of Survival Time: Kaplan–Meier Method...............................................................513 18.2.3.1 Kaplan–Meier Method............................................................................................................................513 18.2.3.2 Using the Survival Curve for Some Estimations................................................................................515 18.2.3.3 Standard Error of Survival Rate (K–M Method).................................................................................515 18.2.3.4 Hazard Function.....................................................................................................................................516 18.3 Issues in Survival Analysis.................................................................................................................................................517 18.3.1 Comparison of Survival in Two Groups..............................................................................................................517 18.3.1.1 Comparing Survival Rates.....................................................................................................................518 18.3.1.2 Comparing Survival Experience: Log-Rank Test...............................................................................519 18.3.2 Factors Affecting the Chance of Survival: Cox Model......................................................................................521 18.3.2.1 Parametric Models..................................................................................................................................521 18.3.2.2 Cox Model for Survival..........................................................................................................................522 18.3.2.3 Proportional Hazards.............................................................................................................................522 18.3.3 Sample Size for Hazard Ratios and Survival Studies.......................................................................................524 References........................................................................................................................................................................................525 Exercises...........................................................................................................................................................................................526
Contents
xxv
19. Simultaneous Consideration of Several Variables.................................................................................................................529 19.1 Scope of Multivariate Methods..........................................................................................................................................529 19.1.1 Essentials of a Multivariate Setup........................................................................................................................530 19.1.2 Statistical Limitation on the Number of Variables.............................................................................................530 19.2 Dependent and Independent Sets of Variables................................................................................................................531 19.2.1 Dependents and Independents Both Quantitative............................................................................................531 19.2.1.1 Canonical Correlation.............................................................................................................................531 19.2.1.2 Multivariate Multiple Regression.........................................................................................................532 19.2.1.3 Path Analysis...........................................................................................................................................535 19.2.2 Quantitative Dependents and Qualitative Independents: Multivariate Analysis of Variance...................536 19.2.2.1 Regular MANOVA..................................................................................................................................537 19.2.2.2 MANOVA for Repeated Measures.......................................................................................................539 19.2.3 Classification of Subjects into Known Groups: Discriminant Analysis.........................................................539 19.2.3.1 Discriminant Functions.........................................................................................................................539 19.2.3.2 Classification Rule...................................................................................................................................540 19.2.3.3 Classification Accuracy..........................................................................................................................540 19.3 Identification of Structure in the Observations...............................................................................................................543 19.3.1 Identification of Clusters of Subjects: Cluster Analysis.....................................................................................543 19.3.1.1 Measures of Similarity...........................................................................................................................543 19.3.1.2 Hierarchical Agglomerative Algorithm.............................................................................................. 544 19.3.1.3 Deciding on the Number of Natural Clusters....................................................................................545 19.3.2 Identification of Unobservable Underlying Factors: Factor Analysis..............................................................546 19.3.2.1 Factor Analysis........................................................................................................................................547 19.3.2.2 Steps for Factor Analysis........................................................................................................................548 19.3.2.3 Features of a Successful Factor Analysis.............................................................................................549 19.3.2.4 Factor Scores.............................................................................................................................................550 References........................................................................................................................................................................................550 Exercises...........................................................................................................................................................................................551 20. Quality Considerations................................................................................................................................................................555 20.1 Statistical Quality Control in Medical Care.....................................................................................................................555 20.1.1 Statistical Control of Medical Care Errors..........................................................................................................556 20.1.1.1 Adverse Patient Outcomes.....................................................................................................................556 20.1.1.2 Monitoring Fatality.................................................................................................................................557 20.1.1.3 Limits of Tolerance..................................................................................................................................557 20.1.2 Quality of Lots.........................................................................................................................................................558 20.1.2.1 Lot Quality Method................................................................................................................................558 20.1.2.2 LQAS in Health Assessment..................................................................................................................558 20.1.3 Quality Control in a Medical Laboratory............................................................................................................559 20.1.3.1 Control Chart...........................................................................................................................................559 20.1.3.2 Cusum Chart............................................................................................................................................560 20.1.3.3 Other Errors in a Medical Laboratory..................................................................................................561 20.1.3.4 Six Sigma Methodology.........................................................................................................................561 20.1.3.5 Nonstatistical Issues...............................................................................................................................561 20.2 Quality of Measurement Instruments..............................................................................................................................562 20.2.1 Validity of Instruments..........................................................................................................................................562 20.2.1.1 Types of Validity......................................................................................................................................562 20.2.2 Reliability of Instruments......................................................................................................................................563 20.2.2.1 Internal Consistency...............................................................................................................................563 20.2.2.2 Cronbach Alpha.......................................................................................................................................564 20.2.2.3 Test–Retest Reliability.............................................................................................................................565 20.3 Quality of Statistical Models: Robustness........................................................................................................................566 20.3.1 Limitations of Statistical Models..........................................................................................................................566
xxvi
Contents
20.3.2 Validation of the Models........................................................................................................................................567 20.3.2.1 Internal Validation..................................................................................................................................567 20.3.2.2 External Validation.................................................................................................................................568 20.3.3 Sensitivity Analysis and Uncertainty Analysis.................................................................................................568 20.3.3.1 Sensitivity Analysis................................................................................................................................568 20.3.3.2 Uncertainty Analysis..............................................................................................................................569 20.3.4 Resampling..............................................................................................................................................................570 20.3.4.1 Bootstrapping..........................................................................................................................................570 20.3.4.2 Jackknife Resampling.............................................................................................................................571 20.3.4.3 Optimistic Index......................................................................................................................................571 20.4 Quality of Data.....................................................................................................................................................................572 20.4.1 Errors in Measurement..........................................................................................................................................572 20.4.1.1 Lack of Standardization in Definitions................................................................................................572 20.4.1.2 Lack of Care in Obtaining or Recording Information.......................................................................572 20.4.1.3 Inability of the Observer to Secure Confidence of the Respondent.................................................573 20.4.1.4 Bias of the Observer................................................................................................................................573 20.4.1.5 Variable Competence of the Observers................................................................................................573 20.4.2 Missing Values........................................................................................................................................................573 20.4.2.1 Approaches for Missing Values............................................................................................................ 574 20.4.2.2 Handling Nonresponse..........................................................................................................................575 20.4.2.3 Imputations..............................................................................................................................................576 20.4.2.4 Intention-to-Treat (ITT) Analysis..........................................................................................................576 20.4.3 Lack of Standardization in Values........................................................................................................................578 20.4.3.1 Standardization Methods Already Described....................................................................................578 20.4.3.2 Standardization for Calculating Adjusted Rates................................................................................578 20.4.3.3 Standardized Mortality Ratio................................................................................................................579 References........................................................................................................................................................................................580 Exercises...........................................................................................................................................................................................581 21. Statistical Fallacies........................................................................................................................................................................583 21.1 Problems with the Sample..................................................................................................................................................583 21.1.1 Biased Sample..........................................................................................................................................................583 21.1.1.1 Survivors..................................................................................................................................................584 21.1.1.2 Volunteers.................................................................................................................................................584 21.1.1.3 Clinic Subjects..........................................................................................................................................584 21.1.1.4 Publication Bias.......................................................................................................................................585 21.1.1.5 Inadequate Specification of the Sampling Method............................................................................585 21.1.1.6 Abrupt Series...........................................................................................................................................585 21.1.2 Inadequate Size of Sample.....................................................................................................................................585 21.1.2.1 Size of Sample Not Adequate................................................................................................................585 21.1.2.2 Problems with Calculation of Sample Size..........................................................................................586 21.1.3 Incomparable Groups.............................................................................................................................................586 21.1.3.1 Differential in Group Composition......................................................................................................587 21.1.3.2 Differential Compliance.........................................................................................................................588 21.1.3.3 Variable Periods of Exposure................................................................................................................588 21.1.3.4 Improper Denominator..........................................................................................................................589 21.1.4 Mixing of Distinct Groups.....................................................................................................................................590 21.1.4.1 Effect on Regression................................................................................................................................590 21.1.4.2 Effect on Shape of the Distribution......................................................................................................591 21.1.4.3 Lack of Intragroup Homogeneity.........................................................................................................591 21.2 Inadequate Analysis............................................................................................................................................................592 21.2.1 Ignoring Reality......................................................................................................................................................592 21.2.1.1 Looking for Linearity.............................................................................................................................592 21.2.1.2 Overlooking Assumptions.....................................................................................................................592 21.2.1.3 Selection of Inappropriate Variables....................................................................................................593
Contents
xxvii
21.2.1.4 Area under the (Concentration) Curve................................................................................................593 21.2.1.5 Further Problems with Statistical Analysis.........................................................................................594 21.2.1.6 Anomalous Person-Years.......................................................................................................................594 21.2.1.7 Problems with Intention-to-Treat Analysis and Equivalence...........................................................595 21.2.2 Choice of Analysis..................................................................................................................................................595 21.2.2.1 Mean or Proportion?...............................................................................................................................595 21.2.2.2 Forgetting Baseline Values.....................................................................................................................596 21.2.3 Misuse of Statistical Packages...............................................................................................................................596 21.2.3.1 Overanalysis............................................................................................................................................597 21.2.3.2 Data Dredging.........................................................................................................................................597 21.2.3.3 Quantitative Analysis of Codes............................................................................................................597 21.2.3.4 Soft Data versus Hard Data...................................................................................................................597 21.3 Errors in Presentation of Findings....................................................................................................................................597 21.3.1 Misuse of Percentages and Means.......................................................................................................................598 21.3.1.1 Misuse of Percentages.............................................................................................................................598 21.3.1.2 Misuse of Means......................................................................................................................................599 21.3.1.3 Unnecessary Decimals...........................................................................................................................599 21.3.2 Problems in Reporting.......................................................................................................................................... 600 21.3.2.1 Incomplete Reporting............................................................................................................................ 600 21.3.2.2 Overreporting..........................................................................................................................................601 21.3.2.3 Selective Reporting.................................................................................................................................601 21.3.2.4 Self-Reporting versus Objective Measurement..................................................................................601 21.3.2.5 Misuse of Graphs....................................................................................................................................601 21.4 Misinterpretation.................................................................................................................................................................602 21.4.1 Misuse of P-Values..................................................................................................................................................602 21.4.1.1 Magic Threshold of 0.05.........................................................................................................................602 21.4.1.2 One-Tailed or Two-Tailed P-Values.......................................................................................................603 21.4.1.3 Multiple Comparisons............................................................................................................................603 21.4.1.4 Dramatic P-Values...................................................................................................................................603 21.4.1.5 P-Values for Nonrandom Sample..........................................................................................................603 21.4.1.6 Assessment of “Normal” Condition Involving Several Parameters................................................604 21.4.1.7 Absence of Evidence Is Not Evidence of Absence..............................................................................604 21.4.2 Correlation versus Cause–Effect Relationship...................................................................................................604 21.4.2.1 Criteria for Cause–Effect........................................................................................................................605 21.4.2.2 Other Considerations..............................................................................................................................606 21.4.3 Sundry Issues..........................................................................................................................................................606 21.4.3.1 Diagnostic Test is Only an Additional Adjunct..................................................................................606 21.4.3.2 Medical Significance versus Statistical Significance..........................................................................606 21.4.3.3 Interpretation of Standard Error of p......................................................................................................606 21.4.3.4 Univariate Analysis but Multivariate Conclusions............................................................................607 21.4.3.5 Limitation of Relative Risk....................................................................................................................607 21.4.3.6 Misinterpretation of Improvements.....................................................................................................607 21.4.4 Final Comments......................................................................................................................................................608 References........................................................................................................................................................................................609 Exercises........................................................................................................................................................................................... 610 Brief Solutions and Answers to the Selected Exercise.................................................................................................................. 611 Appendix A: Statistical Software......................................................................................................................................................631 Appendix B: Some Statistical Tables................................................................................................................................................637 Appendix C: Solution Illustrations Using R.................................................................................................................................. 643 Index........................................................................................................................................................................................................689
Summary Tables Although this book covers statistical methods for a large variety of datasets and problems, there are some setups that are not covered or mentioned in the following tables. Several others for which name and applicability are mentioned in the book without details, such as the Breslow–Day, Tarone–Ware, and Brown–Forsythe tests, are also not mentioned in the following tables. TABLE S.1 Methods to Compute Some Confidence Intervals Parameter of Interest
Conditions
95% CI
Proportion (π)
Large n, 0 < p < 1 Small n, any p Any n, p = 0 or 1 (confidence bound)
Equation 12.11 Figure 12.5 Table 12.4
Mean (μ)
Large n, σ known, almost any underlying distribution Small n, σ known, underlying Gaussian Small n, σ known or unknown, underlying non-Gaussian Any n, σ unknown, underlying Gaussian Large n, σ unknown, underlying non-Gaussian
Equation 12.14
Gaussian distribution Non-Gaussian conditions Large n1, n2: Independent samples Large n1, n2: Paired samples
Equation 12.18b Table 12.5 Equation 12.20 Equation 12.23
Median Difference in proportions (π1 – π2) Difference in means (μ1 – μ2) (σ unknown)
Independent samples Large n1, n2: Almost any underlying distribution Small n1, n2: Underlying Gaussian Paired samples
Equation 12.14 Table 12.5 (CI for median) Equation 12.15 Equation 12.15
Equation 12.21 Equation 12.21 Same as for one sample after taking the difference
Relative risk
Large n1, n2: Independent samples Large n1, n2: Paired samples
Equation 14.4 Same as for odds ratio when prevalence is small (Equation 14.21)
Attributable risk
Large n1, n2: Independent samples Large n1, n2: Paired samples
Same as for (π1 – π2) Equation 14.12
Number needed to treat Odds ratio
Large n1, n2: Independent samples Large n1, n2: Independent samples Large n1, n2: Paired samples
Section 14.1.3.3 Equation 14.18 Equation 14.21
Regression coefficient and intercept (simple linear) Regression line (confidence band)
Large n (Gaussian residuals)
Use Equation 16.7
Large n (Gaussian residuals)
Section 16.2.1.5
Logistic coefficient
Large n
Section 17.2.2
Median effective dose
Up-and-down trial
Section 15.1.4
Note: CI, confidence interval.
xxix
xxx
Summary Tables
TABLE S.2 Statistical Procedures for Test of Hypothesis on Proportions Parameter of Interest and Setup One dichotomous variable
One polytomous variable
Two dichotomous variables (2×2)
Conditions
Main Criterion
Equation/Section
One-Way and 2×2 Tables Independent trials Any n Binomial Large n Gaussian Z
Use Equation 13.1 Equation 13.3
Independent trials Large n Small n
Goodness-of-fit chi-square Multinomial
Equation 13.5 Use Equation 13.6
Chi-square or Gaussian Z Fisher exact Gaussian Z
Equation 13.8 or 13.9 Equation 13.11 Equation 13.10
TOSTsa and others
Table 13.10
McNemar Binomial
Equation 13.12 Equation 13.13
Chi-square Fisher exact
Section 13.2.2.5 Equation 13.11
Two independent samples Large n Small n Detecting a medically important difference: Large n Equivalence, superiority, and noninferiority tests Matched pairs Large n Small n Crossover design Large n Small n
Bigger Tables, No Matching Association (nominal)
Large n Required 2×C tables
The Case of Small n Not Discussed in This Book Chi-square Equation 13.15
Trend in proportions (ordinal)
2×C tables
Chi-square for trend
Equation 13.16
Dichotomy in repeated measures
Many related 2×2 tables
Cochran Q
Equation 13.18
Association
R×C tables
Chi-square
Equation 13.15
Association
Three-way tables Test of full independence Test of other types of independence (log-linear models)
Chi-square G2
Equation 13.19 Three-way extension of Equation 13.22
I×I table
Matched pairs
McNemar–Bowker
Section 13.3.2.2
Stratified
Stratified into many 2×2 tables
Mantel–Haenszel chi-square
Equation 14.26
a
Two one-sided tests
xxxi
Summary Tables
TABLE S.3 Procedures for Test of Hypothesis on Relative Risk and Odds Ratio Parameter of Interest and Setup Relative (RR) and Attributable (AR) Risks ln(RR) RR
Conditions
Main Criterion
Equation/Section
Large n Required Two independent samples Matched pairs
The Case of Small n Not Discussed in This Book Gaussian Z or chi-square Equation 13.8 or 14.5 As for odds ratio (OR) (Gaussian Z or Equation 14.22 or 14.23 McNemar)
AR
Two independent samples Matched pairs
Chi-square or Gaussian Z McNemar
Odds Ratio (OR) ln(OR)
Large n Required Two independent samples
The Case of Small n Not Discussed in This Book Chi-square Equation 13.8
OR
Matched pairs
Gaussian Z or McNemar
Equation 14.22 or 14.23
Stratified
Mantel–Haenszel chi-square
Equation 14.26
Equation 13.8 or 13.9 Equation 13.12
xxxii
Summary Tables
TABLE S.4 Statistical Procedures for Test of Hypothesis on Means or Locations Setup One sample
Comparison of two groups
Conditions Comparison with prespecified: Gaussian σ known σ not known Paired: Gaussian Paired: Non-Gaussian Any n 5 ≤ n ≤ 19 20 ≤ n ≤ 29 n ≥ 30 Unpaired: Gaussian Equal variances Unequal variances Unpaired: Non-Gaussian n1, n2 between (4, 9) n1, n2 between (10, 29) n1, n2 ≥ 30 Crossover design: Gaussian conditions Detecting medically important difference: Gaussian conditions Equivalence tests: Gaussian conditions
Comparison of three or more groups
Multiple comparisons
Repeated measures
Main Criterion
Equation/Section
Gaussian Z Student t
Section 15.1.1.1 Equation 15.1
Student t
Equation 15.3
Sign test Wilcoxon signed-ranks WS Standardized WS referred to Gaussian Z Student t
Equation 15.17a–c Equation 15.18a Equation 15.18b
Student t Welch
Equation 15.6a Equation 15.6b
Wilcoxon rank-sum WR Standardized WR referred to Gaussian Z Student t Student t Student t
Equation 15.19 Equation 15.20 Equation 15.6a or 15.6b Section 15.1.3 Equation 15.23
Student t
Section 15.4.2.2
ANOVA F
Equation 15.8
Equation 15.1
One-way layout: Gaussian One-way layout: Non-Gaussian n≤5 n≥6 Two-way layout: Gaussian Two-way layout: Non-Gaussian (one observation per cell—repeated measures) J ≤ 13 and K = 3 J ≤ 8 and K = 4 J ≤ 5 and K = 5 Larger J, K
Kruskal–Wallis H H referred to chi-square ANOVA F
Equation 15.21 Equation 15.21 Section 15.2.2
Friedman S Friedman S Friedman S S referred to chi-square
Equation 15.22a or 15.22b Equation 15.22a or 15.22b Equation 15.22a or 15.22b Equation 15.22a or 15.22b
Gaussian conditions All pairwise With control group Few comparisons
Tukey D Dunnett Bonferroni
Equation 15.15 Equation 15.16 Section 15.2.4.1
F-test with Huynh–Feldt correction
Section 15.2.3
Gaussian conditions
Note: ANOVA, analysis of variance.
xxxiii
Summary Tables
TABLE S.5 Procedures for Test of Hypothesis on Some Other Parameters Parameter of Interest and Setup
Conditions
Main Criterion
Equation/Section
One Sample Product–moment correlation
Gaussian conditions
Student t
Equation 16.20
Serial correlation
Gaussian conditions
Durbin–Watson
Section 16.3.4
Intraclass correlation
Gaussian conditions
F
Section 16.5.3.2
Sphericity (repeated measures)
Gaussian conditions
Mauchly
Section 15.2.3.2
Goodness of fit of whole model
Large n
Hosmer–Lemeshow
Section 17.1.2.3
Two-Sample Comparison Comparison of two distributions
Two Independent Samples Very large n (mean and SD known) Large n Moderate n
Kolmogorov–Smirnov Shapiro–Wilk Anderson–Darling
Section 12.4.2.1 Section 12.4.2.1 Section 12.4.2.1
Comparison of two correlations
Gaussian conditions
Fisher z-transformation
Section 16.5.1.3
Comparison of two survival curves
Any distribution
Log-rank
Section 18.3.1.2
Comparison of two variances
Gaussian conditions
F or Levene
Section 15.1.2.2
TABLE S.6 Methods for Studying the Nature of Relationship Dependent Variable (y)
Independent Variables (x)
Method
Equation/Section
Quantitativea
Qualitative
ANOVA
Section 15.2
Quantitative
Quantitative
Quantitative regression
Chapter 16
Quantitative
Mixture of qualitative and quantitative
ANCOVA
Section 16.4.2.4
Qualitative (dichotomous)
Qualitative or quantitative or mixture
Logistic
Section 17.1
Qualitative (polytomous)
Qualitative or quantitative or mixture Quantitative
Logistic: Any two categories at a time Discriminant
Section 17.3.2 Section 19.2.3
Survival probability
Groups: Duration in time intervals Groups: Duration in continuous time
Life table Kaplan–Meier
Equation 18.8 Equation 18.10
Hazard ratio
Mixture of qualitative and quantitative
Cox model
Section 18.3.2
a
Note: Large n required, particularly for tests of significance. Exact methods for small n are not discussed in this book. ANOVA, analysis of variance; ANCOVA, analysis of covariance. a Quantitative variables are variables on the metric scale without any broad categories. Fine categories are admissible.
xxxiv
Summary Tables
TABLE S.7 Main Methods of Measurement of Strength of Relationship between Two Variables Types of Variables
Measure
Both Qualitative Binary categories
Equation/Section
Odds ratio, Jaccard, Yule, and several others
Section 17.5.1.1
Polytomous categories: Nominal
Phi-coefficient Tschuprow coefficient Contingency coefficient Cramer V Proportional reduction in error
Equation 17.7a Equation 17.7b Equation 17.7c Equation 17.7d Equation 17.8
Polytomous categories: Ordinal
Kendall tau, Goodman–Kruskal gamma, Somer d
Section 17.5.1.4
Dependent qualitative and independent quantitative
Odds ratio
Section 17.1
Dependent quantitative and independent qualitative
R2 from ANOVA or η2 from regression
Equation 16.11 or 17.9
Both Quantitative For multiple linear (more than 2 variables) For simple linear For monotonic relation For intraclass With previous value Keeping other variables fixed
R2 from regression Product–moment correlation (r) Spearman (rank) correlation (rS) Intraclass correlation (rI) Serial (auto-)correlation Partial correlation
Use Equation 16.12 Equation 16.19 Equation 16.22 Equation 16.23 Section 16.5.1.4 Equation 16.21
Two sets of quantitative variables
Canonical correlation
Section 19.2.1.1
One quantitative and the other dichotomous
Biserial
Section 17.5.2.2
Cohen kappa Limits of disagreement Alternative methods Intraclass correlation
Equation 17.10 Section 16.6.1.2 Section 16.6.2.1 Equation 16.23
Agreement Qualitative Quantitative
Note: ANOVA, analysis of variance.
TABLE S.8 Multivariate Methods in Different Situations (Large n Required) Nature of the Variables A dependent set and an independent set
Objective
Types of Variables
Statistical Method
Section
Relationship
Both quantitative
Multivariate multiple regression
Section 19.2.1.2
Equality of means of dependents
Dependent quantitative and independent qualitative
MANOVA
Section 19.2.2.1
One is dependent (many groups)
Classify subjects into known groups
Independent quantitative
Discriminant analysis
Section 19.2.3.1
All variables are interrelated (none are dependent)
Discover natural clusters of subjects
Qualitative or quantitative or mixed
Cluster analysis
Section 19.3.1
Identify underlying factors that explain the interrelations
Quantitative
Factor analysis
Section 19.3.2
Note: MANOVA, multivariate analysis of variance.
Preface Biostatistical aspects are receiving increased emphasis in medical books, medical journals, and pharmaceutical literature, yet there is a lack of appreciation of biostatistical methods as a medical tool. This book arises from the desire to help biostatistics earn its rightful place as a medical, rather than a mathematical, subject. Medical and health professionals may then perceive biostatistics as their own instead of an alien discipline. A book that effectively provides medical perspective to biostatistics is clearly needed. To describe its focus, this book is titled Medical Biostatistics, where medical not only precludes fish and plants that a purist might include under the genre of biostatistics but also helps us to emphasize that the medical + bio component is dominant in this book over the statistics component. The text fosters the thought that medicine has to be individualized yet participatory, and tries to develop pathways that can achieve this through biostatistical thinking. Variation is an essential and perhaps the most enjoyable aspect of life. But the consequent uncertainties are profound. Thus, methods are needed to measure the magnitude of uncertainties and to minimize their impact on decisions. Biostatistics is the science of management of empirical uncertainties in health and medicine. Beginning with this premise, this book provides a new orientation to the subject. This theme is kept alive throughout the text. We have tried to demonstrate that biostatistics is not just statistics applied to medicine and health sciences but also is two steps further: providing useful tools to manage some aspects of medical uncertainties. This orientation does not stop at aleatory uncertainties but goes beyond to epistemic uncertainties as well so that all data-based uncertainties are comprehensively addressed. Among the unique contributions of the book toward medicalizing biostatistics are a full chapter on the sources of medical uncertainties, another one on the quality of investigations, a chapter on statistical fallacies, and an extensive debate on statistical versus medical significance. There are several other original contributions as listed in Chapter 1. The primary target audience are students, researchers, and professionals of medicine and health. These include clinicians who deal with medical uncertainties in managing patients and want to practice evidence-based medicine; research workers who design and conduct empirical investigations to advance knowledge, including research workers in the pharmaceutical industry who search for new regimens that are safer and more effective yet less expensive and more convenient; and health administrators who are concerned with the epidemiological aspects of health and disease. After reading this book, this audience hopefully will not complain that biostatistics methods have not been explained in an understandable way and made available to meet their needs. Although the text is tilted to the viewpoint of medical and health professionals, the contents are of sufficient interest to a practicing biostatistician and a student of biostatistics as well. They may find some sections very revealing, particularly the heuristic explanations provided for various statistical methods. However, the sequence of chapters may not look natural to statisticians because their thoughts follow a mathematical continuum. Medical and health professionals, whose biostatistics needs are problem solving, may find our sequence very natural. The contents surely follow a distinct structural divergence from a conventional biostatistics book. The boundary between epidemiology methods and biostatistics is thin, if at all. This book does not limit itself to the conventional topics of confidence intervals and tests of significance. It discusses at length study designs, measurement of health and diseases, clinimetrics, and quality control in medical setups. Emphasis is on the concepts and interpretation of the methods rather than on theory or intricacies. In fact, theoretical development is intentionally de-emphasized and applications increasingly emphasized. A large number of real-life examples are included that illustrate the methods and explain the medical meaning of the results. Many statistical concepts are repeatedly explained in different contexts, keeping the requirement of the target audience in mind. In the process of projecting biostatistics as a medical discipline, it is imperative that less emphasis is placed on mathematical aspects but the essential algebra needed to communicate and understand some statistical concepts is not ignored. In fact, the second half of the book makes liberal use of notations. An attempt is made to strike an even balance. Medical and health professionals, who are generally not well trained in mathematics, may find the language and presentation very conducive. Equations and formulas are separately identified, and manual calculations are described for the fundamentals, but the emphasis is on the use of computers for advanced calculations. Software illustrations for intricate methods are provided in Appendix C of this book. This uses R in this edition in place of SPSS in earlier editions. The text is thoroughly revised to provide a more robust account of biostatistical applications that can provide more credible results. The book is fairly comprehensive and incorporates a large number of statistical concepts used in medicine and health. The contents are more than an introduction and less than an advanced treatise. References have been provided for further reading. A medical or a health professional should be able to plan and carry out an investigation by oneself on the basis of this text and intelligently seek the help of an expert biostatistician when needed. Medical laboratory professionals, scientists in xxxv
xxxvi
Preface
basic medical sciences, epidemiologists, public health specialists, nutritionists, and others in health-related disciplines may also find this volume useful. The text is expected to provide a good understanding of the statistical concepts required to critically examine the medical literature. The material is suitable for use in preparation for professional examinations, such as that for membership in the College of Physicians. The content is also broad enough to cover a two-semester biostatistics course for medical and health science students. Upon the demand of those who want to adopt this as a text for courses, this edition includes exercises at the end of each chapter and brief solutions of the selected exercises at the end of the book. We are thankful to the reviewers worldwide who have examined the book microscopically and provided extremely useful suggestions for its improvement while also finding the first edition “probably the most complete book on biostatistics,” the second edition “almost encyclopedic in breadth,” and the third edition as “seems to be an encyclopedia.” This edition incorporates most of the suggestions provided by the reviewers. Some details left out earlier have been included to provide more intelligible reading. Yet, many important techniques continue to be sidetracked in this text. This illustrates our escape from discussing complexities as the book is designed primarily for medical professionals. For a similar explanation of a large number of biostatistics topics in alphabetical order, see Concise Encyclopedia of Biostatistics for Medical Professionals (CRC Press, 2016). We are confident that the book will be found as the most comprehensive treatise on biostatistical methods. In the process, we realize that we are undertaking the risk involved in including elementary- and middle-level discussions in the same book. We would be happy to receive feedback from the readers. Abhaya Indrayan Rajeev Kumar Malhotra
Datasets in the examples in this text are available in Excel for ready download at http://MedicalBiostatistics.synthasite.com. These datasets can be used to rework some of the examples of interest to you and to do further analysis where needed.
Frequently Used Notations We have tried to restrict the mathematical expressions to a minimum but notations have been used so that clarity and generalizability do not suffer. Some notations have been used for more than one quantity. The following list may help you to understand the text more easily. The list is not exhaustive since some notations that have been sparingly used in specific contexts are not included. α 1−α β βk 1−β χ2 ε η2
κ λ Λ λ(t) μ μj, μk
ν π πrc, πk ρ ρs
σ Σ φ A A, B, C, D a, b, c, d ak akm b bar (−) over a variable bk bmk C D d Dk Dot (•) e Ek, Erc, Ercl et F f fk Fm
Level of significance Confidence level Probability of Type II error Regression coefficient in the population for the kth regressor Power Chi-square Relative precision, error (in regression) Coefficient of determination Cohen kappa Logistic of π Wilks criterion Hazard at time t Population mean Population mean in the subscripted group (j = 1, 2, …, J; k = 1, 2, …, K) Degrees of freedom Population proportion or probability π in the subscripted group (r = 1, 2, …, R; c = 1, 2, …, C; k = 1, 2, …, K) Product–moment correlation coefficient in the population Rank correlation coefficient in the population Population standard deviation Sum Phi-coefficient Antecedent characteristic Frequencies in a 2×2 table—retrospective matched pairs Frequencies in a 2×2 table, particularly in the case of relative risk and odds ratio Upper end point of the kth interval (k = 1, 2, …, K) Factor loading of mth factor on kth variable (m = 1, 2, …, M; k = 1, 2, …, K) Sample regression coefficient in simple linear regression Mean of that variable, such as x and y Sample regression coefficient for the kth regressor Factor score coefficient of mth factor for kth variable Cumulative frequency, number of columns in a contingency table, contingency coefficient, complaints or symptoms complex, contribution to log-likelihood, number of controls per case Discriminant function, disease Difference, discriminant value, Euclidean distance kth discriminant function Sum over the corresponding group Residual, Naperian base Expected frequency in the subscripted group Expectation of life at age t in a life table Analysis of variance criterion Function Frequency in the kth group mth factor (m = 1, 2, …, M)
xxxvii
xxxviii
H H0 H1 Hat(∧) over a parameter or a variable I J K L L0 L1 M Max N n nk O Ok, Orc, Orcl P p P(−) P(+) Q Q1 Q3 R r rxy.z R2 rI Rij rs s S(−) S(+) S1, S2 sd sp t T tv WR WS x, y X[i] xi, xk xij, yij, yijk Z z zα
Frequently Used Notations
Kruskal–Wallis criterion Null hypothesis Alternative hypothesis Estimated or predicted value of the parameter or of the variable
Sampling interval Number of groups, number of independent variables Number of groups, number of dependent variables Number of layers in a contingency table, likelihood, precision, number of years lived (in life table) Likelihood under H0 Likelihood under the model Number of factors, number of observers, number of methods, number of items Maximum value Number of subjects in a population Size of sample Number of subjects in the kth group in a sample Outcome Observed frequency in the subscripted group (k = 1, 2, …, K; r = 1, 2, …, R; c = 1, 2, …, C; l = 1, 2, …, L) Probability, particularly of Type I error Proportion in the sample, estimated probability Negative predictivity Positive predictivity Cochran statistic First quartile Third quartile Number of rows in a contingency table Product–moment correlation coefficient in a sample Partial correlation between x and y adjusted for z Square of the coefficient of multiple correlation Intraclass correlation coefficient in a sample Rank of Yij (in the case of nonparametric methods) Spearman’s rank correlation coefficient in a sample Sample standard deviation Specificity Sensitivity Friedman criterion Standard deviation of difference in a sample Pooled estimate of the standard deviation Student t Test, number of time points (t = 1, 2, …, T), T-score t-Value at v degrees of freedom Wilcoxon rank-sum criterion Wilcoxon signed-ranks criterion Variable values Ordered value of x at ith rank ith or kth observation, midpoint of an interval ith observed value of the variables x or y in the jth or (j, k)th group (i = 1, 2, …, n; j = 1, 2, …, J; k = 1, 2, …, K) Z-score or a standardized Gaussian variable Specific value of Z Value such that P(Z ≥ zα) = α
1 Medical Uncertainties The human body has a mechanism to adjust itself to minor variations in the internal as well as external environment. Perspiration in hot weather, shivering in the cold, increased respiration during physical exercise, excretion of redundant nutrients, replacement of lost blood after hemorrhage, and decrease in the diameter of the pupil in bright light are examples of this mechanism. This is a continuous process that goes on all the time in our body and is referred to as homeostasis. Health could be defined as the dynamic balance of body, mind, and soul when homeostasis is going on perfectly well. The greater the capacity to maintain internal equilibrium, the better the health. Perhaps human efficiency is optimal in this condition. Sometimes infections, injury, nutritional imbalances, stress, and such other aberration become too much for this process to handle, and external help is needed. Medicine can be defined as the intervention that tries to put the system back on track when aberrations occur. This can be broadened to include steps that strengthen the process of homeostasis and steps that modify or ameliorate the adverse sequels occurring in some cases. Health greatly differs from person to person and periodically alters in the same person from time to time. The variations are so prominent that no two individuals are ever exactly alike. Differences in facial features and morphologic appearance help us to identify people uniquely, but more important for medicine are the profound variations in physiologic functions. We all know that measurements such as hemoglobin level, cholesterol level, and heart rate differ from person to person even in perfect health. In addition, diurnal variations in body temperature, blood pressure (BP), and blood glucose levels are normal. States such as shock, anger, and excitement temporarily affect most of us and also have the potential to produce long-term sequelae. In the presence of such large variation, it is not surprising that a response to a stimulus such as a drug can seldom be exactly reproduced even in the same person. Uncertainties resulting from these variations are an essential feature of the practice of medicine and deserve recognition. If they do not prevent us from making decisions in daily life, why should they be in such an important aspect as health? Absolute certainty in any aspect of life is an unachievable aspiration, and more so in health and medicine, and doubts and suspicion remain despite far-reaching advances in terms of digital pills, credit card–sized lab-on-a-chip, smart bandages, and pocket-sized health monitors that were visualized a quarter century ago [1] and are now a reality. Role of biostatistics: Medicine is not just a drug for ingestion. It involves close interactions with the patient. More often than not, a large number of steps are taken before arriving at a treatment regimen. The patient’s history is reviewed; measurements such as weight, BP, and heart rate are recorded; a physical examination is carried out; and investigations such as an electrocardiogram (ECG), x-ray studies, blood glucose measurements, and stool examination are done. In passing through these steps, the patient sometimes encounters many observers and many instruments. Variations among them contribute their share to the uncertainties in clinical practice. In fact, discerning natural variability from induced aberration has been a major scientific challenge. The assessments of diagnosis, treatment, and prognosis can all go wrong. To highlight the large magnitude of these uncertainties, details of various contributing factors are provided in Section 1.1. These details show how profound the uncertainties are and how important it is to delineate them and contain their effect. The role of statistics is precisely this. Statistics is the science of management of uncertainties—a tool to measure them and minimize their impact on decisions when based on data. Biostatistics comprises statistical methods that are used to manage empirical uncertainties in the field of medicine and health. This definition highlights the most pragmatic edges of this fascinating science, although it leaves out esoteric of scientific method that resonates in its mathematical content. Although the bio part of biostatistics should stand for all biological sciences, it has become the convention to apply the term biostatistics to statistical applications in only medical and health sciences. Medical biostatistics directly applies to people at a personal level compared with other applications such as economic statistics that are concerned with society. We continuously remind you of this distinction throughout the book. Focusing on uncertainties may give an impression that we are excluding, for example, medical data mining activities. Indeed, biostatistics in the sense described here is distinct from data mining. After examining hospital data for thousands of patients undergoing gastrointestinal (GI) surgery, you may find elective GI surgery most common in females aged 50–59 years compared with other age–sex groups in that hospital, which might help the hospital target a group or maximize profit, but it has little scientific value. This finding has no uncertainty component unless it is sought to be extended to future patients in other hospitals. This text takes a holistic view and discusses data presentation at length, which includes parts of data mining as well. The
1
2
Medical Biostatistics
mother of all this is now popularly called data science (considered by some as the “sexiest” profession) that exploits skillful application of computers and effective visualizations of the databases for meaningful conclusions. Since the uncertainties are glaring, one wonders how medicine has been successful, sometimes very successful, in giving succor to mankind. The silver lining is that a trend can still be detected among these variations, and following this trend yields results within clinical tolerance in most cases. The term clinical tolerance signifies that the medical intervention may not necessarily restore the system to its homeostatic level but tends to bring it closer to that level so that the patient feels better, almost cured. Also note the emphasis in most cases. Positive results are not obtained in all cases, nor is this expected. But a large percentage of cases respond to medical intervention. Thus, the statement is doubly probabilistic. As we proceed, we hope to demonstrate how statistical medical practice is and what can be done to delineate and minimize the role of uncertainties, and thus increase the efficiency of medical decisions. The explanation of statistics would not be complete without describing two usages of this term. The meaning given in the preceding paragraphs is valid when the term is used in the singular. A more common use, however, is in the form of a plural. Numerical information is called statistics. It is in this sense that the media use this term when talking about football statistics, income statistics, or even health statistics. This chapter: This chapter attempts to highlight uncertainties present in all setups of health and disease. Details of uncertainties in day-to-day clinical practice are described in Section 1.1. However, it is in the case of medical research setup that many uncertainties requiring statistical subtleties prominently emerge. Some of these are described in Section 1.2. Biostatistics is often associated with community health and epidemiology—and this association is indeed strong. Although an epidemiological perspective will be visible throughout this book, some uncertainties in health planning and evaluation are specifically discussed in Section 1.3. Section 1.4 provides an outline of the methods discussed in various chapters of the book for managing these uncertainties.
1.1 Uncertainties in Health and Disease The state of health is a result of an intricate interaction of a large number of factors, and an extremely complex phenomenon. Many aspects of this complexity are not fully understood, and most of what is understood seems beyond human control. The most common source of uncertainty in medicine is the natural biologic variability between and within individuals. Variations between laboratories, instruments, observers, and so forth are factors that further accentuate the level of uncertainty. All these variations together cause what is called aleatory uncertainties because these are intrinsic and natural. The others arise due to knowledge limitation, called epistemic uncertainties, and chance. Details of these are as follows. 1.1.1 Uncertainties due to Intrinsic Variation Body temperature and plasma glucose level are everyday examples of medical parameters that are evaluated against their normal values. The need to define and use such normal values arises from the realization that variations do exist, and it is perfectly normal for them to occur in healthy subjects. Such variations can occur due to a number of factors. The following is a list of sources of commonly occurring intrinsic variability, although this is restricted to those that have a profound effect. The sources listed are not necessarily exclusive of one another, and the overlap can be substantial in practical applications. 1.1.1.1 Biologic Variability Age, gender, birth order, height, and weight are among the biological factors that occur naturally in a health setup. Health parameters of children are quite different from those of adults. Almost all kinds of measurements—anatomical, physiological, or biochemical—differ from age to age. For example, levels of BP seen in subjects of age, say, 20 years can be hardly applied to subjects of age 60 years. Similarly, assessment of the health of males based on, say, hemoglobin level or a lung function is not on the same scale as that of females. Biological variability is seen not only between subjects but also within subjects. Examples of diurnal variation in body temperature, BP, and blood glucose levels have already been cited. Menstrual cycles in women are accompanied by many other physiologic changes that are periodic in nature. A person may respond exceedingly well at one time but fail desperately at another time. Fear and anxiety can cause marked alterations in physiological functions. All these variations contribute significantly to the spectrum of uncertainty in health and medicine.
Medical Uncertainties
3
1.1.1.2 Genetic Variability African blacks, Chinese Mongoloids, Indian Aryans, and European Caucasians differ not only in morphological features and anatomical structure but also in physiological functions. They may vary with regard to vital capacity and blood group composition, and for aberrations such as thalassemia, Down’s syndrome, and color blindness that are entirely genetic. Sickle-cell anemia, muscular dystrophy, and hemophilia A are also genetic in origin. Many diseases with multifactorial etiologies, such as hypertension and diabetes, have a genetic component. Because of genetic variabilities in the populations, the clinician has to be wary of the possibility of a genetic influence on signs and symptoms, on the one hand, and the rate of recovery, on the other. 1.1.1.3 Variation in Behavior and Other Host Factors Whereas our anatomy and physiology are traceable mostly to hereditary factors, pathology is caused mostly by our own behavior and the environment. Environmental influences are discussed in the next paragraph, and a brief on behavioral factors is as follows. The emergence of human immunodeficiency virus (HIV) has brought sexual behavior into focus. Almost all sexually transmitted infections (STIs) originate from aberrant sexual relationships. Smoking is seen as an important factor in several types of carcinomas and, in many conditions, affecting the heart and lung. Heavy drinking for many years can affect the bones, kidneys, and liver. A sedentary lifestyle can cause spondylitis and coronary diseases. Nutrition is probably the most dominant factor that controls the body’s defense mechanism. This is determined by the awareness of this requirement and consumption of the right food. Enormous variation in this awareness impacts prevention, cause, and treatment of many diseases in an unpredictable way. In addition, scabies and dental caries thrive on lack of personal hygiene, and socioeconomic status affects health and disease in a variety of ways, both distally and proximally. Some personality traits can affect the risk of hypertension and other cardiovascular diseases. A positive attitude helps ward off some ailments and assists in early recovery when struck, and stress compromises one’s ability to respond to even an established treatment. All these individual factors greatly vary from person to person. Susceptibility and response are the result of a large number of interacting factors, and the nature of this interaction also varies widely between individuals, contributing to the spectrum of uncertainty. 1.1.1.4 Environmental Variability Climatic factors sometimes determine the type and virulence of pathogens. Pollution and global warming are now acquiring center stage in the health scenario with a predilection for grim consequences. Flies, mosquitoes, and rodents are the carriers of many deadly diseases. Many GI disorders are waterborne. The relationship between insanitation and disease is evident. An environment of tension and stress may substantially alter one’s ability to cope with, say, infections. Love, affection, and prayers of the family and others sometimes do wonders in the recovery of a patient, perhaps by providing innate strength to fight the disease. The availability of appropriate and timely medical help also has a tremendous impact on the outcome. Thus, health infrastructure plays an important role. The health culture of people with regard to utilizing the services also varies from population to population and affects health in an unpredictable way. All these environmental factors need due consideration while dealing with health or disease at the individual as well as the community level. Their effects on different people are not uniform, as some subjects tend to be affected more than others due to variabilities in host–environment interactions. 1.1.1.5 Sampling Fluctuations Much of what is known in medicine today has been learned through accumulated experience. This empiricism is basic to most medical research, as discussed in Section 1.2.1. Beware though that experience is always gained by observing a fraction of subjects and all subjects are never studied. The knowledge that chills, fever, and splenomegaly are common in malaria is based on what has been observed in several series of cases over a period, but these series do not comprise all cases that occurred in the world. Only a fraction of cases, called a sample, have been studied. Similarly, saying that the normal albumin level is 56%–75% of the total serum proteins is based on the levels seen in some healthy subjects. One feature of samples is that they tend to provide a different picture with repeated sampling. This is called sampling fluctuation or sampling error. As explained in a later chapter, this error is not a mistake but indicates only a variation.
4
Medical Biostatistics
The objective of mentioning all this here is to point out that sampling fluctuations themselves are a source of uncertainty. Sampling fluctuation is one of the many reasons that necessitate repeat investigations in some cases. Above all, even if the entire “population” is included in the study, future cases can never be included; thus, sampling itself remains a source of uncertainty. 1.1.2 Natural Variation in Assessment Some variation in repeat measurements occurs despite all precautions. If this can happen with fixed parameters, such as the length of a dining table, it can definitely happen with biological parameters that are pliable anyway. A purist can call them errors rather than natural variation. These variations can be minimized by exercising more care but cannot possibly be eliminated. 1.1.2.1 Observer Variability Barring some clear-cut cases, clinicians tend to differ in their assessment of the same subject. Interpretation of x-ray films is particularly notorious in this respect. Disagreement exists concerning such simple tools as a chart for assessing the growth of children (see, e.g., [2]). Physicians tend to differ in grading a spleen enlargement. One physician may consider a fasting blood glucose level of 156 mg/dL in a male of age 60 years sufficient to warrant active intervention, but another might opt to just monitor. Some clinicians are more skillful than others in extracting correct information from the patient and in collating pieces of information into solid diagnostic evidence. An additional factor is the patient–doctor equation. Because of the confidence of a patient in a particular clinician, the concerned clinician is able to secure much better information. Such variability on the part of the observer, researcher, or investigator is a fact of life and cannot be wished away. It is inherent in humans and represents a healthy feature rather than anything to be decried. All these continue to contribute to the spectrum of uncertainty. 1.1.2.2 Variability in Treatment Strategies A physician is basically a healer. Variability in the treatment strategies of different physicians is wide, and outcomes are accordingly affected. They evaluate various medical parameters in different ways and come to different conclusions. For example, some emphasize lifestyle changes for treating hypertension whereas others depend primarily on drugs. For a condition such as acute cystitis of the urinary tract in women, the management strategies differ widely from physician to physician and there is no consensus. Such variation could have special significance when conclusions are drawn based on subjects from different hospitals or from a cross-section of a population. 1.1.2.3 Instrument and Laboratory Variability The BP of an individual measured by mercury sphygmomanometer is often found to be different from that obtained with an electronic instrument. The weight of children on a beam balance and on a spring balance may differ. Apart from such simple cases, laboratories too tend to differ in their results for splits of the same sample. Send aliquots of the same blood sample to two different laboratories for a hemogram, and be prepared to receive different reports. The difference could be genuine due to sampling fluctuation or could be due to differences in chemicals, reagents, techniques, and so forth in the two laboratories. Above all, the human element in the two laboratories may be very different: expertise may differ, and the care and attentiveness may vary. Differences occur despite standardization, although even the standard may differ between laboratories in some cases. 1.1.2.4 Imperfect Tools Clinicians use a variety of tools during the course of their practice. Examples are signs–symptoms syndrome, physical measurements, laboratory and radiological investigations, and intervention in the form of medical treatment or surgery. Besides their skills in optimally using what is available, the efficiency of clinicians depends on the validity and reliability of the tools they use. Validity refers to the ability to measure correctly what a tool is supposed to measure, and reliability means consistency in repeated use. Indicators such as sensitivity, specificity, and predictivities are calculated to assess validity. Reliability is evaluated in terms of measures such as the Cohen kappa and Cronbach alpha. Details of most of these measures are given in this book. In practice, no medical tool is 100% perfect: even a computed tomography (CT) scan can give a false-negative or false-positive result in some cases. A negative histologic result for a specimen is no guarantee that proliferation is absent, although in this case, positive predictivity is nearly 100%. The values of measurements such as the
Medical Uncertainties
5
creatinine level, platelet count, and total lung capacity are indicative rather than absolute—that is, they mostly estimate the likelihood of a disease. Signs and symptoms seldom provide infallible evidence. Because all these tools are imperfect, decisions based on them are also necessarily probabilistic rather than definitive. 1.1.2.5 Incomplete Information on the Patient When a patient arrives in a coma at the casualty department of a hospital, the first steps for management are often taken without considering the medical history of the patient or without waiting for laboratory investigations. An angiography may be highly recommended for a cardiac patient, but treatment decisions are taken in its absence if the facility is not available in that particular health center. Even while interviewing a healthy person, it cannot be ensured that the person is not forgetting or intentionally suppressing some information. Suppression can easily happen in the case of sexually transmitted diseases (STDs) and injuries with medicolegal implications. An uneducated subject may even fail to understand the questions or may misinterpret them. Some investigations, such as CT and magnetic resonance imaging (MRI), are expensive, and lack of funds may sometimes lead to proceeding without these investigations even when they are highly recommended. Thus, the information remains incomplete in many cases despite best efforts. Clinicians are often required to make a decision about treatment based on such incomplete information. 1.1.2.6 Poor Compliance with the Regimen Medical ethics requires that the patient’s consent is obtained before a procedure is used. Excision of a tumor may be in the interest of a patient but this can be done only after informed consent of the patient is obtained and he or she may not agree. When a drug treatment is prescribed, the patient may or may not follow it in its entirety. Noncompliance can be due to circumstances beyond the control of the patient, due to carelessness, or even intentional. In a community setup, if immunization coverage to the extent of 90% is required to control a disease such as polio, best efforts may fail if the public is not cooperative due to some misgivings. A cold chain always remains a source of worry in a domiciliary drive against polio. Health care providers seldom have control on compliance, and this adds to uncertainty about the outcome. 1.1.3 Knowledge Limitations Notwithstanding claims of far-reaching advances in medical sciences, many features of the human body and mind, and their interaction with the environment, are not sufficiently well known. How the mind controls physiological and biochemical mechanisms is an area of current research. What specific psychosomatic factors cause women to live longer than men are still shrouded in mystery. Nobody knows yet how to reverse hypertension that can obviate the dependence on drugs. Cancers are treated by radiotherapy or excision because a procedure to regenerate aberrant cells is not known. Treatment for urinary tract infections in patients with impaired renal function is not known. Such gaps in knowledge naturally add to the spectrum of uncertainty. Further details are as follows. 1.1.3.1 Epistemic Uncertainties Knowledge gaps are wider than generally perceived. One paradigm says that what we do not know is more than what we know, and this unfamiliarity breeds uncertainty. These are called epistemic uncertainties. Besides incomplete knowledge, they also include (a) ignorance, for example, how to choose one treatment strategy when two or more are equally good or equally bad, such as between amoxicillin and cotrimaxazole in nonresponsive pneumonia, or how to restore the health of cancerous cells; (b) parameter uncertainty regarding the factors causing or contributing to a particular outcome, such as etiological factors of vaginal and vulvar cancer; (c) speculation about unobserved values, such as the effect of unusually high levels of NO2 (100 mg/m3) in the atmosphere; (d) lack of knowledge about the exact quantitative effect of various factors, such as diet, exercise, obesity, and stress, on raising the blood glucose level; and (e) confusion about the definition of various health conditions, such as hypertension—whether the BP cutoff for treatment should be 140/90 or 160/95 mmHg. Another kind of epistemic uncertainty arises from nonavailability of the proper instrument. How do you measure blood loss during a surgical operation? Swabs that are used to suck blood are not standardized, and blood can even spill onto the floor in some surgeries. Even a simple parameter, such as pain, is difficult to measure. The visual analog scale (VAS) and other instruments are just approximations. Stress defies measurement, and behavior or opinion types of variables present stiff difficulties. If the measurement is tentative, naturally the conclusion too is tentative. Kelvin said that if we cannot measure, we do not know enough.
6
Medical Biostatistics
1.1.3.2 Chance Variability Let us go a little deeper into the factors already listed. Aging is a natural process but its effect is more severe in some than in others. When exposed to heavy smoking, some people develop lung cancer and others do not. Despite consuming the same water with deficient iodine, some people do not develop goiter, whereas some do—and that too of varying degrees. The incubation period differs greatly from person to person after the same exposure. Part of such variation can be traced to factors already mentioned, such as personality traits, lifestyle, nutritional status, and genetic predisposition, but these known factors fail to explain the entire variation. Two patients who are apparently similar, not just with regard to the disease condition but also for all other known factors, can respond differently to the same treatment regimen. Even susceptibility levels sometimes fail to account for all the variation. The unknown factors are collectively called chance. Sometimes the known factors that are too complex to comprehend or too many to be individually considered are also included in the chance syndrome. In some situations, chance factors could be very prominent contributors to uncertainties, and in other situations they can be minor, but they can hardly ever be completely ruled out. 1.1.3.3 Diagnostic, Therapeutic, and Prognostic Uncertainties Diagnostic uncertainties arise because the tests or assessments used for diagnosis do not have 100% predictivity. The perfect test is not known. None of the procedures, for example, fine-needle aspiration cytology, ultrasonography, and mammogram, are perfect for identifying or excluding breast cancer. An ECG can be false negative or false positive. Quantitative diagnostic tests, such as blood glucose and creatinine level, carry the burden of false results no matter what threshold is used. No therapy has ever been fully effective in all cases. Therapeutic uncertainties are particularly visible in the surgical treatment of asymptomatic gland-confined prostate cancer and in the medical treatment of benign prostatic hyperplasia. Many such examples can be cited. In addition are substances such as the combined oral pill, where long-term use may increase the risk of breast, cervical, or liver cancer but reduce the risk of ovarian, endometrial, and colorectal cancer. Realizing that failure to confront these uncertainties may damage the patients, the United Kingdom has prepared the Database of Uncertainties about the Effects of Treatments [3]. Prognostic uncertainties due to lack of knowledge exist in sudden severe illness. Such illness can occur due to a variety of conditions, and its cause and outcome are difficult to identify. Nobody can predict the occurrence or nonoccurrence of irreversible brain damage after an ischemic stroke. The prognosis of terminally ill patients is also uncertain. The method of care for women undergoing hysterectomy is not standardized. 1.1.3.4 Predictive and Other Uncertainties Medicine is largely a science of prediction of diagnosis, prediction of treatment outcome, and prediction of prognosis. In addition, there are many other types of medical predictions for which knowledge barriers do not allow certainty. Look at the following examples: • • • • • •
Gender of a child immediately after conception Survival duration after onset of an end-stage serious disease Number of hepatitis B cases that would come up in the following year in a country with endemic affliction Number of people to die of various causes in the future Age at death and cause of death of a person Whether a person exposed to a certain risk factor will develop cancer
These are examples of universal inadequacies. In addition, the inadequate knowledge of an individual physician, nurse, or pharmacist is also in operation. First, it is difficult for a health care provider to recollect everything known at the time of meeting with a patient. Second, some caregivers just do not know how to handle a situation, although others know how to do well. All this also contributes to the spectrum of uncertainties. The objective of describing various sources of uncertainty in such detail is to sensitize the reader to their unfailing presence in practically all medical situations. Sometimes they become so profound that medicine transgresses from a science to an art. Many clinicians deal with these uncertainties in their own subjective ways, and some are very successful. But most are not as skillful. To restore a semblance of science, methods are needed to measure these uncertainties, to evaluate
Medical Uncertainties
7
their impact, and of course, to keep their impact under control. All these aspects are primarily attributed to the domain of biostatistics and are the subject matter of different chapters in this book.
1.2 Uncertainties in Medical Research The discussion in Section 1.1 is restricted mostly to the uncertainties present in day-to-day clinical problems. Biostatistics is not merely the measurement of uncertainties, but is also concerned with the control of their impact. Such a need is more conspicuous in a research setup than in everyday practice. All scientific results are susceptible to error, but uncertainty is an integral part of the medical framework. The realization of the enormity of uncertainty in medicine may be recent, but the fact is age old. Also, our knowledge about biological processes is still extremely limited. These two aspects—variation and limitation of knowledge—throw an apparently indomitable challenge into making a decision. Yet, medical science not only has survived but also is ticking with full vigor. The silver lining is the ability of some experts to learn quickly from their own and others’ experience and to discern signals from noise, waves from turbulence, and trends from chaos. It is due to this expertise that death rates have steeply declined in the past 50 years and life expectancy is showing a relentless rise in almost all nations around the world. The burden of disease is steadily but surely declining in most countries. The backbone of such research is empiricism. 1.2.1 Empiricism in Medical Research Empiricism can be roughly equated with experience. An essential ingredient in almost all primary medical research is observation of what goes on naturally or before and after a deliberate intervention. Because of the various sources of uncertainties listed earlier, such observations seldom provide infallible evidence. This can be briefly explained in the context of different types of medical research as follows. The details are given in the subsequent chapters. 1.2.1.1 Laboratory Experiments Laboratory experiments in medicine are often performed on animals but are sometimes performed on biological specimens. Experiments help us understand the mechanisms of various biological functions and of the response to a stimulus. The laboratory provides an environment where the conditions can be standardized so that the influence of extraneous factors can be nearly ruled out. To minimize the role of interindividual variation, homogeneous units (animals and biological specimens) are chosen and are randomly allocated to the groups receiving a specific stimulus. Because of the controlled conditions, an experiment can provide clear answers even when performed on a small number of subjects. Nonetheless, experiments are often replicated to get more experience and thus to strengthen confidence in the results. 1.2.1.2 Clinical Trials Clinical trials are experiments on humans, and they are mostly done to investigate new modes of therapy. Research on new diagnostic procedures also falls in this category. Clinical trials are carried out meticulously involving heavy investment. Since variation between and within subjects occurs due to a large number of factors, it is quite often a challenge to take full care of all of them. Epistemic uncertainty also plays a role. It is imperative in this situation that the rules of empiricism are rigorously followed. This means that the trial should be conducted in controlled conditions so that the influence of extraneous factors is minimized, if not ruled out. Also, the trials should be conducted with a sufficient number of patients so that a trend, if any, can be successfully detected. Clinical trials and other experiments have a tendency to present results as facts, and many believe them, forgetting that the complete truth is rarely brought out by such research. Results should always be presented in terms of probability so that the uncertainty remains an integral part. 1.2.1.3 Surgical Procedures A new surgical procedure is extensively studied regarding its appropriateness before initiating its trial. Abundant precautions are taken in this kind of research, and each case is intensively investigated for days before a surgery is undertaken. Because of such care, an operation found successful in one patient is likely to be successful in another similar patient when
8
Medical Biostatistics
the same precautions are taken. Surgeons can afford to be extra cautious because the new procedure will seldom be used on a large number of cases until sufficient experience is gained. They generally have the opportunity to study each case in depth before performing a new type of surgery. Perhaps the response of tissues to a surgery is less variable than that of physiological functions to a medication. Despite all this, failures do occur, as in the case of organ transplantation, and uncertainties remain prominent in this kind of research also. Nonetheless, there are surgical trials that are comparable with medical trials. Evaluation of transurethral incision against transurethral resection of the prostate in benign prostatic hyperplasia can be done in a manner similar to that of a drug trial. The same kinds of uncertainties exist in this setup, and strategies such as randomization and blindness can be used to minimize their impact. A trial for comparing methods of suturing can also be done just as a regular clinical trial. 1.2.1.4 Epidemiological Research In epidemiological research, association or cause–effect relationships between etiologic factors and outcomes are investigated. Research on factors causing cancers, coronary artery disease, and infections occurring in some but not in others is classified as epidemiological. This can also help us understand the mechanism involved. A very substantial part of modern medical research is epidemiological in nature. The relationship under investigation is influenced even more by various sources of uncertainty in this kind of research, and a conclusion requires experience gained from a large number of cases. Statistical methods are again needed to separate clear signals from chance fluctuations. 1.2.2 Elements of Minimizing the Impact of Uncertainties on Research The sources of intrinsic variation listed in Section 1.1.1 are mostly beyond control, but their impact can still be managed. Other sources of uncertainty also contribute, but investigations can be designed such that their influence on decisions is minimized. These designs are discussed in later chapters in detail, but the elementary concepts are given in the following paragraphs as illustration. The quality of decisions in the long run can also be enhanced by devising and using improved medical methods. Both require substantial statistical inputs. The real challenge in research is cast by epistemic uncertainties that arise from inadequate medical knowledge. Statements are sometimes made without realizing that they are assumptions. A gastric ulcer was thought to be caused by acidity until it was established that the culprit is Helicobacter pylori in many cases. Thus, even fully established facts should be continuously evaluated and replaced by new ones where needed. 1.2.2.1 Proper Design A clinical trial aims to evaluate the efficacy of one or more treatment procedures—generally different drugs or different dosages of the same drug—relative to one another. The “another” could be “no treatment” or “existing treatment” and is called the control. Among the precautions sometimes taken is the baseline matching of the subjects in various groups so that the known sources of variability have less influence on the outcome. Another very effective strategy is randomization, which equalizes the chance of the presence of different sources of uncertainty in various groups, including the unknown sources. The techniques of observation and measurement are standardized and uniformly implemented to minimize the diverse influence of these techniques on the outcome. If identifiable sources of uncertainty still remain uncontrolled, they are taken care of at the time of analysis by suitable adjustments. Appropriate statistical methods help in arriving at a conclusion that has only a small likelihood of being wrong. These preliminaries are stated in the context of clinical trials, but other medical investigations, be they in a community, in a clinic, or in a laboratory, have the same basic structure and require similar statistical inputs. 1.2.2.2 Improved Medical Methods Although the health of each individual is important and clinical practice must use the best available methods, research endeavors especially focus on improved methods that are more accurate and more exact. This makes medical research an expensive proposition. Compromise on methods can substantially affect the quality of research. If such improved methods are not available, research may have to be redirected to devise such methods. To fill the gaps in medical knowledge, research into the more exact delineation of factors responsible for specific conditions of ill health and their mechanism is required. All this will help devise strategies to minimize the uncertain space. Some epistemic uncertainties can be minimized by using an appropriate scoring system. Inadequacies in medical tools, such as diagnostic tests, can be minimized only through research on newer, more valid, and more reliable tools. Compliance
Medical Uncertainties
9
with prescribed regimens can be improved by devising regimens that are simple to implement, less toxic, and more effective. Instrument and observer variability can be controlled by adhering to strict standards and thorough training. Thus, improved methods can minimize the uncertainties arising from these deficiencies. Research into these requires scientific investigations so that the conclusions arrived at are valid as well as reliable. Proper design of the investigation helps to achieve this aim. 1.2.2.3 Analysis and Synthesis Because of the uncertainties involved at every stage of a medical investigation, the conclusion can seldom be drawn in a straightforward manner. In almost all cases, the data obtained are carefully examined to find the answers to the questions initially proposed. For this, it is generally necessary that the data are collated in the form of tables, charts, or diagrams. Some summary measures, such as mean and percentage, are also chosen and computed to draw inferences. Because of the inherent variations in the data and because only a sample of the subjects are investigated rather than the entire target population, some special methods are required to draw valid conclusions. These methods collectively are called techniques of statistical inference. These techniques depend on the type of questions asked, design of the study, kind of measurements used, number of groups investigated, number of subjects studied in each group, and so forth. These techniques are the primary focus of this book and are discussed in detail in various chapters. All data processing activities, beginning with data exploration and ending with drawing inferences, are generally collectively called statistical analysis. The role of this analysis is to help draw valid and reliable conclusions. The term analysis probably comes from the fact that the total variability in the data is broken into its various components, thus helping to filter clear signals or trends from noise-like fluctuations. Although statistical analysis is acknowledged as an essential step in empirical research, the importance of synthesis is sometimes overlooked. Synthesis is the process of combining and reconciling varied and sometimes conflicting evidence. The findings of an investigation do not often match those in another investigation. Diabetes, smoking habits, and BP levels were found to be significant factors of mortality in Italy in one study but not in other studies in the same country [4]. The prevalence of hypertension in Delhi was found to range widely from 13% to 67% in a general population of older adults [5]. These differences occur for a variety of reasons, such as genuine population differences; sampling fluctuation; differences in definitions, methodology, and instruments; and differences in the statistical methods used. A major scientific activity is to synthesize these varying results and arrive at a consensus. The discussion part of most articles published in medical journals tries to do such a synthesis. The objective of most review articles is basically to present a holistic view after reconciling the varying results in different studies. In addition, techniques such as meta-analysis seek to combine evidence from different studies. These synthesis methods too are primarily statistical in nature and are important for medical research.
1.3 Uncertainties in Health Planning and Evaluation Most of the discussion so far has been focused on clinical and research aspects. But medical care is just one component of the health care spectrum. Prevention of disease and promotion of health are equally important. At the individual level, prevention is in terms of steps such as immunization, changes in lifestyle, use of fertility control methods, and improved personal hygiene. Health education is basic to all prevention and much more efficiently done at the community level than at the individual level. All these are geared to meet the specific needs of the population, and these needs vary widely from population to population, area to area, and time to time, depending on the perception, level of infrastructure, urgency, and so forth. A predominantly pediatric population with a high prevalence of infectious diseases requires entirely different services than an aging population with mostly chronic ailments. These variations compel the use of statistical methods in health planning. Health situation analysis is the first step in this planning. When a health program is implemented, the administrator always wants to know how well it is running and to take midcourse corrective steps to put the program back on track in the case of deficiencies. Then, the final outcome is measured in terms of the impact the program has had on the community. All these exercises have a substantial biostatistical component. 1.3.1 Health Situation Analysis The quality and type of health services required for a community depend on the size of the community, its age–sex structure, and the prevalence of various conditions of health and ill health in different sections of the community. The health
10
Medical Biostatistics
services also depend on culture, traditions, perception, socioeconomic status of the population, existing infrastructure, and so forth. Variations and uncertainties are prominent in these aspects too. All these need to be properly assessed to prepare an adequate plan. This assessment is called health situation analysis and provides the baseline information. Since the situation can quickly change because of either natural growth of the population or interventions, a time perspective is always kept in view in this analysis. 1.3.1.1 Identification of the Specifics of the Problem Generally, the broad problem requiring health action is already known before embarking on the health situation analysis, and only the specifics are to be identified. It seems ideal to talk about both good and bad aspects of health, but in practice, a health plan is drawn to meet the needs as perceived by the population. These needs are obviously related to adverse aspects of health rather than to positive aspects. These are identified from the complaints received by various social and medical functionaries, such as mass media (newspapers, magazines, and television), political organizations, voluntary agencies, and medical practitioners. Sometimes a survey is required, and sometimes an expert group is set up to identify the specifics of the problem. Specifics of the problem are contextualized by the size of the target population. An assessment of this size is necessary for two reasons. First, the magnitude of the services to be provided would depend on this size, and second, population is the denominator for many health assessment indicators, such as incidence and prevalence rates, and birth and death rates. For the infant mortality rate, the denominator is the population of live births, and for the general fertility rate, the denominator is the population of women of reproductive age. Such rates are basic to health situation analysis because they delineate the magnitude and help in measuring the impact of a program. A correct denominator is as essential for accurate assessment as the correct numerator, and obtaining them accurately can be very difficult in some situations. The census, carried out periodically in most countries, is the most important source for the count of the general population. Distribution by age, gender, rural or urban area, education level, and so forth is also available in most census reports. Great care is generally taken to ensure that the enumeration is complete and accurate. But the needs of health situation analysis are diverse, and census has a limited role, although the denominator of the rates is better obtained from census data. 1.3.1.2 Magnitude of the Problem Assessment of the magnitude of a health problem is the core of health situation analysis. Simply stated, all it requires is the count of persons with different kinds or with different grades of the problem, along with their background information. This background helps in dividing subjects into relevant groups that may be etiologically important. Grades of the problem are obtained either by direct numerical measurement, such as birth weight in grams and plasma triglyceride in milligrams per deciliter, or in terms of categories, such as none, mild, moderate, and severe, as in the case of disabilities. Such gradation is fraught with uncertainties and consequent risks—thus, a caution can always be advised. The number of subjects with various grades of the problem could provide the information for various indicators that measure morbidity, mortality, and fertility. A composite index, covering multiple aspects, can also be calculated to assess the holistic magnitude of the problem. The most common measures of the magnitude of different health problems in a community are morbidity and premature mortality. Assessment of both may be required to develop a plan to combat the problem. A series of tools, such as personal interview, physical examination, laboratory investigation, and imaging, may be required to assign morbidity to its correct diagnostic category. Lack of perfection in these tools can be monitored by using the concepts of validity and reliability, and steps can be taken to keep them in check. Incomplete response or nonresponse can introduce substantial bias in the findings and aggravate uncertainties. Health is not merely the absence of morbidity and premature mortality. There are other aspects such as fertility and nutrition. Situation analysis for planning a program on fertility control requires the count of births in different birth orders and the age of women at which various orders of birth take place. It may also require investigation of age at menarche, marriage, and menopause. Breast-feeding and postpartum amenorrhea provide additional dimensions of the problem. All these aspects come under neither morbidity nor mortality, yet are important for assessing reproductive health. Thus, morbidity and mortality will not necessarily cover all aspects of health. In any case, they rarely cover social and mental aspects, and separate assessment may be needed for these aspects. 1.3.1.3 Health Infrastructure Assessment of the functionally available health infrastructure is an integral part of health situation analysis. This includes facilities such as hospitals, beds, and health centers; staff in terms of doctors, nurses, technicians, and so forth; supplies
Medical Uncertainties
11
such as equipment, drugs, chemicals, and vehicles; and most of all, their timely availability at the functional level so that they can be effectively used. Other components of the health infrastructure are software, such as training facilities in newer techniques; a well-defined duty schedule so that the staff knows what to do and when; motivational inputs so that the staff is not found lacking in eagerness to provide service; and so forth. All these vary from situation to situation. These soft aspects sometimes play a greater role in the success or failure of a health program than the physical facilities. Creating facilities is one thing, and their adequate utilization is another that breeds an uncertain environment. A sample survey may be required, and biostatistical methods would be needed to come up with an adequate assessment of the utilization pattern by different segments of the population. 1.3.1.4 Feasibility of Remedial Steps Health situation analysis is also expected to provide clues to deal with the problem. The magnitude of the problem in different segments of the population can provide an epidemiological clue based simply on excess occurrence in specific groups. Clues would also be available from the literature or from the knowledge and experience of experts in that type of situation. Not all clues necessarily lead to remedial steps since their feasibility as an effective tool can be a limitation. Consider, for example, maternal complications that arise more frequently in the third and subsequent births. Avoiding those births could certainly be a solution, but many segments of the society in which such births are common may not accept this suggestion. Thus, other alternatives, such as increased spacing, better nutrition, and early checkups, have to be explored. Although early and frequent checkups in the antenatal period are highly desirable, poor availability of facilities may put them out of bounds for most people. Assessment of all such aspects is a part of the exercise of health situation analysis. Uncertainty is an integral part of all these assessments. 1.3.2 Evaluation of Health Programs Evaluation of a health program has two distinct components. First is assessing the extent to which objectives of the program have been achieved. This requires that the objectives are stated in a measurable format. For example, if the objective is to control blindness, it is always desirable to state the magnitude present at the time the program begins and the level of reduction expected at the end of the program or in each year of its implementation. If this level is mentioned for each age, gender, urban and rural area, and socioeconomic group, then the task of the evaluators becomes simpler. A second and more important component of evaluation is to identify the factors responsible for that kind of achievement—good or bad—and to measure the relative importance of these factors. Thus, the evaluation might reveal that the objectives were unrealistic considering the prevalent health situation and the program inputs. Supplies may be adequate but the population may lack the capacity to absorb all of this due to cultural and economic barriers. The target beneficiaries may not be fully aware of the contents and benefits of the program. Or, the program may fall short of the health needs as perceived by the people. Among other factors that could contribute to partial failure are errors in identification of the target beneficiaries, inadequate or ineffective supervision, and lack of expertise or motivation. Thus, evaluation is the appraisal of the impact as well as of the process. Just as any other investigation, evaluation passes through various stages, each of which contains an element of uncertainty. Because of inter- and intraindividual variations in receivers as well as in providers, some segments could benefit more than the others despite equal allocation of resources. Statistical methods are required to manage these uncertainties.
1.4 Management of Uncertainties: About This Book The discussion in the previous sections may have convinced you that uncertainties are present in practically all medical situations and they need to be properly managed. Management in any sphere is a complex process, more so if it concerns phenomena such as uncertainties. Management of uncertainty requires a science that understands randomness, instability, and variation. Biostatistics is the subject that deals specifically with these aspects. Instead of relating it to conventional statistical methods, such as test of hypothesis and regression, biostatistical methods are presented in this book as a medical necessity to solve some health problems. In doing so, we deliberately avoid mathematical intricacies. The attempt is to keep the text light and enjoyable for the medical fraternity so that biostatistical methods are perceived as a delightful experience and not a burden. Statistical jargon is avoided, and a human face is projected. For better communication, the text is sometimes in an interactive mode, and sometimes the same concept, such as different types of bias, is explained many times in different contexts in various chapters to bring home the point. For example, crossover trials are discussed for experiments, clinical trials,
12
Medical Biostatistics
analysis for proportions, and analysis of means in four different chapters. Such repetition is deliberate. It is desirable too from the viewpoint of medical and health professionals who are the target audience, although statisticians might not appreciate such repetitive presentation. Admittedly, some may find our writing format cynical, and some may find it innovative. A new science establishes itself by demonstrating its ability to resolve issues that are intractable within the perimeters of present knowledge. The effort in this book is to establish biostatistics as a science that helps manage some aspects of medical uncertainties in a manner that no other science can. As already stated, uncertainties in most medical setups are generally intrinsic and can seldom be eliminated. But their impact on medical decisions can certainly be minimized, and it can be fairly ensured that the likelihood of a correct decision is high and that of a wrong decision is under control. Section 1.2.2 briefly described some elementary methods to minimize the impact of uncertainties on conclusions of a research investigation. This minimization is indeed a challenge in all medical setups—clinical practice, community health care, and research. Measurement and consequent quantitation are a definite help in this context. This can lead to mathematics, perhaps intricate calculations, but the effort of this book is to describe the methods without using calculus. We also try to avoid complex algebra and provide heuristic explanations instead. Even the high school algebra–based material is gradually introduced after the first few chapters. This may be friendlier to medical and health students and professionals, who generally have less rigorous training in mathematics. Real-life examples and exercises, many of which are from the literature, should provide further help in appreciating the medical significance of these methods. The methods and their implications are explained fully in narrative form, which should plug much of the gaps in the existing biostatistics literature. For example, a detailed discussion on statistical significance demystifies many statistical processes. Readers may find this kind of friendly explanation to be another distinctive feature of this book. 1.4.1 Contents of the Book The process to keep a check on the impact of uncertainties on decisions begins from the stage of conceptualizing or encountering a problem. Identification of the characteristics that need to be assessed; definitions to be used; methods of observation, investigation, and measurements to be adopted; and methods of analysis and interpretation are all important. This book devotes many of the subsequent chapters to these aspects. 1.4.1.1 Chapters Because of the very different orientation of this book, the organization of the chapters is also different from what you conventionally see in biostatistics books. A statistician may find the text disjointed because of a new sequence, but medical and health professionals, who are the target audience, may find a smooth flow. The organization of chapters is as follows: Basics of research, such as types of medical studies and tools of data collection (Chapter 2). Designs for medical investigations so that the conclusions remain focused on the questions proposed to be investigated, including sampling (Chapter 3), observational studies (Chapter 4), medical experiments (Chapter 5), and clinical trials (Chapter 6). Numerical and graphical methods for describing variation in data (Chapters 7 and 8). These methods help in understanding the salient features of the data and in assessing the magnitude of variation. The study of variation helps generate awareness about the underlying uncertainties. Methods of measurement of various aspects of health and disease in children, adolescents, adults, and the aged, including in a clinical setup (Chapter 9). These help to achieve quantitation and, consequently, some sort of exactitude. They include the nature of the reference values so commonly used in medical practice: measurement of uncertainty in terms of probability, particularly in the context of diagnosis, prognosis, and treatment, and assessment of the validity of medical tests in terms of sensitivity–specificity and predictivities. Further on quantitative aspects of medicine under the rubrics of clinimetrics and evidence-based medicine. These include various indices and scoring systems (Chapter 10) that help introduce exactitude to largely qualitative characteristics. Indicators used for measuring the level of health of a community (Chapter 11). This includes a discussion of some composite indices, such as disability-adjusted life years (DALYs).
Medical Uncertainties
13
The need and rationale for confidence intervals and of tests of statistical significance in view of the sampling fluctuations (Chapter 12). These methods assign probabilities to various types of right and wrong decisions based on samples. Sample size calculations and the concept of statistical power are also discussed in this chapter. Methods for making decisions despite the presence of uncertainties, particularly with regard to assessing that the difference in proportions of cases in two or more groups is real or has arisen due to chance (Chapter 13), and the assessment of the magnitude of difference in terms of relative risk and odds ratio (Chapter 14). Methods for testing statistical significance regarding the difference in means of two or more groups, including analysis of variance (ANOVA) (Chapter 15). Whether a relationship between two or more variables really exists and, if so, the nature of the relationship and measurement of its magnitude. These include usual quantitative regression (Chapter 16) and logistic regression (Chapter 17). Duration of survival and how it is influenced by antecedent factors (Chapter 18). Multivariate methods, such as multivariate analysis of variance (MANOVA), discriminant functions, cluster analysis, and factor analysis (Chapter 19). Only the basic features of these methods are discussed, with the objective to describe medical situations in which such methods can be advantageously used. The intricacies of these methods are not fully described. Thus, this chapter does not provide skills to use these methods but provides knowledge about situations in which they can and should be used. Statistical methods for assessing the quality of medical care and medical tools (Chapter 20). This includes measures of validity and reliability of instruments, as well as assessment of the robustness of results through methods such as sensitivity analysis and uncertainty analysis. Fallacies that so commonly occur in statistical applications to health and medicine. We wrap up this book with a quite extensive discussion of such fallacies and the corresponding remedies (Chapter 21). 1.4.1.2 Limitations and Strengths Biostatistics these days is a highly developed science, and it is not possible to include all that is known, or even all that is important, in one book. Our attempt in this text is to cover most of what is commonly used in medicine and health. Computers have radically changed the scenario in that even very advanced and complex techniques can be readily used when demanded by the nature of the problem and the data. Yet, there are some basics whose understanding can be considered crucial for appropriate application. Within the constraints set for this book, the discussion is restricted mostly to basics and sometimes intermediaries. Among the topics left out are infectious disease modeling [6], analysis of time series [7], and bioassays [8]. We have also not included aspects such as operations research [9] and qualitative research [10]. Many others are not included. Nevertheless, the book covers most topics that are included in an undergraduate and graduate biostatistics course required for medical and health students. It also includes the statistical material generally required for membership examinations of many learned bodies, such as the College of Physicians. Above all, the book should be a useful reference to medical and health professionals for acquiring enough knowledge to plan and conduct different kinds of investigational studies on groups of subjects. It incorporates methods that are considered essential armament for medical research. It will also help medical researchers to critically interpret medical literature, which is becoming increasingly statistical. Previous editions of this book have been acclaimed as “probably the most complete book on biostatistics we have seen” and “encyclopedic in breadth,” and the endeavor in this edition is to make it even more complete. A comprehensive index of terms is provided at the end, which will make it easier for the reader to locate the text of his or her choice. As a medical or health professional, you may or may not do any research yourself, but will always be required to interpret the research of others in this fast-growing science. Thus, the discussion in this book could be doubly relevant—first for appreciating and managing uncertainties in medical practice and second to correctly interpret the mind-boggling medical research now going on. Just how much help is provided by knowledge of the biostatistical method in understanding the articles published in reputed journals? The results of an investigation are surprising. A reader knowing mean, standard deviation, and so forth has statistical access to 58% of the articles. Adding t-tests and contingency tables increases comprehension to 73% [11]. The contents of the book should also be useful to those who believe in evidence-based medicine. They search available evidence and interpret it for relevance in managing a patient. Biostatistical methods not only contribute to probabilistic thinking but also encourage critical and logical thinking. An avid reader will find sprinkling in the text of this book discussions on how logic is used for interpreting evidence for improving clinical practice.
14
Medical Biostatistics
The thrust throughout this book is to try not to lose focus on the medical and health aspects of the methods. The chapters often start with a real-life medical example, identify its statistical needs, and then present the methods. This is done to provide motivation for the methods discussed. Chapter titles, too, are, mostly problem oriented rather than method oriented. We have been amidst medical professionals for a long time and are aware of their math phobia. We also appreciate the emphasis of some medical professionals on diagnosis and treatment as an art rather than a science, because it depends very much on clinical acumen and individual equation between the patient and the doctor. This text incorporates some of these concerns also. 1.4.1.3 New in the Fourth Edition The previous editions of this book have been microscopically examined by reviewers across the world, and a large number of suggestions have emerged. The fourth edition incorporates many of those suggestions. This edition contains thoroughly revised and enlarged chapters on clinical trials (Chapter 6), ordinary quantitative regression (Chapter 16), and logistic regression (Chapter 17). The chapter on clinical trials now includes a section on concealment of allocation, a brief on personalized safety and efficacy, and several other updates. The two regression chapters now include more detailed guidelines on the selection of regressors, a more detailed exposition of simple linear regression that lays the foundation for the regression methods, a brief on curvilinear regression, more details on spline regression, a new section on partial correlation, a revised and updated section on intraclass correlation, further notes on assessing the adequacy of a logistic regression, the propensity score approach, and agreement charts. Material added in other chapters includes (a) skewness and kurtosis, (b) confidence interval for binomial π, (c) two-phase sampling, (d) Box–Cox power transformation, (e) bubble charts, (f) sample size for estimating sensitivity and specificity, (g) T-score, (h) partial correlation, (i) sampling weight, and (j) Guttman scale. A new full section has been added on demography and measures of fertility in the chapter on measures of community health. R codes have been provided for intricate problems in Appendix C in view of the popularity of this software, and SPSS solutions are on the book’s website (MedicalBiostatistics.synthasite.com). Many of these solutions go beyond the details provided in the text. The most important addition is the exercises at the end of each chapter due to popular demand by those who use this as a textbook, and brief solutions to the selected exercises. The exercises are intentionally arranged in random order within each chapter so that the sequence does not provide any clue to the solution. In the process, however, some easy questions may appear later in the sequence. The data for some exercises are available on the book’s website for ready adoption. Brief solutions to the selected exercises appear at the end of the book. As the solutions will indicate, these exercises are intended to enhance the grasp of the applications and correct interpretation of the results. The book covers the curriculum of a two-semester biostatistics course for medical and health professionals, with Chapters 1–11 for the first semester and Chapters 12–21 for the second semester. Chapters 1 and 21 can be optional, although they discuss important issues of practical relevance, and have been praised by the reviewers for the immaculate discussion of real statistical issues faced by many medical professionals. The first course could be called Biostatistics in Medical Research Methodology, and the second Biostatistics for Medical Data Analysis. The instructors can select the sections as considered appropriate so that the course is manageable. Some unimportant portions of the text have been deleted to keep the size of the book within manageable limits. The entire text has been revised and updated. 1.4.1.4 Unique Contribution of This Book There are many topics in this text that are not covered at all or adequately by the other existing books; for example,
1. The book tries to establish medical biostatistics as a subject of its own, distinct from the usual biostatistics, and tries to integrate the subject with medical sciences, as opposed to mathematical sciences [12]. 2. This text recognizes that medical uncertainties are abundant and the science of biostatistics can adequately deal with most of the data-based uncertainties. 3. We lay great emphasis on epistemic uncertainties that arise from the unknown factors. Many of us tend to ignore this component, although it can dominate in certain medical setups. 4. The normal distribution is not normal, particularly for sick subjects, and the name is a misnomer—it should be called Gaussian.
5. Many new concepts have been introduced in this book, such as (a) positive health; (b) the death spectrum, whereby individuals may exercise their choice regarding increasing the chances of death in old age by the cause they desire; (c) a comprehensive smoking index; (d) indicators of mental health; (e) a predictivity-based receiver operating characteristic (ROC) curve; (f) a nomogram for sample size in cluster sampling; (g) the clustering of values based on
Medical Uncertainties
15
consensus among various clustering methods for, say, choropleth mapping of the health indicators; and (h) an alternative simple method for assessing agreement. All these are our original contributions. 6. There is an extensive debate on statistical versus medical significance, conceding the essential role of P-values for ruling out sampling fluctuations but maintaining the primacy of biological relevance for meaningful conclusions. We also emphasize that statistical significance is not just about the presence or absence of an effect, but also about whether it has reached a medically important threshold. 7. There is a full chapter on the quality of data, quality of inference, and quality control of errors. This deals with issues that are rarely, if at all, discussed in a biostatistics book despite their huge relevance to medical investigations. 8. There is a full chapter on statistical fallacies, with a large number of practical examples. This happens to be the most appreciated chapter in the reviews. Medical professionals may find it very illuminating. 9. This may be perhaps the most comprehensive text on medical biostatistics, with a discussion and explanation of a large number of concepts and methods in one place. 10. The index at the end of any book makes it easier to locate the topic of interest, but here we provide a very comprehensive list, which some readers consider a great asset.
1.4.2 Salient Features of the Text Important terms and topics mostly appear in section and subsection titles. If not, they are in boldface in the text at the time when they are defined or explained. This may help you to easily spot the text on the topic of interest to you. The terms or phrases that we want to emphasize in the text appear in italics. Whenever feasible, the name of a formula is given on the line before or the same line as the formula so that it can be immediately identified. While general statistical procedures for different situations and their illustrations are described in the text, also included are numbered paragraphs at many places. These contain comments on the applicability of the procedures, their merits and demerits, their limitations and extensions, and so forth. These paragraphs are important inputs for an avid reader and should not be ignored if the intention is to acquire a critical understanding of the procedure. We have tried to arrange them in a crisp, brief format, separately for each point of comment. A large number of examples illustrate the procedures discussed in each chapter. These are given with an indent and a title in bold italics. Many times these examples explain important aspects of the procedure, which might not be adequately discussed in the text. These examples are part of the learning material and should not be ignored. 1.4.2.1 System of Notations Although the notations used in this text are explained upon their first use, an explanation of the system of our notations may provide considerable help in quick understanding of the contents. A measurement or an observed value of a quantitative variable is denoted by x or y. The number of subjects studied is denoted by n. The indexing subscript for this is the letter i. Thus, xi is the measurement obtained on the ith subject. If the heart rate of the third patient in a trial is 52/min, then x3 = 52. The values x1, x2, …, xn are sometimes referred to as n observations. However, x1, x2, …, xK is the notation for K regressors in a regression equation. The indexing subscript for these is k. If the spectrum of values is divided into groups such as systolic BP into 100–119, 120–139, 140–149, 150–159, 160–179, 180–199, and 200+ mmHg, then the number of groups so formed is denoted by K. In this example, K = 7. Sometimes notation J is also used for the number of groups, particularly when grouping is on two different factors. The subscripts for them are lowercase k and j, respectively. The groups are not necessarily quantitative, and they can be nominal, such as male and female (K = 2), or ordinal, such as mild, moderate, and severe pain (K = 3). Thus, yijk is the ith observation in the jth group of factor 1 and the kth group of factor 2. The probability of any event is denoted by P, but the primary use of this notation in the latter half of the book is for the probability of a Type I error. In the case of categorical data, πk is the probability of an observation falling into the kth group (k = 1, 2, …, K). The actual number of subjects in the kth group of a sample is denoted by nk or by Ok•. The proportion of subjects with a specified characteristic out of the total sample is denoted by p, and in the case of groups, in the kth group by pk. Note that the notation small p is used in this book for sample proportion, and capital P for probability. Many books and journals use p for probability, such as p-value, and the same notation for sample proportion. We have tried to make a distinction, as these are very different values. When the subjects are cross-classified by two factors, the tabular results generally have rows depicting classification on factor 1 and columns on factor 2. The number of groups formed for factor 1 is the same as the number of rows and is denoted by R. The number of columns formed for factor 2 is denoted by C. The indexing subscripts for them are r and c, respectively.
16
Medical Biostatistics
Note that this text denotes the number of groups by uppercase J, K, R, and C and the indexing subscript is the corresponding lowercase j, k, r, and c. The only exception is subscript i for a subject number that goes up to n (and not I). The latter is the convention followed by most books, and this book retains this to avoid confusion. The observed frequency in the rth row and cth column of a contingency table is denoted by Orc. The summation of values or of frequencies is indicated by Σ, and the sum so obtained is sometimes denoted by a dot (•) for the corresponding subscript. Thus, O•c = ΣrOrc and Or• = ΣcOrc. In the first case, the sum is over rows, and in the second, over columns. The subscript of Σ is the summing index and is mentioned wherever needed for clarity. Multiplication in notations is denoted by * following the computer language, but is denoted by the conventional × while working with numbers. We find it convenient to retain the notations a, b, c, d used by many authors for the number of subjects in the four cells of a 2×2 table in the context of relative risk and odds ratio. Otherwise, the notations are Orc. It is customary in statistical texts to denote the population parameters by Greek letters and the corresponding observed values in the sample by Roman letters. Thus, population mean, standard deviation, and probability are denoted by μ, σ, and π, and their sample values by x, s, and p, respectively. All notations are in italics, including the notational subscripts, except Greek letters. These are all listed as Frequently Used Notations in the beginning of the book. 1.4.2.2 Guide Chart of the Biostatistical Methods For easy reference, a chart is provided at the beginning of the book that gives a summary of the methods presented in this text. The chart is divided into eight tables, Tables S.1 through S.8, and refers to the equation or expression number or the section where that method is described. The chart also refers to the method applicable for different types of data. Thus, you can go directly to the place in the book where the method for the dataset in your hand is described. This may be a very useful guide and can be used to develop a statistical expert system.
References 1. Indrayan A. Health monitor in your pocket. CSI Commun 1996; 20(1):8–9. 2. Vidal E, Carlin E, Driul D, Tomat M, Tenore A. A comparison study of the prevalence of overweight and obese Italian preschool children using different reference standards. Eur J Pediatr 2006; 165:696–700. 3. National Health Service. UK Database of uncertainties of effects of treatments. http://www.library.nhs.uk/duets/ (accessed 17 May 2016). 4. Menotti A, Seccareccia F. Cardiovascular risk factors predicting all causes of death in an occupational population sample. Int J Epidemiol 1988; 17:773–778. 5. Goswami AK, Gupta SK, Kalaivani M, Nangkyrsih B, Pandav CS. Burden of hypertension and diabetes among urban populations aged ≥ 60 years in South Delhi: A community based study. J Clin Diagn Res 2016; 10:LC01–LC05. 6. Castillo-Chavez C, Blower S, vanden Driessche P, Kirschner D, Yakubu A (Eds.). Mathematical Approaches for Emerging and Reemerging Infectious Diseases: Models, Methods and Theory. Springer, 2002. 7. Diggle PJ, Liang KY, Zeger SL. Analysis of Longitudinal Data, 2nd ed. Clarendon Press, 2013. 8. Govindarajulu Z. Statistical Techniques in Bioassay, 2nd ed. S. Karger, 2001. 9. Taha HA. Operations Research: An Introduction, 9th ed. Pearson, 2010. 10. Sharan BM, Elizabeth JT. Qualitative Research: A Guide to Design and Implementation, 4th ed. Jossey-Bass, 2015. 11. Emerson JD, Coditz GA. Use of statistical analysis in the New England Journal of Medicine. N Engl J Med 1983; 309:709–713. 12. Indrayan A. Statistical medicine: An emerging medical specialty. J Post Grad Med 2017; 63:252–256.
Exercises
1. Illustrate three aleatory and three epistemic uncertainties with the help of examples, along with the reasons why they are classified as aleatory and epistemic, respectively. 2. Consider a study on the relation of liver functions with age and dietary intake. List three sources of uncertainties in this setup due to (i) intrinsic variation, (ii) natural variation in assessment, and (iii) knowledge limitation. 3. Select a large epidemiological study of your choice from the recent literature and discuss how the authors have controlled or not controlled various components of uncertainties in this study. What are your suggestions to possibly improve the control of uncertainties in this study?
2 Basics of Medical Studies You may have realized from Chapter 1 that empiricism is the backbone of medical knowledge. Studies in various forms are constantly carried out to better understand the interactions of various factors affecting health. Omnipresent uncertainties in this setup require that a trend is identified amidst chaos-like situations. For this, generally a group of subjects with some commonality are studied, and a preplanned series of steps is followed to collect evidence and draw a conclusion. This chapter focuses on statistical aspects of such research studies rather than on clinical approaches. If you are not into research, you may still like to know these basics for understanding the full implications of the studies reported in the medical literature, which you may be periodically consulting to update your knowledge. This chapter: This chapter provides only an outline of the methodology followed in medical studies; details are spread out in subsequent chapters. Methodology is the backbone on which a research can stand upright. All empirical results necessarily rely on uncertainties, but robustness to these probabilities is provided by methodological rigor. However, the basis remains empiricism that requires the study of a group of subjects. It essentially consists of preparation of protocol; collection of observations; their collation, analysis, and interpretation; and the conclusions. Features of study protocol are described in Section 2.1. The major component of a protocol is the design that contains the details of how various sources of uncertainties are proposed to be controlled, an outline of which is presented in Section 2.2. This discusses how the broad objectives of the study, the choice of strategy, and the methods to implement that choice finally determine the design of the study. The broad objective of the study could be to obtain descriptive data, for example, the prevalence rate of a disease, or analytical data in an aim to investigate cause–effect type of relationships, but sampling is a necessary ingredient of all empirical studies. In view of its importance, sampling is discussed separately in Chapter 3, but methods of data collection and the intricacies involved are discussed here in Section 2.3. A common strand in all strategies for medical studies is the control of various kinds of biases. These biases are identified and listed in Section 2.4. Methods of their control are also outlined in this section, but the details are explained in subsequent chapters.
2.1 Study Protocol A protocol is a concise statement of the background of the problem, the objectives of the study, and the methodology to be followed. This backbone supports a study in all steps of its execution. Thus, sufficient thought must be given to its preparation. On many occasions, it gradually evolves as more information becomes available and is progressively examined for its adequacy. The most important aspect of study protocol is to state the problem, objectives, and hypotheses. This deserves a separate discussion up front in view of their importance. Details of the other contents of protocol and the designs are presented later in this section. 2.1.1 Problem, Objectives, and Hypotheses It is often said that research is half done when the problem is clearly visualized. There is some truth in this assertion. Thus, do not shy away from devoting time in the beginning to identify the problem, to thoroughly understand its various aspects, and to choose the specifics you would like to investigate. 2.1.1.1 Problem A problem is a perceived difficulty, a feeling of discomfort about the way the things are, the presence of a discrepancy between the existing situation and what it should be, a question about why a discrepancy is present, or the existence of two or more plausible answers to the same question [1]. Among the countless problems you may notice, identifying one suitable for study is not always easy. Researchability, of course, is a prime consideration, but rationale and feasibility are also 17
18
Medical Biostatistics
important. Once these are established, the next important step is to determine the focus of the research. This can be done by reviewing existing information to establish the parameters of the problem and using empirical knowledge to refine the focus. Specify exactly what new the world is likely to know through this research. The title of the research by itself is not the statement of the problem. Instead, the problem is a comprehensive statement regarding the basis for its selection, the details of related lacunae in existing knowledge, a reflection on its importance, and comments on its applicability and relevance. The focus should be sharp so that it is clearly visible to the reader. For example, if the problem area is the role of diet in cancers, the focus may be on how consumption of meat affects the occurrence of pancreatic cancer in males residing in a particular area. To further focus, the study may be restricted to only nonsmoking males to eliminate the effect of smoking. To add depth, meat can be specified as red or white and include the amount of consumption of red and white meat and its duration. The role of other correlates that promote or inhibit the effect of meat in causing cancer can also be studied. The actual depth of the study would depend on the availability of relevant subjects on one hand, and the availability of time, resources, and expertise, on the other. Such a sharp focus is very helpful in specifying the objectives and hypotheses, in developing an appropriate research design, and in conducting the investigation. 2.1.1.2 Broad and Specific Objectives The focus of the study is further refined by stating the objectives. These are generally divided into broad and specific. Any primary medical research (Section 2.2) can have two types of broad objectives. One is to describe features of a condition, such as the clinical profile of a disease, its prevalence in various segments of the population, and the levels of medical parameters seen in different types of cases. This covers the distribution part of the epidemiology of a disease or a health condition, such as what is common and what is rare, and what is the trend. It helps to assess the type of diseases prevalent in various groups and their load in a community. However, a descriptive study does not seek explanation or causes, nor does it try to find which group is superior to the other. Evaluation of the level of β2 microglobulin in cases of HIV/AIDS is an example of a descriptive study. A study on the growth parameters of children or estimating the prevalence of blindness in cataract cases is also a descriptive study. The second type of broad objective could be to investigate a cause–effect type of relationship between an antecedent and an outcome. Whereas cause–effect would be difficult to ascertain unless certain stringent conditions are met (Chapter 20), an association or correlation can be easily established. Studies that aim to investigate such an association or cause–effect are called analytical. A broad objective would generally encompass several dimensions of the problem, which are then spelled out in specific objectives. For example, the broad objective may be to assess whether a new diagnostic modality is better than an existing one. The specific objectives in this case could be separately stated as (a) positive and negative predictivity; (b) safety in the case of an invasive procedure; (c) feasibility under a variety of settings such as field, clinic, and hospital; (d) acceptability by the medical community and the patients; and (e) cost-effectiveness. Another specific objective could be to evaluate its efficacy in different age–sex or disease severity groups so that the kinds of cases where the procedure works well are identified. Specific objectives relate to the specific activities, and they identify the key indicators of interest. They are stated in a measurable format. They may not have atomic clock precision but should be specific enough to keep the study on its course. An acronym quite in vogue these days for objectives is SMART: specific, measurable, achievable, relevant, and timely, which adequately states how objectives should be framed. Keep the specific objectives as few and focused as possible. Do not try to answer too many questions by a single study, especially if its size is small. Too many objectives can render the study difficult to manage. Whatever objectives are set, stick to them all through the study as much as possible, since changing them midway or at the time of report writing signals that enough thinking was not done at the time of protocol development, and you may also later discover that adequate data are not available for such changed objectives. 2.1.1.3 Hypotheses A hypothesis is a precise expression of the expected results regarding the state of a phenomenon in the target population. Research is about replacing the existing hypotheses with new ones that are more plausible. In a medical study, hypotheses could purport to explain the etiology of diseases; prevention strategies; screening and diagnostic modalities; the distribution of occurrence in different segments of the population; the strategies to treat or manage a disease, prevent recurrence of a disease, or prevent occurrence of adverse sequelae; and so forth. Consider which of these types of hypotheses can be investigated by the proposed study. Hypotheses are not guesses but reflect the depth of knowledge of the topic of research. They must be stated in a manner that can be tested by collecting evidence. The hypothesis that dietary pattern affects the occurrence of cancer
Basics of Medical Studies
19
is not testable unless the specifics of diet and the type of cancer are specified. Antecedents and outcome variables, or other correlates, should be exactly specified in a hypothesis. Generate a separate hypothesis for each major expected relationship. The hypotheses must correspond to the broad and specific objectives of the study. Whereas objectives define the key variables of interest, hypotheses are a guide to the strategies to analyze data. 2.1.2 Protocol Content A protocol is the focal document for any medical research. It is a comprehensive yet concise statement regarding the proposal. Protocols are generally prepared in a structured format with an introduction, containing background that exposes the gaps needing research; a review of literature, with details of various views and findings of others on the issue, including those that are in conflict; a clear-worded set of objectives and the hypotheses under test; a methodology for the collection of valid and reliable observations and a statement about the methods of data analysis; and the process of drawing conclusions. It tries to identify the uncertainty gaps and proposes methods to plug those gaps. Administrative aspects, such as sharing of responsibilities, should also be mentioned. In any case, a protocol would contain the name of the investigator, his or her academic qualification and institutional affiliation, and the hierarchy of advisers in the case of master’s and doctoral work. The place of the study, such as the department and institution, should also be mentioned. The year the proposal is framed is also required. All these are mentioned up front on the title page itself. An appendix at the end includes the pro forma of data collection, the consent form, and such other material. The main body of a protocol must address the following questions with convincing justification. Title What is actually intended to be studied, and how and is the study sufficiently specific? Introduction How did the problem arise? In what context? What is the need of the study—what new is expected that is not known so far? Is it worth investigating? Is the study exploratory in nature, or are definitive conclusions expected? To what segment of population or to what type of cases is the problem addressed? Review of Literature What is the status of the present knowledge? What are the lacunae? Are there any conflicting reports? How has the problem been approached so far? With what results? Objectives and Hypotheses What are the broad and specific objectives, and what are the specific questions or hypotheses to be addressed by the study—are these clearly defined, realistic, and evaluative? The objectives should follow the SMART format as mentioned earlier. Methodology What exactly is the intervention, if any—its duration, dosage, frequency, and so forth. What instructions and material are to be given to the subjects and at what time? What are the possible confounders? How are these and other possible sources of bias to be handled? What is the method of allocation of subjects to different groups? If any blinding, how will it be implemented? Is there any matching? On what variables and why? On what characteristics would the subjects be assessed—what are the antecedents and outcomes of interest? When would these assessments be made? Who will assess them? Are these assessments necessary and sufficient to answer the proposed questions? What exactly is the design of the study? Is the study descriptive or analytical? If analytical, is it observational or experimental? An observational study could be prospective, retrospective, or cross-sectional. An experiment could be carried out on biological material or animals, or humans, in which case it is called a trial. If experimental, is the design one-way, two-way, factorial, crossover, or what? What is the operational definition of various assessments? What methods of assessment are to be used—are they sufficiently valid and reliable? What information will be obtained by inspecting records, by interview, by laboratory and radiological investigations, and by physical examination? Is there any system of continuous
20
Medical Biostatistics
monitoring in place? What mechanism is to be adopted for quality control of measurements? What is the set of instructions to be given to the assessor? What form is to be used for eliciting and recording the data? (Attach it as an appendix.) Will it be structured or not, close-ended or open-ended? Who will record? Will it contain the necessary instructions? What is to be done in the case of contingencies, such as dropout of subjects or nonavailability of the kit or regimen, or development of complications in some subjects? What safeguards are provided to protect the health of the participants? Also, when should the study be stopped if a conclusion emerges before the full course of the sample? What is the period of the study and the timeline? Study Subjects What are the subjects, what is the target population, what is the source of subjects, how are they going to be selected, how many in each group, and what is the justification, particularly with regard to the reliability of the results and statistical power? What are the inclusion and exclusion criteria? Is there any possibility of selection bias, and how is this proposed to be handled? Is there any comparison group? Why is it needed and how will it be chosen? How will it provide valid comparison? Data Analysis What estimations, comparisons, and trend assessments are to be done at the time of data analysis? Will the quality and quantity of available data be adequate for these estimations, comparisons, and trend assessments? What statistical indices are to be used to summarize the data—are these indices sufficiently valid and reliable? How is the data analysis to be done—what statistical methods would be used, and are these methods really appropriate for the type of data and for providing correct answers to the questions? What level of significance or confidence is to be used? How are missing data, noncompliance, and nonresponse to be handled? Validation of Results What is the expected reliability of the conclusions? What are the limitations of the study, if any, with regard to generalizability or applicability? What exercises are proposed to be undertaken to confirm the internal and external validity of the results? Administration What resources are required, and how are they to be arranged? How are responsibilities to be shared between investigators, supporting units (e.g., pathology, radiology, and biostatistics), hospital administration, funding agency, and any other? In short, the protocol should be able to convince the reader that the topic is important, and the data collected would be reliable and valid for that topic, and that contradictions, if any, would be satisfactorily resolved. Present it before a critical but positive audience and get their feedback. You may be creative and may be in a position to argue with conviction, but skepticism in science is regularly practiced—in fact, it is welcome. The method and results would be continuously scrutinized for possible errors. The protocol is the most important document to evaluate the scientific merit of the study proposal by the funding agencies, as well as by the accepting agencies. Peer evaluation is a rule rather than exception in scientific pursuits. Good research is robust to such reviews. A protocol should consist of full details with no shortcuts, yet should be concise. It should be to the point and coherent. The reader, who may not be fully familiar with the topic, should be able to get clear answers about the why, what, and how of the proposed research. To the extent possible, it should embody the interest of the sponsor, the investigator, the patients, and the society. The protocol is also a reference source for the members of the research team whenever needed. It should be complete and easy to implement. The protocol is also a big help at the time of writing of the report or a paper. The introduction and methods sections remain much the same as in the protocol, although in an elaborate format. The objectives, as stated in the protocol, help to retain the focus in the report. Much of the literature review done at the time of protocol writing also proves handy at the time of report writing.
21
Basics of Medical Studies
Whereas all other aspects may be clear by themselves or would be clear as we go along in this book, special emphasis should be placed on the impartiality of the literature review. Do not be selective to include only those pieces of literature that support your hypotheses. Include those that are also inconsistent with or in opposition to your hypotheses. Justify the rationale of your research with reasons that effectively counter the opposite or indifferent view. Research is a step in the relentless search for truth, and it must pass the litmus test put forward by conflicting or competing facts. The protocol must provide evidence that the proposed research would stand up to this demand and would help minimize the present uncertainties regarding the phenomenon under study.
2.2 Types of Medical Studies Medical studies encompass a whole gamut of endeavors that ultimately help to improve the health of people. Functionally, medical studies can be divided into basic and applied types. Basic research, also termed pure research, involves advancing the knowledge base without any specific focus on its application. The results of such research are utilized somewhere in the future when that new knowledge is required. Applied research, on the other hand, is oriented to an existing problem. In medicine, basic research is generally done at the cellular level for studying various biological processes. Applied medical research could be on the diagnostic and therapeutic modalities, agent–host–environment interactions, or health assessments.
Medical research Pure Cellular level
Applied Problem oriented Primary Descriptive (surveys) Observational studies Medical experiments Clinical trials Coverage of the methods of this book
Secondary Decision analysis Operations research Health systems Economic analysis Qualitative research Research synthesis
FIGURE 2.1 Types of medical studies.
We would like to classify applied medical studies into two major categories. The first category can be called primary research; it includes analytical studies such as cohort studies, case–control studies, and clinical trials. It also includes descriptive studies, such as surveys, case series, and census. The second category is secondary research and is quite common these days; it includes decision analysis (risk analysis and decision theory), operations research (prioritization, optimization, simulation, etc.), evaluation of health systems (assessment of achievements and shortcomings), economic analysis (cost–benefit, cost-effectiveness, etc.), qualitative research (focus group discussion), and research synthesis (Figure 2.1). This book contains biostatistical methods applicable to primary research, which still forms the bulk of modern medical research, and excludes much of the methods applicable to secondary research. The only aspect of secondary research we incorporate is the meta-analysis, which is a part of research synthesis.
22
Medical Biostatistics
Any research on diagnostic, prophylactic, and therapeutic modalities or on risk assessment is empirical. Experience on one or two patients can help in special cases, but generally an investigation of a large group of subjects is needed to come to a definitive conclusion. To achieve this, study design is of paramount importance. 2.2.1 Elements of a Study Design Since medicine is an empirical science, the decisions are based on evidence. The conclusions must also stand up to reasoning. Hunches or personal preferences have no role. The nature of evidence is important in itself, but credence to this is acquired by the soundness of the methodology adopted for collecting such evidence. Design is the pattern, scheme, or plan to collect evidence. It is the tool by which the credibility of research findings is assessed. The function of a design is to permit a valid conclusion that is justified and unbiased. It should take care of confounding problems that could complicate the interpretation. Thus, it should be able to provide correct answers to the research questions. The objective of a design is to get the best out of the efforts. Various elements of design are as follows:
1. Definition of the target population: Inclusion and exclusion criteria, area to which the subjects would belong, and their background information 2. Specification of various groups to be included, with their relevance 3. Source and number of subjects to be included in each group, with justification including statistical power or precision considerations, as applicable 4. Method of selection of subjects 5. Strategy for eliciting data: Experiment (animal or human) or observation (prospective, retrospective, or cross-sectional) 6. Method of allocation of subjects to different groups, if applicable, or matching criteria with justification 7. Method of blinding, if applicable, and other strategies to reduce bias 8. Specification of intervention, if any 9. Definition of the antecedent, outcome, and other characteristics to be assessed, along with their validity for the study objectives 10. Identification of various confounders and the method proposed for their control 11. Method of administering various data-collecting devices, such as questionnaire, laboratory investigation, and clinical assessment, and the method of various qualitative and quantitative measurements 12. Validity and reliability of different devices and measurements 13. Time sequence of collecting observations and their frequency (once a day, once a month, etc.), duration of follow-up, and duration of study, with justification 14. Methods for assessment of compliance and strategy to tackle ethical problems 15. Strategies to handle nonresponse and other missing data The primary feature of the design is the type of study in terms of descriptive or analytical. These have been described earlier but require some more explanation, although details appear in the subsequent chapters. 2.2.2 Basic Types of Study Design A complete chart of various relevant study formats is shown in Figure 2.2. This chart gives divisions of descriptive and analytical studies.
23
Basics of Medical Studies
Study design
Objective of study: Sample survey
Strategy:
Case series
Census
Experimental (intervention)
Observational
Nonrandom (purposive)
Prospective (follow-up)
SRS
Volunteers
Longitudinal
Case–control
Chemical
Snowball
Cohort
Nested case–control No control
Animal
SyRS StRS
Convenience
CRS/Area
Quota
MRS
Referred
PPS
Crosssectional
Laboratory experiment
Random Haphazard
Type:
Mixed in stages
Method:
Analytical
Descriptive
Retrospective
Cell
Clinical trial Therapeutic
Prophylactic
Diagnostic
Screening
Prophylactic Other Biological Cohort can be historical Screening Cases and controls can be specimen (retrospective) or prospectively concurrent recruited With control
Consecutive
Randomized (RCT if trial)
Sequential
Layout for experiments/trials: - Cross-over, repeated measures - One-way, two-way, factorial, etc.
Double
Without control
Non-randomized Open
Blind Single
Field trial
Triple
FIGURE 2.2 Various types of study designs.
2.2.2.1 Descriptive Studies Studies that seek to assess the current status of a condition in a group of people are called descriptive. These are also sometimes called prevalence studies. Consider the power of the following results from a 2010 National Health Interview Survey in the United States, which was a descriptive study: 1. Eighteen percent of U.S. children aged under 17 years had less than good health. 2. Fourteen percent of children had ever been diagnosed with asthma. 3. Eight percent of children aged 3–17 years had a learning disability, and 8% had attention deficit hyperactivity disorder. The importance of descriptive studies is not fully realized, but it is through such studies that the magnitude of the problems is assessed. They also help to assess what is normally seen in a population. Unfortunately, even body temperature among healthy subjects is not known with precision for many populations. Thus, there is a considerable scope for carrying out descriptive studies. Such studies can provide baseline data to launch programs, such as breast cancer control or rehabilitation of elderly people, and can measure progress made. A descriptive study can also generate hypotheses regarding the etiology of a disease when the disease is found more common in one group than another.
24
Medical Biostatistics
A case study, which generally describes features of a new disease entity, is also descriptive, although it is anecdotal in nature. A story can have a definite impact, since it is easily understood. But be wary of anecdotal evidence. It can be interesting and can lead to a plausible hypothesis, but anecdotes are sometimes built around an exciting event and used by the media to sensationalize the reporting. You must verify anecdotes even for forwarding a hypothesis. A series of such cases form a case series. They summarize common features of the cases or may highlight the variation. This too can lead to a hypothesis. Initial case series of HIV-positive persons in the San Francisco area, almost exclusively among homosexual men, led to the suspicion that sexual behavior could be a cause. Surveys too are descriptive studies, although this term is generally used for community-based investigations. When repeatedly undertaken, they can reveal time trends. Complete enumeration, such as a population census, is also descriptive. A descriptive study generally has only one group since no comparison group is needed for this kind of study. Its design is mainly in terms of a sampling plan, as is discussed in Chapter 3. Those who are aware of statistical errors realize that no Type I or Type II error arises in descriptive studies unless we test a hypothesis, for example, for the presence of a correlation. 2.2.2.2 Analytical Studies and Their Basic Types The other kind of study is an analytical study, which tries to investigate etiology or cause–effect types of relationships. Determinants or risk factors of a disease or health condition are obtained by this kind of study. The differences between two or more groups are also evaluated by analytical studies. Although the conclusions are associational, the overtones are cause–effect. A properly designed analytical study indeed can provide a conclusion regarding a cause–effect relationship. Two strategies are available for such studies: observation and experiment. Observational studies are based on naturally occurring events without any human intervention. Record-based studies are also placed in this group. Experimental studies require deliberate human intervention to change the course of events. Although observational studies can be descriptive, this term is generally used for particular types of analytical studies. A study of smoking and lung cancer is a classic example of an observational study. Nobody would ever advocate or try to intentionally expose a group of persons to smoking to study its effect. Thus, experimentation in this case is not an option, at least for human beings. An observation of what is happening, has happened, or will happen in groups of people is the only approach. On the other hand, the anesthetic effect of a eutectic mixture used for local anesthesia can be safely studied in human beings because this cream has practically no toxic effect. The rule of thumb is that human experiments can be carried out only for potentially beneficial modalities. Observational studies can be carried out for hazardous as well as for beneficial modalities. They can be prospective, retrospective (case–control), or cross-sectional, depending on the methodology adopted to collect data (see Chapter 4 for details). Other than observation, perhaps a more convincing and scientifically more valid but ethically questionable strategy is to intentionally introduce the intervention of which the effect is proposed to be evaluated. Thus, there are human efforts to change the course of events. These are experiments, but in medicine, this term is used generally when the units are animals or biological material. Customarily, experiments on humans are called trials. Laboratory and animal experiments can be done in far more controlled conditions. These experiments are described in Chapter 5. Trials can be conducted in a clinic or in a community. Since humans are involved, a large number of ethical and safety issues crop up in clinical trials. Far more care is required in this setup, and strategies such as randomization and blinding are advocated. The details are provided in Chapter 6. 2.2.3 Choosing a Design In the face of so many designs, shown in Figure 2.2, it can be a difficult task to choose an appropriate design for a particular setup. Issues that determine the choice are clear when various designs are discussed in detail. A summary of our recommendations is as follows. 2.2.3.1 Recommended Design for Particular Setups A good research strategy provides conclusions with minimal error within the constraints of funds, time, personnel, and equipment. In some situations, no choice is available. For investigating the relationship between pesticides and oxidative
25
Basics of Medical Studies
stress, intervention in terms of deliberately exposing some people to pesticides is not an option. Observation of those who have already been exposed is the only choice. For determining the efficacy and safety of a drug, intervention in terms of administering the drug in controlled conditions is a must; an observational strategy is not a good option. As just indicated, the role of potentially harmful factors is generally studied by observations, but potentially beneficial factors can be studied by either strategy. Experiments have the edge in providing convincing results. Also, they can be carried out in controlled conditions; thus, a relatively small sample could be enough. However, experiments raise questions of ethics and feasibility. Guidelines for choosing a strategy for analytical studies are as follows: TABLE 2.1 Recommended Design for Different Types of Research Questions Question 1. What is the prevalence or distribution of a disease or measurement, or what is the pathological, microbiological, and clinical profile of a certain type of cases? 2. Are two or more factors related to one another (without implication of a cause–effect)? 3. Do two or more methods agree with one another? 4. What is the inherent goodness of a test in correctly detecting the presence or absence of a disease (sensitivity and specificity)? 5. What are the risk factors for a given outcome, or what is their relative importance? 6. How good is a test or a procedure in predicting a disease or any other outcome? 7. What is the incidence of a disease, or what is its risk in a specified group or the relative risk? 8. What are the sequelae of a pathological condition? And how much or how does a factor affect the outcome? If the factor under study is potentially harmful If the factor under study is potentially beneficial 9. Is an intervention really effective or more effective than another?
Recommended Design Sample survey Cross-sectional study Cross-sectional study Case–control study Case–control study Prospective study Prospective study Prospective study Prospective study in humans; in rare situations, animal experiment RCT for humans, and experiment for animals or biological material RCT for humans, and experiment for animals or biological material
1. Should be ethically sound, causing the least interference in the routine life of the subjects 2. Should be generally consistent with the approach of other workers in the field, and if not, the new approach should be fully justified 3. Should clearly isolate the effect of the factor under investigation from the effect of other concomitant factors in operation 4. Should be easy to implement and acceptable to the system within which the research is being planned 5. Should confirm that the subjects would sufficiently cooperate during the entire course of the study and would provide correct responses 6. Should be sustainable so that it can be replicated if required As may be clear by now, the choice of design depends on the type of question for which an answer is sought by the study. Appropriate designs for specific research questions are given in Table 2.1. The details of these designs appear in subsequent chapters. 2.2.3.2 Choice of Design by Level of Evidence As a health professional, you are regularly required to update your knowledge of the fast-developing science and interpret the findings reported in the literature. You will want to assess the credibility of the evidence provided in different kinds of studies before accepting such results. If you are a researcher, the level of evidence produced in your work will be a crucial consideration.
26
Medical Biostatistics
TABLE 2.2 Hierarchy of Study Designs by Level of Evidence for Cause–Effect Relationship Evidence
Type of Design
Advantages
Disadvantages
Level 1 (best)
RCT—double blind or crossover
Able to establish cause–effect and efficacy Internally valid results
Assumes ideal conditions Can be done only for potentially beneficial regimen Difficult to implement Expensive
Level 2
Trial—no control/not randomized/not blinded
Can indicate cause–effect when biases are under control Effectiveness under practical conditions can be evaluated
Prospective observational study
Relatively easy to do Establishes sequence of events Antecedents are adequately assessed Yields incidence
Cause–effect is only indicated but not established Can be done only for a potentially beneficial regimen Outcome assessment can be blurred Requires a big sample Limited to one antecedent Follow-up can be expensive
Level 3
Retrospective observational Outcome is predefined, so no ambiguity study—case–control Quick results obtained Small sample can be sufficient
Antecedent assessment can be blurred Sequence of events not established High likelihood of survival and recall bias
Level 4
Cross-sectional study
Appropriate when distinction between outcome and antecedent is not clear When done on a representative sample, the same study can evaluate both sensitivity/specificity and predictivity
No indication of cause–effect—only relationship
Level 5
Ecological study
Easy to do
Fallacies are common
Level 6 (worst)
Case series and case reports
Help in formulating hypothesis
Many biases can occur Do not reflect cause–effect
Note: Laboratory experiments excluded.
The ultimate objective of all analytical studies is to obtain evidence for a cause–effect type of relationship. This depends on a large number of factors, such as representativeness of the sample of subjects, their number in each group, and control of confounders, but guidelines can still be laid for the choice of design. Various types of studies according to the level of evidence are listed in Table 2.2, but the inverted pyramid in Figure 2.3 may be more convincing. The comparison in this table and the figure is valid when other parameters of the study are the same. The pyramid also shows that there is no design with no chance of bias. Further details are given in subsequent chapters. Chance of bias Expert opinions and case-reports
7
Ecologic studies and case-series
6
Cross sectional studies
5
Prospective/retrospective
4
Controlled trial (nonrandomized)
3 2
RCT–nonblind RCT–double blind 1 None 0
FIGURE 2.3 Chance of bias in different types of studies.
Basics of Medical Studies
27
2.3 Data Collection After deciding the design of the study, the next step is the collection of data. This is done by “measuring” the past or present of the subjects by different methods. The measurement is not necessarily quantitative; it can be qualitative also. The details of such statistical types of measurements are given in Chapter 7, but note now that the purpose of all measurements is to get a handle on uncertainties. Measurements help to provide exactitude concerning the state of the characteristics of the subjects under study that cannot be attained in any other way.
2.3.1 Nature of Data The nature of data varies from investigation to investigation and, within an investigation, from item to item. Data can be factual, knowledge based, or a mere opinion. Data can be obtained by observation, interview, or examination. This section discusses these aspects first before moving on to various tools of data collection used in medical studies.
2.3.1.1 Factual, Knowledge-Based, and Opinion-Based Data Data are factual when objectively measured. A subject’s gender is factual and can seldom be wrong. Age and income are factual measurements but may not be known or may be wrongly reported. Extra care is needed when obtaining data on such characteristics. Height, weight, blood pressure, serum glucose level, and so forth can be factually recorded. Disease signs are factual, but their correct observation depends on the acumen of the attending clinician. Generally speaking, data on most but not all the current physical state of a subject can be factually obtained. However, symptoms such as severity of pain are rarely factual because they are mostly based on the perception of the patient that can change quickly even when the intensity of pain does not change. A patient’s history with regard to complaints in the past can sometimes be obtained from records; otherwise, it depends on the awareness and alertness of the subject. The response to the interview largely depends on the knowledge and cooperation of the subject, but note that it is not easy to assess the level of knowledge on a factual basis. Data of the third kind are based on opinion. Hopes, aspirations, and fears also fall in this category. These generally reflect attitudes. Fertility control activities in many developing countries are occurring at a slow pace because a favorable opinion of the populace, sometimes even the government, is lacking. Consent of the patient for a new procedure also depends on this attitude. Many other examples can be cited. It is easy to advocate that data should be factual for quality evidence, but obtaining such data is not always easy. Aspects such as knowledge and opinion are important ingredients for some facets of health, particularly for the prevention and control of diseases, such as sexually transmitted infections. In the case of treatment, the patient’s knowledge—right or wrong—about, for example, side effects or complications of a surgery, is an important consideration in determining the outcome. Also, opinion can sometimes be important in the faithful implementation of a regimen.
2.3.1.2 Method of Obtaining the Data Medical data are mostly obtained by observation, interview, or examination. Observations may be important for behavioral studies, say in the case of psychosomatic disorders, but are otherwise rarely used. The primary methods in a clinic are interview and examination. An interview can be based on a preset structured form (questionnaire or schedule— see next) or unstructured, where the interviewer has the liberty to frame the question as considered appropriate. Most researchers these days use the structured form, since an unstructured one can introduce unsuspecting bias of the interviewer. An interview may or may not reveal the full truth, whereas the data obtained by examination are most reliable in the sense that they can be largely believed to be correct. This includes laboratory and imaging investigations, which form the core of diagnostic tools these days. Some of these methods are very expensive to adopt or just not available in a particular setup, and the clinician may have to resort to interview and physical examination, or to a test of lower cost, to obtain workable data. Although factual data based on examination are often given more weight, knowledge, opinion, and complaints as revealed by interview and observation, particularly for symptoms, have an important place in the practice of medicine. All efforts should be made to obtain valid as well as reliable data when these methods are used. A detailed discussion on these concepts is presented in Chapter 20 in the context of data quality issues.
28
Medical Biostatistics
2.3.2 Tools of Data Collection The tools of data collection include the different instruments used to collect data. These can be existing records, questionnaires and schedules, interviews and examinations, laboratory and radiological investigations, and so forth. For example, a patient with liver disease can be identified by asking the individual whether the disease has already been diagnosed, by looking at the health records, by clinical examination, or by carrying out some biochemical tests. Visual acuity in a subject can be assessed just by noting the power of the spectacles being worn, if any, or by a proper visual examination by an optometrist. Each method of eliciting information has merits and demerits in terms of validity and reliability, on the one hand, and cost and time, on the other. These factors have to be assessed in the context of the problem at hand and of the resources available. This section contains some guidelines. 2.3.2.1 Existing Records Almost all hospitals and clinics maintain fairly good records of the patients served by them. A civil registration system may have records of births and deaths in a community, along with cause of death in some countries, which, when combined with information on age, sex, occupation, and so forth, can be a useful resource. Some countries have a system of health centers where a comprehensive record of each family is maintained. Then, there are records of ad hoc surveys done by different agencies for specific purposes. Many individuals also keep a record of their health parameters based on periodic examinations and investigations. Use of existing records for medical decisions has some demerits: 1. Entry in records is rarely done with the same keenness as used for focused data collection for a specific topic. Records tend to be incomplete, and conditions on which no physician was consulted may never appear. Records may not contain information about items that may look vital for the study at hand but were not perceived to be so when the recording was done. 2. If the records are handwritten, as in some hospitals in some parts of the world, they may even be difficult to decipher. 3. Hospital records seldom represent the target population. Some hospitals cater to a specific ethnic, economic, or geographical group. Quite often, only severe cases go to the hospital. If the study objective demands that mild cases are also included, such records may not be adequate. 4. If only published records are accessible, publication may be late, making the information obsolete. Sometimes only selective records are published, or only summaries are available that may not be adequate for your study. 5. Records maintained by different hospitals or other agencies or patients are seldom in a uniform format. They may lack comparability, which makes pooling difficult. Despite all these problems, records are and should be used for medical studies wherever feasible because they are the cheapest source of data. They preexist and so are not likely to be biased. In fact, the first attempt in all studies should be to explore the available records, published or not. These records can be compared with a road that is full of bumps and holes but is still passable by a careful driver [2]. By evaluating aspects such as the extent of underreporting and the population that the records can reasonably represent, and by linkage of various records of the same person at different places, it may still be possible to put together a usable picture for specific segments of the target population. In cases such as suicides and accidental deaths, records may be the only source to provide a health profile of the victim. Properly maintained records may be vital in a study of trends in causes of early mortality. 2.3.2.2 Questionnaires and Schedules A questionnaire contains a series of questions to be answered by the respondents. This could be self-administered or administered by an interviewer. In the case of the former, the education and attitude of the respondent toward the survey can substantially influence the response, and in the case of the latter, the skill of the interviewer can make a material difference. The interviewer may have to be trained on how to approach a subject; how to obtain cooperation; when to prompt, probe, pause, or interject; and so forth. Some details of these aspects are described in simple language by Hepburn and Lutz [3]. The term schedule is used when it contains a list of items (and not questions) on which information is collected. The information can be obtained by observation, by interview, or by examination. Whether a questionnaire or a schedule is used, sufficient space is always provided to record the response. A list of possible or expected responses can be given to choose
Basics of Medical Studies
29
from against the questions or items. Then the form is called close-ended. In the list of choices, the last item can be “others (specify)” to make the list exhaustive. If needed, carry out a pilot study and ensure that all common responses are listed so that responses under “others” do not exceed, say, 10% of the total responses. If they exceed that, examine the “specify” component of “others” and make a new category of the frequently cited responses. The list of responses should be designed in such a manner that no choice is forced on the respondent. They should also be preferably mutually exclusive so that one and only one response is applicable in each case. Depending on the question or item, one of the responses could be “not applicable.” For example, the question of results of laboratory investigation does not arise in cases where laboratory investigation was not required. This should be distinguished from cases in which the investigation was required but could not be done for some reason. The degree of accuracy of the measurements must be specified to help the surveyor to record accordingly. For example, age can be recorded as on the last birthday, weight to the nearest kilogram, and plasma glucose level to the nearest 5 mg/dL. The responses are sometimes precoded for easy entry into the computer. When the response is to be recorded verbatim, the question or item is called open-ended. This can present difficulty during interpretation and analysis of data but can provide additional information unforeseen earlier. Also, such open-ended responses can provide information regarding the quality of response. Open-ended questions can provide results different from the results of close-ended questions. Suppose a question for parents is, “What is the most important thing for the health of your children?” More than 60% might say nutrition when this alternative is offered on a list. Yet, only less than 10% may provide this answer when no list is presented. Framing questions is a difficult exercise because the structure of the sentence and the choice of words become important. Some words may have different meanings to different people. In any case, it is evident that the questions must be unambiguously worded in simple language. They must make sense to the respondent, and the sequence must be logical. The length of the questionnaire should be carefully decided so that all the questions can be answered in one sitting without any feeling of boredom or burden. It is always helpful for the interviewee, as well as for the interviewer, to divide the questionnaire into sections. Use of features such as italics, boldface, underline, and capitals can help to clarify the theme of the question. Respondents tend to be careless toward the tail end of the questionnaire if it is unduly long and not interesting. A large number of questions may be a source of attrition for the subjects and can increase nonresponse. Thus, nonessential questions must be avoided. Ask yourself how much you lose if a particular question is not asked. All the questions should be short, simple, nonoffending, and corroborative. They should be such that the respondent is able to answer. One aspect of some questions is the rating scale used for providing the answer. For a question such as “How fit are you after 1 week of a surgery?” many may rate themselves +3 if the scale is from −5 to +5, but not 8 if the scale is from 0 to 10. Using ranks 0, 1, 2, 3, 4, and so forth as a numerical score implies that 3 is three times 1, and the difference between ranks 4 and 2 is the same as that between ranks 2 and 0. Check that this is really so before using these as scores. However, qualitative comparators, such as which is better, and frequencies, such as never, sometimes, often, and always, can be used as ordinal categories without assigning scores. Sometimes methods such as the Rasch analysis are used to develop and evaluate a questionnaire or schedule. The objective of Rasch analysis is to increase the chance of a more able person to get any item correct compared with a less able person, and to increase the chance of getting a more difficult item correct by a more able person compared with a less able person. However, the procedure has limitations as it depends on who and how many respondents attempt different items. Also, it generally assumes that all items have equal correlation with the latent trait intended to be assessed. For further details, see Bond and Fox [4]. The questionnaire or schedule must always be accompanied by a statement of the objectives of the survey so that the respondent becomes aware. Easy-to-follow instructions for recording responses and explanatory notes where needed are always helpful. Special care may have to be taken concerning possible memory lapse if a question or item requires recall of an event. Whereas serious events such as accidents and myocardial infarctions are easy to recall even after several years, mild events such as episodes of fever or of diarrhea may be difficult to recall after a lapse of just 1 month. There is also an effect of attitude of the patient toward certain ailments. An older person may perceive poor vision or infirmity as natural and may not report it at all. A smoker might similarly ignore coughing. On the other hand, mild conditions may be exaggeratedly reported, depending on the disturbance they create in the vocational pursuits of the patient. A vocalist may be worried more about the vocal cords than a fractured hand. Thus, interview responses measure the person’s perception of the problem rather than the problem itself. Some conditions with a social stigma, such as venereal diseases and impotence, may not be reported despite their perception as important. 2.3.2.3 Likert Scale For psychometric assessment of opinion, belief, and attitude, a specific kind of tool, called a Likert scale, is popular. For a particular construct, such as satisfaction from hospital services, a set of items is developed so that all aspects of that
30
Medical Biostatistics
construct can be assessed. For hospital services, these items could be on the competence of doctors, humanitarian nursing services, proper diagnostic facilities, and so forth. A declarative statement such as “I am happy with the nursing services in this hospital” is made in each item and the respondent is asked to specify his or her level of agreement, generally on a 5-point scale, such as strongly disagree, disagree, indifferent (neutral), agree, and strongly agree. This is called a Likert scale. The response for each item is graded on the same scale—not necessarily 5-point—it could be 7-point, 9-point, or any other scale. An attempt is made to equally space the options from strongly disagree to strongly agree so that they can be legitimately assigned scores of 0–4, 0–6, or 0–8, depending on the number of options. Instead, these scores can be −2 to +2, −3 to +3, or −4 to +4. Various scoring patterns can give different results and the choice is yours so long as you can justify it. Options for some items are reversed so that “strongly agree” is the first option and “strongly disagree” the last option. Or some items are negatively framed. This helps in eliciting a well-thought-out response instead of consistently selecting the same response, like “agree,” for most items. At the time of analysis, such reversed items and corresponding scores are rearranged in proper order. Analysis is generally done for the total score. Add the scores of all the items and get a total score for each respondent. This is called the Likert score of that respondent for the construct under study. This converts individual qualitative ordinal ratings to a quantity. For a 10-item questionnaire and each item on a 5-point scale (0–4), the total score for each respondent will range from 0 to 40. This can be analyzed just as any other numerical measurement. Second, analysis can be focused on each item. If item 3 is on the satisfaction with diagnostic facilities, and you have n = 50 responses on a 0–4 scale for this item, you can calculate the median score to get an assessment of satisfaction with diagnostic facilities, or to compare it with, say, nursing services. You can thus find which component of the construct is adequate and which needs strengthening. 2.3.2.4 Guttman Scale A binary (yes/no) set of questions or items are said to be on the Guttman scale, when they follow a hierarchy so that yes to any item implies that the answer is yes to all the subsequent (or preceding) items. For this, the characteristic under assessment is arranged in a continuum from minimum to maximum (or vice versa). For example, visual acuity is assessed in stages, and if a person with a cataractous eye can read a font size of 24 from a distance of 14 inches, this implies that he or she can also read a font size of 48. Thus, a threshold can also be obtained and one response can predict all other responses lower in the hierarchy. This scale is mostly used to develop a short questionnaire that can still discriminate, such as for assessing disease severity and the level of satisfaction from hospital services. Sometimes the study is in reverse mode, where an attempt is made to decide a Guttman scale of questions from the answers. Such recovery of a good Guttman scale from noisy data is challenging. In this setup, the potential items are placed in a random order and analyzed for difficulty on the basis of the number of correct scores. Then they are placed in hierarchy. This can be understood as the rank-ordering of the items. In this case, errors in hierarchy, as mentioned in the previous paragraph, are used to calculate what is called the coefficient of reproducibility. For this and other details, see Indrayan and Holt [5]. 2.3.3 Pretesting and Pilot Study It is generally considered essential that all questionnaires, schedules, laboratory procedures, and other tools are tested for their efficacy before they are finally used for the main study. This is called pretesting. Many unforeseen problems or lacunae can be detected by such an exercise, and the tools can be accordingly adjusted and improved. This pretesting can reveal whether the items of information are adequate, feasible, and clear; the space provided for recording is adequate; the length of the interview is within limits; the instructions are adequate; and so forth. This also serves as a rehearsal of the actual data collection process and helps to train the investigators. Sometimes pretesting is repeated to standardize the methodology for eliciting correct and valid information. A study on a small number of subjects before the actual study is called a pilot study. This simulates the actual study. It provides a preliminary estimate of the parameters under investigation. This preliminary estimate may be required for the calculation of sample size as per the details given later in this book. If the phenomenon under study has never been investigated earlier, the pilot study is the only way to get a preliminary estimate of the parameter. A pilot study may also provide information on the size of the clusters and sampling units at various stages that could be important in the planning cluster or multistage sampling for large-scale surveys. Remember that a pilot study is not done to investigate the statistical significance of the result. It only provides a preliminary estimate of the relevant factors for designing a full study where prior data are not available. In a rare case, when the
Basics of Medical Studies
31
effect size is really large and a pilot study is carried out with the proper selection of subjects, the results may even be conclusive, although many would question the conclusion based on a pilot study.
2.4 Nonsampling Errors and Other Biases Different samples contain different sets of individuals, and they can provide different results. This is called sampling error, although this is not an error. The better term is sampling fluctuation. These are random and cannot be avoided. They affect reliability that can be improved by increasing the sample size. Statistical methods are designed to provide dependable results despite this error. Trickier are nonsampling errors that occur due to various types of biases in concepts, sample selection, data collection, analysis, and interpretation. These indeed are errors and affect validity since the results cannot be believed when such biases are present. Most common among nonsampling errors is nonresponse. Other biases listed later in this section are also nonsampling errors. 2.4.1 Nonresponse The inability to elicit full or partial information from the subject after inclusion in a study is termed nonresponse. This can happen due to the subject turning noncooperative, relocation, injury, death, or any such reason. The opportunity for non response is particularly present in follow-up studies, but even in the case of a one-time evaluation, the subject may refuse to answer certain questions or may not agree to submit to a particular investigation or examination even when prior consent has been obtained. Nonresponse has two types of adverse impacts on the results. The first is that the ultimate sample size available from which to draw conclusions is reduced, and this affects the reliability of the results. This deficiency can be remedied by increasing the sample size corresponding to the anticipated nonresponse. The second is more serious. Suppose you select a sample of 3000 out of 1 million. If only 1950 respond out of 3000, the survey results could be severely biased. These responders could be conformers or those with strong views. If not biased, a sample of 1950 is not too bad to provide a valid estimate or to test a hypothesis in most situations. Mostly, the nonresponding subjects are not a random segment but are of specific type, such as seriously ill cases who do not want to continue in the study or very mild cases who opt out after feeling better, or some such special segment. Their exclusion can severely bias the results. A way out is to take a subsample of the nonrespondents and undertake intensive efforts for their full participation. Assess how these subsample subjects are different from the regular respondents and adjust the results accordingly. A provision for such extra efforts to elicit responses from some nonrespondents should be made at the time of planning the study when high nonresponse is anticipated. Experience suggests that some researchers fail to distinguish between nonresponse and zero value or the absence of a characteristic. Take care that this does not happen in the data you are examining as evidence for practice or in the data you are recording for research. Nonresponse should be identified by entering, say, NR in the database and by entering 0 for a zero value or absence. Similarly, not applicable should be entered as, say, NA. Nothing should be left blank in any case. This book later discusses methods such as imputation for missing values and intention-to-treat analysis that can partially address the problem arising from nonresponse. But no analysis, howsoever immaculate, can replace the actual observation. Thus, all efforts should be made to ensure that nonresponse is minimal, if not altogether absent. Strategies for this should be devised at the time of planning the study, and all necessary steps should be taken. 2.4.2 Variety of Biases to Guard Against Medical study results often become clouded because some bias is detected after the results are available. No analysis can give valid results if the data are not correct since statistical methods assume that whatever is recorded and submitted for analysis is flawless. Therefore, it is important that all sources of bias are considered at the time of planning a study, and all efforts are made to control them. 2.4.2.1 List of Biases Various sources of bias are as follows. These are not mutually exclusive, and the overlap is substantial in some cases. Some of the biases in this list are the collection of many biases of a similar type. If all these are stated separately, the list may
32
Medical Biostatistics
become unmanageable. The biases are described in brief here, and the details are provided later in the context where they predominantly occur. 1. Bias in concepts: This occurs due to lack of clarity about the concepts to be used in the proposed study. This gives an undesirable opportunity to the investigators to use subjective interpretation that can vary from person to person. 2. Definition bias: The study subjects should be sharply defined so that there is no room for ambiguity. For example, if the cases are of tuberculosis, specify that these should be sputum positive, Mantoux positive, radiologically established, or some combination. A blurred definition gives unnecessary room to the assessor to use his or her own interpretation that can affect the validity of the study. 3. Bias in design: This bias occurs when the case group and the control group are not equivalent at the baseline, and differentials in prognostic factors are not properly accounted for at the time of analysis. Design bias also depends on the structure of the study: an ecological study has more chance of bias than a double-blind randomized controlled trial (RCT). 4. Bias in selection of subjects: The subjects included in the study may not truly represent the target population. This can happen either because the sampling was not random or because the sample size was too small to represent the entire spectrum of subjects in the target population. Studies on volunteers always have this kind of bias. Selection bias can also occur because the serious cases have already died and are not available with the same frequency as the mild cases (survival bias). Bias due to self-selection of cases and due to inclusion of volunteers is obvious. 5. Bias due to concomitant medication or concurrent disease: Selected patients may suffer from other apparently unrelated conditions, but their response might differ either because of the condition itself or because of medication given concurrently for that condition. 6. Instruction bias: When there are no instructions or when unclear instructions are prepared, the investigators use discretion, and this can vary from person to person and from time to time. 7. Length bias: A case–control study is generally based on prevalent cases rather than incident cases. Prevalence is dominated by those who survive for a longer duration. And these patients are qualitatively different from those who die early. Thus, the sample may include disproportionately more of those who are healthier and survive longer. 8. Bias in detection of cases: Errors can occur in diagnostic or screening criteria. For example, a laboratory investigation done properly in a hospital setting is less error-prone than one carried out in a field setting where the study is actually done. Detection bias also occurs when cases with mild disease do not report or are difficult to detect. If this is inadvertent, the results would be biased without you knowing about such a bias. 9. Lead time bias: All cases are not detected at the same stage of the disease. With regard to cancers, some may be detected at the time of screening, for example, by Pap smear, and some may be detected when the disease starts clinically manifesting. For some others, detection occurs when the disease is already in the advanced stage and the follow-up is from the time of detection. This difference in “lead time” can cause systematic error in the results. 10. Bias due to confounder: This bias occurs due to failure in taking care of the confounders. In that case, any difference or association cannot be fully ascribed to the antecedent factors under study. 11. Bias due to epistemic factors: Efforts can be made to control only those factors that are known. But there may be many unknown factors that can affect the results. These epistemic factors can bias the results in unpredictable ways. 12. Contamination in controls: Control subjects are generally those that receive a placebo or regular therapy. If these subjects are in their homes, it is difficult to know if they have received some other therapy that can affect their status as controls. In a field situation, contamination in a control group can occur if the control group is in close proximity with the unblinded test group and learns from the experience of the latter. The neighboring area may not be the test area of the research, but some other program may be going on there that has a spillover effect on the control area. 13. Berkson bias: Hospital cases, when compared with hospital controls, can have bias if the exposure increases the chance of admission. Thus, cases in a hospital will have a disproportionately higher number of subjects with that exposure. Cases of injury in motor vehicle accidents have this kind of bias. 14. Bias in ascertainment or assessment: Once the subjects are identified, it is possible that more care is exercised by the investigators for cases than for controls. This can also occur when subjects belonging to a particular social group have records, but others have to depend on recall. Sometimes this is also called information bias.
Basics of Medical Studies
15. Interviewer bias or observer bias: Interviewer bias occurs when one is able to elicit better responses from one group of patients (say, those who are educated) than from another kind (such as illiterates). Observer bias occurs when the observer unwittingly (or even intentionally) exercises more care with one type of responses or measurements, such as those supporting a particular hypothesis, than with those opposing the hypothesis. Observer bias can also occur if, for example, the observer is not fully alert when listening to Korotkoff sounds while measuring blood pressure or is not able to properly rotate the endoscope to get an all-round view of, say, the duodenum in a suspected case of peptic ulcer. 16. Instrument bias: This occurs when the measuring instrument is not properly calibrated. A scale may be biased to give a higher reading than the actual or a lower than the actual, such as a mercury column of a sphygmomanometer not being empty in the resting position. The other possibility is the inadequacy of an instrument to provide a complete picture, for example, an endoscope not reaching the site of interest, thereby giving false information. In a Likert scale assessment, +3 may be more frequent on a –5 to +5 scale than +8 on a 0 to 10 scale, although both are the same. The third kind of instrument bias occurs when an instrument is considered gold because this is acknowledged as best, forgetting that this too is most likely imperfect. Fourth is when the predictivity of an instrument is used in a new setup without considering that the different prevalence of the condition in this new setup would affect predictivity. 17. Hawthorne effect: If subjects know that they are being observed or being investigated, their behavior and response can change. In fact, this is the basis for including a placebo group in a trial. The usual responses of subjects are not the same as when under a scanner. 18. Recall bias: There are two types of recall bias. One such bias arises from better recall of recent events than those that occurred a long time ago. Also, serious episodes are easier to recall than mild episodes. The second type of bias arises when cases suffering from a disease are able to recall events much more easily than the controls if they are apparently healthy subjects. 19. Response bias: Cases with serious illness are likely to give more correct responses regarding history and current ailments than are controls. This is not just because of recall but also because the former keep records meticulously. Some patients, such as those suffering from sexually transmitted diseases (STDs), may intentionally suppress sexual history and other information because of the stigma attached to these diseases. Injury history may be distorted to avoid legal consequences. An unsuspecting illness, death in the family, or any such drastic event may produce an extreme response. Response bias also comes under information bias. 20. Bias due to protocol violation: It is not uncommon in a clinical trial that some subjects do not receive the full intervention or the correct intervention, or some ineligible subjects are randomly allocated in error. This can bias the results. 21. Repeat testing bias: In a pretest–posttest situation, the subjects tend to remember some of the previous questions, and they may remove previous errors in posttest—thus doing better without the effect of the intervention. The observer may acquire expertise the second or third time to elicit the correct response. Conversely, fatigue may set in with repeat testing that could alter the response. It is widely believed that most biological measurements have a strong tendency toward the mean. Extremely high scorers tend to score lower in subsequent testing, and extremely low scorers tend to do better in a subsequent test. 22. Midcourse bias: Sometimes the subjects, after enrollment, have to be excluded if they develop an unrelated condition, such as an injury, or their condition become so serious that their continuation in the trial is no longer in the interest of the patient. If a new facility such as a health center is started or the existing one is closed for the population being observed for a study, the response may be altered. If two independent trials are going on in the same population, one may contaminate the other. An unexpected intervention, such as a disease outbreak, can alter the response of those who are not affected. 23. Self-improvement effect: Many diseases are self-limiting. Improvement over time occurs irrespective of the intervention, and it may be partially or fully unnecessarily ascribed to the intervention. Diseases such as arthritis and asthma have natural periods of remission that may look like the effect of therapy. 24. Digit preference: It is well known that almost all of us have a special love for digits zero and five. Measurements are more frequently recorded ending with these digits. A person aged 69 or 71 is likely to report his or her age as 70 years in place of the exact age. Another manifestation of digit preference is in forming intervals for quantitative data, as discussed later in Chapter 7. 25. Bias due to nonresponse: As already discussed in detail, some subjects refuse to cooperate, suffer an injury, die, or become untraceable. In a prospective study, there might be some dropouts for various reasons. Nonrespondents
33
34
Medical Biostatistics
have two types of effects on the responses. First, they are generally different from those who respond, and their exclusion can lead to biased results. Second, nonresponse reduces the sample size, which can decrease the power of the study to detect a specified effect. 26. Attrition bias: The pattern of nonresponse can differ from one group to the other in the sense that in one group, more severe cases drop out, whereas in another group, mostly mild cases drop out. In a rheumatoid arthritis data bank study, attrition during follow-up was high in patients of young age, who were less educated and were of nonCaucasian race [6]. 27. Bias in handling outliers: No objective rule is available to label a value as an outlier except a guideline that the value must be far away from the mainstream values. If the duration of hospital stay after a particular surgery is mostly between 6 and 10 days, some researchers would call 18 days an outlier and exclude it on the suspicion of being a wrong recording, and some would consider it right and include it in their calculation. Some would not exclude any outlier, however different it might be. Thus, the results would vary. 28. Recording bias: Two types of errors can occur in recording. The first arises due to the inability to properly decipher the writing on case sheets. Physicians are notorious for illegible writing. This can happen particularly with similarlooking digits, such as 1 and 7, and 3 and 5. Thus, the entry of data may be in error. The second arises due to the carelessness of the investigator. A diastolic level of 87 can be wrongly recorded as 78, or a code 4 entered as 6 when memory is relied on, which can fail to recall the correct code. Sometimes age 60 years may be entered as 6 years, and it would not be detected if children are also included in the study. Wrongly pressing adjacent keys on the computer keyboard is not uncommon either. 29. Data handling errors: Coding errors are quite common. Also, information from the respondents is generally collected by filling up a form, and then this is transferred to a worksheet. Entry error can occur in this process, particularly when the study is large scale. Sometimes the entries require calculation, such as of body mass index, and there might be errors in this also. 30. Bias in analysis: This again can be of two types. The first occurs when gearing the analysis to support a particular hypothesis. While comparing pre- and postvalues, for example, hemoglobin (Hb) levels before and after weekly supplementation of iron, the average increase may be small enough that it will not be detected by comparison of means but may be detected when evaluated as a proportion of subjects with levels of 0.10 because 1.66 is less than the critical value of 1.812 for two-tailed P = 0.10 (corresponding to one-tailed P = 0.05 for 10 df in this table). Since this P-value is large, the null hypothesis of equality of means cannot be rejected. The evidence is not strong enough to conclude that the mean albumin level after the treatment is any different from the mean before the treatment when these two groups are independent.
15.1.2.3 Some Features of Student t Examples 15.2 and 15.3 very aptly illustrate the following features of Student t: 1. The t-test is based on the magnitude of difference and its variance but ignores the fact that five of the six patients in the paired setup in Example 15.2 showed some rise. If the interest is in the proportion of subjects showing a rise and not in the magnitude of the rise, then use the methods of Chapter 13. 2. The same difference is statistically significant in the paired setup in Example 15.2 but not significant in the unpaired setup in Example 15.3. This occurred because the difference in the paired setup is fairly consistent, ranging from −0.1 to +0.9 g/dL. Each patient served as his or her own control. In the unpaired setup, the interindividual variation is large. This shows that paired setup may be a good strategy in some situations. The earlier advice of matching of controls with cases is for this reason. Such matching simulates pairing and reduces the effect of at least one major source of uncertainty. However, in the case of before–after studies, the deficiencies mentioned in a previous chapter remain. 3. The size of the sample is 6 in the paired setup and a total of 12 in the unpaired setup. Both are small. Recall that Student t is valid only when means follow a Gaussian pattern. When n is large, this pattern is nearly always Gaussian due to the central limit theorem, whether or not the underlying distribution of the individual measurements is Gaussian. Where n is small (say, less than 30), the t-test is valid only if the underlying distribution is Gaussian. This is the assumption made in these examples. If the underlying distribution is far from Gaussian and n is small, other methods, such as nonparametric (e.g., Wilcoxon) tests, are used. Some of these methods are presented in Section 15.3. 4. In Example 15.3, one should examine whether the assumption σ 12 = σ 22 is violated. The ratio s12 /s22 in this example is 0.57972/0.32092 = 3.26. For a sample of size 6 each, this may be within the tolerance limit. Many statistical software packages would automatically test this and, in the case of violation, obtain the P-value by using Welch t as given in Equation 15.6b in place of the pooled variance t given in Equation 15.6a. You may wish to calculate this and examine whether a different P-value is obtained. Some statistical software would also do this automatically or would provide P-values based on both types of t. However, caution is required when using a separate variance estimate. When the variances are different, it is clear that the two populations are different and equality of means may not
383
Inference from Means
be of much consequence. Even if means are equal, the distribution pattern is different. If it is known, for example, that BMI is much more variable in women than in men, equality of their means would rarely help. Thus, use of a separate variance estimate in a two-sample t is rare. 5. As mentioned in Chapter 8, in some cases, for example in the estimation of antibody titers, the conventional (arithmetic) mean is not applicable and the geometric mean (GM) is used to measure the central value. The logarithm of GM has the same features as the usual arithmetic mean. Thus, the Student t-test and other means-based tests can be carried out on GMs after taking the logarithm of the values. However, in this case, the conclusions will be applicable to log values and not to the values themselves. You should carefully examine whether these results can be extended to the original values. In many cases, such extension does not cause any problem. 6. All statistical tests give valid conclusions for the groups and not for the individuals. The t-test is for average values in the groups. Individuals can behave in an unpredictable way irrespective of statistical significance—thus caution is required in applying results to individual subjects. The clinical features of a person may completely override the means-based conclusions. 7. It is often mentioned in examples in this text that the samples are random. Statistical inference is not valid for nonrandom samples. 8. The fundamental requirement for the Student t-test is that the sample values are independent of each other. In Example 15.3, the albumin level in one patient is not going to affect the level in any other patient. Thus, the values are independent. In Example 15.2, there is no such independence because of paired setup, but once the difference is obtained, these differences would be independent. A paired t-test is based on the differences. Consider BP measured for two or more subjects belonging to the same family. Familial aggregation is well known for BP, and thus values belonging to the members of the same family are not independent. The Student t-test cannot be used for such values unless the family effect is first removed. Ingenious methods may be needed for removing this effect. Most practical situations do not have this constraint, and Student t can be safely used. 15.1.2.4 Effect of Unequal n The Student t-test for two independent samples does not have any restrictions on n1 and n2—they can be equal or unequal. However, equal ns are preferred for two reasons: (a) when a total of 2n subjects are available, their equal division among the groups maximizes the power to detect a specified difference, and (b) two-sample t is not robust to σ 12 ≠ σ 22 unless n1 = n2. If a smaller sample has a larger variance, the problem is aggravated. Although equal ns are desirable, this may not be a prudent allocation in many medical situations. In clinical trials, many times controls are easy to investigate, and more than one control per case could be a good strategy, as discussed in an earlier chapter. 15.1.2.5 Difference-in-Differences Approach A popular method with social scientists, now making its way into medical sciences, is to test the significance of difference in differences. This is used when paired observations are available from two independent groups. Suppose the test group is measured before and after the treatment, as well as a control group before and after a placebo. If the corresponding population means are μ1T (before treatment), μ2T (after treatment), μ1C (before placebo in the control), and μ2C (after placebo in the control), then the actual treatment effect is
Difference in differences: (µ 2 T − µ 1T ) − (µ 2 C − µ 1C ),
assuming that after values are higher. The statistical significance of this can be tested by unpaired t after obtaining the differences in the two groups. If this difference in differences is found significant, the estimate of the treatment effect is obtained by substituting the corresponding sample means. 15.1.3 Analysis of Crossover Designs As mentioned in Chapter 5, crossover could be a very efficient strategy for trials on a regimen that provides temporary relief. Hypertension, thalassemia, migraine, and kidney diseases requiring repeated dialysis are examples of conditions that are relieved temporarily by the currently available regimens. The crossover design economizes the use of subjects
384
Medical Biostatistics
because the same subject is used for trials on two regimens. However, each subject should be available twice as long as for a routine trial. Comparison is within subjects, and therefore more precise. In a crossover design, one group receives regimen A and then regimen B (AB sequence), and the other group receives regimen B and then regimen A (BA sequence). This sequence itself can cause a differential effect. This can particularly happen when the intervening period is more favorable to one regimen than the other. For example, when lisinopril is given first and then losartan after a 2-week washout gap to nondiabetic hypertensive patients for insulin sensitivity, the first drug may still be in the system. This may not happen when losartan is given first. Crossover design is not appropriate for such regimens that have significant sequence effects. The primary objective of a crossover trial is to test the significance of the difference in effects of the regimens. The method described in Chapter 13 is for qualitative (binary) data. Now, this section presents the method for quantitative data. The main difficulty in analyzing data from crossover designs arises from a duplicity of factors. If there are three patients, two periods, and two treatments, a crossover could yield only six observations instead of 3 × 2 × 2 = 12 in a conventional design. Two factors determine the third. If patient 1 gets the AB sequence, the observation for this patient in period 2 must be under treatment B (trB). All this is explained with the help of an example in the following paragraphs. Consider a trial on n = 16 chronic obstructive pulmonary disease (COPD) patients who were randomly divided into two equal groups of size 8. The first group received treatment A (trA) and then trB, and the second group received trt B and then trA. An adequate washout period was provided before switching the treatment so that there was no carryover effect. The response variable is forced expiratory volume in 1 s (FEV1). The data obtained are in Table 15.1. TABLE 15.1 FEV1 in COPD Patients in a Crossover Trial Group 1: AB Sequence Subject No. FEV1 (L/s) Period 1 Period 2
trA trB
1
2
3
4
5
6
7
8
1.28 1.25
1.26 1.27
1.60 1.47
1.45 1.38
1.32 1.31
1.20 1.18
1.18 1.20
1.31 1.27
9
10
11
12
13
14
15
16
1.27 1.30
1.49 1.57
1.05 1.17
1.38 1.36
1.43 1.49
1.31 1.38
1.25 1.45
1.20 1.20
Group 2: BA Sequence Subject No. FEV1 (L/s) Period 1 Period 2
trB trA
15.1.3.1 Test for Group Effect In the case of crossover trials, the groups identify the sequence, and the group effect is the same as the sequence effect. If the sequence does not affect the values, the mean difference between trA and trB should be the same in the AB group as in the BA group. A significant effect means that trA has a different effect when in period 1 than when in period 2. Thus, this is also called the t reatment–period interaction. Note that for crossover, group effect = sequence effect = treatment*period interaction. In this example, trA − trB VALUES Group 1 (AB) Group 2 (BA)
+0.03 +0.03
−0.01 +0.08
+0.13 +0.12
+0.07 −0.02
+0.01 +0.06
+0.02 +0.07
−0.02 +0.20
+0.04 +0.00
Since the patients in group 1 are different from the patients in group 2, equality of means in these groups can be tested by the two-sample t-test. In this case, for these differences, x1 = 0.03375
x2 = 0.06750
s12 = 0.0023125
s22 = 0.0048786
sp2 = 0.0035955
385
Inference from Means
Thus, t14 =
0.03375 − 0.06750
= −1.126.
0.0035955(1/8 + 1/8)
This is not statistically significant (P > 0.05). Thus, the evidence is not enough for the presence of the sequence effect, although this can be asserted only when the sample is large and the power is high. If a sequence effect is present, the reasons should be ascertained and the trial done again after eliminating those causes. In most practical situations where a crossover trial is used, the sequence of administering the drugs does not make much difference. The real possibility is that of a carryover effect that can also make a dent in the sequence effect. 15.1.3.2 Test for Carryover Effect If a positive carryover effect were present for any regimen, the performance of the other regimen in period 2 would be better than its performance in period 1. Thus, the presence of a carryover effect can be assessed by comparing the performance of each regimen in the two periods. In the preceding example, this is obtained by comparing trA values in period 1 with trA values in period 2, and similarly for trB. It is possible that only one of the regimens has a long-term effect, so that carryover is present for that regimen and the other has no such effect. Two two-sample t-tests would decide whether one or both have a carryover effect. In this example, these are as follows: trA Period 1 Period 2
Mean = 1.325 Mean = 1.365
t14 =
SD = 0.1386 SD = 0.1387
1.365 − 1.325 0.019224 (1/8 + 1/8)
Pooled variance 0.019224
= 0.577.
trB Period 1 Period 2
Mean = 1.298 Mean = 1.291
t14 =
SD = 0.1391 SD = 0.0952
1.291 − 1.298 0.014206 (1/8 + 1/8)
Pooled variance 0.014206
= − 0.117.
P-values associated with these values of t show that the carryover effect is not statistically significant for any of the treatments in this example. 15.1.3.3 Test for Treatment Effect The two tests just mentioned are preliminaries. The primary purpose of the trial, of course, is to find whether one treatment is better than the other. This can be done only when the sequence (or the group) effect is not significant. 1. If there is no carryover effect, the procedure is as follows. Consider two groups together as one because the sequence is not important and there is no carryover effect. Calculate differences trA − trB and use the paired t-test given in Equation 15.3 on the joint sample. In this example, these differences are +0.03 +0.03
−0.01 +0.08
+0.13 +0.12
+0.07 −0.02
+0.01 +0.06
+0.02 +0.07
−0.02 +0.20
+0.04 0.00
386
Medical Biostatistics
These give mean difference = 0.0506 and sd = 0.06049. Thus, t15 =
0.0506 0.06049/ 16
= 3.346.
From Appendix Table B.2, for 15 df, this gives P < 0.01. Thus, statistically the treatment difference on average is highly significant. 2. If the carryover effect is present, crossover is not a good strategy. You may then increase the washout period and ensure that no carryover effect is present. If the data from a crossover trial are already available and a carryover effect is found to be significant, then analyze as usual by Student t after ignoring the second period. Thus, half of the data (and efforts) will become redundant and the advantages of the crossover design lost. Remember the following for crossover designs:
1. It is easy to say that a washout period will eliminate a carryover effect. In fact, it can rarely be dismissed on a priori grounds. A psychological effect may persist even in the case of blinding of the subjects. Thus, a crossover design should be used only after there is a fair assurance that a carryover effect is practically absent. 2. The test for a carryover effect is based on variation between subjects and thus has less power. A small but real effect may not be detected unless a big trial with a large number of subjects is done. Then the advantage of the economy of subjects in a crossover design may be lost. 3. Details are given later, but carrying out so many statistical tests on the same set of data increases the total chance of Type I error. To keep this under control to, say, less than 5%, you should carry out the three t-tests at the 2% level each. 4. The preceding is an easy method based on the Student t-test. This is not so elegant. Quantitative data from crossover trials can be analyzed more meticulously by using the ANOVA method discussed later (although not for crossover trials). For ANOVA-based analysis of crossover trials, see Everitt [2]. Neither methodology is fully standardized.
15.1.4 Analysis of Data of Up-and-Down Trials The details of the up-and-down method of clinical trials are given in Chapter 6. This method can be used when the response is of the yes or no type. For this kind of trial, fix a dose step that you think would be appropriate escalation or reduction and also guess a median dose with your clinical acumen or past experience. Give this anticipated median dose to an eligible patient and assess if it is effective. If effective, step down the dose by the prior fixed step, and if not effective, step up the dose for the second patient, and so on. This assumes that the regimen is such that a higher dose is likely to give a better response. Generally, an up-and-down trial on 10–15 patients is considered adequate. An essential prerequisite for this kind of trial is that such variation in dose should not be unethical. The analysis of data from an up-and-down trial is in terms of estimating the least effective dose and obtaining a confidence interval (CI). Generally, the least effective dose is defined as the same as the median effective dose. One of the procedures for obtaining the median effective dose is as follows: Step 1. Calculate the logarithm for all the doses you tried, from minimum to maximum. Find the mean of the differences of successive log doses. Denote this mean difference by d. Step 2. Separately list doses that were effective in different patients. Exclude the ineffective ones. Calculate the mean of logarithm of these effective doses and denote it by y. 1 Step 3. Median effective dose = exp y − * d . 2 Calculation of the CI requires a constant G, which is obtained from a figure given by Dixon and Mood [3]. See this reference for the method to obtain the CI.
Inference from Means
387
Example 15.4: Up-and-Down Method for Estimating Minimum Effective Dose of a Local Anesthetic Agent There is a considerable uncertainty regarding the minimum dose requirement of local anesthetics administered intraspinally for surgery. Sell et al. [4] conducted a study to estimate the minimum effective dose of levobupivacaine and ropivacaine in hip replacement surgery. Forty-one patients were randomly allocated to one of the two local anesthetic groups in a double-blind manner. The authors used the up-and-down strategy, determining the initial dose on previous experience. The minimum effective dose was defined as the median effective dose. The up-and-down method found the minimum effective dose of levobupivacaine to be 11.7 mg (95% CI 11.1–12.4) and that of ropivacaine to be 12.8 mg (95% CI 12.2–13.4). The authors concluded that these doses are smaller than those reported earlier for single-shot anesthesia.
As mentioned, we have described one of the several methods used to analyze the data from an up-and-down trial. Statistical properties of the estimate become intractable because (a) the assigned doses are not independent—each dose is dependent on the response of the preceding dose, and (b) usually this kind of experiment is done on a small sample—thus large-sample Gaussian approximation cannot be used [5]. The method we have described is nonparametric and can be used for small samples. The article by Pace and Stylianou [5] is a good reference to understand the intricacies of the up-and-down method.
15.2 Comparison of Means in 3 or More Groups (Gaussian Conditions): ANOVA F-Test Statistical methods for evaluating the significance of difference in means in three or more groups remain conceptually simple but become mathematically complex. As always in this text, we avoid complex mathematical expressions and concentrate on explanations that may help in understanding the basic concepts, in being more judicious in choosing an appropriate method for a particular set of data, in realizing the limitations of the methods, and in interpreting the results properly. The Student t-test of the previous section is valid for almost any underlying distribution if n is large but requires a Gaussian pattern if n is small. A similar condition also applies to the methods of comparison of means in three or more groups. The test criterion now used is called F. As in the case of Student t, this test also requires that the variance in different groups is nearly the same. This property is now called homoscedasticity. The third, and more important, prerequisite for the validity of F is the independence of observations. Serial measurements, taken over a period of time on the same unit, generally lack independence because a measurement depends on its value on the previous occasions. A different set of methods, called hierarchical or repeated measures and akin to a paired t, is generally applied for analyzing such data. Random sampling is required, as always, for the validity of conclusions from the F-test. The generic method used for comparing means in three or more groups is called ANOVA. The name comes from the fact that the total variance in all the groups combined is partitioned into components, such as within-group variance and between-group variance. Between-group variance is the systematic variation occurring due to group differentials. For example, ANOVA may reveal that 60% variation in P3 amplitude in healthy adults is due to genetic differentials, 10% due to age differentials, and the remaining 30% due to other factors. Such residual left after the extraction of the effect of all known factors is considered a random component arising due to intrinsic biological variability between individuals. This is the within-group variance. If genuine group differentials are present, the between-group variance should be large relative to the within-group variance. Thus, the ratio of these two components of variance can be used as a criterion to find whether the group means are different. Details of common setups are given next. 15.2.1 One-Way ANOVA Consider a study in which the plasma amino acid (PAA) ratio for lysine is calculated in healthy children and in children with undernourishment of grades I, II, and III. This ratio is the difference in PAA concentration in blood before and after the meal, expressed as a percentage of the amino acid requirement. There are four groups in this study, including the healthy children. The setup is called one-way because no further classification of subjects, say by age or gender, is sought in this case. Groups define the factor: in this case, the grade of undernourishment. The response is a quantitative variable. This is a ratio in this example, but it can still be considered to follow a Gaussian pattern within each group. When other factors are properly controlled, the difference in PAA ratio among subjects would be due to either the degree of undernourishment or
388
Medical Biostatistics
Variable values ( yij)
the intrinsic interindividual variation in the subjects in different groups. The former is the between-group variation, and the latter the within-group variation. These variations are illustrated in Figure 15.1 for a variable observed for four different groups. Group means are denoted by y• j (j = 1, 2, 3, 4) and are represented by a circle in this figure. Within-group variation is the difference between individual values and their respective group mean. This is summed over the groups. The overall mean is denoted by y•• and is represented by a line in this figure. Between-group variation is based on the difference between the group means and the overall mean.
y1 y3
y4
y
y2
I
II
III Group
IV
FIGURE 15.1 Graphic display of within-group and between-group variances.
Note that part of the within-group variation in the preceding example can be due to other factors such as the heredity, age, gender, height, and weight of the children. But these are assumed to be under control and disregarded in this setup. All within-group variation is considered intrinsic and random. In an ANOVA setup, this is generally called residual or error variance. This is also called mean square error (MSE). The term error does not connote any mistake but stands only for the random component, as for sampling error. If the group differences were not really present, the between-group variance (call it the mean square between groups [MSB]) and the within-group variance would both arise from intrinsic variation alone and be nearly equal. The ratio of these two, with between-group variance in the numerator, is the criterion F (for this reason, this is also called the variance ratio). A value of F substantially more than 1 implies that between-group variation is large relative to within-group variation and indicates that the groups are indeed different with respect to the mean of the variable under study. 15.2.1.1 Procedure to Test H0 It was assumed earlier that all the groups have the same variance. This is a prerequisite for the validity of the usual ANOVA procedure. It is also stipulated that the pattern of distribution of the response variable is the same. This could be of almost any form if the number of subjects in each group is large. One possibility is shown in Figure 15.2. The distribution in all four groups in this figure is identical except for a shift in location. Groups B and C are close to one another, but groups C and D are far apart. Only the means differ, and other features of the distribution are exactly the same. If n is small, the ANOVA procedure is valid only when this distribution is Gaussian. Thus, skewed distributions of the type shown in Figure 15.2 are admissible for ANOVA only when n is large. Group A
FIGURE 15.2 Distribution in four groups differing only in mean.
B
C
D
389
Inference from Means
The null hypothesis in the case of one-way ANOVA is H 0 : µ 1 = µ 2 = = µ J . (15.7)
This says that there are J groups and the means in all the groups are the same. This hypothesis, in conjunction with the conditions mentioned in the preceding paragraph, implies that the response variable has the same distribution in different groups. The alternative hypothesis is that at least one mean is different. When H0 stated in Equation 15.7 is true, MSB is also an estimate of the population variance σ2. MSE is an estimate of σ2 whether or not H0 is true. The ratio F=
MSB (15.8) MSE
is expected to be one under H0. When F ≤ 1, it is a sure indication that group means can be equal. If group means are different, MSB would be large and F ≫ 1 (substantially more than 1). Just like χ2 and t, the distribution of F under H0 is known. In place of a single df, the exact shape of F depends on a pair of df, namely, (J − 1), J(n − 1) in the case of one-way ANOVA. The first corresponds to the numerator of the criterion given in Equation 15.8, and the second to the denominator. In general, these df are denoted by (v1, v2), respectively. The probability P of wrongly rejecting H0 can be obtained corresponding to the value of F calculated from the data. If P < 0.05, the evidence can be considered sufficient to reject H0. Standard statistical packages provide the exact P-value so that a decision can be made immediately. Cutoff values of F for α = 0.05 and different df are given in Appendix Table B.4. These can be used if you do not have access to a statistical package. The following comments provide further information on one-way ANOVA:
1. The primary purpose of ANOVA is to test the null hypothesis of equality of means. However, as shortly explained, an estimate of the effect size of the groups can also be obtained as a by-product of this procedure. 2. ANOVA was basically developed for evaluating data arising from experiments. As explained in Chapters 5 and 6, this requires random allocation of the subjects to different groups and then exposing these groups to various interventions. In the PAA example, the groups are preexisting and there is no intervention. The design is prospective because the undernourishment groups define the antecedent and the PAA ratio is investigated as an outcome. The subjects are not randomly allocated to the different undernourishment groups. But it is important to ensure that the subjects in each group adequately represent their group and that there is no other factor except the undernourishment grade that separates the groups. Such precautions in the selection of subjects are necessary for a valid conclusion from ANOVA. 3. Although the ANOVA method has been described here for comparing means in three or more groups, the method is equally valid for J = 2 groups. In fact, there is a mathematical relationship that says that two-tailed t2 for two independent samples of size n is the same as F with (1, 2n − 2) df. Note that the df of t for two independent samples, each of size n, is also (2n − 2). Both methods lead to the same conclusion, but F is a two-tailed procedure, whereas t can also be used for a one-sided alternative. This flexibility is not readily available with the F-test. Example 15.5: One-Way ANOVA for the Effect of Various Drugs on REM Sleep Time in Rats Tokunaga et al. [6] conducted a study for the effects of some h(1)-antagonists on the sleep–wake cycle in sleep-disturbed rats. Among the response variables was rapid eye movement (REM) sleep time. Take a similar example of a sample of 20 rats, homogeneous for genetic stock, age, gender, and so forth, which are randomly divided into J = 4 groups of n = 5 rats each. Let one group be the control and the others receive drug A (diphenhydramine), drug B (chlorpheniramine), and drug C (cyproheptadine). The REM sleep time was recorded for each rat from 10:00 to 16:00 h. Suppose the data shown in Table 15.2 are obtained. TABLE 15.2 REM Sleep Time in Sleep-Disturbed Rats with Different Drugs Drug 0 (control) A B C
REM Sleep Time (min) 88.6 63.0 44.9 31.0
73.2 53.9 59.5 39.6
91.4 69.2 40.2 45.3
Mean (min) 68.0 50.1 56.3 25.2
75.2 71.5 38.7 22.7
79.28 61.54 47.92 32.76
390
Medical Biostatistics
These data show a large difference in the mean sleep time in various groups. Another experiment on a new sample of rats may or may not give similar results. The likelihood of getting nearly equal means in the long run is extremely remote, as will be clear shortly. To avoid the burden of calculations, the following values obtained from a statistical package for these data are given: SST = 7369.8, SSB = 5882.4, and SSE = 1487.4, (15.9)
where SST is the total sum of squares, SSB is the between-group sum of squares, and SSE is the error (within-group) sum of squares. Mathematical expressions of these are being avoided due to their complexity. However, note that SST = SSB + SSE in a one-way setup. The df for SSB are (4 − 1) = 3, and for SSE are 4(5 − 1) = 16. Mean squares are obtained by dividing these sums of squares by the corresponding df. These values as well as F are conventionally shown in the form of a table, popularly called an ANOVA table. This is given in Table 15.3 for this example. For these data, F=
1960.8 = 21.08. 93.0
The null hypothesis is that there is no effect of dose on REM sleep time. This is the same as saying that means in all four groups are equal. TABLE 15.3 ANOVA Table for the Data in Table 15.2 Source of Variation
df
Sum of Squares
Mean Squares
F
Drug Error Total
3 16 19
5882.4 1487.4 7369.8
1960.8 93.0
21.08
The df of this F are (3, 16). A statistical package gives P < 0.001 under H0 for F = 21.08 or higher. That is, there is less than 1 in a 1000 chance that equal means in groups will give a value of F this large. Therefore, these data must have come from populations with unequal means. In other words, another experiment on the same kind of animals is extremely unlikely to give equal means. The evidence is overwhelming against the null, and it is rejected. The conclusion is that different drugs do affect the REM sleep time differentials. SI DE NO T E : In this example, mean REM sleep time in different drugs not only differs but also follows a trend. Had the groups represented increasing dose levels, it would have implied a decline in sleep time as the dose level increased. The conventional ANOVA just illustrated allows the conclusion of different means in different groups, but not of any trend. Evaluation of trends in means is done by regression, as discussed in Chapter 16.
In Example 15.5, there are only five rats in each group. Statistically, this is an extremely small sample. The reason that such a small sample can still provide a reliable result is that the laboratory conditions can be standardized and most factors contributing to uncertainty can be controlled. The rats can be chosen to be homogeneous, as in this experiment, so that intrinsic factors, such as genetic makeup, age, and gender, do not influence the outcome. Random allocation tends to average out any effect of other factors that are not considered in choosing the animals. The influence of body weight, if any, is taken care of by adjusting the dose for body weight. Thus, interindividual variation within groups is minimal. On the other hand, the variation between groups is very large in this case as is evident from the large difference between the means. This provided clinching evidence in favor of the alternative hypothesis. Among cautions in using ANOVA are the following: (a) The ANOVA is based on means. Any means-based procedure is severely disturbed when outliers are present. Thus, ensure before using ANOVA that there are no outliers in your data. If there are, examine whether they can be excluded without affecting the validity of the conclusion. (b) A problem in the comparison of three or more groups by criterion F is that its significance indicates only that a difference exists. It does not tell exactly which group or groups are different. Further analysis, called multiple comparisons, is required to
391
Inference from Means
identify the groups that have different means. This is discussed later in this chapter. (c) The small numbers of subjects in different groups in Example 15.5 should put you on alert regarding the pattern of distribution of the measurements and of their means. It should be Gaussian. The validity of other assumptions that underlie an ANOVA F-test should also be checked according to the guidelines given next. (d) When no significant difference is found across groups, there is a tendency among researchers to try to locate a group or even subgroup that exhibits benefit. This post hoc analysis is alright as long as it is exploratory in nature. For conclusion, a new study should be conducted on that group or subgroup. 15.2.1.2 Checking the Validity of the Assumptions of ANOVA When the assumptions are violated, the P-values are suspect. Important assumptions for the validity of ANOVA are (a) Gaussian pattern, (b) homoscedasticity, and (c) independence. These are actually checked for the residuals and not for the original observations. Residuals are the remainders left after the factor effects are subtracted from the observed values. These are explained in detail in Chapter 16 in the context of regression. Of the three assumptions for the validity of ANOVA, the assumption of a Gaussian pattern is not a strong requirement. The ANOVA F-test is quite robust to minor departures from a Gaussian pattern. A gross violation, if present, can be detected by the methods given in Section 9.2.1 of Chapter 9. A crude but easy method is to confirm that neither (maximum − mean)/ SD nor (mean − minimum)/SD is less than 2 or more than 4. The normal practice is to continue to use F unless there is extraneous evidence that the observations follow a non-Gaussian pattern. When the pattern is really far from Gaussian and the sample size small, it is advisable to use the nonparametric methods of Section 15.3. However, for the F-test, always check for outliers. These tend to throw the entire results out of gear. You should routinely check the data and decide whether to keep an outlier. Genuine outliers cannot be disregarded. Three methods are available to check whether any value is an outlier in the sense of unusually affecting the results of ANOVA. The first is called the leverage, which is the proportion of the SST contributed by the subject under suspicion. For this, calculate SST with and without the suspected value and find how much it is contributing. If each observation is contributing equally, this could be SST/n, and a substantial difference would indicate unusual contribution. The second r is called the Studentized residual, obtained as , where r is the concerned residual. If this exceeds, say, 5, you can MSE conclude with confidence that it is an outlier. The third is n
∑ ( yˆ − yˆ ) i
Cook distance: Di* =
2
i ( i *)
i=1
K * MSE
,
where yˆ i is the predicted value of yi, yˆ i( i*) is the predicted value when the i*th observation is excluded, and K is the number of parameters in the model. This distance, in a way, combines the first two, and is commonly used to assess if the i*th observation is an outlier since it measures the scaled change in the predicted values. In an example in Appendix C, we show how this is used. Most statistical software packages have provision to calculate the Cook distance. If the distributions are non-Gaussian but similar in shape, such as positively skewed for all groups, ANOVA may still be a valid method for large n. Homoscedasticity is the equality of variances in different groups. This requirement should be fairly met. Check this graphically by box plot (Figure 8.9b in Chapter 8) for different groups. A varying height of the boxes would indicate different variances. Statistically, the variation must be substantial for violation of homoscedasticity. Generally, the largest variance should be no more than four times the smallest. The conventional statistical test for checking homoscedasticity is the Bartlett test. This is heavily dependent on the Gaussian pattern of the distribution of y. For this reason, many software packages now use the Levene test as mentioned for the two-sample t-test. This is based on the median, and thus is more robust to departure from Gaussianity. Transformation of the data, such as logarithm (lny), square (y2), and square root y , is tried in cases where violation occurs. Experience suggests that a transformation can be found that converts a grossly nonGaussian distribution to an approximately Gaussian pattern and, at the same time, stabilizes the variance across groups. If the distribution pattern is already Gaussian and the test reveals that the variances in different groups are significantly different, then the ANOVA F-test should also not be used. In fact, as stated earlier, there may not be much reason in doing the test for equality of means when the variances are found to be different. It is rare that the variance of a variable is different from group to group but the mean is the same. The nature of the variable and measuring techniques should ensure
( )
392
Medical Biostatistics
roughly comparable variances. A difference in variances is itself evidence that the populations are different. However, in the rare case in which the interest persists in equality of means despite different variances, try transformation of the data, as suggested before. But this should not disturb the Gaussian pattern too much. Also, remember that equal n tends to partially neutralize the adverse effect of unequal variances. Except when the data are highly skewed or when most group sizes are less than 10, the Welch test performs well for testing the equality of means in unequal variances and unequal n situations. Also called the Welch–Satterthwaite test, this is an extension of the Welch test for the two groups described earlier. Group sizes between 6 and 10 are your call. When group sizes are really small, such as n < 6 for most groups, unequal ns, and unequal variances, use the Brown–Forsythe test. This uses the median in its calculation instead of the mean—and thus is less sensitive to departures from Guassianity. The mathematical procedure for both is complex for this book. The names are given so that you can make an appropriate choice while running a statistical package. The assumption of independence of residuals is the most serious requirement for the validity of the ANOVA F-test. This is violated particularly in cases in which serial observations are taken and the value of an observation depends on what it was at the preceding time. This is called auto- or serial correlation. This happens in almost all repeated measures where, for example, a patient is measured repeatedly after a surgery, but these serial measurements are a class apart, as discussed later. The problem also arises when, for example, an infant mortality rate (IMR) series from 1980 to 2007 is studied for different socioeconomic groups, where successive residuals, after the time factor is properly accounted for, may be correlated because of other factors that also change but are not accounted for. The independence of residuals can be checked by the Durbin–Watson test [7]. Most statistical packages routinely perform this test and give a P-value. If P < 0.05, reanalyze after controlling the factors that may be causing serial correlation. A strategy such as that of working with differences of successive values might be adopted in some cases. If P ≥ 0.05 for the Durbin–Watson test, the F-test can be used as usual to test differences among means in different groups. Two more remarks for independence are in order. First, not just serial measurements but also measurements taken for clusters lose independence. If different family members of lower socioeconomic groups are asked about the duration of sickness, they are likely to report similar values, in contrast with the values reported by family members of the upper socioeconomic group. If electrical activity in different parts of the brain is measured in children, the values are likely to be correlated. Such instances violate the requirement of independence. If the left eye and right eye form a group, for say intraocular pressure (IOP), they will not be independent and the IOPs in the left and the right eye will be related. Opposed to this, if the groups are males and females, the subject would be different and the values independent. 15.2.2 Two-Way ANOVA The fundamentals of ANOVA may be clear from the details given in the previous section. Now proceed to a slightly more complex situation. Consider a clinical trial in which three doses (including a placebo) of a drug are given to a group of male and female subjects to assess the rise in hematocrit (Hct) level. It is suspected that the effective dose may be different for males than for females. This differential response is called interaction. In this example, the interaction is probably between the drug dose and gender. In some cases, evaluation of such interaction between factors could be important to draw valid conclusions. The details are given a little later in this section. Consider the design aspect first. 15.2.2.1 Two-Factor Design There are two factors in the trial just mentioned: dose of the drug and gender of the subject. The response of interest is a quantitative variable, namely, the percentage rise in Hct level. The objective of the trial is to find the effect of the dose, gender, and interaction between these two on the response. Such a setup with two factors is called a two-way ANOVA situation. Note that there are three dose groups of male subjects in this trial and another three dose groups of female subjects. The researcher may wish to have n = 10 subjects in each of these six groups, making a total of 60 subjects. To minimize the role of other factors causing variation, these 60 subjects should be as homogeneous as possible with respect to all the characteristics that might influence the response; for example, they may be of the same age group and of normal build (say, BMI between 20 and 25). They may, of course, not be suffering from any ailment that may alter the response. Once 30 male and 30 female eligible subjects meeting the inclusion and exclusion criteria are identified, they need to be randomly allocated, 10 each, to the three dose levels. Such allocation increases the confidence in asserting that any difference that occurs then is mostly, if not exclusively, due to the factors under study, namely, the dose of the drug and the gender of the subjects in this example. A post hoc analysis can be done to check that other influencing factors, such as age and BMI, are indeed, at least on average, almost equally distributed in the six groups under study. Informed consent and other requirements of ethics, in any case, must be met. Blinding may be required to eliminate the possible bias of the subjects, of the observers, and even of the data analyst.
393
Inference from Means
15.2.2.2 Hypotheses and Their Test in Two-Way ANOVA The null hypotheses that can be tested in a two-way ANOVA situation are as follows:
1. Levels of factor 1 have no effect on the mean response; that is, each level of factor 1 has the same response on average. This translates into H 0 a : µ 1i = µ 2 i = … = µ J i , (15.10)
where J is the number of levels of factor 1. 2. Levels of factor 2 have no effect on the mean response. That is, H 0b : µ i1 = µ i 2 = = µ i K , (15.11)
where K is the number of levels of factor 2. 3. There is no interaction between factor 1 and factor 2. This is explained with the help of an example in a short while. Again, we do not want to burden you with the mathematical expressions. As with Equation 15.9, the total sum of squares in a two-way ANOVA is broken into the sum of squares due to factor 1, factor 2, interaction, and the residual (also called error). These are divided by the respective df to get mean squares. As in the case of one-way ANOVA, each of these mean squares is an independent estimate of the same variance σ2 of the response y when the corresponding H0 is true. The mean square due to error is an estimate of σ2 even when H0 is false. Other mean squares are compared with MSE, and the criterion F, as in Equation 15.8, is calculated separately for each of the two factors and for their interaction. The P-value is obtained as usual corresponding to the calculated value of F. The pair of df for different Fs in this case are For factor 1: (J − 1), JK(n − 1); For factor 2: (K − 1), JK(n − 1); For interaction between factors 1 and 2: (J − 1)*(K − 1), JK(n − 1); where n (≥2) is the number of subjects with the jth level of factor 1 and the kth level of factor 2 (j = 1, 2, …, J; k = 1, 2, …, K). This assumes the same number of subjects for each combination. The df for interaction will decrease if the design is not fully factorial. A separate decision for factor 1, factor 2, and the interaction is made regarding their statistical significance. When interaction is not significant, the factors are called additive. If the interaction is significant, the inference for any factor cannot be drawn in isolation—it has to be in conjunction with the level of the other factor. In the example of a drug for improving the Hct level, n could be 10 or any other number in each group in a one-way setup. However, a two-factor trial can also be done with only n = 1 subject in each group. In this case, however, interaction cannot be easily evaluated. For simplicity, the preceding example had the same number of subjects in each group. If the situation so demands, you can plan a trial or an experiment with unequal n in different groups. This is called an unbalanced design and is illustrated in Table 15.4 for the birth weight of children born to women with different levels and durations of smoking. See Example 15.6 for more details. The analysis of an unbalanced design is slightly more complex, although the concepts remain the same. For example, the error df would be Σ(njk – 1), where njk (≥2) is the number of subjects for the jth level of factor 1 and TABLE 15.4 Average Birth Weight (kg) of Children Born to Women with Different Amounts and Durations of Smoking Duration of Smoking in Pregnancy μ0. The test criterion for independent samples (pooled variance setup) in this case is
tn1 + n2 − 2 =
x1 − x2 − µ 0 sp 1/n1 + 1/n2
. (15.23)
This is the same as the criterion given in Equation 15.6a except for the extra μ 0 in the numerator. Similarly, the criterion for a paired setup is the same as in Equation 15.3 except for the difference μ0 in the numerator. For right-sided H1, find the P-value in the right tail. This would be the probability that t is more than the value obtained for the sample. Reject the null hypothesis when the P-value so obtained is less than the predetermined level of significance. The conditions for application of the criterion given in Equation 15.23 are the same as those for the two-sample t-test. Gaussian conditions are required. These include large n if the underlying distribution is non-Gaussian. If n is small and the underlying distribution is non-Gaussian, use nonparametric methods in place of Student t. Example 15.14: Detecting Minimum Reduction in Homocysteine Level It is now well known that an increased level of homocysteine is a risk factor for myocardial infarction and stroke. It has been observed in some populations that vitamin B converts homocysteine into glutathione, which is a beneficial substance, and reduces the homocysteine level. To check whether this also happens in nutritionally deficient subjects, a trial was conducted in 40 adults with Hb < 10 g/dL. They were given vitamin B tablets for 3 weeks, and their homocysteine levels before and after were measured. Vitamin B has almost no known side effect, but the cost of administering vitamin B to all nutritionally
415
Inference from Means
deficient adults in a poor country with a great deal of undernourishment can be enormous. The program managers would consider this a good program if the homocysteine level were reduced by more than 3 μmol on average. In this sample of 40 persons, the average reduction is 3.7 μmol, with SD = 1.1 μmol. Can the mean reduction in the target population still be 3 μmol or less? This is a paired setup. In this case, the medically important difference is a minimum 3 μmol. Thus, H0: μ1 − μ2 = 3 μmol and H1: μ1 − μ2 > 3 μmol: 3.7 − 3 1.1/ 40 = 4.02.
t39 =
From Appendix Table B.2, P(t > 4.02) < 0.01, which is extremely small, and thus the null hypothesis is rejected. The conclusion reached is that the mean reduction in homocysteine level is more than 3 μmol. Evidence from this sample is sufficient to conclude that running this program is likely to provide the minimum targeted benefit.
15.4.2.2 Equivalence Tests for Means You are now in a position to learn further about equivalence tests. The primary aim of these tests is to disprove a null hypothesis that two means, or any other summary measure, differ by a clinically important amount. This is the reverse of what has been discussed so far. Equivalence tests are designed to demonstrate that no important difference exists between a new regime and the current regimen. They can also be used to demonstrate the stability of a regimen over time, the equivalence of two routes of dosage, and equipotency. Equivalence can be demonstrated in the form of either at least as good as the present standard, or neither better nor worse than the present standard. The former, in fact, is noninferiority, and the latter is indeed equivalence. Since only the outcome is compared and not the course of disease, this is clinical or therapeutic equivalence and not bioequivalence. As explained for proportions in Chapter 13, the concern in equivalence tests is to investigate if means in two groups do not differ by more than a prespecified medically insignificant margin Δ. The null hypothesis in this case is that the difference is Δ or more, and the burden of providing evidence against this null is on the sample. If the evidence is not sufficient, the null is conceded and the conclusion is that the two groups are not equivalent on average. The procedure is essentially the same as described earlier for equivalence of proportions. Construct a 100(1 − 2α)% CI for the mean difference and see if it is wholly contained in the interval −Δ to +Δ. Irrespective of statistical significance, if such a CI is entirely within (−Δ, +Δ), the two groups are equivalent; otherwise, they are not. In case needed, refresh your memory and see Figure 13.1 again. If the CI overlaps, and the lower or upper limit lies outside the prespecified zone of clinical indifference, the null hypothesis of nonequivalence cannot be rejected. In this case, examine whether increasing the sample size can help to decide one way or the other. 15.4.3 Power and Level of Significance The concepts of power and level of significance were introduced in Chapter 12, but we would like to go deeper into these concepts and explain them further now that the actual tests have been discussed. Statistical significance is intimately dependent on these two concepts. 15.4.3.1 Further Explanation of Statistical Power Consider the following data on birth weight in a control group and a group with an individual educational campaign regarding proper diet: Control group (n1 = 75): Average birth weight x1 = 3.38 kg, SD of birth weight s1 = 0.19 kg Campaign group (n2 = 75): Average birth weight x2 = 3.58 kg, SD of birth weight s2 = 0.17 kg An individual-based educational campaign requires intensive inputs. Suppose it is considered worthy of the effort only if it can make a difference of at least 100 g in the average birth weight. What is the power for detecting this kind of difference when the significance level is α = 0.05?
416
Medical Biostatistics
There is no stipulation in this case that the average birth weight in the campaign group could be lower. Thus, the alternative hypothesis is one-sided, H1: μ1 − μ2 < 0. Note that label 1 here is for the control group and label 2 for the campaign group. At significance level α = 0.05, the null hypothesis would be rejected in favor of this H1 when
x1 − x2 < −1.645. SE of ( x1 − x2 )
(15.24)
The left side of the inequality (15.24) is the usual statistic Z when H0: μ1 − μ2 = 0 is true. The right side is the value of Z from the Gaussian table for left-tailed α = 0.05. The Gaussian form is applicable in this case because n is sufficiently large. When σ is not known, the actual test applicable is the two-sample Student t, but for df = n1 + n2 − 2 = 148 (which is very large), the critical value would still be −1.645 since Appendix Table B.1 can also be used for such a large df. When H0: μ1 − μ2 = 0 is true, the probability of Equation 15.24 is α = 0.05. The actual Z, as is evident from Equation 12.7 (Chapter 12), is
Z=
( x1 − x2 ) − (µ 1 − µ 2 ) , SE of ( x1 − x2 )
(15.25)
and it is only under H0 that it reduces to the left side of the inequality (Equation 15.24). Power is the probability of Z in Equation 15.25 being less than −1.645 when (μ1 − μ2) < 0. But this probability can be obtained only if the difference (μ1 − μ2) is exactly specified. In this example, (μ1 − μ2) = −100 g (or −0.1 kg). This specifies the magnitude of difference between the null and the alternative hypothesis. The power depends on this difference: the larger the difference, the greater the power. In this example, sp = (74 × 0.192 + 74 × 0.17 2 )/148 = 0.18028 . Thus, the SE of ( x1 − x2 ) = 0.18028 1/75 + 1/75 = 0.0294. Now, x1 − x2 Power for (µ 1 − µ 2 = −0.1) = P < −1.645|(µ 1 − µ 2 = − 0.1) SE of ( x1 − x2 ) ( x − x2 ) − (µ 1 − µ 2 ) −0.1 = P 1 < −1.645 − SE of ( x1 − x2 ) SE of ( x1 − x2 )
(−0.1) = P Z < −1.645 − , 0.0294
where Z is obtained from Equation 15.25. This gives Power for (µ 1 − µ 2 = −0.1) = P(Z < 1.76) = 1 − P(Z ≥ 1.76)
= 0.96 from Appendix Table B.1.
This probability is very high. This study on 75 subjects each is almost certain to detect a difference of 100 g in average birth weight if present. Now, for the purpose of illustration, calculate the power for detecting a difference of 50 g: (−0.05) Power for (µ 1 − µ 2 = −0.05) = P Z < −1.645 − 0.0294 = P(Z < 0.06) = 1 − P(Z ≥ 0.06)
= 0.52 from Appendix x Table B.1.
This probability is not so high. The chance of detecting an average difference of 50 g in birth weight in this case is nearly one-half. If the campaign is really able to improve birth weight by an average of 50 g, this study on 75 subjects in each group is not very likely to detect it. In other words, such a difference can actually be present but can easily turn out to be statistically not significant because samples of this size do not have sufficient power to detect such a small difference.
417
Inference from Means
If the significance level is relaxed to α = 0.10, then H0 would be rejected when x1 − x2 < −1.28, SE of ( x1 − x2 )
where the right-hand side is the value of Z corresponding to the left-tailed α = 0.10. For this α, (−0.05) Power for (µ 1 − µ 2 = −50 g or − 0.05 kg) = P Z < −1.28 − 0.0294 = P(Z < 0.42) = 1 − P(Z ≥ 0.42) = 0.66 from m Appendix Table B.1.
Note how the power increases when the significance level is increased. To see the effect of sample size, increase n1 and n2 to 250 each. With these ns, get sp = 0.18028 and SE = 0.01613. For α = 0.10, (−0.05) Power for (µ 1 − µ 2 = −0.05 kg) = P Z < −1.28 − 0.01613 = P(Z < 1.82) = 1 − P(Z ≥ 1.82) = 0.97 from Append dix Table B.1.
These results are summarized in Table 15.16. TABLE 15.16 Illustration of Power Variation in Different Situations μ1 − μ2 (g) −100 −50 −50 −50
α
n1
n2
Power
0.05 0.05 0.10 0.10
75 75 75 250
75 75 75 250
0.96 0.52 0.66 0.97
The following conclusions can now be drawn: 1. Power is directly related to the magnitude of difference to be detected. Power decreases if the difference to be detected is small. In other words, a difference may actually be present but it is more difficult to call a small difference statistically significant. This may also be appealing to your intuition. 2. Power increases as the sample size increases. A small difference is more likely to be statistically significant when n is large—a fact that has been emphasized over and over again. 3. Power increases if a higher probability of Type I error can be tolerated. Note that an increase in the level of significance is a negative feature, whereas an increase in power is a positive feature of a statistical test. These should be balanced.
15.4.3.2 Balancing Type I and Type II Error Since power is the complement of Type II error, it should be clear that increasing one type of error decreases the other type, and vice versa. The two can rarely be simultaneously kept low for a fixed sample size. As explained earlier, Type I error is like misdiagnosis or like punishing an innocent person, and thus more serious than Type II error. Can Type II error be more hazardous than Type I error? Yes, in some rare situations. Suppose H0 is a drug that
418
Medical Biostatistics
is not too toxic, and H1 that it is too toxic. Type II error in this case can be serious. In the context of drug trials, Type I error corresponds to an ineffective drug allowed to be marketed and Type II error corresponds to an effective drug denied entry into the market. In this sense, Type II error may have more serious repercussions if the drug is for a dreaded disease, such as AIDS or cancer. If a drug can increase the chance of 5-year survival for such a disease by, say, 10%, it may be worth putting it on the market provided the side effects are not serious. Not much Type II error can be tolerated in such cases. This error can be decreased (or power increased) by increasing the sample size, but also by tolerating an increased Type I error. But if the side effects were serious, a rethinking would be necessary. Thus, the two errors should be balanced, keeping their consequences in mind.
References 1. Gamst G, Meyers LS, Guarino AJ. Analysis of Variance Designs: A Conceptual and Computational Approach with SPSS and SAS. Cambridge University Press, 2008. 2. Everitt BS. Statistical Method for Medical Investigation, 2nd ed. Edwin Arnold, 1994, pp. 77–91. 3. Dixon WJ, Mood AM. A method for obtaining and analyzing sensitivity data. J Am Stat Assoc 1948; 43:109–126. 4. Sell A, Olkkola KT, Jalonen J, Aantaa R. Minimum effective local anaesthetic dose of isobaric levobupivacaine and ropivacaine administered via a spinal catheter for hip replacement surgery. Br J Anaesth 2005; 94:239–242. 5. Pace NL, Stylianou MP. Advances and limitation of up-and-down methodology. Anesthesiology 2007; 107:144–152. 6. Tokunaga S, Takeda Y, Shinomiya K, Hirase M, Kamei C. Effect of some h(1)-antagonists on the sleep-wake cycle in sleep-disturbed rats. J Pharmacol Sci 2007; 103:201–206. 7. Neter J, Wasserman W, Kutner MH. Applied Linear Regression Models: Regression, Analysis of Variance, and Experimental Designs. Richard D. Irwin, 1983. 8. Wang X, Tager IB, van Vunakis H, Speizer FE, Hanrahan JP. Maternal smoking during pregnancy, urine cotinine concentrations, and birth outcomes: A prospective cohort study. Int J Epidemiol 1997; 26:978–988. 9. Doncaster P, Davey A. Analysis of Variance and Covariance: How to Choose and Construct Models for the Life Sciences. Cambridge University Press, 2007. 10. Kenward MG. Analysis of Repeated Measurements. Oxford University Press, 1998. 11. Klockars AJ, Sax G. Multiple Comparisons (Quantitative Applications in the Social Sciences). Sage Publications, 2005. 12. Conover WJ. Practical Nonparametric Statistics, 3rd ed. John Wiley & Sons, 1999. 13. Hollander M, Wolfe DA, Chicken E. Nonparametric Statistical Methods, 3rd ed. John Wiley & Sons, 2015. 14. Wilcoxon F, Wilcox RA. Some Rapid Approximate Statistical Procedures. Lederle Laboratories, 1964. 15. Kruskal WH, Wallis WA. Use of ranks in one-criterion variance analysis. J Am Stat Assoc 1952; 46:583–621. 16. Cupples LA, Heeren T, Schatzkin A, Colton T. Multiple testing of hypotheses in comparing two groups. Ann Intern Med 1984; 100:122–129. 17. Freiman JA, Chalmers TC, Smith H Jr, Kuebler RR. The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial: Survey of 71 “negative” trials. N Engl J Med 1978; 299:690–694. 18. Dimmick JB, Diener-West M, Lipsett PA. Negative results of randomized clinical trials published in the surgical literature: Equivalency or error? Arch Surg 2001; 136:796–800. 19. Greenland S, Senn SJ, Rothman KJ, et al. Statistical tests, P-values, confidence intervals, and power: A guide to misinterpretations. Eur J Epidemiol 2016; 31:337–350.
Exercises All the datasets in this book are readily available in Excel on the book’s website: http://www.MedicalBiostatistics.com. Use statistical software where necessary.
1. For the data given in Example 15.2, calculate Welch t and test the hypothesis that the means of albumin level are equal before and after treatment when they are in separate groups. Why might Welch t be preferable in this case in place of the Student t-test?
419
Inference from Means
2. Consider the following data on FEV1 values in a crossover trial on COPD patients. Group 1: AB Sequence Subject No. FEV1 (L/s) Period 1 Period 2
trA trB
1
2
3
4
5
6
7
8
9
10
1.28 1.25
1.26 1.27
1.60 1.47
1.45 1.38
1.32 1.31
1.20 1.18
1.18 1.20
1.31 1.27
1.13 1.38
1.27 1.45 20
Group 2: BA Sequence Subject No. FEV1 (L/s) Period 1 Period 2
trB trA
11
12
13
14
15
16
17
18
19
1.27 1.30
1.49 1.57
1.05 1.17
1.38 1.36
1.43 1.49
1.31 1.38
1.25 1.45
1.20 1.20
1.46 1.03
1.39 1.06
Test the following: (i) Sequence effect is not present. (ii) Carryover effect is not present. (iii) Treatment effect is not present. State your conclusion. 3. Consider the 100 values in Table 3.1 (Chapter 3) as a random sample and test whether the mean triglyceride (TG) level in the three waist–hip ratio (WHR) categories ( 0.15 for exclusion of variables for all three stepwise methods. What difference do you observe in the results obtained by these three methods, and why? 6. Fit a simple linear regression of y on 1/x for the data in Example 16.4 on the GFR (y) and creatinine level (x) and plot the regression line with the scatter of the observed values. Does it improve the model in terms of R 2 compared with the one obtained in Example 16.4 for linear regression? 7. If the correlation coefficient between x and y is –0.83, and the SD of x is 45.71 and of y is 26.28, what are the following? (i) Regression coefficient of x on y and of y on x (ii) Covariance between x and y 8. Consider the data on 15 subjects on the GFR (%) and creatinine (mg/dL) in Example 16.4. Delete the last two subjects so that the data are now for 13 subjects. (i) Find the simple linear regression of the GFR on the creatinine level. (ii) Find the 95% CI for the intercept and slope parameters. 9. For the data on 15 subjects of the GFR (%) and creatinine (mg/dL) in Example 16.4, find the predicted GFR based on simple linear regression along with the 95% prediction interval for a person whose creatinine level is 10.7 mg/dL. 10. For the data in Example 16.2 on the APACHE score and duration of hospital stay (days) in 10 subjects, find and plot the 95% confidence band of simple linear regression of duration on the APACHE score. Is the following regression Duration = −1.7 + 0.67 * APACHE,
within this band? What is your conclusion regarding the plausibility of this regression for this dataset? 11. Suppose the covariance between forced expiratory volume (FEV1) in 1 s and age is −3.618 and the variance of age is 150.746. (i) Find the simple regression equation of FEV1 on age when the mean FEV1 is 3.41 L/s and the mean age is 36.07 years. (ii) Determine the CI of the intercept and slope parameters when the MSE is 0.620 and n = 50. (iii) Test the statistical significance of the intercept and slope considering the null hypothesis as zero for both. 12. For a male population, suppose the correlation between age and BMI is 0.35, between age and TC is 0.45, and between BMI and TC is 0.50. Determine the partial correlation between the BMI and TC controlling for age. How would you interpret this correlation? 13. Two observers measured the portal flow (L/min) in MRI in 10 healthy subjects. The following values were obtained: Subject No.
1
2
3
4
5
6
7
8
9
10
Observer 1 Observer 2
0.38 0.45
0.80 0.82
0.60 0.71
1.20 1.22
1.25 1.30
1.31 1.34
0.55 0.60
0.75 0.76
1.43 1.40
1.25 1.18
Determine the ICC between the observers using the ANOVA method as well as by the formula given in Equation 16.23.
467
Relationships: Quantitative Outcome
14. Given the following, x = 166, y = 474, Σ( x − x )2 = 1445, Σ( y − y )2 = 101108
and
Σ( x − x )( y − y ) = 4207 ,
(i) Calculate the correlation coefficient between x and y. (ii) Find the simple linear regression equation of y on x. 15. Consider the following: (i) For the following data on TG level (mg/dL) by two methods in the same subjects, find the Pearson and Spearman rank correlation coefficients between the values obtained by the two methods. What difference do you observe between the correlation values obtained by the Pearson and Spearman methods, and why? Subject No. Method 1 Method 2
1 122 127
2 130 133
3 148 146
4 139 128
5 93 105
6 115 118
7 124 119
8 152 170
9 136 133
10 127 121
11 106 96
12 120 121
(ii) Since the same subjects have been assessed by two methods, suppose it is decided that the correlation must be more than 0.7 for it to be of any medical significance. Can we conclude with reasonable confidence (at a 5% level of significance) that the Pearson correlation is more than 0.7?
http://taylorandfrancis.com
17 Relationships: Qualitative Dependent The regression setup discussed in Chapter 16 requires that the dependent and generally the set of independents are also quantitative. This chapter considers the setup where the dependent is qualitative. The independents can be qualitative or quantitative, or a mixture of both. The interest is primarily in the form or nature of the relationship, and secondarily in the strength of relationship. As a reminder, qualitative variables can be nominal, ordinal, or metric divided into broad ordinalized categories. But the focus of this chapter is on nominal dependent variables, particularly when these are dichotomous. However, a brief explanation is given for polytomous, as well as for ordinal dependents. The dependent of interest is neither the value nor the mean, but the probability or the proportion of subjects with a particular characteristic. The probability necessarily lies between 0 and 1, and this restriction disqualifies the usual quantitative regression method of Chapter 16. The statistical method that meets this requirement is the popular logistic regression. The methods discussed in this chapter are basically distribution-free (nonparametric) in the sense that they do not require any specific distribution of the underlying variable values, although a large n is required. This chapter: The basics of logistic regression, including its meaning, are explained in Section 17.1. Logistic coefficients, which are similar to the regression coefficients of Chapter 16, have an extremely useful interpretation in terms of relative risk (RR) or the odds ratio (OR). This is described in Section 17.2. This includes the confidence interval (CI) and test of significance for the individual coefficients. Issues such as matched data that require conditional logistic, polytomous dependent, and ordinal categories are briefly addressed in Section 17.3. Section 17.4 describes the Cox hazard regression model, and classification and regression trees that have overlap with the quantitative regression method. The measures of strength of relationship among qualitative variables are presented in Section 17.5. This also includes methods of assessing agreement among qualitative data. There is great similarity between the logistic regression of this chapter and the quantitative regression of Chapter 16. Many of the points discussed in the preceding chapter apply here too. We do not repeat them here, except those that are typically needed for logistic regression.
17.1 Binary Dependent: Logistic Regression (Large n) Logistic regression is meant for situations where the response or the dependent variable is dichotomously observed; that is, the response is binary. For example, the occurrence of prostate cancer is dichotomous (yes or no) and may be investigated to depend on, say, vasectomy status (yes or no) and multiple categories of dietary habits as the primary variables of interest. Possible confounders, such as smoking and age at vasectomy, can be the other regressor variables in the model. In another setup, hypothyroidism versus euthyroidism can be investigated for dependence on the presence or absence of signs or symptoms, such as lethargy, constipation, cold intolerance, or hoarseness of voice, and on quantitative serum levels of thyroid-stimulating hormone (TSH), triiodothyronine (T3), and thyroxine (T4). A binary variable is also obtained when a continuous variable is dichotomized, such as a diastolic blood pressure (BP) of 0.5. Any other cutoff point can be used if that is more justified, such as one group a priori much larger than the other. The terms case and control used here are in a very general sense for the dichotomous group—a case could be either a subject with disease or one with exposure. The probability p is calculated to several decimal places, and p = 0.5 is unlikely to occur. If it does, the subject is not classified into either group. The actual status of all the subjects is already known. The predicted grouping is compared with the observed grouping, and a table similar to Table 17.1 is obtained. TABLE 17.1 Classification Table for the Subjects Observed Group Control Case Overall
Predicted by Logistic Model Control
Case
Total
Percent Correct
23 8 31
7 22 29
30 30 60
23/30 = 76.7 22/30 = 73.3 45/60 = 75.0
Consider Example 17.1 with 30 non-BPH and 30 BPH subjects, as shown in Table 17.1. When the probability (Equation 17.3) is estimated for these subjects using the logistic model, let us suppose 23 out of 30 controls have p < 0.5 and 22 out of 30 cases have p > 0.5. This information can be used to calculate both positive and negative predictivity types of measures. Mostly, the two are considered together. In Table 17.1, a total of 45 out of 60 subjects (75%) were correctly classified by the model. This is not high, and there is a room for improvement. This can be done either by including more regressors in the model or by considering an alternative set of variables as regressors that can really predict BPH status. The higher the correct classification, the better the model. Generally, a correct classification between 80% and 89% is considered good, and 90% or more is considered excellent. Also, such internal validation always gives optimistic results—try the model on another set of data to get an idea of its actual worth. One difficulty with this procedure is that the middling probability 0.51 is treated the same way as the high probability 0.97. Also, a subject with probability 0.49 goes to another group than a subject with probability 0.51 despite the minimal difference. The classification is strictly with respect to the cutoff point 0.5 and ignores the magnitude of the difference from 0.5. Another difficulty arises when one group is extremely large and the other group small. Suppose you follow 600 pregnant women for eclampsia, where only 30 of them develop this condition by the end of the follow-up. The model may be able to correctly classify 560 of 570 who did not develop eclampsia and only 2 out of 30 who developed eclampsia. The overall correct classification is 562/600 = 93.7%. This may give an impression that the model is good, but it actually has correctly classified only 2 out of 30 (6.7%) with eclampsia, and this is crucial for the success of the model. Beware of such fallacies. 17.1.2.3 Hosmer–Lemeshow Test Some statistical softwares provides an option to use the Hosmer–Lemeshow test as further evidence of the adequacy of a model (or the lack of it). This is obtained by dividing the subjects into categories and using the usual chi-square
475
Relationships: Qualitative Dependent
goodness-of-fit test. For this, divide the predictor set of x-values into some rational-looking arbitrary number of groups and find the observed frequency of cases and controls in these groups. Corresponding to these groups of xs, find the expected frequencies based on the fitted logistic model. Compare the observed and expected frequencies by the slight modification of chi-square. This chi-square will have 2 df less than the number of groups you formed. Alternatively, divide the predicted probabilities into, say, 10 decile groups and compare the observed number of responses in these groups with the predicted number of responses using cells defined by the groups. The predicted are those obtained by the fitted logistic model. In many cases, visual comparison of the observed and expected frequencies will give you a fair idea of whether they correspond well. If you want to use a test, for 10 groups, this chi-square will have 8 df. Software will do all this for you. The validity condition of chi-square applies, namely, that not many expected cell frequencies should be less than 5 and none should be close to 0. This implies that n should be really large. This test has been criticized lately for arbitrariness that creeps in while defining the groups and for being insensitive to deviations in individual cases. In the Hosmer–Lemeshow test, the null hypothesis is that the model is an adequate fit, and the alternative is that it is not. When P < 0.05, the null is conceded, but that does not necessarily mean that the model is good—the only implication is that the evidence against it is not sufficient. It is customary to draw what is called a calibration plot, with the Hosmer–Lemeshow test. If you have used 10 strata, there will be 10 points in the scatter. A line of equality is also drawn that helps to assess how far the points are from this line. In the case of good fit, the predicted risk will be close to the observed risk, and the scatter will lie on the diagonal. 17.1.2.4 Other Methods of Assessing the Adequacy of a Logistic Regression Other methods to check the applicability of the model, such as another sample method and bootstrap (see Chapter 20), should also be applied before the model is considered good for external use. A popular method to assess the adequacy of a logistic predictive model is to find the area under the receiver operating characteristic (ROC) curve based on the probabilities predicted by the model. The following steps are required to draw such an ROC curve.
1. Calculate the probability predicted by the model for each subject using the model equation. 2. For various cutoff points of probability, such as with an increment of 0.05 or 0.10, classify the subjects as diseased if their probability is equal to or greater than the selected probability, and as without disease otherwise. 3. Find the number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) at each probability cutoff. True positives are those with disease that have been correctly classified, and false positives are the subjects without disease wrongly classified with disease by the model. Similarly, true negatives and false negatives are counted. 4. Find the sensitivity and specificity as explained in Chapter 9 at each selected cutoff point. 5. Draw a curve of sensitivity against the 1 – specificity for different probability cutoffs. This procedure is illustrated in the following example. Example 17.2: Adequacy of Logistic Regression through the ROC Curve for Predicting Hypothyroidism Based on Age, TSH Values, and Sex In a cross-sectional study of 163 suspected cases of age 30–70 years, suppose only 55 were found to have hypothyroid and the remaining 108 euthyroid as per the distribution for TSH values in Table 17.2. TABLE 17.2 Distribution of TSH Values in Euthyroid and Hyperthyroid Cases TSH Level (Units) Group Euthyroid Hypothyroid Total
x2 but y1 < y2. Also, pairs are tied for x if x1 = x2 irrespective of y and tied for y if y1 = y2 irrespective of x. In ordinal data, ties are quite common. For n subjects, there are a total of n(n − 1)/2 pairs since the pair [(x1, y1), (x2, y2)] is considered the same as [(x2, y2), (x1, y1)]. For a 3×2 table (Table 17.10), this can be explained as follows: TABLE 17.10 Persons with Two Ordinal Characteristics Characteristic 1 (x) Characteristic 2 (y) Low High
Low
Medium
High
a d
b e
c f
493
Relationships: Qualitative Dependent
Total number of persons = a + b + c + d + e + f = n. Total number of pairs of pair = n(n − 1)/2 = T. Concordant pairs = a(e + f) + bf = P. Discordant pairs = c(d + e) + bd = Q. Pairs tied on characteristic 1 (x) alone = ad + be + cf = X0. Pairs tied on characteristic 2 (y) alone = a(b + c) + bc + d(e + f) + ef = Y0. Pairs tied on both x and y = a(a − 1)/2 + b(b − 1)/2 + c(c − 1)/2 + d(d − 1)/2 + e(e − 1)/2 + f(f − 1)/2 = (XY)0. (XY)0 does not contribute to the measure and is ignored except for calculating the total number of pairs. Now, various measures of ordinal association can be defined as follows. You may find varying definitions in the literature.
Kendall Tau-a: τ a =
P−Q . n(n − 1)/2
This is the surplus of concordant pairs over discordant pairs as a proportion of the total pairs. If the agreement in pairs is perfect, Q = 0 and tau-a = 1 assuming no ties. If all are discordant pairs, P = 0 and tau-a = −1. Thus, this ranges from −1 to +1. If ties are present, use Kendall Tau-b: τb =
P−Q ( P + Q + X 0 )( P + Q + Y0 )
.
The denominator is now partially adjusted for ties, and P and Q will also be automatically adjusted by definition. Tau-b works well for square tables where the number of categories for one characteristic is the same as for the other characteristic (i.e., R = C). Tau-b = +1 if the table is diagonal and tau-b = −1 if all diagonal elements are zero. If the table is not square, the corresponding adjustment for the size is
Kendall Tau-c: τ c =
2( P − Q)R , n(n − 1)(R − 1)
where R is the smaller of the number of rows or columns. This is also called Kendall–Stuart tau-c. Beside these variations of tau, there are two other popular measures of ordinal association as follows:
Goodman−Kruskal gamma: γ =
P−Q . P+Q
This completely excludes ties from the numerator as well as from the denominator. This also ranges from −1 to +1. If the number of discordant pairs is the same as concordant pairs, γ = 0. All these measures are symmetric in the sense that it does not matter which characteristic is in the rows and which in the columns, as both are treated in the same way. For a directional hypothesis such that x predicts y, use
Somer d =
P−Q , P + Q + Y0
where Y0 is the number of pairs tied for y. Note that only the pairs tied for x are excluded from the denominator. For y predicting x, the formula will change accordingly. Example 17.8: Association between Smoking and Drinking Tai et al. [15] studied smoking and drinking as risk factors for esophageal cancer in Taiwanese women. They did not study association between smoking and drinking, but suppose Table 17.11 was obtained for cancer cases. For this table, n = 51, and T = 51 × 50/2 = 1275. Concordant pairs = P = 35(2 + 3) + 4 × 3 = 187. Discordant pairs = Q = 1(6 + 2) + 4 × 6 = 32.
494
Medical Biostatistics
TABLE 17.11 Smoking and Drinking in Cancer Cases Smoking Drinking
Nonsmokers
0.20 (Continued)
538
Medical Biostatistics
TABLE 19.3 (CONTINUED) P-Values for MANOVA and ANOVA in Example 19.3 Univariate Results for FEV1 Interaction between BMI and age categories Differences between BMI categories Differences between age categories
P > 0.60 P > 0.20 P > 0.10
Univariate Results for PEFR Interaction between BMI and age categories Differences between BMI categories Differences between age categories
P > 0.40 P < 0.05 P > 0.60
Univariate Results for TLC Interaction between BMI and age categories Differences between BMI categories Differences between age categories
P < 0.01 P > 0.40 P > 0.80
Note how the conclusions change in some cases from the univariate to the multivariate setup. In Example 19.3, if the conclusion is drawn for lung functions as a whole, the interaction between BMI and age categories is not significant (P > 0.10) and the main effects are also not significantly different. Each of these conclusions can be drawn at an overall Type I error not exceeding 5%. Univariate results show that interaction between the BMI and age categories is significant for TLC, and differences in PEFR between BMI categories are significant. Examination of the data indicates that TLC is low when BMI and age are both high. But this univariate conclusion is different for FVC, FEV1, and PEFR, and each has its own Type I error. Note also that the P-values differ from one lung function to the other, and that a joint P-value in a multivariate setup can be very different. The correct joint conclusion on the lung functions as a whole is obtained by the multivariate test. The following remarks on MANOVA may be helpful in clarifying certain issues: 1. The underlying assumption for a MANOVA test is a multivariate Gaussian pattern of the observations. This is not easy to verify. However, each variable separately can be checked for a Gaussian pattern using the methods described in an earlier chapter. When each is not far from Gaussian, there is a great likelihood that they are jointly multivariate Gaussian. 2. The other requirement for a valid MANOVA test is homogeneity of the dispersion matrices of ys in different groups. This matrix is the multivariate analog of the variance. Methods such as the Box M test are used for testing their homogeneity [7]. This, however, heavily depends on multivariate normality of data. You can depend on statistical software for performing this test. When the same set of variables is measured in two or more groups, the dispersion matrices should be comparable, and there would rarely be any need to worry on this account. 3. The test criterion used for the MANOVA test is generally either Pillai trace or Wilks’ Λ, as in the case of multivariate regression. Their distribution in most cases can be transformed, at least approximately, to the usual F, as applicable to ANOVA. Statistical software would do this and provide the P-value. 4. Empty cells due to missing observations or otherwise are a more serious handicap in a multivariate setup than in a univariate setup. If information on just one variable is missing, the entire record is generally deleted. In Example 19.3, only FVC and FEV1 could not be measured for one subject and TLC for another subject. Other information was available. But the analysis was done on 68 subjects after excluding these two subjects altogether. 5. These methods assume that all dependent variables have equal importance. If homocysteine and insulin levels both appear as dependents in the data for coronary artery disease (CAD) cases, both are given the same importance unless a weighting system is applied in the analysis. For example, it is possible to incorporate in the analysis that the homocysteine level is six times as important as the insulin level. This will modify the method to some extent, but the major problem for a clinician would be to determine these weights objectively. 6. Example 19.3 stated that P-values are more than or less than nice numbers. This is one way to report statistical results. The other, of course, is to state the exact P-values, as shown in Table 19.2.
539
Simultaneous Consideration of Several Variables
19.2.2.2 MANOVA for Repeated Measures Another example where the dependent is a quantitative response is drug concentration in the blood at different points of time (repeated measures) after its administration to two or more groups of patients (e.g., experimental and control, or control, drug, 1, and drug 2). The qualitative independent in this case is the group. The objective is to find whether the mean response at different points of time is different in various groups, and not to explore the time trend of the response. The mean response in this case refers to the mean over the patients and not the mean over time. MANOVA would simultaneously compare several means, one at each time point, in one group with those in the other groups. Such simultaneous consideration obviates the need to examine univariate parameters, such as time to reach the peak concentration (Tmax) and the peak concentration (Cmax) reached, unless they are otherwise needed for evaluating the pharmacological properties of the regimen. MANOVA in this case could be an alternative to the area under the (concentration) curve (AUC) used by many workers for this setup. As indicated later in Chapter 21, the AUC can lead to erroneous conclusions in some cases. Repeated measures are naturally correlated and provide an apt situation for use of MANOVA for the quantitative dependent. Univariate repeated measures ANOVA was discussed earlier in Chapter 15, but MANOVA is considered better when the group sizes are equal (balanced design) because MANOVA is quite robust to assumptions such as of sphericity. Sphericity, as mentioned earlier, is contrasts (differences at various time points with the previous value) being independent (covariance = 0) and having the same variance (homogeneity), and assessed by the Mauchly test. These are rather strong assumptions for univariate repeated measures ANOVA, but not as much for MANOVA. Univariate analysis is recommended for unbalanced design (unequal group sizes). If you are using MANOVA for repeated measures because of a balanced design, care is needed in specifying the design regarding what factors are between subjects and what are within subjects. In repeated measures, time will always be within subjects, but there might be other factors as well. Also, for this MANOVA, prefer Pillai trace as the criterion instead of Wilks, Λ for testing differences between groups, because Pillai trace is generally more robust to assumption violation. Tests between subjects are based on average over time—thus, interpret them accordingly. The interaction between groups and time will indicate whether different groups have the same time trend. Do not forget to test the homogeneity of v ariance– covariance matrices between groups by the Box M test. Software illustration: Appendix C contains an illustration of the use of R software for running MANOVA for repeated measures. 19.2.3 Classification of Subjects into Known Groups: Discriminant Analysis Move to a setup where the classification structure of n observations is known and this information is used to assign other observations whose classification is not known. Suppose you have information on clinical features and laboratory investigation for 120 thyroid cases. On the basis of extensive information and the response to therapy, they are divided into three groups, namely, hyperthyroid, euthyroid, and hypothyroid patients. Assume that this division is almost infallible and there is practically no error. This is feasible in this case because the response to therapy is also known. The problem is to classify a new subject into one of these three groups on the basis of clinical features alone. This exercise would be useful when the facility for evaluation of thyroid functions is expensive. Response to therapy in any case would be available only afterward. What would be the best clinical criteria for classifying a new case with the least likelihood of error? Note that this is also an exercise in the management of uncertainties. The statistical method used to derive such classification criteria is called discriminant analysis. This can be viewed as a problem of finding combinations of independent variables that best separate the groups. The groups in this case define the dependent variable, and they must be nonoverlapping. 19.2.3.1 Discriminant Functions The procedure to find the combinations of variables that best separate the groups is called discriminant analysis. The combinations so obtained are called discriminant functions. They may not have any biological meaning. These functions are considered optimal when they minimize the probability of misclassification. If there are K groups, the number of discriminant functions required is (K − 1). The first is obtained in such a manner that the ratio of the between-group sum of squares to the within-group sum of squares is maximum. The second is obtained in such a manner that it is uncorrelated with the first and has the next largest ratio, and so on. If there are only two groups, only one discriminant function is needed. The nature of these functions is similar to that of the multiple regression equation, that is,
Discriminant function: Dk = b0 k + b1k x1 + b2 k x2 + + bJk x J ;
k = 1, 2 ,… , ( K − 1), (19.1)
540
Medical Biostatistics
but the x variables on the right side are stochastic in this setup. The function in Equation 19.1 is linear, but other forms can also be tried. The x variables are standardized so that any particular x or a few xs with large numerical values do not get an unfair advantage. The method used to obtain the function as represented in Equation 19.1 is complex, and it can be left to a software package. If J measurements (x1, x2, …, xJ) are available for each subject, it is not necessary to use all J of them. Simple discriminant functions with fewer variables are preferable, provided they have adequate discriminating power. Relevant variables can be selected by a stepwise procedure similar to the one explained for regression in Chapter 16. This procedure also helps to explore which variables are more useful for discriminating among groups. Sometimes, just one variable may be enough to distinguish the groups. 19.2.3.2 Classification Rule When values of x1, x2, …, xJ are substituted in the function in Equation 19.1, the value obtained is called the discriminant score. These scores for various sets of x are used in Bayes’ rule to classify the cases. The probability required to use this rule depends on the distribution form of the variables. A multivariate Gaussian distribution is generally assumed. The rule also requires specification of the prior probability of a subject belonging to various groups. It is not necessarily equal—in some situations, it is known from experience that subjects come more frequently from one group than from the others. This is generally estimated by the proportion of cases in different groups in the sample, provided there is no intervention that could distort sampling. If in 120 consecutive thyroid cases coming to a clinic, 30 are found to be hyperthyroid, 70 euthyroid, and 20 hypothyroid, then the estimates of prior probabilities are 30/120 = 0.25, 70/120 = 0.58, and 20/120 = 0.17, respectively. If the subjects are deliberately chosen to be in a certain proportion in your sample, such as being equal, then the prior probabilities are specified in accordance with the actual group prevalence in the target population. For example, you may wish to have 50 cases each of hyperthyroidism, euthyroidism, and hypothyroidism for your discriminant analysis, but the prior probabilities would continue to depend on their respective proportion in all the incoming cases that are planned to be the subjects for future classification. Many published reports seem to ignore this aspect and assume equal probabilities. The results can then be fallacious. Also, do not try to temper prior probabilities to get a better classification. In the case of two groups, the threshold for classification is
d=
DI + DII ln(prior probability of group II/prior probability of group I) , (19.2) + DI − DII 2
where DI is D in Equation 19.1 evaluated at the mean of the xs in group I, and DII is D evaluated at the mean of the xs in group II. The labeling is such that DI > DII. For a particular subject, if D > d, then that subject is assigned to group I; otherwise, he or she is assigned to group II. This is called the classification rule. If the prior probabilities are equal, the second term on the right-hand side of Equation 19.2 becomes zero. 19.2.3.3 Classification Accuracy The exercise of classification is first used on the cases already available in our sample. For each case, a predicted class is obtained on the basis of the discriminant score for this case. Its actual class is already known. A cross-classification of cases by actual and predicted classification thus obtained is called a classification table. This is used to find the percentage correctly classified, called the discriminating power. For discriminant analysis to be successful, this power should be high, say, exceeding 80%. If a set of discriminant functions cannot satisfactorily classify the cases on which it is based, it certainly cannot be expected to perform well on new cases. When the percentage correctly classified is high, it is still desirable to try the discriminant functions on another set of cases for which the correct classification is known. It is only after such external validation that the hope for its satisfactory performance on new cases is high. Example 19.4: Discriminant Function for Survival in Cervical Cancer Cases Consider 5-year survival in cases of cervical cancer investigated as depending on age at detection (age) and parity. Suppose data are available for n = 53 cases, of whom 33 survived for 5 years or more but 20 died before this. Can age and parity be effectively used to predict whether the duration of survival is going to be at least 5 years? Suppose the data summary is as given in Table 19.4.
541
Simultaneous Consideration of Several Variables
TABLE 19.4 Age and Parity in Cases of Cervical Cancer by 5-Year Survival 5-Year Survival No Yes
Number of Subjects
Age Mean (SD)
Parity Mean (SD)
20 33
43.2 (6.55) 42.6 (6.84)
2.6 (1.43) 1.4 (1.27)
The difference between the groups in mean age at detection of cancer is not significant (P > 0.50), but the difference in mean parity is (P < 0.05). The ratio of survivors to nonsurvivors in this group is 33/20 = 0.623/0.377. Assuming that the same ratio will continue in the future, these can be used as prior probabilities for running the discriminant analysis. Then, the following discriminant function is obtained by a software: D = −0.327 − 0.0251 (AGE) + 0.774 (PARITY).
For classification of subjects, you need the threshold d. This is obtained from Equation 19.2 as follows. Substitution of group means in the discriminant function D gives DI = −0.327 − 0.0251 × 43.2 + 0.774 × 2.6 = 0.6011 for nonsurvivors and DII = −0.327 − 0.0251 × 42.6 + 0.774 × 1.4 = −0.3127 for survivors. Thus, d=
(
)
ln 0.623/0.377 0.6011 − 0.3127 0.5023 + = 0.1442 + = 0.69. 2 0.6011 − (−0.3127 ) 0.9138
If D > 0.69 for a particular subject, then assign the subject to group I (surviving less than 5 years); if not, then assign to group II. The discriminant function in this case can be represented by a line, as shown in Figure 19.3. This shows that the chance of survival for at least 5 years is less if age at detection and parity are high. Some points in Figure 19.3 overlap.
5 Surviving at least 5 years Surviving less than 5 years
4
Parity
3 Discriminant line 2 1 0 –1
25
35
45 55 Age at detection (years)
FIGURE 19.3 Scatter and discriminant line for surviving less than 5 years and at least 5 years.
65
542
Medical Biostatistics
TABLE 19.5 Classification Table Based on Discriminant Function in Example 19.4 Observed Group Survivors Nonsurvivors Total a
a
Predicted Group Survivorsa
Nonsurvivors
Total
28 8 36
5 12 17
33 20 53
For 5 years or more.
SI DE NO T E : Variations and uncertainties play their role, and many subjects are misclassified. The actual numbers are
shown in Table 19.5. Only 40 out of 53 (75.5%) could be correctly classified. As many as 8 out of 20 (40%) nonsurvivors are classified as survivors by this discriminant function. When the search is restricted to linear functions, no better discrimination criterion can be achieved. For better discrimination, either look for nonlinear functions (not included in this text) or include more variables in addition to age and parity. Perhaps, a more plausible conclusion is that age and parity by themselves are not sufficient to predict 5-year survival in cervical cancer cases. You can take the view that correct classification in three-fourths of the cases on the basis of such mundane variables as age and parity is good enough for practical applications. If this view is accepted, the discriminant function must be externally validated before it is used on new cases.
Example 19.5: Discriminant Function to Determine Sex from Patellar Dimensions Introna et al. [8] studied the right patella of 40 male and 40 female Italian skeletons with respect to seven measurements: maximum height, maximum width, thickness, height and width of the external facies articularis, and height and width of the internal facies articularis. Through a stepwise procedure, they found that only two measurements, maximum width and thickness, could discriminate gender correctly in 83.3% of cases. It was concluded that gender can be predicted by patellar dimensions when no other suitable remains of a human skeleton are available for gender determination. The prior probabilities in this example are 0.5 each because the two genders have same proportions in the population.
Note the following regarding discriminant functions:
1. Discriminant functions divide the universe of subjects into sectors corresponding to different groups. Each sector defines the group to which a subject is most likely to belong. 2. You may have noticed earlier that logistic regression is used when the target variable is binary. Logistic regression for polytomous-dependent variables exists, but it also generally considers two categories at a time. Discriminant analysis can be used for all categories together when the target variable is polytomous. The interpretation of a discriminant function, even when there are two groups, is not exactly the same as that of a logistic regression, but it serves the purpose of predicting the category provided the distribution is multivariate Gaussian. The associated probability can also be calculated, although no details are provided in this text. For the calculation of these probabilities, see Everitt and Dunn [9]. Most statistical softwares calculate these probabilities. 3. Discriminant analysis is sometimes used to find whether a particular variable or a set of variables has sufficient discriminating power between groups. Wang et al. [10] studied a large number of genes for prognostic scoring systems that can predict survival in gastric ulcer. They identified a 53-gene signature that strongly predicted patients with either a poor or good overall survival. 4. As stated earlier, the foregoing discriminant analysis assumes that the variables are jointly multivariate Gaussian. This necessarily implies that the distribution of each variable is Gaussian. Mild deviations do not do much harm. In Example 19.4, the parity distribution is not Gaussian, yet the result may still be valid for a large n. If one or more variables are binary, then the foregoing procedure is questionable unless n is really large. In that case, use a logistic discriminant function. This is briefly discussed by Everitt [11]. 5. The other important requirement for discriminant analysis is the equality of the dispersion matrices in various groups. This is checked by the Box M test. If the dispersion matrices are really very different, the linear discriminant function is not adequate in separating the groups. A more complex, quadratic discriminant function [11] may be helpful in this case.
543
Simultaneous Consideration of Several Variables
19.3 Identification of Structure in the Observations There were two groups of variables in the setup considered in the previous section: the dependent and the independent. Now consider a setup where all the variables have the same status and there is no such distinction. Suppose you obtain information on a large number of health variables of a set of individuals and want to divide these individuals into two or more groups such that the individuals within each group have similar health. The number of groups is not predetermined, and there is no dependent variable in this case. Dividing cases into diagnostic categories on the basis of various clinical features and laboratory investigations is a classic example of this type of exercise. It was observed in the past, for example, that acute onset of fever, hemorrhagic manifestations, thrombocytopenia, and rising hematocrit occur together in some cases, and this entity was called dengue hemorrhagic fever under certain conditions. Similarly, fever, splenomegaly, and hepatomegaly accompanied by anemia and weight loss were given a name, kala-azar, particularly in sandfly-infested areas. All diagnostic categories have evolved from the observation of such common features. Subjects with the same features are put together into one “cluster.” In these examples, the clustering of cases is with respect to the clinical features or etiology. The structure is not known a priori, and there is no target variable. The second type of structure discussed in this section concerns the constructs, if any, underlying a set of observations. These constructs are common to many observations and generally are unobservable or unmeasurable entities. These are popularly called factors underlying the observations. Such factors can be identified in some cases by suitable statistical methods. The term factors in this context has an entirely different meaning from what has been used earlier. 19.3.1 Identification of Clusters of Subjects: Cluster Analysis The problem addressed now is the division of subjects or units into an unspecified number of affinity groups. The statistical method that discovers such natural groups is called cluster analysis. This was briefly described in Chapter 8 in the context of graphics. The grouping is done in such a manner that units similar to one another with respect to a set of variables belong to one group, and the dissimilar ones belong to another group. Thus, natural groupings in the data are detected. This is, for instance, unwittingly done in assigning grades to students in a course where the instructor looks for rational cutoff points. Sometimes, only two grades, A and B, are considered enough; sometimes they go up to E. In another setup, the method can also create taxonomic groups for future use. An example already cited is of cases falling into various diagnostic groups on the basis of their clinical features. A name is subsequently assigned to these groups depending on their etiology or features. Although overlapping clusters can be conceived so that one or more units belong simultaneously to two or more clusters, this chapter is restricted to exclusive clusters. Cluster analysis is a nonparametric procedure and does not require values to follow a Gaussian or any other pattern. 19.3.1.1 Measures of Similarity
(a)
Diastolic BP
Diastolic BP
The division of subjects into a few but unknown number of affinity or natural groups requires that proximity between subjects is objectively assessed. Those with high proximity go into one group, and those with low proximity are assigned to some other group. Thus, the subjects resembling one another are put together in one group. As many groups are formed as needed for internal homogeneity and external isolation of the groups (Figure 19.4). The points plotted in Figure 19.4b are the same as in Figure 19.4a, but now the affinity groups are shown.
Systolic BP
(b)
Systolic BP
FIGURE 19.4 Scatter plot of diastolic and systolic levels of BP: (a) affinity not shown and (b) one form of affinity groups.
544
Medical Biostatistics
TABLE 19.6 Matches and Mismatches in Variables in Two Subjects Measured on K Binary Variables ith Subject jth Subject Yes No Total
Yes
No
Total
k11 k21 k11 + k21
k12 k22 k12 + k22
k11 + k12 k21 + k22 K
Similarity between two subjects can be measured in a large number of ways. The methods are different for qualitative than for quantitative variables. In the case of binary qualitative variables, the affinity can be shown in the form of a 2×2 table, as in Table 19.6. There are a total of K variables out of which matches between the ith and jth subjects occur in k11 + k22 variables. A popular measure of similarity in this case is
Simple matching coefficient: sij =
k11 + k22 ; i, j = 1, 2 ,… , n. (19.3) K
Note that sij is the ratio of the number of variables on which the ith and jth subjects match to the total number of variables. This can be calculated for each pair of subjects. Similar measures are available for polytomous variables, such as the phi coefficient and Cramer V (Section 17.5.1). Romesburg [12] discusses strategies for mixed sets (quantitative and qualitative) of variables. In the case of quantitative variables, the usual Pearson correlation can be used as a measure of similarity. If yi1, yi2, …, yiK are K quantitative measurements on the ith subject and yj1, yj2, …, yjK on the jth subject, the correlation between the two can be directly calculated. Note that this is being used here for assessing similarity between subjects, whereas the earlier use was to measure the strength of correlation between variables. A more acceptable method in this case is to compute a measure of dissimilarity instead of similarity. This, between the ith and jth subjects, can be measured by
Euclidean distance: dij = Σ k ( y ik − y jk )2 ; i, j = 1, 2 ,… , n. (19.4)
This is calculated after standardization (Z-score) of the variables so that the scales do not affect the value. Otherwise, the variables with larger numerical values will mostly determine distance. This distance can also be calculated for a setup with one variable (K = 1). In this case, this reduces to a simple difference between the values. On the basis of such measurement of dissimilarity, the subjects are classified into various groups using one of the several possible algorithms. A very popular one is described next. 19.3.1.2 Hierarchical Agglomerative Algorithm With hierarchical algorithms, two or more units (or subjects) that are most similar (or least distant) are grouped together in the first step to form one group of these units. This group is now considered one entity. At this stage, other entities may also form from the merging of other units. Now the distance of these entities from other units or entities is compared with the other distances between various pairs of units or entities. Again, the closest are joined together. This hierarchical agglomerative process goes on in stages, reducing the number of entities by one or more each time. The process is continued until all units are clustered together as one big entity. Described later is a method to decide when to stop the agglomerative process so that natural clusters are obtained. This process is graphically depicted by a dendrogram of the type shown in Figure 8.15 in Chapter 8. Note that for this method, subsequent clusters completely contain previously formed clusters. It may not be immediately clear how to compute the distance between two entities containing, say, n1 and n2 units, respectively. Several methods are available. A first method is to consider all units in an entity centered on their average. A second method is to compute the distance of the units that are farthest in the two entities. A third method is to base it on the nearest units. There are several others. Depending on how this distance is computed, names such as centroid, complete linkage, and single linkage, are given to these methods. Different methods can give different results. Jain et al. [13] have studied the relative merits and demerits of some of these methods and commented that no specific guideline can be given, but a method called average linkage is found to perform better in many situations. This method uses the average of the distances between units belonging to different entities as the measure of distance between two entities.
545
Simultaneous Consideration of Several Variables
19.3.1.3 Deciding on the Number of Natural Clusters The most difficult decision in the hierarchical clustering process is regarding the number of clusters naturally present in the data. The decision is made with the help of a criterion such as pseudo-r or the cubic clustering criterion [14]. These values should be high compared with the adjacent stages of the clustering process. Another criterion could be the distance between the two units or entities that are being merged in different stages. If this shows a sudden jump, it is indicative of a very dissimilar unit joining the new entity. Thus, the stage where the entities are optimal in terms of internal homogeneity and external isolation can be identified. The entities at this stage are the required natural clusters. You can confirm this by running ANOVA with cluster as groups. If ANOVA F is not significant, cluster means are not sufficiently different, and that would indicate that the cluster analysis has not been effective. The following comments regarding cluster analysis may be helpful:
1. The algorithm just described is a hierarchical agglomerative algorithm. You can use a hierarchical divisive algorithm in which the beginning is from one big entity containing all the units, and divisions are made in subsequent stages. However, this is rarely favored because agglomeration is considered a natural clustering process. 2. The other algorithm is nonhierarchical. This can be used when the number of clusters is predetermined. We do not recommend this algorithm because it does not adequately meet the objective of discovering an unspecified number of natural clusters. 3. Cluster analysis methods have the annoying feature of “discovering” clusters when, in fact, none exist. A careful examination of the computer output for cluster analysis, particularly with regard to the criteria for deciding the number of clusters, should tell you whether natural clusters really exist. 4. Since there is no target variable, the clusters so discovered may or may not have any medical relevance. 5. Different clustering methods can give different clusters. One strategy to overcome this problem is to obtain clusters by several different methods and then look for consensus among them. Such consensus clusters are likely to be stable. The consensus may be difficult to identify in a multivariate setup. Indrayan and Kumar [15] have given a procedure to identify consensus clusters in the case of multivariate data. 6. Cluster analysis has developed into a full subject by itself. Details of the procedures just described and of several other cluster procedures are available in Everitt et al. [16]. 7. The clustering just mentioned is different from the clustering of subjects with disease in population units such as households. For this, papers by Mantel [17] and Fraser [18] may be helpful. Example 19.6: Clustering of Countries for Expectation of Life at Birth Cluster analysis can also be done on just one variable. Figure 19.5 shows the dendrogram obtained for Pan-American countries when clustered on the basis of 2004 values of expectation of life at birth (ELB). The method followed is average linkage. The distance on the horizontal axis is rescaled in proportion to the actual distances between entities. The algorithm detected four clusters, as shown from 1 to 4 in the first clustering in the figure. These are as follows. Cluster 1 2 3 4
Expected Life at Birth (Years) 80–75 73–68 64 52
The fourth cluster has only one country, namely, Haiti, which has very low ELB. Note the gap in ELB values between clusters, which makes these clusters distinct. Example 19.7: Clusters of Pseudomonas aeruginosa by Whole-Cell Protein Analysis by SDS-PAGE See Figure 8.15 (Chapter 8) for a dendrogram that shows the hierarchical clustering of 20 sodium dodecyl sulfate– polyacrylamide gel electrophoresis (SDS–PAGE) groups on the basis of the whole-cell protein profile in strains of Pseudomonas aeruginosa [19]. The authors did not study the number of natural clusters in this case, but the affinity can be seen in the figure. For example, groups 5–8 have been found alike, and they are close to group 9 (note the small vertical distance between group 9 and groups 5–8; the vertical distance from group 4 is larger). Similarly, groups 18 and 19 are alike. There are many such affinity groups in this example.
546
Medical Biostatistics
0 Country
Rescaled distance between cluster combines 5 10 15 20 25
ELB (yrs)
- Cuba - USA - Chile - Costa Rica - Canada - Mexico - Panama - Argentina - Barbados - Ecuador - Uruguay
78 78 78 78 80 75 75 75 75 75 76
- Bolivia - Guyana - Guatemala - Honduras - Dominican Republic - Suriname - Colombia - Venezuela - Belize - Jamaica - Paraguay - Brazil - El Salvador - Peru - Trinidad and Tobago - Bahamas - Nicaragua - Haiti
64 64 68 68 68 69 73 73 72 71 71 71 71 70 70 70 70 52
1
2
3
4
FIGURE 19.5 Dendrogram of Pan-American countries for ELB (2004): average linkage method.
The dendrogram in Figure 19.5 needs some more explanation for it to make more sense. Note that at the first stage, entity 1 has 11 countries with ELB ranging from 75 to 80 years, entity 2 has 2 countries with an ELB of 64 years, entity 3 has all other countries except Haiti with ELB ranging from 68 to 73 years, and entity 4 has Haiti with an ELB of 52 years. In the second stage, entities 2 and 3 merge (call it entity 2,3), and at the next stage, entity 2,3 merges with entity 1. In the last stage, all these together merge with entity 4. Also note that the horizontal distance suddenly becomes huge when entity 4 is merged. This sudden jump can also be used to get an indication that a very dissimilar entity is being merged. Thus, an idea of the optimal number of clusters can also be obtained from the dendrogram. In this particular example, this would indicate only two clusters (one Haiti and the remaining in the other cluster) exist, but such a small number of clusters could defeat the purpose of cluster analysis. Thus, the first stage with four clusters has been considered as the right clustering. 19.3.2 Identification of Unobservable Underlying Factors: Factor Analysis How do you assess the health of a child? You will see his or her height, weight, skinfold, eyes, nails, and so forth. Health is a directly unobservable factor, and it is measured through a set of surrogates. Conversely, you may have a host of variables
Simultaneous Consideration of Several Variables
547
and want to know what factors are underlying those variables. There are situations where the observed variables can be considered to be made up of a small number of unobservable, sometimes abstract, factors, better known as constructs. Health in its comprehensive form (physical, social, mental, and spiritual) can be measured for adults by variables such as the kind and severity of complaints, if any; obesity; lung functions; smoking; sexual behavior; marital relations; job satisfaction; unmet aspirations; income; and education. It is not apparent how much of, say, job satisfaction is ascribable to physical health, how much to each of the other components (social, mental, and spiritual) of health, and how much is the remainder that cannot be assigned to any of these components. For a variable such as obesity, the physical component may be dominant, and for a variable such as education, the social component may be dominant. The technique called exploratory factor analysis can sometimes be used to identify such underlying factors. The theory of factor analysis presumes that the variables share some common factors that give rise to correlation among them. Actual interest in this analysis is in identifying relatively few underlying factors, and the variables actually measured are now better understood as surface attributes because they are considered manifestations of the factors. The latent underlying factors are the internal attributes. The primary objective of factor analysis is to determine the number and nature of the underlying internal attributes, and the pattern of their influence on the surface attributes. In a successful factor analysis, a few factors would adequately represent relationships among a relatively large number of variables. These factors can then be used subsequently for other inferential purposes. An example described later should clarify the meaning and purpose of such factors. 19.3.2.1 Factor Analysis How do you measure the physical health of a community? You may want to assess it by expectation of life at 1 year, absence of various diseases, disability-free life years, and so forth. All these are kinds of surrogates. You may want to combine all these and call it the index of physical health. Note that whereas observed variables are surface attributes, the real interest is in an internal attribute—health in this example, and health cannot be directly measured. This is a factor in the terminology of factor analysis. Factors are unobservable underlying attributes and can be abstract. Both latent factors and observed values are continuous in this setup. Now add other attributes such as beds and doctors per 1000 population, per capita health expenditure by the government and out of pocket, specialists available, smoking, divorces, crimes, and suicides, so that the spectrum enlarges to include other aspects of health. If a mix of these measurements is available, and you want to discover the unknown intrinsic attributes, you may find one set of measurements comprising those mostly concerned with health resources and a second set mostly with mental health, in addition to the set concerned mostly with physical health. Factor analysis is easy to understand through principal components. Consider two variables—height and weight—whose essence is captured by BMI for many applications. This combined variable provides most of the information contained in two different but correlated variables. BMI is not a linear combination of height and weight, but there are situations where much of the information contained in two or more variables can be captured by one linear combination of them. This linear combination is called a principal component. If substantial information remains uncaptured, you can try to find a second principal component. This would be independent of the first and may capture much of this remainder. If the remainder is substantial even after the second, a third principal component can be obtained, and so on. When really successful, few principal components would be able to capture most of the information in a large number of variables. Now you do not need to work with the original large set and can work with a few principal components without much loss. This is called data reduction—more appropriately, dimensionality reduction. In the principal component method of factor analysis, first linear component is discovered that is able to account for the maximum variation among observed variables. A second linear combination is obtained from the remaining variation, and so on. The first principal component may be able to take care of 62% for the total variation, the second 23%, and the third only 8%. A strategy called rotation of axes is used that can modify principal components such that the first principal component has high weight for only a few variables, the second for another set of a few variables, and so on. In our example, although the dataset comprises all the measurements, the first principal component may have a high weight for variables pertaining to physical health (and low weight for others), the second principal component a high weight for variables pertaining to health resources, and the third set a high weight for variables pertaining to mental health. Some variation would still remain unaccounted for irrespective of rotation. The statistical purpose of factor analysis is kind of reverse. It is to obtain each observed variable as a combination of a few unobservable factors, that is,
Observed value of a variable = Linear combination of factors + Error.
548
Medical Biostatistics
If the observed variables are x1, x2, …, xK, the factor analysis seeks the following: x1 = a11F1 + a12 F2 + + a1 M FM + U 1 , x2 = a21F1 + a22 F2 + + a2 M FM + U 2 ,
(19.5)
xK = aK 1F1 + aK 2 F2 + + aKM FM + U K
where F1, F2, …, FM are the M unobservable factors common to xks, and Uks (k = 1, 2, …, K) are called unique factors. A schematic representation is in Figure 19.6, where the number of x variables is six but the number of factors is two. Unique factors represent the part that remains unexplained by the factors. Underlying factors F1
Observed variables X1 X2
F2 U1 U2 U3 U4 U5 U6
X3 X4 X5 X6
FIGURE 19.6 Schematic representation of factors.
The coefficients akm (k = 1, 2, …, K; m = 1, 2, …, M; M « K) are estimated by the factor analysis procedure. These coefficients are called loadings and measure the importance of the factor Fm in the variable xk on a scale of –1 to +1. When a loading is very small, say, less than 0.20 in absolute value, the corresponding factor is dropped from Equation 19.5. Thus, different variables may contain different sets of factors. Some factors would overlap and would be present in two or more variables. In Figure 19.6, factor F1 is common to x2, x3, x5, and x6, and F2 is common to x1, x3, and x4. Each x has its own specific U. These factors are also called constructs since they require a judgment. Equations 19.5 differ from the usual multiple regression equations because the Fms are not single independent variables. Instead, they are labels for a combination of variables that characterize these constructs. They are obtained in such a manner that they are uncorrelated with one another. The statistical method of principal components is generally used for this purpose. Details of the method are complex and are beyond the scope of this book. You may want to see Kline [20] for details. Statistical software would easily provide an output but it may ask you to specify different aspects of the methodology. You should not undertake this exercise yourself unless you understand the intricacies. Do not hesitate to obtain the help of a biostatistician when needed. 19.3.2.2 Steps for Factor Analysis The basic steps for doing factor analysis are as follows: Step 1. Obtain the correlation matrix of all x variables you have observed. Step 2. Check that correlations are sufficiently high among subsets of variables. Generally, the correlation coefficients between most pairs of variables should exceed 0.3. You can use the Bartlett test to test the null hypothesis that the correlation matrix is an identity. Identity matrix means that the diagonal elements are 1 (the correlation of xk with xk is always 1) and the off-diagonal elements are 0. This null must be convincingly rejected for factor analysis to be successful.
Simultaneous Consideration of Several Variables
549
Step 3. Check that the variables indeed share some common factors that are probably giving rise to these correlations. This can be checked as follows for linear correlations: a. Multiple correlation of each xk with other xs should be generally high. b. The partial correlations, which are correlations between two variables after the linear effect of others is adjusted, should be generally low. This is tested by the Kaiser–Meyer–Olkin (KMO) measure. Step 4. Once you are convinced from steps 1–3 that you have an appropriate setup to try factor analysis, enter the correlation matrix into factor analysis software. The output will give you factor loadings and a host of other information. Step 5. Apply appropriate rotation (such as varimax) to get a meaningful interpretation of the factor loading, as described in a short while. Depending on the correlation structure among variables, it is often possible to identify meaningful underlying constructs. For proper interpretation, it is necessary to group variables that have large loading for the same factor. A suitable name can be accordingly assigned to these factors. Such an exercise can sometimes lead to a better understanding of the structure of the interaction among variables. However, the procedure based on correlations as just described assumes a linear relationship. Other forms of relationships may require another set of procedures that are not discussed in this chapter. The following example may clarify the process and meaning of factor analysis. Example 19.8: Underlying Diet Patterns in Colon Cancer Colon cancer is considered to be associated with several nutrients and foods. Detailed dietary intake data can be examined to find dietary patterns related to colon cancer. Slattery et al. [21] surveyed a large number of subjects for intake of more than 800 individual food items. These items were categorized into 35 food groups, such as processed meat, red meat, fish, eggs, butter, margarine, liquor, coffee, fresh fruits, dry fruits, fruit juice, green salads, grains, and desserts. These served as input variables for factor analysis. The underlying factors in this study are the dietary patterns. Six dietary patterns for males and seven for females were identified in control subjects after performing factor analysis. Depending on the patterns receiving higher loadings in specific food groups, they were given a name. For example, the pattern that received high loadings in processed meat, red meat, eggs, refined grains, added sugar, and so forth was called a “Western” diet. The pattern that received high loadings in fruit juice, potatoes, salads, vegetables, and whole grains was called a “prudent” diet. The pattern receiving high loadings in fresh liquor and wine was called a “drinker” diet. SI DE NO T E : Evaluation of these dietary patterns with regard to the risk of colon cancer showed that the Western-type diet
increased the risk of colon cancer in both men and women. The prudent diet pattern was associated with decreased risk.
19.3.2.3 Features of a Successful Factor Analysis The details are omitted, but there are some steps in factor analysis that are not considered fully scientific. Despite this, the technique is popular with social scientists and it is now making inroads into medical sciences as well. But the technique sometimes fails to identify meaningful factors. Its success is assessed on the basis of the following considerations:
1. One of the steps in factor analysis is breaking down the total variation among variables into variations accountable by different factors. The analysis is considered successful when few factors are able to account for a large part of the total variation, say, more than 70%. In Example 19.8, the total variation explained by the six factors together in males was only 36.9%, and by seven factors in females only 34.3% [21]. This is low, and indicates that there were factors other than the identified dietary patterns that were responsible for intake of a specific type of food. Also, in this example, the percentage of variation explained by individual factors ranged from only 4 to 10. This is also low. In many situations, you may find that the first factor alone explains something like 30% of the total variation and the others 15% or 10% each. This did not happen in this example. 2. A criterion for successful factor analysis is to be able to find some factors with very high loadings (close to ±1) and the others with very low loadings (close to 0). Also, the factors should be largely distinct or nonoverlapping. To achieve this, a technique called rotation of axes is sometimes adopted. One popular rotation is called varimax, as it aims to maximize the variance of that factor. Others are quartimax and equamax. This sometimes helps to extract factors with selectively high loadings, and thus to obtain interpretable factors. 3. In a successful factor analysis, the number of identified factors is very small relative to the number of variables. In Example 19.8, the number of factors is 6 or 7 and the number of variables is 35. Thus, the number of factors is indeed low.
550
Medical Biostatistics
A basic requirement to achieve successful factor analysis is that the underlying common factors are really present. In Example 19.8, food intake was likely to be governed by the dietary patterns. It was thus possible to identify the patterns on the basis of the kind of food consumed. The presence of underlying factors would mean at least a fair degree of correlation between most variables. If correlations are not present, it is futile to try factor analysis. The Bartlett test [7] and the one discussed by Tabachnick and Fidell [3] are available to test the significance of the correlations. These correlations are stated in the form of a matrix in the case of a multivariate setup. The success of factor analysis, as for all multivariate methods, also depends on the proper choice of variables to include in the analysis. In Example 19.8, the authors divided more than 800 food items and nutrients into 35 food groups. Some of these could be arbitrary. The results can be different if the grouping of food items is different or the number of groups is 50 instead of 35. Perhaps, the most important decision in factor analysis is the choice of the number of factors to be extracted. The statistical procedure theoretically is such that as many factors can be extracted as the number of variables. But only a few factors would be important. This importance is generally assessed by the eigenvalues for the factors. Eigenvalues depend on the correlation structure among the variables. Factors with eigenvalues of greater than 1.0 can be considered important because they explain more variance than is explained by a single variable. When common factors are indeed present, only a few factors are likely to have this property. If the number of factors so identified is more than you think it should be, the threshold of the eigenvalue can be raised from 1.0 to 1.5 or any other suitable number. In the example on dietary patterns, the authors used a threshold of 1.25. It is on the basis of this cutoff that they came up with six factors for males and seven for females. The factors identified after this analysis are not necessarily exclusive. In the example on dietary patterns, the factors might correlate well with lifestyle (physical activity, smoking, use of drugs such as aspirin, etc.) and socioeconomic status. These are the confounders in this case. It cannot be immediately concluded that dietary patterns per se were responsible for differential risks of colon cancer. The lifestyle and socioeconomic status may also have contributed irrespective of the diet pattern. Another term talked about in the context of factor analysis is communality. Note that the basic premise of factor analysis is that various variables have one or more common factors. In accordance with its social meaning, the proportion of variance of the variable xk that is shared with other xs is called its communality. The remainder is unique to xk. Generally, the value of R2 obtained when xk is regressed on other xs is regarded as the communality of xk. There are other methods also. A high communality of several xs indicates that the factor analysis is likely to be successful. The method described so far is mostly for exploratory purposes. This method does not assume any preconceived structure of the factors. If you want to assess whether a set of variables conforms to a known structure of the factors, then confirmatory factor analysis is needed. One of the methods for this is path analysis. This is discussed earlier in this chapter. For details, see Cattell [22]. 19.3.2.4 Factor Scores
As a reverse process, it is possible to obtain factors as a linear combination of the variables, that is, Fm = bm1x1 + bm 2 x2 + + bmK xK , m = 1, 2 ,… , M. (19.6)
The coefficients bmk (m = 1, 2, …, M; k = 1, 2, …, K) are called factor score coefficients. When the values of (x1, x2, …, xK) for a subject are substituted in Equation 19.6, the quantity obtained is called the factor score for that subject. This measures the importance of the factor for that individual. If the factor score for the third factor (F3) is high for the 6th subject and relatively low for the 10th subject, it can be concluded that F3 is influencing the 6th subject more than the 10th subject. Such factor scores for each factor can be obtained for each subject. These scores can be used for a variety of purposes. Chandra Sekhar et al. [23] used these scores to develop an index of need for health resources in different states of India, and Riege et al. [24] used them to study the age dependence of brain glucose metabolism and memory functions. After the discussion of a wide variety of multivariate methods, it might be helpful to have a glance at all these methods. This is given in Table S.8 at the beginning of the book.
References
1. Boshuizen HC, Verkerk PH, Reerink JD, et al. Maternal smoking during lactation: Relation to growth during the first year of life in a Dutch birth cohort. Am J Epidemiol 1998; 147:117–126. 2. Hair JF, Black WC, Babin BJ, Anderson RE. Multivariate Data Analysis, 7th ed. Pearson, 2009.
551
Simultaneous Consideration of Several Variables
3. Tabachnick BG, Fidell LS. Using Multivariate Statistics, 6th ed. Pearson, 2012. 4. Shi J, Tian J, Long Z, et al. The pattern element scale: A brief tool of traditional medical subtyping for dementia. Evid Based Complement Alternat Med 2013; 460562. 5. Loehlin JC. Latent Variable Models: An Introduction to Factor, Path and Structural Equation Analysis, 4th ed. Psychology Press, 2003. 6. Schenkman M, Cutson TM, Kuchibhatla M, et al. Exercise to improve spinal flexibility and function for people with Parkinson’s disease: A randomized controlled trial. J Am Geriatr Soc 1998; 46:1207–1216. 7. Indrayan A, Holt M. Concise Encyclopedia of Biostatistics for Medical Professionals. Chapman & Hall/CRC Press, 2017. 8. Introna F Jr, Di Vella G, Campobasso CP. Sex determination by discriminant analysis of patella measurements. Forensic Sci Int 1998; 95:39–45. 9. Everitt BS, Dunn G. Applied Multivariate Data Analysis, 2nd ed. Wiley, 2010. 10. Wang P, Wang Y, Hang B, Mao JH. A novel gene expression-based prognostic scoring system to predict survival in gastric cancer. Oncotarget 2016; 7:55343–55351. 11. Everitt BS. Statistical Methods in Medical Investigations, 2nd ed. Hodders, 1994. 12. Romesburg MR. Cluster Analysis for Researchers. Lulu.com, 2004. 13. Jain NC, Indrayan A, Goel LR. Monte Carlo comparison of six hierarchical clustering methods on random data. Pattern Recognit 1986; 19:95–99. 14. Milligan GW, Cooper MC. An examination of procedures for determining the number of clusters in a data set. Psychometrika 1985; 50:159–179. 15. Indrayan A, Kumar R. Statistical choropleth cartography in epidemiology. Int J Epidemiol 1996; 25:181–189. 16. Everitt B, Landau S, Lease M, Stahl D. Cluster Analysis, 5th ed. Wiley, 2011. 17. Mantel N. Re: Clustering of disease in population units: An exact test and its asymptotic version. Am J Epidemiol 1983; 118:628–629. 18. Fraser DW. Clustering of disease in population units: An exact test and its asymptotic version. Am J Epidemiol 1983; 118:732–739. 19. Khan FG, Rattan A, Khan IA, Kalia A. A preliminary study of fingerprinting of Pseudomonas aeruginosa by whole cell protein analysis by SDS-PAGE. Indian J Med Res 1996; 104:342–348. 20. Kline P. An Easy Guide to Factor Analysis. Routledge, 1994. 21. Slattery ML, Boucher KM, Caan BJ, Potter JD, Ma KN. Eating patterns and risk of colon cancer. Am J Epidemiol 1998; 148:4–16. 22. Cattell RB. The Scientific Use of Factor Analysis in Behavioral and Life Sciences. Springer, 1978. 23. Chandra Sekhar C, Indrayan A, Gupta SM. Development of an index of need for health resources for Indian states using factor analysis. Int J Epidemiol 1991; 20:246–250. 24. Riege WH, Metter EJ, Kuhl DE, Phelps ME. Brain glucose metabolism and memory functions: Age decrease in factor scores. J Gerontol 1985; 40:459–467.
Exercises Use software where needed.
1. Delete five top and five bottom subjects from the Table 19.1 and rerun the multivariate multiple regression analysis as in Example 19.1 and interpret the results. 2. Explain the difference between multivariable and multivariate regressions. What is the basic condition for multivariate analysis? 3. Name the appropriate multivariate analysis and the important requirement for the validity of this analysis under the following conditions: (i) The dependent has two or more variables and the independent has one or more variables—both measured on the metric scale—and the objective is to find the relationship between these two sets. (ii) Suppose five cancer quantitative markers (such as selenium level) are measured for patients with lung, prostate, and esophagus cancers. The objective is to find whether the average levels of these markers together differ in different cancers. 4. A study is conducted to determine the gender on the basis of the patella measurements. The two measurements found to be important in linear discriminant analysis are maximum height and maximum width of patella. The mean maximum height (MH) in males and females was 41.1 and 36.6 mm, respectively, whereas the mean maximum width (MW) was 43.5 and 39.1 mm, respectively. The discriminant equation obtained for the data was
D = −17.052 + 0.282 * MH + 0.141 * MW.
552
Medical Biostatistics
This equation had an approximate 85% correct classification rate in this study. (i) Determine the threshold value d (see text) using the equal prior probability of 0.5. (ii) Determine the threshold value d using the unequal prior probability of 0.6 for male and 0.4 for female (this ratio is unrealistic, but let us do it as an exercise). 5. A researcher wanted to validate the classification rate for the discriminant function given in question 4. She collected the information of MH and MW from 12 males and 12 females. The values observed are as follows: Males MH (mm) MW (mm)
39.8 43.1
43.8 46.4
41.5 44.7
39.5 45.2
41.2 43.6
46.1 43.7
38.4 40.2
43.1 46.8
40.9 41.1
41.4 43.7
43.4 44.2
37.0 38.8
36.8 40.8
37.9 40.1
37.7 41.1
38.1 43.5
35.1 39.7
36.5 39.6
Females MH (mm) MW (mm)
35.5 40.7
37.3 40.6
38.4 39.5
39.7 40.2
34.4 35.7
41.0 43.6
Apply the discriminant function in question 4 using the threshold values obtained in parts (i) and (ii), and calculate the correct classification rate in each case. 6. Delete the first 10 subjects from Table 19.1 so that now n = 60. Exclude those with age less than 20 years and those with age 50 years or more and divide the remaining into age intervals of 20–29, 30–39, and 40–49 years. Calculate BMI and divide it as 140/90 mmHg the right definition for hypertension, or is BP > 160/95 more correct? Both definitions are used. (See Chambless et al. [7] and Dwyer et al. [8]; incidentally, these are nearly consecutive articles in the same journal.) This means that those with BP = 150/92 are considered hypertensive by one definition but not by the other. Is waist–hip ratio (WHR) an appropriate measure of obesity, or is body mass index (BMI) more appropriate? If it is BMI, should the threshold be 25 or 30 or somewhere in between, such as 27? How can one distinguish, without error, a hypothyroid condition from euthyroid goiter? How can one correctly ascertain the cause of death in an inaccessible rural area where a medical doctor is not available? How can age be assessed when a dependable record is not available? Can the diet content of an individual be adequately assessed by taking a 3-day history? How can physical activity be validly measured? These examples illustrate how crucial validity is. Instruments of different hues are used, and their validity is never perfect. Validity is measured against a gold standard. Such a standard, in most cases, is elusive, and consequently validity suffers. Once an acceptable gold standard is available, indices such as sensitivity, specificity, and predictive values are calculated to measure validity—see Chapter 9 for these indices. In the context of a medical score, validity is assessed in terms of, for example, its ability to distinguish between a sick and a healthy subject or a very sick and a not so sick subject. This means that the scores received or the responses obtained from one group of subjects should be sufficiently different from those of another group when the two groups are known to be different with respect to the characteristic under assessment. Consider a questionnaire containing 20 items measuring the quality of life. If the response on one particular item, say, on physical independence, from the healthy subjects is nearly the same as that from the mentally retarded subjects, then this item is not valid for measuring the quality of life in those types of subjects. 20.2.1.1 Types of Validity Validity is an intricate concept that changes according to the setup. Important types of validity for the purpose of medical applications are face, criterion, concurrent, content, and construct validity. Face validity is an apparent correspondence between what is intended and what is actually obtained. Grossly speaking, ovarian cancer cannot occur in males, and prostate cancer cannot occur in females. Cleft lip, deafness, and dental caries cannot be causes of death. If any data show such inconsistency, these are not face valid. Face validity is also violated when a patient is shown as discharged from a hospital and a cause of death within the hospital is also recorded. Current age cannot be less than the age at first childbirth. Thus, face validity is achieved when the observations look just about right.
Quality Considerations
563
If a man gives his age as 20 years but looks like a 40-year-old, the response is not face valid. If a family reports consuming food amounting to 2800 cal per unit per day but the children are grossly undernourished, then again, the response or the assessment is not face valid. An instrument or a method is called criterion valid if it gives nearly the same information as a criterion with established validity would. Carter et al. [9] reported the criterion validity of the Duke Activity Status Index with respect to standard pathologic work capacity indices in patients with chronic obstructive pulmonary disease. Whenever a new index or a score or any other method is developed, it is customary that its criterion validity is established by comparing its performance with a standard. A review of literature suggests that the statistical method generally used for this purpose is correlation, whereas the right assessment of criterion validity is obtained by evaluating predictivity. If negative and positive predictivities are high, the tool can be considered criterion valid. It is not uncommon that no validated standard is available for a particular medical condition. For example, no valid measure is available to assess physical balance in people with vestibular dysfunction. The Berg Balance Scale is often used, but it is not considered a fully valid measure. A relatively new tool is the Dynamic Gait Index. The two can be investigated for agreement. Again, this is different from correlation. If the agreement is good, the two methods can be called concurrently valid. This means that both are equally good (or equally bad!). This, however, does not establish superiority of one over the other. The superiority can be inferred only when two methods under evaluation are compared with a known gold standard. The fourth type is content validity. This is based on the domain of the content of the measurement or the device. Would you consider kidney function tests such as urea clearance and diodrast clearance content valid for assessing the overall health of kidneys? Such tests are restricted to specific aspects and do not provide a complete picture. Content validity is often established through qualitative expert reviews. Wynd and Schaefer [10] explain how the content validity of the osteoporosis risk assessment tool was established through a panel of experts. The last is construct validity. This seeks the agreement of a device with its theoretical concept. BMI is construct valid for overall obesity but not for central obesity, whereas WHR is construct valid for central obesity and not for overall obesity. Construct validity is sometimes assessed by the statistical method of factor analysis that reveals “constructs.” If these constructs correspond well with the ones that are otherwise theoretically expected, the tool is considered construct valid. 20.2.2 Reliability of Instruments You know that a clinical thermometer is a reliable tool for measuring body temperature. It behaves the same way under a variety of conditions. The reliability of an instrument is its ability to be consistent across varying measurements in the sense of providing the same results when used by different observers or in varying conditions. The performance of a reliable tool is not affected much by the external environment in which the tool is used. Unlike validity, the concept of reliability is not related to any gold standard. Two kinds of reliability, namely, interrater reliability and interlaboratory reliability, were discussed in an earlier chapter. The latter is also called reproducibility. Both can be measured by the extent of agreement. This is applicable to one item at a time. An instrument, such as a questionnaire, can have multiple items. How does one assess the reliability of such an instrument? The reliability of a multiple-item instrument, which is now called a “test” for convenience, has two components: (a) internal consistency and (b) stability across repeated use. The first concerns the consistency of responses across various items in the same test. The second is self-explanatory and is also called repeatability. 20.2.2.1 Internal Consistency Consider a test in the form of a questionnaire containing several questions or items on, say, a scale of 0–5. An example is the assessment of ability to perform activities of daily living (ADL) by a geriatric population. Items such as ability to dress, bathe, walk around, and eat can be scored from 0 for complete inability (full dependence) to 5 for complete independence requiring no assistance. It is expected that the ability score on one item would correspond to the score on other items. If there were a difference, it would generally persist across subjects. In fact, all items are different facets of the same entity— the ADL in this case. That the underlying construct is uniform across all items of a test is an important prerequisite for measuring internal consistency. If that is so, the responses will be consistent with one another, provided the items are properly framed and the questions appropriately asked. One way to measure internal consistency is to split the test into ostensible halves where feasible. This is generally possible for a questionnaire by randomly dividing questions into two parts. The other method is to put odd-numbered questions in one-half and even-numbered questions in the other half. The product–moment correlation coefficient between total scores in the two parts across several subjects is a measure of internal consistency. This is called split-half consistency. Note that there is no one-to-one correspondence between one item in one-half of the test and another item in the other half of the
564
Medical Biostatistics
test—thus, intraclass correlation cannot be used in this case. The usual product–moment correlation between total scores is used to assess split-half consistency. Split half implies that the number of items available for calculating the total score is one-half of the items in the whole test. Such a reduction in the number of items adversely affects the reliability assessment. An adjustment, called the Spearman– Brown coefficient, can be used to estimate the reliability if the test were to consist of all the items. For details, see Carmines and Zeller [11]. 20.2.2.2 Cronbach Alpha The second method for assessing internal consistency is
Cronbach alpha =
Mr , 1 + ( M − 1)r
(20.3)
where M is the number of items in the test, and r is the average of M(M − 1)/2 correlations between each pair of M items. This version is called the standardized Cronbach alpha since product–moment correlation is being used in place of variances and covariances. As you can see, Cronbach alpha is applicable only when the response to various items is on a metric scale, such as the graded response on ADL in the previous example. A basic assumption is that the variances of the scores on different individual items are equal. If this is not the case, the measure in Equation 20.3 needs modification, as indicated by Carmines and Zeller [11]. Equation 20.3 shows that the value of the Cronbach alpha depends on the number of items in the test. When r = 0.3, alpha = 0.68 for M = 5, and alpha = 0.87 for M = 15. Thus, the Cronbach reliability of a test can be increased by simply increasing the number of items, provided that the average correlation does not deteriorate. The reason for such behavior is not too hard to find. The Cronbach alpha assumes that the items included in the test are a random sample from a universe of many possible items. The higher the number of items, the better the representation. The following comments give some more information regarding the Cronbach alpha:
1. The Cronbach alpha is based on all items of the test, and thus is a better indicator of reliability than the split-half coefficient. The latter is based on only half the items. Also, split half is based on the sum total of the scores on the items in the half test, whereas the Cronbach alpha considers each item separately and uses the average correlation. 2. The Cronbach alpha measures how well the items are focused on a single idea or an entity or a construct. All items are supposed to be focused on the same entity, such as disability in the ADL example. But it does not tell you how well the entity is covered. A high value of this alpha is possible even when the breadth of the construct is covered only partially. 3. The Cronbach alpha measures the internal consistency component of reliability of an instrument as a whole and not of individual items. However, correlations of individual items with one another can be examined to find whether any particular item is adequately related to the others. 4. The Cronbach alpha can also be viewed as the correlation between the test and all other possible tests containing the same number of items on the same entity or construct. In the ADL example, the questions can be on being able to climb short stairs instead of walking around. If the test is reliable, the alpha will not materially change when an item is replaced by another equivalent item. 5. All measures of reliability, including the Cronbach alpha, range from 0 to 1. Thus, these measures are easy to interpret. A value less than 0.4 indicates poor reliability, a value between 0.4 and 0.6 is a toss-up, a value between 0.6 and 0.8 is acceptable, and a value more than 0.8 is good. However, an exceedingly high value, such as >0.90, suggests some repetition and redundancy. 6. If the response to items is binary (yes or no, true or false, or present or absent type), then the Cronbach alpha reduces to what is called the Kuder–Richardson (K-R) coefficient. For details, see Carmines and Zeller [11]. 7. All measures of internal consistency are affected by the length of the test, the homogeneity of the test items, the heterogeneity of the subjects in the sample to whom the test was administered, and the objectivity of the test items. In addition, the conditions of administration of the test, such as time allowed for completing the test and observer–subject interaction, affect the reliability assessment.
565
Quality Considerations
Example 20.4: Internal Consistency of an AIDS Questionnaire A questionnaire consisting of four items on knowledge of acquired immunodeficiency syndrome (AIDS) was administered to 80 students of grade VII. These are as follows: Item 1: Knowledge about mode of spread of human immunodeficiency virus (HIV) infection Item 2: Knowledge about its prevention Item 3: Knowledge about social handling of AIDS patients Item 4: Knowledge about symptoms of AIDS Each item is graded on a 4-point scale from 0 for no or completely wrong knowledge to 3 for perfect knowledge. Suppose the correlations are as shown in Table 20.2. These give r = 0.563. Thus, Cronbach alpha =
4 × 0.563 1 + 3 × 0.563
= 0.84.
The internal consistency is thus “good,” despite the fact that the fourth item has a poor correlation with the other items. Low correlation of this item with others is an indication that knowledge of symptoms is probably not part of the same entity that is being measured by the other three items. TABLE 20.2 Correlation between Items of Knowledge about AIDS Item Item
1
1 2 3
2
3
4
0.87
0.62 0.75
0.45 0.33 0.36
20.2.2.3 Test–Retest Reliability The second component of reliability is repeatability. This is the stability of the response with repeated use of the instrument. The measurement of repeatability essentially involves administering the test at least twice to the same subjects. This is therefore known as test–retest reliability. It can be calculated for an instrument that does not provide any learning to the respondent, and the second-time responses are not affected by the first-time responses. The scores or the responses obtained on the two occasions are checked for agreement. These can be checked item-wise, but generally the total score is compared. You now know that the extent of agreement is measured by intraclass correlation if the responses are quantitative and by Cohen kappa if they are qualitative. Suppose you want to find out whether laboratory A is more reliable than laboratory B in measuring plasma glucose levels. One strategy for this could be to split each blood sample into six parts. Send three parts to laboratory A and three to laboratory B. Do this for, say, 25 subjects. Keep the laboratories blind regarding sending aliquots of the same sample. Calculate the intraclass correlation for each laboratory. If the laboratory is reliable, the measurement on the different parts of the same sample will give nearly the same values, yielding a high value of the intraclass correlation coefficient. The laboratory with higher intraclass correlation is more reliable. Note that the reliability may be higher even when there is a constant bias in the measurement. It is possible that a laboratory has an intraclass correlation coefficient of 0.90 but consistently reports values lower or higher than the actual value. Thus, reliability does not ensure validity. Example 20.5: Repeatability of Vision in Keratoconus Keratoconus is a disease in which the eye lens bulges and becomes conical in shape. It is believed that these patients generally report variable vision on different occasions. Gordon et al. [12] measured visual acuity (VA) in 134 keratoconus patients on two visits separated by a median of 90 days. The examiner was the same in the two visits for some subjects but different for other subjects. High- and low-contrast VA was measured with the patient’s habitual visual correction and with the best correction monocularly. Thus, four measurements were available for each patient for each visit. The intraclass correlation between VA on the two visits ranged from 0.76 to 0.85. The VA score was somewhat higher at the repeat visit than at the baseline visit when the examiners were different between visits. It was concluded that the VA in this sample was very repeatable, and repeatability was slightly poorer when different examiners tested VA at the baseline and repeat visits.
566
Medical Biostatistics
Example 20.6: Reliability of SF-36 in Chronic Schizophrenia Measurement of the changes in physical and role functioning in outpatients of mental health clinics has been a challenge. Russo et al. [13] tested a 36-item health survey form (SF-36) on 36 outpatients with chronic schizophrenia (incidentally, in this example, n = 36 and M = 36 are equal). The form was completed in writing as well as orally by each patient. The test– retest reliability as well as internal consistency showed that SF-36 is an appropriate outcome measure for schizophrenic outpatients.
20.3 Quality of Statistical Models: Robustness Models by definition are relatively simple statements of complex processes. They are man-made manifestations of the nature. By their very definition, models are imperfect representations of the real world and, at best, only approximations of the actual situation. Statistical models are no different. They can never be “true,” although that is the assumption behind many modelbased results. Yet, many statistical models provide remarkably useful approximations and are adopted in practice. Given the complex interrelated nature of health and disease, the statistical models also tend to become complex. They go much beyond the simple ±2 SE thresholds. Efforts are made to keep parsimony intact so that the utility is not compromised. Because of this limitation, most statistical models used in medicine and health are fragile. Often, they work only in very specific situations, and slight variation in the conditions can produce weird results. The antonym to fragility is robustness. This means consistency, sustainability, and stability of results in less than ideal conditions. It refers to insensitivity of the overall conclusion to various limitations of data, limited knowledge, and varying analytical approaches. By showing the endurance of results to withstand the pressure of minor variations in the underlying conditions, robustness establishes the wider applicability of models. The results that risk of coronary disease is higher in persons with higher BP and risk of lung cancer is higher in persons smoking heavily are robust. The presence or absence of other factors does not affect these results much. Also, this is seen in all population segments. On the other hand, the conclusion that urinary sodium excretion is dependent on salt intake is not robust since it is so easily affected by metabolism, creatinine excretion, and BMI. The result that risk of lung cancer increases by 1% by smoking 100 additional cigarette-years is also not robust. This chance can easily be 2% or 0.5% depending on the person’s nutrition, exposure to kitchen smoke, and such other factors. Robustness is a relatively new concept and is not in regular practice yet in medicine. Only good researchers use this strategy to rule out challenges from critics. Robustness is difficult also to ascertain. The evaluation could be done for the set of underlying conditions as a whole, but is generally done for its components such as choice of variables, type of instrumentation, and the methodology. Individual conditions are altered by turn to examine if the broad conclusion still remains the same. If you are using a published result for your practice, examine if the result is sufficiently robust to the altered conditions under which you intend to apply. This, in a way, can also help to minimize the impact of uncertainties. 20.3.1 Limitations of Statistical Models You know that most biological phenomena are very complex, and a large number of factors need to be included if these are to be expressed in terms of a mathematical expression. For example, a prerequisite for a regression model to be useful is that all relevant regressors are included. This may be hindered by lack of knowledge about the relevant regressors. For example, if it were not adequately known which factors influence the duration of survival after detection of malignancy, any model on survival would be tentative. Other limitations of statistical models are as follows:
1. Models may not help—even mislead—if the underlying processes are not fully specified. For this, the underlying process must be understood first. 2. A critical step is to decide whether a model is credible, plausible, and consistent with existing knowledge and actual observations. This step was ignored in earlier literature but is now almost ritually followed. 3. Models are accepted in the absence of a better alternative, despite the aleatory and epistemic uncertainties they contain. 4. Models are based on past experience but are used for predicting the unknown. Sometimes they are used for forecasting the future. This could be risky. Past trends may change in the future due to interventions such as improved strategies. Thus, guard against overconfidence.
Quality Considerations
567
5. Beware of the butterfly effect. As in the theory of chaos, a tiny error in the beginning can end up in tornado. A forecast can go haywire if there is even a minor error in the input values. 6. Many models simulate the present well, yet disagree in the magnitude of changes they predict for the future. 7. Models are not unique, and different parameters can reasonably reproduce the same values. Thus, the model needs to be checked under varying conditions as stipulated under sensitivity analysis. This method is discussed in a short while in this chapter. 8. It follows from the above that good agreement between the model and the observed data does not imply that the model assumptions accurately describe the underlying process—only that it is one of the plausible explanations and empirically adequate.
Models are used for prediction as well as for explanation of the relationship. Both require judicious identification of what is signal and what is noise, but prediction is relatively a serious business and noise could be a greater hindrance in this activity. Models cannot be simplistic without compromising their utility. We should be aware of the conceptual limits of what can and cannot be adequately predicted. No prediction is certain, and suspicion remains. The predictive performance of a model can be judged only when this is put to real-life test. Sometimes a good model seems like it is not correctly predicting because some people change themselves in reaction to a prediction—either positively in favor of the prediction or negatively against it. These caveats do not imply that model building is a worthless exercise—but only that the adequacy of a model for future prediction is difficult to assess, as the underlying uncertainties can be very unpredictable. Also, modeling is a continuous process—it evolves as new evidence emerges. Less effective components are replaced or modified, whereas successful components are retained. Nevertheless, statistical models do approximate biological processes fairly well when prepared with sufficient care and when the missing links are not too many. But their adequacy must be fully assessed before they are actually used. For this, three types of approaches are available: (a) external validation, (b) sensitivity and uncertainty analysis, and (c) resampling. All these approaches answer questions regarding the population settings, outcome variables, and antecedents to which the model can be generalized. Opposed to this, as you might have noticed, the concern in most statistical methods discussed in this book is on internal validity. Internal validity primarily answers questions regarding the effect really being present rather than arising from extraneous considerations. 20.3.2 Validation of the Models The foregoing discussion may have convinced you that the models need to be validated before they are used. Two kinds of validation can be done: internal validation and external validation. 20.3.2.1 Internal Validation This is the process whereby the existing data on which the model was developed are used for the validation. Among many, one method of internal validation is to divide your sample (if large) into, say, 10 equal parts. Use the first nine parts (90% of the data) to develop the model and test it on the remaining 10%. Then exclude the second 10%, develop the model on the other 90%, and test it on the excluded 10%—and so on. If the results are consistent, there is evidence of internal validation. The second is the split-sample method. This also requires that a large number of eligible subjects be available for the study. Under this method, the information is collected on all subjects, but only a random half is used for statistical analysis. If the result fails statistical significance in this half sample, just draw lessons for another study and do not draw any substantive conclusion regarding the question under research. If significance is achieved, use the second half group of subjects and examine if the results are valid. The first half sample is called the training dataset, and the second half the test dataset. If the test dataset supports the model derived from the training dataset, combine the two and obtain more reliable (due to increased n) estimates of the parameters of the model. Suppose you want to develop a model to predict the gestation age on the basis of anthropometric measurements at birth in situations where the date of last menstrual period is not known, as can happen with less educated women. If a large sample, such as 1800, is available, a random 900 of them can be used to develop the model, and the other 900 to test it. If testing gives correct gestation, such as within ±1 week in at least 80% of cases, the model can be considered robust for application, especially when obtained on the combined data.
568
Medical Biostatistics
20.3.2.2 External Validation A model may turn out to be valid when datasets for developing it are nearly the same as those used to evaluate it. This is a kind of circular reasoning. External validation overcomes this drawback. Although statistical significance implies that the results are replicable, a trial of results on another group of subjects is desirable to establish validity. Agreement in results on repeated application means that the model is more likely to give correct results in practice. To examine this kind of robustness, the first step is to test the results on another group from the same target population. If found valid, the next step is to test the results on a group of subjects from a different but similar population. Sometimes, the settings and measurement scales are varied to assess the external validity. A new sample of subjects for this purpose can be taken in a variety of ways, and that should not affect the validation exercise. Another sample method just mentioned requires a new sample at another point in time, possibly from other setting. If this sample also gives the same result, confidence increases tremendously. In the gestational age example in the preceding section, the validation sample can be taken from some other hospital located in the same area or in some other area with the same socioeconomic milieu. Many researchers leave such validation to future researchers. The McGill pain score has been validated in a variety of settings by different researchers, yet its utility in a largely illiterate population remains questionable. Try to use properly validated results in your professional practice and research. Example 20.7: Validation of Automatic Scoring Procedure for Discriminating between Breast Cancer Cases and Controls The micronucleus test (MNT) in human lymphocytes is frequently used to assess chromosomal damage. Varga et al. [14] developed the MNT on the basis of computerized image analysis. In a test sample of 73 persons (27 breast cancer cases and 46 controls) in Germany, the automated score gave an odds ratio (OR) of 16. This is a substantial improvement over OR = 4 for the conventional score based on visual counting. The improvement was confirmed in a validation sample of 41 persons (20 cases and 21 controls) that gave OR = 11. SI DE NO T E : In this study, the variation in OR in the test and validation sample is high. This tends to reduce the confidence
in the result.
20.3.3 Sensitivity Analysis and Uncertainty Analysis As stated earlier, models, particularly statistical models, are not unique. Different sets of parameters may reasonably reproduce the same sample values. Thus, the model needs to be checked under varying conditions. The sensitivity analysis refers to the study of the effect of changes in the basic premise, such as individual and societal preferences or assumptions made at the time of model development. These assumptions are made to plug the knowledge gaps. Against this, uncertainty analysis is the process of measuring the impact on the result of the changing values of one or more key inputs about which there is uncertainty. Thus, this mostly arises due to the use of the sample estimates. The inputs are varied over a reasonable range that can practically occur. In essence, sensitivity analysis is primarily for epistemic uncertainties, whereas uncertainty analysis is mostly for aleatory uncertainties. 20.3.3.1 Sensitivity Analysis In the older calculation of disability-adjusted life years (DALYs), the assumption that a death in young age is much more important than death in childhood or old age is a value choice. In addition to such assumptions, variables used for developing a model are also basic to the model. These also depend on the choice of the investigator. A simple example is modeling systolic BP in healthy subjects to depend on age, sex, and BMI. Another choice could be age, sex, and socioeconomic status. These choices are deterministic rather than stochastic. More accurate measurement or scientifically sound methodology does not help alleviate this uncertainty. Sensitivity analysis is varying these basic conditions and verifying that the broad conclusion still remains the same. Thus, this is also an exercise in external validation. The risk of coronary disease can be modeled to depend on the presence or absence of diabetes, hypertension, and dyslipidemia. This model might be able to correctly predict the 10-year risk in 62% of cases. However, the addition of smoking and obesity can increase this to 70%. This addition of 8% is substantial. Thus, the predictivity of coronary disease is sensitive to the choice of risk factors. The first model is based on three risk factors, and the second on five factors. Had the contribution of smoking and obesity been only 2% or 3%, the conclusion would be that prediction of coronary disease is insensitive to smoking and obesity when diabetes, hypertension, and dyslipidemia are known. Then, robustness would have established. The sensitivity analysis investigates how the result is affected when the basic premise is altered. Unlike uncertainty analysis, this has nothing to do with the variation in the input values. Input factors themselves are changed. Sensitivity
Quality Considerations
569
analysis deals with uncertainty in model structure, assumption, and specification. Thus, it pertains to what is called the model uncertainty. Example 20.8: Sensitivity Analysis of Shock for Predicting Prolonged Mechanical Ventilation in Intensive Care Unit Estenssoro et al. [15] compared Argentinean patients on mechanical ventilation in the intensive care unit (ICU) for more than 21 days (n = 79) with those with less ventilation (n = 110) for severity score, worst PaO2/FIO2 fraction, presence of shock on ICU admission day, length of stay in ICU, and length of stay in the hospital. Logistic regression identified shock on ICU admission day as the only significant predictor, with OR = 3.10. This analysis excluded patients who died early. It was not known whether this result would hold for patients dying early. To bridge this epistemic gap, the authors conducted a sensitivity analysis by including 130 patients who died early. Shock remained a powerful predictor. The conclusion is that the only prognostic factor for prolonged mechanical ventilation is shock on ICU admission day irrespective of early or late death (or survival). Perhaps, shock itself is a by-product of the severity of illness and hypoxemia.
The purpose of sensitivity analysis is to examine whether the key result continues to point to the same direction when the underlying structure is altered within a plausible range. The process involves identifying the key outcome in the first place, and then the basic inputs that can affect this outcome are also identified. For example, in a clinical trial setup, the outcome could be mortality or the length of hospital stay. Both should generally lead to nearly the same conclusion. Thus, the outcome measure can also be changed to see if the results are still the same. Patients’ inclusion and exclusion criteria can be relaxed, the method of assessment can be altered, and even the data analysis methods can be changed to see if this affects the final result. Intention-to-treat (ITT) analysis can be tried, in addition to the regular per protocol (PP) analysis that excludes the missing or distorted data. If the results do not materially change, the confidence in the results strengthens. 20.3.3.2 Uncertainty Analysis The study of the effect of varying the value of parameters included in the analysis is called uncertainty analysis. Thus, this pertains to parameter uncertainty, as opposed to the model uncertainty in sensitivity analysis. Examples of such parameters used in the calculation of DALYs are incidence, prevalence, and duration of disease and mortality rates for various health conditions. The value of DALYs will naturally change if any of these are changed. Correct estimation of these parameters is vital to the validity of the DALYs obtained. The uncertainty surrounding these parameters can be reduced through more accurate measurement and by adopting a scientifically sound methodology of estimation. This is not possible for the preferences mentioned for sensitivity analysis. In medicine, uncertainty analysis is generally done for a model that relates an outcome to its antecedents. For example, urinary excretion of creatinine can be predicted by age, dietary constituents, and lean body mass. If age is reported approximately in multiples of 5, such as 45 years instead of the exact 43, will the model still be able to nearly predict the correct value of urinary creatinine? If this example is not convincing, consider the effect of changes in diet from day to day. If the outcome prediction remains more or less same, the model is considered robust. Uncertainty analysis incorporates three types of aleatory variations in the inputs:
1. Random measurement errors such as approximate age in the creatinine example. 2. Natural variation that can occur in the input parameters, such as dietary constituents changing from day to day. If the model is based on the average diet over a month, the model may or may not be able to reflect the effect of daily changes. If it is based on the diet of the previous day, what happens if the diet changes the next day? 3. Variation in the multipliers used in the prediction. If ln(creatinine in mmol/day) is predicted as (0.012 × height in cm − 0.68) in children, what happens if the multiplier of height is 0.013 or 0.011. The multiplier 0.012 is an estimate, which is subject to sampling fluctuation. This multiplier can change depending on the subjects that happen to be in the sample. Example 20.9: Uncertainty Analysis of Estimates of Health-Adjusted Life Expectancy Calculation of health-adjusted life expectancy (HALE) for any population requires three inputs: (1) life expectancy at each age, (2) estimates of prevalence of various nonhealthy conditions at each age, and (3) a method of valuing the nonhealthy period in comparison with full health. All three are estimates and have inbuilt sampling and other variation. The life expectancy of females at birth in Brazil was estimated as 72 years in 2016, but it could actually be 71 or 73 years. Life expectancy is relatively robust, but such variation is more prominent in prevalence rates of various diseases. Many of these prevalences
570
Medical Biostatistics
are not known, and they are estimated by an indirect method. A statistical distribution can be imagined around all such estimates. Salomon et al. [16] have described a method to calculate the uncertainty interval around HALE for each member country of the World Health Organization. They used computer simulations for generating statistical distributions around input values, and thus propagate uncertainty intervals.
An uncertainty interval is very different from the CI. In Example 20.9, the CI would be based exclusively on the SE of the HALE without considering the variation in input values. In comparison, the uncertainty interval would be much larger as it incorporates the effect of variation in each of the input values. This variation depends on the repeatability of instruments, the number of measurements, and such other sources of variation. An uncertainty analysis is recommended when it is necessary to disclose the potential bias associated with models that use a single value of the parameters, particularly when your calculations indicate the need for further investigation before taking any action. Also, do uncertainty analysis if the proposed model has serious consequences. The preceding discussion is in the context of health outcomes, but the major application of uncertainty analysis has been in the context of costing. Health care cost can substantially vary depending on the quantity and quality of various inputs, and uncertainty analysis helps to delineate the range of costs for varying inputs. 20.3.4 Resampling When the opportunity for studying another sample does not exist, the results can still be validated by using what is called resampling. Under this method, subsamples from the existing sample are drawn and results checked if they replicate on such subsamples. Thus, this method assesses the reproducibility or reliability of the results. Two popular methods of resampling are bootstrap and jackknife. The following is a brief and introductory account. For details, see Good [17]. 20.3.4.1 Bootstrapping You probably know that bootstrap is the loop at the back of a boot that is used to pull the boot on. In science, the term bootstrap is used for an algorithm with an initial instruction such that subsequent steps are automatically implemented. In statistics, bootstrap is generating thousands of samples from the available sample. You know that the distance ±1.96 SE in 95% CI comes from the Gaussian distribution. This can also be used for a CI for the mean from the non-Gaussian distribution when n is large because then the central limit theorem comes to the rescue. This does not happen for estimating nonadditive parameters such as median (e.g., ED50) and quartile, nor for the mean of a highly non-Gaussian distribution. For these situations, ±1.96E cannot be used. The bootstrap method was originally devised to find the CI in such situations. This is done by generating a pseudodistribution using thousands of samples with replacement drawn by computer from the available sample, and thus creates a proxy universe. This universe is used for finding the CI. Note that the method is nonparametric. The actual bootstrap procedure is something like this. If you have a sample of size n = 18, select one at random from this sample. Replace this value back so that 18 values are available again. Select another from these 18. This is your second value in the generated pseudosample. Since the sampling is with replacement, the same value can be selected again. Do this 18 times and you have a pseudosample of size 18. In a rare case, the same value can occur 18 times in one pseudosample. You can repeat the process and build thousands of pseudosamples from one sample. Each of these samples can give the mean, median, SD, or whatever statistic is required. You will have thousands of these values and can generate a pseudodistribution of sample median, sample quartile, and so forth. Use this distribution to find, for example, a range such that 2.5% medians are less than the lower limit of this range and 2.5% are more than the upper limit of this range. This is your 95% bootstrap CI for median. This can be done for any parameter of interest. Note how the available data are used to tell more about the data. No extra help is needed—thus the name bootstrap. The difficulty is that if the original sample is bad (not representative), the bootstrap samples will also be bad and you will have a false sense of security. However, the concern in this section is with the robustness of results. For this, bootstrapping involves taking several subsamples from the same group that was actually studied and examining if the results replicate. This is slightly different from what was just described. For checking robustness, this method should be applied only when the available sample is large. If the result is based on a group of 200 subjects, you can take several random subsamples of, say, 150 subjects from this sample of 200. Of course, these are not exclusive samples but drawn after replacement. Each subsample will have some different subjects, but most will be those that appear in the previous subsample. This overlap does not matter. Although such resampling can be done many times, generally three or four subsamples of this type would be enough to assess robustness. Opposed to this, for finding the CI, hundreds of bootstrap samples are required.
Quality Considerations
571
20.3.4.2 Jackknife Resampling The term jackknife signifies a ready tool that can be opened and folded, and can be used in a variety of situations. Statistically, the only difference between bootstrap and jackknife is that in jackknife, one or more values are serially dropped at a time and the results recalculated. Thus, in jackknife resampling, the subsamples are not necessarily random. This method can be used even when the sample is relatively small, such as 10 or 15. For dropping two at a time, the procedure would be as follows. If you have a sample of n = 12 subjects, drop numbers 1 and 2 and recalculate the result on the basis of numbers 3–12. Then, drop numbers 2 and 3 and recalculate the result based on numbers 1 and 4–12, and so on. In jackknifing, generally, all possible subsamples are studied. The method was originally devised to detect outliers that have a major effect on the results. You can see that jackknifing is very effective in detecting outliers. It can also be used for assessing robustness. For robustness, there is no need to study all possible samples—possibly 5 or 6 or 10 subsamples are enough. If the results replicate in repeated subsamples, there is evidence that they are robust. Example 20.10 illustrates another use of jackknife resampling that derives various validity measures of a new method of classification of breast cancer cases. Example 20.10: Jackknife Testing of Computer-Aided Classification of BI-RADS Category 3 Breast Lesions The Breast Imaging Reporting and Data System (BI-RADS) is a conventional method of mammographic interpretation. Buchbinder et al. [18] used a computer-aided classification (CAC) that automatically extracted lesion-characterizing quantitative features from digitized mammograms. This classification was evolved on the basis of 646 pathologically proven cases (323 malignant). The jackknife method (100 or more subsamples by computer) was used to calculate that the sensitivity of CAC is 94%, the specificity is 78%, the positive predictivity is 81%, and the area under the receiver operating characteristic (ROC) curve is 0.90. SI DE NO T E : The authors also used CAC on 42 proven malignant lesions that were classified by the conventional method
as probably benign and found that CAC correctly upgraded the category in 38 (90%) of them. The implication is that CAC is better than the conventional method for categorizing breast lesions.
Resampling methods can be used for a variety of purposes and obviate the need to depend on the Gaussian distribution. Many enthusiasts believe that there would be a paradigm shift and statistical methods would be primarily based on resampling instead of the conventional methods. These methods show promise today as much as they did 25 years ago, but have not been able to make much headway so far. 20.3.4.3 Optimistic Index This measures the extent of overfit (optimistic fit) of a model to a dataset because it is based on the same data. A model based on any dataset is more likely to be a good fit because the calculations are based on the same data, but the model may not apply with the same predictivity to a new dataset. The optimistic index tries to correct this predictivity so that a realistic assessment is available regarding the practical utility of the model. Among several, the most common method for finding the optimistic index for a model involves repeatedly fitting the model to a bootstrap subsample from the original sample and calculating the predictivity each time. This would generally be lower than the predictivity originally obtained based on the entire sample, since now only subsamples are being used. The average predictivity over, say, 100 bootstrap samples would be considered the actual predictivity. The difference between the original predictivity and the average predictivity based on bootstrap samples would be the optimistic index. This measures how much the model is overpredicting. Suppose you fit a logistic model to predict the mortality in acute necrotizing pancreatitis cases with a database of 400 cases. Let the classification accuracy of the model for these data be 86%. That is, the model was able to correctly predict survival or death in 86% of cases, but was wrong in the other 14% of cases in the dataset used for developing the model. In these 14% of cases, the patient died when the model predicted survival or the patient survived when the model predicted death. Now fit the same model to a random subsample of, say, 300 subjects from the available 400 and suppose it is found that the predictivity is 82%. Take another random subsample of 300 cases and find that the predictivity is, say, 75%. Do this 100 times. Suppose the average predictivity in these 100 random subsamples of 300 patients is 79%. Since the original predictivity is 86%, the optimistic index is 86% – 79% = 7%. The model seems to be overpredicting to an extent of 7%. You can see that the index may depend on the size of the bootstrap samples. A smaller sample may have lower predictivity not because of sample size, but because the smaller sample may not represent the full spectrum of subjects.
572
Medical Biostatistics
20.4 Quality of Data Some researchers have blind faith in their data without considering that they could be flawed. You may be familiar with the garbage-in garbage-out (GIGO) syndrome. Even the most immaculate statistical treatment of data cannot lead to correct conclusions if the data are basically wrong. Many times the results fail to reproduce and the fault is often laid at the door of statistical methods, whereas the problem actually could be with the data. Before starting any analysis, everything possible should be done to ensure that the information is correct. Realize that valid and reliable instruments can also give erroneous results when not used with sufficient care. In addition, some errors creep in inadvertently due to ignorance. Sometimes the data are deliberately manipulated to support a particular viewpoint. Also, remember what Tukey said: the availability of a set of data and the aching desire for an answer do not ensure that a reasonable answer will be extracted. Data quality is crucial. It is sometimes believed that bad data are better than none at all. This can be true if sufficient care is exercised in ensuring that the effect on the conclusion of bias in bad data has been minimized, if not eliminated. This is rarely possible if the sources of bias are too many. Also, care can be exercised only when the sources of bias are known or can be reasonably conjectured. Even the most meticulous statistical treatment of inherently bad data cannot lead to correct conclusions. 20.4.1 Errors in Measurement Errors in measurement can arise due to several factors, as listed in the following. 20.4.1.1 Lack of Standardization in Definitions If it is not decided beforehand that an eye will be called practically blind when VA < 3/60 or VA < 1/60, then different observers may use different definitions. When such inconsistent data are merged, an otherwise clear signal from the data may fail to emerge, leading to a wrong conclusion. An example was cited earlier of the variable definition of hypertension used by different workers. In addition, special attention is required for borderline values. Chambless et al. [7] use BP ≥ 140/90 for hypertension, whereas Wei et al. [19] use BP > 140/90: one considers BP = 140/90 hypertensive, and the other considers this normotensive. The other such example is age. This can be recorded in completed years (age at last birthday) or as on the nearest birthday. These two are not necessarily the same. A difficulty might arise in classifying an individual as anemic when the Hb level is low but the hematocrit is normal. Guidelines on such definitions should be very clear, particularly in a research setup. 20.4.1.2 Lack of Care in Obtaining or Recording Information There can be a lack of care in obtaining or recording information when, for example, sufficient attention is not paid to the appearance of Korotkoff sounds while measuring BP by sphygmomanometer, or to the waves appearing on a monitor for a patient in critical condition. This can also happen when responses from patients are accepted without probing, and some of them may not be consistent with the response obtained on other items. If reported gravidity in a woman does not equal the sum of parity, abortions, and stillbirths, then obviously some information is wrong. A person may say that he does not know anything about AIDS in the early part of an interview but states that sexual intercourse is the mode of transmission in the latter part of the interview. The observer or the interviewer has to exercise sufficient care so that such inconsistencies do not arise. Then is the question of correct transfer from the datasheet to the spreadsheet. We have often detected rrors in data entry upon checking the data after a suspicion arose. This we do only for studies where we are intimately involved; otherwise, the analysis is done on the submitted data. It is mostly for investigators themselves to carefully check the data. Here is a sample of what we could detect. In one instance, codes for males and females were reversed for some subjects. Such errors are difficult to detect unless you find a pregnant male! In another case, the pretest values were unwittingly swapped with the posttest values for some cases at the time of data entry. In yet another instance, the values for case number 27 were exactly the same as those for case number 83. This may have occurred due to repeat recording of the same subject by two different workers or by wrong renumbering of forms. To give you a few more practical examples, at the time of entering change from pre- to postvalues, a minus sign was inadvertently omitted. In another instance, the calculation of scores was based on the wrong columns (F, G, H, M, and N) in the spreadsheet, where the columns actually needed were F, G, H, N, and O, and the scores turned out did not arouse suspicion.
Quality Considerations
573
Only when a third person examined the data for some other purpose was this error detected. All these are actual instances and can happen in the best of setups. Thus, proper scrutiny of data is a must for valid results. Some such errors may have crept in this book also, and we will fake it as nothing serious has happened. Now some tips on how to check the data. For large datasets, double entry by two independent workers and matching may help in detecting errors. A second method is the range check. If hemoglobin level is typed as 3.2 mg/dL, birth weight as 6700 g, or age at menopause as 58 years, you know these are improbable values and need to be double-checked. Whether qualitative or quantitative, frequency tabulation can help in detecting such outliers. The difficulty is in detecting age 32 years typed as 23 years—when both are equally plausible. If the stakes are high, you may want to double-check all the entries with the forms (assuming that information on the forms is correct). If the entries are direct online, one source of errors is eliminated, but the chance of detecting any real error in entry steeply declines. Do whatever you can since you cannot disown such errors later on when detected. 20.4.1.3 Inability of the Observer to Secure Confidence of the Respondent This inability can be due to language or intellectual barriers if the subject and observer come from widely different backgrounds. They may then not understand each other and generate wrong data. In addition, in some cases, such as in sexually transmitted diseases (STDs), part of the information may be intentionally distorted because of the stigma or the inhibition attached to such diseases. An injury in a physical fight may be ascribed to something else to avoid legal wrangles. Some women hesitate to divulge their correct age. Some may refuse physical examination, forcing one to depend on less valid information. Correct information can be obtained only when the observer enjoys full confidence of the respondent. 20.4.1.4 Bias of the Observer Some agencies deliberately underreport starvation deaths in their area as a face-saving device and overreport deaths due to calamities, such as flood, cyclone, and earthquake, to attract funds and sympathy. Improvement in the condition of a patient for reasons other than therapy can be wrongly ascribed to the therapy. Tighe et al. [20] studied observer bias in the interpretation of dobutamine stress echocardiography. They concluded that the potential for observer bias exists because of the influence of ancillary testing data, such as angina pectoris and ST segment changes. Lewis [21] found that psychiatric assessments of anxiety and depression requiring clinical judgment on the part of the interviewer are likely to suffer from observer bias. These examples illustrate some situations in which observer bias can easily occur. 20.4.1.5 Variable Competence of the Observers Quite often, an investigation is a collaborative effort involving several observers. Not all observers have the same competence or the same skill. Assuming that each observer works to his or her fullest capability, faithfully following the definitions and protocol, variation can still occur in measurement and in the assessment of diagnosis and prognosis, because one observer may have different acumen in collating the spectrum of available evidence than the others. Many inadvertent errors can be avoided by imparting adequate training to the observers in the standard methodology proposed to be followed for the collection of data and by adhering to the protocol as outlined in the instruction sheet. The difficulty is that many investigations do not even prepare an instruction sheet, let alone address adherence. Intentional errors are, however, nearly impossible to handle and can remain unknown until they expose themselves. One of the possible approaches is to be vigilant regarding the possibility of such errors and deal sternly with them when they come to notice. Scientific journals can play a responsible role in this respect. If these errors are noticed before reaching the publication stage, steps can sometimes be taken to correct the data. If correction is not possible, the biased data may have to be excluded altogether from analysis and conclusion. 20.4.2 Missing Values Two kinds of missing values occur in a medical setup. First, some of the subjects included in the sample are not available for some reason right from the beginning, and for them no information is available at all. Hopefully, they are random and not introducing any bias. For these misses, the study protocol should provide for an unbiased mechanism to select other subjects from the same target population and use them as replacement under certain conditions. Second, the subjects drop out after initially consenting. For them, some initial or baseline information is available. They should not be replaced but need to be handled separately.
574
Medical Biostatistics
It is quite common even with the best-designed and monitored studies that part of the information is not available for some subjects. This can happen for reasons beyond the control of the investigator. (a) The subjects may cease to cooperate after giving consent earlier and may not show up or may leave a hospital ward against medical advice. Some subjects may unexpectedly die in the midst of the investigation, and some may change their medical care provider. Some may develop side effects and drop out. Some subjects may refuse to divulge part of the information that they want to keep secret. Some may have urgent work at home or the office or may be ill so that the appointment is rescheduled. (b) In some cases, a particular investigation in the laboratory is not done because the kit or the reagent is not available at that time, or the sample of blood, urine, and other material is not adequate or is ruined in storage or transportation. The equipment may not be functioning properly. If the information is to be obtained from records, these may be incomplete. Some laboratory results may be unbelievably wrong and discarded. The person responsible for collection of data may not be sufficiently careful, or he or she may goof up and not collect some of the information as per the protocol. Some reports may be misplaced. When the collection of data involves handwriting, it may not be legible in some cases. The opportunities of missing values due to errors enumerated in reason 2 have substantially declined after the advent of automated and online systems, but they continue to occur, perhaps with increased frequency, for the personal reasons listed in (a). If the number of subjects investigated is 250, and complete and correct data are available only for 220, accounting for the missing 30 can be a challenge. 20.4.2.1 Approaches for Missing Values It has been seen time and again that the missing values are not a random subsample. These tend to follow a different pattern. Missing subjects may be those who have experienced considerable improvement in their condition or those who see no hope. Their moving out may be related to their condition. If some have died, this again reflects their condition. Thus, the effect of missing values on the conclusion is not just due to reduced sample size but, more importantly, because the available sample is no longer representative. Missing values can be especially dangerous where several variables are under consideration. Multiple regression and multivariate methods are particularly vulnerable. Suppose 15 variables are studied for 60 individuals. If one variable value is missing for two subjects and another variable value for another three subjects, these five subjects will be deleted from the multiple regression calculations. In practice, it is possible that one or two values are missing for 40 out of 60 in the sample. Thus, n reduces to only 20. As stated earlier, this is called case-wise deletion or list-wise deletion and results in a big loss. Most statistical packages use this deletion as the default. Methods are available, particularly for simple regression, that use pairwise deletion. This means that all available values for each pair of variables are included for calculation. This results in much less loss, but one analysis may be based on (n − 3) subjects and the other on (n − 7) subjects. Thus, the reliability will differ, and this may cause problems. A few missing values may not affect the findings, and the available data can be analyzed without any adjustment. If the missing values are far too many, such as more than 50%, assess the utility of the available data because such deficient data can lead to distorted results. There is a view, though, that imputation or adjustment is better than throwing away the data. Three alternative approaches are available:
1. The simplest but undesirable approach is to ignore the missing values and pretend that nothing is lost. Only the available data are analyzed. This approach may work well in a random nonresponse situation. But, as you are now aware, nonresponse is seldom random. In addition, the sample size will reduce in any case, which will compromise the reliability of the results. Thus, this approach is not advisable unless the missing values are really sparse, say, less than 5%. 2. The second approach is to compare the baseline information of the nonrespondents with that of the respondents and adjust the results accordingly if the baselines are found to be different. This approach is illustrated in Example 20.11 and works well if baseline characteristics make a difference in the outcome, and nothing else is that important. 3. The third approach is to make up the lost data by imputation, as discussed in a short while. You should carefully examine the data regarding the pattern of missing values—whether some information is missing for some specific types of subjects or is dispersed, that is, does it look random or fall into a pattern? If same variable values are missing for a large proportion of subjects, consider deleting this variable from the analysis.
575
Quality Considerations
Example 20.11: Adjustment in Incidence of Cancer Based on the Age Structure of Respondent and Nonrespondent Women In a study to evaluate the risk of breast or endometrial cancer in women treated with menopausal estrogen, a cohort of 5000 women was randomly chosen from those who received a prescription from pharmacies across a country. (The study would have an equivalent control without estrogen therapy, but we leave that outside the purview of this example.) At the end of the 10-year follow-up period, information on 1237 women was not available for a variety of reasons. The broad distribution of the 3763 women whose data were available is shown in Table 20.3. There is apparently some gradient of incidence rate with age at menopause in this study. The overall rate of 1.49% is unadjusted for the nonresponse.
TABLE 20.3 Incidence of Cancer in Menopausal Women Treated with Estrogen Age at Menopause (Years) a, a spline function is needed, as stated in Chapter 16. Without this, the results could be fallacious. Realize, however, that this could be used only if you know, or at least suspect, that the patterns could be different over different subranges of x. If this is not known, the fallacy can continue to occur for a long time. 21.2.1.2 Overlooking Assumptions A Gaussian form of distribution of various quantitative measurements is so ingrained in the minds of some workers that they take it for granted. Many examples are provided in this text of medical measurements that do not follow a Gaussian pattern. For large n, the central limit theorem can be invoked for inference on means, but nonparametric methods should be used when n is small and the distribution is far from Gaussian. At the same time, note also that most parametric methods, such as t and F, are quite robust to mild deviation from the Gaussian pattern. Their use in such cases does limited harm so long as the distribution has a single mode. Special attention is required if the distribution looks bimodal, or if it is J or U shaped. Transformations such as logarithm and square root sometimes help to “Gaussianize” a positively skewed distribution. But these also make interpretation difficult and unrealistic. If the means of the logarithm of the lipoprotein(a) level do not differ significantly in males and females, what sort of conclusion can be drawn for the lipoprotein(a) level itself? Despite this limitation, such transformations are in vogue and seem to lead to correct conclusions in many cases, particularly if n is not too small. Special statistical methods for many non-Gaussian distributions, such as exponential and gamma, can be developed, but experience suggests that the gains are not commensurate with the efforts. However, the need for such statistical methods cannot be denied. The assumptions of independence of observations and of homoscedasticity are more important than that of a Gaussian pattern. You now know that independence is threatened when the measurements are serial, longitudinal, or clustered. Homoscedasticity or uniformity of variance is lost when, for example, SD varies with mean. And this is not so uncommon. Systolic levels are
593
Statistical Fallacies
higher in postmenopausal females, and so are their SDs. Persons with a higher BMI also tend to exhibit greater variability in BMI than those with lower BMI. Thus, care is needed in using methods that require uniformity of variances. In the case of chi-square for proportions, the basic assumption is that the expected frequency in most cells is five or more. This restriction is sometimes overlooked. The Fisher exact test for a 2×2 table has become an integral part of most statistical packages of repute, but the multinomial test required for larger tables with many expected frequencies of less than 5 has not found a similar place. It is for you to locate a statistical package that is right for your problem. 21.2.1.3 Selection of Inappropriate Variables More than 100 signs–symptoms of hypothyroidism can be identified. These include puffy face, cold intolerance, thin hair, and brittle fingernails. This is probably true for many medical conditions. Two kinds of fallacy can arise in such cases. One is due to a priori (before modeling) selection of variables. If you cannot study all because of the limitation of sample size or otherwise, you must have sufficient reasons for including some and excluding others. Such reasons do not adequately exist in many situations, and subjective preference plays a role. This obviously will affect the results. If all known variables are entered into the model, the sample size must be enormously large. Sometimes, one variable at a time is considered and the statistically significant ones are included in a multivariable setup. To reiterate what was stated earlier, this ignores the interdependence—thus, the choice is statistically difficult. You should keep this limitation in mind while making an inference in such cases. Many times this is forgotten. The final wrinkle is the epistemic gap, about which remarks were made earlier in the context of regression models. Quite often, the factors affecting the outcome are not fully known. Obviously, only known or, at best, suspected factors can be studied. Large unexplained variation is one indicator of the inadequacy of a model, but unknown factors can be important even when the unexplained part is small. This can happen when an unknown factor is closely linked to one or more of the known factors. You know that all models are a necessarily simplified version of the reality. They many times present a false aura of precision due to mathematical formulation. The limitations just stated increase vagaries. For this reason, even statistically adequate models may fail to live up to the expectations in practice. 21.2.1.4 Area under the (Concentration) Curve
Concentration
There is a practice in animal experiments and clinical trials to note the intensity of response at different time points after intervention. This could be in terms of concentration of a drug after administration, intensity of pain after analgesia, physical stress during an exercise, or any other response. One popular method generally used to compare the pattern of response is the area under the (concentration) curve (AUC). This is routinely used for bioavailability, which in turn is used for bioequivalence, particularly for average bioequivalence. For example, Gaudreault et al. [11] evaluated truncated AUC as a measure of the relative extent of bioavailability. This area is different from the area under the receiver operating characteristic (ROC) curve. The difficulty is that the AUC has limited physical meaning. When used in isolation, it may fail to provide an adequate assessment of the difference in trend, even on average, because it is neither specific nor sensitive to the changes in the patterns. Curves with markedly different trends can give the same area (Figure 21.4). Thus, an AUC is not a valid measure of bioequivalence, although it is frequently used. Also, as time passes, some patients drop out or get cured—thus, the average response at different points in time is based on different ns. This is sometimes forgotten while the studying AUC.
Time FIGURE 21.4 Markedly different curves with the same area under the curve.
594
Medical Biostatistics
The AUC can give valid results when the response in one group is better than that in another at almost every time point. If they crisscross, then the conclusion can be very wrong. In this case, the investigator may have to be more specific on what exactly he or she is looking for. Merely stating that a trend or pattern is under investigation does not help. Depending on what is most relevant for the objective of the study, particularly in pharmacokinetic studies, it could be the time taken to reach the peak (Tmax), the time taken to return to the initial level, the peak level attained (Cmax), the response level after a specific time gap, and so forth. In many situations, the AUC needs to be considered in conjunction with other parameters, such as Tmax and Cmax, for a valid conclusion. In a study on treatment efficacy, the interest may be in the value at the end relative to the initial value. If the study is on the effectiveness of an analgesic, the interest may be in Tmax, Cmax, and possibly time to reach some specified critical level. No matter what parameter is used, it should always be decided beforehand on the basis of clinical utility and not after inspecting the data. 21.2.1.5 Further Problems with Statistical Analysis At the cost of repetition but in the interest of completeness, we remind readers of the following problems in analysis:
1. Categorizing a continuous variable causes loss of information and can result in bias, loss of power, and an increase in Type I error. Different categorizations can lead to different conclusions. In some cases, a clever investigator can choose categories after seeing the data that connive to reach a preconceived result. At the same time, it is also true that in some cases, the fitting of a continuous curve disturbs the actual relationship if it has kinks. 2. The mean and SD are inappropriate for variables with highly skewed distributions. For such variables, the median and interquartile range are more appropriate. You know that the interquartile range comprises the middle half of the subjects. The difference (median − Q1) when compared with (Q3 − median) gives an idea of the extent of skewness. The interquartile range also rules out “absurd”-looking statements, such as the mean lipoprotein(a) level is 8 ± 20, since this cannot be negative. The distribution of lipoprotein(a) is highly positively skewed, and the SD is very high. Most values are around 5 mg/dL, and mean ± SD can give a fallacious impression. 3. A real anomaly arises when, for example, too many zeros occur in a dataset (such as runs by batters in baseball games). In medicine, this happens when, for example, you are studying smoking among adolescents. If you record duration as zero for nonsmokers and actual duration on a continuous scale for smokers, neither mean nor median will work well because of the large number of zeros. If you categorize this as zero and more than zero, or duration in three or four categories, the analysis can be appropriately done by using proportions. 4. Many researchers take Gaussianity for granted, and use Gaussian-based methods even for small samples. Remember that small samples are not able to provide evidence against Gaussianity because of lack of power. Tests such as the Kolmogorov–Smirnov, Anderson–Darling, and Shapiro–Wilk work well with large samples and not small samples. For small samples, you should have some extraneous evidence (from experience or literature) that the distribution is Gaussian. If it suggests a non-Gaussian pattern, it might be safe to use nonparametric methods for small samples. On the other hand, some researchers are nonparametric enthusiasts and use these methods even when the sample size is large. This could result in loss of power. So far, fortunately, the number of such enthusiasts is not high. 5. In some paired comparisons, change, such as from pre to post, is a natural outcome of interest. Since baseline (pre) values can affect the amount of change, many times percent change is calculated. This works well provided (a) negative and positive changes are not canceling on averaging, and (b) the percent change is independent of the baseline. These limitations tend to be overlooked. 21.2.1.6 Anomalous Person-Years It is customary to calculate person-years of smoking and use this for various inference purposes. The calculation presumes that smoking 5 cigarettes a day for 20 years has the same implication as smoking 50 cigarettes a day for 2 years. Person-years of exposure is a valid epidemiological tool only when each year of exposure has the same risk. For many diseases, this obviously is not true. For example, smoking in the first 2 years may not cause as much harm as an additional 2 years would after smoking for 10 years when the damage has started. Similarly, calculation of mortality rate per thousand person-years after an episode of fracture in patients on dialysis [12] may be flawed since the risk of mortality in the 1st year after the fracture is not the same as in, say, the 10th year after fracture. If nothing else, aging will make an impact in this case. Thus, use the person-years tool with abundant caution. Most of the literature using person-years seems to be missing this point.
595
Statistical Fallacies
21.2.1.7 Problems with Intention-to-Treat Analysis and Equivalence Use of the intention-to-treat (ITT) strategy in superiority trials is preferred, but in equivalence or noninferiority trials, ITT is suspect. When the patients switch from one group to the other, ITT may show equivalence where none really exists. The other fallacy occurs when a failed superiority trial is touted as evidence of equivalence or noninferiority. A review of 88 superiority studies over a 5-year period found that the claim was not properly supported in 67% of the studies [13]. Failed superiority is not evidence of equivalence or noninferiority. A fresh trial is needed to confirm this. On the other hand, per protocol (PP) analysis is based on those patients who stick to the allocated regimen. If there are many who frittered away, the reduced sample size will affect the power to detect a clinically important difference. At the same time, often smaller variability is associated with PP analysis that can compensate the loss of power due to small n. Perhaps the best strategy is to do both types of analysis. If they do not conflict, you are done. Otherwise, there might be lessons on where the problem lies. You now know that the equivalence of a regimen can be concluded when its efficacy is different by not more than a specified margin. If an existing regimen has 79% efficacy and 3% difference is your tolerance, a regimen with 76% efficacy can be considered equivalent. This, however, means that the standard can slip over time. Now, a new regimen could be evaluated for 76% efficacy in place of 79%. You may want to guard against such a fallacy. Also, make sure that equivalence or noninferiority is not due to mischievous execution of the study, such as a lot of dropouts and noncompliance to the regimen by either group, particularly the group with the standard regimen. 21.2.2 Choice of Analysis A large number of fallacies can be cited that arise due to an inappropriate choice of analysis, but this section contains two everyday examples that illustrate how varying conclusions can be reached by one kind of analysis opposed to another kind. The aura of P-values is such that a novice could be happy using an inapplicable statistical method that happens to yield the lowest P-value. 21.2.2.1 Mean or Proportion? The blame lies mostly with statisticians who fancy quantity over quality. Although quantitation in many cases does help in achieving exactitude in thinking and in drawing conclusions, it can sometimes suppress important findings. Example 21.5a illustrates one such situation. This is similar to Example 15.13 in Chapter 15. Example 21.5a: Conclusion Based on Mean Can Be Different from the Conclusion Based on Proportion Given in the following is the rise or fall in Hb levels in a random sample of eight adolescent girls after iron supplementation for 2 weeks: Rise in Hb level (g/dL)
0.4
−0.9
0.3
Mean rise = 0.275 g/dL SD of rise = 0.542 g/dL
t7 =
0.7
0.275 0.542/ 8
0.1
0.5
0.9
0.2
= 1.44 .
Use of the t-test is valid in this case (small n) if the rise follows a Gaussian pattern at least approximately. The null hypothesis is H0: μ = 0, where μ is the mean rise in the target population. Note in this case that the possibility of iron supplementation reducing the mean Hb level is ruled out, and the test is one-sided. The critical value of t for 7 degrees of freedom (df) at the 5% level is 1.895 for a one-tailed test. Since the calculated value of t is less than the critical value, infer that the mean rise is not statistically significant and conclude that iron supplementation for 2 weeks in adolescent girls is not sufficient to produce a rise on average. The fact also is that Hb levels in seven out of eight girls showed a rise. If the supplement is not effective, then the chance of rise is the same as that of a fall, assuming that the Hb level will not remain exactly at the same level. Thus, H0: π = 1/2 and
596
Medical Biostatistics
H1: π > 1/2, where π is the probability of a rise. Because of a small n, you need to use the exact binomial distribution to obtain the P-value. Under H0, for seven or more “successes,” this is given by 7
1
8
1 1 1 1 P( x ≥ 7 ) = 8C7 + 8C8 2 2 2 2
0
= 0.035,
where x is the number of girls showing a rise in the Hb level. Thus, the chance of seven or more girls showing a rise out of eight when π = 1/2 is less than 0.05. Reject H0 and conclude that the supplement is effective. This conclusion is different from the one reached on the basis of the mean.
Both analyses in Example 21.5a are right. If one wants to draw a conclusion about the rise in mean, then the answer is that it is not statistically significant. For the proportion of girls showing a rise, the conclusion is that a statistically significant number show a rise. The first is based on the quantity of rise, and the second disregards the exact quantity. Even a rise of 0.1 g/dL is a rise and a fall of 0.9 is a fall, irrespective of the quantity. Depending on the purpose of the investigation, the finding can be presented either way. It is for you to be on guard and check that the conclusion drawn indeed emerges from the data, and that it appropriately answers the intended question. 21.2.2.2 Forgetting Baseline Values Although the rise in Example 21.5a is calculated over the baseline values, a rise of 0.3 over 8.7 g/dL is considered the same in this example as the rise over 13.7 g/dL. In this sense, the example does not consider the actual Hb level in the subjects before the start of the supplementation. Now examine what can happen if the initial levels are also considered. Example 21.5b: Consideration of Baseline Values Can Alter the Conclusion Following are the Hb levels (g/dL) of the eight girls in Example 21.5a before the supplementation: 10.1
8.5
13.8
9.7
12.4
9.0
8.6
12.1
On critical examination of the data, you will find that the girls with a higher level of Hb showed a smaller rise. In fact, a girl with an Hb level of 13.8 g/dL showed a fall. The relationship can be examined by running a simple linear regression of rise on the initial level. The regression equation is given by
Rise = 2.85 − 0.244 (initial Hb).
This gives R 2 = 0.81 and r = 0.90. The regression shows that the rise, on average, is higher when the initial values are lower and becomes negative in this case when the initial Hb level exceeds 11.7 g/dL. Since fall can be ruled out in the case of iron supplementation, it might be concluded that iron supplementation in such girls is possibly not useful when the Hb level is already 11.7 g/dL or higher.
The lesson from Example 21.5b is that baseline values can be important. In the case of comparison of two groups, if baseline values differ, the analysis may have to be geared to adjust for this difference. In many cases, change from baseline, rather than the final outcome, would be more adequate in coming to a valid conclusion. Example 21.5b also illustrates the possible adverse effect of a small n on the conclusion. The finding that the Hb level substantially falls in some cases after iron supplementation is medically untenable and possibly indicates that other uncontrolled factors were playing spoil sport. In this example, only one girl exhibited a fall, and this is able to vitiate the regression because of a small n. Also, the objective of the study was not to find a threshold beyond which iron supplementation is not useful. To achieve this objective, a larger trial is required anyway. 21.2.3 Misuse of Statistical Packages Computers have revolutionized the use of statistical methods for empirical inferences. Methods requiring complex calculations are now done in seconds. This is a definite boon when appropriately used, but is a bane in the hands of nonexperts. Understanding of the statistical techniques has not kept pace with the spread of their use, particularly with medical and
Statistical Fallacies
597
health professionals. The danger arises from misuse of sophisticated statistical packages for intricate analysis without fully appreciating the underlying principles. The following are some of the common misuses of statistical packages. 21.2.3.1 Overanalysis Popularly termed as torturing the data until they confess, data are sometimes overanalyzed, particularly in the form of post hoc analysis. This is like chasing a small effect hidden in noisy data. A study may be designed to investigate the relationship between two specific measurements, but correlations between pairs of a large number of other variables, which happen to be available, are calculated and examined. This is easy these days because of the availability of computers. If each correlation is tested for statistical significance at α = 0.05, the total error rate increases enormously. Also, α = 0.05 implies that 1 in 20 correlations can be concluded to be significant when actually they are not. If measurements on 16 variables are available, the total number of pairwise correlations is 16 × 15/2 = 120. At the error rate of 5%, 6 of these 120 can turn out to be falsely significant. Overanalysis also has the risk of interpreting noise as a signal resulting from what is called P-hacking. Hofacker [14] has illustrated this problem with the help of randomly generated data. Any result from post hoc analysis should be considered indicative and not conclusive. Further study should be planned to confirm such results. This also applies to post hoc analysis of various subgroups that were not part of the original plan. There is always a tendency to try to find the age–sex or severity groups that benefited more from the treatment than the others. Numerous such analyses are sometimes done using a drop-down menu in the hope of finding some statistical significance somewhere. Again, such detailed analysis is fine for searching a ground to plan a further study, but not for drawing a definitive conclusion. However, this is not to deny the existence of the Americas because they were not in Columbus’s plan. This finding is not empirical anyway. In most empirical cases, though, it may be sufficient to acknowledge that the results are based on post hoc analysis and possibly need to be confirmed. 21.2.3.2 Data Dredging Because of availability of software packages, it is now easy to reanalyze data after deleting some inconvenient observations. Valid reasons, such as the presence of outliers, for this exercise are sometimes present, but this can be misused by excluding some data that do not fit the hypothesis of the investigator. It is extremely difficult to get any evidence of this happening in a finished report. The integrity of the workers is not and cannot be suspected unless evidence to the contrary is available. Thus, data dredging can go unnoticed. 21.2.3.3 Quantitative Analysis of Codes Most computer programs, for the time being, do not have the capability to distinguish numeric codes from quantitative data unless you define codes as categorical. If disease severity is coded as 0, 1, 2, 3, and 4 for none, mild, moderate, serious, and critical conditions, respectively, statistical calculations may treat them as the usual quantitative measurements. As cautioned in Chapter 9, this runs the risk of considering three mild cases as equal to one serious case, and so on. This means codes can be mistreated as scores. This can happen even with nominal categories, such as signs and symptoms, when they are coded as 1, 2, 3, and so forth. Extra caution is needed in analyzing such data so that codes do not become quantities unless you want them to be. 21.2.3.4 Soft Data versus Hard Data The data to be analyzed statistically are mostly entered in terms of numerics in a computer. There may be some features of the health spectrum that are soft in the sense that they can only be understood and are difficult to put on paper. This applies particularly to psychological variables such as depression and frustration. Even if put on paper, they may defy coding, particularly if it is to be done before collection of data. A precoded pro forma is considered desirable these days because it makes computer entry so easy. But it should be ensured in this process that the medical sensibility of the information is not lost.
21.3 Errors in Presentation of Findings Out of ignorance or deliberate, the presentation of data in medical reports sometimes lack propriety. This can happen in a variety of ways.
598
Medical Biostatistics
21.3.1 Misuse of Percentages and Means Some fallacies occurring due to the misuse of proportions and means are described next. 21.3.1.1 Misuse of Percentages Percentages can mislead if calculations are (a) based on small n or (b) based on an inappropriate total. If two patients out of five respond to a therapy, is it correct to say that the response rate is 40%? In another group of five patients, if three respond, the rate jumps to 60%. A difference of 20% looks like a substantial gain, but in fact, the difference is just one patient. This difference can always occur due to sampling fluctuation. Thus, the percentage based on small n can mislead. It is preferable to have n ≥ 100 for valid percentages, but the following is our subjective guideline: 1. State only the number of subjects without percentages if n < 30. 2. For n ≥ 30, percentages can be given, but n should always be stated. The second misuse of percentages occurs when they are calculated on the basis of an inappropriate group. Example 21.6 illustrates this kind of misuse. Example 21.6: Correct Base for Calculating Percentages Consider a group of 134 cases of heart bypass surgery who are followed for postsurgical complications. The data obtained are presented in Table 21.5. This is similar to Table 7.5 of Chapter 7. TABLE 21.5 Complications in Cases of Heart Bypass Complication (Multiple Responses) Excessive bleeding Chest wound infection Other infections Breathing problems Blood clot in the legs Others Any complication No significant complication Total (data available) Data not available Grand total
Number of Cases
Wrong Percentage (Out of 44)
9 15 8 10 13 11 44 83 127 7 134
20.5 34.1 18.2 22.7 29.5 25.0 100.00
Correct Percentage (Out of 127) 7.09 11.81 6.30 7.87 10.24 8.66 34.65 65.35 100.00
(94.78) (5.22) (100.00)
Information was not available for seven patients. Since 83 did not experience any significant complication, it would be wrong to calculate the complication rate on the basis of the 44 patients who experienced complications. It unnecessarily magnifies the problem. The correct base for the complication rate is 127. The fact is not that 20.5% had excessive bleeding, but that 7.09% of the patients had this problem. It would also be wrong to include seven patients in this calculation for whom the data are not available. For a nonresponse rate, however, the correct base is 134. This can be separately stated, as shown in parentheses in the table.
Example 21.6 also illustrates the calculation of percentages in the case of multiple responses. A patient can have two or more complications. Thus, the percentages are not additive in this case. Another example of use of a wrong base is as follows. Suppose 300 subjects are asked about their preference for more convenient, less expensive, but less efficacious medical treatment versus less convenient, more expensive, but more efficacious surgical removal of a benign tumor. Only 60 (20%) reported preference for medical intervention, and 84 (28%) preferred surgery. The remaining 156 (52%) were noncommittal. If this finding is reported as 28% preferring surgery, it has the risk of being interpreted as saying that the remaining 72% preferred medication. Obviously, this is wrong. The nonrespondents or the neutrals should always be stated so that no bias occurs in the interpretation. For an example from everyday life, consider
Statistical Fallacies
599
the interpretation of the forecast of 30% chance of rain. In fact, then, the chance of no rain is 70%, which is more than two times the chance of rain. Here, there is no neutral category. 21.3.1.2 Misuse of Means A popular saying by detractors of statistics is, “Head in an oven, feet in a freezer, and the person is comfortable, on average!” There is no doubt that an inference on mean alone can sometimes be very misleading. It must always be accompanied by the SD so that an indication is available about the dispersion of the values on which the mean is based. This text has emphasized this from time to time. Sometimes the standard error (SE) is stated in place of SD. This might mislead unless its implications in the context are fully explained. Also, n must always be stated when reporting a mean. These two, n and SD, should be considered together when drawing any conclusion based on mean. Statistical procedures, such as confidence intervals (CIs) and test of significance, have a built-in provision to take care of both of them. A mean based on large n naturally commands more confidence than the one based on small n. Similarly, a smaller SD makes the mean more meaningful. The general practice is to state mean and SD with a ± sign in between, such as given by Park et al. [15]. Opinion is now being generated against the use of the ± sign because it tends to give a false impression. If mean serum bilirubin is reported as 1.1 ± 0.4 mg/dL, it gives the impression that the variation is from 0.7 to 1.5 mg/dL. In fact, the variation is much more, even more than the ±2 SD limits of 0.3–1.9 mg/dL if SD = 0.4 mg/dL. Thus, it is better to state clearly that the mean is 1.1 mg/dL and the SD is 0.4 mg/dL without using a ± sign. You should also evaluate whether the mean is an appropriate indicator for a particular dataset. Averages are not always what they seem. If in a group of 10 persons, 9 do not fall sick and 1 is sick for 40 days, how correct is it to say that the average duration of sickness in this group is 4 days per person? If extreme values or outliers are present, mean is not a proper measure. Similarly, it would be fallacious to say that the average number of legs in humans is 1.997 since some have one and some have none. Either use the median or recalculate the mean after excluding the outliers. If exclusion is done, this must be clearly stated. Otherwise, consider if proportions are more adequate than mean. 21.3.1.3 Unnecessary Decimals There is a tendency to overuse decimals in the reporting of results, creating a false sense of accuracy. Perhaps, a higher number of decimals is considered to give a scientific look. A large number of decimal places should be used for intermediary calculations (a computer will automatically do that), but the final result should have only an appropriate number of decimal places. A rule for percentages is to have only as many decimal places as are needed to retrieve the original number. As an extra precaution, one extra decimal can be used. For the percentage of subjects, this can be obtained by the following rules: In percentages, use One decimal place if n ≤ 99, Two decimal places if 100 ≤ n ≤ 999, Three decimal places if 1000 ≤ n ≤ 9999. There is a concept of significant digits. This is applicable particularly to the reporting of calculated values with an extremely large denominator. The leading zeros after a decimal are ignored while counting the significant digits. The value 0.00720 and the value 0.458 both have three significant digits and have the same accuracy. The values 3.06 and 0.069 have two significant digits each. All reporting should have the same number of significant digits. This concept is particularly useful when the reported values are meant to be used for further calculations. We have not followed this rule in this text. In expressing quantities other than percentages, such as mean and SD, the following rule is generally adequate: Report one decimal place more than in the original measurements from which the mean and SD are calculated.
If the last digit to be rounded off is 5, make the previous digit even; that is, 1.15 is rounded off as 1.2, and 3.45 as 3.4. Or, you can decide to stick to the odd digit in place of an even digit. The idea is that when 5 is the last digit after a decimal, it should go up half the time and down half the time because it is exactly midway. It is sometimes desirable to use the same number of decimal places uniformly throughout a report, even if n varies from one section of the report to another. The number of decimal places in this case would depend on the highest n. If there are 260 subjects belonging to 80 families, the percentages for subjects as well as for families can go up to two decimal places each.
600
Medical Biostatistics
The number of decimal places in a coefficient, such as in a regression equation, would depend on the numerical magnitude of the quantity with which it is multiplied. The coefficient 1.07 for birth weight in kilograms has the same accuracy as 0.00107 for birth weight in grams. Sometimes original measurements are also made with excess accuracy. It would be a waste of resources to measure survival time in cancer patients in days and report it as 3.7534 years, or insulin level in a subject to two decimal places. A minute difference in such readings does not affect the validity of the measurement. The same sort of logic can also be applied to percentages and means. It does not matter whether the mean diastolic blood pressure (BP) level of a group is 86.73 or 86.54 mmHg. Both could be rounded off to 87 for medical interpretation. Thus, the decimal places in this case do not serve any useful purpose. Instead, they complicate the presentation and interpretation. Many rates can be expressed as percentage, but for others a different multiplier is used for convenience. For example, the death rate due to cervical cancer is stated as 8 per 100,000 women. This is the same as 0.00008 per woman, or 0.008%. Siegrist [16], however, found that rates expressed as a frequency (8 per 100,000) are perceived differently than rates expressed as a probability (0.00008). 21.3.2 Problems in Reporting Among many problems that can occur with the reporting, two requiring special attention, are incomplete reporting and overreporting. The other is misuse of graphs. 21.3.2.1 Incomplete Reporting All reports should state not only the truth, but the whole truth. There is a growing concern in the medical fraternity that part of the information in reports is sometimes intentionally suppressed and sometimes unknowingly missed. If so, the reader gets a biased picture. This is easily illustrated by the bias of many medical journals for reporting positive findings and ignoring the negative reports. While knowledge about the side effects of a therapeutic regimen is definitely important, knowledge about their absence cannot altogether be ignored. Both should be reported in a balanced manner. Similarly, properly designed studies that do not reject a particular null hypothesis deserve a respectable place in the literature. That, at present, is sadly lacking, although awareness for a balanced approach is increasing. Pharma companies are often accused of highlighting, perhaps exaggerating, the potential benefits of their regimen by hiding evidence of damaging side effects. If you are scanning the literature to update your knowledge, examine whether you are getting balanced information. It is not an uncommon practice in the medical literature to not give full details of the methodology, of the sources of subjects, or of the data. For example, reports on trials sometimes say that randomization was done, but exactly how it was achieved is not stated. Sometimes the source of the subject is not clearly stated. Among many, one such example is the study of infants and young children reported by Kirjavainen et al. [17] on monitoring of respiration during sleep. The source of the subjects is obscure. If the subjects are from a mix of heterogeneous sources then the results in some cases can be suspect. Often, the target population to which the results are intended to be extrapolated remains obscure. There are examples of articles in reputed journals that report that data were statistically analyzed without stating the particular statistical methods that were used for different inferences. Some remain absolutely silent about statistical methods; see, for example, an article by Belgorosky et al. [18]. Another problem is with regard to the data. Although complete data are desirable, it may not be feasible to reproduce them in many cases. However, they can be placed on the web. In any case, the reports should contain enough details for a suspicious reader to verify the results. A serious problem is reporting only those parts of a study that support a particular hypothesis. The other parts are suppressed. An example is a series of studies on the carcinogenic effect of asbestos. According to an analysis of different studies provided by Lilienfeld [19], deliberate attempts were made by the industry to suppress information on the carcinogenicity of asbestos that affected millions of workers. If 12 frequent users of mobile phones in an area are found to develop cancer, how big is this warning sign? For media, this is a big catch. If there are 900,000 other users who did not develop cancer, can you say that mobile phones protect from cancer? Such reports deserve to be fully investigated to discover the truth before they are sensationalized. While discussing the critique of a medical report in Chapter 1, we advised that you ensure that the results do indeed emanate from the data. Inference must always be accompanied by evidence. Some reports do not live up to expectations in this respect.
Statistical Fallacies
601
21.3.2.2 Overreporting The statement of results in a report should generally be confined to the aspects for which the study was originally designed. If there are any unanticipated findings, these should be reported with caution. These can be labeled as “interesting” or “worthy of further investigation” but not presented as conclusions. A new study, specifically designed to investigate these findings, should be conducted. If a drug is found to be marginally effective in improving vision in an isolated small-sized trial, it is likely to be blown up by the hasty media because such a drug is needed by the public. Thus, unproved treatments can get undue promotion. If one subgroup showed a large effect of a regimen, this can be undesirably highlighted by more detailed analysis and by speaking extensively about it. Sometimes even the primary focus may change from the one stated in the protocol. Any violation of protocol leaves strong ground for suspicion. Also, beware of the “other” and “unknown” categories of the responses in a finished report. They may be hiding important information on cases adverse to the hypothesis of the investigator. 21.3.2.3 Selective Reporting Researchers around the world tend to be selective in reporting the results of their work. Whereas some of this can be justified to save journal space and complexity, at other times, such filtering can introduce bias. If you started with 1000 subjects, all these must be fully accounted for in your report. Statements such as CONSORT and STROBE require this in full measure. But filtering can be intentionally or unsuspectingly done at other levels too, for example, (a) selecting the “favorable” end point for reporting; (b) reporting the results for three treatments while the study had four treatments; (c) truncating the results to, say, 6 months even though the follow-up was 2 years because they are to your liking until this point in time; (d) studying 15 predictors but restricting the results to 6 convenient predictors; (e) trying different cut points for your continuous variables and choosing those for reporting that provide favorable results; (f) using categories such as none, mild, moderate, and serious (e.g., for hypertension) where actual levels (of BP in this case) should be used; and (g) reporting two favorable results out of dozens of attempted analyses. What is the way out? Do not compromise the integrity of your work. Fully disclose what variables you started with, what analyses were tried, and what part of the results you are reporting and why. Why other results are not included should also be expressly stated. This does not necessarily remove the bias, but it makes the reader aware of the limitations of the results. 21.3.2.4 Self-Reporting versus Objective Measurement Self-perception of health may be very different from an assessment based on measurements. A person with an amputated leg may consider himself or herself absolutely healthy. Besides such discrepancies, it has been observed, for example, that people tend to report lower weight than their actual weight, but when it comes to reporting height, they actually report higher values. This could make a substantial difference when BMI is calculated. The percentage of subjects with a BMI of ≥25 may thus become much lower than what would be obtained by actual measurements. Thus, only those characteristics should be self-reported that are so required for the fulfillment of the objectives of the study. All others should be objectively measured. 21.3.2.5 Misuse of Graphs While describing graphs and diagrams in Chapter 8, we mentioned that some fallacies could occur due to an inadequate choice of scale. A steep slope can be represented as mild, and vice versa. Similarly, a wide scatter may be shown as compact. Variation is shown as ±1 SD, whereas actually it is much more. Also, means in different groups or means over time can be shown without corresponding SDs. They can be shown to indicate a trend that really does not exist or is not statistically significant. One of the main sources of fallacies in graphs is their insensitivity to the size of n. A mean or a percentage based on n = 2 is represented in the same way as the one based on n = 100. Unfortunately, the perception, and possibly cognition, received from a graph is not affected even when n is explicitly stated. One such example is the box-and-whiskers plots drawn by Koblin et al. [20] for the time elapsed between cancer and AIDS diagnoses among homosexual men, with cancer diagnosed before or concurrently with AIDS in San Francisco during 1978–1990. This is reproduced in Figure 21.5. Five lines (minimum, Q1, median, Q3, and maximum) are shown on the basis of only four cases of anal cancer. It was concluded on the basis of the figure that “Hodgkin’s disease cases occurred relatively close to AIDS diagnoses.” This may have merit, but it is stated without checking the statistical significance and in disregard of the small number of available cases.
602
Time from cancer to AIDS diagnosis (months)
Medical Biostatistics
0 –10 –20 –30 –40 –50 –60 –70 –80 –90
–100 N= x = outlier
Hodgkin's disease 6
Anal 4
All other cancer 20
FIGURE 21.5 Time between cancer and AIDS diagnoses among homosexual men with cancer diagnosed before or concurrently with AIDS, San Francisco, 1978–1990. (From Koblin, B.A. et al., Am. J. Epidemiol., 144, 916–923, 1996. With permission of the Johns Hopkins University School of Hygiene and Public Health.)
21.4 Misinterpretation Misinterpretation of statistical results mostly occurs due to failure to comprehend them in their totality and the inability to juxtapose them with the realities of the situation. This can happen either because many medical professionals have limited knowledge of statistical concepts [21] or because of inadequate understanding of medical issues by the statisticians associated with medical projects. If there is greater interest or many teams are competing, a chase for tantalizing statistical significance that may or may not exist is inevitable. One problem is the inappropriate use of statistical methods, but some of the others are as follows. 21.4.1 Misuse of P-Values Statistical P-values seem to be gaining acceptance as a gold standard for data-based conclusions, although serious questions are being asked lately. Biological plausibility should never be abandoned in favor of P-values. Inferences based on P-values alone can also produce a biased or incorrect result. You may want to revisit Sections 12.5 and 15.4 to understand the limitations of P-values. All P-values assume that the data are correct. If that is not so, the effect of incorrect data is confounded and can even override the conclusion. 21.4.1.1 Magic Threshold of 0.05 A threshold of 0.05 in Type I error is customary in health and medicine. Except for convention, there is no specific sanctity of this threshold. There is certainly no cause for obsession with this cutoff point. A result with P = 0.051 is statistically almost as significant as one with P = 0.049, yet the conclusion reached would be very different if P = 0.05 is used as the threshold. Borderline values always need additional precaution. At the same time, a value close to the threshold, such as P = 0.06, can be interpreted both ways. If the investigator is interested in showing the presence of an effect, he or she might argue that this P approaches significance. If the investigator is not interested, this can be easily brushed aside as not indicating significance at α = 0.05. It is for the reader to be on guard to check that the interpretation of such borderline P-values is based on objective consideration and not driven by bias. Our advice is to generally interpret P = 0.06 or 0.07 as encouraging, although not quite good enough to conclude statistical significance. They can be called marginally significant results. If feasible, wait for some more data to come and retest the null in this situation. However, if biological plausibility is convincing, a conclusion of the presence of the effect can be drawn despite a (marginally) higher P-value.
Statistical Fallacies
603
The second problem with a threshold of 0.05 is that it is sometimes used without flexibility in the context of its usage. In some instances, as in the case of a potentially hazardous regimen, more stringent control of Type I error may be needed. Then, α = 0.02, 0.01, or 0.001 may be more appropriate. It is not necessary to use α = 0.01 if a value of less than 0.05 is required. The value α = 0.02 can also be used if this can be justified. In some other instances, as in concluding the presence of differences in social characteristics of the subjects, a relaxed threshold α = 0.10 may be more appropriate. For most physiological and pathological conditions, however, the conventional α = 0.05 works fine, and that is why it has remained a standard for so long. A value of P around 0.10 can possibly be considered weak evidence against the null hypothesis, and a very small P, say, less than 0.01, as strong evidence. This text has used α = 0.05 in most of the examples. As stated in Chapter 12, the practice now generally followed is to state exact P-values so that the reader can draw his or her own conclusion. At the same time, a threshold is necessary if a decision has to be a binary yes or no. 21.4.1.2 One-Tailed or Two-Tailed P-Values Example 21.5a on the rise in Hb level is a situation where a one-tailed test is appropriate because biological knowledge and experience confirm that iron supplementation cannot reduce the Hb level. Use of the two-tailed test in this case makes it unnecessarily restrictive and makes rejection of H0 more difficult. However, in most medical situations, assertion of a one-sided alternative is difficult and a two-sided test is needed. Most statistical packages provide two-tailed P-values as a default, and many workers would not worry too much about this aspect. Scientifically, a conservative test does not do much harm, although some real differences may fail to be detected when a two-tailed test is used instead of a one-tailed test. Our advice is to use a one-tailed test only where a clear indication is available, but not otherwise. 21.4.1.3 Multiple Comparisons This point has already been explained in the paragraph on overanalysis in Section 21.2. There is no need to repeat the argument, but it needs mentioned here for completeness. Several tests of statistical significance, each at α = 0.05, would make the total error much higher than acceptable. Specific statistical methods, such as the Tukey procedure, that control α to 0.05 or any other specified level, should be used in such cases. A wiser approach is to limit the number of P-values to the original question. Do not give too many P-values that are difficult to handle. The concept of multiple comparisons and adjustment of P-values is complex. Many P-values may have been considered to arrive at a conclusion. The procedure just mentioned is valid for the stated situations, but in practice, many statistical tests may have been tried before reaching the final ones for the report. Hardly anybody will make adjustment for such “behindthe-scene” statistical tests, although they also affect the final P-value. There might be two or more publications on the same set of data with a different focus or with different outcome variables. Each publication may be complete within itself for multiple comparisons but would be oblivious to the comparisons made with the other papers. The argument can be extended to Type I errors committed by other workers in similar studies and possibly in a lifetime. For some researchers, ignoring the accumulation of Type I errors is one reason that the statistically significant results fail to reproduce. Just be on guard for such fallacies. 21.4.1.4 Dramatic P-Values Attempts are sometimes made to dramatize the P-values. James et al. [22] stated P < 0.00000000001 for a difference in seroconversion rate against the Epstein–Barr virus between patients with lupus and controls. This is pulling out a number from a computer output without being careful about its implications. Such accuracy is redundant. It really does not matter whether P < 0.001 or P < 0.000001 as far as its practical implication is concerned. Many statistical software packages rightly stop at three decimal places as default and give P = 0.000 when it is exceedingly small. It only means that P < 0.0005, and this is enough to draw a conclusion. Further decimal places are rarely required. But do not interpret P = 0.000 as P = 0. 21.4.1.5 P-Values for Nonrandom Sample This text has repeatedly emphasized that all statistical methods of inference, such as CI and test of significance, require random sample of subjects. Patients coming to a clinic during a particular period can be considered a random sample from the population of patients that are currently coming to that clinic. But this limited definition of population is sometimes forgotten and generalized conclusions are drawn on the basis of P-values. This is quite frequent in medical literature and mostly accepted without question. An example is an investigation [20] of a New York City cohort of homosexual men who participated in two studies of hepatitis B virus infection in the late 1970s and a San Francisco City clinic cohort of homosexual men
604
Medical Biostatistics
recruited from the municipal sexually transmitted disease (STD) clinics from 1978 through 1990. Neither cohort is a random sample. Yet P-values for statistical significance of difference between the two cohorts have been calculated. These P-values are valid only when the cohorts are considered a random sample from a hyperpopulation of homosexuals in the two cities. The hyperpopulation exists in concept but not in reality and may include the past and, more importantly, homosexuals arising in the immediate future. The implication of the result, which in this case is the increased incidence of cancer among homosexual men, is a warning to future homosexual men more than to the present homosexual men. Two more points need to be stressed in this context. First, a sample can rarely be fully representative of the target population in a true sense. Thus, the risk of a wrong conclusion at α = 0.05 is, in fact, slightly more in many cases than it is in the case of truly random samples of large size. This probability can also be affected by, for example, a non-Gaussian distribution of the measurement under consideration. Second, P-values have no meaning if no extrapolation of findings is stipulated. Conclusions based on a sample for the sample subjects can be drawn without worrying about P-values. 21.4.1.6 Assessment of “Normal” Condition Involving Several Parameters Consider individual persons instead of groups. As explained earlier, the reference range for most quantitative medical parameters is obtained as mean ± 2 SD of healthy subjects assuming they have a Gaussian distribution. These statistical limits carry a risk of excluding 5% of healthy subjects who have levels in the two extremes. When such limits are applied on several parameters, it becomes very unlikely that all parameters in any person are within such a statistical range, even if the person is fully healthy. This anomaly is sometimes forgotten while evaluating the health parameters of a person or while devising inclusion and exclusion criteria for subjects in a research study. This fallacy is similar to a multivariate conclusion based on several univariate analyses, and the one inherent in multiple comparisons. 21.4.1.7 Absence of Evidence Is Not Evidence of Absence We have tried to make a distinction all through this book that a large P-value is not interpreted as evidence of absence of the effect. When P > 0.05 for a difference between two groups, the only valid conclusion is that the data are not sufficient to provide evidence of a difference being present. This does not imply that there is no difference. Yet, when the sample size is reasonably large to have enough power for detecting a specified unimportant difference, the conclusion can be that the difference is less than the specified value—thus, practically there is no difference. This kind of argument is generally forwarded to assert that the two groups are not different at baseline. Remember, though, that the sample size must be sufficiently large to arrive at this conclusion. 21.4.2 Correlation versus Cause–Effect Relationship Some professionals feel that an RCT is the only mode to provide scientific evidence of a cause–effect relationship. A moment’s reflection will convince you that this is not so. Most accept that cigarette smoking causes lung cancer, but no trial has ever been conducted. The controlled experiments do provide direct evidence, but evidence from observational studies can be equally compelling when confounders are really under control and the results replicate in a variety of settings. Keep in mind, though, that an association or a correlation in health measurements can arise because of a large number of intervening factors. It is rarely a cause–effect type of relationship unless established by a carefully designed study that rules out the possibility of a significant role of any confounding factor. McLaughlin et al. [23] report from epidemiological studies that women in the United States with cosmetic breast implants have a two- to threefold risk of suicides compared with the general population. This is statistically significant. However, this by itself does not support a cause–effect relationship. They speculate that the psychopathological condition of women may be a confounding factor since this condition can contribute to the need for a breast implant, as well as suicide. They also remark that any study aiming to investigate this relationship for cause–effect should also rule out year of birth, family history of psychiatric admission, and so forth as factors. More important threats to cause–effect inference are not known confounders, but unknown confounders. A strong correlation between heights of siblings in a family exists not because one is the cause of the other, but because both are affected by parental height. Similarly, correlation between visual acuity and vital capacity in subjects of age 50 years and above is not of the cause–effect type but arises because both are products of the same degeneration process. No one expects vision to improve if vital capacity is improved by some therapy. The maternal mortality rate declined in Mexico between 1960 and 2010, and the proportional mortality from coronary diseases increased. These too have a strong negative correlation, but they are not cause–effect type of relationships. Counterfactual information provides useful armory to refute causality.
Statistical Fallacies
605
An unusual confounding factor may provide useful information in some cases. Persons with depressed moods may have an elevated risk of lung cancer because of a third intervening factor, namely, smoking [24]. Depressiveness seems to modify the effect of smoking on lung cancer either by a biological mechanism or by affecting smoking behavior. This example illustrates how a seemingly nonsense correlation can sometimes lead to a plausible hypothesis. Another example is the parental age gap influencing the sex ratio of the firstborn children [25]. At the same time, our existing knowledge tells us that no drug can change the blood group. There is no need for empirical evidence to confirm this. 21.4.2.1 Criteria for Cause–Effect For an association or a correlation to indicate a cause and effect, the following points should be carefully examined:
1. Association should be present not only in the prevalence but also in the incidence. This should continue to be statistically significant when the effect of other confounding factors is eliminated by means of a method such as multiple regression. If feasible, confirm this by an experiment. 2. Association or correlation must be consistent across various groups, periods, and geographical areas. Wherever or whenever the postulated cause is present, the effect must also be present. If an association is found between lipoprotein(a) and the incidence of CAD in Canada, Singapore, England, and South Africa, which has persisted since this protein was first discovered, then this is likely to be a cause of CAD, although this could be one of the many causes. 3. There must be a dose–response kind of relationship in the sense that if the cause is present in higher amounts or in greater intensity, then the chance of effect should also be high. The more one smokes, the higher is the risk of lung cancer. 4. The relationship should also be specific; that is, if the postulated cause is absent, then the effect should also be absent. However, make a distinction between a necessary cause and a sufficient cause. Effect can happen due to other reasons as well. 5. The relationship must be biologically plausible, which means an explanation linking the two in a causal way is available. For example, cigarette smoke has cotinine, which easily affects lung cells and causes aberrant cell behavior. Thus, a biological explanation is available. 6. It is generally stated for a cause–effect relationship that the degree of association or the correlation should be high. This may be so when the factor under investigation is a dominant cause. The correlation between parental levels’ and children’s level of cholesterol is not high, but this does not exclude genetic influence as a cause of raised cholesterol level. When a correlation, albeit low, is consistently present, a causal relationship can still be inferred, but that would be one of the many causes. It should, however, be statistically significant so that sampling fluctuation is adequately ruled out as a likely explanation. In addition, also note that the correlation coefficient can be small just because a small range of values has been observed. 7. If both the antecedent and the outcome are rare, their occurrence together is a strong indication of a cause–effect relationship. The causal role of elevated serum cholesterol in atherosclerotic CAD, for example, fulfills all these criteria. A consistent association has been found in prevalence as well as in incidence in a large number of studies carried out in populations with different backgrounds. A positive gradient in risk of the disease has been seen with rising cholesterol level. In controlled clinical trials, cholesterol-lowering drugs and diet therapy have been shown to result in a reduction of coronary events. Angiographic regression of coronary lesions has also been demonstrated. Biological plausibility has been established from animal and human studies that have demonstrated cholesterol deposition in atheromatous plaques. A recent example is the Zika infection during pregnancy “causing” microcephaly in newborns in Brazil and Colombia [26]. (a) For it to be a causal relationship, the virus infection must precede the fatal microcephaly—the mother was infected early in pregnancy. (b) For biological plausibility, although microcephaly can occur due to other pathogens, Zika virus was isolated from the brain tissues of fetuses and infants with microcephaly. (c) Both Zika infection and microcephaly are rare occurrences—just their simultaneous presence is too much of a coincidence and points to the cause–effect. (d) No alternative explanation has emerged that can deny the cause–effect relationship. Although some other criteria are not fulfilled, these four reasons are considered sufficient to suggest a cause–effect relationship between Zika infection and microcephaly in newborns.
606
Medical Biostatistics
21.4.2.2 Other Considerations A distinction may be made between a necessary cause and a sufficient cause. Sexual intercourse is necessary for initiating pregnancy in its natural course, but it is not sufficient. In fact, the correlation between number of intercourses and number of pregnancies, even without barriers, is negligible. In this era of diseases with multifactorial etiology, one particular factor may not be sufficient to cause the disease. The presence or absence of one or more of the others may also be needed. Thus, a factor could be a contributing cause in the sense that it is a predisposing, enabling, precipitating, or reinforcing factor. For a detailed discussion of causal relationships in medicine, see Elwood [27]. The aforementioned discussion assumes that the role of chance and of bias has been minimized, if not eliminated. Chance due to sampling fluctuation is adequately ruled out by demonstrating statistical significance, but bias, as discussed in Chapter 2, can occur due to a host of factors such as the selection process, observational methods, differential recall of past events, and suppression of information. Measures such as standardization of methods, training, randomization, and matching have already been discussed to minimize the bias. The cause–effect hypothesis is strengthened when all alternative explanations are also studied and shown to be not tenable. A larger sample size does help in increasing confidence, but its role in causal analysis beyond statistical significance is marginal. 21.4.3 Sundry Issues The list of statistical fallacies mentioned so far is not complete. Several others are mentioned in various chapters. Some of these need to be reiterated. 21.4.3.1 Diagnostic Test is Only an Additional Adjunct Moons et al. [28] make a forceful plea, and rightly so, that hardly ever is a diagnostic test used in isolation. In almost every situation, the patient’s history and physical examination findings are already available before a test is used. In fact, a test is advised only on the basis of the clinical picture. Thus, the utility of the test should be examined only as an additional adjunct instead of solely on its performance. They illustrate this by calculating the area under the ROC curve for 140 patients suspected of pulmonary embolism who had an inconclusive ventilation–perfusion lung scan. The tests they evaluated are partial pressure of oxygen in arterial blood (PaO2), x-ray film of the thorax, and leg ultrasound. The ROC curve area for history and physical examination was 0.75, which rose to 0.77 when only the PaO2 test was added, to 0.81 when only thorax x-ray was added, and was the same when ultrasound was added. There was no clear-cut advantage in choosing between the x-ray and ultrasound from this viewpoint; however, the x-ray is preferred because of convenience. Against this, single-test evaluations showed that ultrasound was the most informative. The authors rightly concluded that single-test evaluations could be very misleading. Also, note in this example that the additional utility of the tests once the clinical picture is available is very limited. The gain is only 0.02–0.06 in the AUC. Such a low gain over the clinical picture is often forgotten when the utility of tests is evaluated. Medical tests are often ordered to get objective evidence of what has been subjectively evaluated using the clinical picture. A clinician may be already confident, but corroborative evidence may be more convincing to the patient. Also, tests are often considered essential to wriggle out legal hassles. Statistically, though, tests in some situations do not sufficiently add to the already arrived at clinical picture–based probability. In these cases, the posttest probability of a disease is not much different from the pretest probability. 21.4.3.2 Medical Significance versus Statistical Significance As emphasized in earlier chapters, there is a need to scrupulously maintain the distinction between statistical significance and medical significance. A decline of 1 mmHg in systolic BP on average can be statistically significant if n is large but may not have any medical significance in terms of condition of the patients or in terms of management of the condition. This text tried to make it clear in many examples that statistical methods check statistical significance only, and medical significance needs to be examined separately. For example, in a test for bioequivalence of two pharmaceutical preparations, if the difference is less than a specified amount, you can conclude that the groups are essentially equivalent even if the difference is statistically significant. 21.4.3.3 Interpretation of Standard Error of p As explained in Chapter 12, SE(p) has two very different meanings depending on whether this is measured in an absolute sense or in a relative sense. If p = 0.03, then SE(p) = 0.022 is exceedingly high because the 95% CI for π is (0, 0.07) for large n,
Statistical Fallacies
607
which is wide and does not add much to our knowledge. If p = 0.40, even a double SE(p) = 0.044 is low because 95% CI (0.31, 0.49) is still narrow relative to the value of p. 21.4.3.4 Univariate Analysis but Multivariate Conclusions For simplicity in calculations and for easy understanding, statistical analysis is often done for one variable at a time. This is adequate as long as the conclusion too is univariate. But a multivariate conclusion on the basis of several univariate analyses can be wrong. Consider, for example, the relationship of weight, length, and head circumference of infants of age 10 months with feeding practices and antenatal smoking. If weight is affected but length and head circumference are not affected, no definite conclusion can be drawn for growth as a whole. Growth is a multidimensional variable that includes all three measurements, if not more. If development is also to assessed, and one milestone is properly achieved and the others are not, the conclusion again can be drawn for that milestone by univariate analysis but not for development in general. The interest may be in the composite answer. If so, multivariate analysis becomes essential but is rarely done. There is another dimension to the problem. It is quite often possible that three or four variables, when considered separately, exhibit statistical significance but become not significant when considered together in a multivariate setup. The opposite can also happen, although this is rare. Such instances occur because multivariate analysis gives due consideration to the interrelations of the variables, which a set of univariate analyses would not do. A classic example of a bivariate entity is hypertension. It requires simultaneous consideration of both systolic and diastolic BP. One of them in isolation is rarely enough to draw a valid conclusion. Similarly, the efficacy of a treatment regimen is assessed not just in terms of recovery rate but also in terms of speed of relief, side effects, cost, convenience of administration, and so forth. The recovery itself may be multivariate in some situations, consisting of a symptomatic response, functional change, and laboratory results, and they may not be of equal importance. A practicing clinician may be interested in a composite answer about whether the regimen on the whole is beneficial to his patients. This conclusion will be necessarily multivariate and requires simultaneous consideration of the factors involved. Realization of the importance of multivariate analysis is common, but its use in practice is rare, probably because of the intricacies involved in calculation and interpretation. The situation may change in the course of time. 21.4.3.5 Limitation of Relative Risk No risk can exceed 1. If the 10-year risk of death in leukemia is 0.99 and in anemia only 0.02, the relative risk (RR) of death in leukemia is nearly 50. Opposed to this, if the comparison of leukemia is with breast cancer cases, where the 10-year risk of death is 0.60, the RR is only 0.99/0.60 = 1.65. This is as high as it can get in this situation. If the risk at baseline is 0.5 or more, the RR cannot exceed 2.0. Note the physical limitation imposed by high risk in the reference group. The value of the RR is greatly influenced by the risk in the control group. This aspect is many times forgotten while interpreting the value of RR. 21.4.3.6 Misinterpretation of Improvements A gain of 2 years in survival duration after an intervention has different meaning for a disease affecting old-age people, such as Alzheimer’s, than for a condition affecting young people, such as motor vehicle injuries. A gain of 2 mg/dL in Hb level is easier when the baseline is 6.2 mg/dL than when the baseline is 14.7 mg/dL. An improvement of 1.5/1000 in the case–fatality rate in cases of stage IV cancer has different implications than a similar improvement in cases of peritonitis. Such gains may look statistically the same but have different social implications. You should be careful in interpreting such improvements. Put value to the outcome and interpret accordingly. Another kind of misinterpretation is cited by Vickers [29] in the context of the result of a trial on prostectomies. This trial found that 53 (15.3%) out of 347 patients of prostate cancer died in the prostectomy group, of which 16 died from prostate cancer. In the control group of 348 patients, 62 (17.8%) died, of which 31 died from prostate cancer. The conclusion was that prostectomy reduced deaths from prostate cancer but not overall deaths. How is this possible when the cases are all of prostate cancer and this is the primary cause of death? What other factors were operating in the prostectomy group to cause nearly the same number of deaths? This example illustrates that statistical nonsignificance in overall deaths (15.3% vs. 17.8%, P > 0.10) should not be considered as no difference. It is possible that the overall mortality was different but not picked up in this study. The second possibility is that prostectomy reduced prostate cancer deaths but propped up other factors to cause death. This is not unlikely. The third possibility, as emerged later on, is statistical. In this case, the difference 17.8% − 15.3% = 2.5% in overall mortality is lower than the difference 31/348 − 16/347 = 4.3% in prostate cancer–specific mortality. Statistical significance is more difficult to manifest for the same difference in two large percentages compared with two small percentages. That is, it is easier to detect a difference between, for example, rates of 8% and
608
Medical Biostatistics
4% cancer-specific mortality than a similar difference between 28% and 24% in overall mortality. The difference of 4% in the former is one-half of the higher percentage, whereas it is one-seventh in the latter. How many of us really look at the data so critically? 21.4.4 Final Comments No decision is more important than the one concerning the life and health of people. The medical fraternity has a tall order. They are expected to prolong life and reduce suffering that occurs as a consequence of complex and often poorly understood interactions of a large number of factors. Some of these factors are explicit, but many remain obscure, and some behave in a very unpredictable manner. Uncertainties in health and medicine are indeed profound. A main reason behind the success achieved so far on the health front is the ability of some professionals to learn quickly from their own experience or from the experience of others and articulately collate the experience gained in laboratories, clinics, and the field. Successful empiricism is proper discernment of trends from fluctuation, segregation of focus from turbulence, and order from chaos. Knowingly for some, but unknowingly for many, statistical methods play a vital role in all empirical inferences. Know the limitations also. Nothing in medicine can be predicted with certainty, no matter how sophisticated the method is. However, uncertainties can be controlled and estimated with reasonable confidence. That helps in some, but not in all, cases. Some researchers (see, e.g., Vickers [29]) have passionately and convincingly pleaded that inappropriate use of statistical methods and consequent wrong decisions can risk the lives and health of many people. A surgeon is trained for years, yet a mistake is dreaded, as it can cost the life of a patient. Statistical methods seem to belong to everyone. Sufficiently trained or not, almost anybody can comment on statistical data and can carry out statistical analysis. Vickers illustrates how a mistake in statistical analysis and interpretation may have resulted in 750 premature deaths due to prostate cancer in Scandinavia. This indeed is serious, and you should take all possible precautions to ensure that the right method and the right software are used and the results are correctly interpreted. It should be clear that statistics could be a dangerous tool when used carelessly. Methodological considerations can put you in a statistical quagmire, and it may not be easy to come out of it due to insufficient knowledge. The realization of cultivating statistics-based thinking in medical professionals is recent. We have tried to demonstrate in this book that biostatistics is about rational thinking, and it does not require fancy mathematics. It is necessary to use an appropriate method for the problem in hand, and for this, mathematics is not all that necessary. As with all experts, statisticians too sometimes differ on the method to be adopted in a particular setup, but there is wide agreement on the basic methods. The choice of a basic method is thus not much of a problem. Reputed statistical packages do not yet have the inbuilt expertise to decide the correct method, although they sometimes generate a warning message when the data are not adequate. The user of the package decides the method. If you are not sufficiently confident, do not hesitate to consult an expert biostatistician. This consultation is much more effective when done at the planning stage than at the data analysis stage. Statistics seems to have become too important for some of us. Some statisticians may allow themselves to be manipulated when pressed to do so. Senn [30] narrates how this can happen in some cases. Biostatisticians should never indulge in losing soul for pecuniary gains—they must let the data speak. All data and analysis may be put on the Internet for anybody to check. Medical journals too have a responsibility to ensure that results of dubious quality are not published. Statistical refereeing is a norm for some journals, but some are lax on this issue. There is also a need to realize that biostatisticians too need some basic training in medicine or health before being admitted to this fraternity. Some statisticians win the label of biostatistician because they happen to work in a medical or health institution. Some universities require sufficient grounding in health or medicine before offering degrees in biostatistics, but most universities around the world are not so particular. Many statistical fallacies occur due to the inability of biostatisticians to comprehend the medical aspects of the problem. They are thus not able to provide sufficiently valid consultation in some cases. Our last advice is not to rely solely on statistical evidence. Statistical tools are surely good as an aid but rarely as a master. Do not allow them to dominate other ways of thinking. Statistical methods still do not incorporate skilled clinical judgment, which remains the hallmark of clinical practice. In addition, like a diagnostic test, a statistical test can be falsely positive or falsely negative. In the case of diagnosis, decisions are sometimes made against the test results when such evidence is overwhelming. This can be done against statistical tests too when sufficient evidence is otherwise available. You may want to distinguish between statistical evidence and scientific conclusion. Previous knowledge, medical evidence, and biological plausibility must remain driving considerations in reaching a conclusion. The context cannot be divorced, and numbers by themselves seldom provide infallible evidence. The approach should be holistic rather than
Statistical Fallacies
609
isolated and graded instead of binary. Rely on your intuition more than science. If scientific results fail intuitional judgment, look for gaps. They will most likely lie with science rather than with intuition.
References 1. Ludwig EG, Collette JC. Some misuses of health statistics. JAMA 1971; 216:493–499. 2. Hill AB. A Short Textbook of Medical Statistics. English Language Book Society, 1977, p. 261. 3. Fernandez WG, Mehta SD, Coles T, Feldman JA, Mitchell P, Olshanker J. Self-reported safety belt use among emergency department in Boston, Massachusetts. BMC Public Health 2006; 6:111. 4. Vitolo MR, Canal Q, Campagnolo PD, Gama CM. Factors associated with risk of low folate intake among adolescents. J Pediatr (Rio J) 2006; 82:121–126. 5. Freiman JA, Chalmers TC, Smith H Jr, Kuebler RR. The importance of beta, the type II error, and sample size in the design and interpretation of the randomized control trial: A survey of 71 “negative” trials. N Engl J Med 1978; 299:690–694. 6. Dimick JB, Diener-West M, Lipsett PA. Negative results of randomized clinical trials published in the surgical literature: Equivalency or error? Arch Surg 2001; 136:796–800. 7. Schnitzer TJ, Kong SX, Mitchell JH, et al. An observational, retrospective cohort study of dosing patterns for rofecoxib and celecoxib in the treatment of arthritis. Clin Ther 2003; 25:3162–3172. 8. Barton MB, Morley DS, Moore S, et al. Decreasing women’s anxieties after abnormal mammograms: A controlled trial. J Natl Cancer Inst 2004; 96:529–538. 9. Silber JH, Rosenbaum PR. A spurious correlation between hospital mortality and complication rates: The importance of severity adjustment. Med Care 1997; 35 (10 Suppl):OS77–OS92. 10. Gorey KM, Trevisan M. Secular trends in the United States black/white hypertension prevalence ratio: Potential impact of diminishing response rates. Am J Epidemiol 1998; 147:95–99. 11. Gaudreault J, Potvin D, Lavigne J, Lalonde RL. Truncated area under the curve as a measure of relative extent of bioavailability: Evaluation using experimental data and Monte Carlo simulations. Pharm Res 1998; 15:1621–1629. 12. Danese MD, Kim J, Doan QV, Dylan M, Griffiths R, Chertow GM. PTH and the risks for hip, vertebral and pelvic fractures among patients on dialysis. Am J Kidney Dis 2006; 47:149–156. 13. Greene WL, Concato J, Feinstein AR. Claims of equivalence in medical research: Are they supported by the evidence? Ann Intern Med 2000; 132:715–722. 14. Hofacker CF. Abuse of statistical packages: The case of the general linear model. Am J Physiol 1983; 245:R299–R302. 15. Park SW, Kim JY, Lee SW, Park J, Yun YO, Lee WK. Estimation of smoking prevalence among adolescents in a community by design-based analysis. J Prev Med Public Health 2006; 39:317–324. 16. Siegrist M. Communicating low risk magnitudes: Incidence rates expressed as frequency versus rates expressed as probability. Risk Analysis 1997; 17:507–510. 17. Kirjavainen T, Cooper D, Polo O, Sullivan CE. The static-charge-sensitive bed in the monitoring of respiration during sleep in infants and young children. Acta Paediatr 1996; 85:1146–1152. 18. Belgorosky A, Chahin S, Rivarola MA. Elevation of serum luteinizing hormone levels during hydrocortisone treatment in infant girls with 21-hydroxylase deficiency. Acta Paediatr 1996; 85:1172–1175. 19. Lilienfeld DE. The silence: The asbestos industry and early occupational cancer research—A case study. Am J Public Health 1991; 81:791–800. 20. Koblin BA, Hessol NA, Zauber AG, et al. Increased incidence of cancer among homosexual men, New York City and San Francisco, 1978–1990. Am J Epidemiol 1996; 144:916–923. 21. Wulff HR, Andersen B, Brandenhoff P, Guttler F. What do doctors know about statistics? Stat Med 1987; 6:3–10. 22. James JA, Kaufman KM, Farris AD, Taylor-Albert E, Lehman TJA, Harley JB. An increased prevalence of Epstein-Barr virus infection in young patients suggests a possible etiology for systemic lupus erythematosus. J Clin Invest 1997; 100:3019–3026. 23. McLaughlin JK, Lipworth L, Tarone RE. Suicide among women with cosmetic breast implants: A review of the epidemiologic evidence. J Long Term Eff Med Implants 2003; 13:445–450. 24. Knekt P, Raitasalo R, Heliovaara M, et al. Elevated lung cancer risk among persons with depressed mood. Am J Epidemiol 1996; 144:1096–1103. 25. Manning JT, Anderton RH, Shutt M. Parental age gap skews child sex ratio. Nature 1997; 399:344. 26. Hassad RA. Zika virus and birth effect: Correlation or causation? Significance 2016; 13:140. 27. Elwood JM. Causal Relationship in Medicine: A Practical System for Critical Appraisal. Oxford University Press, 1992. 28. Moons KG, van Es GA, Michel BC, Buller HR, Habbema JD, Grobbee DE. Redundancy of single diagnostic test evaluation. Epidemiology 1999; 10:276–281. 29. Vickers A. Interpreting data from randomized trials; the Scandinavian prostatectomy study illustrates two common errors. Nat Clin Pract Urol 2005; 2:404–405. 30. Senn S. Sharp tongues and bitter pills. Significance 2006; 3:123–125.
610
Medical Biostatistics
Exercises
1. A new regimen was developed that claims to improve liver functions within 24 h in those with cirrhosis and fatty liver. In a case–control study, it showed improvement as follows: n Group Cases Controls
Improved within 24 h
Cirrhosis
Fatty Liver
Total
Cirrhosis
Fatty Liver
Total
48 21
32 59
80 80
41 16
12 12
53 28
The total in the last column shows that the cases receiving the regimen show a much better rate of improvement. Without worrying about statistical significance, what conclusion do you arrive at when the cases and controls are divided into cirrhosis and fatty liver groups? 2. In a study on the relation between total bilirubin and the alkaline phosphatase (ALP) level, the following data were obtained: Subject Number Parameter Total bilirubin (mg/ dL) ALP (U/L)
1
2
3
4
5
6
7
0.87 47
0.65 53
1.20 89
0.95 41
1.07 32
3.18 148
0.40 88
Values for subject number 6 are outliers. Show that these outliers are unduly affecting the correlation coefficient between total bilirubin and ALP levels in these subjects.
3. A researcher identifies 13 quantitative variables that were statistically significant at the 5% level in affecting an outcome in a previous study. In a new study on these 13 variables, he tests the statistical significance of all correlations between these variables and the outcome, again at the 5% level. What fallacies can occur in this setup and why? 4. A study reports on the basis of measurements of children of grade I–V in a school that as the size of the middle finger increases, the IQ in the same test also increases in these children. What fallacy could have occurred in this conclusion? 5. A study on old-age (70+) persons reports that as age increases by 3 years, the incidence of kidney diseases also increases by 1% in a population. How could this result be fallacious and not revealing the correct age trend in incidence? 6. For a new hematinic, a study on n = 200 subjects finds that the average gain in the Hb level after use for 1 week by the anemic subjects (Hb < 11 g/dL) is 1.7 g/dL. Discuss the limitations of this result. 7. A study on subjects of age 60 years and above finds that visual acuity and lung functions are strongly correlated (r = 0.93), and concludes that if lung functions can be improved by some exercise or otherwise, visual acuity will also improve. Do you agree with this conclusion? Why or why not? 8. Discuss two sources of bias in studies based on volunteers with the help of an example of your choice. 9. In a city, the average number of cigarettes smoked by the population is only 0.43 per day per person, whereas otherwise smoking apparently looks quite prevalent. How could this average be right but fallacious? 10. Use the criteria listed in Section 21.4.2.1 to explain that the relationship between BMI and diabetes indeed could be a cause–effect type in adults.
Brief Solutions and Answers to the Selected Exercises These exercises are essential for learning the methods, and they supplement the text material. The following solutions are brief—actual solutions should be in detail. Some calculations in these solutions are approximate.
Chapter 1 2. Uncertainties due to intrinsic variation: Liver function variation from person to person of the same age, age differences from person to person that can affect liver functions, variation in calorie intake and the type of diet from day to day that can affect liver function parameters on any given day. Uncertainties due to natural variation in assessment: Variation in laboratory conditions that can give different values for the same person, variation in the method of measurement across laboratories, different observers using different methods to assess dietary intake despite a standard questionnaire. Knowledge limitations: Variation in liver functions due to unknown causes; different views on how to assess dietary intake (24 h recall or 7-day intake, or monthly food consumption); argument on which method should be used for assessing the relation between liver functions and age—correlation, curvilinear regression, or just cross-tabulation and chi-square after dividing each into categories—and then disagreement on what categories and how many would be adequate for this assessment.
Chapter 2
2. An observational study, as there is no intervention. 3. Standard conditions in the laboratory and homogeneous experimental material (mice, biological specimen, etc.) in most experiments. 4. The objective of a clinical trial is to show a certain effect of an intervention in a tightly specified type of cases defined by inclusion and exclusion criteria. The subjects in the treatment and control groups can be chosen to be equivalent at baseline so that any difference can be safely attributed to the regimen, more so if random allocation is done that takes care of the equivalence of epistemic factors as well as of the statistical requirement. There is no need to have a random sample for this conclusion because of the baseline equivalence of the two groups and randomization. 5. d. 8. Advantages of a records-based study: (i) Ready availability of data in records; thus, the study is quick and is much less expensive; (ii) no investigator bias in the selection of cases or in the assessment, as the cases and their assessment already exist. Disadvantages: (i) Records may be incomplete or sloppy; (ii) past cases may not represent the present or future cases to which the findings will apply. 11. (i) Bias. (ii) Random error. (iii) Bias.
611
612
Brief Solutions and Answers to the Selected Exercises
Chapter 3 1. (i) Weighted calculations:
0.20 × 250 + 0.10 × 400 + 0.30 × 100 + 0.40 × 50 = 0.175. 250 + 400 + 100 + 50
(ii) Unweighted calculations:
20 = 0.25. 80
The difference is due to varying sizes of the strata. This can be eliminated by taking a sample from each stratum proportional to its size. Divide the sample of 80 in proportion to the stratum size, keep the same proportion injured in each stratum as now, do the unweighted calculations, and check that the two answers become the same. 5. Name of the sampling method: Cluster sampling. Primary sampling unit: Day of the week (although not random). Demerits: (i) This is not a random sample since the day is purposively decided. No statistical methods of estimation for all patients can be used unless the selected day is considered random. (ii) Patients on Mondays can be typical and not random since a particular physician with specific expertise happens to be available on that day. The patients may have some common ailment for which the physician available on Mondays has expertise. This may produce a clustering effect and biased results. Merits: (i) This method is highly convenient, as the day is fixed and all eligible patients can be included. (ii) The sample size could be large without much difficulty since the scheme can be extended to include as many Mondays as necessary. 6. Sampling unit: Family. Unit of inquiry: Child of age less than 5 years.
Chapter 4
1. Cross-sectional, since there is no antecedent or outcome, all assessed at the same time. 2. Rare disease: Case–control so that you can start with enough cases. Rare exposure: Prospective so that you can start with enough exposed. 3. (i) The distinction between antecedent and outcome is blurred. (ii) It is possible to take a random sample such that the antecedent and outcome are proportionately represented. (iii) The sample size is large for adequate representation.
5. b.
Chapter 5
3. d. This design has the following factors: diabetes, vehicle, SITA, PDX, B420. With each as yes or no, there would be a total of 25 = 32 possible combinations in a fully factorial experiment. But only five combinations were studied. Thus, the design is partially factorial. 4. Advantages of randomization in an experiment: (i) Equal chance for all the subjects to be in the case or control group reduces possible bias of the investigator; (ii) unknown factors affecting the outcome also tend to be equalized; (iii) provides the basis to use statistical methods that require random observations. 5. Medical experiment: Human intervention (some patients on a new regimen and some others on the existing regimen). Natural experiment: Intervention by nature (some people exposed to nutrition deficiency and others not).
Brief Solutions and Answers to the Selected Exercises
613
Chapter 6
2. Merits of historical controls: Easily available (cost saver); reduced requirement of current cases. Demerits of historical controls: May have different setting; lack of randomization affects credibility of results. 4. Clinical equipoise: Collective uncertainty regarding the upcoming results among physicians and researchers. Patient equipoise: It does not matter to the patient whether he or she gets the new regimen or the existing regimen because a priori they seem equivalent. Personal equipoise: Clinician willing to randomize his or her patients to any group (in his or her opinion, the two regimens under trial may turn out to have nearly the same efficacy). All these reduce the possibility of bias and thus increase the validity. 5. Randomization when (i) enough patients are available; (ii) informed consent available for randomization. Matching when (i) the size of the trial is small; (ii) subjects do not agree to randomization. Advantages of randomization: (i) Equal chance for all the subjects to be in the case or control group reduces possible bias of the investigator; (ii) unknown factors affecting the outcome also tend to be equalized; (iii) provides the basis to use statistical methods that require random observations. 10. Essential features of phase I trial: (i) Small number of subjects; (ii) volunteers; (iii) dose escalation to study tolerability; (iv) no control group. Necessary to establish that the regimen is not harmful and has desirable pharmacological properties for pursuing it further. 13. c. Outcome comes afterward. 14. c. All others are not necessary for phase II. 15. b. The end point in a clinical trial is a clinical outcome. 16. b. Pragmatic trial does not stick to the planned allocation and is not strict to the regimen. It allows variation as seen in practical life, such as some people do not comply with the prescriptions and some switch from one group to the other. 17. c. Random selection of subjects helps in enhancing generalization (external validity). Other options are for internal validity. 18. (i) False. Double blinding reduces specific biases in assessment, but other biases, such as in allocation, can still creep in. (ii) False. An active control is generally an existing treatment, but it may or may not be sufficiently effective. (iii) False. A control can be an existing active treatment. (iv) False. Nonrandomized and uncontrolled trials are also done. (v) False. Finding a matched control means looking for the right person and discarding others. This can increase the cost. 19. (i) True. In fact, this is the purpose of a placebo. (ii) False. In most trials, the primary end-point is efficacy in place of side effects, and that does not mean that researchers know that no side effect will occur. (iii) False. There is not much difference between 5 and 8 dropouts out of 387 assigned to each group. (iv) True. Placebo group is the parallel control in this case. (v) False. In fact, excluding ineligible patients increases the validity of the trial and indicates that enough care was taken. 20. a. Crossover can be used only if there is no carryover effect and the disease recurs with the same intensity. This is not given here. This trial has two factors—dose level and age. Thus, one-way design cannot be adopted. Control is necessary for proper comparison. Factorial design will have all combinations of doses and relevant ages under test. Thus, this is the most appropriate design for this trial. You can also assess interaction in this case to find which dose is most effective at what age.
614
Brief Solutions and Answers to the Selected Exercises
Chapter 7
2. With original intervals in Table 7.6: Mean = 7.8, median = 7.0, mode = 6.7 days (see Example 7.3).
With new intervals: Mean = 304/38 = 8.0, median = 4.5 + 4 × (19 – 4)/23 = 7.1, mode = 4.5 + 4 × (23 – 4)/(2 × 23 – 4 – 10) = 6.9. Different intervals can give different means, medians, and modes, and for that matter, any sample summary. 4. (i) Mean = 5.5, median = 3.0, mode = (2 + 3)/2 = 2.5. (ii) Because of an outlier (the value 41 days), median = 3.0 days more correctly represents the central value for the duration of hospital days in these patients. (The mean of 5.53 days is not representative, as 13 of the 15 values are less than this value.) (iii) Q1 = 2.0 days, Q3 = 3.25 (or approximately 3 days) days, IQR = 2.0 to 3.25 days (or 3.25 – 2.0 = 1.25 days). It says that at least 50% of the durations are between 2.0 and 3.25 days (both inclusive). 5. SD = 9.92 days; not a good representation of the variability since mean is a poor representative for these data and SD is calculated from the mean. The IQR tells that at least 50% of the durations are between 2.0 and 3.25 days (which indicates that not much variation in most of these values). 7. (i) Mean = 3292.5/130 = 25.3, variance = 25.95, coefficient of variation = 100 × √(25.95)/25.3 = 20.1%. (ii) First decile = 18 + 5(13 – 2)/41 = 19.3, seventh decile = 23 + 4(91 – 43)/57 = 26.4. Ten percent of the subjects have a BMI less than or equal to 19.3 kg/m2, and 70% have a BMI less than or equal to 26.4. 8. (i) 90th percentile = 14 + 1(178.2 – 158)/36 = 14.56 kg, 95th percentile = 14 + 1(188.1 – 158)/36 = 14.83 kg. Ninety percent of 2-year old boys have a weight less than or equal to 14.56 kg, and 95% have a weight less than or equal to 14.83 kg. (ii) Tertiles divide the subjects into three equal groups; first tertile = 12 + 1(66 – 15)/65 = 12.78, second tertile = 13 + 1(132 – 80)/78 = 13.67. One-third of the boys have a weight less than or equal to 12.78 kg, and another one-third have a weight between 12.78 and 13.67 kg. The remaining one-third have a weight more than or equal to 13.67 kg. 9. GM = 3.24 days, HM = 2.59 days. Not useful, as they have no clear interpretation in this case. 10. a. The two values have very different units and CV is unit-free, and so comparable. 13. Sum total Hb for 300 subjects = 13.5 × 300 = 4050, sum total Hb of healthy subjects = 15.4 × 100 = 1540, sum total Hb of CAD subjects = 12.6 × 150 = 1890. Thus, sum total Hb of diabetic subjects = 4050 – 1540 – 1890 = 620 and mean of diabetic subjects = 620/50 = 12.4 g/dL.
Chapter 8 1. See the following figures. 35–45 16–18 1% 27–35 5% 18–23 18% 32%
15
Number of subjects
10 5 0
16–17 18–19 20–21 22–23 24–25 26–27 28–29 30–31 32–33 34–35 36–37 38–39 40–41 42–43 44–45
BMI (kg/m2)
23–27 44%
615
Brief Solutions and Answers to the Selected Exercises
Because of unequal intervals, a histogram has to be made after finding frequencies in the hypothetical intervals of 1 unit each, as shown. The area of the bars should be equal to the frequency in that interval. A histogram is difficult for unequal intervals, but it is a more accurate representation. Pie forgets that one interval is larger than the other. A large frequency in a larger interval could produce an artifact in a pie that will not happen in a histogram. But pie is able to show that slightly less than half of the subjects have a BMI between 23 and 27 kg/m2. A histogram is not able to show this information. 4. A bubble chart is useful to show the relationship between three quantitative variables, where one is a measure of size that can be represented by the size of the bubble. Scatter is just for the relationship between two quantitative variables. Examples are your own. 6. FVC has the highest variability in males of age –2.4) = 0.20. 1.25 1.25
7. (i) SE of mean = 5/√(16) = 1.25; P(19.0 < x < 21.0) = P
P( x > 30) = P Z > (ii)
30 − 22 = 0.05—only 5% chances that the duration of surgery for the next random patient will 5
exceed 30 min. 2
zα/2 2 π(1 − π) + zβ π 1 (1 − π 1 ) + π 2 (1 − π 2 ) n= 9. , zα/2 = 1.645 (one-tailed), zβ = 0.84, π1 = 0.65, π2 = 0.75, and δ = 0.10 gives δ2
n = 258.3 ≈ 259 in each group.
2
zα/2 2 π(1 − π) + zβ/2 π 1 (1 − π 1 ) + π 0 (1 − π 0 ) , z = 1.645, z = 1.645, π = 0.74, π = 0.68, δ = 0.06, and Δ = 0.03 gives 11. n= α β/2 1 0 ( ∆ − δ )2
n = 4941.8 ≈ 4942. 14. 95% CI for mean difference: –2.4667 ± 2.110 × 2.0275, that is, (–6.74, 1.81). The flaws could be (i) non-Gaussian distribution since n is small or (ii) variances not equal. 16. For sensitivity: n = For specificity: n =
(1.96)2 × 0.78(1 − 0.78) = 6103.9 ≈ 6104. 0.032 × 0.12 (1.96)2 × 0.87(1 − 0.87 ) = 548.6 ≈ 549. 0.032 × (1 − 0.12)
Since both are to be estimated from the same study, the larger of the two is required. Thus, n = 6104.
Chapter 13 1. (i) P(x ≥ 3|n = 8, π = 0.25) = 0.32 by binomial distribution (small n). (ii) Because of the large sample, use Gaussian approximation:
P(x ≥ 30|n = 80, π = 0.25) = P(Z ≥
30 − 80 × 0.25 80 × 0.25 × 0.75
= P(Z ≥ 2.58) = 0.005;
a probability of 3 or more out of 8 is very different from the probability of 30 or more out of 80, despite the same π. 4. With a 2% margin, the regimen is noninferior if the lower limit of the 90% CI is more than 0.83. For p1 = 0.86, the lower limit of the 90% CI for large n (= 200) is 0.86 – 1.645 × √(0.86 × 0.14/200) = 0.8196, which is less than 0.83—thus, the new regimen can be inferior.
619
Brief Solutions and Answers to the Selected Exercises
(
)
9. (i) With continuity correction, McNemar chi-square = 8 − 2 − 1 8+2 ence between the two regimens. 10
(ii) Exact from binomial P =
∑
10
8
1 Cx 2
2
= 2.5 at 1 df gives P = 0.11. No significant differ-
10
= 0.055 —one-tailed.
(iii) McNemar gives two-tailed probability, whereas binomial gives one-tailed probability. 14. P(x = 8|n = 8, π = 0.86) = 0.2992. Nearly a 30% chance that all of the next eight will survive. 17. μ = 47 mg/dL, σ = 7 mg/dL.
f
Upper End 34 38 42 46 50 54 58 62 >62 Sum
5 15 23 50 55 28 16 6 2 200
Gaussian CumP
Gaussian Prob
Exp(f)
(f – Exp(f))2 /Exp(f)
0.0316 0.0993 0.2375 0.4432 0.6659 0.8413 0.9420 0.9839 1.0000
0.0316 0.0676 0.1383 0.2057 0.2227 0.1755 0.1006 0.0420 0.0161 1.0000
6.3290 13.5252 27.6508 41.1352 44.5362 35.0925 20.1227 8.3959 3.2125 199.9999
0.2791 0.1608 0.7822 1.9104 2.4585 1.4334 0.8447 0.6837 0.4576 9.0104
Note: CumP, Cumulative probability; Exp( f ), expected frequency.
(i) Expected frequencies with mean = 47 and SD = 7 are in the fifth column. (ii) Chi-square (8 df) = 9.01, P = 0.34 (a large P indicates that the distribution can be considered Gaussian with mean = 47 mg/dL and SD = 7 mg/dL). (iii) Sample mean = 46.5 mg/dL, sample SD = 6.37 mg/dL, chi-square (6 df) = 4.62 (df 2 less since the mean and SD are estimated from the data), P = 0.57 (i.e., can be considered to have a Gaussian distribution with mean = 46.5 and SD = 6.38). The result would be slightly different if the mean and SD have several decimal places. 20. Note the values of x because of grouping in the first interval. Chi-square = 31.44
x
O1
n
nx
nx 2
O1*x
1 3 4 5 Sum
15 20 30 35 100
60 45 50 45 200
60 135 200 225 620
60 405 800 1125 2390
15 60 120 175 370
(i) Chi-square for (linear) trend (by formula in the text) = 30.77 (P < 0.001)—thus, the proportion of cases with osteoporotic fractures do increase with the number of deranged factors. (ii) Chi-square for lack of trend = 31.44 – 30.77 = 0.67 (P >> 0.05); linear trend by itself is good to explain the trend.
620
Brief Solutions and Answers to the Selected Exercises
Chapter 14 1.
(i) RRM = 1.32, ARM = 0.16, SE(ARM) = 0.082, approximate 95% CI on ARM: (0, 0.32); lower limit is zero since ARM cannot be negative. Chance of improvement by adding yoga is nearly one-third more than the chance of improvement by drug alone; nearly 16% of improvement can be attributed to yoga, but it can range from none to 32% in the corresponding population. (ii) McNemar chi-square = 2.72 with 1 df gives P > 0.05—thus, the improvement by adding yoga is not statistically significant. 2. 15 × 5 = 75. Thus, 75 subjects are needed to be treated on average for getting one (additional) survival in one year. 3. (i) The table has to be rearranged since derangement of more MS factors is expected in fracture cases. 95% CI for OR
Age Group (years)
OR
SE (lnOR)
From
To
40–50 50–60 60–70 All ages combined
2.86 2.14 3.82 2.78
0.249 0.200 0.196 0.119
1.76 1.45 2.60 2.20
4.65 3.16 5.61 3.51
(ii) ORMH = 2.87. (iii) Note a relatively higher OR for the age group 60–70 years. Thus, age did influence the OR, but whether this influence is statistically significant is yet to be tested.
5. (i) π 0′ = 0.20, OR = 1.5, and for this OR, π 1′ = 0.2727; zα = 1.96, zβ = 0.84; n = 533.9 ≈ 534 in each group. (ii) zβ = 1.28; n ≈ 715 in each group. Increase = 715 – 534 = 179 per group. (iii) C = 2, π = 0.2242, n = 394 in the study group and 394 × 2 = 788 in the control group. (iv) RR = 2, π0 = 0.1, π1 = 0.2, n = 199. 10. HTN
MI
Non-MI
Yes No Total
180 120 300
60 240 300
OR = 6, SE(lnOR) = 0.1863, 95% CI for lnOR: (1.4265, 2.1570); 95% CI for OR: (4.16, 8.65). The OR is unlikely to be less than 4.16 or more than 8.65. 13. (i) RR = 0.15/0.05 = 3, AR = 0.15 – 0.05 = 0.10. (ii) Since 20% are anemic, the risk of low birth weight (LBW) in the population = (0.15 × 0.20 + 0.05 × 0.80) = 0.07; PAR = 0.07 – 0.05 = 0.02, that is, a 2% risk of LBW in the population is due to anemia; PAR fraction = [0.20(3 – 1)]/ [0.20(3 – 1) + 1] = 0.2857 (note also that PAR is 2% out of 7% and 0.02/0.07 = 0.2857); nearly 29% of the risk of LBW in the population is due to anemia. 15. RR = 1.8/(0.94 + 0.06 × 1.8) = 1.72; fairly close to OR.
621
Brief Solutions and Answers to the Selected Exercises
16. (i) Since retrospective, calculate OR.
(ii) ORM = 25/17 = 1.47, SE(lnOR) =
symmetric for lnOR but not for OR because of logarithm. (iii) McNemar chi-square = 1.167 with 1 df, P = 0.28; the relationship between preexisting BPH and cancer prostate is not statistically significant.
19. (i) RR =
1 1 + = 0.3144, 95% CI for lnOR: (–0.230, 1.002), 95% CI for OR: (0.79, 2.72); 25 17
15/500 = 2.5. Tobacco chewers have 2.5 times the risk of developing oral cancer than smokers. 6/500
(ii) Same as H0: RR = 1 or lnRR = 0; Z =
ln 2.5 1 1 1 1 − + − 15 500 6 500
= 1.91; which gives P = 0.056 (two-tailed), marginally
significant. (iii) Yes: Risk in the two groups is given; AR = 15/500 – 6/500 = 0.018 (excess risk of oral cancer in chewers over the smokers).
Chapter 15
1. The Welch test is preferable since the variances in the two groups are substantially different when these are considered separate groups, although these variances are statistically not significant due to the small size of the sample in this example: sample variance 1 = 0.3360 and sample variance 2 = 0.1030; Welch t df = 7.8 instead of 10 and P = 0.14 (P-value is now even more, as always happens with the Welch test); means in the two groups are not significantly different by the Welch t-test. 3. Means of TG level in different WHR groups are as follows: Dependent Variable: TG WHR
Mean
SD