Robust Nonparametric Statistical Methods

Robust Nonparametric Statistical Methods Thomas P. Hettmansperger Penn State University and Joseph W. McKean Western Mic

Views 200 Downloads 70 File size 3MB

Report DMCA / Copyright

DOWNLOAD FILE

Recommend stories

Nonparametric Statistical Inference, Fourth Edition

Nonparametric Statistical Inference Fourth Edition, Revised and Expanded Jean Dickinson Gibbons Subhabrata Chakraborti

118 1 5MB Read more

Statistical Methods in Engineering and Quality Assurance

Statistical Methods in Engineering and Quality Assurance Statistical Methods in Engineering and Quality Assurance PETE

48 0 16MB Read more

e Handbook of Statistical Methods (NIST SEMATECH)

Detailed Table of Contents Detailed Table of Contents for the Handbook Chapter [1] [2] [3] [4] [5] [6] [7] [8] 1. Expl

46 1 10MB Read more

Statistical Methods in Hydrology Charles T. HAAN

61 1 16MB Read more

Statistical and Probabilistic Methods in Actuarial Science

Interdisciplinar y Statistics STATISTICAL and PROBABILISTIC METHODS in ACTUARIAL SCIENCE C6951_FM.indd 1 1/24/07 1:45

42 0 2MB Read more

Manual Robust RB BD501

0 0 9MB Read more

272057-2E Compresor Robust

12 0 287KB Read more

Methods

Physics 451, 452, 725: Mathematical Methods Russell Bloomer1 University of Virginia Note: There is no guarantee that the

68 1 286KB Read more

Newman Barkema Monte Carlo Methods in Statistical Physics

Monte Carlo Methods in Statistical Physics This page intentionally left blank Monte Carlo Methods in Statistical Phy

35 0 24MB Read more

(Machine Learning Mastery) Jason Brownlee - Statistical Methods For Machine Learning

Statistical Methods for Machine Learning Discover how to Transform Data into Knowledge with Python Jason Brownlee i

157 0 3MB Read more

Author / Uploaded
Petronilo Jamachit

Citation preview

Robust Nonparametric Statistical Methods Thomas P. Hettmansperger Penn State University and Joseph W. McKean Western Michigan University

c Copyright 1997, 2008, 2010 by Thomas P. Hettmansperger and Joseph W. McKean All rights reserved.

ii

Dedication: To Ann and to Marge

Contents Preface

ix

1 One Sample Problems 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Location Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Geometry and Inference in the Location Model . . . . . . . . . . . . . . . . . 1.3.1 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Properties of Normed-Based Inference . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Basic Properties of the Power Function γS (θ) . . . . . . . . . . . . . 1.5.2 Asymptotic Linearity and Pitman Regularity . . . . . . . . . . . . . . 1.5.3 Asymptotic Theory and Efficiency Results for θb . . . . . . . . . . . . 1.5.4 Asymptotic Power and Efficiency Results for the Test Based on S(θ) 1.5.5 Efficiency Results for Confidence Intervals Based on S(θ) . . . . . . . 1.6 Robustness Properties of Norm-Based Inference . . . . . . . . . . . . . . . . 1.6.1 Robustness Properties of θb . . . . . . . . . . . . . . . . . . . . . . . . 1.6.2 Breakdown Properties of Tests . . . . . . . . . . . . . . . . . . . . . . 1.7 Inference and the Wilcoxon Signed-Rank Norm . . . . . . . . . . . . . . . . 1.7.1 Null Distribution Theory of T (0) . . . . . . . . . . . . . . . . . . . . 1.7.2 Statistical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.3 Robustness Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 Inference Based on General Signed-Rank Norms . . . . . . . . . . . . . . . . 1.8.1 Null Properties of the Test . . . . . . . . . . . . . . . . . . . . . . . . 1.8.2 Efficiency and Robustness Properties . . . . . . . . . . . . . . . . . . 1.9 Ranked Set Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10 Interpolated Confidence Intervals for the L1 Inference . . . . . . . . . . . . . 1.11 Two Sample Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 2 4 12 12 17 18 21 24 25 27 30 30 33 35 36 37 42 44 46 47 53 56 60 65

2 Two Sample Problems 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Geometric Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73 73 74

iii

iv

CONTENTS

2.3 2.4

2.5

2.6

2.7

2.8

2.9 2.10

2.11

2.12 2.13

2.2.1 Least Squares (LS) Analysis . . . . . . . . . . . . . . . . . 2.2.2 Mann-Whitney-Wilcoxon (MWW) Analysis . . . . . . . . 2.2.3 Computation . . . . . . . . . . . . . . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inference Based on the Mann-Whitney-Wilcoxon . . . . . . . . . . 2.4.1 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . 2.4.3 Statistical Properties of the Inference Based on the MWW 2.4.4 Estimation of ∆ . . . . . . . . . . . . . . . . . . . . . . . . 2.4.5 Efficiency Results Based on Confidence Intervals . . . . . . General Rank Scores . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Efficiency Results . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Connection between One and Two Sample Scores . . . . . L1 Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Analysis Based on the L1 Pseudo Norm . . . . . . . . . . . 2.6.2 Analysis Based on the L1 Norm . . . . . . . . . . . . . . . Robustness Properties . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Breakdown Properties . . . . . . . . . . . . . . . . . . . . 2.7.2 Influence Functions . . . . . . . . . . . . . . . . . . . . . . Lehmann Alternatives and Proportional Hazards . . . . . . . . . . 2.8.1 The Log Exponential and the Savage Statistic . . . . . . . 2.8.2 Efficiency Properties . . . . . . . . . . . . . . . . . . . . . Two Sample Rank Set Sampling (RSS) . . . . . . . . . . . . . . . Two Sample Scale Problem . . . . . . . . . . . . . . . . . . . . . 2.10.1 Optimal Rank-Based Tests . . . . . . . . . . . . . . . . . . 2.10.2 Efficacy of the Traditional F -Test . . . . . . . . . . . . . . Behrens-Fisher Problem . . . . . . . . . . . . . . . . . . . . . . . 2.11.1 Behavior of the Usual MWW Test . . . . . . . . . . . . . . 2.11.2 General Rank Tests . . . . . . . . . . . . . . . . . . . . . . 2.11.3 Modified Mathisen’s Test . . . . . . . . . . . . . . . . . . . 2.11.4 Modified MWW Test . . . . . . . . . . . . . . . . . . . . . 2.11.5 Efficiencies and Discussion . . . . . . . . . . . . . . . . . . Paired Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.12.1 Behavior under Alternatives . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Linear Models 3.1 Introduction . . . . . . . . . . . . . 3.2 Geometry of Estimation and Tests . 3.2.1 Estimation . . . . . . . . . . 3.2.2 The Geometry of Testing . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77 78 80 80 83 83 92 92 96 97 99 102 103 107 108 108 112 115 115 116 118 119 121 123 125 125 133 135 135 137 138 140 141 143 145 148

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

153 153 153 154 156

CONTENTS 3.3 3.4 3.5

3.6

3.7

3.8 3.9

3.10 3.11

3.12

3.13 3.14 3.15 3.16

Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Assumptions for Asymptotic Theory . . . . . . . . . . . . . . Theory of Rank-Based Estimates . . . . . . . . . . . . . . . . 3.5.1 R-Estimators of the Regression Coefficients . . . . . . . 3.5.2 R-Estimates of the Intercept . . . . . . . . . . . . . . . Theory of Rank-Based Tests . . . . . . . . . . . . . . . . . . . 3.6.1 Null Theory of Rank Based Tests . . . . . . . . . . . . 3.6.2 Theory of Rank-Based Tests under Alternatives . . . . 3.6.3 Further Remarks on the Dispersion Function . . . . . . Implementation of the R-Analysis . . . . . . . . . . . . . . . . 3.7.1 Estimates of the Scale Parameter τϕ . . . . . . . . . . 3.7.2 Algorithms for Computing the R-Analysis . . . . . . . 3.7.3 An Algorithm for a Linear Search . . . . . . . . . . . . L1 -Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.1 Properties of R-Residuals and Model Misspecification . 3.9.2 Standardization of R-Residuals . . . . . . . . . . . . . 3.9.3 Measures of Influential Cases . . . . . . . . . . . . . . Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . Correlation Model . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.1 Huber’s Condition for the Correlation Model . . . . . . 3.11.2 Traditional Measure of Association and its Estimate . . 3.11.3 Robust Measure of Association and its Estimate . . . . 3.11.4 Properties of R-Coefficients of Multiple Determination 3.11.5 Coefficients of Determination for Regression . . . . . . High Breakdown (HBR) Estimates . . . . . . . . . . . . . . . 3.12.1 Geometry of the HBR-Estimates . . . . . . . . . . . . 3.12.2 Weights . . . . . . . . . . . . . . . . . . . . . . . . . . b 3.12.3 Asymptotic Normality of β HBR . . . . . . . . . . . . . 3.12.4 Robustness Prperties of the HBR Estimates . . . . . . 3.12.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 3.12.6 Implementation and Examples . . . . . . . . . . . . . . 3.12.7 Studentized Residuals . . . . . . . . . . . . . . . . . . 3.12.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . Diagnostics for Differentiating between Fits . . . . . . . . . . Rank-Based procedures for Nonlinear Models . . . . . . . . . . 3.14.1 Implementation . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

159 164 166 166 170 176 177 181 185 187 188 191 193 194 196 196 202 208 214 220 221 223 223 225 230 232 232 233 235 239 242 243 244 245 247 252 255 257 260

vi

CONTENTS

4 Experimental Designs: Fixed Effects 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Oneway Design . . . . . . . . . . . . . . . . . . . . . . 4.2.1 R-Fit of the Oneway Design . . . . . . . . . . . 4.2.2 Rank-Based Tests of H0 : µ1 = · · · = µk . . . . 4.2.3 Tests of General Contrasts . . . . . . . . . . . . 4.2.4 More on Estimation of Contrasts and Location . 4.2.5 Pseudo-observations . . . . . . . . . . . . . . . 4.3 Multiple Comparison Procedures . . . . . . . . . . . . 4.3.1 Discussion . . . . . . . . . . . . . . . . . . . . . 4.4 Twoway Crossed Factorial . . . . . . . . . . . . . . . . 4.5 Analysis of Covariance . . . . . . . . . . . . . . . . . . 4.6 Further Examples . . . . . . . . . . . . . . . . . . . . . 4.7 Rank Transform . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Monte Carlo Study . . . . . . . . . . . . . . . . 4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

275 275 276 277 281 283 284 286 288 294 296 300 304 310 312 317

5 Models with Dependent Error Structure 5.1 General Mixed Models . . . . . . . . . . . . . . . . . . . . 5.1.1 Applications . . . . . . . . . . . . . . . . . . . . . . 5.2 Simple Mixed Models . . . . . . . . . . . . . . . . . . . . . 5.2.1 Variance Component Estimators . . . . . . . . . . . 5.2.2 Studentized Residuals . . . . . . . . . . . . . . . . 5.2.3 Example and Simulation Studies . . . . . . . . . . 5.2.4 Simulation Studies of Validity . . . . . . . . . . . . 5.2.5 Simulation Study of Other Score Functions . . . . . 5.3 Rank-Based Procedures Based on Arnold Transformations 5.3.1 R Fit Based on Arnold Transformed Data . . . . . 5.4 General Estimating Equations (GEE) . . . . . . . . . . . . 5.4.1 Asymptotic Theory . . . . . . . . . . . . . . . . . . 5.4.2 Implementation and a Monte Carlo Study . . . . . 5.4.3 Example . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

323 323 326 327 328 329 330 331 333 333 334 339 342 343 345 348 349

. . . . . .

351 351 356 359 361 364 366

6 Multivariate 6.1 Multivariate Location Model . . . . . . 6.2 Componentwise Methods . . . . . . . . 6.2.1 Estimation . . . . . . . . . . . . 6.2.2 Testing . . . . . . . . . . . . . . 6.2.3 Componentwise Rank Methods 6.3 Spatial Methods . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

CONTENTS

6.4

6.5

6.6

6.7 6.8

vii

6.3.1 Spatial sign Methods . . . . . . . . . . . . . . . . . . . . 6.3.2 Spatial Rank Methods . . . . . . . . . . . . . . . . . . . Affine Equivariant and Invariant Methods . . . . . . . . . . . . 6.4.1 Blumen’s Bivariate Sign Test . . . . . . . . . . . . . . . 6.4.2 Affine Invariant Sign Tests in the Multivariate Case . . . 6.4.3 The Oja Criterion Function . . . . . . . . . . . . . . . . 6.4.4 Additional Remarks . . . . . . . . . . . . . . . . . . . . Robustness of Multivariate Estimates of Location . . . . . . . . 6.5.1 Location and Scale Invariance: Componentwise Methods 6.5.2 Rotation Invariance: Spatial Methods . . . . . . . . . . . 6.5.3 The Spatial Hodges-Lehmann Estimate . . . . . . . . . . 6.5.4 Affine Equivariant Spatial Median . . . . . . . . . . . . . 6.5.5 Affine Equivariant Oja Median . . . . . . . . . . . . . . Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Test for Regression Effect . . . . . . . . . . . . . . . . . 6.6.2 The Estimate of the Regression Effect . . . . . . . . . . 6.6.3 Tests of General Hypotheses . . . . . . . . . . . . . . . . Experimental Designs . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A Asymptotic Results A.1 Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . A.2 Simple Linear Rank Statistics . . . . . . . . . . . . . . . . . . A.2.1 Null Asymptotic Distribution Theory . . . . . . . . . . A.2.2 Local Asymptotic Distribution Theory . . . . . . . . . A.2.3 Signed-Rank Statistics . . . . . . . . . . . . . . . . . . A.3 Results for Rank-Based Analysis of Linear Models . . . . . . . A.3.1 Convex Functions . . . . . . . . . . . . . . . . . . . . . A.3.2 Asymptotic Linearity and Quadraticity . . . . . . . . . b and β e . . . . . . . . . A.3.3 Asymptotic Distance Between β A.3.4 Consistency of the Test Statistic Fϕ . . . . . . . . . . . A.3.5 Proof of Lemma 3.5.1 . . . . . . . . . . . . . . . . . . A.4 Asymptotic Linearity for the L1 Analysis . . . . . . . . . . . . A.5 Influence Functions . . . . . . . . . . . . . . . . . . . . . . . . A.5.1 Influence Function for Estimates Based on Signed-Rank A.5.2 Influence Functions for Chapter 3 . . . . . . . . . . . . b A.5.3 Influence Function of β HBR of Chapter 5 . . . . . . . . A.6 Asymptotic Theory for Chapter 5 . . . . . . . . . . . . . . . .

B Larger Data Sets

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistics . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

366 373 377 377 379 387 391 392 392 392 394 394 394 395 397 404 405 412 416

. . . . . . . . . . . . . . . . .

421 421 422 423 424 431 433 436 437 439 440 442 443 446 447 448 454 455 465

viii

CONTENTS

Preface I don’t believe I can really do without teaching. The reason is, I have to have something so that when I don’t have any ideas and I’m not getting anywhere I can say to myself, “At least I’m living; at least I’m doing something; I’m making some contribution”-it’s just psychological. Richard Feynman We are currently revising these notes. Any corrections and/or comments are welcome. This book is based on the premise that nonparametric or rank based statistical methods are a superior choice in many data analytic situations. We cover location models, regression models including designed experiments, and multivariate models. Geometry provides a unifying theme throughout much of the development. We emphasize the similarity in interpretation with least squares methods. Basically, we replace the Euclidean norm with a weighted L-1 norm. This results in rank based methods or L-1 methods depending on the choice of weights. The rank-based methods proceed much like the traditional analysis. Using the norm, models are easily fitted. Diagnostics procedures can then be used to check the quality of fit (model criticism) and to locate outlying points and points of high influence. Upon satisfaction with the fit, rank-based inferential procedures can be used to conduct the statistical analysis. The benefits include significant gains in power and efficiency when the error distribution has tails heavier than those of a normal distribution and superior robustness properties in general. The main text concentrates on Wilcoxon and L-1 methods. The theoretical development for general scores (weights) is contained in the Appendix. By restricting attention to Wilcoxon rank methods, we can recommend a unified approach to data analysis beginning with the simple location models and extending through complex regression models and designed experiments. All major methodology is illustrated on real data. The examples are intended as guides for the application of the rank and L-1 methods. Furthermore, all the data sets in this book can be obtained from the web site: http://www.stat.wmich.edu/home.html. Selected topics from the first four chapters provide a basic graduate course in rank based methods. The prerequisites are an introductory course in mathematical statistics and some background in applied statistics. The first seven sections of Chapter 1 and the first four sections of Chapter 2 are fundamental for the development of Wilcoxon signed rank and Mann-Whitney-Wilcoxon rank sum methods in the one- and two-sample location models. In ix

x

PREFACE

Chapter 3, on the linear model, sections one through seven and section nine present the basic material for estimation, testing and diagnostic procedures for model criticism. Sections two through four of Chapter 4 give extensive development of methods for the one- and two-way layouts. Then, depending on individual tastes, there are several more exotic topics in each chapter to choose from. Chapters 5 and 6 contain more advanced material. In Chapter 5 we extend rank based methods for a linear model to bounded influence, high breakdown estimates and tests. In Chapter 6 we take up the concept of multidimensional rank. We then discuss various approaches to the development of rank-like procedures that satisfy various invariant/equivariant restrictions. Computation of the procedures discussed in this book is very important. Minitab contains an undocumented RREG (rank regression) command. It contains various subcommands that allow for testing and estimation in the linear model. The reader can contact Minitab at (put email address or web page address here) and request a technical report that describes the RREG command. In many of the examples of this book the package rglm is used to obtain the rank-based analyses. The basic algorithms behind this package are described in Chapter 3. Information (including online rglm analyses of examples) can be obtained from the web site: http://www.stat.wmich.edu/home.html. Students can also be encouraged to write their own S-plus functions for specific methods. We are indebted to many of our students and colleagues for valuable discussions, stimulation, and motivation. In particular, the first author would like to express his sincere thanks for many stimulating hours of discussion with Steve Arnold, Bruce Brown, and Hannu Oja while the second author wants to express his sincere thanks for discussions with John Kapenga, Joshua Naranjo, Jerry Sievers, and Tom Vidmar. We both would like to express our debt to Simon Sheather, our friend, colleague, and co-author on many papers. Finally, we would like to thank Jun Recta for assistance in creating several of the plots. Tom Hettmansperger Joe McKean July 2008 State College, PA Kalamazoo, MI

Chapter 1 One Sample Problems 1.1

Introduction

Traditional statistical procedures are widely used because they offer the user a unified methodology with which to attack a multitude of problems, from simple location problems to highly complex experimental designs. These procedures are based on least squares fitting. Once the problem has been cast into a model then least squares offers the user: 1. a way of fitting the model by minimizing the Euclidean normed distance between the responses and the conjectured model; 2. diagnostic techniques that check the adequacy of the fit of the model, explore the quality of fit, and detect outlying and/or influential cases; 3. inferential procedures, including confidence procedures, tests of hypotheses and multiple comparison procedures; 4. computational feasibility. Procedures based on least squares, though, are easily impaired by outlying observations. Indeed one outlying observation is enough to spoil the least squares fit, its associated diagnostics and inference procedures. Even though traditional inference procedures are exact when the errors in the model follow a normal distribution, they can be quite inefficient when the distribution of the errors has longer tails than the normal distribution. For simple location problems, nonparametric methods were proposed by Wilcoxon (1945). These methods consist of test statistics based on the ranks of the data and associated estimates and confidence intervals for location parameters. The test statistics are distribution free in the sense that their null distributions do not depend on the distribution of the errors. It was soon realized that these procedures are almost as efficient as the traditional methods when the errors follow a normal distribution and, furthermore, are often much more efficient relative to the traditional methods when the error distributions deviate from normality; see Hodges and Lehmann (1956). These procedures possess both robustness of validity and 1

2

CHAPTER 1. ONE SAMPLE PROBLEMS

power. In recent years these nonparametric methods have been extended to linear and nonlinear models. In addition, from the perspective of modern robustness theory, contrary to least squares estimates, these rank-based procedures have bounded influence functions and positive breakdown points. Often these nonparametric procedures are thought of as disjoint methods that differ from one problem to another. In this text, we intend to show that this is not the case. Instead, these procedures present a unified methodology analogous to the traditional methods. The four items cited above for the traditional analysis hold for these procedures too. Indeed the only operational difference is that the Euclidean norm is replaced by another norm. There are computational procedures available for the rank-based procedures discussed in this book. We offer the reader a collection of computational functions written in the software language R at the site http://www.stat.wmich.edu/mckean/Rfuncs/ . We refer to these computational algorithms as rank-based R algorithms or RBR. We discuss these functions throughout the text and use them in many of the examples, simulation studies, and exercises. The programming language R (see Ihaka, R. and Gentleman, R., 1996) is freeware and can run on all (PC, Mac, Linux) platforms. To download the R software and accompanying information, visit the site http://www.r-project.org/. The language R has intrinsic functions for computation of some of the procedures discussed in this and the next chapter.

1.2

Location Model

In this chapter we will consider the one sample location problem. This will allow us to explore some useful concepts such as distribution freeness and robustness in a simple setting. We will extend many of these concepts to more complicated situations in later chapters. We need to first define a location parameter. For a random variable X we often subscript its distribution function by X to avoid confusion. Definition 1.2.1. Let T (H) be a function defined on the set of distribution functions. We say T (H) is a location functional if 1. If G is stochastically larger than F (ie.(G(x) ≤ F (x)) for all x, then T (G) ≥ T (F ); 2. T (HaX+b ) = aT (HX ) + b, a > 0; 3. T (H−X ) = −T (HX ). Then, we will call θ = T (H) a location parameter of H. Note that if X has location parameter θ it follows from the second item in the above definition that the random variable e = X −θ has location parameter 0. Suppose X1 , . . . , Xn is a random sample having the common distribution function H(x) and θ = T (H) is a location parameter of interest. We express this by saying that Xi follows the statistical location model, Xi = θ + ei , i = 1, . . . , n , (1.2.1)

1.2. LOCATION MODEL

3

where e1 , . . . , en are independent and identically distributed random variable with distribution function F (x) and density function f (x) and location T (F ) = 0. It follows that H(x) = F (x − θ) and that T (H) = θ. We next discuss three examples of location parameters that we will use throughout this chapter. Other location parameters are discussed in Section 1.8. See Bickel and Lehmann (1975) for additional discussion of location functionals. Example 1.2.1. The Median Location Functional First define the inverse of the cdf H(x) by H −1 (u) = inf{x : H(x) ≥ u}. Generally we will suppose that H(x) is strictly increasing on its support and this will eliminate ambiguities on the selection of the parameter. Now define θ1 = T1 (H) = H −1 (1/2). This is the median functional. Note that if G(x) ≤ F (x) for all x, then G−1 (u) ≥ F −1 (u) for all u; and, in particular, G−1 (1/2) ≥ F −1 (1/2). Hence, T1 (H) satisfies the first condition for a location functional. Next let H ∗ (x) = P (aX + b ≤ x) = H[a−1 (x − b)]. Then it follows at once that H ∗−1 (u) = aH −1 (u) + b and the second condition is satisfied. The third condition follows with an argument similar to the the one for the second condition. Example 1.2.2. The Mean Location Functional R functional let θ2 = T2 (H) = xdH(x), when the mean exists. Note that R For the mean R −1 xdH(x) = H (u)du. Now if G(x) ≤ F (x) for all x, then x ≤ G−1 (F (x)). R −1Let x = −1 −1 −1 −1 −1 F (u) and we have F (u) ≤ G (F (F (u)) ≤ G (u). Hence, T (G) = G (u)du ≥ 2 R −1 F (u)du = T2 (F ) and the first condition is satisfied. The other two conditions follow easily from the definition of the integral. Example 1.2.3. The Pseudo-Median Location Functional Assume that X1 and X2 are independent and identically distributed, (iid), with distribution function Y has distribution function H ∗ (y) = R H(x). Let Y = (X1 + X2 )/2. Then∗−1 P (Y ≤ y) = H(2y −x)h(x)dx. Let θ3 = T3 (H) = H (1/2). To show that T3 is a location functional, suppose G(x) ≤ F (x) for all x. Then Z Z Z 2y−x Z Z 2y−x ∗ G (y) = G(2y − x)g(x) dx = g(t) dt g(x) dx ≤ f (t) dt g(x) dx −∞ −∞ Z Z 2y−t Z Z 2y−t = g(x) dt f (t) dx ≤ f (x) dt f (t) dx = F ∗ (y) ; −∞

−∞

hence, as in Example 1.2.1, it follows that G∗−1 (u) ≥ F ∗−1 (u) and, hence, that T3 (G) ≥ T3 (F ). For the second property, let W = aX + b where X has distribution function H and a > 0. Then W has distribution function FW (t) = H((t − b)/a). Then by the change of variable z = (x − b)/a, we have Z Z 2y − x − b 1 y−b x−b ∗ FW (y) = H dx = H 2 h − z h(z) dz . a a a a

4

CHAPTER 1. ONE SAMPLE PROBLEMS

Thus the defining equation for T3 (FW ) is Z T3 (FW ) − b 1 = H 2 − z h(z) dz , 2 a which is satisfied for T3 (FW ) = aT3 (H) + b. For the third property, let V = −X where X has distribution function H. Then V has distribution function FV (t) = 1 − H(−t). Hence, by the change in variable z = −x, Z Z ∗ FV (y) = (1 − H(−2y + x))h(−x) dx = 1 − H(−2y − z))h(z) dz . Because the defining equation of T3 (FV ) can be written as Z 1 = H(2(−T3 (FV )) − z)h(z) dz , 2 it follows that T3 (FV ) = −T3 (H). Therefore, T3 is a location functional. It has been called the pseudo-median by Hoyland (1965) and is more appropriate for symmetric distributions. The next theorem characterizes all the location functionals for a symmetric distribution. Theorem 1.2.1. Suppose that the pdf h(x) is symmetric about some point a. If T (H) is a location functional, then T (H) = a. Proof. Let the random variable X have pdf h(x) symmetric about a. Let Y = X − a, then Y has pdf g(y) = h(y + a) symmetric about 0. Hence Y and −Y have the same distribution. By the third property of location functionals, this means that T (GY ) = T (G−Y ) = −T (GY ); i.e, T (GY ) = 0. But by the second property, 0 = T (GY ) = T (H) − a; that is , a = T (H). This theorem means that when we sample from a symmetric distribution we can unambiguously define location as the center of symmetry. Then all location functionals that we may wish to study will specify the same location parameter.

1.3

Geometry and Inference in the Location Model

Letting X = (X1 , . . . , Xn )′ and e = (e1 , . . . , en )′ , we then write the statistical location model, ( 1.2.1), as, X = 1θ + e , (1.3.1) where 1 denotes the vector all of whose components are 1 and T (Fe ) = 0. If ΩF denotes the one-dimensional subspace spanned by 1, then we can express the model more compactly as X = η + e, where η ∈ ΩF . The subscript F on Ω stands for full model in the context of hypothesis testing as discussed below. Let x be a realization of X. Note that except for random error, x would lie in ΩF . Hence an intuitive fitting criteria is to estimate θ by a value θb such that the vector 1θb ∈ ΩF lies

1.3. GEOMETRY AND INFERENCE IN THE LOCATION MODEL

5

“closest” to x, where “closest” is defined in terms of a norm. Furthermore, a norm, as the following general discussion shows, provides a complete inference for the parameter θ. Recall that a norm is a nonnegative function, k · k, defined on Rn such that kyk ≥ 0 for all y; kyk = 0 if and only if y = 0; kayk = |a|kyk for all real a; and ky + zk ≤ kyk + kzk. The distance between two vectors is d(z, y) = kz − yk. Given a location model, ( 1.3.1), and a specified a norm, k · k, the estimate of θ induced by the norm is θb = argminkx − 1θk , (1.3.2)

i.e., the value which minimizes the distance between x and the space ΩF . As discussed in Exercise 1.12.1, a minimizing value always exists. The dispersion function induced by the norm is given by, D(θ) = kx − 1θk . (1.3.3) b The minimum distance between the vector of observations x and the space ΩF is D(θ). As Exercise 1.12.3 shows, D(θ) is a convex, continuous function of θ which is differentiable almost everywhere. Actually the norms discussed in this book are differentiable at all but at most a finite number of points. We define the gradient process by the function S(θ) = −

d D(θ) . dθ

(1.3.4)

As Exercise 1.12.3, shows, S(θ) is a nonincreasing function. Its discontinuities are the points where D(θ) is nondifferentiable. Furthermore the minimizing value is a value where S(θ) is 0 or, due to a discontinuity, steps through 0. We express this by saying that θb solves the equation . b = S(θ) 0. (1.3.5)

b b n ), where Hn denotes Suppose we can represent the above estimate by θb = θ(x) = θ(H b n ) is suggestive of the the empirical distribution function of the sample. The notation θ(H functional notation used in the last section. This is as it should be, since it is easy to show that θb satisfies the sample analogues of properties (2) and (3) of Definition 1.2.1. For property (2), consider the estimating equation of the translated sample y = ax + 1b, for a > 0, given by

θ − b b

. θ(y) = argminky − 1θk = a argmin x − 1 a b b From this we immediaitely have that θ(y) = aθ(x) + b. For property (3), the defining equation for the sample y = −x is b θ(y) = argminky − 1θk = argminkx − 1(−θ)k .

b b From which we have θ(y) = −θ(x). Furthermore, for the norms considered in this book it b b is easy to check that θ(Hn ) ≥ θ(Gn ) when Hn and Gn are empirical cdfs for which Hn is stochastically larger than Gn . Hence, the norms generate location functionals on the set of

6

CHAPTER 1. ONE SAMPLE PROBLEMS

b n ) = H −1 ( 1 ) empirical cdfs. The L1 norm provides an easy example. We can think of θ(H n 2 as the restriction of θ(H) = H −1 ( 21 ) to the class of discrete distributions which assign mass b n ) as the restriction of θ(H) or, conversely, 1/n to n points. Generally we can think of θ(H b n ). We let the norm determine the location. we can think of θ(H) as the extension of θ(H This is especially simple in the symmetric location model where all location functionals are equal to the point of symmetry. Next consider the hypotheses, H0 : θ = θ0 versus HA : θ 6= θ0 ,

(1.3.6)

for a specified θ0 . Because of the second property of location functionals in Definition 1.2.1, we can assume without loss of generality that θ0 = 0; otherwise we need only subtract θ0 from each Xi . Based on the data, the most acceptable value of θ is the value at which the gradient S(θ) is zero. Hence large values of |S(0)| favor HA . Formally the level α gradient test or score test for the hypotheses ( 1.3.6) is given by Reject H0 in favor of HA if |S(0)| ≥ c ,

(1.3.7)

where c is such that P0 [|S(0)| ≥ c] = α. Typically, the null distribution of S(0) is symmetric so there is no loss in generality in considering symmetrical critical regions. A second formulation of a test statistic is based on the difference in minimizing dispersions or the reduction in dispersion. Call Model 1.2.1 the full model. As noted above, the distance b The reduced model is the full model subject between x and the subspace ΩF is D(θ). to H0 . In this case the reduced model space is {0}. Hence the distance between x and the reduced model space is D(0). Under H0 , x should be close to this space; therefore, the reduction in dispersion test is given by b ≥m, Reject H0 in favor of HA if RD = D(0) − D(θ)

(1.3.8)

where m is determined by the null distribution of RD. This test will be used in Chapter 3 and subsequent chapters. A third formulation is based on the standardized estimate: b

Reject H0 in favor of HA if √ |θ| ≥ γ , Varθb

(1.3.9)

b Tests based directly on the estimate are where γ is determined by the null distribution of θ. often referred to as Wald type tests. The following useful theorem allows us to shift between computing probabilities when θ = 0 and for general θ. Its proof is a straightforward application of a change of variables. See Theorem A.2.4 of the Appendix for a more general result. Theorem 1.3.1. Suppose that we can write S(θ) = S(x1 − θ, . . . , xn − θ). Then Pθ (S(0) ≤ t) = P0 (S(−θ) ≤ t).

1.3. GEOMETRY AND INFERENCE IN THE LOCATION MODEL

7

We now turn to the problem of the construction of a (1 − α)100% confidence interval for θ based on S(θ). Such an interval is easily obtained by inverting the acceptance region of the level α test given by ( 1.3.7). The acceptance region is | S(0) |< c. Define θbL = inf{t : S(t) < c} and θbU = sup{t : S(t) > −c}.

(1.3.10)

{θ :| S(θ) |< c} = {θ : θbL ≤ θ ≤ θbU } .

(1.3.11)

Then because S(θ) is nonincreasing,

Thus from Theorem 1.3.1,

Pθ (θbL ≤ θ ≤ θbU ) = Pθ (| S(θ) |< c) = P0 (| S(0) |< c) = 1 − α .

(1.3.12)

Hence, inverting a size α test results in the (1 − α)100% confidence interval (θbL , θbU ). Thus a norm not only provides a fitting criterion but also a complete inference. As with all statistical analyses, checks on the appropriateness of the model and the quality of fit are needed. Useful plots here include: stem-leaf plots and q − q plots to check shape and distributional assumptions, boxplots and dotplots to check for outlying observations, and a plot of Xi versus i (or other appropriate variables) to check for dependence between observations. Some of these diagnostic checks are performed in the the next section of numerical examples. In the next three examples, we discuss the inference for the norms associated with the location functionals presented in the last section. We state the results of their associated inference, which we will derive in later sections.

Example 1.3.1. L1 -Norm P Recall that the L1 norm is defined as kxk1 = | xi |, hence P the associated dispersion and negative gradient functions are given respectively by D (θ) = | Xi − θ | and S1 (θ) = 1 P sgn(Xi − θ). Letting Hn denote the empirical cdf, we can write the estimating equation as Z X −1 0=n sgn(xi − θ) = sgn(x − θ)dHn (x) . The solution, of course, is θb the median of the observations. If we replace the empirical cdf Hn by the true underlying cdf H then the estimating equation becomes the defining equation for the parameter θ = T (H). In this case, we have 0=

Z

sgn(x − T (H))dH(x) = −

Z

T (H)

−∞

dH(x) +

Z

∞

dH(x) ;

T (H)

hence, H(T (H)) = 1/2 and solving for T (H) we find T (H) = H −1 (1/2) as expected.

8

CHAPTER 1. ONE SAMPLE PROBLEMS As we show in Section 1.5, θb has an asymptotic N(θ, τS2 /n) distribution ,

(1.3.13)

Reject H0 in favor of HA if S1+ ≤ c1 or S1+ ≥ n − c1 ,

(1.3.14)

P [bin(n, 1/2) ≤ c1 ] = α/2 ,

(1.3.15)

where τs = 1/(2h(θ)). Estimation of the standard deviation of θb is discussed in Section 1.5. the gradient test P statistic is S1 (0) = P Turning next to testing the hypotheses+( 1.3.6), − + 0 I(Xi > 0), S1− = write, S1 (0) = S1 − S1 + S1 where S1 = P sgn(Xi ). But we can P I(Xi < 0), and S10 = I(Xi = 0) = 0, with probability one since we are sampling from a continuous distribution, and I(·) is the indicator function. In practice, we must deal with ties and this is usually done by setting aside those observations that are equal to the hypothesized value and carrying out the test with a reduced sample size. Now note that n = S1+ + S1− so that we can write S1 = 2S1+ − n and the test can be based on S1+ . The null distribution of S1+ is binomial with parameters n and 1/2. Hence the level α sign test of the hypotheses ( 1.3.6) is

and c1 satisfies where bin(n, 1/2) denotes a binomial random variable based on n trials and with probability of success 1/2. Note that the critical value of the test can be determined without specifying the shape of F . In this sense, the test based on S1 is distribution free or nonparametric. Using the asymptotic null distribution of S1+ , c1 can be approximated as . c1 = n/2 − n1/2 zα/2 /2 − .5 where Φ(−zα/2 ) = α/2; Φ(.) is the standard normal cdf, and .5 is the continuity correction. For the associated (1 − α)100% confidence interval, we follow the general development above, ( 1.3.12). Hence, we must find θbL = inf{t : S1+ (t) < n − c1 }, where c1 is given by ( 1.3.15). Note that S1+ (t) < n − c1 if and only if the number of Xi greater than t is less than n − c1 . But #{i : Xi > X(c1 +1) } = n − c1 − 1 and #{i : Xi > X(c1 +1) − ǫ} ≥ n − c1 for any ǫ > 0. Hence, θbL = X(c1 +1) . A similar argument shows that θbU = X(n−c1 ) . We can summarize this by saying that the (1 − α)100% L1 confidence interval is the half open, half closed interval [X(c1 +1) , X(n−c1 ) ) where α/2 = P (S1+ (0) ≤ c1 ) determines c1 .

(1.3.16)

The critical value c1 can be determined from the binomial(n, 1/2) distribution or from the normal approximation cited above. The interval developed here is a distribution-free confidence interval since the confidence coefficient is determined from the binomial distribution without making any shape assumption on the underlying model distribution. Example 1.3.2. L2 -Norm P Recall that the square of the L2 -norm is given by kxk22 = ni=1 x2i . As shown in Exercise 1.12.4, the estimate determined by this norm is the sample mean X and the functional

1.3. GEOMETRY AND INFERENCE IN THE LOCATION MODEL

9

R parameter is µ = xh(x) dx, provided it exists. Hence the L2 norm is consistent for the mean location problem. The associated test statistic is equivalent to Student’s t-test. The approximate distribution of X is N(0, σ 2 /n), provided the variance σ 2 = VarX1 exists. Hence, the test statistic is not distribution free. In practice, σ is replaced by its estimate s = P √ 2 1/2 ( (Xi − X) /(n − 1)) and the test is based on the t-ratio, t = n X/s, which, under√ the null hypothesis, is asymptotically N(0, 1). The usual confidence interval is X ±tα/2,n−1 s/ n, where tα/2,n−1 is the (1 − α/2)-quantile of a t-distribution with n − 1 degrees of freedom. This interval has the approximate confidence coefficient (1 − α)100%, unless the errors, ei , follow a normal distribution in which case it has exact confidence. Example 1.3.3. Weighted L1 Norm Consider the function kxk3 =

n X i=1

R(|xi |)|xi | ,

(1.3.17)

where R(|xi |) denotes the rank of |xi | among |x1 |, . . . , |xn |. As the next theorem shows this function is a norm on Rn . See Section 1.8 for a general weighted L1 norm. P P Theorem 1.3.2. The function kxk3 = j|x|(j) = R(|xj |)|xj | is a norm, where R(|xj |) is the rank of |xj | among |x1 |, . . . , |xn | and |x|(1) ≤ · · · ≤ |x|(n) are the ordered absolute values. Proof. The equality relating kxk3 to the ranks is clear. To show that we have a norm, we first note that kxk3 ≥ 0 and that kxk3 = 0 if and only if x = 0. Also clearly kaxk3 = |a|kxk3 for any real a. Hence, to finish the proof, we must verify the triangle inequality. Now X X X X kx+yk3 = j|x+y|(j) = R(|xi +yj |)|xi +yj | ≤ R(|xi +yj |)|xi |+ R(|xi +yj |)|yj | . (1.3.18) Consider the first term on the right side. By summing through another index we can write it as, X X R(|xi + yj |)|xi | = bj |x|(j) , where b1 , . . . , bn is a permutation on the integers 1, . . . , n. Suppose bj is not in order, then there exists a t and a s such that |x|(t) ≤ |x|(s) but bt > bs . Whence, [bs |x|(t) + bt |x|(s) ] − [bt |x|(t) + bs |x|(s) ] = (bt − bs )(|x|(s) − |x|(t) ) ≥ 0 . Hence such an interchange never decreases the sum. This leads to the result, X X R(|xi + yj |)|xi | ≤ j|x|(j) ,

A Psimilar result P holds for the second term on the right side of ( 1.3.18). Therefore, kx+yk3 ≤ j|x|(j) + j|y|(j) = kxk3 + kyk3, and, this completes the proof. The above argument is taken from Hardy, Littlewood, and Polya (1952).

10

CHAPTER 1. ONE SAMPLE PROBLEMS

We shall call this norm the weighted L1 Norm. In the next theorem, we offer an interesting identity satisfied by this norm. First, though, we need another representation of it. For a random sample X1 , . . . , Xn , define the anti-ranks to be the random variables D1 , . . . , Dn such that (1.3.19) Z1 = |XD1 | ≤ . . . ≤ Zn = |XDn | . For example, if D1 = 2 then |X2 | is the smallest absolute value and Z1 has rank 1. Note that the anti-rank function is just the inverse of the rank function. We can then write kxk3 =

n X

j|x|(j) =

i=j

n X j=1

j|xDj | .

(1.3.20)

Theorem 1.3.3. For any vector x, X X xi + xj X X xi − xj kxk3 = 2 + 2 .

(1.3.21)

i

onesampwil(diffs)

Results for the Wilcoxon-Signed-Rank procedure Test of theta = 0 versus theta not equal to 0 Test-Stat. is T 54 Standardized (z) Test-Stat. is Estimate 1.3 SE is 0.484031 95 % Confidence Interval is ( 0.9 , 2.7 ) Estimate of the scale parameter tau 1.530640

2.70113 p-vlaue 0.00691043

1.4. EXAMPLES >

15

onesampsgn(diffs)

Results for the Sign procedure Test of theta = 0 versus theta not equal to 0 Test stat. S is 9 Standardized (z) Test-Stat. 2.666667 p-vlaue 0.007660761 Estimate 1.3 SE is 0.4081708 95 % Confidence Interval is ( 0.8 , 2.4 ) Estimate of the scale parameter tau 1.290749 > temp=onesampt(diffs)

Results for the t-test procedure Test of theta = 0 versus theta not equal to 0 Test stat. Ave(x) - 0 is 1.58 Standardized (t) Test-Stat. 4.062128 p-vlaue 0.00283289 Estimate 1.58 SE is 0.3889587 95 % Confidence Interval is ( 0.7001142 , 2.459886 ) Estimate of the scale parameter sigma 1.229995 The confidence interval corresponding to the sign test is (0.8, 2.4) which is shifted above 0. Hence, there is strong support for the alternative hypothesis that the location of the difference distribution is not equal to zero. That is, we reject H0 : θ = 0 in favor of HA : θ 6= 0 at α = .05. All three tests support this conclusion. The estimates of location corresponding to the three tests are the median (1.3), the median of the Walsh averages (1.3), and the mean of the sample differences (1.58). Note that the outlier had an effect on the sample mean. In order to see how sensitive the test statistics are to outliers, we change the value of the outlier (difference in the 10th row of Table 1.4.1 and plot the value of the test statistic against the value of the difference in the 10th row of Table 1.4.1; see Panel C of Figure 1.4.1. Note that as the value of the 10th difference changes the t-test changes quite rapidly. In fact, the t-test can be pulled out of the rejection region by making the difference sufficiently small or large. However, the sign test , Panel D of Figure 1.4.1, stays constant until the difference crosses zero and then only changes by 2. This illustrates the high sensitivity of the t-test to outliers and the relative resistance of the sign test. A similar plot can be prepared for the Wilcoxon signed rank test; see Exercise 1.12.8. In addition, the corresponding pvalues can be plotted to see how sensitive the decision to reject the null hypothesis is to outliers. Sensitivity plots are similar to influence functions. We discuss influence functions for estimates in Section 1.6. Example 1.4.2. Shoshoni Rectangles.

16

CHAPTER 1. ONE SAMPLE PROBLEMS Table 1.4.2: Width to Length Ratios of Rectangles 0.553 0.570 0.576 0.601 0.606 0.606 0.609 0.611 0.615 0.628 0.654 0.662 0.668 0.670 0.672 0.690 0.693 0.749 0.844 0.933

The golden rectangle is a rectangle in which the ratio of the width to length is approximately 0.618. It can be characterized in various ways. For example, w/l = l/(w + l) characterizes the golden rectangle. It is considered to be an aesthetic standard in Western civilization and appears in art and architecture going back to the ancient Greeks. It now appears in such items as credit and business cards. In a cultural anthropology study, DuBois (1960) reports on a study of the Shoshoni beaded baskets. These baskets contain beaded rectangles and the question was whether the Shoshonis use the same aesthetic standard as the West. A sample of twenty width to length ratios from Shoshoni baskets is given in Table 1.4.2. Panel A of Figure 1.4.2 shows the notched boxplot containing the 95% L1 confidence interval for θ the median of the population of w/l ratios. It shows two outliers which are also apparent in the normal quantile plot, Panel B of Figure 1.4.2. We used the sign procedure to analyze the data, perfoming the computations with the RBR function onesampsgn. For Figure 1.4.2: Panel A: Boxplot of Width to Length Ratios of Shoshoni Rectangles; Panel B: Normal q−q plot. Panel A

Panel B

0.9

0.9

*

0.8 0.7

Width to length ratios

*

0.6

0.8 0.7 0.6

Width to length ratios

*

*

**

** *** * * * ****

−1.5 −1.0 −0.5

0.0

**

0.5

1.0

1.5

Normal quantiles

this problem, it is of interest to test H0 : θ = 0.618 (the golden rectangle). The display

1.5. PROPERTIES OF NORMED-BASED INFERENCE

17

below shows this evaluation for the sign test along with a 90% confidence interval for θ. > onesampsgn(x,theta0=.618,alpha=.10) Results for the Sign procedure Test of theta = 0.618 versus theta not equal to 0.618 Test stat. S is 2 Standardized (z) Test-Stat. 0.2236068 p-vlaue 0.8230633 Estimate 0.641 SE is 0.01854268 90 % Confidence Interval is ( 0.609 , 0.67 ) Estimate of the scale parameter tau 0.0829254 With a p-value of 0.823, there is no evidence to refute the null hypothesis. Further. we see that the golden rectangle 0.618 is contained in the confidence interval. This suggests that there is no evidence in this data that the Shoshonis are using a different standard. For comparison, the analysis based on the t-procedure is > onesampt(x,theta0=.618,alpha=.10) Results for the t-test procedure Test of theta = 0.618 versus theta not equal to 0.618 Test stat. Ave(x) - 0.618 is 0.0425 Standardized (t) Test-Stat. 2.054523 p-vlaue 0.05394133 Estimate 0.6605 SE is 0.02068606 90 % Confidence Interval is ( 0.624731 , 0.696269 ) Estimate of the scale parameter sigma 0.09251088 Based on the t-test with the p-value of 0.053, one might conclude that there is evidence that the Shoshonis are using a different standard. Further, the 90% t-interval does not contain the golden rectangle ratio. Based on the t-analysis, a researcher might conclude that there is evidence that the Shoshonis are using a different standard. Hence, the robust and traditional approaches lead to different practical conclusions for this problem. The outliers, of course impaired the t-analysis. For this data, we have more faith in the simple sign test.

1.5

Properties of Normed-Based Inference

In this section, we establish statistical properties of the inference described in Section 1.3 for the norm-fit of a location model. These properties describe the null and alternative distributions of the test, ( 1.3.7), and the asymptotic distribution of the estimate, (1.3.2). Furthermore, these properties allow us to derive relative efficiencies between competing procedures. While our discussion is general, we will illustrate the inference based on the L1 and

18

CHAPTER 1. ONE SAMPLE PROBLEMS

L2 norms as we proceed. The inference based on the signed-rank norm will be considered in Section 1.7 and that based on norms of general signed-rank scores in Section 1.8. We assume then that Model ( 1.2.1) holds for a random sample X1 , . . . , Xn with common distribution and density functions H(x) = F (x − θ) and h(x) = f (x − θ), respectively. Next a norm is specified to fit the model. We will assume that the induced functional is 0 at F , i.e., T (F ) = 0. Let S(θ) be the gradient function induced by the norm. We establish the properties of the inference by considering the null and alternative behavior of the gradient test. For convenience, we consider the one sided hypothesis, H0 : θ = 0 versus HA : θ > 0 .

(1.5.1)

Since S(θ) is nonincreasing, a level α test of these hypotheses based on S(0) is Reject H0 in favor of HA if S(0) ≥ c ,

(1.5.2)

where c is such that P0 [S(0) ≥ c] = α. The power function of this test is given by, γS (θ) = Pθ [S(0) ≥ c] = P0 [S(−θ) ≥ c] ,

(1.5.3)

where the last equality follows from Theorem 1.3.1. The power function forms a convenient summary of the test based on S(0). The probability of a Type I Error (level of the test) is given by γS (0). The probability of a Type II error at the alternative θ is βS (θ) = 1 − γS (θ). For a given test of hypotheses ( 1.5.1) we want the power function to be increasing in θ with an upper limit of one. In the first subsection below, we establish these properties for the test ( 1.5.2). We can also compare level α-tests of ( 1.5.1) by comparing their powers at alternative hypotheses. These are efficiency considerations and they are covered in later subsections.

1.5.1

Basic Properties of the Power Function γS (θ)

As a first step we show that γS (θ) is nondecreasing: Theorem 1.5.1. Suppose the test of H0 : θ = 0 versus HA : θ > 0 rejects when S(0) ≥ c. Then the power function is nondecreasing in θ. Proof. Recall that S(θ) is nonincreasing in θ since D(θ) is convex. By Theorem 1.3.1, γS (θ) = P0 [S(−θ) ≥ c]. Now, if θ1 ≤ θ2 then S(−θ1 ) ≤ S(−θ2 ) and , hence, S(−θ1 ) ≥ c implies that S(−θ2 ) ≥ c. It then follows that P0 (S(−θ1 ) ≥ c) ≤ P0 (S(−θ2 ) ≥ c) and the power function is monotone in θ as required. This theorem shows that the test of H0 : θ = 0 versus HA : θ > 0 based on S(0) is unbiased, that is, Pθ (S(0) ≥ c) ≥ α for positive θ, where α is the size of the test. At times it is convenient to consider the more general null hypothesis: H0∗ : θ ≤ 0 versus HA : θ > 0 .

(1.5.4)

1.5. PROPERTIES OF NORMED-BASED INFERENCE

19

A test of H0∗ versus HA with power function γS is said to have level α, if sup γS (θ) = α . θ≤0

The proof of Theorem 1.5.1 shows that γS (θ) is nondecreasing in all θ ∈ R. Since the gradient test has level α for H0 , it follows immediately that it has level α for H0∗ also. We next show that the power function of the gradient test converges to 1 as θ → ∞. We formally define this as: Definition 1.5.1. Consider a level α test for the hypotheses ( 1.5.1) which has power function γS (θ). We say the test is resolving, if γS (θ) → 1 as θ → ∞. Theorem 1.5.2. Suppose the test of H0 : θ = 0 versus HA : θ > 0 rejects when S(0) ≥ c. Further, let η = supθ S(θ) and suppose that η is attained for some finite value of θ. Then the test is resolving, that is, Pθ (S(0) ≥ c) → 1 as θ → ∞. Proof. Since S(θ) is nonincreasing, for any unbounded increasing sequence θm , S(θm ) ≥ S(θm+1 ). For fixed n and F , there is a real number a such that P0 (| Xi |≤ a, i = 1, . . . , n) > 1 − ǫ for any specified ǫ > 0. Let Aǫ denote the event {| Xi |≤ a, i = 1, . . . , n}. Now, Pθm (S(0) ≥ c) = P0 (S(−θm ) ≥ c) = 1 − P0 (S(−θm ) < c) = 1 − P0 ({S(−θm ) < c} ∩ Aǫ ) − P0 ({S(−θm ) < c} ∩ Acǫ ) . The hypothesis of the theorem implies that, for sufficiently large m, {S(−θm ) < c} ∩ Aǫ is empty. Further, P0 ({S(−θm ) < c} ∩ Acǫ ) ≤ P0 (Acǫ ) < c. Hence, for m sufficiently large, Pθm (S(0) ≥ c) ≥ 1 − ǫ and the proof is complete. The condition of boundedness imposed on S(θ) in the above theorem holds for almost all the nonparametric tests discussed in this book; hence, these nonparametric tests will be resolving. Thus they will be able to discern large alternative hypotheses with high power. What can be said at a fixed alternative? Recall the definition of a consistent test: Definition 1.5.2. We say that a test is consistent if the power tends to one for each fixed alternative as the sample size n increases. The alternatives consist in specific values of θ and a cdf F . Consistency implies that the test is behaving as expected when the sample size increases and the alternative hypothesis is true. To obtain consistency of the gradient test, we need to impose the following two assumptions on S(θ): first P

S(θ) = S(θ)/nγ → µ(θ) where µ(0) = 0 and

µ(0) < µ(θ) for all θ > 0,

(1.5.5)

20

CHAPTER 1. ONE SAMPLE PROBLEMS

for some γ > 0 and secondly, E0 S(0) = 0 and

√

D

n S(0) → N(0, σ 2 (0)) under H0 for all F ,

(1.5.6)

for some positive constant σ(0). The first assumption means that S(0) separates the null from the alternative hypothesis. Note, it is not crucial that µ(0) = 0, since this can always be achieved by recentering. It will be useful to have the following result concerning the asymptotic null distribution of S(0). Its proof follows readily from the definition of convergence in distribution. √ Theorem 1.5.3. Assume ( 1.5.6). The test defined by n S(0) ≥ zα σ(0) where zα is the upper α percentile from the standard normal cdf ie. 1 − Φ(zα ) = α is asymptotically size α. √ Hence, P0 ( n S(0)) ≥ zα σ(0)) → α. It follows that a gradient test is consistent; i.e., √ Theorem 1.5.4. Assume conditions ( 1.5.5) and ( 1.5.6). Then the gradient test n S(0) ≥ zα σ(0) is consistent, ie. the power at fixed alternatives tends to one as n increases. Proof. Fix θ∗ > 0 and F . For ǫ > 0 and for large n, we have n−1/2 zα σ(0) < µ(θ∗ ) − ǫ. This leads to the following string of inequalities: Pθ∗ ,F (S(0) ≥ n−1/2 zα σ(0)) ≥ Pθ∗ ,F (S(0) ≥ µ(θ∗ ) − ǫ) ≥ Pθ∗ ,F (| S(0) − µ(θ∗ ) |≤ ǫ) → 1 , which is the desired result. Example 1.5.1. The L1 Case Assume that the model cdf F has the unique median 0. Consider the L1 norm. The associated level α gradient test of ( 1.5.1) is equivalent to the sign test given by: Reject H0 in favor of HA if S1+ =

P

I(Xi > 0) ≥ c ,

where c is such that P [bin(n, 1/2) ≥ c] = α. The test is nonparametric, i.e., it does not depend on F . From the above discussion its power function is nondecreasing in θ. Since S1+ (θ) is bounded and attains its bound on a finite interval, the test is resolving. For consistency, take γ = 1 in expression ( 1.5.5). Then E[n−1 S1+ (0)] = P (X > 0) = 1 − F (−θ) = µ(θ). An application of the Weak Law of Large numbers shows that the limit in condition ( 1.5.5) holds. Further, µ(0) = 1/2 < µ(θ) for all θ > 0 and all F . Finally, apply the Central Limit Theorem to show that ( 1.5.6) holds. Hence, the sign test is consistent for location alternatives. Further, it is consistent for each pair θ, F such that P (X > 0) > 1/2. A discussion of these properties for the gradient test based on the L2 -norm can be found in Exercise 1.12.5.

1.5. PROPERTIES OF NORMED-BASED INFERENCE

1.5.2

21

Asymptotic Linearity and Pitman Regularity

In the last section we discussed some of the basic properties of the power function for a gradient test. Next we establish some general results that will allow us to compare power functions for different level α-tests. These results will also lead to the asymptotic distributions of the location estimators θb based on norm fits. We will also make use of them in later sections and chapters. Assume the setup found at the beginning of this section; i.e., we are considering the location model ( 1.3.1) and we have specified a norm with gradient function S(θ). We first define a Pitman regular process: Definition 1.5.3. We will say an estimating function S(θ) is Pitman Regular if the following four conditions hold: first, S(θ) is nonincreasing in θ ;

(1.5.7)

second, letting S(θ) = S(θ)/nγ , for some γ > 0. there exists a function µ(θ), such that µ(0) = 0, µ′ (θ) is continuous at 0, P

µ′ (0) > 0 and either S(0) →θ µ(θ) or Eθ (S(0) = µ(θ) ; third,

P √ √ b ′ sup n S √ − n S(0) + µ (0)b → 0 , n

(1.5.8)

(1.5.9)

|b|≤B

for any B > 0; and fourth there is a constant σ(0) such that √

n

S(0) σ(0)

D

→0 N(0, 1) .

(1.5.10)

Further the quantity c = µ′ (0)/σ(0)

(1.5.11)

is called the efficacy of S(θ). Condition ( 1.5.9) is called the asymptotic linearity of the process S(θ). Often we can compute c when we have the mean under general θ and the variance under θ = 0. Thus µ′ (0) =

d Eθ [S(0) |θ=0 and σ 2 (0) = lim{nVar0 (S(0))} . dθ

(1.5.12)

Hence, another way expressing the asymptotic linearity of S(θ) is √

n

√ √ S(b/ n) S(0) = n − cb + op (1) . σ(0) σ(0)

(1.5.13)

22

CHAPTER 1. ONE SAMPLE PROBLEMS

If we replace b by

√

√ nθn where, of course, | nθn | ≤ B for B > 0, then we can write √ √ √ S(θn ) S(0) = n − c nθn + op (1) . n (1.5.14) σ(0) σ(0)

We record one more result on limiting distributions whose proof follows from Theorems 1.3.1 and 1.5.6. Theorem 1.5.5. Suppose S(θ) is Pitman Regular. Then √ √ S(b/ n) D0 → Z − cb n σ(0) and

√

n

S(0) σ(0)

D−b/√n

→

Z − cb ,

(1.5.15)

(1.5.16)

where Z ∼ N(0, 1) and, so, Z − cb ∼ N(−cb, 1). The second part of this theorem says that the limiting distribution of S(0) , when standardized by σ(0), and computed along a sequence of alternatives −b/n1/2 is still normal with the same variance of one but with a new mean, namely −cb. This result will be useful in approximating the power near the null hypothesis. We will find asymptotic linearity to be useful in establishing statistical properties. Our next result provides sufficient conditions for linearity. Theorem 1.5.6. Let S(θ) = (1/nγ )S(θ) for some γ > 0 such that the conditions ( 1.5.7), ( 1.5.8) and ( 1.5.10) of Definition 1.5.3 hold. Suppose for any b ∈ R, nVar0 (S(n−1/2 b) − S(0)) → 0 , as n → ∞ . Then

√ P √ b ′ sup n S √ − n S(0) + µ (0)b → 0 , n

(1.5.17)

(1.5.18)

|b|≤B

for any B > 0.

√ Proof. First consider Un (b) = [S(n−1/2 b) − S(0)]/(b/ n). By ( 1.5.8) we have √ √ b ′ n −b n E0 (U0 (b)) = − √ µ (ξn ) → −µ′ (0) . µ √ = b b n n Furthermore,

b n Var0 Un (b) = 2 Var0 S √ − S(0) → 0 . b n

(1.5.19)

(1.5.20)

As Exercise 1.12.9 shows, ( 1.5.19) and ( 1.5.20) imply that Un (b) converges to −µ′ (0) in probability, pointwise in b, i.e., Un (b) = −µ′ (0) + op (1).

1.5. PROPERTIES OF NORMED-BASED INFERENCE

23

√ √ √ For the second part of the proof, let Wn (b) = n[S(b/ n) − S(0) + µ′ (0)b/ n]. Further let ǫ > 0 and γ > 0 and partition [−B, B] into −B = b0 < b1 < . . . < bm = B so that bi − bi−1 ≤ ǫ/(2|µ′ (0)|) for all i. There exists N such that n ≥ N implies P [maxi |Wn (bi )| > ǫ/2] < γ. Now suppose that Wn (b) ≥ 0 ( a similar argument can be given for Wn (b) < 0). Then √ √ b b ′ − S(0) + bµ (0) ≤ n S √ − S(0) n S √ |Wn (b)| = n n +bi−1 µ′ (0) + (b − bi−1 )µ′ (0) ≤ |Wn (bi−1 )| + (b − bi−1 )|µ′(0)| ≤ max |Wn (bi )| + ǫ/2 . i

Hence, P0

!

sup |Wn (b)| > ǫ

|b|≤B

≤ P0 (max |Wn (bi )| + ǫ/2) > ǫ) < γ , i

and

P

sup |Wn (b)| → 0 .

|b|≤B

In the next three subsections we use these tools to handle the issues of power and efficiency for a general norm-based inference, but first we show that the L1 gradient function is Pitman regular. Example 1.5.2. Pitman Regularity of the L1 Process Assume that the model pdf satisfies f (0) > 0. Recall that the L1 gradient function is S1 (θ) =

n X i=1

sgn(Xi − θ) .

Take γ = 1 in Theorem 1.5.6; hence, the average of interest is S 1 (θ) = n−1 S1 (θ). This is nonincreasing so condition ( 1.5.7) is satisfied. Next it is easy to check that µ(θ) = Eθ S 1 (0) = Eθ sgnXi = E0 sgn(Xi + θ) = 1 − 2F (−θ). Hence, µ′ (0) = 2f (0). Then condition ( 1.5.8) is satisfied. We now consider condition ( 1.5.17). Consider the case b > 0, (similarly for b < 0), n n X X √ √ S 1 (b/ n) − S 1 (0) = n−1 [sgn(Xi − b/ n) − sgn(Xi )] = −(2/n) I(0 < Xi < b/n1/2 ) 1

1

Because this is a sum of independent Bernoulli variables, we have √ √ nVar0 [S 1 (b/n1/2 ) − S 1 (0)] ≤ 4P (0 < X1 < b/ n) = 4[F (b/ n) − F (0)] → 0 . The convergence to 0 occurs √ since F is continuous. Thus condition ( 1.5.17) is satisfied. Finally, note that σ(0) = 1 so n S 1 converges in distribution to Z ∼ N(0, 1) by the Central

24

CHAPTER 1. ONE SAMPLE PROBLEMS

Limit Theorem. Therefore the L1 gradient process S(θ) is Pitman regular. It follows that the efficacy of the L1 is cL1 = 2f (0) . (1.5.21) √ For future reference, we state the asymptotic linearity result for the L1 process: if | nθn | ≤ B then √ √ √ n S 1 (θn ) = n S 1 (0) − 2f (0) nθn + op (1) . (1.5.22) Example 1.5.3. Pitman Regularity of the L2 Process In Exercise 1.12.6 it is shown that, provided Xi has finite variance, the L2 gradient function is Pitman Regular and that the efficacy is simply cL2 = 1/σf . We are now in a position to investigate the efficiency and power properties of the statistical methods based on the L1 norm relative to the statistical methods based on the L2 norm. As we will see in the next three subsections, these properties depend only on the efficacies.

1.5.3

Asymptotic Theory and Efficiency Results for θb

As at the beginning of this section, suppose we have the location model, ( 1.2.1), and that we have chosen a norm to fit the model with gradient function S(θ). In this part we will develop the asymptotic distribution of the estimate. The asymptotic variance will provide the basis for efficiency comparisons. We will use the asymptotic linearity √ b that accompanies Pitman Regularity. To do this, however, we first need to show that nθ is bounded in probability. Lemma 1.5.1. If the gradient function S(θ) is Pitman Regular, then

√

n(θb − θ) = Op (1).

Proof. Assume that θ = 0 and of √ without lossbof generality √ √ take t > 0. Byb the monotonicity √ S(θ), if S(t/ n) < 0 then θ ≤ t/ n. Hence, P0 (S(t/ n) < 0) ≤ P0 (θ ≤ t/ n). Theorem 1.5.5 implies that the first probability can be made as close to Φ(tc) as desired.√This, in turn, can be made as close to 1 as desired. In a similar vein we note that If S(−t/ n) > 0, then √ √ θb ≥ −t/ n and − nθb ≤ t. Again, the probability of this event can be made arbitrarily close √ b to 1. Hence, P0 (| nθ| ≤ t) is arbitrarily close to 1 and we have boundedness in probability. We are now in a position to exploit this boundedness in probability to determine the asymptotic distribution of the estimate. √ Theorem 1.5.7. Suppose S(θ) is Pitman regular with efficacy c. Then n(θb − θ) converges in distribution to Z ∼ n(0, c−2 ).

Proof. As usual we assume, with out loss of generality, that√θ = 0. First recall that θb is . b = defined by n−1/2 S(θ) 0. From Lemma 1.5.1, we know that nθb is bounded in probability so that we can apply ( 1.5.13) to deduce √

b n S(θ) = σ(0)

√

√ n S(0) − c nθb + op (1) . σ(0)

1.5. PROPERTIES OF NORMED-BASED INFERENCE Solving we have

25

√ nθb = c−1 n S(0)/σ(0) + op (1) ; √ hence, the result follows because n S(0)/σ(0) is asymptotically N(0, 1). √

Definition 1.5.4. If we have two Pitman Regular estimates with efficacies c1 and c2 , respectively, then the efficiency of θb1 with respect to θb2 is defined to be the reciprocal ratio of their asymptotic variances, namely, e(θb1 , θb2 ) = c21 /c22 . The next example compares the L1 estimate to the L2 estimate.

Example 1.5.4. Relative efficiency between the L1 and L2 estimates In this example we compare the L1 and L2 estimates, namely, the sample median and mean. We have seen that their respective efficacies are 2f (0) and σf−1 , and their asymptotic variances are 1/4f 2(0)n and σf2 /n, respectively. Hence, the relative efficiency of the median with respect to the mean is √ √ ¯ ˙ = c2˙ /c2¯ = 4f 2 (0)σ 2 ˙ X) ¯ = asyvar( nX)/asyvar( nX) (1.5.23) e(X, f X X ¯ is the sample mean. The efficiency computation where X˙ is the sample median and X depends only on the Pitman efficacies. We illustrate the computation of the efficiency using the contaminated normal distribution. The pdf of the contaminated normal distribution consists of mixing the standard normal pdf with a normal pdf having mean zero and variance δ 2 > 1. For ǫ between 0 and 1, the pdf can be written: fǫ (x) = (1 − ǫ)φ(x) + ǫδ −1 φ(δ −1 x)

(1.5.24)

with σf2 = 1 + ǫ(δ 2 − 1). This distribution has tails heavier than the standard normal distribution and can be used to model data contamination; see Tukey (1960) for more discussion. We can think of ǫ as the fraction of the data that is contaminated. In Table 1.5.1 we provide values of the efficiencies for various values of contamination and with δ = 3. Note that when we have 10 percent contamination that the efficiency is 1. This indicates that, for this distribution, the median and mean are equally effective. Finally, this example exhibits a distribution for which the median is superior to the mean as an estimate of the center. See Exercise 1.12.12 for other examples.

1.5.4

Asymptotic Power and Efficiency Results for the Test Based on S(θ)

Consider the location model, ( 1.2.1), and assume that we have chosen a norm to fit the model with gradient function S(θ). Consider the gradient test ( 1.5.2) of the hypotheses ( 1.5.1). In Section 1.5.1, we showed that the power function of this test is nondecreasing with upper limit one and that it is typically resolving. Further, we showed that for a fixed alternative, the test is consistent. Thus the power will tend to one as the sample size increases. To

26

CHAPTER 1. ONE SAMPLE PROBLEMS

Table 1.5.1: Efficiencies of the median relative to the mean for contaminated normal models. ˙ X) ¯ ǫ e(X, .00 .637 .03 .758 .05 .833 .10 1.000 .15 1.134 offset this effect, we will let the alternative converge to the null value at a rate that will stabilize the power away from one. This will enable us to compare two tests along the same alternative Consider the null hypothesis H0 : θ = 0 versus HAn : θ = θn where √ sequence. ∗ ∗ θ√n = θ / n and θ > 0. Recall that the asymptotic size α test based on S(0) rejects H0 if n S/σ(0) ≥ zα where 1 − Φ(zα ) = α. The following theorem is called the asymptotic power lemma. Its proof follows immediately from expression ( 1.5.13). Theorem 1.5.8. Assume that S(0) is √ Pitman Regular with efficacy c, then the asymptotic ∗ local power along the sequence θn = θ / n is γS (θn ) = Pθn as n → ∞.

√

√ n S(0)/σ(0) ≥ zα = P0 n S(−θn )/σ(0) ≥ zα → 1 − Φ(zα − θ∗ c) ,

Note that larger values of the efficacy imply larger values of the asymptotic local power. Definition 1.5.5. The Pitman asymptotic relative efficiency of one test relative to another is defined to be e(S1 , S2 ) = c21 /c22 . Note that this is the same formula as the efficiency of one estimate relative to another given in Definition 1.5.4. Therefore, the efficiency results discussed in Example 1.5.4 between the L1 and L2 estimates apply for the sign and t tests also. Hence, we have an example in which the simple sign test is asymptotically more powerful than the t-test. We can also develop a sample size interpretation for the asymptotic power. Suppose we specify a power γ < 1. Further, let zγ be defined by 1−Φ(zγ ) = γ. Then 1−Φ(zα −cn1/2 θn ) = 1 − Φ(zγ ) and zα − cn1/2 θn = zγ . Solving for n yields . n = (zα − zγ )2 /c2 θn2 . (1.5.25) Typically we take θn = kn σ with kn small. Now if S1 (0) and S2 (0) are two Pitman Regular asymptotically size α tests then the ratio of sample sizes required to achieve the same asymptotic power along the same sequence of alternatives is given by the approximation: . n2 /n1 = c21 /c22 . This provides additional motivation for the above definition of Pitman efficiency of two tests. The initial development of asymptotic efficiency was done by Pitman (1948) in an unpublished manuscript and later published by Noether (1955).

1.5. PROPERTIES OF NORMED-BASED INFERENCE

1.5.5

27

Efficiency Results for Confidence Intervals Based on S(θ)

In this part we consider the length of the confidence interval as a measure of its efficiency. Suppose that we specify γ = 1 − α for the confidence coefficient. Then let zα/2 be defined by 1 − Φ(zα/2 ) = α/2. Again we suppose throughout the discussion that the estimating functions are Pitman Regular. Then the endpoints of the 100γ percent confidence interval are given asymptotically by θbL and θbU such that √

n S(θbL ) = zα/2 and σ(0)

√

n S(θbU ) = −zα/2 ; σ(0)

(1.5.26)

see ( 1.3.10) for the exact versions of the endpoints. The next theorem provides the asymptotic behavior of the length of this interval and, further, it shows that the standardized length of confidence interval is a consistent √ the b estimate of the asymptotic standard deviation of nθ.

Theorem 1.5.9. Suppose S(θ) is a Pitman Regular estimating function with efficacy c. Let L be the length of the corresponding confidence interval. Then √ nL P 1 → 2zα/2 c

Proof: Using the same argument as in Lemma 1.5.1, we can show that θbL and θbU are √ bounded in probability when multiplied by n. Hence, the above estimating equations can be linearized to obtain, for example: √ √ √ zα/2 = n S(θbL )/σ(0) = n S(0)/σ(0) − c nθbL /σ(0) + oP (1) .

This can then be solved to find: √ √ nθbL = n S(0)/cσ(0) − zα/2 /c + oP (1)

When this is also done for θbU and the difference is taken, we have: n1/2 (θbU − θbL ) = 2zα/2 /c + oP (1) ,

which concludes the argument. From Theorem 1.5.7, θb has an approximate normal distribution with variance c−2 /n. So by Theorem 1.5.9, a consistent estimate of the standard error of θb is √ nL 1 L b √ = . (1.5.27) SE(θ) = 2zα/2 n 2zα/2 If the ratio of squared asymptotic lengths is used as a measure of efficiency then the efficiency of one confidence interval relative to another is again the ratio of the squares of the efficacies.

28

CHAPTER 1. ONE SAMPLE PROBLEMS

The discussion of the properties of estimation, testing, and confidence interval construction shows that, asymptotically at least, the relative merit of a procedure is measured by its efficacy. This measure is the slope of the linear approximation of the standardized estimating function that determines these procedures. In the comparison of L1 and L2 methods, we have seen that the efficiency e(L1 , L2 ) = 4σf2 f 2 (0). There are other types of asymptotic efficiency that have been studied in the literature along with finite sample versions of these asymptotic efficiencies. The conclusions drawn from these other efficiencies are consistent with the picture presented here. Finally, conclusions of simulation studies have also been consistent with the material presented here. Hence, we will not discuss these other measures; see Section 2.6 of Hettmansperger (1984a) for further references. Example 1.5.5. Estimation of the Standard Error of the Sample Median Recall that the sample median, when properly standardized, has a limiting normal distribution. Suppose we have a sample of size n from H(x) = F (x − θ) where θ is the unknown b the sammedian. From Theorem 1.5.7, we know that the approximating distribution for θ, 2 ple median, is normal with mean θ and variance 1/[4nh (θ)]. We refer to this variance as the asymptotic variance. This normal distribution can be used to approximate probabilities concerning the sample median. When the underlying form of the distribution H is unknown, we must estimate this asymptotic variance. Theorem 1.5.9 provides one key to the estimation of the asymptotic variance. The square root of the asymptotic variance is sometimes called the asymptotic standard error of the sample median. We will discuss the estimation of this standard error rather than the asymptotic variance. As a simple example, in expression (1.5.27) take α = .05, zα/2 = 2, and k = n/2 − n1/2 , then we have the following consistent estimate of the asymptotic standard error of the median: SE(median) ≈ [X(n/2+n1/2 ) − X(n/2−n1/2 ) ]/4. (1.5.28) This simple estimate of the asymptotic standard error is based on the length of the 95% confidence interval for the median. Sheather (1987) shows that the estimate can be improved by using the interpolated confidence intervals discussed in Section 1.10. Of course, other confidence intervals with different confidence coefficients can be used also. We recommend using 90% or 95%; again, see McKean and Schrader (1984) and Sheather (1987). This SE is computed by our R function onesampsgn for general α. The default value of α is set at 0.05. There are other approaches to the estimation of this standard error. For example, we b where hn is the density estimate. could estimate the density h(x) directly and then use hn (θ) Another possibility is to estimate the finite sample standard error of the sample median directly. Sheather (1987) surveys these approaches. We will discuss one further possibility here, namely the bootstrap. The bootstrap has gained wide attention recently because of its versatility in estimation and testing in nonstandard situations. See Efron and Tibshirani (1993) for a very readable account of the bootstrap. If we know the underlying distribution H(x), then we could estimate the standard error of the median by repeatedly drawing samples with a computer from the distribution H. If we

1.5. PROPERTIES OF NORMED-BASED INFERENCE

29

Table 1.5.2: Generated N(0, 1) variates, (placed in order) -1.79756 -1.66132 -1.46531 -1.45333 -1.21163 -0.92866 -0.86812 -0.84697 -0.81584 -0.78912 -0.68127 -0.37479 -0.33046 -0.22897 -0.02502 -0.00186 0.09666 0.13316 0.17747 0.31737 0.33125 0.80905 0.88860 0.90606 0.99640 1.26032 1.46174 1.52549 1.60306 1.90116 have B samples from H and have computed and stored the B values of the sample median, then our estimate of the standard error of the median is simply the sample standard deviation of these B values. When H is unknown we replace it by Hn , the empirical distribution function, and proceed with the simulation. Later in the chapter we will encounter an example where we want to compute a bootstrap p-value for a test; see Section ??. The bootstrap approach based on Hn is called the nonparametric bootstrap since nothing is assumed about the form of the underlying distribution H. In another version, called the parametric bootstrap, we suppose that we know the form of the underlying distribution H but there are some unknown parameters such as the mean and variance. We use the sample to estimate these unknown parameters, insert the values into H, and use this distribution to draw the B samples. In this book we will be concerned mainly with the nonparametric bootstrap and we will use the generic term bootstrap to refer to this approach. In either case, ready access to high speed computing makes this method appealing. The following example illustrates the computations. Example 1.5.6. Generated Data Using Minitab, the 30 data points in Table 1.5.2 were generated from a normal distribution with mean 0 and variance 1. Thus, we know that the asymptotic standard error should be about 1/[301/2 2f (0)] = 0.23. We will use this to check what happens if we try to estimate the standard error from the data. Using expression (1.3.16), the 95% confidence interval for the median is (−0.789, 0.331). Hence, the length of confidence interval estimate, given in expression ( 1.5.28), is (0.331 + 0.789)/4 = 0.28. A simple R function was written to bootstrap the sample; see Exercise 1.12.7. Using this function, we obtained 1000 bootstrap samples and the resulting standard deviation of the 1000 bootstrap medians was 0.27. For this instance, the bootstrap procedure essentially agrees with the length of confidence interval estimate. Note that, from the data, the sample mean is −0.03575 and the sample standard deviation is 1.04769. If we assume the underlying distribution H is normal with unknown mean and variance, we would use the parametric bootstrap. Hence, instead of sampling from the empirical distribution function, we want to sample from a normal distribution with mean −0.03575 and standard deviation 1.04769. Using R (see Exercise 1.12.7), we obtained 1000 parametric bootstrapped samples. The sample standard deviation of the resulting medians was 0.23, just the value we would expect. You should not expect to get the precise value every time you bootstrap, either parametrically or nonparametrically. It is, however, a very

30

CHAPTER 1. ONE SAMPLE PROBLEMS

versatile method to use to estimate such quantities as standard errors of estimates and p-values of tests. An unusual aspect of this example is that the bootstrap distribution of the sample median can be found in closed form and does not have to be simulated as described above. The variance of the sample median computed from the bootstrap distribution can then be found. The result is another estimate of the variance of the sample median. This was discovered independently by Maritz and Jarrett (1978) and Efron (1979). We do not pursue this development here because in most cases we must simulate the bootstrap distribution and that is where the real strength of the bootstrap approach lies. For an interesting comparison of the various estimates of the variance of the sample median see McKean and Schrader (1984).

1.6

Robustness Properties of Norm-Based Inference

We have just considered the statistical properties of the inference procedures. We have looked at ideas such as efficiency and power. We now turn to stability or robustness properties. By this we mean how the inference procedures are effected by outliers or corruption of portions of the data. Ideally, we would like procedures (tests and estimates) which do not respond too quickly to a single outlying value when it is introduced into the sample. Further, we would not like procedures that can be changed by arbitrary amounts by corrupting a small amount of the data. Response to outliers is measured by the influence curve and response to data corruption is measured by the breakdown value. We will introduce finite sample versions of these concepts. They are easy to work with and, in the limit, they generally equal the more abstract versions based on the study of statistical functionals. We consider first the robustness properties of the estimates and secondly tests. As in the last section, the discussion will be general but the L1 and L2 procedures will be discussed as we proceed. The robustness properties of the procedures based on the weighted L1 norm will be covered in Sections 1.7 and 1.8. See Section A.5 of the Appendix for a development based on functionals.

1.6.1

Robustness Properties of θb

b We begin with the definition of breakdown for the estimator θ.

Definition 1.6.1. Estimation Breakdown Let x = (x1 , . . . , xn ) represent a realization of a sample and let x(m) = (x∗1 , . . . , x∗m , xm+1 , . . . , xn )′ represent the corruption of any m of the n observations. We define the bias of an estimator b x) = sup |θ(x b (m) ) − θ(x)| b θb to be bias(m; θ, where the sup is taken over all possible corrupted (m) samples x . Note that we change only x∗1 , . . . , x∗m while xm+1 , . . . , xn are fixed at their original values. If the bias is infinite, we say the estimate has broken down and the finite sample breakdown value is given by b x) = ∞} . (1.6.1) ǫ∗ = min {m/n : bias(m; θ, n

1.6. ROBUSTNESS PROPERTIES OF NORM-BASED INFERENCE

31

This approach to breakdown is called replacement breakdown because observations are replaced by corrupted values; see Donoho and Huber (1983) for more discussion of this approach. Often there exists an integer m such that x(m) ≤ θb ≤ x(n−m+1) and either θb tends to −∞ as x(m) tends to −∞ or θb tends to +∞ as x(n−m+1) tends to +∞. If m∗ is the smallest such integer then ǫ∗n = m∗ /n. Hodges (1967) was the first to introduce these ideas. To remove the effects of sample size, the limit, when it exists, can be computed. In this case we call the lim ǫ∗n = ǫ∗ , the asymptotic breakdown value. Example 1.6.1. Breakdown Values for the L1 and L2 Estimates The L1 estimate is the sample median. If the sample size is n = 2k then it is easy to see that when x(k) tends to −∞, the median also tends to −∞. Hence, the breakdown value of the sample median is k/n which tends to .5. By a similar argument, when the sample size is n = 2k + 1, the breakdown value is (k + 1)/n and it also tends to .5 as the sample size increases. Hence, we say that the sample median is a 50% breakdown estimate. The L2 estimate is the sample mean. A similar analysis shows that the breakdown value is 1/n which tends to zero. Hence, we say the sample mean is a zero breakdown estimate. This sharply contrasts the two estimates since we see that the median is the most resistant estimate and the sample mean is the least resistant estimate. In Exercise 1.12.13, the reader is asked to show that the pseudo-median induced by the signed-rank norm, ( 1.3.25), has breakdown .29. We have just considered the effect of corrupting some of the observations. The estimate breaks down if we can force the estimate to change by an arbitrary amount by changing the observations over which we have control. Another important concept of stability entails measuring the effect of the introduction of a single outlier. An estimate is stable or resistant if it does not change by a large amount when the outlier is introduced. In particular, we want the change to be bounded no matter what the value of the outlier. Suppose we have a sample of observations x1 , . . . , xn from a distribution centered at 0 and an estimate θbn based on these observations. By Pitman Regularity, Definition 1.5.3, and Theorem 1.5.7, we have n1/2 θbn = c−1 n−1/2 S(0)/σ(0) + oP (1) ,

(1.6.2)

provided the true parameter is 0. Further, we often have a representation of S(0) as a sum of independent random variables. We may have to make a projection of S(0) to achieve this; see the next chapter for examples of projections. In any case, we then have the following representation n X −1 −1/2 −1/2 Ω(xi ) + oP (1) , (1.6.3) c n S(0)/σ(0) = n i=1

where Ω(·) is the function needed in the representation. When we combine the above two

32

CHAPTER 1. ONE SAMPLE PROBLEMS

statements we have n1/2 θbn

−1/2

=n

n X

Ω(xi ) + oP (1) .

(1.6.4)

i=1

Recall that the distribution that we are sampling is assumed to be centered at 0. The difference (θbn −0) is approximated by the average of n independent and identically distributed random variables. Since Ω(xi ) represents the effect of the ith observation on θbn it is called the influence function. The influence function approximates the rate of change of the estimate when an outlier is introduced. Let xn+1 = x∗ represent a new, outlying , observation. Since θbn should be roughly 0, we have and

. (n + 1)θbn+1 − (n + 1)θbn = Ω(x∗ ) θbn+1 − θbn ≈ Ω(x∗ ) , 1/(n + 1)

(1.6.5)

and this reveals the differential character of the influence function. Hampel (1974) developed the influence function from the theory of von Mises differentiable functions. In Sections A.5 and A.5.2 of the Appendix, we use his formulation to derive several influence functions for later situations. Here, though, we will identify influence functions for the estimates through the approximations described above. We now illustrate this approach. Example 1.6.2. Influence Function for the L1 and L2 Estimates We will briefly describe the influence functions for the sample median and the sample mean, the L1 and L2 estimates. From Example 1.5.2 we have immediately that, for the sample median, n

n and

1 X sgn(Xi ) θ≈√ n i=1 2f (0)

1/2 b

Ω(x) =

sgn(x) 2f (0)

Note that the influence function is bounded but not continuous. Hence, outlying observations cannot have an arbitrarily large effect on the estimate. It is this feature along with the 50% breakdown property that makes the sample median the prototype of resistant estimates. The sample mean, on the other hand, has an unbounded influence function. It is easy to see that Ω(x) = x, linear and unbounded. Hence, a single large outlier is sufficient to carry the sample mean beyond any bound. The unbounded influence is connected to the 0 breakdown property. Hence, the L2 estimate is the prototype of an estimate highly efficient at a specified model, the normal model in this case, but not resistant. This means that quite

1.6. ROBUSTNESS PROPERTIES OF NORM-BASED INFERENCE

33

close to the model for which the estimate is optimal, the estimate may perform very poorly; recall Table 1.5.1.

1.6.2

Breakdown Properties of Tests

We now turn to the issue of breakdown in testing hypotheses. The problems are a bit different in this case since we typically want to move, by data corruption, a test statistic into or out of a critical region. It is not a matter of sending the statistic beyond any finite bound as it is in estimation breakdown. Definition 1.6.2. Suppose that V is a statistic for testing H0 : θ = 0 versus H0 : θ > 0 and we reject the null hypothesis when V ≥ k, where P0 (V ≥ k) = α determines k. The rejection breakdown of the test is defined by ǫ∗n (reject) = min {m/n : inf sup V ≥ k} , x

(1.6.6)

x(m)

where the sup is taken over all possible corruptions of m data points. Likewise the acceptance breakdown is defined to be ǫ∗n (accept) = min {m/n : sup inf V < k} . x

x(m)

(1.6.7)

Rejection breakdown is the smallest portion of the data that can be corrupted to guarantee that the test will reject. Acceptance breakdown is interpreted as the smallest portion of the data that must be corrupted to guarantee that the test statistic will not be in the critical region; i.e., the test is guaranteed to fail to reject the null hypothesis. We turn immediately to a comparison of the L1 and L2 tests. Example 1.6.3. Rejection Breakdown of the L1 We first consider the one sided sign test for testing H0 : θ = 0 versus HA : θ > 0. The asymptotically size α test rejects the null hypothesis when n−1/2 S1 (0) ≥ zα , the upper α quantile from a standard normal to see exactly what happens if we P distribution. It is easier + 1/2 convert the test to S1 (0) = I(Xi > 0) ≥ n/2 + (n zα )/2. Now each time we make an + observation positive it makes S1 (0) increase by one. Hence, if we wish to guarantee that the test will reject, we make m observations positive where m∗ = [n/2 + (n1/2 zα )/2] + 1, [.] the greatest integer function. Then the rejection breakdown is zα . 1 ǫ∗n (reject) = m∗ /n = + 1/2 2 2n Likewise,

zα . 1 ǫ∗n (accept) = − 1/2 . 2 2n Note that the rejection breakdown converges down to the estimation breakdown and the acceptance breakdown converges up to it.

34

CHAPTER 1. ONE SAMPLE PROBLEMS

We next turn to the one-sided Student’s t-test. Acceptance breakdown for the t-test is simple. By making a single observation approach −∞, the t statistic can be made negative hence we can always guarantee acceptance with control of one observation. The rejection breakdown is more interesting. If we increase an observation both the sample mean and the sample standard deviation increase. Hence, it is not at all clear what will happen to the t-statistic. In fact it is not sufficient to increase a single observation in order to force the t-statistic to move into the critical region. We now show that the rejection breakdown for the t-statistic is: t2α ǫ∗n (reject) = → 0 , as n → ∞ , n − 1 + t2α where tα is the upper α quantile from a t-distribution with n − 1 degrees of freedom. The infimum part of the definition suggests that we set all observations at −B < 0 and then change m observations to M > 0. The result is x¯ =

m(n − m)(M + B)2 mM − (n − m)B and s2 = . n (n − 1)n

Putting these two quantities together we have 1/2 n1/2 x¯ n−1 m(n − 1) 1/2 , = [m − (n − m)B/M] → s m(n − m)(1 + B/M)2 n−m

as M → ∞. We now equate the limit to tα and solve for m to get m = nt2α /(n − 1 + t2α ), (actually we would take the greatest integer and add one). Then the rejection breakdown is m divided by n as stated. Table 1.6.1 compares rejection breakdown values for the sign and t-tests. We assume α = .05 and the sample sizes are chosen so that the size of the sign test is quite close to .05. For further discussion, see Ylvisaker (1977). These definitions of breakdown assume a worst case scenario. They assume that the test statistic is as far away from the critical region (for rejection breakdown) as possible. In practice, however, it may be the case that a test statistic is quite near the edge of the critical region and only one observation is needed to change the decision from fail to reject to reject. An alternative form of breakdown considers the average number of observations that must be corrupted, conditional on the test statistic being in the acceptance region, to force a rejection. Let MR be the number of observations that must be corrupted to force a rejection; then, MR is a random variable. The expected rejection breakdown is defined to be Exp∗n (reject) = EH0 [MR |MR > 0]/n .

(1.6.8)

Note that we condition on MR > 0 since MR = 0 is equivalent to a rejection. It is left as Exercise 1.12.14 to show that the expected breakdown can be computed with unconditional expectation as Exp∗n (reject) = EH0 [MR ]/(1 − α) . (1.6.9) In the following example we illustrate this computation on the sign test and show how it compares to the worst case breakdown introduced earlier.

1.7. INFERENCE AND THE WILCOXON SIGNED-RANK NORM

35

Table 1.6.1: Rejection breakdown values for size α = .05 tests. n Sign t 10 .71 .27 13 .70 .21 18 .67 .15 30 .63 .09 100 .58 .03 ∞ .50 0 Table 1.6.2: Comparison of expected breakdown and worst case breakdown for the size α = .05 sign test. n Exp∗n (reject) ǫ∗n (reject) 10 .27 .71 13 .24 .70 18 .20 .67 30 .16 .63 100 .08 .58 ∞ 0 .50

Example 1.6.4. Expected Rejection Breakdown of the Sign Test P Refer to Example 1.6.3. The one sided sign test rejects when I(Xi > 0) ≥ n/2 + n1/2 zα/2 . Hence, given thatPwe fail to reject the null hypothesis, we will need to change (corrupt) n/2 + n1/2 zα/2 − I(Xi > 0) negative observations into positive ones. This is 1/2 precisely MR and E[MR ] = n zα/2 . It follows that Exp∗n (reject) = zα/2 n1/2 (1 − α) → 0 as n → ∞ rather than .5 which happens in the worst case breakdown. Table 1.6.2 compares the two types of rejection breakdown. This simple calculation clearly shows that even highly resistant tests such as the sign test may breakdown quite easily. This is contrary to what the worst case breakdown analysis would suggest. For additional reading on test breakdown see Coakley and Hettmansperger (1992). He, Simpson and Portnoy (1990) discuss asymptotic test breakdown.

1.7

Inference and the Wilcoxon Signed-Rank Norm

In this section we develop the statistical properties for the procedures based on the Wilcoxon signed-rank norm, ( 1.3.17), that was defined in Example 1.3.3 of Section 1.3. Recall that the norm and its associated gradient function are given in expressions ( 1.3.17) and ( 1.3.24), respectively. Recall for a sample X1 , . . . , Xn that the estimate of θ is the median of the

36

CHAPTER 1. ONE SAMPLE PROBLEMS

Walsh averages given by ( 1.3.25). As in Section 1.3, our hypotheses of interest are H0 : θ = 0 versus H0 : θ 6= 0 .

(1.7.1)

The level α test associated with the signed-rank norm is Reject H0 in favor of HA , if |T (0)| ≥ c ,

(1.7.2)

where c is such that P0 [|T (0)| ≥ c]. To complete the test we need to determine the null distribution of T (0), which is given by Theorems 1.7.1 and 1.7.2. In order to develop the statistical properties, in addition to ( 1.2.1), we assume that h(x) is symmetrically distributed about θ .

(1.7.3)

We refer to this as the symmetric location model. Under symmetry, by Theorem 1.2.1, T (H) = θ, for all location functionals T .

1.7.1

Null Distribution Theory of T (0)

In addition to expression ( 1.3.24), a third representation of T (0) will be helpful in establishing its null distribution. Recall the definition of the anti-ranks, D1 , . . . , Dn , given in expression ( 1.3.19). Using these anti-ranks, we can write X X X T (0) = R(|Xi |)sgn(Xi ) = jsgn(XDj ) = jWj , where Wj = sgn(XDj ).

Lemma 1.7.1. Under H0 , |X1 |, . . . , |Xn | are independent of sgn(X1 ), . . . , sgn(Xn ). Proof: Since X1 , . . . , Xn is a random sample from H(x), it suffices to show that P [|Xi| ≤ x, sgn(Xi ) = 1] = P [|Xi| ≤ x]P [sgn(Xi ) = 1]. But due to H0 and the symmetry of h(x) this follows from the following string of equalities: P [|Xi| ≤ x, sgn(Xi ) = 1] = P [0 < Xi ≤ x] = H(x) −

1 2

1 = [2H(x) − 1] = P [|Xi| ≤ x]P [sgn(Xi ) = 1] . 2

Based on this lemma, the vector of ranks and, hence, the vector of antiranks (D1 , . . . , Dn ), are independent of the vector (sgn(X1 ), . . . , sgn(Xn )). Based on these facts, we can obtain the distribution of (W1 , . . . , Wn ), which we summarize in the following lemma; see Exercise 1.12.15 for its proof. Lemma 1.7.2. Under H0 and the symmetry of h(x), W1 , . . . , Wn are iid random variables with P [Wi = 1] = P [Wi = −1] = 1/2.

1.7. INFERENCE AND THE WILCOXON SIGNED-RANK NORM

37

We can now easily derive the null distribution theory of T (0) which we summarize in the following theorems. Details are given in Exercise 1.12.16. Theorem 1.7.1. Under H0 and the symmetry of h(x),

p

T (0) E0 [T (0)]

is =

Var0 (T (0))

=

distribution free and its distribution is symmetric 0 n(n + 1)(2n + 1) 6

T (0) has an asymptotically N(0, 1) distribution . Var0 (T (0))

(1.7.4) (1.7.5) (1.7.6) (1.7.7)

The exact distribution of T (0) cannot be found in closed form. We do, however, have the following recursion formula; see Exercise 1.12.17. Theorem 1.7.2. Consider the version of the signed-rank test statistics given by T + , ( 1.3.28). . Then Let pn (k) = P [T + = k] for k = 0, . . . , n(n+1) 2 1 pn (k) = [pn−1 (k) + pn−1 (k − n)] , 2

(1.7.8)

where p0 (0) = 1 ; p0 (k) = 0 for k 6= 0; and p0 (k) = 0 for k < 0 . Using this formula algorithms can be developed which obtain the null distribution of the signed-rank test statistic. The moment generating function can also be inverted to find the null distribution; see Hettmansperger(1984a, Section 2.2). As discussed in Section 1.3.1, software is now available which computes critical values and p-values of the null distribution. Theorem 1.7.1 justifies the confidence interval for θ given in display ( 1.3.30); i.e, the (1−α)100% confidence interval given by [W(k+1) , W(((n(n+1))/2)−k) ) where W(i) denotes the ith ordered Walsh average and P (T + (0) ≤ k) = α/2. Based on ( 1.7.7), k can be approximated as k ≈ n(n+1)/4−.5−zα/2 [n(n+1)(2n+1)/24]1/2 . As noted in Section 1.3.1, the computation of the estimate and confidence interval can be obtain by our R function onesampwil or the R intrinsic function wilcox.test.

1.7.2

Statistical Properties

From our earlier analysis of the statistical properties of the L1 and L2 methods we see that Pitman Regularity is crucial. In particular, we need to compute the Pitman efficacy which determines the asymptotic variance of the estimate, the asymptotic local power of the test, and the asymptotic length of the confidence interval. In the following theorem we show that the weighted L1 gradient function is Pitman Regular and determine the efficacy. Then we make some preliminary efficiency comparisons with the L1 and L2 methods.

38

CHAPTER 1. ONE SAMPLE PROBLEMS

R Theorem 1.7.3. Suppose that h is symmetric and that h2 (x)dx < ∞. Let X xi + xj 2 sgn −θ . T (θ) = n(n + 1) i≤j 2 Then the conditions of Definition 1.5.3 are satisfied and, thus, T (θ) is Pitman Regular. Moreover, the Pitman efficacy is given by √ Z ∞ 2 c = 12 h (x)dx . (1.7.9) −∞

Proof. Since we have the L1 norm applied to the Walsh averages, the estimating function is a nonincreasing step function with steps at the Walsh averages. Hence, ( 1.5.7) holds. Next note that h(x) = h(−x) and, hence, n−1 X1 + X2 2 . Eθ sgn(X1 ) + Eθ sgn µ(θ) = Eθ T (0) = n+1 n+1 2 Now Eθ sgnX1 = and Eθ sgn(X1 + X2 )/2 =

Z Z

Z

sgn(x + θ)h(x)dx = 1 − 2H(θ) ,

sgn[(x + y)/2 + θ]h(x)h(y)dxdy =

Z

[1 − 2H(−2θ − y)]h(y)dy .

Differentiate with respect to θ and set θ = 0 to get Z Z 2h(0) 4(n − 1) ∞ 2 ′ µ (0) = + h (y)dy → 4 h2 (y) dy . n+1 n + 1 −∞ The finiteness of the integral is sufficient to ensure that the derivative can be passed through the integral; see Hodges and Lehmann (1961) or Olshen (1967). Hence, ( 1.5.8) also holds. We next establish Condition ( 1.5.9). Since n X X Xi + Xj 2 2 sgn −θ , sgn(Xi − θ) + T (θ) = n(n + 1) i=1 n(n + 1) ı 0, let X Xi + Xj Xi + Xj 2 ∗ −1/2 sgn −n b − sgn V = n(n + 1) i 0 and all B > 0. Finally the fourth condition, ( 1.5.10), concerns √ the asymptotic null distribution which was discussed above. The null variance of Tϕ+ (0)/ n is given by expression ( 1.8.12). Therefore the process Tϕ+ (θ) is Pitman regular with efficacy given by R ∞ +′ R1 + + 2 ϕ (2H(x) − 1)h2 (x) dx ϕ (u)ϕ (u) du h −∞ 0 qR c ϕ+ = q = . (1.8.21) R1 1 + 2 + 2 (ϕ (u)) du (ϕ (u)) du 0 0 As our first result, we obtain the asymptotic power lemma for the process Tϕ+ (θ). This, of course, follows immediately from Theorem 1.5.8 so we state it as a corollary. Corollary 1.8.1. Under the symmetric location model, Tϕ+ (0) P θn √ ≥ zα → 1 − Φ(zα − θ∗ cϕ+ ) , nσϕ+

(1.8.22)

for the sequence of hypotheses H0 : θ = 0 versus HAn : θ = θn =

θ∗ √ n

for θ∗ > 0 .

Based on Pitman regularity, the asymptotic distribution of the the estimate θbϕ+ is √

D

n(θbϕ+ − θ) → N(0, τϕ2+ ) ,

(1.8.23)

where the scale parameter τϕ+ is defined by the reciprocal of ( 1.8.21), τϕ+ = c−1 ϕ+ = R 1 0

σϕ+ ϕ+ (u)ϕ+ h (u) du

.

(1.8.24)

1.8. INFERENCE BASED ON GENERAL SIGNED-RANK NORMS

49

Using the general result of Theorem 1.5.9, the length of the confidence interval for θ, (1.8.16), can be used to obtain a consistent estimate of τϕ+ . This in turn can be used to obtain a consistent estimate of the standard error of θbϕ+ ; see Exercise ??. The asymptotic relative efficiency between two estimates or two tests based on score + functions ϕ+ 1 (u) and ϕ2 (u) is the ratio + e(ϕ+ 1 , ϕ2 )

=

c2ϕ+ 1

c2ϕ+ 2

=

τϕ2+ 2

τϕ2+

.

(1.8.25)

1

This can be used to compare different tests. For a specific distribution we can determine the optimum scores. Such a score should make the scale parameter τϕ+ as small as possible. This scale parameter can be written as,  s  R 1 ϕ+ (u)ϕ+ (u) du  Z 1 0 qR h cϕ+ = τϕ−1 ϕ2h (u) du . (1.8.26) + = 1 2   0 σϕ+ ϕh (u) du 0 The quantity in brackets is a correlation coefficient; hence, to minimize the scale parameter τϕ+ , we need to maximize the correlation coefficient which can be accomplished by selecting the optimal score function given by ϕ+ (u) = ϕ+ h (u) , qR 1 2 where ϕ+ (u) is given by expression ( 1.8.18). The quantity (ϕ+ h h (u)) du is the square 0 root of Fisher information; see Exercise 1.12.23. Therefore for this choice of scores the estimate θbϕ+ is asymptotically efficient. This is the reason for calling the score function h ϕ+ h the optimal score function. It is shown in Exercise 1.12.24 that the optimal scores are the normal scores if h(x) is a normal density, the Wilcoxon weighted L1 scores if h(x) is a logistic density, and the L1 scores if h(x) is a double exponential density. It is further shown that the scores generated by ( 1.8.3) are optimal for symmetric densities with a logistic center and exponential tails. From Exercise 1.12.24, the efficiency of the normal scores methods relative to the least squares methods is Z ∞ 2 f 2 (x) e(NS, LS) = dx , (1.8.27) −1 −∞ φ (Φ (F (x))) where F ∈ FS , the family of symmetric distributions with positive, finite Fisher information and φ = Φ′ is the N(0, 1) pdf. We now prove a result similar to Theorem 1.7.4. We prove that the normal scores methods always have efficiency at least equal to one relative to the LS methods. Further, it is only equal to 1 at the normal distribution. The result was first proved by Chernoff and Savage (1958); however, the proof presented below is due to Gastwirth and Wolff (1968).

50

CHAPTER 1. ONE SAMPLE PROBLEMS

Theorem 1.8.1. Let X1 , . . . , Xn be a random sample from F ∈ Fs . Then inf e(NS, LS) = 1 ,

(1.8.28)

Fs

and is only equal to 1 at the normal distribution. Proof: If σf2 = ∞ then e(NS, LS) > 1; hence, we suppose that σf2 = 1. Let e = e(NS, LS). Then from ( 1.8.27) we can write √ f (X) e = E φ (Φ−1 (F (X))) 1 = E . φ (Φ−1 (F (X))) /f (X) Applying Jensen’s inequality to the convex function h(x) = 1/x, we have √

e≥

1 E

[φ (Φ−1 (F (X))) /f (X)]

.

Hence, 1 φ (Φ−1 (F (X))) √ ≤ E e f (X) Z = φ Φ−1 (F (x)) dx .

We now integrate by parts, using u = φ (Φ−1 (F (x))), du = φ′ (Φ−1 (F (x))) f (x) dx/φ (Φ−1 (F (x))) = −Φ−1 (F (x))f (x) dx since φ′ (x)/φ(x) = −x. Hence, with dv = dx, we have Z ∞ Z ∞ ∞ −1 −1 φ Φ (f (x)) dx = xφ Φ (F (x)) −∞ + xΦ−1 (F (x))f (x) dx . (1.8.29) −∞

−∞

−1 Now transform xφ (Φ−1 (F (x))) R −1 into F (Φ(w))φ(w) R by first letting t = F (x) and then −1 w = Φ (t). The integral F (Φ(w))φ(w) dw = xf (x) dx < ∞, hence the limit of the integrand must be 0 as x → ±∞. This implies that the first term on the right side of ( 1.8.29) is 0. Hence applying the Cauchy-Schwarz inequality, Z ∞ 1 √ ≤ xΦ−1 (F (x))f (x) dx e Z−∞ ∞ p p = x f (x)Φ−1 (F (x)) f (x) dx

≤

−∞ ∞

Z

= 1,

−∞

2

x f (x) dx

Z

∞

−∞

1/2 −1 2 Φ (F (x)) f (x) dx

1.8. INFERENCE BASED ON GENERAL SIGNED-RANK NORMS

51

R R since x2 f (x) dx = 1 and x2 φ(x) dx = 1. Hence e1/2 ≥ 1 and e ≥ 1, which completes the proof. It should be noted that the inequality is strict except at the normal distribution. Hence the normal scores are strictly more efficient than the LS procedures except at the normal model where the asymptotic relative efficiency is 1. The influence function for θbϕ+ is derived in Section A.5 of the Appendix. It is given by ϕ+ (2H(t) − 1) Ω(t, θbϕ+ ) = R ∞ +′ . (1.8.30) 4 0 ϕ (2H(x) − 1)h2 (x) dx

Note, also, that E[Ω2 (X, θbϕ+ )] = τϕ2+ as a check on the asymptotic distribution of θbϕ+ . Note that the influence function is bounded provided the score function is bounded. Thus the estimates based on the scores discussed in the last paragraph are all robust except for the normal scores. In the case of the normal scores, when H(t) = Φ(t), the influence function is Ω(t) = Φ−1 (t); see Exercise 1.12.25. The asymptotic breakdown of the estimate θbϕ+ is ǫ∗ given by Z 1−ǫ∗ Z 1 1 + + ϕ (u) du = ϕ (u) du . (1.8.31) 2 0 0 We provide a heuristic argument for ( 1.8.31); for a rigorous development see Huber (1981). Recall Definition 1.6.1. The idea is to corrupt enough data so that the estimating equation, ( 1.8.5), no longer has a solution. Suppose that [ǫn] observations are corrupted, where [·] denotes the greatest integer function. Push the corrupted observations out towards +∞ so that n n X X + a+ (i) . a (R(|Xi − θ|))sgn(Xi − θ) = i=[(1−ǫ)n]+1

i=[(1−ǫ)n]+1

This restrains the estimating function from crossing the horizontal axis provided [(1−ǫ)n]

−

X

a+ (i) +

i=1

n X

a+ (i) > 0 .

i=[(1−ǫ)n]+1

Replacing the sums by integrals in the limit yields Z 1−ǫ Z 1 + ϕ (u) du > ϕ+ (u) du . 0

1−ǫ

Now use the fact that Z

0

1−ǫ +

ϕ (u) du +

Z

1 +

ϕ (u) du = 1−ǫ

Z

1

ϕ+ (u) du 0

and that we want the smallest possible ǫ to get ( 1.8.31). Example 1.8.1. Breakdowns of Estimates Based on Wilcoxon and Normal Scores

52

CHAPTER 1. ONE SAMPLE PROBLEMS Table 1.8.1: Empirical AREs Based on n = 30 and 10,000 simulations. Estimators Normal Contaminated Normal NS, LS 0.983 1.035 Wil, LS 0.948 1.007 NS, WIL 1.037 1.028

√ . For θb = med(Xi + Xj )/2, ϕ+ (u) = u and it follows at once that ǫ∗ = 1 − (1/ 2) = .293. For the estimate based on the normal scores where ϕ+ (u) is given by ( 1.8.4), expression ( 1.8.31) becomes ǫ i2 1 1 h −1 1− = exp − Φ 2 2 2 √ . and ǫ∗ = 2(1 − Φ( log 4)) = .239. Hence we have the unusual situation that the estimate based on the normal scores has positive breakdown but an unbounded influence curve. Example 1.8.2. Small Sample Empirical AREs of Estimator Based on Normal Scores As discussed above, the ARE between the normal scores estimator and the sample mean is 1 at the normal distribution. This is an asymptotic result. To answer the question about this efficiency at small samples, we conducted a small simulation study. We set the sample size at n = 30 and ran 10,000 simulations from a normal distribution. We also selected the contaminated normal distribution with ǫ = 0.01 and σc = 3, which is a very mild contaminated distribution. We consider the three estimators: rank-based estimator based on normal scores (NS), rank-based estimator based on Wilcoxon scores (WIL), and the sample mean (LS). We used the RBR command onesampr(x,score=phinscp,grad=spnsc,maktable=F) to compute the normal scores estimator; see Exercise 1.12.29. As our empirical ARE we used the ratios of empirical mean square errors of the three estimators. Table 1.8.1 summarizes the results. The empirical AREs for the NS and WIL estimators, at the normal, are close to their asymptotic counterparts. Note that the NS estimator results in only a loss of less than 2% efficiency over LS. For this small amount of contamination the NS estimator dominates the LS estimator. It also dominates the Wilcoxon estimator. In Exercise 1.12.29, the reader is asked to extend this study to other situations. Example 1.8.3. Shoshoni Rectangles, Continued. The next display shows the normal scores analysis of the Shoshoni Rectangles Data; see Example 1.4.2. We conducted the same analysis as we did for the sign test and tratditional t-test discussed in Example 1.4.2. Note that the call to the RBR function onnesampr with the values score=phinscp,grad=spnsc computes the normal scores analysis. > onesampr(x,theta0=.618,alpha=.10,score=phinscp,grad=spnsc) Test of Theta = 0.618 Alternative selected is 0 Test Stat. Tphi+ is 7.809417 Standardized (z) Test-Stat. 1.870514

and p-vlaue 0.061

1.9. RANKED SET SAMPLING

53

Estimate 0.6485 SE is 0.02502799 90 % Confidence Interval is ( 0.61975 , 0.7 ) Estimate of the scale parameter tau 0.1119286 While not as sensitive to the outliers as the traditional analysis, the outliers still had some influence on the normal scores analysis. The normal scores test rejects the null hypothesis at level 0.06 while the 90% confidence interval just misses the value 0.618.

1.9

Ranked Set Sampling

In this section we discuss an alternative to simple random sampling (SRS) called ranked set sampling (RSS). This method of data collection is useful when measurements are destructive or expensive while ranking of the data is relatively easy. Johnson, Nussbaum, Patil and Ross (1996) give an interesting application to environmental sampling. As a simple example consider the problem of estimating the mean volume of trees in a forest. To measure the volume, we must destroy the tree. On the other hand, an expert may well be able to rank the trees by volume in a small sample. The idea is to take a sample of size k of trees and ask the expert to pick the one with smallest volume. This tree is cut down and the volume measured and the other k − 1 trees are returned to the population for possible future selection. Then a new sample of size k is taken and the expert identifies the second smallest which is then cut down and measured. This is repeated until we have k measurements, having looked at k 2 trees. This ends cycle 1. The measurements are represented as x(1)1 ≤ . . . ≤ x(k)1 where the number in parentheses indicates an order statistic and the second number indicates the cycle. We repeat the process for n cycles to get nk measurements: x(1)1 , . . . , x(1)n x(2)1 , . . . , x(2)n .. . x(k)1 , . . . , x(k)n

iid iid .. . iid

h(1) (t) h(2) (t) .. . h(k) (t)

It is important to note that all nk measurements are independent but are identically distributed only within each row. The density function h(j) (t) represents the pdf of the jth order statistic from a sample of size k and is given by: h(j) (t) =

k! H j−1(t)[1 − H(t)]k−j h(t) (j − 1)!(k − j)!

We suppose the measurements are distributed as H(x) = F (x − θ) and we wish to make a statistical inference concerning θ, such as an estimate, test, or confidence interval. We will illustrate the ideas on the L1 methods since they are simple to work with. We also wish to

54

CHAPTER 1. ONE SAMPLE PROBLEMS

compute the efficiency of the RSSL1 methods relative to the SRSL1 methods. We will see that there is a substantial increase in efficiency when using the RSS design. In particular, we will compare the RRS methods to SRS methods based on a sample of size nk. The RSS method was first applied by McIntyre (1952) in measuring mean pasture yields. See Hettmansperger (1995) for a development of the RSSL1 methods. The most convenient form of the RSS sign statistic is the number of positive measurements given by + SRSS

=

k X n X

I(X(j)i > 0) .

(1.9.1)

j=1 i=1

P + P + + + Now note that SRSS can be written as SRSS = S(j) where S(j) = i I(X(j)i > 0) has + a binomial distribution with parameters n and 1 − H(j) (0). Further, S(j) , j = 1, . . . , k are stochastically independent. It follows at once that + ESRSS = n

k X j=1

+ VarSRSS

= n

k X j=1

(1 − H(j) (0))

(1.9.2)

(1 − H(j) (0))H(j) (0) .

+ With k fixed and n → ∞, it follows from the independence of S(j) , j = 1, . . . , k that + (nk)−1/2 {SRSS −n

k X j=1

D

(1 − H(j) (0)} → Z ∼ n(0, ξ 2) ,

(1.9.3)

and the asymptotic variance is given by 2

ξ =k

−1

k

k X

X 1 1 (H(j) (0) − )2 . [1 − H(j) (0)]H(j) (0) = − k −1 4 2 j=1 j=1

(1.9.4)

P It is convenient to introduce a parameter δ 2 = 1 − (4/k) (H(j) (0) − 1/2)2 , then ξ 2 = δ 2 /4. The reader is asked to prove the second equality above in Exercise 1.12.26. Using the formulas for the pdfs of the order statistics it is straightforward to verify that h(t) = k −1

k X j=1

h(j) (t) and H(t) = k −1

k X

H(j) (t) .

j=1

We now consider testing H0 : θ = 0 versus HA : θ 6= 0. The following theorem provides the mean and variance of the RSS sign statistic under the null hypothesis. Theorem 1.9.1. Under the assumption that H0 : θ = 0 is true, F (0) = 1/2, Z 1/2 k! F(j) (0) = uj−1(1 − u)k−j du (j − 1)!(k − j)! 0

1.9. RANKED SET SAMPLING Table 1.9.1: Values k: 2 1 .750 2 .250 3 4 5 6 7 8 9 10 δ 2 .750

55

P δ 2 = 1 − (4/k) (F(j) (0) − 1/2)2 . 7 8 9 10 .992 .996 .998 .999 .938 .965 .981 .989 .773 .856 .910 .945 .500 .637 .746 .828 .227 .363 .500 .623 .063 .145 .254 .377 .008 .035 .090 .172 .004 .020 .055 .002 .011 .001 .625 .547 .490 .451 .416 .393 .371 .352

of F(j) (0), j = 1, . . . , k and 3 4 5 6 .875 .938 .969 .984 .500 .688 .813 .891 .125 .313 .500 .656 .063 .188 .344 .031 .109 .016

and

X + + ESRSS = nk/2, and VarSRSS = 1/4 − k −1 (F(j) (0) − 1/2)2 . P Proof. Use the fact that k −1 F(j) (0) = F (0) = 1/2, and the expectation formula follows at once. Note that Z 0 k! F(j) (0) = F (t)j−1(1 − F (t))k−j f (t)dt , (j − 1)!(k − j)! −∞

and then make the change of variable u = F (t). + The variance of SRSS does not depend on H, as expected; however, its computation requires the evaluation of the incomplete beta integral. Table 1.9.1 provides the values of F(j) (0),Punder H0 : θ = 0. The bottom line of the table provides the values of δ 2 = 1 − (4/k) (F(j) (0) − 1/2)2 , an important parameter in assessing the gain of RSS over SRS. + We will compare the SRS sign statistic SSRS based on a sample of nk to the RSS sign + + statistic SRSS . Note that the variance of S SRS is nk/4. Then the ratio of variances is P + + 2 VarSRSS /VarSSRS = δ = 1 − (4/k) (F(j) (0) − 1/2)2. The reduction in variance is given in the last row of Table 1.9.1 and can be quite large. We next show that the parameter δ is an integral part of the efficacy of the RSS L1 methods. It is straight forward using the methods of Section 1.5 and Example 1.5.2 to show that the RSS L1 estimating function is Pitman regular. To compute the efficacy we first note that S¯RSS = (nk)−1

k X n X j=1 i=1

+ sgn(X(j)i ) = (nk)−1 [2SRSS − nk] .

We then have at once that D

(nk)−1/2 S¯RSS →0 Z ∼ n(0, δ 2 ) ,

(1.9.5)

56

CHAPTER 1. ONE SAMPLE PROBLEMS

and µ′ (0) = 2f (0); see Exercise 1.12.27. See Babu and Koti (1996) for a development of the exact distribution. Hence, the efficacy of the RSS L1 methods is given by cRSS =

2f (0) 2f (0) . = Pk δ {1 − (4/k) j=1(F(j) (0) − 1/2)2 }1/2

We now summarize the inference methods and their efficiency in the following: 1. The test. Reject H0 : θ = 0 in favor of HA : θ > 0 at significance level α if + SSRS > (nk/2) − zα δ(nk/4)1/2 where, as usual, 1 − Φ(zα ) = α. D

2. The estimate. (nk)1/2 {medX(j)i − θ} → Z ∼ n(0, δ 2 /4f 2 (0)). ∗ ∗ 3. The confidence interval. Let X(1) , . . . , X(nk) be the ordered values of X(j)i , j = ∗ ∗ 1, . . . , k and i = 1, . . . , n. Then [X(m+1) , X(nk−m) ] is a (1 − α)100% confidence in+ terval for θ where P (SSRS ≤ m) = α/2. Using the normal approximation we have . m = (nk/2) − zα/2 δ(nk/4)1/2 .

4. Efficiency. The efficiency of the RSS methods with respect to the SRS methods is given by e(RSS, SRS) = c2RSS /c2SRS = δ −2 . Hence, the reciprocal of the last line of Table 1.9.1 provides the efficiency values and they can be quite substantial. Recall from the discussion following Definition 1.5.5 that efficiency can be interpreted as the ratio of sample sizes needed to achieve the same approximate variances, the same approximate local power, and the same confidence interval length. Hence, we write . (nk)RSS = δ 2 (nk)SRS . This is really the point of the RSS design. Returning to the example of estimating the volume of wood in a forest, if we let k = 5, then from Table 1.9.1, we would need to destroy and measure only about one half as many trees using the RSS method rather than the SRS method. As a final note, we mention the problem of assessing the effect of imperfect ranking. Suppose that the expert makes a mistake when asked to identify the jth ordered value in a set of k observations. As expected, there is less gain from using the RSS method. The interesting point is that if the expert simply identifies the supposed jth ordered value by random guess then δ 2 = 1 and the two sign tests have the same information; see Hettmansperger (1995) for more detail.

1.10

Interpolated Confidence Intervals for the L1 Inference

When we construct L1 confidence intervals, we are limited in our choice of confidence coefficients because of the discreteness of the binomial distribution. The effect does not wear off

1.10. INTERPOLATED CONFIDENCE INTERVALS FOR THE L1 INFERENCE

57

very quickly as the sample size increases. For example with a sample of size 50, we can have either a 93.5% or a 96.7% confidence interval, and that is as close as we can come to 95%. In the following discussion we provide a method to interpolate between confidence intervals. The method is nonlinear and seems to be essentially distribution-free. We will begin by presenting and illustrating the method and then derive its properties. Suppose γ is the desired confidence coefficient. Further, suppose the following intervals are available from the binomial table: interval (x(k) , x(n−k+1) ) with confidence coefficient γk and interval (x(k+1) , x(n−k) ) with confidence coefficient γk+1 where γk+1 ≤ γ ≤ γk . Then the interpolated interval is [θbL , θbU ], where

θbL = (1 − λ)x(k) + λx(k+1) and θbU = (1 − λ)x(n−k+1) + λx(n−k) λ=

γk − γ (n − k)I and I = . k + (n − 2k)I γk − γk+1

(1.10.1)

(1.10.2)

We call I the interpolation factor and note that if we were using linear interpolation then λ = I. Hence, we see that the interpolation is distinctly nonlinear. As a simple example we take n = 10 and ask for a 95% confidence interval. For k = 2 we find γk = .9786 and γk+1 = .8907. Then I = .325 and λ = .685. Hence, θbL = .342x(2) + .658x(3) and θbU = .342x(9) + .658x(8) . Note that linear interpolation is almost the reverse of the recommended mixtures, namely λ = I = .325 and this can make a substantial difference in small samples. The method is based on the following theorem. This theorem highlights the nonlinear relationship between the interpolation factor and λ. After proving the theorem we will need to develop an approximate solution and then show that it works in practice. Theorem 1.10.1. The interpolation factor I is given by γk − γ I= = 1 − (n − k)2n γk − γk+1

Z

∞

F

k

0

−λ y (1 − F (y))n−k−1f (y)dy 1−λ

Proof. Without loss of generality we will assume that θ is 0. Then we can write: γk = P0 (xk ≤ 0 ≤ xn−k+1 ) = P0 (k − 1 < S1+ (0) < n − k − 1) and γk+1 = P0 (xk+1 ≤ 0 ≤ xn−k ) = P0 (k < S1+ (0) < n − k) . Taking the difference, we have, using nk to denote the binomial coefficient, γk − γk+1 =

P0 (S1+ (0)

= k) +

P0 (S1+ (0)

n (1/2)n−1 . = n − k) = k

(1.10.3)

58

CHAPTER 1. ONE SAMPLE PROBLEMS

We now consider the lower tail probability associated with the confidence interval. First consider Z ∞ n! 1 − γk+1 = F k (t)(1 − F (t))n−k−1 dF (t)(1.10.4) P0 (Xk+1 > 0) = 2 k!(n − k − 1)! 0 + = P0 (S1 (0) ≥ n − k) = P0 (S1+ (0) ≤ k) . We next consider the lower end of the interpolated interval 1−γ = P0 ((1 − γ)Xk + λXk+1 > 0) 2 Z ∞Z y n! F k−1(x)(1 − F (y))n−k−1f (x)f (y)dxdy = −λ (k − 1)!(n − k − 1)! y 0 Z ∞ 1−λ −λy n! 1 k k = F (y) − F (1 − F (y))n−k−1f (y)dy (k − 1)!(n − k − 1)! k 1 − λ 0 Z ∞ −λy n! 1 − γk+1 k (1 − F (y))n−k−1f (y)dy (1.10.5) − F = 2 k!(n − k − 1)! 1 − λ 0 Use ( 1.10.4) in the last line above. Now with ( 1.10.3), substitute into the formula for the interpolation factor and the result follows. Clearly, not only is the relationship between I and λ nonlinear but it also depends on the underlying distribution F . Hence, the interpolated interval is not distribution free. There is one interesting case in which we have a distribution free interval given in the following corollary. Corollary 1.10.1. Suppose F is the cdf of a symmetric distribution. Then I(1/2) = k/n, where we write I(λ) to denote the dependence of the interpolation factor on λ. This shows that when we sample from a symmetric distribution, the interval that lies half between the available intervals does not depend on the underlying distribution. Other interpolated intervals are not distribution free. Our next theorem shows how to approximate the solution and the solution is essentially distribution free. We show by example that the approximate solution works in many cases. Theorem 1.10.2.

. I(λ) = λk/(λ(2k − n) + n − k)

Proof. We consider the integral Z ∞ −λ k F y (1 − F (y))n−k−1f (y)dy 1−λ 0 The integrand decreases rapidly for moderate powers; hence, we expand the integrand around y = 0. First take logarithms then −λ λ f (0) k log F y = k log F (0) − k y + o(y) 1−λ 1 − λ F (0)

1.10. INTERPOLATED CONFIDENCE INTERVALS FOR THE L1 INFERENCE

59

Table 1.10.1: Confidence Coefficients for Interpolated Confidence Intervals in Example 1.10.1. DE(Approx)=Double Exponential and the Approximation in Theorem 1.10.2, U=Uniform, N=Normal, C=Cauchy, Linear=Linear Interpolation λ DE(Approx) U N C Linear 0.1 0.976 0.977 0.976 0.976 0.970 0.2 0.973 0.974 0.974 0.974 0.961 0.3 0.970 0.971 0.971 0.970 0.952 0.4 0.966 0.967 0.966 0.966 0.943 0.5 0.961 0.961 0.961 0.961 0.935 0.6 0.955 0.954 0.954 0.954 0.926 0.7 0.946 0.944 0.944 0.946 0.917 0.8 0.935 0.930 0.931 0.934 0.908 0.9 0.918 0.912 0.914 0.918 0.899 and (n − k − 1) log(1 − F (y)) = (n − k − 1) log(1 − F (0)) − (n − k − 1)

f (0) y + o(y) . 1 − F (0)

Substitute r = λk/(1 − λ) and F (0) = 1 − F (0) = 1/2 into the above equations, and add the two equations together. Add and subtract r log(1/2), and group terms so the right side of the second equation appears on the right side along with k log(1/2) − r log(1/2). Hence, we have −λ k log F y + (n − k − 1) log(1 − F (y)) = k log(1/2) − r log(1/2) 1−λ +(n − r − k − 1) log(1 − F (y)) + o(y) , and, hence, Z ∞ Z ∞ −λ . k n−k−1 F y (1 − F (y)) f (y)dy = 2−(k−r) (1 − F (y))n+r−k−1f (y)dy 1 − λ 0 0 1 . (1.10.6) = n 2 (n + r − k) Substitute this approximation into the formula for I(λ), use r = λk/(1 − λ) and the result follows. Note that the approximation agrees with Corollary 1.10.1. In addition Exercise 1.12.28 shows that the approximation formula is exact for the double exponential (Laplace) distribution. In Table 1.10.1 we show how well the approximation works for several other distributions. The exact results were obtained by numerical integration of the integral in Theorem 1.10.1. Similar close results were found for asymmetric examples. For further reading see Hettmansperger and Sheather (1986) and Nyblom (1992).

60

CHAPTER 1. ONE SAMPLE PROBLEMS

Example 1.10.1. Cushney-Peebles Example 1.4.1, continued. We now return to this example using it to illustrate the sign test and the L1 interpolated confidence interval. We use the RBR function interpci for the computations. We take as our location model: X1 , . . . , X10 iid from H(x) = F (x − θ), F and θ both unknown, along with the L1 norm. We have already seen that the estimate of θ is the sample median equal to 1.3. Besides obtaining an interpolated 95% confidence interval, we test H0 : θ = 0 versus HA : θ 6= 0. Assuming that the sample is in the vector x, the output for a test and a 95% interpolated confidence interval is:

> tm=interpci(.05,x) Estimation of Median Sample Median is 1.3 Confidence Interval ( 1 , 1.8 ) 89.0625 % Confidence Interval ( 0.9315 , 2.0054 ) 95 % Interpolted Confidence Interval ( 0.8 , 2.4 ) 97.8516 % Results for the Sign Test Test of theta = 0 versus theta not equal to Test stat. S is 9 p-vlaue 0.00390625

0

Note the p-value of the test is .0039 and we would easily reject the null hypothesis at any reasonable level of significance. The interpolated 95% confidence interval for θ shows the reasonable set of values of θ to be between .9315 and 2.0054, given the level of confidence.

1.11

Two Sample Analysis

We now propose a simple way to extend our one sample methods to the comparison of two samples. Suppose X1 , . . . , Xm are iid F (x − θx ) and Y1 , . . . , Yn are iid F (y − θy ) and the two samples are independent. Let ∆ = θy − θx and we wish to test the null hypothesis H0 : ∆ = 0 versus the alternative hypothesis Ha : ∆ 6= 0. Without loss of generality we can consider θx = 0 so that the X sample is from a distribution with cdf F (x) and the Y sample is from a distribution with cdf F (y − ∆). The hypothesis testing rule that we propose is: 1. Construct L1 confidence intervals [XL , XU ] and [YL , YU ]. 2. Reject H0 if the intervals are disjoint.

1.11. TWO SAMPLE ANALYSIS

61

If we consider the confidence interval as a set of reasonable values for the parameter, given the confidence coefficient, then we reject the null hypothesis when the respective reasonable values are disjoint. We must determine the significance level for the test. In particular, for given γx and γy , what is the value of αc , the significance level for the comparison? Perhaps more pertinent: Given αc , what values should we choose for γx and γy ? Below we show that for a broad range of sample sizes, Comparing two 84% CI’s yields a 5% test of H0 : ∆ = 0 versus HA : ∆ 6= 0,

(1.11.1)

where CI denotes confidence interval. In the following theorem we provide the relationship between αc and the pair γx , γy . Define zx by γx = 2Φ(zx ) − 1 and likewise zy by γy = 2Φ(zy ) − 1. Theorem 1.11.1. Suppose m, n → ∞ so that m/N → λ, 0 < λ < 1, N = m + n. Then under the null hypothesis H0 : ∆ = 0, αc = P (XL > YU ) + P (YL > XU ) → 2Φ[−(1 − λ)1/2 zx − λ1/2 zy ] Proof. We will consider αc /2 = P (XL > YU ). From ( 1.5.22) we have zx zy . Sx (0) . Sy (0) XL = − 1/2 and YU = + 1/2 . m2f (0) m 2f (0) m2f (0) n 2f (0) Since m/N → λ D

N 1/2 XL → λ−1/2 Z1 , Z1 ∼ n(−zx /2f (0), 1/4f 2(0)) , and

D

N 1/2 YU → (1 − λ)−1/2 Z2 , Z2 ∼ n(−zy /2f (0), 1/4f 2(0)) .

Now αc /2 = P (XL > YU ) = P (N 1/2 (YU − XL ) < 0) and XL , YU are independent, hence D

N 1/2 (YU − XL ) → λ−1/2 Z1 − (1 − λ)−1/2 Z , and λ

−1/2

Z1 − (1 − λ)

−1/2

Z2 ∼ n

1 2f (0)

zx zy + (1 − λ)1/2 λ1/2

1 , 2 4f (0)

1 1 + λ 1−λ

It then follows that P (N 1/2 (YU − XL ) < 0) → Φ −

zx zy + 1/2 1/2 (1 − λ) λ

/

1 λ(1 − λ)

Which, when simplified, yields the result in the statement of the theorem.

1/2 !

.

.

62

CHAPTER 1. ONE SAMPLE PROBLEMS Table 1.11.1: Confidence Coefficients for 5% Comparison. λ = m/N m/n zx = zy γx = γy

.500 .550 .600 .650 .750 1.00 1.22 1.50 1.86 3.00 1.39 1.39 1.39 1.40 1.43 .84 .84 .84 .85 .86

To illustrate, we take equal sample sizes so that λ = 1/2 and we take zx = zy = 2. Then we have two 95% confidence intervals and we will reject the null hypothesis H0 : ∆ = 0 if the two intervals are disjoint. The above theorem says that the significance level is approximately equal to αc = 2Φ(−2.83) = .0046. This is a very small level and it will be difficult to reject the null hypothesis. We might prefer a significance level of say αc = .05. We then must find zx and zy so that .05 = 2Φ(−(.5)1/2 (zx + zy )). Note that now we have an infinite number of solutions. If we impose the reasonable condition that the two confidence coefficients are the same then we require that zx = zy = z. Then we have the equation .025 = Φ(−(2)1/2 z) and hence −2 = −(2)1/2 z. So z = 21/2 = 1.39 and the confidence coefficient for the two intervals is γ = γx = γy = 2Φ(1.41) − 1 = .84. Hence, if we have equal sample sizes and we use two 84% confidence intervals then we have a 5% two sided comparison of the two samples. If we set αc = .10, this would correspond to a 5% one sided test. This means that we compare the two confidence intervals in the direction specified by the alternative hypothesis. For example, if we specify ∆ = θy − θx > 0, then we would reject the null hypothesis if the X-interval is completely below the Y -interval. To determine which confidence intervals we again assume that the two intervals will have the same confidence coefficient. Then we must find z such that .05 = Φ(−(2)1/2 z) and this leads to −1.645 = −(2)1/2 z and z = 1.16. Hence, the confidence coefficient for the two intervals is γ = γx = γy = 2Φ(1.16) − 1 = .75. Hence, for a one-sided 5% test or a 10% two-sided test, when you have equal sample sizes, use two 75% confidence intervals. We must now consider what to do if the sample sizes are not equal. Let zc be determined by αc /2 = Φ(−zc ), then, again if we use the same confidence coefficient for the two intervals, z = zx = zy = zc /(λ1/2 + (1 − λ)1/2 ). When m = n so that λ = 1 − λ = .5 we had z = zc /21/2 = .707zc and so z = 1.39 when αc = .05. We now show by example that when αc = .05, z is not sensitive to the value of λ. Table 1.11.1 gives the relevant information. Hence, if we use 84% confidence intervals, then the significance level will be roughly 5% for the comparison for a broad range of ratios of sample sizes. Likewise, we would use 75% intervals for a 10% comparison. See Hettmansperger (1984b) for additional discussion. Next suppose that we want a confidence interval for ∆ = θy − θx . In the following simple theorem we show that the proposed test based on comparing two confidence intervals is equivalent to checking to see if zero is contained in a different confidence interval. This new interval will be a confidence interval for ∆. Theorem 1.11.2. [XL , XU ] and [YL, YU ] are disjoint if and only if 0 is not contained in

1.11. TWO SAMPLE ANALYSIS

63

[YL − XU , YU − XL ]. If we specify our significance level to be αc then we have immediately that 1 − αc = P∆ (YL − XU ≤ ∆ ≤ YU − XL ) and [YL − XU , YU − XL ] is a γc = 1 − αc confidence interval for ∆. This theorem simply points out that the hypothesis test can be equivalently based on a single confidence interval. Hence, two 84% intervals produce a roughly 95% confidence interval for ∆. The confidence interval is easy to construct since we need only find the least and greatest differences of the end points between the respective Y and X intervals. Recall that one way to measure the efficiency of a confidence interval is to find its asymptotic length. This is directly related to the Pitman efficacy of the procedure; see Section 1.5.5. This would seem to be the most natural way to study the efficiency of the test based on confidence intervals. In the following theorem we determine the asymptotic length of the interval for ∆. Theorem 1.11.3. Suppose m, n → ∞ in such a way that m/N → λ, 0 < λ < 1, N = m + n. Further suppose that γc = 2Φ(zc ) − 1. Let Λ be the length of [YL − XU , YU − XL ]. Then N 1/2 Λ 1 → 2zc [λ(1 − λ)]1/2 ]2f (0) Proof. First note that Λ = Λx + Λy , the sum of the two lengths of the X and Y intervals, respectively. Further, N 1/2 N 1/2 N 1/2 Λ = 1/2 n1/2 Λy + = 1/2 m1/2 Λx . n m But by Theorem 1.5.9 this converges in probability to zx /λ1/2 + zy /(1 − λ)1/2 . Now note that (1 − λ)1/2 zx + λ1/2 zy = zc and the result follows. The interesting point about this theorem is that the efficiency of the interval does not depend on how zx and zy are chosen so long as they satisfy (1 − λ)1/2 zx + λ1/2 zy = zc . In addition, this interval has inherited the efficacy of the L1 interval in the one sample location model. We will discuss the two-sample location model in detail in the next chapter. In Hettmansperger (1984b) other choices for zx and zy are discussed; for example, we could choose zx and zy so that the asymptotic standardized lengths are equal. The corresponding confidence coefficients for this choice are more sensitive to unequal sample sizes than the method proposed here. Example 1.11.1. Hendy and Charles Coin Data Hendy and Charles (1970) study the change in silver content in Byzantine coins. During the reign of Manuel I (1143-1180) there were several mintings. We consider the research hypothesis that the silver content changed from the first to the fourth coinage. The data consists in 9 coins identified from the first coinage and 7 coins from the fourth. We suppose

64

CHAPTER 1. ONE SAMPLE PROBLEMS Table 1.11.2: Silver Percentage in Two Mintings First Fourth

5.9 6.8 6.4 5.3 5.6 5.5

7.0 6.6 7.7 7.2 5.1 6.2 5.8 5.8

6.9 6.2

that they are realizations of random samples of coins from the two populations. The percentage of silver in each coin is given in Table 1.11. Let ∆ = θ1 − θ4 where the 1 and 4 indicate the coinage. To test the null hypothesis H0 : ∆ = 0 versus HA : ∆ 6= 0 at α = .05, we construct two 84% L1 confidence intervals and reject the null hypothesis if they are disjoint. The confidence intervals can be computed by using the RBR function onesampsgn with the value alph=.16. Results pertinent to the confidence intervals are: > onesampsgn(First,alpha=.16) Estimate 6.8 SE is 0.2135123 84 % Confidence Interval is ( 6.4 , 7 ) Estimate of the scale parameter tau 0.6405368 > onesampsgn(Fourth,alpha=.16) Estimate 5.6 SE is 0.1779269 84 % Confidence Interval is ( 5.3 , 5.8 ) Estimate of the scale parameter tau 0.4707503 Clearly, the 84% confidence intervals are disjoint, hence, we reject the null hypothesis at a 5% significance level and claim that the emperor apparently held back a little on the fourth coinage. A 95% confidence interval for ∆ = θ1 − θ4 is found by taking the differences in the ends of the confidence intervals: (6.4 − 5.8, 7.0 − 5.3) = (0.6, 1.7). Hence, this analysis suggests that the difference in median percentages is someplace between .6% and 1.7%, with a point estimate of 6.8 − 5.6 = 1.2%. Figure 1.11.1 provides a comparison boxplot of the data for the first and fourth coinages. Marking the 84% confidence intervals on the plot, we can see the relatively large gap between the confidence intervals, i.e., the sharp reduction in silver content from the first to fourth coinage. In addition, the box for the fourth coinage is a bit more narrow than the box for the first coinage indicating that there may be less variation (as measured by the interquartile range) in the fourth coinage. There are no apparent outliers as indicated by the whiskers on the boxplot. Larson and Stroup (1976) analyze this example with a two sample t-test.

1.12. EXERCISES

65

6.5 5.0

5.5

6.0

Percentage of silver

7.0

7.5

Figure 1.11.1: Comparison Boxplots of the Hendy and Charles Coin Data

First

1.12

Fourth

Exercises

1.12.1. Show that if k · k is a norm, then there always exists a value of θ which minimizes kx − θ1k for any x1 , . . . , xn . 1.12.2. Figure 1.12.1 displays the graph of Z(θ) versus θ for n = 20 data points (count the steps) where 20 1 X Z(θ) = √ sign(Xi − θ), n i=1 i.e., the standardized sign (median) process. (a) From the plot, what are the minimum and maximum values of the sample? (b) From the plot, what is the associated point estimate of θ?

66

CHAPTER 1. ONE SAMPLE PROBLEMS

(c) From the plot, determine a 95% confidence interval for θ, (approximate, but show on the graph). (d) From the plot, determine the value of the test statistic and the associated p-value for testing H0 : θ = 0 versus HA : θ > 0.

0 −5

Z(theta)

5

Plot of Z(theta) versus theta

−1

0

1

2

3

theta

Figure 1.12.1: The Graph of Z(θ) versus θ 1.12.3. Show D(θ), ( 1.3.3), is convex and continuous as a function of θ. Further, argue that D(θ) is differentiable almost everywhere. Let S(θ) be a function such that S(θ) = −D ′ (θ) where the derivative exists. Then show that S(θ) is a nonincreasing function. √ √ 2 b 1.12.4. √ Consider the L2 norm. Show that θ = x¯ and that S2 (0) = nt/ n − 1 + t where t = n¯ x/s, and s is the sample standard deviation. Further, show S2 (0) is an increasing function of t so the test based on t is equivalent to S2 (0). 1.12.5. Discuss the consistency of the t-test. Is the t-test resolving? 1.12.6. Discuss the Pitman regularity in the L2 case. 1.12.7. The following R function computes a bootstrap distribution of the sample median. bootmed = function(x,nb){ # Sample is in x and nb is the number of bootstraps n = length(x) bootmed = rep(0,nb) for(i in 1:nb){

1.12. EXERCISES

67

y = sample(x,size=n,replace=T) bootmed[i] = median(y) } bootmed } (a). Use this code to obtain 1000 bootstraped medians for the Shoshoni data of Example 1.4.2. Determine the standard error of this bootstrap sample of medians and compare it with the estimate based on the length of the confidence interval for the Shoshoni data. (b). Now find the mean and variance of the Shoshoni data. Use these estimates to perform a parametric bootstrap of the sample median, as discussed in Example ??. Determine the standard error of this parametric bootstrap sample of medians and compare it with estimates in Part (a). 1.12.8. Using languages such as Minitab or R, obtain a plot of the test sensitivity curves based on the signed-rank Wilcoxon statistic for the Cushney-Peebles Data, Example 1.4.1, similar to the sensitivity curves based on the t test and the sign test as shown in Figure 1.4.1. 1.12.9. In the proof of Theorem 1.5.6, show that ( 1.5.19) and ( 1.5.20) imply that Un (b) converges to −µ′ (0) in probability, pointwise in b, i.e., Un (b) = −µ′ (0) + op (1). 1.12.10. Suppose we are sampling fron the distribution with pdf f (x) =

3 1 exp{−|x|3/2 }, 4 Γ(2/3)

−∞ < x < ∞

and we are considering whether to use the Wilcoxon or sign test. Using the efficacies of these tests, determine which test to use. 1.12.11. For which of the following distributions is the signed-rank Wilcoxon more powerful? Why? 3 2 x −1 < x < 1 2 f1 (x) = 0 elsewhere. 3 (1 − x2 ) −1 < x < 1 2 f2 (x) = 0 elsewhere. 1.12.12. Show that ( 1.5.23) is scale invariant. Hence, the efficiency does not change if X is multiplied by a positive constant. Let f (x, δ) = δ exp(−|x|δ )/2Γ(δ −1), − ∞ < x < ∞, 1 ≤ δ ≤ 2. When δ = 2, f is a normal distribution and when δ = 1, f is a Laplace distribution. Compute and plot as a function of δ the efficiency ( 1.5.23).

68

CHAPTER 1. ONE SAMPLE PROBLEMS

1.12.13. Show that the finite sample breakdown of the Hodges-Lehmann estimate ( 1.3.25) is ǫ∗n = m/n, where m is the solution to the quadratic inequality 2m2 −(4n+ 2)m∗ + n2 + n ≤ 0. . Table ǫ∗n as a function of n and show that ǫ∗n converges to 1 − √12 = .29. 1.12.14. Derive ( 1.6.9). 1.12.15. Prove Lemma 1.7.2. 1.12.16. Prove Theorem 1.7.1. In particular, check the conditions of the Lindeberg Central Limit Theorem to verify ( 1.7.7). 1.12.17. Prove Theorem 1.7.2. 1.12.18. For the general signed-rank norm given by (1.8.1), show that the function Tϕ+ (θ), (1.8.2) is a decreasing step function which steps down only at the Walsh averages. Hint: First show that the ranks of |Xi − θ| and |Xj − θ| switch for θ1 < θ2 if and only if θ1
0. Assume that T (θ) is standardized so that the decision rule of the (asymptotic) level α test is given by Reject H0 : θ = 0 in favor of HA : θ > 0, if T (0) > zα . Further assume that for all |θ| < B, B > 0, √ T (θ/ n) = T (0) − 1.2θ + op (1). (a) For θ0 > 0, determine the asymptotic power γ(θ0 ), i.e., determine γ(θ0 ) = Pθ0 [T (0) > zα ]. (b) Evaluate γ(θ0 ) for n = 36 and θ0 = 0.5.

70

CHAPTER 1. ONE SAMPLE PROBLEMS

1.12.31. Suppose X1 , . . . , X2n are independent observations such that Xi has cdf F (x−θi )0. For testing H0 : θ1 = . . . = θ2n versus HA : θ1 ≤ . . . ≤ θ2n with at least one strict inequality, consider the test statistic, n X S= I(Xn+i > Xi ) . i=1

(a.) Discuss the small sample and asymptotic distribution of S under H0 . (b.) Determine the alternative distribution of S under the alternative θn+i − θi = ∆, ∆ > 0, for all i = 1, . . . , n. Show that the test is consistent for this alternative. This test is called Mann’s (1945) test for trend. 1.12.32. The data in Table 1.12.1 constitutes a sample of size 59 of information on professional baseball players. The data were recorded from the back of a deck of baseball cards, (complements of Carrie McKean). (a). Obtain dotplots of the weights and heights of the baseball players. (b). Assume the weight of a typical adult male is 175 pounds. Use the Wilcoxon test statistic to test the hypotheses H0 : θW = 175 versus HA : θW 6= 175 , where θW is the median weight of a professional baseball player. Compute the p-value. Next obtain a 95% confidence interval for θW using the confidence interval procedure based on the Wilcoxon. Use the dotplot in Part (a) to comment on the assumption of symmetry. (c). Let θH be the median height of a baseball player. Repeat the analysis of Part (b) for the hypotheses H0 : θH = 70 versus HA : θH 6= 70 . 1.12.33. The signed-rank Wilcoxon scores are optimal for the logistic distribution while the sign scores are optimal for the Laplace distribution. A family of score functions which are optimal for distributions with logistic “middles” and Laplace “tails” are the bent scores. These are continuous score functions ϕ+ (u) with a linear (positive slope and intercept 0) piece for 0 < u < b and a constant piece for b < u < 1, for a specified value of b; see Policello and Hettmansperger (1976). These are called signed-rank Winsorized Wilcoxon scores. R (a) Obtain the standardized scores such that [ϕ+ (u)]2 du = 1.

(b) For these scores with b = 0.75, obtain the corresponding estimate of location and an estimate of its standard error for the following data set: 7.94 8.13 8.11 7.96 7.83 7.04 7.91 7.82 7.42 8.06 8.51 7.88 8.96 7.58 8.14 8.06

1.12. EXERCISES

71

Table 1.12.1: Data for professional baseball players, Exercise 1.12.32. The variables are: (H) Height in inches; (W) Weight in pounds; (B) Side of plate from which the player bats, (1-Right handed, 2-Left handed, 3-Switch-hitter); (A) Throwing arm (0-Right, 1-Left); (P) Pitch-hit indicator, (0-Pitcher, 1-Hitter); and (Ave) Average (ERA if pitcher, Batting average if hitter). H W B A P Ave H W B A P Ave 74 218 1 1 0 3.330 79 232 2 1 0 3.100 75 185 1 0 1 0.286 72 190 1 0 1 0.238 77 219 2 1 0 3.040 75 200 2 0 0 3.180 73 185 1 0 1 0.271 70 175 2 0 1 0.279 69 160 3 0 1 0.242 75 200 1 0 1 0.274 73 222 1 0 0 3.920 78 220 1 0 0 3.880 78 225 1 0 0 3.460 73 195 1 0 0 4.570 76 205 1 0 0 3.420 75 205 2 1 1 0.284 77 230 2 0 1 0.303 74 185 1 0 1 0.286 78 225 1 0 0 3.460 71 185 3 0 1 0.218 76 190 1 0 0 3.750 73 210 1 0 1 0.282 72 180 3 0 1 0.236 76 210 2 1 0 3.280 73 185 1 0 1 0.245 73 195 1 0 1 0.243 73 200 2 1 0 4.800 75 205 1 0 0 3.700 74 195 1 0 1 0.276 73 175 1 1 0 4.650 75 195 1 0 0 3.660 73 190 2 1 1 0.238 72 185 2 1 1 0.300 74 185 3 1 0 4.070 75 190 1 0 1 0.239 72 190 3 0 1 0.254 76 200 1 0 0 3.380 73 210 1 0 0 3.290 76 180 2 1 0 3.290 71 195 1 0 1 0.244 72 175 2 1 1 0.290 71 166 1 0 1 0.274 76 195 2 1 0 4.990 71 185 1 1 0 3.730 68 175 2 0 1 0.283 73 160 1 0 0 4.760 73 185 1 0 1 0.271 74 170 2 1 1 0.271 69 160 1 0 1 0.225 76 185 1 0 0 2.840 76 211 3 0 1 0.282 71 155 3 0 1 0.251 77 190 3 0 1 0.212 76 190 1 0 0 3.280 74 195 1 0 1 0.262 71 160 3 0 1 0.270 75 200 1 0 0 3.940 70 155 3 0 1 0.261 73 207 3 0 1 0.251

72

CHAPTER 1. ONE SAMPLE PROBLEMS The software RBR computes this estimate with the call onesampr(x,score=phipb,grad=sphipb,param=c(.75)).

Chapter 2 Two Sample Problems 2.1

Introduction

Let X1 , . . . , Xn1 be a random sample with common distribution function F (x) and density function f (x). Let Y1 , . . . , Yn2 be another random sample, independent of the first, with common distribution function G(x) and density g(x). We will call this the general model throughout this chapter. A natural null hypothesis is H0 : F (x) = G(x). In this chapter we will consider rank and sign tests of this hypothesis. A general alternative to H0 is HA : F (x) 6= G(x) for some x. Except for the Section 2.10 on the scale model we will be generally concerned with the alternative models where one distribution is stochastically larger than the other; for example, the alternative that G is stochastically larger than F which can be expressed as HA : G(x) ≤ F (x) with a strict inequality for some x. This family of alternatives includes the location model, described next, and the Lehmann alternative models discussed in Section 2.7, which are used in survival analysis. As in Chapter 1, the location models will be of primary interest. For these models G(x) = F (x − ∆) for some parameter ∆. Thus the parameter ∆ represents a shift in location between the two distributions. It can be expressed as ∆ = θY − θX where θY and θX are the medians of the distributions of G and F or equivalently as ∆ = µY − µX where, provided they exist, µY and µX are the means of G and F. In the location problem the null hypothesis becomes H0 : ∆ = 0. In addition to tests of this hypothesis we will develop estimates and confidence intervals for ∆. We will call this the location model throughout this chapter and we will show that this is a generalization of the location problem defined in Chapter 1. As in Chapter 1 with the one-sample problems, for the two-sample problems, we offer the reader computational R functions which do the computation for the rank-based analyses dicussed in this chapter. 73

74

2.2

CHAPTER 2. TWO SAMPLE PROBLEMS

Geometric Motivation

In this section, we work with the location model described above. As in Chapter 1, we will derive sign and rank-based tests and estimates from a geometric point of view. As we shall show, their development is analogous to that of least squares procedures in that other norms are used in place of the least squares Euclidean norm. In order to do this we place the problem into the context of a linear model. This will facilitate our geometric development and will also serve as an introduction to Chapter 3, linear models. Let Z′ = (X1 , . . . , Xn1 , Y1 , . . . , Yn2 ) denote the vector of all observations; let n = n1 + n2 denote the total sample size; and let 0 if 1 ≤ i ≤ n1 ci = . (2.2.1) 1 if n1 + 1 ≤ i ≤ n Then we can write the location model as Zi = ∆ci + ei , 1 ≤ i ≤ n ,

(2.2.2)

where e1 , . . . , en are iid with distribution function F (x). Let C = [ci ] denote the n × 1 design matrix and let ΩF U LL denote the column space of C. We can express the location model as Z = C∆ + e ,

(2.2.3)

where e′ = (e1 , . . . , en ) is the n × 1 vector of errors. Note that except for random error, the ˆ minimizes observations Z would lie in ΩF U LL . Thus given a norm, we estimate ∆ so that C∆ b is the vector in ΩF U LL closest to the distance between Z and the subspace ΩF U LL; i.e., C∆ Z. Before turning our attention to ∆, however, we write the problem in terms of the geometry discussed in Chapter 1. Consider any location functional T of the distribution of e. Let θ = T (F ). Define the random variable e∗ = e − θ. Then the distribution function of e∗ is F ∗ (x) = F (x + θ) and its functional is T (F ∗) = 0. Thus the model, (2.2.3), can be expressed as Z = 1θ + C∆ + e∗ . (2.2.4) Note that this is a generalization of the location problem discussed in Chapter 1. From the last paragraph, the distribution function of Xi can be expressed as F (x) = F ∗ (x − θ); hence, T (F ) = θ is a location functional of Xi . Further, the distribution function of Yj can be written as G(x) = F ∗ (x − (∆ + θ)). Thus T (G) = ∆ + θ is a location functional of Yj . Therefore, ∆ is precisely the difference in location functionals between Xi and Yj . Furthermore ∆ does not depend on which location functional is used and will be called the shift parameter. b such Let b = (θ, ∆)′ . Given a norm, we want to choose as our estimate of b a value b b minimizes the distance between the vector of observations Z and the column that [1 C]b space V of the matrix [1 C]. Thus we can use the norms defined in Chapter 1 to estimate b.

2.2. GEOMETRIC MOTIVATION

75

If, as an example, we select the L1 norm, then our estimate of b minimizes D(b) =

n X i=1

|Zi − θ − ci ∆| .

(2.2.5)

Differentiating D with respect to θ and ∆, respectively, and setting the resulting equations to 0 we obtain the equations, n1 X i=1

sgn (Xi − θ) +

n2 X j=1

n2 X j=1

. sgn (Yj − θ − ∆) = 0

(2.2.6)

. sgn (Yj − θ − ∆) = 0 .

(2.2.7)

P n1 . Subtracting the second equation from the first we get i=1 sgn (Xi − θ) = 0; hence, b = b = med {Yj − θ} θb = med {Xi}. Substituting this into the second equation, we get ∆ b = (med {Xi }, med {Yj − θ} b − med {Xi }). We will obtain med {Yj } − med {Xi }; hence, b inference based on the L1 norm in Sections 2.6.1 and 2.6.2. b = (X, Y − If we select the L2 norm then, as shown in Exercise 2.13.1, the LS-estimate b ′ X) . Another norm discussed in Chapter 1 was the weighted L1 norm. In this case b is estimated by minimizing D(b) =

n X i=1

R(|Zi − θ − ci ∆|)|Zi − θ − ci ∆| .

(2.2.8)

This estimate cannot be obtained in closed form; however, fast minimization algorithms for such problems are discussed later in Chapter 3. In the initial statement of the problem, though, θ is a nuisance parameter and we are really interested in ∆, the shift in location between the populations. Hence, we want to define distance in terms of norms which are invariant to θ. The type of norm that is invariant to θ is a pseudo-norm which we define next. Definition 2.2.1. An operator k · k∗ is called a pseudo-norm if it satisfies the following four conditions: ku + vk∗ kαuk∗ kuk∗ kuk∗

≤ = ≥ =

kuk∗ + kvk∗ for all u, v ∈ Rn |α|kuk∗ for all α ∈ R, u ∈ Rn 0 for all u ∈ Rn 0 if and only if u1 = · · · = un

Note that a regular norm satisfies the first three properties but in lieu of the fourth property, the norm of a vector is 0 if and only if the vector is 0. The following inequalities

76

CHAPTER 2. TWO SAMPLE PROBLEMS

establish the invariance of pseudo-norms to the parameter θ: kZ − θ1 − C∆k∗ ≤ kZ − C∆k∗ + kθ1k∗ = kZ − C∆k∗ = kZ − θ1 − C∆ + θ1k∗ ≤ kZ − θ1 − C∆k∗ . Hence, kZ − θ1 − C∆k∗ = kZ − C∆k∗ . Given a pseudo-norm, denote the associated dispersion function by D∗ (∆) = kZ − C∆k∗ . It follows from the above properties of a pseudo-norm that D∗ (∆) is a non-negative, continuous, and convex function of ∆. We next develop an inference which includes estimation of ∆ and tests of hypotheses concerning ∆ for a general pseudo-norm. As an estimate of the shift parameter ∆, we b which solves choose a value ∆ b = ArgminD∗ (∆) = ArgminkZ − C∆k∗ ; ∆

(2.2.9)

S∗ (∆) = − ▽ kZ − C∆k∗

(2.2.10)

b minimizes the distance between Z and ΩF U LL . Another way of defining ∆ b is as the i.e., C∆ stationary point of the gradient of the pseudo-norm. Define the function S∗ by where ▽ denotes the gradient of kZ − C∆k∗ with respect to ∆. Because D∗ (∆) is convex, it follows immediately that S∗ (∆) is nonincreasing in ∆ . b is such that Hence ∆

. b = S∗ (∆) 0.

(2.2.11) (2.2.12)

Given a location functional θ = T (F ), i.e. Model (2.2.4), once ∆ has been estimated we b i . For example, if we chose the median can base an estimate of θ on the residuals Zi − ∆c as our location functional then we could use the median of the residuals to estimate it. We will discuss this in more detail for general linear models in Chapter 3. Next consider the hypotheses H0 : ∆ = 0 versus HA : ∆ 6= 0 .

(2.2.13)

The closer S∗ (0) is to 0 the more plausible is the hypothesis H0 . More formally, we define the gradient test of H0 versus HA by the rejection rule, Reject H0 in favor of HA if S∗ (0) ≤ k or S∗ (0) ≥ l , where the critical values k and l depend on the null distribution of S∗ (0). Typically, the null distribution of S∗ (0) is symmetric about 0 and k = −l. The reduction in dispersion test is given by b ≥m, Reject H0 in favor of HA if D∗ (0) − D∗ (∆)

2.2. GEOMETRIC MOTIVATION

77

where the critical value m is determined by the null distribution of the test statistic. In this chapter, as in Chapter 1, we will be concerned with the gradient test while in Chapter 3 we will use the reduction in dispersion test. A confidence interval for ∆ of confidence (1 − α)100% is the interval {∆ : k < S∗ (∆) < l} and 1 − α = P∆ [k < S∗ (∆) < l] .

(2.2.14)

Since D∗ (∆) is convex, S∗ (∆) is nonincreasing and we have b L = inf{∆ : S∗ (∆) < l} and ∆ b U = sup{∆ : S∗ (∆) > k} ; ∆

(2.2.15)

compare (1.3.10). Often we will be able to invert k < S∗ (∆) < l to find an explicit formula for the upper and lower end points. We will discuss a large class of general pseudo norms in Section 2.5, but now we present the pseudo norms that yield the pooled t-test and the Mann-Whitney-Wilcoxon test.

2.2.1

Least Squares (LS) Analysis

The traditional analysis is based on the squared pseudo-norm given by kuk2LS

=

n n X X i=1 j=1

(ui − uj )2 , u ∈ Rn .

(2.2.16)

It follows, (see Exercise 2.13.1) that ▽kZ − C∆k2LS = −4n1 n2 (Y − X − ∆) ; ˆ LS = Y − X. Eliminating the constant factor 4n1 n2 the hence the classical estimate is ∆ classical test is based on the statistic SLS (0) = Y − X . As shown in Exercise 2.13.1, standardizing SLS results in the two-sample pooled t-statistic. An approximate confidence interval for ∆ is given by Y − X ± t(α/2,n1 +n2 −2) σ b

r

1 1 + , n1 n2

where σ b is the usual pooled estimate of the common standard deviation. This confidence interval is exact if ei has a normal distribution. Asymptotically, we replace t(α/2,n1 +n2 −2) by zα/2 . The test is asymptotically distribution free.

78

2.2.2

CHAPTER 2. TWO SAMPLE PROBLEMS

Mann-Whitney-Wilcoxon (MWW) Analysis

The rank based analysis is based on the pseudo-norm defined by kukR =

n X n X i=1 j=1

|ui − uj | , u ∈ Rn .

(2.2.17)

Note that this pseudo-norm is the L1 -norm based on the differences between the components and that it is the second term of expression (1.3.20), which defines the norm of the signed rank analysis of Chapter 1. Note further, that this pseudo-norm differs from the least squares pseudo-norm in that the square root is taken inside the double summation. In Exercise 2.13.2 the reader is asked to show that this indeed is a pseudo-norm and that further it can be written in terms of ranks as n X n+1 kukR = 4 R(ui ) − ui . 2 i=1 From (2.2.17), it follows that the MWW gradient is ▽kZ − C∆kR = −2

n1 X n2 X i=1 j=1

sgn (Yj − Xi − ∆) .

Our estimate of ∆ is a value which makes the gradient zero; that is, makes half of the differences positive and the other half negative. Thus the rank based estimate of ∆ is b R = med {Yj − Xi } . ∆

(2.2.18)

This pseudo-norm estimate is often called the Hodges-Lehmann estimate of shift for the b two sample problem, (Hodges and Lehmann, 1963). As we show in Section p2.4.4, ∆R has an approximate normal distribution with mean ∆ and standard deviation τ (1/n1 ) + (1/n2 ), where the scale parameter τ is given in display (2.4.22). From the gradient we define SR (∆) =

n1 X n2 X i=1 j=1

sgn (Yj − Xi − ∆) .

(2.2.19)

Next define SR+ (∆) = #(Yj − Xi > ∆) .

(2.2.20)

Note that we have (with probability one) that SR (∆) = 2SR+ (∆) − n1 n2 . The statistic SR+ = SR+ (0), originally proposed by Mann and Whitney (1947), will be more convenient to use. The gradient test for the hypotheses (2.2.13) is Reject H0 in favor of HA if SR+ ≤ k or SR+ ≥ n1 n2 − k ,

2.2. GEOMETRIC MOTIVATION

79

where k is chosen by P0 (SR+ ≤ k) = α/2. We show in Section 2.4 that the test statistic is distribution free under H0 and, that further, p it has an asymptotic normal distribution with mean n1 n2 /2 and standard deviation n1 n2 (n1 + n2 + 1)/12 under H0 . Hence, an asymptotic level α test rejects H0 in favor of HA , if |z| > zα/2 where z = √ by

+ SR −(n1 n2 /2)

n1 n2 (n1 +n2 +1)/12

.

(2.2.21)

As shown in Section 2.4.2, the (1 − α)100% MWW confidence interval for ∆ is given [D(k+1) , D(n1 n2 −k) ) ,

(2.2.22)

where k is such that P0 [SR+ ≤ k] = α/2 and D(1) ≤ · · · ≤ D(n1 n2 ) denote the ordered n1 n2 + differences Yj − Xi . It follows from q the asymptotic null distribution of SR that k can be (n+1) . approximated as n12n2 − 21 − zα/2 n1 n212 A rank formulation of the MWW test statistic SR+ (∆) will also prove useful. Letting R(ui ) denote the rank of ui among u1 , . . . , un we can write n2 X j=1

R(Yj − ∆) =

n2 X j=1

{#i (Xi < Yj − ∆) + #i (Yi − ∆ ≤ Yj − ∆)}

= #(Yj − Xi > ∆) + Defining, W (∆) =

n2 X i=1

we thus have the relationship that

n2 (n2 + 1) . 2

R(Yi − ∆) ,

SR+ (∆) = W (∆) −

n2 (n2 + 1) . 2

(2.2.23)

(2.2.24)

The test statistic W (0) was proposed by Wilcoxon (1945). Since it is a linear function of the Mann-Whitney test statistic it has identical statistical properties. We will refer to the statistic, SR+ , as the Mann-Whitney-Wilcoxon statistic and will label it as MWW. As a final note on the geometry of the rank based analysis, reconsider the model with the location functional θ in it, i.e. (2.2.4). Suppose we obtain the R-estimate of ∆, (2.2.18). b R denote the residuals. Next suppose we want to estimate the location Let b eR = Z − C∆ parameter θ by using the weighted L1 normPwhich was discussed for estimation of location in Section 1.7 of Chapter 1. Let kukSR = nj=1 j|u|(j) denote this norm. For the residual vector b eR , expression (1.3.10) of Chapter 1 is given by X X ebi + b e j + (1/4)kb − θ eR kR . (2.2.25) kb e − θ1kSR = 2 i≤j

80

CHAPTER 2. TWO SAMPLE PROBLEMS

Hence the estimate of θ determined by this geometry is the Hodges-Lehmann estimate based on the residuals; i.e., ebi + b ej b . (2.2.26) θR = medi≤j 2

b R )′ will be discussed Asymptotic theory for the joint distribution of the random vector (θbR , ∆ in Chapter 3.

2.2.3

Computation

The Mann-Whitney-Wilcoxon analysis which we described above is easily computed using the RBR function twosampwil. This function returns the value of the Mann-Whitney-Wilcoxon b (2.2.18), the associated confidence intest statistic SR+ = SR+ (0), (2.2.20), the estimate ∆, terval (2.2.22), and comparison boxplots of the samples. Also, the R intrinsic function wilcoxon.test and minitab command MANN compute this Mann-Whitney-Wilcoxon analysis.

2.3

Examples

In this section we present two examples which illustrate the methods discussed in the last section. The calculations were performed by the RBR functions twosampwil and twosampt which compute the Mann-Whitney-Wilcoxon and LS analyses, respectively. By convention, for each difference Yj − Xi = 0, we add the value 1/2 to the test statistic SR+ . Further, the returned p-value is calculated with the usual continuity correction. The estimate of τ and its standard error (SE) displayed in the results are given by expression (2.4.27), where a full discussion is given. The LS analysis, computed by twosampt, is based on the traditional pooled two-sample t-test. Example 2.3.1. Quail Data The data for this problem are drawn from a high volume drug screen designed to find compounds which reduce low density lipoproteins, LDL, cholesterol in quail; see McKean, Vidmar and Sievers (1989) for a discussion of this screen. For the purposes of the present example, we have taken the plasma LDL levels of one group of quail who were fed over a specified period of time a special diet mixed with a drug compound and the LDL levels of a second group of quail who were fed the same special diet but without the drug compound over the same length of time. A completely randomized design was employed. We will refer to the first group as the treatment group and the second group as the control group. The data are displayed in Table 2.3.1. Let θC and θT denote the true median levels of LDL for the control and treatment populations, respectively. The parameter of interest is ∆ = θC − θT . We are interested in the alternative hypothesis that the treatment has been effective; hence the hypotheses are: H0 : ∆ = 0 versus HA : ∆ > 0 .

2.3. EXAMPLES

81

Table 2.3.1: Data for Quail Example 64 49 54 64 97 66 76 44 71 70 72 71 55 60 62 46 77 86 Treated 40 31 50 48 152 44 74 38 81 Control

89 71 64

The comparison boxplots returned by the RBR function twosampwil are found in Figure 2.3.1. Note that there is one outlier, the fifth observation of the treated group, which has the value 152. Outliers such as this were typical with most of the data in this study; see McKean et al. (1989). For the data at hand, the treated group appears to have lower LDL levels. Figure 2.3.1: Comparison Boxplots of Treatment and Control Quail LDL Levels

100 80 40

60

LDL cholesterol

120

140

Comparison Boxplots of Treated and Control

Control

Treated

The analyses returned by the functions twosampwil and twosampt are given below. The Mann-Whitney-Wilcoxon test statistic has the value 134.5 with p-value 0.067, while the t-test statistic has value 0.557 with p-value 0.291. The MWW indicates with marginal significance that the treatment performed better than the placebo. The two sample t analysis was impaired by the outlier. The Hodges-Lehmann estimate of ∆, (2.2.18), is 14 and the 90% confidence interval is (−2.0, 24.0). In contrast, the least squares estimate of shift is 5 and the corresponding 90% confidence interval is (−10.25, 20.25). > twosampwil(y,x,alt=1,alpha=.10,namex="Treated",namey="Control",

82

CHAPTER 2. TWO SAMPLE PROBLEMS

nameresp="LDL cholesterol") Test of Delta = 0 Alternative selected is 1 Test Stat. S+ is 134.5 Standardized (z) Test-Stat. 1.495801 p-vlaue 0.06735282 MWW estimate of the shift in location is 14 SE is 90 % Confidence Interval is ( -2 , 24 ) Estimate of the scale parameter tau 21.12283

and

8.180836

> twosampt(y,x,alt=1,alpha=.10,namex="Treated",namey="Control", nameresp="LDL cholesterol") Test of Delta = 0 Alternative selected is 1 Test Stat. ybar-xbar- 0 is 5 Standardized (t) Test-Stat. 0.5577585 and p-vlaue 0.2907209 Mean of y minus the mean of x is 5 SE is 8.964454 90 % Confidence Interval is ( -10.24971 , 20.24971 ) Estimate of the scale parameter sigma 23.14612 As noted above, this data was drawn from data from a high-speed drug screen to discover drug compounds which have the potential to reduce LDL cholesterol. In this screen, if a compound was at least marginally significant the investigation of it would continue, else it would be eliminated from further scrutiny. Hence, for this drug compound, the robust and LS analyses would result in different practical outcomes. Example 2.3.2. Hendy-Charles Coin Data, continuation of Example 1.11.1 Recall that the 84% L1 confidence intervals for the data are disjoint. Thus we reject the null hypothesis that the silver content is the same for the two mintings at the 5% level. We now apply the MWW test and confidence interval to this data and find the Hodges-Lehmann estimate of shift. If the tailweights of the underlying distributions are moderate, the MWW methods are more efficient. The output from the RBR function twosampwil is: > twosampwil(Fourth,First) Test of Delta = 0 Alternative selected is 0 Test Stat. S+ is 61.5 Standardized (z) Test-Stat. 3.122611 and p-vlaue 0.001792544 MWW estimate of the shift in location is 1.1

SE is

0.2999926

2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON

83

95 % Confidence Interval is ( 0.6 , 1.7 ) Estimate of the scale parameter tau 0.5952794 Note that there is strong statistical evidence that the mintings are different. The HodgesLehmann estimate (2.2.18) is 1.1 which suggests that there is roughly a 1.1% decrease in the silver content from the first to the fourth mintings. The 95% confidence interval, (2.2.22), is (0.6, 1.7). Half the length of the confidence is .45 and this could be reported as the margin of error in estimating ∆, the change in median silver contents from the first to the fourth mintings. Hence we could report 1.1% ± .45%.

2.4

Inference Based on the Mann-Whitney-Wilcoxon

We next develop the theory for inference based on the Mann-Whitney-Wilcoxon statistic, including the test, the estimate and the confidence interval. Although much of the development is for the location model the general model will also be considered. We begin with testing.

2.4.1

Testing

Although the geometric motivation of the test statistic SR+ was derived under the location model, the test can be used for more general models. Recall that the general model is comprised of a random sample X1 , . . . , Xn1 with cdf F (x) and a random sample Y1 , . . . , Yn2 with cdf G(x). For the discussion we select the hypotheses, H0 : F (x) = G(x), for all x versus HA : F (x) ≥ G(x), with strict inequality for some x. (2.4.1) Under this stochastically ordered alternative Y tends to dominate X,; i.e., P (Y > X) > 1/2. Our rank-based decision rule is to reject H0 in favor of HA if SR+ is too large, where SR+ = #(Yj − Xi > 0). Our immediate goal is to make this precise. What we discuss will of course hold for the other one-sided alternative F (x) ≤ G(x) and the two-sided alternative F (x) ≤ G(x) or F (x) ≥ G(x) as well. Furthermore, since the location model is a submodel of the general model, what holds for the general model will hold for it also. It will always be clear which set of hypotheses is being considered. Under H0 , we first show that SR+ is distribution free and then show it is symmetrically distributed about (n1 n2 )/2. Theorem 2.4.1. Under the general null hypothesis in (2.4.1), SR+ is distribution free. Proof: Under the null hypothesis, the combined samples X1 , . . . , Xn1 , Y1 , . . . , Yn2 constitute a random sample of size n from the distribution function F (x). Hence any assignment of n2 ranks from the set of integers {1, . . . , n} to Y1 , . . . , Yn2 is equilikely; i. e., has probability n −1 independent of F . n2

Theorem 2.4.2. Under H0 in (2.4.1), the distribution of SR+ is symmetric about (n1 n2 )/2.

84

CHAPTER 2. TWO SAMPLE PROBLEMS

Proof: Under H0 in (2.4.1) L(Yj − Xi ) = L(Xi − Yj ) for all i, j; see Exercise 2.13.3. Thus if SR− = #(Xi − Yj > 0) then, under H0 , L(SR+ ) = L(SR− ). Since SR− = n1 n2 − SR+ we have the following string of equalities which proves the result: n1 n2 n1 n2 + x] = P [n1 n2 − SR− ≥ + x] P [SR+ ≥ 2 2 n1 n2 n1 n2 − x] = P [SR+ ≤ − x] = P [SR− ≤ 2 2 Hence for the hypotheses (2.4.1), a level α test based on SR+ would reject H0 if SR+ ≥ cα,n1 ,n2 where PH0 [SR+ ≥ cα,n1 ,n2 ] = α. From the symmetry, note that the lower α critical point is given by n1 n2 − cα,n1 ,n2 . Although SR+ is distribution free under the null hypothesis its distribution cannot be obtained in closed form. The next theorem gives a recursive formula for its distribution. The proof can be found in Exercise 2.13.4; see, also, Hettmansperger (l984, p. 136-137). Theorem 2.4.3. Under the general null hypothesis in (2.4.1), let Pn1 ,n2 (k) = PH0 [SR+ = k]. Then n2 n1 Pn1 ,n2 (k) = Pn1 ,n2 −1 (k − n1 ) + Pn −1,n2 (k) , n1 + n2 n1 + n2 1 where Pn1 ,n2 (k) satisfies the boundary conditions Pi,j (k) = 0 if k < 0, Pi,0 (k) and P0,j (k) are 1 or 0 as k = 0 or k 6= 0. Based on these recursion formulas, tables of the null distribution can be obtained readily, which then can be used to obtain the critical values for the rank based test. Alternatively, the asymptotic null distribution of SR+ can be used to determine approximate critical values. This asymptotic test will be discussed later; see Theorem 2.4.9. We next derive the mean and variance of SR+ under the three models:

(a) the general model where X has distribution function F (x) and Y has distribution function G(x); (b) the location model where G(x) = F (x − ∆); (c) and the null model in which F (x) = G(x). Of course, from Theorem 2.4.2, the null mean of SR+ is (n1 n2 )/2. In our derivation we repeatedly make use of the fact that if H is the distribution function of a random variable Z then the random variable H(Z) has a uniform distribution over the interval (0, 1); see Exercise 2.13.5. Theorem 2.4.4. Assuming that X1 , . . . , Xn1 are iid F (x) and Y1 , . . . , Yn2 are iid G(x) and that these two samples are independent of one another, the means of SR+ under the three models (a)-(c) are: (a) E SR+ = n1 n2 [1 − E [G(X)]] = n1 n2 E [F (Y )] (b) E SR+ = n1 n2 [1 − E [F (X − ∆)]] = n1 n2 E [F (X + ∆)] n1 n2 (c) E SR+ = . 2

2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON

85

Proof: We shall prove only (a), since results (b) and (c) follow directly from it. We can write SR+ in terms of indicator functions as SR+

=

n1 X n2 X i=1 j=1

I(Yj − Xi > 0) ,

(2.4.2)

where I(t > 0) is 1 or 0 for t > 0 or t ≤ 0, respectively. Let Y have distribution function G, let X have distribution function F , and let X and Y be independent. Then E [I (Y − X > 0)] = E [P [Y > X|X]] = E [1 − G(X)] = E [F (Y )] , where the second equality follows from the independence of X and Y . The results then follow. Theorem 2.4.5. The variances of SR+ under the models (a) - (c) are: (a) V ar SR+ = + + (b) V ar SR = + + (c) V ar SR =

n1 n2 E [G(X)] − E 2 [G(X)] n1 n2 (n1 − 1)V ar [F (Y )] + n1 n2 (n2 − 1)V ar [G(X)] n1 n2 E [F (X − ∆)] − E 2 [F (X − ∆)] n1 n2 (n1 − 1)V ar [F (Y )] + n1 n2 (n2 − 1)V ar [F (X − ∆)] n1 n2 (n + 1) . 12

Proof: Again only the result (a) will be obtained. Using the indicator formulation of SR+ , (2.4.2), we have V ar

SR+

=

n1 X n2 X i=1 j=1

V ar [I(Yj − Xi > 0)]+

n2 n1 X n2 X n1 X X i=1 j=1 l=1 k=1

Cov [I(Yj − Xi > 0), I(Yk − Xl > 0)] ,

where the sums for the covariance terms are over all possible combinations except (i, j) = (l, k). For the first term, note that the variance of I(Y − X > 0) is V ar [I(Y > X)] = E [I(Y > X)] − E 2 [I(Y > X)] = E [1 − G(X)] − E 2 [1 − G(X)] = E [G(X)] − E 2 [G(X)] . This yields the first term in (a). For the covariance terms, note that a covariance is 0 unless either j = k or i = l. This leads to the following two cases:

86

CHAPTER 2. TWO SAMPLE PROBLEMS

Case(i) For the covariance terms with j = k and i 6= l, we need E [I(Y > X1 )I(Y > X2 )] which is, E [I(Y > X1 )I(Y > X2 )] = = = =

P [Y > X1 , Y > X2 ] E [P [Y > X1 , Y > X2 |Y ]] E [P [Y > X1 |Y ] P [Y > X2 |Y ]] E F (Y )2

There are n2 ways to get a j and n1 (n1 −1) ways to get i 6= l; hence there are n1 n2 (n1 −1) covariances of this form. This leads to the second term of (a). Case(ii) ) The terms for the covariances where i = l and j 6= k follow similarly to Case (i). This leads to the third and final term of (a). The last two theorems suggest that the random variable Z =

n1 n2 2 n1 n2 (n+1) 12

S+− qR

has an approx-

imate N(0, 1) distribution under H0 . This follows from the next results which yield the asymptotic distribution of SR+ under general alternatives as well as under the null hypothesis. We will obtain these results by projecting our statistic SR+ down onto a set of linear combinations of independent random variables. Then we can use central limit theory on the ˇ ak (1967) for a discussion of this technique. projection. See H´ajek and Sid´ Let T = T (Z1 , . . . , Zn ) be a random variable based on a sample Z1 , . . . , Zn such that E [T ] = 0. Let p∗k (x) = E [T | Zk = x] , k = 1, . . . , n . Next define the random variable Tp to be Tp =

n X

p∗k (Zk ) .

(2.4.3)

k=1

In the next theorem we show that Tp is the projection of T onto the space of linear functions of Z1 , . . . , Zn . Note that unlike T , Tp is a linear combination of independent random variables; hence, its asymptotic distribution is often easier to obtain than that of T . As the following P projection theorem shows it is in a sense the “closest” linear function of the form pi (Zi ) to T . Pn 2 Theorem 2.4.6. If W = i=1 pi (Zi ) then E [(T − W ) ] is minimized by taking pi (x) = p∗i (x). Furthermore E [(T − Tp )2 ] = V ar [T ] − V ar [Tp ]. Proof: First note that E [p∗k (Zk )] = 0. We have, E (T − W )2 = E [(T − Tp ) − (W − Tp )]2 (2.4.4) 2 2 = E (T − Tp ) + E (W − Tp ) − 2E [(T − Tp )(W − Tp )] .

2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON

87

We can write one-half the cross product term as n X i=1

E [(T − Tp )(pi (Zi ) − p∗i (Zi ))] = =

n X

i=1 n X i=1

E [E [(T − Tp )(pi (Zi ) − p∗i (Zi )) | Zi ]] "

"

E (pi (Zi) − p∗i (Zi ))E T −

n X j=1

p∗j (Zj ) | Zi

##

.

The conditional expectation can be written as, X E p∗j (Zj ) = 0 − 0 = 0 . (E [T | Zi ] − p∗i (Zi )) − j6=i

Hence the cross-product term is zero, and, therefore the left-hand-side of expression (2.4.4) is minimized with respect to W by taking W = Tp . Also since this holds, in particular, for W = 0 we get E T 2 = E (T − Tp )2 + E Tp2 .

Since both T and Tp have zero means the second result of the theorem also follows. From these results a strategy for obtaining the asymptotic distribution of T is apparent. Namely, find the asymptotic distribution of its projection, Tp and then show V ar [T ] − V ar [Tp ] → 0 as n → ∞. This implies that T and Tp have the same asymptotic distribution; see Exercise 2.13.7. We shall apply this strategy to get the asymptotic distribution of the rank based methods. As a first step we obtain the projection of SR+ − E SR+ under the general model. Theorem 2.4.7. Under the general model the projection of the random variable SR+ −E SR+ is n2 n1 X X Tp = n1 (F (Yj ) − E [F (Yj )]) − n2 (G(Xi ) − E [G(Xi )]) . (2.4.5) j=1

i=1

Proof: Define the n random variables Z1 , . . . , Zn by Xi if 1 ≤ i ≤ n1 . Zi = Yi−n1 if n1 + 1 ≤ i ≤ n We have, p∗k (x) = E SR+ | Zk = x − E SR+ n1 X n2 X = E [I(Yj > Xi ) | Zk = x] − E SR+ . i=1 j=1

There are two cases depending on whether 1 ≤ k ≤ n1 or n1 + 1 ≤ k ≤ n1 + n2 = n.

(2.4.6)

88

CHAPTER 2. TWO SAMPLE PROBLEMS

Case (1) Suppose 1 ≤ k ≤ n1 then the conditional expectation in the above expression (2.4.6), depending on the value of i, becomes (a): i 6= k, E [I(Yj > Xi ) | Xk = x] = = (b): i = k, E [I(Yj > Xi ) | Xi = x] = =

E [I(Yj > Xi )] P [Y > X] P [Y > X | X = x] 1 − G(x)

Hence, in this case, p∗k (x) = n2 (n1 − 1)P [Y > X] + n2 (1 − G(x)) − E SR+ .

Case (2) Next suppose that n1 + 1 ≤ k ≤ n then the conditional expectation in the above expression (2.4.6), depending on the value of j, becomes (a): j 6= k, E [I(Yj > Xi ) | Yk = x] = P [Y > X] (b): j = k, E [I(Yj > Xi ) | Yj = x] = F (x) Hence, in this case, p∗k (x) = n1 (n2 − 1)P [Y > X] + n1 F (x) − E SR+ .

Combining these results we get Tp =

n1 X i=1

p∗i (Xi )

+

n2 X

p∗j (Yj )

j=1

= n1 n2 (n1 − 1)P [Y > X] + n2 + n1 n2 (n2 − 1)P [Y > X] + n1

n1 X

i=1 n2 X j=1

(1 − G(Xi )) F (Yj ) − nE SR+ .

This can be simplified by noting that P (Y > X) = E [P (Y > X | X)] = E [1 − G(X)] or similarly P (Y > X) = E [F (Y )] . From (a) of Theorem 2.4.4, E SR+ = n1 n2 (1 − E [G(X)]) = n1 n2 P (Y > X) .

Substituting these three results into (2.4.6) we get the desired result. An immediate outcome is

2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON

89

Corollary 2.4.1. Under the general model, if Tp is given by (2.4.5) then Var (Tp ) = n21 n2 Var (F (Y )) + n1 n22 Var (G(X)) . From this it follows that Tp should be standardized as Tp∗ = √

1 Tp . nn1 n2

In order to obtain the asymptotic distribution of Tp and subsequently SR+ we need the following assumption on the design (sample sizes), (D.1) :

ni → λ i , 0 < λi < 1 . n

(2.4.7)

This says that the sample sizes go to ∞ at the same rate. Note that λ1 + λ2 = 1. The asymptotic variance of Tp∗ is thus Var (Tp∗ ) → λ1 Var (F (Y )) + λ2 Var (G(X)) . We first want to obtain the asymptotic distribution under general alternatives. In order to do this we need an assumption concerning the ranges of X and Y . The support of a continuous random variable with distribution function H and density h is defined to be the set {x : h(x) > 0} which is denoted by S(H). Our second assumption states that the intersection of the supports of F and G has a nonempty interior; that is (E.3) :

There is an open interval I such that I ⊂ S(F ) ∩ S(G) .

(2.4.8)

Note that the asymptotic variance of Tp∗ is not zero under (E.3). We are now in the position to find the asymptotic distribution of Tp∗ . Theorem 2.4.8. Under the general model and the assumptions (D.1) and (E.3), Tp∗ has an asymptotic N (0, λ1 Var (F (Y )) + λ2 Var (G(X))) distribution. Proof: By (2.4.5) we can write r r n2 n1 n1 X n2 X ∗ (F (Yj ) − E [F (Yj )]) − (G(Xi ) − E [G(Xi )]) . Tp = nn2 j=1 nn1 i=1

(2.4.9)

Note that both sums on the right side of expression (2.4.9) are composed of independent and identically distributed random variables and that the sums are independent of one another. The result then follows immediately by applying the simple central limit theorem to each sum. This is the key result we need in order to obtain the asymptotic distribution of our test statistic SR+ . We first obtain the result under the general model and then under the null hypothesis. As we will see, both results are immediate.

90

CHAPTER 2. TWO SAMPLE PROBLEMS

Theorem 2.4.9. Under the general model and the conditions (E.3) and (D.1) the random + S + −E [SR ] variable √R has a limiting N(0, 1) distribution. Var (SR+ ) Proof: By the last theorem and Theorem 2.4.6, we need only show that the difference in √ the variances of SR+ / nn1 n2 and Tp∗ goes to 0 as n → ∞. Note that, Var

1 S+ √ nn1 n2 R

n1 n2 E [G(X)] − E [G(X)]2 nn1 n2 n1 n2 (n1 − 1) n1 n2 (n2 − 1) + Var (F (Y )) + Var (G(X)) ; nn1 n2 nn1 n2

=

√ hence, Var (Tp∗ ) − Var (SR+ / nn1 n2 ) → 0 and the result follows from Exercise (2.13.7). The asymptotic distribution of the test statistic under the null hypothesis follows immediately from this theorem. We record it in the next corollary. Corollary2.4.2. Under H0 : F (x) = G(x) and (D.1) only, the test statistic SR+ is approx(n+1) imately N n12n2 , n1 n212 . Therefore an asymptotic size α test for H0 : F (x) = G(x) versus HA : F (x) 6= G(x) is to reject H0 if |z| ≥ zα/2 where S + − n12n2 z = qR n1 n2 (n+1) 12

and

1 − Φ(zα/2 ) = α/2 . Since we approximate a discrete random variable with continuous one, we think it is advisable in cases of small samples to use a continuity correction. Fix and Hodges (l955) give an Edgeworth approximation to the distribution of SR+ and Bickel (l974) discusses the error of this approximation. Since the standard normal distribution function, Φ, is continuous on the entire real line, we can strengthen the convergence in Theorem 2.4.9 to uniform convergence; that is, the distribution function of the standardized MWW converges uniformly to Φ. Using this, it is not hard to show that the standardized critical values of the MWW converge to their counterparts at the standard normal. Thus if cα,n is the MWW critical value defined by α = PH0 [SR+ ≥ cα,n ] then cα,n − n12n2 q → zα , (2.4.10) n1 n2 (n+1) 12

where 1 − α = Φ(zα ); see Exercise 2.13.8 for details. This result will prove useful in the next section.

2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON

91

We now consider when the test based on SR+ is consistent. Consider the general set up; i. e., X1 , . . . , Xn1 is a random sample with distribution function F (x) and Y1 , . . . , Yn2 is a random sample with distribution function G(x). Consider the hypotheses H0 : F = G versus HA1 : F (x) ≥ G(x) with F (x0 ) > G(x0 ) for some x0 ∈ Int(S(F ) ∩ S(G)). (2.4.11) Such an alternative is called a stochastically ordered alternative. The next theorem shows that the MWW test statistic is consistent for this alternative. Likewise it is consistent for the other one sided stochastically ordered alternative with F and G interchanged, HA2 , and, also, for the two sided alternative which consists of the union of HA1 and HA2 . These results imply that the MWW test is consistent for location alternatives, provided F and G have overlapping support. As Exercise 2.13.9 shows, it will also be consistent when one support is shifted to the right of the other support. Theorem 2.4.10. Suppose that the assumptions (D.1), (2.4.7), and (E.3), (2.4.8), hold. Under the stochastic ordering alternatives given above, SR+ is a consistent test. Proof: Assume the stochastic ordering alternative HA1 , (2.4.11). For an arbitrary level α, select the critical level cα such that the test that rejects H0 if SR+ ≥ cα has asymptotic level α. We want to show that the power of the test goes to 1 as n → ∞. Since F (x0 ) > G(x0 ) for some point x0 in the interior of S(F )∩S(G), there exists an interval N such that F (x) > G(x) on N. Hence Z Z EHA [G(X)] = G(y)f (y)dy + G(y)f (y)dy c N N Z Z 1 < F (y)f (y)dy + F (y)f (y)dy = (2.4.12) 2 N Nc The power of the test is given by # " + cα − (n1 n2 /2) (n1 n2 /2) − EHA (SR+ ) SR+ − EHA (SR+ ) p p + . P HA S R ≥ c α = P HA ≥ p VHA (SR+ ) VHA (SR+ ) VHA (SR+ )

Note by (2.4.10) that

p cα − (n1 n2 /2) cα − (n1 n2 /2) VH0 (SR+ ) p p p → zα κ , = VHA (SR+ ) VH0 (SR+ ) VHA (SR+ )

where κ is a real number (since the variances are of the same order). But by (2.4.12) (n1 n2 /2) − EHA (SR+ ) (n1 n2 /2) − n1 n2 [1 − EHA (G(X)) p p = ] VHA (SR+ ) VHA (SR+ ) n1 n2 − 12 + EHA (G(X)) p → −∞ . = VHA (SR+ ) +

+

SR −EHA (SR ) By Theorem (2.4.9), under HA the random variable √ converges in distribution to + VHA (SR )

a standard normal variate. Since the convergence is uniform, it follows from the above limits that the power converges to 1. Hence the MWW test is consistent.

92

2.4.2

CHAPTER 2. TWO SAMPLE PROBLEMS

Confidence Intervals

Consider the location model (2.2.4). We next obtain a distribution free confidence interval for ∆ by inverting the MWW test. As a first step we have the following result on the function SR+ (∆), (2.2.20): Lemma 2.4.1. SR+ (∆) is a decreasing step function of ∆ which steps down by 1 at each difference Yj − Xi . Its maximum is n1 n2 and its minimum is 0. Proof: Let D(1) ≤ · · · ≤ D(n1 n2 ) denote the ordered n1 n2 differences Yj − Xi . The results follow immediately by writing SR+ (∆) = #(D(i) > ∆). Let α be given and choose cα/2 to be the lower α/2 critical point of the MWW distribution; i.e., P∆ SR+ (∆) ≤ cα/2, = α/2. By the above lemma we have 1 − α = P∆ cα/2 < SR+ (∆) < n1 n2 − cα/2 h i = P∆ D(cα/2 +1) ≤ ∆ < D(n1 n2 −cα/2 ) .

Thus [D(cα/2 +1) , D(n1 n2 −cα/2 ) ) is a (1 − α)100% confidence interval for ∆; compare (1.3.30). From the asymptotic null distribution theory for SR+ , Corollary (2.4.2), we can approximate cα/2 as r n1 n2 (n + 1) n n . 1 2 − zα/2 − .5 . (2.4.13) cα/2 = 2 12

2.4.3

Statistical Properties of the Inference Based on the MWW

In this section we derive the efficiency properties of the MWW test statistic and properties of its power function under the location model (2.2.4). We begin with an investigation of the power function of the MWW test. For definiteness we will consider the one sided alternative, H0 : ∆ = 0 versus HA : ∆ > 0 .

(2.4.14)

Results similar to those given below can be obtained for the power function of the other one sided and the two sided alternatives. Given a level α, let cα,n1 ,n2 denote the upper critical value for the MWW test of this hypothesis; hence, the test rejects H0 if SR+ ≥ cα,n1 ,n2 . The power function of this test is given by γ(∆) = P∆ [SR+ ≥ cα,n1 ,n2 ] ,

(2.4.15)

where the subscript ∆ on P denotes that the probability is determined when the true parameter is ∆. Recall that SR+ (∆) = #{Yj − Xi > ∆}. The following theorem will prove useful, its proof is similar to that of Theorem 1.3.1 of Chapter 1 and the more general result Theorem A.2.4 of the Appendix. Theorem 2.4.11. For all t, P∆ [SR+ (0) ≥ t] = P0 [SR+ (−∆) ≥ t].

2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON

93

From Lemma 2.4.1 and Theorem 2.4.11 we have our first important result on the power function of the MWW test; namely, that it is monotone. Theorem 2.4.12. For the above hypotheses (2.4.14), the function γ(∆) in monotonically increasing in ∆. Proof: Let ∆1 < ∆2 . Then −∆2 < −∆1 and, hence, from Lemma 2.4.1, we have SR+ (−∆2 ) ≥ SR+ (−∆1 ). By applying Theorem 2.4.11, the desired result, γ(∆2 ) ≥ γ(∆1 ), follows from the following: 1 − γ(∆2 ) = = ≤ = =

P∆2 [SR+ (0) < cα,n1 ,n2 ] P0 [SR+ (−∆2 ) < cα,n1 ,n2 ] P0 [SR+ (−∆1 ) < cα,n1 ,n2 ] P∆1 [SR+ (0) < cα,n1 ,n2 ] 1 − γ(∆1 ) .

From this we immediately have that the MWW test is unbiased; that is, its power function evaluated at an alternative is always at least as large as its level of significance. We state it as a corollary. Corollary 2.4.3. For the above hypotheses (2.4.14), γ(∆) ≥ α for all ∆ > 0. A more general null hypothesis is given by H0∗ : ∆ ≤ 0 versus HA : ∆ > 0 . If T is any test for these hypotheses with critical region C then we say T is a size α test provided sup P∆ [T ∈ C] = α . ∆≤0

For selected α, it follows from the monotonicity of the MWW power function that the MWW test has size α for this more general null hypothesis. From the above theorems, we have that the MWW power function is monotonically increasing in ∆. Since SR+ (∆) achieves its maximum for ∆ finite, we have by Theorem 1.5.2 of Chapter 1 that the MWW test is resolving; hence, its power function approaches one as ∆ → ∞. Even for the location model, though, we cannot get the power function of the MWW test in closed form. For local alternatives, however, we can obtain an asymptotic expression for the power function. Applications of this result include sample size determination for the MWW test and efficiency comparisons of the MWW with other tests, both of which we consider. We will need the assumption that the density f (x) had finite Fisher Information, i.e., (E.1) f is absolutely continuous, 0 < I(f ) =

R1 0

ϕ2f (u) du < ∞ ,

(2.4.16)

94

CHAPTER 2. TWO SAMPLE PROBLEMS

where ϕf (u) = −

f ′ (F −1 (u)) . f (F −1(u))

(2.4.17)

As discussed in Section 3.4, assumption (E.1) implies that f is uniformly bounded. Once again we will consider the one sided alternative, (2.4.14), (similar results hold for the other one sided and two sided alternatives). Consider a sequence of local alternatives of the form δ HAn : ∆n = √ , (2.4.18) n where δ > 0 is arbitrary but fixed. As a first step, we need to show that SR+ (∆) is Pitman regular as discussed in Chapter + 1. Let S R (∆) = SR+ (∆)/(n1 n2 ). We need to verify the four conditions of Definition 1.5.3 of Chapter 1. The first condition is true by Lemma 2.4.1 and the fourth condition follows from Corollary 2.4.2. By (b) of Theorem 2.4.4, we have +

µ(∆) = E∆ [S R (0)] = 1 − E[F (X − ∆)] .

(2.4.19)

R 2 R By assumption (E.1), (2.4.16), f (x) dx ≤ sup f f (x) dx < ∞. Hence differentiating R (2.4.19) we obtain µ′ (0) = f 2 (x)dx > 0 and, thus, the second condition is true. Hence we + need only show that the third condition, asymptotic linearity of S R (∆) is true. This will follow provided we can show the variance condition (1.5.17) of Theorem 1.5.6 is true. Note that √ √ + + S R (δ/ n) − S R (0) = (n1 n2 )−1 #(0 < Yj − Xi ≤ δ/ n) . This is similar to the MWW statistic itself. Using essentially the same argument as that for the variance of the MWW statistic, Theorem 2.4.5 we get √ + + nVar0 [S R (δ/ n) − S R (0)] =

n(n1 − 1) n (an − a2n ) + (bn − cn ) n1 n2 n1 n2 n(n2 − 1) (dn − a2n ) , + n1 n2 √ 2 √ where an = E 0 [F (X + δ/ n) − F (X)], bn = E0 [(F (Y ) − F (Y − δ/ n)) ], cn = E0 [(F (Y ) − √ √ F (Y − δ/ n))], and dn = E0 [(F (X + δ/ n) − F (X))2 ]. Using the Lebesgue Dominated Convergence Theorem, it is easy to see that an , bn , cn , and dn all converge to 0. Therefore Condition (1.5.17) of Theorem 1.5.6 holds and we have thus established the asymptotic linearity result given by: Z P 1/2 + √ 2 1/2 + sup n S R (δ/ n) − n S R (0) + δ f (x) dx → 0 , (2.4.20) |δ|≤B

for any B > 0. Therefore, it follows that SR+ (∆) is Pitman regular.

2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON

95

In order to get the efficacy of the MWW test, we need the quantity σ 2 (0) defined by σ 2 (0) = lim nVar0 (S R (0)) n→0

nn1 n2 (n + 1) = (12λ1 λ2 )−1 ; n→0 n21 n22 12

= lim

see expression (1.5.12). Therefore by (1.5.11) the efficacy of the MWW test is p p √ Z 2 ′ cM W W = µ (0)/σ(0) = λ1 λ2 12 f (x) dx = λ1 λ2 τ −1 ,

(2.4.21)

where τ is the scale parameter given by

√ Z 2 τ = ( 12 f (x)dx)−1 .

(2.4.22)

2.13.10 it is shown that the efficacy of the two sample pooled t-test is √ In Exercise −1 λ1 λ2 σ where σ 2 is the common variance of X and Y . Hence the efficiency of the the MWW test to the two sample t test is the ratio σ 2 /τ 2 . This of course is the same efficiency as that of the signed rank Wilcoxon test to the one sample t test; see (1.7.13). In particular if the distribution of X is normal then the efficiency of the MWW test to the two sample t test is .955. For heavier tailed distributions, this efficiency is usually larger than 1; see Example 1.7.1. As in Chapter 1 it is convenient to summarize the asymptotic linearity result as follows: ) ) ( + √ ( + √ √ S R (δ/ n) − µ(0) S R (0) − µ(0) = n − cM W W δ + op (1) , (2.4.23) n σ(0) σ(0) uniformly for |δ| ≤ B and any B > 0. The next theorem is the asymptotic power lemma for the MWW test. As in Chapter 1, (see Theorem 1.5.8), its proof follows from the Pitman regularity of the MWW test. Theorem 2.4.13. Under the sequence of local alternatives, (2.4.18), Z p 2 lim γ(∆n ) = P0 [Z ≥ zα − cM W W δ] = 1 − Φ zα − 12λ1 λ2 f δ ; , n→∞

where Z is N(0, 1).

In Exercise 2.13.10, it is shown that if γLS (∆) denotes the power function of the usual two sample t-test then p δ , (2.4.24) lim γLS (∆n ) = 1 − Φ zα − λ1 λ2 n→∞ σ where σ 2 is the common variance of X and Y . By comparing these two power functions, it is seen that the Wilcoxon is asymptotically more powerful if τ < σ, i.e., if e = c2M W W /c2t > 1.

96

CHAPTER 2. TWO SAMPLE PROBLEMS

As an application of the asymptotic power lemma, we consider sample size determination. Consider the MWW test for the one sided hypothesis (2.4.14). Suppose the level, α, and the power, β, for a particular alternative ∆A are specified. For convenience, assume ∗ equal sample sizes, i. e. n1 = n n∗ denotes the common sample size; hence, 2 = n where √ √ λ1 = λ2 = 2−1 . Express ∆A as 2n∗ ∆A / 2n∗ . Then by Theorem 2.4.13 we have " # r √ 1 2n∗ ∆A . β = 1 − Φ zα − , 4 τ But this implies and

√ √ zβ = zα − τ −1 n∗ ∆A / 2 n∗ =

zα − zβ ∆A

2

(2.4.25)

2τ 2 .

The above value of n∗ is the approximate sample size. Note that it does depend on τ which, in applications, would have to be guessed or estimated in a pilot study; see the discussion in Section 2.4.5, (estimates of τ are discussed in Sections 2.4.5 and 3.7.1). For a specified distribution it can be evaluated; for instance, p if the underlying density is assumed to be normal with standard deviation σ then τ = π/3σ. Using (2.4.24) a similar derivation can be obtained for the usual two sample t-test, resulting in an approximate sample size of 2 zα − zβ ∗ nLS = 2σ 2 . ∆A The ratio of the sample size needed by the MWW test to that of the two sample t test is τ 2 /σ 2 . This provides additional motivation for the definition of efficiency.

2.4.4

Estimation of ∆

Recall from the geometry earlier in this chapter that the estimate of ∆ based on the rankb R = med i,j {Yj − Xi }, (2.2.18). We now obtain several properties of pseudo norm is ∆ this estimate including its asymptotic distribution. This will lead again to the efficiency properties of the rank based methods discussed in the last section. b R = ∆(Y, b For convenience, we note some equivariances of ∆ X), which are established in b Exercise 2.13.11. First, ∆R is translation equivariant; i.e., b R (Y + ∆ + θ, X + θ) = ∆ b R (Y, X) + ∆ , ∆

b R is scale equivariant; i.e., for any ∆ and θ. Second, ∆

b R (aY, aX) = a∆ b R (Y, X) , ∆

b R is an unbiased estimate of ∆ under certain for any a. Based on these we next show that ∆ conditions.

2.4. INFERENCE BASED ON THE MANN-WHITNEY-WILCOXON

97

Theorem 2.4.14. If the errors, e∗i , in the location model (2.2.4) are symmetrically disb R is symmetrically distributed about ∆. tributed about 0, then ∆ Proof: Due to translation equivariance there is no loss of generality in assuming that ∆ and θ are 0. Then Y and X are symmetrically distributed about 0; hence, L(Y ) = L(−Y ) and L(X) = L(−X). Thus from the above equivariance properties we have, b b b L(−∆(Y, X)) = L(∆(−Y, −X) = L(∆(Y, X)) .

b R is symmetrically distributed about 0, and, in general it is symmetrically disTherefore ∆ tributed about ∆.

b R is symmetrically distributed Theorem 2.4.15. Under Model (2.2.4), if n1 = n2 then ∆ about ∆.

b R may be biased if the The reader is asked to prove this in Exercise 2.13.12. In general, ∆ b R is error distribution is not symmetrically distributed but as the following result shows ∆ + always asymptotically unbiased. Since the MWW process SR (∆) was shown to be Pitman √ b regular the asymptotic distribution of n(∆ − ∆) is N(0, c−2 M W W ). In practice, we say b R has an approximate N(∆, τ 2 (n−1 + n−1 )) distribution, ∆ 1 2

where τ was defined in (2.4.22). Recall from Definition 1.5.4 of Chapter 1, that the asymptotic relative efficiency of two Pitman regular estimators is the reciprocal of the ratio of their aymptotic variances. As b LS = Y − X of ∆ is approximately Exercise 2.13.10 shows, the least squares estimate ∆ N ∆, σ 2 n11 + n12 ; hence, 2 b R, ∆ b LS ) = σ = 12σf2 e(∆ τ2

Z

2 f (x) dx . 2

This agrees with the asymptotic relative efficiency results for the MWW test relative to the t test and (1.7.13).

2.4.5

Efficiency Results Based on Confidence Intervals

Let L1−α be the length of the (1 − α)100% distribution free confidence interval based on the MWW statistic discussed in Section 2.4.2. Since this interval is based on the Pitman regular process SR+ (∆), it follows from Theorem 1.5.9 of Chapter 1 that r n1 n2 L1−α P →τ ; (2.4.26) n 2zα/2 that is, the standardized length of a distribution-free confidence interval is a consistent estimate of the scale parameter τ . It further follows from (2.4.26) that, as in Chapter 1, if

98

CHAPTER 2. TWO SAMPLE PROBLEMS

efficiency is based on the relative squared asymptotic lengths of confidence intervals then we obtain the same efficiency results as quoted above for tests and estimates. In the RBR computational function twosampwil a simple degree of freedom adjustment is made in the estimation of τ . In the traditional LS analysis based on the pooled t, this adjustment is equivalent to dividing the pooled estimate of variance by n1 + n2 − 2 instead of n1 + n2 . Hence, as our estimate of τ , the function twosampwil uses r r n1 + n2 n1 n2 L1−α τb = . (2.4.27) n1 + n2 − 2 n 2zα/2 p b R is given by τb (1/n1 ) + (1/n2 ). Thus the standard error (SE) of the estimator ∆ b R . Often in practice The distribution free confidence interval is not symmetric about ∆ b R we can formulate symmetric intervals are desired. Based on the asymptotic distribution of ∆ the approximate interval r 1 1 b ∆R ± zα/2 τˆ + , (2.4.28) n1 n2 where τˆ is a consistent estimate of τ . If we use (2.4.26) as our estimate of τ with the level α, then the confidence interval simplifies to b R ± L1−α . ∆ 2

(2.4.29)

Besides the estimate given in (2.4.26), a consistent estimate of τ was proposed by by Koul, Sievers and McKean (1987) and will be discussed in Section 3.7. Using this estimate small sample studies indicate that zα/2 should be replaced by the t critical value t(α/2,n−1) ; see McKean and Sheather (1991) for a review of small sample studies on R-estimates. In b R is directly analogous to the usual this case, the symmetric confidence interval based on ∆ t interval based on least squares in that the only difference is that σ b is replaced by τb.

Example 2.4.1. Hendy and Charles Coin Data, continued from Examples 1.11.1 and 2.3.2

Recall from Chapter 1 that this example concerned the silver content in two coinages (the first and the fourth) minted during the reign of Manuel I. The data are given in Chapter 1. The Hodges-Lehmann estimate of the difference between the first and the fourth coinage is 1.10 percent of silver and a 95% confidence interval for the difference is (.60, 1.70). The length of this confidence interval is 1.10; hence, the estimate of τ given in expression (2.4.27) is 0.595. The symmetrized confidence interval (2.4.28) based on the t upper .025 critical value is (0.46, 1.74). Both of these intervals are in agreement with the confidence interval obtained in Example 1.11.1 based on the two L1 confidence intervals. Another estimate of τ can be obtained from a similar consideration of the distribution free confidence intervals based on the signed-rank statistic discussed in Chapter 1; see Exercise 2.13.13. Note in this case for consistency, though, we would have to assume that f is symmetric.

2.5. GENERAL RANK SCORES

2.5

99

General Rank Scores

In this section we will be concerned with the location model; i.e., X1 , . . . , Xn1 are iid F (x), Y1 , . . . , Yn2 are iid G(x) = F (x−∆), and the samples are independent of one another. We will present an analysis for this problem based on general rank scores. In this terminology, the Mann-Whitney-Wilcoxon procedures are based on a linear score function. We will present the results for the hypotheses H0 : ∆ = 0 versus H0 : ∆ > 0 .

(2.5.1)

The results for the other one sided and two sided alternatives are similar. We will also be concerned with estimation and confidence intervals for ∆. As in the preceeding sections we will first present the geometry. Recall that the pseudo norm which generated the MWW analysis could be written as a linear combination of ranks times residuals. This is easily generalized. Consider the function kuk∗ =

n X

a(R(ui ))ui ,

(2.5.2)

i=1

P where a(i) are scores such that a(1) ≤ · · · ≤ a(n) and a(i) = 0. For the next theorem, we will also assume that a(i) = −a(n + 1 − i); although, this is only used to show the scalar multiplicative property. P Theorem 2.5.1. Suppose that a(1) ≤ · · · ≤ a(n), a(i) = 0, and a(i) = −a(n + 1 − i). Then the function k · k∗ is a pseudo-norm. Proof: By the connection between ranks and order statistics we can write kuk∗ =

n X

a(i)u(i) .

i=1

Next suppose that u(j) is the last order statistic with a negative score. Since the scores sum to 0, we can write kuk∗ = =

n X i=1

X i≤j

a(i)(u(i) − u(j) ) a(i)(u(i) − u(j) ) +

X i≥j

a(i)(u(i) − u(j) ) .

(2.5.3)

Both terms on the right side are nonnegative; hence, kuk∗ ≥ 0. Since all the terms in (2.5.3) are nonnegative, kuk∗ = 0 implies that all the terms are zero. But since the scores are not all 0, yet sum to zero, we must have a(1) < 0 and a(n) > 0. Hence we must have u(1) = u(j) = u(n) ; i.e., u(1) = · · · = u(n) . Conversely if u(1) = · · · = u(n) then kuk∗ = 0. By the condition a(i) = −a(n + 1 − i) it follows that kαuk∗ = |α|kuk∗ ; see Exercise 2.13.16.

100

CHAPTER 2. TWO SAMPLE PROBLEMS

In order to complete the proof we need to show the triangle inequality holds. This is established by the following string of inequalities: ku + vk∗ = = ≤

n X

a(R(ui + vi ))(ui + vi )

i=1

n X

i=1 n X

a(R(ui + vi ))ui +

n X

a(R(ui + vi ))vi

i=1

a(i)u(i) +

i=1

n X

a(i)v(i)

i=1

= kuk∗ + kvk∗ .

The proof of the above inequality is similar to that of Theorem 1.3.2 of Chapter 1. Based on a set of scores satisfying the above assumptions, we can establish a rank inference for the two sample problem similar to the MWW analysis. We shall do so for general rank scores of the form aϕ (i) = ϕ(i/(n + 1)) , (2.5.4) where ϕ(u) satisfies the following assumptions ϕ(u) is a nondecreasing function defined on the interval (0, 1) R1 R1 2 ; ϕ(u) du = 0 and ϕ (u) du = 1 0 0

(2.5.5)

see (S.1), (3.4.10) in Chapter 3, also. The last assumptions concerning standardization of the scores are for convenience. The Wilcoxon scores are generated in this way by the linear √ function ϕR (u) = 12(u − (1/2)) and the sign scores are generated by ϕS (u) = sgn(2u − 1). We will denote the corresponding pseudo norm for scores generated by ϕ(u) as kukϕ =

n X

aϕ (R(ui ))ui .

(2.5.6)

i=1

These two sample sign and Wilcoxon scores are generalizations of the sign and Wilcoxon scores discussed in Chapter 1 for the one sample problem. In Section 1.8 of Chapter 1 we presented one sample analyses based on general score functions. Similar to the sign and Wilcoxon cases, we can generate a two sample score function from any one sample score function. For reference we establish this in the following theorem: Theorem 2.5.2. As discussed at the beginning of Section 1.8, let ϕ+ (u) be a score function for the one sample problem. For u ∈ (−1, 0), let ϕ+ (u) = −ϕ+ (−u). Define, ϕ(u) = ϕ+ (2u − 1) , for u ∈ (0, 1) . and kxkϕ =

n X i=1

ϕ(R(xi )/(n + 1))xi .

(2.5.7)

(2.5.8)

2.5. GENERAL RANK SCORES

101

Then k · kϕ is a pseudo-norm on Rn . Furthermore

and

Z

ϕ(u) = −ϕ(1 − u) ,

(2.5.9)

Z

(2.5.10)

1 2

ϕ (u) du =

0

1

(ϕ+ (u))2 du .

0

Proof: As discussed in the beginning of Section 1.8, (see expression (1.8.1)), ϕ+ (u) is a positive valued and nondecreasing function defined on theR interval (0, 1). Based on these 1 properties, it follows that ϕ(u) is nondecreasing and that o ϕ(u) du = 0. Hence k · kϕ is a pseudo-norm on Rn . Properties (2.5.9) and (2.5.10) follow readily; see Exercise 2.13.17 for details. The two sample sign and Wilcoxon scores, cited above, are easily √seen to be generated this way from their one sample counterparts ϕ+ (u) = 1 and ϕ+ (u) = 3u, respectively. As discussed further in Section 2.5.3, properties such as efficiencies of the analysis based on the one sample scores are the same for a two sample analysis based on their corresponding two sample scores. In the notation of (2.2.3), the estimate of ∆ is b ϕ = Argmin kZ − C∆kϕ . ∆

Denote the negative of the gradient of kZ − C∆kϕ by Sϕ (∆). Then based on (2.5.6), Sϕ (∆) =

n2 X j=1

aϕ (R(Yj − ∆)) .

b ϕ equivalently solves the equation, Hence ∆

. b ϕ) = Sϕ ( ∆ 0.

(2.5.11)

(2.5.12)

As with pseudo norms in general, the function kZ − C∆kϕ is a convex function of ∆. The negative of its derivative, Sϕ (∆), is a decreasing step function of ∆ which steps down at the differences Yj − Xi ; see Exercise 2.13.18. Unlike the MWW function SR (∆), the step sizes of Sϕ (∆) are not necessarily the same size. Based on MWW starting values, a simple trace b ϕ . The R function algorithm through the differences can be used to obtain the estimator ∆ twosampr2 computes the rank-based analysis for general scores. The gradient rank test statistic for the hypotheses (2.5.1) is Sϕ =

n2 X j=1

aϕ (R(Yj )) .

(2.5.13)

102

CHAPTER 2. TWO SAMPLE PROBLEMS

Since the test statistic only depends on the ranks of the combined sample it is distribution free under the null hypothesis. As shown in Exercise 2.13.18, E0 [Sϕ ] = 0

(2.5.14)

σϕ2 = V0 [Sϕ ] = Note that we can write the variance as n1 n2 σϕ2 = n−1

( n X i=1

n1 n2 n(n − 1)

1 a2 (i) n

)

n X

a2 (i) .

(2.5.15)

i=1

. n1 n2 = , n−1

(2.5.16)

where R 2 the approximation is due to the fact that the term in braces is a Riemann sum of ϕ (u)du = 1 and, hence, converges to 1. It will be convenient from time to time to use rank statistics based on unstandardized scores; i.e., a rank statistic of the form Sa =

n2 X

a(R(Yj )) ,

(2.5.17)

j=1

where a(i) = ϕ(i/(n + 1)), i = 1, . . . , n is a set of scores. As Exercise 2.13.18 shows the null mean µS and null variance σS2 of Sa are given by n1 n2 X (a(i) − a)2 . µS = n2 a and σS2 = (2.5.18) n(n − 1)

2.5.1

Statistical Methods

The asymptotic null distribution of the statistic Sϕ , (2.5.13), easily follows from Theorem A.2.1 of the Appendix. To see this note that we can use the notation (2.2.1) and (2.2.2) to write Sϕ as a linear rank statistic; i.e., n n X X n Sϕ = ci a(R(Zi )) = (ci − c)a Fn (Zi ) , (2.5.19) n+1 i=1 i=1 where Fn is the empirical distribution function of Z1 , . . . , Zn . Our score function ϕ is monotone and square integrable; hence, the conditions on scores in Section A.2 are satisfied. Also F is continuous so the distributional assumption is satisfied. Finally, we need only show that the constants ci satisfy conditions, D.2, (3.4.7), and D.3, (3.4.8). It is a simple exercise to show that n X n1 n2 (ci − c)2 = n i=1 2 2 n2 n1 2 max (ci − c) = max . , 1≤i≤n n2 n2

2.5. GENERAL RANK SCORES

103

Under condition (D.1), (2.4.7), 0 < λi < 1 where lim(ni /n) = λi for i = 1, 2. Using this along with the last two expressions, it is immediate that Noether’s condition, (3.4.9), holds for the ci ’s. Thus the assumptions of Section A.2 hold for the statistic Sϕ . As in expression (A.2.7) of Section A.2, define the random variable Tϕ as Tϕ =

n X i=1

(ci − c¯)ϕ(F (Zi)) .

(2.5.20)

By comparing expressions (2.5.19) and (2.5.20), it seems that the variable Tϕ is an approximation of Sϕ . This follows from Section A.2. Briefly, under H0 the distribution of Tϕ is approximately normal and Var((Tϕ − Sϕ )/σϕ ) → 0; hence, Sϕ is asymptotically normal with mean and variance given by expressions (2.5.14) and (2.5.15), respectively. Hence, an asymptotic level α test of the hypotheses (2.5.1) is Reject H0 in favor of HA , if Sϕ ≥ zα σϕ , where σϕ is defined by (2.5.15). b ϕ of ∆ solves the equation (2.5.12). The interval As discussed above, the estimate ∆ b L, ∆ b U ) is a (1 − α)100% confidence interval for ∆ (based on the asymptotic distribution) (∆ b L and ∆ b U solve the equations provided ∆ p p . . bU) = b L) = Sϕ ( ∆ −zα/2 n1nn2 and Sϕ (∆ zα/2 n1nn2 ,

(2.5.21)

where 1 − Φ(zα/2 ) = α/2. As with the estimate of ∆, these equations can be easily solved with an iterative algorithm; see Exercise 2.13.18.

2.5.2

Efficiency Results

In order to obtain the efficiency results for these statistics, we first show that the process Sϕ (∆) is Pitman regular. For general scores we need to further assume that the density has finite Fisher information; Ri.e., satisfies condition (E.1), (2.4.16). Recall that Fisher 1 information is given by I(f ) = 0 ϕ2F (u) du, where ϕf (u) = −

f ′ (F −1 (u)) . f (F −1(u))

(2.5.22)

Below we will show that the score function ϕf is optimal. Define the parameter τϕ as, τϕ−1

=

Z

Estimation of τϕ is dicussed in Section 3.7.

ϕ(u)ϕf (u)du .

(2.5.23)

104

CHAPTER 2. TWO SAMPLE PROBLEMS

To show that the process Sϕ (∆) is Pitman regular, we show that the four conditions of Definition 1.5.3 are true. As noted after expression (2.5.12), Sϕ (∆) is nonincreasing; hence, the first condition holds. For the second condition, note that we can write Sϕ (∆) =

n2 X i=1

n2 X n1 n2 a(R(Yi − ∆)) = ϕ Fn (Yi − ∆) + Fn (Yi) , n+1 1 n+1 2 i=1

(2.5.24)

where Fn1 and Fn2 are the empirical cdfs of the samples X1 , . . . , Xn1 and Y1 , . . . , Yn2 , respectively. Hence, passing to the limit we have, Z ∞ 1 Sϕ (∆) → λ2 ϕ[λ1 F (x) + λ2 F (x − ∆)]f (x − ∆) dx E0 n −∞ Z ∞ = λ2 ϕ[λ1 F (x + ∆) + λ2 F (x)]f (x) dx = µϕ (∆) ; (2.5.25) −∞

see Chernoff and Savage (1958) for a rigorous proof of the limit. Differentiating µϕ (∆) and evaluating the derivative at 0 we obtain Z ∞ ′ µϕ (0) = λ1 λ2 ϕ′ [F (t)]f 2 (t) dt ′ Z−∞ ∞ f (t) f (t) dt = λ1 λ2 ϕ[F (t)] − f (t) −∞ Z 1 = λ1 λ2 ϕ(u)ϕf (u) du = λ1 λ2 τϕ−1 > 0 . (2.5.26) 0

Hence, the second condition is satisfied. The null asymptotic distribution of Sϕ (0) was established in the Section 2.5.1; hence the fourth condition is true. Hence, we need only establish asymptotic linearity. This result follows from the results for general rank regression statistics which are developed in Section A.2.2 of the Appendix. By Theorem A.2.8 of the Appendix, the asymptotic linearity result for Sϕ (∆) is given by √ 1 1 √ Sϕ (δ/ n) = √ Sϕ (0) − τϕ−1 λ1 λ2 δ + op (1) , n n

(2.5.27)

uniformly for |δ| ≤ B, where B > 0 and τϕ is defined in (2.5.23). Therefore, following Definition 1.5.3 of Chapter 1, the estimating function is Pitman regular. √ By the discussion following (2.5.20), we have that n−1/2 Sϕ (0)/ λ1 λ2 is asymptotically N(0, 1). The efficacy of the test based on Sϕ is thus given by p τϕ−1 λ1 λ2 cϕ = √ = τϕ−1 λ1 λ2 . λ1 λ2

(2.5.28)

2.5. GENERAL RANK SCORES

105

As with the MWW analysis, several important items follow immediately from Pitman regularity. Consider first the behavior of Sϕ under local alternatives. Specifically consider a level α test based on Sϕ for the hypothesis (2.5.1) and the sequence of local alternatives √ Hn : ∆n = δ/ n. As in Chapter 1, it is easy to show that the asymptotic power of the test based on Sϕ is given by lim Pδ/√n [Sϕ ≥ zα σϕ ] = 1 − Φ(zα − δcϕ ) .

n→∞

(2.5.29)

Based on this result, sample size determination for the test based on Sϕ can be conducted similar to that based on the MWW test statistic; see (2.4.25). b ϕ . Recall that the Next consider the asymptotic distribution of the estimator ∆ . b ϕ solves the equation Sϕ (∆ b ϕ ) = 0. Based on Pitman regularity and Theorem estimate ∆ b ϕ is given by 1.5.7 of Chapter 1 the asymptotic distribution ∆ √

D

b ϕ − ∆) → N(0, τ 2 (λ1 λ2 )−1 ) ; n(∆ ϕ

(2.5.30)

By using (2.5.27) and Tϕ (0) to approximate Sϕ (0), we have the following useful result: √

b = n∆

τϕ 1 √ Tϕ (0) + op (1) . λ1 λ2 n

(2.5.31)

We want to select scores such that the efficacy cϕ , (2.5.28), is as large as possible, or b ϕ is as small as possible. How large can equivalently such that the asymptotic variance of ∆ the efficacy be? Similar to (1.8.26), note that we can write Z −1 ϕ(u)ϕf (u)du τϕ = sZ R ϕ(u)ϕf (u)du qR ϕ2f (u)du qR = 2 ϕf (u)du ϕ2 (u)du sZ = ρ

ϕ2f (u)du .

(2.5.32)

The second equation is true since the were standardized as above. In the third equation R scores 2 ρ is a correlation coefficient and ϕf (u)du is Fisher location information, (2.4.16), which we denoted by I(f ). By the Rao-Cram´er lower bound, the smallest asymptotic variance obtainable by an asymptotically unbiased estimate is (λ1 λ2 I(f ))−1. Such an estimate is called asymptotically efficient. Choosing a score function to maximize (2.5.32) is equivalent to choosing a score function to make ρ = 1. This can be achieved by taking the score function b ϕ , is asymptotically efficient. Of to be ϕ(u) = ϕf (u), (2.5.22). The resulting estimate, ∆ course this can be accomplished only provided that the form of f is known; see Exercise 2.13.19. Evidently, the closer the chosen score is to ϕf , the more powerful the rank analysis will be.

106

CHAPTER 2. TWO SAMPLE PROBLEMS

In Exercise 2.13.19, the reader is ask to show that the MWW analysis is asymptotically efficient if the errors have a logistic distribution. For normal errors, it follows in a few steps from expression (2.4.17) that the optimal scores are generated by the normal scores function, ϕN (u) = Φ−1 (u) , (2.5.33) where Φ(u) is the distribution function of a standard normal random variable. Exercise 2.13.19 shows that this score function is standardized. These scores yield an asymptotically efficient analysis if the the errors truly have a normal distribution and, further, e(ϕN , L2 ) ≥ 1; see Theorem 1.8.1. Also, unlike the Mann-Whitney-Wilcoxon analysis, the estimate of the shift ∆ based on the normal scores cannot be obtained in closed form. But as mentioned above for general scores, provided the score function is nondecreasing, simple iterative algorithms can be used to obtain the estimate and the corresponding confidence interval for ∆. In the next sections we will discuss analyses that are asymptotically efficient for other distributions. Example 2.5.1. Quail Data, continued from Example 2.3.1 In the larger study, McKean et al. (1989), from which these data were drawn, the responses were positively skewed with long right tails; although, outliers frequently occurred in the left tail also. McKean et al. conducted an investigation of estimates of the score functions for over 20 of these experiments. Classes of simple scores which seemed appropriate for such data were piecewise linear with one piece which is linear on the first part on the interval (0, b) and with a second piece which is constant on the second part (b, 1); i.e., scores of the form 2 u − 1 if 0 < u < b b(2−b) ϕb (u) = . (2.5.34) b if b ≤ u < 1 2−b These scores are optimal for densities with left logistic and right exponential tails; see Exercise 2.13.19. A value of b which seemed appropriate for this type of data was 3/4. Let P S3/4 = a3/4 (R(Yj )) denote the test statistic based on these scores. The RBR function phibentr with the argument param = 0.75 computes these scores. Using the RBR function twosampr2 with the argument score = phibentr, computes the rank-based analysis for the score function (2.5.34). Assuming that the treated and control observations are in x and y, respectively, the call and the resulting analysis for a one sided test as computed by R is: > tempb = twosampr2(x,y,test=T,alt=1,delta0=0,score=phibentr,grad=sphir, param=.75,alpha=.05,maktable=T) Test of Delta = 0 Alternative selected is Standardized (z) Test-Statistic 1.787738 Estimate

15.5

SE is

7.921817

1 and p-vlaue 0.03690915

2.5. GENERAL RANK SCORES

107

95 % Confidence Interval is ( -2 , 28 ) Estimate of the scale parameter tau 20.45404 Comparing p-values, the analysis based on the score function (2.5.34) is a little more precise than the MWW analysis given in Example 2.3.1. Recall that the data are right skewed, so this result is not surprising. For another class of scores similar to (2.5.34), see the discussion around expression (3.10.6) in Chapter 3.

2.5.3

Connection between One and Two Sample Scores

In Theorem 2.5.2 we discussed how to obtain a corresponding two sample score function given a one sample score function. Here we reverse the problem, showing how to obtain a one sample score function from a two sample score function. This will provide a natural estimate of θ in (2.2.4). We also show the efficiencies and asymptotic properties are the same for such corresponding scores functions. Consider the location model but further assume that X has a symmetric distribution. Then Y also has a symmetric distribution. For associated one sample problems, we could then use the signed rank methods developed in Chapter 1. What one sample scores should we select? First consider what two sample scores would be suitable under symmetry. Assume without loss of generality that X is symmetrically distributed about 0. Recall that the optimal scores are given by the expression (2.5.22). Using the fact that F (x) = 1 − F (−x), it is easy to see (Exercise 2.13.20) that the optimal scores satisfy, ϕf (−u) = −ϕf (1 − u) , for 0 < u < 1 , that is, the optimal score function is odd about 12 . Hence for symmetric distributions, it makes sense to consider two sample scores which are odd about 12 . For this sub-section then assume that the two sample score generating function satisfies the property (S.3) ϕ(1 − u) = −ϕ(u) . (2.5.35) Note that such scores satisfy: ϕ(1/2) = 0 and ϕ(u) ≥ 0 for u ≥ 1/2. Define a one sample score generating function as u+1 + ϕ (u) = ϕ (2.5.36) 2

and the one sample scores as

+

a (i) = ϕ

+

i n+1

.

(2.5.37)

It follows that these one sample scores are nonnegative and nonincreasing. For example, if two sample scores, that is, scores generated by the √ we use Wilcoxon function, ϕ(u) = 12 u − 12 then the associated one sample score generating function is

108

CHAPTER 2. TWO SAMPLE PROBLEMS

√ ϕ+ (u) = 3u and, hence, the one sample scores are the Wilcoxon signed-rank scores. If instead we use the two sample sign scores, ϕ(u) = sgn(2u − 1) then the one sample score function is ϕ+ (u) = 1. This results in the one sample sign scores. Suppose we use two sample scores which satisfy (2.5.35) and use the associated one sample scores. Then the corresponding one and two sample efficacies satisfy p (2.5.38) c ϕ = λ 1 λ 2 c ϕ+ , where the efficacies are given by expressions (2.5.28) and (1.8.21). Hence the efficiency and asymptotic properties of the one and two sample analyses are the same. As a final remark, if we write the model as in expression (2.2.4), then we can use the rank statistic based on b i . Then using the one the two sample to estimate ∆. We next form the residuals Zi − ∆c sample scores statistic of Chapter 1, we can estimate θ based on these residuals, as discussed in Chapter 1. In terms of a regression problem we are estimating the intercept parameter θ based on the residuals after fitting the regression coefficient ∆. This is discussed in some detail in Section 3.5.

2.6

L1 Analyses

In this section, we present analyses based on the L1 norm and pseudo norm. We discuss the pseudo norm first, showing that the corresponding test is the familiar Mood’s (1950) test. The test which corresponds to the norm is Mathisen’s (1943) test.

2.6.1

Analysis Based on the L1 Pseudo Norm

Consider the sign scores. These are the scores generated by the function ϕ(u) = sgn(u−1/2). The corresponding pseudo norm is given by, n X n+1 kukϕ = sgn R(ui) − ui . (2.6.1) 2 i=1 This pseudo norm is optimal for double exponential errors; see Exercise 2.13.19. We have the following relationship between the L1 pseudo norm and the L1 norm. Note that we can write n X n+1 kukϕ = sgn i − u(i) . 2 i=1 Next consider,

n X i=1

|u(i) − u(n−i+1) | =

n X

sgn(u(i) − u(n−i+1) )(u(i) − u(n−i+1) )

i=1 n X

= 2

i=1

sgn(u(i) − u(n−i+1) )u(i) .

2.6.

L1 ANALYSES

109

Finally note that

n+1 sgn(u(i) − u(n−i+1) ) = sgn(i − (n − i + 1)) = sgn i − 2

.

Putting these results together we have the relationship, n X i=1

|u(i) − u(n−i+1) | = 2

n X i=1

n+1 sgn i − u(i) = 2kukϕ . 2

(2.6.2)

Recall that the pseudo norm based Wilcoxon scores can be expressed as the sum of all absolute differences between the components; see (2.2.17). In contrast the pseudo norm based on the sign scores only involves the n symmetric absolute differences |u(i) − u(n−i+1) |. In the two sample location model the corresponding R-estimate based on the pseudo norm (2.6.1) is a value of ∆ which solves the equation n+1 . =0. Sϕ (∆) = sgn R(Yj − ∆) − 2 j=1 n2 X

(2.6.3)

Note that we are ranking the set {X1 , . . . , Xn1 , Y1 − ∆, . . . , Yn2 − ∆} which is equivalent to ranking the set {X1 − med Xi , . . . , Xn1 − med Xi , Y1 − ∆ − med Xi , . . . , Yn2 − ∆ − med Xi }. We must choose ∆ so that half of the ranks of the Y part of this set are above (n + 1)/2 and half are below. Note that in the X part of the second set, half of the X part is below 0 and half is above 0. Thus we need to choose ∆ so that half of the Y part of this set is below 0 and half is above 0. This is achieved by taking b = med Yj − med Xi . ∆

(2.6.4)

c , M0+ = #(Yj > M)

(2.6.5)

This is the same estimate as produced by the L1 norm, see the discussion following (2.2.5). We shall refer to the above pseudo norm (2.6.1) as the L1 pseudo norm. Actually, as pointed out in Section 2.2, this equivalence between estimates based on the L1 norm and the L1 pseudo norm is true for general regression problems in which the model includes an intercept, as it does here. P 2 The corresponding test statistic for H0 : ∆ = 0 is nj=1 sgn(R(Yj ) − n+1 ). Note that 2 the sgn function here is only counting the number of Yj ’s which are above the combined c = med {X1 , . . . , Xn1 , Y1 , . . . , Yn2 } minus the number below M c. Hence a sample median M more convenient but equivalent test statistic is

which is called Mood’s median test statistic; see Mood (1950).

110

CHAPTER 2. TWO SAMPLE PROBLEMS

Testing Since this L1 -analysis is based on a rank-based pseudo-norm we could use the general theory discussed in Section 2.5 to handle the theory for estimation and testing. As we will point out, though, there are some interesting results pertaining to this analysis. For the null distribution of M0+ , first assume that n is even. Without loss of generality, assume that n = 2r and n1 ≥ n2 . Consider the combined sample as a population of n items, where n2 of the items are Y ’s and n1 items are X’s. Think of the n/2 items which exceed c. Under H0 these items are as likely to be an X as a Y . Hence, M0+ , the number of Y ’s M in the top half of the sample follows the hypergeometric distribution, i.e., n1 n2 P (M0+ = k) =

k

r−k n r

k = 0, . . . , n2 ,

where r = n/2. If n is odd the same result holds except in this case r = (n − 1)/2. Thus as a level α decision rule, we would reject H0 : ∆ = 0 in favor of HA : ∆ > 0, if M0+ ≥ cα , where cα could be determined from the hypergeometric distribution or approximated by the binomial distribution. From the properties of the hypergeometic distribution, E0 [M0+ ] = r(n2 /n) and V0 [M0+ ] = (rn1 n2 (n − r))/(n2 (n − 1)). Under the assumption D.1, (2.4.7), it follows that the limiting distribution of M0+ is normal. Confidence Intervals Exercise 2.13.21 shows that, for n = 2r, c) = M0+ (∆) = #(Yj − ∆ > M

n2 X i=1

I(Y(i) − X(r−i+1) − ∆ > 0) ,

(2.6.6)

and furthermore that the n = 2r differences,

Y(1) − X(r) < Y(2) − X(r−1) < · · · < Y(n2 ) − X(r−n2 +1) , can be ordered only knowing the order statistics from the individual samples. It is further shown that if k is such that P (M0+ ≤ k) = α/2 then a (1 − α)100% confidence interval for ∆ is given by (Y(k+1) − X(r−k) , Y(n2 −k) − X(r−n2 +k+1) ) . The above confidence interval simplifies when n1 = n2 = m, say. In this case the interval becomes (Y(k+1) − X(m−k) , Y(m−k) − X(k+1) ) , which is the difference in endpoints of the two simple L1 confidence intervals (X(k+1) , X(m−k) ) and (Y(k+1) , Y(m−k) ) which were discussed in Section 1.11. Using the normal approximation

2.6.

L1 ANALYSES

111

p to the hypergeometric we have k = m/2 − Zα/2 m2 /(4(2m − 1)) − .5. Hence, the above two intervals have confidence coefficient ! p k − m/2 . γ = 1 − 2Φ p = 1 − 2Φ zα/2 m/(2m − 1) m/4 . = 1 − 2Φ zα/2 2−1/2 .

For example, for the equal sample size case, a 5% two sided Mood’s test is equivalent to rejecting the null hypothesis if the 84% one sample L1 confidence intervals are disjoint. While this also could be done for the unequal sample sizes case, we recommend the direct approach of Section 1.11. Efficiency Results b = We will obtain the efficiency results from the asymptotic distribution of the estimate, ∆ med Yj − med Xi of ∆. Equivalently, we could obtain the results by asymptotic linearity that was derived for arbitrary scores in (2.5.27); see Exercise 2.13.22. Theorem 2.6.1. Under the conditions cited in Example 1.5.2, (L1 Pitman regularity conditions), and (2.4.7), we have √

D

b − ∆) → N(0, (λ1 λ2 4f 2 (0))−1 ) . n(∆

(2.6.7)

Proof: Without loss of generality assume that ∆ and θ are 0. We can write, r r √ n√ n√ b n∆ = n2 med Yj − n1 med Xi . n2 n1

From Example 1.5.2, we have √ √

n2 med Yj =

n2 1 1 X sgnYj + op (1) √ 2f (0) n2 j=1

D

n2 med Yj → Z2 where Z2 is N(0, (4f 2 (0))−1 ). Likewise

√

D

n1 med Xi → Z1 where Z1 √ b D 2 −1 is N(0, (4f (0)) ). Since Z1 and Z2 are independent, we have that n∆ → (λ2 )−1/2 Z2 − (λ1 )−1/2 Z1 which yields the result. √ The efficacy of Mood’s test is thus λ1 λ2 2f (0). The asymptotic relative efficiency of Mood’s test to the two-sample t test is 4σ 2 f 2 (0), while its asymptotic relative efficiency with R the MWW test is f 2 (0)/(3( f 2 )2 ). These are the same as the efficiency results of the sign test to the t test and to the Wilcoxon signed-rank test, respectively, that were obtained in Chapter 1; see Section 1.7. hence,

Example 2.6.1. Quail Data, continued, Example 2.3.1

112

CHAPTER 2. TWO SAMPLE PROBLEMS

c = 64. For the subsequent For the quail data the median of the combined samples is M test based on Mood’s test we eliminated the three data points which had this value. Thus n = 27, n1 = 9 and n2 = 18. The value of Mood’s test statistic is M0+ = #(Pj > 64) = 11. Since EH0 (M0+ ) = 8.67 and VH0 (M0+ ) = 1.55, the standardized value (using the continuity correction) is 1.47 with a p-value of .071. Using all the data, the point estimate corresponding to Mood’s test is 19 while a 90% confidence interval, using the normal approximation, is (−10, 31).

2.6.2

Analysis Based on the L1 Norm

Another sign type procedure is based on the L1 norm. Reconsider expression (2.2.7) which is the partial derivative of the L1 dispersion function with respect to ∆. We take the parameter θ as a nuisance parameter and we estimate it by med Xi . An aligned sign test procedure for ∆ is then obtained by aligning the Yj ’s with respect to this estimate of θ. The process of interest, then, is n2 X S(∆) = sgn(Yj − med Xi − ∆) . j=1

A test of H0 : ∆ = 0 is based on the statistic Ma+ = #(Yj > med Xi ) .

(2.6.8)

This statistic was proposed by Mathisen (1943) and is also referred to as the control median . test; see Gastwirth (1968). The estimate of ∆ obtained by solving S(∆) = 0 is, of course, b = med Yj − med Xi . the L1 estimate ∆ Testing

Mathisen’s test statistic, similar to Mood’s, has a hypergeometric distribution under H0 . Theorem 2.6.2. Suppose n1 is odd and is written as n1 = 2n∗1 + 1. Then under H0 : ∆ = 0, P (Ma+ = t) =

n∗1 +t n∗1

n2 −t+n∗1 n∗1 n n1

, t = 0, 1, . . . , n2 .

Proof: The proof will be based on a conditional argument. Given X(n∗1 +1) = x, Ma+ is binomial with n2 trials and 1 − F (x) as the probability of success. The density of X(n∗1 +1) is f ∗ (x) =

n1 ! ∗ ∗ (1 − F (x))n1 F (x)n1 f (x) . ∗ 2 (n1 !)

2.6.

L1 ANALYSES

113

Using this and the fact that the samples are independent we get, Z n2 + (1 − F (x))t F (x)n2 −t f (x)dx P (Ma = t) = t Z n2 n1 ! ∗ ∗ (1 − F (x))t+n1 F (x)n1 +n2 −t f (x)dx = ∗ 2 t (n !) 1 Z 1 n2 n1 ! ∗ ∗ = (1 − u)t+n1 un1 +n2 −t du . ∗ 2 t (n1 !) 0 By properties of the β function this reduces to the result. Once again using the conditional argument, we obtain the moments of Ma+ as n2 2 n2 (n + 1) ; V0 [Ma+ ] = 4(n1 + 2)

E0 [Ma+ ] =

(2.6.9) (2.6.10)

see Exercise 2.13.23. The result when n1 is even is found in Exercise 2.13.23. For the asymptotic null distribution of Ma+ we shall make use of the linearity result for the sign process derived in Chapter 1; see Example 1.5.2. 2 (n+1) ) distriTheorem 2.6.3. Under H0 and D.1, (2.4.7), Ma+ has an approximate N( n22 , n4(n 1 +2) bution.

Proof: Assume without loss of generality that the true median of X and Y is 0. Let θb = med Xi . Note that n2 X + b + n2 )/2 . Ma = ( sgn(Yj − θ) (2.6.11) j=1

√ Clearly under (D.1), n2 θb is bounded in probability. Hence by the asymptotic linearity result for the L1 analysis, obtained in Example 1.5.2, we have −1/2 n2

n2 X j=1

But we also have

b = n−1/2 sgn(Yj − θ) 2 √

n2 X j=1

√ sgn(Yj ) − 2f (0) n2 θb + op (1) . n

1 √ −1 X b n1 θ = (2f (0) n1 ) sgn(Xi ) + op (1) .

i=1

Therefore −1/2 n2

n2 X j=1

b = n−1/2 sgn(Yj − θ) 2

n2 X j=1

n1 X p −1/2 sgn(Yj ) − n2 /n1 n1 sgn(Xi ) + op (1) . i=1

114

CHAPTER 2. TWO SAMPLE PROBLEMS

Note that −1/2 n2

n2 X j=1

and

D

sgn(Yj ) → N(0, λ−1 1 ) .

n1 X p D −1/2 sgn(Xi ) → N(0, λ2 /λ1 ) . n2 /n1 n1 i=1

The result follows from these asymptotic distributions, the independence of the samples, expression (2.6.11), and the fact that asymptotically the variance of Ma+ satisfies n2 (n + 1) . = n2 (4λ1 )−1 . 4(n1 + 2) Confidence Intervals b = #(Yj − θb > ∆); hence, if k is such that P0 (M + ≤ Note that Ma+ (∆) = #(Yj − ∆ > θ) a b b k) = α/2 then (Y(k+1) − θ, Y(n2 −k) − θ) is a (1 − α)100% confidence interval for ∆. For testing the two sided hypothesis H0 : ∆ = 0 versus HA : ∆ 6= 0 we would reject H0 if 0 is not in the confidence interval. This is equivalent, however, to rejecting if θb is not in the interval (Y(k+1) , Y(n2 −k) ). Suppose we determine k by the normal approximation. Then s r n n2 (n + 1) n2 . 2 . n2 k= − .5 . − zα/2 − .5 = − zα/2 2 4(n1 + 2) 2 4λ1 The confidence interval (Y(k+1) , Y(n2 −k) ), is a γ100%, (γ = 1 − 2Φ(−zα/2 (λ1 )−1/2 ), confidence interval based on the sign procedure for the sample Y1 , . . . , Yn2 . Suppose we √ take α = .05 and have the equal sample sizes case so that λ1 = .5. Then γ = 1 − 2Φ(−2 2). Hence, the two sided 5% test rejects H0 : ∆ = 0 if θb is not in the confidence interval. Remarks on Efficiency

Since the estimator of ∆ based on the Mathisen procedure is the same as that of Mood’s procedure, the asymptotic relative efficiency results for Mathisen’s procedure are the same as that of Mood’s. Using another type of efficiency due to Bahadur (1967), Killeen, Hettmansperger and Sievers (1972) show it is generally better to compute the median of the smaller sample. Curtailed sampling on the Y ’s is one situation where Mathisen’s test would be used instead of Mood’s test since with Mathisen’s test an early decision could be made; see Gastwirth (1968). Example 2.6.2. Quail Data, continued, Examples 2.3.1 and 2.6.1

2.7. ROBUSTNESS PROPERTIES

115

For this data, med Ti = 49. Since one of the placebo values was also 49, we eliminated it in the subsequent computation of Mathisen’s test. The test statistic has the value Ma+ = #(Cj > 49) = 17. Using n2 = 19 and n1 = 10 the null mean and variance are 9.5 and 11.875, respectively. This leads to a standardized test statistic of 2.03 (using the continuity correction) with a p-value of .021. Utilizing all the data, the corresponding point estimate and confidence interval are 19 and (6, 27). This differs from MWW and Mood analyses; see Examples 2.3.1 and 2.6.1, respectively.

2.7

Robustness Properties

In this section we obtain the breakdown points and the influence functions of the L1 and MWW estimates. We first consider the breakdown properties.

2.7.1

Breakdown Properties

We begin with the definition of an equivariant estimator of ∆. For convenience let the vectors X and Y denote the samples {X1 , . . . , Xn1 } and {Y1 , . . . , Yn2 }, respectively. Also let X + a1 = (X1 + a, . . . , Xn1 + a)′ . b Definition 2.7.1. An estimator ∆(X, Y) of ∆ is said to be an equivariant estimator of b b b b ∆ if ∆(X + a1, Y) = ∆(X, Y) − a and ∆(X, Y + a1) = ∆(X, Y) + a. Note that the L1 estimator and the Hodges-Lehmann estimator are both equivariant estimators of ∆. Indeed as Exercise 2.13.24 shows any estimator based on the rank pseudo norms discussed in Section 2.5 are equivariant estimators of ∆. As the following theorem shows the breakdown point of an equivariant estimator is bounded above by .25.

Theorem 2.7.1. Suppose n1 ≤ n2 . Then the breakdown point of an equivariant estimator satisfies ǫ∗ ≤ {[(n1 + 1)/2] + 1}/n, where [·] denotes the greatest integer function. b is an equivariant estimator such that Proof: Let m = [(n1 + 1)/2] + 1. Suppose ∆ ǫ > m/n. Then the estimator remains bounded if m points are corrupted. Let X∗ = (X1 + a, . . . , Xm + a, Xm+1 , . . . , Xn1 )′ . Since we have corrupted m points there exists a B > 0 such that b ∗ , Y) − ∆(X, b |∆(X Y)| ≤ B . (2.7.1) ∗

Next let X∗∗ = (X1 , . . . , Xm , Xm+1 −a, . . . , Xn1 −a)′ . Then X∗∗ contains n1 −m = [n1 /2] ≤ m altered points. Therefore, b ∗∗ , Y) − ∆(X, b |∆(X Y)| ≤ B . (2.7.2) b ∗∗ , Y) = ∆(X b ∗ , Y) + a. By (2.7.1) we have Equivariance implies that ∆(X b b ∗ , Y) ≤ ∆(X, b ∆(X, Y) − B ≤ ∆(X Y) + B

(2.7.3)

116

CHAPTER 2. TWO SAMPLE PROBLEMS

while from (2.7.2) we have b b ∗∗ , Y) ≤ ∆(X, b ∆(X, Y) − B + a ≤ ∆(X Y) + B + a .

(2.7.4)

Taking a = 3B leads to a contradiction between (2.7.2) and (2.7.4). By this theorem the maximum breakdown point of any equivariant estimator is roughly half of the smaller sample proportion. If the sample sizes are equal then the best possible breakdown is 1/4. Example 2.7.1. Breakdown of L1 and MWW estimates b = med Yj − med Xi , achieves the maximal breakdown since The L1 estimator of ∆, ∆ med Yj achieves the maximal breakdown in the one sample problem. b R = med {Yj − Xi } also achieves maximal breakdown. The Hodges-Lehmann estimate ∆ To see this, suppose we corrupt an Xi . Then n2 differences Yj − Xi are corrupted. Hence between samples we maximize the corruption by corrupting the items in the smaller sample, so without loss of generality we can assume that n1 ≤ n2 . Suppose we corrupt m Xi ’s. In order to corrupt med {Yj − Xi } we must corrupt (n1 n2 )/2 differences. Therefore mn2 ≥ (n1 n2 )/2; i.e., m ≥ n1 /2. Hence med {Yj − Xi } has maximal breakdown. Based on Exercise 1.12.13 of Chapter 1, the one sample estimate based on the Wilcoxon signed rank statistic does not achieve the maximal breakdown value of 1/2 in the one sample problem.

2.7.2

Influence Functions

Recall from Section 1.6.1 that the influence function of a Pitman regular estimator based on a single sample X1 , . . . , Xn is the function Ω(z) when the estimator has the represenP tation n−1/2 Ω(Xi ) + op (1). The estimators we are concerned with in this section are Pitman regular; hence, to determine their influence functions we need only obtain similar representations for them. For the L1 estimate we have from the proof of Theorem 2.6.1 that ) (n n1 2 X X √ sgn (X ) 1 1 sgn (Y ) i j b = med Yj − med Xi = √ + op (1) . − n∆ 2f (0) n j=1 λ2 λ 1 i=1

Hence the influence function of the L1 estimate is −(λ1 2f (0))−1 sgn z if z is an x , Ω(z) = (λ2 2f (0))−1sgn z if z is an y

which is a bounded discontinuous function. For the Hodges-Lehmann estimate, (2.2.18), note that we can write the linearity result (2.4.23) as Z √ √ √ + + n(S (δ/ n) − 1/2) = n(S (0) − 1/2) − δ f 2 + op (1) ,

2.7. ROBUSTNESS PROPERTIES

117

√ b n∆R for δ leads to −1 Z √ √ + 2 b n∆R = f n(S (0) − 1/2) + op (1) .

which upon substituting

Recall the projection of the statistic S R (0) −1/2 given in Theorem 2.4.7. Since the difference between it and this statistic goes to zero in probability we can, after some algebra, obtain the following representation for the Hodges-Lehmann estimator, ) (n −1 Z n2 2 X X √ 1 F (Xi ) − 1/2 F (Yj ) − 1/2 bR = √ n∆ f2 + op (1) . − λ2 λ1 n j=1 i=1 Therefore the influence function for the Hodges-Lehmann estimate is ( R −1 − λ1 f 2 (F (z) − 1/2) if z is an x R 2 −1 Ω(z) = , λ2 f (F (z) − 1/2) if z is an y

which is easily seen to be bounded and continuous. For least squares, since the estimate is Y − X the influence function is −(λ1 )−1 z if z is an x Ω(Z) = , (λ2 )−1 z if z is an y

which is unbounded and continuous. The Hodges-Lehmann and L1 estimates attain the maximal breakdown point and have bounded influence functions; hence they are robust. On the other hand, the least squares estimate has 0 percent breakdown and an unbounded influence function. One bad point can destroy a least squares analysis. For a general score function ϕ(u), by (2.5.31) we have the asymptotic representation # "n n2 1 X X τ 1 τ ϕ ϕ b =√ ϕ(F (Xi)) + ϕ(F (Yi)) . ∆ − n i=1 λ1 λ 2 i=1 Hence, the influence function of the R-estimate based on the score function ϕ is given by τϕ − λ1 ϕ(F (z)) if z is an x Ω(z) = , τϕ ϕ(F (z)) if z is an y λ2

where τϕ is defined by expression (2.5.23). In particular, the influence function is bounded provided the score generating function is bounded. Note that the influence function for the R-estimate based on normal scores is unbounded; hence, this estimate is not robust. Recall Example 1.8.1 in which the one sample normal scores estimate has an unbounded influence function (non robust) but has positive breakdown point (resistant). A rigorous derivation of these influence functions can be based on the influence function derived in Section A.5.2 of the Appendix.

118

2.8

CHAPTER 2. TWO SAMPLE PROBLEMS

Lehmann Alternatives and Proportional Hazards

Consider a two sample problem where the responses are lifetimes of subjects. We shall continue to denote the independent samples by X1 , . . . , Xn1 and Y1 , . . . , Yn2 . Let Xi and Yj have distribution functions F (x) and G(x) respectively. Since we are dealing with lifetimes both Xi and Yj are positive valued random variables. The hazard function for Xi is defined by f (t) hX (t) = 1 − F (t) and represents the likelihood that a subject will die at time t given that he has survived until that time; see Exercise 2.13.25. In this section, we will consider the class of lifetime models that are called Lehmann alternative models for which the distribution function G satisfies 1 − G(x) = (1 − F (x))α ,

(2.8.1)

where the parameter α > 0. See Section 4.4 of Maritz (1981) for an overview of nonparametric methods for these models. The Lehmann model generalizes the exponential scale model F (x) = 1 − exp(−x) and G(x) = 1 − (1 − F (x))α = 1 − exp(−αx). As shown in Exercise 2.13.25, the hazard function of Yj is given by hY (t) = αhX (t); i.e., the hazard function of Yj is proportional to the hazard function of Xi ; hence, these models are also referred to as proportional hazards models; see, also, Section 3.10. The null hypothesis can be expressed as HL0 : α = 1. The alternative we will consider is HLA : α < 1; that is, Y is less hazardous than X; i.e., Y has more chance of long survival than X and is stochastically larger than X. Note that, Pα (Y > X) = Eα [P (Y > X | X)] = Eα [1 − G(X)] = Eα [(1 − F (X))α ] = (α + 1)−1

(2.8.2)

The last equality holds, since 1 − F (X) has a uniform (0, 1) distribution. Under HLA , then, Pα (Y > X) > 1/2; i.e., Y tends to dominate X. The MWW test statistic SR+ = #(Yj > Xi ) is a consistent test statistic for HL0 versus HLA , by Theorem 2.4.10. We reject HL0 in favor of HLA for large values of SR+ . Furthermore by Theorem 2.4.4 and (2.8.2), we have that n1 n2 Eα [SR+ ] = n1 n2 Eα [1 − G(X)] = . 1+α This suggests as an estimate of α, the statistic, α b = ((n1 n2 )/SR+ ) − 1 .

(2.8.3)

By Theorem 2.4.5 it can be shown that Vα (SR+ ) =

n1 n2 (n1 − 1)α n1 n2 (n2 − 1)α2 αn1 n2 + + ; (α + 1)2 (α + 2)(α + 1)2 (2α + 1)(α + 1)2

(2.8.4)

2.8. LEHMANN ALTERNATIVES AND PROPORTIONAL HAZARDS

119

see Exercise 2.13.27. Using this result and the asymptotic distribution of SR+ under general alternatives, Theorem 2.4.9, we can obtain, by the delta method, the asymptotic variance of α b given by n1 − 1 (n2 − 1)α . (1 + α)2 α 1+ . (2.8.5) + Var α b= n1 n2 α+2 2α + 1

This can be used to obtain an asymptotic confidence interval for α; see Exercise 2.13.27 for details. As in the example below the bootstrap could also be used to estimate the Var(b α).

2.8.1

The Log Exponential and the Savage Statistic

Another rank test which is frequently used in this situation is the log rank test proposed by Savage (1956). In order to obtain this test, first consider the special case where X has the exponential distribution function, F (x) = 1 − e−x/θ , for θ > 0. In this case the hazard function of X is a constant function. Consider the random variable ǫ = log X − log θ. In a few steps we can obtain its distribution function as, P [ǫ ≤ t] = P [log X − log θ ≤ t] = 1 − exp (−et ) ; i.e., ǫ has an extreme value distribution. The density of ǫ is fǫ (t) = exp (t − et ). Hence, we can model log X as the location model: log X = log θ + ǫ .

(2.8.6)

Next consider the distribution of the log Y . Using expression (2.8.1) and a few steps of algebra we get α P [log Y ≤ t] = 1 − exp (− et ) . θ But from this it is easy to see that we can model Y as log Y = log θ + log

1 +ǫ, α

(2.8.7)

where the error random variable has the above extreme value distribution. From (2.8.6) and (2.8.7) we see that the log-transformation problem is simply a two sample location problem with shift parameter ∆ = − log α. Here, HL0 is equivalent to H0 : ∆ = 0 and HLA is equivalent to HA : ∆ > 0. We shall refer to this model as the log exponential model for the remainder of this section. Thus any of the rank-based analyses that we have discussed in this chapter can be used to analyze this model. Lets consider the analysis based on the optimal score function for the model. Based on Section 2.5 and Exercise 2.13.19, the optimal scores for the extreme value distribution are generated by the function (2.8.8) ϕfǫ (u) = −(1 + log(1 − u)) .

120

CHAPTER 2. TWO SAMPLE PROBLEMS

Hence the optimal rank test in the log exponential model is given by n2 n2 X X R(Yj ) R(log Yj ) SL = =− ϕfǫ 1 + log 1 − n + 1 n+1 j=1 j=1 n2 X R(Yj ) . = − 1 + log 1 − n + 1 j=1

(2.8.9)

We reject HL0 in favor of HLA for large values of SL . By (2.5.14) the null mean of SL is 0 while from (2.5.18) its null variance is given by 2 n n1 n2 X i 2 σϕfǫ = 1 + log 1 − . (2.8.10) n(n − 1) i=1 n+1

Then an asymptotic level α test rejects HL0 in favor of HLA if SL ≥ zα σϕfǫ . Certainly the statistic SL can be used in the general Lehmann alternative model described above, although, it is not optimal if X does not have an exponential distribution. We shall discuss the efficiency of this test below. b be the estimate of ∆ based on the optimal score function ϕfǫ ; that For estimation, let ∆ b solves the equation is, ∆ n2 X R[log(Yj ) − ∆] . =0. (2.8.11) 1 + log 1 − n+1 j=1 Besides estimation, the confidence intervals discussed in Section 2.5 for general scores, can be obtained for the score function ϕfǫ ; see Example n 2.8.1 o for an illustration. b . As discussed in Exercise 2.13.27, Thus another estimate of α would be α b = exp −∆ an asymptotic confidence interval for α can be formulated from this relationship. Keep in mind, though, that we are assuming that X is exponentially distributed. As a further note, since ϕfǫ (u) is an unbounded function it follows from Section 2.7.2 b is unbounded. Thus the estimate is not robust. that the influence function of ∆ A frequently used, equivalent test statistic to SL was proposed by Savage. To derive it, denote R(Yj ) by Rj . Then we can write Z 1−Rj /(n+1) Z 0 Rj 1 1 log 1 − = dt = dt . n+1 t 1 Rj /(n+1) 1 − t

We can approximate this last integral by the following Riemann sum: 1 1 1 1 1 1 + +· · ·+ . 1 − Rj /(n + 1) n + 1 1 − (Rj − 1)/(n + 1) n + 1 1 − (Rj − (Rj − 1))/(n + 1) n + 1

This simplifies to

n X 1 1 1 1 + +···+ = . n+1−1 n+1−2 n + 1 − Rj i i=n+1−R j

2.8. LEHMANN ALTERNATIVES AND PROPORTIONAL HAZARDS

121

This suggests the rank statistic S˜L = −n2 +

n2 X

n X

j=1 i=n−Rj

1 . i +1

(2.8.12)

This statistic was proposed by Savage (1956). Note that it is a rank statistic with scores defined by n X 1 aj = −1 + . (2.8.13) i i=n−j+1

Exercise 2.13.28 shows that its null mean and variance are given by EH0 [S˜L ] = 0 σ ˜

2

n1 n2 = n−1

(

n

1X1 1− n j=1 j

)

.

(2.8.14)

Hence an asymptotic level α test is to reject HL0 in favor of HLA if S˜L ≥ σ ˜ zα . ˜ Based on the above Riemann sum it would seem that SL and SL are close statistics. Indeed they are asymptotically equivalent and, hence, both are optimal when X is exponenˇ ak (1967) or Kalbfleish and Prentice (1980) for details. tially distributed; see H´ajek and Sid´

2.8.2

Efficiency Properties

We next derive the asymptotic relative efficiences for the log exponential model with fǫ (t) = exp (t − et ). The MWW statistic, SR+ , is a consistent test for the log exponential model. By (2.4.21), the efficacy of the Wilcoxon test is r √ Z 2p 3p cM W W = 12 fǫ λ1 λ2 = λ1 λ2 ; , 4

Since the Savage test is asymptotically optimal its efficacy is the √ square root of Fisher information, i.e., I 1/2 (fǫ ) discussed in Section 2.5. This efficacy is λ1 λ2 . Hence the asymptotic relative efficiency of the Mann-Whitney-Wilcoxon test to the Savage test at the log exponential model, is 3/4; see Exercise 2.13.29. √ Recall that the efficacy of the L1 procedures, both Mood’s and Mathisen’s, is 2fǫ (θǫ ) λ1 λ2 , where θǫ denotes the median of the extreme value distribution. This turns out to be √ θǫ = log(log 2)). Hence fǫ (θǫ ) = (log 2)/2, which leads to the efficacy λ1 λ2 log 2 for the L1 methods. Thus the asymptotic relative efficiency of the L1 procedures with respect to the procedure based on Savage scores is (log 2)2 = .480. The asymptotic relative efficiency of the L1 methods to the MWW at this model is .6406. Therefore there is a substantial loss of efficiency if L1 methods are used for the log exponential model. This makes sense since the extreme value distribution has very light tails.

122

CHAPTER 2. TWO SAMPLE PROBLEMS

The variance of a random variable with density fǫ is π 2 /6; hence the asymptotic relative efficiency of the t test to the Savage test at the log exponential model is 6/π 2 = .608. Hence, for the procedures analyzed in this chapter on the log exponential model the Savage test is optimal followed, in order, by the MWW, t, and L1 tests. Example 2.8.1. Lifetimes of an Insulation Fluid. The data below are drawn from an example on page 3 of Lawless (1982); see, also, Nelson (1982, p. 227). They consist of the breakdown times (in minutes) of an electrical insulating fluid when subject to two different levels of voltage stress, 30 and 32 kV. Suppose we are interested in testing to see if the lower level is less hazardous than the higher level. Voltage Level 30 kV Y 32 kV X

17.05 22.66 194.90 47.30 0.40 82.85 3.91 0.27

Times to Breakdown (Minutes) 21.02 175.88 139.07 144.12 20.46 43.40 7.74 9.88 89.29 215.10 2.75 0.79 15.93 0.69 100.58 27.80 13.95 53.24

Let Y and X denote the log of the breakdown times of the insulating fluid at the voltage stesses of 30 kV and 32 kV’s, respectively. Let ∆ = θY − θX denote the shift in locations. We are interested in testing H0 : ∆ = 0 versus HA : ∆ > 0. The comparison boxplots for the log-transformed data are displayed in the left panel of Figure 2.8.1. It appears that the lower level (30 kV) is less hazardous. The RBR function twosampwiltwosampr2+ with the score argument set at philogr obtains the analysis based on the log-rank scores. . Briefly, the results are: Test of Delta = 0 Alternative selected is 1 Standardized (z) Test-Statistic 1.302 and p-vlaue 0.096 Estimate 0.680 SE is 0.776 95 % Confidence Interval is (-0.261, 2.662) Estimate of the scale parameter tau 1.95 The corresponding Mann-Whitney-Wilcoxon analysis is Test of Delta = 0 Alternative selected is 1 Test Stat. S+ is 118 Standardized (z) Test-Stat. 1.816 MWW estimate of the shift in location is 1.297 95 % Confidence Interval is (-0.201, 3.355) Estimate of the scale parameter tau 2.37

SE is

and p-vlaue 0.034 0.944

2.9. TWO SAMPLE RANK SET SAMPLING (RSS)

123

Figure 2.8.1: Comparison Boxplots of Treatment and Control Quail LDL Levels Comparison Boxplots of log 32 kv and log 30 kv

0

−1

0

50

1

2

Breakdown−time

100

Voltage level

3

150

4

5

200

Exponential q−q Plot

0.0

0.5

1.0

1.5

2.0

2.5

log 30 kv

log 32 kv

Exponential Quantiles

While the log-rank is insignificant, the MWW analysis is significant at level 0.034. This difference is not surprising upon considering the q−q plot of the original data at the 32 kV level found in the right panel of Figure 2.8.1. The population quantiles are drawn from an exponential distribution. The plot indicates heavier tails than that of an exponential distribution. In turn, the error distribution for the location model would have heavier tails than the light-tailed extreme ,valued distribution. Thus the MWW analysis is more appropriate. The two sample t-test has value 1.34 with the p-value also of .096. It was impaired by the heavy tails too. Although, the exponential model on the original data seems unlikely, for illustration we consider it. The sum of the ranks of the 30 kV (Y ) sample is 184. The estimate of α based on the MWW statistic is .40. A 90% confidence interval for α based on the approximate (via the delta-method) variance, (2.8.5), is (.06, .74); while, a 90% bootstrap confidence interval based 1000 bootstrap samples is (.15, .88). Hence the MWW test, the corresponding estimate of α and the two confidence intervals indicate that the lower voltage level is less hazardous than the higher level.

2.9

Two Sample Rank Set Sampling (RSS)

The basic background for rank set sampling was discussed in Section 1.9. In this section we extend these ideas to the two sample location problem. Suppose we have the two samples in which X1 , . . . , Xn1 are iid F (x) and Y1 , . . . , Yn2 are iid F (x − ∆) and the two samples

124

CHAPTER 2. TWO SAMPLE PROBLEMS

are independent of one another. In the corresponding RSS design, we take n1 cycles of k samples for X and n2 cycles of q samples for Y . Proceeding as in Section 1.9, we display the measured data as: X(1)1 , . . . , X(1)n1 iid f(1) (t) Y(1)1 , . . . , Y(1)n2 iid f(1) (t − ∆) · · · · · · · · · · · · . · · · · · · X(k)1 , . . . , X(k)n1 iid f(k) (t) Y(q)1 , . . . , Y(q)n2 iid f(q) (t − ∆) To test H0 : ∆ = 0 versus HA : ∆ > 0 we compute Mann-Whitney-Wilcoxon P 2 Pthe n1 statistic with these rank set samples. Letting Usi = nt=1 j=1 I(Y(s)t > X(i)j ), the test statistic is q k X X Usi . URSS = s=1 i=1

Note that Usi is the Mann-Whitney-Wilcoxon statistic computed on the sample of the sth Y order statistics and the ith X order statistics. Even under the null hypothesis H0 : ∆ = 0, Usi is not based on identically distributed samples unless s = i. This complicates the null distribution of URSS . Bohn and Wolfe (1992) present a thorough treatment of the distribution theory for URSS . We note that under H0 : ∆ = 0, URSS is distribution free and further, using the same ideas as in Theorem 1.9.1, EH0 (URSS ) = qkn1 n2 /2. For fixed k and q, provided assumption D.1, p (2.4.7), holds, Theorem 2.4.2 can be applied to show that (URSS − qkn1 n2 /2)/ VH0 (URSS ) has a limiting N(0, 1) distribution. The difficulty is in the calculation of the VH0 (URSS ); recall Theorem 1.9.1 for a similar calculation for the sign statistic. Bohn and Wolfe (1992) present a complex formula for the variance. Bohn and Wolfe provide a table of the approximate null distribution of URSS for q = k = 2, n1 = 1, . . . , 5, n2 = 1, . . . , 5 and likewise for q = k = 3. Another way to approximate the null distribution of URSS is to bootstrap it. Consider, for simplicity, the case k = q = 3 and n1 = n2 = m. Hence the expert must rank three observations and each of the m cycles consists of three samples of size three for each of the X and Y measurements. In order to bootstrap the null distribution of URSS , first align the b the Hodges-Lehmann estimate of shift computed across the two RSS’s. Y -RSS’s with ∆, Our bootstrap sampling is on the data with the indicated sampling distributions: X(1)1 , . . . , X(1)m X(2)1 , . . . , X(2)m X(3)1 , . . . , X(3)m

sample Fˆ(1) (x) Y(1)1 , . . . , Y(1)m sample Fˆ(2) (x) Y(2)1 , . . . , Y(2)m sample Fˆ(3) (x) Y(3)1 , . . . , Y(3)m

b sample Fˆ(1) (y − ∆) b . sample Fˆ(2) (y − ∆) b sample Fˆ(3) (y − ∆)

∗ ∗ In the bootstrap process, for each row i = 1, 2, 3, we take random samples X(i)1 , . . . , X(i)m ∗ ∗ b We then compute U ∗ on these samples. from Fˆ(i) (x) and Y(i)1 , . . . , Y(i)m from Fˆ(2) (y − ∆). RSS ∗ ∗ Repeating this B times, we obtain the sample of test statistics URSS,1 , . . . , URSS,B . Then ∗ the bootstrap p-value for our test is #(URSS,j ≥ URSS )/B, where URSS is the value of the statistic based on the original data. Generally we take B = 1000 for a p-value. It is clear how to modify the above argument to allow for k 6= q and n1 6= n2 .

2.10. TWO SAMPLE SCALE PROBLEM

2.10

125

Two Sample Scale Problem

Frequently it is of interest to investigate whether or not one random variable is more dispersed than another. The general case is when the random variables differ in both location and scale. Suppose the distribution functions of X and Y are given by F (x) and G(y) = F ((y − ∆)/η), respectively; hence L(Y ) = L(ηX + ∆). For discussion, we consider one-sided hypotheses of the form H0 : η = 1 versus HA : η > 1. (2.10.1) The other one-sided or two-sided hypotheses can be handled similarly. Let X1 , . . . , Xn1 and Y1 , . . . , Yn2 be samples drawn on the random variables X and Y , respectively. The traditional test of H0 is the F -test which is based on the ratio of sample variances. As we discuss in Section 2.10.2, though, this test is generally not asymptotically correct, (one of the exceptions is when F (t) is a normal cdf). Indeed, as many simulation studies have shown, this test is extremely liberal in many non-normal situations; see Conover, Johnson and Johnson (1981). Tests of H0 should be invariant to the locations. One way of ensuring this is to first center the observations. For the F -test, the centering is by sample means; instead, we prefer to use the sample medians. Let θbX and θbY denote the sample medians of the X and Y samples, respectively. Then the samples of interest are the folded aligned samples given by |X1∗ |, . . . , |Xn∗1 | and |Y1∗ |, . . . , |Yn∗2 |, where Xi∗ = Xi − θbX and Yi∗ = Yi − θbY .

2.10.1

Optimal Rank-Based Tests

To obtain appropriate score functions for the scale problem, first consider the case when the location parameters of X and Y are known. Without loss of generality, we can then assume that they are 0 and, hence, that L(Y ) = L(ηX). Further because η > 0, we have L(|Y |) = L(η|X|). Let Z′ = (log |X1 |, . . . , log |Xn1 |, log |Y1|, . . . , log |Yn2 |) and ci , (2.2.1), be the dummy indicator variable, i.e., ci = 0 or 1, depending on whether Zi is an X or Y , repectively. Then an equivalent formulation of this problem is Zi = ζci + ei , 1 ≤ i ≤ n ,

(2.10.2)

where ζ = log η, e1 , . . . , en are iid with distribution function F ∗ (x) which is the cdf of log |X|. The hypotheses, (2.10.1), are equivalent to H0 : ζ = 0 versus HA : ζ > 1.

(2.10.3)

Of course, this is the two sample location problem based on the logs of the absolute values of the observations. Hence, the optimal score function for Model 2.10.2 is given by ϕf ∗ (u) = −

f ∗′ (F ∗−1 (u))) . f ∗ (F ∗−1 (u)))

(2.10.4)

126

CHAPTER 2. TWO SAMPLE PROBLEMS

After some simplification, see Exercise 2.13.30, we have ex [f ′ (ex ) − f ′ (−ex )] f ∗′ (x) = +1. − ∗ f (x) f (ex ) + f (−ex )

(2.10.5)

If we further assume that f is symmetric, then expression (2.10.5) for the optimal scores function simplifies to u + 1 f ′ F −1 u+1 −1 2 ϕf ∗ (u) = −F − 1. (2.10.6) 2 f F −1 u+1 2 This expression is convenient to work with because it depends on F (t) and f (t), the cdf and pdf of X, in the original formulation of this scale problem. The following two examples obtain the optimal score function for the normal and double exponential situations, respectively. Example 2.10.1. L(X) Is Normal Without loss of generality, assume that f (x) is the standard normal density. In this case expression (2.10.6) simplifies to 2 u+1 −1 ϕF K (u) = Φ −1, 2

(2.10.7)

where Φ is the standard normal distribution function; see Exercise 2.13.33. Hence, if we are sampling from a normal distribution this suggests the rank test statistic SF K =

n2 X j=1

−1

Φ

R|Yj | 1 + 2(n + 1) 2

2

,

(2.10.8)

where the F K subscript is due to Fligner and Killeen (1976), who discussed this score function in their work on the two-sample scale problem. Example 2.10.2. L(X) Is Double Exponential Suppose that the density of X is the double exponential, f (x) = 2−1 exp {−|x|}, −∞ < x < ∞. Then as Exercise 2.13.33 shows the optimal rank score function is given by ϕ(u) = −(log (1 − u) + 1) .

(2.10.9)

These scores are not surprising, because the distribution of |X| is exponential. Hence, this is precisely the log linear problem with exponentially distributed lifetime that was discussed in Section 2.8; see the discussion around expression (2.8.8). Example 2.10.3. L(|X|) Is a Member of the Generalized F -family: MWW Statistic

2.10. TWO SAMPLE SCALE PROBLEM

127

In Section 3.10 a discussion is devoted to a large family of commonly used distributions called the generalized F family for survival type data. In particular, as shown there, if |X| follows an F (2, 2)-distribution, then it follows, (Exercise 2.13.31), that the log |X| has a logistic distribution. Thus the MWW statistic is the optimal rank score statistic in this case. Notice the relationship between tail-weight of the distribution and the optimal score function for the scale problem over these last three examples. If the underlying distribution is normal then the optimal score function (2.10.8) is for very light-tailed distributions. Even at the double-exponential, the score function (2.10.9) is still for light-tailed errors. Finally, for the heavy-tailed (variance is ∞) F (2, 2) distribution the score function is the bounded MWW score function. The reason for the difference in location and scale scores is that the optimal score function for the scale case is based on the distribution of the log’s of the original variables. Once a scale score function is selected, following Section 2.5 the general scores process for this problem is given by n2 X

Sϕ (ζ) =

j=1

aϕ (R(log |Yj | − ζ)) ,

(2.10.10)

where the scores a(i) are generated by a(i) = ϕ(i/(n + 1)). A rank test statistic for the hypotheses, (2.10.3), is given by Sϕ = Sϕ (0) =

n2 X j=1

aϕ (R(log |Yj |) =

n2 X j=1

aϕ (R(|Yj |) ,

(2.10.11)

where the last equality holds because the log function is strictly increasing. This is not necessarily a standardized score function, but it follows from the discussion on general scores found in Section 2.5 and (2.5.18) that the null mean µϕ and null variance σϕ2 of the statistic are given by n1 n2 X µϕ = n2 a and σϕ2 = (a(i) − a)2 . (2.10.12) n(n − 1) The asymptotic version of this test statistic rejects H0 at approximate level α if z ≥ zα where z=

Sϕ − µ ϕ . σϕ

The efficacy of the test based on Sϕ is given by expression (2.5.28); i.e., p cϕ = τϕ−1 λ1 λ2 , where τϕ is given by

τϕ−1

=

Z

(2.10.13)

(2.10.14)

1 0

ϕ(u)ϕf ∗ (u) du

(2.10.15)

128

CHAPTER 2. TWO SAMPLE PROBLEMS

and the optimal scores function ϕf ∗ (u) is given in expression (2.10.4). Note that this formula for the efficiacy is under the assumption that the score function ϕ(u) is standardized. Recall the original (realistic) problem, where the distribution functions of X and Y are given by F (x) and G(y) = F ((y − ∆)/η), respectively and the difference in locations, ∆, is unknown. In this case, L(Y ) = L(ηX + ∆). As noted above, the samples of interest are the folded aligned samples given by |X1∗ |, . . . , |Xn∗1 | and |Y1∗ |, . . . , |Yn∗2 |, where Xi∗ = Xi − θbX and Yi∗ = Yi − θbY , where θbX and θbY denote the sample medians of the X and Y samples, respectively. Given a score function ϕ(u), we consider the linear rank statistic, (2.10.11), where the ranking is performed on the folded-aligned observations; i.e., Sϕ∗

=

n2 X j=1

a(R(|Yj∗ |)).

(2.10.16)

The statistic S ∗ is no longer distribution free for finite samples. However, if we further assume that the distributions of X and Y are symmetric, then the test statistic Sϕ∗ is asymptotically distribution free and has the same efficiency properties as Sϕ ; see Puri (1968) and Fligner and Hettmansperger (1979). The requirement that f is symmetric is discussed in detail by Fligner and Hettmansperger (1979). In particular, we standardize the statistic using the mean and variance given in expression (2.10.12). Estimation and confidence intervals for the parameter η are based on the process Sϕ∗ (ζ)

=

n2 X j=1

aϕ (R(log |Yj∗ | − ζ)) ,

An estimate of ζ is a value ζb which solves the equation (2.5.12); i.e., . b = Sϕ∗ (ζ) 0.

(2.10.17)

(2.10.18)

An estimate of η, the ratio of scale parameters, is then b

ηb = eζ .

(2.10.19)

The interval (ζbL , ζbU ) where ζbL and ζbU solve the equations (2.5.21), forms (asymptotically) a (1 − α)100% confidence interval for ζ. The corresponding confidence interval for η is (exp {ζbL}, exp {ζbU }). As a simple rank-based analysis, consider the test and estimation given above based on the optimal scores (2.10.7) for the normal situation. The folded aligned samples version of the test statistic (2.10.8) is the statistic SF∗ K

2 n2 X R|Yj∗ | 1 −1 = Φ . + 2(n + 1) 2 j=1

(2.10.20)

2.10. TWO SAMPLE SCALE PROBLEM

129

The standardized test statistic is zF∗ K = (SF∗ K − µF K )/σF K , where µF K abd σF K are the vaules of (2.10.12) for the scores (2.10.7). This statistic for non-aligned samples is given on ˇ ak (1967). A version of it was also discussed by Fligner and Killeen page 74 of H´ajek and Sid´ (1976). We refer to this test and the associated estimator and confidence interval as the Fligner-Killeen analysis. The RBR function twoscale with the score function phiscalefk computes the Fligner-Killeen analysis. We next obtain the efficacy of this analysis. Example 2.10.4. Efficacy for the Score Function ϕF K (u). To use expression (2.5.28) for the efficacy, we must first standardize the score function ϕF K (u) = {Φ−1 [(u + 1)/2]}2 − 1, (2.10.7). Using the substitution (u + 1)/2 = Φ(t), we have Z 1 Z ∞ ϕF K (u) du = t2 φ(t) dt − 1 = 1 − 1 = 0. 0

−∞

Hence, the mean is 0. In the same way, Z 1 Z ∞ Z 2 4 [ϕF K (u)] du = t φ(t) dt − 2 0

−∞

∞

t2 φ(t) dt + 1 = 2.

−∞

Thus the standardized score function is √ ϕ∗F K (u) = {Φ−1 [(u + 1)/2]}2 − 1]/ 2. Hence, the efficacy of the Fligner-Killeen analysis is Z 1 p 1 √ {Φ−1 [(u + 1)/2]}2 − 1]ϕf ∗ (u) du, c ϕF K = λ 1 λ 2 2 0

(2.10.21)

(2.10.22)

where the optimal score function ϕf ∗ (u) is given in expression (2.10.4). In particular, the efficacy at the normal distribution is given by Z 1 p √ p 1 √ {Φ−1 [(u + 1)/2]}2 − 1]2 du, = 2 λ1 λ2 . cϕF K (normal) = λ1 λ2 (2.10.23) 2 0 We illustrate the Fligner-Killeen analysis with the following example.

Example 2.10.5. Doksum and Sievers Data. Doksum and Sievers (1976) describe an experiment involving the effect of ozone on weight gain of rats. The experimental group consisted of n2 = 22 rats which were placed in an ozone environment for seven days, while the control group contained n1 = 21 rats which were placed in an ozone free environment for the same amount of time. The response was the weight gain in a rat over the time period. Figure 2.10.1 displays the comparison boxplots for the data. There appears to be a difference in scale. Using the RBR software discussed above,

130

CHAPTER 2. TWO SAMPLE PROBLEMS Figure 2.10.1: Comparison Boxplots of Treated and Control Weight Gains in rats.

20 −10

0

10

Weight Gain

30

40

50

Comparison Boxplots of Control and Ozone

Control

Ozone

the Fligner-Killeen test statistic SF∗ K = 28.711 and its standardized value is zF∗ K = 2.095. The corresponding p-value for a two sided test is 0.036, confirming the impression from the plot. The associated estimate of the ratio (ozone to control) of scales is ηb = 2.36 with a 95% confidence interval of (1.09, 5.10). Conover, Johnson and Johnson (1981) performed a large Monte Carlo study of tests of dispersion, including these folded-aligned rank tests, over a wide variety of situations for the c-sample scale problem. The traditional F -test (Bartlett’s test) did poorly, (as would be expected from our comments below about the lack of robustness of the classical F -test). In certain null situations its empirical α levels exceeded .80 when the nominal α level was .05. One rank test that performed very well was the aligned rank version of a test statistic similar to SF∗ K , (2.10.20), but with the exponent of one instead of two in the definition of the score function. This performed well overall in terms of validity and power except for highly asymmetric distributions, where it has a tendency to be liberal. However, in the following simulation study the Fligner-Killeen test (??) (exponent of two) is empirically valid over the asymmetric situations covered. Example 2.10.6. Simulation Study for Validity of Tests Sϕ∗ Table 2.10.1 displays the results of a small simulation study of the validity of the rankbased tests of scale for various score functions over mostly skewed error distributions. The scores in the study are: (fk2 ), the optimal score function for the normal distribution; (fk), similar to last except the exponent is one; (Wilcoxon), the linear Wilcoxon score function;

2.10. TWO SAMPLE SCALE PROBLEM

131

(Quad), the score function ϕ(u) = u2 ; and (Logistic) the optimal score function if the distribution of X is logistic (see Exercise 2.13.32). The error distributions include the normal and the χ2 (1) distributions and several members of the skewed contaminated normal distribution. In the later case, the random variable X is written as X = X1 (1 − Iǫ ) + Iǫ X2 , where X1 and X2 have N(0, 1) and N(µc , σc2 ) distributions, respectively, Iǫ has a Bernoulli distribution with probability of success ǫ, and X1 , X2 and Iǫ are mutually independent. For the study ǫ was set at 0.3 and µc and σc varied. The pdfs of the three SCN distributions in Table 2.10.1 are shown in Figure 2.10.2. The pdf in the bottom right cornor panel of the figure is that of χ2 (1)-distribution. For all but the last situation in Table 2.10.1, the sample sizes are n1 = 20 and n2 = 25. The last situation is for n1 = n2 = 10, The number of simulations for each situation was set at 1000. For each run, the two sided alternative, HA : η 6= 1, was tested and the estimator of η and an associated confidence interval for η was obtained. Computations were performed by RBR functions. The table shows the empirical α levels at the nominal 0.10, 0.05, and 0.01 levels; the emirical confidence coefficient for a nominal 95% confidence interval, the mean of the estimates of η, and the MSE for ηb. Of the five analyses, overall the Fligner-Killeen analysis (fk2 ) performed the best. This analysis was valid (nominal levels and empirical coverage) in all the situations, except for the χ2 (1) distribution at the 10% level and the larger sample sizes. Even here, its empirical level is 0.128. The other tests were liberal in the skewed situations, some as the Wilcoxon quite liberal. Also, the fk analysis (exponent 1 in its score function) was liberal for the χ2 (1) situations. Notice that the Fligner-Killeen analysis achieved the lowest MSE in all the situations. Hall and Padmanabhan (1997) developed a percentile bootstrap for these rank-based tests which in their accompanying study performed quite well for skewed error distributions as well as the symmetric error distributions. As a final remark, another class of linear rank statistics for the two sample scale problem consists of simple linear rank statistics of the form n2 X S= a(R(Yj )) , (2.10.24) j=1

where the scores are generated as a(i) = ϕ(i/(n + 1)). The folded rank statistics discussed above suggest that ϕ be a convex (or concave) function. One popular score function is the quadratic function ϕ(u) = (u − 1/2)2 . The resulting statistic, 2 n2 X R(Yj ) 1 SM = , (2.10.25) − n + 1 2 j=1 was proposed by Mood (1954) as a test statistic for (??). For the realistic problem with unknown location, though, the observations have to be first aligned. Asymptotic theory holds, provided the underlying distribution is symmetric. This class of aligned rank tests, though, did not perform nearly as well as the folded rank statistics, (2.10.16), in the large Monte Carlo study of Conover et al. (1981). Hence, we recommend the folded rank-based analyses discussed above.

132

CHAPTER 2. TWO SAMPLE PROBLEMS

Table 2.10.1: Empirical Levels, Confidences, and MSE’s for the Monte carlo Study of Example ??. Normal Errors, n1 = 20, n2 = 25 d .95 α b.10 α b.05 α b.01 Cnf ηˆ MSE(ˆ η) Logistic 0.083 0.041 0.006 0.961 1.037 0.060 Quad. 0.080 0.030 0.008 0.970 1.043 0.076 Wilcoxon 0.073 0.033 0.004 0.967 1.042 0.097 2 fk 0.087 0.039 0.004 0.960 1.036 0.057 fk 0.077 0.033 0.005 0.969 1.037 0.067 √ SKCN(µc = 2, σc = 2, ǫc = 0.3), n1 = 20, n2 = 25 Logistic 0.106 0.036 0.006 0.965 1.035 0.076 Quad. 0.106 0.046 0.008 0.953 1.040 0.095 Wilcoxon 0.103 0.049 0.007 0.952 1.043 0.117 2 fk 0.100 0.034 0.006 0.966 1.033 0.073 fk 0.099 0.047 0.006 0.953 1.034 0.085 √ SKCN(µc = 6, σc = 2, ǫc = 0.3), n1 = 20, n2 = 25 Logistic 0.081 0.033 0.006 0.966 1.067 0.166 Quad. 0.122 0.068 0.020 0.933 1.105 0.305 Wilcoxon 0.163 0.103 0.036 0.897 1.125 0.420 2 fk 0.072 0.026 0.005 0.974 1.057 0.126 fk 0.111 0.057 0.015 0.942 1.075 0.229 √ SKCN(µc = 12, σc = 2, ǫc = 0.3), n1 = 20, n2 = 25 Logistic 0.084 0.046 0.007 0.954 1.091 0.298 Quad. 0.138 0.085 0.018 0.916 1.183 0.706 Wilcoxon 0.171 0.116 0.038 0.886 1.188 0.782 2 fk 0.074 0.042 0.007 0.958 1.070 0.201 fk 0.115 0.069 0.015 0.932 1.109 0.400 2 χ (1), n1 = 20, n2 = 25 Logistic 0.154 0.086 0.023 0.913 1.128056 0.353 Quad. 0.249 0.149 0.047 0.851 1.170 0.482 Wilcoxon 0.304 0.197 0.067 0.804 1.196 0.611 fk2 0.128 0.066 0.018 0.936 1.120 0.336 fk 0.220 0.131 0.039 0.870 1.154 0.432 2 χ (1), n1 = 10, n2 = 10 Logistic 0.132 0.062 0.018 0.934 1.360 1.495 Quad. 0.192 0.099 0.035 0.900 1.457 2.108 Wilcoxon 0.276 0.166 0.042 0.833 1.560 3.311 2 fk 0.111 0.057 0.013 0.941 1.335 1.349 fk 0.199 0.103 0.033 0.893 1.450 2.086

2.10. TWO SAMPLE SCALE PROBLEM

133

Figure 2.10.2: Pdfs of Skewed Distributions in the Simulation Study of Example 2.10.6.

f(x)

0.20

f(x) 0.10 0.00

0

2

4

6

8

−2

0

4

6

SCN: µc = 12, σc = 1.41, ε = .3

χ2, One Defree of Freedom

8

10

1.2

x

1.0 0.8 0.6

f(x)

0.4 0.2 0.0 0

5

10

15

0

x

2.10.2

2

x

0.00 0.05 0.10 0.15 0.20 0.25

f(x)

−2

0.00 0.05 0.10 0.15 0.20 0.25

SCN: µc = 6, σc = 1.41, ε = .3

0.30

SCN: µc = 2, σc = 1.41, ε = .3

1

2

3

4

x

Efficacy of the Traditional F -Test

We next obtain the efficacy of the traditional F -test for the ratio of scale parameters. Actually for our development we need not assume that X and Y have the same locations. Let σ22 and σ12 denote the variances of Y and X respectively. Then in the notation in the first paragraph of this section, η 2 = σ22 /σ12 . The classical F -test of the hypotheses (??) is to reject H0 if F ∗ ≥ F (α, n2 − 1, n1 − 1) where F∗ = σ b22 /b σ12 ,

and σ b22 and σ b12 are the sample variances of the samples Y1 , . . . , Yn2 and X1 , . . . , Xn1 , respectively. The F -test is exact size α if f is a normal pdf. Also the test is invariant to differences in location. We first need the asymptotic distribution of F ∗ under the null hypothesis. Instead of working √ with F ∗ it is more convenient mathematically to work with the equivalent test statistic n log F ∗ . We will assume that X has a finite fourth central moment; i.e., µX,4 = E[(X − E(X))4 ] < ∞. Let ξ = (µX,4 /σ14 ) − 3 denote the kurtosis of X. It easily follows that Y has the same kurtosis under the null and alternative hypotheses. A key result, established in Exercise 2.13.38, is that under these conditions √ D σi2 − σi2 ) → N(0, σi4 (ξ + 2)) , for i = 1, 2 . (2.10.26) ni (b It follows immediately by the delta method that √ D ni (log σ bi2 − log σi2 ) → N(0, ξ + 2) , for i = 1, 2 .

(2.10.27)

134

CHAPTER 2. TWO SAMPLE PROBLEMS

Under H0 , σi = σ, say, and the last result, r r √ n√ n√ 2 2 ∗ n log F = n2 (log σ b2 − log σ ) − n1 (log σ b12 − log σ 2 ) n2 n1 D

→ N(0, (ξ + 2))/(λ1 λ2 )) .

The approximate test rejects H0 if p

√

n log F ∗

(ξ + 2))/(λ1 λ2 )

≥ zα .

(2.10.28)

(2.10.29)

Note that ξ = 0 if X is normal. In practice the test which is used assumes ξ = 0; that is, F ∗ is not corrected by an estimate of ξ. This is one reason that the usual F -test for ratio in variances does not possess robustness of validity; that is, the significance level is not asymptotically distribution free. Unlike the t-test, the F -test for variances is not even asymptotically distribution free under H0 . In order to obtain the efficacy of the F -test, consider the sequence of contiguous alternatives (??). Assume without loss of generality that the locations of X and Y are the same. Under this sequence of alternatives we have Yj = e∆n Uj where Uj is a random variable with cdf F (x) while Yj has cdf F (e∆n x). We also get σ b22 = exp {2∆n }b σU2 where σ bU2 denotes the sample variance of U1 , . . . , Un2 . Let γF (∆) denote the power function of the F -test. The asymptotic power lemma for the F test is Theorem 2.10.1. Assuming that X has a finite fourth moment, with ξ = (µX,4 /σ14 ) − 3, lim γF (∆n ) = P (Z ≥ zα − cF δ) ,

n→∞

where Z has a standard normal distribution and efficacy p p cF = 2 λ 1 λ 2 / ξ + 2 .

(2.10.30)

Proof: The conclusion follows directly upon observing, √ √ n log F ∗ = n(log σ b22 − log σ b12 ) √ √ = n(log σ bU2 + 2(δ/ n) − log σ b12 ) r r n√ n√ 2 2 n2 (log σ bU − log σ ) − n1 (log σ b12 − log σ 2 ) = 2δ + n2 n1

and that the last quantity converges in distribution to a N(2δ, (ξ + 2))/(λ1λ2 )) variate. Let ϕ(u) denote a general score function for an foled-aligned rank-based analysis as discussed above. It then follows that the asymptotic relative efficiency of this analysis to the F -test is the ratio of the squares of their efficacies, i.e., e(S, F ) = c2ϕ /c2F , where cϕ is given in expression (2.5.28). Suppose we use the Fligner-Killeen analysis. Then its efficacy is cϕF K which is given in expression (2.10.22). The ARE between the Fligner-Killeen analysis and the traditional F test analysis is the ratio c2ϕF K /c2F . In particular, if we assume that the underlying distribution is normal, then by (2.10.23), this ratio is one.

2.11. BEHRENS-FISHER PROBLEM

2.11

135

Behrens-Fisher Problem

Consider the general model in Section 2.1 of this chapter, where X1 , . . . , Xn1 is a random sample on the random variable X which has distribution function F (x) and density function f (x) and Y1 , . . . , Yn2 is a second random sample, independent of the first, on the random variable Y which has common distribution function G(x) and density g(x). Let θX and θY denote the medians of X and Y , respectively, and let ∆ = θY − θX . In Section 2.4 we showed that the MWW test was consistent for the stochastically ordered alternative. In the location model where the distributions of X and Y differ by at most a shift in location, the hypothesis F = G is equivalent to the the null hypothesis that ∆ = 0. In this section we drop the location model assumption, that is, we will assume that X and Y have distribution functions F and G respectively, but we still consider the null hypothesis that ∆ = 0. In order to avoid confusion with Section 2.4, we explicitly state the hypotheses of this section as H0 : ∆ = 0 versus HA : ∆ > 0 , where ∆ = θY − θX , and L(X) = F, and L(Y ) = G . (2.11.1) As in the previous sections we have selected a specific alternative for the discussion. The above hypothesis is our most general hypothesis of this section and the modified Mathisen’s test defined below is consistent for it. We will also consider the case where the forms of F and G are the same; that is, G(x) = F (x/η), for some parameter η. Note in this case that L(Y ) = L(ηX); hence, η = T (Y )/T (X) where T (X) is any scale functional, (T (X) > 0 and T (aX) = aT (X) for a ≥ 0). If T (X) = σX , the standard deviation of X, then this is a Behrens-Fisher problem with F unknown. If we further assume that the distributions of X and Y are symmetric then the modified MWW, defined below, can be used to test that ∆ = 0. The most restrictive case, is when both F and G are assumed to be normal distribution functions. This is, of course, the classical Behrens-Fisher problem and the classical solution to it is the Welch type t-test, discussed below. For motivation we first show the behavior of usual the MWW statistic. We then consider general rank procedures and finally specialize to analogues of the L1 and MWW analyses.

2.11.1

Behavior of the Usual MWW Test

In order to motivate the problem, consider the null behavior of the usual MWW test under (2.11.1) with the further restriction that the distributions of X and Y are symmetric. Under H0 , since we are examining null behavior there is no loss of generality if we assume that θX = θY = 0. The asymptotic form of the MWW test rejects H0 in favor of HA if r n1 X n2 X n1 n2 n1 n2 (n + 1) + SR = I(Yj − Xi > 0) ≥ + zα . 2 12 i=1 j=1 This test would have asymptotic level α if F = G. As Exercise 2.13.41 shows, we still have EH0 (SR+ ) = n1 n2 /2 when the densities of X and Y are symmetric. From Theorem 2.4.5, Part

136

CHAPTER 2. TWO SAMPLE PROBLEMS

(a), the variance of the MWW statistic under H0 satisfies the limit, VarH0 (SR+ ) → λ1 Var(F (Y )) + λ2 Var(G(X)) . n1 n2 (n + 1) Recall that we obtained the asymptotic distribution of SR+ , Theorem 2.4.9, under general conditions which cover the current assumptions; hence, the true significance level of the MWW test has the following limiting behavior: " # r n1 n2 n1 n2 (n + 1) αS + = PH0 SR+ ≥ + zα R 2 12 s " # SR+ − n12n2 n1 n2 (n + 1) ≥ zα = P H0 p 12VarH0 (SR+ ) VarH0 (SR+ ) → 1 − Φ zα (12)−1/2 (λ1 Var(F (Y )) + λ2 Var(G(X)))−1/2 . (2.11.2) Under the assumptions that the sample sizes are the same and that L(X) and the L(Y ) have the same form we can simplify expression (2.11.2) further. We express the result in the following theorem.

Theorem 2.11.1. Suppose that the null hypothesis in (2.11.1) is true. Assume that the distributions of Y and X are symmetric, n1 = n2 , and G(x) = F (x/η) where η is an unknown parameter. Then the maximum observed significance level is 1 − Φ(.816zα ) which is approached as η → 0 or η → ∞. R Proof: Under the Rassumptions of the theorem, note that Var(F (Y )) = F 2 (ηt)dF (t) − 41 and Var(G(X)) = F 2 (x/η)dF (x) − 14 . Differentiating (2.11.2) with respect to η we get φ zα (12)−1/2 ((1/2)Var(F (Y )) + (1/2)Var(G(X)))−1/2 zα (12)−1/2 Z −3/2 Z 2 F (ηt)tf (ηt)f (t)dt + F (t/η)f (t/η)(−t/η )f (t)dt .

(2.11.3)

Making the substitution u = ηt in the first integral, the quantity in braces reduces to R −2 η (F (u) − F (u/η))uf (u)f (u/η)du. Note that the other factors in (2.11.3) are strictly positive. Thus to determine the graphical behavior of (2.11.2) with respect to η, we need only consider the factor in braces. First note that it has a critical point at η = 1. Next consider the case η > 1. In this case F (u) − F (u/η) < 0 on the interval (−∞, 0) and is positive on the interval (0, ∞); hence the factor in braces is positive for η > 1. Using a similar argument this factor is negative for 0 < η < 1. Therefore the limit of the function αS + (η) is decreasing on the interval (0, 1), has a minimum at η = 1 and is increasing on the R interval (1, ∞). Thus the minimum level of significance occurs at η = 1, (the location model), where it is α. By the graphical behavior of the function, maximum levels would occur at the extremes

2.11. BEHRENS-FISHER PROBLEM of 0 and ∞. But it follows that Var(F (Y )) = and Var(G(X)) =

1 F (ηt)dF (t) − → 4

1 F (x/η)dF (x) − → 4

Z

Z

137

2

2

0 1 4

1 4

0

if η → 0 if η → ∞ if η → 0 . if η → ∞

From these two results and (2.11.2), the true significance level of the MWW test satisfies αS + → R

1 − Φ(zα (3/2)−1/2 ) 1 − Φ(zα (3/2)−1/2 )

if η → 0 . if η → ∞

Hence, αS + → 1 − Φ(zα (3/2)−1/2 ) = 1 − Φ(.816zα ) , R

whether η → 0 or ∞. Thus the maximum observed significance level is 1 − Φ(.816zα ) which is approached as η → 0 or η → ∞. For example if α = .05 then .816zα = 1.34 and αS + → 1 − Φ(1.34) = .09. Thus in the R equal sample size case when F and G differ only in scale parameter and are symmetric, the nominal 5% level of the MWW test will not be worse than .09. In order to guarantee that α ≤ .05 choose zα so that 1 − Φ(.816zα ) = .05. This leads to zα = 2.02 which is the critical value for an α = .02. Hence another way of saying this is: by performing a 2% MWW test we are guaranteed that the true (asymptotic) level is at most 5%.

2.11.2

General Rank Tests

Assuming the most general hypothesis, (2.11.1), we will follow the development of Fligner and Policello (1981) to construct general tests. Suppose T represents a rank test statistic, used in the case F = G, and that the test rejects H0 : ∆ = 0 in favor of HA : ∆ > 0 for large values of T . Suppose further that n1/2 (T − µF,G )/σF,G converges in distribution to a standard normal. Let µ0 denote the null mean of T and assume that it is independent of F . Next suppose that σ b is a consistent estimate of σF,G which is a function only of the ranks of the combined sample. This will ensure distribution freeness under H0 ; otherwise, the test statistic will only be asymptotically distribution free. The modified test statistic is n1/2 (T − µ0 ) Tb = . σ b

(2.11.4)

Such a test can be used for the general hypothesis (2.11.1). Fligner and Policello (1981) applied this approach to Mood’s statistic; see Hettmansperger and Malin (1975), also. In the next section, we consider Mathisen’s test.

138

2.11.3

CHAPTER 2. TWO SAMPLE PROBLEMS

Modified Mathisen’s Test

We next present a modified version of Mathisen’s test for the most general hypothesis (2.11.1). Let θbX = medi Xi and define the sign-process S2 (θ) =

n2 X j=1

sgn(Yj − θ) .

(2.11.5)

Recall from expression (2.6.8), Section 2.6.2 that Mathisen’s test statistic (centered version) is given by S2 (θbX ). This will be our test statistic. The modification lies in its asymptotic distribution which is given in the next theorem.

Theorem 2.11.2. Assume the null hypothesis in expression (2.11.1) is true. Then under the assumption (D.1), (2.4.7), √1n2 S2 (θbX ) is asymptotically normal with mean 0 and asymptotic 2 2 variance 1 + K12 where K12 is defined by 2 K12 =

λ2 g 2 (θY ) . λ1 f 2 (θX )

(2.11.6)

Proof: Assume without loss of generality that θX = θY = 0. From the asymptotic linearity results discussed in Example 1.5.2 of Chapter 1, we have that √ 1 . 1 √ S2 (θn ) = √ S2 (0) − 2g(0) n2 θn , n2 n2 √ √ for n|θn | ≤ c, c > 0. Since n2 θbX is bounded in probability, upon substitution in the last expression we get √ 1 . 1 (2.11.7) √ S2 (θbX ) = √ S2 (0) − 2g(0) n2 θbX . n2 n2 In Example 1.5.2, we also have the approximation

where S1 (0) =

Pn1

i=1

. θbX =

1 S1 (0) , n1 2f (0)

sgn(Xi ). Combining (2.11.7) and (2.11.8), we get r 1 g(0) 1 n2 1 . √ S2 (θbX ) = √ S2 (0) − √ S1 (0) . n2 n2 f (0) n1 n1

(2.11.8)

(2.11.9)

√ D The results follows because of independent samples and because Si (0)/ ni → N(0, 1), for i = 1, 2. In order to use this test we need an estimate of K12 . As in Chapter 1, selected order statistics from the sample X1 , . . . , Xn1 will provide a confidence interval for the median of X. Hence given a level α, the interval (L, U), where L1 = X(k+1) , U1 = X(n−k) , and

2.11. BEHRENS-FISHER PROBLEM

139

√ k = n/2 − zα/2 ( n/2) is an approximate (1 − α)100% confidence interval for the median of X. Let DX denote the length of this confidence interval. By Theorem 1.5.9 of Chapter 1, √ n1 DX P → 2f (0) . (2.11.10) 2zα/2 In the same way let DY denote the length of the corresponding (1 − α)100% confidence interval for the median of Y . Define b 12 = DY . (2.11.11) K DX b 12 is a consistent estimate From (2.11.10) and the corresponding result for DY , the estimate K of K12 , under both H0 and HA . Thus the modified Mathisen’s test for the general hypotheses (2.11.1), is to reject H0 at approximately level α if S2 (θbX ) ZM = q ≥ zα . (2.11.12) 2 b n2 (1 + K12 ) To derive the efficacy of this statistic we will use the development of Section 1.5.2. The average to consider is n−1 S2 (θbX ). Let ∆ denote the shift in medians and without loss of generality let θX = 0. Then the mean function we need is lim E∆ (n−1 S2 (θbX )) = µ(∆) . n→∞

Note that we can reexpress the expansion (2.11.9) as 1 S2 (θbX ) n

n2 1 S2 (θbX ) n n2 r r g(0) n2 n1 1 . n2 1 = S2 (0) − S1 (0) n1 n2 f (0) n1 n2 n1 g(0) P∆ → λ2 E∆ [sgn(Y )] − E∆ [sgn(X)] f (0) = λ2 E∆ [sgn(Y )] = µ(∆) , =

(2.11.13)

where the next to last equality holds since θX = 0. Using E∆ (sgn(Y )) = 1 − 2G(−∆), we obtain the derivative µ′ (0) = 2λ2 g(0) . (2.11.14) √ By Theorem 2.11.2 we have the asymptotic null variance of the test statistic S2 (θbX )/ n. From the above discussion then the statistic S2 (θbX ) is Pitman regular with efficacy √ λ1 λ2 2g(0) 2λ2 g(0) . (2.11.15) cM M = p =p 2 λ1 + λ2 (g 2 (0)/f 2 (0)) λ2 (1 + K12 ) Using Theorem 1.5.4 of Chapter 1, consistency of the modified Mathisen’s test for the hypotheses (2.11.1) is obtained provided µ(∆) > µ(0). But this follows immediately from the inequality G(−∆) > G(0).

140

2.11.4

CHAPTER 2. TWO SAMPLE PROBLEMS

Modified MWW Test

+ Recall R by Theorem 2.4.9 that the mean of the MWW test statistic SR is n1 n2 P (Y > X) = 1 − G(x)f (x)dx. For general F and G, though, this mean may not be 1/2 under H0 . Since this section is concerned with methods for testing the specific hypothesis that ∆ = 0, we add the further restriction that the distributions of X and Y are symmetric. Recall from Section 2.11.1 that under this assumption and ∆ = 0 that E(SR+ ) = n1 n2 /2; see Exercise 2.13.41. Using the general development of rank tests, Section 2.11.2, our modified rank test is given by: reject H0 : ∆ = 0 in favor of HA : ∆ > 0 if Z > zα where

Z=

SR+ − (n1 n2 )/2 q , + d Var(SR )

(2.11.16)

d + ) is a consistent estimate of Var(S + ), under H0 . From the asymptotic distriwhere Var(S R R bution theory obtained for SR+ under general conditions, Theorem 2.4.9, it follows that this test has approximate level α. By Theorem 2.4.5, we can express the variance as Z ! Z ! Z Z 2

Var(SR+ )

= n1 n2

GdF −

+ n1 n2 (n2 − 1)

Z

GdF

(1 − G)2 dF −

2

2

+ n1 n2 (n1 − 1)

F dG −

Z

.

(1 − G)dF

2 !

F dG

(2.11.17)

Following the suggestion of Fligner and Policello (1981), we estimate Var(SR+ ) by replacing F and G by the empirical cdfs Fn1 and Gn2 respectively. As Exercise 2.13.42 demonstrates, this estimate is consistent and, further, it is a function of the ranks of the combined sample. Thus the test is distribution free when F (x) = G(x) and is asymptotically distribution free when F and G have symmetric densities. The efficacy for the modified MWW follows using an argument similar to that for the MWW in Section 2.4. As there, the function SR+ (∆) is a decreasing function of ∆. Its mean function is given by Z + + E∆ (SR ) = E0 (SR (−∆)) = n1 n2 (1 − G(x − ∆))f (x)dx . The average to consider here is S R = (n1 n2 )−1 SR+ . Letting µ(∆) denote the mean of S R under R ∆, we have µ′ (0) = g(x)f (x)dx > 0. The variance we need is σ 2 (0) = limn→∞ nVar0 (S R ), which using the above result on variance simplifies to Z ! Z ! Z Z 2

σ 2 (0) = λ−1 2

F 2 dG −

F dG

2

+ λ−1 1

(1 − G)2 dF −

(1 − G)dF

.

2.11. BEHRENS-FISHER PROBLEM

141

The process SR+ (∆) is Pitman regular and, in particular, its efficacy is given by, R √ λ1 λ2 g(x)f (x) cM M W W = r . R R R 2 R 2 2 2 λ1 F dG − F dG + λ2 (1 − G) dF − (1 − G)dF

(2.11.18) As with the modified Mathisen’s test, we show consistency of the modified MWW test by using Theorem 1.5.4. Again we need only show that µ(0) < µ(∆). But this follows immediately provided the supports of F and G overlap in a neighborhood of 0. Note that this shows that the modified MWW is consistent for the hypotheses (2.11.1) under the further restriction that the densities of X and Y are symmetric.

2.11.5

Efficiencies and Discussion

Before obtaining the asymptotic relative efficiencies of the above procedures, we shall briefly discuss traditional methods. Suppose we restrict F and G to have symmetric densities of the same form with finite variance; that is, F (x) = F0 ((x−θX )/σX ) and G(x) = F0 ((x−θY )/σY ) where F0 is some distribution function with symmetric density f0 and σX and σY are the standard deviations of X and Y respectively. √ Under these assumptions, it follows that n(Y − X − ∆) converges in distribution to 2 N(0, (σX /λ1 ) + (σY2 /λ2 )); see Exercise 2.13.43. The test is to reject H0 : ∆ = 0 in favor of HA : ∆ > 0 if tW > zα where Y −X , tW = q 2 s2Y sX + n2 n1 where s2X and s2Y are the sample variances of Xi and Yj , respectively. Under these as2 sumptions, it follows that these sample variances are consistent estimates of σX and σY2 , respectively; hence, the test has approximate level α. If F0 is also normal then, under H0 , tW has an approximate t distribution with a degrees of freedom correction proposed by Welch (1949). This test is frequently used in practice and we shall subsequently call it the Welch t-test. In contrast, the pooled t-test can behave poorly in this situation, since we have, tp = r

Y −X

(n1 −1)s2X +(n2 −1)s2Y n1 +n2 −2

Y −X . = q 2 ; sX s2Y + n1 n2

1 n1

+

1 n2

that is, the sample variances are divided by the wrong sample sizes. Hence unless the sample sizes are fairly close the pooled t is not asymptotically distribution free. Exercise 2.13.44 obtains the true asymptotic level of tp .

142

CHAPTER 2. TWO SAMPLE PROBLEMS

In order to get the efficacy of the Welch t, consider the statistic Y −X. The mean fuction at ∆ is µ(∆) = ∆; hence, µ′ (0) = 1. It follows from the asymptotic distribution discussed above that # " √ √ λ1 λ2 (Y − X) D n p 2 → N(0, 1) ; (σX /λ1 ) + (σY2 )/λ2 ) p √ 2 hence, σ(0) = (σX /λ1 ) + (σY2 )/λ2 )/ λ1 λ2 . Thus the efficacy of tW is given by √ µ′ (0) λ1 λ2 ctW = . (2.11.19) =p 2 σ(0) (σX /λ1 ) + (σY2 )/λ2 )

We obtain the ARE’s of the above procedures for the case where G(x) = F (x/η) and F (x) has density f (x) symmetric about 0 with variance 1. Thus η is the ratio of standard deviations σY /σX . For this case the efficacies (2.11.15), (2.11.18), and (2.11.19) reduce to √ 2 λ1 λ2 f (0) cM M = p λ2 + λ1 η 2 R √ λ1 λ2 gf cM M W W = q R R R R λ1 [ F 2 dG − ( F dG)2 ] + λ2 [ (1 − G)2 dF − ( (1 − G)dF )2 ] √ λ1 λ2 ctW = p . λ2 + λ1 η 2

Thus the ARE between the modified Mathisen’s procedure and the Welch procedure is the 2 2 ratio c2M M /c2tW = 4σX f (0) = 4f02 (0). This is the same ARE as in the location problem. In particular the ARE does not depend on η = σY /σX . Thus the modified Mathisen’s test in comparison to tW would have poor efficiency at the normal distribution, .63, but in general it would be much more efficient than tW for heavy tailed distributions. Similar to the modified Mathisen’s test, the Mood test can also be modified for these problems; see Exercise 2.13.45. Its efficacy is the same as that of the Mathisen’s test. Asymptotic relative efficiencies involving the modified Wilcoxon do depend on the ratio of scale parameters η. Fligner and Rust (1982) show that if the variances of X and Y are quite different then the modified Mathisen’s test may be as efficient as the modified MWW irrespective of the shape of the underlying distribution. Fligner and Policello (1981) conducted a simulation study of the pooled t, Welch’s t, MWW and the modified MWW over situations where F and G differ in scale only. The unmodified tests did not maintain their level. Welch’s t performed well when F and G were normal whereas the modified MWW performed well over all situations, including unequal sample sizes and normal and contaminated normal distributions. In the simulation study performed by Fligner and Rust (1982), they found that the modified Mood test maintains its level over the situations that were considered by Fligner and Policello (1981). As a final note, Welch’s t requires distributions with the same shape and the modified MWW requires symmetric densities. The modified Mathisen’s test and the modified Mood test, though, are consistent tests for the general problem stated in expression (2.11.1).

2.12. PAIRED DESIGNS

2.12

143

Paired Designs

Consider the situation where we have two treatments of interest, say, A and B, which can be applied to subjects from a population of interest. Suppose we are interested in a particular response after these treatments have been applied. Let X denote the response of a subject after treatment A has been applied and let Y be the corresponding measurement for a subject after treatment B has been applied. The natural null hypothesis, H0 , is that there is no difference in treatment effects. A one sided alternative, would be that the response of a subject under treatment B is in general larger than of a subject under treatment A. Reversing the roles of A and B would yield the other one sided alternative while the union of the these two alternatives would result in the two sided alternative. Again for definiteness we choose as our alternative, HA , the first one sided alternative. The completely randomized design and the paired design are two experimental designs which are often employed in this situation. In the completely randomized design, n subjects are selected at random from the population of interest and n1 of them are randomly assigned to treatment A while the remaining n2 = n − n1 are assigned to treatment B. At the end of the treatment period, we then have two samples, one on X while the other is on Y . The two sample procedures discussed in the previous sections can be used to analyze the data. Proper randomization along with carefully controlled experimental conditions give credence to the assumptions that the samples are random and are independent of one another. The design that produced the data of Example 2.3.1 was a a completely randomized design. While the completely randomized design is often used in practice, the underlying variability may impair the power of any procedure, robust or classical, to detect alternative hypotheses. The design discussed next usually results in a more powerful analysis but it does require a pairing device; i.e., a block of length two. Suppose we have a pairing device. Some examples include identical twins for a study on human subjects, litter mates for a study on animal subjects, or the same exterior wall of a house for a study on the durability of exterior house paints. In the paired design, n pairs of subjects are randomly selected from the population of interest. Within each pair, one member is randomly assigned to treatment A while the other receives treatment B. Again let X and Y denote the responses of subjects after treatments A and B respectively have been applied. This experimental design results in a sample of pairs (X1 , Y1), . . . , (Xn , Yn ). The sample differences D1 = X1 − Y1 , . . . Dn = Xn − Yn , however, become the single sample of interest. Note that the random pairing in this design induces under the null hypothesis a symmetrical distribution for the differences. Theorem 2.12.1. In a randomized paired design, under the null hypothesis of no treatment effect, the differences Di are symmetrically distributed about 0. Proof: Let F (x, y) denote the joint distribution of (X, Y ). Under the null hypothesis of no treatment effect and randomized pairing, it follows that X and Y are exchangable random variables; that is, P (X ≤ x, Y ≤ y) = P (X ≤ y, Y ≤ x). Hence for a difference

144

CHAPTER 2. TWO SAMPLE PROBLEMS

D = Y − X we have, P [D ≤ t] = P [Y − X ≤ t] = P [X − Y ≤ t] = P [−D ≤ t] . Thus D and −D have the same distribution; hence D is symmetrically distributed about 0. Let θ be a location functional for the distribution of Di . We shall further assume that Di is symmetrically distributed under alternative models also. Then we can express the above hypotheses by H0 : θ = 0 versus HA : θ > 0. Note that the one sample analyses based on signs and signed-ranks discussed in Chapter 1 arePappropriate for the randomly paired design. The P appropriate sign test statistic is S= sgn(Di ) while the signed-rank statistic is T = sgn(Di )R(|Di |). From Chapter 1 we shall summarize the analysis based on the signed-rank statistic. A level α test would reject H0 in favor of HA , if T ≥ cα where cα is determined from the null distribution of the Wilcoxon signed-rank test or from the asymptotic approximation to the distribution. The test is consistent for θ > 0 and it has the efficiency results discussed in Chapter 1. In particular, for normal errors the efficiency of T with respect to the usual paired t-test is .955. The associated point estimate of θ is the Hodges-Lehmann estimate given by θb = med i≤j {(Di + Dj )/2}. A distribution free confidence interval for θ is constructed based on the Walsh averages {(Di + Dj )/2}, i ≤ j as discussed in Chapter 1. Instead of using Wilcoxon scores, general signed-rank scores as discussed in Chapter 1, can also be used. A similar summary holds for the analysis based on the sign statistic. In fact for the sign scores we need not assume that D1 , . . . , Dn are identically distributed; that is, there can be a block effect. This is discussed further in Chapter 4. We should mention that if the pairing is not done randomly then Di may or may not be symmetrically distributed. If the symmetry assumption is realistic, then both sign and signed-rank analyses can be used. If, however, it is not realistic then the sign analysis would still be valid but caution would be necessary in interpretating the results of the signed-rank analysis. Example 2.12.1. Darwin Data: The data, Table 2.12.1, are some measurements recorded by Charles Darwin in 1878. They consist of 15 pairs of heights in inches of cross-fertilized plants and self-fertilized plants, (Zea mays), each pair grown in the same pot. RBR Results for Darwin Data Results for the Wilcoxon-Signed-Rank procedure Test of theta = 0 versus theta not equal to 0 Test-Stat. is T 72 Standardized (z) Test-Stat. is Estimate 3.1375 SE is 1.244385 95 % Confidence Interval is ( 0.5 , 5.2125 )

2.016 p-vlaue 0.043

2.12. PAIRED DESIGNS

Pot CrossSelfPot CrossSelf-

1 23.500 17.375 9 18.250 16.500

2 12.000 20.375 10 21.625 18.000

145 Table 2.12.1: Plant Growth 3 4 5 6 21.000 22.000 19.125 21.500 20.000 20.000 18.375 18.625 11 12 13 14 23.250 21.000 22.125 23.000 16.250 18.000 12.750 15.500

7 8 22.125 20.375 18.625 15.250 15 12.000 18.000

Estimate of the scale parameter tau 4.819484 Results for the Sign procedure Test of theta = 0 versus theta not equal to 0 Test stat. S is 11 Standardized (z) Test-Stat. 2.581 p-vlaue 0.009 Estimate 3 SE is 1.307422 95 % Confidence Interval is ( 1 , 6.125 ) Estimate of the scale parameter tau 5.063624 Let Di denote the difference between the heights of the cross-fertilized and self-fertilized plants of the ith pot and let θ denote the median of the distribution of Di . Suppose we are interested in testing for an effect; that is, the hypotheses are H0 : θ = 0 versus HA : θ 6= 0. The boxplot of the differences is displayed in Panel A of Figure 2.12.1, while Panel B gives the normal q−q plot of the differences. As the plots indicate, the differences for Pot 2 and, perhaps, Pot 15 are possible outliers. The results from the RBR functions onesampwil and onesampsgn are shown below. The value of the signed-rank Wilcoxon statistic for this data is T = 72 with the approximate p-value of .044. The corresponding estimate of θ is 3.14 inches and the 95% confidence interval is (.50, 5.21). There are 13 positive differences, so the standardized value of the sign test statistic is 2.58, with the p-value of 0.01. The corresponding estimate of θ is 3 inches and the 95% interpolated confidence is (1.00, 6.13). The paired t-test statistic has the value of 2.15 with p-value 0.050. The difference in sample means is 2.62 inches and the corresponding 95% confidence interval is (0, 5.23). Note that the outliers impaired the t-test and to a lesser degree the Wilcoxon signed-rank test; see Exercise 2.13.46 for further analyses.

2.12.1

Behavior under Alternatives

In this section we will compare sample size determination for the paired design with sample size determination for the completely randomized design. For the paired design, let γ + (θ) denote the power function of Wilcoxon signed-rank test statistic alternative θ. Then √ for R the 2 −1 the asymptotic power lemma, Theorem 1.5.8 with c = τ = 12 f (t) dt, for the signedrank Wilcoxon from Chapter 1 states that at significance level α and under the sequence of

146

CHAPTER 2. TWO SAMPLE PROBLEMS

0 −5

Paired differnces

5

10

Figure 2.12.1: Boxplot of Darwin Data.

Darwin Data

√ contiguous alternatives, θn = θ/ n, +

lim γ (θn ) = Pθn

n→∞

θ Z ≥ zα − τ

.

We will only consider the case where the random vector (Y, X) is jointly normal with variance-covariance matrix 1 ρ 2 . V=σ ρ 1 p p Then τ = π/3σ 2(1 − ρ). Now suppose we select the sample size n∗ so that the Wilcoxon signed-rank test has power√ γ + (θ0 ) to detect the one-sided alternative θ0 > 0 for a level α test. Then writing ∗ θ0 = √nn∗θ0 we have by the asymptotic power lemma and (1.5.25) that √ . γ + (θ0 ) = 1 − Φ(zα − n∗ θ0 /τ ) , and

2 . (zα − zγ + (θo ) ) 2 τ . n∗ = θ02 Substituting the value of τ into this final equation, we have that the necessary sample size for the paired design to have the desired local power is 2 . (zα − zγ + (θo ) ) (π/3)σ 2 2(1 − ρ) . n∗ = θ02

(2.12.1)

2.12. PAIRED DESIGNS

147

Next consider a two-sample design with equal sample sizes ni = n∗ . Assume that X and Y are iid normal with variance σ 2 . Then τ 2 = (π/3)σ 2. Hence by (2.4.25), the necessary sample size for the completely randomized design to achieve power γ + (θ0 ) at the onesided alternative θ0 > 0 for a level α test is given by, zα − zγ + (θ0 ) 2 n= 2(π/3)σ 2 . (2.12.2) θ0 Based on expressions (2.12.1) and (2.12.2), the sample size needed for the paired design is (1 − ρ) times the sample size needed for the completely randomized design. If the pairing device is such that X and Y are strongly, positively correlated then it pays to use the paired design. The paired design is a disaster, of course, if the variables are negatively correlated.

148

2.13

CHAPTER 2. TWO SAMPLE PROBLEMS

Exercises

2.13.1. (a). Derive the L2 estimates of intercept and shift based on the L2 norm on Model (2.2.4). (b). Next apply the pseudo norm, (2.2.16), to (2.2.4) and derive the estimating function. Show that the natural test statistic is the pooled t-statistic. 2.13.2. Show that (2.2.17) is a pseudo norm. Show, also, that it can be written in terms of ranks; see the formula following (2.2.17). 2.13.3. In the proof of Theorem 2.4.2, verify that L(Yj − Xi ) = L(Xi − Yj ). 2.13.4. Prove Theorem 2.4.3. 2.13.5. Prove that if a continuous random variable Z has cdf H(z), then the random variable H(Z) has a uniform distribution on (0, 1). R R 2.13.6. In Theorem 2.4.4, show that E(F (Y )) = F (y)dG(y) = (1 − G(x))dF (x) = E(1 − G(X)). 2.13.7. Prove that if Zn converges in distribution to Z and if Var(Zn − Wn ) and EZn − EWn converge to 0, then Wn also converges in distribution to Z. 2.13.8. Verify (2.4.10). 2.13.9. Explain what happens to the MWW statistic when one support is shifted completely to the right of the other support. What does this imply about the consistency of the MWW in this case? 2.13.10. Show that the L2 estimating function is Pitman regular and derive the efficacy of the pooled t-test. Also, establish the asymptotic power √ lemma, ¯Theorem 2.4.13, for the L2 case. Finally, establish the asymptotic distribution of n(Y¯ − X). 2.13.11. Prove that the Hodges-Lehmann estimate of shift, (2.2.18), is translation and scale equivariant. (See the discussion in Section 2.4.4). 2.13.12. Prove Theorem 2.4.15. b i , i = 1, . . . , n. Then, similar to Section 2.13.13. In Example 2.4.1, form the residuals Zi − ∆c 1.5.5, use these residuals to estimate τ based on (1.3.30).

2.13.14. Simulate independent random samples from N(20, 52 ) and N(22, 52 ) distributions of sizes 10 and 15 respectively. Let ∆ denote the shift in the locations of the distributions. (a.) Obtain comparison boxplots for your samples. (b.) Use the Wilcoxon procedure to test H0 : ∆ = 0 versus HA : ∆ 6= 0 at level .05.

2.13. EXERCISES

149

(c.) Use the Wilcoxon procedure to estimate ∆ and obtain a 95% confidence interval for it. (d.) Obtain the true value of τ . Use your confidence interval in the last item to obtain an estimate of τ . Obtain a symmetric 95% confidence interval for ∆ based on your estimate. (e.) Form a pooled estimate of τ based on the Wilcoxon signed rank process for each sample. Obtain a symmetric 95% confidence interval for ∆ based on your estimate. Compare it with the estimate from the last item and the true value. b Obtain the bootstrap 2.13.15. Write minitab macros to bootstrap the distribution of ∆. distribution for 500 bootstraps of data of Problem 2.13.14. What is your bootstrap estimate of τ ? Compare with the true value and the other estimates. 2.13.16. Verify the scalar multiple condition for the pseudo norm in the proof of Theorem 2.5.1. 2.13.17. Verify (2.5.9) and (2.5.10). 2.13.18. Consider the process Sϕ (∆), (2.5.11): (a). Show that Sϕ (∆) is a decreasing step function, with steps occurring at Yj − Xi . (b). Using Part (a) and the MWW estimator as a starting values, write with some details b ϕ. an algorithm which obtains the estimator ∆ (c). Verify expressions (2.5.14), (2.5.15), and (2.5.16).

2.13.19. Consider the the optimal score function (2.5.22): (a). Show it is location invariant and scale equivariant. Hence, show if g(x) = then ϕg = σ −1 ϕf .

1 f ( x−µ ), σ σ

(b). Use (2.5.22) to show that the MWW is asymptotically efficient when the underlying distribution is logistic. (F (x) = (1 + exp(−x))−1 , −∞ < x < ∞.) (c). Show that (2.6.1) is optimal for a Laplace or double exponential distribution. ( f (x) = 1 exp(−|x|), −∞ < x < ∞.) 2 (d). Show that the optimal score function for the extreme value distribution, (f (x) = exp{x − ex } , −∞ < x < ∞ ), is given by (2.8.8). (e). Show that the optimal score function for the normal distribution is given by (2.5.33). Show that it is standardized. (f). Show that (2.5.34) is the optimal score function for an underlying distribution that has a left logistic tail and a right exponential tail.

150

CHAPTER 2. TWO SAMPLE PROBLEMS

2.13.20. Show that when the underlying density f is symmetric then ϕf (1 − u) = −ϕf (u). 2.13.21. Show that expression (2.6.6) is true and that the n = 2r differences, Y(1) − X(r) < Y(2) − X(r−1) < · · · < Y(n2 ) − X(r−n2 +1) , can be ordered only knowing the order statistics from the individual samples. 2.13.22. Develop the asymptotic linearity formula for Mood’s estimating function given in (2.6.3). Then give an alternative proof of Theorem 2.6.1 based on this result. 2.13.23. Verify the moment formulas (2.6.9) and (2.6.10). 2.13.24. Show that any estimator based on the pseudo norm (2.5.2) is equivariant. Hence, if we multiply the combined sample observations by a constant, then the estimator is multiplied by that same constant. 2.13.25. Suppose X is a continuous random variable representing the time until failure of some process. The hazard function for a continuous random variable X with cdf F is defined to be the instantaneous rate of failure at X = t, conditional on survival to time t. It is formally given by: hX (t) = lim+ ∆t→0

P (t ≤ X < t + ∆t|X ≥ t) . ∆t

(a). Show that hX (t) =

f (t) . 1 − F (t)

(b). Suppose that Y has cdf given by (2.8.1). Show the hazard function is given by hY (t) = αhX (t). 2.13.26. Verify (2.8.4). 2.13.27. Apply the delta method of finding the asymptotic distribution of a function to (2.8.3) to find the asymptotic distribution of α b. Then verify (2.8.5). Explain how this can be used to find an approximate (1 − α)100% confidence interval for α.

2.13.28. Verify (2.8.14).

2.13.29. Show that the asymptotic relative efficiency of the Mann-Whitney-Wilcoxon test to the Savage test at the log exponential model, is 3/4. 2.13.30. Verify (2.10.5). 2.13.31. Show that if |X| has an F (2, 2) distribution then log |X| has a logistic distribution. 2.13.32. Suppose f (t) is the logistic pdf. Show that the optimal scores function, (2.10.6) is given by ϕ(u) = u{log[(u + 1)/(1 − u)]}.

2.13. EXERCISES

151

2.13.33. (a). Verify (2.10.6). (b). Apply (2.10.6) to the normal distribution. (c). Apply (2.10.6) to the Laplace or double exponential distribution. 2.13.34. We consider the Siegel-Tukey (1960) test for the equality of variances when the underlying centers are equal but possibly unknown. The test statistic is the sum of ranks of the Y sample in the combined sample (MWW statistic). However, the ranks are assigned in a different way: In the ordered combined sample assign rank 1 to the smallest value, rank 2 to the largest value, rank 3 to the second largest value, rank 4 to the second smallest value, and so on, alternatively assigning ranks to end values. To test H0 : varX = varY vs HA : varX > varY , reject H0 when the sum of ranks of the Y sample is large. Find the mean, variance and the limiting distribution of the test statistic. Show how to find an approximate size α test. 2.13.35. Develop a sample size formula for the scale problem similar to the sample size formula in the location problem, (2.4.25). 2.13.36. Verify (??). 2.13.37. Compute the efficacy of Mood’s scale test, Ansari-Bradley scale test, and Klotz’s scale test discussed in Section ??. 2.13.38. Verify the asymptotic properties given in (2.10.26), (2.10.27) and ( 2.10.28). 2.13.39. Compute the efficiency of Mood’s scale test and the Ansari-Bradley scale test relative to the classical F test for equality of variances. 2.13.40. Show that the Ansari-Bradley scale test is optimal for f (x) = 21 (1 + |x|)−2 , −∞ < x < ∞. 2.13.41. Show that when F and G have densities symmetric at 0 (or any common point), the expected value of SR+ = n1 n2 /2. 2.13.42. Show that the estimate of (2.11.17) based on the empirical cdfs is consistent and that it is a function only of the combined sample ranks. 2.13.43. Under the general model in Section 2.11.5, derive the limiting distribution of √ n(Y − ∆ − X). 2.13.44. Find the true asymptotic level of the pooled t-test under the null hypothesis in (2.11.1). 2.13.45. Develop a modified Mood’s test similar to the modified Mathisen’s test discussed in Section 2.11.5.

152

CHAPTER 2. TWO SAMPLE PROBLEMS

2.13.46. Construct and discuss a normal quantile plot of the differences from Table 2.12.1. Carry out the Boos test for asymmetry (??). Why do these results suggest that the L1 analysis may be the best analysis in this example? 2.13.47. Consider the data set of information on professional baseball players given in Exercise 1.12.32. Let ∆ denote the shift parameter of the difference between the height of a pitcher and the height of a hitter. (a.) Obtain comparison dotplots between the heights of the pitchers and hitters. Does a shift model seem appropriate? (b.) Use the MWW test statistic to test the hypotheses H0 : ∆ = 0 versus HA : ∆ > 0. Compute the p-value. (c.) Determine a point estimate for ∆ and a 95% confidence interval for ∆ based on MWW procedure. b Use it to obtain an approximate (d.) Obtain an estimate of the standard deviation of ∆. 95% confidence interval for ∆.

2.13.48. Repeat Exercise 2.13.47 when ∆ is the shift parameter for the difference in pitchers’ and hitters’ weights.

2.13.49. Repeat Exercise 2.13.47 when ∆ is the shift parameter for the difference in left handed (A-1) and right handed (A-0) pitchers’ ERA’s and the hypotheses are H0 : ∆ = 0 versus HA : ∆ 6= 0.

Chapter 3 Linear Models 3.1

Introduction

In this chapter we discuss the theory for a rank-based analysis of a general linear model. Applications of this analysis to experimental design models will be discussed in Chapter 4. The rank-based analysis is complete, consisting of estimation, testing, and diagnostic tools for checking the adequacy of fit of the model, outlier detection, and detection of influential cases. As in the earlier chapters, we present the analysis in terms of its geometry. The analysis could be based on either rank scores or signed-rank scores. We have chosen to use the general rank scores of Chapter 2. This allows the error distribution to be either asymmetric or symmetric. An analysis based on signed-rank scores would parallel the one based on rank scores except that the theory would require a symmetric error distribution; see Hettmansperger and McKean (1983) for discussion. Although the results are established for general score functions, we illustrate the methods with Wilcoxon and sign scores throughout. We will commonly use the subscripts R and S for results based on Wilcoxon and sign scores, respectively.

3.2

Geometry of Estimation and Tests

For i = 1, . . . , n. let Yi denote the ith observation and let xi denote a p × 1 vector of explanatory variables. Consider the linear model Yi = x′i β + e∗i ,

(3.2.1)

where β is a p × 1 vector of unknown parameters. In this chapter, the components of β are the parameters of interest. We are interested in estimating β and testing linear hypotheses concerning it. However, it will be convenient to also have a location parameter. So accordingly let α = T (e∗i ) be a location functional. One that we will frequently use is the median. Let ei = e∗i − α then T (ei ) = 0 and the model can be written as, Yi = α + x′i β + ei . 153

(3.2.2)

154

CHAPTER 3. LINEAR MODELS

The parameter α is called an intercept parameter. An argument similar to the one concerning the shift parameter ∆ of Chapter 2 shows that β does not depend on the location functional used. Let Y = (Y1 , . . . , Yn )′ denote the n × 1 vector of observations and let X denote the n × p matrix whose ith row is x′i . We can then express the model as Y = 1α + Xβ + e ,

(3.2.3)

where 1 is an n×1 vector of ones, and e′ = (e1 , . . . , en ). Since the model includes an intercept parameter, α, there is no loss in generality in assuming that X is centered; i.e., the columns of X sum to 0. Further, in this chapter, we will assume that X has full column rank p. Let ΩF denote the column space spanned by the columns of X. Note that we can then write the model as Y = 1α + η + e , where η ∈ ΩF . (3.2.4) This model is often called the coordinate free model. Besides estimation of the regression coefficients, we are interested in tests of general linear hypotheses of the form H0 : Mβ = 0 versus HA : Mβ 6= 0 ,

(3.2.5)

where M is a q × p matrix of full row rank. In this section, we discuss the geometry of estimation and testing with rank-based procedures for the linear model.

3.2.1

Estimation

With respect to model ( 3.2.4), we will estimate η by minimizing the distance between Y and the subspace ΩF . In this chapter we will define distance in terms of the norms or pseudo-norms presented in Chapter 2. Consider, first, the general R pseudo-norm discussed in Chapter 2 which is given by expression ( 2.5.2) and which we write for convenience, kvkϕ =

n X

a(R(vi ))vi ,

(3.2.6)

i=1

where a(1) ≤ a(2) ≤ · · · ≤ a(n) is a set of scores generated as a(i) = ϕ(i/(n + 1)) for some nondecreasing score R R function ϕ(u) defined on the interval (0, 1) and standardized such that ϕ(u)du = 0 and ϕ2 (u)du = 1. This was shown to be a pseudo-norm in Chapter 2. √ Recall that the Wilcoxon pseudo-norm is generated by the linear score function ϕ(u) = 12(u − 1/2). We will also discuss the sign pseudo-norm which is generated by ϕ(u) = sgn(u − 1/2) and show that it is equivalent to using the L1 norm. In Section 3.10 we will also discuss a class of score functions appropriate for survival type analyses. For the general R pseudo-norm given above by ( 3.2.6), an R-estimate of η is a vector b ϕ such that Y b ϕ kϕ = min kY − ηkϕ . Dϕ (Y, ΩF ) = kY − Y (3.2.7) η ∈ΩF

3.2. GEOMETRY OF ESTIMATION AND TESTS

155

b ϕ which minimizes the normed differences, Figure 3.2.1: The R-estimate of η is a vector Y ( 3.2.6), between Y and ΩF . The distance between Y and the space ΩF is Dϕ (Y, ΩF ). about here These quantities are represented geometrically in Figure 3.2.1. b ϕ ; that is, Once η has been estimated, β can be estimated by solving the equation Xβ = Y b = (X′ X)−1 X′ Y b ϕ . As discussed later in Section 3.7, the intercept the R-estimate of β is β ϕ b ϕ . One that we α can be estimated by a location estimate based on the residuals b e = Y−Y b }. will frequently use is the median of the residuals which we denote as α bS = med {Yi − x′i β ϕ Theorem 3.5.7 shows, under regularity conditions, that −1 2 α bS n τS 0′ α distribution , , has an approximate Np+1 b 0 τϕ2 (X′ X)−1 β β ϕ (3.2.8) where τϕ and τS are the scale parameters defined in displays ( 3.4.4) and ( 3.4.6), respectively. From this result, an asymptotic confidence interval for the linear function h′ β is given by p b ± t(α/2,n−p−1) τbϕ h(X′ X)−1 h , h′ β (3.2.9) ϕ where the estimate τbϕ is discussed in Section 3.7.1. The use of t-critical values instead of z-critical values is documented in the small sample studies cited in Section 3.7. Note the close analogy between this confidence interval and those based on LS estimates. The only difference is that σ b has been replaced by τbϕ . We will make use of the coordinate free model, especially in Chapter 4; however, in this b and it will be chapter we are primarily concerned with the properties of the estimator β ϕ more convenient to use the coordinate model ( 3.2.3). Define the dispersion function by Dϕ (β) = kY − Xβkϕ .

(3.2.10)

b ϕ ) = Dϕ (Y, Ω ) = kY − Y b ϕ kϕ is the R-distance between Y and the subspace Then Dϕ (β F ΩF . It is also the residual dispersion. Because Dϕ is expressed in terms of a norm it is a continuous and convex function of β; see Exercise 1.12.3. Exercise 3.16.2 shows that the ranks of the residuals can ony change at the boundaries of the regions defined by the n2 equations yi − x′i β = yj − x′j β. Note that in Y −Y the simple linear regression case, these equations define the sample slopes xjj −xii . Hence, in the interior of these regions the ranks are constant. Therefore, Dϕ (β) is a piecewise linear, continuous, convex function of β with gradient (defined almost everywhere) given by, ▽Dϕ (β) = −Sϕ (Y − Xβ) ,

(3.2.11)

Sϕ (Y − Xβ) = X′ a(R(Y − Xβ)))

(3.2.12)

where

156

CHAPTER 3. LINEAR MODELS

b solves the equations and a(R(Y − Xβ)))′ = (a(R(Y1 − x′1 β)), . . . , a(R(Yn − x′n β))). Thus β ϕ . Sϕ (Y − Xβ) = X′ a(R(Y − Xβ))) = 0 ,

(3.2.13)

which are called the R normal equations. A quadratic form in Sϕ (Y − Xβ 0 ) serves as the gradient R-test statistic for testing H0 : β = β 0 versus HA : β 6= β 0 . In terms of the simple regression problem Sϕ (β) is a decreasing step function of β, which steps down at each sample slope. There may be an interval of solutions of Sϕ (β) = 0 or Sϕ (β) may step across the horizontal axis. Let βbϕ denote any point in the interval in the formerPcase and the crossing point in the latter case. The gradient test statistic is Sϕ (β0 ) = xi a(R(yi − xi β0 )). If the x’s are distinct and equally spaced then for Wilcoxon scores this test statistic is equivalent to the test for correlation based on Spearman’s rS ; see Exercise 3.16.4. For the asymptotic distribution theory of estimation and testing, we note that the esb (Y) denote the R-estimate β for the lintimate is location and scale equivariant. Let β ϕ b (Y + Xδ) = β b (Y) + δ and ear model ( 3.2.3). Then, as shown in Exercise 3.16.6, β ϕ ϕ b b β ϕ (kY) = k β ϕ (Y). In particular these results imply, without loss of generality, that the theory developed in the following sections can be accomplished under the assumption that the true β is 0. As a final note, we outline the least squares estimates. The LS estimates of η in model ( 3.2.4) is given by b LS = Argmin kY − ηk2 , Y LS k · kLS denotes the least squares pseudo-norm given by ( 2.2.16) of Chapter 2. The value of η which minimizes this pseudo-norm is b LS = HY , η

(3.2.14)

where H is the projection matrix onto the space ΩF i.e.; H = X(X′X)−1 X′ . Denote the sum of squared residuals by SSE = minη ∈Ω kY − ηk2LS = k(I − H)Yk2LS . In order to have F 2 similar notation we shall denote this minimum by DLS (Y, ΩF ). Also, it is easy to show that ′ −1 ′ b the least squares estimate of β is β LS = (X X) X Y.

3.2.2

The Geometry of Testing

We next discuss the geometry behind rank-based tests of the general linear hypotheses given by ( 3.2.5). As above, consider the model ( 3.2.4), Y = 1α + η + e , where η ∈ ΩF ,

(3.2.15)

b ϕ,Ω denote the Rand ΩF is the column space of the full model design matrix X. Let Y F fitted value in the full model. Note that Dϕ (Y, ΩF ) is the amount of residual dispersion not accounted for in fitting the Model ( 3.2.4). These are shown geometrically in Figure 3.2.2.

3.2. GEOMETRY OF ESTIMATION AND TESTS

157

Next let ω denote the subspace of ΩF subject to H0 . In symbols ω = {η ∈ ΩF : η = Xβ, for some β such that Mβ = 0}. In Exercise 3.16.7 the reader is asked to show that b ϕ,ω denote the R-estimate of η when the ω is a subspace of ΩF of dimension p − q. Let Y b ϕ,ω kR denote the distance between Y and reduced model is fit and let Dϕ (Y, ω) = kY − Y the subspace ω. These are illustrated by Figure 3.2.2. The nonnegative quantity RDϕ = Dϕ (Y, ω) − Dϕ (Y, ΩF ) ,

(3.2.16)

denotes the reduction in residual dispersion when we pass from the reduced model to the full model. Large values of RDϕ indicate HA while small values support H0 . Figure 3.2.2: The reduction in dispersion RDϕ is the difference in normed distances between Y and the subspaces ΩF and ω. about here This drop in residual dispersion, RDϕ , is analogous to the drop in residual sums of squares for the LS analysis. In fact to obtain this reduction in sums of squares, we need only replace the R-norm with the square of the Euclidean norm in the above development. Thus the drop in sums of squared errors is 2 2 SS = DLS (Y, ω) − DLS (Y, ΩF ) , 2 where DLS (Y, ΩF ) is defined above. Hence the reduction in sums of squared residuals can be written as SS = k(I − Hω )Yk2LS − k(I − HΩ )Yk2LS . F The traditional least squares F -test is given by

FLS =

SS/q , σ b2

(3.2.17)

2 where σ b2 = DLS (Y, ΩF )/(n − p). Other than replacing one norm with another, Figures 3.2.1 and 3.2.2 remain the same for the two analyses, LS and R. In order to be useful as a test statistic, similar to least squares, the reduction in dispersion RDϕ must be standardized. The asymptotic distribution theory that follows suggests the standardization RDϕ/q Fϕ = , (3.2.18) τbϕ /2

where τbϕ is the estimate of τϕ discussed in Section 3.7. Small sample studies cited in Section 3.7 indicate that Fϕ should be compared with F -critical values with q and n − (p + 1) degrees of freedom analogous to the LS classical F -test statistic. Similar to the LS F –test, the test based on Fϕ can be summarized in the ANOVA table, Table 3.2.1. Note that the reduction in dispersion replaces the reduction in sums of squares in the classical table. These robust ANOVA tables were first discussed by Schrader and McKean (1976).

158

CHAPTER 3. LINEAR MODELS

Source in Dispersion Regression Error

Table 3.2.1: Robust ANOVA Table for H0 : Mβ = 0 Reduction Mean Reduction in Dispersion df in Dispersion Fϕ RDϕ = Dϕ (Y, ω) − Dϕ (Y, ΩF ) q RD/q Fϕ n − (p + 1) τbϕ /2

Table 3.2.2: Robust ANOVA Table for H0 : β = 0 Source Reduction Mean Reduction in Dispersion in Dispersion df in Dispersion Fϕ Regression RD = Dϕ (0) − Dϕ (Y, ΩF ) p RD/p Fϕ Error n−p−1 τbϕ /2

Tests that all Regression Coefficients are 0

As discussed more fully in Section 3.6, there are three R-test statistics for the hypotheses ( 3.2.5). These are the R-analogues of the classical tests: the likelihood ratio test, the scores test, and the Wald test. We shall introduce them here for the special null hypothesis that all the regression parameters are 0; i.e., H0 : β = 0 versus H0 : β = 0 .

(3.2.19)

Their asymptotic theory and small sample properties are discussed in more detail in later sections. In this case, the reduced model dispersion is just the dispersion of the response vector Y, i.e., Dϕ (0). Hence, the R-test based on the reduction in dispersion is Dϕ (0) − Dϕ (Y, ΩF ) /p Fϕ = . (3.2.20) τbϕ /2 As discussed above, Fϕ should be compared with F (α, p, n − p − 1)-critical values. Similar to the general hypothesis, the test based on Fϕ can be expressed in the robust ANOVA table given in Table 3.2.2. This is the robust analogue of the traditional ANOVA table that is printed out for a regression analysis by most least squares regression packages. The R-scores test is the test based on the gradient. Theorem 3.5.2, below, gives the asymptotic distribution of the gradient Sϕ (0) under the null hypothesis. This leads to the asymptotic level α test, reject H0 if S′ϕ (0)(X′ X)−1 Sϕ (0) ≥ χ2α (p) .

(3.2.21)

Note that this test avoids the estimation of τϕ . The R-Wald test is a quadratic form in the full model estimates. Based on the asympb given in Corollary 3.5.1, an asymptotic level totic distribution of the full model estimate β ϕ

3.3. EXAMPLES

159

Table 3.3.1: Data for Example 3.3.1. The number of calls is in tens of millions and the years are from 1950-1973. Year 50 51 52 53 54 55 56 57 58 59 60 61 No. Calls 0.44 0.47 0.47 0.59 0.66 0.73 0.81 0.88 1.06 1.20 1.35 1.49 Year 62 63 64 65 66 67 68 69 70 71 72 73 No. Calls 1.61 2.12 11.90 12.40 14.20 15.90 18.20 21.20 4.30 2.40 2.70 2.90 α test, rejects H0 if

3.3

b /p b ′ (X′X)β β ϕ ϕ ≥ F (α, p, n − p − 1) . τbϕ2

(3.2.22)

Examples

We offer several examples to illustrate the rank-based estimates and test procedures discussed √ in the last section. For all the examples, we use Wilcoxon scores, ϕ(u) = 12(u − (1/2)), for the rank-based estimates of the regression coefficients. We estimate the intercept by the median of the residuals and we estimate the scale parameter τϕ as discussed in Section 3.7. We begin with a simple regression data set and proceed to multiple regression problems. Example 3.3.1. Telephone Data The response for this data set is the number of telephone calls (tens of millions) made in Belgium for the years 1950 through 1973. Time, the years, serves as our only predictor variable. The data is discussed in Rousseeuw and Leroy (1987) and, for convenience, is displayed in Table 3.3.1. The Wilcoxon estimates of the intercept and slope are −7.13 and .145, respectively, while the LS estimates are −26 and .504. The reason for this disparity in fits is easily seen in Panel A of Figure 3.3.1 which is a scatterplot of the data overlaid with the LS and Wilcoxon fits. Note that the years 1964 through 1969 had a profound effect on the LS fit while the Wilcoxon fit was much less sensitive to these years. As discussed in Rousseeuw and Leroy the recording system for the years 1964 through 1969 differed from the other years. Panels B and C of Figure 3.3.1 are the studentized residual plots of the fits; see ( 3.9.31) of Section 3.9. As with internal LS-studentized residuals, values of the internal R-studentized residuals which exceed 2 in absolute value are potential outliers. Note that the internal Wilcoxon studentized residuals clearly show that the years 1964-1969 are outliers while the internal LS studentized residuals only detect 1969. The Wilcoxon studentized residuals also mildly detect the year 1970. Based on the scatterplot, this point does not follow the trend of the early (before 1964) years either. The scatterplot and Wilcoxon residual plot indicate that there may be a quadratic trend over the years before the outliers occur. The last few years, though, do not seem to follow this trend. Hence, a linear model for this data is questionable. On the basis of these plots, we will not discuss any formal inference for this data set.

160

CHAPTER 3. LINEAR MODELS

Figure 3.3.1: Panel A: Scatterplot of the Telephone Data, overlaid with the LS and Wilcoxon fits; Panel B: Internal LS studentized residual plot; Panel C: Internal Wilcoxon studentized residual plot; and Panel D: Wilcoxon dispersion function. Panel A

Panel B •

••••••••• 55

•••

60

• 1 0

•

•• ••

••

••

••

••

••

•• • ••

65

70

0

2

4

6

8

LS-Fit

Panel C

Panel D •

50

•

10

150

Year

•••••••••••••• 0

1

2

Wilcoxon-Fit

• ••• 3

140 130

30 20

••

120

• •

110

40

•

10 0

Wilcoxon-Studentized residuals

50

••

•••

Wilcoxon dispersion

0

•

•

-1

15 10

••

5

Number of calls

• •

LS-Fit Wilcoxon-Fit

LS-Studentized residuals

2

20

• •

-0.2

0.0

0.2

0.4

0.6

Beta

Panel D of Figure 3.3.1 depicts the Wilcoxon dispersion function over the interval (−.2, .6). Note that Wilcoxon estimate βbR = .145 is the minimizing value. Next consider the hypotheses H0 : β = 0 versus HA : β 6= 0. The basis for the test statistic Fϕ can be read from this plot. The reduction in dispersion is given by RD = D(0) − D(.145). Also, the gradient test of these hypotheses would be the negative of the slope of the dispersion function at 0; i.e., −D ′ (0). Example 3.3.2. Baseball Salaries As a large data set, we consider data on the salaries of professional baseball pitchers for the 1987 baseball season. This data set was taken from the data set on baseball salaries which was used in the 1988 ASA Graphics Section Poster Session. It can be obtained at the web site: http://lib.stat.cmu.edu/datasets. Our analysis concerns a subdata set of 176 pitchers, which can be obtained from the authors upon request. Our response variable is the 1987 beginning salary (in log dollars) of these pitchers. As predictors, we took the career summary statistics through the end of the 1986 season. The names of these variables are listed in Table 3.3.2. Panels A - G of Figure 3.9.2 show the scatter plots of the log of salary versus each of the predictors. Certainly the strongest predictor on the basis of these plots is log years; although, linearity in this plot is questionable.

3.3. EXAMPLES

161

Figure 3.3.2: Panels A - G: Plots of log-salary versus each of the predictors for the baseball data of Example 3.3.2; Panel H: Internal Wilcoxon studentized residual plot. Panel A

Panel B •

0.0

• • •• ••

• •• •

••• • • • •• ••

•

•

• •

•

•

•

1.0

1.5

2.0

2.5

3.0

0

5

4

•

8

10

20

7 6

Log salary

5

12

2.5

3.0

ERA

7

•

• • • •

•

•

••

80

•

• 0

• •

• •

•

••

• • • • • ••

50

100

•

5.0

• • • • •••• •• ••••••• • • •• • • •• • • •• • • •

5.5

• • • • • • ••

•

• • • •

• • •

•

• •

•

150

200

250

Panel H

-2

0

2

••• • • •• • • • • • •••• •• ••• • • ••• • ••••••• •• •••••••••••••• • •••• •••••••••••••••••• •• • • •• ••• • •••• • •• •• •• • • •• • •• • • • • • • • • ••••••• •• •• ••• • •• • • ••• •• •• • • • • •• • • •• • • •• •• • • •

-4

•

Studentized resid.

•

••

•

•• O •• O

-6

•

4

6

•

•

-8

7 6

Log salary

5 4

•

•• • •

4.5

Ave. innings

Panel G

Ave. saves

•

• •

• • • • • • • • • ••• •• • •• • • • • •• • • •• • • • • •• • • • • • • • •• • • • • • •• • • • • • • • • • • • • • • •• •• •• ••• • • • • • • •• • •• • • •• • • •• •• • ••• •

•

•

60

15

•

• • ••

40

10

4.0

Panel F

•• • • • •••• • • • • • •••••••• • • • ••••• ••• • • • •• •• •••• • • • • • • •••••• • • •• ••• • • • • • • •• • • • • • • •• • • • • • •••••••• • • • • • • ••• • • • •••• •• • • •• • • •• • • •• • ••• •• •• •• • •• ••• • • 5

3.5

Ave. loses

Ave. games

0

•

•• •• • •• • • • • •• •• • • • • •• •• • • • ••••••••••• • •• • ••• •• • • • • • • • •• • • • • •• •••••••• • • • • • • •• •• • • • • • • • • • • • • • • • • • • • • • • • • • •• • •• • • •• • • ••• • • • • • •• • •• •• • • • • • • •• • • • • •• • • • • •• • • • • • • • • •

Panel E •• • • • • • • •• • • •• • • • • • • • • •• ••••••••• •••• •• •• ••• • •• • • • ••• • •• • •• • • • • • • • •• •••• • • • • •••• • •• • • • • • • • • • •• •• • •• •• ••• • •• •• • • • • •• • • • • • • •• • • • • • • • • • • • • • • • • • • • • • • • •• • •• • • • •

0

•

4

6

20

••

6

4

•

Log salary

6

•

5

Log salary

7

2

• • • • • • • • • •• •• • • • • •• • • •• • • •• • • • •• • • • • • • • • • • • • • • • • • • • • • • • •

5

4

•

•

•

4

7 6 5

Log salary

•

15

Panel D • •

•

• • • • •• • ••• •• • • • • •• • • • • • • •• •• • • • • • • • • • • •

• •

Ave. wins

Panel C

• •• •• • • • •••• •• • • • • • •• • • • • • • • • ••• • • • • • • • • •• • • • •• • • • • • •• • • • •• • • • • • • •• • • • •

•

•

10

Log Years

•

•

• • • • • •• • • • •• • •• ••• • •• • • •• • • • •• •• • ••• • • •• • • • •••••• • • • •• •• • • •• • • • • • • • • • ••• •• • • • •• • • • •• • • •• • •• • • • • •• •• • • ••• • • • • • • • •• • • •• • ••• • • • • •• •• • • • • •• • • • ••• • • •• • •• •• • • • • • •• • • • • •

7

• •

•

0.5

••

•

• • • • • • • • • • • • • •• •• •• •• •• • •• •• •• •• • •• • • ••• • • • • •

4

4

• •• •• •• ••

• • ••• • ••

6

6

•• •• ••

• • • •• ••• •• •• •• •

• • • • • •• ••• ••

5

•

5

Log salary

7

• • •

Log salary

•

20

25

4

5

6

• O 7

8

Wilcoxon fit

The internal Wilcoxon studentized residuals, ( 3.9.31), versus fitted values are displayed in the Panel H of Figure 3.9.2. Based on Panels A and H, the pattern in the residual plot follows from the fact that log years is not a linear predictor. Better fitting models are pursued in Exercise 3.16.1. Note that there are several large outliers. The three identified outliers, circled points in Panel H, are interesting. These correspond to the pitchers Steve Carlton, Phil Niekro and Rick Sutcliff. These were very good pitchers, but in 1987 they were at the end of their careers, (21, 23, and 21 years of pitching, respectively); hence, they missed the rapid rise in baseball salaries. A diagnostic analysis, (see Section 3.9 and Exercise 3.16.1), indicates a few mildly influential points, also. For illustration, though, we will consider the model that we fit. Table 3.3.2 also displays the estimated coefficients and their standard errors. The outliers impaired the LS fit, somewhat. The LS estimate of σ is .515 in comparison to the estimate of τ which is .388. Table 3.3.3 displays the robust ANOVA table for testing that all the coefficients, except

162

CHAPTER 3. LINEAR MODELS

Table 3.3.2: Predictors for Baseball Salaries of Pitchers and Their Estimated (Wilcoxon Fit) Coefficients Predictor Estimate Stand. Error t-ratio log Years in professional baseball .839 .044 19.15 Average wins per year .045 .028 1.63 Average losses per year -.024 .026 -.921 Earned run average -.146 .070 -2.11 Average games per year -.006 .004 1.60 Average innings per year .004 .003 1.62 Average saves per year .012 .011 1.07 Intercept 4.22 .324 Scale (τ ) .388 Table 3.3.3: Wilcoxon ANOVA Table for H0 : β = 0 Source Reduction Mean Reduction in Dispersion in Dispersion df in Dispersion Fϕ Regression 78.287 7 11.18 57.65 Error 168 .194 the intercept, are 0. Based on the large value of Fϕ , ( 3.2.20), the predictors are helpful in explaining the response. In particular, based on Table 3.3.2, the predictors years in professional baseball, earned run average, average innings per year, and average number of saves per year seem more important than the variables wins, losses, and games. These last three variables form a similar group of variables; hence, as an illustration of the rank-based statistic Fϕ , the hypothesis that the coefficients for these three predictors are 0 was tested. The reduction in dispersion for this hypothesis is RD = 1.24 which leads to Fϕ = 2.12 which is significant at the 10% level. This confirms the above observations on the regression coefficients. Example 3.3.3. Potency Data This example is part of an n = 34 multivariate data set discussed in Chapter 6; see Table 6.6.2 for the data.. The experiment concerned the potency of drug compounds which were manufactured under different levels of 4 factors. Here we shall consider only one of the response variables POT2, which is the potency of a drug compound at the end of two weeks. The factors are: SAI, the amount of intragranular steric acid, which was set at the three levels −1, 0 and 1; SAE, the amount of extragranular steric acid, which was set at the three levels −1, 0 and 1; ADS, the amount of cross carmellose sodium, which was set at the three levels −1, 0 and 1; and TYPE of steric acid which was set at two levels −1 and 1. The initial potency of the compound, POT0, served as a covariate. In Example 3.9.2 of Section 3.9 a residual analysis of this data set is performed. This analysis indicates that the model which includes the covariate, the linear terms of the factors,

3.3. EXAMPLES

163

Table 3.3.4: Wilcoxon and LS Estimates for the Potency Data Wilcoxon Estimates LS Estimates Terms Parameter Est. SE Est. SE Intercept α 7.184 2.96 5.998 4.50 β1 0.072 0.05 0.000 0.08 Linear β2 0.023 0.05 -0.018 0.07 β3 0.166 0.05 0.135 0.07 β4 0.020 0.04 -0.011 0.05 β5 0.042 0.05 0.086 0.08 β6 -0.040 0.05 0.035 0.08 Two-way β7 0.040 0.05 0.102 0.07 Inter. β8 -0.085 0.06 -0.030 0.09 β9 0.024 0.05 0.070 0.07 β10 -0.049 0.05 -0.011 0.07 β11 -0.002 0.10 0.117 0.15 Quad. β12 -0.222 0.09 -0.240 0.13 β13 0.022 0.09 -0.007 0.14 Covariate β14 0.092 0.31 0.217 0.47 Scale τ or σ .204 .310 the simple two-way interaction terms of the factors, and the quadratic terms of the three factors SAE, SAI and ADS is adequate. Let xj for j = 1, . . . , 4 denote the level of the factors SAI, SAE, ADS, and TYPE, respectively, and let ci denote the value of the covariate. Then the model is expressed as, yi = α + β1 x1,i + β2 x2,i + β3 x3,i + β4 x4,i + β5 x1,i x2,i + β6 x1,i x3,i +β7 x1,i x4,i + β8 x2,i x3,i + β9 x2,i x4,i + β10 x3,i x4,i +β11 x21,i + β12 x22,i + β13 x23,i + β14 ci + ei .

(3.3.1)

The Wilcoxon and LS estimates of the regression coefficients and their standard errors are given in Table 3.3.4. The Wilcoxon estimates are more precise. As the diagnostic analysis of Example 3.9.2 shows, this is due the outliers in this data set. Note that the Wilcoxon estimate of the parameter β13 , the quadratic term of the factor ADS is significant. Again referring to the residual analysis given in Example 3.9.2, there is some graphical evidence to retain the three quadratic coefficients in the model. In order to statistically confirm this evidence, we will test the hypotheses H0 : β12 = β13 = β14 = 0 versus HA : βi 6= 0 for some i = 12, 13, 14 . The Wilcoxon test is summarized in Table 3.3.5 and it is based on the test statistic ( 3.2.18). The value of the test statistic is significant at the .05 level. The LS F -test statistic, though, has the value 1.19. As with its estimates of the regression coefficients, the LS F -test statistic has been impaired by the outliers.

164

CHAPTER 3. LINEAR MODELS Table 3.3.5: Wilcoxon ANOVA Table for H0 : β12 = β13 = β14 = 0 Source Reduction Mean Reduction of Dispersion in Dispersion df in Dispersion Fϕ Quadratic Terms .977 3 .326 3.20 Error 19 .102

3.4

Assumptions for Asymptotic Theory

For the asymptotic theory developed in this chapter certain assumptions on the distribution of the errors, the design matrix, and the scores are needed. The required assumptions for each section may differ, but for easy reference, we have placed them in this section. The major assumption on the error density function f for much of the rank-based analyses, is: (E.1) f is absolutely continuous, 0 < I(f ) < ∞ . (3.4.1) where I(f ) denotes Fisher information, ( 2.4.16). Since f is absolutely continuous, we can write Z s f (s) − f (t) = f ′ (x)dx t

′

for some function f . An application of the Cauchy-Schwartz inequality yields p |f (s) − f (t)| ≤ I(f )1/2 |F (s) − F (t)| ;

(3.4.2)

see Exercise 1.12.20. It follows from ( 3.4.2), that assumption (E.1) implies that f is uniformly bounded and is uniformly continuous. An assumption that will be used for analyses based on the L1 norm is: (E.2) f (θe ) > 0 , where θe denotes the median of the error distribution, i.e., θe = F −1 (1/2). For easy reference, we list again the scale parameter τϕ , ( 2.5.23), Z −1 τϕ = ϕ(u)ϕf (u)du , where

f ′ (F −1 (u)) ϕf (u) = − . f (F −1(u))

(3.4.3)

(3.4.4)

(3.4.5)

Under (E.1) the scale parameter τϕ is well defined. Another scale parameter that will be needed is τS defined as: τS = (2f (θe ))−1 ; (3.4.6) see ( 1.5.21). Note that it is well defined under Assumption (E.2).

3.4. ASSUMPTIONS FOR ASYMPTOTIC THEORY

165

As above let H = X(X′ X)−1 X′ denote the projection matrix onto Ω, the column space of X. Our asymptotic theory assumes that the design matrix X is imbedded in a sequence of design matices which satisfy the next two properties. We should subscript quantities such as X and the projection matrix with n to show this, but as a matter of convenience we have not done so. We will subscript the leverage values hiin which are the diagonal entries of the projection matrix H. We will often impose the next two conditions on the design matrix: (D.2) lim max hiin = 0 n→∞ 1≤i≤n −1 ′

(D.3) lim n X X = Σ , n→∞

(3.4.7) (3.4.8)

where Σ is a p × p positive definite matrix. The first condition has become known as Huber’s condition. Huber (1981) showed that (D.2) is a necessary and sufficient design condition for the least squares estimates to have an asymptotic normal distribution provided the errors, ei , are iid with finite variance. Conditions (D.3) reduces to assumption (D.1), ( 2.4.7), of Chapter 2 for the two sample problem. Another design condition is Noether’s condition which is given by x2 (N.1) max Pn ik 1≤i≤n

2 j=1 xjk

→ 0 for all k = 1, . . . p .

(3.4.9)

Although this condition will be convenient, as the next lemma shows it is implied by Huber’s condition. Lemma 3.4.1. (D.2) implies (N.1). Proof: By the generalized Cauchy-Schwarz inequality, (see Graybill, (1976), page 224), for all i = 1, . . . , n we have the following equalities: δ ′ xi x′ δ sup ′ ′ i = x′i (X′ X)−1 xi = hnii . δ X Xδ kδ k=1 Next for k = 1, . . . , p take δ to be δ k , the p × 1 vector of zeroes except for 1 in the kth component. Then the above equalities imply that x2 Pn ik j=1

Hence

x2jk

≤ hnii , i = 1, . . . , n, k = 1, . . . , p .

x2 max max Pn ik

1≤k≤p 1≤i≤n

2 j=1 xjk

≤ max hnii . 1≤i≤n

Therefore Huber’s condition implies Noether’s condition. As in Chapter 2, we will often assume that the score generating function ϕ(u) satisfies assumption ( 2.5.5). We will in addition assume that it is bounded. For reference, we will assume that ϕ(u) is a function defined on (0, 1) such that ϕ(u) is a nondecreasing, square-integrable, and bounded function R1 R1 2 (S.1) . (3.4.10) ϕ(u) du = 0 and ϕ (u) du = 1 0 0

166

CHAPTER 3. LINEAR MODELS

Occasionally we will need further assumptions on the score function. In Section 3.7, we will need to assume that (S.2) ϕ is differentiable . (3.4.11) When estimating the intercept parameter based on signed-rank scores, we need to assume that the score function is odd about 21 , i.e., (S.3) ϕ(1 − u) = −ϕ(u) ;

(3.4.12)

see, also, ( 2.5.5).

3.5

Theory of Rank-Based Estimates

Consider the linear model given by ( 3.2.3). To avoid confusion, we will denote the true vector of parameters by (α0 , β 0 )′ ; that is, the true model is Y = 1α0 + Xβ0 + e. In this section we will derive the asymptotic theory for the R-analysis, estimation and testing, under the assumptions (E.1), (D.2), (D.3), and (S.1). We will occassionally supress the subscripts b ϕ and R from the notation. For example, we will denote the R-estimate by simply β.

3.5.1

R-Estimators of the Regression Coefficients

A key result for both estimation and testing concerns the gradient S(Y − Xβ), ( 3.2.12). We first derive its mean and covariance matrix and then obtain its asymptotic distribution. Theorem 3.5.1. Under Model ( 3.2.3),

σa2 = (n − 1)−1

Pn

i=1

. a2 (i) = 1.

E [S(Y − Xβ 0 )] = 0 V [S(Y − Xβ 0 )] = σa2 X′ X ,

Proof: Note that S(Y − Xβ 0 ) = X′ a(R(e)). Under Model ( 3.2.3), e1 , . . . , en are iid; hence, the ith component a(R(e)) has mean E [a(R(ei ))] =

n X

a(j)n−1 = 0 ,

j=1

from which the result for the expectation follows. For the result on the variance-covariance matrix, note that, V [S(Y − Xβ 0 )] = X′ V [a(R(e)] X. The digaonal entries for the covariance matrix on the RHS are:

2

V [a(R(ei ))] = E a (R(ei )) =

n X j=1

a(j)2 n−1 =

n−1 2 σa . n

3.5. THEORY OF RANK-BASED ESTIMATES

167

The off-diagonal entries are the covariances given by cov(a(R(ei )), a(R(el ))) = E [a(R(ei )a(R(el )] Pn Pn −1 = k=1 j6=k a(j)a(k)(n(n − 1)) j=1 −1

= −(n(n − 1)) =

−σa2 /n

,

n X

a2 (j)

j=1

P n

(3.5.1) 2

where the third step in the derivation follows from 0 = j=1 a(j) . The result, ( 3.5.1), is obtained directly from these variances and covariances. Under (D.3), we have that V n−1/2 S(Y − Xβ 0 ) → Σ . (3.5.2) This anticpates our next result,

Theorem 3.5.2. Under the Model ( 3.2.3), (E.1), (D.2), (D.3), and (S.1) in Section 3.4, D

n−1/2 S(Y − Xβ0 ) → Np (0, Σ) .

(3.5.3)

Proof: Let S(0) = S(Y − Xβ 0 ) and let T(0) = X′ ϕ(F (Y − Xβ 0 )). Under the above assumptions, the discussion around Theorem A.3.1 of the appendix shows √ that (T(0) − √ S(0))/ n converges to 0 in probability. Hence we need only show that T(0)/ n converges to the intended distribution. Letting W ∗ = n−1/2 t′ T(e) where t 6= 0 is an arbitrary p × 1 vector, it suffices to show that W ∗ converges in distribution to a N(0, t′Σt) distribution. Note that we can write W ∗ as, ∗

−1/2

W =n

n X

t′ xk ϕ(F (ek )) .

(3.5.4)

k=1

F is the distribution function of ek , it follows from RSince 2 ϕ du = 1, and (D.3) that ∗

−1

V [W ] = n

n X k=1

R

ϕdu = 0 that E [W ∗ ] = 0, from

(t′ xk )2 = t′n−1 X′ Xt → t′ Σt > 0 .

(3.5.5)

Since W ∗ is a sum of independent random variables which are not identically distributed we establish the limit distribution by the Lindeberg-Feller Central Limit Theorem; see Theorem A.1.1 of the Appendix. In the notation of this theorem let Bn2 = V [W ∗ ]. By ( 3.5.5), Bn2 converges to a positive real number. We need to show, n X 1 1 ′ 2 2 ′ −2 (xk t) ϕ (F (ek ))I √ (xk t)ϕ(F (ek )) > ǫBn =0. (3.5.6) lim Bn E n n k=1

168

CHAPTER 3. LINEAR MODELS

The key is the factor n−1/2 (x′k t) in the indicator function. By the Cauchy-Schwarz inequality and (D.2) we have the string of inequalities: n−1/2 |(x′k t)| ≤ n−1/2 kxk kktk " #1/2 p X = n−1 x2kj ktk ≤

j=1

p max n−1 x2kj j

1/2

ktk .

(3.5.7)

By assumptions (D.2) and (D.3), it follows that the quantity in brackets in equation ( 3.5.7), and, hence, n−1/2 |(x′k t)| converges to zero as n → ∞. Call the term on the right side of equation ( 3.5.7) Mn . Note that it does not depend on k and Mn → 0. From this string of inequalities, the limit on the left side of (3.5.6) is less than or equal to lim Bn−2

ǫBn lim E ϕ (F (e1 ))I |ϕ(F (e1 ))| > Mn 2

−1

lim n

n X (x′k t)2 . k=1

The first and third limits are positive reals. For the second limit, note that the random variable inside the expectation is bounded; hence, by Lebesgue Dominated Convergence n Theorem we can interchange the limit and expectation. Since ǫB → ∞ the expectation Mn goes to 0 and our desired result is obtained. Similar to Chapter 2, Exercise 3.16.9 obtains the proof of the above theorem for the special case of the Wilcoxon scores by first getting the projection of the statistic W . Note from this theorem we have the gradient test that all the regression coefficients are 0; that is, H0 : β = 0 versus HA : β 6= 0. Consider the test statistic T = σa−2 S(Y)′ (X′X)−1 S(Y) .

(3.5.8)

From the last theorem an approximate level α test for H0 versus HA is: Reject H0 in favor of HA if T ≥ χ2 (α, p) ,

(3.5.9)

where χ2 (α, p) denotes the upper level α critical value of χ2 -distribution with p degrees of freedom. Theorem A.3.8 of the Appendix gives the following linearity result for the process S(β n ): √ 1 1 √ S(β n ) = √ S(β 0 ) − τϕ−1 Σ n(β n − β 0 ) + op (1) , n n

(3.5.10)

√ for n(β n − β 0 ) = O(1), where the scale parameter τϕ is given by ( 3.4.4). Recall that we have made use of this result in Section 2.5 when we showed that the two sample location process under general scores functions is Pitman regular. If we integrate the RHS of this

3.5. THEORY OF RANK-BASED ESTIMATES

169

result we obtain a locally smooth approximation of the dispersion function D(βn ) which is given by the following quadratic function: Q(Y−Xβ) = (2τϕ )−1 (β−β 0 )′ X′ X(β −β 0 )−(β −β 0 )′ S(Y−Xβ 0 )+D(Y−Xβ0 ) . (3.5.11) Note that Q depends on τϕ and β 0 so it cannot be used to estimate β. As we will show, the function Q is quite useful for establishing asymptotic properties of the R-estimates and test statistics. As discussed in Section 3.7.3, it also leads to a Gauss-Newton type algorithm for obtaining R-estimates. The following theorem shows that Q provides a local approximation to D. This is an asymptotic quadraticity result which was proved by Jaeckel (1972). It in turn is based on an asymptotic linearity result derived by Jureˇckov´a (1971) and displayed above, ( 3.5.10). It is proved in the Appendix; see Theorem A.3.8. Theorem 3.5.3. Under the Model ( 3.2.3) and the assumptions (E.1), (D.1), (D.2), and (S.1) of Section 3.4, for any ǫ > 0 and c > 0, " # P

max

√ kβ −β 0 k 0. The process we consider is the sign process based on residuals given by b )= S1 (Y − α1 − Xβ ϕ

n X i=1

b ). sgn(Yi − α − xi β ϕ

(3.5.18)

As with the sign process in Chapter 1, this process is a nondecreasing step function of α which steps down at the residuals. The solution to the equation . b )= S1 (Y − α1 − Xβ 0 ϕ

(3.5.19)

b ϕ }. Our goal is is the median of the residuals which we shall denote by α bS = med{Yi − xi β ′ b ϕ = (b b )′ . to obtain the asymptotic joint distribution of the estimate b αS , β ϕ

Similar to the R-estimate of β the estimate of the intercept is location and scale equivariant; hence, without loss of generality we will assume that the true intercept and regression parameters are 0. We begin with a lemma.

Lemma 3.5.1. Assume conditions (E.1), (E.2), (S.1), (D.1) and (D.2) of Section 3.4. For any ǫ > 0 and for any a ∈ R, √ b ) − S1 (Y − an−1/2 1)| ≥ ǫ n] = 0 . lim P [|S1(Y − an−1/2 1 − Xβ ϕ

n→∞

The proof of this lemma was first given by Jureˇckov´a (1971) for general signed-rank scores and it is briefly sketched in the Appendix for the sign scores; see Lemma A.3.2. This lemma leads to the asymptotic linearity result for the process ( 3.5.18). We need the following linearity result: Theorem 3.5.6. Assume conditions (E.1), (E.2), (S.1), (D.1) and (D.2) of Section 3.4. For any ǫ > 0 and c > 0, b ) − n−1/2 S1 (Y − Xβ b ) + aτ −1 | ≥ ǫ] = 0 , lim P [sup|a|≤c |n−1/2 S1 (Y − an−1/2 1 − Xβ ϕ ϕ S

n→∞

where τs is the scale parameter defined in expression ( 3.4.6).

172

CHAPTER 3. LINEAR MODELS

Proof: For any fixed a write b ) − n−1/2 S1 (Y − Xβ b ) + aτ −1 | ≤ |n−1/2 S1 (Y − an−1/2 1 − Xβ ϕ ϕ S −1/2 −1/2 −1/2 −1/2 b )−n |n S1 (Y − an 1 − Xβ S1 (Y − an 1)| ϕ

−1/2

+ |n

−1/2

S1 (Y − an

b )| . 1) − n−1/2 S1 (Y) + aτS−1 | + |n−1/2 S1 (Y) − n−1/2 S1 (Y − Xβ ϕ

We can apply Lemma 3.5.1 to the first and third terms on the right side of the above inequality. For the middle term we can use the asymptotic linearity result in Chapter 1 for the sign process, ( 1.5.22). This yields the result for any a and the sup will follow from the monotonicity of the process, similar to the proof of Theorem 1.5.6 of Chapter 1. b )− Letting a = 0 in Lemma 3.5.1, we have that the difference n−1/2 S1 (Y − Xβ ϕ n−1/2 S1 (Y) goes to zero in probability. Thus the asymptotic distribution of n−1/2 S1 (Y − b ) is the same as that of n−1/2 S1 (Y), namely, N(0, 1). We have two applications of these Xβ ϕ results. The first is found in the next lemma.

Lemma 3.5.2. Assume conditions (E.1), (E.2), (D.1), (D.2), and (S.1) of Section 3.4. The random variable, n1/2 α bS is bounded in probability.

b ) is asymptotically N(0, 1) there exists Proof: Let ǫ > 0 be given. Since n−1/2 S1 (Y − Xβ ϕ a c < 0 such that b ϕ ) < c] < ǫ . P [n−1/2 S1 (Y − Xβ (3.5.20) 2 Take c∗ = τS−1 (c − ǫ). By the process’s monotonicity and the definition of α b, we have the 1/2 ∗ −1/2 ∗ −1/2 b implication n α bS < c ⇒ n S1 (Y − c n 1 − Xβ ϕ ) ≤ 0. Adding in and subtracting out the above linearity result, leads to

b ) ≤ 0] P [n1/2 α bS < c∗ ≤ P [n−1/2 S1 (Y − n−1/2 c∗ 1 − Xβ ϕ −1/2 ∗ −1/2 b ) − (n−1/2 S1 (Y − Xβ b ) − c∗ τ −1 | ≥ ǫ] ≤ P [|n S1 (Y − c n 1 − Xβ ϕ ϕ S −1/2 ∗ −1 b ) − c τ < ǫ]] + P [n S1 (Y − Xβ (3.5.21) ϕ

S

The first term on the right side can be made less that ǫ/2 for sufficiently large n whereas the second term is ( 3.5.20). From this it follows that n1/2 α bS is bounded below in probability. 1/2 To finish the proof a similar argument shows that n α bS is bounded above in probability. As a second application we can write the linearity result of the last theorem as b ) = n−1/2 S1 (Y) − aτ −1 + op (1) n−1/2 S1 (Y − an−1/2 1 − Xβ ϕ S

(3.5.22)

uniformly for all |a| ≤ c and for c > 0. Because α bS is a solution to equation ( 3.5.19) and n1/2 α bS is bounded in probability, the second linearity result, ( 3.5.22), yields, after some simplification, the following asymptotic representation of our result for the estimate of the intercept for the true intercept α0 , 1/2

n

−1/2

(b αS − α0 ) = τS n

n X i=1

sgn(Yi − α0 ) + op (1) ,

(3.5.23)

3.5. THEORY OF RANK-BASED ESTIMATES

173 D

where τS is given in ( 3.4.6). From this we have that n1/2 (b αS − α0 ) → N(0, τS2 ). Our interest, b . though, is in the joint distribution of α bS and β ϕ b for the true vector By Corollary 3.5.2 the corresponding asymptotic representation of β ϕ of regression coefficients β 0 is b − β ) = τϕ (n−1 X′ X)−1 n−1/2 X′ ϕ(F (Y)) + op (1) , n1/2 (β ϕ 0

(3.5.24)

where τϕ is given by ( 3.4.4). The joint asymptotic distribution is given in the following theorem. Theorem 3.5.7. Under (D.1), (D.2), (S.1), (E.1) and (E.2) in Section 3.4, −1 2 α bS n τS 0′ α0 b distribution . , bϕ = b has an approximate Np+1 0 τϕ2 (X′ X)−1 β0 βϕ

Proof: As above assume without loss of generality √ that the parameters are 0. It √ true −1 −1 b )′ )′ . Let is easier to work with the random vector Tn = (τs nb αS , n(τϕ (n−1 X′ X)β ϕ ′ ′ p+1 t = (t1 , t2) be an arbitrary, nonzero, vector in R . We need only show that Zn = t′ Tn has an asymptotically univariate normal distribution. Based on the above asymptotic b , ( 3.5.24), we have representations of α bS , ( 3.5.23), and β ϕ −1/2

Zn = n

n X

(t1 sgn(Yk ) + (t′2 xk )ϕ(F (Yk )) + op (1) ,

(3.5.25)

k=1

Denote the sum on the right side of ( 3.5.25) as Zn∗ . We need only show that Zn∗ converges ∗ in distribution to a univariate normal distribution. Denote the kth summand as Znk . We shall use the Lindeberg-Feller Central Limit Theorem. Our application of this theorem is similar to its use inR the proof of Theorem 3.5.2. First note that since the score function ϕ is standardized ( ϕ = 0) that E(Zn∗ ) = 0. Let Bn2 = Var(Zn∗ ). Because Rthe individual summands are independent, Yk are identically distributed, ϕ is standardized ( ϕ2 = 1), and 2 the design is centered, BN simplifies to Bn2 = n−1 ( =

t21

+

n X

t21 +

n X

(t′2 xk )2 + 2t1 cov(sgn(Y1 ), ϕ(F (Y1 ))t′2

k=1 k=1 ′ −1 ′ t2 (n X X)t2

n X

xk

k=1

+0.

Hence by (D.2), lim Bn2 = t21 + t′2 Σt2 ,

n→∞

(3.5.26)

which is a positive number. To satisfy the Lindeberg-Feller condition, we need to show that for any ǫ > 0 n X −2 ∗2 ∗ lim Bn E[Znk I(|Znk | > ǫBn )] = 0 . (3.5.27) n→∞

k=1

174

CHAPTER 3. LINEAR MODELS

Since Bn2 converges to a positive constant we need only show that the sum converges to 0. By the triangle inequality we can show that the indicator function satisfies ∗ I(n−1/2 |t1 | + n−1/2 |t′2 xk ||ϕ(F (Yk ))| > ǫBn ) ≥ I(|Znk | > ǫBn ) .

(3.5.28)

Following the discussion after expression ( 3.5.7), we have that n−1/2 |(x′k t)| ≤ Mn where Mn is independent of k and, furthermore, Mn → 0. Hence, we have I(|ϕ(F (Yk ))| >

ǫBn − n−1/2 t1 ) ≥ I(n−1/2 |t1 | + n−1/2 |t′2 xk ||ϕ(F (Yk ))| > ǫBn ) . Mn

(3.5.29)

Thus the sum in expression ( 3.5.27) is less than or equal to ǫBn − n−1/2 t1 ǫBn − n−1/2 t1 ∗2 E Znk I |ϕ(F (Yk ))| > = t1 E I |ϕ(F (Y1))| > Mn Mn k=1 X n ǫBn − n−1/2 t1 ′ xk t2 + (2/n)E sgn(Y1 )ϕ(F (Y1))I |ϕ(F (Y1))| > Mn k=1 n −1/2 X ǫBn − n t1 (1/n) (t′2 xk )2 . + E ϕ2 (F (Y1 ))I |ϕ(F (Y1))| > Mn k=1

n X

Because theP design is centered the middle term on the right side is 0. As remarked above, the term (1/n) nk=1 (t′2 xk )2 = (1/n)t′2 X′ Xt2 converges to a positive constant. In the expression ǫBn −n−1/2 t1 , the numerator converges to a positive constant as the denominator converges to Mn 0; hence, the expression goes to ∞. Therefore since ϕ is bounded, the indicator function converges to 0. Again using the boundedness of ϕ, we can interchange limit and expectation by the Lebesgue Dominated Convergence Theorem. Thus condition ( 3.5.27) is true and, hence, Zn∗ converges in distribution to a univariate normal distribution. Therefore, Tn converges to a multivariate normal distribution. Note by ( 3.5.26) it follows that the asymptotic b ϕ is the result displayed in the theorem. covariance of b In the above development, we considered the centered design. In practice, though, we are often concerned with an uncentered design. Let α∗ denote the intercept for the uncentered model. Then α∗ = α − x′ β where x denoted the vector of column averages of the uncentered b . Based design matrix. An estimate of α∗ based on R-estimates is given by α bS∗ = α bS − x′ β ϕ on the last theorem, it follows, (Exercise 3.16.14), that

α bS∗ b β

ϕ

is approximately Np+1

α0 β0

,

κn −τϕ2 x′ (X′ X)−1 −τϕ2 (X′ X)−1 x τϕ2 (X′ X)−1

, (3.5.30)

where κn = n−1 τS2 + τϕ2 x′ (X′ X)−1 x and τS and and τϕ are given respectively by ( 3.4.6) and ( 3.4.4).

3.5. THEORY OF RANK-BASED ESTIMATES

175

Intercept Estimate Based on Signed-Rank Scores Suppose we additionally assume that the errors have a symmetric distribution; i.e., f (−x) = f (x). In this case, all location functionals are the same. Let ϕf (u) = −f ′ (F −1 (u))/f (F −1(u)) denote the optimal scores for the density f (x). Then as Exercise 3.16.12 shows, ϕf (1 − u) = −ϕf (u); that is, the scores are odd about 1/2. Hence, in this subsection we will additionally assume that the scores satisfy property (S.3), ( 3.4.12). For scores satisfying (S.3), the corresponding signed-rank scores are generated as a+ (i) = ϕ+ (i/(n + 1)) where ϕ+ (u) = ϕ((u + 1)/2); see the discussion in Section 2.5.3. For example √ if Wilcoxon scores are used, ϕ(u) = 12(u − 1/2), then the signed-rank score function is √ + ϕ (u) = 3u. Recall from Chapter 1, that these signed-rank scores can be used to define a norm and a subsequent R-analysis. Here we only want to apply the associated one sample signed-rank procedure to the residuals in order to obtain an estimate of the intercept. So consider the process +

T (b eR − α1) =

n X i=1

sgn(b eRi − α1)a+ (R|b eRi − α|) ,

(3.5.31)

b ; see ( 1.8.2). Note that this is the process discussed in Section 1.8, where ebRi = yi − x′i β ϕ except now the iid observations are replaced by residuals. The process is still a nonincreasing function of α which steps down at the Walsh averages of the residuals; see Exercise 1.12.28. The estimate of the intercept is a value α bϕ+ which solves the equation . T + (b eR − α) = 0.

(3.5.32)

If Wilcoxon scores are used then the estimate is the median of the Walsh averages, ( 1.3.25) while if sign scores are used the estimate is the median of the residuals. b+ = (b b ′ )′ . We next briefly sketch the development of the asymptotic distriLet b αϕ+ , β ϕ ϕ + b bution of bϕ . Assume without loss of generality that the true parameter vector (α0 , β ′0 )′ is 0. Suppose instead of the residuals we had the true errors in ( 3.5.31). Theorem A.2.11 of the Appendix then yields an asymptotic linearity result for the process. McKean and Hettmansperger (1976) show that this result holds for the residuals also; that is, 1 √ S + (b eR − α1) = S + (e) − ατϕ−1 + op (1) , n

(3.5.33)

for all |α| ≤ c, where c > 0.√Using arguments similar to those in McKean and Hettmansperger αϕ+ is bounded in probability; hence, by ( 3.5.33) we have that (1976), we can show that nb √

1 nb αϕ+ = τϕ √ S + (e) + op (1) . n

(3.5.34)

176

CHAPTER 3. LINEAR MODELS

But by ( A.2.43) and ( A.2.45) of the Appendix, we have the second representation given by, √

n

1 X + + nb αϕ+ = τϕ √ ϕ (F |ei |)sgn(ei ) + op (1) n i=1 n

1 X + = τϕ √ ϕ (2F (ei ) − 1) + op (1) , n i=1

(3.5.35)

where F + is the distribution function of the absolute errors |ei |. Due to symmetry, F + (t) = 2F (t)−1. Then using the relationship between the rank and the signed-rank scores, ϕ+ (u) = ϕ((u + 1)/2), we obtain finally √

n

nb αϕ+

1 X ϕ(F (Yi)) . = τϕ √ n i=1

(3.5.36)

Therefore using expression ( 3.5.2), we have the asypmtotic representation of the estimates: + √ α bϕ τϕ 1′ ϕ(F (Y)) . (3.5.37) n b =√ βϕ n (X′ X)−1 X′ ϕ(F (Y)) This and an application of the Lindeberg Central Limit Theorem, similar to the proof of Theorem 3.5.7, leads to the theorem,

Theorem 3.5.8. Under assumptions (D.1), (D.2), (E.1), (E.2), (S.1) and (S.3) of Section 3.4 + α bϕ α0 2 ′ −1 distribution , (3.5.38) , τϕ (X1 X1 ) has an approximate Np+1 b β0 β ϕ where X1 = [1 X].

3.6

Theory of Rank-Based Tests

Consider the general linear hypotheses discussed in Section 3.2, H0 : Mβ = 0 versus HA : Mβ 6= 0 ,

(3.6.1)

where M is a q × p matrix of full row rank. The geometry of R testing, Section 3.2.2, indicated the statistic based on the reduction of dispersion between the reduced and full models, Fϕ = (RD/q)/(b τϕ /2), see ( 3.2.18), as a test statistic. In this section we develop the asymptotic theory for this test statistic under null and alternative hypotheses. This theory will be sufficient for two other rank-based tests which we will discuss later. See Table 3.2.2 and the discussion relating to that table for the special case when M = I.

3.6. THEORY OF RANK-BASED TESTS

3.6.1

177

Null Theory of Rank Based Tests

We proceed with two lemmas about the dispersion function D(β) and its quadratic approximation Q(β) given by expression ( 3.5.11). b denote the R-estimate of β in the full model ( 3.2.3), then under Lemma 3.6.1. Let β (E.1), (S.1), (D.1) and (D.2) of Section 3.4, P

b − Q(β) b →0. D(β)

(3.6.2)

Proof: Assume h√ without loss i of generality that the true β is 0. Let ǫ > 0 be given. Choose b > c0 < ǫ/2, for n sufficiently large. Using asymptotic quadraticity, c0 such that P nkβk Theorem A.3.8, we have for n sufficiently large ) "( # n√ h i o b − Q(β)| b 0. Since nkβ n k = kθk, we can take c = kθk and get kn−1/2 S(Y − Xβ n ) − n−1/2 S(Y) − τϕ−1 Σ(0′ , θ ′ )′ k = op (1) .

(3.6.27)

The above probability statements hold under the null model and, hence, by contiguity under the sequence of models ( 3.6.24) also. Under the sequence of models ( 3.6.24), however, D

n−1/2 S(Y − Xβ n ) → Np (0, Σ) . Hence, under the sequence of models ( 3.6.24) D

n−1/2 S(Y) → Np (τϕ−1 Σ(0′ , θ ′ )′ , Σ) .

(3.6.28)

184

CHAPTER 3. LINEAR MODELS

Then under the sequence of models ( 3.6.24), −1/2 D −B′ A−1 S(Y) → Nq (τϕ−1 W0 , W0 ) . 1 I n

From this last result, the conclusion readily follows. Several interesting remarks follow from this theorem. First, since W0 is positive definite, under alternatives the noncentrality parameter η > 0. Thus the asymptotic distribution of T (τϕ ) under the sequence of models ( 3.6.24) has mean q + η. Furthermore, the asymptotic power of a level α test based on T (τϕ ) is P [χ2q (η) ≥ χ2α,q ]. Second, note that that we can write the non-centrality parameter as η = (τϕ2 n)−1 [θ ′ A2 θ − (Bθ)′ A−1 1 Bθ] . Both matrices A2 and A−1 are positive definite; hence, the non-centrality parameter is 1 maximized when θ is in the kernel of B. One way of assuring this for a design is to take B = 0. Because B = X′1 X2 this condition holds for orthogonal designs. Therefore orthogonal designs are generally more efficient than non-orthogonal designs. We next obtain the asymptotic relative efficiency of the test statistic Fϕ with respect to the least squares classical F -test, FLS , defined by ( 3.2.17) in Section 3.2.2. The theory for FLS under local alternatives is outlined in Exercise 3.16.18 where it is shown that, under the additional assumption that the random errors ei have finite variance σ 2 , the null asymptotic distribution of qFLS is a central χ2q distribution. Thus both Fϕ and FLS have the same asymptotic null distribution. As outlined in Exercise 3.16.18, under the sequence of models ( 3.6.24) qFLS has an asymptotic noncentral χ2q,ηLS with noncentrality parameter ηLS = (σ 2 )−1 θ ′ W0−1 θ

(3.6.29)

Based on Theorem 3.6.3, the asymptotic relative efficiency of Fϕ and FLS is the ratio of their non-centrality parameters; i.e., e(Fϕ , FLS ) =

σ2 ηϕ = 2 . ηLS τϕ

Thus the efficiency results for the rank-based estimates and tests discussed in this section are the same as the efficiency results presented in Chapters 1 and 2. An asymptotically efficient analysis can be obtained if the selected rank score function is ϕf (u) = −f0′ (F0−1 (u))/f0 (F0−1 (u)) where f0 is the form of the density of the error distribution. If the errors have a logistic distribution then the Wilcoxon scores will result in an asymptotically efficient analysis. Usually we have no knowledge of the distribution of the errors. In which case, we would recommend using Wilcoxon scores. With them, the loss in relative efficiency to the classical analysis at the normal distribution is only 5%, while the gain in efficiency over the classical analysis for long tailed error distributions can be substantial as discussed in Chapters 1 and 2.

3.6. THEORY OF RANK-BASED TESTS

185

Many of the studies reviewed in the article by McKean and Sheather (1991) included power comparisons of the rank-based analyses with the least squares F -test, FLS . The empirical power of FLS at normal error distributions was slightly better than the empirical power of Fϕ , under Wilcoxon scores. Under error distributions with heavier tails than the normal distribution, the empirical power of Fϕ was generally larger, often much larger, than the empirical power of FLS . These studies provide empirical evidence that the good asymptotic efficiency properties of the rank-based analysis hold in the small sample setting. As discussed above, the noncentrality parameters of the test statistics Fϕ and FLS differ in only the scale parameters. Hence, in practice, planning designs based on the noncentrality parameter of Fϕ can proceed similar to the planning of a design using the noncentrality parameter of FLS ; see, for example, the discussion in Chapter 4 of Graybill (1976).

3.6.3

Further Remarks on the Dispersion Function

Let b e denote the rank-based residuals when the linear model, ( 3.2.4), is fit using the scores based on the function ϕ. Suppose the same assumptions hold as above; i.e., (E.1), (D.1), and (D.2) in Section 3.4. In this section, we explore further properties of the residual dispersion D(b e); see also Sections 3.9.2 and 3.11. The functional corresponding to the dispersion function evaluated at the errors ei is determined as follows: letting Fn denote the empirical distribution function of the iid errors e1 , . . . , en we have n X 1 1 D(e) = a(R(ei ))ei n n i=1 n X n 1 = ϕ Fn (ei ) ei n+1 n i=1 Z n = ϕ Fn (x) x dFn (x) n+1 Z P → ϕ(F (x))x dF (x) = D e .

(3.6.30)

As Exercise 3.16.19 shows, D e is a scale parameter; see also the examples below. b = D(Y, Ω). We next show that n−1 D(b Let D(b e) denote the residual dispersion D(β) e) also converges in probability to D e , a result which will prove useful in Sections 3.9.2 and 3.11. Assume without loss of generality that the true β is 0. We can write b + (Q(β) b − Q(β)) e + Q(β) e . D(b e) = (D(b e) − Q(β))

By Lemmas 3.6.1 and 3.6.2 the two differences on the right hand side converge to 0 in probability. After some algebra, we obtain ( ) −1 τ 1 1 1 e = − ϕ √ S(e)′ √ S(e) + D(e) . Q(β) X′ X 2 n n n

186

CHAPTER 3. LINEAR MODELS

By Theorem 3.5.2 the term in braces on the right side converges in distribution to a χ2 random variable with p degrees of freedom. This implies that (D(e) − D(b e))/(τϕ /2) also 2 converges in distribution to a χ random variable with p degrees of freedom. Although this is a stronger result than we need, it does imply that n−1 (D(e) − D(b e)) converges to 0 in −1 probability. Hence, n D(b e) converges in probability to De . The natural analog to the least squares F -test statistic is Fϕ∗ =

RD/q , σ bD /2

(3.6.31)

where σ bD = D(b e)/(n − p − 1), rather than Fϕ . But we have, qFϕ∗ =

where κF is defined by

τbϕ /2 D qFϕ → −1 n D(b e)/2 τbϕ P → −1 n D(b e)

κF χ2 (q) ,

κF .

(3.6.32)

(3.6.33)

Hence, to have a limiting χ2 -distribution for qFϕ∗ we need to have κF = 1. Below we give several examples where this occurs. In the first example, the form of the error distribution is known while in the second example the errors are normally distributed; however, these cases rarely occur in practice. There is an even more acute problem with using Fϕ∗ , though. In Section A.5.2 of the appendix, we show that the influence function of Fϕ∗ is not bounded in the Y -space, while, as noted above, the influence function of the statistic Fϕ is bounded in the Y -space provided e) the score function ϕ(u) is bounded. Note, however, that the influence functions of D(b and Fϕ∗ are linear rather than quadratic as is the influence function of FLS . Hence, they are somewhat less sensitive to outliers in the Y -space than FLS ; see Hettmansperger and McKean (1978). Example 3.6.1. Form of Error Density Known. Assume that the errors have density f (x) = σ −1 f0 (x/σ) where f0 is known. Our choice of scores would then be the optimal scores given by ϕ0 (u) = − p

f0′ (F0−1 (u)) , −1 I(f0 ) f0 (F0 (u)) 1

(3.6.34)

where I(f0 ) denotes the Fisher information corresponding to f0 . These scores yield an asymptotically efficient rank-based analysis. Exercise 3.16.20 shows that with these scores τϕ = D e . Thus κF = 1 for this example and qFϕ∗0 has a limiting χ2 (q)-distribution under H0 .

(3.6.35)

3.7. IMPLEMENTATION OF THE R-ANALYSIS

187

Example 3.6.2. Errors are Normally Distributed.

√ In this case the form of the error density is f0 (x) = ( 2π)−1 exp {−x2 /2}; i.e., the standard normal density. This is of course a subcase of the last example. The optimal scores in this case are the normal scores ϕ0 (u) = Φ−1 (u) where Φ denotes the standard normal distribution function. Using these scores, the statistic qFϕ∗0 has a limiting χ2 (q)-distribution under H0 . Note here that the score function ϕ0 (u) = Φ−1 (u) is unbounded; hence the above theory must be modified to obtain this result. Under further regularity conditions on the design matrix, Jureˇckov´a (1969) obtained asymptotic linearity for the unbonded score function case; see, also, Koul (1992, p. 51). Using these results, the limiting distribution of qFϕ∗0 can be obtained. The R-estimates based on these scores, however, have an unbounded influence function; see Section 1.8.1. We next consider this analysis for Wilcoxon and sign scores. If Wilcoxon scores are employed then Exercise 3.16.21 shows that r π τϕ = σ (3.6.36) 3 r 3 De = σ . (3.6.37) π Thus, in this case, a consistent estimate of τϕ /2 is n−1 D(b e)(π/6). For sign scores a similar computation yields r π τS = σ 2 r 2 De = σ π

(3.6.38) (3.6.39)

Hence n−1 D(b e)(π/4) is a consistent estimate of τS /2.

Note that both examples are overly restrictive and again in all cases the resulting rankbased test of the general linear hypothesis H0 has an unbounded influence function, even in the case when the errors have a normal density and the analysis is based on Wilcoxon or sign scores. In general then, we recommend using a bounded score function ϕ and the corresponding test statistic Fϕ , ( 3.2.18) which is highly efficent and whose influence function, 3.6.21, is bounded in the Y -space.

3.7

Implementation of the R-Analysis

Up to this point, we have presented the geometry and asymptotic theory of the R-analysis. In order to implement the analysis we need to discuss the estimation of the scale parameters τϕ and τS . Estimation of τS is discussed around expression (1.5.28). Here, though, the estimate is based on the residuals. We next discuss estimation of the scale parameter τϕ . We also discuss algorithms for obtaining the rank-based analysis.

188

3.7.1

CHAPTER 3. LINEAR MODELS

Estimates of the Scale Parameter τϕ

The estimators of τϕ that we dicuss are based on the R-residuals formed after estimating β. In particular, the estimators do not depend on the the estimate of intercept parameter α. Suppose then we have fit Model ( 3.2.3) based on a score function ϕ which satisfies (S.1), R R b ( 3.4.10), i.e., ϕ is bounded, and is standardized so that ϕ = 0 and ϕ2 = 1. Let β ϕ b denote the R-estimate of β and let b eR = Y − Xβ ϕ denote the residuals based on the R-fit.

There have been several estimates of τϕ proposed. McKean and Hettmansperger (1976) proposed a Lehmann type estimator based on the standardized length of a confidence interval for the intercept parameter α. This estimator is a function of residuals and is consistent provided the density of the errors is symmetric. It is similar to the estimators of τϕ discussed in Chapter 1. For Wilcoxon scores, Aubuchon and Hettmansperger (1984, 1989) obtained a density type estimator for τϕ and showed it was consistent for symmetric and asymmetric error distributions. Both of these estimators are available as options in the command RREGR in Minitab. In this section we briefly sketch the development of an estimator of τϕ for bounded score functions proposed by Koul, Sievers and McKean (1987). It is a density type estimate based on residuals which is consistent for symmetric and asymmetric error distributions which satisfy (E.1), ( 3.4.1). It further satisfies a uniform consistency property as stated in Theorem 3.7.1. Witt et al. (1995) derived the influence function of this estimator, showing that it is robust. A bootstrap percentile-t procedure based on this estimator did quite well in terms of empirical validity and efficiency in the Monte Carlo study performed by George, McKean, Schucany and Sheather (1995). Let the score function ϕ satisfy (S.1), (S.2), and (S.3) of Section 3.4. Since it is bounded, consider the standardization of it given by

ϕ∗ (u) =

ϕ(u) − ϕ(0) . ϕ(1) − ϕ(0)

(3.7.1)

Since ϕ∗ is a linear function of ϕ the inference properties under either score function are the same. The score function ϕ∗ will be useful since it is also a distribution function on (0, 1). Recall that τϕ = 1/γ where

γ=

Z

0

1

ϕ(u)ϕf (u)du and ϕf (u) = −

f ′ (F −1 (u)) . f (F −1(u))

R Note that γ ∗ = ϕ∗ (u)ϕf (u)du = (ϕ(1) − ϕ(0))−1 γ. For the present it will be more convenient to work with γ ∗ .

3.7. IMPLEMENTATION OF THE R-ANALYSIS

189

If we make the change of variable u = F (x) in γ ∗ , we can rewrite it as, Z ∞ ∗ γ = − ϕ∗ (F (x))f ′(x)dx Z ∞−∞ = ϕ∗′ (F (x))f 2 (x)dx −∞ Z ∞ = f (x)dϕ∗ (F (x)) , −∞

where the second equality is obtained upon integration by parts using dv = f ′ (x) dx and u = ϕ∗ (F (x)). From the above assumptions on ϕ∗ , ϕ∗ (F (x)) is a distribution function. Suppose Z1 and Z2 are independent random variables with distributions functions F (x) and ϕ∗ (F (x)), respectively. Let H(y) denote the distribution function of |Z1 − Z2 |. It then follows that, R∞ P [|Z1 − Z2 | ≤ y] = −∞ [F (z2 + y) − F (z2 − y)]dϕ∗(F (z2 )) y > 0 H(y) = . (3.7.2) 0 y≤0 Let h(y) denote the density of H(y). Upon differentiating under the integral sign in expression ( 3.7.2) it easily follows that h(0) = 2γ ∗ . (3.7.3) So to estimate γ we need to estimate h(0). Using the transformation t = F (z2 ), rewrite ( 3.7.2) as Z

H(y) =

0

1

F (F −1 (t) + y) − F (F −1 (t) − y) dϕ∗ (t) .

(3.7.4)

Next let Fbn denote the empirical distribution function of the R-residuals and let Fbn−1 (t) = b n denote the estimate of H inf{x : Fbn (x) ≥ t} denote the usual inverse of Fbn . Let H which is obtained by replacing F by Fbn . Some simplification follows by noting that for bn, t ∈ ((j − 1)/n, j/n], Fb −1 (t) = b e(j) . This leads to the following form of H n

b n (y) = H =

Z 1h i Fbn (Fbn−1 (t) + y) − Fbn (Fbn−1 (t) − y) dϕ∗ (t) 0

n Z X

j=1 ( n h X

j−1 j , ] n n

h i −1 −1 b b b b Fn (Fn (t) + y) − Fn (Fn (t) − y) dϕ∗ (t)

i j j−1 ∗ ∗ −ϕ = ϕ n n j=1 n n j j−1 1 XX ∗ ∗ = ϕ −ϕ I(|b e(i) − b e(j) | ≤ y) . n i=1 j=1 n n Fbn (b e(j) + y) − Fbn (b e(j) − y)

(3.7.5)

190

CHAPTER 3. LINEAR MODELS

b n (tn )/(2tn ) where An estimate of h(0) and hence γ ∗ , ( 3.7.3), is an estimate of the form H b n is a distribution function, let b tn is chosen close to 0. Since H tn,δ denote the δth quantile √ −1 b n ; i.e., b b (δ). Then take tn = tn,δ / n. Our estimate of γ is given by of H tn,δ = H n b n (tn,δ /√n) (ϕ(1) − ϕ(0))H √ γn,δ = b . (3.7.6) 2tn,δ / n

Its consistency is given by the following theorem:

Theorem 3.7.1. Under (E.1),(D.1), (S.1), and (S.2) of Section 3.4, and for any 0 < δ < 1, P

sup |b γn,δ − γ| → 0 , ϕ∈C

where C denotes the class of all bounded, right continuous, nondecreasing score functions defined on the interval (0, 1). The proof can be found in Koul et al. (1987). It follows immediately that τbϕ = 1/b γn,δ is a consistent estimate of τϕ . Note that the uniformity condition on the scores in the theorem is more than we need here. This result, though, proves useful in adaptive procedures which estimate the score function; see McKean and Sievers (1989). b n is obtained by an application Since the scores are differentiable, an approximation of H of the mean value theorem to ( 3.7.5) which results in n n j 1 X X ∗′ ∗ b ϕ I(|b e(i) − b e(j) | ≤ y) , (3.7.7) Hn (y) = cn n i=1 j=1 n+1

P b ∗ is a distribution function. where cn = nj=1 ϕ∗′ (j/(n + 1)) is such that H n b The expression ( 3.7.5) for Hn contains a density estimate of f based on a rectangular kernel. Hence, in choosing δ we are really choosing a bandwidth for a density estimator. As most kernel type density estimates are sensitive to the bandwidth, so is γ ∗ sensitive to δ. Several small sample studies have been done on this estimate of τϕ ; see McKean and Sheather (1991) for a summary. In these studies the quality of an estimator of τϕ is based on how well it standardizes test statistics such as Fϕ in terms of how close the empirical α-levels of the test statistic are to nominal α-levels. In the same way, scale estimators used in confidence intervals were judged by how close empirical confidence levels were to nominal confidence levels. The major concern is thus the validity of the inference procedure. For moderate sample sizes where the ratio of n/p exceeds 5, the value of δ = .80 yielded valid estimates. For ratios less than 5, larger values of δ, around .90, gave valid estimates. In all cases it was found that the following simple degrees of freedom correction benefited the analysis r n τbϕ = γ −1 . b (3.7.8) n−p−1 Note that this is similar to the least squares correction on the maximum likelihood estimate (under normality) of the variance.

3.7. IMPLEMENTATION OF THE R-ANALYSIS

3.7.2

191

Algorithms for Computing the R-Analysis

As we saw in Section 3.2, the dispersion function D(β) is a continuous convex function of β. Gradient type algorithms, such as steepest descent, can be use to minimize D(β) but they are often agonizingly slow. The algorithm which we describe next is a Newton type of algorithm based on the asymptotic quadraticity of D(β). It is generally much faster than gradient type algorithms and is currently used in the RREGR command in Minitab and in the program RGLM (Kapenga, McKean and Vidmar, 1988). A finite algorithm to minimize D(β) is discussed by Osborne (1985). b (0) . Let The Newton type of algorithm needs an initial estimate which we denote as β b (0) denote the initial residuals and let τbϕ(0) denote the initial estimate of τϕ b e(0) = Y − Xβ b (0) is based on these residuals. By ( 3.5.11) the approximating quadratic to D based on β given by,

′ ′ −1 b (0) . b (0) +D Y − Xβ b (0) S Y − Xβ b (0) − β − β b (0) X′ X β − β β−β Q(β) = 2τbϕ (0)

By ( 3.5.13), the value of β which minimizes Q(β) is given by

b (1) = β b (0) + τb(0) (X′ X)−1 S(Y − Xβ b (0) ) . β ϕ

(3.7.9)

This is the first Newton step. In the same way that the first step was defined in terms of the initial estimate, so can a second step be defined in terms of the first step. We shall call these iterated estimates ork-step estimates. In practice, though, we would want to know (1) (0) b b if D β is less than D β before proceeding. A more formal algorithm is presented

below. These k-step estimates satisfy some interesting properties themselves which we briefly discuss; details can be found in McKean and Hettmansperger (1978). Provided the initial √ b (0) estimate is such that n(β − β) is bounded in probability then for any k ≥ 1 we have √ (k) P b −β b → n β 0, ϕ

b denotes a minimizing value of D. Hence the k-step estimates have the same where β ϕ b . Furthermore τbϕ(k) is a consistent estimate of τϕ , if it is any of asymptotic distribution as β ϕ (k) the scale estimates discussed in Section 3.7.1 based on k-step residuals. Let Fϕ denote the R-test of a general linear hypothesis based on reduced and full model k-step estimates. Then (k) it can be shown that Fϕ satisfies the same asymptotic properties as the test statistic Fϕ under the null hypothesis and contiguous alternatives. Also it is consistent for any alternative HA .

192

CHAPTER 3. LINEAR MODELS

Formal Algorithm In order to outline the algorithm used by RGLM, first consider the QR-decomposition of X which is given by Q′ X = R , (3.7.10) where Q is an n × n orthogonal matrix and R is an n × p upper triangular matrix of rank p. As discussed in Stewart (1973), Q can be expressed as a product of p Householder transformations. Writing Q = [Q1 Q2 ] where Q1 is n × p, it is easy to show that the columns of Q1 form an orthonormal basis for the column space of X. In particular the projection matrix onto the column space of X is given by H = Q1 Q′1 . The software package LINPACK (1979) is a collection of subroutines which efficiently computes QR-decompositions and it further has routines which obtain projections of vectors. Note that we can write the kth Newton step in terms of residuals as b e(k) = b e(k−1) − τbϕ Ha(R(b e(k−1) )

(3.7.11) (k−1)

where a(R(b e(k−1) ) denotes the vector whose ith component is a(R(b ei ). Let D (k) denote the dispersion function evaluated at b e(k) . The Newton step is a step from b e(k−1) along the direction τbϕ Ha(R(b e(k−1) )). If D (k) < D(k−1) the step has been successful; otherwise, a linear search can be made along the direction to find a value which minimizes D. This would then become the kth step residual. Such a search can be performed using methods such as false position as discussed below in Section 3.7.3. Stopping rules can be based on the relative drop in dispersion, i.e., stop when D (k) − D (k−1) < ǫD , D (k−1)

(3.7.12)

where ǫD is a specified tolerance. A similar stopping rule can be based on the relative size b = Y−b of the step. Upon stopping at step k, obtain the fitted value Y e(k) and then the b estimate of β by solving Xβ = Y. A formal algorithm is: Let ǫD and ǫs be the given stopping tolerances. 1. Set k = 1. Obtain initial residuals b e(k−1) and based upon these get an initial estimate (0) τbϕ of τϕ .

2. Obtain b e(k) as in expression ( 3.7.11). If the step is successful proceed to the next step, otherwise search along the Newton direction for a value which minimizes D then go to the next step. An algorithm for this search is discussed in Section 3.7.3.

3. If the relative drop in dispersion or length of step is within its respective tolerance ǫD or ǫs stop; otherwise set b e(k−1) = b e(k) and go to step (2). 4. Obtain the estimate of β and the final estimate of τϕ .

3.7. IMPLEMENTATION OF THE R-ANALYSIS

193

The QR decomposition can readily be used to form a reduced model design matrix for testing the general linear hypotheses ( 3.2.5), Mβ = 0, where M is a specified q × p matrix. Recall that we called the column space of X, ΩF , and the space ΩF constrained by Mβ = 0 the reduced model space, ω. The key result lies in the following theorem: Theorem 3.7.2. Denote the row space of M by R(M′ ). Let QM be a p × (p − q) matrix whose columns consist of an orthonormal basis for the space (R(M′ ))⊥ . If U = XQM , then R(U) = ω. Proof: If u ∈ ω then u = Xb for some b where Mb = 0. Hence b ∈ (R(M′ ))⊥ ; i.e., b = QM c for some c. Conversely, if u ∈ R(U) then for some c ∈ Rp−q , u = X(QM c). Hence u ∈ R(X) and M(QM c) = (MQM )c = 0. Thus using the LINPACK subroutines mentioned above, it is easy to write an algorithm which obtains the reduced model design matrix U defined above in the theorem. The package RGLM uses such an algorithm to test linear hypotheses; see Kapenga, McKean and Vidmar (1988).

3.7.3

An Algorithm for a Linear Search

The computation for many of the quantities needed in a rank-based analysis involve simple linear searches. Examples include the estimate of the location parameter for a signed-rank procedure, the estimate of the shift in location in the two sample location problem, the estimate of τϕ discussed in Section 3.7 and the search along the Newton direction for a minimizing value in Step (2) of the algorithm for the R-fit in a regression problem discussed in the last section. The following is a generic setup for these problems: solve the equation S(b) = K ,

(3.7.13)

where S(b) is a decreasing step function and K is a specified constant. Without loss of generality we will take K = 0 for the remainder of the discussion. By the monotonicity, a solution always exists, although, it may be an interval of solutions. In almost all cases, S(b) is asymptotically linear; so, the search problem becomes relatively more efficient as the sample size increases. There are certainly many search algorithms that can be used for solving ( 3.7.13). One that we have successfully employed is the Illinois version of regula falsi; see Dowell and Jarratt (1971). McKean and Ryan (1977) employed this routine to obtain the estimate and confidence interval for the two sample Wilcoxon location problem. We will write the generic asymptotic linearity result as, . S(b) = S(b(0) ) − ζ(b − b(0) ) . (3.7.14) The parameter ζ is often of the form δ −1 C where C is some constant. Since δ is a scale parameter, initial estimates of it include such estimates as the MAD, ( 3.9.27), or the sample standard deviation. We have found MAD to usually be preferable. An outline of an algorithm for the search is:

194

CHAPTER 3. LINEAR MODELS

1. Bracket Step. Beginning with an initial estimate b(0) step along the b-axis to b(1) where the interval (b(0) , b(1) ), or vice-versa, brackets the solution. Asymptotic linearity can be used here to make these steps; for instance, if ζ (0) is an estimate of ζ based on b(0) then the first step is b(1) = b(0) + S(b(0) )/ζ (0) . 2. Regula-Falsi. Assume the interval (b(0) , b(1) ) brackets the solution and that b(1) is the more recent value of b(0) , b(1) . If |b(1) − b(0) | < ǫ then stop. Else, the next step is where the secant line determined by b(0) , b(1) intersects the b-axis; i.e., b(2) = b(0) −

b(1) − b(0) S(b(0) ) . (1) (0) S(b ) − S(b )

(3.7.15)

(a) If (b(0) , b(2) ) brackets the solution then replace b(1) by b(2) and go to (2) but use S(b(0) )/2 in place of S(b(0) ) in determination of the secant line, (this is the Illinois modification). (b) If (b(2) , b(1) ) brackets the solution then replace b(0) by b(2) and go to (2). The above algorithm is easy to implement. Such searches are used in the package RGLM; see Kapenga, McKean and Vidmar (1988).

3.8

L1-Analysis

This section is devoted to L1 -procedures. These are widely used procedures; see, for example, Bloomfield and Steiger (1983). We first show that they are equivalent to R-estimates based on the sign score function under Model ( 3.2.4). Hence the asymptotic theory for L1 -estimation and subsequent analysis is contained in Section 3.5. The asymptotic theory for L1 -estimation can also be found in Bassett and Koenker (1978) and Rao (1988) from an L1 point of view. Consider the sign scores; i.e., the scores generated by ϕ(u) = sgn(u −1/2). In this section we shall denote the associated pseudo-norm by kvkS =

n X i=1

sgn(R(vi ) − (n + 1)/2)vi v ∈ Rn ;

see, also, Section 2.6.1. This score function is optimal if the errors follow a double exponential (Laplace) distribution; see Exercise 2.13.19 of Chapter 2. We shall summarize the analysis based on the sign scores, but first we show that indeed the R-estimates based on sign scores are also L1 -estimates, provided that the intercept is estimated by the median of residuals. Consider the intercept model, ( 3.2.4), as given in Section 3.2 and let Ω denote the column space of X and Ω1 denote the column space of the augmented matrix X1 = [1 X]. First consider the R-estimate of η ∈ Ω based on the L1 pseudo-norm. This is a vector b YS ∈ Ω such that YbS = Argminη ∈Ω kY − ηkS .

3.8. L1 -ANALYSIS

195

Next consider the L1 -estimate for the space Ω1 ; i.e., the L1 -estimate of α1 + η. This is b L ∈ Ω1 such that a vector Y 1 where kvkL1 =

P

YbL1 = Argminθ ∈Ω1 kY − θkL1 ,

|vi | is the L1 -norm.

Theorem 3.8.1. R-estimates based on sign scores are equivalent to L1 -estimates; that is, bL = Y b S + med{Y − Y b S }1 .‘ Y 1

(3.8.1)

Proof: Any vector v ∈ Ω1 can be written uniquely as v = a1 + vc where a is a scalar and vc ∈ Ω. Since the sample median minimizes the L1 -distance between a vector and the space spanned by 1, we have kY − vkL1 = kY − a1 − vc kL1 ≥ kY − med{Y − vc }1 − vc kL1 . But it is easy to show that sgn(Yi − med{Y − vc } − vci ) = sgn(R(Yi − vci ) − (n + 1)/2) for i = 1, . . . , n. Putting these two results together along with the fact that the sign scores sum to 0 we have, kY − vkL1 = kY − a1 − vc kL1 ≥ kY − med{Y − vc }1 − vc kL1 = kY − vc kS ; ,

(3.8.2)

for any vector v ∈ Ω1 . Once more using the sign argument above, we can show that b S }1 − Y b S kL = kY − Y b S kS . kY − med{Y − Y 1

(3.8.3)

Putting ( 3.8.2) and ( 3.8.3) together establishes the result. b′ = (b b ′ ) denote the R-estimate of the vector of regression coefficients b = Let b αS , β S S (β0 , β ′ )′ . It follows that these R-estimates are the maximum likelihood estimates if the errors ei are double exponentially distributed; see Exercise 3.16.13. b S has an approximate N(b, τ 2 (X′ X1 )−1 ) From the discussions in Sections 3.5 and 3.5.2, b 1 S distribution, where τS = (2f (0))−1. From this the efficiency properties of the L1 -procedures discussed in the first two chapters carry over to the L1 linear model procedures. In particular its efficiency relative to LS at the normal distribution is .63, and it can be much more efficient than LS for heavier tailed error distributions. As Exercise 3.16.22 shows, the drop in dispersion test based on sign scores, FS , is, except for the scale parameter, the likelihood ratio test of the general linear hypothesis ( 3.2.5), provided the errors have a double exponential distribution. For other error distributions, the same comments about efficiency of the L1 estimates can be made about the test FS . In terms of implementation, Schrader and McKean (1987) found it more difficult to standardize the L1 statistics than other R-procedures, such as the Wilcoxon. Their most successful standardization of FS was based on the following bootstrap procedure: b and α 1. Compute the full model L1 estimates β bS , the full model residuals b e1 , . . . , b en , S and the test statistic FS .

196

CHAPTER 3. LINEAR MODELS

2. Select e e1 , . . . , e en˜ , the n ˜ = n − (p + 1) nonzero residuals.

3. Draw a bootstrap random sample e∗1 , . . . , e∗n˜ with replacement from ee1 , . . . , e en˜ . Calcu∗ ∗ ∗ b + e∗ . b bS + x′iβ late β S and FS , the L1 estimate and test statistic, from the model yi = α S i 4. Independently repeat step 3 a large number B times. The bootstrap p value, p∗ = #{FS∗ ≥ FS }/B. 5. Reject H0 at level α if p∗ ≤ α. Notice that by using full model residuals, the algorithm estimates the null distribution of FS . The algorithm depends on the number B of bootstrap samples taken. We suggest at least 2000.

3.9

Diagnostics

One of the most important parts in the analysis of a linear model is the examination of the resulting fit. Tools for doing this include residual plots and diagnostic techniques. Over the last fifteen years or so, these tools have been developed for fits based on least squares; see, for example, Cook and Weisberg (1982) and Belsley, Kuh and Welsch (1980). Least squares residual plots can be used to detect such things as curvature not accounted for by the fitted model; see, Cook and Weisberg (1989) for a recent discussion. Further diagnostic techniques can be used to detect outliers which are points that differ greatly from pattern set by the bulk of the data and to measure the influence of individual cases on the least squares fit. In this section we explore the properties of the residuals from the rank-based fits, showing how they can be used to determine model misspecification. We present diagnostic techniques for rank-based residuals that detect outlying and influential cases. Together these tools offer the user a residual analysis for the rank-based method for the fit of a linear model similar to the residual analysis based on least squares estimates. In this section we consider the same linear model, ( 3.2.3), as in Section 3.2. For a given b and b score function ϕ, let β eR denote the R-estimate of β and residuals from the R-fit ϕ of the model based on these scores. Much of the discussion is taken from the articles by McKean, Sheather and Hettmansperger (1990, 1991, 1993). Also, see Dixon and McKean (1996) for a robust rank-based approach to modeling heteroscedasticity.

3.9.1

Properties of R-Residuals and Model Misspecification

As we discussed above, a primary use of least squares residuals is in detection of model misspecification. In order to show that the R-residuals can also be used to detect model misspecification, consider the sequence of models Y = 1α + Xβ + Zγ + e ,

(3.9.1)

3.9. DIAGNOSTICS

197

√ where Z is an n × q centered matrix of constants and γ = θ/ n, for θ 6= 0. Note that this sequence of models is contiguous to Model ( 3.2.3). Suppose we fit model ( 3.2.3), i.e. Y = 1α + Xβ + e, when model ( 3.9.1) is the true model. Hence the model has been misspecified. As a first step in examining the residuals in this situation, we consider the limiting distribution of the corresponding R-estimate. b ϕ be the R-estimate for the Theorem 3.9.1. Assume model ( 3.9.1) is the true model. Let β model ( 3.2.3). Suppose that conditions (E.1) and (S.1) of Section 3.4 are true and that conditions (D.1) and (D.2) are true for the augmented matrix [X Z]. Then b has an approximate Np β + (X′ X)−1 X′ Zθ/√n, τ 2 (X′ X)−1 distribution. (3.9.2) β ϕ ϕ

Proof: Without loss of generality assume that β = 0. Note that the situation here is the same as the situation in Theorem 3.6.3; except now the null hypothesis corresponds to b is the reduced model estimate. Thus we seek the asymptotic distribution of γ = 0 and β ϕ the reduced model estimate. As in Section 3.5.1 it is easier to consider the corresponding e which is the reduced model estimate which minimzes the quadratic Q(Y− pseudo estimate β √ b e P Xβ), ( 3.5.11). Under the null hypothesis, γ = 0, n(β ϕ − β) → 0; hence by contiguity √ b P e → b and β e have the same n(β ϕ − β) 0 under the sequence of models ( 3.9.1). Thus β ϕ e But by ( 3.5.13), distributions under ( 3.9.1); hence, it suffices to find the distribution of β. e = τϕ (X′ X)−1 S(Y) , β

(3.9.3)

where S(Y) is the first p components of the vector T (Y) = [X Z]′ a(R(Y)). By ( 3.6.28) of Theorem 3.6.3 D n−1/2 T (Y) → Np+q (τϕ−1 Σ∗ (0′ , θ ′ )′ , Σ∗ ) , (3.9.4) where Σ∗ is the following limit,

1 lim n→∞ n

X′ X X′ Z Z′ X Z′ Z

= Σ∗ .

e is defined by ( 3.9.3), the result is just an algebraic computation applied to ( 3.9.4). Because β b , which is given in the With a few more steps we can write a first order expression for β ϕ following corollary: Corollary 3.9.1. Under the assumptions of the last theorem, √ b = β + τϕ (X′ X)−1 X′ ϕ(F (e)) + (X′ X)−1 X′ Zθ/ n + op (n−1/2 ) . β ϕ

(3.9.5)

Proof: Without loss of generality assume that the regression coefficients are 0. By ( A.3.10) and expression ( 3.6.27) of Theorem 3.6.3 we can write ′ ′ 1 1 X Zθ X ϕ(F (e)) −1 1 √ T (Y) = √ + op (1) ; + τϕ n Z′ Zθ n n Z′ ϕ(F (e))

198

CHAPTER 3. LINEAR MODELS

hence, the first p components of

√1 T (Y) n

satisfy

1 1 1 √ S(Y) = √ X′ ϕ(F (e)) + τϕ−1 X′ Zθ + op (1) . n n n √ b e P − β) → 0 the result follows. By expression ( 3.9.3) and the fact that n(β From this corollary we obtain the following first order expressions of the R-residuals and R-fitted values: . bR = Y α1 + Xβ + τϕ Hϕ(F (e)) + HZγ . b eR = e − τϕ Hϕ(F (e)) + (I − H)Zγ ,

(3.9.6) (3.9.7)

where H = X (X′ X)−1 X′ . In Exercise 3.16.23 the reader is asked to show that the least squares fitted values and residuals satisfy b LS = α1 + Xβ + He + HZγ Y b eLS = e − He + (I − H)Zγ .

(3.9.8) (3.9.9)

In terms of model mispecification the coefficients of interest are the regression coefficients. Hence, at this time we need not consider the effect of the estimation of the intercept. This avoids the problem of which estimate of the intercept to use. In practice, though, for both R- and LS-fits, the intercept will also be fitted and its effect will be removed from the residuals. We will also include the effect of estimation of the intercept in our discussion of the standardization of residuals and fitted values in Sections 3.9.2 and 3.9.3, respectively. Suppose that the linear model ( 3.2.3) is correct. Based on its first order expression when γ = 0, b eR is a function of the random errors similar to b eLS ; hence, it follows that a plot b R should generally be a random scatter, similar to the least squares residual of b eR versus Y plot. In the case of model misspecification, note that the R-residuals and least squares residuals have the same asymptotic bias, namely (I − H)Zγ. Hence R-residual plots, similar to those of least squares, are useful in identifying model misspecification. For least squares residual plots, since least squares residuals and the fitted values are uncorrelated, any pattern in this plot is due to model misspecification and not the fitting procedure used. The converse, however, is not true. As the example on the potency of drug compounds below illustrates, the least squares residual plot can exhibit a random scatter for a poorly fitted model. This orthogonality in the LS residual plot does, however, make it easier to pick out patterns in the plot. Of course the R-residuals are not orthogonal to the R-fitted values, but they are usually close to orthogonality; see Naranjo et al. (1994). We introduce the following parameter ν to measure the extent of departure from orthogonality. b and b Denote general fitted values and residuals by Y e respectively. The expected departure from orthogonality is the parameter ν defined by h i b . ν=E b (3.9.10) e′ Y

3.9. DIAGNOSTICS

199

For least squares, νLS is of course 0. For R-fits, we have the following first order expression for it: Theorem 3.9.2. Under the assumptions of Theorem 3.9.1 and either Model ( 3.2.3) or Model ( 3.9.1), . νR = pτϕ (E[ϕ(F (e1 ))e1 ] − τϕ ) . (3.9.11) Proof: Suppose Model ( 3.9.1) holds. Using the above first order expressions we have . νR = E [(e + α1 − τϕ Hϕ(F (e)) + (I − H)Zγ)′ (Xβ + τϕ Hϕ(F (e)) + HZγ)] Using E[ϕ(F (e))] = 0, E[e] = E(e1 )1, and the fact that X is centered this expression simplifies to . νR = τϕ E [trHϕ(F (e))e′ ] − τϕ2 E [trHϕ(F (e))ϕ(F (e))′ ] . Since the components of e are independent, the result follows. The result is invariant to either of the models. Although in general, νR 6= 0 for R-estimates, if, as the next corollary shows, optimal scores (see Examples 3.6.1 and 3.6.2) are used the expected departure from orthogonality is 0. Corollary 3.9.2. Under the hypothesis of the last theorem, if optimal R-scores are used then νR = 0. R ′ (F −1 (u)) Proof: Let ϕ(u) = −cf where c is chosen so that ϕ2 (u)du = 1. Then f (F −1 (u)) τϕ =

Z

′ −1 −1 f (F (u)) ϕ(u) − du = c. f (F −1 (u))

Some simplification and an integration by parts shows Z Z ϕ(F (e))e dF (e) = −c f ′ (e) de = c. Naranjo et al. (1994) conducted a simulation study to investigate the above properties of rank-based and LS residuals over several small sample situations of null (the true model was fitted) models and misspecified models. Error distributions included the normal distribution and a contaminated normal distribution. Wilcoxon scores were used. The first part of the study concerned the amount of association between residuals and fitted values where the association was measured by several correlation coefficients, including Pearson’s r and Kendall’s τ . Because of orthogonality between the LS residuals and fitted values, Pearson’s r is always 0 for LS. On the other measures of association, however, the results for the Wilcoxon analysis and LS were about the same. In general, there was little association. The second part investigated measures of randomness in a residual plot, including a runs tests and a quadrant count test, (the quadrants were determined by the medians of the residuals and fitted values). The results were similar for the LS and Wilcoxon fits. Both showed validity

200

CHAPTER 3. LINEAR MODELS

%I-8 Cloud Point %I-8 Cloud Point

0 22.1 2 26.1

Table 3.9.1: Cloud Data 1 2 3 4 5 6 7 8 0 24.5 26.0 26.8 28.2 28.9 30.0 30.4 31.4 21.9 4 6 8 10 0 3 6 9 28.5 30.3 31.5 33.1 22.8 27.3 29.8 31.8

over the null models and exhibited similar power over the misspecified models. In a power study over a quadratic misspecified model, the Wilcoxon analysis exhibited more power for long tailed error distributions. In summary, the simulation study provided empirical evidence that residual analyses based on Wilcoxon fits are similar to LS based residual analyses. There are other useful residual plots. Two that we will briefly discuss are q − q plots and added variable plots. As with standard residual plots, the internal R-studentized residuals (see Section 3.9.2) can be used in place of the residuals. Since the R-estimates of β are consistent, the distribution of the residuals should resemble the distribution of the errors. This leads to consideration of another useful residual plot, a q−q plot. In this plot, the quantiles of the target distribution form the horizontal coordinates while the sample quantiles (ordered residuals) form the vertical coordinates. Linearity of this plot indicates the appropriateness of the target distribution as the true model distribution; see Exercise 3.16.24. McKean and Sievers (1989) discuss how to use these plots adaptively to select appropriate rank scores. In the next example, we use them to examine how well the R-fit fits the bulk of the data and to highlight outliers. For the added variable plot, let b eR denote the residuals from the R-fit of the model Y = α1 + Xβ + e. In this case, Z is a known vector and we wish to decide whether or not to add it to the regression model. For the added variable plot, we regress Z on X. We will denote the residuals from this fit as b e(Z | X) = (I − H)Z. The added variable plot consists of the scatter plot of the residuals b eR versus b e(Z | X). Under model misspecification γ 6= 0 from expression ( 3.9.7), the residuals b eR are also a function of (I − H)Z. Hence, the plot can be quite powerful in determining the potential of Z as a predictor. Example 3.9.1. Cloud Data

The data for this example can be found in Table 3.9.1. It is taken from an exercise on p.162 of Draper and Smith (1966). The dependent variable is the cloud point of a liquid, a measure of degree of crystallization in a stock. The independent variable is the percentage of I-8 in the base stock. The subsequent R-fits for this data set were all based on Wilcoxon scores with the intercept estimate α bS , the median of the residuals. Panel A of Figure 3.9.1 displays the residual plot (R-residuals versus R-fitted values) of the R-fit of the simple linear model. The curvature in the plot indicates that this model is a poor choice and that a higher degree polynomial model would be more appropriate. Panel B of Figure 3.9.1 displays the residual plot from the R-fit of a quadratic model. Some curvature is still present in the plot. A cubic polynomial was fitted next. Its R-residual plot, found in Panel C of Figure 3.9.1, is much more of a random scatter than the first two plots.

3.9. DIAGNOSTICS

201

On the basis of residual plots the cubic polynomial is an adequate model. Least squares residual plots would also lead to a third degree polynomial. Figure 3.9.1: Panel A through C are the residual plots of the Wilcoxon fits of the linear, quadratic and cubic models, respectively, for the Cloud Data. Panel D is the q−q plot based on the Wilcoxon fit of the cubic model Panel A

0.6 • • •

• • 24

0.4

•

26

28

30

32

•

•

•

• •

-0.6

•

• •

•

• •

•

•

•

• •

34

24

26

28

30

Wilcoxon quadratic fit

Panel C

Panel D • •

• •

•

•

•

• •

•

•

26

28

30

Wilcoxon cubic fit

32

0.2

•

•• •• •• •

• 24

•• • •••••

0.0

•

-0.2

•

• •

-0.4

•

•

Wilcoxon residuals

0.2 -0.2

0.0

•

-0.4

Wilcoxon residuals

•

22

32

Wilcoxon linear fit

•

•

• •

0.4

-1.0

•

0.2

• •

•

•

-0.2

0.0

• •

•

Wilcoxon residuals

0.5

• • • • •

-2.0

Wilcoxon residuals

Panel B

• -1

0

1

Normal quantiles

In the R-residual plot of the cubic model, several points appear to be outlying from the bulk of the data. These points are also apparent in Panel D of Figure 3.9.1 which displays the q−q plot of the R-residuals. Based on these plots, the R-regression appears to have fit the bulk of the data well. The q−q plot suggests that the underlying error distribution has slightly heavier tails than the normal distribution. A scale would be helpful in interpretating these residual plots as discussed the next section. Table 3.9.2 displays the estimated coefficients along with their standard errors. The Wilcoxon and least squares fits are practically the same. Example 3.9.2. Potency Data, Example 3.3.3 continued

202

CHAPTER 3. LINEAR MODELS

Table 3.9.2: Wilcoxon and LS-estimates of the regression coefficients for Cloud Data. Standard errors are in parentheses. Method Intercept Linear Quadratic Cubic Scale Wilcoxon 22.35 (.18) 2.24 (.17) -.23 (.04) .01 (.003) τbϕ = .307 Least Squares 22.31 (.15) 2.22 (.15) -.22 (.04) .01 (.003) σ b = .281

This example was discussed in Section 3.3. Recall that the data were the result of an experiment concerning the potency of drug compounds manufactured under different levels of 4 factors and one covariate. Here we want to discuss a residual analysis of the rank-based fits of the two models that were fit in Example 3.3.3. First consider Model ( 3.3.1) without the quadratic terms, i.e., without the parameters β11 , β12 and β13 . The residuals used are the internal R-studentized residuals defined in the next section; see ( 3.9.31). They provide a convenient scale for detecting outliers. The curvature in the Wilcoxon-residual plot of this model, Panel A of Figure 3.9.2, is quite apparent, indicating the need for quadratic terms in the model; whereas, the LS residual plot, Panel C of Figure 3.9.2, does not exhibit this quadratic effect. As the R-residual plot indicates there are outliers in the data and these had an effect on the LS fit. Panels B and D display the residual plots, when the squared terms of the factors are added to model, i.e., Model ( 3.3.1) was fit. This R-residual plot no longer exhibits the quadratic effect indicating a better fitting model. Also by examining the R-plots for both models, it is seen that the outlyingness of some of the outliers indicated in the plot for the first model was accounted for by the larger model.

3.9.2

Standardization of R-Residuals

In this section we want to obtain an expression for the variance of the R-residuals under the model ( 3.2.3). We will assume in this section that σ 2 , the variance of the errors, is finite. As we show below, similar to the least squares residual, the variance of an R-residual depends both on its location in the x-space and the underlying variation of the errors. The internal Studentized least squares residuals (residuals divided by their estimated standard errors) have proved useful in diagnostic procedures since they correct for both the model and the underlying variance. The internal R-Studentized residuals defined below, ( 3.9.31), are similarly Studentized R-residuals. A diagnostic use of a Studentized residual is in detecting outlying observations. The Rmethod provides a robust fit to the bulk of the data. Thus any case with a large Studentized residual can be considered an outlier from this model. Even though a robust fit is resistant to outliers, it is still useful to detect such points. Indeed in practice these are often the points of most interest. The value of an internally Studentized residual is in its simplicity. It tells how many estimated standard errors a residual is away from the center of the data. The standardization depends on which estimate of the intercept is selected. We shall obtain the result for α bS the median of ebRi and only state the results for the intercept based

3.9. DIAGNOSTICS

203

Figure 3.9.2: Panels A and B are the Wilcoxon internal studentized residuals plots for models without and with, respectively, the three quadratic terms β11 , β12 and β13 . Panels C and D are the analogous plots for the LS fit. Panel A

•

1.0

• •

• •

•

•

• •

• 8.0

8.2

8.4

• • • • • •• •• • • • •• • • • •• • •• • • • • • •

7.6

7.8

8.0

8.2

Wilcoxon with quad. terms

Panel C

Panel D

•

8.0

8.2

• •• • •• •

8.4

LS w/o quad. terms

• •• • • •• ••

• •

•

•

• 7.8

•• • • •

-0.4

•

•

•

0.2

LS residuals

• • • ••• • • • •• • • • • •• • • • •• • • • • • •

•

0.4

•

0.0

•

0.6

Wilcoxon w/o quad. terms

0.6 0.4 0.2 0.0 -0.4

• 0.5

•

0.0

Wilcoxon residuals

0.4 0.0

• •• • • • •• • • •• •• • • • • • •• •

7.8

LS residuals

•

•

• •

-0.4

Wilcoxon residuals

•

Panel B

•

• 7.8

• 8.0

• • 8.2

• 8.4

LS with quad. terms

on symmetric errors. Thus the residuals we seek to standardize are given by b . b eR = Y − α b S 1 − Xβ ϕ

(3.9.12)

We will obtain a first order approximation of cov(b eR ). Since the residuals are invariant to the regression coefficients, we can assume without loss of generality that the true parameters are zero. Recall that hci is the ith diagonal element of H = X(X′X)−1 X′ and hi = n−1 + hci . Theorem 3.9.3. Under the conditions (E.1), (E.2), (D.1), (D.2) and (S.1) of Section 3.4, if the intercept estimate is α bs then a first order representation of the variance of b eR,i is . Var(b eR,i ) = σ 2 (1 − K1 n−1 − K2 hci ) , (3.9.13) where K1 and K2 are defined in expressions ( 3.9.18) and ( 3.9.19), respectively. In the case of a symmetric error distribution when the estimate of the intercept is given by α bϕ+ , discussed

204

CHAPTER 3. LINEAR MODELS

in Section 3.5.2, and (S.3) also holds, . Var(b eR,i ) = σ 2 (1 − K2 hi ) .

(3.9.14)

b given in ( 3.5.24) and the asymptotic repreProof: Using the first order expression for β ϕ sentation of α bS given by ( 3.5.23), we have . b eR = e − τS sgn(e)1 − Hτϕ ϕ(F (e)) ,

(3.9.15)

P where sgn(e) = sgn(ei )/n and τS and τϕ are Rdefined in expressions ( 3.4.6) and ( 3.4.4), respectively. Because the median of ei is 0 and ϕ(u) du = 0, we have . E[b eR ] = E(e1 )1 .

Hence, . cov(b eR ) = E[(e − τS sgn(e)1 − Hτϕ ϕ(F (e)) − E(e1 )1)

(e − τS sgn(e)1 − Hτϕ ϕ(F (e)) − E(e1 )1)′ ] .

(3.9.16)

Let J = 11′ /n denote the projection onto the space spanned by 1. Since our design matrix is [1 X], the leverage of the ith case is hi = n−1 + hci where hci is the ith diaggonal entry of the projection matrix H. By expanding the above expression and using the independence of the components of e we get after some simplification (see Exercise 3.16.25): . Cov(b eR ) = σ 2 {I − K1 J − K2 H} ,

(3.9.17)

where K1 = K2 = δS = δ = σ2 =

τ 2 δ S S 2 −1 , σ τS τ 2 δ ϕ 2 −1 , σ τϕ E[ei sgn(ei )] , E[ei ϕ(F (ei ))] , Var(ei ) = E((ei − E(ei ))2 ) .

(3.9.18) (3.9.19) (3.9.20) (3.9.21) (3.9.22)

This yields the first result, ( 3.9.13). Next consider the case of a symmetric error distribution. If the estimate of the intercept is given by α bϕ+ , discussed in Section 3.5.2, the result simplifies to ( 3.9.14). From Cook and Weisberg (1982, p. 11) in the least squares case, Var(b eLS,i) = σ 2 (1 − hi ) so that K1 and K2 are correction factors due to using the rank score function. Based on the results in the theorem, an estimate of the variance-covariance matrix of b eR is e=σ ˆ 1J − K ˆ 2 Hc } , S b2 {I − K (3.9.23)

3.9. DIAGNOSTICS

205

where ! 2δbS −1 , τˆS ! τˆϕ2 2δb = −1 , σ ˆ 2 τˆϕ 1 X |ˆ eR,i | , = n−p τˆ2 = S2 σ ˆ

b1 K

b2 K and

δbS

δb =

(3.9.24) (3.9.25) (3.9.26)

1 ˆ ). D(β ϕ n−p

The estimators τbS and τˆϕ are discussed in Section 3.7.1. To complete the estimate of the Cov(b eR ) we need to estimate σ. A robust estimate of it is given by the MAD, σ b = 1.483medi {|ˆ eRi − medj eˆRj |} , (3.9.27) which is a consistent estimate of σ if the errors have a normal distribution. For the examples discussed here, we used this estimate in ( 3.9.23) - ( 3.9.25). It follows from ( 3.9.23) that an estimate of Var(b eR,i ) is b1 se2R,i = σ b2 (1 − K

1 b 2 hc,i) . −K n

(3.9.28)

where hci = xi (X′ X)−1 xi . 2 Let σ bLS denote the usual least squares estimate of the variance. Least squares residuals are standardized by seLS,i where 2 se2LS,i = σ bLS (1 − hi ) ;

(3.9.29)

b 2 hi ) . se2R,i = σ b2 (1 − K

(3.9.30)

see page 11 of Cook and Weisberg (1982) and recall that hi = n−1 + x′i (X′ X)−1xi . If the error distribution is symmetric ( 3.9.28) reduces to

Internal R-studentized Residual

We define the internal R-studentized residuals as rR,i =

ebR,i for i = 1, . . . , n , seR,i

(3.9.31)

where seR,i is the square root of either ( 3.9.28) or ( 3.9.30) depending on whether one assumes an asymmetric or symmetric error distribution, respectively.

206

CHAPTER 3. LINEAR MODELS

It is interesting to compare expression ( 3.9.30) with the estimate of the variance of the 2 b 2 depends on the score function least squares residual σ bLS (1 − hi ). The correction factor K ϕ(·) and the underlying symmetric error distribution. If, for example, the error distribution b 2 converges in probability to 1; see Exercise is normal and if we use normal scores, then K b2 3.16.26. In general, however , we will not wish to specify the error distribution and then K provides a natural adjustment. A simple benchmark is useful in declaring whether or not a case is an outlier. We are certainly not advocating eliminating such cases but flagging them as potential outliers and targeting them for further study. As we discussed in the last section, the distribution of the R-residuals should resemble the true distribution of the errors. Hence a simple rule for all cases is not apparent. In general, unless the residuals appear to be from a highly skewed distribution, a simple rule is to declare a case to be a potential outlier if its residual exceeds two standard errors in absolute value; i.e., |rR,i | > 2. e ( 3.9.23), is an estimate of a first order approximation of cov(b The matrix S, eR ). It is not necessarily positive semi-definite and we have not constrained it to be so. In practice this has not proved troublesome since only occasionally have we encountered negative estimates of the variance of the residuals. For instance, the R-fit for the cloud √ data resulted in one b is the case with a negative variance. Presently, we replace ( 3.9.28) by σ b 1 − hi , where σ MAD estimate ( 3.9.27), in these situations. We have already illustrated the internal R-studentized residuals for the potency of Example 3.9.2 discussed in the last section. We use them next on the Cloud data. Example 3.9.3. Cloud Data, Example 3.9.1, continued Returning to cloud data example, Panel A of Figure 3.9.3 displays a residual plot of the internal Wilcoxon studentized residuals versus the fitted values. It is similar to Panel C of Figure 3.9.1 but it has a meaningful scale on the vertical axis. The residuals for three of the cases (4, 10, and 16) are over two standard errors from the center of the data. These should be flagged as potential outliers. Panel B of Figure 3.9.3 displays the normal q −q plot of the internal Wilcoxon studentized residuals. The underlying error structure appears to have heavier tails than the normal distribution. As with their least squares counterparts, we think the chief benefits of the internal Rstudentized residuals is their usefulness in diagnostic plots and flagging potential outliers. External R-studentized Residual Another statistic that is useful for flagging outliers is a robust version of the external t statistic. The LS version of this diagnostic is discussed in detail in Cook and Weisberg (1982). A robust version of this diagnostic is discussed in McKean, Sheather and Hettmansperger (1991). We briefly describe this latter approach. Suppose we want to examine the ith case to see if its an outlier. Consider the mean shift model given by, Y = X1 b + θi di + e , (3.9.32)

3.9. DIAGNOSTICS

207

Figure 3.9.3: Internal Wilcoxon studentized residual plot, Panel A, and corresponding normal q−q plot, Panel B, for the Cloud Data. Panel A

Panel B

•

•

-1

• •

•

•

• •

•

•

• •

• 22

•

24

26

2 1

••••

0

• •

• •

28

30

Wilcoxon cubic fit

32

•

•• • •

••

-1

0

•

•

-2

1

•

Wilcoxon Studentized residuals

2

•

-2

Wilcoxon Studentized residuals

•

••

•

• -1

0

1

Normal quantiles

where X1 is the augmented matrix [1 X] and di is an n × 1 vector of zeroes except for its ith component which is a 1. A formal hypothesis that the the ith case is an outlier is given by H0 : θi = 0 versus HA : θi 6= 0 .

(3.9.33)

One way of testing these hypotheses is to use the test procedures described in Section 3.6. This requires fitting Model ( 3.9.32) for each value of i. A second approach is described next. Note that we can rewrite Model ( 3.9.32) equivalently as Y = X1 b∗ + θi d∗i + e ,

(3.9.34)

where d∗i = (I − H1 )di , H1 is the projection matrix onto the column space of X1 and b∗ = b + H1 di θi ; see Exercise 3.16.27. Because of the orthogonality between X and d∗i , the least squares estimate of θi can be obtained by a simple linear regression of Y on d∗i or equivalently of b eLS on d∗i . For the rank-based estimate, the asymptotic distribution theory

208

CHAPTER 3. LINEAR MODELS

of the regression estimates suggests a similar approach. Accordingly, let θbR,i denote the R-estimate when b eR is regressed on d∗i . This is a simple regression and the estimate can be obtained by a linear search algorithm; see Section 3.7.2. As Exercise 3.16.29 shows, this estimate is the inversion of an aligned rank statistic to test the hypotheses ( 3.9.33). Next let τbϕ,i denote the estimate of τϕ produced from this regression. We define the external R-studentized residual to be the statistic tR (i) =

θbR,i p , τbϕ,i / 1 − h1,i

(3.9.35)

where h1,i is the ith diagonal entry of H1 . Note that we have standardized θbR,i by its asymptotic standard error. A final remark on these external t-statistics is in order. In the mean shift model, ( 3.9.32), the leverage value of the ith case is 1. Hence, the design assumption (D.2), ( 3.4.7), is not true. This invalidates both the LS and rank-based asymptotic theory for the external t statistics. In light of this, we do not propose the statistic tR (i) as a test statistic for the hypotheses ( 3.9.33) but as a diagnostic for flagging potential outliers. As a benchmark, we suggest the value 2.

3.9.3

Measures of Influential Cases

Since R-estimates have bounded influence in the y-space but not in the x- space, the R-fit may be affected by outlying points in the x-space. We next introduce a statistic which measures the influence of the ith case on the robust fit. We work with the usual model b R . Similar to the proof of Theorem ( 3.2.3). First, we need the first order representation of Y 3.9.3 which obtained the first order representation of the residuals, ( 3.9.15), we have . bR = Y α1 + Xβ + τS sgn(e)1 + Hτϕ ϕ(F (e)) ;

(3.9.36)

see exercise 3.16.28. Let YbR (i) denote the R-predicted value of Yi when the ith case is deleted from the model. We shall call this model, the delete i model. Then the change in the robust fit due to the ith case is RDF F ITi = YbR,i − YˆR (i) . (3.9.37)

RDF F ITi is our measure of the influence of case i. Computation of this statistic is discussed later. Clearly, in order to be useful, RDF F ITi must be assessed relative to some scale. RDF F IT is a change in the fitted value; hence, a natural scale for assessing RDF F IT is a fitted value scale. Using as our estimate of the intercept α bS , it follows from the expression ( 3.9.36) with γ = 0 that . Var(YbR,i ) = n−1 τS2 + hc,i τϕ2 . (3.9.38) Hence, based on a fitted scale assessment, we standardize RDF F IT by an estimate of the square root of this quantity.

3.9. DIAGNOSTICS

209

For least squares diagnostics there is some discussion on whether to use the original model or the model with the ith point deleted for the estimation of scale. Cook and Weisberg (1982) advocate the original model. In this case the scale estimate is the same for all n cases. This allows casewise comparisons involving the diagnostic. Belsley, Kuh, and Welsch (1980), however, advocate scale estimation based on the delete i model. Note that both standardizations correct for the model and the underlying variation of the errors. Let τbS (i) and τˆϕ (i) denote the estimates of τS and τϕ for the delete i model as discussed above. Then our diagnostic in which RDF F ITi is assessed relative to a fitted value scale with estimates of scale based on the delete i model is given by RDF F IT Si =

RDF F ITi 1

(n−1 τbS2 (i) + hc,i τbϕ2 (i)) 2

.

(3.9.39)

This is an R-analogue of the least squares diagnostic DF F IT Si proposed by Belsley et al. (1980). For standardization based on the original model, replace τbS (i) and τˆϕ (i) by τbS and τˆϕ respectively. We shall define, RDCOOKi =

RDF F ITi

1

(n−1 τbS2 + hc,iτbϕ2 ) 2

.

(3.9.40)

+ If α bR is used as the estimate of the intercept then, provided the errors have a symmetric distribution, the R-diagnostics are obtained by replacing Var(YbR,i ) with Var(YbR,i ) = hi τˆϕ2 ; see Exercise 3.16.30 for details. This results in the diagnostics,

and

RDF F ITi RDF F IT Ssymm,i = √ , hi τbϕ (i) RDCOOKsymm,i =

RDF F ITi √ . hi τbϕ

(3.9.41)

(3.9.42)

This eliminates the need to estimate τS . There is also a disagreement on what benchmarks to use for flagging points of potential influence. As Belsley et al. (1980) discuss in some detail, DF Fp IT S is inversely influenced by sample size. They advocate a size-adjusted benchmark of 2 p/n for DF F IT S. Cook √ and Weisberg (1982) suggest a more conservative value which results in p. We shall use both benchmarks in the examples. We realize these diagnostics only flag potentially influential points that require investigation. Similar to the two references cited above, we would never recommend indiscriminately deleting observations solely because their diagnostic values exceed the benchmark. Rather these are potential points of influence which should be investigated. The diagnostics described above are formed with the leverage values based on the projection matrix. These leverage values are nonrobust, (see Rousseeuw and van Zomeren, 1990). For data sets with clusters of outliers in factor space robust leverage values can be formulated

210

CHAPTER 3. LINEAR MODELS

in terms of high breakdown estimates of the center and scatter matrix in factor space. One such choice would be the MVE, minimum volume ellipsoid, proposed by Rousseeuw and van Zomeren (1990). Other estimates could be based on the robust singular value decomposition discussed by Ammann (1993). See, also, Simpson, Ruppert and Carroll (1992). We recommend computing YbR (i) with a one or two step R-estimate based on the residuals from the original model; see Section 3.7.2. Each step involves a single ordering of the residuals which are nearly in order, (in fact on the first step they are in order) and a single projection onto the range of X, (easily obtained by using the routines in LINPACK as discussed in Section 3.7.2). The diagnostic RDF IT T Si measures the change in the fitted values when the ith case is deleted. Similarly we can also measure changes in the estimates of the regression coefficients. For the LS analysis, this is the diagnostic DBET AS proposed by Belsley, Kuh and Welsch (1980). The corresponding diagnostics for the rank-based analysis are: RDBET ASij =

βbϕ,j − βbϕ,j (i) p , τbϕ (i) (X′ X)jj

(3.9.43)

b (i) denotes the R-estimate of β in the delete i-model. A similar statistic can where β ϕ be constructed for the intercept parameter. Furthermore, a DCOOK verison can also be constructed as above. These diagnostics are often used when |RDF F IT Si| is large. In such cases, it may be of interest to know which components of the regression coefficients are more influential than √ other components. The benchmark suggested by Belsley, Kuh and Welsch (1980) is 2/ n. Example 3.9.4. Free Fatty Acid (FFA) Data. The data for this example can be found in Morrison (1983, p.64) and for convenience in Table 3.9.3. The response is the level of free fatty acid of prepubescent boys while the independent variables are age, weight, and skin fold thickness. The sample size is 41. Panel A of Figure 3.9.4 depicts the residual plot based on the least squares internal t residuals. From this plot there appears to be several outliers. Certainly the cases 12, 22, 26 and 9 are outlying and perhaps the cases 8, 10 and 38. In fact, the first four of these cases probably control the least squares fit, obscuring cases 8, 10 and 38. As our first R-fit of this data, we used the Wilcoxon scores with the intercept estimated by the median of the residuals, α bs . Note that all seven cases stand out in the Wilcoxon residual plot based on the internal R-studentized residuals, ( 3.9.31); see Panel B of Figure 3.9.4. This is further confirmed by the fits displayed in Table 3.9.4, where the LS fit with these seven cases deleted is very similar to the Wilcoxon fit using all the cases. The q −q plot of the internal R-studentized residuals, Panel C of Figure 3.9.4, also highlights these outlying cases. Similar to the residual plot, the q − q plot suggests that the underlying error distribution is positively skewed with a light left tail. The estimates of the regression coefficients and their standard errors are displayed in Table 3.9.4. Due to the skewness in

3.9. DIAGNOSTICS Table 3.9.3: Free Fatty Acid (FFA) Data Age Weight Skin Fold Free Fatty Case (months) (LBS) Thickness Acid 1 105 67 0.96 0.759 2 107 70 0.52 0.274 3 100 54 0.62 0.685 4 103 60 0.76 0.526 5 97 61 1.00 0.859 6 101 62 0.74 0.652 7 99 71 0.76 0.349 8 101 48 0.62 1.120 9 107 59 0.56 1.059 10 100 51 0.44 1.035 11 100 80 0.74 0.531 12 101 57 0.58 1.333 13 104 58 1.10 0.674 14 99 58 0.72 0.686 15 101 54 0.72 0.789 16 110 66 0.54 0.641 17 109 59 0.68 0.641 18 109 64 0.44 0.355 19 110 76 0.52 0.256 20 111 50 0.60 0.627 21 112 64 0.70 0.444 22 117 73 0.96 1.016 23 109 68 0.82 0.582 24 112 67 0.52 0.325 25 111 81 1.14 0.368 26 115 74 0.82 0.818 27 115 63 0.56 0.384 28 125 74 0.72 0.509 29 131 70 0.58 0.634 30 121 63 0.90 0.526 31 123 67 0.66 0.337 32 125 82 0.94 0.307 33 122 62 0.62 0.748 34 124 67 0.74 0.401 35 122 60 0.60 0.451 36 129 98 1.86 0.344 37 128 76 0.82 0.545 38 127 63 0.26 0.781 39 140 79 0.74 0.501 40 141 60 0.62 0.524 41 139 81 0.78 0.318

211

212

CHAPTER 3. LINEAR MODELS

the data, it is not surprising, that the LS and R estimates of the intercept are different since the former estimates the mean of the residuals while the later estimates the median of the residuals. Figure 3.9.4: Panel A, Internal LS studentized residual plot on the original Free Fatty Acid Data; Panel B, Internal Wilcoxon studentized residual plot on the original Free Fatty Acid Data; Panel C, Internal Wilcoxon studentized normal q −q plot on the original Free Fatty Acid Data; and Panel D, Internal R-studentized residual plot on the original Free Fatty Acid Data based on the score function ϕ.5 (u). Panel A

Panel B

••

•

•

•

••

•

• •

• •

• •• •

• • • • •• • • • • • •

0.5

• •

• •

0.7

3

• •

• •

•

•

•

• • • • • • • •• •• • • ••• • • •

• • • •

•

•

0.3

0.5

0.7

Panel C

Panel D

-1

0

1

Normal quantiles

3 1

2

•

-1

• ••• •••• • ••• •••• •• • • • •••• • • ••••••

•

0

3 2

• ••• 1

•

Wilcoxon fit original data

•

0

••

•

LS fit original data

•

-2

2

0.9

•

•

• ••

1

•

•

Bent score Studentized residuals

0

•

•

•

0

•

Wilcoxon Studentized residuals

2

•

1

•

0.3

Wilcoxon Studentized residuals (orig)

•

•

-1

LS Studentized residuals

•

2

•

• •

•

•

• •

• •

• • •

•

•

• •

• • • • •• ••• •• • •

0.4

0.6

• •• ••

•

• • •

• 0.8

Fit based on Bent score original data

Table 3.9.5 displays the values of the R and LS diagnostics for the cases of interest. For the seven cases cited above, the internal Wilcoxon studentized residuals, ( 3.9.31), definitely flag three of the cases and for two of the others it exceeds 1.70; see Panel B of Figure 3.9.4. As RDF F IT S, ( 3.9.39), indicates none of these seven cases seem to have an effect on the Wilcoxon-fit, (the liberal benchmark is .62), whereas the 12th case appears to have an effect on the least squares fit. RDF F IT S exceeded the benchmark only for case 2 for which it had the value -.64. Case 36 with h36 = .53 has high leverage but it did not have an adverse

3.9. DIAGNOSTICS Table 3.9.4: Estimates of β (first cell entry) and σ bβ Data. Original Data Par. LS Wilcoxon LS (w/o 7 pts.) β0 1.70 .33 1.49 .27 1.24 .21 β1 -.002 .003 -.001 .003 -.001 .002 β2 -.015 .005 -.015 .004 -.013 .003 β3 .205 .167 .274 .137 .285 .103 Scale .215 .178 .126

213 (second cell entry) for Free Fatty Acid R-Bent Score 1.37 .21 -.001 .002 -.015 .003 .355 .104 .134

Table 3.9.5: Regression Diagnostics for cases of interest LS Wilcoxon Case hi Int. t DFFIT Int. t DFFIT 8 0.12 1.16 0.43 1.57 0.44 9 0.04 1.74 0.38 2.14 0.13 10 0.09 1.12 0.36 1.59 0.53 12 0.06 2.84 0.79 3.30 0.33 22 0.05 2.26 0.53 2.51 -0.06 26 0.04 1.51 0.32 1.79 0.20 38 0.15 1.27 0.54 1.70 0.53 2 0.10 -1.19 -0.40 -0.17 -0.64 7 0.11 -1.07 -0.37 -0.75 -0.44 11 0.22 0.56 0.30 0.97 0.31 40 0.25 -0.51 -0.29 -0.31 -0.21 36 0.53 0.18 0.19 -0.04 -0.27

log y LS Wilcoxon 1.12 .52 .99 .54 -.001 .005 .000 .005 -.029 .008 -.031 .008 .444 .263 .555 .271 .341 .350

for the Fatty Acid Data. Bent Score Int. t DFFIT 1.73 0.31 2.37 0.26 1.84 0.30 3.59 0.30 2.55 0.11 1.86 0.10 1.93 0.19 -0.75 -0.48 -0.74 -0.64 1.03 0.07 -0.35 0.06 -0.66 -0.34

effect on either the Wilcoxon fit or the LS fit. This is true too of cases 11 and 40 which were the only other cases whose leverage values exceeded the benchmark of 2p/n. As we noted above, both the residual and the q−q plots indicate that the distribution of the residuals is positively skewed. This suggests a transformation as discussed below, or perhaps a prudent choice of a score function which would be more appropriate for skewed error distributions than the Wilcoxon scores. The score function ϕ.5 (u), ( 2.5.34), is more suited to positively skewed errors. Panel D of Figure 3.9.4 displays the internal R-studentized residuals based on the R-fit using this bent score function. From this plot and the tabled diagnostics, the outliers stand out more from this fit than the previous two fits. The RDF F IT S values for this fit are even smaller than those of the Wilcoxon fit, which is expected since this score function protects on the right. While Case 7 has a little influence on the bent score fit, no other cases have RDF F IT S exceeding the benchmark. Table 3.9.4 displays the estimates of the betas for the three fits along with their standard errors. At the .05 level, coefficients 2 and 3 are significant for the robust fits while only coefficient 2 is significant for the LS fit. The robust fits appear to be an improvement over

214

CHAPTER 3. LINEAR MODELS

LS. Of the two robust fits, the bent score fit appears to be more precise than the Wilcoxon fit. A practical transformation on the response variable suggested by the Box-Cox transformation is the log. Panel A of Figure 3.9.5 shows the internal R-studentized residuals plot based on the Wilcoxon fit of the log transformed response. Note that 5 of the cases still stand out in the plot. The residuals from the transformed response still appear to be skewed as is evident in the q−q plot, Panel B of Figure 3.9.5. From Table 3.9.4, the Wilcoxon fit seems slightly more precise in terms of standard errors. Figure 3.9.5: Panel A, Internal R-studentized residuals plot of the log transfomed Free Fatty Acid Data; Panel B, Corresponding normal q−q plot.

• •

• •

••

0.0 • • •

•

• • -1.0

• •

•

•

••

• •

• • •• • • •• • • •

••

-0.6

-0.2

Wilcoxon fit logs data

3.10

2.0 1.0

•

•

• •

0.0

• •

-1.0

1.0

•

Panel B Wilcoxon Studentized residuals (logs)

• • •

-1.0

Wilcoxon Studentized residuals

2.0

Panel A

• -2

•

•

•• • • •• ••••• •• •• ••• • •• • •• •••• ••• ••••

-1

0

1

•

2

Normal quantiles

Survival Analysis

In this section we discuss scores which are appropriate for lifetime distributions when the log of lifetime follows a linear model. These are called accelerated failure time models;

3.10. SURVIVAL ANALYSIS

215

Kalbfleisch and Prentice (1980). Let T denote the lifetime of a subject and let x be a p × 1 vector of covariates associated with T . Let h(t; x) denote the hazard function of T at time t; see Section 2.8. Suppose T follows a log linear model; that is, Y = log T follows the linear model Y = α + x′ β + e , (3.10.1) where e is a random error with density f . Exponentiating both sides we get T = exp{α + x′ β}T0 where T0 = exp{e}. Let h0 (t) denote the hazard function of T0 . This is called the baseline hazard function. Then the hazard function of T is given by h(t; x) = h0 (t exp{−(α + x′ β)}) exp{−(α + x′ β)} .

(3.10.2)

Thus the covariate x accelerates or decelerates the failure time of T ; hence, the name accelerated failure time for these models. An important subclass of the accelerated failure time models are those where T0 follows a Weibull distribution, i.e., fT0 (t) = λγ(λt)γ−1 exp{−(λt)γ } , t > 0 ,

(3.10.3)

where λ and γ are unknown parameters. In this case it follows that the hazard function of T is proportional to the baseline hazard function with the covariate acting as the factor of proportionality; i.e., h(t; x) = h0 (t) exp{−(α + x′ β)} . (3.10.4) Hence these models are called proportional hazards models. Kalbfleisch and Prentice (1980) show that the only proportional hazards models which are also accelerated failure time models are those for which T0 has the Weibull density. We can write the random error e = log T0 as e = ξ + γ −1 W0 where ξ = − log γ and W0 has the extreme value distribution discussed in Section 2.8 of Chapter 2. Thus the optimal rank scores for these log-linear models are generated by the function ϕfǫ (u) = −1 − log(1 − u) ;

(3.10.5)

see ( 2.8.8) of Chapter 2. Next we consider suitable score functions for the general failure time models, ( 3.10.1). As noted in Kalbfleisch and Prentice (1980) many of the error distributions currently used for these models are contained in the log-F class. In this class, e = log T is distributed down to an unknown scale parameter, as the log of an F random variable with 2m1 and 2m2 degrees of freedom. In this case we shall say that e has a GF (2m1 , 2m2 ) distribution. The distribution of T is Weibull if (m1 , m2 ) → (1, ∞), log-normal if (m1 , m2 ) → (∞, ∞), and generalized gamma if (m1 , m2 ) → (∞, 1); see Kalbfleish and Prentice. If (m1 , m2 ) = (1, 1) then the e has a logistic distribution. In general this class contains a variety of shapes. The distributions are symmetric for m1 = m2 , positively skewed for m1 > m2 , and negatively skewed for m1 < m2 . While Kalbfleisch and Prentice discuss this class for m1 , m2 ≥ 1, we will extend the class to m1 , m2 > 0 in order to include heavier tailed error distributions.

216

CHAPTER 3. LINEAR MODELS

For random errors with distribution GF (2m1 , 2m2 ), the optimal rank score function is given by ϕm1 ,m2 (u) = (m1 m2 (exp {F −1 (u)} − 1))/(m2 + m1 exp {F −1(u)}) ,

(3.10.6)

where F is the cdf of the GF (2m1 , 2m2 ) distribution; see Exercise 3.16.31. We shall label these scores as GF (2m1 , 2m2 ) scores. It follows that the scores are strictly increasing and bounded below by −m1 and above by m2 . Hence an R-analysis based on these scores will have bounded influence in the Y -space.

2.0

Figure 3.10.1: Schematic of the four classes, C1 - C4, of the GF (2m1 , 2m2 ) scores

Light Tailed

C2

C4

C1

1.0

C3

0.5

m2

1.5

Neg. Skewed

Pos. Skewed

0.0

Heavy Tailed

0.0

0.5

1.0

1.5

2.0

m1

This class of scores can be conveniently divided into the four subclasses C1 through C4 which are represented by the four quadrants with center (1, 1) as depicted in Figure 3.10.1. The point (1, 1) in this figure corresponds to the linear-rank, Wilcoxon scores. These scores are optimal for the logistic distribution, GF (2, 2), and form a “natural” center point for the scores. One score function from each class with the density for which it is optimal is plotted in Figure 3.10.2. These plots are generally representative. The score functions in C2

3.10. SURVIVAL ANALYSIS

217

change from concave to convex as u increases and, hence, are suitable for light tailed error structure, while, those in C4 pass from convex to concave and are suitable for heavy tailed error structure. The score functions in C3 are always convex and are suitable for negatively skewed error structure with heavy left tails and moderate right tails, while those in C1 are suitable for positively skewed errors with heavy right tails and moderate left tails. Figure 3.10.2: Column A contains plots of the densities: the Class C1 distribution GF (3, .8); the Class C2 distribution GF (4, 8); the Class C3 distribution GF (.5, 6); and the Class C4 distribution GF (1, .6). Column B contains the corresponding optimal score functions. 0.0 -0.5 -1.0

0.05

0.10

GF(3,.8)-score(u)

0.15

Column B

0.0

GF(3,.8)-density(x)

Column A

-4

-2

0

2

4

6

8

10

0.0

0.2

0.4

0.8

1.0

0.6

0.8

1.0

0.6

0.8

1.0

0.6

0.8

1.0

2 1 -1

0

GF(4,8)-score(u)

0.3 0.2

GF(4,8)-density(x)

0.1

-2

-1

0

1

2

0.0

0.2

0.4 u

1.0 0.5

0.0

0.0

0.04

0.08

GF(.5,6)-score(u)

0.12

1.5

x

GF(.5,6)-density(x)

0.6 u

0.4

x

-15

-10

-5

0

0.0

0.2

0.4 u

0.2 0.0

0.0

-0.4

-0.2

GF(1,.6)-score(u)

0.08 0.04

GF(1,.6)-density(x)

0.12

x

-5

0

5 x

10

0.0

0.2

0.4 u

Figure 3.10.2 shows how a score function corresponds to its density. If the density has a heavy right tail then the score function will tend to be flat on the right side; hence, the resulting estimate will be less sensitive to outliers on the right. While if the density has a light right tail then the scores will tend to rise on the right in order to accentuate points on the right. The plots in Figure 3.10.2 suggest approximating these scores by scores consisting of two or three line segments such as the bent score function, ( 2.5.34). Generally the GF (2m1 , 2m2 ) scores cannot be obtained in closed form due to F −1 , but

218

CHAPTER 3. LINEAR MODELS

programs such as Minitab and Splus can easily produce them. There are two interesting subclasses for which closed forms are possible. These are the subclasses GF (2, 2m2 ) and GF (2m1 , 2). As Exercise 3.16.32 shows, the random variables for these classes are the logs of variates having Pareto distributions. For the subclass GF (2, 2m2 ) the score generating function is 1/2 m2 + 2 m2 − (m2 + 1)(1 − u)1/m2 . (3.10.7) ϕm2 (u) = m2 These are the powers of rank scores discussed by Mielke (1972) in the context of two sample problems. It is interesting to note that the asymptotic relative efficiency of the Wilcoxon to the optimal rank score function at the GF (2m1 , 2m2 ) distribution is given by ARE =

12 Γ4 (m1 + m2 )Γ2 (2m1 )Γ2 (2m2 )(m1 + m2 + 1) ; Γ4 (m1 )Γ4 (m2 )Γ2 (2m1 + 2m2 )m1 m2

(3.10.8)

see Exercise 3.16.31. This efficiency can be arbitrarily small. For instance, in the subclass GF (2, 2m2 ) the efficiency reduces to ARE =

3m2 (m2 + 2) , (2m2 + 1)2

(3.10.9)

which approaches 0 as m2 → 0 and 34 as m2 → ∞. Thus in the presence of severely skewed errors, the Wilcoxon scores can have arbitrarily low efficiency compared to a fully efficient R-estimate based on the optimal scores. For a given problem, the choice of scores presents a problem. McKean and Sievers (1989) discuss several methods for score selection, one of which is illustrated in the next example. This method is adaptive in nature with the adaption depending on residuals from an initial fit. In practice, this can lead to overfitting. Its use, however, can lead to insight and may prove beneficial for fitting future data sets of the same type; see McKean et al. (1989) for such an application. Using XLISP-STAT (Tierney, 1990), Wang (1996) presents a graphical interface for methods of score selection. Example 3.10.1. Insulating Fluid Data. We consider a problem discussed in Nelson (1982, p. 227) and also discussed by Lawless (1982, p. 185). The data consist of breakdown times T of an electrical insulating fluid subject to seven different levels of voltage stress v. Panel A of Figure 3.10.3 displays a scatter plot of Y = log T versus log v. As a full model we consider a oneway layout, as discussed in Chapter 4, with the response variable Y = log T and with the seven voltage levels as treatments. The comparison boxplots, Panel B of Figure 3.10.3, are an appropriate display for this model. The one method for score selection that we briefly touch on here is based on q−q plots; see McKean and Sievers (1989). Using Wilcoxon scores we obtained an initial fit of the oneway layout model as discussed in Chapter 4. Panel C of Figure 3.10.3 displays the q − q plot of the

3.10. SURVIVAL ANALYSIS

219

ordered residuals versus the logistic quantiles based on this fit. Although the left tail of the logistic distribution appears adequate, the right side of the plot indicates that distributions with lighter right tails might be more appropriate. This is confirmed by the near linearity of the GF (2, 10) quantiles versus the Wilcoxon residuals. After trying several R-fits using GF (2m1 , 2m2 ) scores with m1 , m2 ≥ 1, we decided that the q −q plot of the GF (2, 10) fit, Panel D of Figure 3.10.3, appeared to be most linear and we used it to conduct the following R-analysis. For the fit of the full model using the scores GF (2, 10), the minimum value of the dispersion function, D, is 103.298 and the estimate of τϕ is 1.38. Note that this minimum value of D is the analogue of the “pure” sum of squared errors in a least squares analysis; hence, we will use the notation DP E = 103.298 for pure error dispersion. We first test the goodness of fit of a simple linear model. The reduced model in this case is a simple linear model. The alternative hypothesis is that the model is not linear but, other than this, it is not specified; hence, the full model is the oneway layout. Thus the hypotheses are H0 : Y = α + β log v + e versus HA : the model is not linear.

(3.10.10)

To test H0 , we fit the reduced model Y = α + β log v + e. The dispersion at the reduced model it is 104.399. Since, as noted above, the dispersion at the full model is 103.298, the lack of fit is the reduction in dispersion RDLOF = 104.399 − 103.298 = 1.101. Therefore the value of the robust test statistic is Fϕ = .319. There is no evidence on the basis of this test to contest a linear model. The GF (2, 10)-fit of the simple linear model is Yˆ = 64 − 17.67 log v, which is graphed in Panel A of Figure 3.10.3. Under this linear model, the estimate of the scale parameter τϕ is 1.57. From this we compute a 95% confidence interval for the slope parameter β to be −17.67 ± 3.67; hence, it appears that the slope parameter differs significantly from 0. In Lawless there was interest in computing a confidence interval for E(Y |x = log 20). The robust estimate of this conditional mean is Yˆ = 11.07 and a confidence interval is 11.07 ± 1.9. Similar to the other robust confidence intervals, this interval is the same as in the least squares analysis, except that τˆϕ replaces σ ˆ . A fuller discussion of the R-analysis of this data set can be found in McKean and Sievers (1989). Example 3.10.2. Sensitivity Analysis for Insulating Fluid Data. As noted by Lawless, engineers may suggest a Weibull distribution for breakdown times in this problem. As discussed earlier this means the errors have an extreme value distribution. This distribution is essentially the limit of a GF (2, 2m) distribution as m → ∞. For completeness we obtained, using the IMSL (1987) subroutine UMIAH, estimates based on an extreme value likelihood function. These estimates are labeled EXT . R-estimates based on the the optimum R-score function ( 2.8.8) for the extreme value distribution are labeled as REXT . The influence functions for EXT and REXT estimates are unbounded in Y -space and, hence, neither estimate is robust; see ( 3.5.17). In order to illustrate this lack of robustness, we conducted a small sensitivity analysis. We replaced the fifth point, which had the value 6.05 (log units), in the data with an outlying

220

CHAPTER 3. LINEAR MODELS Table 3.10.1: Sensitivity Analysis for Insulating Data. Value of Y5 Original (.05) 7.75 10.05 16.05 ˆ ˆ ˆ Estimate α ˆ β α ˆ β α ˆ β α ˆ βˆ LS 59.4 -16.4 60.8 -16.8 62.7 -17.3 67.6 -18.7 Wilcoxon 62.7 -17.2 63.1 -17.4 63.0 -17.4 63.1 -17.4 GF (2, 10) 64.0 -17.7 65.5 -18.1 67.0 -18.5 67.1 -18.5 REXT 64.1 -17.7 65.5 -18.1 68.3 -18.9 68.3 -18.9 EXT 64.8 -17.7 68.4 -18.7 79.3 -21.8 114.6 -31.8

30.05 α ˆ 79.1 63.1 67.1 68.3 191.7

βˆ -21.9 -17.4 -18.5 -18.9 -53.5

observation. Table 3.10.1 summarizes the results for several different choices of the outlier. Note that even for the first case when the changed point is 7.75, which is the maximum of the original data, there is a substantial change in the EXT -estimates. The EXT fit is a disaster when the point is changed to 10.05, whereas the R-estimates exhibit robustness. This is even more so for succeeding cases. Although the REXT -estimates have an unbounded influence function, they behaved well in this sensitivity analysis.

3.11

Correlation Model

In this section, we are concerned with the correlation model defined by Y = α + x′ β + e

(3.11.1)

where x is a p-dimensional random vector with distribution function M and density function m, e is a random variable with distribution function F and density f , and x and e are independent. Let H and h denote the joint distribution function and joint density function of Y and x. It follows that h(x, y) = f (y − α − x′ β)m(x) .

(3.11.2)

Denote the marginal distribution and density functions of Y by G and g. The hypotheses of interest are: H0 : Y and x are independent versus HA : Y and x are dependent .

(3.11.3)

By ( 3.11.2) this is equivalent to the hypotheses H0 : β = 0 versus HA : β 6= 0. For this section, we will use the additional assumptions: (E.2) Var(e) = σe2 < ∞ (M.1) E[xx′ ] = Σ , Σ > 0 . Without loss of generality assume that E[x] = 0 and E(e) = 0.

(3.11.4) (3.11.5)

3.11. CORRELATION MODEL

221

Let (x1 , Y1 ), . . . , (xn , Yn ) be a random sample from the above model. Define the n × p matrix X1 to be the matrix whose ith row is the vector xi and let X be the corresponding centered matrix, i.e, X = (I − n−1 11′ )X1 . Thus the notation here agrees with that found in the previous sections. We intend to briefly describe the rank-based analysis for this model. As we will show using conditional arguments the asymptotic inference we developed for the fixed x case will hold for the stochastic case also. We then want to explore measures of association between x 2 and Y . These will be analogues of the classical coefficient of multiple determination, R . 2 As with R , these robust CMDs will be 0 when x and Y are independent and positive when they are dependent. Besides defining these measures, we will obtain consistent estimates of them. First we show that, conditionally, the assumptions of Section 3.4 hold. Much of the discussion in this section is taken from the paper by Witt, Naranjo and McKean (1995).

3.11.1

Huber’s Condition for the Correlation Model

The key assumption on the design matrix for the nonstochastic x linear model was Huber’s condition, (D.2), ( 3.4.7). As we next show, it holds almost surely (a.s.) for the correlation model. This will allow us to easily obtain inference methods for the correlation model as discussed below. First define the modulus of a matrix A to be m(A) = max |aij | . i,j

(3.11.6)

As Exercise 3.16.33 shows the following three facts follow from this definition: m(AB) ≤ p m(A)m(B) where p is the common dimension of A and B; m(AA′ ) ≥ m(A)2 ; and m(A) = max aii if A is positive semidefinite. We next need a preliminary lemma found in Arnold (1980). P Lemma 3.11.1. Let {an } be a sequence of nonnegative real numbers. If n−1 ni=1 ai → a0 then n−1 sup1≤i≤n ai → 0. Proof: We have, n

n−1

n−1 1 X 1X an ai − ai → 0 . = n n i=1 n n − 1 i=1

(3.11.7)

Now suppose that n−1 sup1≤i≤n an 6→ 0. Then for some ǫ > 0 and for all integers N there exists an nN such that nN ≥ N and n−1 N sup1≤i≤nN ai ≥ ǫ. Thus we can find a subsequence of integers {nj } such that nj → ∞ and n−1 j sup1≤i≤nj ai ≥ ǫ. Let ainj = sup1≤i≤nj ai . Then ǫ≤

ainj nj

≤

ainj inj

.

(3.11.8)

Also, since nj → ∞ and ǫ > 0, inj → ∞; hence, expression ( 3.11.8) leads to a contradiction of expression ( 3.11.7). The following theorem is due to Arnold (1980).

222

CHAPTER 3. LINEAR MODELS

Theorem 3.11.1. Under ( 3.11.5), o n −1 lim max diag X (X′ X) X′ = 0 , a.s. ;

n→∞

(3.11.9)

Proof: Using the facts cited above on the modulus of a matrix, we have −1 ! 1 −1 X′ X . m X (X′ X) X′ ≤ p2 n−1 m (XX′) m n

(3.11.10)

Using the assumptions on the correlation model, the law of large numbers yields n1 X′ X → Σ a.s. . Hence we need only show that n−1 m (XX′ ) → 0 a.s. . Let Ui denote the ith diagonal element of XX′ . We then have, n

1X 1 a.s. Ui = tr X′ X → tr Σ . n i=1 n a.s By Lemma 3.11.1 we have n−1 supi≤n Ui → 0. Since XX′ is positive semidefinite, the desired conclusion is obtained from the facts which followed expression ( 3.11.6). Thus given X, we have the same assumptions on the design matrix as we did in the previous sections. By conditioning on X, the theory derived in Section 3.5 holds for the correlation model also. Such a conditional argument is demonstrated in Theorem 3.11.2 below. For later discussion we summarize the rank-based inference for the correlation model. b denote the R-estimate of β defined in Section Given a specified score function ϕ, let β ϕ 3.2. Under the correlation model ( 3.11.1) and the assumptions ( 3.11.4), (S.1), ( 3.4.10), √ b D 2 −1 and ( 3.11.5) n(β ϕ − β) → Np (0, τϕ Σ ). Also the estimates of τϕ discussed in Section 3.7.1 will be consistent estimates of τϕ under the correlation model. Let τbϕ denote such an estimate. In terms of testing, consider the R-test statistic, Fϕ = (RD/p)/(b τϕ /2), of the above hypothesis H0 of independence. Employing the usual conditional argument, it follows √ D that pFϕ → χ2 (p, δR ), a.e. M under Hn : β = θ/ n where the noncentrality parameter δR is given by δ = θ ′ Σθ/τϕ2 . b of β. Using the conditional argument, (see Arnold Likewise for the LS estimate β LS √ b D D (1980) for details), n(β LS − β) → Np (0, σ 2 Σ−1 ) and under Hn , pFLS → χ2 (p, δLS ) with noncentrality parameter δLS = θ ′ Σθ/σ 2 . Thus the ARE of the R-test Fϕ to the least squares test FLS is the ratio of noncentrality parameters, σ 2 /τϕ2 . This is the usual ARE of rank tests to tests based on least squares in simple location models. Hence the test statistic Fϕ has efficiency robustness. The theory of rank-based tests in Section 3.6 applies to the correlation model. We return to measures of association and their estimates. For motivation, we consider the least squares measure first.

3.11. CORRELATION MODEL

3.11.2

223

Traditional Measure of Association and its Estimate

The traditional population coefficient of multiple determination (CMD) is defined by 2

R =

β ′ Σβ ; σe2 + β ′ Σβ

(3.11.11)

2

see Arnold (1981). Note that R is a measure of association between Y and x. It lies between 0 and 1 and it is 0 if and only if Y and x are independent, (because Y and x are independent if and only if β = 0). 2 In order to obtain a consistent estimate of R , treat xi as nonstochastic and fit by least squares the model Yi = α+x′i β +ei , which will be called the full model. The residual amount P b )2 , where β b and α bLS − x′i β bLS are the least squares of variation is SSE = ni=1 (Yi − α LS LS estimates. Next fit the reduced modelP defined as the full model subject to H0 : β = 0. The total amount of variation is SST = ni=1 (Yi − Y )2 . The reduction in variation in fitting 2 the full model over the reduced model is SSR = SST − SSE. An estimate of R is the proportion of explained variation given by SSR . (3.11.12) SST 2 2 The least squares test statistic for H0 versus HA is FLS = (SSR/p)/b σLS where σ bLS = 2 SSE/(n − p − 1). Recall that R can be expressed as R2 =

p F SSR n−p−1 LS = . R = p 2 SSR + (n − p − 1)b σLS 1 + n−p−1 FLS 2

(3.11.13)

Now consider the general correlation model. As shown in Arnold (1980), under ( 3.11.4) 2 and ( 3.11.5), R2 is a consistent estimate of R . Under the multivariate normal model R2 is 2 the maximum likelihood estimate of R .

3.11.3

Robust Measure of Association and its Estimate

The rank-based analogue to the reduction in residual variation is the reduction in residual b ). Hence, the proportion of dispersion dispersion which is given by RD = D(0) − D(β R explained by fitting β is R1 = RD/D(0) . (3.11.14) This is a natural CMD for any robust estimate and, as we shall show below, the population CMD for which R1 is a consistent estimate does satisfy interesting properties. As expression ( A.5.11) of the Appendix shows, however, the influence function of the denominator is not bounded in the Y -space. Hence the statistic R1 is not robust. In order to obtain a CMD which is robust, consider the test statistic of H0 , Fϕ = (RD/p)/(τbϕ /2), ( 3.6.12). As we indicated above, the test statistic Fϕ has efficiency robustness. Furthermore, as shown in the Appendix, the influence function of Fϕ is bounded in the Y -space. Hence the test statistic is robust.

224

CHAPTER 3. LINEAR MODELS

Consider the relationship between the classical F-test and R2 given by expression ( 3.11.13). In the same way but using the robust test Fϕ , we can define a second R-coefficient of multiple determination R2 =

1

p F n−p−1 R p + n−p−1 FR

RD . RD + (n − p − 1)(b τϕ /2)

=

(3.11.15)

It follows from the above discussion on the R-test statistic that the influence function of R2 has bounded influence in the Y -space. b ) are D y = correspond to the statistics D(0) and D(β R R R The parameters that respectively ϕ(G(y))ydG(y) and D e = ϕ(F (e))edF (e); see the discussion in Section 3.6.3. The population CMD’s associated with R1 and R2 are: R1 = RD/Dy R2 = RD/(RD + (τϕ /2)) ,

(3.11.16) (3.11.17)

where RD = D y − D e . The properties of these parameters are discussed in the next section. The consistency of R1 and R2 is given in the following theorem: Theorem 3.11.2. Under the correlation model ( 3.11.1) and the assumptions (E.1), ( 2.4.16), (S.1), ( 3.4.10), (S.2), ( 3.4.11), and ( 3.11.5), P

Ri → Ri a.e. M , i = 1, 2 . Proof: Note that we can write n X n b 1 D(0) = ϕ Fn (Yi ) Yi n+1 n i=1 Z n b = ϕ Fn (t) tdFbn (t) , n+1

where Fbn denotes the empirical distribution function of the random sample Y1 , . . . , Yn . As n → ∞ the integral converges to D y . Next consider the reduction in dispersion. By Theorem 3.11.1, with probability 1, we can restrict the sample space to a space on which Huber’s design condition (D.1) holds and on which n−1 X′ X → Σ. Then conditionally given X, we have the assumptions found in Section 3.4 for the non-stochastic model. Hence from the discussion found in Section 3.6.3 P b ) → D e . Hence it is true unconditionally, a.e. M. The consistency of τbϕ was (1/n)D(β R discussed above. The result then follows.

3.11. CORRELATION MODEL

225

Example 3.11.1. Measures of Association for Wilcoxon Scores. √ For theR Wilcoxon scorepfunction, ϕW (u) = 12(u − 1/2), as Exercise 3.16.34 shows, D y = ϕ(G(y))ypdy = 3/4E|Y1 − Y2 | where Y1 , Y2 are iid with distribution function G. Likewise, √ DRe = 3/4E|e1 − e2 | where e1 , e2 are iid with distribution function F . Finally τϕ = ( 12 f 2 )−1 . Hence for Wilcoxon scores these coefficients of multiple determination simplify to E|Y1 − Y2 | − E|e1 − e2 | E|Y1 − Y2 | E|Y1 − Y2 | − E|e1 − e2 | R . = E|Y1 − Y2 | − E|e1 − e2 | + (1/(6 f 2 ))

RW 1 =

(3.11.18)

RW 2

(3.11.19)

As discussed above, in general, RW 1 is not robust but RW 2 is. Example 3.11.2. Measures of Association for Sign Scores.

R For the sign score function, Exercise 3.16.34 shows that D y = ϕ(G(y))y dy = E|Y −medY | where medY denotes the median of Y . Likewise De = E|e − mede|. Hence for sign scores, the coefficients of multiple determination are E|Y − medY | − E|e − mede| E|Y − medY | E|Y − medY | − E|e − mede| . = E|Y − medY | − E|e − mede| + (4f (mede))−1

RS1 =

(3.11.20)

RS2

(3.11.21)

These were obtained by McKean and Sievers (1987) from a l1 point of view.

3.11.4

Properties of R-Coefficients of Multiple Determination

In this section we explore further properties of the population coefficients of multiple determination proposed in the last section. To show that R1 and R2 , ( 3.11.16) and ( 3.11.17), are indeed measures of association we have the following two theorems. The proof of the first theorem is quite similar to corresponding proofs of properties of the dispersion function for the nonstochastic model. Theorem 3.11.3. Suppose f and g satisfy the conditionR (E.1), ( 3.4.1), and their first moments are finite then D y > 0 and D e > 0, where Dy = ϕ(G(y))y dy.

Proof: It suffices R to show it for D y since the proof for D e is the same. The function ϕ is increasing and ϕ = 0; hence, ϕ must take on both negative and positive values. Thus the set A = {y : ϕ(G(y)) < 0} is not empty and is bounded above. Let y0 = sup A. Then Z y0 Z ∞ Dy = ϕ(G(y))(y − y0 )dG(y) + ϕ(G(y))(y − y0 )dG(y) . (3.11.22) −∞

y0

226

CHAPTER 3. LINEAR MODELS

Since both integrands are nonnegative, it follows that D y ≥ 0. If Dy = 0 then it follows from (E.1) that ϕ(G(y)) = 0 for all y 6= y0 which contradicts the facts that ϕ takes on both positive and negative values and that G is absolutely continuous. The next theorem is taken from Witt (1989). Theorem 3.11.4. Suppose f and g satisfy the conditions (E.1) and (E.2) in Section 3.4 and that ϕ satisfies assumption (S.2), ( 3.4.11). Then RD is a strictly convex function of β and has a minimum value of 0 at β = 0. Proof: We will show that the gradient of RD is zero at β = 0 and that its second matrix derivative is positive definite. Note first that the distribution function, G, and density, g, of R R ′ Y can be expressed as G(y) = F (y − β x)dM(x) and g(y) = f (y − β ′ x)dM(x). We have Z Z Z ∂RD = − ϕ′ [G(y)]yf (y − β ′ x)f (y − β ′ u)udM(x)dM(u)dy ∂β Z Z − ϕ[G(y)]yf ′(y − β ′ x)xdM(x)dy . (3.11.23)

Since E[x] = 0, both terms on the right side of the above expression are 0 at β = 0. Before obtaining the second derivative, we rewrite the first term of ( 3.11.23) as Z Z Z ′ ′ ′ − ϕ [G(y)]yf (y − β x)f (y − β u)dydM(x) udM(u) = Z Z ′ ′ − ϕ [G(y)]g(y)yf (y − β u)dy udM(u) .

Next integrate by parts the expression in brackets with respect to y using dv = ϕ′ [G(y)]g(y)dy and t = yf (y − β ′ u). Since ϕ is bounded and f has a finite second moment this leads to Z Z Z Z ∂RD ′ = ϕ[G(y)]f (y − β u)dydM(u) + ϕ[G(y)]yf ′(y − β ′ u)udydM(u) ∂β Z Z − ϕ[G(y)]yf ′(y − β ′ x)xdydM(x) Z Z = ϕ[G(y)]f (y − β ′ u)udydM(u) .

Hence the second derivative of RD is Z Z ∂ 2 RD = − ϕ[G(y)]f ′(y − β ′ x)xx′ dydM(x) ∂β∂β ′ Z Z Z − ϕ′ [G(y)]f (y − β ′ x)f (y − β ′ u)xu′ dydM(x)dM(u) . (3.11.24)

Now integrate the first term on the right side of ( 3.11.24) by parts with respect to y by using dt = f ′ (y − β ′ x)dy and v = ϕ[(G(y)]. This leads to Z Z Z ∂ 2 RD =− ϕ′ [G(y)]f (y − β ′ x)f (y − β ′ u)x(u − x)′ dydM(x)dM(u) . (3.11.25) ∂β∂β ′

3.11. CORRELATION MODEL

227

We have, however, the following identity Z Z ϕ′ [G(y)]f (y − β ′ x)f (y − β ′ u)(u − x)(u − x)′ dydM(x)dM(u) = Z Z ϕ′ [G(y)]f (y − β ′ x)f (y − β ′ u)u(u − x)′ dydM(x)dM(u) Z Z − ϕ′ [G(y)]f (y − β ′ x)f (y − β ′ u)x(u − x)′ dydM(x)dM(u) . Since the two integrals on the right side of the last expression are negatives of each other, this combined with expression ( 3.11.24) leads to ∂ 2 RD 2 = ∂β∂β ′

Z Z

ϕ′ [G(y)]f (y − β ′ x)f (y − β ′ u)(u − x)(u − x)′ dydM(x)dM(u) .

Since the functions f and M are continuous and the score function is increasing, it follows that the right side of this last expression is a positive definite matrix. It follows from these theorems that the Ri s satisfy properties of association similar to 2 R . We have 0 ≤ Ri ≤ 1. By Theorem 3.11.4, Ri = 0 if and only if β = 0 if and only if Y and x are independent.

Example 3.11.3. Multivariate Normal Model Further understanding of Ri can be gleaned from their direct relationship with R2 for the multivariate normal model. Theorem 3.11.5. Suppose Model ( 3.11.1) holds. Assume further that the (x, Y ) follows a multivariate normal distribution with the variance-covariance matrix Σ β′ Σ . (3.11.26) Σ(x,Y ) = Σβ σe2 + β ′ Σβ Then, from ( 3.11.16) and ( 3.11.17), R1 = 1 − R2

q

2

1−R q 2 1− 1−R q = , 2 2 1 − 1 − R [1 − (1/(2T ))]

(3.11.27) (3.11.28)

R 2 where T = ϕ[Φ(t)]tdΦ(t), Φ is the standard normal distribution function, and R is the traditional coefficient of multiple determination given by ( 3.11.11).

228

CHAPTER 3. LINEAR MODELS

Proof: Note that σy2 = σe2 +β ′ Σβ and E(Y ) = α+β ′ E[x]. Further the distribution function of Y is G(y) = Φ((y −α −β ′ E(x))/σy ) where Φ is the standard normal distribution function. Then Z ∞ Dy = ϕ [Φ(y/σy )] ydΦ(y/σy ) (3.11.29) −∞

= σy T .

(3.11.30)

Similarly, De = σe T . Hence, 2

2

RD = (σy − σe )T

By the definition of R , we have R = 1 − 1−

q

σe2 . σy2

(3.11.31)

This leads to the relationship,

2

1−R =

σy − σe . σy

(3.11.32)

The result ( 3.11.27) follows from the expressions ( 3.11.31) and ( 3.11.32). For the result ( 3.11.28), by the assumptions on the distribution of (x, Y ), the distribution of e is N(0, σe2 ); i.e., f (x) = (2πσe2 )−1/2 exp {−x2 /(2σe2 )} and F (x) = Φ(x/σe ). It follows that f ′ (x)/f (x) = −σe−2 x, which leads to −

f ′ (F −1 (u)) 1 = Φ−1 (u) . ′ f (F (u)) σe

Hence, τϕ−1

Z

1

1 −1 Φ (u) σe

du ϕ(u) Z 1 1 = ϕ(u)Φ−1 (u) du . σe 0

=

0

Upon making the substitution u = Φ(t), we obtain the relationship T = σe /τϕ . Using this, the result ( 3.11.31), and the definition of R2 , ( 3.11.11), we get R2 =

σy −σe σy σy −σe + σσye 2T1 2 σy

.

The result for R2 follows from this and ( 3.11.32). Note that T is free of all parameters. It can be shown directly that the Ri s are one-to-one 2 increasing functions of R ; see Exercise 3.16.35. Hence, for the multivariate normal model 2 the parameters R , R1 , and R2 are equivalent. Although the CMDs are equivalent for the normal model, they measure dependence between x and Y on different scales. We can use the above relationships derived in the last theorem to have these coefficients measure the same quantity at the normal model by

3.11. CORRELATION MODEL

229

2

simply solving for R in terms of R1 and R2 in ( 3.11.27) and ( 3.11.28) respectively. These ∗ ∗ parameters will be useful later so we will call them R1 and R2 respectively. Hence solving as indicated we get ∗2

R1

∗2

R2

= 1 − (1 − R1 )2 2 1 − R2 . = 1− 1 − R2 (1 − (1/(2T 2 ))) 2

∗2

(3.11.33) (3.11.34)

∗2

Again, at the multivariate normal model we have R = R1 = R2 . For Wilcoxon scores and sign scores the reader is ask to show in Exercise 3.16.36 that (1/(2T 2)) = π/6 and (1/(2T 2)) = π/4, respectively.

Example 3.11.4. A Contaminated Normal Model. As an illustration of these population coefficients of multiple determination, we evaluate them for the situation where the random error e has a contaminated normal distribution with proportion of contamination ǫ and the ratio of contaminated variance to uncontaminated σc2 , the random variable x has a univariate normal N(0, 1) distribution, and the parameter β = 1. So β ′ Σβ = 1. Without loss of generality, we took α = 0 in ( 3.11.1). Hence Y and x are dependent. We consider the CMDs based on the Wilcoxon score function only. The density of Y = x + e is given by, ! 1−ǫ y ǫ y g(y) = √ φ √ φ p . +p 2 2 1 + σc2 1 + σc2

This leads to the expressions, √ n o p √ 12 −1/2 2 (1 − ǫ)2 2 + 2−1/2 ǫ2 1 + σc2 + ǫ(1 − ǫ)[3 + σc2 ]1/2 Dy = √ 2π √ n o p 12 −1/2 De = √ 2 (1 − ǫ)2 + 2−1/2 ǫ2 σc + ǫ(1 − ǫ) 1 + σc2 2π "√ ( )#−1 12 (1 − ǫ)2 ǫ2 2ǫ(1 − ǫ) √ √ τϕ = + √ +p ; 2π 2 σc 2 σc2 + 1

see Exercise 3.16.37. Based on these quantities the coefficients of multiple determination 2 R , R1 and R2 can be readily formulated. Table 3.11.1 displays these parameters for several values of ǫ and for σc2 = 9 and 100. For ease of interpretation we rescaled the robust CMDs as discussed above. Thus at the normal ∗2 ∗2 2 (ǫ = 0) we have R1 = R2 = R with the common value of .5 in these situations. Certainly as either ǫ or σc change, the amount of dependence between Y and x changes; hence all

230

CHAPTER 3. LINEAR MODELS

Table 3.11.1: Coefficients of Multiple Determination under Contaminated Errors (e). e ∼ CN(ǫ, σc2 = 9) e ∼ CN(ǫ, σc2 = 100) ǫ ǫ CMD .00 .01 .02 .05 .10 .15 .00 .01 .02 .05 .10 .15 2 R .50 .48 .46 .42 .36 .31 .50 .33 .25 .14 .08 .06 ∗ R1 .50 .50 .48 .45 .41 .38 .50 .47 .42 .34 .26 .19 ∗ R2 .50 .50 .49 .47 .44 .42 .50 .49 .47 .45 .40 .36 the coefficients change somewhat. However, R2 decays as the percentage of contamination increases, and the decay is rapid in the case σc2 = 100. This is true also, to a lesser degree, ∗ for R1 which is predictable since its denominator has unbounded influence in the Y -space. ∗ The coefficient R2 shows stability with the increase in contamination. For instance when ∗ σc2 = 100, R2 decays .44 units while R2 decays only .14 units. See Witt et al. (1995) for more discussion on this example. Ghosh and Sen (1971) proposed the mixed rank test statistic to test the hypothesis of independence ( 3.11.3). It is essentially the gradient test of the hypothesis H0 : β = 0. As we showed in Section 3.6, this test statistic is asymptotically equivalent to Fϕ . Ghosh and Sen (1971), also, proposed a pure rank statistic in which both variables are ranked and scored.

3.11.5

Coefficients of Determination for Regression

We have mainly been concerned with coefficients of multiple determination as measures of dependence between the random variables Y and x. In the regression setting, though, the statistic R2 is one of the most widely used statistics, not in the sense of estimating dependence but in the sense of comparing models. As the proportion of variance accounted for, R2 is intuitively appealing. Likewise R1 , the proportion of dispersion accounted for in the fit, is an intuitive statistic. But neither of these statistics are robust. The statistic R2 though is robust and is directly linked (a one-to-one function) to the robust test statistic Fϕ . Furthermore it lies between 0 and 1, having the values 1 for a perfect fit and 0 for a complete lack of fit. These properties make R2 an attractive coefficient of determination for regression as the following example illustrates. Example 3.11.5. Hald Data This data consists of 13 observations and 4 predictors. It can be found in Hald (1952) but it is also discussed in Draper and Smith (1966) where it serves to illustrate a method of predictor subset selection based on R2 . The data are given in Table 3.11.2. The response is the heat evolved in calories per gram of cement. The predictors are the percent in weight

3.11. CORRELATION MODEL Table 3.11.2: x1 7 1 11 11 7 11 3 1 2 21 1 11 10

231 Hald x2 26 29 56 31 52 55 71 31 54 47 40 66 68

Data used in Example 3.11.5 x3 x4 Response 6 60 78.5 15 52 74.3 8 20 104.3 8 47 87.6 6 33 95.9 9 22 109.2 17 6 102.7 22 44 72.5 18 22 93.1 4 26 115.9 23 34 83.8 9 12 113.3 8 12 109.4

Table 3.11.3: Coefficients of Multiple Determination on Hald Data Subset of Original Data Changed Data Predictors R2 R1 R2 R2 R1 R2 {x1 , x2 } .98 .86 .92 .57 .55 .92 {x1 , x3 } .55 .33 .52 .47 .24 .41 {x1 , x4 } .97 .84 .90 .52 .51 .88 {x2 , x3 } .85 .63 .76 .66 .46 .72 {x2 , x4 } .68 .46 .62 .34 .27 .57 {x3 , x4 } .94 .76 .89 .67 .52 .83 of ingredients used in the cement and are given by: x1 x2 x3 x4

= = = =

amount amount amount amount

of of of of

tricalcium aluminate tricalcium silicate tetracalcium alumino ferrite dicalcium silicate .

To illustrate the use of the coefficients of determination R1 and R2 , suppose we are interested in the best two variable predictor model based on coefficients of determination. Table 3.11.3 gives the results for two data sets. The first is the original Hald data while in the second we changed the 11th response observation from 83.8 to 8.8. Note that on the original data all three coefficients choose the subset {x1 , x2 }. For the changed data, though, the outlier severely affects the LS coefficient R2 and the nonrobust coefficient R1 , but the robust coefficient R2 was much less sensitive to the outlier. It chooses the same subset {x1 , x2 } as it did with the original data; however, the LS coefficient selects

232

CHAPTER 3. LINEAR MODELS

the subset {x3 , x4 }, two different predictors than its selection for the original data. The nonrobust coefficient R1 still chooses {x1 , x2 }, although, at a relativity much smaller value. This example illustrates that the coefficient R2 can be used in the selection of predictors in a regression problem. This selection could be formalized like the MAXR procedure in SAS. In a similar vein, the stepwise model building criteria based on LS estimation (Draper and Smith, 1966) could easily be robustified by using R-estimates in place of LSestimates and the robust test statistic Fϕ in place of FLS .

3.12

High Breakdown (HBR) Estimates

By (3.5.17), the influence function of the R-estimate is unbounded in the x-space. While in a designed experiment this is of little consequence, for non-designed experiments where there are widely dispersed xs, (i.e. outliers in factor space), this is of some concern. In this chapter we present R-estimators which have influence functions bounded in both spaces and which can have 50% breakdown. We shall call these estimators high breakdown R (HBR) estimators. Further, we derive diagnostics which differentiate between fits based on these estimators, R-estimators and LS-estimators. Tableman (1990) provides an alternative development of bounded influence R-estimates.

3.12.1

Geometry of the HBR-Estimates

Consider the linear model ( 3.2.3). In Chapter 3, estimation and testing are based on the pseudo-norm, (3.2.6). Here we shall consider the function X kukHBR = bij |ui − uj | , (3.12.1) i 0, where τϕ is the same scale parameter as in Chapter 3; i.e., defined in expression (3.4.4). From this we obtain the asymptotic representation of the R estimator given by √ √ b ϕ = τϕ n(X′ X)−1 X′ ϕ[F(e)] + op (1). nβ (5.1.7)

b is approximately normal Based on (5.1.6) and (5.1.7), it follows that the distribution of β ϕ with mean β and covariance matrix ! m X X′k Σϕ,k Xk (X′X)−1 . (5.1.8) Vϕ = τϕ2 (X′X)−1 k=1

Letting τs = 1/2f (0), α bS is approximately normal with mean α and variance "n # m k X X X 1 var(sgn(ekj )) + cov(sgn(ekj ), sgn(ekj ′ )) . σ12 (0) = τS2 n k=1 j=1 ′

(5.1.9)

j6=j

In this section, we have kept the model general; i.e., we have not specified the covariance b . Define structure. To conduct inference, we need an estimate of the covariance matrix of β ϕ the residuals of the R fit by b . b eR = Y − α bs 1n − Xβ (5.1.10) ϕ

Using these residuals, we estimate the parameter τϕ as discussed in Section 3.7.1. Next, a nonparametric estimate of Σϕ,k , (5.1.5). is obtained by replacing the distribution function F (t) in its definition by the empirical distribution function of the residuals. Based on these results, for a specified vector h ∈ Rp , an approximate (1 − α)100% confidence interval for h′ β is given by q ′b b ϕ h. h β ϕ ± zα/2 h′ V (5.1.11) Consider general linear hypotheses of the form H0 : Mβ = 0 versus HA : Mβ 6= 0, where M is a q × p matrix of rank q. We offer two test statistics. First, the asymptotic b suggests a Wald type test of H0 based on the test statistic distribution of β ϕ b )T [MV b ). b ϕ MT ]−1 (Mβ TW,ϕ = (Mβ ϕ ϕ

(5.1.12)

326

CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

Under H0 , TW,ϕ has an asymptotic χ2q distribution with q degrees of freedom. Hence, a nominal level α test is to reject H0 if TW,ϕ ≥ χ2α (q). As in the independent error case, this test is consistent for all alternatives of the form Mβ 6= 0. For efficiency results consider β a sequence of local alternatives of the form: HAn : Mβ n = √n , where β 6= 0. Under this sequence of alternatives TW,ϕ has an asymptotic noncentral χ2q -distribution with noncentrality parameter η = (Mβ)T [MVϕ MT ]−1 Mβ. (5.1.13) A second test utilizes the reduction in dispersion, RDϕ = D(Red) − D(Full), where D(Full) and D(Red) are respectively the minimum values of the dispersion function under the full and reduced (full model constrained by H0 ) models. The asymptotically correct standardization depends on the dependence structure of the errors; see Exercises 5.6.5 and 5.6.6 for discussion on this test and also of the aligned rank test of Chapter 3. Our discussion has been for general scores. If we have knowledge of the distribution of the errors then we can optimize the analysis by selecting a suitable score function. From expression (5.1.8), although the dependence structure appears in the approximate covariance b , as in Chapters 2 and 3, the constant of proportionality is τϕ . Hence, the discussion in of β ϕ Chapters 2 and 3 concerning score selection based on minimizing τϕ is still pertinent for the rank-based analysis of this section. Example 5.2.1 of the next section illustrates such score selection. If the score function is bounded, then based on their asymptotic representation, (5.1.7), these R estimators have bounded influence in response space but not in factor space. However, for outliers in factor space, the high breakdown HBR estimators, (3.12.2), can be extended in the same way as the R estimates.

5.1.1

Applications

In many applications the form of the covariance structure of the random vector of errors ek of Model 5.1.1 is known. This can result in a simplified asymptotic covariance structure for b . We discuss several such cases in the next few sections. In Section 5.2, we consider a β ϕ simple mixed model with block as a random effect. Here, besides an estimate of τϕ , only an additional covariance parameter is required to estimate Vϕ . In Section 5.3.1, we discuss a transformed procedure for a simple mixed model, provided the the block design matrices, Xk ’s, have full column rank. Another rich class of such models is the repeated measure designs, where block is synonymous with subject. Two common types of covariance structure for these designs are: (i) the covariance of the errors for a subject have compound symmetrical structure, i.e., a simple random effect model, or (ii) the errors follow a stationary time series model, for instance an autoregressive model. For Case (ii), the univariate marginals would have the same distribution and, hence, the above assumptions hold for our rank-based estimates. Using the residuals from the rank-based fit, R estimators of the autoregressive parameters of the error distribution can be obtained. These estimates could then be used in the usual way to transform the observations and then a second (generalized) R estimate

5.2. SIMPLE MIXED MODELS

327

could be obtained based on these transformed observations; see Exercise 5.6.7 for details. This is a robust analogue of the two-stage estimation procedure discussed for cluster samples in Rao, Sutradhar and Yue (1993). Generalized R estimators based on transformations are discussed in Sections 5.3 and 5.4.

5.2

Simple Mixed Models

In this section, we discuss a simple mixed model with block or cluster as a random effect. Consider Model (5.1.1), but for each block k, model the error vector ek as ek = 1nk bk + ǫk , where the components of ǫk are independent and identically distributed and bk is a continuous random variable which is independent of ǫk . Hence, we write the model as Yk = α1nk + Xk β + 1nk bk + ǫk , k = 1, . . . m.

(5.2.1)

Assume that the random effects b1 , . . . , bm are independent and identically distributed random variables. It follows that the distribution of ek is exchangeable. In particular, all marginal distributions of ek are the same; so, the theory of Section 5.1 holds. This family of models contains the randomized block designs, but as in Section 5.1 the blocks can be incomplete. b , (5.1.8) simplifies to For this model, the asymptotic variance-covariance matrix of β ϕ P ′ ′ −1 (5.2.2) τϕ2 (X′X)−1 m k=1 Xk Σϕ,k Xk (X X) , Σϕ,k = (1 − ρϕ )1I nk + ρϕ Jnk , where ρϕ = cov {ϕ[F (e11 )], ϕ[F (e12 )]} = E{ϕ[F (e11 )]ϕ[F (e12 )]}. Also, the asymptotic variance of the intercept (5.1.9) simplifies to n−1 τS2 (1 + n∗ ρ∗S ), for ρ∗S = cov [sgn (e11 ), sgn (e12 )] P and n∗ = n−1 m k=1 nk (nk − 1). As with LS, for positive definiteness, need to assume that Pm nkwe ∗ each of ρϕ and ρS exceeds maxk {−1/(nk − 1)}. Let M = k=1 2 − p, (the subtraction of p, the dimension of the vector β, is a degree of freedom correction). A simple moment estimator of ρϕ is m X X −1 ρbϕ = M a[R(b eki )]a[R(b ekj )]. (5.2.3) k=1 i>j

Plugging this into (5.2.2) and using the estimate of τϕ discussed earlier, we have an estimate of the asymptotic covariance matrix of the R estimators. For the general mixed model (5.1.1) of Section 5.1, the ARE’s for the rank-based procedures are difficult to obtain; however, for the simple mixed model, (5.2.1), the ARE can be obtained in closed form provided the design is centered within each block; see Kloke et al. (2009). The reader is asked to show in Exercise 5.6.2 that for Wilcoxon scores, this ARE is 2 Z 2 2 f (t) dt , (5.2.4) ARE(FW,ϕ , FLS ) = [(1 − ρ)/(1 − ρϕ )]12σ where ρϕ is defined under expression (5.2.2) and ρ is the correlation coefficient within a block. If the random vectors in a block follow the multivariate normal distribution, then this

328

CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

ARE lies in the interval [0.8660, 0.9549] when 0 < ρ < 1. The lower bound is attained when ρ → 1. The upper bound is attained when ρ = 0 (the independent case), which is the usual high efficiency of the Wilcoxon to LS at the normal distribution. When −1 < ρ < 0, this ARE lies in [0.9549, 0.9662] and the upper bound is attained when ρ = −0.52 and the lower bound is attained when ρ → −1. Generally, the high efficiency properties of the Wilcoxon analysis to LS analysis in the independent errors case extend to the Wilcoxon analysis for this mixed model design. See Kloke et al. (2009) for details.

5.2.1

Variance Component Estimators

In this section, we assume that the variances of the errors exist. Let Σek denote the variancecovariance matrix of ek . Under the model of this section, the variance-covariance matrix of ek is compound symmetric having the form Σek = σ 2 Ak (ρ) = σ 2 [(1 − ρ)Ink + ρJnk ], where σ 2 = Var(eki), Ink is the identity matrix of order nk , and Jnk is a nk × nk matrix of ones. Letting σb2 and σε2 denote respectively the variances of the random effect bk and the error ε, the total variance is given by σ 2 = σε2 + σb2 . The intraclass correlation coefficient is ρ = σb2 /(σε2 + σb2 ). These parameters, (σε2 , σb2 , σ 2 ), are referred to as the variance components. To estimate these variance components, we use the estimates discussed in Kloke at al. (2009); see, also Rashid and Nandram (1998) and Gerand and Schucany (2007). In block k, rewrite model (5.2.1) as ykj − [α + x′kj β] = bk + εkj , j = 1, . . . , nk . The left side of this expression is estimated by the residual b k = 1, . . . , m; j = 1, . . . , nk . eR,kj = ykj − [b b α + x′kj β],

(5.2.5)

Hence, a predictor (estimate) of bk is given by bbk = med1≤j≤nk {b eR,kj }. Hence a robust estimator of the variance of bk is MAD, (3.9.27); that is, σ bb2

h i2 2 b b b = [MAD1≤k≤m (bk )] = 1.483 med1≤k≤m |bk − med1≤j≤m {bj }| .

(5.2.6)

In this simple mixed model, the residuals b ekj , (5.2.5), are often call the marginal residuals. In addition, though, we have the conditional residuals for the errors εkj which are defined by εbkj = b eR,kj − bbk , j = 1, . . . nk , k = 1, . . . , m. (5.2.7)

A robust estimate of σε2 is then

σ bε2 = [MAD1≤j≤nn ,1≤k≤m (b εkj )]2 .

(5.2.8)

σ b2 = σ bε2 + σ bb2 and ρb = σ bb2 /b σ2 .

(5.2.9)

Hence, robust estimates of the total variance σ 2 and the intraclass correlation coefficient are

Thus, our robust estimates of the variance components are given in expressions (5.2.6), (5.2.8), and (5.2.9).

5.2. SIMPLE MIXED MODELS

5.2.2

329

Studentized Residuals

In Chapter 3, we presented Studentized residuals for R and HBR fits. These residuals are fundamental for diagnostic analyses of linear models. They correct for both the model (factor space) and the underlying covariance structure and allow for a simple benchmark rule for designating potential outliers. In this section, we present Studentized residuals based on the R fit of the simple mixed model, (5.2.1). Because the marginal residuals ebR,kj , (5.2.5), are used to check the quality of fit, these are the appropriate residuals for standardizing. Because the block sample sizes nk are not necessarily the same, some additional notation simplifies the presentation. Let ν1 and ν2 be two parameters and define the block-diagonal matrix B(ν1 , ν2 ) = diag{B1 (ν1 , ν2 ), . . . , Bm (ν1 , ν2 )}, where Bk (ν1 , ν2 ) = (ν1 − ν2 )Ink + ν2 Jnk , k = 1, . . . , m. Hence, for Model (5.2.1), we can write Var(e) = σ 2 B(1, ρ). b ϕ given in expression (5.1.7), a tedious calcuUsing the asymptotic representation for β lation, similar to that in Section 3.9.2, shows that the approximate covariance matrix of b eR is given by

τ2 CR = σ 2 B(1, ρ) + s2 Jn B(1, ρ∗S )Jn + τ 2 Hc B(1, ρϕ )Hc n τs τs ∗ ∗ ∗ ∗ − B(δ11 , δ12 )Jn − τ B(δ11 , δ12 )Hc − Jn B(δ11 , δ12 ) n n τ τs τs τ + Jn B(γ11 , γ12 )Hc − τ Hc B(δ11 , δ12 ) + Hc B(γ11 , γ12 )Jn , (5.2.10) n n where Hc is the projection matrix onto the column space of the centered design matrix Xc , Jn is the n × n matrix of all ones, and ∗ δ11 ∗ δ12 δ11 δ12 γ11 γ12

= = = = = =

E[e11 sgn (e11 )], E[e11 sgn (e12 )], E[e11 ϕ(F (e11 ))], E[e11 ϕ(F (e12 ))], E[sgn(e11 )ϕ(F (e11 ))], E[sgn(e11 )ϕ(F (e12 ))],

and ρϕ and ρ∗S are defined in (5.1.5) and (5.1.8), respectively. To compute the Studentized residuals, estimates of the parameters in CR , (5.2.10), are required. First, consider the matrix σ 2 B(1, ρ). In Section 5.2.1, we obtained robust estimators σ b2 and ρb given in expression (5.2.9). Substituting these estimators for σ 2 and ρ into 2 σ B(1, ρ), we have a robust estimator of σ 2 B(1, ρ) given by σ b2 B(1, ρb). Expression (5.2.3) ∗ ∗ gives a simple moment estimator of ρϕ . The parameters ρ∗S , δ11 , δ12 , δ11 , δ12 , γ11 , and γ12 bR can be estimated in the same way. Substituting these estimators into the matrix CR , let C denote the resulting estimator. b R . Then the tth For t = 1, . . . , n, let b ctt denote the tth diagonal entry of the matrix C Studentized marginal residual based on the R fit is p e∗R,t = ebR,t / b b ctt . (5.2.11)

330

CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

As in Chapter 3, the traditional benchmarks used with these Studentized residuals are the limits ±2.

5.2.3

Example and Simulation Studies

In this section we present an example of a randomized block design. It consists of only two blocks, so we also summarize simulation studies which confirms the validity of the rank-based analysis. For the examples and the simulation studies, we computed the rank-based analysis using the collection of R functions Rfit described above. By the traditional fit, we mean the maximum likelihood fit based on multivariate normality of the error random vectors. This fit and subsequent analysis was obtained using the R function lme as discussed in Pinheiro and Bates (2000). The rank-based analysis of this section was computed by Rfit, a collection of R functions. Example 5.2.1 (Crab Grass Data). [tbh] Cobb (1998) presented an example of a complete block design concerning the weight of crab grass. Much of our discussion is drawn from Kloke at al. (2009). There are four fixed factors in the experiment: the density of the crabgrass at four levels, the nitrogen content of the crabgrass at two levels, the phosphorus content of the crabgrass at two levels, and the potassium content of the crabgrass at two levels. Two complete blocks of the experiment were carried out, so altogether there are n = 64 observations. Here block is a random factor and we assume the simple mixed model, (5.2.1), of this section. Under each set of experimental conditions, crab grass was grown in a cup. The response is the dry weight of a unit (cup) of crab grass, in milligrams. The data are presented in Table A.0.1 of Appendix B. We consider the rank-based analysis of this section based on Wilcoxon scores. For the main effects model, Table 5.2.1 displays the estimated effects (contrasts) and standard errors for the Wilcoxon and traditional analyses. For the nutrients, these effects are the differences between the high and low levels, while for the factor density the three contrasts reference the highest density level. There are major differences between the Wilcoxon and the traditional estimates. For the Wilcoxon estimates, the nutrients nitrogen and phosphorus are significant and the contrast between the low and high levels of density is highly significant. Nitrogen is the only significant effect for the traditional analysis. The Wilcoxon statistic to test the density effects has the value TW,ϕ = 20.55 with p = 0.002; while, the traditional test statistic is Flme = 0.82 with p = 0.490. The robust estimates of the variance components are: σ b2 = 206.33, σ bb2 = 20.28, and ρb = 0.098 An outlier accounts for much of this dramatic difference between the robust and traditional analyses. Originally, one of the responses was mistyped; instead of the correct value 97.25, the response was typed as 972.5. As Cobb (1998) notes, this outlier was more difficult to spot in the original units. Upon replacing the outlier with its correct value, the Wilcoxon and traditional analyses are similar; although, the Wilcoxon analysis is still more precise; see the discussion below on the other outliers in this data set. This is true too of the test for the factor density: TW,ϕ = 23.23 (p = 0.001) and Flme = 6.33 with p = 0.001. The robust estimates of the variance components are: σ b2 = 209.20, σ bb2 = 20.45, and ρb = 0.098 These

5.2. SIMPLE MIXED MODELS

331

Table 5.2.1: Wilcoxon and Traditional Estimates and SEs of Effects for the Crabgrass. Wilcoxon Traditional Contrast Est. SE Est. SE Nit 39.90 4.08 69.76 28.7 Pho 10.95 4.08 −11.52 28.7 Pot −1.60 4.08 28.04 28.7 D34 3.26 5.76 57.74 40.6 D24 7.95 5.76 8.36 40.6 D14 24.05 5.76 31.90 40.6 are essentially unchanged from their values on the original data. If on the original data the experimenter had run the robust fit and compared it with the traditional fit, then the outlier would have been discovered immediately. Figure 5.2.1 contains the Wilcoxon Studentized residual plot and q−q plot for the original data. We have removed the large outlier from the plots, so that we can focus on the remaining data. The “vacant middle” in the residual plot is an indication that interaction may be present. For the hypothesis of interaction between the nutrients, the value of the Wald type test statistic is TW,ϕ = 30.61, with p = 0.000. Hence, the R analysis strongly confirms that interaction is present. On the other hand, the traditional likelihood ratio test statistic for this interaction is 2.92, with p = 0.404. In the presence of interaction, many statisticians would consider interaction contrasts instead of a main effects analysis. Hence, for such statisticians, the robust and traditional analyses would have different practical interpretations.

5.2.4

Simulation Studies of Validity

In this data set, the number of blocks is two. Hence, to answer questions concerning the validity of the Wilcoxon analysis, Kloke et al. (2009) conducted a small simulation study. Table 5.2.2 summarizes the empirical confidences and AREs of this study for two situations, normal errors and contaminated normal errors (20% contamination and the ratio of the contaminated variance to the uncontaminated variance at 25). For each situation, the same randomized block design as in the Crab Grass example was used, with the correlation structure as estimated by the Wilcoxon analysis. The empirical confidences of the asymptotic 95% confidence intervals were recorded. These intervals are of the form Estimate ±1.96×SE, where SE denotes the standard errors of the estimates. The number of simulations was 10,000 for each situation, therefore, the error in the table based on the usual 95% confidence interval for a proportion is 0.004. The empirical confidences for the Wilcoxon are quite good with the target of 0.95 usually within range of error. They were perhaps a little conservative at the the contaminated normal situation. Hence, the Wilcoxon analysis appears to be valid for this design. The intervals based on the traditional fit are slightly liberal. The empirical ARE’s between two estimators displayed in Table 5.2.2 are the ratios of empirical mean squared errors of the two estimators. As the table shows, the traditional fit is more efficient

332

CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

8 6 4 2 −2 0

Studentized Wilcoxon residual

Studentized Residual Plot, Outlier Deleted

40

60

80

100

Wilcoxon fit

8 6 4 2 −2 0

Studentized Wilcoxon residual

Normal q−q Plot, Outlier Deleted

−2

−1

0

1

2

Normal quantiles

Figure 5.2.1: Studentized Residual and q−q Plots, Minus Large Outlier. at the normal but the efficiencies are close to the value 0.95 for the independent error case. The Wilcoxon analysis is much more efficient over the contaminated normal situation. Does this rank-based analysis differ from the independent error analysis of Chapter 3? As a tentative answer to this question, Kloke et al. (2009) ran 10,000 simulations using the model for the Crab Grass Example. Wilcoxon scores were used for both analyses. To avoid confusion, call the analysis of Chapter 3, the IR analysis, (I for independent errors) and the analysis of this section the R analysis. They considered normal error distributions, setting the variance components at the values of the robust estimates. Because the R and IR fits are the same, they considered the differences in their inferences of the six effects listed in Table 5.2.1. For 95% nominal confidence, the average empirical confidences over these six

Table 5.2.2: Validity of Inference (Empirical Confidence Sizes and AREs)

Contrast Nit Pho Pot D34 D24 D14

Wilc. 0.948 0.953 0.948 0.950 0.951 0.952

Norm. Errors Traditional 0.932 0.934 0.927 0.929 0.929 0.930

ARE 0.938 0.941 0.940 0.936 0.943 0.944

Cont. Norm. Errors Wilc. Traditional ARE 0.964 0.933 7.73 0.964 0.930 7.82 0.966 0.934 7.72 0.964 0.931 7.75 0.960 0.931 7.57 0.960 0.929 7.92

5.3. RANK-BASED PROCEDURES BASED ON ARNOLD TRANSFORMATIONS 333 contrasts are 95.32% and 96.12%, respectively for the R and IR procedures. Hence, both procedures appear valid. For a measure of efficiency, they averaged, across the contrasts, the averages of squared lengths of the confidence intervals. The ratio of the R to the IR averages is 0.914; hence for the simulation, the R inference is about 9% more efficient than the IR inference. Similar results for the traditional analyses are reported in Rao et al. (1993).

5.2.5

Simulation Study of Other Score Functions

Besides, the large outlier there are six other potential outliers in the Cobb data. This quantity of outliers suggests the use of score functions which are more preferable than the Wilcoxon score function for very heavy-tailed error structure. To investigate this, we turned to the family of Winsorized Wilcoxon score functions. Recall that this family was discussed for skewed data in Example 2.5.1. Here, though, asymmetry does not appear to be warranted. We selected the score function which is linear over the interval (0.2, 0.8), i.e., 20% Winsorizing on both sides. We denote it by WW2 . For the parameters as in Table 5.2.1, the WW2 estimates and standard errors (in parentheses) are: 39.16 (3.78), 10.13 (3.78), −2.26 (3.78), 2.55 (5.35), 7.68 (5.35), and 23.28 (5.35). The estimate of the scale parameter τ is 14.97 compared to the Wilcoxon estimate which is 15.56. This indicates that an analysis based on the WW2 fit has more precision than one based on the Wilcoxon fit. To investigate this gain in precision, we ran a small simulation study. We used the same model and the same correlation structure as estimated by the Wilcoxon fit. We considered normal and contaminated normal errors, with the percent of contamination at 20% and the relative variance of the contaminated part at 25. For each situation 10,000 simulations were run. The AREs were very similar for all six parameters, so we only report their averages. For the normal situation the average ARE between the WW2 and Wilcoxon estimates was 0.90; hence, the WW2 estimate was 10% less efficient for the normal situation. For the contaminated normal situation, though, this average was 1.21; hence, the WW2 estimate was 20% more efficient than the Wilcoxon estimate for the contaminated normal situation. There are families of scores functions besides the Winsorized Wilcoxon scores. Gastwirth (1966) presents several families of score functions appropriate for classes of distributions with tails heavier than the exponential distribution. For certain cases, he selects a score based on a maxi-min strategy.

5.3

Rank-Based Procedures Based on Arnold Transformations

In this section, we apply a linear transformation to the mixed model, (5.1.1), and then obtain the R fits. We begin with a brief but necessary discussion of the intercept parameter. Write the mixed model in the long form (5.1.2), Y = 1n α + Xβ + e. Suppose the transformation matrix is A. Multiplying both sides of the model by A, the transformed

334

CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

model is of the form Y ∗ = X∗ b + e∗ ,

(5.3.1) ′ ′

where v∗ denotes the vector Av and the vector of parameters is b = (α, β ) . While the original model has an intercept parameter, in general, the transformed model does not. As discussed in Exercise 3.16.39 of Chapter 3, the R fit of Model (5.3.1) is actually the R fit of e ∗ b + e∗ , where X e ∗ = (I − H1 )X∗ and H1 is the projection matrix onto the model Y ∗ = X e ∗ is the centered design matrix based on X∗ . the space spanned by 1; i.e, X As proposed in Exercise 3.16.39, to obtain an R fit of Model (5.3.1), we use the following algorithm: (1.) Fit the model e ∗ b + e∗ . Y ∗ = α1 1 + X

(5.3.2)

By fit we mean: obtain the R estimate of b and then estimate the α1 by the median b ∗ denote the R fit. of the residuals. Let Y 1

b ∗ to the right space; i.e., obtain (2.) Project Y 1

b ∗ = HX∗ Y b ∗. Y 1

(5.3.3)

b∗ = (X∗′ X∗ )−1 X∗′ Y b ∗. b

(5.3.4)

b ∗ ; i.e., our estimator is (3.) Solve X∗ b = Y

b∗ is asymptotically normal with the asymptotic represenAs developed in Exercise 3.16.39, b tation given by (3.16.11) and asymptotic variance given by (3.16.12). We use these results in the remainder of this chapter.

5.3.1

R Fit Based on Arnold Transformed Data

As in the previous sections, consider an experiment done over m blocks, (clusters, centers), and let Yk denote the vector of nk observations for the kth block, k = 1, . . . , m. In this section, we consider the simple mixed model of Section 5.2. Using the notation of expression (5.2.1), Yk follows the model Yk = α1nk + Xk β + 1nk bk + ǫk , where bk is a random effect and β denotes the fixed effects of interest. As in Section 5.2, assume that the blocks are independent and bk and ǫk are independent. Let ek = 1nk bk + ǫk . As in expression (5.1.2), the long form of the model is useful, i.e., Y = 1n α + Xβ + e. Because there P is an intercept parameter in the model, we may assume that X is centered. Let n = m k=1 nk denote the total sample size. For this section, in addition we assume that for all blocks Xk has full column rank p. If the variances of the error variables exist, denote them by Var[bk ] = σb2 and Var[ǫkj ] = σǫ2 . In this case, the variance covariance structure for the kth block is compound symmetric which we denote as Var[ek ] = σ 2 Ak (ρ) = σ 2 [(1 − ρ)Ink + ρJnk ], (5.3.5) where σ 2 = σǫ2 + σb2 , and ρ = σb2 /(σb2 + σǫ2 ).

5.3. RANK-BASED PROCEDURES BASED ON ARNOLD TRANSFORMATIONS 335 Arnold Transformation Arnold (Chapters 14 and 15, 1981) discusses a Helmert transformation for these types of models for traditional (least squares) analyses for balanced designs, i.e., all nk ’s are the same. Kloke and McKean (2010) generalized Arnold’s results to unbalanced designs and developed the properties of the R fit for the transformed data. Consider the nk × nk orthogonal matrix 1 ′ √ 1n nk k Γk = (5.3.6) C′k , ′ where the columns of Ck form an orthonormal basis for 1⊥ nk , (Ck 1nk = 0). We call Γk an Arnold transformation of size nk . Now, apply an Arnold’s Transformation of size nk to the response vector for the kth block ∗ Yk1 ∗ Yk = Γk Yk = ∗ Yk2 √ ∗ ¯ ′k β + e∗k1 , the contrast component is where the mean component is Yk1 = α∗ + b∗k + nk x ∗ Yk2 = X∗k β + e∗k2 , and the other quantities are:

1 ′ 1 Xk nk nk 1 = √ 1′nk ek nk = Ck X k = C′k ek = bk C′k 1nk + C′k ǫk = C′k ǫk .

¯ ′k = x e∗k1 X∗k e∗k2

In particular, note that the contrast component contains, as a linear model, the fixed effects of interest and, moreover, it is free of the random block effect. ¯ = 0. Furthermore, notice that all the information on β is in the contrast component if x This occurs when the experimental design is replicated at least once in each of the blocks and the covariate does not change. Also, all of the information on β is in the mean component if the covariates are constant within a block. More often, however, there is information on β in both of the components. If this is the case, then for balanced designs, one can put both pieces back together and obtain an estimator using all of the information. For unbalanced designs this is not possible. The approach we take is to ignore the information in the mean component and use the contrast component for inference. Let n∗ = n − m. Then the long form of the Arnold transformation (AT) is Y2∗ = C′ Y, where C′ = diag[C′1 , . . . , C′m ]. So we can model Y2∗ as Y2∗ = X∗ β + e∗2 ,

(5.3.7)

where e∗2 = C′ e, and, provided variances exist, Var[e∗2 ] = σ22 In∗ , σ22 = σ 2 (1 − ρ), and X∗ = C′ X.

336

CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

LS Fit on Arnold Transformed Data For the traditional least squares procedure, suppose the variance of the errors exist. Under the additional assumption of normality, the transformed errors are independent. The traditional estimator is thus the usual LS estimator b AT LS = Argminky∗ − X∗ βkLS . β 2

∗′ ∗ −1 ∗′ ∗ b i.e., β AT LS = (X X ) X y2 . This is the extension of Arnold’s (1981) solution that was proposed by Kloke and McKean (2010) for the unbalanced case of Model (5.3.7). As usual, estimate the intercept based on the mean of the residuals,

1 ′ b) 1 (y − y n 1 ′ = 1 (In − X(X∗′ X∗ )−1 X∗ C′ )y = y¯. n

α bLS =

As Exercise 5.6.3 shows the joint asymptotic distribution is 2 α bLS σ1 0′ α (5.3.8) , ∼N ˙ p+1 b 0 σ22 (X∗′ X∗ )−1 β β AT LS P 2 2 2 where σ12 = (σ 2 /n2 ) m k=1 [(1 − ρ)nk + nk ] and σ2 = σ (1 − ρ). Notice that if inference is to be on β then we avoid explicit estimation of ρ. To estimate σ22 we may use σ b22 = Pm Pnk ∗2 b e /(n∗ − p) where b b e∗ = y ∗ − x∗′ β. k=1

j=1 kj

kj

kj

kj

R Fit on Arnold Transformed Data

For the R fit of Model (5.3.7), we briefly sketch the development in Kloke and McKean (2010). Assume that we have selected a score function ϕ(u). We define the Arnold’s transformation rank-based (ATR) estimator of β as the regression through the origin rank estimator defined by the steps (5.3.2) - (5.3.4) of the last section; that is, the rank-based estimator is given by ∗ ∗ b β AT R = Argminky2 − X βkϕ .

(5.3.9)

The results of Section 5.1 ensure that the ATR estimates are consistent and asymptotically normal. The reason for doing an Arnold transformation, though, is that the transformed error variables are uncorrelated. While this does not necessarily mean that they are independent, in the literature they are usually treated as if they are. This is called working independence. The asymptotic distributions discussed next are formulated under the working independence. The simulation results reported in Kloke and McKean (2010) support the validity of the asymptotic distributions over normal and contaminated normal error distributions. Recall from the regression through the origin algorithm that the asymptotic distribution b of β AT R depends on the choice of the estimate of the intercept α1 . For the first case, suppose

5.3. RANK-BASED PROCEDURES BASED ON ARNOLD TRANSFORMATIONS 337 ∗ the median of the residuals is used as the estimate of the intercept, (b αAT R = med{ykj2 − b x∗′ kj β AT R }. Then, under working independence, the joint approximate distribution of the regression parameters is 2 2 α bAT R σs τs,e /n 0′ α , ∼N ˙ p+1 (5.3.10) b β 0 V β AT R P where V is given in expression (3.16.12) of Chapter 3, σs2 = 1 + t∗ ρs , t∗ = m k=1 nk (nk − 1), and ρs = cov[sgn(e11 )sgn(e12 )]. For the second case, assume that the score function ϕ(u) is odd about 1/2; ϕ(1 − u) = + −ϕ(u). Let α bAT R denote the signed-rank estimator of the intercept; see expression (3.5.32) of Chapter 3. Then, under working independence, the joint approximate distribution of the rank-based estimator is 2 2 + α bAT R σs τs,e /n 0′ α , ∼N ˙ p+1 , (5.3.11) b β 0 V β AT R

where V = τ 2 (X∗′ X∗ )−1 . In comparing expressions (5.3.8) and (5.3.11), we see that asymptotic relative efficiency (ARE) between the ATLS and the ATR estimators is the same as that of LS and R estimates in ordinary linear models. In particular when Wilcoxon scores are used and errors have a normal distribution, the ARE between the ATLS and ATR(Wilcoxon) is the usual 0.95. Hence, for this second case, the ATR estimators are efficiently robust. To complete the practical inference, the scale parameters, τ and τs are based on the distribution of e∗2kj and can be estimated as discussed in Chapter 3. From this, an inference is readily formed for the parameters of the model. Validity of the resulting confidence intervals is confirmed in the simulation study of Kloke and McKean (2010). Studentized residuals are also discussed in this article. A matrix expression such as (5.2.10) for the simple mixed model is derived by the authors; however, unlike the situation in Section 5.2.2, some of the necessary correlations are not straightforward to estimate. Kloke and McKean recommend a bootstrap to estimate the standard error of a residual. We use these in the following example. Example and Discussion The following example is drawn from the article of Kloke and McKean (2010). Although simple, the following data set demonstrates some of the nice features of Arnold’s Transformation, particularly for balanced data. Example 5.3.1 (Milliken and Johnson Data). The data in Table 5.3.1 are from an example found on page 260 of Milliken and Johnson (2002). Each row represents a block of length two. There is one covariate and each of the responses were measurements on different treatments. The model for these data is −0.5 Yk = α12 + ∆ + βxk 12 + ǫk . 0.5

338

CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE Table 5.3.1: Data for x y1 23.2 60.4 26.9 59.9 29.4 64.4 22.7 63.5 30.6 80.6 36.9 75.9 17.6 53.7 28.5 66.3

Example 5.3.1. y2 76.0 76.3 77.8 75.6 94.6 96.1 62.3 81.6

Table 5.3.2: ATR and ATLS estimates and standard errors for Example 5.3.1.

α ∆ β

ATR ATLS Est SE Est SE 70.8 3.54 72.8 8.98 −14.45 1.61 −14.45 1.19 1.43 0.65 1.46 0.33

The Arnold’s Transformation for this model is 1 1 1 . Γk = √ 2 1 −1 ∗ ∗ ′ The transformed responses are Yk∗ = Γk Yk = [Yk1 , Yk2 ] , where ∗ Yk1 = α∗ + β ∗ xk + ǫ∗k1 , ∗ Yk2 = ∆∗ + ǫ∗k2 ,

√ √ α∗ = 2α, β ∗ = 2β, and ∆∗ = √12 ∆. We treat the transformed errors ǫ∗k1 for k = 1, . . . , m and ǫ∗k2 for k = 1, . . . , m as iid. Notice that the first component is a simple linear regression model and the second component is a simple location model. For this example, we use signed-rank to estimate both of the intercept terms. The estimates and standard errors of the parameters are given in Table 5.3.2. Kloke and McKean (2010) plotted bootstrap Studentized residuals for the least squares (top) and Wilcoxon fits. These plots show no serious outliers. To demonstrate the robustness of ATR estimates in the example, Kloke and McKean (i) (2010) conducted a small sensitivity analysis. They set the second data point to y12 = y11 + ∆y, where ∆y varied from -30 to 30. Then the parameters ∆(i) are estimated based on the data set with the outlier. The graph below, displays the relative change of the estimate, b −∆ b (i) ∆ b ∆

5.4. GENERAL ESTIMATING EQUATIONS (GEE)

339

0.00 −0.02 −0.06

−0.04

Relative Change ∆

0.02

0.04

as a function of ∆y.

−30

−20

−10

0

10

20

30

∆y

Over this range of ∆y, the relative changes in the ATR estimate is between −0.042 to 0.062. In contrast, as the reader is asked to show in Exercise 5.6.4, the relative change in ATLS over this range is between 0.125 to 0.394. Hence, the relative change in the ATR estimates is small, which indicates the robustness of the ATR estimates. ????????????????Discussion????????

5.4

General Estimating Equations (GEE)

For longitudinal data, Liang and Zeger (1986) presented an elegant, general iterated reweighted least squares (IRLS) fit of a generalized longitudinal model. As we note below, their fit solves a set of general estimating equations (GEE). Their model is more general than Model (5.1.1). Abebe, McKean and Kloke (2010) developed a rank-based fit of this general model which we present in this section. While analogous to Liang and Zeger’s fit, it is robust in response space. Further, the procedure can easily be generalized to be robust in factor space, also. Consider a longitudinal set of observations over m subjects. Let yit denote the tth response for ith subject for t = 1, 2, . . . , ni and P i = 1, 2, . . . , m. Assume that xit is a p × 1 vector of corresponding covariates. Let n = m i=1 ni denote the total sample size. Assume that the marginal distribution of yit is of the exponential class of distributions and is given by f (yit ) = exp{[yit θit − a(θit ) + b(yit )]φ} , (5.4.1) where φ > 0, θit = h(ηit ), ηit = xTit β, and h(·) is a specified function. Thus the mean and variance of yit are given by E(yit ) = a′ (θit ) and Var(yit ) = a′′ (θit )/φ,

(5.4.2)

340

CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

where the ′ denotes derivative. In this notation, the link function is h−1 ◦ (a′ )−1 . More assumptions are stated later for the theory. Let Yi = (yi1 , . . . , yini )T and Xi = (xi1 , . . . , xini )T denote the ni × 1 vector of responses and the ni × p matrix of covariates, respectively, for the ith individual. We consider the general case where the components of the vector of responses for the ith subject, Yi, are dependent. Let θ i = (θi1 , θi2 , . . . , θini )T , so that E(Yi ) = a′ (θ i ) = (a′ (θi1 ), . . . , a′ (θini ))T . For a s × 1 vector of unknown correlation parameters α, let Ci = Ci (α) denote a ni × ni correlation matrix. Define the matrix 1/2

1/2

Vi = Ai Ci (α)Ai /φ ,

(5.4.3)

where Ai = diag{a′′ (θi1 ), . . . , a′′ (θini )}. The matrix Vi need not be the covariance matrix of b i be Yi . In any case, we refer to Ci as the working correlation matrix. For estimation, let V an estimate of Vi. This, in general, requires estimation of α and often an initial estimate of ˆ β. In general, we denote the estimator of α by α(β, φ) to reflect its dependence on β and φ. Liang and Zeger (1986) defined their estimate in terms of general estimating equations (GEE). Define the ni × p Hessian matrix, ∂a′ (θ i ) Di = , ∂β

i = 1, . . . , m .

(5.4.4)

b is the solution to the equations Then their GEE estimator β LS m X i=1

b −1 [Yi − a′ (θ i )] = 0 . DTi V i

(5.4.5)

To motivate our estimator, it is convenient to write this in terms of the Euclidean norm. Define the dispersion function, DLS (β) = =

m X

i=1 m X i=1

=

b −1[Yi − a′ (θ i )] [Yi − a′ (θ i )]T V i

b [V i

−1/2

ni m X X i=1 t=1

b Yi − V i

b a (θ i )]T [V i

−1/2 ′

−1/2

−1/2 ′

Yi − Vi

[yit∗ − dit (β)]2 ,

a (θ i )] (5.4.6)

b −1/2 . b −1/2 Yi = (y ∗ , . . . , y ∗ )T , dit (β) = cT a′ (θ i ), and cT is the tth row of V where Yi∗ = V t t i1 ini i i The gradient of DLS (β) is ▽DLS (β) = −

m X i=1

b −1[Yi − a′ (θ)] . DTi V i

(5.4.7)

5.4. GENERAL ESTIMATING EQUATIONS (GEE)

341

Thus the solution to the GEE equations (5.4.5) also can be expressed as b = Argmin DLS (β) . β LS

(5.4.8)

b is a nonlinear least squares (LS) estimator. We refer to it as From this point of view, β LS GEEWL2 estimator. Consider, then, the robust rank-based nonlinear estimators discussed in Section 3.14. For nonnegative weights (see expression (5.4.10) below), we assume for now that the score function is odd about 1/2, i.e., satisfies (2.5.9). In situations where this assumption is unwarranted, we can adjust the weights to accommodate scores appropriate for skewed error distributions; see the discussion in Section 5.4.3. Next consider the general model defined by expressions (5.4.1) and (5.4.2). As in the LS b −1/2 Yi = (y ∗ , . . . , y ∗ )T , git (β) = cT a′ (θ i ), where cT is the tth row development, let Yi∗ = V t t i1 ini i −1/2 ∗ b of Vi , and let Gi = [git ]. The rank-based dispersion function is given by DR (β) =

ni m X X i=1 t=1

ϕ[R(yit∗ − git (β))/(n + 1)][yit∗ − git (β)] .

(5.4.9)

We next write the R estimator as weighted LS estimator. From this representation the asymptotic theory of the R estimator can be derived. Furthermore, it naturally suggests an IRLS algorithm. Let eit (β) = yit∗ − git (β) denote the (i, t)th residual and let m(β) = med(i,t) {eit (β)} denote the median of all the residuals. Then because the scores sum to 0 we have the identity, DR (β) =

ni m X X i=1 t=1

= =

ϕ[R(eit (β))/(n + 1)][eit (β) − m(β)]

ni m X X ϕ[R(eit (β))/(n + 1)] i=1 t=1 ni m X X i=1 t=1

eit (β) − m(β)

[eit (β) − m(β)]2

wit (β)[eit (β) − m(β)]2 ,

(5.4.10)

where wit (β) = ϕ[R(eit (β))/(n + 1)]/[eit (β) − m(β)] is a weight function. As usual, we take wit (β) = 0 if eit (β) − m(β) = 0. Note that by using the median of the residuals in conjunction with property (2.5.9), the weights are positive. To accommodate other score functions besides those that satisfy (2.5.9) quantiles other than the median can be used; see Example 5.4.3 and Sievers and Abebe (2004) for discussion. For the initial estimator of β, we recommend the rank-based estimator of Chapter 3 based b (0) . As estimates of the weights, we on the score function ϕ(u). Denote this estimator by β R (0) (0) b b use w bit β R ; i.e., the weight function evaluated at β . Expression (5.4.10) leads to the

342

CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

dispersion function ∗ DR

ni m X (0) i2 (0) h X (0) b b b = e (β) − m w bit β β β|β R it R R i=1 t=1

=

ni m X X i=1 t=1

"r

Let

# r (0) 2 (0) (0) b b b eit (β) − w m β . (5.4.11) w bit β bit β R R R

(0) (1) ∗ b b β R = ArgminD β|β R . n (k) o b , k = 1, 2, . . .. This establishes a sequence of IRLS estimates, β R After some algebraic simplification, we obtain the gradient ∗ ▽DR

m (k) i h X (k) b −1/2 W c iV b −1/2 Yi − a′ (θ) − m∗ β b b , DTi V β|β R = −2 R i i

(5.4.12)

(5.4.13)

i=1

(k) (k) b 1/2 m β b b 1, 1 denotes a ni × 1 vector all of whose elements are 1, = V where m∗ β R R i c i = diag{wˆi1 , . . . , w and W ˆini } is the diagonal matrix of weights for the ith subject. Hence, (k+1) b satisfies the general estimating equations (GEE) given by, β R

m X i=1

c iV b −1/2 b −1/2 W DTi V i i

i h ′ ∗ b (k) =0. Yi − a (θ) − m β R

(5.4.14)

We refer to this weighted, general estimation equations estimator as the GEEWR estimator.

5.4.1

Asymptotic Theory

Recall that both the GEEWL2 and GEEWR estimators were defined in terms of the univariate variables yit∗ . These of course are transformations of the original observations by the estimates of the covariance matrix Vi and the weight matrix Wi . For the theory, we need to consider similar transformed variables using the matrices Vi and Wi , where this notation means that Vi and Wi are evaluated at the true parameters. For i = 1, . . . , m and t = 1, . . . , ni , let −1/2

Yi† = Vi

† † Yi = (yi1 , . . . , yin )T i

−1/2

G†i (β) = Vi a′i (θ) = [git† ] e†it = yit† − git† (β).

(5.4.15)

To obtain asymptotic distribution theory for a GEE procedure, assumptions concerning these errors e†it must be made. Regularity conditions for the GEEWL2 estimates are discussed in

5.4. GENERAL ESTIMATING EQUATIONS (GEE)

343

Liang and Zeger (1986). For the GEEWR estimator, assume these conditions and, further that the marginal pdf of e†it is continuous and the variance-covariance matrix given in (5.4.16) is positive definite. Under these conditions, Abebe et al. (2010) derived the asymptotic distribution of the GEEWR estimator. The proof involves a Taylor series expansion, as in Liang and Zeger’s (1994) proof, and the rank-based theory found in Brunner and Denker (1994) for dependent observations. We state the result in the next theorem. √ b (0) Theorem 5.4.1. Assume that the initial estimate satisfies m(β R − β) = Op (1). Then √ b (k) under the above assumptions, for k ≥ 1, m(β R −β) has an asymptotic normal distribution with mean 0 and covariance matrix, ( m )−1 ( m ) X X −1/2 −1/2 −1/2 −1/2 † T T lim m Di V i W i V i Di Di Vi Var(ϕi )Vi Di m→∞

i=1

×

( m X

i=1

−1/2

DTi Vi

−1/2

Wi V i

i=1

Di

)−1

,

(5.4.16)

where ϕ†i denotes the ni × 1 vector (ϕ[R(e†i1 )/(n + 1)], . . . , ϕ[R(e†ini )/(n + 1)])T .

5.4.2

Implementation and a Monte Carlo Study

For practical use of the GEEWR estimate, the asymptotic covariance matrix (5.4.16) requires estimation. This is true even in the case where percentile bootstrap confidence intervals are employed for inference, because appropriate standardized bootstrap estimates are generally used. We present a nonparametric estimator of the covariance structure and an then approximation to it. We compare these in a small simulation study. Nonparametric (NP) Estimator of Covariance b (k) and (for the ith The covariance structure suggests a simple moment estimator. Let β b (k) denote the final estimates of β and Vi , respectively. Then the residuals which subject) V i estimate e†i ≡ (e†i1 , . . . , e†ini )T are given by h i−1/2 b (k) ), i = 1, . . . , m, b (k) b (k) (β b e†i = V Yi − G i i

(5.4.17)

h i−1/2 (k) (k) (k) (k) (k) Tb ′ b b b b where Gi = Vi . Let R(b e†it ) denote the rank of eb†it and θ it = h xit β a θ

b †i = (ϕ[R(b among {b e†i′ t′ }, t = 1, . . . , ni ; i = 1, . . . , m. Let ϕ e†i1 )/(n + 1)], . . . , ϕ[R(b e†ini )/(n + bi = ϕ b †i 1ni . Then a moment estimator of the covariance matrix (5.4.16) is b †i − ϕ 1)])T . Let S that expression with Var(ϕ†i ) estimated by \† ) = S biS bT , Var(ϕ i i

(5.4.18)

344

CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

and, of course, final estimates of Di and Vi. We label this estimator (NP) Although this is a simple nonparametric estimate of the covariance structure, in a simulation study Abebe et al. (2010) showed that this estimate often leads to a very liberal inference. Werner and Brunner (2007) discovered this in a corresponding rank testing problem. Approximation (AP) of the Nonparametric Estimator The form of the weights, though, suggests a simple approximation, which is based on certain ideal conditions. Suppose the model is correct. Assume that the true transformed errors are independent. Then, because the scores have been standardized, asymptotically Var(ϕ†i ) converges to Ini , so replace it with Ini . This is the first part of the approximation. Next consider the weights. The functional for the weights is of the form ϕ[F (e)]/e. Assuming that F (0) = 1/2, a simple application of the Mean Value Theorem gives the approximation ϕ[F (e)]/e = ϕ′ [F (e)]f (e). The expected value of this approximation can be expressed as ′ −1 Z ∞ Z 1 f [F (u)] −1 ′ 2 du, (5.4.19) τ = ϕ [F (t)]f (t) dt = ϕ(u) − f [F −1 (u)] −∞ 0 where the second integral is derived from the first by integration by parts followed by a substitution. The parameter τ is of course the usual scale parameter for the R estimates in the linear model based on the score function ϕ(u). The second part of the approximation is to replace the weight matrix by (1/ˆ τ )I. We label this estimator of the covariance matrix of (k) b by (AP). β Monte carlo Study

We report the results of a small simulation study in Abebe et al. (2010) which compares the estimators (NP) and (AP). It also provides empirical information on the relative efficiency b (k) the maximum likelihood estimator (mle) under assumed normality. between β The simulated model is a randomized block design with the fixed factor at five levels and the random (block) factor at seven levels. The distribution of the random effect was taken to be normal. Two error distributions were considered: a normal and a contaminated normal with the contamination rate at 20% and ratio of the contaminated standard deviation to the noncontaminated at five. For the normal error model, the intraclass correlation coefficient was set at 0.5. For each distribution, 10,000 simulations were run. We consider the GEEWR estimator based on a working independence covariance structure. We compared it with the maximum likelihood estimator (mle) for a randomized block design. This yields the traditional analysis used in practice. We used the R function lme (Pinheiro et al., 2007) to compute it. Table 5.4.1 records the results of the empirical efficiencies and empirical confidences between the GEEWR estimator and mle estimator for the fixed effect contrasts between level 1 and the other four levels. The empirical confidence coefficients are for nominal 95%

5.4. GENERAL ESTIMATING EQUATIONS (GEE)

345

confidence intervals based on asymptotic distribution of the GEEWR estimator using the nonparametric (NP) estimate of the covariance structure, the approximation (AP) discussed above, and the mle inference. Table 5.4.1: Empirical Efficiencies and Confidence Coefficients Dist. Method Contrast β21 β31 β41 β51 Empirical Efficiency Norm 0.974 0.974 0.972 0.973 CN 2.065 2.102 2.050 2.055 Empirical Conf. Coeff. Norm mle 0.916 0.915 0.914 0.914 NP 0.546 0.551 0.564 0.549 AP 0.951 0.955 0.954 0.951 CN mle 0.919 0.923 0.916 0.915 NP 0.434 0.445 0.438 0.441 AP 0.890 0.803 0.893 0.889 At the normal distribution, the loss in empirical efficiency of the GEEWR estimates over the mle estimates is only about 3%; while for the contaminated normal distribution the gain in efficiency of the GEEWR estimates over the maximum likelihood estimates is about 200%. Hence, for these situations the GEEWR estimator possesses robustness of efficiency. In terms of empirical confidence coefficients, the nonparametric procedure is quite liberal. In contrast, the approximate procedure confidences are quite close to the nominal confidence (95%) for the normal situation and similar to the those of the mle for the contaminated normal situation.

5.4.3

Example

As an example, we selected part of a study by Plaisance et al. (2007) concerning the effect of a single session of high intensity aerobic exercise on inflammatory markers of subjects taken over time. One purpose of the study was to see if these markers differed depending on the fitness level of the subject. Subjects were placed into one of two groups (High Fitness and Moderate Fitness) depending on the level of their peak oxygen uptake. The response we consider here is C-reactive protein (CRP). Elevated CRP levels are a marker of low-grade chronic inflammation and may predict a higher risk for cardiovascular disease (Ridker et al., 2002). The effect of interest is the difference in CRP between the two groups, which we denote by θ. Hence, a one-sided hypothesis of interest is H0 : θ ≥ 0 versus HA : θ < 0.

(5.4.20)

Out of the 21 subjects in the study, three were removed due to noncompliance or incomplete information. Thus, we consider the remaining 18 individuals, 9 in each group.

346

CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

CRP level was obtained 24 hours and immediately prior to the acute bout of exercise and subsequently 24, 72, and 120 hours following exercise giving 90 data points in all. The data are displayed in Table A.0.2 of Appendix B. The top left comparison boxplot of Figure 5.4.1 shows the effect based on the raw responses. An estimate of the effect based on the raw data is difference in medians which is −0.54. Note that the responses are skewed with outliers in each group. We took the time of measurement as a covariate. Let yi and xi denote respectively the 5 × 1 vectors of observations and times of measurements for subject i and let ci denote his/her indicator variable for Group, i.e., its components are either 0 (for Moderate Fitness) or 1 (for High Fitness). Then our model is yi = α15 + θci + βxi + ei , i = 1, . . . 18 ,

(5.4.21)

where ei denotes the vector of errors for the ith individual. We present the results for three covariance structures of ei : working independence (WI), compound symmetry (CS), and autoregressive-one (AR(1)). We fit the GEEWR estimate for each of these covariance structures using Wilcoxon scores. The error model for compound symmetry is the simple mixed model; i.e., ei = bi 1ni + ai , where bi is the random effect for subject i and the components of ai are iid and independent from bi . Let σb2 and σa2 denote the variances of bi and aij , respectively. Let σt2 = σb2 + σa2 denote the total variance and ρ = σb2 /σt2 denote the intraclass coefficient. In this case, the covariance matrix of ei is of the form σt2 [(1 − ρ)I + ρJ]. We estimated these variance component parameters σt2 and ρ at each step of the fit of Model (5.4.21) using the robust estimators discussed in Section 5.2.1 The error model for the AR(1) is eij = ρ1 ei,j−1 + aij , j = 2, . . . ni , where the aij ’s are |s−t| iid, for the ith subject. The (s, t) entry in the covariance matrix of ei is κρ1 , where κ = σa2 /(1 − ρ21 ). To estimate the covariance structure at step k, for each subject, we model this autoregressive model using the current residuals. For each subject, we then estimate ρ1 , using the Wilcoxon regression estimate of Chapter 3. As our estimate of ρ1 , we take the median over subjects of these Wilcoxon regression estimates. Likewise, as our estimate of σa2 we took the median over subjects of MAD2 of the residuals based on the AR(1) fits. Note that there are only 18 experimental units in this problem, nine for each treatment. So it is a small sample problem. Accordingly, we used a bootstrap to standardize the GEEWR estimates. Our bootstrap consisted of resampling the 18 experimenter units, nine from each group. This keeps the covariance structure intact. Then for each bootstrap sample, the GEEWR estimate was computed and recorded. We used 3000 bootstrap samples. With these small samples, the outliers had an effect on the bootstrap, also. Hence, we used the b MAD of the bootstrap estimates of θ as our standard error of θ. Table 5.4.2 summarizes the three GEEWR estimates of θ and β, along with the estimates of the variance components for the CS and AR(1) models. As the comparison boxplot of residuals shows in Figure 5.4.1, the three fits are similar. The WI and AR(1) estimates of the effect θ are quite similar, including their bootstrap standard errors. The CS estimate of θ, though, is more precise and it is closer to the difference (based on the raw data) in medians −0.54. The traditional fit of the simple mixed model (under CS covariance structure), would

5.4. GENERAL ESTIMATING EQUATIONS (GEE) Box Plots: Residuals

2

0

0

1

1

2

CRP

Residuals

3

3

4

4

Group Comparison Box Plots

347

High Fit

Mod. Fit

AR(1)

4 3 0

1

2

CS Residuals

3 2 0

1

CS Residuals

WI

QQ Plot of Residuals for CS Fit

4

Residual Plot of CS Fit

CS

−0.5

−0.4

−0.3

−0.2 CS Fit

−0.1

0.0

−2

−1

0

1

2

Normal Quantiles

Figure 5.4.1: Plots for CRP Data. be the maximum likelihood fit based on normality. We obtained this fit by using the lme function in R. Its estimate of θ is −0.319 with standard error 0.297. For the hypotheses of interest (5.4.20), based on asymptotic normality, the CS GEEWR estimate is marginally significant with p = 0.064, while the mle estimate is insignificant with p = 0.141. Note that the residual and q − q plots of the CS GEEWR fit, bottom plots of Figure 5.4.1, show that the error distribution is right skewed with a heavy right tail. This suggests using scores more appropriate for skewed error distributions than the Wilcoxon scores. We considered a simple score from the class of Winsorized Wilcoxon scores. The Wilcoxon score function is linear. For this data, a suitable Winsorizing score function is the piecewise linear function, which is linear on the interval (0, c) and then constant on the interval (c, 1). As discussed in Example 2.5.1 of Chapter 2, these scores are optimal for a skewed distribution with a logistic left tail and an exponential right tail. We obtained the GEEWR fit of this data using this score function with c = 0.80, i.e, the bend is at 0.80. To insure positive weights, we used the 47th percentile as the location estimator m(β) in the definition of

348

CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE Table 5.4.2: Summary of Estimates and Bootstrap Standard Errors (BSE). Wilcoxon Scores b θ BSE βb BSE COV. Cov. Parameters WI −0.291 0.293 −0.0007 0.0007 NA NA CS −0.370 0.244 −.0010 0.0007 σ ˆa2 = 0.013 ρˆ = 0.968 AR(1) −0.303 0.297 −0.0008 0.0015 ρˆ1 = 0.023 σ ˆa2 = 0.032 Winsorized Wilcoxon Scores with Bend at 0.8 CS −0.442 0.282 −0.008 0.0008 σ ˆa2 = 0.017 ρˆ = 0.966

the weights; see the discussion around expression (5.4.10). The computed estimates and their bootstrap standard errors are given in the last row of Table 5.4.2 for the compound symmetry case. The estimate of θ is −0.442 which is closer than the Wilcoxon estimate to the difference in medians based on the raw data. Using the bootstrap standard error, the corresponding z-test for hypotheses (5.4.20) is −1.57 with the p-value of 0.059, which is more significant than the test based on Wilcoxon scores. Computationally, the iterated reweighted GEEWR algorithm remains the same except that the Wilcoxon scores are replaced by these Winsorized Wilcoxon scores. As a final note, the residual plot of the GEEWR fit for the compound symmetric dependence structure also shows some heteroscedasticity. The variability of the residuals is directly proportional to the fitted values. This scalar trend can be modeled robustly using the rank-based procedures discussed in Exercise 3.16.39.

5.5

Time Series

5.6. EXERCISES

5.6

349

Exercises

5.6.1. Assume the simple mixed model (5.2.1). Show that expression (5.2.2) is true. 5.6.2. Obtain the ARE between the R and traditional estimates found in expression (5.2.4), for Wilcoxon scores when the random error vector has a multivariate normal distribution. 5.6.3. Show that the asymptotic distribution of the LS estimator for the Arnold transformed model is given by expression (5.3.8). 5.6.4. Consider Example 5.3.1. (a.) Verify the ATR and ATLS estimates in Table 5.3.2. (b.) Over the range of ∆y used in the example, verify the relative changes in the ATR and ATLS estimates as shown in the example. 5.6.5. Consider the discussion of test statistics around expression (5.1.12). Explore the asymptotic distributions of the drop in dispersion and aligned rank test statistics under the null and contiguous alternatives for the general mixed model. 5.6.6. Continuing with the last exercise, suppose that the simple mixed model (5.2.1) is true. Suppose further that the design is centered within each block; i.e., X′k 1nk = 0p . For example, this is true for an ANOVA design in which all subjects have all treatment combinations such as the Plasma Example of Section 4. (a.) Under this assumption, show that expression (5.2.2) simplifies to Vϕ = τϕ2 (1−ρϕ )(X′ X)−1 ;. (b.) Show that the noncentrality parameter η, (5.1.13), simplifies to η=

1 Mβ ′ [M(X′ X)−1 M′ ]−1 Hβ. τϕ2 (1 − ρϕ )

(c.) Consider as a test statistic the standardized version of the reduction in dispersion, FRD,ϕ =

RDϕ /q . (1 − ρˆϕ )(ˆ τϕ /2) D

Show that under the null hypothesis H0 , qFRD,ϕ → χ2 (q) and that under the sequence D of alternatives HAn , qFRD,ϕ → χ2 (q, η), where the noncentrality parameter η is given in Part (b). (d.) Show that FW,ϕ , (5.1.12), and FRD,ϕ are asymptotically equivalent under the null and local alternative models. (e). Explore the asymptotic distribution of the aligned rank test under the conditions of this exercise. 5.6.7. AR(1) generalized exercise.

350

CHAPTER 5. MODELS WITH DEPENDENT ERROR STRUCTURE

Chapter 6 Multivariate 6.1

Multivariate Location Model

We now consider a statistical model in which we observe vectors of observations. For example, we may record both the SAT verbal and math scores on students. We then wish to investigate the bivariate distribution of scores. We may wish to test the hypothesis that the vector of population locations has changed over time or to estimate the vector of locations. The framework in which we carry out the statistical inference is the multivariate location model which is similar to the location model of Chapter 1. For simplicity and convenience, we will often discuss the bivariate case. The k-dimensional results will often be obvious changes in notation. Suppose that X1 , . . . , Xn are iid random vectors with XTi = (Xi1 , Xi2 ). In this chapter, T denotes transpose and we reserve prime for differentiation. We assume that X has an absolutely continuous distribution with cdf F (s − θ1 , t − θ2 ) and pdf f (s − θ1 , t − θ2 ). We also assume that the marginal distributions are absolutely continuous. The vector θ = (θ1 , θ2 )T is the location vector. Definition 6.1.1. Distribution models for bivariate data. Let F (s, t) be a prototype cdf, then the underlying model will be a shifted version: H(s, t) = F (s − θ1 , t − θ2 ). The following models will be used throughout this chapter. 1. We say the distribution is symmetric when X and −X have the same distribution or f (s, t) = f (−s, −t). This is sometimes called diagonal symmetry. The vector (0, 0)T is the center of symmetry of F and the location functionals all equal the center of symmetry. Unless stated otherwise, we will assume symmetry throughout this chapter. 2. The distribution has spherical symmetry when ΓX and X have the same distribution where Γ is an orthogonal matrix. The pdf has the form g(kxk) where kxk = (xT x)1/2 is the Euclidean norm of x. The contours of the density are circular. 3. In an elliptical model the pdf has the form |det Σ|−1/2 g(xT Σ−1 x), where det denotes determinant and Σ is a symmetric, positive definite matrix. The contours of the density are ellipses. 351

352

CHAPTER 6. MULTIVARIATE

4. A distribution is directionally symmetric if X/kXk and −X/kXk have the same distribution. Note that elliptical symmtery implies symmetry which is turn implies directional symmetry. In an elliptical model, the contours of the density are elliptical and if Σ is the identity matrix then we have a spherically symmetric distribution. An elliptical distribution can be transformed into a spherical one by a transformation of the form Y = DX where D is a nonsingular matrix. Along with various models, we will encounter various transformations in this chapter. The following definition summarizes the transformations. Definition 6.1.2. Data transformations. (a) Y = ΓX is an orthogonal transformation when the matrix Γ is orthogonal. These transformations include rotations and reflections of the data. (b) Y = AX + b is called an affine transformation when A is a nonsingular matrix and b is any vector of real numbers. (c) When the matrix A in (b) is diagonal, we have a special affine transformation called a scale and location transformation. b (d) Suppose t(X) represents one of the above transformations of the data. Let θ(t(X)) denote the estimator computed from the transformed data. Then we say the estimator is ˆ ˆ equivariant if θ(t(X)) = t(θ(X)). Let V (t(X)) denote a test statistic computed from the transformed data. We say the test statistic is invariant when V (t(X)) = V (X). Recall that Hotelling’s T 2 statistic is given by T 2 = n(X − µ)T S−1 (X − µ), where S is the sample covariance matrix. In Exercise 6.8.1, the reader is asked to show that the vector of sample means is affine equivariant and Hotelling’s T 2 test statistic is affine invariant. As in the earlier chapters, we begin with a criterion function or with a set of estimating equations. To fix the ideas, suppose that we wish to estimate θ or test the hypothesis H0 : θ = 0 and we are given a pair of estimating equations: S1 (θ) =0; (6.1.1) S(θ) = S2 (θ) see Example 6.1.1 for three criterion functions. We now list the usual set of assumptions that we have been using throughout the book. These assumptions guarantee that the estimating equations are Pitman regular in the sense of Definition 1.5.3 so that we can define the estimate and test and develop the necessary asymptotic distribution theory. It will often be convenient to suppose that the true value of θ is 0 which we can do without loss of generality. Definition 6.1.3. Pitman Regularity conditions.

6.1. MULTIVARIATE LOCATION MODEL

353

(a) The components of S(θ) should be nonincreasing functions of θ1 and θ2 . (b) E0 (S(0)) = 0 (c)

D0 √1 S(0) → n

Z ∼ N2 (0, A)

(d) supkbk≤B √1n S √1n b −

√1 S(0) n

P + Bb → 0 .

The matrix A in (c) is the asymptotic covariance matrix of √1n S(0) and the matrix B in (d) can be computed in various ways, depending on when differentiation and expectation can be interchanged. We list the various computations of B for completeness. Note that ▽ denotes differentiation with respect to the components of θ. B = −E0 ▽

1 S(θ) |θ =0 n

1 = ▽Eθ S(0) |θ =0 n

= E0 [(− ▽ log f (X))ΨT (X)]

(6.1.2)

where ▽ log f (X) denotes the vector of partial derivatives of log f (X) and Ψ(· ) is such that n

1 1 X √ S(θ) = √ Ψ(Xi − θ) + op (1). n n i=1 Brown (1985) proved a multivariate counterpart to Theorem 1.5.6. We state it next and refer the reader to the paper for the proof. Theorem 6.1.1. Suppose conditions (a) - (c) of Definition 6.1.3 hold. Suppose further that B is given by the second expression in ( 6.1.2) and is positive definite. If, for any b, 1 1 1 →0 trace ncov S √ b − S(0) n n n then (d) of Definition 6.1.3 also holds. b The estimate of θ is, of course, the solution of the estimating equations, denoted θ. Conditions (a) and (b) make this reasonable. To test the hypothesis H0 : θ = 0 versus b −1S(0) ≥ χ2 (2), where the upper HA : θ 6= 0, we reject the null hypothesis when n1 ST (0)A α b → A, in α percentile of a chisquare distribution with 2 degrees of freedom. Note that A b will be a simple moment estimator of A. Condition (c) implies probability, and typically A that this is an asymptotically size α test.

354

CHAPTER 6. MULTIVARIATE

With condition (d) we can determine the asymptotic distribution of the estimate and the asymptotic local power of the test; hence, asymptotic efficiencies can be computed. We can determine the quantity that corresponds to the efficacy in the univariate case described in Section 1.5.2 of Chapter 1. We do this next before discussing specific estimating equations. The following proposition follows at once from the assumptions. Theorem 6.1.2. Suppose conditions (a)-(d) in Definition 6.1.3 are satisfied, θ = 0 is the √ b is the solution of true parameter value, and θ n = γ/ n for some fixed vector γ. Further θ the estimating equation. Then 1. 2.

√ b D nθ = B−1 √1n S(0) + op (1) →0 Z ∼ MVN(0, B−1 AB−1) Dθ 1 T S (0)A−1 S(0) →n n

χ2 (2, γ T BA−1 Bγ) ,

,

where χ2 (a, b) is noncentral chisquare with a degrees of freedom and noncentrality parameter b. b → 0 in probaProof: Part 1 follows immediately from condition (d) and letting θ n = θ bility; see Theorem 1.5.7. Part 2 follows by observing (see Theorem 1.5.8) that 1 T 1 1 T 1 −1 2 −1 2 Pθ n S (0)A S(0) ≥ χα (2) = P0 S − √ γ A S − √ γ ≥ χα (2) n n n n and from (d), 1 1 1 D √ S − √ γ = √ S(0) + Bγ + op (1) →0 Z ∼ MVN(Bγ, A). n n n Hence, we have a noncentral chisquare limiting distribution for the quadratic form. Note b is Ω(x) = B−1 Ψ(x) and we say θ b has bounded influence that the influence function of θ provided kΩ(x)k is bounded. Definition 6.1.4. Estimation Efficiency. The efficiency of a bivariate estimator can be measured using the Wilk’s generalized variance defined to be the determinant of the covariance matrix of the estimator: σ12 σ22 (1 − ρ212 ) where ((ρij σi σj )) is the covariance matrix of the bivariate vector of estimates. The estimation efficiency of θb1 relative to θb2 is the square root of the reciprocal ratio of the generalized variances.

This means that the asymptotic covariance matrix given by B−1 AB−1 of the more efficient estimator will be ”small” in the sense of generalized variance. See Bickel (1964) for further discussion of efficiency in the multivariate case. Definition 6.1.5. Test efficiency. When comparing two tests based on S1 and S2 , since the asymptotic local power is an increasing function of the noncentrality parameter, we define the test efficiency as the ratio of the respective noncentrality parameters.

6.1. MULTIVARIATE LOCATION MODEL

355

−1 T In the bivariate case, we have γ T B1 A−1 1 B1 γ divided by γ B2 A2 B2 γ and, unlike the estimation case, the test efficiency may depend on the direction γ along which we approach the origin; see Theorem 6.1.2. Hence, we note that, unlike the univariate case, the testing and estimation efficiencies are not necessarily equal. Bickel (1965) shows that the ratio of noncentrality parameters can be interpreted as the limiting ratio of sample sizes needed for the same asymptotic level and same asymptotic power along the same sequence of alternatives, as in the Pitman efficiency used throughout this book. We can see that BA−1B should be ”large” just as B−1 AB−1 should be ”small”. In the next section we consider how to set up the estimating equations and consider what sort of estimates and tests result. We will be in a position to compute the efficiency of the estimates and tests relative to the traditional least squares estimates and tests. First we list three important criterion functions and their associated estimating equations. Other criterion functions will be introduced in later sections.

Example 6.1.1. Three criterion functions. We now introduce three criterion functions that, in turn, produce estimating equations through differentiation. One of the criterion functions will generate the vector of means, the L2 or least squares estimates. The other two criterion functions will generate different versions of what may be considered L1 estimates or bivariate medians. The two types of medians differ in their equivariance properties. See Small (1990) for an excellent review of multidimensional medians. The vector of means is equivariant under affine transformations of the data; see Exercise 6.8.1. The three criterion functions are: v u n uX D1 (θ) = t [(xi1 − θ1 )2 + (xi2 − θ2 )2 ] (6.1.3) i=1

D2 (θ) = D3 (θ) =

n X

i=1 n X i=1

p (xi1 − θ1 )2 + (xi2 − θ2 )2

(6.1.4)

{|xi1 − θ1 | + |xi2 − θ2 |}

(6.1.5)

In each of these criterion functions we have pushed the square root operation deeper into the expression. As we will see, this produces very different types of estimates. We now take the gradients of these criterion functions and display the corresponding estimating functions. The computation of these gradients is given in Exercise 6.8.2. P (x − θ ) i1 1 −1 P (6.1.6) S1 (θ) = [D1 (θ)] (xi2 − θ2 ) n X xi1 − θ1 −1 (6.1.7) kxi − θ i k S2 (θ) = xi2 − θ2 i=1 P sgn(x − θ ) i1 1 P S3 (θ) = (6.1.8) sgn(xi2 − θ2 )

356

CHAPTER 6. MULTIVARIATE

In ( 6.1.8) if the vector is zero, then we take the term in the summation to be zero also. In Exercise 6.8.3 the reader is asked to verify that S2 (θ) = S3 (θ) in the univariate case; hence, we already see something new in the structure of the bivariate location model over the univariate location model. On the other hand, S1 (θ) and S3 (θ) are componentwise equations unlike S2 (θ) in which the two components are entangled. The solution to ( 6.1.8) is the vector of medians, and the solution to ( 6.1.7) is the spatial median which is discussed in Section 6.3. We will begin with an analysis of componentwise estimating equations and then consider other types. Sections 6.2.3 through 6.4.4 deal with one sample estimates and tests based on vector signs and ranks. Both rotational and affine invariant/equivariant methods are developed. Two and several sample models are treated in Section 6.6 as examples of location models. In Section 6.6 we will be primarily concerned with componentwise methods.

6.2

Componentwise Methods

Note that S1 (θ) and S3 (θ) are of the general form P ψ(x − θ ) i1 1 S(θ) = P ψ(xi2 − θ2 )

(6.2.1)

where ψ(t) = t or sgn(t) for ( 6.1.6) and ( 6.1.8), respectively. We need to find the matrices A and B in Definition 6.1.3. It is straight forward to verify that, when the true value of θ is 0, Eψ 2 (X11 ) Eψ(X11 )ψ(X12 ) , (6.2.2) A= Eψ(X11 )ψ(X12 ) Eψ 2 (X22 ) and, from ( 6.1.2), B=

Eψ ′ (X11 ) 0 ′ 0 Eψ (X12 )

.

(6.2.3)

Provided that A is positive definite, the multivariate central limit theorem implies that condition (c) in Definition 6.1.3 is satisfied for the componentwise estimating functions. In the case that ψ(t) = sgn(t), we use the second representation in ( 6.1.2). The estimating functions in (6.2.1) are examples of M-estimating functions; see Maronna, Martin and Yohai (2006). Example 6.2.1. Pulmonary Measurements on Workers Exposed to Cotton Dust. In this example we extend the discussion to k = 3 dimensions. The data consists of n = 12 trivariate (k = 3) observations on workers exposed to cotton dust. The measurements in Table 6.2.1 are changes in measurements of pulmonary functions: FVC (forced vital capacity), FEV3 (forced expiratory volume), and CC (closing capacity); see Merchant et al. (1975).

6.2. COMPONENTWISE METHODS

357

Table 6.2.1: Changes in Pulmonary Function after Six Hours of Exposure to Cotton Dust Subject 1 2 3 4 5 6 7 8 9 10 11 12

FVC FEV3 −.11 −.12 .02 .08 −.02 .03 .07 .19 −.16 −.36 −.42 −.49 −.32 −.48 −.35 −.30 −.10 −.04 .01 −.02 −.10 −.17 −.26 −.30

CC −4.3 4.4 7.5 −.30 −5.8 14.5 −1.9 17.3 2.5 −5.6 2.2 5.5

Let θ T = (θ1 , θ2 , θ3 ) and consider H0 : θ = 0 versus HA : θ 6= 0. First we compute the componentwise sign test. In ( 6.2.1) take ψ(x) = sgn(x), then n−1/2 ST3 = n−1/2 (−6, −6, 2) and the estimate of A = Cov(n−1/2 S3 ) is     P P n sgnx sgnx sgnx sgnx 12 8 −4 i1 i2 i1 i3 P P b = 1  = 1  8 12 0  . sgnx sgnx n sgnx sgnx A i1 i2 i2 i3 P n P 12 −4 0 12 sgnxi1 sgnxi3 sgnxi2 sgnxi3 n P Here the diagonal elements are i sgn2 (Xis ) = n and the off-diagonal elements are values of P b −1 S3 = 3.667, and using the statistics i sgn(Xis )sgn(Xit ). Hence, the test statistic n−1 ST3 A χ2 (3), the approximate p-value is 0.299; see Section 6.2.2. We can also consider the finite sample conditional distribution in which sign changes are generated with a binomial with n = 12 and p = .5; see the discussion in Section 6.2.2. Again note that the signs of all components of the observation vector are either changed or not. b −1 S3 . b remains unchanged so it is simple to generate many values of n−1 ST A The matrix A 3 Out of 2500 values we found 704 greater than or equal to 3.667; hence, the randomization or sign change p-value is approximately 704/2500 = 0.282, quite close to the asymptotic approximation. At any rate, we fail to reject H0 : θ = 0 at any reasonable level. T −1 b X = 14.02 with a p-value of 0.051, based on the F Further, Hotelling’s T 2 = nX Σ distribution for [(n − p)/(n − 1)p]T 2 with 3 and 9 degrees of freedom. Hence, Hotelling’s T 2 is significant at approximately 0.05. Figure 6.2.1 provides boxplots for the data and componentwise normal q−q plots. These boxplots suggest that any differences will be due to the upward shift in the CC distribution. The normal q − q plot of the component CC shows two outlying values on the right side. In the case of the componentwise Wilcoxon test, Section 6.2.3, we consider (n + 1)S4 (0) in ( 6.2.14) along with (n + 1)2 A, essentially in ( 6.2.15). For the pulmonary function data

358

CHAPTER 6. MULTIVARIATE

Figure 6.2.1: Panel A: Boxplots of the changes in pulmonary function for the Cotton Dust Data. Note that the responses have been standardized by componentwise standard deviations; Panel B: normal q−q plot for the component FVC, original scale; Panel C: normal q−q plot for the component FEV3, original scale; Panel D: normal q−q plot for the component CC, original scale. Panel A

Panel B

0.0

• • •

-0.2

Changes in FVC

1 0 -1

• •

• • • -0.4

-2

Standardized responses

2

• • •

FEV_3

FVC

-1.5

-0.5

0.5

1.5

Component

Normal Quantiles

Panel C

Panel D

0.2

CC

•

• 15

• •

-0.2

10

• •

•

-1.5

• -5

•

• -0.5

•

• • 0

• •

• 5

Changes in CC

0.0

• •

-0.4

Changes in FEV3

• • • •

0.5

Normal Quantiles

1.5

• -1.5

•

• -0.5

0.5

Normal Quantiles

1.5

6.2. COMPONENTWISE METHODS

359

(n + 1)ST4 (0) = (−63, −52, 28) and 

 649 620.5 −260.5 b = 1  620.5 649.5 −141.5  . (n + 1)2 A n −260.5 −141.5 650

P 2 P 2 The diagonal elements are i R (|Xis |) which should be i i = 650 but differ for the first off-diagonal elements are P two components due to ties among the absolute values. The −1 b −1 S4 (0) = 7.82. R(|X |)R(|X |)sgn (X )sgn (X ). The test statistic is then n ST4 (0)A is it is it i From the χ2 (3) distribution, the approximate p-value is 0.0498. Hence, the Wilcoxon test rejects the null hypothesis at essentially the same level as Hotelling’s T 2 test. In the construction of tests we generally must estimate the matrix A. When testing b If we H0 : θ = 0 the question arises as to whether or not we should center the data using θ. do not center then we are using a reduced model estimate of A; otherwise, it is a full model estimate. Reduced model estimates are generally used in randomization tests. In this case, b must only be computed once in the process of randomizing and recomputing generally, A P b −→ b −1 S. Note also that when H0 : θ = 0 is true, θ the test statistic n−1 ST A 0. Hence, b B−1 AB−1 , we b is valid under H0 . When estimating the asymptotic Cov(θ), the centered A b because we no longer assume that H0 is true. should center A

6.2.1

Estimation

Let θ = (θ1 , θ2 )T denote the true vector of location parameters. Then, when ( 6.1.2) holds, the asymptotic covariance matrix in Theorem 6.1.2 is 

 B−1 AB−1 = 

Eψ2 (X11 −θ1 ) [Eψ′ (X11 −θ1 )]2

Eψ(X11 −θ1 )ψ(X12 −θ2 ) Eψ′ (X11 −θ1 )Eψ′ (X12−θ2 )

Eψ(X11 −θ1 )ψ(X12 −θ2 ) Eψ′ (X11 −θ1 )Eψ′ (X12 −θ2 )

Eψ2 (X12 −θ2 ) [Eψ′ (X12 −θ2 )]2

  

(6.2.4)

Now Theorem 6.1.2 can be applied for various M-estimates to establish asymptotic normality. Our interest is in the comparison of L2 and L1 estimates and we now turn to that discussion. In the case of L2 estimates, corresponding to S1 (θ), we take ψ(t) = t. In this case, θ in expression (6.2.4) is the vector of means. Then it is easy to see that B−1 AB−1 is equal to the covariance matrix of the underlying model, say Σf . In applications, θ is estimated by the vector of component sample means. For the standard errors of these estimates, the vector of componentwise sample means replaces θ in expression (6.2.4) and the expected values are replaced by the corresponding sample moments. Then it is easy to see that the estimate of B−1 AB−1 is equal to the traditional sample covariance matrix. In the first L1 case corresponding to S3 (θ), using ( 6.1.2), we take ψ(t) = sgn(t) and find,

360

CHAPTER 6. MULTIVARIATE

using the second representation in ( 6.1.2), that  1  B−1 AB−1 = 

4f12 (0)

E sgn(X11 −θ1 )sgn(X12 −θ2 ) 4f1 (0)f2 (0)

E sgn(X11 −θ1 )sgn(X12 −θ2 ) 4f1 (0)f2 (0)

1 4f22 (0)



  ,

(6.2.5)

where f1 and f2 denote the marginal pdfs of the joint pdf f (s, t) and θ1 and θ2 denote the componentwise medians. In applications, the estimate of θ is the vector of componentwise sample medians, which we denote by (θb1 , θb2 )′ . For inference an estimate of the asymptotic covariance matrix, (6.2.5) is reqiured. An estimate of Esgn(X11 − θ1 )sgn(X12 − θ2 ) is the P simple moment estimator n−1 sgn(xi1 − θb1 )sgn(xi2 − θb2 ). The estimators discussed in Section 1.5.5, (1.5.28), can be used to estimate the scale parameters 1/2f1 (0) and 1/2f2 (0). We now turn to the efficiency of the vector of sample medians with respect to the vector of sample means. Assume for each component that the median and mean are the same and that without loss of generality their common value is 0. Let δ = det(B−1 AB−1 ) = √ b det(A)/[det(B)]2 be the Wilk’s generalized variance of nθ in Definition 6.1.4. For the 2 2 2 vector of means we have δ = σ1 σ2 (1 − ρ ), the determinant of the underlying variancecovariance matrix. For the vector of sample medians we have 1 − (EsgnX11 sgnX12 )2 δ= 16f12(0)f22 (0) and the efficiency of the vector of medians with respect to the vector of means is given by: s 1 − ρ2 (6.2.6) e(med,mean) = 4σ1 σ2 f1 (0)f2 (0) 1 − [EsgnX11 sgnX12 ]2 Note that EsgnX11 sgnX12 = 4P (X11 < 0, X12 < 0) − 1. When the underlying distribution is bivariate normal with means 0, variances 1, and correlation ρ, Exercise 6.8.4 shows that P (X11 < 0, X12 < 0) =

1 1 + . 4 2π sin ρ

Further, the marginal distributions are standard normal; hence,( 6.2.6) becomes s 2 1 − ρ2 e(med, mean) = π 1 − [(2/π) sin−1 ρ]2

(6.2.7)

(6.2.8)

The first factor 2/π ∼ = .637 is the univariate efficiency of the median relative to the mean when the underlying distribution is normal and also the efficiency of the vector of medians relative to the vector of means when the correlation in the underlying model is zero. The second factor accounts for the bivariate structure of the model and, in general, depends on the correlation ρ. Some values of the efficiency are given in Table 6.2.2.

6.2. COMPONENTWISE METHODS

361

Table 6.2.2: Efficiency ( 6.2.8) of the vector of medians relative to the vector of means when the underlying distribution is bivariate normal. ρ eff

0 .64

.1 .2 .3 .4 .63 .63 .62 .60

.5 .6 .7 .8 .58 .56 .52 .47

.9 .99 .40 .22

Clearly, as the elliptical contours of the underlying normal distribution flatten out, the efficiency of the vector of medians decreases. This is the first indication that the vector of medians is not affine (or even rotation) equivariant. The vector of means is affine equivariant and hence the dependency of the efficiency on ρ must be due to the vector of medians. Indeed, Exercise 6.8.5 asks the reader to construct an example showing that when the axes are rotated the vector of means rotates into the new vector of means while the vector of medians fails to do so.

6.2.2

Testing

We now consider the properties of bivariate tests. Recall that we assume the underlying bivariate distribution is symmetric. In addition, we would generally use an odd ψ-function, so that ψ(t) = −ψ(−t). This implies that ψ(t) = ψ(|t|)sgn(t) which will be useful shortly. Now referring to Theorem 6.1.2 along with the corresponding matrix A, the test of 1 T −1 2 H0 : θ = 0 vs HA : θ 6= 0 rejects the null hypothesis when RR n S (0)A S(0) ≥ χα (2). Note that the covariance term in A is Eψ(X11 )ψ(X12 ) = ψ(s)ψ(t)f (s, t) dsdt and it depends upon the underlying bivariate distribution f . Hence, even the sign test based on the componentwise sign statistics S3 (0) is not distribution free under the null hypothesis as it is in the univariate case. In this case, Eψ(X11 )ψ(X12 ) = 4P (X11 < 0, X12 < 0) − 1 as we saw in the discussion of estimation. To make the test operational we must estimate the components of A. Since they are expectations, we use moment estimates, under the null hypothesis. Now condition (c) in Definition 6.1.3 guarantees that the test with the estimated A is asymptotically distribution free since it has a limiting chisquare distribution, independent of the underlying distribution. What can we say about finite samples? First note that Σψ(|xi1 |)sgn(xi1 ) (6.2.9) S(0) = Σψ(|xi2 |)sgn(xi2 ) Under the assumption of symmetry, (x1 , . . . , xn ) is a realization of (s1 x1 , . . . , sn xn ) where (s1 , . . . , sn ) is a vector of independent random variables each equalling ±1 with probability 1/2, 1/2. Hence Esi = 0 and Es2i = 1. Condition on (x1 , . . . , xn ) then, under the null hypothesis, there are 2n equally likely sign combinations associated with these vectors. Note that the sign changes attach to the entire vector. From ( 6.2.9), we see that conditionally, the scores are not affected by the sign changes and S(0) depends on the sign changes only through the signs of the components of the observation vectors. It follows at once that the

362

CHAPTER 6. MULTIVARIATE

conditional mean of S(0) under the null hypothesis is 0. Further the conditional covariance matrix is given by P 2 Σψ (|x |) ψ(|x |)ψ(|x |)sgn(x )sgn(x ) i1 i1 i2 i1 i2 P P 2 . (6.2.10) ψ(|xi1 |)ψ(|xi2 |)sgn(xi1 )sgn(xi2 ) ψ (|xi2 |)

Note that conditionally, n−1 times this matrix is an estimate of the matrix A above. Thus we have a conditionally distribution free sign change distribution. For small to moderate n the test statistic (quadratic form) can be computed for each combination of signs and a conditional p-value of the test is the number of values (divided by 2n ) of the test statistic at least as large as the observed value of the test statistic. In the first chapter on univariate methods this argument also leads to unconditionally distribution free tests in the case of the univariate sign and rank tests since in those cases the signs and the ranks do not depend on the values of the conditioning variables. Again, the situation is different in the bivariate case due to the matrix A which must be estimated since it depends on the unknown underlying distribution. In the Exercise 6.8.6 the reader is asked to construct the sign change distributions for some examples.

We now turn to a more detailed analysis of the tests based on S1 = S1 (0) and S3 = S3 (0). Recall that S1 is the vector of sample means. The matrix A is the covariance matrix of the underlying distribution and we take the sample covariance matrix as the natural estimate. T −1 b X which is Hotelling’s T 2 statistic. Note for T 2 , we The resulting test statistic is nX A typically use a centered estimate of A. If we want the randomization distribution then we use the uncentered estimate. Since BA−1 B = Σ−1 f , the covariance matrix of the underlying distribution, the asymptotic noncentrality parameter for Hotelling’s test is γ T Σ−1 f γ. The vector S3 is the vector of component sign statistics. By inverting ( 6.2.5) we can write down the noncentrality parameter for the bivariate componentwise sign test. To illustrate the efficiency of the bivariate sign test relative to Hotelling’s test we simplify the structure as follows: assume that the marginal distributions are identical. Let ξ = 4P (X11 < 0, X12 < 0) − 1 and let ρ denote the underlying correlation, as usual. Then Hotelling’s noncentrality parameter is 1 γ12 − 2ργ1 γ2 + γ22 1 −ρ T γ = γ (6.2.11) −ρ 1 σ 2 (1 − ρ2 ) σ 2 (1 − ρ2 ) Likewise the noncentrality parameter for the bivariate sign test is 4f 2 (0)(γ12 − 2ξγ1γ2 + γ22 ) 4f 2 (0) T 1 −ξ γ = γ −ξ 1 (1 − ξ 2 ) (1 − ξ 2)

(6.2.12)

The efficiency of the bivariate sign test relative to Hotelling’s test is the ratio of the their respective noncentrality parameters: 4f 2 (0)σ 2 (1 − ρ2 )(γ12 − 2ξγ1γ2 + γ22 ) (1 − ξ 2 )(γ12 − 2ργ1 γ2 + γ22 )

(6.2.13)

6.2. COMPONENTWISE METHODS

363

Table 6.2.3: Minimum and maximum efficiencies of the bivariate sign test relative to Hotelling’s T 2 when the underlying distribution is bivariate normal. ρ 0 min .64 max .64

.2 .4 .6 .58 .52 .43 .68 .71 .72

.8 .9 .99 .31 .22 .07 .72 .71 .66

There are three contributing factors in this efficiency: 4f 2 (0)σ 2 which is the univariate efficiency of the sign test relative to the t test, (1 − ρ2 )/(1 − ξ 2 ) due to the dependence structure in the bivariate distribution, and the final factor which reflects the direction of approach of the sequence of alternatives. It is this last factor which separates the testing efficiency from the estimation efficiency. In order to see the effect of direction on the efficiency we will use the following result from matrix theory; see Graybill (1983). Lemma 6.2.1. Suppose D is a nonsingular, square matrix and C is any square matrix and suppose λ1 and λ2 are the minimum and maximum eigen values of CD−1 , then λ1 ≤

γ T Cγ ≤ λ2 γ T Dγ

. The following proposition is left as an Exercise 6.8.7. Theorem 6.2.1. The efficiency e(S3 , S1 ) is bounded between the minimum and maximum of 4f 2 (0)σ 2 (1 − ρ)/(1 − ξ) and 4f 2 (0)σ 2 (1 + ρ)/(1 + ξ). In Table 6.2.3 we give some values of the maximum and minimum efficiencies when the underlying distribution is bivariate normal with means 0, variances 1 and correlation ρ. This table can be compared to Table 6.2.2 which contains the corresponding estimation efficiencies. We have f 2 (0) = (2π)−1 and ξ = (2/π) sin−1 ρ . Hence, the dependence of the efficiency on direction determined by γ is apparent. The examples involving the bivariate normal distribution also show the superiority of the vector of means over the vector of medians and Hotelling’s test over the bivariate sign test as expected. Bickel (1964, 1965) gives a more thorough analysis of the efficiency for general models. He points out that when heavy tailed models are expected then the medians and sign test will be much better provided ρ is not too close to ±1. In the exercises the reader is asked to show that Hotelling’s T 2 statistic is affine invariant. Thus the efficiency properties of this statistic do not depend on ρ. This means that the bivariate sign test cannot be affine invariant; again, this is developed in the exercises. It is now natural to inquire about the properties of the estimate and test based on S2 . This estimating function cannot be written in the componentwise form that we have been considering. Before we turn to this statistic, we consider estimates and tests based on componentwise ranking.

364

6.2.3

CHAPTER 6. MULTIVARIATE

Componentwise Rank Methods

In this part we will sketch the results for the vector of Wilcoxon signed rank statistics discussed in Section 1.7 for each component. See Example 6.2.1 for an illustration of the calculations. In Section 6.6 we provide a full development of componentwise rank-based methods for location and regression models with examples. We let ! P R(|xi1 −θ1 |) sgn(x − θ ) i1 1 (6.2.14) S4 (θ) = P R(|xn+1 i2 −θ2 |) sgn(xi2 − θ2 ) n+1 Using the projection method, Theorem 2.4.6, we have from Exercise 6.8.8, for the case θ = 0, P + P F (|x |)sgn(x ) 2 [F (x ) − 1/2] i1 i1 1 i1 1 P + op (1) S4 (0) = P + + op (1) = 2 [F2 (xi2 ) − 1/2] F2 (|xi2 |)sgn(xi2 )

where Fj+ is the marginal distribution of |X1j | for j = 1, 2 and Fj is the marginal distribution of X1j for j = 1, 2; see, also, Section A.2.3 of the Appendix. Symmetry of the marginal distributions is used in the computation of the projections. The conditions (a)-(d) of Definition 6.1.3 can now be verified for the projection and then we note that the vector of rank statistics has the same asymptotic properties. We must identify the matrices A and B for the purposes of constructing the quadratic form test statistic, the asymptotic distribution of the vector of estimates and the noncentrality parameter. The first two conditions, (a) and (b), are easy to check since the multivariate central limit theorem can be applied to the projection. Since under the null hypothesis that θ = 0, F (Xi1 ) has a uniform distribution on (0, 1), and introducing θ and differentiating with respect to θ1 and θ2 , the matrices A and B are R 2 1 1 2 f1 (t)dt R 0 δ 3 and B = (6.2.15) A= 0 2 f22 (t)dt δ 31 n RR where δ = 4 F1 (s)F2 (t)dF (s, t) − 1. Hence, similar to the vector of sign statistics, the vector of Wilcoxon signed rank statistics also has a covariance which depends on the underlying bivariate distribution. We could construct a conditionally distribution free test but not an unconditionally distribution free one. Of course, the test is asymptotically distribution free. A consistent estimate of the parameter δ in A is given by n

1X Rit Rjt δb = sgnXit sgnXjt , n t=1 (n + 1)(n + 1)

(6.2.16)

where Rit is the rank of |Xit | in the tth component among |X1t |, . . . , |Xnt |. This estimate is the conditional covariance and can be used in estimating A in the construction of an asymptotically distribution free test; when we estimate the asymptotic covariance matrix of b we first center the data and then compute ( 6.2.16). θ

6.2. COMPONENTWISE METHODS

365

Table 6.2.4: Efficiencies of componentwise Wilcoxon methods relative to L2 methods when the underlying distribution is bivariate normal. ρ min max est

0 .96 .96 .96

.2 .94 .96 .96

.4 .93 .97 .95

.6 .91 .97 .94

.8 .89 .96 .93

.9 .88 .96 .92

.99 .87 .96 .91

The estimator that solves S4 (θ) = 0 is the vector of Hodges-Lehmann estimates for the two components; that is, the vector of medians of Walsh averages for each component. Like the vector of medians, the vector of HL estimates is not equivariant under orthogonal transformations and the test is not invariant under these transformations. This will show up in the efficiency with respect to the L2 methods which are an equivariant estimate and an invariant test. Theorem 6.1.2 provides the asymptotic distribution of the estimator and the asymptotic local power of the test. Suppose the underlying distribution is bivariate normal with means 0, variances 1, and correlation ρ, then the estimation and testing efficiencies are given by r 3 1 − ρ2 e(HL, mean) = π 1 − 9δ 2 3 (1 − ρ2 ) γ12 − 6δγ1 γ2 + γ22 } { e(Wilcoxon, Hotelling) = π (1 − 9δ 2 ) γ12 − 2ργ1 γ2 + γ22

(6.2.17) (6.2.18)

Exercise 6.8.9 asks the reader to apply Lemma 6.2.1 and show the testing efficiency is bounded between 3(1 + ρ) 3(1 − ρ) and (6.2.19) ρ 3 −1 2π[2 − π cos ( 2 )] 2π[2 − π3 cos−1 ( 2ρ )] In Table 6.2.4 we provide some values of the minimum and maximum efficiencies as well as estimation efficiency. Note how much more stable the rank methods are than the sign methods. Bickel (1964) points out, however, that when there is heavy contamination and ρ is close to ±1 the estimation efficiency can be arbitrarily close to 0. Further, this efficiency can be arbitrarily large. This behavior is due to the fact that the sign and rank methods are not invariant and equivariant under orthogonal transformations, unlike the L2 methods. Hence, we now turn to an analysis of the methods generated by S2 (θ). Additional material on the componentwise methods can be found in the papers of Bickel (1964, 1965) and the monograph by Puri and Sen (1971). The extension of the results to dimensions higher than two is straightforward and the formulas are obvious. One interesting question is how the efficiencies of the sign or rank methods relative to the L2 methods depend on the dimension. See Section 6.6 and Davis and McKean (1993) for component wise linear model rank-based methods.

366

6.3 6.3.1

CHAPTER 6. MULTIVARIATE

Spatial Methods Spatial sign Methods

We are now ready to consider the estimate and test generated by S2 (θ); recall ( 6.1.4) and ( 6.1.7). This estimating function cannot be written P in componentwise fashion because kxi −θk appears in both components. Note that S2 (θ) = kxi −θk−1 (xi −θ), a sum of unit vectors, so that the estimating function depends on the data only through the directions and not on the magnitudes of xi − θ, i = 1, . . . , n. The vector kxk−1 x is also called the spatial sign of x. It generalizes the notion of univariate sign: sgn(x) = |x|−1 x. Hence, the test is sometimes called the angle test or spatial sign test and the estimate is called the spatial median; see Brown (1983). Milasevic and Ducharme (1987) show that the spatial median is always unique, unlike the univariate median. We will see that the test is invariant under orthogonal transformations and the estimate is equivariant under these transformations. Hence, the methods are rotation invariant and equivariant, properties suitable for methods used on spatial data. However, applications do not have to be confined to spatial data and we will consider these methods competitors to the other methods already discussed. Following our pattern above, we first consider the matrices A and B in Definition 6.1.3. Suppose θ = 0, then since S2 (0) is a sum of independent random variables, condition (c) is immediate with A = EkXk−2 XXT and the obvious estimate of A, under H0 , is n

X b = 1 A kxi k−2 xi xTi , n i=1

(6.3.1)

which can be used to construct the spatial sign test statistic with 1 1 D D b −1 S2 (0) → √ S2 (0) → N2 (0, A) and ST2 (0)A χ2 (2) . n n

(6.3.2)

In order to compute B, we first compute the partial derivatives; then we take the expectation. This yields 1 1 T B=E I− (XX ) , (6.3.3) kXk kXk2 where I is the identity matrix. Use a moment estimate for B similar to the estimate of A. The spatial median is determined by b = Argmin θ

n X i=1

kxi − θk

(6.3.4)

or as the solution to the estimating equations n X xi − θ S2 (θ) = = 0. kx − θk i i=1

(6.3.5)

6.3. SPATIAL METHODS

367

The R package SpatialNP provides routines to compute the spatial median. Gower (1974) calls the estimate the mediancentre and provides a Fortran program for its computation. See Bedall and Zimmerman (1979) for a program in dimensions higher than 2. Further, for higher dimensions see M¨ott¨onen and Oja (1995). We have the asymptotic representation 1 b 1 D √ θ = B−1 √ S2 (0) + op (1) → N2 (0, B−1 AB−1). n n

(6.3.6)

Chaudhuri (1992) provides a sharper analysis for the remainder term in his Theorem 3.2. The consistency of the moment estimates of A and B is established rigorously in the linear ˆ and B ˆ computed model setting by Bai, Chen, Miao, and Rao (1990). Hence, we would use A from the residuals. Bose and Chaudhuri (1993) develop estimates of A and B that converge more quickly than the moment estimates. Bose and Chaudhuri provide a very interesting b than to estimate analysis of why it is easier to estimate the asymptotic covariance matrix of θ the asymptotic variance of the univariate median. Essentially, unlike the univariate case, we do not need to estimate the multivariate density at a point. It is left as an exercise to show that the estimate is equivariant and the test is invariant under orthogonal transformations of the data; see Exercise 6.8.13.

Example 6.3.1. Cork Borings Data We consider a well known example due to Rao (1948) of testing whether the weight of cork borings on trees is independent of the directions: North, South, East and West. In this case we have 4 measurements on each tree and we wish to test the equality of marginal locations: H0 : θN = θS = θE = θW . This is a common hypothesis in repeated measure designs. See Jan and Randles (1996) for an excellent discussion of issues in repeated measures designs. We reduce the data to trivariate vectors via N − E, E − S, S − W . Then we test δ = 0 where δ T = (θN − θS , θS − θE , θE − θW ). Table 6.3.1 displays the original n = 28 four component data vectors. In Table 6.3.2 we display the data differences: N − S, S − E, and E − W along with the unit spatial sign vectors kxk−1 x for each data point. Note that, except for rounding error, the sum of squares in each row is 1 for the spatial sign vectors. We compute the spatial sign statistic to be ST2 = (7.78, −4.99, 6.65) and, from ( 6.3.1),   .2809 −.1321 −.0539 b =  −.1321 .3706 −.0648  . A −.0539 −.0648 .3484 b −1 S2 (0) = 14.74 which yields an asymptotic p-value of .002, using a χ2 Then n−1 ST2 (0)A approximation with 3 degrees of freedom. Hence, we easily reject H0 : δ = 0 and conclude that boring size depends on direction.

368

CHAPTER 6. MULTIVARIATE Table 6.3.1: Weight of Cork Borings (in Centigrams) in Four Directions for 28 Trees N 72 60 56 41 32 30 39 42 37 33 32 63 54 47

E 66 53 57 29 32 35 39 43 40 29 30 45 46 51

S 76 66 64 36 35 34 31 31 31 27 34 74 60 52

W 77 63 58 38 36 26 27 25 25 36 28 63 52 43

N 91 56 79 81 78 46 39 32 60 35 39 50 43 48

E S W 79 100 75 68 47 50 65 70 61 80 68 58 55 67 60 38 37 38 35 34 37 30 30 32 50 67 54 37 48 39 36 39 31 34 37 40 37 39 50 54 57 43

For estimation we return to the original component data. Since we have rejected the null hypothesis of equality of locations, we want to estimate the 4 components of the location bT = vector: θ T = (θ1 , θ2 , θ3 , θ4 ). The spatial median solves S2 (θ) = 0, and we find θ (45.38, 41.54, 43.91, 41.03). For comparison the mean vector is (50.54, 46.18, 49.68, 45.18)T . These computations can be performed using the R package SpatialNP. The issue of how to apply rank methods in repeated measure designs has an extensive literature. In addition to Jan and Randles (1996), Kepner and Robinson (1988) and Akritas and Arnold (1994) discuss the use of rank transforms and pure ranks for testing hypotheses in repeated measure designs. The Friedman test, Exercise 4.8.19, can also be used for repeated measure designs.

Efficiency for Spherical Distributions Expressions for A and B can be simplified and the computation of efficiencies made easier if we transform to polar coordinates. We write cos φ cos ϕ x=r = rs (6.3.7) sin φ sin ϕ where r = kxk ≥ 0, 0 ≤ φ < 2π, and s = ±1 depending on whether x is above or below the horizontal axis with 0 < ϕ < π. The second representation is similar to 6.2.9 and is useful in the development of the conditional distribution of the test under the null hypothesis. Hence X cos ϕi S2 (0) = si (6.3.8) sin ϕi

6.3. SPATIAL METHODS

369

Table 6.3.2: Each row is a data vector for N-S, S-E, E-W along with the components of the spatial sign vector. Row N − E E − S S − W 1 6 -10 -1 2 7 -13 3 3 -1 -7 6 4 12 -7 -2 5 0 -3 -1 6 -5 1 8 7 0 8 4 8 -1 12 6 9 -3 9 6 10 4 2 -9 11 2 -4 6 12 18 -29 11 13 8 -14 8 14 -4 -1 9 15 12 -21 25 16 -12 21 -3 17 14 -5 9 18 1 12 10 19 23 -12 7 20 8 1 -1 21 4 1 -3 22 2 0 -2 23 10 -17 13 24 -2 -11 9 25 3 -3 8 26 16 -3 -3 27 6 -2 -11 28 -6 -3 14

S1 0.51 0.46 -0.11 0.85 0.00 -0.53 0.00 -0.07 -0.27 0.40 0.27 0.50 0.44 -0.40 0.34 -0.49 0.81 0.06 0.86 0.98 0.78 0.71 0.42 -0.14 0.33 0.97 0.47 -0.39

S2 -0.85 -0.86 -0.75 -0.50 -0.95 0.11 0.89 0.89 0.80 0.19 -0.53 -0.80 -0.78 -0.10 -0.60 0.86 -0.29 0.77 -0.44 0.12 0.20 0.00 -0.72 -0.77 -0.33 -0.18 -0.16 -0.19

S3 -0.09 0.20 0.65 -0.14 -0.32 0.84 0.45 0.45 0.53 -0.90 0.80 0.31 0.44 0.91 0.71 -0.12 0.52 0.64 0.26 -0.12 -0.59 -0.71 0.55 0.63 0.88 -0.18 -0.87 0.90

370

CHAPTER 6. MULTIVARIATE

where ϕi is the angle measured counterclockwise between the positive horizontal axis and the line through xi extending indefinitely through the origin and si indicates whether the observation is above or below the axis. Under the null hypothesis θ = 0, si = ±1 with probabilities 1/2, 1/2 and s1 , . . . , sn are independent. Thus, we can condition on ϕ1 , . . . , ϕn to get a conditionally distribution free test. The conditional covariance matrix is n X cos2 ϕi cos ϕi sin ϕi (6.3.9) cos ϕi sin ϕi sin2 ϕi i=1

and this is used in the quadratic form with S2 (0) to construct the test statistic; see M¨ott¨onen and Oja (1995, Section 2.1). To consider the asymptotically distribution free version of this test we use the form X cos φi (6.3.10) S2 (0) = sin φi

where, recall 0 ≤ φ < 2π, and the multivariate central limit theorem implies that √1n S2 (0) has a limiting bivariate normal distribution with mean 0 and covariance matrix A. We now translate A and its estimate into polar coordinates. n 1X cos2 φ cos φ sin φ cos2 φi cos φi sin φi b (6.3.11) A=E and A = cos φ sin φ sin2 φ cos φi sin φi sin2 φi n i=1

b −1 S2 (0) ≥ χ2 (2) is an asymptotically size α test. Hence, n1 ST2 (0)A α We next express B in terms of polar coordinates: sin2 φ − cos φ sin φ cos2 φ cos φ sin φ −1 −1 = Er I− B = Er − cos φ sin φ cos2 φ cos φ sin φ sin2 φ (6.3.12) √ Hence, n times the spatial median is limiting bivariate normal with asymptotic covariance matrix equal to B−1 AB−1 . The corresponding noncentrality parameter of the noncentral chisquare limiting distribution of the test is γ T BA−1Bγ. We are now in a position to evaluate the efficiency of the spatial median and the spatial sign test with respect to the mean vector and Hotelling’s test under various model assumptions. The following result is basic and is derived in Exercise 6.8.10. Theorem 6.3.1. Suppose the underlying distribution is spherically symmetric so that the joint density is of the form f (x) = h(kxk). Let (r, φ) be the polar coordinates. Then r and φ are stochastically independent, the pdf of φ is uniform on (0, 2π] and the pdf of r is g(r) = 2πrf (r), for r > 0. Theorem 6.3.2. If the underlying distribution is spherically symmetric, then the matrices A = (1/2)I and B = [(Er −1 )/2]I. Hence, under the null hypothesis, the test statistic n−1 ST2 (0)A−1 S2 (0) is distribution free over the class of spherically symmetric distributions.

6.3. SPATIAL METHODS

371

Proof. First note that 1 E cos φ sin φ = 2π

Z

cos φ sin φdf = 0 .

Then note that Er −1 cos φ sin φ = Er −1 E cos φ sin φ = 0 . Finally note, E cos2 φ = E sin2 φ = 1/2. We can then compute B−1 AB−1 = [2/(Er −1 )2 ]I and BA−1 B = [(Er −1 )2 /2]I. This implies that the generalized variance of the spatial median and the noncentrality parameter of the angle sign test are given by detB−1 AB−1 = 2/(Er −1 )2 and [(Er −1 )2 /2]γ T γ. Notice that the efficiencies relative to the mean and Hotelling’s test are now equal and independent of the direction. Recall, for the mean vector and T 2 , that A = 2−1 E(r 2 )I, det B−1 AB−1 = 2−1 E(r 2 ), and γ T BA−1Bγ = [2/E(r 2 )]γ T γ. This is because both the spatial L1 methods and the L2 methods are equivariant and invariant with respect to orthogonal (rotations and reflections) transformations. Hence, we see that the efficiency 1 e(spatialL1 , L2 ) = Er 2 {Er −1 }2 . 4

(6.3.13)

If, in addition, we assume the underlying distribution is spherical normal (bivariate norp mal with means 0 and identity covariance matrix) then Er −1 = π/2, Er 2 = 2 and e(spatialL1 , L2 ) = π/4 ≈ .785. Hence, the efficiency of the spatial L1 methods based on S2 (θ) are more efficient relative to the L2 methods at the spherical normal model than the componentwise L1 methods (.637) discussed in Section 6.2.3. In Exercise 6.8.12 the reader is asked to show that the efficiency of the spatial L1 methods relative to the L2 methods with a k-variate spherical model is given by ek (spatial L1 , L2 ) =

k−1 k

2

E(r 2 )[E(r −1 )]2 .

(6.3.14)

√ When the k-variate spherical model is normal, the exercise shows that Er −1 = Γ[(k−1)/2)] with 2Γ(k/2) √ Γ(1/2) = π. Table 6.3.3 gives some values for this efficiency as a function of dimension. Hence, we see that the efficiency increases with dimension. This suggests that the spatial methods are superior to the componentwise L1 methods, at least for spherical models.

Efficiency for Elliptical Distributions We need to consider what happens to the efficiency when the model is elliptical but not spherical. Since the methods that we are considering are equivariant and invariant to rotations, we can eliminate the correlation from the elliptical model with a rotation but then the variances are typically not equal. Hence, we study, without loss of generality, the efficiency

372

CHAPTER 6. MULTIVARIATE

Table 6.3.3: Efficiency as a function of dimension for a k-variate spherical normal model. k e(spatial L1 , L2 )

2 4 6 0.785 0.884 0.920

when the underlying model has unequal variances but covariance 0. Now the L2 methods are affine equivariant and invariant but the spatial L1 methods are not scale equivariant and invariant (hence not affine equivariant and invariant); hence, the efficiency will be a function of the underlying variances. The computations are now more difficult. To fix the ideas suppose the underlying model is bivariate normal with means 0, variances 1 and σ 2 , and covariance 0. If we let X and Z denote iid N(0, 1) random variables, then the model distribution is that of X and Y = σZ. Note that W 2 = Z 2 /X 2 has a standard Cauchy distribution. Now we are ready to determine the matrices A and B. First, by symmetry, we have E cos φ sin φ = E[XY /(X 2 +Y 2 )] = 0 and Er −1 cos φ sin φ = E[XY /(X 2 + Y 2 )3/2 ] = 0; hence, the matrices A and B are diagonal. Next, cos2 φ = X 2 /[X 2 + σ 2 W 2 ] = 1/[1 + σ 2 W 2 ] so we can use the Cauchy density to compute the expectation. Using the method of partial fractions: Z 1 1 1 2 E cos φ = dw = . 2 2 2 (1 + σ w ) π(1 + w ) 1+σ Hence, E sin2 φ = σ/(1 + σ). The next two formulas are given by Brown (1983) and are derivable by several steps of partial integration: r ∞ 2 πX (2j)! −1 (1 − σ 2 )j , Er = 2j 2 2 j=0 2 (j!) Er and

−1

1 cos φ = 2 2

r

∞

πX 2 j=0

(2j + 2)!(2j)! 4j+1 2 (j!)2 [(j + 1)!]2

2

Er −1 sin2 φ = Er −1 − Er −1 cos2 φ .

(1 − σ 2 )j ,

Thus A = diag[(1 + σ)−1 , σ(1 + σ)−1 ] and the distribution of the test statistic, even under the normal model depends on σ. The formulas can be used to compute the efficiency of the spatial L1 methods relative to the L2 methods; numerical values are given in Table 6.3.4. The dependency of the efficiency on σ reflects the dependency of the efficiency on the underlying correlation which is present prior to rotation. Hence, just as the componentwise L1 methods have decreasing efficiency as a function of the underlying correlation, the spatial L1 methods have decreasing efficiency as a function of the ratio of underlying variances. It should be emphasized that the spatial methods are most appropriate for spherical models where they have equivariance and invariance properties. The

6.3. SPATIAL METHODS

373

Table 6.3.4: Efficiencies of spatial L1 methods relative to the L2 methods for bivariate normal model with means 0, variances 1 and σ 2 , and 0 correlation, the elliptical case. σ e(spatial L1 , L2 )

1 .8 .6 .4 .2 .05 .01 0.785 0.783 0.773 0.747 0.678 0.593 0.321

componentwise methods, although equivariant and invariant under scale transformations of the components, cannot tolerate changes in correlation. See Mardia (1972) and Fisher (1987, 1993) for further discussion of spatial methods. In higher dimensions, Mardia refers to the angle test as Rayleigh’s test; see Section 9.3.1 of Mardia (1972). M¨ott¨onen and Oja (1995) extend the spatial median and the spatial sign test to higher dimensions. See Table 6.3.6 below for efficiencies relative to Hotelling’s test for higher dimensions and for a multivariate t underlying distribution. Note that for higher dimensions and lower degrees of freedom, the spatial sign test is superior to Hotelling’s T 2 .

6.3.2

Spatial Rank Methods

Spatial Signed Rank Test M¨ott¨onen and Oja (1995) develop the concept of a orthogonally invariant rank vector. Hence, rather than use the univariate concept of rank in the construction of a test, they define a spatial rank vector that has both magnitude and direction. This problem is delicate since there is no inherently natural way to order or rank vectors. We must first review the relationship between sign, rank, and signed rank. Recall the norm, ( 1.3.17) and ( 1.3.21), that was used to generate the Wilcoxon signed rank statistic. Further, recall that the second term in the norm was the basis, in Section 2.2.2, for the Mann-Whitney-Wilcoxon rank sum statistic. We reverse this approach here and show how the one sample signed rank statistic based on ranks of the absolute values can be developed from the ranks of the data. This will provide the motivation for a one sample spatial signed rank statistic. P Let x1 , . . . , xn be a univariate sample. Then 2[Rn (xi )−(n+1)/2] = j sgn(xi −xj ). Thus the centered rank is constructed from the signs of the differences. Now to construct a one sample statistic, we introduce the reflections −x1 , . . . , −xn and consider the centered rank of xi among the 2n combined observations and their reflections. The subscript 2n indicates that the reflections are included in the ranking. X X 2[R2n (xi )−(2n+1)/2] = sgn(xi −xj )+ sgn(xi +xj ) = [2Rn (|xi |)−1]sgn(xi ) ; (6.3.15) j

j

see Exercise 6.8.14. Hence, ranking observations in the combined observations and reflections is essentially equivalent to ranking the absolute values |x1 |, . . . , |xn |. In this way, one sample methods can be developed from two sample methods.

374

CHAPTER 6. MULTIVARIATE

M¨ott¨onen and Oja (1995) use this approach to develop a one sample spatial signed rank statistic. The key is the expression sgn(xi − xj ) + sgn(xi + xj ) which requires only the concept of sign, not rank. Hence, we must find the appropriate extension of sign to two dimensions. In one dimension, sgn(x) = |x|−1 x can be thought of as a unit vector pointing in the positive or negative directions toward x. Likewise u(x) = kxk−1 x is a unit vector in the direction of x. Hence, as in the previous section, we take u(x) P to be the vector spatial sign. The vector centered spatial rank of xi is then R(xi ) = j u(xi − xj ). Thus, the vector spatial signed rank statistic is S5 (0) =

XX i

j

{u(xi − xj ) + u(xi + xj )}

(6.3.16)

This is also the sum of the centered spatial ranks of the observations when ranked in the combined observations and their reflections. Note that −u(xi − xj ) = u(xj − xi ) so that P P u(xi − xj ) = 0 and the statistic can be computed from S5 (0) =

XX i

u(xi + xj ) ,

(6.3.17)

j

which is the direct analog of ( 1.3.24). We now develop a conditional test by conditioning on the data x1 , . . . , xn . From ( 6.3.16) we can write X S5 (0) = r+ (xi ) , (6.3.18) i

P

where r+ (x) = j {u(x − xj ) + u(x + xj )}. Now it is easy to see that r+ (−x) = −r+ (x). Under the null hypothesis of symmetry about 0, we can think of S5 (0) as a realization of P + b r (xi ) where b1 , . . . , bn are iid variables with P (bi = +1) = P (bi = −1) = 1/2. Hence, i i Ebi = 0 and var(bi ) = 1. This means that, conditional on the data, b = Cov d ES5 (0) = 0 and A

n 1 X + (r (xi ))(r+ (xi ))T . S5 (0) = 3 n3/2 n i=1 1

(6.3.19)

The approximate size α conditional test of H0 : θ = 0 versus HA : θ 6= 0 rejects H0 when 1 T b −1 S A S5 ≥ χ2α (2) , n3 5

(6.3.20)

where χ2α (2) is the upper α percentile from a chisquare distribution with 2 degrees of freedom. Note that the extension to higher dimensions is done in exactly the same way. See Chaudhuri (1992) for rigorous asymptotics.

Example 6.3.2. Cork Borings, Example 6.3.1 continued

6.3. SPATIAL METHODS

375

Table 6.3.5: Each row is a spatial signed rank vector for the data differences in Table 6.3.2. Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14

SR1 0.28 0.28 -0.09 0.58 -0.03 -0.28 0.07 0.01 -0.13 0.23 0.12 0.46 0.30 -0.22

SR2 -0.49 -0.58 -0.39 -0.29 -0.20 0.07 0.43 0.60 0.46 0.13 -0.20 -0.76 -0.56 -0.05

SR3 -0.07 0.12 0.31 -0.11 -0.07 0.43 0.23 0.32 0.34 -0.49 0.33 0.28 0.34 0.49

Row SR1 SR2 SR3 15 0.30 -0.54 0.69 16 -0.40 0.73 -0.07 17 0.60 -0.14 0.39 18 0.10 0.56 0.49 19 0.77 -0.34 0.22 20 0.48 0.10 -0.03 21 0.26 0.08 -0.16 22 0.12 0.00 -0.11 23 0.32 -0.58 0.48 24 -0.14 -0.53 0.42 25 0.19 -0.12 0.45 26 0.73 -0.07 -0.14 27 0.31 -0.12 -0.58 28 -0.30 -0.14 0.67

We use the spatial signed-rank method ( 6.3.20) to test the hypothesis. Table 6.3.5 provides the vector signed-ranks, r+ (xi ) defined in expression ( 6.3.18). Then ST5 (0) = (4.94, −2.90, 5.17), b −1 n3 A



 .1231 −.0655 .0050 .1611 −.0373  , =  −.0655 .0050 −.0373 .1338

b −1 S5 (0) = 11.19 with an approximate p-value of 0.011 based on a χ2 and n−1 ST5 (0)A distribution with 3 degrees of freedom. The Hodges-Lehmann estimate of θ, which solves . bT = (49.30, 45.07, 48.90, 44.59). S5 (θ) = 0, is computed to be θ

Efficiency

The test in ( 6.3.20) can be developed from the point of view of asymptotic theory and the efficiency can be computed. The computations are quite involved. The multivariate t distributiuons provide both a range of tailweights and a range of dimensions. A summary of these efficiencies is found in Table 6.3.6; see M¨ott¨onen, Oja and Tienari (1997) for details. The M¨ott¨onen and Oja (1995) test efficiency increases with the dimension; see especially, the circular normal case. The efficiency begins at .95 and increases! The efficiency also increases with tailweight, as expected. This strongly suggests that the M¨ott¨onen and Oja approach is an excellent way to extend the idea of signed rank from the univariate case. See Example 6.6.2 for a discussion of the two sample spatial rank test.

376

CHAPTER 6. MULTIVARIATE

Table 6.3.6: The row labeled Spatial SR are the asymptotic efficiencies of multivariate spatial signed-rank test, ( 6.3.20), relative to Hotelling’s test under the multivariate t distribution; the efficiencies for the spatial sign test, ( 6.3.2), are given in the rows labeled Spatial Sign.

Dimension 1 2 3 4 6 10

Test Spatial Spatial Spatial Spatial Spatial Spatial Spatial Spatial Spatial Spatial Spatial Spatial

3 1.90 1.62 1.95 2.00 1.98 2.16 2.00 2.25 2.02 2.34 2.05 2.42

SR Sign SR Sign SR Sign SR Sign SR Sign SR Sign

4 1.40 1.13 1.43 1.39 1.45 1.50 1.46 1.56 1.48 1.63 1.49 1.68

Degress of Freedom 6 8 10 15 20 ∞ 1.16 1.09 1.05 1.01 1.00 0.95 0.88 0.80 0.76 0.71 0.70 0.64 1.19 1.11 1.07 1.03 1.01 0.97 1.08 0.98 0.93 0.88 0.85 0.79 1.20 1.12 1.08 1.04 1.02 0.97 1.17 1.06 1.01 0.95 0.92 0.85 1.21 1.13 1.09 1.04 1.025 0.98 1.22 1.11 1.05 0.99 0.96 0.88 1.22 1.14 1.10 1.05 1.03 0.98 1.27 1.15 1.09 1.03 1.00 0.92 1.23 1.14 1.10 1.06 1.04 0.99 1.31 1.19 1.13 1.06 1.03 0.95

Hodges-Lehmann Estimator . The estimator derived from S5 (θ) = 0 is the spatial median of the pairwise averages, a spatial Hodges-Lehmann (1963) estimator. This estimator is studied in great detail by Chaudhuri (1992). His paper contains a thorough review of multidimensional location estimates. He develops a Bahadur representation for the estimate. From his Theorem 3.2, we can immediately conclude that √ n n 1 b n XX 1 −1 √ θ = B2 u (xi + xj ) + op (1) (6.3.21) n n(n − 1) i=1 j=1 2

where B2 = E{kx∗ k−1 (I − kx∗ k−2 x∗ (x∗ )T )} and x∗ = 21 (x1 + x2 ). Hence, the asymptotic b is determined by that of n−3/2 S5 (0). This leads to distribution of √1n θ 1 b D −1 √ θ → N2 (0, B−1 2 A2 B2 ) , n

(6.3.22)

where A2 = E{u(x1 + x2 )(u(x1 + x2 ))T }. Moment estimates of A2 and B2 can be used. In b defined in expression ( 6.3.19), is a consistent estimate of A2 . Bose fact the estimator A, and Chaudhuri (1993) and Chaudhuri (1993) discuss refinements in the estimation of A2 and B2 . Choi and Marden (1997) extend these spatial rank methods to the two-sample model and the one-way layout. They also consider tests for ordered alternatives; see, also, Oja (2010).

6.4. AFFINE EQUIVARIANT AND INVARIANT METHODS

6.4 6.4.1

377

Affine Equivariant and Invariant Methods Blumen’s Bivariate Sign Test

It is clear from Tables 6.3.4 and 6.3.6 of efficiencies in the previous section that is desirable to have robust sign and rank methods that are affine invariant and equivariant to compete with LS methods. We begin with yet another representation of the estimating function S2 (θ), ( 6.1.7). Let the ordered ϕ angles be given by 0 ≤ ϕ(1) < ϕ(2) < . . . < ϕ(n) < π and let s(i) = ±1 when the observation corresponding to ϕ(i) is above or below the horizontal axis. Then we can write, as in expression ( 6.3.8), S2 (θ) =

n X i=1

s(i)

cos ϕ(i) sin ϕ(i)

(6.4.1)

Now under the assumption of spherical symmetry, ϕ(i) is distributed as the ith order statistic from the uniform distribution on [0, π) and, hence, Eϕ(i) = πi/(n + 1), i = 1, . . . , n. Recall, in the univariate case, if we believe that the underlying distribution is normal then we could replace the data by the normal scores (expected values of the order statistics from a normal distribution) in a signed rank statistic. The result is the distribution free normal scores test. We will do the same thing here. We replace ϕ(i) by its expected value to construct a scores statistic. Let X n n πRi πi X cos n+1 cos n+1 (6.4.2) S6 (θ) = s(i) = si πi πRi sin n+1 sin n+1 i=1

i=1

where R1 , . . . , Rn are the ranks of the unordered angles ϕ1 , . . . , ϕn . Note that s1 , . . . , sn are iid with P (si = 1) = P (si = −1) = 1/2 even if the underlying model is elliptical rather than spherical. Since we now have constant vectors in S6 (θ), it follows that the sign test based on S6 (θ) is distribution free over the class of elliptical models. We look at the test in more detail and consider the efficiency of this sign test relative to Hotelling’s test. First, we have immediately, under the null hypothesis, from the distribution of s1 , . . . , sn that ! P P cos[πi/(n+1)] sin[πi/(n+1)] cos2 [πi/(n+1)] 1 n P P 2 n cov √ S6 (0) = →A, cos[πi/(n+1)] sin[πi/(n+1)] sin [πi/(n+1)] n n n where A=

R1

cos2 πtdt R1 0 cos πt sin πtdt 0

R1

cos πt sin πtdt 0 R 1 sin2 πtdt 0

!

1 = I, 2

as n → ∞. So reject H0 : θ = 0 if n2 S′6 (0)S6 (0) ≥ χ2α (2) for the asymptotic size α distribution free version of the test where, ( 2 X 2 ) X 2 ′ πi 2 πi s(i) cos S (0)S6 (0) = . (6.4.3) + s(i) sin n 6 n n+1 n+1

378

CHAPTER 6. MULTIVARIATE

This test is not affine invariant. Blumen (1958) created an asymptotically equivariant test that is affine invariant. We can think of Blumen’s statistic as an elliptical scores version of the angle statistic of Brown (1983). In (6.4.3) i/(n + 1) is replaced by (i − 1)/n. Blumen rotated the axes so that ϕ(1) is equal to zero and the data point is on the horizontal axis. Then the remaining scores are uniformly spaced. In this case, π(i − 1)/n is the conditional expectation of ϕ(i) given ϕ(1) = 0. Estimation methods corresponding to Blumen’s test, however, have not yet been developed. To compute the efficiency of Blumen’s test relative to Hotelling’s test we must compute the noncentrality parameter of the limiting chisquare distribution. Hence, we must compute BA−1 B and this leads us to B. Theorem 6.3.2 provides the matrices A and B for the angle sign statistic when the underlying distribution is spherically symmetric. The following theorem shows that the affine invariant sign statistic has the same A and B matrices as in Theorem 6.3.2 and they hold for all elliptical distributions. We discuss the implications after the proof of the proposition. Theorem 6.4.1. If the underlying distribution is elliptical, then corresponding to S6 (0) we have A = 12 I and B = (Er −1 /2)I. Hence, the efficiency of Blumen’s test relative to Hotelling’s test is e(S6 , Hotelling) = E(r 2 )[E(r −1 ]2 /4 which is the same for all elliptical models. Proof. To prove this we show that under a spherical model the angle statistic S2 (0) and scores statistic S6 (0) are asymptotically equivalent. Then S6 (0) will have the same A and B matrices as in Theorem 6.3.2. But since S6 (0) leads to an affine invariant test statistic, it follows that the same A and B continue to apply for elliptical models. Recall that under the spherical model, s(1) , . . . , s(n) are iid with P (si = 1) = P (si = −1) = 1/2 random variables. Then we consider n n n πi πi 1 X 1 X 1 X cos n+1 cos n+1 − cos ϕ(i) cos ϕi √ s(i) −√ s(i) =√ s(i) πi πi sin n+1 sin ϕi sin n+1 − sin ϕ(i) n n n i=1

i=1

i=1

We treat the two components separately. First 1 X 1 X πi πi √ s(i) cos( − cos ϕ(i) ) ≤ maxi cos − cos ϕ(i) √ s(i) n n+1 n+1 n

The cdf of the uniform distribution on [0, π) is equal to t/π for 0 ≤ t < π. Let Gn (t) be the i πi empirical cdf of the angles ϕi , i = 1, . . . , n. Then G−1 n ( n+1 ) = ϕ(i) and maxi | n+1 − ϕ(i) | ≤ −1 supt |Gn (t) − tπ| = supt |Gn (t) − tπ| → 0 wp1 by the Glivenko-Cantelli Lemma. The result πi now follows by using a linear approximation to cos( n+1 ) − cos ϕ(i) and noting that the cos and sin are bounded. The same argument applies to the second component. Hence, the difference of the two statistics are op (1) and are asymptotically equivalent. The results for the angle statistic now apply to S6 (0) for a spherical model. The affine invariance extends the result to an elliptical model.

6.4. AFFINE EQUIVARIANT AND INVARIANT METHODS

379

The main implication of this proposition is that the efficiency of the test based on S6 (0) relative to Hotelling’s test is π/4 ≈ .785 for all bivariate normal models, not just the spherical normal model. Recall that the test based on S2 (0), the angle sign test, has efficiency π/4 only for the spherical normal and declining efficiency for elliptical normal models. Hence, we not only gain affine invariance but also have a constant, nondecreasing efficiency. Oja and Nyblom (1989) study a class of sign tests for the bivariate location problem. They show that Blumen’s test is locally most powerful invariant for the entire class of elliptical models. Ducharme and Milasevic (1987) define a normalized spatial median as an estimate of location of a spherical distribution. They construct a confidence region for the modal direction. These methods are resistant to outliers.

6.4.2

Affine Invariant Sign Tests in the Multivariate Case

Affine Invariant Sign Tests Affine invariance is determined in the Blumen test statistic by rearranging the data axes to be uniformly spaced scores. Further, note that the asymptotic covariance matrix A is (1/2)I, where I is the identity. This the covariance matrix for a random vector that is uniformly distributed on the unit circle. The equally spaced scores cannot be constructed in higher dimensions. The approach taken here is due to Randles (2000) in which we seek a linear transformation of the data that makes the data axes roughly equally spaced and the resulting direction vectors will be roughly uniformly distributed on the unit sphere. We choose the transformation so that the sample covariance matrix of the unit vectors of the transformed data is that of a random vector uniformly distributed on the unit sphere. We then compute the spatial sign test (6.3.2) on the transformed data. The result is an affine invariant test. Let x1 , ..., xn be a random sample of size n from a k-variate multivariate symmetric distribution with symmetry center 0. Suppose for the moment that a nonsingular matrix Ux determined by the data, exists and satisfies

n

1X n i=1

Ux xi kUx xi k

Ux xi kUx xi k

T

=

1 I. k

(6.4.4)

Hence, the unit vectors of the transformed data have covariance matrix equal to that of a random vector uniformly distributed on the unit k − sphere. Below we describe a simple and fast way to compute Ux for any dimension k. The test statistic in (6.4.4) computed on the transformed data becomes

where

1 T b −1 k S7 A S7 = ST7 S7 n n n X Ux xi S7 = kUx xi k i=1

(6.4.5)

(6.4.6)

380

CHAPTER 6. MULTIVARIATE

b in (6.3.1) becomes k −1 I because of the definition of Ux in (6.4.4). and A Theorem: Suppose n > k(k − 1) and the underlying distribution is symmetric about 0. Then nk ST7 S7 in (6.4.5) is affine invariant and the limiting distribution, as n → ∞, is chisquare with k degrees of freedom. The following lemma will be helpful in the proof of the theorem. The lemma’s proof depends on a uniqueness result from Tyler (1987). Lemma: Suppose n > k(k − 1) and D is a fixed, nonsingular transformation matrix. Suppose Ux and UDx are defined by (6.4.4). Then a. DT UTDx UDx D = c0 UTx Ux for some positive constant c˙0 that may depend on D and the data and √ b. there exists an orthogonal matrix G such that c0 GUx = UDx D. Proof: Define D∗ = Ux D−1 then ∗ T T n n 1 X U∗ Dxi U Dxi Ux xi 1 X Ux xi 1 = = I. ∗ ∗ n i=1 kU Dxi k kU Dxi k n i=1 kUx xi k kUx xi k k Tyler (1987) showed that the matrix UDx defined from Dx1 , ..., Dxn is unique up to a positive constant. Hence, UDx = aU∗ for some positive constant a. Hence, UTDx UDx = a2 U∗T U∗ = a2 (DT )−1 UTx Ux D−1 and DT UtDx UDx D = a2 UTx Ux which completes the proof of part a with c0 = a2 . −1/2 Define G = c0 UDx DU−1 x where c0 comes from the lemma. Then, using part a, it T follows that G G = I and G is orthogonal. Hence, −1/2

c0 GUx = c0 c0

1/2

UDx DU−1 x Ux = c0 UDx D

and part b follows. Proof of affine invariance: Given D is a fixed, nonsingular matrix, let yi = Dxi for i = 1, ..., n. Then (6.4.6) becomes SD 7

n X UDx Dxi = . kU Dx k Dx i i=1

D T We will show that SDT 7 S7 = S7 S7 and hence does not depend on D. Now, from the lemma, 1/2

c0 GUx x UDx Dx Ux x = =G 1/2 kUDx Dxk kUx xk k c0 GUx x k and SD 7

n X Ux xi =G = GS7 . kU x k x i i=1

6.4. AFFINE EQUIVARIANT AND INVARIANT METHODS

381

D T Hence, SDT 7 S7 = S7 S7 and the affine invariance follows from the orthogonal invariance of ST7 S7 . Sketch of argument that the asymptotic distribution is chisquared with k degrees of freedom. Tyler (1987) showed that there exists a unique upper triangular matrix U∗ with upper left diagonal element equal to 1 and such that " ∗ T # U∗ X UX 1 E = I ∗ ∗ kU Xk kU Xk k

√ ∗ and n(Ux − U∗ ) = Op (1). Theorem 6.1.2 implies that (k/n)S∗T 7 S7 is asymptotically ∗ chisquared with k degrees of freedom where U replaces Ux in S7 . But since Ux and U∗ ∗ T are close, (k/n)S∗T 7 S7 − (k/n)S7 S7 = op (1), the asymptotic distribution follows. See the appendix in Randles (2000) for details. We have assumed symmetry of the underlying multivariate distribution. The results continue to hold with the weaker assumption of directional symmetry about 0 in which X/kXk and −X/kXk have the same distribution. In addition to the asymptotic distribution, we can compute or approximate the conditional distribution (given the direction axes of the data) of nk ST7 S7 under the assumption of directional symmetry by listing or sampling the 2n equi-likely values of !T ! n n X Ux xi k X Ux xi δi δi n i=1 kUx xi k kUx xi k i=1 where δi = ±1 for i = 1, ..., n. Hence, it is straight forward to approximate the k − value of the test. Computation of Ux It remains to compute Ux from the data x1 , ..., xn . The following efficient iterative procedure is due to Tyler (1987) who also shows the sequence of iterates converges when n > k(k − 1). We begin with T n 1 X xi xi V0 = . n i=1 kxi k kxi k

and U0 = Chol (V0−1 ), where Chol (M) is the upper triangular Cholesky decomposition of the positive definite matrix M divided by the upper left diagonal element of the upper triangular matrix. This places a 1 as the first element of the main diagonal and makes Chol (M) unique. If kV0 − k −1 Ik is sufficiently small (a prespecified tolerance) stop and take Ux = U0 . If kV0 − k −1 Ik is large, compute n

1X V1 = n i=1

U0 xi kU0 xi k

U0 xi kU0 xi k

T

.

382

CHAPTER 6. MULTIVARIATE

and compute U1 = Chol (V1−1 ). If kV1 − k −1 Ik is sufficiently small stop and take Ux = U1 U0. If kV1 − k −1 Ik is large compute T n 1X U1 U0 xi U1 U0 xi V2 = . n i=1 kU1 U0 xi k kU1 U0 xi k

and U2 = Chol (V2−1). If kV2 − k −1 Ik is sufficiently small, stop and take Ux = U2 U1 U0 . If kV2 − k −1 Ik is large compute V3 and U3 and proceed until kVj0 − k −1 Ik is sufficiently small and take Ux = Uj0 Ujo −2 ...U0

Affine Equivariant Median. We now turn to the problem of constructing an affine equivariant estimate of the center of symmetry of the underlying distribution. Our goal is to produce an estimate that is computationally efficient for large samples in any dimension, a problem that plagued some earlier attempts; see Small (1990) for an overview of multivariate medians. The estimate described below was proposed by Hettmansperger and Randles (2002) b is chosen to be the solution of and we refer to it as the HR estimate. The estimator θ n

1 X Ux (xi − θ) =0 n i=1 kUx (xi − θ)k

(6.4.7)

in which Ux is the k × k upper triangular positive definite matrix, with a one in the upper left position on the diagonal, chosen to satisfy T n 1X Ux (xi − θ) Ux (xi − θ) 1 = I. (6.4.8) n i=1 kUx (xi − θ)k kUx (xi − θ)k k This is a transform-retransform estimate; see, for example, Chakraborty, Chaudhuri and b is computed. Oja (1998). The data are transformed using Ux , and the estimate τb = Ux θ −1 b = U τb . The simultaneous Then the estimate is retransformed back to the original scale θ x solutions of (6.4.7) and (6.4.8) are M-estimates; see Section 6.5.4 for the explicit representation. It follows from this that the estimate is affine equivariant. It is also possible to directly verify the affine equivariance. b involves two routines. The first routine finds the value that The calculation of (Ux , θ) solves (6.4.7) with Ux fixed. This is done by letting yi = Ux xi and finding τb that solves Σ(yi − τ )/ k yi − τ k= 0. Hence, τb is the spatial median of y1 , . . . , yn ; see Section 6.3.1. The b = U−1 τb . The second routine then finds Ux in (6.4.8) as described solution to (6.4.7) is θ x b above for the computation of Ux for a fixed value of θ with xi replaced by xi − θ. b alternates between these two routines until convergence. To The calculation of (Ux , θ) obtain starting values, let θ 0j = xj . Use the second routine to obtain U0j for this value of θ. The starting (θ 0j , U0j ) is the pair that minimizes, for j = 1, ..., n, the inner product " n #T " n # X U0j (xi − θ 0j ) X U0j (xi − θ 0j ) . kU (x − θ )k kU (x − θ )k 0j i 0j 0j i 0j i=1 i=1

6.4. AFFINE EQUIVARIANT AND INVARIANT METHODS

383

This starting procedure is used, since starting values need to be affine invariant and equivariant. For a fixed Ux there exists a unique solution for θ, and for fixed θ there exists a unique Ux up to multiplicative constant. In simulations and calculations described in Hettmansperger and Randles (2002) the alternating algorithm did not fail to converge. However, the equations b do not fully satisfy all conditions stated in the defining the simultaneous solution (Ux , θ) literature for existence and uniqueness; see Maronna (1976), Tyler (1988), Kent and Tyler (1991). The asymptotic distribution theory developed in Hettmansperger and Randles (2002) b is approximately multivariate normally distributed under the assumption of dishow that θ rectional symmetry and, hence, symmetry. The asymptotic covariance matrix is complicated b and we recommend a bootstrap estimate of the covariance matrix of θ. The approach taken above is more general. If we begin with the orthogonally invariant statistic in (6.3.2) and use a matrix U that satisfies the invariance property in part b of the Lemma then the resulting statistic is affine invariant. For example we could take U to be the inverse of the sample covariance matrix. This results in a test statistic studied by Hossjer and Croux (1995). We prefer the more robust matrix Ux proposed by Tyler (1987). Example 6.4.1. Mathematics and Statistics Exam Scores We now illustrate the one-sample affine invariant spatial sign test (6.4.5) and the affine equivariant spatial median on a small data set. A major advantage of this method is the speed of computation which allows for bootstrap estimates of the covariance matrix and standard errors for the estimator. The data consists of 20 vectors, chosen at random from a larger data set published in Mardia, Kent, and Bibby (1979). Each vector consists of four components and records test scores in Mechanics, Vectors, Analysis, and Statistics. We wish to test the hypothesis that there are no differences among the examination topics. This is a traditional hypothesis in repeated measures designs; see Jan and Randles (1996) for a thorough discussion of this problem. Similar to our findings above on efficiencies, they found that mulitivariate sign and signed rank tests were often superior to least squares in robustness of level and efficiency . Table 6.4.1 provides the original quadrivariate data along with the trivariate data that result when the Statistics score is subtracted from the other three. We suppose that the trivariate data are a sample of size 20 from a symmetric distribution with center θ = (θ1 , θ2 , θ3 )T and we wish to test H0 : θ = 0 versus HA : θ 6= 0. In Table 6.4.1 we have the HR estimates (standard errors) and the tests for the affine spatial methods, Hotelling’s T2 , and Oja’s affine methods described later in Section 6.4.3. The standard errors of the HR estimate are obtained from a boostrap estimate of the covariance matrix. The following estimates are based on 500 bootstrap resamples.   33.88 10.53 21.05 b =  10.53 17.03 12.49  . Cov (θ) 21.05 12.49 32.71

384

CHAPTER 6. MULTIVARIATE

Table 6.4.1: Test Score Data: Mechanics (M), Vectors (V ), Analysis (A), Statistics (S) and differences when Statistics is subtracted from the other three. M 59 52 44 44 30 46 31 42 46 49 17 37 40 35 31 17 49 8 15 0

V 70 64 61 56 69 49 42 60 52 49 53 56 43 36 52 51 50 42 38 40

A 62 63 62 61 52 59 54 49 41 48 43 28 21 48 27 35 23 26 28 9

S M −S V −S A−S 56 3 14 6 54 -2 10 9 46 -2 15 16 36 8 20 25 45 -15 24 7 37 9 12 22 68 -37 -26 -14 33 9 27 16 40 6 12 1 39 10 10 9 51 -34 2 -8 45 -8 11 -17 61 -21 -18 -40 29 6 7 19 40 -9 12 -13 31 -14 20 4 9 40 41 14 40 -32 2 -14 17 -2 21 11 14 -14 26 -5

The standard errors in Table 6.4.1 are the squareroots of the main diagonal of this matrix. The affine sign methods suggest that the major source of statistical significance is the V − S difference. In particular, Vector scores are higher than Statistics scores. A more convenient comparision is achieved by estimating the locations in the four dimensional problem. We find the affine equivariant spatial median for M, V, A, S to be (along with bootstrap standard errors) 36.54 (8.41), 53.04 (5.09), 44.28 (8.39), and 39.65 (7.06). This again reflects the significant differences between Vector scores and Statistics. In fact, it appears the Vector exam was easiest while the other subjects are roughly equivalent. An outlier was created in V by replacing the 70 (first observation) by 0. The results are shown in the lower part of Table 6.4.2. Note, in particular, unlike the robust methods, the p-value for Hotelling’s T 2 test has shifted above 0.05 and, hence, would no longer be considered significant. An affine Invariant Signed Rank Test and Affine Equivariant Estimate The test statistic can be constructed in the same way that the affine invariant sign test was constructed. We will sketch this development below. For a detailed and rigorous development

6.4. AFFINE EQUIVARIANT AND INVARIANT METHODS

385

Table 6.4.2: Results for the original and contaminated test score data: mean of signed-rank vectors, usual mean vectors, the Hodges-Lehmann estimate of θ; results for the signed-rank test ( 6.4.16) and Hotelling’s T 2 test

Original Data HR Estimate SE HR Mean SE Mean Oja HL-est. Affine Sign Test (6.4.5) Hotelling’s T 2 Oja Signed rank (6.4.16) Contaminated Data HR Estimate SE HR Mean Vector Oja HL-estimate Affine Sign Test (6.4.5) Hotelling’s T 2 Oja Signed rank (6.4.16)

M −S

V −S

A−S

−2.12 5.82 -4.95 4.07 -3.05

13.85 4.13 12.10 3.33 14.06

6.21 5.72 2.40 3.62 4.96

−2.92 5.58 -4.95 -3.90

12.83 8.27 8.60 12.69

Test Statistic

Asymp. p-value

14.19 13.47 14.07

0.0027 0.0037 0.0028

10.76 6.95 10.09

0.0131 .0736 0.0178

6.90 6.60 2.40 4.64

386

CHAPTER 6. MULTIVARIATE

see Oja (2010, Chapter 7) or Oja and Randles (2004). The spatial signed rank statistic is given by S5 in (6.3.19) along with the spatial signed rank covariance matrix, given in this case by n 1X + r (xi )r+ (xi )T . (6.4.9) n i=1 Now suppose we can construct a matrix Vx such that when xi is replaced by Vx xi in (6.4.9) we have P 1

n

1X + 1 1 r (Vx xi )r+ (Vx xi )T = I. + T + k r (Vx xi ) r (Vx xi ) n

(6.4.10)

The divisor in 6.4.10 is the average squared length of the signed rank vectors and is needed normalize (on average) the signed rank vectors. In the simpler sign vector case P to −1 T n [xi xi / k xi k2 ] = 1. The normalized signed rank vectors now have roughly the same covariance structure as vectors uniformly distributed on the unit k − sphere. It is straight forward to develop an iterative routine to compute Vx in the same way we computed Ux for the sign statistic. The signed rank test statistic developed from (6.3.22) is then k T S S8 , n 8

(6.4.11)

P where S8 = r+ (Vx xi ). Again, it can be verified directly that this test statistic is affine invariant. In addition, the p − value of the test can be approximated using the chisquare distribution with k degrees of freedom or by simulation, conditionally using the 2n equally likely values of " n #" n # X k X T + δ r (Vx xi )T δi r+ (Vx xi ) n i=1 i i=1

with δi = ±1. Recall that the Hodges-Lehmann estimate related to the spatial signed rank statistic is the spatial median of the pairwise averages of the data vectors. This estimate is orthogonally equivariant but not affine equivariant. We use the transformation-retransformation method. We transform the data using Vx to get yi = Vx xi i = 1, ..., n and then compute the spatial median of the pairwise averages (yi + yj )/2 which we denote by τb . Then we retransform it b = V−1τb . This estimate is now affine equivariant. Because of the complexity of the back: θ x asymptotic covariance matrix we recommend a bootstrap estimate of the covariance matrix b of θ. Efficiency

Recall Table 6.3.6 which provides efficiency values for either the spatial sign test or the spatial signed rank test relative to Hotelling’s T2 test. The calculations were made for the spherical

6.4. AFFINE EQUIVARIANT AND INVARIANT METHODS

387

t distribution for various degrees of freedom and finally for the spherical normal distribution. Now that we have affine invariant sign and signed rank tests and affine equivariant estimates we can apply these efficiency results to elliptical t and normal distributions. Hence, we again see the superiority of the sign and signed rank methods over Hotelling’s test and the sample mean. The affine invariant tests and affine equivariant estimates are efficient and robust alternatives to the traditional least squares methods. In the case of the affine invariant sign test, Randles (2000) presents a power sensitivity simulation comparing his test to Hotelling’s T 2 test, Blumen’s test, and Oja’s sign test (6.4.14). In addition to the multivariate normal distribution, he included t distributions and a skewed distribution. Randles’ affine invariant sign test performed extremely well. Although Oja’s sign test performed comparably, it is much more computationally intensive than Randles’ test.

6.4.3

The Oja Criterion Function

This method provides a direct approach to affine invariance equivariance and does not require a transform-retransform technique. It is, however, much more computationally intensive. We will only sketch the results in this section and give references where the more detailed derivations can be found. Recall from the univariate location L1 and L2 are P model that m special cases of methods that are derived from minimizing |xi − θ| P , for m = 1 and m = 2. Oja (1983) proposed the bivariate objective function: D8 (θ) = i 0 be given. Then since ϕ(u) is continuous a.e. we can

A.2. SIMPLE LINEAR RANK STATISTICS

429

assume it is continuous at F (y). Hence there exists a δ1 > 0 such that |ϕ(z) − ϕ(F (y))| < ǫ for |z−F (y)| < δ1 . By the uniform continuity of F , choose δ2 > 0 such that |F (t)−F (s)| < δ1 for |s − t| < δ2 . By ( A.2.16) choose N0 so that for n > N0 implies max {|di|} < δ2 .

1≤i≤n

Thus for n > N0 , |F (y) − F (y + di )| < δ1 , for i = 1, . . . , n , and, hence, |ϕ(F (y)) − ϕ (F (y + di ))| < ǫ , for i = 1, . . . , n . Thus for n > N0 ,

max [ϕ(F (y)) − ϕ(F (y + di )]2 < ǫ2 ,

1≤i≤n

Therefore, lim

Z

∞

2

max [ϕ(F (y)) − ϕ(F (y + di ))] f (y) dy

−∞ 1≤i≤n

and we are finished. The next result yields the asymptotic mean of Td .

≤ ǫ2 ,

Theorem A.2.6. Under py and the assumptions ( 3.4.1), ( A.2.1), ( A.2.2), ( A.2.14) ( A.2.17), 1 Epy √ Td → γf σxd . n Proof: By Theorem A.2.3, √1 T n

− γf σxd σx

D

→ N(0, 1) , under qd .

Hence by the transformation Theorem A.2.4, √1 Td n

− γf σxd

σx

D

→ N(0, 1) , under py .

(A.2.36)

By ( A.2.9), √1 T n

D

→ N(0, 1) , under py ;

σx hence by Theorem A.2.5, we must have h i √1 Td − E √1 Td n n σx

D

→ N(0, 1) , under py .

The conclusion follows from the results ( A.2.36) and ( A.2.37).

(A.2.37)

430

APPENDIX A. ASYMPTOTIC RESULTS

By the last two theorems we have under py 1 1 √ Td = √ T + γf σxd + op (1) . n n We need to express these results for the random variables √ S, ( A.2.4), and Sd , ( A.2.32). Because the densities qd√are contiguous to py and (T − S)/ n → 0 in probability under py , it follows that √ (T − S)/ n → 0 in probability under qd . By a change of variable this means (Td − Sd )/ n → 0 in probability under py . This discussion leads to the following two results which we state in a theorem. Theorem A.2.7. Under py and the assumptions ( 3.4.1), ( A.2.1), ( A.2.2), ( A.2.14) ( A.2.17), 1 1 √ Sd = √ S + γf σxd + op (1) n n 1 1 √ Sd = √ T + γf σxd + op (1) . n n

(A.2.38) (A.2.39)

Next we relate the result Theorem A.2.7 to ( 2.5.27), the asymptotic linearity of the general scores statistic in the two sample problem. Recall in the two sample problem that ci = 0 for 1 ≤ i ≤ n1 and ci = 1 for n1 + 1 ≤ i ≤ n1 + n2 = n, ( 2.2.1). Hence,√ xi = ci − c = −n2 /n for 1 ≤ i ≤ n1 and xi = n1 /n for n1 + 1 ≤ i ≤ n. Defining di = −δxi / n, it is easy to check that conditions hold with σxd = −λ1 λ2 δ. Further P ( A.2.14) - ( A.2.17) P ( A.2.32) √ √ becomes Sϕ (δ/ n) = xi a(R(Y xi a(R(Yi )), i − δxi /R n)) and ( A.2.4) becomes Sϕ (0) = R where a(i) = ϕ(i/(n + 1)), ϕ = 0 and ϕ2 = 1. Hence ( A.2.38) becomes √ 1 1 √ Sϕ (δ/ n) = √ Sϕ (0) − λ1 λ2 γf δ + op (1) . n n

√ Finally using the usual partition argument, Theorem 1.5.6, and the monotonicity of Sϕ (δ/ n) we have: Theorem A.2.8. Assuming Finite Fisher information, nondecreasing and square integrable ϕ(u), and ni /n → λi , 0 < λi < 1, i = 1, 2, ! 1 1 δ − √ Sϕ (0) + λ1 λ2 γf δ ≥ ǫ → 0 , (A.2.40) Ppx √sup √ Sϕ √ n n n n|δ|≤c

for all ǫ > 0 and for all c > 0.

This theorem establishes ( 2.5.27). As a final note from ( A.2.11), n−1/2 Sϕ (0) is asympP totically N(0, σx2 ), where σx2 = σ 2 (0) = lim n−1 x2i = λ1 λ2 . Hence to determine the efficacy using this approach, we have p λ1 λ2 γ f cϕ = = λ1 λ2 τϕ−1 , (A.2.41) σ(0)

see ( 2.5.28).

A.2. SIMPLE LINEAR RANK STATISTICS

A.2.3

431

Signed-Rank Statistics

In this section we develop the asymptotic local behavior for the general signed rank statistics defined in Section 1.8. Assume that X1 , . . . Xn are a random sample having distribution function H(x) with density h(x) which is symmetric about 0. Recall that general signed rank statistics are given by X a+ (R(|Xi |))sgn(Xi ) , (A.2.42) Tϕ+ =

where the scores are generated as a+ (i) = ϕ+ (i/(n + 1)) R for a nonnegative and square integrable function ϕ+ (u) which is standardized such that (ϕ+ (u))2 du = 1. The null asymptotic distribution of Tϕ+ was derived in Section 1.8 so here we will be concerned with its behavior under local alternatives. Also the derivations here are similar to those for simple linear rank statistics, Section A.2.2; hence, we will be brief. Note that we can write Tϕ+ as X n + + ϕ Tϕ+ = H (|Xi |) sgn(Xi ) , n+1 n

where Hn+ denotes the empirical distribution function of |X1 |, . . . , |Xn |. This suggests the approximation X ϕ+ (H + (|Xi |))sgn(Xi ) , (A.2.43) Tϕ∗+ =

where H + (x) is the distribution function of |Xi |. Denote the likelihood of the sample X1 , . . . Xn by

px = Πni=1 h(xi ) .

(A.2.44)

A result that we will need is, P 1 √ Tϕ+ − Tϕ∗+ → 0 , under px . n

(A.2.45)

ˇ This result is shown on page 167 of H´ajek and √ Sid´ak (1967). √ For the sequence of local alternatives, b/ n with b ∈ R, (here we are taking di = −b/ n), we denote the likelihood by b n qb = Πi=1 h xi − √ . (A.2.46) n For b ∈ R, consider the log of the likelihoods given by, l(η) =

n X i=1

log

h(Xi − η √bn ) h(Xi )

.

(A.2.47)

If we expand l(η) about 0 and evaluate it at η = 1, similar to the expansion ( A.2.19), we obtain n n b X h′ (Xi ) b2 X h(Xi )h′′ (Xi ) − (h′ (Xi ))2 l = −√ + + op (1) , (A.2.48) n i=1 h(Xi ) 2n i=1 h2 (Xi )

432

APPENDIX A. ASYMPTOTIC RESULTS

provided that the third derivative of the log-ratio, evaluated at 0, is square integrable. Under px , the middle term converges in probability to −I(h)b2 /2, provided that the second derivative of the log-ratio, evaluated at 0, is square integrable. An application of Theorem 2 A.1.1 shows that l converges in distribution to a N(− I(h)b , I(h)b2 ). Hence, by LeCam’s first 2 lemma, (A.2.49) the densities qb = Πni=1 h xi − √bn are contiguous to px = Πni=1 h(xi ) ;

Similar to Section A.2.2, by using √ Theorem A.1.1 we can derive the asymptotic distri∗ bution of the random vector (Tϕ+ / n, l), which we record as: Theorem A.2.9. Under px and some regularity conditions on h, 1 ∗ √ T + 0 1 bγh D n ϕ 2 , , → N2 bγh I(h)b2 l − I(h)b 2

(A.2.50)

where γh = 1/τϕ+ and τϕ+ is given in expression ( 1.8.24). By this last theorem and LeCam’s third lemma, we have 1 D √ Tϕ∗+ → N(bγh , 1) , under qb . (A.2.51) n √ By the result on contiguity, ( A.2.49), the test statistic Tϕ+ / n has the same distribution under qb . A proof of the asymptotic power lemma, Theorem 1.8.1, follows from this result. Next consider a shifted version of Tϕ∗+ given by ∗ Tbϕ +

=

n X i=1

b b + sgn Xi + √ ϕ H . Xi + √n n +

(A.2.52)

The following identity is readily established:

∗ Pqb [Tϕ∗+ ≤ t] = Ppx [Tbϕ + ≤ t] ;

(A.2.53)

see, also, Theorem 1.3.1. We need the following theorem: Theorem A.2.10. Under px ,

∗ ∗ Tϕ∗+ − [Tbϕ + − Epx (Tbϕ+ )] P √ →0. n

√ ∗ n] → 0. But this Proof: As in Theorem A.2.5, it suffices to show that V [(Tϕ∗+ − Tbϕ + )/ variance reduces to ∗ 2 Z ∞ ∗ Tϕ+ − Tbϕ + b b + + + + √ sgn x + √ V = ϕ H (|x|) sgn(x) − ϕ H h(x) dx . x + √n n n −∞

A.3. RESULTS FOR RANK-BASED ANALYSIS OF LINEAR MODELS

433

Since ϕ+ (u) is square integrable, the quantity in braces is dominated by an integrable function. Since it converges pointwise to 0, a.e., an application of the Lebesgue Dominated Convergence Theorem establishes the result. Using the above results, we can proceed as we did for Theorem A.2.6 to show that under px , 1 ∗ Epx √ Tbϕ+ → bγh . (A.2.54) n

Hence,

1 ∗ 1 ∗ √ Tbϕ + = √ Tϕ+ + bγh + op (1) . n n

(A.2.55)

A similar result holds for the signed-rank statistic. For the results needed in Chapter 1, however, it is convenient to change the notation to: n X

a+ (R|Xi − b|)sgn(Xi − b) .

(A.2.56)

1 1 √ Tϕ+ (θ) = √ Tϕ+ (0) − θγh + op (1) , n n

(A.2.57)

Tϕ+ (b) =

i=1

The above results imply that

√

n|θ| ≤ B, for B > 0. The general signed-rank statistics found in Chapter 1 are based on norms. In this case, since the scores are nondecreasing, we can strengthen our results to include uniformity; that is,

for

Theorem A.2.11. Assuming Finite Fisher information, nondecreasing and square integrable ϕ+ (u), 1 1 Ppx [√ sup | √ Tϕ+ (θ) − √ Tϕ+ (0) + θγh | ≥ ǫ] → 0 , (A.2.58) n n n|θ|≤B for all ǫ > 0 and all B > 0. A proof can be obtained by the usual partitioning type of argument on the interval R [−B, B]; see the proof of Theorem 1.5.6. Hence, since (ϕ+ (u))2 du = 1, the efficacy is given by cϕ+ = γh ; see ( 1.8.21).

A.3

Results for Rank-Based Analysis of Linear Models

In this section we consider the linear model defined by ( 3.2.3) in Chapter 3. The distribution of the errors satisfies Assumption E.1, ( 3.4.1). The design matrix satisfies conditions D.2, ( 3.4.7), and D.3, ( 3.4.8). We shall assume without loss of generality that the true vector of parameters is 0.

434

APPENDIX A. ASYMPTOTIC RESULTS

It will be easier to work with the√following transformation of the design matrix and parameters. We consider β such that nβ = O(1). Note that we will suppress the notation indicating that β depends on n. Let, ∆ = (X′ X)

1/2

C = X (X′X) di = −c′i ∆ ,

β,

−1/2

(A.3.1) ,

(A.3.2) (A.3.3)

where ci is the ith row of C and note that ∆ = O(1) because n−1 X′ X → Σ > 0 and √ nβ = O(1). Then C′ C = Ip and HC = HX , where HC is the projection matrix onto the column space of C. Note that since X is centered, C is also. Also kcik2 = h2nii where h2nii is the ith diagonal entry of HX . It is straightforward to show that c′i ∆ = x′i β. Using the conditions (D.2) and (D.3), the following conditions are readily established: n X i=1

d = 0 n X 2 di ≤ kci k2 k∆k2 = pk∆k2 , for all n

(A.3.4) (A.3.5)

i=1

max d2i ≤ k∆k2 max kci k2

1≤i≤n

1≤i≤n

(A.3.6)

= k∆k2 max h2nii → 0 as n → ∞ , 1≤i≤n

since k∆k is bounded. For j = 1, . . . , p define Snj (∆) =

n X i=1

cij a(R(Yi − c′i ∆)) ,

(A.3.7)

where the scores are generated by a function ϕ which staisfies (S.1), ( 3.4.10). We now show that the theory established in the Section A.2 for simple linear rank statistics holds for Snj , for each j. √ Fix j, then thePregressionPcoefficients xi of Section A.2 are given by xi = ncij . Note from ( A.3.2) that x2i /n = c2ij = 1; hence, condition ( A.2.2) is true. Further by ( A.3.6), max1≤i≤n x2i Pn 2 = max c2ij → 0 ; 1≤i≤n i=1 xi

hence, condition ( A.2.1) is true. For the sequence di = −c′i ∆, conditions ( A.3.4) - ( A.3.6) imply conditions ( A.2.14) ( A.2.16), (the upper bound in condition ( A.3.6) was actually all that was needed in the proofs of Section A.2). Finally for ( A.2.17), because C is orthogonal, σxd is given by ( n ) p n n X X X 1 X σxd = √ cij cik ∆k = −∆j . (A.3.8) xi di = − cij c′i ∆ = − n i=1 i=1 i=1 k=1

A.3. RESULTS FOR RANK-BASED ANALYSIS OF LINEAR MODELS

435

Thus by Theorem A.2.7, for j = 1, . . . , p, we have the results, Snj (∆) = Snj (0) − γf ∆j + op (1) Snj (∆) = Tnj (0) − γf ∆j + op (1) , where Tnj (0) =

n X

(A.3.9) (A.3.10)

cij ϕ(F (Yi)) .

(A.3.11)

i=1

Let Sn (∆)′ = (Sn1 (∆), . . . , Snp (∆)). Because component-wise convergence in probability implies that the corresponding vector converges, we have shown that the following theorem is true: Theorem A.3.1. Under the above assumptions, for ǫ > 0 and for all ∆ lim P (kSn (∆) − (Sn (0) − γ∆) k ≥ ǫ) = 0 .

n→∞

(A.3.12)

The conditions we want are asymptotic linearity and quadraticity. Asymptotic linearity is the condition ! lim P

sup kSn (∆) − (Sn (0) − γ∆) k ≥ ǫ k∆k≤c

n→∞

=0,

(A.3.13)

for arbitrary c > 0 and ǫ > 0. This result was first shown by Jureˇckov´a (1971) under more stringent conditions on the design matrix. Consider the dispersion function discussed in Chapter 2. In terms of the above notation Dn (∆) =

n X i=1

a(R(Yi − ci ∆))(Yi − ci ∆) .

(A.3.14)

An approximation of Dn (∆) is the quadratic function Qn (∆) = γ∆′ ∆/2 − ∆′ Sn (0) + Dn (0) .

(A.3.15)

Using Jureˇckov´a’s conditions, Jaeckel (1972) extended the result ( A.3.13) to asymptotic quadraticity which is given by ! lim P

n→∞

sup |Dn (∆) − Qn (∆)| ≥ ǫ

=0,

(A.3.16)

k∆k≤c

for arbitrary c > 0 and ǫ > 0. Our main result of this section shows that ( A.3.12), ( A.3.13), and ( A.3.16) are equivalent. The proof proceeds as in Heiler and Willers (1988) who established their results based on convex function theory. Before proceeding with the proof, for the reader’s convenience, we present some notes on convex functions.

436

A.3.1

APPENDIX A. ASYMPTOTIC RESULTS

Convex Functions

Let f be a real valued function defined on Rp . Recall the definition of a convex function: Definition A.3.1. The function f is convex if f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) ,

(A.3.17)

for 0 < λ < 1. Further, a convex function f is called proper if it is defined on an open set C ∈ Rp and is everywhere finite on C. The convex functions of interest in this appendix are proper with C = Rp . The proof of the following theorem can be found in Rockafellar (1970); see pages 82 and 246. Theorem A.3.2. Suppose f is convex and proper on an open subset C of Rp . Then f is continuous on C and is differentiable almost everywhere on C. We will find it useful to define a subgradient: Definition A.3.2. The vector D(x0 ) is called a subgradient of f at x0 if f (x) − f (x0 ) ≥ D(x0 )′ (x − x0 ) , for all x ∈ C .

(A.3.18)

As shown on page 217 of Rockafellar (1970), a proper convex function which is defined on an open set C has a subgradient at each point in C. Furthermore, at the points of differentiability, the subgradient is unique and it agrees with the gradient. This is a theorem proved on page 242 of Rockafellar which we next state. Theorem A.3.3. Let f be convex. If f is differentiable at x0 then ▽f (x0 ), the gradient of f at x0 , is the unique subgradient of f at x0 . Hence combining Theorems A.3.2 and A.3.3, we see that for proper convex functions the subgradient is the gradient almost everywhere; hence if f is a proper convex function we have, f (x) − f (x0 ) ≥ ▽f (x0 )′ (x − x0 ) , a.e. x ∈ C . (A.3.19) The next theorem can be found on page 90 of Rockafellar (1970). Theorem A.3.4. Let the sequence of convex functions {fn } be proper on C and suppose the sequence converges for all x ∈ C∗ where C∗ is dense in C. Then the functions fn converge on the whole set C to a proper and convex function f and, furthermore, the convergence is uniform on each compact subset of C. The following theorem is a modification by Heiler and Willers (1988) of a theorem found on page 248 of Rockafellar (1970).

A.3. RESULTS FOR RANK-BASED ANALYSIS OF LINEAR MODELS

437

Theorem A.3.5. Suppose in addition to the assumptions of the last theorem the limit function f is differentiable, then lim ▽fn (x) = ▽f (x) , for all x ∈ C .

n→∞

(A.3.20)

Furthermore the convergence is uniform on each compact subset of C. The following result is proved in Heiler and Willers (1988). Theorem A.3.6. Suppose the hypotheses of Theorem A.3.4 hold. Assume, also, that the limit function f is differentiable. Then lim ▽fn (x) = ▽f (x) , for all x ∈ C∗

(A.3.21)

lim fn (x0 ) = f (x0 ) , for at least one x0 ∈ C∗

(A.3.22)

n→∞

and n→∞

where C∗ is dense in C, imply that lim fn (x) = f (x) , for all x ∈ C

n→∞

(A.3.23)

and the convergence is uniform on each compact subset of C.

A.3.2

Asymptotic Linearity and Quadraticity

We now proceed with Heiler and Willers (1988) proof of the equivalence of ( A.3.12), ( A.3.13), and ( A.3.16). Theorem A.3.7. Under Model ( 3.2.3) and the assumptions ( 3.4.7), ( 3.4.8), and ( 3.4.1), the expressions ( A.3.12), ( A.3.13), and ( A.3.16) are equivalent. Proof: ( A.3.12) ⇒ ( A.3.16). Both functions Dn (∆) and Qn (∆) are proper convex functions for ∆ ∈ Rp . Their gradients are given by, ▽Qn (∆) = γ∆ − Sn (0) ▽Dn (∆) = −Sn (∆) , a.e. ∆ ∈ Rp .

(A.3.24) (A.3.25)

By Theorem A.3.2 the gradient of D exists almost everwhere. Where the derivative of Dn (∆) is not defined, we will use the subgradient of Dn (∆), ( A.3.2), which, in the case of proper convex functions, exists everwhere and which agrees uniquely with the gradient at points where D(∆) is differentiable; see Theorem A.3.3 and the surrounding discussion. Combining these results we have, ▽(Dn (∆) − Qn (∆)) = −[Sn (∆) − Sn (0) + γ∆]

(A.3.26)

438

APPENDIX A. ASYMPTOTIC RESULTS Let N denote the set of positive integers. Let ∆(1) , ∆(2) , . . . be a listing of the vectors in p-space with rational components. By ( A.3.12) the right side of ( A.3.26) goes to 0 in probability for ∆(1) . Hence, for every infinite index set N ∗ ⊂ N there exists another infinite index set N1∗∗ ⊂ N ∗ such that a.s.

[Sn (∆(1) ) − Sn (0) + γ∆(1) ] → 0 ,

(A.3.27)

for n ∈ N1∗∗ . Since the right side of ( A.3.26) goes to 0 in probability for ∆(2) and N1∗∗ is an infinite index set, there exists another infinite index set N2∗∗ ⊂ N1∗∗ such that a.s.

[Sn (∆(i) ) − Sn (0) + γ∆(i) ] → 0 ,

(A.3.28)

for n ∈ N2∗∗ and i ≤ 2. We continue and, hence, get a sequence of nested infinite index sets N1∗∗ ⊃ N2∗∗ ⊃ · · · ⊃ Ni∗∗ ⊃ · · · such that a.s.

[Sn (∆(j) ) − Sn (0) + γ∆(j) ] → 0 ,

(A.3.29)

∗∗ e be a diagonal infinite index set of the for n ∈ Ni∗∗ ⊃ Ni+1 ⊃ · · · and j ≤ i. Let N ∗∗ ∗∗ ∗∗ sequence N1 ⊃ N2 ⊃ · · · ⊃ Ni ⊃ · · · . Then a.s.

[Sn (∆) − Sn (0) + γ∆] → 0 ,

(A.3.30)

e and for all rational ∆. for n ∈ N

Define the convex function Hn (∆) = Dn (∆) − Dn (0) + ∆′ Sn (0). Then Dn (∆) − Qn (∆) = Hn (∆) − γ∆′ ∆/2 ▽(Dn (∆) − Qn (∆)) = ▽Hn (∆) − γ∆ .

(A.3.31) (A.3.32)

Hence by ( A.3.30) we have a.s.

▽Hn (∆) → γ∆ = ▽γ∆′ ∆/2 , e and for all rational ∆. Also note for n ∈ N

Hn (0) = 0 = γ∆′ ∆/2|∆=0 .

(A.3.33)

(A.3.34)

Since Hn is convex and ( A.3.33) and ( A.3.34) hold, we have by Theorem A.3.6 that {Hn (∆)}n∈Ne converges to γ∆′ ∆/2 a.s., uniformly on each compact subset of Rp . That is by ( A.3.31), Dn (∆) − Qn (∆) → 0 a.s., uniformly on each compact subset of Rp . Since N ∗ is arbitrary, we can conclude, (see Theorem 4, page 103 of Tucker, 1967), P that Dn (∆) − Qn (∆) → 0 uniformly on each compact subset of Rp . ( A.3.16) ⇒ ( A.3.13). Let c > 0 be given and let C = {∆ : k∆k ≤ c}. By ( A.3.16) we P know that Dn (∆) − Qn (∆) → 0 on C. Using the same diagonal argument as above, e ⊂ N ∗ such that for any infinite index set N ∗ ⊂ N there exists an infinite index set N

A.3. RESULTS FOR RANK-BASED ANALYSIS OF LINEAR MODELS

439

a.s.

e and for all rational ∆. As in the last part, introduce Dn (∆) − Qn (∆) → 0 for n ∈ N the function Hn as Dn (∆) − Qn (∆) = Hn (∆) − γ∆′ ∆/2 .

Hence,

a.s.

Hn (∆) → γ∆′ ∆/2 ,

(A.3.35) (A.3.36)

e and for all rational ∆. By ( A.3.36) and the fact that the function γ∆′ ∆/2 for n ∈ N is differentiable we have by Theorem A.3.5, a.s.

▽Hn (∆) → γ∆ ,

(A.3.37)

e and uniformly on C. This leads to the following string of convergences, for n ∈ N a.s.

▽(Dn (∆) − Qn (∆)) → 0 a.s. Sn (∆) − (Sn (0) − γ∆) → 0 ,

(A.3.38)

e and uniformly on C. Since N ∗ was arbitrary where both convergences are for n ∈ N we can conclude that P Sn (∆) − (Sn (0) − γ∆) → 0 , (A.3.39) uniformly on C. Hence ( A.3.13) holds. ( A.3.13) ⇒ ( A.3.12). This is trivial. These are the results we wanted. For convenience we summarize asymptotic linearity and asymptotic quadraticity in the following theorem: Theorem A.3.8. Under Model ( 3.2.3) and the assumptions ( 3.4.7), ( 3.4.8), and ( 3.4.1), ! lim P

n→∞

sup kSn (∆) − (Sn (0) − γ∆) k ≥ ǫ

k∆k≤c

lim P

n→∞

!

sup |Dn (∆) − Qn (∆)| ≥ ǫ

k∆k≤c

=0,

(A.3.40)

=0,

(A.3.41)

for all ǫ > 0 and all c > 0. Proof: This follows from the Theorems A.3.1 and A.3.7.

A.3.3

b and β e Asymptotic Distance Between β

This section contains a proof of Theorem 3.5.5. It shows that the R-estimate in Chapter 3 is close to the value which minimizes the quadratic approximation to the dispersion function. The proof is due to Jaeckel (1972). For convenience, we have restated the theorem.

440

APPENDIX A. ASYMPTOTIC RESULTS

Theorem A.3.9. Under the Model ( 3.2.3), (E.1), (D.1), (D.2) and (S.1) in Section 3.4, √

P

b − β) e →0. n(β

√ e Proof: Choose ǫ > 0 and δ > 0. Since nβ converges in distribution, there exists a c0 such that h √ i e ≥ c0 / n < δ/2 , P kβk (A.3.42) for n sufficiently large. Let n √ o e . e = ǫ/ n − Q(Y − Xβ) T = min Q(Y − Xβ) : kβ − βk

(A.3.43)

e is the unique minimizer of Q, T > 0; hence, by asymptotic quadraticity we have Since β " # P

max

√ kβ k Q(Y − Xβ) − T /2, for sufficiently large n. From this, ( A.3.43), and ( A.3.45) we get the following string of inequalities D(Y − Xβ) > Q(Y − Xβ) − T /2 n √ o e = ǫ/ n − T /2 ≥ min Q(Y − Xβ) : kβ − βk e − T /2 = T + Q(Y − Xβ) e > D(Y − Xβ) e = T /2 + Q(Y − Xβ)

(A.3.46)

e for kβ − βk e = ǫ/√n. Since D is convex, we must also have Thus, D(Y −Xβ) > D(Y −Xβ), e for kβ − βk e ≥ ǫ/√n. But D(Y − Xβ) e ≥ min D(Y − Xβ) = D(Y − Xβ) > D(Y − Xβ), √ b b e D(Y − Xβ). Hence h β must lie inside i the disk kβ − βk = ǫ/ n with probability of at least √ b − βk e < ǫ/ n > 1 − 2δ. This yields the result. 1 − 2δ; that is, P kβ

A.3.4

Consistency of the Test Statistic Fϕ

This section contains a proof of the consistency of the test statistic Fϕ , Theorem 3.6.2. We begin with a lemma.

A.3. RESULTS FOR RANK-BASED ANALYSIS OF LINEAR MODELS

441

(Q(β) e nkβ −β k=a −1 ′

e − Q(β)). Then

Lemma A.3.1. Let a > 0 be given and let tn = min√

tn = (2τ )−1 a2 λn,1 where λn,1 is the minimum eigenvalue of n X X.

Proof: After some computation, we have √ √ e = (2τ )−1 n(β − β) e ′ n−1 X′ X n(β − β) e . Q(β) − Q(β)

Let 0 < λn,1 ≤ · · · ≤ λn,p be the eigenvalues of n−1 X′ X and let γ n,1 , . . . , γ n,p be a corresponding set of orthonormal eigenvectors. The spectral decomposition of n−1 X′ X is n−1 X′ X = P p ′ −1 ′ ′ 2 i=1 λn,i γ n,i γ n,i . From this we can show for any vector δ that δ n X Xδ ≥ λn,1 kδk and, that further, the minimum is achieved over all vectors of unit length when δ = γ n,1 . It then follows that min δ ′ n−1 X′ Xδ = λn,1 a2 , kδ k=a which yields the conclusion. Note that by (D.2) of Section 3.4, λn,1 → λ1 , for some λ1 > 0. The following is a restatement and a proof of Theorem 3.6.2. Theorem A.3.10. Suppose conditions (E.1), (D.1), (D.2), and (S.1) of Section 3.4 hold. The test statistic Fϕ is consistent for the hypotheses ( 3.2.5). Proof: By the above discussion we need only show that ( 3.6.23) is true. Let ǫ > 0 be given. Let c0 = (2τ )−1 χ2α,q . By Lemma A.3.1, choose a > 0 so large that (2τ )−1 a2 λ1 > 3c0 +ǫ. √ e Next choose n0 so large that (2τ )−1 a2 λn,1 > 3c0 , for n ≥ n0 . Since nkβ − β 0 k is bounded in probability, there exits a c > 0 and a n1 such that for n ≥ n1 Pβ (C1,n ) ≥ 1 − (ǫ/2) , 0

(A.3.47)

√ e where we define the event C1,n = { nkβ −β 0 k < c}. Since t > 0 by asymptotic quadraticity, Theorem A.3.8, there exits an n2 such that for n > n2 , Pβ (C2,n ) ≥ 1 − (ǫ/2) , 0

(A.3.48)

|Q(β) − D(β)| < (t/3)}. For the remainder of the proof √ e − βk = a. Then assume that n ≥ max{n0 , n1 , n2 } = n∗ . Next suppose β is such that nkβ √ e on C1,n it follows that nkβ − βk ≤ c + a. Hence on both C1,n and C2,n we have

where C2,n = {max√nkβ −β

0 k≤c+a

D(β) > Q(β) − (t/3) e + t − (t/3) ≥ Q(β) e + 2(t/3) = Q(β) e + (t/3) . > D(β)

√ e e > (t/3) > c0 . But D is convex; Therefore, for all β such that nkβ − βk = a, D(β) − D(β) √ e e > (t/3) > c0 . hence on C1,n ∩ C2,n , for all β such that nkβ − βk ≥ a, D(β) − D(β)

442

APPENDIX A. ASYMPTOTIC RESULTS

√ Finally choose n3 such that for n ≥ n3 , δ > (c + a)/ n where δ is the positive distance between β 0 and Rr . Now assume that n ≥ max{n∗ , n3 } and C1,n ∩ C2,n is true. Recall that b lies in Rr ; hence, b = (β b ′ , 0′ )′ where β the reduced model R-estimate is β r,1 r r,1 √

b − βk e ≥ nkβ r

√

b −β k− nkβ r 0

√

e−β k≥ nkβ 0

√

nδ − c > a .

b ) − D(β) e > c0 . Thus for n sufficiently large we have, Thus on C1,n ∩ C2,n , D(β r b ) − D(β) e > (2τ )−1 χ2 ] ≥ 1 − ǫ . P [D(β r α,q

Because ǫ was arbitrary ( 3.6.23) is true and consistency of Fϕ follows.

A.3.5

Proof of Lemma 3.5.1

The following lemma was used to establish the asymptotic linearity for the sign process for linear models in Chapter 3. The proof of this lemma was first given by Jureˇckov´a (1971) for general scores. We restate the lemma and give its proof for sign scores. Lemma A.3.2. Assume conditions (E.1), (E.2), (S.1), (D.1) and (D.2) of Section 3.4. For any ǫ > 0 and for any a ∈ R, √ b ) − S1 (Y − an−1/2 )| ≥ ǫ n] = 0 . lim P [|S1(Y − an−1/2 − Xβ R

n→∞

Proof: Let a be arbitrary but fixed and let c > |a|. After matching notation, Theorem A.4.3 leads to the result, 1 1 −1/2 √ S1 (Y − an = op (1) . √ max (A.3.49) − Xβ) − S (Y) + (2f (0))a 1 n n k(X′ X)1/2 β k≤c

Obviously the above result holds for β = 0. Hence for any ǫ > 0, " # 1 1 −1/2 −1/2 √ S1 (Y − an P max − Xβ) − √ S1 (Y − an ) ≥ ǫ n n k(X′ X)1/2 β k≤c # " 1 √ S1 (Y − an−1/2 − Xβ) − √1 S1 (Y) + (2f (0)a ≥ ǫ ≤P max n 2 n k(X′ X)1/2 β k≤c 1 ǫ 1 −1/2 +P √ S1 (Y − an . ) − √ S1 (Y) + (2f (0)a ≥ 2 n n

By ( A.3.49), for n sufficiently large, the two terms on the right side are arbitrarily small. b is bounded in probability. The desired result follows from this since (X′ X)1/2 β

A.4. ASYMPTOTIC LINEARITY FOR THE L1 ANALYSIS

A.4

443

Asymptotic Linearity for the L1 Analysis

In this section we obtain a linearity result for the L1 analysis of a linear model. Recall from Section 3.6 that the L1 -estimates are equivalent to the R-estimates when the rank scores are generated by the sign function; hence, the distribution theory for the L1 -estimates is derived in Section 3.4. The linearity result derived below offers another way to obtain this result. More importantly though, we need the linearity result for the proof of Lemma 3.5.6 of Section 3.5. As we next show, this result is a corollary to the linearity results derived in the last section. We will assume the same linear model and use the same notation as in Section 3.2. Recall that the L1 estimate of β minimizes the dispersion function, D1 (α, β) =

n X i=1

|Yi − α − xi β| .

The corresponding gradient function is the (p + 1) × 1 vector whose components are ▽ j D1 =

P − Pni=1 sgn(Yi − α − xi β) if j = 0 , − ni=1 xij sgn(Yi − α − xi β) if j = 1, . . . , p

where j = 0 denotes the partial of D1 with respect to α. The parameter α will denote the location functional med(Yi − xi β), i.e., the median of the errors. Without loss of generality we will assume that the true parameters are 0. We first consider the simple linear model. Consider then the notation of Section A.3; see ( A.3.1) - ( A.3.7). We will derive the analogue of Theorem A.3.8 for the processes U0 (α, ∆) =

n X i=1

U1 (α, ∆) =

n X i=1

α sgn(Yi − √ − ∆ci ) n

(A.4.1)

α ci sgn(Yi − √ − ∆ci ) . n

(A.4.2)

Let pd = Πni=1 f√0 (yi ) denote the likelihood for the iid observations Y1 , . . . , Yn and let qd = Πni=1 f0 (yi + α/ n + ∆ci ) denote the liklihood of the variables Yi − √αn − ∆ci . We assume throughout that f (0) > 0. Similar to Section A.2.2, the sequence of densities qd is contiguous to the sequence pd . Note that the processes U0 and U1 are already sums of independent variables; hence, projections are unnecessary. We first work with the process U1 . Lemma A.4.1. Under the above assumptions and as n → ∞, E0 (U1 (α, ∆)) → −2∆f0 (0) .

444

APPENDIX A. ASYMPTOTIC RESULTS

Proof: After some simplification we get E0 (U1 (α, ∆)) = 2 = 2

n X

i=1 n X i=1

√ ci F0 (0) − F0 (α/ n + ∆ci )

√ ci (−∆ci − α/ n)f0 (ξin ) ,

√ where, by the mean value theorem, ξin is between 0 and |α/ n + ∆ci |. Since the ci ’s are centered, we further obtain E0 (U1 (α, ∆)) = −2∆

n X i=1

c2i

[f0 (ξin ) − f0 (0)] − 2∆

n X

c2i f0 (0) .

i=1

√ By assumptions of Section A.2.2, it follows that maxi |α/ n + ∆ci | → 0 as n → ∞. Since P n 2 i=1 ci = 1 and the assumptions that f0 continuous and positive at 0 the desired result easily follows. This leads us to our main result for U1 (α, ∆): Theorem A.4.1. Under the above assumptions, for all α and ∆ P

U1 (α, ∆) − [U1 (0, 0) − ∆2f0 (0)] → 0 , as n → ∞. Because the ci ’s are centered it follows that Epd (U1 (0, 0)) = 0. Thus by the last lemma, we need only show that Var(U1 (α, ∆) − U1 (0, 0)) → 0. By considering the variance of the sign of a random variable, simplification leads to the bound: Var((U1 (α, ∆) − U1 (0, 0)) ≤ 4

n X i=1

√ c2i |F0 (α/ n + ∆ci ) − F0 (0)| .

√ By our assumptions, maxi |∆ci + α/ n| → 0 as n → ∞. From this and the continuity of F0 at 0, it follows that Var(U1 (α, ∆) − U1 (0, 0)) → 0. We need analogous results for the process U0 (α, ∆). Lemma A.4.2. Under the above assumptions, E0 [U0 (α, ∆)] → −2αf0 (0) , as n → ∞.

A.4. ASYMPTOTIC LINEARITY FOR THE L1 ANALYSIS

445

Proof: Upon simplification and an application of the mean value theorem, n α 2 X F0 (0) − F0 √ + ci ∆ E0 [U0 (α, ∆)] = √ n i=1 n n −2 X α √ + ci ∆ f0 (ξin ) = √ n i=1 n n

−2α X [f0 (ξin ) − f0 (0)] − 2αf0 (0) , = n i=1

where √ we have used the fact that √ the ci ’s are centered. Note that |ξin | is between 0 and |α/ n + ci ∆| and that max |α/ n + ci ∆| → 0 as n → ∞. By the continuity of f0 at 0, the desired result follows. Theorem A.4.2. Under the above assumptions, for all α and ∆ P

U0 (α, ∆) − [U0 (0, 0) − 2αf0 (0)] → 0 , as n → ∞. Because the medYi is 0, E0 [U0 (0, 0)] = 0. Hence by the last lemma it then suffices to show that Var(U0 (α, ∆) − U0 (0, 0)) → 0. But, n α 4 X F0 √ + ci ∆ − F0 (0) Var(U0 (α, ∆) − U0 (0, 0)) ≤ n i=1 n

√ Because max |α/ n + ci ∆| → 0 and F0 is continuous at 0, Var(U0 (α, ∆) − U0 (0, 0)) → 0. Next consider the multiple regression model as discussed in Section A.3. The only difference in notation is that here we have the intercept parameter included. Let ∆ = (α, ∆1 , . . . , ∆p )′ denote the vector of all regression parameters. Take X = [1n : Xc ], where ′ Xc denotes a centered design X)−1/2 . Note that √ matrix and as in ( A.3.2) take C = X(X the first column of C is (1/ n)1n . Let U(∆) = (U0 (∆), . . . , Up (∆))′ denote the vector of processes. Similar to the discussion prior to Theorem A.3.1, the last two theorems imply that P U(∆) − [U(0) − 2f0 (0)∆] → 0 , for all real ∆ in Rp+1 . As in Section A.3, we define the approximation quadratic to D1 as Q1n (∆) = (2f0 (0))∆′ ∆/2 − ∆′ U(0) + D1 (0) . The asymptotic linearity of U and the asymptotic quadraticity of D1 then follow as in the last section. We state the result for reference:

446

APPENDIX A. ASYMPTOTIC RESULTS

Theorem A.4.3. Under conditions ( 3.4.1), ( 3.4.3), ( 3.4.7) and ( 3.4.8), ! lim P

n→∞

max kU(∆) − (U(0) − (2f0 (0))∆) k ≥ ǫ

k∆k≤c

lim P

n→∞

=0,

(A.4.3)

=0,

(A.4.4)

!

max |D1 (∆) − Q1 (∆)| ≥ ǫ

k∆k≤c

for all ǫ > 0 and all c > 0.

A.5

Influence Functions

In this section we derive the influence functions found in Chapters 1-3 and Chapter 5. Discussions of the influence function can be found in Staudte and Sheather (1990), Hampel et al. (1986) and Huber (1981). For the influence functions of Chapter 3, we will find the Gˆateux derivative of a convenient functional; see Fernholz (1983) and Huber (1981) for rigourous discussions of functionals and derivatives. Definition A.5.1. Let T be a statistical functional defined on a space of distribution functions and let H denote a distribution function in the domain of T . We say T is Gˆ ateux differentiable at H if for any distribution function W , such that the distribution functions {(1 − s)H + sW } lie in the domain of T , the following limit exists: Z T [(1 − s)H + sW ] − T [H] lim = ψH dW , (A.5.1) s→0 s for some function ψH . Note by taking W to be H in the above definition we have, Z ψH dH = 0 .

(A.5.2)

The usual definition of the influence function is obtained by taking the distribution function W to be a point mass distribution. Denote the point mass distribution function at t by ∆t (x). Letting W (x) = ∆t (x), the Gˆateux derivative of T (H) is T [(1 − s)H + s∆s (x)] − T [H] = ψH (x); . s→0 s lim

(A.5.3)

The function ψH (x) is called the influence function of T (H). Note that this is the derivative of the functional T [(1 − s)H + s∆s (x)] at s = 0. It measures the rate of change of the functional T (H) at H in the direction of ∆s . A functional is said to be robust when this derivative is bounded.

A.5. INFLUENCE FUNCTIONS

A.5.1

447

Influence Function for Estimates Based on Signed-Rank Statistics

In this section we derive the influence function for the one-sample location estimate θbϕ+ , ( 1.8.5), discussed in Chapter 1. We will assume that we are sampling from a symmetric density h(x) with distribution function H(x), as in Section 1.8. As in Chapter 2, we will assume that the one sample score function ϕ+ (u) is defined by u+1 + ϕ (u) = ϕ , (A.5.4) 2

where ϕ(u) is a nondecreasing, differentiable function defined on the interval (0, 1) satisfying ϕ(1 − u) = −ϕ(u) .

(A.5.5)

ϕ+ (−u) = −ϕ+ (u) .

(A.5.6)

Recall from Chapter 2 that this assumption is appropriate for scores for samples from symmetrical distributions. For convenience we extend ϕ+ (u) to the interval (−1, 0) by Our functional T (H) is defined implicitly by the equation ( 1.8.5). Using the symmetry of h(x), ( A.5.5), and ( A.5.6) we can write the defining equation for θ = T (H) as Z ∞ 0 = ϕ+ (H(x) − H(2θ − x))h(x) dx Z−∞ ∞ 0 = ϕ(1 − H(2θ − x))h(x) dx . (A.5.7) −∞

For the derivation, we will proceed as discussed above; see the discussion around expression ( A.5.3). Consider the contaminated distribution of H(x) given by Ht,ǫ (x) = (1 − ǫ)H(x) + ǫ∆t (x) ,

(A.5.8)

where 0 < ǫ < 1 is the proportion of contamination and ∆t (x) is the distribution function for a point mass at t. By ( A.5.3) the influence function is the derivative of the functional at ǫ = 0. To obtain this derivative we implicitly differentiate the defining equation ( A.5.7) at Ht,ǫ (x); i.e., at Z ∞ 0 = (1 − ǫ) ϕ(1 − (1 − ǫ)H(2θ − x) − ǫ∆t (2θ − x))h(x) dx −∞ Z ∞ = ǫ ϕ(1 − (1 − ǫ)H(2θ − x) − ǫ∆t (2θ − x)) d∆t (x) −∞

Let θ˙ denote the derivative of the functional. Implicitly differentiating this equation and then setting ǫ = 0 and without loss of generality θ = 0, we get Z ∞ Z ∞ 0 = − ϕ(H(x))h(x) dx + ϕ′ (H(x))H(−x)h(x) dx −∞ −∞ Z ∞ Z ∞ ′ 2 ˙ = −2θ ϕ (H(x))h (x) dx − ϕ′ (H(x))∆t (−x)h(x) dx + ϕ(H(t)) . −∞

−∞

448

APPENDIX A. ASYMPTOTIC RESULTS

R Label the four integrals in the above equation as I1 , . . . , I4 . Since ϕ(u) du = 0, I1 = 0. For I2 we get Z ∞ Z ∞ ′ I2 = ϕ (H(x))h(x) dx − ϕ′ (H(x))H(x)h(x) dx −∞ −∞ Z 1 Z 1 = ϕ′ (u) du − ϕ′ (u)u du = −ϕ(0) . 0

0

Next I4 reduces to −

Z

−t

−∞

′

ϕ (H(x))h(x) dx = −

Z

H(−t)

ϕ′ (u) du = ϕ(H(t)) + ϕ(0) .

0

Combining these results and solving for θ˙ leads to the influence function which we can write in either of the following two ways, ϕ(H(t)) ϕ′ (H(x))h2 (x) dx −∞

Ω(t, θbϕ+ ) = R ∞ =

A.5.2

4

R∞ 0

ϕ+ (2H(t) − 1) . ϕ+′ (2H(x) − 1)h2 (x) dx

(A.5.9)

Influence Functions for Chapter 3

In this section, we derive the influence functions which were presented in Chapter 3. Much of this work was developed in Witt (1989) and Witt, McKean and Naranjo (1995). The correlation model of Section 3.11 is the underlying model for the influence functions derived in this section. Recall that the joint distribution function of x and Y is H, the distribution functions of x, Y and e are M, G and F , respectively, and Σ is the variance-covariance martix of x. b denote the R-estimate of β for a specified score function ϕ(u). In this section we Let β ϕ are interested in deriving the influence functions of this R-estimate and of the corresponding R-test statistic for the general linear hypotheses. We will obtain these influence functions by using the definition of the Gˆateux derivative of a functional, ( A.5.1). The influence functions are then obtained by taking W to be the point mass distribution function ∆(x0 ,y0 ) ; see expression ( A.5.3). If T is Gˆateux differentiable at H then by setting W = ∆(x0 ,y0 ) we see that the influence function of T is given by Z Ω(x0 , y0 ; T ) = ψH d∆(x0 ,y0 ) = ψH (x0 , y0 ) . (A.5.10) P As a simple example, we will obtain the influence function of the statistic D(0) = a(R(Yi ))Yi. Since G is the distribution function of Y , the corresponding functional is

A.5. INFLUENCE FUNCTIONS T [G] =

R

449

ϕ(G(y))ydG(y). Hence for a given distribution function W , Z

T [(1 − s)G + sW ] = (1 − s) ϕ[(1 − s)G(y) + sW (y)]ydG(y) Z +s ϕ[(1 − s)G(y) + sW (y)]ydW (y) . Taking the partial derivative of the right side with respect to s, setting s = 0, and substituting ∆y0 for W leads to the influence function Ω(y0 ; D(0)) = − +

Z

Z

ϕ(G(y))ydG(y) − ∞

Z

ϕ′ (G(y))G(y)ydG(y)

ϕ′ (G(y))ydG(y) + ϕ(G(y0 ))y0 .

(A.5.11)

y0

Note that this is not bounded in the Y -space and, hence, the statistic D(0) is not robust. Thus, as noted in Section 3.11, the coefficient of multiple determination R1 , ( 3.11.16), is not robust. A similar development establishes the influence function for the denominator of LS coefficient of multiple determination R2 , showing too that it is not bounded. Hence R2 is not a robust statistic. Another example is the the influence function of the least squares estimate of β which is given by, b ) = σ −1 y0 Σ−1 x0 . Ω(x0 , y0; β (A.5.12) LS

The influence function of the least squares estimate is, thus, unbounded in both the Y - and x-spaces. b Influence Function of β ϕ

Recall that H is the joint distribution function of x and Y . Let the p × 1 vector T (H) denote b . Assume without loss of generality that the true β = 0, the functional corresponding to β ϕ α = 0, and that the Ex = 0. Hence the distribution function of Y is F (y) and Y and x are independent; i.e., H(x, y) = M(x)F (y). Recall that the R-estimate satisfies the equations n X i=1

. xi a(R(Yi − x′i β)) = 0 .

b∗ denote the empirical distribution function of Yi −x′ β. Then we can rewrite the above Let G n i equations as n X n b∗ 1 . ′ n xi ϕ Gn (Yi − xi β) =0. n + 1 n i=1

450

APPENDIX A. ASYMPTOTIC RESULTS

Let G∗ denote the distribution function of Y − x′ T (H). Then the functional T (H) satisfies Z ϕ(G∗ (y − x′ T (H))xdH(x, y) = 0 . (A.5.13) We can show that, ∗

G (t) =

Z Z

dH(v, u) .

(A.5.14)

u≤v′ T (H)+t

Let Hs = (1 − s)H + sW for an arbitrary distribution function W . Then the functional T (H) evaluated at Hs satisfies the equation Z Z ∗ ′ (1 − s) ϕ(Gs (y − x T (Hs ))xdH(x, y) + s ϕ(G∗s (y − x′ T (Hs ))xdW (x, y) = 0 , where G∗s is the distribution function of Y − x′ T (Hs ). We will obtain ∂T /∂s by implicit differentiation. Then upon substituting ∆x0 ,y0 for W the influence function is given by (∂T /∂s) |s=0 , which we will denote by T˙ . Implicit differentiation leads to Z Z ∂G∗ ∗ ′ 0 = − ϕ(Gs (y − x T (Hs ))xdH(x, y) − (1 − s) ϕ′ (G∗s (y − x′ T (Hs )) s xdH(x, y) ∂s Z + ϕ(G∗s (y − x′ T (Hs ))xdW (x, y) + sB1 , (A.5.15) where B1 is irrelevant since we will be setting s to 0. We first get the partial derivative of G∗s with respect to s. By ( A.5.14) and the independence between Y and x at H, we have Z Z ∗ ′ Gs (y − x T (Hs )) = dHs (v, u) u≤y−T (Hs )′ (x−v) Z Z Z ′ = (1 − s) F [y − T (Hs ) (x − v)]dM(v) + s dW (v, u) . u≤y−T (Hs )′ (x−v)

Thus, ∂G∗s (y − x′ T (Hs )) = − ∂s

Z

F [y − T (Hs )′ (x − v)]dM(v) Z ∂T +(1 − s) F ′ [y − T (Hs )′ (x − v)](v − x)′ dM(v) ∂s Z Z + dW (v, u) + sB2 , u≤y−T (Hs )′ (x−v)

where B2 is irrelevant since we are setting s to 0. Therefore using the independence between Y and x at H, T (H) = 0, and Ex = 0, we get ∂G∗s (y − x′ T (Hs )) |s=0= −F (y) − f (y)x′T˙ + WY (y) , ∂s

(A.5.16)

A.5. INFLUENCE FUNCTIONS

451

where WY denotes the marginal (second variable) distribution function of W . Upon evaluating expression ( A.5.15) at s = 0 and substituting into it expression ( A.5.16) we have Z Z 0 = − xϕ(F (y))dH(x, y) + xϕ′ (F (y))[−F (y) − f (y)x′T˙ + WY (y)]dH(x, y) Z + xϕ(F (y))dW (x, y) Z Z ′ ′ ˙ = − ϕ (F (y))f (y)xx T dH(x, y) + xϕ(F (y))dW (x, y) Substituting ∆x0 ,y0 in for W , we get 0 = −τ ΣT˙ + x0 ϕ(F (y0)) . b is given by Solving this last expression for T˙ , the influence function of β ϕ b ) = τ Σ−1 ϕ(F (y0))x0 . Ω(x0 , y0; β ϕ

(A.5.17)

b is bounded in the Y -space but not in the x-space. The Hence the influence function of β ϕ estimate is thus bias robust. In Chapter 5 we presented R-estimates whose influence functions are bounded in both spaces; see Theorems ?? and 3.12.4. Note that the asymptotic b ϕ in Corollary 3.5.24 can be written in terms of this influence function representation of β as n X √ −1/2 b ) + op (1) . b nβ = n Ω(xi , Yi ; β ϕ

ϕ

i=1

Influence Function of Fϕ

Rewrite the correlation model as Y = α + x′1 β 1 + x′2 β 2 + e and consider testing the general linear hypotheses H0 : β 2 = 0 versus HA : β 2 6= 0 ,

(A.5.18)

b 1,ϕ denote where β 1 and β 2 are q ×1 and (p −q) ×1 vectors of parameters, respectively. Let β the reduced model estimate. Recall that the R-test based upon the drop in dispersion is given by RD/q , Fϕ = τb/2

b ) − D(β b ) is the reduction in dispersion. In this section we want to where RD = D(β 1,ϕ ϕ derive the influence function of the test statistic.

452

APPENDIX A. ASYMPTOTIC RESULTS

Let RD(H) denote the functional for the statistic RD. Then RD(H) = D1 (H) − D2 (H) , where D1 (H) and D2 (H) are the reduced and full model functionals given by Z D1 (H) = ϕ[G∗ (y − x′1 T1 (H))](y − x′1 T1 (H))dH(x, y) Z D2 (H) = ϕ[G∗ (y − x′ T (H))](y − x′ T (H))dH(x, y) ,

(A.5.19)

and T1 (H) and T2 (H) denote the reduced and full model functionals for β 1 and β, respectively. Let β r = (β ′1 , 0′ )′ denote the true vector of parameters under H0 . Then the random variables Y − x′ β r and x are independent. Next write Σ as Σ=

Σ11 Σ12 Σ21 Σ22

.

It will be convenient to define the matrices Σr and Σ+ r as Σr =

Σ11 0 0 0

and

Σ+ r

=

Σ−1 0 11 0 0

.

As above, let Hs = (1 − s)H + sW . We begin with a lemma, Lemma A.5.1. Under the correlation model, (a) (b) (c)

RD(0) = 0 ∂RD(Hs ) |s=0 = 0 ∂s −1 ∂ 2 RD(Hs ) 2 ′ ′ + | = τ ϕ [F (y − x β )]x Σ − Σ x. s=0 r ∂s2

(A.5.20)

Proof: Part (a) is immediate. For Part (b), it follows from ( A.5.19) that ∂D2 (Hs ) = − ∂s

Z

ϕ[G∗s (y − x′ T (Hs ))](y − x′ T (Hs ))dH Z ∂G∗ +(1 − s) ϕ′ [G∗s (y − x′ T (Hs ))](y − x′ T (Hs )) s dH ∂s Z ∂T +(1 − s) ϕ[G∗s (y − x′ T (Hs ))](−x′ )dH ∂s Z + ϕ[G∗s (y − x′ T (Hs ))](y − x′ T (Hs ))dW (y) + sB ,

(A.5.21)

A.5. INFLUENCE FUNCTIONS

453

where B is irrelevant because we are setting s to 0. Evaluating this at s = 0 and using the independence of Y − x′ β r and x, and E(x) = 0 we get after some simplification ∂D2 (Hs ) |s=0 = − ∂s +

Z

Z

ϕ[F (y − xβ r )](y − xβ r )dH −

Z

ϕ′ [F (y − xβ r )]F (y − xβ r )(y − xβ r )dH

ϕ′ [F (y − xβ r )]WY (y − xβ r )(y − xβ r )dH + ϕ[F (y0 − x0 β r )](y0 − x0 β r ) .

1 Differentiating as above and using x′ β r = x′1 β 1 , we get the same expression for ∂D |s=0 . ∂s Hence Part (b) is true. Taking the second partial derivatives of D1 (H) and D2 (H) with respect to s, the result for Part (c) can be obtained. This is a tedious derivation and details of it can be found in Witt (1989) and Witt et al. (1995). Since Fϕ is nonnegative, there is no loss in generality in deriving the influence function p of qFϕ . Letting Q2 = 2τ −1 RD we have

Ω(x0 , y0 ;

p Q[(1 − s)H + s∆x0 ,y0 ] − Q[H] . qFϕ ) = lim s→0 s

But Q[H] = 0 by Part (a) of Lemma A.5.1. Hence we can rewrite the above limit as 1/2 p Q2 [(1 − s)H + s∆x0 ,y0 ] Ω(x0 , y0 ; qFϕ ) = lim . s→0 s2 Using Parts (a) and (b) of Lemma A.5.1, we can apply L’hospital’s rule twice to evaluate this limit. Thus p Ω(x0 , y0 ; qFϕ ) = =

1 ∂ 2 Q2 lim s→0 2 ∂s2 2τ

−1 ∂

2

1/2

RD ∂s2

1/2

= |ϕ[F (y − x′ β r )]|

p

x′ [Σ−1 − Σ+ ] x

(A.5.22)

Hence, the influence function of the rank-based test statistic Fϕ is bounded in the Y -space as long as the score function is bounded. It can be shown that the influence function of the least squares test statistic is not bounded in Y -space. It is clear from the above argument that the coefficient of multiple determination R2 is also robust. Hence, for R-fits R2 is the preferred coefficient of determination. However, Fϕ is not bounded in the x-space. In Chapter 5 we present statistics whose influence function are bounded in both spaces; although, they are less efficient. The asymptotic distribution of qFϕ was derived in Section 3.6; however, we can use the above result on the influence function to immediately display it. If we expand Q2 into a

454

APPENDIX A. ASYMPTOTIC RESULTS

vonMises expansion at H, we have 1 ∂ 2 Q2 ∂Q2 |s=0 + |s=0 +R Q2 (Hs ) = Q2 (H) + 2 ∂s 2 ∂s Z ′ ′ ϕ(F (y − x β r )x d∆x0 ,y0 (x, y) Σ−1 − Σ+ = Z ′ · ϕ(F (y − x β r )xd∆x0 ,y0 (x, y) + R .

(A.5.23)

Upon substituting the empirical distribution function for ∆x0 ,y0 in expression ( A.5.23), we have at the sample "

n

1 X ′ nQ (Hs ) = √ xϕ n i=1 i 2

1 R(Yi − x′i β r ) n

#

Σ

−1

−Σ

+

"

# n 1 X 1 ′ √ R(Yi − xi β r ) +op (1) . xi ϕ n n i=1

This expression is equivalent to the expression ( 3.6.11) which yields the asymptotic distribution of the test statistic in Section 3.6.

A.5.3

b HBR of Chapter 5 Influence Function of β

b The influence function of the high breakdown estimator β HBR is discussed in Section 3.12.4. In this section, we restate Theorem A.5.24 and then derive a proof of it. b Theorem A.5.1. The influence function for the estimate β HBR is given by −1 1 b Ω(x0 , y0 , β HBR ) = CH 2

Z Z

(x0 −x1 )b(x1 , x0 , y1 , y0)sgn{y0 −y1 } dF (y1)dM(x1 ) , (A.5.24)

where CH is given by expression ( 3.12.22). Proof: Let ∆0 (x, y) denote the distribution function of the point mass at the point (x0 , y0 ) and consider the contaminated distribution Ht = (1 − t)H + t∆0 for 0 < t < 1. Let β(Ht ) denote the functional at Ht . Then β(Ht ) satisfies 1 ′ dHt (x1 , y1 )dHt (x2 , y2) . 0= x1 b(x1 , x2 , y1, y2 ) I(y2 − y1 < (x2 − x1 ) β(Ht )) − 2 (A.5.25) We next implicitly differentiate ( A.5.25) with respect to t to obtain the derivative of the functional. The value of this derivative at t = 0 is the influence function. Without loss of generality, we can assume that the true parameter β = 0. Under this assumption x and y are independent. Substituting the value of Ht into ( A.5.25) and expanding we obtain the Z Z

A.6. ASYMPTOTIC THEORY FOR CHAPTER 5 four terms: Z Z Z

"Z

y1 +(x2 −x1 )′ β (Ht )

455

#

1 dM(x2 )dM(x1 )dF (y1) 2 −∞ Z Z Z Z 1 ′ dM(x2 )dF (y2)d∆0 (x1 , y1 ) +(1 − t)t x1 b(x1 , x2 , y1 , y2 ) I(y2 − y1 < (x2 − x1 ) β(H)) − 2 Z Z Z Z 1 ′ +(1 − t)t x1 b(x1 , x2 , y1 , y2 ) I(y2 − y1 < (x2 − x1 ) β(H)) − d∆0 (x2 , y2)dM(x1 )dF (y1 ) 2 Z Z Z Z 1 ′ 2 d∆0 (x2 , y2 )dδ0 (x1 , y1 ) . x1 b(x1 , x2 , y1 , y2) I(y2 − y1 < (x2 − x1 ) β(H)) − +t 2

0 = (1 − t)2

x1

b(x1 , x2 , y1 , y2)dF (y2 ) −

Let β˙ denote the derivative of the functional evaluted at 0. Proceeding to implicitly differentiate this equation and evaluating the derivative at 0, we get, after some derivation, Z Z Z 0 = x1 b(x1 , x2 , y1, y1 )f 2 (y1 )(x2 − x1 )′ dy1dM(x1 )dM(x2 ) β˙ Z Z 1 + x0 b(x0 , x2 , y0 , y2) I(y2 < y0 ) − dF (y2)dM(x2 ) 2 Z Z 1 + x1 b(x1 , x0 , y1 , y0) I(y0 < y1 ) − dF (y1)dM(x1 ) 2 Once again using the symmetry in the x arguments and y arguments of the function b, we can simplify this expression to Z Z Z 1 ′ 2 (x2 − x1 )b(x1 , x2 , y1, y1 )(x2 − x1 ) f (y1 ) dy1 dM(x1 )dM(x2 ) β˙ 0 = − 2 Z Z 1 + (x0 − x1 )b(x1 , x0 , y1 , y0) I(y1 < y0 ) − dF (y1)dM(x1 ) . 2 Using the relationship between the indicator function and the sign function and the definition of CH ,( 3.12.22), we can rewrite this last expression as Z Z 1 ˙ (x0 − x1 )b(x1 , x0 , y1 , y0 )sgn{y0 − y1 } dF (y1)dM(x1 ) . 0 = −CH β + 2 Solving for β˙ leads to the desired result.

A.6

Asymptotic Theory for Chapter 5

In this section we derive the results that are needed in Section 3.12.3 of Chapter 5. These results were first derived by Chang (1995). Our development is taken from the article by Chang, McKean, Naranjo and Sheather (1996). The main goal is to prove Theorem 3.12.2 which we restate here:

456

APPENDIX A. ASYMPTOTIC RESULTS

Theorem A.6.1. Under assumptions (E.1), ( 3.4.1), and (H.1) - (H.4), ( 3.12.10) ( 3.12.13), √ d −1 −1 b n(β HBR − β) −→ N( 0, (1/4)C ΣH C ). Besides the notation of Chapter 5, we need: 1. 2. 3. 4. 5.

Wij (∆) = (1/2)[sgn(zj − zi ) − sgn(yj − yi )], √ where zj = yj − x′j ∆/ n . √ tij (∆) = (xj − xi )′ ∆/ n . Bij (t) = E[bij I(0 < yi − yj < t)] . γij = Bij′ (0)/E(bij ) . X Cn = γij bij (xj − xi )(xj − xi )′ . i