Introduction To Econometrics With R

Introduction to Econometrics with R Christoph Hanck, Martin Arnold, Alexander Gerber and Martin Schmelzer 2018-10-17 2

Views 96 Downloads 0 File size 4MB

Report DMCA / Copyright

DOWNLOAD FILE

Recommend stories

Citation preview

Introduction to Econometrics with R Christoph Hanck, Martin Arnold, Alexander Gerber and Martin Schmelzer 2018-10-17

2

Contents Preface

7

1 Introduction 1.1 A Very Short Introduction to R and RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9 10

2 Probability Theory 13 2.1 Random Variables and Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Random Sampling and the Distribution of Sample Averages . . . . . . . . . . . . . . . . . . . 31 2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3 A Review of Statistics using R 3.1 Estimation of the Population Mean . . . . . . . . . . . . 3.2 Properties of the Sample Mean . . . . . . . . . . . . . . 3.3 Hypothesis Tests Concerning the Population Mean . . . 3.4 Confidence Intervals for the Population Mean . . . . . . 3.5 Comparing Means from Different Populations . . . . . . 3.6 An Application to the Gender Gap of Earnings . . . . . 3.7 Scatterplots, Sample Covariance and Sample Correlation 3.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

45 45 47 53 63 65 66 68 71

4 Linear Regression with One Regressor 4.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . 4.2 Estimating the Coefficients of the Linear Regression Model 4.3 Measures of Fit . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 The Least Squares Assumptions . . . . . . . . . . . . . . . . 4.5 The Sampling Distribution of the OLS Estimator . . . . . . 4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

73 73 75 81 83 87 95

5 Hypothesis Tests and Confidence Intervals in the Simple Linear Regression 5.1 Testing Two-Sided Hypotheses Concerning the Slope Coefficient . . . . . . . . . . 5.2 Confidence Intervals for Regression Coefficients . . . . . . . . . . . . . . . . . . . 5.3 Regression when X is a Binary Variable . . . . . . . . . . . . . . . . . . . . . . . 5.4 Heteroskedasticity and Homoskedasticity . . . . . . . . . . . . . . . . . . . . . . . 5.5 The Gauss-Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Using the t-Statistic in Regression When the Sample Size Is Small . . . . . . . . 5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

97 97 101 106 108 117 119 121

6 Regression Models with Multiple Regressors 6.1 Omitted Variable Bias . . . . . . . . . . . . . . . . . . . . . . . 6.2 The Multiple Regression Model . . . . . . . . . . . . . . . . . . 6.3 Measures of Fit in Multiple Regression . . . . . . . . . . . . . . 6.4 OLS Assumptions in Multiple Regression . . . . . . . . . . . . 6.5 The Distribution of the OLS Estimators in Multiple Regression

. . . . .

. . . . .

. . . . .

123 123 126 127 129 137

3

. . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

4

CONTENTS 6.6

Exercises

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7 Hypothesis Tests and Confidence Intervals in Multiple Regression 7.1 Hypothesis Tests and Confidence Intervals for a Single Coefficient . . . 7.2 An Application to Test Scores and the Student-Teacher Ratio . . . . . 7.3 Joint Hypothesis Testing Using the F-Statistic . . . . . . . . . . . . . 7.4 Confidence Sets for Multiple Coefficients . . . . . . . . . . . . . . . . . 7.5 Model Specification for Multiple Regression . . . . . . . . . . . . . . . 7.6 Analysis of the Test Score Data Set . . . . . . . . . . . . . . . . . . . . 7.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

141 141 143 145 147 149 152 156

8 Nonlinear Regression Functions 8.1 A General Strategy for Modelling Nonlinear Regression Functions . 8.2 Nonlinear Functions of a Single Independent Variable . . . . . . . 8.3 Interactions Between Independent Variables . . . . . . . . . . . . . 8.4 Nonlinear Effects on Test Scores of the Student-Teacher Ratio . . . 8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

157 157 160 171 186 193

9 Assessing Studies Based on Multiple Regression 9.1 Internal and External Validity . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Threats to Internal Validity of Multiple Regression Analysis . . . . . . . . . 9.3 Internal and External Validity when the Regression is Used for Forecasting 9.4 Example: Test Scores and Class Size . . . . . . . . . . . . . . . . . . . . . . 9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

195 196 196 208 209 219

10 Regression with Panel Data 10.1 Panel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Panel Data with Two Time Periods: “Before and After” Comparisons . . . . . . 10.3 Fixed Effects Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Regression with Time Fixed Effects . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5 The Fixed Effects Regression Assumptions and Standard Errors for Fixed Effects 10.6 Drunk Driving Laws and Traffic Deaths . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . Regression . . . . . . .

221 222 226 228 231 233 235

11 Regression with a Binary Dependent Variable 11.1 Binary Dependent Variables and the Linear Probability Model 11.2 Probit and Logit Regression . . . . . . . . . . . . . . . . . . . . 11.3 Estimation and Inference in the Logit and Probit Models . . . 11.4 Application to the Boston HMDA Data . . . . . . . . . . . . . 11.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

241 242 245 252 254 261

12 Instrumental Variables Regression 12.1 The IV Estimator with a Single Regressor and a Single Instrument 12.2 The General IV Regression Model . . . . . . . . . . . . . . . . . . 12.3 Checking Instrument Validity . . . . . . . . . . . . . . . . . . . . . 12.4 Application to the Demand for Cigarettes . . . . . . . . . . . . . . 12.5 Where Do Valid Instruments Come From? . . . . . . . . . . . . . . 12.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

263 263 268 271 273 278 278

13 Experiments and Quasi-Experiments 13.1 Potential Outcomes, Causal Effects and Idealized Experiments 13.2 Threats to Validity of Experiments . . . . . . . . . . . . . . . . 13.3 Experimental Estimates of the Effect of Class Size Reductions . 13.4 Quasi Experiments . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

279 280 281 282 292

14 Introduction to Time Series Regression and Forecasting

. . . . .

. . . .

. . . .

303

CONTENTS 14.1 14.2 14.3 14.4 14.5 14.6 14.7 14.8 14.9

Using Regression Models for Forecasting . . . . . Time Series Data and Serial Correlation . . . . . Autoregressions . . . . . . . . . . . . . . . . . . . Can You Beat the Market? (Part I) . . . . . . . Additional Predictors and The ADL Model . . . Lag Length Selection Using Information Criteria Nonstationarity I: Trends . . . . . . . . . . . . . Nonstationarity II: Breaks . . . . . . . . . . . . . Can You Beat the Market? (Part II) . . . . . . .

5 . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

15 Estimation of Dynamic Causal Effects 15.1 The Orange Juice Data . . . . . . . . . . . . . . . . . . . . . . 15.2 Dynamic Causal Effects . . . . . . . . . . . . . . . . . . . . . . 15.3 Dynamic Multipliers and Cumulative Dynamic Multipliers . . . 15.4 HAC Standard Errors . . . . . . . . . . . . . . . . . . . . . . . 15.5 Estimation of Dynamic Causal Effects with Strictly Exogeneous 15.6 Orange Juice Prices and Cold Weather . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

304 304 311 316 317 327 329 341 347

. . . . . . . . . . . . . . . . . . . . . . . . . . . . Regressors . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

353 353 357 358 359 361 367

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

375 376 384 387 393

16 Additional Topics in Time Series Regression 16.1 Vector Autoregressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 Orders of Integration and the DF-GLS Unit Root Test . . . . . . . . . . 16.3 Cointegration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.4 Volatility Clustering and Autoregressive Conditional Heteroskedasticity

. . . . . . . . .

. . . .

. . . .

6

CONTENTS

Preface Chair of Econometrics Department of Business Administration and Economics University of Duisburg-Essen Essen, Germany [email protected] Last updated on Wednesday, October 17, 2018 Over the recent years, the statistical programming language R has become an integral part of the curricula of econometrics classes we teach at the University of Duisburg-Essen. We regularly found that a large share of the students, especially in our introductory undergraduate econometrics courses, have not been exposed to any programming language before and thus have difficulties to engage with learning R on their own. With little background in statistics and econometrics, it is natural for beginners to have a hard time understanding the benefits of having R skills for learning and applying econometrics. These particularly include the ability to conduct, document and communicate empirical studies and having the facilities to program simulation studies which is helpful for, e.g., comprehending and validating theorems which usually are not easily grasped by mere brooding over formulas. Being applied economists and econometricians, all of the latter are capabilities we value and wish to share with our students. Instead of confronting students with pure coding exercises and complementary classic literature like the book by Venables and Smith (2010), we figured it would be better to provide interactive learning material that blends R code with the contents of the well-received textbook Introduction to Econometrics by Stock and Watson (2015) which serves as a basis for the lecture. This material is gathered in the present book Introduction to Econometrics with R, an empirical companion to Stock and Watson (2015). It is an interactive script in the style of a reproducible research report and enables students not only to learn how results of case studies can be replicated with R but also strengthens their ability in using the newly acquired skills in other empirical applications. Conventions Used in this Book • Italic text indicates new terms, names, buttons and alike. • Constant width text is generally used in paragraphs to refer to R code. This includes commands, variables, functions, data types, databases and file names. • Constant width text on gray background indicates R code that can be typed literally by you. It may appear in paragraphs for better distinguishability among executable and non-executable code statements but it will mostly be encountered in shape of large blocks of R code. These blocks are referred to as code chunks. Acknowledgement We thank the Stifterverband für die Deutsche Wissenschaft e.V. and the Ministry of Science and Research North Rhine-Westphalia for their financial support. Also, we are grateful to Alexander Blasberg for proofreading and his effort in helping with programming the exercises. A special thanks goes to Achim Zeileis (University of Innsbruck) and Christian Kleiber (University of Basel) for their advice and constructive criticism. Another thanks goes to Rebecca Arnold from the Münster University of Applied Sciences for several 7

8

CONTENTS

suggestions regarding the website design and for providing us with her nice designs for the book cover, logos and icons. We are also indebted to all past students of our introductory econometrics courses at the University of Duisburg-Essen for their feedback.

Chapter 1

Introduction The interest in the freely available statistical programming language and software environment R (R Core Team, 2018) is soaring. By the time we wrote first drafts for this project, more than 11000 add-ons (many of them providing cutting-edge methods) were made available on the Comprehensive R Archive Network (CRAN), an extensive network of FTP servers around the world that store identical and up-to-date versions of R code and its documentation. R dominates other (commercial) software for statistical computing in most fields of research in applied statistics. The benefits of it being freely available, open source and having a large and constantly growing community of users that contribute to CRAN render R more and more appealing for empirical economists and econometricians alike. A striking advantage of using R in econometrics is that it enables students to explicitly document their analysis step-by-step such that it is easy to update and to expand. This allows to re-use code for similar applications with different data. Furthermore, R programs are fully reproducible, which makes it straightforward for others to comprehend and validate results. Over the recent years, R has thus become an integral part of the curricula of econometrics classes we teach at the University of Duisburg-Essen. In some sense, learning to code is comparable to learning a foreign language and continuous practice is essential for the learning success. Needless to say, presenting bare R code on slides does not encourage the students to engage with hands-on experience on their own. This is why R is crucial. As for accompanying literature, there are some excellent books that deal with R and its applications to econometrics, e.g., Kleiber and Zeileis (2008). However, such sources may be somewhat beyond the scope of undergraduate students in economics having little understanding of econometric methods and barely any experience in programming at all. Consequently, we started to compile a collection of reproducible reports for use in class. These reports provide guidance on how to implement selected applications from the textbook Introduction to Econometrics (Stock and Watson, 2015) which serves as a basis for the lecture and the accompanying tutorials. This process was facilitated considerably by knitr (Xie, 2018b) and R markdown (Allaire et al., 2018). In conjunction, both R packages provide powerful functionalities for dynamic report generation which allow to seamlessly combine pure text, LaTeX, R code and its output in a variety of formats, including PDF and HTML. Moreover, writing and distributing reproducible reports for use in academia has been enriched tremendously by the bookdown package (Xie, 2018a) which has become our main tool for this project. bookdown builds on top of R markdown and allows to create appealing HTML pages like this one, among other things. Being inspired by Using R for Introductory Econometrics (Heiss, 2016)1 and with this powerful toolkit at hand we wrote up our own empirical companion to Stock and Watson (2015). The result, which you started to look at, is Introduction to Econometrics with R. Similarly to the book by Heiss (2016), this project is neither a comprehensive econometrics textbook nor is it intended to be a general introduction R. We feel that Stock and Watson do a great job at explaining 1 Heiss

(2016) builds on the popular Introductory Econometrics (Wooldridge, 2016) and demonstrates how to replicate the applications discussed therein using R.

9

10

CHAPTER 1. INTRODUCTION

the intuition and theory of econometrics, and at any rate better than we could in yet another introductory textbook! Introduction to Econometrics with R is best described as an interactive script in the style of a reproducible research report which aims to provide students with a platform-independent e-learning arrangement by seamlessly intertwining theoretical core knowledge and empirical skills in undergraduate econometrics. Of course, the focus is on empirical applications with R. We leave out derivations and proofs wherever we can. Our goal is to enable students not only to learn how results of case studies can be replicated with R but we also intend to strengthen their ability in using the newly acquired skills in other empirical applications — immediately within Introduction to Econometrics with R. To realize this, each chapter contains interactive R programming exercises. These exercises are used as supplements to code chunks that display how previously discussed techniques can be implemented within R. They are generated using the DataCamp light widget and are backed by an R session which is maintained on DataCamp’s servers. You may play around with the example exercise presented below. As you can see above, the widget consists of two tabs. script.R mimics an .R-file, a file format that is commonly used for storing R code. Lines starting with a # are commented out, that is, they are not recognized as code. Furthermore, script.R works like an exercise sheet where you may write down the solution you come up with. If you hit the button Run, the code will be executed, submission correctness tests are run and you will be notified whether your approach is correct. If it is not correct, you will receive feedback suggesting improvements or hints. The other tab, R Console, is a fully functional R console that can be used for trying out solutions to exercises before submitting them. Of course you may submit (almost any) R code and use the console to play around and explore. Simply type a command and hit the Enter key on your keyboard. As an example, consider the following line of code presented in chunk below. It tells R to compute the number of packages available on CRAN. The code chunk is followed by the output produced. # check the number of R packages available on CRAN nrow(available.packages(repos = "http://cran.us.r-project.org")) ## [1] 13204 Each code chunk is equipped with a button on the outer right hand side which copies the code to your clipboard. This makes it convenient to work with larger code segments in your version of R/RStudio or in the widgets presented throughout the book. In the widget above, you may click on R Console and type nrow(available.packages(repos = "http://cran.us.r-project.org")) (the command from the code chunk above) and execute it by hitting Enter on your keyboard.2 Note that some lines in the widget are out-commented which ask you to assign a numeric value to a variable and then to print the variable’s content to the console. You may enter your solution approach to script.R and hit the button Run in order to get the feedback described further above. In case you do not know how to solve this sample exercise (don’t panic, that is probably why you are reading this), a click on Hint will provide you with some advice. If you still can’t find a solution, a click on Solution will provide you with another tab, Solution.R which contains sample solution code. It will often be the case that exercises can be solved in many different ways and Solution.R presents what we consider as comprehensible and idiomatic.

1.1

A Very Short Introduction to R and RStudio

R Basics As mentioned before, this book is not intended to be an introduction to R but as a guide on how to use its capabilities for applications commonly encountered in undergraduate econometrics. Those having basic knowledge in R programming will feel comfortable starting with Chapter 2. This section, however, is meant 2 The

R session is initialized by clicking into the widget. This might take a few seconds. Just wait for the indicator next to the button Run to turn green.

1.1. A VERY SHORT INTRODUCTION TO R AND RSTUDIO

11

Figure 1.1: RStudio: the four panes for those who have not worked with R or RStudio before. If you at least know how to create objects and call functions, you can skip it. If you would like to refresh your skills or get a feeling for how to work with RStudio, keep reading. First of all start RStudio and open a new R script by selecting File, New File, R Script. In the editor pane, type 1 + 1 and click on the button labeled Run in the top right corner of the editor. By doing so, your line of code is sent to the console and the result of this operation should be displayed right underneath it. As you can see, R works just like a calculator. You can do all arithmetic calculations by using the corresponding operator (+, -, *, / or ^). If you are not sure what the last operator does, try it out and check the results. Vectors R is of course more sophisticated than that. We can work with variables or, more generally, objects. Objects are defined by using the assignment operator