Introduction to Regression Analysis. Dr. Devlina Chatterjee 11 th August, 2017

Similar documents
Econometrics Honor s Exam Review Session. Spring 2012 Eunice Han

Introduction to Econometrics

A Guide to Modern Econometric:

ECONOMETRICS HONOR S EXAM REVIEW SESSION

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

G. S. Maddala Kajal Lahiri. WILEY A John Wiley and Sons, Ltd., Publication

Christopher Dougherty London School of Economics and Political Science

WISE International Masters

9) Time series econometrics

Outline. Nature of the Problem. Nature of the Problem. Basic Econometrics in Transportation. Autocorrelation

Multiple Regression Analysis

Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

ECON Introductory Econometrics. Lecture 11: Binary dependent variables

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

Econ 423 Lecture Notes: Additional Topics in Time Series 1

Econometrics. 9) Heteroscedasticity and autocorrelation

Econometrics Summary Algebraic and Statistical Preliminaries

388 Index Differencing test ,232 Distributed lags , 147 arithmetic lag.

Econometrics. Week 4. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Econ 300/QAC 201: Quantitative Methods in Economics/Applied Data Analysis. 17th Class 7/1/10

ECON3327: Financial Econometrics, Spring 2016

REED TUTORIALS (Pty) LTD ECS3706 EXAM PACK

Økonomisk Kandidateksamen 2005(I) Econometrics 2 January 20, 2005


Multiple Regression Analysis

2 Prediction and Analysis of Variance

Modeling and forecasting global mean temperature time series

FinQuiz Notes

Empirical Market Microstructure Analysis (EMMA)

WISE International Masters

PhD/MA Econometrics Examination. January, 2015 PART A. (Answer any TWO from Part A)

Introduction to Eco n o m et rics

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Review of Econometrics

Unless provided with information to the contrary, assume for each question below that the Classical Linear Model assumptions hold.

Økonomisk Kandidateksamen 2004 (I) Econometrics 2. Rettevejledning

Environmental Econometrics

The Simple Linear Regression Model

Econometrics Review questions for exam

Applied Microeconometrics (L5): Panel Data-Basics

Econometrics Homework 4 Solutions

6. Assessing studies based on multiple regression

Economics 308: Econometrics Professor Moody

LECTURE 5. Introduction to Econometrics. Hypothesis testing

ECON 4230 Intermediate Econometric Theory Exam

Answers to End-of-Chapter Review the Concepts Questions

Econometrics. Week 8. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Time Series Analysis -- An Introduction -- AMS 586

1. Fundamental concepts

Panel Data Models. Chapter 5. Financial Econometrics. Michael Hauser WS17/18 1 / 63

6. The econometrics of Financial Markets: Empirical Analysis of Financial Time Series. MA6622, Ernesto Mordecki, CityU, HK, 2006.

Introduction to Econometrics

Sample Exam Questions for Econometrics

Chapter 11. Regression with a Binary Dependent Variable

APPLIED TIME SERIES ECONOMETRICS

New York University Department of Economics. Applied Statistics and Econometrics G Spring 2013

PhD/MA Econometrics Examination January 2012 PART A

Introduction to Econometrics. Regression with Panel Data

Lecture-1: Introduction to Econometrics

Linear Regression. Junhui Qian. October 27, 2014

Applied Economics. Regression with a Binary Dependent Variable. Department of Economics Universidad Carlos III de Madrid

Time Series Analysis. James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY

Økonomisk Kandidateksamen 2004 (II) Econometrics 2 June 14, 2004

Analysis. Components of a Time Series

2) For a normal distribution, the skewness and kurtosis measures are as follows: A) 1.96 and 4 B) 1 and 2 C) 0 and 3 D) 0 and 0

Contest Quiz 3. Question Sheet. In this quiz we will review concepts of linear regression covered in lecture 2.

Introduction to Econometrics

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

Metrics Honors Review

Final Exam. Name: Solution:

at least 50 and preferably 100 observations should be available to build a proper model

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

Econometric Analysis of Cross Section and Panel Data

Making sense of Econometrics: Basics

10. Time series regression and forecasting

The Prediction of Monthly Inflation Rate in Romania 1

Basic econometrics. Tutorial 3. Dipl.Kfm. Johannes Metzler

Econometrics of Panel Data

M(t) = 1 t. (1 t), 6 M (0) = 20 P (95. X i 110) i=1

Chapter 1 Introduction. What are longitudinal and panel data? Benefits and drawbacks of longitudinal data Longitudinal data models Historical notes

Simultaneous Equation Models Learning Objectives Introduction Introduction (2) Introduction (3) Solving the Model structural equations

Microeconometrics. Bernd Süssmuth. IEW Institute for Empirical Research in Economics. University of Leipzig. April 4, 2011

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

Covers Chapter 10-12, some of 16, some of 18 in Wooldridge. Regression Analysis with Time Series Data

Final Exam - Solutions

Lecture 1. Behavioral Models Multinomial Logit: Power and limitations. Cinzia Cirillo

CHAPTER 6: SPECIFICATION VARIABLES

ECONOMETRICS FIELD EXAM Michigan State University May 9, 2008

Final Review ECON Sebastian James. (Slides prepared by Bo Jackson and Hanley Chiang)

Statistics 910, #5 1. Regression Methods

Binary Dependent Variable. Regression with a

ISQS 5349 Spring 2013 Final Exam

Econometrics -- Final Exam (Sample)

Time Series and Forecasting

Limited Dependent Variables and Panel Data

FINANCIAL ECONOMETRICS AND EMPIRICAL FINANCE -MODULE2 Midterm Exam Solutions - March 2015

Bootstrapping the Grainger Causality Test With Integrated Data

Time Series Analysis. James D. Hamilton PRINCETON UNIVERSITY PRESS PRINCETON, NEW JERSEY

LECTURE 11. Introduction to Econometrics. Autocorrelation

Contents. Part I Statistical Background and Basic Data Handling 5. List of Figures List of Tables xix

Transcription:

Introduction to Regression Analysis Dr. Devlina Chatterjee 11 th August, 2017

What is regression analysis? Regression analysis is a statistical technique for studying linear relationships. One dependent variable and one or more independent variables, often called predictor variables. Regression is being used more and more in analytics research that affects many aspects of our daily lives, from how much we pay for automobile insurance to which ads appear in our social media.

Example of questions where regression analysis is used What is the effect of one more year of education on the income of an individual? What are the factors that indicate whether a particular individual will get a job? Can we predict how the price of a particular stock will change in the next few weeks? How will a particular policy such as change in taxes on cigarettes affect the incidence of smoking in a state? How long will a patient survive after being given a particular treatment as compared to not being given that treatment?

Population vs. Sample The research question may be does more education lead to more income? We want to understand the relationship between two variables (education and income) in the population We do not have data for every person in the population Look at data for a smaller sample drawn from the population If the sample is large enough and drawn randomly from the population, then we can make inferences about the population from the relationships observed in the sample The reason we can draw inferences is because of two fundamental theorems in probability: Law of Large Numbers and Central Limit Theorem.

Sampling Distributions Suppose that we draw all possible samples of size n from a given population. We compute a statistic (e.g., a mean, proportion, standard deviation) for each sample. The probability distribution of statistic is called a sampling distribution. The standard deviation of this statistic is called the standard error. Mean and Variance of the Sampling Distribution of is given by: E( Y) Y Var( Y) Y Y n 1. As n increases, the distribution of becomes more tightly centered around μ Y (the Law of Large Numbers) 2. Moreover, the distribution of becomes normal (the Central Limit Theorem) 2 Y

Law of Large Numbers The law of large numbers states that the sample mean converges to the distribution mean as the sample size increases, and is one of the fundamental theorems of probability.

The Central Limit Theorem The Central Limit Theorem states that the sampling distribution of the mean of any independent, random variable will be normal or nearly normal, if the sample size is large enough. Sampling distribution of Y when Y is Bernoulli, p = 0.78:

Simple Linear Regression We begin by supposing a general form for the relationship, known as the population regression model: where Y 0 1X1 is the dependent variable and is the independent variable. Our primary interest is to estimate and. These are estimated from a sample of data and are termed and Y X 1 0 1 0 1

Reasons for doing Regression Determining whether there is evidence that an explanatory variable belongs in a regression model (Statistical significance of a variable) Estimating the effect of an explanatory variable on the dependent variable (Effect Size magnitude of the coefficient ) Measuring the explanatory power of a model (R 2 ) Making individual predictions for individuals not in the original sample

Regression between weight and height e 1 e 3 e 2 106.5 1 1.065 1 Sample 2 0 114.3 ( height. in. metres) ( height. in. cms) Sample 1 0 105.011 1 1.108

The idea of statistical significance Someone may look at the regression results from Sample 1 and Sample 2 and see that the results are slightly different. How much confidence can we have in the values of and estimated from our first sample? We need to test the hypothesis that there is indeed a non-zero correlation between Y and X Which translates to testing the null hypotheses: 0 0 and 1 0 0 1

Hypothesis Testing Test inferences about population parameters using data from a sample. In order to test a hypothesis in statistics, we must perform following steps: 1. Formulate a null hypothesis and an alternative hypothesis on population parameters. X H 0 : vs. H A : X X 2. Build a statistic to test the hypothesis made. z s / n where is the sample mean and s is the sample standard deviation X 3. Define a decision rule to reject or not to reject the null hypothesis.

Statistical Significance p values We have to test the null hypothesis 1 0, that is there is no correlation We find the standard error associated with the estimated betas as follows: Given the estimate of and the standard error of the estimate SE( 1 ) We calculate a t-statistic for : 1 0 t SE( 1) If t statistic > 1.96, we can reject null hypothesis 1 0 with 95% confidence. The p-value associated with each variable gives the probability that we could have observed the value of or larger, if the true value of was in fact 0. Very small p-values indicate there is a very small probability of the real being 0 SE( 1) ( n 2) 1 1 ( Yi Y ) ( X 1 i i 2 X ) 2 Indicates that there is a statistically significant relationship between Y and X that is not just due to chance alone.

R output

Multiple Linear Regression When there are more than one explanatory variable Y X X... 0 1 1 2 2 k X k Why include more than one variable in your regression? To avoid omitted variable bias (get the most accurate estimates of ) To get a model that has higher explanatory and predictive power 1

Omitted variable bias Example: Effect of 1 more year of education on income of an individual. People who attain higher levels of education may often have parents who have higher incomes Children of high income parents may have access to better social connections and so get better jobs. When we regress income over education and do not include parental income as a variable, we miss this effect. We may assume that higher levels of education led to the children getting higher incomes, and arrive at biased estimates of 1 Children of poorer parents may who are as well educated may not in fact get jobs with high income. In order to test whether it was the education or the parental connections that led to higher incomes, we need to include both variables in the model More accurate estimates of the correlation between the independent variable (education) and the dependent variable (income) Thus the 1 will not suffer from omitted variable bias

OLS assumptions Assumption 1: Model is linear in parameters and correctly specified Assumption 2: Full Rank of Matrix X or in other words (i) # observations > # variables, (ii) no perfect multicollinearity among independent variabls Assumption 3: Explanatory Variables must be exogenous E( e i X ) 0 Assumption 4: Independent and Identically Distributed Error Terms 2 i iid(0, ) or in other words (i) observations are randomly selected, (ii) constant variance homoscedasticity (iii) no auto-correlation Assumption 5: Normal Distributed Error Terms in Population If these assumptions are made, the OLS estimators will be the BLUE (Best Linear Unbiased Estimator)

Different Kinds of Regression Models Generalized Linear Regression Model Type of Dependent Variable Kind of data Kind of Independent Variables Continuous Linear Regression Discrete Logit / Probit Regression Crosssectional Data Panel Data Panel Data Models 1.Entity Fixed Effects 2.Time Fixed Effects 3.Random Effects Time-series Data Time-series models 1.AR 2.MA 3.ARIMA 4.VAR 5.VECM 6.ARCH 7.GARCH Duration Data Survival analysis models 1.Cox Proportional Hazard 2.Accelerated Failure Time (AFT) models Exogenous Endogenous Instrumental Variables Regression Simultaneous Equation Models

Panel Data Models Panel data consists of observations on the same n entities at two or more time period, T. Example: Road accidents for all states of India recorded over 20 years Notation for panel data: (X it, Y it ) where i = 1,...,n and t = 1,...,T Balanced panel Variables are observed for each entity for all the time periods. Unbalanced panel if data is missing for one or more time period for some entities Fixed Effects Regression: controls for omitted variable bias when the omitted variables vary across entities but do not change over time. Y it = β 1 X 1,it + + β k X k,it + α i + u it Time Fixed Effects controls for omitted variable bias when the omitted variables vary over time but are constant across entities Y it = β 1 X 1,it + + β k X k,it + λ t + u it

Discrete Dependent Variables (Logit/ Probit) Dependent variable is not continuous but discrete Example: Will a particular individual pass an exam? Dependent variable is coded as 1 when the individual passes and 0 when the individual fails We define the dependent variable as the probability of the individual passing the exam P( Y 1 X,..., X k ) f ( 0 1X1 2X 2... X 1 k k )

Logit vs. Probit Logistic Regression where: Pr (Y = 1 X) = F(β 0 + β 1 X 1 ) 1 F(β 0 + β 1 X) = 1 + e (β 0+β 1 X) Probit Regression Pr (Y = 1 X) = Φ(β 0 + β 1 X) Measures of Fit for Logit and Probit The fraction correctly predicted = fraction of Y s for which the predicted probability is >50% when Y i =1, or is <50% when Yi=0. The pseudo-r 2 measures the improvement in the value of the log likelihood, relative to having no X s. The pseudo-r 2 simplifies to the R 2 in the linear model with normally distributed errors.

Time-series models Data recorded at regular intervals of time, like daily, weekly, monthly or annually Example: Want to predict price of stocks based on historical data Primary objective prediction not understanding causality Components of time-series: Trends Seasonal Movements. Cyclical Movements. Irregular Fluctuations. Dependent variable regressed on lagged values of same variable, autocorrelation Need to ensure stationarity of data i.e. the future will be like the past Stationarity q t q t t p t p t t t x x x y y y y...... 1 2 1 2 2 1 1 0

Time-series Models AR models Autoregressive model model based on lagged values of the same dependent variable hence auto -regressive MA models Moving average model smoothing past data ARIMA Autoregressive Integrated Moving Average model combining AR and MA components after differencing the data to make it stationary VAR Vector Autoregressive Model -Vector Auto Regressive models (VAR) are econometric models used to capture the linear interdependencies among multiple time series. VECM Vector Error Correction Model - If the variables are not covariance stationary, but their first differences are, they may be modeled with a vector error correction model, or VECM. ARCH - ARCH models are used to model financial time series with time-varying volatility, such as stock prices. GARCH - If an autoregressive moving average model (ARMA model) is assumed for the error variance, the model is a generalized autoregressive conditional heteroscedasticity (GARCH) model.