Lecture 1 Intro to Spatial and Temporal Data

Similar documents
Lecture 4 Multiple linear regression

Multiple Linear Regression

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

22s:152 Applied Linear Regression. Returning to a continuous response variable Y...

22s:152 Applied Linear Regression. In matrix notation, we can write this model: Generalized Least Squares. Y = Xβ + ɛ with ɛ N n (0, Σ)

EECS E6690: Statistical Learning for Biological and Information Systems Lecture1: Introduction

Non-independence due to Time Correlation (Chapter 14)

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

Lecture 24: Weighted and Generalized Least Squares

5. Linear Regression

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Linear Regression Model. Badr Missaoui

ST430 Exam 1 with Answers

Applied Regression Analysis

Homework 2. For the homework, be sure to give full explanations where required and to turn in any relevant plots.

Ordinary Least Squares Regression

Introduction and Single Predictor Regression. Correlation

UNIVERSITY OF TORONTO Faculty of Arts and Science

Lecture 5: Clustering, Linear Regression

Correlation in Linear Regression

Stat 412/512 REVIEW OF SIMPLE LINEAR REGRESSION. Jan Charlotte Wickham. stat512.cwick.co.nz

Linear models and their mathematical foundations: Simple linear regression

Model Specification and Data Problems. Part VIII

For more information about how to cite these materials visit

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Two-Variable Regression Model: The Problem of Estimation

(a) The percentage of variation in the response is given by the Multiple R-squared, which is 52.67%.

1 Introduction 1. 2 The Multiple Regression Model 1

Simple Linear Regression

Chaos, Complexity, and Inference (36-462)

review session gov 2000 gov 2000 () review session 1 / 38

Ch 2: Simple Linear Regression

High-dimensional regression

2. Linear regression with multiple regressors

1 The Classic Bivariate Least Squares Model

Example: 1982 State SAT Scores (First year state by state data available)

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

R 2 and F -Tests and ANOVA

Introduction to Estimation Methods for Time Series models. Lecture 1

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018

Dealing with Heteroskedasticity

Inference for Regression

Midterm Exam 1, section 1. Tuesday, February hour, 30 minutes

Lecture 1: OLS derivations and inference

Linear Regression. Furthermore, it is simple.

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Gov 2000: 9. Regression with Two Independent Variables

Introduction to Linear Regression

Lecture 5: Clustering, Linear Regression

Linear Models in Machine Learning

Chapter 12: Linear regression II

2.1 Linear regression with matrices

Regression. Marc H. Mehlman University of New Haven

Chapter 8: Simple Linear Regression

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

The Classical Linear Regression Model

Lecture 14 Simple Linear Regression

Lecture 14 Singular Value Decomposition

Homework 1: Solutions

AGEC 621 Lecture 16 David Bessler

Sampling Distributions

Econ 1123: Section 2. Review. Binary Regressors. Bivariate. Regression. Omitted Variable Bias

Linear models. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. October 5, 2016

MS&E 226: Small Data

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

Linear Models for Regression CS534

AMS-207: Bayesian Statistics

Lecture 5: Clustering, Linear Regression

Module 17: Bayesian Statistics for Genetics Lecture 4: Linear regression

Lecture 4: Multivariate Regression, Part 2

Lectures on Simple Linear Regression Stat 431, Summer 2012

ADVENTURES IN THE FLIPPED CLASSROOM FOR INTRODUCTORY

Appendix A: Review of the General Linear Model

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Lecture 7 Autoregressive Processes in Space

Statistics 191 Introduction to Regression Analysis and Applied Statistics Practice Exam

Lecture 4: Regression Analysis

ECON Introductory Econometrics. Lecture 17: Experiments

Questions and Answers on Heteroskedasticity, Autocorrelation and Generalized Least Squares

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Midterm Exam 1, section 1. Thursday, September hour, 15 minutes

Regression: Ordinary Least Squares

Applied Statistics and Econometrics

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Homework 5. Convex Optimization /36-725

x 21 x 22 x 23 f X 1 X 2 X 3 ε

EC4051 Project and Introductory Econometrics

Ch. 5 Transformations and Weighting

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

Lecture 2. The Simple Linear Regression Model: Matrix Approach

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

Lecture 4: Multivariate Regression, Part 2

STAT 3A03 Applied Regression With SAS Fall 2017

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Lecture 18: Simple Linear Regression

Multiple Linear Regression

STAT Financial Time Series

STAT 350: Summer Semester Midterm 1: Solutions

Mutilevel Models: Pooled and Clustered Data

Transcription:

Lecture 1 Intro to Spatial and Temporal Data Dennis Sun Stanford University Stats 253 June 22, 2015

1 What is Spatial and Temporal Data? 2 Trend Modeling 3 Omitted Variables 4 Overview of this Class

1 What is Spatial and Temporal Data? 2 Trend Modeling 3 Omitted Variables 4 Overview of this Class

Temporal Data Temporal data are also called time series. 0 2 4 6 8 10 12 14 1995 2000 2005 2010 2015 Monthly Rainfall in San Francisco

Spatial Data Spatial observations can be areal units... Percent of votes for George W. Bush in 2004 election.

Spatial Data...or points in space. San Jose house prices from zillow.com

What do the two have in common? 0 2 4 6 8 10 12 14 1995 2000 2005 2010 2015 Observations that are close in time or space are similar.

Why is this the case? Common or similar factors drive observations that are nearby in time and space. The meteorological phenomena that drive rainfall (e.g., El Niño) in one month typically lasts a few months. Religion and race are strong predictors of voters choices. These are likely to be similar in nearby regions. School quality is a strong predictor of house prices. Nearby houses belong to the same school district. To make this precise, assume that each observation y i can be modeled as a function of predictors x i : y i = f(x i ) + ɛ }{{}}{{} i trend noise

1 What is Spatial and Temporal Data? 2 Trend Modeling 3 Omitted Variables 4 Overview of this Class

Linear Models We will focus on the most common model for the trend, a linear model: f(x i ) = x T i β, although there are others (loess, splines, etc.). We estimate β by ordinary least squares (OLS) ˆβ def = argmin β Is this a good estimator? n (y i x T i β) 2 i=1 = argmin y Xβ 2 β = (X T X) 1 X T y

Properties of OLS If we assume that y = Xβ + ɛ, where E[ɛ X] = 0, then ˆβ = (X T X) 1 X T y = (X T X) 1 X T (Xβ + ɛ) = β + (X T X) 1 X T ɛ. Then, E[ ˆβ X] = β + E[(X T X) 1 X T ɛ X] = β, so the OLS estimator is unbiased. In fact, it is the best linear unbiased estimator. (More on this next time.)

Example: House Prices in Florida Call: lm(formula = price ~ size + beds + baths + new, data = houses) Residuals: Min 1Q Median 3Q Max -215.747-30.833-5.574 18.800 164.471 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -28.84922 27.26116-1.058 0.29262 size 0.11812 0.01232 9.585 1.27e-15 *** beds -8.20238 10.44984-0.785 0.43445 baths 5.27378 13.08017 0.403 0.68772 new 54.56238 19.21489 2.840 0.00553 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 54.25 on 95 degrees of freedom Multiple R-squared: 0.7245, Adjusted R-squared: 0.713 F-statistic: 62.47 on 4 and 95 DF, p-value: < 2.2e-16

Where do the standard errors come from? If we further assume Var[ɛ X] = σ 2 I, then we can calculate: Var[ ˆβ X] = Var [ β + (X T X) 1 X T ɛ X ] = ( (X T X) 1 X T ) Var [ɛ X] ( (X T X) 1 X T ) T }{{} X(X T X) 1 = σ 2 ( (X T X) 1 X T ) ( X(X T X) 1) = σ 2 (X T X) 1. Since ˆβ is a random vector, this is a covariance matrix: Var( ˆβ 1 ) Cov( ˆβ 1, ˆβ 2 )... Cov( ˆβ 1, ˆβ p ) Var( ˆβ) Cov( ˆβ 2, ˆβ 1 ) Var( ˆβ 2 )... Cov( ˆβ 2, ˆβ p ) =....... Cov( ˆβ p, ˆβ 1 ) Cov( ˆβ p, ˆβ 2 )... Var( ˆβ p ) The square root of the diagonal elements give us the standard errors, i.e., SE( ˆβ j ) = Var( ˆβ j ).

1 What is Spatial and Temporal Data? 2 Trend Modeling 3 Omitted Variables 4 Overview of this Class

What happens if we omit a variable? Suppose the following model for house prices is correct: price i = β 0 + β 1 size i + β 2 new }{{} i + ɛ i, }{{} trend noise where E[ɛ size, new] = 0 and Var[ɛ size, new] I. Suppose we don t actually have data about whether a house is new or not. We omit it from our model, so new becomes part of the noise. Is this a problem? We are fine as long as price i = β 0 + β 1 size }{{} i + β 2 new i + ɛ i, }{{} trend noise E[noise size] = 0 Var[noise size] I

Omitted Variable Bias Suppose the first condition is violated, i.e., E[noise size] 0, i.e., Since E[ɛ size] = 0, this means E[β 2 new + ɛ size] 0. E[β 2 new size] 0. Two things have to happen for this situation to occur: β 2 0: The omitted variable is relevant for predicting the response. E[new size] 0: The omitted variable is correlated with a predictor in the model. Omitted variables are also called confounders. Since E[noise size] 0, ˆβ 1 is no longer unbiased for β 1.

Correlated Noise Suppose we are reasonably convinced that new is not correlated with size in our dataset. So we will be able to obtain an unbiased estimator for the effect of size on house prices. But in order for the standard errors to be valid, we need This depends on whether but chances are: Var[β 2 new + ɛ size] I. Var[new size] I, Cov[new i, new j size] 0.

A Simulation Study Suppose we have n = 20 observations from y t = βx t + ɛ t, β = 1 where ɛ t is correlated (generated from an AR(1) process). Here are the OLS estimates ˆβ obtained over 10000 simulations. 900 According to the simulations: count 600 300 E[ ˆβ x] 1, so ˆβ is unbiased. SE[ ˆβ x].15. 0 0 1 2 OLS Estimates

A Simulation Study Suppose we have n = 20 observations from y t = βx t + ɛ t, β = 1 where ɛ t is correlated (generated from an AR(1) process). Here are the naive SEs from calling the lm function in R. 1000 count 500 OLS does not estimate the standard error appropriately. 0 0.0 0.5 1.0 1.5 OLS Standard Errors

1 What is Spatial and Temporal Data? 2 Trend Modeling 3 Omitted Variables 4 Overview of this Class

Why study spatial and temporal statistics? The focus of this class will be supervised learning when the error is correlated. y i = f(x i ) + ɛ i We will assume that the omitted variables do not lead to bias (E[ɛ X] = 0). If the omitted variables all have a spatial or temporal structure, then we can try to model it explicitly: Cov[ɛ i, ɛ j X] = g(d(i, j)). This will allow us to (1) obtain correct inferences for the variables in the model and (2) obtain a more efficient estimator than the OLS estimator.

Course Requirements We ll have 3 homeworks, which will be coding / data analysis. We ll also have 3 in-class quizzes, which will go over the conceptual issues. These will be graded on a check / resubmit basis. For those taking the class for a letter grade, the grade will be based primarily on a final project.

Structure of the Class This class will meet Monday, Wednesday, Friday at 2:15pm for the first four weeks. The last four weeks will be dedicated to your final project. I will schedule individual meetings with students, and there may be sporadic lectures covering topics of interest to the class.

Course Website The course website is stats253.stanford.edu. All materials (syllabus, lecture slides, homeworks) will be posted here. All homework will be submitted through this course website.