Lecture 3: Just a little more math

Similar documents
Lecture 5: A step back

High-dimensional regression

Linear Model Selection and Regularization

Linear regression methods

CHAPTER 6: SPECIFICATION VARIABLES

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

ISyE 691 Data mining and analytics

Introduction to Statistical modeling: handout for Math 489/583

Lecture 2: Linear regression

Regression Models - Introduction

Linear model selection and regularization

How the mean changes depends on the other variable. Plots can show what s happening...

Bias Variance Trade-off

Regression, Ridge Regression, Lasso

MS&E 226: Small Data

Biostatistics Advanced Methods in Biostatistics IV

An Introduction to Path Analysis

Some Review and Hypothesis Tes4ng. Friday, March 15, 13

Data Mining Stat 588

Regression I: Mean Squared Error and Measuring Quality of Fit

An Introduction to Mplus and Path Analysis

Regression #3: Properties of OLS Estimator

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Regression Models - Introduction

Inference in Regression Analysis

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Steps in Regression Analysis

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Business Statistics. Lecture 9: Simple Regression

Introduction to Simple Linear Regression

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Chapter 14 Stein-Rule Estimation

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Design and Optimization of Energy Systems Prof. C. Balaji Department of Mechanical Engineering Indian Institute of Technology, Madras

Machine Learning for OR & FE

Machine Learning Linear Regression. Prof. Matteo Matteucci

Lecture 14: Shrinkage

MS&E 226: Small Data. Lecture 6: Bias and variance (v2) Ramesh Johari

ISQS 5349 Spring 2013 Final Exam

Ridge Regression and Ill-Conditioning

Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Properties of the least squares estimates

CS 188: Artificial Intelligence Spring Announcements

Polynomial Regression and Regularization

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Lecture 14 Singular Value Decomposition

Formal Statement of Simple Linear Regression Model

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Introduction to Machine Learning and Cross-Validation

MAT 211, Spring 2015, Introduction to Linear Algebra.

Lecture 7: Modeling Krak(en)

Lecture 4: Multivariate Regression, Part 2

The prediction of house price

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals

Math 31 Lesson Plan. Day 2: Sets; Binary Operations. Elizabeth Gillaspy. September 23, 2011

Two-Variable Regression Model: The Problem of Estimation

Lecture 24: Weighted and Generalized Least Squares

Homoskedasticity. Var (u X) = σ 2. (23)

1 Correlation between an independent variable and the error

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Lecture 4: Multivariate Regression, Part 2

Math 423/533: The Main Theoretical Topics

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

MS&E 226: Small Data

Regression #4: Properties of OLS Estimator (Part 2)

Designing Information Devices and Systems I Summer 2017 D. Aranki, F. Maksimovic, V. Swamy Homework 5

Statistical Distribution Assumptions of General Linear Models

Statistical Foundations:

Generating Function Notes , Fall 2005, Prof. Peter Shor

High-dimensional regression modeling

We ve seen today that a square, symmetric matrix A can be written as. A = ODO t

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Machine Learning for OR & FE

Sparse Linear Models (10/7/13)

Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras. Lecture - 13 Conditional Convergence

SSR = The sum of squared errors measures how much Y varies around the regression line n. It happily turns out that SSR + SSE = SSTO.

Introduction to Stochastic Processes

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Stat 5101 Lecture Notes

MITOCW watch?v=vu_of9tcjaa

Regression and Stats Primer

y response variable x 1, x 2,, x k -- a set of explanatory variables

Multiple Linear Regression for the Supervisor Data

Regression Shrinkage and Selection via the Lasso

Introduction to Algebra: The First Week

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Note: Please use the actual date you accessed this material in your citation.

Grades 7 & 8, Math Circles 10/11/12 October, Series & Polygonal Numbers

MITOCW ocw f99-lec30_300k

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Transcription:

Lecture 3: Just a little more math

Last time Through simple algebra and some facts about sums of normal random variables, we derived some basic results about orthogonal regression We used as our major test case a set of orthogonal polynomials as computed by R We derived the mean and variance of OLS estimates and established their distribution; in particular, we found that OLS estimates are unbiased

Last time We then considered a model selection rule and demonstrated how the orthogonal model allowed us to compute the best fit exactly, no matter how many variables we entered into our model We interpreted this rule in terms of a keep-or-kill operator acting on the OLS estimates

Today We start with one more property of OLS estimates as formalized by the Gauss-Markov theorem We use this result to think about why we might want to use a model selection criterion in the first place We then introduce a new kind of estimate that acts by shrinking the OLS coefficients; we see that it also has many desirable properties We finish by seeing this new shrinkage estimate as coming from a penalized least squares criterion

Some rationale Ok, this class is meant to be an applied tour of regression, and this quarter generalized linear models. But... Sticking to OLS leaves us somewhere in the late 1800 s and early 1900 s methodologically Today s lecture will bring us up to the mid-1970 s and by Monday, we will be discussing techniques developed in your lifetime

Some history Ronald Aylmer Fisher (1890-1962) R. A. Fisher is often credited as the single most important figure in 20th century statistics Before Fisher, statistics was an ingenious collection of ad hoc devices (Efron, 1996) Statistical Methods for Research Workers (1958) looks remarkably like most of the texts we use for Statistics 11, 13, and so on Image by A. Barrington-Brown www.csse.monash.edu.au/~lloyd/tildeimages/people/fisher.ra

Some history Ronald Aylmer Fisher (1890-1962) He is responsible for creating a mathematical framework for statistics; but equally important was his impact on data analysis Fisher once commented that over his calculator, he had learned all he knew * At one time, Fisher believed that model specification was outside the field of mathematical statistics; his work focused on estimation assessing uncertainty after a model has been selected Image by A. Barrington-Brown www.csse.monash.edu.au/~lloyd/tildeimages/people/fisher.ra

Some history John Wilder Tukey (1915-2000) John Tukey is another pioneer of statistical theory and practice In his text Exploratory Data Analysis (1977), Tukey creates graphical tools for exploring features in data Many of the graphical displays in R were developed by Tukey His style is iterative, advocating many different analyses From the Bell Laboratories archives, circa 1965 stat.bell-labs.com/who/tukey

What happened between Fisher s time (the 1920 s and 30 s) and Tukey s (the 1960 s and 70 s)?

The computer In 1947 the transistor was developed In 1964 the IBM 360 was introduced and quickly becomes the standard institutional mainframe computer * By the end of the 1960 s computer resources were generally available at major institutions * Taken from a PBS history of the computer, www.pbs.org/nerds/timeline/micro.html

The computer The computer changed the process of modeling, allowing statisticians to attempt larger, more challenging problems In the late 70 s Box and his co-authors articulated a pipeline of analysis, a feedback loop that runs from model specification, to fitting and diagnostics and back Once we were free to think about model selection, there is an irresistible urge to formalize the process Hence, in the 70 s you see the rise of selection criteria like AIC, BIC and Cp

And since 1970?

Home computers and the Internet First personal computer in 1975 (MITS Altair 8800) In 1977 Apple introduces the Apple II at a price of $1,195; 16K of RAM, no monitor The first spreadsheet, VisiCalc, ships in 1979 and is designed for the Apple II The Apple Macintosh appears in 1984 Microsoft Windows 1.0 ships in 1985 * Taken from a PBS history of the computer, www.pbs.org/nerds/timeline/micro.html

Home computers and the Internet In the 1990 s there is a migration to ubiquitous computing : There are small but powerful computers in phones, PDAs, cars, you name it The internet (or rather a nationwide fiber optic network) connects us, with wireless access becoming standard At the same time, technologies for data collection, and in particular those associated with environmental monitoring are undergoing a small revolution in the form of sensor networks All of these developments have had an enormous impact on the practice of statistics

Upcoming events An Undergraduate Summer Program Data visualization and its role in the practice of statistics http://summer.stat.ucla.edu June 19-25, 2005 UCLA Campus Designed for university sophomores and juniors with backgrounds in some quantitative field; hands-on projects related to current hot topics in statistics Bioinformatics Computer vision Earth and space sciences Computer networks Industrial statistics

... but I digress

Regression revisited Recall the basic assumptions for the normal linear model We have a response y variables x 1, x 2,..., x p and (possible) predictor We hypothesize a linear model for the response y = β 1x 1 + β 2x 2 + β px p + ɛ where ɛ variance has a normal distribution with mean 0 and σ 2

Regression revisited Then, we collect data... We have a n observations y 1,..., y n and each response is associated with the predictors x i1, x i2,..., x ip y i Then, according to our linear model y i = β 1x i1 + β 2x i2 + β px ip + ɛ i ɛ i and we assume the are independent, each having a normal distribution with mean 0 and variance σ 2

Regression revisited To estimate the coefficients β1, β2,..., βp we turn to OLS, ordinary least squares (this is also maximum likelihood under our normal linear model) We want to choose β 1, β 2,..., β p to minimize the OLS criterion n [y i β 1 x i1 β 2 x i2 β p x ip ] 2 i=1 You recall that this is just the sum of squared errors that we incur if we predict y i with β 1 x i1 + β 2 x i2 + + β p x ip

Regression revisited Computing the OLS estimates was easy; for example, we can differentiate this expression with respect to β 1 and solve to find β 1 = y i x i1 i From here we found that E β 1 = β 1 and that var( β 1 ) = σ 2

Regression revisited OLS estimates have a lot to recommend them They are unbiased (repeated sampling) Gauss-Markov Theorem: Among all linear unbiased estimates, they have the smallest variance (BLUE or best unbiased linear estimates)

Regression revisited The BLUEs To see this, look at some other linear estimate of is, we can write β 1 ; that β 1 = d i y i Then, if we set c i = d i x 1i we can rewrite it as β 1 = β 1 + c i y i

Regression revisited The BLUEs For β 1 to be unbiased, we must have E β 1 = E β 1 + c i Ey i = β 1 + β 1 ci x i1 + + β p ci x ip If this is to hold for all values of β 1,..., β p we must have ci x i1 = 0 ci x i2 = 0 and so on

Regression revisited The BLUEs For the variance of β 1 we just follow our noses var β 1 = var d i y i = var (c i + x i1 )y i = (c i + x i1 ) 2 var(y i ) = = σ 2 x 2 i1 + σ 2 c 2 i + 2σ 2 c i x i1 σ 2 + σ 2 c 2 i which is larger than unless the are all zero σ 2 c i

Mean squared error OK, so the OLS estimates are looking pretty good; but suppose we give up unbiasedness? Define the mean squared error of any estimator β to be E( β β ) 2 = E( β β + β β ) 2 = E( β β ) 2 + ( β β ) 2 = var β + ( β β ) 2 This relationship is often called the bias-variance tradeoff; the first term is the variance of an estimator and the second is its squared bias

Mean squared error The Gauss-Markov theorem implies that our OLS estimates have the smallest mean squared error among all linear estimators with no bias That begs a question: Can there be a biased estimator with a smaller mean squared error?

Shrinkage estimators Let s consider something that initially seems crazy We will replace our OLS estimates slightly smaller β k = 1 1 + λ β k β k with something If λ is zero, we get our OLS estimates back; if λ gets really really big, things crush to zero

Mean squared error Working out the mean squared error, we again follow our noses to find p E( β k βk) 2 = k=1 p E k=1 ( ) 2 1 1 + λ β k βk = ( ) 2 1 p 1 + λ k=1 2 E( β k β k) + ( ) 2 λ p 1 + λ k=1 β k 2 ( ) 2 ( ) 2 1 λ p = pσ 2 + βk 1 + λ 1 + λ k=1 2

Mean squared error Let s look at these two terms ( ) 2 ( ) 2 1 λ p pσ 2 + βk 1 + λ 1 + λ k=1 The first is the variance component; it is smallest when λ is zero; the second is the squared variance and it goes to zero as λ gets large (why?) 2

Shrinkage In principle, with the right choice of λ we can get an estimator with a better MSE The estimate is not unbiased, but what we pay for in bias, we make up for in variance MSE 0 2 4 6 8 10 squared bias shrinkage OLS variance 0.0 0.5 1.0 1.5 2.0 2.5 3.0 lambda

Mean squared error We can find the minimum by balancing the two terms ( ) 2 ( ) 2 1 λ p pσ 2 + βk 1 + λ 1 + λ k=1 It is not hard to show that the minimum must occur at 2 λ = pσ2 β k 2

Mean squared error Therefore, if all of the coefficients are large relative to their variances, we want to set λ small On the other hand, if we have a lot of small coefficients, we will want to pull them closer to zero

Comparison So, if our wee bit of math suggests we can do better in terms of MSE by shrinking the coefficients when we have lots of small ones, what advice does AIC give? new coefficient estimate!4!2 0 2 4 shrinkage AIC How does it compare?!4!2 0 2 4 beta!hat

Next time OK, so the last two days have been a lot of math; I want you to focus on the concepts rather than the algebra Think about how you might design a simulation to see how well subset selection is doing versus shrinkage We will talk about this next time and you will implement it for homework next week