Linear regression. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Similar documents
DS-GA 1002 Lecture notes 12 Fall Linear regression

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

DS-GA 1002 Lecture notes 10 November 23, Linear models

Lecture Notes 6: Linear Models

Descriptive Statistics

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

The Singular-Value Decomposition

CMU-Q Lecture 24:

Linear Models for Regression CS534

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

4 Bias-Variance for Ridge Regression (24 points)

Bayesian Approaches Data Mining Selected Technique

Lecture : Probabilistic Machine Learning

Pattern Recognition. Parameter Estimation of Probability Density Functions

Optimization methods

GAUSSIAN PROCESS REGRESSION

ECE521 week 3: 23/26 January 2017

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Learning from Data: Regression

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Random Processes. DS GA 1002 Probability and Statistics for Data Science.

Overfitting, Bias / Variance Analysis

Density Estimation. Seungjin Choi

Linear Models for Regression CS534

Machine Learning (CSE 446): Probabilistic Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Statistical Machine Learning Hilary Term 2018

Statistical Data Analysis

Bayesian Learning (II)

Machine learning - HT Maximum Likelihood

y Xw 2 2 y Xw λ w 2 2

Econ 424 Time Series Concepts

Bayesian Linear Regression [DRAFT - In Progress]

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Models for Regression CS534

Least Squares. Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter UCSD

Overview. DS GA 1002 Probability and Statistics for Data Science. Carlos Fernandez-Granda

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

CS 340 Lec. 15: Linear Regression

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Lecture 4: Probabilistic Learning

Linear Regression and Its Applications

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Linear Methods for Regression. Lijun Zhang

Least Squares Regression

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

CSC2515 Assignment #2

Lecture Notes 1: Vector spaces

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Least Squares Regression

y(x) = x w + ε(x), (1)

Advanced Introduction to Machine Learning CMU-10715

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Linear Models for Regression

Machine Learning 4771

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Linear Regression (9/11/13)

Logistic Regression Logistic

The Gaussian distribution

Multivariate Statistics

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Lecture 8: Signal Detection and Noise Assumption

How to build an automatic statistician

Linear Methods for Prediction

10. Linear Models and Maximum Likelihood Estimation

Linear Models for Regression

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Association studies and regression

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

CSCI567 Machine Learning (Fall 2014)

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Fitting Linear Statistical Models to Data by Least Squares: Introduction

Introduction to Simple Linear Regression

Statistical learning. Chapter 20, Sections 1 3 1

STA414/2104 Statistical Methods for Machine Learning II

Relevance Vector Machines

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

Today. Calculus. Linear Regression. Lagrange Multipliers

Ordinary Least Squares Regression Explained: Vartanian

Rowan University Department of Electrical and Computer Engineering

Machine Learning 4771

Constrained optimization

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

L11: Pattern recognition principles

Learning representations

Estimation techniques

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

System Identification, Lecture 4

Linear regression COMS 4771

System Identification, Lecture 4

Random variables. DS GA 1002 Probability and Statistics for Data Science.

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Transcription:

Linear regression DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall15 Carlos Fernandez-Granda

Linear models Least-squares estimation Overfitting Example: Global warming

Regression The aim is to learn a function h that relates a response or dependent variable y to several observed variables x 1, x 2,..., x p, known as covariates, features or independent variables The response is assumed to be of the form y = h ( x) + z where x R p contains the features and z is noise

Linear regression The regression function h is assumed to be linear y (i) = x (i) T β + z (i), 1 i n Our aim is to estimate β R p from the data

Linear regression In matrix form y (1) x (1) y (2) 1 x (1) 2 x (1) p β = x (2) 1 x (2) 2 x p (2) 1 z (1) β 2 + z (2) y (n) x (n) 1 x (n) 2 x p (n) β p z (n) Equivalently, y = X β + z

Linear model for GDP Population Unemployment GDP rate (%) (USD millions) California 38 332 521 5.5 2 448 467 Minnesota 5 420 380 4.0 334 780 Oregon 3 930 065 5.5 228 120 Nevada 2 790 136 5.8 141 204 Idaho 1 612 136 3.8 65 202 Alaska 735 132 6.9 54 256 South Carolina 4 774 839 4.9???

Linear model for GDP After normalizing the features and the response 0.984 0.982 0.419 0.135 0.139 0.305 y := 0.092 0.057, X := 0.101 0.419 0.071 0.442 0.026 0.041 0.290 0.022 0.019 0.526 Aim: find β R 2 such that y X β The estimate for the GDP of South Carolina will be x T sc β

Linear models Least-squares estimation Overfitting Example: Global warming

Least squares For fixed β we can evaluate the error using n ) 2 (y (i) x (i) T β y = X β 2 i=1 2 The least-squares estimate β LS minimizes this cost function β LS := arg min y X β β 2

Least-squares fit 1.2 1.0 Data Least-squares fit 0.8 y 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 x

Linear model for GDP The least-squares estimate is β LS = [ ] 1.010 0.019 GDP roughly proportional to the population Unemployment doesn t help (linearly)

Linear model for GDP GDP Estimate California 2 448 467 2 446 186 Minnesota 334 780 334 584 Oregon 228 120 233 460 Nevada 141 204 159 088 Idaho 65 202 90 345 Alaska 54 256 23 050 South Carolina 199 256 289 903

Geometric interpretation Any vector X β is in the span of the columns of X The least-squares estimate is the closest vector to y that can be represented in this way This is the projection of y onto the column space of X

Geometric interpretation

Probabilistic interpretation We model the noise as an iid Gaussian random vector Z Entries have zero mean and variance σ 2 The data are a realization of the random vector Y := X β + Z Y is Gaussian with mean X β and covariance matrix σ 2 I

Likelihood The joint pdf of Y is The likelihood is n ( 1 f Y ( a) := exp 1 ( ( i=1 2πσ 2σ 2 a i X β ) ) ) 2 i ( 1 = (2π) n σ exp 1 ) a X β 2 n 2σ 2 2 L y ( β ) = ( 1 (2π) n exp 1 2 y X β ) 2 2

Maximum-likelihood estimate The maximum-likelihood estimate is ( ) β ML = arg max L y β β ( ) = arg max log L y β β = arg min β = β LS y X β 2 2

Linear models Least-squares estimation Overfitting Example: Global warming

Temperature predictor A friend tells you: I found a cool way to predict the temperature in New York: It s just a linear combination of the temperature in every other state. I fit the model on data from the last month and a half and it s perfect!

Overfitting If a model is very complex, it may overfit the data To evaluate a model we separate the data into a training and a test set 1. We fit the model using the training set 2. We evaluate the error on the test set

Experiment X train, X test, z train and β are iid Gaussian with mean 0 and variance 1 y train = X train β + z train y test = X test β We use y train and X train to compute β LS error train = error test = X train βls 2 y train y train 2 X test βls 2 y test y test 2

Experiment 0.5 0.4 Error (training) Error (test) Noise level (training) Relative error (l2 norm) 0.3 0.2 0.1 0.0 50 100 200 300 400 500 n

Linear models Least-squares estimation Overfitting Example: Global warming

Maximum temperatures in Oxford, UK 30 25 Temperature (Celsius) 20 15 10 5 0 1860 1880 1900 1920 1940 1960 1980 2000

Maximum temperatures in Oxford, UK 25 20 Temperature (Celsius) 15 10 5 0 1900 1901 1902 1903 1904 1905

Linear model y t β 0 + β 1 cos ( ) ( ) 2πt + β 12 2πt 2 sin + β 12 3 t 1 t n is the time in months (n = 12 150)

Model fitted by least squares 30 25 Temperature (Celsius) 20 15 10 5 0 1860 1880 1900 1920 1940 1960 1980 2000 Data Model

Model fitted by least squares 25 20 Temperature (Celsius) 15 10 5 Data Model 0 1900 1901 1902 1903 1904 1905

Model fitted by least squares 25 20 Temperature (Celsius) 15 10 5 0 Data Model 5 1960 1961 1962 1963 1964 1965

Trend: Increase of 0.75 C / 100 years (1.35 F) 30 25 Temperature (Celsius) 20 15 10 5 0 1860 1880 1900 1920 1940 1960 1980 2000 Data Trend

Model for minimum temperatures 20 15 Temperature (Celsius) 10 5 0 5 10 1860 1880 1900 1920 1940 1960 1980 2000 Data Model

Model for minimum temperatures Temperature (Celsius) 14 12 10 8 6 4 2 0 Data Model 2 1900 1901 1902 1903 1904 1905

Model for minimum temperatures 15 10 Temperature (Celsius) 5 0 5 Data Model 10 1960 1961 1962 1963 1964 1965

Trend: Increase of 0.88 C / 100 years (1.58 F) 20 15 Temperature (Celsius) 10 5 0 5 10 1860 1880 1900 1920 1940 1960 1980 2000 Data Trend