Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Similar documents
Lecture Notes 6: Linear Models

Linear regression. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

DS-GA 1002 Lecture notes 12 Fall Linear regression

DS-GA 1002 Lecture notes 10 November 23, Linear models

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

The Singular-Value Decomposition

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Linear Models for Regression CS534

Lecture Notes 1: Vector spaces

Descriptive Statistics

4 Bias-Variance for Ridge Regression (24 points)

ECE521 week 3: 23/26 January 2017

Least Squares Regression

Linear Models for Regression CS534

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Least Squares Regression

Linear Methods for Regression. Lijun Zhang

Linear Models for Regression CS534

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Statistical Data Analysis

Lecture Notes 2: Matrices

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Linear Regression Linear Regression with Shrinkage

Linear Models for Regression

Overfitting, Bias / Variance Analysis

Machine Learning, Fall 2009: Midterm

Theorems. Least squares regression

Statistics: Learning models from data

Constrained optimization

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

STA414/2104 Statistical Methods for Machine Learning II

Principal Component Analysis (PCA) CSC411/2515 Tutorial

CS6220: DATA MINING TECHNIQUES

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Probabilistic Machine Learning. Industrial AI Lab.

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression Linear Regression with Shrinkage

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

ISyE 691 Data mining and analytics

Lecture : Probabilistic Machine Learning

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Density Estimation. Seungjin Choi

MLE/MAP + Naïve Bayes

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Bayesian Linear Regression [DRAFT - In Progress]

A Modern Look at Classical Multivariate Techniques

Linear Models in Machine Learning

Approximate Principal Components Analysis of Large Data Sets

Computer Vision Group Prof. Daniel Cremers. 3. Regression

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Cheng Soon Ong & Christian Walder. Canberra February June 2018

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

An Introduction to Statistical and Probabilistic Linear Models

Data Mining and Analysis: Fundamental Concepts and Algorithms

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bias-Variance Tradeoff

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

The prediction of house price

Regression, Ridge Regression, Lasso

Machine Learning 4771

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Lecture 2 Machine Learning Review

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

L11: Pattern recognition principles

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

PRINCIPAL COMPONENTS ANALYSIS

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Lecture 6: Methods for high-dimensional problems

CS6220: DATA MINING TECHNIQUES

Machine Learning And Applications: Supervised Learning-SVM

PCA, Kernel PCA, ICA

Linear Regression (9/11/13)

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Recap from previous lecture

MLE/MAP + Naïve Bayes

Generative v. Discriminative classifiers Intuition

Outline lecture 2 2(30)

Lecture 3: Introduction to Complexity Regularization

Bayesian Classification Methods

Machine Learning: Assignment 1

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Machine Learning (CSE 446): Probabilistic Machine Learning

MATH 221, Spring Homework 10 Solutions

Today. Calculus. Linear Regression. Lagrange Multipliers

Machine Learning for OR & FE

Linear & nonlinear classifiers

Midterm Exam, Spring 2005

20 Unsupervised Learning and Principal Components Analysis (PCA)

2 Statistical Estimation: Basic Concepts

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Transcription:

Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda

Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

Regression The aim is to learn a function h that relates a response or dependent variable y to several observed variables x 1, x 2,..., x p, known as covariates, features or independent variables The response is assumed to be of the form y = h ( x) + z where x R p contains the features and z is noise

Linear regression The regression function h is assumed to be linear y (i) = x (i) T β + z (i), 1 i n Our aim is to estimate β R p from the data

Linear regression In matrix form y (1) x (1) y (2) 1 x (1) 2 x (1) p β = x (2) 1 x (2) 2 x p (2) 1 z (1) β 2 + z (2) y (n) x (n) 1 x (n) 2 x p (n) β p z (n) Equivalently, y = X β + z

Linear model for GDP State GDP (millions) Population Unemployment Rate North Dakota 52 089 757 952 2.4 Alabama 204 861 4 863 300 3.8 Mississippi 107 680 2 988 726 5.2 Arkansas 120 689 2 988 248 3.5 Kansas 153 258 2 907 289 3.8 Georgia 525 360 10 310 371 4.5 Iowa 178 766 3 134 693 3.2 West Virginia 73 374 1 831 102 5.1 Kentucky 197 043 4 436 974 5.2 Tennessee??? 6 651 194 3.0

Centering 127 147 25 625 71 556 y cent = 58 547 25 978 470 105 862 17 807 3 044 121 1.7 1 061 227 2.8 813 346 1.1 813 825 5.8 X cent = 894 784 2.8 6508 298 4.2 667 379 8.8 1 970 971 1.0 634 901 1.1 av ( y) = 179 236 av (X ) = [ 3 802 073 4.1 ]

Normalizing 0.321 0.065 0.180 0.148 y norm = 0.065 0.872 0.001 0.267 0.045 0.394 0.600 0.137 0.099 0.105 0.401 0.105 0.207 X norm = 0.116 0.099 0.843 0.151 0.086 0.314 0.255 0.366 0.082 0.401 std ( y) = 396 701 std (X ) = [ 7 720 656 2.80 ]

Linear model for GDP Aim: find β R 2 such that y norm X norm β The estimate for the GDP of Tennessee will be y Ten = av ( y) + std ( y) x norm, Ten β where x Ten norm is centered using av (X ) and normalized using std (X )

Temperature predictor A friend tells you: I found a cool way to predict the average daily temperature in New York: It s just a linear combination of the temperature in every other state. I fit the model on data from the last month and a half and it s perfect!

System of equations A is n p and full rank A b = c If n < p the system is underdetermined: infinite solutions for any b! (overfitting) If n = p the system is determined: unique solution for any b (overfitting) If n > p the system is overdetermined: unique solution exists only if b col (A) (if there is noise, no solutions)

Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

Least squares For fixed β we can evaluate the error using n ) 2 (y (i) x (i) T β y = X β 2 i=1 2 The least-squares estimate β LS minimizes this cost function β LS := arg min y X β β 2 ( ) 1 = X T X X T y if X is full rank and n p

Least-squares fit 2 Data Least-squares fit 1 y 0 1 2 2 1 0 1 2 x

Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 = 2

Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β 2 2 2 2

Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β 2 2 = arg min β UU T y X β 2 2 2 2

Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β 2 2 = arg min β = arg min β UU T y X β 2 2 2 2 UU T y USV T β 2 2

Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β 2 2 = arg min β = arg min β = arg min β UU T y X β 2 2 2 2 UU T y USV T β 2 U T y SV T β 2 2 2

Least-squares solution Let X = USV T ( ) y = UU T y + I UU T y By the Pythagorean theorem y X β 2 (I ) = UU T y 2 UU + T y X β 2 2 arg min β y X β 2 2 = arg min β = arg min β = arg min β = VS 1 U T y = UU T y X β 2 2 2 2 UU T y USV T β 2 U T y SV T β 2 2 ( X T X ) 1 X T y 2

Linear model for GDP The least-squares estimate is β LS = [ ] 1.019 0.111 GDP roughly proportional to the population Unemployment has a negative (linear) effect

Linear model for GDP State GDP Estimate North Dakota 52 089 46 241 Alabama 204 861 239 165 Mississippi 107 680 119 005 Arkansas 120 689 145 712 Kansas 153 258 136 756 Georgia 525 360 513 343 Iowa 178 766 158 097 West Virginia 73 374 59 969 Kentucky 197 043 194 829 Tennessee 328 770 345 352

Maximum temperatures in Oxford, UK 30 25 Temperature (Celsius) 20 15 10 5 0 1860 1880 1900 1920 1940 1960 1980 2000

Maximum temperatures in Oxford, UK 25 20 Temperature (Celsius) 15 10 5 0 1900 1901 1902 1903 1904 1905

Linear model y t β 0 + β 1 cos ( ) ( ) 2πt + β 12 2πt 2 sin + β 12 3 t 1 t n is the time in months (n = 12 150)

Model fitted by least squares 30 25 Temperature (Celsius) 20 15 10 5 0 1860 1880 1900 1920 1940 1960 1980 2000 Data Model

Model fitted by least squares 25 20 Temperature (Celsius) 15 10 5 Data Model 0 1900 1901 1902 1903 1904 1905

Model fitted by least squares 25 20 Temperature (Celsius) 15 10 5 0 Data Model 5 1960 1961 1962 1963 1964 1965

Trend: Increase of 0.75 C / 100 years (1.35 F) 30 25 Temperature (Celsius) 20 15 10 5 0 1860 1880 1900 1920 1940 1960 1980 2000 Data Trend

Model for minimum temperatures 20 15 Temperature (Celsius) 10 5 0 5 10 1860 1880 1900 1920 1940 1960 1980 2000 Data Model

Model for minimum temperatures Temperature (Celsius) 14 12 10 8 6 4 2 0 Data Model 2 1900 1901 1902 1903 1904 1905

Model for minimum temperatures 15 10 Temperature (Celsius) 5 0 5 Data Model 10 1960 1961 1962 1963 1964 1965

Trend: Increase of 0.88 C / 100 years (1.58 F) 20 15 Temperature (Celsius) 10 5 0 5 10 1860 1880 1900 1920 1940 1960 1980 2000 Data Trend

Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

Geometric interpretation Any vector X β is in the span of the columns of X The least-squares estimate is the closest vector to y that can be represented in this way This is the projection of y onto the column space of X X β LS = USV T VS 1 U T y = UU T y

Geometric interpretation

Face denoising We denoise by projecting onto: S 1 : the span of the 9 images from the same subject S 2 : the span of the 360 images in the training set Test error: x P S1 y 2 x 2 = 0.114 x P S2 y 2 x 2 = 0.078

S 1 S 1 := span ( )

Denoising via projection onto S 1 Projection onto S 1 Projection onto S 1 Signal x = 0.993 + 0.114 + Noise z = 0.007 + 0.150 = Data y = + Estimate

S 2 S 2 := span ( )

Denoising via projection onto S 2 Projection onto S 2 Projection onto S 2 Signal x = 0.998 + 0.063 + Noise z = 0.043 + 0.144 = Data y = + Estimate

P S1 y and P S2 y x P S1 y P S2 y

Lessons of Face Denoising What does our intuition learned from Face Denoising tell us about linear regression?

Lessons of Face Denoising What does our intuition learned from Face Denoising tell us about linear regression? More features = larger column space Larger column space = captures more of the true image Larger column space = captures more of the noise Balance between underfitting and overfitting

Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

Motivation Model data y 1,..., y n as realizations of a set of random variables y 1,..., y n The joint pdf depends on a vector of parameters β f β (y 1,..., y n ) := f y1,...,y n (y 1,..., y n ) is the probability density of y 1,..., y n at the observed data Idea: Choose β such that the density is as high as possible

Likelihood The likelihood is equal to the joint pdf L y1,...,y n ( β ) := f β (y 1,..., y n ) interpreted as a function of the parameters The log-likelihood function is the log of the likelihood log L y1,...,y n ( β )

Maximum-likelihood estimator The likelihood quantifies how likely the data are according to the model Maximum-likelihood (ML) estimator : ( ) β ML (y 1,..., y n ) := arg max L y1,...,y β n β ( ) = arg max log L y1,...,y β n β Maximizing the log-likelihood is equivalent, and often more convenient

Probabilistic interpretation We model the noise as an iid Gaussian random vector z Entries have zero mean and variance σ 2 The data are a realization of the random vector y := X β + z y is Gaussian with mean X β and covariance matrix σ 2 I

Likelihood The joint pdf of y is f y ( a) := = The likelihood is n i=1 ( 1 exp 1 ( ( 2πσ 2σ 2 a[i] X β ) ) ) 2 [i] ( 1 (2π) n σ exp 1 a X 2 β n 2σ 2 2 L y ( β ) = ( 1 (2π) n exp 1 2 ) y X β ) 2 2

Maximum-likelihood estimate The maximum-likelihood estimate is ( ) β ML = arg max L y β β ( ) = arg max log L y β β = arg min β = β LS y X β 2 2

Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

Estimation error If the data are generated according to the linear model y := X β + z then β LS β

Estimation error If the data are generated according to the linear model y := X β + z then β LS β = ( ) ( 1 X T X X T X β ) + z β

Estimation error If the data are generated according to the linear model y := X β + z then β LS β ( ) ( 1 = X T X X T X β ) + z β ( ) 1 = X T X X T z as long as X is full rank

LS estimator is unbiased Assume noise z is random and has zero mean, then ( E βls β )

LS estimator is unbiased Assume noise z is random and has zero mean, then E ( βls β ) = ( X T X ) 1 X T E ( z)

LS estimator is unbiased Assume noise z is random and has zero mean, then E ( βls β ) = = 0 ( X T X ) 1 X T E ( z) The estimate is unbiased: its mean equals β

Least-squares error If the data are generated according to the linear model y := X β + z then z 2 σ 1 β LS β 2 z 2 σ p σ 1 and σ p are the largest and smallest singular values of X

Least-squares error: Proof The error is given by β LS β = (X T X ) 1 X T z. How can we bound (X T X ) 1 X T z 2?

Singular values The singular values of a matrix A R n p of rank p satisfy σ 1 = max { x 2 =1 x R n } A x 2 σ p = min { x 2 =1 x R n } A x 2

Least-squares error β LS β = VS 1 U T z The smallest and largest singular values of VS 1 U are 1/σ 1 and 1/σ p, so z 2 σ 1 VS 1 U T z z 2 2 σ p

Experiment X train, X test, z train and β are sampled iid from a standard Gaussian Data has 50 features y train = X train β + z train y test = X test β (No Test Noise) We use y train and X train to compute β LS error train = error test = X train βls 2 y train y train 2 X test βls 2 y test y test 2

Experiment 0.5 0.4 Error (training) Error (test) Noise level (training) Relative error (l2 norm) 0.3 0.2 0.1 0.0 50 100 200 300 400 500 n

Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? 2. Why does the training error start at 0? 3. Why does the relative training error converge to the noise level? 4. Why does the relative test error converge to zero?

Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? β 2 50, X train β 2 50n, z train 2 n, 1 51 0.140 2. Why does the training error start at 0? 3. Why does the relative training error converge to the noise level? 4. Why does the relative test error converge to zero?

Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? β 2 50, X train β 2 50n, z train 2 n, 1 51 0.140 2. Why does the training error start at 0? X is square and invertible 3. Why does the relative training error converge to the noise level? 4. Why does the relative test error converge to zero?

Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? β 2 50, X train β 2 50n, z train 2 n, 1 51 0.140 2. Why does the training error start at 0? X is square and invertible 3. Why does the relative training error converge to the noise level? X train βls y train 2 = X train ( β LS β ) z train 2 and β LS β 4. Why does the relative test error converge to zero?

Experiment Questions 1. Can we approximate the relative noise level z 2 / y 2? β 2 50, X train β 2 50n, z train 2 n, 1 51 0.140 2. Why does the training error start at 0? X is square and invertible 3. Why does the relative training error converge to the noise level? X train βls y train 2 = X train ( β LS β ) z train 2 and β LS β 4. Why does the relative test error converge to zero? We assumed no test noise, and β LS β

Non-asymptotic bound Let y := X β + z, where the entries of X and z are iid standard Gaussians The least-squares estimate satisfies (1 ɛ) p (1 + ɛ) n βls β 2 (1 + ɛ) p (1 ɛ) n with probability at least 1 1/p 2 exp ( pɛ 2 /8 ) as long as n 64p log(12/ɛ)/ɛ 2

Proof U T z 2 σ 1 VS 1 U T z 2 U T z 2 σ p

Projection onto a fixed subspace Let S be a k-dimensional subspace of R n and z R n a vector of iid standard Gaussian noise For any ɛ > 0 ) ) P (k (1 ɛ) < P S z 2 2 < k (1 + ɛ) 1 2 exp ( kɛ2 8

Projection onto a fixed subspace Let S be a k-dimensional subspace of R n and z R n a vector of iid standard Gaussian noise For any ɛ > 0 ) ) P (k (1 ɛ) < P S z 2 2 < k (1 + ɛ) 1 2 exp ( kɛ2 8 Consequence: With probability 1 2 exp ( pɛ 2 /8 ) (1 ɛ) p U T z (1 + ɛ) p 2 2

Singular values of a Gaussian matrix Let A be a n k matrix with iid standard Gaussian entries such that n > k For any fixed ɛ > 0, the singular values of A satisfy n (1 ɛ) σk σ 1 n (1 + ɛ) with probability at least 1 1/k as long as n > 64k 12 log ɛ2 ɛ

Proof With probability 1 1/p n (1 ɛ) σp σ 1 n (1 + ɛ) as long as n 64p log(12/ɛ)/ɛ 2

Experiment: β p 2 Plot of β β LS 2 β 2 Relative coefficient error (l2 norm) 0.10 0.08 0.06 0.04 0.02 p=50 p=100 p=200 1/ n 0.00 50 5000 10000 15000 20000 n

Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

Condition number The condition number of A R n p, n p, is the ratio σ 1 /σ p of its largest and smallest singular values A matrix is ill conditioned if its condition is large (almost rank defficient)

Noise amplification Let y := X β + z, where z is iid standard Gaussian With probability at least 1 2 exp ( ɛ 2 /8 ) β LS β 2 1 ɛ where σ p is the smallest singular value of X σ p

Proof β LS β 2 2

Proof β LS β 2 VS = 1 U T z 2 2 2

Proof β LS β 2 VS = 1 U T z 2 = S 1 U T z 2 2 2 2 V is orthogonal

Proof β LS β 2 VS = 1 U T z 2 = S 1 U T z 2 2 ( p u T = i z ) 2 i σ 2 i 2 2 V is orthogonal

Proof β LS β 2 VS = 1 U T z 2 = S 1 U T z 2 2 ( p u T = i z ) 2 i σ 2 i ( u T p z ) 2 σ 2 p 2 2 V is orthogonal

Projection onto a fixed subspace Let S be a k-dimensional subspace of R n and z R n a vector of iid standard Gaussian noise For any ɛ > 0 ) ) P (k (1 ɛ) < P S z 2 2 < k (1 + ɛ) 1 2 exp ( kɛ2 8

Projection onto a fixed subspace Let S be a k-dimensional subspace of R n and z R n a vector of iid standard Gaussian noise For any ɛ > 0 ) ) P (k (1 ɛ) < P S z 2 2 < k (1 + ɛ) 1 2 exp ( kɛ2 8 Consequence: With probability 1 2 exp ( ɛ 2 /8 ) ( u T p z ) 2 (1 ɛ)

Example Let y := X β + z where 0.212 0.099 0.605 0.298 X := 0.213 0.113 0.589 0.285, β := 0.016 0.006 0.059 0.032 z 2 = 0.11 [ ] 0.471, z := 1.191 0.066 0.077 0.010 0.033 0.010 0.028

Example Condition number = 100 0.234 0.427 0.674 0.202 [ ] [ ] X = USV T = 0.241 0.744 1.00 0 0.898 0.440 0.654 0.350 0 0.01 0.440 0.898 0.017 0.189 0.067 0.257

Example β LS β

Example β LS β = VS 1 U T z

Example β LS β = VS 1 U T z [ ] 1.00 0 = V U T z 0 100.00

Example β LS β = VS 1 U T z [ ] 1.00 0 = V U T z 0 100.00 [ ] 0.058 = V 3.004

Example β LS β = VS 1 U T z [ ] 1.00 0 = V U T z 0 100.00 [ ] 0.058 = V 3.004 [ ] 1.270 = 2.723

Example β LS β = VS 1 U T z [ ] 1.00 0 = V U T z 0 100.00 [ ] 0.058 = V 3.004 [ ] 1.270 = 2.723 so that β LS β 2 = 27.00 z 2

Multicollinearity Feature matrix is ill conditioned if any subset of columns is close to being linearly dependent (there is a vector almost in the null space) This occurs if features are highly correlated For any X R n p, with normalized columns, if X i and X j, i j, satisfy X i, X j 2 1 ɛ 2 then the smallest singular value σ p ɛ

Multicollinearity Feature matrix is ill conditioned if any subset of columns is close to being linearly dependent (there is a vector almost in the null space) This occurs if features are highly correlated For any X R n p, with normalized columns, if X i and X j, i j, satisfy X i, X j 2 1 ɛ 2 then the smallest singular value σ p ɛ Proof Idea: Consider X ( e i e j ) 2.

Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

Motivation Avoid noise amplification due to multicollinearity Problem: Noise amplification blows up coefficients Solution: Penalize large-norm solutions when fitting the model Adding a penalty term promoting a particular structure is called regularization

Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min β y X β 2 +λ β 2 2 2

Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min y X β 2 +λ 2 β β 2 2 ( ) 1 = X T X + λi X T y

Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min y X β 2 +λ 2 β β 2 2 ( ) 1 = X T X + λi X T y λi increases the singular values of X T X

Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min y X β 2 +λ 2 β β 2 2 ( ) 1 = X T X + λi X T y λi increases the singular values of X T X When λ 0 then β ridge β LS

Ridge regression For a fixed regularization parameter λ > 0 β ridge := arg min y X β 2 +λ 2 β β 2 2 ( ) 1 = X T X + λi X T y λi increases the singular values of X T X When λ 0 then β ridge β LS When λ then β ridge 0

Proof β ridge is the solution to a modified least-squares problem β ridge = arg min β [ ] y 0 ] λi β [ X 2 2

Proof β ridge is the solution to a modified least-squares problem [ ] [ ] 2 β ridge = arg min y X β λi β 0 2 ( [ ] T [ ] ) 1 [ ] T [ ] X = λi λi X λi X y 0

Proof β ridge is the solution to a modified least-squares problem [ ] [ ] 2 β ridge = arg min y X β λi β 0 2 ( [ ] T [ ] ) 1 [ ] T [ ] X = λi λi X λi X y 0 ( ) 1 = X T X + λi X T y

Modified projection y ridge := X β ridge

Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y

Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y = USV T ( VS 2 V T + λvv T ) 1 VSU T y

Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y = USV T ( VS 2 V T + λvv T ) 1 VSU T y = USV T V ( S 2 + λi ) 1 V T VSU T y

Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y = USV T ( VS 2 V T + λvv T ) 1 VSU T y = USV T V ( S 2 + λi ) 1 V T VSU T y = US ( S 2 + λi ) 1 SU T y

Modified projection y ridge := X β ridge ( ) 1 = X X T X + λi X T y = USV T ( VS 2 V T + λvv T ) 1 VSU T y = USV T V ( S 2 + λi ) 1 V T VSU T y = US ( S 2 + λi ) 1 SU T y p σi 2 = σi 2 + λ y, u i u i i=1 Component of data in direction of u i is shrunk by σ 2 i σ 2 i +λ

Modified projection: Relation to PCA Component of data in direction of u i is shrunk by σ 2 i σ 2 i +λ Instead of orthogonally projecting on to the column space of X as in standard regression, we shrink and project Which directions are shrunk the most?

Modified projection: Relation to PCA Component of data in direction of u i is shrunk by σ 2 i σ 2 i +λ Instead of orthogonally projecting on to the column space of X as in standard regression, we shrink and project Which directions are shrunk the most? The directions in the data with smallest variance

Modified projection: Relation to PCA Component of data in direction of u i is shrunk by σ 2 i σ 2 i +λ Instead of orthogonally projecting on to the column space of X as in standard regression, we shrink and project Which directions are shrunk the most? The directions in the data with smallest variance In PCA, we delete the directions with smallest variance (i.e., shrink them to zero) Can think of Ridge Regression as a continuous variant of performing regression on principal components

Ridge-regression estimate If y := X β + z σ1 2 σ 0 0 1 0 0 σ 1 2 +λ σ σ2 0 2 1 2 +λ σ 0 0 2 0 β ridge = V σ2 2 +λ V T β + V σ2 2 +λ U T z 0 0 0 0 σ 2 p σ 2 p +λ where X = USV T and σ 1,..., σ p are the singular values σ p σ 2 p +λ For comparison, β LS = β + VS 1 U T z

Bias-variance tradeoff Error β ridge β can be divided into two terms: Bias (depends on β ) and variance (depends on z) The bias equals λ σ ( 1 2+λ 0 0 E β ridge β ) λ 0 = V σ2 2+λ 0 V T β λ 0 0 σp 2+λ Larger λ increases bias, but dampens noise (decreases variance)

Example Let y := X β + z where 0.212 0.099 0.605 0.298 X := 0.213 0.113 0.589 0.285, β := 0.016 0.006 0.059 0.032 z 2 = 0.11 [ ] 0.471, z := 1.191 0.066 0.077 0.010 0.033 0.010 0.028

Example β ridge β = V λ 1+λ 0 V T β V λ 0 0.01 2 +λ 1 1+λ 0 0 0.01 0.01 2 +λ U T z

Example Setting λ = 0.01 β ridge β = V λ 1+λ 0 λ 0 0.01 2 +λ V T β V 1 1+λ 0 0.01 0 0.01 2 +λ U T z = V 0.001 0 V T β + V 0.99 0 U T z 0 0.99 0 0.99 = 0.329 0.823

Example Least-squares relative error = 27.00 β ridge β 2 = 7.96 z 2

Example 2.5 2.0 Coefficients Coefficient error Least-squares fit 1.5 1.0 0.5 0.0 0.5 10-7 10-6 10-5 10-4 10-3 10-2 10-1 10 0 10 1 10 2 10 3 Regularization parameter

Maximum-a-posteriori estimator Is there a probabilistic interpretation of ridge regression? Bayesian viewpoint: β is modeled as random, not deterministic The maximum-a-posteriori (MAP) estimator of β given y is ( ) β MAP ( y) := arg max fβ β y y, β f β y is the conditional pdf of β given y

Maximum-a-posteriori estimator Let y R n be a realization of y := X β + z where β and z are iid Gaussian with mean zero and variance σ 2 1 and σ2 2 If X R n m is known, then β MAP = arg min β y X β 2 + λ β 2 2 2 where λ := σ 2 2 /σ2 1 What does it mean if σ 2 1 is tiny or large? How about σ2 2?

Problem How to calibrate regularization parameter Cannot use coefficient error (we don t know the true value!) Cannot minimize over training data (why?) Solution: Check fit on new data

Cross validation Given a set of examples ( y (1), x (1)) (, y (2), x (2)) (,..., y (n), x (n)), 1. Partition data into a training set X train R n train p, y train R n train and a validation set X val R n val p, y val R n val 2. Fit model using the training set for every λ in a set Λ β ridge (λ) := arg min y train X trainβ 2 + λ 2 β β 2 2 and evaluate the fitting error on the validation set err (λ) := y train X trainβridge (λ) 2 2 3. Choose the value of λ that minimizes the validation-set error λ cv := arg min err (λ) λ Λ

Prediction of house prices Aim: Predicting the price of a house from 1. Area of the living room 2. Condition (integer between 1 and 5) 3. Grade (integer between 7 and 12) 4. Area of the house without the basement 5. Area of the basement 6. The year it was built 7. Latitude 8. Longitude 9. Average area of the living room of houses within 15 blocks

Prediction of house prices Training data: 15 houses Validation data: 15 houses Test data: 15 houses Condition number of training-data feature matrix: 9.94 We evaluate the relative fit y X β 2 ridge y 2

Prediction of house prices 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Coefficients l2-norm cost (training set) l2-norm cost (validation set) 10-3 10-2 10-1 10 0 10 1 10 2 10 3 Regularization parameter

Prediction of house prices Best λ: 0.27 Validation set error: 0.672 (least-squares: 0.906) Test set error: 0.799 (least-squares: 1.186)

Training 600000 400000 200000 Estimated price (dollars) 800000 Least squares Ridge regression 0 0 200000 400000 600000 800000 True price (dollars)

Validation 600000 400000 200000 Estimated price (dollars) 800000 Least squares Ridge regression 0 0 200000 400000 600000 800000 True price (dollars)

Test 600000 400000 200000 Estimated price (dollars) 800000 Least squares Ridge regression 0 0 200000 400000 600000 800000 True price (dollars)

Linear regression Least-squares estimation Geometric interpretation Probabilistic interpretation Analysis of least-squares estimate Noise amplification Ridge regression Classification

The classification problem Goal: Assign examples to one of several predefined categories We have n examples of labels and corresponding features ( y (1), x (1)) (, y (2), x (2)) (,..., y (n), x (n)). Here, we consider only two categories: labels are 0 or 1

Logistic function Smoothed version of step function g (t) := 1 1 + exp( t)

Logistic function 1 0.8 1 1+exp( t) 0.6 0.4 0.2 0 8 6 4 2 0 2 4 6 8 t

Logistic regression Generalized linear model: linear model + entrywise link function y (i) g ( β 0 + x (i), β ).

Maximum likelihood If y (1),..., y (n) are independent samples from Bernoulli random variables with parameter ( p y (i) (1) := g x (i), β ) where x (1),..., x (n) R p are known, the ML estimate of β given y (1),..., y (n) is β ML := n y (i) log g i=1 ( x (i), β ) ( + 1 y (i)) ( ( log 1 g x (i), β ))

Maximum likelihood ( ) ( L β := p y (1),...,y (n) y (1),..., y (n))

Maximum likelihood ( ) ( L β := p y (1),...,y (n) y (1),..., y (n)) = n g i=1 ( x (i), β ) y (i) ( ( 1 g x (i), β )) 1 y (i)

Logistic-regression estimator β LR := n y (i) log g i=1 ( x (i), β ) ( + 1 y (i)) ( ( log 1 g x (i), β )) For a new x the logistic-regression prediction is { ( 1 if g x, ) β LR 1/2 y LR := 0 otherwise g ( x, β ) LR can be interpreted as the probability that the label is 1

Iris data set Aim: Classify flowers using sepal width and length Two species, 5 examples each: Iris setosa (label 0): sepal lengths 5.4, 4.3, 4.8, 5.1 and 5.7, and sepal widths 3.7, 3, 3.1, 3.8 and 3.8 Iris versicolor (label 1): sepal lengths 6.5, 5.7, 7, 6.3 and 6.1, and sepal widths 2.8, 2.8, 3.2, 2.3 and 2.8 Two new examples: (5.1, 3.5), (5,2)

Iris data set After centering and normalizing [ ] 32.1 β LR = 29.6 and β 0 = 2.06 g i 1 2 3 4 5 x (i) [1] -0.12-0.56-0.36-0.24 0.00 x (i) [2] 0.38-0.09-0.02 0.45 0.45 x (i), β LR + β 0-12.9-13.5-8.9-18.8-11.0 ( x (i), β ) LR + β 0 0.00 0.00 0.00 0.00 0.00 g i 6 7 8 9 10 x (i) [1] 0.33 0.00 0.53 0.25 0.17 x (i) [2] -0.22-0.22 0.05-0.05-0.22 x (i), β LR + β 0 19.1 8.7 17.7 26.3 13.9 ( x (i), β ) LR + β 0 1.00 1.00 1.00 1.00 1.00

Iris data set Sepal width 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 Sepal length 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0??? Setosa Versicolor

Iris data set Petal length 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.3 0.2 0.1 0.0 0.1 0.2 0.3 Sepal width 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Virginica Versicolor

Digit classification MNIST data Aim: Distinguish one digit from another x i is an image of a 6 or a 9 y i = 1 or y i = 0 if image i is a 6 or 9, respectively 2000 training examples and 2000 test examples, each half 6 half 9 Training error rate: 0.0, Test error rate = 0.006

Digit classification: β 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4

Digit classification: True Positives β T x Probability of 6 Image 20.878 1.00 18.217 1.00 16.408 1.00

Digit classification: True Negatives β T x Probability of 6 Image -14.71 0.00-15.829 0.00-17.02 0.00

Digit classification: False Positives β T x Probability of 6 Image 7.612 0.9995 0.4341 0.606 7.822484 0.9996

Digit classification: False Negatives β T x Probability of 6 Image -5.984 0.0025-2.384.084-1.164 0.238

Digit Classification This is a toy problem: distinguishing one digit from another is very easy Harder is to classify any given digit We used it to give insight into how logistic regression works It turns out, on this simplified problem, a very easy solution for β gives good results. Can you guess it?

Digit Classification Average of 6 s minus average of 9 s Training error: 0.005, Test error: 0.0035 1.5 1.0 0.5 0.0 0.5 1.0