Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Similar documents
Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Introduction to Machine Learning DIS10

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Regression and generalization

Classification with linear models

Correlation Regression

10-701/ Machine Learning Mid-term Exam Solution

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

The Method of Least Squares. To understand least squares fitting of data.

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Machine Learning Assignment-1

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Machine Learning Brett Bernstein

ECON 3150/4150, Spring term Lecture 3

Pattern recognition systems Lab 10 Linear Classifiers and the Perceptron Algorithm

Linear Classifiers III

11 Correlation and Regression

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Pattern recognition systems Laboratory 10 Linear Classifiers and the Perceptron Algorithm

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Naïve Bayes. Naïve Bayes

6.867 Machine learning, lecture 7 (Jaakkola) 1

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

Machine Learning for Data Science (CS 4786)

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

Study the bias (due to the nite dimensional approximation) and variance of the estimators

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

REGRESSION (Physics 1210 Notes, Partial Modified Appendix A)

For a 3 3 diagonal matrix we find. Thus e 1 is a eigenvector corresponding to eigenvalue λ = a 11. Thus matrix A has eigenvalues 2 and 3.

GRAPHING LINEAR EQUATIONS. Linear Equations ( 3,1 ) _x-axis. Origin ( 0, 0 ) Slope = change in y change in x. Equation for l 1.

1 Review of Probability & Statistics

Optimally Sparse SVMs

We will conclude the chapter with the study a few methods and techniques which are useful

Machine Learning for Data Science (CS 4786)

3/3/2014. CDS M Phil Econometrics. Types of Relationships. Types of Relationships. Types of Relationships. Vijayamohanan Pillai N.

Lecture 11 and 12: Basic estimation theory

Regression, Part I. A) Correlation describes the relationship between two variables, where neither is independent or a predictor.

Linear Regression Demystified

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Chapters 5 and 13: REGRESSION AND CORRELATION. Univariate data: x, Bivariate data (x,y).

Zeros of Polynomials

Topics Machine learning: lecture 3. Linear regression. Linear regression. Linear regression. Linear regression

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Admin REGULARIZATION. Schedule. Midterm 9/29/16. Assignment 5. Midterm next week, due Friday (more on this in 1 min)

Nonlinear regression

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Statistical Properties of OLS estimators

5 : Exponential Family and Generalized Linear Models

Exponential Families and Bayesian Inference

Problem Set 4 Due Oct, 12

Algebra of Least Squares

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Image Spaces. What might an image space be

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Mixtures of Gaussians and the EM Algorithm

Maximum Likelihood Estimation

Math 21C Brian Osserman Practice Exam 2

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

Simple Linear Regression

Machine Learning: Logistic Regression. Lecture 04

Solution of Final Exam : / Machine Learning

Estimation for Complete Data

Lecture 12: February 28

Least-Squares Regression

Introduction to Artificial Intelligence CAP 4601 Summer 2013 Midterm Exam

3.2 Properties of Division 3.3 Zeros of Polynomials 3.4 Complex and Rational Zeros of Polynomials

Statistical Fundamentals and Control Charts

REGRESSION WITH QUADRATIC LOSS

1 Inferential Methods for Correlation and Regression Analysis

CSCI567 Machine Learning (Fall 2014)

6.867 Machine learning

Most text will write ordinary derivatives using either Leibniz notation 2 3. y + 5y= e and y y. xx tt t

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Lecture 23: Minimal sufficiency

Lecture 11 Simple Linear Regression

Support Vector Machines and Kernel Methods

U8L1: Sec Equations of Lines in R 2

Lecture 3: MLE and Regression

x c the remainder is Pc ().

Stat 139 Homework 7 Solutions, Fall 2015

Simple Regression. Acknowledgement. These slides are based on presentations created and copyrighted by Prof. Daniel Menasce (GMU) CS 700

4. Linear Classification. Kai Yu

Machine Learning Theory (CS 6783)

Local Polynomial Regression

Stat410 Probability and Statistics II (F16)

Optimization Methods MIT 2.098/6.255/ Final exam

Recursive Algorithms. Recurrences. Recursive Algorithms Analysis

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Learning Bounds for Support Vector Machines with Learned Kernels

Regression with quadratic loss

Bivariate Sample Statistics Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 7

Transcription:

REGRESSION 1

Outlie Liear regressio Regularizatio fuctios Polyomial curve fittig Stochastic gradiet descet for regressio MLE for regressio Step-wise forward regressio

Regressio methods Statistical techiques for fidig the best-fittig curve for a set of perturbed values from ukow fuctio Poits geerated from si(πx), perturbed with Gaussia oise Example from C. Bishop: PRML book 3

Error fuctios for regressio Let (x 1, t 1 ),, (x, t ) be pairs of istaces ad their true values for a ukow fuctio f: X R, X R k Let y 1,, y R be the values retured by a regressio model for istaces x 1,, x X Sum-of-squares error (also called quadratic error or least-squares error) e sq y 1,, y, t 1,, t = 1 (y i t i ) E e sq = var + bias + oise Mea squared error mse y 1,, y, t 1,, t = e sq y 1,, y, t 1,, t / Root-mea-square error e rms y 1,, y, t 1,, t = mse y 1,, y, t 1,, t 4

Geeral idea: Curve fittig Use approximatio fuctio of the form y x i, w = w 0 + w 1 φ 1 x i + + w M φ M x i, x i R k, φ j : R k R e.g., for x i R, y x i, w = M j=o w j x j j i with φ j x i = x i φ j x i are called basis fuctios Miimize misfit betwee y x i, w ad t i, 1 i, e.g., the sum-of-squares error 1 y x i, w t i displacemet/residual sum-of-squares error 1 sum-of-squares of displacemets Example from C. Bishop: PRML book 5

Uivariate liear regressio Geeral form of uivariate liear regressio t = w 0 + w 1 x + oise, x, w j R Example Suppose we aim at ivestigatig the relatioship betwee people s height (h i ) ad weight (g i ) based o measuremets h i, g i, 1 i Fid subject to 1 mi w 0,w 1 g i = w 0 + w 1 h i, i g i w 0 + w 1 h i Least-squares method 6

Example Example from Machie Learig by P. Flach 9 simulated measuremets by addig Gaussia oise to the dashed liear fuctio Solid lie represets liear regressio applied to the 9 poits with mea 0 ad variace 5 7

Optimal parameters for uivariate liear regressio Set derivatives for the itercept (w 0 ) ad the slope (w 1 ) to zero ad solve for each of the variables, respectively: 1 w 0 g i w 0 + w 1 h i = g i w 0 + w 1 h i = 0 w 0 = g w 1 h 1 w 1 g i w 0 + w 1 h i = g i w 0 + w 1 h i h i = 0 w 1 = h i h g i w h i h = Cov h, g Var h g = w 0 + w 1 h = g + w 1 h h 8

Abstract view o uivariate liear regressio For a target variable t that is liearly depedet o a feature x, i.e., t = w 0 + w 1 x + oise the geeral solutio depeds oly o w 1 = Cov x, t Var x This meas that solutio is highly sesitive to oise ad outliers Steps 1. Normalize the feature by dividig its values by the feature s variace. Calculate the covariace betwee target variable ad ormalized feature 9

Probabilistic view o least-squares t i = w 0 + w 1 x i + ε i, ε i ~N 0, σ i.i.d. ormally distributed errors Assumptio: t i ~N w 0 + w 1 x i, σ P t i w 0, w 1, σ = 1 πσ exp t i w 0 + w 1 x i σ For i.i.d. data poits t 1,, t : P t 1,, t w 0, w 1, σ = 1 πσ exp t i w 0 + w 1 x i σ = 1 πσ exp t i w 0 + w 1 x i σ l π l σ t i w 0 + w 1 x i σ 10

Maximum Likelihood Estimatio of w 0, w 1, σ l P t 1,, t w 0, w 1, σ w 0 = t i w 0 + w 1 x i = 0 w 0 = t w 1 x l P t 1,, t w 0, w 1, σ w 1 = t i w 0 + w 1 x i x i = 0 w 1 = Cov x, t Var x l P t 1,, t w 0, w 1, σ σ = 1 σ + t i w 0 + w 1 x i σ = 0 σ = t i w 0 + w 1 x i 11

Multivariate liear regressio t i = w 0 + w 1 x i + ε i, t. 1 1 x.. 1. =. w 0 +. w 1 +... t 1 x 1 i ε 1... ε t 1... t = 1 x.. 1.... 1 x w 0 w 1 + Geeral form of multivariate liear regressio t = Xw + ε t R 1, vector of target variables X R m, matrix of feature vectors (each cotaiig m features) w R m 1, weight vector (i.e., a weight for each feature) ε R 1, oise vector ε 1... ε 1

Geeral solutio for w i the multivariate case Cov x,t For uivariate liear regressio we foud w 1 = Var x It turs out that the geeral solutio for the weight vector i the multivariate case w = X T X 1 X T t Assume feature vectors (i.e., rows) i X are 0-cetered, i.e., from each row x i1,, x im we have subtracted x.1,, x.m, where x.j 1 x ij The 1 XT X is the m m covariace matrix, i.e., cotaiig the pairwise covariaces betwee all features (what does it cotai i the diagoal?) X T X 1 decorrelates, ceters, ad ormalizes features Ad 1 XT t is a m-vector holdig the covariace betwee each feature ad the output values t 13

Effect of correlatio betwee features y y x 1 x x 1 x Example from Machie Learig by P. Flach Red dots represet oisy samples of y Red plae represets true fuctio y = x 1 + x Gree plae fuctio leared by multivariate liear regressio Blue plae fuctio leared by decomposig the problem ito two uivariate regressio problems O the right features are highly correlated, the sample gives much less iformatio about the true fuctio 14

Regularized multivariate liear regressio Least squares method w = argmi w t Xw T t Xw + λ w Solutio is Least-squares error Regularizatio term w = X T X + λi 1 X T t I is the idetity matrix with 1s i the diagoal ad 0s everywhere else For the regularizatio oe ca use Ridge regularizatio w = i w i (i.e., L orm) Ridge regressio Lasso regularizatio w = i w i (i.e., L1 orm), which favors sparser solutios Lasso regressio λ determies the amout of regularizatio Lasso regressio is much more sesitive to the choice of λ 15

Ridge vs. Lasso regularizatio w w 1 + w (Ridge regularizatio) w ML w 1 w 1 + w (Lasso regularizatio) 16

Liear regressio for classificatio We leared that the geeral solutio for w is w = X T X 1 X T t For liear classificatio the goal is to lear w for a decisio boudary Ca we set w = w? Yes Least-squares classifier w x = b X T X 1 decorrelates, ceters, ad ormalizes features (good to have) Suppose t = +/ 1 +/ 1 ; what is the result of X T t? Cautio: Complexity of computig X T X 1 is O m + m 3 17

Uivariate polyomial curve fittig Use approximatio fuctio of the form y x i, w = w 0 + w 1 φ 1 x i + + w M φ M x i, where φ j x i = x i j bias term Least-squares regressio: Miimize misfit betwee y x i, w ad t i, 1 i, e.g., the sum-of-squares error 1 y x i, w t i Still liear i the weights w i 18

Example of overfittig overfittig Example from C. Bishop: PRML book 19

Impact of data ad regularizatio Icreasig the umber of data poits mitigates overfittig Regularizatio coefficiet Aother possibility: Use regularizatio e w = 1 i y x i, w t i + λ w Regularizatio term Examples from C. Bishop: PRML book 0

Polyomial curve fittig: Stochastic Gradiet Descet Defiitio: The gradiet of a differetiable fuctio f(w 1,, w M ) is defied as w f = f e w 1 + + f e 1 w M M where the e i are orthogoal uit vectors Theorem: For a fuctio f that is differetiable i the eighborhood of a poit w, w w η w f w yields f w < f w for small eough η > 0 Least-mea-squares algorithm For each data poit x i w (τ+1) = w (τ) 1 η w t i w τ T φ x i //gradiet descet //η:learig rate w (τ) η t i w τ T φ x i φ x i //stochastic gradiet descet //with the least-mea squares rule 1

Polyomial curve fittig: Maximum Likelihood Estimatio Assume each observatio t i comes from fuctio, with added Gaussia oise t i = y x i, w + ε, P ε σ = N ε 0, σ P t i x i, w, σ = N(t i y x i, w, σ ) We ca write the likelihood fuctio based o the observatios y x i, w = w 0 + w 1 φ 1 x i + + w M φ M x i = w T φ(x i ) P t x, w, σ = N(t i y x i, w, σ ) = i N(t i w T φ(x i ), σ ) i

Polyomial curve fittig: Maximum Likelihood Estimatio () We ca write the likelihood fuctio based o i.i.d. observatios P t x, w, σ = N(t i y x i, w, σ ) = i N(t i w T φ(x i ), σ ) i Takig the logarithm l P t x, w, σ = = l σ l(n(t i w T φ(x i ), σ )) l π 1 σ t i w T φ(x i ) 3

Polyomial curve fittig: Maximum Likelihood Estimatio (3) Takig the gradiet ad settig it to zero w l P t x, w, σ = 1 σ N t i w T φ x i φ x i T = 0 Solvig for w w ML = (Φ T Φ ) 1 Φ T t where Φ = φ 0 (x 1 ) φ M (x 1 ) φ 0 (x N ) φ M (x N ) Geometrical iterpretatio y = Φw ML = φ 0,, φ M w ML S T t (w ML miimizes the distace betwee t ad its projectio o S) Example from C. Bishop: PRML book 4

Reviewig the multivariate case Geeralizatio to the multivariate case y x, w = M i=o w i φ i x = w T φ x, x R k The discussed algorithms of stochastic gradiet descet ad MLE geeralize to this case as well The choice of φ i is crucial for the tradeoff betwee regressio quality ad complexity 5

Choices for basis fuctios Simplest case: Retur the i th compoet of x φ i x = x (i) Polyomial basis fuctio for x R φ i x = x i (small chages i x affect all basis fuctios) Gaussia basis fuctio (for x R) φ i x = exp x μ i s cotrols locatio cotrols scale (small chages i x affect earby basis fuctios) Examples from C. Bishop: PRML book 6

Forward step-wise regressio Goal: Fid fittig fuctio y x i = y 1 x i, w 1 + + y x i, w Step 1: Fit first simple fuctio y 1 x i, w 1 w 1 = argmi w t i y 1 x i, w Step : Fit secod simple model y x i, w to the residuals of the first: w = argmi w... Step : Fit a simple model y x i, w t i y 1 x i, w 1 y x i, w to the residuals of the previous step Stop whe o sigificat improvemet i traiig error is made 7

Further cosideratios o regressio Other choices of regularizatio fuctios L p -regularizatio is give by i w i p p=0.5 p=1 p= p=4 For p > 1, o sparse solutios are achieved Tree models ca be applied to regressio Impurity reductio traslates to variace reductio (see also exercises) 8

Mai solutio for liear regressio Uivariate Multivariate w 1 = Summary Cov x, t Var x w = X T X 1 X T t Regularizatio mitigates overfittig Lasso (L1): With high probability sparse Ridge (L): Not sparse Solutio strategies Stochastic gradiet descet MLE Forward step-wise regressio 9