Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

REGRESSION 1

Outlie Liear regressio Regularizatio fuctios Polyomial curve fittig Stochastic gradiet descet for regressio MLE for regressio Step-wise forward regressio

Regressio methods Statistical techiques for fidig the best-fittig curve for a set of perturbed values from ukow fuctio Poits geerated from si(πx), perturbed with Gaussia oise Example from C. Bishop: PRML book 3

Error fuctios for regressio Let (x 1, t 1 ),, (x, t ) be pairs of istaces ad their true values for a ukow fuctio f: X R, X R k Let y 1,, y R be the values retured by a regressio model for istaces x 1,, x X Sum-of-squares error (also called quadratic error or least-squares error) e sq y 1,, y, t 1,, t = 1 (y i t i ) E e sq = var + bias + oise Mea squared error mse y 1,, y, t 1,, t = e sq y 1,, y, t 1,, t / Root-mea-square error e rms y 1,, y, t 1,, t = mse y 1,, y, t 1,, t 4

Geeral idea: Curve fittig Use approximatio fuctio of the form y x i, w = w 0 + w 1 φ 1 x i + + w M φ M x i, x i R k, φ j : R k R e.g., for x i R, y x i, w = M j=o w j x j j i with φ j x i = x i φ j x i are called basis fuctios Miimize misfit betwee y x i, w ad t i, 1 i, e.g., the sum-of-squares error 1 y x i, w t i displacemet/residual sum-of-squares error 1 sum-of-squares of displacemets Example from C. Bishop: PRML book 5

Uivariate liear regressio Geeral form of uivariate liear regressio t = w 0 + w 1 x + oise, x, w j R Example Suppose we aim at ivestigatig the relatioship betwee people s height (h i ) ad weight (g i ) based o measuremets h i, g i, 1 i Fid subject to 1 mi w 0,w 1 g i = w 0 + w 1 h i, i g i w 0 + w 1 h i Least-squares method 6

Example Example from Machie Learig by P. Flach 9 simulated measuremets by addig Gaussia oise to the dashed liear fuctio Solid lie represets liear regressio applied to the 9 poits with mea 0 ad variace 5 7

Optimal parameters for uivariate liear regressio Set derivatives for the itercept (w 0 ) ad the slope (w 1 ) to zero ad solve for each of the variables, respectively: 1 w 0 g i w 0 + w 1 h i = g i w 0 + w 1 h i = 0 w 0 = g w 1 h 1 w 1 g i w 0 + w 1 h i = g i w 0 + w 1 h i h i = 0 w 1 = h i h g i w h i h = Cov h, g Var h g = w 0 + w 1 h = g + w 1 h h 8

Abstract view o uivariate liear regressio For a target variable t that is liearly depedet o a feature x, i.e., t = w 0 + w 1 x + oise the geeral solutio depeds oly o w 1 = Cov x, t Var x This meas that solutio is highly sesitive to oise ad outliers Steps 1. Normalize the feature by dividig its values by the feature s variace. Calculate the covariace betwee target variable ad ormalized feature 9

Probabilistic view o least-squares t i = w 0 + w 1 x i + ε i, ε i ~N 0, σ i.i.d. ormally distributed errors Assumptio: t i ~N w 0 + w 1 x i, σ P t i w 0, w 1, σ = 1 πσ exp t i w 0 + w 1 x i σ For i.i.d. data poits t 1,, t : P t 1,, t w 0, w 1, σ = 1 πσ exp t i w 0 + w 1 x i σ = 1 πσ exp t i w 0 + w 1 x i σ l π l σ t i w 0 + w 1 x i σ 10

Maximum Likelihood Estimatio of w 0, w 1, σ l P t 1,, t w 0, w 1, σ w 0 = t i w 0 + w 1 x i = 0 w 0 = t w 1 x l P t 1,, t w 0, w 1, σ w 1 = t i w 0 + w 1 x i x i = 0 w 1 = Cov x, t Var x l P t 1,, t w 0, w 1, σ σ = 1 σ + t i w 0 + w 1 x i σ = 0 σ = t i w 0 + w 1 x i 11

Multivariate liear regressio t i = w 0 + w 1 x i + ε i, t. 1 1 x.. 1. =. w 0 +. w 1 +... t 1 x 1 i ε 1... ε t 1... t = 1 x.. 1.... 1 x w 0 w 1 + Geeral form of multivariate liear regressio t = Xw + ε t R 1, vector of target variables X R m, matrix of feature vectors (each cotaiig m features) w R m 1, weight vector (i.e., a weight for each feature) ε R 1, oise vector ε 1... ε 1

Geeral solutio for w i the multivariate case Cov x,t For uivariate liear regressio we foud w 1 = Var x It turs out that the geeral solutio for the weight vector i the multivariate case w = X T X 1 X T t Assume feature vectors (i.e., rows) i X are 0-cetered, i.e., from each row x i1,, x im we have subtracted x.1,, x.m, where x.j 1 x ij The 1 XT X is the m m covariace matrix, i.e., cotaiig the pairwise covariaces betwee all features (what does it cotai i the diagoal?) X T X 1 decorrelates, ceters, ad ormalizes features Ad 1 XT t is a m-vector holdig the covariace betwee each feature ad the output values t 13

Effect of correlatio betwee features y y x 1 x x 1 x Example from Machie Learig by P. Flach Red dots represet oisy samples of y Red plae represets true fuctio y = x 1 + x Gree plae fuctio leared by multivariate liear regressio Blue plae fuctio leared by decomposig the problem ito two uivariate regressio problems O the right features are highly correlated, the sample gives much less iformatio about the true fuctio 14

Regularized multivariate liear regressio Least squares method w = argmi w t Xw T t Xw + λ w Solutio is Least-squares error Regularizatio term w = X T X + λi 1 X T t I is the idetity matrix with 1s i the diagoal ad 0s everywhere else For the regularizatio oe ca use Ridge regularizatio w = i w i (i.e., L orm) Ridge regressio Lasso regularizatio w = i w i (i.e., L1 orm), which favors sparser solutios Lasso regressio λ determies the amout of regularizatio Lasso regressio is much more sesitive to the choice of λ 15

Ridge vs. Lasso regularizatio w w 1 + w (Ridge regularizatio) w ML w 1 w 1 + w (Lasso regularizatio) 16

Liear regressio for classificatio We leared that the geeral solutio for w is w = X T X 1 X T t For liear classificatio the goal is to lear w for a decisio boudary Ca we set w = w? Yes Least-squares classifier w x = b X T X 1 decorrelates, ceters, ad ormalizes features (good to have) Suppose t = +/ 1 +/ 1 ; what is the result of X T t? Cautio: Complexity of computig X T X 1 is O m + m 3 17

Uivariate polyomial curve fittig Use approximatio fuctio of the form y x i, w = w 0 + w 1 φ 1 x i + + w M φ M x i, where φ j x i = x i j bias term Least-squares regressio: Miimize misfit betwee y x i, w ad t i, 1 i, e.g., the sum-of-squares error 1 y x i, w t i Still liear i the weights w i 18

Example of overfittig overfittig Example from C. Bishop: PRML book 19

Impact of data ad regularizatio Icreasig the umber of data poits mitigates overfittig Regularizatio coefficiet Aother possibility: Use regularizatio e w = 1 i y x i, w t i + λ w Regularizatio term Examples from C. Bishop: PRML book 0

Polyomial curve fittig: Stochastic Gradiet Descet Defiitio: The gradiet of a differetiable fuctio f(w 1,, w M ) is defied as w f = f e w 1 + + f e 1 w M M where the e i are orthogoal uit vectors Theorem: For a fuctio f that is differetiable i the eighborhood of a poit w, w w η w f w yields f w < f w for small eough η > 0 Least-mea-squares algorithm For each data poit x i w (τ+1) = w (τ) 1 η w t i w τ T φ x i //gradiet descet //η:learig rate w (τ) η t i w τ T φ x i φ x i //stochastic gradiet descet //with the least-mea squares rule 1

Polyomial curve fittig: Maximum Likelihood Estimatio Assume each observatio t i comes from fuctio, with added Gaussia oise t i = y x i, w + ε, P ε σ = N ε 0, σ P t i x i, w, σ = N(t i y x i, w, σ ) We ca write the likelihood fuctio based o the observatios y x i, w = w 0 + w 1 φ 1 x i + + w M φ M x i = w T φ(x i ) P t x, w, σ = N(t i y x i, w, σ ) = i N(t i w T φ(x i ), σ ) i

Polyomial curve fittig: Maximum Likelihood Estimatio () We ca write the likelihood fuctio based o i.i.d. observatios P t x, w, σ = N(t i y x i, w, σ ) = i N(t i w T φ(x i ), σ ) i Takig the logarithm l P t x, w, σ = = l σ l(n(t i w T φ(x i ), σ )) l π 1 σ t i w T φ(x i ) 3

Polyomial curve fittig: Maximum Likelihood Estimatio (3) Takig the gradiet ad settig it to zero w l P t x, w, σ = 1 σ N t i w T φ x i φ x i T = 0 Solvig for w w ML = (Φ T Φ ) 1 Φ T t where Φ = φ 0 (x 1 ) φ M (x 1 ) φ 0 (x N ) φ M (x N ) Geometrical iterpretatio y = Φw ML = φ 0,, φ M w ML S T t (w ML miimizes the distace betwee t ad its projectio o S) Example from C. Bishop: PRML book 4

Reviewig the multivariate case Geeralizatio to the multivariate case y x, w = M i=o w i φ i x = w T φ x, x R k The discussed algorithms of stochastic gradiet descet ad MLE geeralize to this case as well The choice of φ i is crucial for the tradeoff betwee regressio quality ad complexity 5

Choices for basis fuctios Simplest case: Retur the i th compoet of x φ i x = x (i) Polyomial basis fuctio for x R φ i x = x i (small chages i x affect all basis fuctios) Gaussia basis fuctio (for x R) φ i x = exp x μ i s cotrols locatio cotrols scale (small chages i x affect earby basis fuctios) Examples from C. Bishop: PRML book 6

Forward step-wise regressio Goal: Fid fittig fuctio y x i = y 1 x i, w 1 + + y x i, w Step 1: Fit first simple fuctio y 1 x i, w 1 w 1 = argmi w t i y 1 x i, w Step : Fit secod simple model y x i, w to the residuals of the first: w = argmi w... Step : Fit a simple model y x i, w t i y 1 x i, w 1 y x i, w to the residuals of the previous step Stop whe o sigificat improvemet i traiig error is made 7

Further cosideratios o regressio Other choices of regularizatio fuctios L p -regularizatio is give by i w i p p=0.5 p=1 p= p=4 For p > 1, o sparse solutios are achieved Tree models ca be applied to regressio Impurity reductio traslates to variace reductio (see also exercises) 8

Mai solutio for liear regressio Uivariate Multivariate w 1 = Summary Cov x, t Var x w = X T X 1 X T t Regularizatio mitigates overfittig Lasso (L1): With high probability sparse Ridge (L): Not sparse Solutio strategies Stochastic gradiet descet MLE Forward step-wise regressio 9