Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme

Machine Learning A. Supervised Learning A.1. Linear Regression Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany 1 / 36

Outline 1. Linear Regression via Normal Equations 2. Minimizing a Function via Gradient Descent 3. Learning Linear Regression Models via Gradient Descent 4. Case Weights 2 / 36

1. Linear Regression via Normal Equations Outline 1. Linear Regression via Normal Equations 2. Minimizing a Function via Gradient Descent 3. Learning Linear Regression Models via Gradient Descent 4. Case Weights 2 / 36

1. Linear Regression via Normal Equations The Simple Linear Regression Problem Given a set D train := {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )} R R called training data, compute the parameters ( ˆβ 0, ˆβ 1 ) of a linear regression function ŷ(x) := ˆβ 0 + ˆβ 1 x s.t. for a set D test R R called test set the test error is minimal. err(ŷ; D test ) := 1 D test (x,y) D test (y ŷ(x)) 2 Note: D test has (i) to be from the same data generating process and (ii) not to be available during training. 2 / 36

1. Linear Regression via Normal Equations The (Multiple) Linear Regression Problem Given a set D train := {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )} R M R called training data, compute the parameters ( ˆβ 0, ˆβ 1,..., ˆβ M ) of a linear regression function ŷ(x) := ˆβ 0 + ˆβ 1 x 1 +... + ˆβ M x M s.t. for a set D test R M R called test set the test error is minimal. err(ŷ; D test ) := 1 D test (x,y) D test (y ŷ(x)) 2 Note: D test has (i) to be from the same data generating process and (ii) not to be available during training. 3 / 36

1. Linear Regression via Normal Equations Several predictors Several predictor variables x.,1, x.,2,..., x.,m : ŷ = ˆβ 0 + ˆβ 1 x.,1 + ˆβ 2 x.,2 + ˆβ M x.,m M =β 0 + ˆβ m x.,m m=1 with M + 1 parameters ˆβ 0, ˆβ 1,..., ˆβ M. 4 / 36

1. Linear Regression via Normal Equations Linear form Several predictor variables x.,1, x.,2,..., x.,m : M ŷ = ˆβ 0 + ˆβ m x.,m = ˆβ, x. m=1 where ˆβ := ˆβ 0 ˆβ 1. ˆβ M, x. := 1 x.,1. x.,m, Thus, the intercept is handled like any other parameter, for the artificial constant predictor x.,0 1. 5 / 36

1. Linear Regression via Normal Equations Simultaneous equations for the whole dataset For the whole dataset D train := {(x 1, y 1 ),..., (x N, y N )}: y ŷ := X ˆβ where y := y 1. y N, ŷ := ŷ 1. ŷ N, X := x 1. x N = x 1,1 x 1,2... x 1,M.... x N,1 x N,2... x N,M 6 / 36

1. Linear Regression via Normal Equations Least squares estimates Least squares estimates ˆβ minimize RSS( ˆβ, D train ) := N (y n ŷ n ) 2 = y ŷ 2 = y X ˆβ 2 n=1 The least squares estimates ˆβ are computed via normal equations X T X ˆβ = X T y Proof: y X ˆβ 2 = y X ˆβ, y X ˆβ (...) ˆβ = 2 X, y X ˆβ = 2(X T y X T X ˆβ)! = 0 7 / 36

1. Linear Regression via Normal Equations How to compute least squares estimates ˆβ Solve the M M system of linear equations X T X ˆβ = X T y i.e., Ax = b (with A := X T X, b = X T y, x = ˆβ). There are several numerical methods available: 1. Gaussian elimination 2. Cholesky decomposition 3. QR decomposition 8 / 36

1. Linear Regression via Normal Equations Learn Linear Regression via Normal Equations 1: procedure learn-linreg-normeq(d train := {(x 1, y 1 ),..., (x N, y N )}) 2: X := (x 1, x 2,..., x N ) T 3: y := (y 1, y 2,..., y N ) T 4: A := X T X 5: b := X T y 6: ˆβ := solve-sle(a, b) 7: return ˆβ 9 / 36

1. Linear Regression via Normal Equations Example Given is the following data: Predict a y value for x 1 = 3, x 2 = 4. x 1 x 2 y 1 2 3 2 3 2 4 1 7 5 5 1 10 / 36

1. Linear Regression via Normal Equations Example / Simple Regression Models for Comparison ŷ = ˆβ 0 + ˆβ 1 x 1 ŷ = ˆβ 0 + ˆβ 2 x 2 =2.95 + 0.1x 1 =6.943 1.343x 2 y 1 2 3 4 5 6 7 data model y 1 2 3 4 5 6 7 data model 1 2 3 4 5 1 2 3 4 5 x1 x2 ŷ(x 1 = 3) = 3.25 ŷ(x 2 = 4) = 1.571 11 / 36

1. Linear Regression via Normal Equations Example Now fit to the data: ŷ = ˆβ 0 + ˆβ 1 x 1 + ˆβ 2 x 2 X = X T X = 1 1 2 1 2 3 1 4 1 1 5 5 4 12 11 12 46 37 11 37 39, y = x 1 x 2 y 1 2 3 2 3 2 4 1 7 5 5 1 3 2 7 1, X T y = 13 40 24 12 / 36

1. Linear Regression via Normal Equations Example 4 12 11 13 12 46 37 40 11 37 39 24 4 12 11 13 0 10 4 1 0 16 35 47 4 12 11 13 0 10 4 1 0 0 143 243 4 12 11 13 0 1430 0 1115 0 0 143 243 286 0 0 1597 0 1430 0 1115 0 0 143 243 i.e., ˆβ = 1597/286 1115/1430 243/143 5.583 0.779 1.699 13 / 36

1. Linear Regression via Normal Equations Example 10 10 8 8 6 6 y 4 4 2 2 0 0 x2 x1 2 2 4 4 14 / 36

1. Linear Regression via Normal Equations Example / Visualization of Model Fit To visually assess the model fit, a plot can be plotted: residuals ˆɛ := y ŷ vs. true values y y y^ 0.04 0.00 0.02 1 2 3 4 5 6 7 y 15 / 36

2. Minimizing a Function via Gradient Descent Outline 1. Linear Regression via Normal Equations 2. Minimizing a Function via Gradient Descent 3. Learning Linear Regression Models via Gradient Descent 4. Case Weights 16 / 36

2. Minimizing a Function via Gradient Descent Gradient Descent Given a function f : R N R, find x with minimal f (x). Idea: start from a random x 0 and then improve step by step, i.e., choose x i+1 with f (x i+1 ) f (x i ) f(x) 0 2 4 6 8 3 2 1 0 1 2 3 Choose the negative gradient f x (x i) as direction for descent, i.e., with a suitable step length α i > 0. x i+1 x i = α i f x (x i) x 16 / 36

2. Minimizing a Function via Gradient Descent Gradient Descent 1: procedure minimze-gd-fsl(f : R N R, x 0 R N, α R, i max N, ɛ R + ) 2: for i = 1,..., i max do 3: x i := x i 1 α f x (x i 1) 4: if f (x i 1 ) f (x i ) < ɛ then 5: return x i 6: error not converged in i max iterations x 0 start value α (fixed) step length / learning rate i max maximal number of iterations ɛ minimum stepwise improvement 17 / 36

2. Minimizing a Function via Gradient Descent Example Example: f (x) := x 2 f, x (x) = 2x, x 0 := 2, α i : 0.25 Then we compute iteratively: using f i x i x (x i) x i+1 0 2 4 1 1 1 2 0.5 2 0.5 1 0.25 3 0.25.... x i+1 = x i α n f x (x i).. 0 2 4 6 8 3 2 1 0 1 2 3 x f(x) x 2 x 3 x 1 x 0 18 / 36

2. Minimizing a Function via Gradient Descent Step Length Why do we need a step length? Can we set α n 1? The negative gradient gives a direction of descent only in an infinitesimal neighborhood of x n. Thus, the step length may be too large, and the function value of the next point does not decrease. f(x) 0 2 4 6 8 x 1 x 0 Lars Schmidt-Thieme, Information Systems and 3 Machine 2 Learning 1 Lab 0 (ISMLL), 1 University 2 3 of Hildesheim, Germany 19 / 36

2. Minimizing a Function via Gradient Descent Step Length There are many different strategies to adapt the step length s.t. 1. the function value actually decreases and 2. the step length becomes not too small (and thus convergence slow) Armijo-Principle: α n := max{α {2 j j N 0 } f (x n α f x (x n)) f (x n ) αδ f x (x n), f x (x n) } with δ (0, 1). 20 / 36

2. Minimizing a Function via Gradient Descent Armijo Step Length 1: procedure steplength-armijo(f : R N R, x R N, d R N, δ (0, 1)) 2: α := 1 3: while f (x) f (x + αd) < αδd T d do 4: α = α/2 5: return α x last position d descend direction δ minimum steepness (δ 0: any step will do) 21 / 36

2. Minimizing a Function via Gradient Descent Gradient Descent 1: procedure minimize-gd(f : R N R, x 0 R N, α, i max N, ɛ R + ) 2: for i = 1,..., i max do 3: d := f x (x i 1) 4: α i := α(f, x i 1, d) 5: x i := x i 1 + α i d 6: if f (x i 1 ) f (x i ) < ɛ then 7: return x i 8: error not converged in i max iterations x 0 start value α step length function, e.g., steplength-armijo (with fixed δ). i max maximal number of iterations ɛ minimum stepwise improvement 22 / 36

2. Minimizing a Function via Gradient Descent Bold Driver Step Length [Bat89] A variant of the Armijo principle with memory: 1: procedure steplengthbolddriver(f : R N R, x R N, d R N, α old, α +, α (0, 1)) 2: α := α old α + 3: while f (x) f (x + αd) 0 do 4: α = αα 5: return α α old last step length α + step length increase factor, e.g., 1.1. α step length decrease factor, e.g., 0.5. 23 / 36

2. Minimizing a Function via Gradient Descent Simple Step Length Control in Machine Learning The Armijo and Bold Driver step lengths evaluate the objective function (including the loss) several times, and thus often are too costly and not used. But useful for debugging as they guarantee decrease in f. Constant step lengths α (0, 1) are frequently used. If chosen (too) small, the learning algorithm becomes slow, but usually still converges. The step length becomes a hyperparameter that has to be searched. Regimes of shrinking step lengths are used: α i := α i 1 γ, γ (0, 1) not too far from 1 If the initial step length α0 is too large, later iterations will fix it. If γ is too small, GD may get stuck before convergence. 24 / 36

2. Minimizing a Function via Gradient Descent How Many Minima can a Funciton have? In general, a function f can have several different minima. GD will find a random one (with small step lengths, usually one close to the starting point; local optimization method). 25 / 36

2. Minimizing a Function via Gradient Descent Convexity A function f : R N R is called convex if f (tx 1 + (1 t)x 2 ) tf (x 1 ) + (1 t)f (x 2 ), x 1, x 2 R N, t [0, 1] A two-times differentiable function is convex if its Hessian is positive semidefinite, i.e., ( x T 2 ) f x 0 x R N x i x j i=1,...,n,j=1,...,n For any matrix A R N M, the matrix A T A is positive semidefinite. 26 / 36

3. Learning Linear Regression Models via Gradient Descent Outline 1. Linear Regression via Normal Equations 2. Minimizing a Function via Gradient Descent 3. Learning Linear Regression Models via Gradient Descent 4. Case Weights 27 / 36

3. Learning Linear Regression Models via Gradient Descent Sparse Predictors Many problems have predictors x R M that are high-dimensional: M is large, and sparse: most x m are zero. For example, text regression: task: predict the rating of a customer review. predictors: a text about a product a sequence of words. can be represented as vector via bag of words: x m encodes the frequency of word m in a given text. dimensions 30,000-60,000 for English texts in short texts as reviews with a couple of hundred words, maximally a couple of hundred dimensions are non-zero. target: the customers rating of the product a (real) number. 27 / 36

3. Learning Linear Regression Models via Gradient Descent Sparse Predictors Dense Normal Equations Recall, the normal equations have dimensions M M. X T X ˆβ = X T y Even if X is sparse, generally X T X will be rather dense. (X T X ) m,l = X T.,mX.,l For text regression, (X T X ) m,l will be non-zero for every pair of words m, l that co-occurs in any text. Even worse, even if A := X T X is sparse, standard methods to solve linear systems (such as Gaussian elimination, LR decomposition etc.) do not take advantage. 28 / 36

3. Learning Linear Regression Models via Gradient Descent Learn Linear Regression via Loss Minimization Alternatively to learning a linear regression model via solving the linear normal equation system one can minimize the loss directly: When computing f and f ˆβ, f ( ˆβ) := ˆβ T X T X ˆβ 2y T X ˆβ + y T y = (y X ˆβ) T (y X ˆβ) f ˆβ ( ˆβ) = 2(X T y X T X ˆβ) = 2X T (y X ˆβ) avoid computing (dense) X T X. always compute (sparse) X times a (dense) vector. 29 / 36

3. Learning Linear Regression Models via Gradient Descent Objective Function and Gradient as Sums over Instances f ( ˆβ) := (y X ˆβ) T (y X ˆβ) T = = N (y n xn T ˆβ) 2 n=1 N ɛ 2 n, n=1 f ˆβ ( ˆβ) = 2X T (y X ˆβ) = 2 = 2 N (y n xn T ˆβ)x n n=1 N ɛ n x n n=1 ɛ n := y n x T n ˆβ 30 / 36

3. Learning Linear Regression Models via Gradient Descent Learn Linear Regression via Loss Minimization: GD 1: procedure learn-linreg- GD(D train := {(x 1, y 1 ),..., (x N, y N )}, α, i max N, ɛ R + ) 2: X := (x 1, x 2,..., x N ) T 3: y := (y 1, y 2,..., y N ) T 4: ˆβ 0 := (0,..., 0) 5: ˆβ := minimize-gd( f ( ˆβ) := (y X ˆβ) T (y X ˆβ), ˆβ 0, α, i max, ɛ) 6: return ˆβ 31 / 36

4. Case Weights Outline 1. Linear Regression via Normal Equations 2. Minimizing a Function via Gradient Descent 3. Learning Linear Regression Models via Gradient Descent 4. Case Weights 32 / 36

4. Case Weights Cases of Different Importance Sometimes different cases are of different importance, e.g., if their measurements are of different accurracy or reliability. Example: assume the left most point is known to be measured with lower reliability. Thus, the model does not need to fit to this point equally as well as it needs to do to the other points. I.e., residuals of this point should get lower weight than the others. y 0 2 4 6 8 0 2 4 6 8 data model x 32 / 36

4. Case Weights Case Weights In such situations, each case (x n, y n ) is assigned a case weight w n 0: the higher the weight, the more important the case. cases with weight 0 should be treated as if they have been discarded from the data set. Case weights can be managed as an additional pseudo-variable w in implementations. 33 / 36

4. Case Weights Weighted Least Squares Estimates Formally, one tries to minimize the weighted residual sum of squares with N w n (y n ŷ n ) 2 = W 1 2 (y ŷ) 2 n=1 W := w 1 0 w 2... 0 w n The same argument as for the unweighted case results in the weighted least squares estimates X T WX ˆβ = X T Wy 34 / 36

4. Case Weights Weighted Least Squares Estimates / Example To downweight the left most point, we assign case weights as follows: w x y 1 5.65 3.54 1 3.37 1.75 1 1.97 0.04 1 3.70 4.42 0.1 0.15 3.85 1 8.14 8.75 1 7.42 8.11 1 6.59 5.64 1 1.77 0.18 1 7.74 8.30 y 0 2 4 6 8 0 2 4 6 8 data model (w. weights) model (w./o. weights) x 35 / 36

4. Case Weights Summary For regression, linear models of type ŷ = x T ˆβ can be used to predict a quantitative y based on several (quantitative) x. A bias term can be modeled as additional predictor that is constant 1. The ordinary least squares estimates (OLS) are the parameters with minimal residual sum of squares (RSS). OLS estimates can be computed by solving the normal equations X T X ˆβ = X T y as any system of linear equations via Gaussian elimination. Alternatively, OLS estimates can be computed iteratively via Gradient Descent. Especially for high-dimensional, sparse predictors GD is advantageous as it never has it never has to compute the large, dense X T X. Case weights can be handled seamlessly by both methods to model different importance of cases. 36 / 36

Further Readings [JWHT13, chapter 3], [Mur12, chapter 7], [HTFF05, chapter 3]. 37 / 36

References Roberto Battiti. Accelerated backpropagation learning: Two optimization methods. Complex systems, 3(4):331 342, 1989. Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. The elements of statistical learning: data mining, inference and prediction, volume 27. 2005. Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Springer, 2013. Kevin P. Murphy. Machine learning: a probabilistic perspective. The MIT Press, 2012. 38 / 36