Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme

Similar documents
Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Big Data Analytics. Lucas Rego Drumond

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Modern Optimization Techniques

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Optimization methods

Selected Topics in Optimization. Some slides borrowed from

Linear Regression (continued)

Machine Learning Linear Models

Association studies and regression

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Linear Models in Machine Learning

Modern Optimization Techniques

COMS 4771 Regression. Nakul Verma

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Regression, Ridge Regression, Lasso

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

MS-C1620 Statistical inference

Machine Learning. 1. Linear Regression

Machine Learning for OR & FE

arxiv: v3 [stat.ml] 14 Apr 2016

Reading Group on Deep Learning Session 1

Advanced Topics in Machine Learning

Linear Regression (9/11/13)

Linear Methods for Regression. Lijun Zhang

Statistical Inference

An Introduction to Statistical and Probabilistic Linear Models

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Optimization methods

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Ordinary Least Squares Linear Regression

Bias-Variance Decomposition. Mohammad Emtiyaz Khan EPFL Oct 6, 2015

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

CSC321 Lecture 2: Linear Regression

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Data Mining Stat 588

Linear regression methods

Convex Optimization for High-Dimensional Portfolio Construction

Least Angle Regression, Forward Stagewise and the Lasso

Statistical Machine Learning Hilary Term 2018

Regression with Numerical Optimization. Logistic

Linear Model Selection and Regularization

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Solving Regression. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 12. Slides adapted from Matt Nedrich and Trevor Hastie

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Neural Network Training

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

ESL Chap3. Some extensions of lasso

x k+1 = x k + α k p k (13.1)

CPSC 540: Machine Learning

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Machine Learning and Data Mining. Linear regression. Kalev Kask

LINEAR REGRESSION, RIDGE, LASSO, SVR

CS6220: DATA MINING TECHNIQUES

Bootstrap & Confidence/Prediction intervals

COMP 551 Applied Machine Learning Lecture 2: Linear regression

MLCC 2017 Regularization Networks I: Linear Models

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

High-dimensional regression modeling

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Neural Networks: Optimization & Regularization

Fundamentals of Machine Learning (Part I)

CSC 411 Lecture 6: Linear Regression

Overfitting, Bias / Variance Analysis

Machine Learning for OR & FE

CS-E3210 Machine Learning: Basic Principles

Conditional Random Fields for Sequential Supervised Learning

Linear Regression and Discrimination

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Newton s Method. Javier Peña Convex Optimization /36-725

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Planning and Optimal Control

Lecture 17: October 27

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

CS6220: DATA MINING TECHNIQUES

Gaussian Graphical Models and Graphical Lasso

Variable Selection in Data Mining Project

Logistic Regression. William Cohen

Gradient Descent. Mohammad Emtiyaz Khan EPFL Sep 22, 2015

Nonparameteric Regression:

STAT 462-Computational Data Analysis

A summary of Deep Learning without Poor Local Minima

LEAST ANGLE REGRESSION 469

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

ECS171: Machine Learning

Transcription:

Machine Learning A. Supervised Learning A.1. Linear Regression Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany 1 / 36

Outline 1. Linear Regression via Normal Equations 2. Minimizing a Function via Gradient Descent 3. Learning Linear Regression Models via Gradient Descent 4. Case Weights 2 / 36

1. Linear Regression via Normal Equations Outline 1. Linear Regression via Normal Equations 2. Minimizing a Function via Gradient Descent 3. Learning Linear Regression Models via Gradient Descent 4. Case Weights 2 / 36

1. Linear Regression via Normal Equations The Simple Linear Regression Problem Given a set D train := {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )} R R called training data, compute the parameters ( ˆβ 0, ˆβ 1 ) of a linear regression function ŷ(x) := ˆβ 0 + ˆβ 1 x s.t. for a set D test R R called test set the test error is minimal. err(ŷ; D test ) := 1 D test (x,y) D test (y ŷ(x)) 2 Note: D test has (i) to be from the same data generating process and (ii) not to be available during training. 2 / 36

1. Linear Regression via Normal Equations The (Multiple) Linear Regression Problem Given a set D train := {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )} R M R called training data, compute the parameters ( ˆβ 0, ˆβ 1,..., ˆβ M ) of a linear regression function ŷ(x) := ˆβ 0 + ˆβ 1 x 1 +... + ˆβ M x M s.t. for a set D test R M R called test set the test error is minimal. err(ŷ; D test ) := 1 D test (x,y) D test (y ŷ(x)) 2 Note: D test has (i) to be from the same data generating process and (ii) not to be available during training. 3 / 36

1. Linear Regression via Normal Equations Several predictors Several predictor variables x.,1, x.,2,..., x.,m : ŷ = ˆβ 0 + ˆβ 1 x.,1 + ˆβ 2 x.,2 + ˆβ M x.,m M =β 0 + ˆβ m x.,m m=1 with M + 1 parameters ˆβ 0, ˆβ 1,..., ˆβ M. 4 / 36

1. Linear Regression via Normal Equations Linear form Several predictor variables x.,1, x.,2,..., x.,m : M ŷ = ˆβ 0 + ˆβ m x.,m = ˆβ, x. m=1 where ˆβ := ˆβ 0 ˆβ 1. ˆβ M, x. := 1 x.,1. x.,m, Thus, the intercept is handled like any other parameter, for the artificial constant predictor x.,0 1. 5 / 36

1. Linear Regression via Normal Equations Simultaneous equations for the whole dataset For the whole dataset D train := {(x 1, y 1 ),..., (x N, y N )}: y ŷ := X ˆβ where y := y 1. y N, ŷ := ŷ 1. ŷ N, X := x 1. x N = x 1,1 x 1,2... x 1,M.... x N,1 x N,2... x N,M 6 / 36

1. Linear Regression via Normal Equations Least squares estimates Least squares estimates ˆβ minimize RSS( ˆβ, D train ) := N (y n ŷ n ) 2 = y ŷ 2 = y X ˆβ 2 n=1 The least squares estimates ˆβ are computed via normal equations X T X ˆβ = X T y Proof: y X ˆβ 2 = y X ˆβ, y X ˆβ (...) ˆβ = 2 X, y X ˆβ = 2(X T y X T X ˆβ)! = 0 7 / 36

1. Linear Regression via Normal Equations How to compute least squares estimates ˆβ Solve the M M system of linear equations X T X ˆβ = X T y i.e., Ax = b (with A := X T X, b = X T y, x = ˆβ). There are several numerical methods available: 1. Gaussian elimination 2. Cholesky decomposition 3. QR decomposition 8 / 36

1. Linear Regression via Normal Equations Learn Linear Regression via Normal Equations 1: procedure learn-linreg-normeq(d train := {(x 1, y 1 ),..., (x N, y N )}) 2: X := (x 1, x 2,..., x N ) T 3: y := (y 1, y 2,..., y N ) T 4: A := X T X 5: b := X T y 6: ˆβ := solve-sle(a, b) 7: return ˆβ 9 / 36

1. Linear Regression via Normal Equations Example Given is the following data: Predict a y value for x 1 = 3, x 2 = 4. x 1 x 2 y 1 2 3 2 3 2 4 1 7 5 5 1 10 / 36

1. Linear Regression via Normal Equations Example / Simple Regression Models for Comparison ŷ = ˆβ 0 + ˆβ 1 x 1 ŷ = ˆβ 0 + ˆβ 2 x 2 =2.95 + 0.1x 1 =6.943 1.343x 2 y 1 2 3 4 5 6 7 data model y 1 2 3 4 5 6 7 data model 1 2 3 4 5 1 2 3 4 5 x1 x2 ŷ(x 1 = 3) = 3.25 ŷ(x 2 = 4) = 1.571 11 / 36

1. Linear Regression via Normal Equations Example Now fit to the data: ŷ = ˆβ 0 + ˆβ 1 x 1 + ˆβ 2 x 2 X = X T X = 1 1 2 1 2 3 1 4 1 1 5 5 4 12 11 12 46 37 11 37 39, y = x 1 x 2 y 1 2 3 2 3 2 4 1 7 5 5 1 3 2 7 1, X T y = 13 40 24 12 / 36

1. Linear Regression via Normal Equations Example 4 12 11 13 12 46 37 40 11 37 39 24 4 12 11 13 0 10 4 1 0 16 35 47 4 12 11 13 0 10 4 1 0 0 143 243 4 12 11 13 0 1430 0 1115 0 0 143 243 286 0 0 1597 0 1430 0 1115 0 0 143 243 i.e., ˆβ = 1597/286 1115/1430 243/143 5.583 0.779 1.699 13 / 36

1. Linear Regression via Normal Equations Example 10 10 8 8 6 6 y 4 4 2 2 0 0 x2 x1 2 2 4 4 14 / 36

1. Linear Regression via Normal Equations Example / Visualization of Model Fit To visually assess the model fit, a plot can be plotted: residuals ˆɛ := y ŷ vs. true values y y y^ 0.04 0.00 0.02 1 2 3 4 5 6 7 y 15 / 36

2. Minimizing a Function via Gradient Descent Outline 1. Linear Regression via Normal Equations 2. Minimizing a Function via Gradient Descent 3. Learning Linear Regression Models via Gradient Descent 4. Case Weights 16 / 36

2. Minimizing a Function via Gradient Descent Gradient Descent Given a function f : R N R, find x with minimal f (x). Idea: start from a random x 0 and then improve step by step, i.e., choose x i+1 with f (x i+1 ) f (x i ) f(x) 0 2 4 6 8 3 2 1 0 1 2 3 Choose the negative gradient f x (x i) as direction for descent, i.e., with a suitable step length α i > 0. x i+1 x i = α i f x (x i) x 16 / 36

2. Minimizing a Function via Gradient Descent Gradient Descent 1: procedure minimze-gd-fsl(f : R N R, x 0 R N, α R, i max N, ɛ R + ) 2: for i = 1,..., i max do 3: x i := x i 1 α f x (x i 1) 4: if f (x i 1 ) f (x i ) < ɛ then 5: return x i 6: error not converged in i max iterations x 0 start value α (fixed) step length / learning rate i max maximal number of iterations ɛ minimum stepwise improvement 17 / 36

2. Minimizing a Function via Gradient Descent Example Example: f (x) := x 2 f, x (x) = 2x, x 0 := 2, α i : 0.25 Then we compute iteratively: using f i x i x (x i) x i+1 0 2 4 1 1 1 2 0.5 2 0.5 1 0.25 3 0.25.... x i+1 = x i α n f x (x i).. 0 2 4 6 8 3 2 1 0 1 2 3 x f(x) x 2 x 3 x 1 x 0 18 / 36

2. Minimizing a Function via Gradient Descent Step Length Why do we need a step length? Can we set α n 1? The negative gradient gives a direction of descent only in an infinitesimal neighborhood of x n. Thus, the step length may be too large, and the function value of the next point does not decrease. f(x) 0 2 4 6 8 x 1 x 0 Lars Schmidt-Thieme, Information Systems and 3 Machine 2 Learning 1 Lab 0 (ISMLL), 1 University 2 3 of Hildesheim, Germany 19 / 36

2. Minimizing a Function via Gradient Descent Step Length There are many different strategies to adapt the step length s.t. 1. the function value actually decreases and 2. the step length becomes not too small (and thus convergence slow) Armijo-Principle: α n := max{α {2 j j N 0 } f (x n α f x (x n)) f (x n ) αδ f x (x n), f x (x n) } with δ (0, 1). 20 / 36

2. Minimizing a Function via Gradient Descent Armijo Step Length 1: procedure steplength-armijo(f : R N R, x R N, d R N, δ (0, 1)) 2: α := 1 3: while f (x) f (x + αd) < αδd T d do 4: α = α/2 5: return α x last position d descend direction δ minimum steepness (δ 0: any step will do) 21 / 36

2. Minimizing a Function via Gradient Descent Gradient Descent 1: procedure minimize-gd(f : R N R, x 0 R N, α, i max N, ɛ R + ) 2: for i = 1,..., i max do 3: d := f x (x i 1) 4: α i := α(f, x i 1, d) 5: x i := x i 1 + α i d 6: if f (x i 1 ) f (x i ) < ɛ then 7: return x i 8: error not converged in i max iterations x 0 start value α step length function, e.g., steplength-armijo (with fixed δ). i max maximal number of iterations ɛ minimum stepwise improvement 22 / 36

2. Minimizing a Function via Gradient Descent Bold Driver Step Length [Bat89] A variant of the Armijo principle with memory: 1: procedure steplengthbolddriver(f : R N R, x R N, d R N, α old, α +, α (0, 1)) 2: α := α old α + 3: while f (x) f (x + αd) 0 do 4: α = αα 5: return α α old last step length α + step length increase factor, e.g., 1.1. α step length decrease factor, e.g., 0.5. 23 / 36

2. Minimizing a Function via Gradient Descent Simple Step Length Control in Machine Learning The Armijo and Bold Driver step lengths evaluate the objective function (including the loss) several times, and thus often are too costly and not used. But useful for debugging as they guarantee decrease in f. Constant step lengths α (0, 1) are frequently used. If chosen (too) small, the learning algorithm becomes slow, but usually still converges. The step length becomes a hyperparameter that has to be searched. Regimes of shrinking step lengths are used: α i := α i 1 γ, γ (0, 1) not too far from 1 If the initial step length α0 is too large, later iterations will fix it. If γ is too small, GD may get stuck before convergence. 24 / 36

2. Minimizing a Function via Gradient Descent How Many Minima can a Funciton have? In general, a function f can have several different minima. GD will find a random one (with small step lengths, usually one close to the starting point; local optimization method). 25 / 36

2. Minimizing a Function via Gradient Descent Convexity A function f : R N R is called convex if f (tx 1 + (1 t)x 2 ) tf (x 1 ) + (1 t)f (x 2 ), x 1, x 2 R N, t [0, 1] A two-times differentiable function is convex if its Hessian is positive semidefinite, i.e., ( x T 2 ) f x 0 x R N x i x j i=1,...,n,j=1,...,n For any matrix A R N M, the matrix A T A is positive semidefinite. 26 / 36

3. Learning Linear Regression Models via Gradient Descent Outline 1. Linear Regression via Normal Equations 2. Minimizing a Function via Gradient Descent 3. Learning Linear Regression Models via Gradient Descent 4. Case Weights 27 / 36

3. Learning Linear Regression Models via Gradient Descent Sparse Predictors Many problems have predictors x R M that are high-dimensional: M is large, and sparse: most x m are zero. For example, text regression: task: predict the rating of a customer review. predictors: a text about a product a sequence of words. can be represented as vector via bag of words: x m encodes the frequency of word m in a given text. dimensions 30,000-60,000 for English texts in short texts as reviews with a couple of hundred words, maximally a couple of hundred dimensions are non-zero. target: the customers rating of the product a (real) number. 27 / 36

3. Learning Linear Regression Models via Gradient Descent Sparse Predictors Dense Normal Equations Recall, the normal equations have dimensions M M. X T X ˆβ = X T y Even if X is sparse, generally X T X will be rather dense. (X T X ) m,l = X T.,mX.,l For text regression, (X T X ) m,l will be non-zero for every pair of words m, l that co-occurs in any text. Even worse, even if A := X T X is sparse, standard methods to solve linear systems (such as Gaussian elimination, LR decomposition etc.) do not take advantage. 28 / 36

3. Learning Linear Regression Models via Gradient Descent Learn Linear Regression via Loss Minimization Alternatively to learning a linear regression model via solving the linear normal equation system one can minimize the loss directly: When computing f and f ˆβ, f ( ˆβ) := ˆβ T X T X ˆβ 2y T X ˆβ + y T y = (y X ˆβ) T (y X ˆβ) f ˆβ ( ˆβ) = 2(X T y X T X ˆβ) = 2X T (y X ˆβ) avoid computing (dense) X T X. always compute (sparse) X times a (dense) vector. 29 / 36

3. Learning Linear Regression Models via Gradient Descent Objective Function and Gradient as Sums over Instances f ( ˆβ) := (y X ˆβ) T (y X ˆβ) T = = N (y n xn T ˆβ) 2 n=1 N ɛ 2 n, n=1 f ˆβ ( ˆβ) = 2X T (y X ˆβ) = 2 = 2 N (y n xn T ˆβ)x n n=1 N ɛ n x n n=1 ɛ n := y n x T n ˆβ 30 / 36

3. Learning Linear Regression Models via Gradient Descent Learn Linear Regression via Loss Minimization: GD 1: procedure learn-linreg- GD(D train := {(x 1, y 1 ),..., (x N, y N )}, α, i max N, ɛ R + ) 2: X := (x 1, x 2,..., x N ) T 3: y := (y 1, y 2,..., y N ) T 4: ˆβ 0 := (0,..., 0) 5: ˆβ := minimize-gd( f ( ˆβ) := (y X ˆβ) T (y X ˆβ), ˆβ 0, α, i max, ɛ) 6: return ˆβ 31 / 36

4. Case Weights Outline 1. Linear Regression via Normal Equations 2. Minimizing a Function via Gradient Descent 3. Learning Linear Regression Models via Gradient Descent 4. Case Weights 32 / 36

4. Case Weights Cases of Different Importance Sometimes different cases are of different importance, e.g., if their measurements are of different accurracy or reliability. Example: assume the left most point is known to be measured with lower reliability. Thus, the model does not need to fit to this point equally as well as it needs to do to the other points. I.e., residuals of this point should get lower weight than the others. y 0 2 4 6 8 0 2 4 6 8 data model x 32 / 36

4. Case Weights Case Weights In such situations, each case (x n, y n ) is assigned a case weight w n 0: the higher the weight, the more important the case. cases with weight 0 should be treated as if they have been discarded from the data set. Case weights can be managed as an additional pseudo-variable w in implementations. 33 / 36

4. Case Weights Weighted Least Squares Estimates Formally, one tries to minimize the weighted residual sum of squares with N w n (y n ŷ n ) 2 = W 1 2 (y ŷ) 2 n=1 W := w 1 0 w 2... 0 w n The same argument as for the unweighted case results in the weighted least squares estimates X T WX ˆβ = X T Wy 34 / 36

4. Case Weights Weighted Least Squares Estimates / Example To downweight the left most point, we assign case weights as follows: w x y 1 5.65 3.54 1 3.37 1.75 1 1.97 0.04 1 3.70 4.42 0.1 0.15 3.85 1 8.14 8.75 1 7.42 8.11 1 6.59 5.64 1 1.77 0.18 1 7.74 8.30 y 0 2 4 6 8 0 2 4 6 8 data model (w. weights) model (w./o. weights) x 35 / 36

4. Case Weights Summary For regression, linear models of type ŷ = x T ˆβ can be used to predict a quantitative y based on several (quantitative) x. A bias term can be modeled as additional predictor that is constant 1. The ordinary least squares estimates (OLS) are the parameters with minimal residual sum of squares (RSS). OLS estimates can be computed by solving the normal equations X T X ˆβ = X T y as any system of linear equations via Gaussian elimination. Alternatively, OLS estimates can be computed iteratively via Gradient Descent. Especially for high-dimensional, sparse predictors GD is advantageous as it never has it never has to compute the large, dense X T X. Case weights can be handled seamlessly by both methods to model different importance of cases. 36 / 36

Further Readings [JWHT13, chapter 3], [Mur12, chapter 7], [HTFF05, chapter 3]. 37 / 36

References Roberto Battiti. Accelerated backpropagation learning: Two optimization methods. Complex systems, 3(4):331 342, 1989. Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. The elements of statistical learning: data mining, inference and prediction, volume 27. 2005. Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Springer, 2013. Kevin P. Murphy. Machine learning: a probabilistic perspective. The MIT Press, 2012. 38 / 36