Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Similar documents
Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

MS-C1620 Statistical inference

Support Vector Machines for Classification: A Statistical Portrait

Discriminative Learning and Big Data

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning. Support Vector Machines. Manfred Huber

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning for OR & FE

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Review: Support vector machines. Machine learning techniques and image analysis

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Discriminative Models

Proteomics and Variable Selection

Support Vector Machine (SVM) and Kernel Methods

Learning from Data: Regression

Regression, Ridge Regression, Lasso

Support Vector Machine (continued)

Lecture 2 Machine Learning Review

Discriminative Models

ISyE 691 Data mining and analytics

Support Vector Machine (SVM) and Kernel Methods

A Modern Look at Classical Multivariate Techniques

Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Kernel Methods and Support Vector Machines

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Linear & nonlinear classifiers

Perceptron Revisited: Linear Separators. Support Vector Machines

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Least Squares SVM Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Statistical Pattern Recognition

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

CS 195-5: Machine Learning Problem Set 1

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Support Vector Machines

High-dimensional regression

Linear Models for Regression

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

SVMs, Duality and the Kernel Trick

Prediction & Feature Selection in GLM

Recap from previous lecture

Introduction to Gaussian Process

Jeff Howbert Introduction to Machine Learning Winter

Holdout and Cross-Validation Methods Overfitting Avoidance

COS 424: Interacting with Data

Regularization Paths

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Classifier Complexity and Support Vector Classifiers

Machine learning - HT Basis Expansion, Regularization, Validation

COMS 4771 Regression. Nakul Verma

Machine Learning 4771

Kernel methods CSE 250B

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Linear Regression Linear Regression with Shrinkage

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

(Kernels +) Support Vector Machines

Linear & nonlinear classifiers

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Linear Model Selection and Regularization

SUPPORT VECTOR MACHINE

Linear Regression Linear Regression with Shrinkage

Data Mining Stat 588

Machine Learning And Applications: Supervised Learning-SVM

Today. Calculus. Linear Regression. Lagrange Multipliers

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

L5 Support Vector Classification

Cheng Soon Ong & Christian Walder. Canberra February June 2018

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Statistical Methods for Data Mining

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

CSC 411 Lecture 17: Support Vector Machine

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Pattern Recognition 2018 Support Vector Machines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

ORIE 4741: Learning with Big Messy Data. Regularization

High-Dimensional Statistical Learning: Introduction

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Regularization Paths. Theme

Advanced Introduction to Machine Learning CMU-10715

CS540 Machine learning Lecture 5

Machine Learning Techniques

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Data Mining Stat 588

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

6.036 midterm review. Wednesday, March 18, 15

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Transcription:

Machine Learning Regression basics Linear regression, non-linear features (polynomial, RBFs, piece-wise), regularization, cross validation, Ridge/Lasso, kernel trick Marc Toussaint University of Stuttgart Summer 2015

(MT/plot.h -> gnuplot pipe) (MT/plot.h -> gnuplot pipe) 1.5 1 'train' us 1:2 'model' us 1:2 3 2 train decision boundary 0.5 1 0 0-0.5-1 -1-1.5-3 -2-1 0 1 2 3-2 -2-1 0 1 2 3 Are these linear models? Linear in what? Input: No. Parameters, features: Yes! 2/21

Linear Modelling is more powerful than it might seem at first! 3/21

Linear Modelling is more powerful than it might seem at first! Linear Regression on non-linear features very powerful (polynomials, piece-wise, spline basis, kernels) Regularization (Ridge, Lasso) & cross-validation for proper generalization to test data Gaussian Processes and SVMs are closely related (linear in kernel features, but with different optimality criteria) Liquid/Echo State Machines, Extreme Learning, are examples of linear modelling on many (sort of random) non-linear features Basic insights in model complexity (effective degrees of freedom) Input relevance estimation (z-score) and feature selection (Lasso) Linear regression linear classification (logistic regression: outputs are likelihood ratios) Good foundation for learning about ML (We roughly follow Hastie, Tibshirani, Friedman: Elements of Statistical Learning) 3/21

Linear Regression Notation: input vector x R d output value y R parameters β = (β 0, β 1,.., β d ) R d+1 linear model f(x) = β 0 + d j=1 β jx j 4/21

Linear Regression Notation: input vector x R d output value y R parameters β = (β 0, β 1,.., β d ) R d+1 linear model f(x) = β 0 + d j=1 β jx j Given training data D = {(x i, y i )} n i=1 we define the least squares cost (or loss ) L ls (β) = n i=1 (y i f(x i )) 2 4/21

Optimal parameters β Augment input vector with a 1 in front: x = (1, x) = (1, x 1,.., x d ) R d+1 β = (β 0, β 1,.., β d ) R d+1 f(x) = β 0 + n j=1 β jx j = x β 5/21

Optimal parameters β Augment input vector with a 1 in front: x = (1, x) = (1, x 1,.., x d ) R d+1 β = (β 0, β 1,.., β d ) R d+1 f(x) = β 0 + n j=1 β jx j = x β Rewrite sum of squares as: L ls (β) = n i=1 (y i x i β)2 = y Xβ 2 X = x 1.. x n = 1 x 1,1 x 1,2 x 1,d.. 1 x n,1 x n,2 x n,d, y = y 1. y n 5/21

Optimal parameters β Augment input vector with a 1 in front: x = (1, x) = (1, x 1,.., x d ) R d+1 β = (β 0, β 1,.., β d ) R d+1 f(x) = β 0 + n j=1 β jx j = x β Rewrite sum of squares as: L ls (β) = n i=1 (y i x i β)2 = y Xβ 2 X = x 1.. x n = 1 x 1,1 x 1,2 x 1,d.. 1 x n,1 x n,2 x n,d, y = y 1. y n Optimum: 0 d = Lls (β) β = 2(y Xβ) X 0 d = X Xβ X y ˆβ ls = (X X) -1 X y 5/21

(MT/plot.h -> gnuplot pipe) (MT/plot.h -> gnuplot pipe) 1 0 'train' us 1:2 'model' us 1:2 1.5 1 'train' us 1:2 'model' us 1:2-1 0.5-2 0-3 -4-0.5-5 -1-6 -3-2 -1 0 1 2 3-1.5-3 -2-1 0 1 2 3./x.exe -mode 1 -datafeaturetype 1 -modelfeaturetype 1 6/21

Non-linear features Replace the inputs x i R d by some non-linear features φ(x i ) R k f(x) = k j=1 φ j(x) β j = φ(x) β The optimal β is the same ˆβ ls = (X X) -1 X y but with X = φ(x 1 ).. φ(x n ) R n k What are features? a) Features are an arbitrary set of basis functions b) Any function linear in β can be written as f(x) = φ(x) β for some φ which we denote as features 7/21

Example: Polynomial features Linear: φ(x) = (1, x 1,.., x d ) R 1+d Quadratic: φ(x) = (1, x 1,.., x d, x 2 1, x 1 x 2, x 1 x 3,.., x 2 d(d+1) d ) R1+d+ 2 Cubic: φ(x) = (.., x 3 1, x 2 1x 2, x 2 1x 3,.., x 3 d(d+1) d ) R1+d+ 2 + d(d+1)(d+2) 6 x φ(x) f(x) = φ(x) β x 1 1 x 1 x 2 φ x d x 2 1 β f(x) x d x 1x 2 x 1x 3 x 2 d./x.exe -mode 1 -datafeaturetype 1 -modelfeaturetype 1 8/21

Example: Piece-wise features Piece-wise constant: φ j (x) = I(ξ 1 < x < ξ 2 ) Piece-wise linear: φ j (x) = xi(ξ 1 < x < ξ 2 ) Continuous piece-wise linear: φ j (x) = (x ξ 1 ) + 9/21

Example: Radial Basis Functions (RBF) Given a set of centers {c j } k j=1, define φ j (x) = b(x, c j ) = e 1 2 x cj 2 [0, 1] Each φ j (x) measures similarity with the center c j Special case: use all training inputs {x i } n i=1 as centers φ(x) = 1 b(x, x 1 ). b(x, x n ) (n + 1 dim) This is related to kernel methods and GPs, but not quite the same we ll discuss this later. 10/21

Features Polynomial Piece-wise Radial basis functions (RBF) Splines (see Hastie Ch. 5) Linear regression on top of rich features is extremely powerful! 11/21

The need for regularization Noisy sin data fitted with radial basis functions (MT/plot.h -> gnuplot pipe) 2 'z.train' us 1:2 'z.model' us 1:2 1.5 1 0.5 0-0.5-1 -1.5-3 -2-1 0 1 2 3./x.exe -mode 1 -n 40 -modelfeaturetype 4 -datatype 2 -rbfwidth.1 -sigma.5 -lambda 1e-10 12/21

The need for regularization Noisy sin data fitted with radial basis functions (MT/plot.h -> gnuplot pipe) 2 'z.train' us 1:2 'z.model' us 1:2 1.5 1 0.5 0-0.5-1 -1.5-3 -2-1 0 1 2 3./x.exe -mode 1 -n 40 -modelfeaturetype 4 -datatype 2 -rbfwidth.1 -sigma.5 -lambda 1e-10 Overfitting & generalization: The model overfits to the data and generalizes badly Estimator variance: When you repeat the experiment (keeping the underlying function fixed), the regression always returns a different model estimate 12/21

Estimator variance Assumptions: We computed parameters ˆβ = (X X) -1 X y The data was noisy with variance Var{y} = σ 2 I n Then Var{ ˆβ} = (X X) -1 σ 2 high data noise σ high estimator variance more data less estimator variance: Var{ ˆβ} 1 n In practise we don t know σ, but we can estimate it based on the deviation from the learnt model: ˆσ 2 = 1 n d 1 n (y i f(x i )) 2 i=1 13/21

Estimator variance Overfitting picking one specific data set y N(y mean, σ 2 I n ) picking one specific ˆb N(β mean, (X X) -1 σ 2 ) If we could reduce the variance of the estimator, we could reduce overfitting and increase generalization. 14/21

Hastie s section on shrinkage methods is great! Describes several ideas on reducing estimator variance by reducing model complexity. We focus on regularization. 15/21

Ridge regression: L 2 -regularization We add a regularization to the cost: L ridge (β) = n i=1 (y i φ(x i ) β) 2 + λ k j=1 β2 j NOTE: β 0 is usually not regularized! 16/21

Ridge regression: L 2 -regularization We add a regularization to the cost: L ridge (β) = n i=1 (y i φ(x i ) β) 2 + λ k j=1 β2 j NOTE: β 0 is usually not regularized! Optimum: ˆβ ridge = (X X + λi) -1 X y (where I = I k, or with I 1,1 = 0 if β 0 is not regularized) 16/21

The objective is now composed of two potentials : The loss, which depends on the data and jumps around (introduces variance), and the regularization penalty (sitting steadily at zero). Both are pulling at the optimal β the regularization reduces variance. The estimator variance becomes less: Var{ ˆβ} = (X X + λi) -1 σ 2 The ridge effectively reduces the complexity of the model: df(λ) = d j=1 d 2 j d 2 j +λ where d 2 j is the eigenvalue of X X = V D 2 V (details: Hastie 3.4.1) 17/21

Choosing λ: generalization error & cross validation λ = 0 will always have a lower training data error We need to estimate the generalization error on test data 18/21

Choosing λ: generalization error & cross validation λ = 0 will always have a lower training data error We need to estimate the generalization error on test data k-fold cross-validation: D 1 D 2 D i D k 1: Partition data D in k equal sized subsets D = {D 1,.., D k } 2: for i = 1,.., k do 3: compute ˆβ i on the training data D \ D i leaving out D i 4: compute the error l i = L ls ( ˆβ i, D i ) on the validation data D i 5: end for 6: report mean error ˆl = 1/k i l i and variance (1/k i l2 i ) ˆl 2 Choose λ for which ˆl is smallest 18/21

quadratic features on sinus data: 1.3 1.2 (MT/plot.h -> gnuplot pipe) cv error training error 1.1 mean squared error 1 0.9 0.8 0.7 0.6 0.5 0.4 0.001 0.01 0.1 1 10 100 1000 10000100000 lambda./x.exe -mode 4 -n 10 -modelfeaturetype 2 -datatype 2 -sigma.1./x.exe -mode 1 -n 10 -modelfeaturetype 2 -datatype 2 -sigma.1 19/21

Summary Linear models on non-linear features extremely powerful linear polynomial RBF kernel Ridge Lasso regression classification Generalization Regularization complexity/dof penalty Cross validation to estimate generalization empirically use to choose regularization parameters 20/21

Representation Objective Solver error regularization f(x) = φ(x) β (also kernel, SVR, GP, splines) Neural Networks Regression Trees ( tree-features) lazy learning, local smoothing, knn sqr. error median sqr. error robust error functions, Huber, Forsyth SVM regression loss Ridge, L 2, λ β 2 Lasso, L 1, λ β analytic β =(X X+λI) -1 X y 21/21

Appendix: Lasso: L 1 -regularization We add a L 1 regularization to the cost: L lasso (β) = n i=1 (y i φ(x i ) β) 2 + λ k j=1 β j NOTE: β 0 is usually not regularized! Has no closed form expression for optimum (Optimum can be found by solving a quadratic program; see appendix.) 22/21

Lasso vs. Ridge: Lasso sparsity! feature selection! 23/21

L q (β) = n i=1 (y i φ(x i ) β) 2 + λ k j=1 β j q Subset selection: q = 0 (counting the number of β j 0) 24/21

Appendix: Dual formulation of Ridge The standard way to write the Ridge regularization: L ridge (β) = n i=1 (y i φ(x i ) β) 2 + λ k j=1 β2 j Finding ˆβ ridge = argmin β L ridge (β) is equivalent to solving ˆβ ridge = argmin β subject to n (y i φ(x i ) β) 2 i=1 k βj 2 t j=1 λ is the Lagrange multiplier for the inequality constraint 25/21

Appendix: Dual formulation of Lasso The standard way to write the Lasso regularization: L lasso (β) = n i=1 (y i φ(x i ) β) 2 + λ k j=1 β j Equivalent formulation (via KKT): ˆβ lasso = argmin β subject to n (y i φ(x i ) β) 2 i=1 k β j t j=1 Decreasing t is called shrinkage : The space of allowed β shrinks. Some β will become zero feature selection 26/21