COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Similar documents
Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

ECE521 lecture 4: 19 January Optimization, MLE, regularization

DATA MINING AND MACHINE LEARNING

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

STA414/2104 Statistical Methods for Machine Learning II

1 Sparsity and l 1 relaxation

6. Regularized linear regression

Is the test error unbiased for these programs? 2017 Kevin Jamieson

1 Regression with High Dimensional Data

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

l 1 and l 2 Regularization

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Optimization methods

Max Margin-Classifier

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

Linear Methods for Regression. Lijun Zhang

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Optimization methods

ORIE 4741: Learning with Big Messy Data. Regularization

Lecture 17 Intro to Lasso Regression

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

MSA220/MVE440 Statistical Learning for Big Data

Lecture 14: Shrinkage

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Constrained optimization

Kernel Methods and Support Vector Machines

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

High-dimensional regression

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Is the test error unbiased for these programs?

Discriminative Models

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Introduction to Machine Learning

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ESL Chap3. Some extensions of lasso

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Support Vector Machine (continued)

Statistical Methods for Data Mining

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i

Logistic Regression Logistic

Lecture 6. Regularized least-squares and minimum-norm methods 6 1

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Classification Logistic Regression

Least Squares Regression

Bayesian Linear Regression [DRAFT - In Progress]

Discriminative Models

Parameter Norm Penalties. Sargur N. Srihari

Outline lecture 2 2(30)

Data Sparse Matrix Computation - Lecture 20

CS-E4830 Kernel Methods in Machine Learning

Linear regression COMS 4771


MSA220/MVE440 Statistical Learning for Big Data

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

(Kernels +) Support Vector Machines

ICS-E4030 Kernel Methods in Machine Learning

Linear Models for Regression

Overfitting, Bias / Variance Analysis

Oslo Class 6 Sparsity based regularization

CIS 520: Machine Learning Oct 09, Kernel Methods

Machine Learning And Applications: Supervised Learning-SVM

Sparse Linear Models (10/7/13)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Optimization and Gradient Descent

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Linear model selection and regularization

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Today. Calculus. Linear Regression. Lagrange Multipliers

CSC 411 Lecture 17: Support Vector Machine

Introduction to Support Vector Machines

Least Sparsity of p-norm based Optimization Problems with p > 1

Support Vector Machines: Maximum Margin Classifiers

Week 3: Linear Regression

Least Squares Regression

Statistical Pattern Recognition

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

COMS 4771 Regression. Nakul Verma

Convex Optimization Algorithms for Machine Learning in 10 Slides

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

DS-GA 1002 Lecture notes 10 November 23, Linear models

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Signal Recovery from Permuted Observations

ECS171: Machine Learning

Summary and discussion of: Dropout Training as Adaptive Regularization

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

STAT 462-Computational Data Analysis

Transcription:

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

UNDERDETERMINED LINEAR EQUATIONS We now consider the regression problem y = Xw where X R n d is fat (i.e., d n). This is called an underdetermined problem. There are more dimensions than observations. w now has an infinite number of solutions satisfying y = Xw. y = X w These sorts of high-dimensional problems often come up: In gene analysis there are 1000 s of genes but only 100 s of subjects. Images can have millions of pixels. Even polynomial regression can quickly lead to this scenario.

MINIMUM l 2 REGRESSION

ONE SOLUTION (LEAST NORM) One possible solution to the underdetermined problem is w ln = X T (XX T ) 1 y Xw ln = XX T (XX T ) 1 y = y. We can construct another solution by adding to w ln a vector δ R d that is in the null space N of X: δ N (X) Xδ = 0 and δ 0 and so X(w ln + δ) = Xw ln + Xδ = y + 0. In fact, there are an infinite number of possible δ, because d > n. We can show that w ln is the solution with smallest l 2 norm. We will use the proof of this fact as an excuse to introduce two general concepts.

TOOLS: ANALYSIS We can use analysis to prove that w ln satisfies the optimization problem w ln = arg min w w 2 subject to Xw = y. (Think of mathematical analysis as the use of inequalities to prove things.) Proof : Let w be another solution to Xw = y, and so X(w w ln ) = 0. Also, (w w ln ) T w ln = (w w ln ) T X T (XX T ) 1 y = (X(w w ln )) T (XX T ) 1 y = 0 }{{} = 0 As a result, w w ln is orthogonal to w ln. It follows that w 2 = w w ln +w ln 2 = w w ln 2 + w ln 2 +2 (w w ln ) T w ln > w ln 2 }{{} = 0

TOOLS: LAGRANGE MULTIPLIERS Instead of starting from the solution, start from the problem, w ln = arg min w w T w subject to Xw = y. Introduce Lagrange multipliers: L(w, η) = w T w + η T (Xw y). Minimize L over w maximize over η. If Xw y, we can get L = +. The optimal conditions are w L = 2w + X T η = 0, η L = Xw y = 0. We have everything necessary to find the solution: 1. From first condition: w = X T η/2 2. Plug into second condition: η = 2(XX T ) 1 y 3. Plug this back into #1: w ln = X T (XX T ) 1 y

SPARSE l 1 REGRESSION

LS AND RR IN HIGH DIMENSIONS Usually not suited for high-dimensional data Modern problems: Many dimensions/features/predictors Only a few of these may be important or relevant for predicting y Therefore, we need some form of feature selection Least squares and ridge regression: Treat all dimensions equally without favoring subsets of dimensions The relevant dimensions are averaged with irrelevant ones Problems: Poor generalization to new data, interpretability of results

REGRESSION WITH PENALTIES Penalty terms Recall: General ridge regression is of the form L = n (y i f (x i ; w)) 2 + λ w 2 i=1 We ve referred to the term w 2 as a penalty term and used f (x i ; w) = x T i w. Penalized fitting The general structure of the optimization problem is total cost = goodness-of-fit term + penalty term Goodness-of-fit measures how well our model f approximates the data. Penalty term makes the solutions we don t want more expensive. What kind of solutions does the choice w 2 favor or discourage?

QUADRATIC PENALTIES Intuitions Quadratic penalty: Reduction in cost depends on w j. Suppose we reduce w j by w. The effect on L depends on the starting point of w j. Consequence: We should favor vectors w whose entries are of similar size, preferably small. w 2 j w w w j

SPARSITY Setting Regression problem with n data points x R d, d n. Goal: Select a small subset of the d dimensions and switch off the rest. This is sometimes referred to as feature selection. What does it mean to switch off a dimension? Each entry of w corresponds to a dimension of the data x. If w k = 0, the prediction is f (x, w) = x T w = w 1 x 1 + + 0 x k + + w d x d, so the prediction does not depend on the kth dimension. Feature selection: Find a w that (1) predicts well, and (2) has only a small number of non-zero entries. A w for which most dimensions = 0 is called a sparse solution.

SPARSITY AND PENALTIES Penalty goal Find a penalty term which encourages sparse solutions. Quadratic penalty vs sparsity Suppose w k is large, all other w j are very small but non-zero Sparsity: Penalty should keep w k, and push other w j to zero Quadratic penalty: Will favor entries w j which all have similar size, and so it will push w k towards small value. Overall, a quadratic penalty favors many small, but non-zero values. Solution Sparsity can be achieved using linear penalty terms.

LASSO Sparse regression LASSO: Least Absolute Shrinkage and Selection Operator With the LASSO, we replace the l 2 penalty with an l 1 penalty: where w lasso = arg min w y Xw 2 2 + λ w 1 w 1 = d w j. j=1 This is also called l 1 -regularized regression.

QUADRATIC PENALTIES Quadratic penalty w j 2 Linear penalty w j w j w j Reducing a large value w j achieves a larger cost reduction. Cost reduction does not depend on the magnitude of w j.

RIDGE REGRESSION VS LASSO w 2 w 2 w LS w LS w 1 w 1 This figure applies to d < n, but gives intuition for d n. Red: Contours of (w w LS ) T (X T X)(w w LS ) (see Lecture 3) Blue: (left) Contours of w 1, and (right) contours of w 2 2

COEFFICIENT PROFILES: RR VS LASSO (a) w 2 penalty (b) w 1 penalty

l p REGRESSION l p -norms These norm-penalties can be extended to all norms: ( d w p = w j p) 1 p for 0 < p j=1 l p -regression The l p -regularized linear regression problem is w lp := arg min w y Xw 2 2 + λ w p p We have seen: l 1 -regression = LASSO l 2 -regression = ridge regression

l p PENALIZATION TERMS p = 4 p = 2 p = 1 p = 0.5 p = 0.1 p Behavior of. p p = Norm measures largest absolute entry, w = max j w j p > 2 Norm focuses on large entries p = 2 Large entries are expensive; encourages similar-size entries p = 1 Encourages sparsity p < 1 Encourages sparsity as for p = 1, but contour set is not convex (i.e., no line of sight between every two points inside the shape) p 0 Simply records whether an entry is non-zero, i.e. w 0 = j I{wj 0}

COMPUTING THE SOLUTION FOR l p Solution of l p problem l 2 l p aka ridge regression. Has a closed form solution (p 1, p 2) By convex optimization. We won t discuss convex analysis in detail in this class, but two facts are important There are no local optimal solutions (i.e., local minimum of L) The true solution can be found exactly using iterative algorithms (p < 1) We can only find an approximate solution (i.e., the best in its neighborhood ) using iterative algorithms. Three techniques formulated as optimization problems Method Good-o-fit penalty Solution method Least squares y Xw 2 2 none Analytic solution exists if X T X invertible Ridge regression y Xw 2 2 w 2 2 Analytic solution exists always LASSO y Xw 2 2 w 1 Numerical optimization to find solution