Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Size: px
Start display at page:

Download "Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda"

Transcription

1 Sparse regression Optimization-Based Data Analysis Carlos Fernandez-Granda 3/28/2016

2 Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

3 Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

4 Regression Aim: Predict the value of a response / dependent variable y R from p predictors / features / independent variables X 1, X 2,..., X p R Methodology: 1. Fit a model with using n training examples y 1, y 2,..., y n y i f (X i1, X i2,..., X ip ) 1 i n 2. Use learned model f to predict from new data

5 Linear regression f is parametrized by an intercept β 0 and a vector of weights β R p p y i β 0 + β j X ij j=1 1 i n y 1 y 2 y n X 11 X 12 X 1p β 0 + X 21 X 22 X 2p X n1 X n2 X np β 1 β 2 β p y 1β 0 + X β Aim: Learn β 0 and β from the training data

6 Linear models Signal representation (compression, denoising) x = D c Aim: Find sparse representation of the signal x Columns of D are designed/learned atoms

7 Linear models Inverse problems (compressed sensing, super-resolution) y = A x Aim: Estimate the signal x from the measurements y A models the measurement process

8 Linear models Linear regression y = X β Aim: Learn the relation between the response y and the predictors X 1, X 2,..., X p to predict from new data Each column of X contains a variable that may or may not be related to the response

9 Preprocessing 1. Center each predictor column X i subtracting its mean so that n X ij = 0 i=1 for all 1 j p 2. Normalize each predictor column X j so that X j 2 = 1 for all 1 j p 3. Center the response vector y so that n y i = 0 i=1

10 Least-squares fit Minimize l 2 norm of error over training data minimize y X β 2 After preprocessing β 0 = 0, so we can ignore it

11 Geometric interpretation Decompose y into its projection onto the column space of X and onto the orthogonal complement y = y X + y X By Pythagoras s Theorem y X β 2 = y X yx X β 2 2 If n p and X is full rank β ls = ( X T X ) 1 X T y satisfies X β ls = y X After preprocessing β 0 = 0

12 Probabilistic interpretation: Maximum likelihood Assumption: Linear model is correct but examples are corrupted by additive iid Gaussian noise y = X β + z The likelihood is the probability density function of y parametrized by β ( ) ( 1 L β = (2π) n exp 1 ) y X β The maximum-likelihood estimate of β is ( ) β ML = arg max L β β ( ) = arg max log L β β = arg min β y X β 2

13 Temperature predictor A friend tells you: I found a cool way to predict the temperature in New York: It s just a linear combination of the temperature in every other state. I fit the model on data from the last month and a half and it s perfect!

14 Overfitting If a model is very complex, it may overfit the data To evaluate a model we separate the data into a training and a test set 1. We fit the model using the training set 2. We evaluate the error on the test set

15 Experiment X train, X test, z train and β are iid Gaussian with mean 0 and variance 1 y train = X train β + z train y test = X test β We use y train and X train to compute β ls error train = X train β ls y train 2 y train 2 error test = X test β ls y test 2 y test 2

16 Experiment Error (training) Error (test) Noise level (training) Relative error (l2 norm) n

17 Analysis Data model, z is Gaussian noise with variance σ 2 z, y = X β + z σ min /σ max are the smallest/largest singular values of X 1 ɛ σ 2 min β β ls 2 2 pσ 2 z 1 + ɛ σ 2 max with high probability

18 Proof β ls = V Σ 1 U T where X = UΣV T β β ls 2 = p ( U T j z j=1 U T z is Gaussian with mean 0 and covariance I U T z 2 is chi-square with p degrees of freedom 2 U T z 2 p 2 σ j ) 2

19 Consistency Assume σ 2 z = 1 If X is well conditioned and entries have constant magnitude σ j n β β ls 2 If entries of β are constant, β 2 p Least-squares estimator is consistent p n β β ls 2 β 2 1 n

20 Experiment: X R n p, β R p, z R n iid N (0, 1) Relative coefficient error (l2 norm) p=50 p=100 p=200 1/ n n

21 Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

22 Maximum temperatures in Oxford, UK Temperature (Celsius)

23 Maximum temperatures in Oxford, UK Temperature (Celsius)

24 Model fitted by least squares Temperature (Celsius) Data Model

25 Model fitted by least squares Temperature (Celsius) Data Model

26 Model fitted by least squares Temperature (Celsius) Data Model

27 Trend: Increase of 0.75 C / 100 years (1.35 F) Temperature (Celsius) Data Trend

28 Model for minimum temperatures Temperature (Celsius) Data Model

29 Model for minimum temperatures Temperature (Celsius) Data Model

30 Model for minimum temperatures Temperature (Celsius) Data Model

31 Trend: Increase of 0.88 C / 100 years (1.58 F) Temperature (Celsius) Data Trend

32 Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

33 Classification Aim: Predict the value of a binary response y {0, 1} from p predictors X 1, X 2,..., X p R Methodology: 1. Fit a model with using n training examples y 1, y 2,..., y n y i f (X i1, X i2,..., X ip ) 1 i n 2. Use learned model f to predict from new data

34 Logistic-regression model We model the probability that y i = 1 by P (y i = 1 X i1, X i2,..., X ip ) := g β 0 + p β j X ij j=1 1 = ( 1 + exp β 0 ) p j=1 β jx ij g is the logistic function or sigmoid g (t) := exp t

35 Cost function Likelihood of the data under a Bernouilli model in which p P (y i = 1 X i1, X i2,..., X ip ) := g β 0 + β j X ij j=1 P (y i = 0 X i1, X i2,..., X ip ) := 1 g β 0 + Assuming independence ( ) L β 0, β = n g β 0 + i=1 p j=1 p β j X ij j=1 y i β j X ij 1 g β 0 + p β j X ij j=1 1 y i

36 Cost function Log likelihood is concave ( log L β0, β ) n = y i log g β0 + i=1 p j=1 β j X ij + (1 y i ) log 1 g β 0 + p β j X ij j=1

37 Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

38 Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

39 Sparse regression Assumption: Response only depends on a subset S of s p predictors y 1β 0 + X S β S Regime of interest: p n Model-selection problem: Determine what predictors are relevant Best-subset selection: Choose among all possible sparse models Intractable unless s is very small

40 Forward stepwise regression Greedy method similar to orthogonal matching pursuit Initialization: j 0 := arg max y, X j j J := {j 0 } β ls := arg min β r (0) := y X J β ls y X J β 2

41 Forward stepwise regression Iterations: k = 1, 2,... j k := arg max j / J y, P col(xj ) j X J := J {j k } β ls := arg min y X J β β r (k) := r (k 1) X J β ls 2 Stop when fit does not improve much anymore

42 The lasso l 1 -norm regularized least-squares regression minimize 1 2n y X β 2 + λ β 2 1 We assume that the data are centered, so β 0 = 0 Original formulation minimize subject to y X β β τ Equivalent, but relation between λ and τ depends on X and y

43 Experiment X train and X test are iid Gaussian (0, 1), z train is iid Gaussian (0, 0.5) β has 10 nonzero entries, p = 50, n = 100 y train = X train β + z train y test = X test β For an estimate ˆβ learnt from the training data error train = error test = 2 X train ˆβ ytrain y train 2 2 X test ˆβ y test y test 2

44 Lasso path Coefficients Relevant features Regularization parameter

45 Training error Relative error (l2 norm) Least squares Forward regression Lasso Noise level n

46 Test error Least squares Forward regression Lasso Relative error (l2 norm) n

47 Logistic regression Add l 1 -norm regularization term to log likelihood, we minimize n p y i log g β0 + β j X ij i=1 j=1 (1 y i ) log 1 g β λ β 1 p β j X ij j=1

48 Arrhythmia prediction Predict whether patient has arrhythmia from n = 271 examples and p = 182 predictors Age, sex, height, weight Features obtained from electrocardiogram recordings Best sparse model uses around 60 predictors

49 Prediction accuracy Misclassification Error log(lambda)

50 Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

51 Least-squares analysis Data model, z is Gaussian noise with variance σz 2 and β is s sparse y = X β + z Assumption: X is well conditioned, entries have constant magnitude σ min n Columns of X have l 2 norm n For the least-squares estimator we have β β ls 2 σ z p n If p n, but s p we should do better!

52 Restricted eigenvalue property There exists γ > 0 such that for any v R p if for any subset T, T s, then v T c 1 v T 1 1 n M v 2 2 γ v 2 2 Similar to restricted isometry property, may hold even if p > n!

53 Guarantees If X satisfies the RE property and τ = β 1, the solution β lasso to satisfies minimize subject to y X β β τ β β lasso 2 σ z 32 α s log p γ n with probability 1 2 exp ( (α 1) log p) for any α > 1 Close to least-squares error if we know which predictors are relevant

54 Proof The error h := β β lasso satisfies h T c 1 h T 1 so β β lasso γn X h 2 2

55 Proof By optimality of β lasso Xh z T Xh 2 X T z h 1 Since h T c 1 h T 1 h 1 2 s h 2 so β β lasso s X γn β β lasso T 2 z

56 Proof Xi T z is Gaussian with variance σz 2 X i 2 2, so for t > 0 P ( X T i ) z > tσ z X i 2 2 exp ( ) t2 2 By the union bound, ( X P T z > tσ z max i ) ( ) X i 2 2 p exp t2 2 = 2 exp ( t22 ) + log p

57 Proof Choosing t = 2α log p, α > 2, if max i X i 2 = n we have ( X P T ) z > σ z 2α n log p 2 exp ( (α 1) log p) We conclude β β lasso 2 σ z 32 α s log p γ n with probability 1 2 exp ( (α 1) log p)

58 Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

59 Correlated predictors Error of least-squares estimator β β ls 2 = p ( U T j z If predictors are strongly correlated singular values can be very small Estimator suffers from noise amplification j=1 σ j ) 2

60 Ridge regression Aim: Control norm of weights minimize y X β 2 + λ β Equivalent to regression problem minimize [ ] y 0 [ ] X λi β 2 2

61 Ridge regression If y = X β + z σ σ1 2 +λ2 σ β ridge = V σ2 2 +λ2 V T β σp σp+λ 2 2 σ σ1 2+λ2 σ V σ 2 2 +λ2 UT z 0 0 σ p σ 2 p +λ2 Decreases noise amplification if singular values are small

62 Correlated predictors in model selection Lasso controls norm amplification but tends to choose a small subset of the relevant predictors Aim: Selecting all relevant predictors Why? Interpretability + improving prediction

63 Identical predictors If two predictors are the same X i = X j then β i and β j should be the same True for any cost function of the form minimize 1 2n y X β 2 ( ) + λ R β 2 as long as the regularizer R is strictly convex Not true for the lasso

64 Experiment The entries of z train, Z train and Z test are iid Gaussian (0, 1) y train := X A,train β A + X B,train β B + z train y test := X A,test β A + X B,test β B X train := [ X A,train X B,train Z train ] X test := [ X A,test X B,test Z test ]

65 Experiment Rows of X A,train, X B,train, X A,train, X B,train are independent Gaussian random vectors with mean zero and covariance matrix Σ := β A and β B have 4 nonzero entries each

66 Lasso path Coefficients Group 1 Group Regularization parameter

67 Ridge-regression path Coefficients Group 1 Group Regularization parameter

68 Elastic net Combines lasso and ridge-regression penalties minimize ( y X β 2 1 α + λ 2 2 β 2 + α 2 2 β ) 1

69 Elastic-net path, α = 0.5 Coefficients Group 1 Group Regularization parameter

70 Errors Relative error (l2 norm) Least squares (training) Least squares (test) Ridge (training) Ridge (test) Elastic net (training) Elastic net (test) Lasso (training) Lasso (test) Regularization parameter

71 Regression Least-squares regression Example: Global warming Logistic regression Sparse regression Model selection Analysis of the lasso Correlated predictors Group sparsity

72 Group sparse structure Predictors may be partitioned into k groups G 1, G 2,..., G m ] β y β 0 + X β = β 0 + [X G1 X G2 X G2 Gm β G1 β Gm Aim: Select groups that are relevant

73 Multi-task learning Several responses y 1, y 2,..., y k modeled with the same predictors Y = [ y 1 y 2... y k ] B0 + XB = B 0 + X [ β 1 β 2 β k ] Assumption: Responses depend on the same subset of predictors Aim: Learn a group-sparse model

74 Mixed l 1 /l 2 norm Mixed l 1 /l 2 norm β := 1,2 k β 2 Gi i=1 promotes group-sparse structure For multitask learning B := 1,2 k 2 B :i i=1

75 Geometric intuition G 1 = {1, 2} G 2 = {3} l 1 -norm ball l 1 /l 2 -norm ball Figure taken from Statistical Learning with Sparsity The Lasso and Generalizations by Hastie, Tibshirani and Wainwright

76 Group and multi-task lasso Group lasso minimize y β 0 X β 2 + λ β 2 1,2 Multi-task lasso minimize Y B 0 X B 2 B + λ F 1,2

77 Proximal gradient method Method to solve the optimization problem minimize f (x) + g (x), where f is differentiable and prox g is tractable Proximal-gradient iteration: x (0) = arbitrary initialization ( x (k+1) = prox αk g x (k) α k f (x (k)))

78 Subdifferential of l 1 /l 2 norm q R p is a subgradient of the l 1 /l 2 norm at β R p if q Gi = β G i β Gi 2 if β Gi 0, q Gi 2 1 if β Gi = 0

79 Proximal operator of l 1 /l 2 norm Proximal operator of l 1 /l 2 norm is block soft-thresholding prox α 1,2 (β) = BS α (β) where α > 0 and β Gi α β G i if β BS α (β) Gi := β Gi Gi 2 α 2 0 otherwise

80 Iterative Shrinkage Thresholding for the l 1 /l 2 norm Proximal gradient method for the problem minimize 1 2 Ax y λ x 1,2 ISTA iteration: x (0) = arbitrary initialization ( ( )) x (k+1) = BS αk λ x (k) α k A T Ax (k) y

81 Speech denoising: STFT thresholding Frequency Time

82 Speech denoising: STFT block thresholding Frequency Time

83 Experiment X train and X test are iid Gaussian (0, 1), Z train is iid Gaussian (0, 0.5) B R p k has s nonzero rows, p = 50, n = 100 Y train = X train B + Z train Y test = X test B For an estimate ˆB learnt from the training data error test = 2 X test ˆB Y test Y test 2

84 Test error, s = Lasso (k=4) Lasso (k=40) Multi-task lasso (k=4) Multi-task lasso (k=40) Relative error (l2 norm) Regularization parameter

85 Test error, s = 30 Relative error (l2 norm) Lasso (k=4) Lasso (k=40) Multi-task lasso (k=4) Multi-task lasso (k=40) Regularization parameter

86 Magnitude of coefficients, s = 30 k = 40 Lasso Multitask lasso Original

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Linear regression. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Linear regression. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Linear regression DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall15 Carlos Fernandez-Granda Linear models Least-squares estimation Overfitting Example:

More information

DS-GA 1002 Lecture notes 12 Fall Linear regression

DS-GA 1002 Lecture notes 12 Fall Linear regression DS-GA Lecture notes 1 Fall 16 1 Linear models Linear regression In statistics, regression consists of learning a function relating a certain quantity of interest y, the response or dependent variable,

More information

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Linear regression Least-squares estimation

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Oslo Class 6 Sparsity based regularization

Oslo Class 6 Sparsity based regularization RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

More information

Sparse linear models

Sparse linear models Sparse linear models Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 2/22/2016 Introduction Linear transforms Frequency representation Short-time

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

DS-GA 1002 Lecture notes 10 November 23, Linear models

DS-GA 1002 Lecture notes 10 November 23, Linear models DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive: Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful

More information

Regression Shrinkage and Selection via the Lasso

Regression Shrinkage and Selection via the Lasso Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Is the test error unbiased for these programs? 2017 Kevin Jamieson Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning

More information

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Overview. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Overview Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 1/25/2016 Sparsity Denoising Regression Inverse problems Low-rank models Matrix completion

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

2.3. Clustering or vector quantization 57

2.3. Clustering or vector quantization 57 Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :

More information

Gaussian Graphical Models and Graphical Lasso

Gaussian Graphical Models and Graphical Lasso ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf

More information

COMS 4771 Regression. Nakul Verma

COMS 4771 Regression. Nakul Verma COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

More information

Machine Learning for Biomedical Engineering. Enrico Grisan

Machine Learning for Biomedical Engineering. Enrico Grisan Machine Learning for Biomedical Engineering Enrico Grisan enrico.grisan@dei.unipd.it Curse of dimensionality Why are more features bad? Redundant features (useless or confounding) Hard to interpret and

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

ESL Chap3. Some extensions of lasso

ESL Chap3. Some extensions of lasso ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied

More information

LASSO Review, Fused LASSO, Parallel LASSO Solvers

LASSO Review, Fused LASSO, Parallel LASSO Solvers Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Prediction & Feature Selection in GLM

Prediction & Feature Selection in GLM Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released

More information

Sparse and Robust Optimization and Applications

Sparse and Robust Optimization and Applications Sparse and and Statistical Learning Workshop Les Houches, 2013 Robust Laurent El Ghaoui with Mert Pilanci, Anh Pham EECS Dept., UC Berkeley January 7, 2013 1 / 36 Outline Sparse Sparse Sparse Probability

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Constrained optimization

Constrained optimization Constrained optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Compressed sensing Convex constrained

More information

Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian

More information

Lecture Notes 6: Linear Models

Lecture Notes 6: Linear Models Optimization-based data analysis Fall 17 Lecture Notes 6: Linear Models 1 Linear regression 1.1 The regression problem In statistics, regression is the problem of characterizing the relation between a

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

An Introduction to Sparse Approximation

An Introduction to Sparse Approximation An Introduction to Sparse Approximation Anna C. Gilbert Department of Mathematics University of Michigan Basic image/signal/data compression: transform coding Approximate signals sparsely Compress images,

More information

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso COMS 477 Lecture 6. Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso / 2 Fixed-design linear regression Fixed-design linear regression A simplified

More information

Compressed Sensing and Neural Networks

Compressed Sensing and Neural Networks and Jan Vybíral (Charles University & Czech Technical University Prague, Czech Republic) NOMAD Summer Berlin, September 25-29, 2017 1 / 31 Outline Lasso & Introduction Notation Training the network Applications

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,

More information

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Lecture Notes 10: Matrix Factorization

Lecture Notes 10: Matrix Factorization Optimization-based data analysis Fall 207 Lecture Notes 0: Matrix Factorization Low-rank models. Rank- model Consider the problem of modeling a quantity y[i, j] that depends on two indices i and j. To

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015 Machine Learning Regression basics Linear regression, non-linear features (polynomial, RBFs, piece-wise), regularization, cross validation, Ridge/Lasso, kernel trick Marc Toussaint University of Stuttgart

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

1 Regression with High Dimensional Data

1 Regression with High Dimensional Data 6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Machine Learning for Signal Processing Sparse and Overcomplete Representations Machine Learning for Signal Processing Sparse and Overcomplete Representations Abelino Jimenez (slides from Bhiksha Raj and Sourish Chaudhuri) Oct 1, 217 1 So far Weights Data Basis Data Independent ICA

More information

A Short Introduction to the Lasso Methodology

A Short Introduction to the Lasso Methodology A Short Introduction to the Lasso Methodology Michael Gutmann sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology March 9, 2016 Michael

More information

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013

Machine Learning for Signal Processing Sparse and Overcomplete Representations. Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013 Machine Learning for Signal Processing Sparse and Overcomplete Representations Bhiksha Raj (slides from Sourish Chaudhuri) Oct 22, 2013 1 Key Topics in this Lecture Basics Component-based representations

More information

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

EE 381V: Large Scale Optimization Fall Lecture 24 April 11 EE 381V: Large Scale Optimization Fall 2012 Lecture 24 April 11 Lecturer: Caramanis & Sanghavi Scribe: Tao Huang 24.1 Review In past classes, we studied the problem of sparsity. Sparsity problem is that

More information

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop Modeling Data with Linear Combinations of Basis Functions Read Chapter 3 in the text by Bishop A Type of Supervised Learning Problem We want to model data (x 1, t 1 ),..., (x N, t N ), where x i is a vector

More information

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 3: Sparse signal recovery: A RIPless analysis of l 1 minimization Yuejie Chi The Ohio State University Page 1 Outline

More information

Feature Engineering, Model Evaluations

Feature Engineering, Model Evaluations Feature Engineering, Model Evaluations Giri Iyengar Cornell University gi43@cornell.edu Feb 5, 2018 Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35 Overview 1 ETL 2 Feature Engineering

More information

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016

Random projections. 1 Introduction. 2 Dimensionality reduction. Lecture notes 5 February 29, 2016 Lecture notes 5 February 9, 016 1 Introduction Random projections Random projections are a useful tool in the analysis and processing of high-dimensional data. We will analyze two applications that use

More information

CSC 576: Variants of Sparse Learning

CSC 576: Variants of Sparse Learning CSC 576: Variants of Sparse Learning Ji Liu Department of Computer Science, University of Rochester October 27, 205 Introduction Our previous note basically suggests using l norm to enforce sparsity in

More information

Support Vector Machine I

Support Vector Machine I Support Vector Machine I Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative Please use piazza. No emails. HW 0 grades are back. Re-grade request for one week. HW 1 due soon. HW

More information

https://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:

More information

A Least Squares Formulation for Canonical Correlation Analysis

A Least Squares Formulation for Canonical Correlation Analysis A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation

More information

Least Sparsity of p-norm based Optimization Problems with p > 1

Least Sparsity of p-norm based Optimization Problems with p > 1 Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from

More information

Minimizing the Difference of L 1 and L 2 Norms with Applications

Minimizing the Difference of L 1 and L 2 Norms with Applications 1/36 Minimizing the Difference of L 1 and L 2 Norms with Department of Mathematical Sciences University of Texas Dallas May 31, 2017 Partially supported by NSF DMS 1522786 2/36 Outline 1 A nonconvex approach:

More information

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

High-dimensional Ordinary Least-squares Projection for Screening Variables

High-dimensional Ordinary Least-squares Projection for Screening Variables 1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information

Sparse Approximation and Variable Selection

Sparse Approximation and Variable Selection Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation

More information

Robust Principal Component Analysis

Robust Principal Component Analysis ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M

More information

10-725/36-725: Convex Optimization Prerequisite Topics

10-725/36-725: Convex Optimization Prerequisite Topics 10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Nonconvex Sparse Logistic Regression with Weakly Convex Regularization

Nonconvex Sparse Logistic Regression with Weakly Convex Regularization Nonconvex Sparse Logistic Regression with Weakly Convex Regularization Xinyue Shen, Student Member, IEEE, and Yuantao Gu, Senior Member, IEEE Abstract In this work we propose to fit a sparse logistic regression

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information

Supremum of simple stochastic processes

Supremum of simple stochastic processes Subspace embeddings Daniel Hsu COMS 4772 1 Supremum of simple stochastic processes 2 Recap: JL lemma JL lemma. For any ε (0, 1/2), point set S R d of cardinality 16 ln n S = n, and k N such that k, there

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear

More information

Compressed Sensing in Cancer Biology? (A Work in Progress)

Compressed Sensing in Cancer Biology? (A Work in Progress) Compressed Sensing in Cancer Biology? (A Work in Progress) M. Vidyasagar FRS Cecil & Ida Green Chair The University of Texas at Dallas M.Vidyasagar@utdallas.edu www.utdallas.edu/ m.vidyasagar University

More information

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Problems

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu

More information

6. Regularized linear regression

6. Regularized linear regression Foundations of Machine Learning École Centrale Paris Fall 2015 6. Regularized linear regression Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe agathe.azencott@mines paristech.fr

More information

Learning From Data: Modelling as an Optimisation Problem

Learning From Data: Modelling as an Optimisation Problem Learning From Data: Modelling as an Optimisation Problem Iman Shames April 2017 1 / 31 You should be able to... Identify and formulate a regression problem; Appreciate the utility of regularisation; Identify

More information

Sparsity in Underdetermined Systems

Sparsity in Underdetermined Systems Sparsity in Underdetermined Systems Department of Statistics Stanford University August 19, 2005 Classical Linear Regression Problem X n y p n 1 > Given predictors and response, y Xβ ε = + ε N( 0, σ 2

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is

More information

Regularization and Variable Selection via the Elastic Net

Regularization and Variable Selection via the Elastic Net p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Lecture Notes 9: Constrained Optimization

Lecture Notes 9: Constrained Optimization Optimization-based data analysis Fall 017 Lecture Notes 9: Constrained Optimization 1 Compressed sensing 1.1 Underdetermined linear inverse problems Linear inverse problems model measurements of the form

More information

Lasso Regression: Regularization for feature selection

Lasso Regression: Regularization for feature selection Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 1 Feature selection task 2 1 Why might you want to perform feature selection? Efficiency: - If

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1 Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1 ( OWL ) Regularization Mário A. T. Figueiredo Instituto de Telecomunicações and Instituto Superior Técnico, Universidade de

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information