COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso
|
|
- Elisabeth Hutchinson
- 5 years ago
- Views:
Transcription
1 COMS 477 Lecture 6. Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso / 2
2 Fixed-design linear regression
3 Fixed-design linear regression A simplified fixed-design setting x (), x (2),..., x (n) R p assumed to be fixed i.e., not random; y (), y (2),..., y (n) are independent random variables, with E(y (i) ) = x (i), w, var(y (i) ) = σ 2. 3 / 2
4 Fixed-design linear regression A simplified fixed-design setting x (), x (2),..., x (n) R p assumed to be fixed i.e., not random; y (), y (2),..., y (n) are independent random variables, with y () y (2). y (n) E(y (i) ) = x (i), w, var(y (i) ) = σ 2. }{{} y R n = x () x (2). } x (n) {{ } X R n p w, w,2. w,p }{{} w R p + ε () ε (2). ε (n) }{{} ε R n where ε (), ε (2),..., ε (n) are independent, E(ε (i) ) = 0, and var(ε (i) ) = σ 2. 3 / 2
5 Fixed-design linear regression A simplified fixed-design setting x (), x (2),..., x (n) R p assumed to be fixed i.e., not random; y (), y (2),..., y (n) are independent random variables, with y () y (2). y (n) E(y (i) ) = x (i), w, var(y (i) ) = σ 2. }{{} y R n = x () x (2). } x (n) {{ } X R n p w, w,2. w,p }{{} w R p + ε () ε (2). ε (n) }{{} ε R n where ε (), ε (2),..., ε (n) are independent, E(ε (i) ) = 0, and var(ε (i) ) = σ 2. Want to find ŵ R p based on (x (), y () ), (x (2), y (2) ),..., (x (n), y (n) ) so that [ ] E n Xw Xŵ 2 2 is small (e.g., o() as function of n). 3 / 2
6 Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. 4 / 2
7 Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols 4 / 2
8 Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols = X(X X) X (Xw + ε) 4 / 2
9 Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols = X(X X) X (Xw + ε) = Xw + X(X X) X ε 4 / 2
10 Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols = X(X X) X (Xw + ε) = Xw + X(X X) X ε = Xw + Πε where Π R n n is the orthogonal projection operator for ran(x). 4 / 2
11 Ordinary least squares Recall: assuming X X is invertible, ŵ ols := (X X) X y. Expressing Xŵ ols in terms of Xw : Xŵ ols = X(X X) X (Xw + ε) = Xw + X(X X) X ε = Xw + Πε where Π R n n is the orthogonal projection operator for ran(x). [ ] E n Xw Xŵ ols 2 2 = σ2 p n. 4 / 2
12 Ridge regression and principal components regression
13 Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. 6 / 2
14 Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. Ridge regression (Hoerl, 962): for λ > 0, ŵ λ := arg min w R p n y Xw λ w / 2
15 Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. Ridge regression (Hoerl, 962): for λ > 0, Gradient of objective is zero when ŵ λ := arg min w R p n y Xw λ w 2 2. (X X + nλi)w = X y, which always has a unique solution (since λ > 0): ŵ λ := (X X + nλi) X y. 6 / 2
16 Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. Ridge regression (Hoerl, 962): for λ > 0, Gradient of objective is zero when ŵ λ := arg min w R p n y Xw λ w 2 2. (X X + nλi)w = X y, which always has a unique solution (since λ > 0): ŵ λ := (X X + nλi) X y. 6 / 2
17 Ridge regression Ordinary least squares only applies if X X is invertible, which is not possible if p > n. Ridge regression (Hoerl, 962): for λ > 0, Gradient of objective is zero when ŵ λ := arg min w R p n y Xw λ w 2 2. (X X + nλi)w = X y, which always has a unique solution (since λ > 0): ( ) ( ) ŵ λ := n X X + λi n X y. 6 / 2
18 Ridge regression: geometry Ridge regression objective (as function of w) can be written as n X(w ŵ ols) λ w (stuff not depending on w) w 2 ŵ ols 0 Level sets for n X(w ŵ ols) 2 2 w Level sets for λ w / 2
19 Ridge regression: eigendecomposition Write eigendecomposition of n X X as n X X = p λ jv jv j where v, v 2,..., v p R p are orthonormal eigenvectors with corresponding eigenvalues λ λ 2... λ p 0. j= 8 / 2
20 Ridge regression: eigendecomposition Write eigendecomposition of n X X as n X X = p λ jv jv j where v, v 2,..., v p R p are orthonormal eigenvectors with corresponding eigenvalues λ λ 2... λ p 0. Eigenvectors v, v 2,..., v p comprise an orthonormal basis for R p. j= We ll look at ŵ ols and w in this basis: e.g., ŵ ols = p v j, ŵ ols v j. j= 8 / 2
21 Ridge regression: eigendecomposition Write eigendecomposition of n X X as n X X = p λ jv jv j where v, v 2,..., v p R p are orthonormal eigenvectors with corresponding eigenvalues λ λ 2... λ p 0. Eigenvectors v, v 2,..., v p comprise an orthonormal basis for R p. j= We ll look at ŵ ols and w in this basis: e.g., ŵ ols = p v j, ŵ ols v j. j= The inverse of n X X + λi has the form ( ) n X X + λi = p j= λ j + λ vjv j. 8 / 2
22 Ridge regression vs. ordinary least squares If ŵ ols exists, then 9 / 2
23 Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = ( ) ( ) n X X + λi n X X ŵ ols 9 / 2
24 Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = = ( ) ( ) n X X + λi n X X ŵ ols ( p )( p ) λ j + λ vjv j λ jv jv j j= j= ŵ ols 9 / 2
25 Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = = = ( ) ( ) n X X + λi n X X ŵ ols ( p )( p ) λ j + λ vjv j λ jv jv j j= ( p j= λ j λ j + λ vjv j ) j= ŵ ols ŵ ols (by orthogonality) 9 / 2
26 Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = = = = ( ) ( ) n X X + λi n X X ŵ ols ( p )( p ) λ j + λ vjv j λ jv jv j j= ( p j= p j= λ j λ j + λ vjv j ) j= ŵ ols λ j λ j + λ vj, ŵ ols v j. ŵ ols (by orthogonality) 9 / 2
27 Ridge regression vs. ordinary least squares If ŵ ols exists, then ŵ λ = = = = ( ) ( ) n X X + λi n X X ŵ ols ( p )( p ) λ j + λ vjv j λ jv jv j j= ( p j= p j= λ j λ j + λ vjv j ) j= ŵ ols λ j λ j + λ vj, ŵ ols v j. ŵ ols (by orthogonality) Interpretation: Shrink ŵ ols towards zero by λ j λ j factor in direction vj. +λ 9 / 2
28 Coefficient profile Horizontal axis: varying λ (large λ to left, small λ to right). Vertical axis: coefficient value in ŵ λ for eight different variables. 0 / 2
29 Ridge regression: theory Theorem: [ ] E n Xw Xŵ λ 2 2 p λ jλ = λ vj, w 2 (λ j + λ) 2 j= }{{} bias p. n λ j + λ j= }{{} variance + σ2 ( λj ) 2 / 2
30 Ridge regression: theory Theorem: [ ] E n Xw Xŵ λ 2 2 p λ jλ = λ vj, w 2 (λ j + λ) 2 j= }{{} bias p. n λ j + λ j= }{{} variance + σ2 ( λj ) 2 Corollary: since λ jλ/(λ j + λ) 2 /2 and p j= λj = tr( n X X), [ ] E n Xw Xŵ λ 2 2 λ w σ2 tr ( X X ) n. 2nλ / 2
31 Ridge regression: theory Theorem: [ ] E n Xw Xŵ λ 2 2 p λ jλ = λ vj, w 2 (λ j + λ) 2 j= }{{} bias p. n λ j + λ j= }{{} variance + σ2 ( λj ) 2 Corollary: since λ jλ/(λ j + λ) 2 /2 and p j= λj = tr( n X X), [ ] E n Xw Xŵ λ 2 2 For instance, using λ := [ ] E n Xw Xŵ λ 2 2 λ w tr( n X X) guarantees n + σ2 tr ( X X ) n. 2nλ w σ 2 tr ( X X ) n 2 n. / 2
32 Ridge regression: theory Theorem: [ ] E n Xw Xŵ λ 2 2 p λ jλ = λ vj, w 2 (λ j + λ) 2 j= }{{} bias p. n λ j + λ j= }{{} variance + σ2 ( λj ) 2 Corollary: since λ jλ/(λ j + λ) 2 /2 and p j= λj = tr( n X X), [ ] E n Xw Xŵ λ 2 2 For instance, using λ := [ ] E n Xw Xŵ λ 2 2 λ w tr( n X X) guarantees n + σ2 tr ( X X ) n. 2nλ w σ 2 tr ( X X ) n 2 n No explicit dependence on p. Corollary can be meaningful even if p =, as long as w 2 2 and tr ( n X X ) are finite.. / 2
33 Principal components ( keep or kill ) regression Ridge regression as shrinkage: p λ j ŵ λ = λ j + λ vj, ŵ ols v j. j= 2 / 2
34 Principal components ( keep or kill ) regression Ridge regression as shrinkage: p λ j ŵ λ = λ j + λ vj, ŵ ols v j. j= Another approach: principal components regression. Instead of shrinking ŵ ols in all directions, either keep ŵ ols in direction v j (if λ j λ), or kill ŵ ols in direction v j (if λ j < λ). p ŵ pc λ := {λ j λ} v j, ŵ ols v j. j= shrinkage factor ridge pc λ j 2 / 2
35 Principal components regression: theory Theorem: [ ] E n Xw Xŵ pc λ 2 2 = p {λ j < λ}λ j v j, w 2 j= } {{ } bias p {λ j λ} n j= }{{} variance + σ2 3 / 2
36 Principal components regression: theory Theorem: [ ] E n Xw Xŵ pc λ 2 2 = p {λ j < λ}λ j v j, w 2 j= }{{} bias [ ] 4 E n Xw Xŵ λ 2 2. p {λ j λ} n j= }{{} variance + σ2 3 / 2
37 Principal components regression: theory Theorem: [ ] E n Xw Xŵ pc λ 2 2 = p {λ j < λ}λ j v j, w 2 j= }{{} bias [ ] 4 E n Xw Xŵ λ 2 2. p {λ j λ} n j= }{{} variance + σ2 Should pick λ large enough so that n p {λ j λ} j= (effective dimension at cut-off point λ). 3 / 2
38 Principal components regression: theory Theorem: [ ] E n Xw Xŵ pc λ 2 2 = p {λ j < λ}λ j v j, w 2 j= }{{} bias [ ] 4 E n Xw Xŵ λ 2 2. p {λ j λ} n j= }{{} variance + σ2 Should pick λ large enough so that n p {λ j λ} j= (effective dimension at cut-off point λ). Never (much) worse than ridge regression; often substantially better. 3 / 2
39 Sparse regression and Lasso
40 Sparsity Another way to deal with p > n is to only consider sparse w i.e., w with only a small number ( p) of non-zero entries. 5 / 2
41 Sparsity Another way to deal with p > n is to only consider sparse w i.e., w with only a small number ( p) of non-zero entries. Other advantages of sparsity (especially relative to ridge/p.c.-regression): Sparse solutions more interpretable. Can be more efficient to evaluate x, w (both in terms of computing variable values and computing inner product). 5 / 2
42 Sparse regression methods For any T {, 2,..., p}, let ŵ(t ) := OLS only using variables in T. Subset selection Brute-force strategy. Pick the T {, 2,..., p} of size T = k for which is minimal, and return ŵ(t ). y Xŵ(T ) / 2
43 Sparse regression methods For any T {, 2,..., p}, let ŵ(t ) := OLS only using variables in T. Subset selection Brute-force strategy. Pick the T {, 2,..., p} of size T = k for which is minimal, and return ŵ(t ). y Xŵ(T ) 2 2 Gives you exactly what you want (for given value k). 6 / 2
44 Sparse regression methods For any T {, 2,..., p}, let ŵ(t ) := OLS only using variables in T. Subset selection Brute-force strategy. Pick the T {, 2,..., p} of size T = k for which is minimal, and return ŵ(t ). y Xŵ(T ) 2 2 Gives you exactly what you want (for given value k). Only feasible for very small k, since complexity scales with ( p k). (NP-hard optimization problem.) 6 / 2
45 Sparse regression methods Forward stepwise regression Greedy strategy. Starting with T =, repeat until T = k: Pick the j {, 2,..., p} \ T for which is minimal, and add this j to T. Return ŵ(t ). y Xŵ(T {j}) / 2
46 Sparse regression methods Forward stepwise regression Greedy strategy. Starting with T =, repeat until T = k: Pick the j {, 2,..., p} \ T for which is minimal, and add this j to T. Return ŵ(t ). y Xŵ(T {j}) 2 2 Gives you a k-sparse solution, no shrinkage within T. 7 / 2
47 Sparse regression methods Forward stepwise regression Greedy strategy. Starting with T =, repeat until T = k: Pick the j {, 2,..., p} \ T for which is minimal, and add this j to T. Return ŵ(t ). y Xŵ(T {j}) 2 2 Gives you a k-sparse solution, no shrinkage within T. Primarily only effective when columns of X are close to orthogonal. 7 / 2
48 Lasso (Tibshirani, 994) Lasso: least absolute shrinkage and selection operator ŵ lasso λ := arg min w R p n y Xw λ w. (Convex, though not differentiable.) 8 / 2
49 Lasso (Tibshirani, 994) Lasso: least absolute shrinkage and selection operator ŵ lasso λ := arg min w R p n y Xw λ w. (Convex, though not differentiable.) Objective (as function of w) can be written as n X(w ŵ ols) λ w + (stuff not depending on w) w 2 ŵ ols Level sets for n w X(w ŵ ols) Level sets for λ w 8 / 2
50 Coefficient profile Horizontal axis: varying λ (large λ to left, small λ to right). Vertical axis: coefficient value in ŵ λ for eight different variables. 9 / 2
51 Lasso: theory Many results, mostly roughly of the following flavor. Suppose w has k non-zero entries; X satisfies some special properties (typically not efficiently checkable); λ σ then 2 log(p) n ; [ ] E n Xw Xŵ lasso λ 2 2 ( σ 2 k log(p) O n ). 20 / 2
52 Lasso: theory Many results, mostly roughly of the following flavor. Suppose w has k non-zero entries; X satisfies some special properties (typically not efficiently checkable); λ σ then 2 log(p) n ; [ ] E n Xw Xŵ lasso λ 2 2 ( σ 2 k log(p) O n ). Very active subject of research; closely related to compressed sensing ; intersects with beautiful subject of high-dimensional convex geometry. 20 / 2
53 Recap Fixed-design setting for studying linear regression methods. Ridge and principal components regression make use of eigenvectors of n X X ( principal component directions ), and also corresponding eigenvalues. Ridge regression: shrink ŵ ols along principal component directions by amount related to eigenvalue and λ. Principal components regression: keep-or-kill ŵ ols along principal component directions, based on comparing eigenvalue to λ. Sparse regression: intractable, but some greedy strategies work. Lasso: shrink coefficients towards zero in a way that tends to lead to sparse solutions. 2 / 2
Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationCOMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017
COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We
More informationMA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2
MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationLinear regression methods
Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response
More informationSCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.
SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression
More informationLecture 2 Part 1 Optimization
Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss
More informationEigenvalues and diagonalization
Eigenvalues and diagonalization Patrick Breheny November 15 Patrick Breheny BST 764: Applied Statistical Modeling 1/20 Introduction The next topic in our course, principal components analysis, revolves
More informationPrediction & Feature Selection in GLM
Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis
More informationSTAT 462-Computational Data Analysis
STAT 462-Computational Data Analysis Chapter 5- Part 2 Nasser Sadeghkhani a.sadeghkhani@queensu.ca October 2017 1 / 27 Outline Shrinkage Methods 1. Ridge Regression 2. Lasso Dimension Reduction Methods
More informationSTATS 306B: Unsupervised Learning Spring Lecture 13 May 12
STATS 306B: Unsupervised Learning Spring 2014 Lecture 13 May 12 Lecturer: Lester Mackey Scribe: Jessy Hwang, Minzhe Wang 13.1 Canonical correlation analysis 13.1.1 Recap CCA is a linear dimensionality
More informationLinear Model Selection and Regularization
Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In
More informationRegularization: Ridge Regression and the LASSO
Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression
More informationRegression Shrinkage and Selection via the Lasso
Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,
More informationMLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT
MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationIs the test error unbiased for these programs? 2017 Kevin Jamieson
Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning
More informationISyE 691 Data mining and analytics
ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)
More informationcxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c
Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is
More informationHigh-dimensional regression
High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and
More information6. Regularized linear regression
Foundations of Machine Learning École Centrale Paris Fall 2015 6. Regularized linear regression Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe agathe.azencott@mines paristech.fr
More informationMachine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 14, 2017 1 Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization
More informationLearning with Singular Vectors
Learning with Singular Vectors CIS 520 Lecture 30 October 2015 Barry Slaff Based on: CIS 520 Wiki Materials Slides by Jia Li (PSU) Works cited throughout Overview Linear regression: Given X, Y find w:
More informationhttps://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:
Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful
More informationDirect Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina
Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:
More informationORIE 4741: Learning with Big Messy Data. Regularization
ORIE 4741: Learning with Big Messy Data Regularization Professor Udell Operations Research and Information Engineering Cornell October 26, 2017 1 / 24 Regularized empirical risk minimization choose model
More information1 Regression with High Dimensional Data
6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:
More informationMSA220/MVE440 Statistical Learning for Big Data
MSA220/MVE440 Statistical Learning for Big Data Lecture 7/8 - High-dimensional modeling part 1 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification
More informationSparse representation classification and positive L1 minimization
Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng
More informationLecture 6: Methods for high-dimensional problems
Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,
More informationA SURVEY OF SOME SPARSE METHODS FOR HIGH DIMENSIONAL DATA
A SURVEY OF SOME SPARSE METHODS FOR HIGH DIMENSIONAL DATA Gilbert Saporta CEDRIC, Conservatoire National des Arts et Métiers, Paris gilbert.saporta@cnam.fr With inputs from Anne Bernard, Ph.D student Industrial
More informationMIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationA A x i x j i j (i, j) (j, i) Let. Compute the value of for and
7.2 - Quadratic Forms quadratic form on is a function defined on whose value at a vector in can be computed by an expression of the form, where is an symmetric matrix. The matrix R n Q R n x R n Q(x) =
More informationMATH 423 Linear Algebra II Lecture 33: Diagonalization of normal operators.
MATH 423 Linear Algebra II Lecture 33: Diagonalization of normal operators. Adjoint operator and adjoint matrix Given a linear operator L on an inner product space V, the adjoint of L is a transformation
More informationWeek 3: Linear Regression
Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationCS540 Machine learning Lecture 5
CS540 Machine learning Lecture 5 1 Last time Basis functions for linear regression Normal equations QR SVD - briefly 2 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 3
More informationBiostatistics Advanced Methods in Biostatistics IV
Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results
More informationStatistical Learning with the Lasso, spring The Lasso
Statistical Learning with the Lasso, spring 2017 1 Yeast: understanding basic life functions p=11,904 gene values n number of experiments ~ 10 Blomberg et al. 2003, 2010 The Lasso fmri brain scans function
More informationESL Chap3. Some extensions of lasso
ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied
More informationLasso Regression: Regularization for feature selection
Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 Feature selection task 1 Why might you want to perform feature selection? Efficiency: - If size(w)
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationThe Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA
The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:
More informationEigenvalues and Eigenvectors A =
Eigenvalues and Eigenvectors Definition 0 Let A R n n be an n n real matrix A number λ R is a real eigenvalue of A if there exists a nonzero vector v R n such that A v = λ v The vector v is called an eigenvector
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationLasso Regression: Regularization for feature selection
Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 1 Feature selection task 2 1 Why might you want to perform feature selection? Efficiency: - If
More informationRegression Retrieval Overview. Larry McMillin
Regression Retrieval Overview Larry McMillin Climate Research and Applications Division National Environmental Satellite, Data, and Information Service Washington, D.C. Larry.McMillin@noaa.gov Pick one
More informationLecture 5: A step back
Lecture 5: A step back Last time Last time we talked about a practical application of the shrinkage idea, introducing James-Stein estimation and its extension We saw our first connection between shrinkage
More informationLasso, Ridge, and Elastic Net
Lasso, Ridge, and Elastic Net David Rosenberg New York University February 7, 2017 David Rosenberg (New York University) DS-GA 1003 February 7, 2017 1 / 29 Linearly Dependent Features Linearly Dependent
More informationParameter Norm Penalties. Sargur N. Srihari
Parameter Norm Penalties Sargur N. srihari@cedar.buffalo.edu 1 Regularization Strategies 1. Parameter Norm Penalties 2. Norm Penalties as Constrained Optimization 3. Regularization and Underconstrained
More informationLecture 8 : Eigenvalues and Eigenvectors
CPS290: Algorithmic Foundations of Data Science February 24, 2017 Lecture 8 : Eigenvalues and Eigenvectors Lecturer: Kamesh Munagala Scribe: Kamesh Munagala Hermitian Matrices It is simpler to begin with
More informationLinear regression COMS 4771
Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between
More information4 Bias-Variance for Ridge Regression (24 points)
2 count = 0 3 for x in self.x_test_ridge: 4 5 prediction = np.matmul(self.w_ridge,x) 6 ###ADD THE COMPUTED MEAN BACK TO THE PREDICTED VECTOR### 7 prediction = self.ss_y.inverse_transform(prediction) 8
More informationAn Introduction to Sparse Approximation
An Introduction to Sparse Approximation Anna C. Gilbert Department of Mathematics University of Michigan Basic image/signal/data compression: transform coding Approximate signals sparsely Compress images,
More informationA Short Introduction to the Lasso Methodology
A Short Introduction to the Lasso Methodology Michael Gutmann sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology March 9, 2016 Michael
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Matrix Notation Mark Schmidt University of British Columbia Winter 2017 Admin Auditting/registration forms: Submit them at end of class, pick them up end of next class. I need
More informationSparse analysis Lecture II: Hardness results for sparse approximation problems
Sparse analysis Lecture II: Hardness results for sparse approximation problems Anna C. Gilbert Department of Mathematics University of Michigan Sparse Problems Exact. Given a vector x R d and a complete
More informationOslo Class 6 Sparsity based regularization
RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity
More informationIntroduction to Machine Learning
Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1
More informationMachine learning - HT Basis Expansion, Regularization, Validation
Machine learning - HT 016 4. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford Feburary 03, 016 Outline Introduce basis function to go beyond linear regression Understanding
More informationDimension Reduction Methods
Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.
More informationData Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.
TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin
More informationMultivariate Statistical Analysis
Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 4 for Applied Multivariate Analysis Outline 1 Eigen values and eigen vectors Characteristic equation Some properties of eigendecompositions
More informationShrinkage Methods: Ridge and Lasso
Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationMSA200/TMS041 Multivariate Analysis
MSA200/TMS041 Multivariate Analysis Lecture 8 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Back to Discriminant analysis As mentioned in the previous
More informationMachine Learning for Biomedical Engineering. Enrico Grisan
Machine Learning for Biomedical Engineering Enrico Grisan enrico.grisan@dei.unipd.it Curse of dimensionality Why are more features bad? Redundant features (useless or confounding) Hard to interpret and
More informationRegularization and Variable Selection via the Elastic Net
p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction
More informationCS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS
CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Problems
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationCOMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More informationCOS 424: Interacting with Data
COS 424: Interacting with Data Lecturer: Rob Schapire Lecture #14 Scribe: Zia Khan April 3, 2007 Recall from previous lecture that in regression we are trying to predict a real value given our data. Specically,
More informationRidge Regression Revisited
Ridge Regression Revisited Paul M.C. de Boer Christian M. Hafner Econometric Institute Report EI 2005-29 In general ridge (GR) regression p ridge parameters have to be determined, whereas simple ridge
More informationLinear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman
Linear Regression Models Based on Chapter 3 of Hastie, ibshirani and Friedman Linear Regression Models Here the X s might be: p f ( X = " + " 0 j= 1 X j Raw predictor variables (continuous or coded-categorical
More informationA Modern Look at Classical Multivariate Techniques
A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico
More informationLinear Regression. Volker Tresp 2018
Linear Regression Volker Tresp 2018 1 Learning Machine: The Linear Model / ADALINE As with the Perceptron we start with an activation functions that is a linearly weighted sum of the inputs h = M j=0 w
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationLecture 14: Shrinkage
Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationData Mining Stat 588
Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic
More informationLecture 17 Intro to Lasso Regression
Lecture 17 Intro to Lasso Regression 11 November 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Notes problem set 5 posted; due today Goals for today introduction to lasso regression the subdifferential
More informationLecture VIII Dim. Reduction (I)
Lecture VIII Dim. Reduction (I) Contents: Subset Selection & Shrinkage Ridge regression, Lasso PCA, PCR, PLS Lecture VIII: MLSC - Dr. Sethu Viayakumar Data From Human Movement Measure arm movement and
More informationl 1 and l 2 Regularization
David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about
More informationMethods for sparse analysis of high-dimensional data, II
Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional
More informationLecture 5 : Projections
Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization
More informationLASSO Review, Fused LASSO, Parallel LASSO Solvers
Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationMLCC 2015 Dimensionality Reduction and PCA
MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline PCA & Reconstruction PCA and Maximum Variance PCA and Associated Eigenproblem Beyond the First Principal Component
More informationLinear Regression. CSL603 - Fall 2017 Narayanan C Krishnan
Linear Regression CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis Regularization
More informationBayesian Linear Regression [DRAFT - In Progress]
Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory
More informationLinear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan
Linear Regression CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Univariate regression Multivariate regression Probabilistic view of regression Loss functions Bias-Variance analysis
More informationRidge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation
Patrick Breheny February 8 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/27 Introduction Basic idea Standardization Large-scale testing is, of course, a big area and we could keep talking
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods
More informationTransductive Experiment Design
Appearing in NIPS 2005 workshop Foundations of Active Learning, Whistler, Canada, December, 2005. Transductive Experiment Design Kai Yu, Jinbo Bi, Volker Tresp Siemens AG 81739 Munich, Germany Abstract
More informationHigh-dimensional regression modeling
High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making
More information