Statistical Learning with the Lasso, spring The Lasso

Size: px
Start display at page:

Download "Statistical Learning with the Lasso, spring The Lasso"

Transcription

1 Statistical Learning with the Lasso, spring Yeast: understanding basic life functions p=11,904 gene values n number of experiments ~ 10 Blomberg et al. 2003, 2010 The Lasso fmri brain scans function of brain language network p=appr. 3 mill. Voxels n number of subjects ~ 50 Taylor et al. 2006

2 The module Lasso - linear models - generalized linear models - group - smooth functions Module litteratur: Statistics for High-Dimensional Data. Methods, Theory and Applications, P. Buhlmann and S. van de Geer, Springer Computer-Age Statistical Inference by B. Efron and T. Hastie, Cambridge University Press 2016 Statistical learning with sparsity: the Lasso and generalizations by T. Hastie, R. Tibshirani, and M. Wainwright, CRC press

3 Slides for B&vdG and E&H 12.1, 12.2 (also good to read E&H 16.1, 16.2) Exercise: B&vdG 2.1. Project: find interesting data set and do a Lasso on it. Excercises and project should be done in groups of 2 or 3 persons and presented in week 10.

4 The linear model responses covariates parameters errors YY 1 = XX 1 (1) ββ1 + XX 1 (2) ββ2 + XX 1 pp ββ pp + εε 1 (1) (2) YY 2 = XX 2 ββ1 + XX 2 ββ2 + XX pp 2 ββ pp + εε 2 (1) (2) YY nn = XX nn ββ1 + XX nn ββ2 + XX pp nn ββ pp + εε nn YY = XXββ + εε ols (ordinary least square) estimate of parameters nn ββ = argmin ββ ii=1 = argmin β YY XXββ YY ii (XX ii 1 ββ 1 + XX ii 2 ββ 2 + XX ii pp ββ pp ) 2

5 ols ββ oooooo = argmin β YY XXββ Can be obtained as follows ff(ββ) = YY XXββ = YY XXββ tt (YY XXββ) ddff ββ ddββ = 2XXtt (YY XXββ) = 0 XX tt YY = XX tt XXββ ββ oooooo = XX tt XX 1 XX tt YY (if matrix is invertible) If pp large then ols often overfits (i.e. adapts too much to the noise and gives unreliable predictions and indications of which covariates are important)

6 E&H E&H 16.1 Forward stepwise regression. The standard statistical way to do regression on many covariates: first do all 1-dimensional regressions, choose the covariate which gives the smallest loss, then do all 2-dimensional regressions and choose the 2 covariates which gives the smallest loss, then... up to some predetermined dimension M. Choose the regression which is best. Computationally impossible in high dimensions. Greedy version: instead of doing all 2-dimensional regressions, in step 2 keep the first selected covariate, and just add the best one of the remaining covariates, and so on. Computationally much easier, but the first version can lead to better model E&H 16.1: Lasso. Read

7 B&vdG 2.2 Simplest case: YY 1 = XX 1 (1) ββ1 + XX 1 (2) ββ2 + XX 1 pp ββ pp + εε 1 (1) (2) YY 2 = XX 2 ββ1 + XX 2 ββ2 + XX 2 2 ββ pp + εε 2 (1) (2) YY nn = XX nn ββ1 + XX nn ββ2 + XX pp nn ββ pp + εε nn P>>n ( wide data set ) YY = XXββ + εε Standardize (cheating?) so that YY = nn ii=1 YY ii /nn = 0, XX (jj) = 0, σσ 2 jj = nn ii=1 (XX jj ii XX (jj) ) 2 = 1, for jj = 1, pp

8 The Lasso (Least Absolute Shrinkage and Selection estimator) Lagrange formulation: ββ λλ = argmin β ( YY XXββ /nn + λλ ββ 1 ) Same loss as (for RR determined by λλ and data: RR = ββ(λλ ) 1 ) ββ primal RR = argmin β; ββ 1 RR YY XXββ /nn (Ridge regression: ββ λλ = argmin β ( YY XXββ /nn + λλ ββ 2 2 ) Same as (for RR determined by λλ) ββ primal RR = argmin β; ββ 2 RR YY XXββ /n )

9 Ordinary least squares (ols): ββ oooooo = argmin β YY XXββ More parameters than observations if pp nn ols overfits Estimation & understanding only possible if pppppppppppppppprrrr ββ are sparse or continuous, and to obtain sparse models when pp is (very) large one has to select which covariates to use in a good way. One way to do this is penalize as on previous slide

10 Contour lines of residual sum of squares (Left) 1-ball corresponding to the Lasso. (Right) 2-ball corresponding to Ridge regression.

11 B&vdG 2.3: Soft tresholding Simplest case: YY 1 = ββ 1 + εε 1 YY 2 = ββ 2 + εε 2 YY nn = ββ nn + εε nn YY XXββ = nn ii=1 YY ii ββ ii 2, ββ 1 = nn ii=1 ββ ii YY XXββ λλ ββ 11 = nn ii=1 YY ii ββ ii 2 + λλ ββ ii ββ jj = 0 if YY jj λλ/2 sign(yy jj )( YY jj λλ ) =: gg soft, λλ (YY ii ) 2 2

12 gg ssssssss,1 (zz)

13 Soft tresholding: orthonormal designs Orthonormal design if XX is nn nn and XX tt XX = nnii nn Ex (designed experiment with 3 factors): YY = ββ 1 + XX 2 ββ 2 + XX 3 ββ 3 + XX 4 ββ 4 + εε ββ 1 is the mean XX 2 = 1 if first factor is high, XX 2 = 1 if first factor is low, etc XX =

14 if XX tt XX = nnii nn then XX tt YY/nn = XX tt XXββ/nn + XX tt εε/nn = ββ + εεε so back to the simple situation, but with YY jj replaced by ZZ jj = (XX tt YY) jj /nn and we get that ββ jj = h soft, λλ 2 (XX tt YY jj ) = h λλ(zz jj ) soft, 2 (note that XX tt YY/nn = XX tt XX 1 XX tt YY = ββ ols for an orthonormal design) Homework: problem 2.1

15 B&vdG 2.4: Prediction pp is to find ff xx = EE YY XX = xx) = jj=1 xx (jj) ββ jj, and hence is closely related to estimation of ββ. If we use the Lasso for estimation we get the estimated predictor ff λλ xx = xx tt pp ββ λλ = xx (jj) ββ jj (λλ) jj=1 But, how should one chose λλ?

16 EH 12.1, 12.2, (10-fold) crossvalidation 1: split (somehow) the data into 10 approximately equally sized groups 2: for each YY ii use the Lasso to estimate ββ from the 9 groups which do not contain YY ii, call this estimate ββ ii λλ, and use it to compute the predictor ii pp ff λλ xx ii = jj=1 xx jj ( ii) ii ββ jj (λλ) 3: compute the crossvalidated mean square prediction error ee λλ = nn ii=1 (YY ii ff λλ ii xx ii ) 2 (or use some other appropriate error measure) 4: repeat this for a grid of λλ-values, choose the λλ which gives the smallest ee λλ

17 B&vdG Lymph node status of cancer patients: YY = 0 or 1 nn = 49 breast cancer tumor samples pp = 7129 gene expression measurements/tumor sample Doesn t follow model, but one can still use (crossvalidated ) pp Lasso to compute ff λλ xx = jj=1 xx (jj) ββ jj (λλ) and then use it for classification: classify status of observation with design vector xx as 1 if ff λλ xx > 1/2 and as 0 if ff λλ xx 1/2. Compared with other method, also through crossvalidation: randomly divide sample into 2/3 training data and 1/3 test data, do entire model fitting on training data, count misclassifications on test data, repeat 100 times, compute average misclassification percentages: Lasso 21% alt. 35%. Average number of non-zero ββ-s for Lasso was 13.

18 Choice of estimation method How does one choose between different estimation metods: Lasso, ridge regression, ols,? Sometimes, e.g. in signal processing, one can try out the methods in situations where one can see how well they work (but still, it s imposible to try all methods which have been suggested) Understanding Simulation Asymptotics

19 B&vdG 2.4.2: Asymptotics In basic statistics asymptotics the model is fixed and nn, the number of observations, tends to infinity (and one then assumes that in a practical situations with large nn distributions are well described by the asymptotic distribution). Not possible for pp nn since if the model (and thus pp, the number of parameters) is kept fixed, then as nn gets large one gets that pp < nn. One instead has to use triangular arrays, where pp = pp nn tends to infinity at the same time as n tends to infinity.

20 Triangular arrays In asymptotic statments B&vdG typically assume that pp = pp nn, XX = XX n = (j) X n;i, and ββ = ββ nn = ββ nn;jj all change as nn, typically with pp nn faster than nn i.e. with pp nn /nn as nn. Thus one considers a sequence of models or in more detail YY nn = XX nn ββ nn + εε nn, pp nn (jj) YY nn;ii = XXnn;ii ββnn;jj + εε ii, jj=1 for i = 1, nn, and derive asymptotic inequalities and limit distributions for such triangular arrays, for nn

21 Prediction consistency ββ nn λλ nn ββ nn 0 tt Σ XXnn ββ nn λλ nn ββ nn 0 = oo pp 1 for Σ XXnn = XX nn tt XX nn /nn for a fixed design, and Σ XXnn equal to the estimated covariance matrix of the covariates vector XX nn, for a random design with the covariate vectors in the different rows i.i.d. (ZZ nn = oo pp (1) means that PP ZZ nn > εε 0 as nn, for any εε > 0, i.e. that ZZ nn tends to zero in probability) Conditions needed to obtain prediction consistency include ββ nn 0 1 = oo( nn log pp nn ), KK 1 log pp nn /nn < λλ nn < KK log pp nn /nn for some constant KK, and that the εε nn;ii, ii = 1, nn are i.i.d. with finite variance.

22 The quantity ββ nn λλ nn 0 ββ tt nn Σ XXnn ββ nn λλ nn ββ 0 nn = oo pp 1 equals XX nn ββ nn (λλ nn ) ββ nn 0 2 /nn for a fixed design, 2 EE (XX nn;nnnnnn ββ nn λλ nn ββ nn 0 ) 2 for a random design, and where XX nn ββ nn λλ nn ββ nn 0 is the difference between the estimated and the true predictor (= regression function).

23 Oracle inequality EE XX ββ nn (λλ nn ) ββ nn 0 2 /nn = OO( ss nn;0 logpp nn 2 nnφφ nn 2 ), where ss nn;0 = card S n;0 = S n;0 is the number of 0 elemements in the set S 0;n = jj; ββ nn;jj 0, jj = 1, pp nn active variables, and φφ 2 nn is a compatablity constant determined by the design matrix XX nn. means roughly that using the estimated predictor instead of the true one adds a (random) error of the size slightly larger 1 than nn Requires further conditions. of

24 B&vdG 2.5: Estimation consistency ββ nn λλ nn ββ nn 0 qq = oo PP 1, for qq = 1 or 2 Similar conditions as for prediction consistency.

25 Variable screening SS 0;nn (λλ) = jj; ββ nn;jj (λλ) 0, jj = 1, pp nn Typically, the Lasso solutions ββ nn 0 are not unique (e.g. they are not unique if pp > nn) but still SS 0;nn (λλ) is unique (Lemma 2.1). Further, SS 0;nn λλ min n, p n. relevant(cc) SS 0;nn (λλ) = jj; 0 ββnn;jj (λλ) 0, jj = 1, pp nn is the true set of relevant variables. Asymptotically, under relevant(cc) suitable conditions, SS 0;nn (λλ) will contain SS 0;nn, but often, depending on the value of λλ, it will contain many more variables. For choice of λλ, read B&vdG

26 B&vdG Motif regression for DNA binding sites: YY ii binding intensity of H1F1αα in coarse DNA segment ii (jj) XX ii abundance score of candidate motif jj in DNA segment ii ii = 1, nn = 287; jj = 1, pp = YY ii = μμ + jj=1 XX ii (jj) ββjj + εε ii, for i = 1, 287, Scale covariates XX to same empirical variance, subtract YY, do Lasso with λλ = λλ CV choosen by 10-fold crossvalidation for optimal prediction. Gives SS λλ CV = 26 non-zero ββ λλ CV estimates. B&vdG believe these include true active variables

27

28 B&vdG 2.6: Variable selection ββ λλ = argmin β ( YY XXββ /nn + λλ ββ 0 0 ) where ββ 0 pp 0 = jj=1 1(ββ jj 0) Not convex: naive solution would be to go through all possible sub-models (i.e. models with some of the ββ-s set to 0), compute the ols for each of these models, and then chose the model which minimizes the right-hand side. However, there are 2 pp submodels, and if pp is not small, this isn t feasible. More efficient algoritms exist, but still a difficult/infeasible problem if pp is not small.

29 One possibility is to first use (some variant of) the Lasso to select a set of variables which with high probability includes the set of active variables, and then only do the minimization of the l 0 penalized mean square error among these variables. Typically reduces the number of operations needed for the minimization to OO nnnnmin nn, pp However this requires that the Lasso in fact selects all active variables/parameters. Asymptotic conditions to ensure this include that inf ββ jj ss 0 log pp/nn jj SS 0 and neigbourhood stability or irrepresentable condition holds

30 Examples of sufficient conditions for neigbourhood stability (recall that Σ = XX tt XX/n for a fixed design and that Σ is the correlation matrix of XX for a random design) Equal positive correlation: Σ jj,jj = 1, for all jj, Σ jj,kk = ρρ > 0, for some ρρ < 1 Markov (or Toeplitz) structure: Bounded pairwise correlation ΣΣ jj,kk = θθ jj kk, with θθ < 1 ss 0 max jj SS 0 kk SS 0 Σ jj,kk 2 2 /Λ min (ΣΣ 1,1) < 1 2 where Λ min (ΣΣ 1,1) is the smallest eigenvalue of the covariance matrix of the active variables

31

High-dimensional data analysis, fall

High-dimensional data analysis, fall High-dimensional data analysis, fall 2013 10 Yeast understanding basic life functions 11,904 p-values Blomberg et al. 2003, 2010 Arabidopsis Thaliana association mapping 3,745 p-values Zhao et al. 2007

More information

LASSO Review, Fused LASSO, Parallel LASSO Solvers

LASSO Review, Fused LASSO, Parallel LASSO Solvers Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Support Vector Machines CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

Lecture 10: Logistic Regression

Lecture 10: Logistic Regression BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics Lecture 10: Logistic Regression Jie Wang Department of Computational Medicine & Bioinformatics University of Michigan 1 Outline An

More information

Variations. ECE 6540, Lecture 02 Multivariate Random Variables & Linear Algebra

Variations. ECE 6540, Lecture 02 Multivariate Random Variables & Linear Algebra Variations ECE 6540, Lecture 02 Multivariate Random Variables & Linear Algebra Last Time Probability Density Functions Normal Distribution Expectation / Expectation of a function Independence Uncorrelated

More information

ORIE 4741: Learning with Big Messy Data. Regularization

ORIE 4741: Learning with Big Messy Data. Regularization ORIE 4741: Learning with Big Messy Data Regularization Professor Udell Operations Research and Information Engineering Cornell October 26, 2017 1 / 24 Regularized empirical risk minimization choose model

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 2: Vector Data: Prediction Instructor: Yizhou Sun yzsun@cs.ucla.edu October 8, 2018 TA Office Hour Time Change Junheng Hao: Tuesday 1-3pm Yunsheng Bai: Thursday 1-3pm

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable

More information

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo

Approximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Approximate Second Order Algorithms Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Why Second Order Algorithms? Invariant under affine transformations e.g. stretching a function preserves the convergence

More information

Variable Selection in Predictive Regressions

Variable Selection in Predictive Regressions Variable Selection in Predictive Regressions Alessandro Stringhi Advanced Financial Econometrics III Winter/Spring 2018 Overview This chapter considers linear models for explaining a scalar variable when

More information

General Strong Polarization

General Strong Polarization General Strong Polarization Madhu Sudan Harvard University Joint work with Jaroslaw Blasiok (Harvard), Venkatesan Gurswami (CMU), Preetum Nakkiran (Harvard) and Atri Rudra (Buffalo) December 4, 2017 IAS:

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

Building a Prognostic Biomarker

Building a Prognostic Biomarker Building a Prognostic Biomarker Noah Simon and Richard Simon July 2016 1 / 44 Prognostic Biomarker for a Continuous Measure On each of n patients measure y i - single continuous outcome (eg. blood pressure,

More information

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso COMS 477 Lecture 6. Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso / 2 Fixed-design linear regression Fixed-design linear regression A simplified

More information

Uncertainty quantification in high-dimensional statistics

Uncertainty quantification in high-dimensional statistics Uncertainty quantification in high-dimensional statistics Peter Bühlmann ETH Zürich based on joint work with Sara van de Geer Nicolai Meinshausen Lukas Meier 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Vector Data: Clustering: Part II Instructor: Yizhou Sun yzsun@cs.ucla.edu May 3, 2017 Methods to Learn: Last Lecture Classification Clustering Vector Data Text Data Recommender

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher Lecture 3 STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher Previous lectures What is machine learning? Objectives of machine learning Supervised and

More information

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Problems

More information

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014 Case Study 3: fmri Prediction Fused LASSO LARS Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox 2014 1 LASSO Regression

More information

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is

More information

Shrinkage Methods: Ridge and Lasso

Shrinkage Methods: Ridge and Lasso Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and

More information

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA

More information

Lecture 3. Linear Regression

Lecture 3. Linear Regression Lecture 3. Linear Regression COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne Weeks 2 to 8 inclusive Lecturer: Andrey Kan MS: Moscow, PhD:

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04

More information

Radial Basis Function (RBF) Networks

Radial Basis Function (RBF) Networks CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks 1 Function approximation We have been using MLPs as pattern classifiers But in general, they are function approximators Depending

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

High-dimensional regression modeling

High-dimensional regression modeling High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making

More information

Lecture 3: Autoregressive Moving Average (ARMA) Models and their Practical Applications

Lecture 3: Autoregressive Moving Average (ARMA) Models and their Practical Applications Lecture 3: Autoregressive Moving Average (ARMA) Models and their Practical Applications Prof. Massimo Guidolin 20192 Financial Econometrics Winter/Spring 2018 Overview Moving average processes Autoregressive

More information

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren 1 / 34 Metamodeling ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University March 1, 2015 2 / 34 1. preliminaries 1.1 motivation 1.2 ordinary least square 1.3 information

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

PHY103A: Lecture # 4

PHY103A: Lecture # 4 Semester II, 2017-18 Department of Physics, IIT Kanpur PHY103A: Lecture # 4 (Text Book: Intro to Electrodynamics by Griffiths, 3 rd Ed.) Anand Kumar Jha 10-Jan-2018 Notes The Solutions to HW # 1 have been

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Expectation Propagation performs smooth gradient descent GUILLAUME DEHAENE

Expectation Propagation performs smooth gradient descent GUILLAUME DEHAENE Expectation Propagation performs smooth gradient descent 1 GUILLAUME DEHAENE In a nutshell Problem: posteriors are uncomputable Solution: parametric approximations 2 But which one should we choose? Laplace?

More information

A simulation study of model fitting to high dimensional data using penalized logistic regression

A simulation study of model fitting to high dimensional data using penalized logistic regression A simulation study of model fitting to high dimensional data using penalized logistic regression Ellinor Krona Kandidatuppsats i matematisk statistik Bachelor Thesis in Mathematical Statistics Kandidatuppsats

More information

Proteomics and Variable Selection

Proteomics and Variable Selection Proteomics and Variable Selection p. 1/55 Proteomics and Variable Selection Alex Lewin With thanks to Paul Kirk for some graphs Department of Epidemiology and Biostatistics, School of Public Health, Imperial

More information

Machine Learning for Biomedical Engineering. Enrico Grisan

Machine Learning for Biomedical Engineering. Enrico Grisan Machine Learning for Biomedical Engineering Enrico Grisan enrico.grisan@dei.unipd.it Curse of dimensionality Why are more features bad? Redundant features (useless or confounding) Hard to interpret and

More information

General Strong Polarization

General Strong Polarization General Strong Polarization Madhu Sudan Harvard University Joint work with Jaroslaw Blasiok (Harvard), Venkatesan Gurswami (CMU), Preetum Nakkiran (Harvard) and Atri Rudra (Buffalo) May 1, 018 G.Tech:

More information

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds CS 6347 Lecture 8 & 9 Lagrange Multipliers & Varitional Bounds General Optimization subject to: min ff 0() R nn ff ii 0, h ii = 0, ii = 1,, mm ii = 1,, pp 2 General Optimization subject to: min ff 0()

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

l 1 and l 2 Regularization

l 1 and l 2 Regularization David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

Lecture 6. Notes on Linear Algebra. Perceptron

Lecture 6. Notes on Linear Algebra. Perceptron Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Notes on linear algebra Vectors

More information

Regularization Path Algorithms for Detecting Gene Interactions

Regularization Path Algorithms for Detecting Gene Interactions Regularization Path Algorithms for Detecting Gene Interactions Mee Young Park Trevor Hastie July 16, 2006 Abstract In this study, we consider several regularization path algorithms with grouped variable

More information

Is the test error unbiased for these programs?

Is the test error unbiased for these programs? Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

SECTION 5: POWER FLOW. ESE 470 Energy Distribution Systems

SECTION 5: POWER FLOW. ESE 470 Energy Distribution Systems SECTION 5: POWER FLOW ESE 470 Energy Distribution Systems 2 Introduction Nodal Analysis 3 Consider the following circuit Three voltage sources VV sss, VV sss, VV sss Generic branch impedances Could be

More information

Data analysis strategies for high dimensional social science data M3 Conference May 2013

Data analysis strategies for high dimensional social science data M3 Conference May 2013 Data analysis strategies for high dimensional social science data M3 Conference May 2013 W. Holmes Finch, Maria Hernández Finch, David E. McIntosh, & Lauren E. Moss Ball State University High dimensional

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 5, 06 Reading: See class website Eric Xing @ CMU, 005-06

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014 Exam policy: This exam allows two one-page, two-sided cheat sheets (i.e. 4 sides); No other materials. Time: 2 hours. Be sure to write

More information

The lasso, persistence, and cross-validation

The lasso, persistence, and cross-validation The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive: Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful

More information

High-dimensional Ordinary Least-squares Projection for Screening Variables

High-dimensional Ordinary Least-squares Projection for Screening Variables 1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor

More information

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the

More information

ECE 592 Topics in Data Science

ECE 592 Topics in Data Science ECE 592 Topics in Data Science Dror Baron Associate Professor Dept. of Electrical and Computer Engr. North Carolina State University, NC, USA Two Classification Examples [Hastie et al., Chapter 2.1-2.5]

More information

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We

More information

A Simple Algorithm for Learning Stable Machines

A Simple Algorithm for Learning Stable Machines A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is

More information

Lecture 11. Kernel Methods

Lecture 11. Kernel Methods Lecture 11. Kernel Methods COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture The kernel trick Efficient computation of a dot product

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Lecture 3 Optimization methods for econometrics models

Lecture 3 Optimization methods for econometrics models Lecture 3 Optimization methods for econometrics models Cinzia Cirillo University of Maryland Department of Civil and Environmental Engineering 06/29/2016 Summer Seminar June 27-30, 2016 Zinal, CH 1 Overview

More information

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. August 2008 Trevor

More information

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013) A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

Cross-Validation with Confidence

Cross-Validation with Confidence Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation

More information

Minwise hashing for large-scale regression and classification with sparse data

Minwise hashing for large-scale regression and classification with sparse data Minwise hashing for large-scale regression and classification with sparse data Nicolai Meinshausen (Seminar für Statistik, ETH Zürich) joint work with Rajen Shah (Statslab, University of Cambridge) Simons

More information

A Bias Correction for the Minimum Error Rate in Cross-validation

A Bias Correction for the Minimum Error Rate in Cross-validation A Bias Correction for the Minimum Error Rate in Cross-validation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by cross-validation.

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

7.3 The Jacobi and Gauss-Seidel Iterative Methods

7.3 The Jacobi and Gauss-Seidel Iterative Methods 7.3 The Jacobi and Gauss-Seidel Iterative Methods 1 The Jacobi Method Two assumptions made on Jacobi Method: 1.The system given by aa 11 xx 1 + aa 12 xx 2 + aa 1nn xx nn = bb 1 aa 21 xx 1 + aa 22 xx 2

More information

Regularization and Variable Selection via the Elastic Net

Regularization and Variable Selection via the Elastic Net p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

High-Dimensional Statistical Learning: Introduction

High-Dimensional Statistical Learning: Introduction Classical Statistics Biological Big Data Supervised and Unsupervised Learning High-Dimensional Statistical Learning: Introduction Ali Shojaie University of Washington http://faculty.washington.edu/ashojaie/

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

ECE 6540, Lecture 06 Sufficient Statistics & Complete Statistics Variations

ECE 6540, Lecture 06 Sufficient Statistics & Complete Statistics Variations ECE 6540, Lecture 06 Sufficient Statistics & Complete Statistics Variations Last Time Minimum Variance Unbiased Estimators Sufficient Statistics Proving t = T(x) is sufficient Neyman-Fischer Factorization

More information

Heteroskedasticity ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Heteroskedasticity ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD Heteroskedasticity ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD Introduction For pedagogical reasons, OLS is presented initially under strong simplifying assumptions. One of these is homoskedastic errors,

More information

Homework 5. Convex Optimization /36-725

Homework 5. Convex Optimization /36-725 Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information

Data Preprocessing. Jilles Vreeken IRDM 15/ Oct 2015

Data Preprocessing. Jilles Vreeken IRDM 15/ Oct 2015 Data Preprocessing Jilles Vreeken 22 Oct 2015 So, how do you pronounce Jilles Yill-less Vreeken Fray-can Okay, now we can talk. Questions of the day How do we preprocess data before we can extract anything

More information

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Consistent high-dimensional Bayesian variable selection via penalized credible regions Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 7/8 - High-dimensional modeling part 1 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

STAT 462-Computational Data Analysis

STAT 462-Computational Data Analysis STAT 462-Computational Data Analysis Chapter 5- Part 2 Nasser Sadeghkhani a.sadeghkhani@queensu.ca October 2017 1 / 27 Outline Shrinkage Methods 1. Ridge Regression 2. Lasso Dimension Reduction Methods

More information