Feature Selection: Part 1

Similar documents
Generalized Linear Methods

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Singular Value Decomposition: Theory and Applications

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

1 Convex Optimization

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Lecture 10 Support Vector Machines II

10-701/ Machine Learning, Fall 2005 Homework 3

Lecture Notes on Linear Regression

CSC 411 / CSC D11 / CSC C11

CSE 546 Midterm Exam, Fall 2014(with Solution)

Lecture 20: November 7

Week 5: Neural Networks

Lecture 17: Lee-Sidford Barrier

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Ensemble Methods: Boosting

Kernel Methods and SVMs Extension

The Experts/Multiplicative Weights Algorithm and Applications

MMA and GCMMA two methods for nonlinear optimization

Lecture 10 Support Vector Machines. Oct

Lecture 4. Instructor: Haipeng Luo

Errors for Linear Systems

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Online Classification: Perceptron and Winnow

PHYS 705: Classical Mechanics. Calculus of Variations II

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

EEE 241: Linear Systems

COS 521: Advanced Algorithms Game Theory and Linear Programming

Neural networks. Nuno Vasconcelos ECE Department, UCSD

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Bezier curves. Michael S. Floater. August 25, These notes provide an introduction to Bezier curves. i=0

Learning Theory: Lecture Notes

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Pattern Classification

CSCI B609: Foundations of Data Science

Lecture 3: Dual problems and Kernels

Supplement: Proofs and Technical Details for The Solution Path of the Generalized Lasso

Supporting Information

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lecture 14: Bandits with Budget Constraints

Assortment Optimization under MNL

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Some modelling aspects for the Matlab implementation of MMA

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

Lecture 12: Discrete Laplacian

Econ Statistical Properties of the OLS estimator. Sanjaya DeSilva

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

Section 8.3 Polar Form of Complex Numbers

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

Hidden Markov Models

Integrals and Invariants of Euler-Lagrange Equations

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Excess Error, Approximation Error, and Estimation Error

The Geometry of Logit and Probit

4DVAR, according to the name, is a four-dimensional variational method.

Which Separator? Spring 1

Chapter 9: Statistical Inference and the Relationship between Two Variables

Statistical Analysis of Bayes Optimal Subset Ranking

Chapter 11: Simple Linear Regression and Correlation

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Convergence of random processes

A Delay-tolerant Proximal-Gradient Algorithm for Distributed Learning

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

Notes on Frequency Estimation in Data Streams

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Vector Norms. Chapter 7 Iterative Techniques in Matrix Algebra. Cauchy-Bunyakovsky-Schwarz Inequality for Sums. Distances. Convergence.

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Advanced Introduction to Machine Learning

A New Refinement of Jacobi Method for Solution of Linear System Equations AX=b

Mean Field / Variational Approximations

6.854J / J Advanced Algorithms Fall 2008

Expectation propagation

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

Module 9. Lecture 6. Duality in Assignment Problems

Difference Equations

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Open Systems: Chemical Potential and Partial Molar Quantities Chemical Potential

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Natural Language Processing and Information Retrieval

Lecture 2: Prelude to the big shrink

More metrics on cartesian products

The Second Anti-Mathima on Game Theory

COS 511: Theoretical Machine Learning

COMP th April, 2007 Clement Pang

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16

Bézier curves. Michael S. Floater. September 10, These notes provide an introduction to Bézier curves. i=0

LECTURE 9 CANONICAL CORRELATION ANALYSIS

2.3 Nilpotent endomorphisms

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Multilayer Perceptron (MLP)

Transcription:

CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n? In the prevous lecture, we examned rdge regresson and provded a dmenson free rate of convergence. Now let us examne feature selecton. 2 Feature Selecton Let us suppose there are s relevant features out of the d possble features. Throughout ths analyss, let us assume that: Y = Xw + η, where Y R n and X R n d. We assume that the support of w (the number of non-zero entres) s s. 2.1 Loss Mnmzaton (Emprcal Rsk Mnmzaton) Defne our emprcal loss as: whch has no expectaton over Y. ˆL(w) = 1 Xw Y 2 n Suppose we knew the support sze s. One algorthm s to smply fnd the estmator whch mnmzes the emprcal loss and has support only on s coordnates. In partcular, consder the estmator: ŵ subset selecton = arg mn support(w) s ˆL(w) where the nf s over vectors wth support sze s. Computng ths estmator s not computatonally tractable n general (the nave algorthm runs n tme d s ). Furthermore, fndng the best subset s known to be an NP-hard problem. How much better s ths estmator better than the nave estmator? Recall the rsk s: where the expectaton s over Y. We have the followng theorem: R(ŵ subset selecton ) = E Y ŵ T w 2 Σ Theorem 2.1. Suppose the support of w s bounded by s. We have that the rsk s bounded as: (where c sa unversal constant). R(ŵ subset selecton ) c s log d n σ2 1

2.2 Coordnate dependence? Clearly, the coordnates system s mportant here, as the support s defned wth respect to ths coordnate system. However, note that the scale n each coordnate s rrelevant here. In contrast, note the emprcal rsk mnmzaton does not depend on the coordnate system. 3 Norms The l p of a vector x s: The l 0 norm s defned as: x p = ( x p ) 1 p x p = { x 0} whch s the number of non-zero entres n x. Techncally, the l 0 norm s not a norm. 4 Lasso Let us vew the the subset selecton problem as a regularzed problem. A relaxed verson of a hard constrant on the sze of the subset would be to mnmze: ˆL(w) + λ w 0 One can show that for an approprate choce of λ ths algorthm also enjoys the same rsk guarantee of the hard constraned subset selecton algorthm (up to constants). Unfortunately, mnmzng ths objectve functon s also not computatonally tractable. A natural convex relaxaton s to nstead consder mnmzng the followng: F (w) = ˆL(w) + λ w 1 whch can be vewed as a convex relaxaton to the l 0 problem. Ths s referred to as the Lasso. 4.1 Coordnate Scalngs Often t s a good dea to transform the data so that the varance along each coordnate s 1. In other words, for each coordnate j, t often makes sense to do the followng transformaton: X,j X,j /Z where Z j = 1 n Intutvely, ths s to remove an arbtrary scale factor. A more precse reason for ths wll be dscussed n the next lecture. X 2,j 2

4.2 Optmzaton & Coordnate Descent The 1-dmensonal case: Suppose that we are n n the 1-dmensonal case where each x s a scalar and so w s a scalar. The lasso problem s then to mnmze: (y wx ) 2 + λ w where w s the absolute value functon. To mnmze ths functon, we can agan set the gradent to 0 and solve. A subtlety here s that the absolute value functon s non-dfferentable at 0. Note that for any w 0 the gradent s: 2 x (y wx ) + λsgn(w) (1) where sgn(w) s 1 f w s postve and 1 f w s negatve. There are three cases to check. If the mnmzer w s postve, then we know that the frst order condton mples that: w = y x λ/2 x2 If we compute the rght hand sde and t s postve, then ndeed ths value s the mnmzer. Now suppose the mnmzer w s negatve, then we know that the frst order condton mples that: w = y x + λ/2 x2 If we compute the rght hand sde and t s negatve, then ndeed ths value s the mnmzer. Now suppose w s 0. Note that w s not dfferentable at 0. Here, one can show that we must have that: 2 y x [ λ, λ] So f we compute the left hand sde and t s n the nterval [ λ, λ] then w = 0 s a mnmzer. To see ths, consder any small perturbaton so that w = ɛ. Suppose ɛ > 0. For suffcently small ɛ, the frst term n (1) wll stll be n the nterval [ λ, λ] and so the gradent wll be strctly postve (for small ɛ). Thus gradent descent wll push us back to 0. Smlarly, for ɛ < 0, we wll move back to 0. Formally, the sub-gradent of w can take any value n [ 1, 1], whch s a vald tangent plane (see the wkpeda defnton). Coordnate Ascent: The coordnate ascent algorthm for mnmzng an objectve functon F (w 1, w 2,... w n ) s as follows: 1. Intalze: w = 0 2. choose a coordnate (e.g. at random) 3. update w as follows: w arg mn z R F (w 1,..., w 1, z, w +1,... w d ) where the optmzaton s over the -th coordnate (holdng the other coordnates fxed). Then return to step 2. Clearly, many natural varants are possble. 3

5 Relatonshp of Lasso to Compressed Sensng As we dscussed earler, regresson can be vewed as fndng an approxmate soluton to an (nconsstent) lnear system of equatons. In compressed sensng, we are dealng wth the settng were the system of equatons s consstent,.e. Aw = b has a soluton. Suppose we are n the case where A s of sze n d and d > n, so there are multple solutons. In partcular, we seek the sparsest soluton: As before, ths problem s not computatonally tractable. mn w 0 s.t. Ax = b w The convex relaxaton s the followng optmzaton problem: mn w 1 s.t. Ax = b w Smlar to the case of lasso, under certan assumptons ths can recover the soluton to the l 0 problem. 6 Greedy Algorthms There are varety of greedy algorthms and numerous namng conventons for these algorthms. These algorthms must rely on some stoppng condton (or some condton to lmt the sparsty level of the soluton). 6.1 Stagewse Regresson / Matchng Pursut / Boostng Here, we typcally do no regularze our objectve functon and, nstead, drectly deal wth the emprcal loss ˆL(w 1, w 2,... w n ). Ths class of algorthms for mnmzng an objectve functon ˆL(w 1, w 2,... w n ) s as follows: 1. Intalze: w = 0 2. choose the coordnate whch can result n the greatest decrease n error,.e. 3. update w as follows: arg mn mn z R F (w 1,..., w 1, z, w +1 w arg mn z R F (w 1,..., w 1, z, w +1,... w d ) where the optmzaton s over the -th coordnate (holdng the other coordnates fxed). 4. Whle some termnaton condton s not met, return to step 2. Ths termnaton condton can be lookng at the error on some holdout set or smply just runnng the algorthm for some predetermned number of steps. Varants: Clearly, many varants are possble. Sometmes (for loss functons other than the square loss) t s costly to do the mnmzaton exactly so we sometmes choose based on another method (e.g. the magntude of the gradent of a coordnate). We could also re-optmze all the weghts of all those features whch were are currently added. Also, sometmes we do backward steps where we try to prune away some of the features whch are added. Relaton to boostng: In boostng, we sometmes do not explctly enumerate the set of all features. Instead, we have a weak learner whch provdes us wth a new feature. The mportance of ths vewpont s that sometmes t s dffcult to enumerate the set of all features (e.g. our features could be decson trees, so our feature vector x could be of dmenson the number of possble tress). Instead, we just assume some oracle whch n step 2 whch provdes us wth a feature. There are numerous varants. 4

6.2 Stepwse Regresson / Orthogonal Matchng Pursut Note that the prevous algorthm fnds by only checkng the mprovement n performance keepng all the other varables fxed. At any gven teraton, we have some subset S of features whose weghts are not 0. Instead, when determnng whch coordnate to add, we could look mprovement based on reoptmzng the weghts on the full set S {}. Ths s a more costly procedure computatonally, though there are some ways to reduce the computatonal cost. 5