Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

Similar documents
Dimensionality Reduction and Learning

Feature Selection: Part 1

18.413: Error Correcting Codes Lab March 2, Lecture 8

Bayes (Naïve or not) Classifiers: Generative Approach

Solving Constrained Flow-Shop Scheduling. Problems with Three Machines

Rademacher Complexity. Examples

Lecture 9: Tolerant Testing

The number of observed cases The number of parameters. ith case of the dichotomous dependent variable. the ith case of the jth parameter

Generalized Linear Regression with Regularization

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Kernel-based Methods and Support Vector Machines

6.867 Machine Learning

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Point Estimation: definition of estimators

PTAS for Bin-Packing

Introduction to local (nonparametric) density estimation. methods

Summary of the lecture in Biostatistics

CHAPTER VI Statistical Analysis of Experimental Data

Unsupervised Learning and Other Neural Networks

Mu Sequences/Series Solutions National Convention 2014

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

CIS 800/002 The Algorithmic Foundations of Data Privacy October 13, Lecture 9. Database Update Algorithms: Multiplicative Weights

Likewise, properties of the optimal policy for equipment replacement & maintenance problems can be used to reduce the computation.

Lecture 4 Sep 9, 2015

18.657: Mathematics of Machine Learning

Department of Agricultural Economics. PhD Qualifier Examination. August 2011

Chapter 14 Logistic Regression Models

MATH 247/Winter Notes on the adjoint and on normal operators.

Supervised learning: Linear regression Logistic regression

The Mathematical Appendix

CHAPTER 4 RADICAL EXPRESSIONS

CSE 5526: Introduction to Neural Networks Linear Regression

Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

ε. Therefore, the estimate

Support vector machines

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

ECON 5360 Class Notes GMM

Lecture Note to Rice Chapter 8

A conic cutting surface method for linear-quadraticsemidefinite

Lecture 02: Bounding tail distributions of a random variable

arxiv:math/ v1 [math.gm] 8 Dec 2005

Arithmetic Mean and Geometric Mean

For combinatorial problems we might need to generate all permutations, combinations, or subsets of a set.

å 1 13 Practice Final Examination Solutions - = CS109 Dec 5, 2018

Wu-Hausman Test: But if X and ε are independent, βˆ. ECON 324 Page 1

Lecture 2 - What are component and system reliability and how it can be improved?

Linear Regression Linear Regression with Shrinkage. Some slides are due to Tommi Jaakkola, MIT AI Lab

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Median as a Weighted Arithmetic Mean of All Sample Observations

arxiv: v1 [cs.lg] 22 Feb 2015

Chapter 9 Jordan Block Matrices

1. The weight of six Golden Retrievers is 66, 61, 70, 67, 92 and 66 pounds. The weight of six Labrador Retrievers is 54, 60, 72, 78, 84 and 67.

ECON 482 / WH Hong The Simple Regression Model 1. Definition of the Simple Regression Model

Machine Learning. Introduction to Regression. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

Homework 1: Solutions Sid Banerjee Problem 1: (Practice with Asymptotic Notation) ORIE 4520: Stochastics at Scale Fall 2015

Chapter 13, Part A Analysis of Variance and Experimental Design. Introduction to Analysis of Variance. Introduction to Analysis of Variance

Logistic regression (continued)

Generative classification models

Lecture 3 Probability review (cont d)

Dimensionality reduction Feature selection

1 Mixed Quantum State. 2 Density Matrix. CS Density Matrices, von Neumann Entropy 3/7/07 Spring 2007 Lecture 13. ψ = α x x. ρ = p i ψ i ψ i.

To use adaptive cluster sampling we must first make some definitions of the sampling universe:

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Investigation of Partially Conditional RP Model with Response Error. Ed Stanek

Assignment 7/MATH 247/Winter, 2010 Due: Friday, March 19. Powers of a square matrix

8.1 Hashing Algorithms

A tighter lower bound on the circuit size of the hardest Boolean functions

( ) 2 2. Multi-Layer Refraction Problem Rafael Espericueta, Bakersfield College, November, 2006

Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy

UNIT 2 SOLUTION OF ALGEBRAIC AND TRANSCENDENTAL EQUATIONS

The Selection Problem - Variable Size Decrease/Conquer (Practice with algorithm analysis)

ECONOMETRIC THEORY. MODULE VIII Lecture - 26 Heteroskedasticity

1 Onto functions and bijections Applications to Counting

Chapter 3 Sampling For Proportions and Percentages

Long Tailed functions

2SLS Estimates ECON In this case, begin with the assumption that E[ i

Simulation Output Analysis

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

The Occupancy and Coupon Collector problems

CS286.2 Lecture 4: Dinur s Proof of the PCP Theorem

Binary classification: Support Vector Machines

Chapter 5 Properties of a Random Sample

Econometric Methods. Review of Estimation

Multiple Choice Test. Chapter Adequacy of Models for Regression

Runtime analysis RLS on OneMax. Heuristic Optimization

Functions of Random Variables

(b) By independence, the probability that the string 1011 is received correctly is

ENGI 3423 Simple Linear Regression Page 12-01

Special Instructions / Useful Data

KLT Tracker. Alignment. 1. Detect Harris corners in the first frame. 2. For each Harris corner compute motion between consecutive frames

On Modified Interval Symmetric Single-Step Procedure ISS2-5D for the Simultaneous Inclusion of Polynomial Zeros

Chapter 8: Statistical Analysis of Simulated Data

ρ < 1 be five real numbers. The

This lecture and the next. Why Sorting? Sorting Algorithms so far. Why Sorting? (2) Selection Sort. Heap Sort. Heapsort

COV. Violation of constant variance of ε i s but they are still independent. The error term (ε) is said to be heteroscedastic.

Unimodality Tests for Global Optimization of Single Variable Functions Using Statistical Methods

1. Overview of basic probability

means the first term, a2 means the term, etc. Infinite Sequences: follow the same pattern forever.

Transcription:

CSE 546: Mache Learg Lecture 6 Feature Selecto: Part 2 Istructor: Sham Kakade Greedy Algorthms (cotued from the last lecture) There are varety of greedy algorthms ad umerous amg covetos for these algorthms. These algorthms must rely o some stoppg codto (or some codto to lmt the sparsty level of the soluto).. Stagewse Regresso / Matchg Pursut / Boostg Here, we typcally do o regularze our objectve fucto ad, stead, drectly deal wth the emprcal loss ˆL(w, w 2,... w ). Ths class of algorthms for mmzg a objectve fucto ˆL(w, w 2,... w ) s as follows:. Italze: w = 0 2. choose the coordate whch ca result the greatest decrease error,.e. 3. update w as follows: arg m m z R F (w,..., w, z, w + w arg m z R F (w,..., w, z, w +,... w d ) where the optmzato s over the -th coordate (holdg the other coordates fxed). 4. Whle some termato codto s ot met, retur to step 2. Ths termato codto ca be lookg at the error o some holdout set or smply just rug the algorthm for some predetermed umber of steps. Varats: Clearly, may varats are possble. Sometmes (for loss fuctos other tha the square loss) t s costly to do the mmzato exactly so we sometmes choose based o aother method (e.g. the magtude of the gradet of a coordate). We could also re-optmze all the weghts of all those features whch were are curretly added. Also, sometmes we do backward steps where we try to prue away some of the features whch are added. Relato to boostg: I boostg, we sometmes do ot explctly eumerate the set of all features. Istead, we have a weak learer whch provdes us wth a ew feature. The mportace of ths vewpot s that sometmes t s dffcult to eumerate the set of all features (e.g. our features could be decso trees, so our feature vector x could be of dmeso the umber of possble tress). Istead, we just assume some oracle whch step 2 whch provdes us wth a feature. There are umerous varats.

.2 Stepwse Regresso / Orthogoal Matchg Pursut Note that the prevous algorthm fds by oly checkg the mprovemet performace keepg all the other varables fxed. At ay gve terato, we have some subset S of features whose weghts are ot 0. Istead, whe determg whch coordate to add, we could look mprovemet based o reoptmzg the weghts o the full set S {}. Ths s a more costly procedure computatoally, though there are some ways to reduce the computatoal cost. 2

2 Feature Selecto the Orthogoal Case Let us suppose there are s relevat features out of the d possble features. Throughout ths aalyss, let us assume that: Y = Xw + η, where Y R ad X R d ad η R s the Gaussa ose vector (wth each coordate sampled N(0, σ 2 )). We assume that the support of w (the umber of o-zero etres) s s. Let us suppose that our desg matrx s orthogoal. I partcular, suppose that: Σ = X X = dagoal Now let us cosder the least squares estmate (gorg feature selecto ssues). Uder the dagoal assumpto, wthout loss of geeralty, we ca assume that: Σ = I (by rescalg each coordate). Here we have that j-th coordate of the (global) least squares estmate [ŵ] j s just correlato betwee j-th dmeso ad Y. [ŵ least squares ] j = [X Y] j = X,j Y 2. A hgh probablty regret boud Suppose we kew the support sze s. Let us cosder the the estmator whch mmzes the emprcal loss ad has support oly o s coordates. I partcular, cosder the estmator: where the f s over vectors wth support sze s. ŵ subset selecto = arg m support(w) s ˆL(w) I the orthogoal case, computg ths estmator s actually rather easy. Provded we have scaled the coordates so that Σ s the detty, our estmate smply choose these the s largest coordates of ŵ least squares. I other words, a smple forward greedy algorthm suffces to compute ŵ subset selecto. Now let us explctly provded the followg hgh probablty boud o the parameter error: Theorem 2.. (hgh probablty boud) We have that wth probablty greater tha δ: ŵ subset selecto w 2 0s log(2d/δ) σ 2 Proof. For ay partcular coordate j, the Gaussa tal boud mples that, wth probablty greater tha δ, that: 2σ2 log(/δ) [ŵ least squares ] j [w ] j + () ad also that: 2σ2 log(/δ) [ŵ least squares ] j [w ] j (usg the Gaussa tal boud whch s proved the optoal readg). We would lke these equaltes to smultaeously hold for all coordates j. 3

The uo bouds states that for evets E to E k that: Pr(E or E 2... or E k ) j Pr(E j ) Now cosder the followg 2d evets: oe of the above 2 equatos fal for some coordate j. Note that f use δ/2d the above the cumulatve falure probablty s less tha: Pr( ay falure ) j Pr( oe-sded falure for j) 2d(δ/2d) = δ Hece, we have that, wth probablty greater tha δ, that: 2σ2 log(2d/δ) max [ŵ least squares ] j [w ] := j Here, we have that δ s a boud o the cumulatve falure probablty. Also, we have defed the last equalty. Let S be the optmal support set (e.g. the support set of w ). For ay w we have: w w 2 = [w] 2 j + ([w] j [w ] j ) 2 j / S j S Now, for those features j / S, we have [ŵ least squares ] j. Hece, for those features j S, we have that: [ŵ least squares ] j [w ] j 2 To see ths ote that, f [w ] j > 2, the we wll clude ths feature (ad our estmated error s ). If [w ] j 2, the we mght mstakely exclude ths feature ( whch case the above s also true). Hece, Also, as we oly clude at most s features we have: j S ([ŵ least squares ] j [w ] j ) 2 4s 2 j / S ([ŵ least squares ] j ) 2 s 2 Addg these together (ad usg the value of ) completes the proof. 2.2 The Lasso the orthogoal case Let us cosder the case where Σ = I. Note f Σ s smply dagoal, the must rescale the coordates for ths algorthm to work. log d We ca ow argue that usg λ = σ suffces to gve the Lasso a hgh probablty of success. log(d/δ) Theorem 2.2. (Lasso the orthogoal case) Suppose Σ = I. Set λ = 0σ. Let The we have that wth probablty greater tha δ (where c s a uversal costat). ŵ lasso arg m Xw Y 2 + λ w ŵ lasso w 2 c s log(d/δ) Also, ote f we had used the greedy algorthm, we do ot eed to explctly do ths ths rescalg. 4

Proof. Note that: Xw Y 2 + λ w = w w 2 + η X(w w ) + λ w usg that Σ = I ad the defto of Y. Hece, the Lasso s mmzg: w w 2 + η (w w ) + λ w where we have defed η = X η. By assumpto o X we have that η s a N(0, I d ). Hece, the Lasso smplfes to solvg separate. -dmesoal problems. Also, wth probablty greater tha δ, log d/δ each coordate of η s bouded by σ 2. Settg λ to ths value esures that all rrelevat features wll be thresholded to 0. Ad all relevat features wll have ther weght shruk by λ, whch results the regret that s same as the subset selecto algorthm. 3 Whe do the Lasso ad the greedy algorthm also have low regret? There has bee much work showg that the Lasso ad the greedy algorthms ca obta regret bouds comparable to the subset selecto algorthm. These assumptos ca be vewed as relaxatos to the orthogoal codto. Oe weakeg s based o coherece (whch s essetally a assumpto that the features matrx X has propertes smlar to that of a radom matrx). Namely, ths assumpto s that for all coordates j k: where we also assume that: X,j X,k X 2,j = The Restrcted Isometry Property (RIP) s a weaker assumpto tha ths. Uder ether of these assumptos, both the Lasso ad the greedy algorthms ca have rsk bouds comparable to that of the subset selecto algorthm. 5