Binary classification, Part 1

Similar documents
Regression with quadratic loss

REGRESSION WITH QUADRATIC LOSS

1 Review and Overview

Optimally Sparse SVMs

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

A survey on penalized empirical risk minimization Sara A. van de Geer

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Rademacher Complexity

18.657: Mathematics of Machine Learning

10-701/ Machine Learning Mid-term Exam Solution

Empirical Process Theory and Oracle Inequalities

Support vector machine revisited

Convergence of random variables. (telegram style notes) P.J.C. Spreij

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Machine Learning Brett Bernstein

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Intro to Learning Theory

Sieve Estimators: Consistency and Rates of Convergence

Linear Regression Demystified

1 Review and Overview

6.867 Machine learning, lecture 7 (Jaakkola) 1

Dimensionality reduction in Hilbert spaces

6.3 Testing Series With Positive Terms

Monte Carlo Integration

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Machine Learning Theory (CS 6783)

Efficient GMM LECTURE 12 GMM II

Lecture 15: Learning Theory: Concentration Inequalities

Beurling Integers: Part 2

Algebra of Least Squares

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

18.657: Mathematics of Machine Learning

Learnability with Rademacher Complexities

Introduction to Machine Learning DIS10

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Problem Set 4 Due Oct, 12

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

An Introduction to Randomized Algorithms

Pattern recognition systems Laboratory 10 Linear Classifiers and the Perceptron Algorithm

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Support Vector Machines and Kernel Methods

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Output Analysis and Run-Length Control

Math Solutions to homework 6

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

INEQUALITIES BJORN POONEN

Rates of Convergence by Moduli of Continuity

b i u x i U a i j u x i u x j

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Lecture 3 The Lebesgue Integral

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Machine Learning Brett Bernstein

4.1 Sigma Notation and Riemann Sums

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

CHAPTER 5. Theory and Solution Using Matrix Techniques

Lecture 3: August 31

Lecture 19: Convergence

The Maximum-Likelihood Decoding Performance of Error-Correcting Codes

Lecture 7: October 18, 2017

Information-based Feature Selection

Seunghee Ye Ma 8: Week 5 Oct 28

Differentiable Convex Functions

Problem Set 2 Solutions

Math 451: Euclidean and Non-Euclidean Geometry MWF 3pm, Gasson 204 Homework 3 Solutions

4.3 Growth Rates of Solutions to Recurrences

Basics of Probability Theory (for Theory of Computation courses)

Lecture 10 October Minimaxity and least favorable prior sequences

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Machine Learning Theory (CS 6783)

On Classification Based on Totally Bounded Classes of Functions when There are Incomplete Covariates

Frequentist Inference

1 Review of Probability & Statistics

Pattern recognition systems Lab 10 Linear Classifiers and the Perceptron Algorithm

Selective Prediction

Mathematical Induction

Maximum Likelihood Estimation and Complexity Regularization

Stochastic Simulation

7.1 Convergence of sequences of random variables

CS 2750 Machine Learning. Lecture 23. Concept learning. CS 2750 Machine Learning. Concept Learning


1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Lecture 11: Decision Trees

7 Sequences of real numbers

Topic 9: Sampling Distributions of Estimators

Chapter Vectors

6.867 Machine learning

Glivenko-Cantelli Classes

Enumerative & Asymptotic Combinatorics

Lecture 6: Integration and the Mean Value Theorem. slope =

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution

18.657: Mathematics of Machine Learning

Transcription:

Biary classificatio, Part 1 Maxim Ragisky September 25, 2014 The problem of biary classificatio ca be stated as follows. We have a radom couple Z = (X,Y ), where X R d is called the feature vector ad Y 1,1} is called the label 1. I the spirit of the modelfree framework, we assume that the relatioship betwee the features ad the labels is stochastic ad described by a ukow probability distributio P P (Z), where Z = R d 1,1}. I these lectures o biary classificatio, I will be followig maily two excellet sources: the book by Devroye, Györfi, ad Lugosi [DGL96] ad the comprehesive survey article by Bousquet, Bouchero, ad Lugosi [BBL05]. As usual, we cosider the case whe we are give a i.i.d. sample of legth from P. The goal is to lear a classifier, i.e., a mappig g : R d 1,1}, such that the probability of classificatio error, P(g (X ) Y ), is small. As we have see before, the optimal choice is the Bayes classifier g (x) 1, if η(x) > 1/2 1, otherwise (1) where η(x) P[Y = 1 X = x] is the regressio fuctio. However, sice we make o assumptios o P, i geeral we caot hope to lear the Bayes classifier g. Istead, we focus o a more realistic goal: We fix a collectio G of classifiers ad the use the traiig data to come up with a hypothesis ĝ G, such that P(ĝ (X ) Y ) if P(g (X ) Y ) g G with high probability. By way of otatio, let us write L(g ) for the classificatio error of g, i.e., L(g ) P(g (X ) Y ), ad let L (G ) deote the smallest classificatio error attaiable over G : L (G ) if L(g ). g G We will assume that a miimizig g G exists. For future referece, we ote that L(g ) = P(g (X ) Y ) = P(Y g (X ) < 0). (2) Warig: I what follows, we will use C or c to deote various absolute costats; their values may chage from lie to lie. 1 The reaso why we chose 1,1}, rather tha 0,1}, for the label space is merely coveiece. 1

1 Learig liear discrimiat rules Oe of the simplest classificatio rules (ad oe of the first to be studied) is a liear discrimiat rule: give a ozero vector w = (w (1),..., w (d) ) R d ad a scalar b R, let g (x) g w,b (x) 1, if w, x + b > 0 1, otherwise (3) Let G be the class of all such liear discrimiat rules as w rages over all ozero vectors i R d ad b rages over all reals: G = g w,b : w R d \0},b R}. Give the traiig sample Z, let ĝ G be the output of the ERM algorithm, i.e., ĝ argmi g G 1 1 g (Xi ) Y i }. I other words, ĝ is ay classifier of the form (3) that miimizes the umber of misclassificatios o the traiig sample. The we have the followig: Theorem 1. There exists a absolute costat C > 0, such that for ay N ad ay δ (0,1), the boud L(ĝ ) L d + 1 (G ) +C + 2log(1/δ) (4) holds with probability at least 1 δ. Proof. A stadard argumet leads to the boud where L(ĝ ) L (G ) + 2 (Z ), (5) (Z ) sup L(g ) L (g ) g G is the uiform deviatio ad L (g ) deotes the empirical classificatio error of g o Z : L (g ) = 1 1 g (Xi ) Y i }, which is the fractio of icorrectly labeled poits i the traiig sample Z. Cosider a classifier g G ad defie the set } C g (x, y) R d 1,1} : y ( w, x + b) 0. The it is easy to see that where, as before, L(g ) = P(C g ) ad L (g ) = P (C g ), P = 1 Zi = δ 1 δ (Xi,Y i ) is the empirical distributio of the sample Z. Let C deote the collectio of all sets of the form C = C g for some g G. The (Z ) = sup P (C ) P(C ). C C 2

Let F = F C deote the class of idicator fuctios of the sets i C : F C = 1 C } : C C }. The we kow that, with probability at least 1 δ, (Z ) 2ER (F (Z log(1/δ) )) +, (6) 2 where R (F (Z )) is the Rademacher average of the projectio of F oto the sample Z. Now, F (Z ) = (f (Z 1 ),..., f (Z )) : f F } Therefore, if we prove that C is a VC class, the = (1 Z1 C },...,1 Z C }) : C C }. R (F (Z )) C V (C ). But this follows from the fact that ay C C has the form } d C = (x, y) R d 1,1} : w (j ) y x (j ) + by 0 for some w R d \0} ad some b R, ad the fuctios (x, y) y ad (x, y) y x (j ), 1 j d, spa a liear space of dimesio o greater tha d + 1. Hece, V (C ) d + 1, so that R (F (Z V (C ) d + 1 )) C C. Combiig this with (5) ad (6), we see that (4) holds with probability at least 1 δ. 1.1 Geeralized liear discrimiat rules I the same vei, we may cosider classificatio rules of the form g (x) = 1, if k w (j ) ψ j (x) + b > 0 1, otherwise (7) where k is some positive iteger (ot ecessarily equal to d), w = (w (1),..., w (k) ) R k is a ozero vector, b R is a arbitrary scalar, ad Ψ = ψ j : R d R} k is some fixed dictioary of real-valued fuctios o R d. For a fixed Ψ, let G deote the collectio of all classifiers of the form (7) as w rages over all ozero vectors i R k ad b rages over all reals. The the ERM rule is, agai, give by ĝ if L 1 (g ) if g G g G 1 g (Xi ) Y i }. The followig result ca be proved pretty much alog the same lies as Theorem 1: Theorem 2. There exists a absolute costat C > 0, such that for ay N ad ay δ (0,1), the boud L(ĝ ) L k + 1 (G ) +C + 2log(1/δ) (8) holds with probability at least 1 δ. 3

1.2 Two fudametal issues As Theorems 1 ad 2 show, the ERM algorithm applied to the collectio of all (geeralized) liear discrimiat rules is guarateed to work well i the sese that the classificatio error of the output hypothesis will, with high probability, be close to the optimum achievable by ay discrimiat rule with the give structure. The same argumet exteds to ay collectio of classifiers G, for which the error sets (x, y) : y g (x) 0}, g G, form a VC class of dimesio much smaller tha the sample size. I other words, with high probability the differece L(ĝ ) L (G ) = L(ĝ ) if g G L(g ) will be small. However, precisely because the VC dimesio of G caot be too large, the approximatio properties of G will be limited. Aother problem is computatioal. For istace, the problem of fidig a empirically optimal liear discrimiat rule is NP-hard. I other words, uless P is equal to NP, there is o hope of comig up with a efficiet ERM algorithm for liear discrimiat rules that would work for all feature space dimesios d. If d is fixed, the it is possible to eumerate all projectios of a give sample Z oto the class of idicators of all halfspaces i O( d 1 log) time, which allows for a exhaustive search for a ERM solutio, but the usefuless of this aive approach is limited to d < 5. 2 Risk bouds for combied classifiers via surrogate loss fuctios Oe way to sidestep the above approximatio-theoretic ad computatioal issues is to replace the 0 1 Hammig loss fuctio that gives rise to the probability of error criterio with some other loss fuctio. What we gai is the ability to boud the performace of various complicated classifiers built up by combiig simpler base classifiers i terms of the complexity (e.g, the VC dimesio) of the collectio of the base classifiers, as well as cosiderable computatioal advatages, especially if the problem of miimizig the empirical surrogate loss turs out to be a covex programmig problem. What we lose, though, is that, i geeral, we will ot be able to compare the geeralizatio error of the leared classifier to the miimum classificatio risk. Istead, we will have to be cotet with the fact that the geeralizatio error will be close to the smallest surrogate loss. We will cosider classifiers of the form g f (x) = sg f (x) 1, if f (x) 0 1, otherwise (9) where f : R d R is some fuctio. From (2) we have L(g f ) = P(g f (X ) Y ) P(Y g f (X ) < 0) = P(Y f (X ) < 0). From ow o, whe dealig with classifiers of the form (9), we will write L(f ) istead of L(g f ) to keep the otatio simple. Now we itroduce the otio of a surrogate loss fuctio. Defiitio 1. A surrogate loss fuctio is ay fuctio ϕ : R R +, such that Some examples of commoly used surrogate losses: ϕ(x) 1 x>0}. (10) 4

1. Expoetial, ϕ(x) = e x 2. Logit, ϕ(x) = log 2 (1 + e x ) 3. Hige loss, ϕ(x) = (x + 1) + maxx + 1,0} Let ϕ be a surrogate loss. The for ay (x, y) R d 1,1} ad ay f : R d R we have y f (x) < 0 ϕ( y f (x)) 1 y f (x)>0} = 1 y f (x)<0}. (11) Therefore, defiig the ϕ-risk of f by A ϕ (f ) E[ϕ( Y f (X ))] ad its empirical versio we see from (11) that A ϕ, (f ) 1 ϕ( Y i f (X i )), L(f ) A ϕ (f ) ad L (f ) A ϕ, (f ). (12) Now that these prelimiaries are out of the way, we ca state ad prove the basic surrogate loss boud: Theorem 3. Cosider ay learig algorithm A = A } =1, where, for each, the mappig A receives the traiig sample Z = (Z 1,..., Z ) as iput ad produces a fuctio f : R d R from some class F. Suppose that F ad the surrogate loss ϕ are chose so that the followig coditios are satisfied: 1. There exists some costat B > 0 such that sup (x,y) R d 1,1} sup ϕ( y f (x)) B 2. There exists some costat M ϕ > 0 such that ϕ is M ϕ -Lipschitz, i.e., ϕ(u) ϕ(v) M ϕ u v, u, v R. The for ay ad ay δ (0,1) the followig boud holds with probability at least 1 δ: Proof. Usig (12), we ca write L( f ) A ϕ, ( f ) + 4M ϕ ER (F (X log(1/δ) )) + B. (13) 2 L( f ) A ϕ ( f ) = A ϕ, ( f ) + A ϕ ( f ) A ϕ, ( f ) A ϕ, ( f ) + sup Aϕ (f ) A ϕ, (f ). 5

Now let H deote the class of fuctios h : R d 1,1} R of the form h(x, y) = y f (x), f F. The sup Aϕ (f ) A ϕ, (f ) = sup E[ϕ( Y f (X ))] 1 ϕ( Y i f (X i )) = sup P(ϕ h) P (ϕ h), h H where ϕ h(z) ϕ(h(z)) for every z = (x, y) R d 1,1}. Let (Z ) sup P(ϕ h) P (ϕ h) h H = sup h H P(ϕ h ϕ(0)) P (ϕ h ϕ(0)), where the secod lie follows from the fact that addig the same costat to each ϕ h does ot chage the value of P (ϕ h) P(ϕ h). Usig the familiar symmetrizatio argumet, we ca write E (Z ) 2ER ( Hϕ (Z ) ), (14) where H ϕ deotes the class of all fuctios of the form (x, y) ϕ(h(x, y)) ϕ(0), h H. We ow use a very powerful result about the Rademacher averages called the cotractio priciple, which states the followig [LT91]: If A R is a bouded set ad F : R R is a M-Lipschitz fuctio satisfyig F (0) = 0, the R (F A ) 2MR (A ), (15) where F A (F (a 1 ),...,F (a )) : a = (a 1,..., a ) A }. (The proof of the cotractio priciple is somewhat ivolved, ad we do ot give it here.) Cosider the fuctio F (u) = ϕ(u) ϕ(0). This fuctio clearly satisfies F (0) = 0, ad it is M ϕ -Lipschitz, by our assumptios o ϕ. Moreover, from our defiitio of H ϕ, we immediately see that H ϕ (Z ) = ( ϕ(h(z 1 ) ϕ(0),...,ϕ(h(z ) ϕ(0) ) : h H } = (F (h(z 1 )),...,F (h(z ))) : h H } = F H (Z ). Therefore, applyig (15) to A = H (Z ) ad the usig the resultig boud i (14), we obtai E (Z ) 4M ϕ ER ( H (Z ) ). Furthermore, lettig σ be a i.i.d. Rademacher tuple idepedet of Z, we have [ ] R (H (Z )) = 1 E σ sup σ h H i h(z i ) ] = 1 E σ [sup σ i Y i f (X i ) ] = 1 E σ [sup σ i f (X i ) R (F (X )), 6

which leads to E (Z ) 4M ϕ ER ( F (X ) ). (16) Now, sice every fuctio ϕ h is bouded betwee 0 ad B, the fuctio (Z ) has bouded differeces with c 1 =... = c = B/. Therefore, from (16) ad from McDiarmid s iequality, we have for every t > 0 that ( ) ( ) P (Z ) 4M ϕ ER (F (X )) + t P (Z ) E (Z ) + t e 2t 2 /B 2. Choosig t = B (2) 1 log(1/δ), we see that with probability at least 1 δ. Therefore, sice (Z ) 4M ϕ ER (F (X )) + B we see that (13) holds with probability at least 1 δ. L( f ) A ϕ, ( f ) + (Z ), log(1/δ) What the above theorem tells us is that the performace of the leared classifier f is cotrolled by the Rademacher average of the class F, ad we ca always arrage it to be relatively small. We will ow look at several specific examples. 2 3 Weighted liear combiatio of classifiers Let G = g : R d 1,1}} be a class of base classifiers (ot to be cofused with Bayes classifiers!), ad cosider the class } F λ f = c j g j : N N, c j λ; g 1,..., g N G, where λ > 0 is a tuable parameter. The for each f = N c j g j F λ the correspodig classifier g f of the form (9) is give by ( ) g f (x) = sg c j g j (x). A useful way of thikig about g f is that, upo receivig a feature x R d, it computes the outputs g 1 (x),..., g N (x) of the N base classifiers from G ad the takes a weighted majority vote ideed, if we had c 1 =... = c N = λ/n, the sg(g f (x)) would precisely correspod to takig the majority vote amog the N base classifiers. Note, by the way, that the umber of base classifiers is ot fixed, ad ca be leared from the data. Now, Theorem 3 tells us that the performace of ay learig algorithm that accepts a traiig sample Z ad produces a fuctio f F λ is cotrolled by the Rademacher average R (F λ (X )). It turs out, moreover, that we ca relate it to the Rademacher average of the base class G. To start, ote that F λ = λ abscovg, 7

where } abscovg = c j g j : N N; c = c j 1; g 1,..., g N G is the absolute covex hull of G. Therefore R (F λ (X )) = λ R (G (X )). Now ote that the fuctios i G are biary-valued. Therefore, assumig that the base class G is a VC class, we will have R (G (X V (G ) )) C. Combiig these bouds with the boud of Theorem 3, we coclude that for ay f selected from F λ based o the traiig sample Z, the boud L( f ) A ϕ, ( f V (G ) ) +CλM ϕ + B log(1/δ) 2 will hold with probability at least 1 δ, where B is the uiform upper boud o ϕ( y f (x)), f F Λ, (x, y) R d 1,1} ad M ϕ is the Lipschitz costat of the surrogate loss ϕ. Note that the above boud ivolves oly the VC dimesio of the base class, which is typically small. O the other had, the class F λ obtaied by formig weighted combiatios of classifiers from G is extremely rich, ad will geerally have ifiite VC dimesio! But there is a price we pay: The first term is the empirical surrogate loss A ϕ, ( f ), rather tha the empirical classificatio error L ( f ). However, it is possible to choose the surrogate ϕ i such a way that A ϕ, ( ) ca be bouded i terms of a quatity related to the umber of misclassified traiig examples. Here is a example. Fix a positive parameter γ > 0 ad cosider 0, if x γ ϕ(x) = 1, if x 0 1 + x/γ, otherwise This is a valid surrogate loss with B = 1 ad M ϕ = 1/γ, but i additio we have ϕ(x) 1 x> γ}, which implies that ϕ( y f (x)) 1 y f (x)<γ}. Therefore, for ay f we have The quatity is called the margi error of f. Notice that: For ay γ > 0, L γ (f ) L (f ) The fuctio γ L γ (f ) is icreasig. A ϕ, (f ) = 1 i f (X i )) ϕ( Y 1 L γ (f ) 1 1 Yi f (X i )<γ}. (17) 1 Yi f (X i )<γ} (18) 8

Notice also that we ca write L γ (f ) = 1 1 Yi f (X i )<0} + 1 1 0 Yi f (X i )<γ}, where the first term is just L (f ), while the secod term is the umber of traiig examples that were classified correctly, but oly with small margi" (the quatity Y f (X ) is ofte called the margi of the classifier f ). Theorem 4 (Margi-based risk boud for weighted liear combiatios). For ay γ > 0, the boud L( f ) L γ ( f ) + Cλ V (G ) γ + log(1/δ) (19) holds with probability at least 1 δ. Remark 1. Note that the first term o the right-had side of (19) icreases with γ, while the secod term decreases with γ. Hece, if the leared classifier f has a small margi error for a large γ, i.e., it classifies the traiig samples well ad with high cofidece," the its geeralizatio error will be small. Refereces [BBL05] O. Bousquet, S. Bouchero, ad G. Lugosi. Theory of classificatio: a survey of recet advaces. ESAIM Probability ad Statistics, 9:323 375, 2005. [DGL96] L. Devroye, L. Györfi, ad G. Lugosi. A Probabilistic Theory of Patter Recogitio. Spriger, 1996. [LT91] M. Ledoux ad M. Talagrad. Probability i Baach Spaces: Isoperimetry ad Processes. Spriger, 1991. 9