COS 511: Theoretical Machine Learning

Similar documents
1 Definition of Rademacher Complexity

Excess Error, Approximation Error, and Estimation Error

Computational and Statistical Learning theory Assignment 4

1 Review From Last Time

Learning Theory: Lecture Notes

1 Generalization bounds based on Rademacher complexity

Ensemble Methods: Boosting

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

System in Weibull Distribution

Generalized Linear Methods

Vapnik-Chervonenkis theory

Boostrapaggregating (Bagging)

1 Rademacher Complexity Bounds

XII.3 The EM (Expectation-Maximization) Algorithm

10-701/ Machine Learning, Fall 2005 Homework 3

Feature Selection: Part 1

Multipoint Analysis for Sibling Pairs. Biostatistics 666 Lecture 18

Lecture Notes on Linear Regression

Fermi-Dirac statistics

Chapter 12 Lyes KADEM [Thermodynamics II] 2007

1 Proof of learning bounds

Errors for Linear Systems

Applied Mathematics Letters

Kernel Methods and SVMs Extension

On the Construction of Polar Codes

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

On the Construction of Polar Codes

On the number of regions in an m-dimensional space cut by n hyperplanes

Xiangwen Li. March 8th and March 13th, 2001

Lecture 3. Ax x i a i. i i

Designing Fuzzy Time Series Model Using Generalized Wang s Method and Its application to Forecasting Interest Rate of Bank Indonesia Certificate

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

BAYESIAN CURVE FITTING USING PIECEWISE POLYNOMIALS. Dariusz Biskup

On Pfaff s solution of the Pfaff problem

Lecture 4 Hypothesis Testing

Least Squares Fitting of Data

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

CSC 411 / CSC D11 / CSC C11

Notes on Frequency Estimation in Data Streams

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

LINEAR REGRESSION ANALYSIS. MODULE VIII Lecture Indicator Variables

,..., k N. , k 2. ,..., k i. The derivative with respect to temperature T is calculated by using the chain rule: & ( (5) dj j dt = "J j. k i.

1 The Mistake Bound Model

Lecture 10 Support Vector Machines. Oct

Note on EM-training of IBM-model 1

Special Relativity and Riemannian Geometry. Department of Mathematical Sciences

AN ANALYSIS OF A FRACTAL KINETICS CURVE OF SAVAGEAU

Course 395: Machine Learning - Lectures

Grover s Algorithm + Quantum Zeno Effect + Vaidman

On the Eigenspectrum of the Gram Matrix and the Generalisation Error of Kernel PCA (Shawe-Taylor, et al. 2005) Ameet Talwalkar 02/13/07

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

Estimation: Part 2. Chapter GREG estimation

Preference and Demand Examples

Robust Algorithms for Preemptive Scheduling

Centroid Uncertainty Bounds for Interval Type-2 Fuzzy Sets: Forward and Inverse Problems

Expected Value and Variance

Online Classification: Perceptron and Winnow

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Homework Assignment 3 Due in class, Thursday October 15

Edge Isoperimetric Inequalities

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

Lecture 10 Support Vector Machines II

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Problem Set 9 Solutions

Machine learning: Density estimation

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

Scattering by a perfectly conducting infinite cylinder

Complete subgraphs in multipartite graphs

More metrics on cartesian products

The Order Relation and Trace Inequalities for. Hermitian Operators

Lecture 4. Instructor: Haipeng Luo

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan.

Foundations of Arithmetic

Chapter One Mixture of Ideal Gases

Denote the function derivatives f(x) in given points. x a b. Using relationships (1.2), polynomials (1.1) are written in the form

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

LECTURE :FACTOR ANALYSIS

Departure Process from a M/M/m/ Queue

The Expectation-Maximization Algorithm

Two Conjectures About Recency Rank Encoding

P exp(tx) = 1 + t 2k M 2k. k N

Eigenvalues of Random Graphs

COMP th April, 2007 Clement Pang

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

ON THE NUMBER OF PRIMITIVE PYTHAGOREAN QUINTUPLES

} Often, when learning, we deal with uncertainty:

Finding Dense Subgraphs in G(n, 1/2)

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Another converse of Jensen s inequality

Our focus will be on linear systems. A system is linear if it obeys the principle of superposition and homogenity, i.e.

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Solutions for Homework #9

CALCULUS CLASSROOM CAPSULES

MATH 5707 HOMEWORK 4 SOLUTIONS 2. 2 i 2p i E(X i ) + E(Xi 2 ) ä i=1. i=1

Multiplicative Functions and Möbius Inversion Formula

Lecture 3: Probability Distributions

Machine Learning. What is a good Decision Boundary? Support Vector Machines

On the Calderón-Zygmund lemma for Sobolev functions

Transcription:

COS 5: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture #0 Scrbe: José Sões Ferrera March 06, 203 In the last lecture the concept of Radeacher coplexty was ntroduced, wth the goal of showng that for all f n a faly of functons F we have ÊS f E f Let us suarze the defntons of nterest: F faly of functons f : Z 0, S = z,, z ˆR S (F) = E σ σ f(z ) = R (F) = E S ˆRS (F) We also began provng the followng theore: Theore Wth probablty at least δ and f F: ln /δ E f ÊS f + 2R (F) + O E f ÊS f + 2 ˆR ln /δ S (F) + O Whch we now prove n full Proof Let us defne: Step ( Φ(S) E S Φ(S) + O ( ) Φ(S) = E f ÊS f E f = E z D f(z) Ê S f = f(z ) ) ln /δ Ths was proven last lecture and follows fro McDard s nequalty Step 2 E S Φ(S) E S,S = ) (ÊS f ÊS f Ths was also shown last lecture We also consdered generatng new saples T, T by flppng a con, e runnng through =,, we flp a con, swappng z wth z f heads, and dong nothng otherwse We then claed that the dstrbutons thus generated are dstrbuted the sae as S and S, and we noted Ê S f ÊS f = ( f(z ) f(z ) )

whch eans we can wrte Ê T f ÊT f = ( σ f(z ) f(z ) ) whch s wrtten n ters of Radeacher rando varables, σ We now proceed wth the proof Step 3 We frst cla E S,S ) (ÊS f ÊS f = E S,S,σ ( ( f(z ) f(z ) ) ) σ To see ths, note that the rght hand sde s effectvely the sae expectaton as the left hand sde, but wth respect to T and T, whch are dentcally dstrbuted to S and S Now we can wrte E S,S,σ ( ( f(z ) f(z ) ) ) σ E S,S,σ σ f(z ) + E S,S,σ ( σ )f(z ) where we are just axzng over the sus separately We now note two ponts: The rando varable σ has the sae dstrbuton as σ ; 2 The expectaton over S s rrelevant n the frst ter, snce the ter nsde the expectaton does not depend on S Slarly, the expectaton over S s rrelevant n the second ter Therefore E S,S,σ σ f(z ) = E S,σ = E S E σ = R (F) σ f(z ) σ f(z ) and, slarly E S,S,σ ( σ )f(z ) = R (F) 2

Step 4 We have thus shown E S,S Channg our results together, we obtan ) (ÊS f ÊS f 2R (F) ( ) Φ(S) = E f ÊS f 2R (F) + O ln /δ We conclude that wth probablty at least δ and f F ln /δ E f ÊS f Φ(S) 2R (F) + O Therefore, wth probablty at least δ and f F ln /δ E f ÊS f + 2R (F) + O Ths s one of the results we were seekng Provng the result for ˆR S (F) s just a atter of applyng McDard s nequalty to obtan, wth probablty at least δ ln /δ ˆR S (F) R (F) + O Motvaton The orgnal otvaton behnd the theore above was to obtan a relatonshp between generalzaton error and tranng error We want to be able to say that, wth probablty at least δ, h H err(h) err(h) ˆ + sall ter We note that err(h) s evocatve of E f and err(h) ˆ s evocatve of ÊS f, whch appear n our theore Let us wrte err(h) = Pr (x,y) D h(x) y = E (x,y) D {h(x) y} err(h) ˆ = {h(x ) y } = ÊS {h(x) y} as per our defntons We see that, to ft our defnton, we ust work wth functons f whch are ndcator functons Let us defne Z = X {, +} and for h H: f h (x, y) = {h(x) y} 3

Now we can wrte: E (x,y) D {h(x) y} = E f h Ê S {h(x) y} = ÊS f h F H = {f h : h H} Ths allows us to use our theore to state that: Wth probablty δ h H err(h) err(h) ˆ + 2R (F H ) + O err(h) err(h) ˆ + 2 ˆR S (F H ) + O ln /δ ln /δ We want to wrte the above n ters of the Radeacher coplexty of H, whch we can do by lookng at the defnton of Radeacher coplexty We have ˆR S (F H ) = E σ σ f h (x, y ) f h F H Now, our functons f h are just ndcator functons and can be wrtten f h (x, y ) = y h(x ) 2 Further, we are ndexng each functon by a functon h H Therefore, we can just ndex the reu wth h H nstead of f h F H Wrtng ths out gves ˆR S (F H ) = E σ = 2 E σ h H ( ) y h(x ) σ 2 σ + ( y σ )h(x ) h H Because σ s a Radeacher rando varable, ts expectaton s just 0 For the second ter, we note that because the saple S s fxed, the y s are fxed, and therefore the ter y σ s dstrbuted the sae as σ Hence, we conclude We have therefore shown ˆR S (F H ) = 0 + 2 E σ h H err(h) err(h) ˆ + R (H) + O err(h) err(h) ˆ + ˆR S (H) + O σ h(x ) = 2 ˆR S (H) ln /δ ln /δ 4

2 Obtanng other bounds It was alluded to n class that obtanng the above bounds n ters of Radeacher coplexty subsues other bounds prevously shown, whch can be deonstrated wth an exaple We frst state a sple theore (a slghtly weaker verson of ths theore wll be proved n a later hoework assgnent) Theore For H < : ˆR S (H) 2 ln H Now consder agan the defnton of eprcal Radeacher coplexty: ˆR S (H) = E σ σ h(x ) h H We see that t only depends on how the hypothess behaves on the fxed set S We therefore have a fnte set of behavors on the set Defne H H, where H s coposed of one representatve fro H for each possble labelng of the saple set S by H Therefore = H = Π H (S) Π H () Snce the coplexty only depends on the behavors on S, we cla ˆR S (H) = E σ σ h(x ) = h H ˆR S (H ) = We can now use the theore stated above to wrte 2 ln ˆR S (H ΠH (S) ) Fnally, we recall that after provng Sauer s lea, we showed Π H () ( ) e d, d for d Therefore 2d ln ( ) e d ˆR S (H) We have thus used the Radeacher coplexty results to get an upper bound for the case of nfnte H n ters of VC-denson 3 Boostng Up untl ths pont, the PAC learnng odel we have been consderng requres that we be able to learn to arbtrary accuracy Thus, the proble we have been dealng wth s: 5

Strong learnng C s strongly PAC-learnable f algorth A dstrbutons D c C ɛ > 0 δ > 0 A, gven = poly (/ɛ, /δ, ) exaples, coputes h such that Pr err(h) ɛ δ But what f we can only fnd an algorth that gves slghtly better than an even chance of error (eg 40%)? Could we use t to develop a better algorth, teratvely provng our soluton to arbtrary accuracy? We want to consder the followng proble: Weak learnng C s weakly PAC-learnable f γ > 0 algorth A dstrbutons D c C δ > 0 A, gven = poly (/ɛ, /δ, ) exaples, coputes h such that Pr err(h) 2 γ δ We note that n ths proble we no longer requre arbtrary accuracy, but only that the algorth pcked be able to do slghtly better than rando guessng, wth hgh probablty The natural queston that arses s whether weak learnng s equvalent to strong learnng Consder frst the spler case of a fxed dstrbuton D In ths case, the answer to our queston s no, whch we can llustrate through a sple exaple Exaple: For fxed D, defne X = {0, } n {z} D pcks z wth probablty /4 and wth probablty 3/4 pcks unforly fro {0, } n C = { all concepts over X } In a tranng saple, we expect to see z wth hgh probablty, and therefore z wll be correctly learned by the algorth However, the reanng ponts are exponental n, so that wth only poly(/ɛ, /δ, ) nuber of exaples, we are unlkely to do uch better than even chance on the rest of the doan We therefore expect the error to be gven roughly by err(h) 2 3 4 + 0 4 = 3 8 n whch case C s weakly learnable, but not strongly learnable We wsh to prove that n the general case of an arbtrary dstrbuton the followng theore holds: Theore Strong and weak learnng are equvalent under the PAC learnng odel The way we wll reach ths result s by developng a boostng algorth whch constructs a strong learnng algorth fro a weak learnng algorth 6

3 The boostng proble The challenge faced by the boostng algorth can be defned by the followng proble Boostng proble Gven: (x, y ),, (x, y ) wth y {, +} access to a weak learner A: dstrbutons D gven exaples fro D coputes h such that Pr err D (h) 2 γ δ Goal: fnd H such that wth hgh probablty err D (H) ɛ for any fxed ɛ Fgure : Scheatc representaton of boostng algorth The an dea behnd the boostng algorth s to produce a nuber of dfferent dstrbutons D fro D, usng the saple provded Ths s necessary because runnng A on the sae saple alone wll not, n general, be enough to produce an arbtrarly accurate hypothess (certanly so f A s deternstc) A boostng algorth wll therefore run as follows: Boostng algorth for t =,, T run A on D t to get weak hypothess h t : X {, +} ɛ t = err Dt (h t ) = 2 γ t, where γ t γ end output H, where H s a cobnaton of the weak hypotheses h,, h T In the above, the dstrbutons D t are dstrbutons on the ndces,,, and ay vary fro round to round It s by adjustng these dstrbutons that the boostng algorth wll be able to acheve hgh accuracy Intutvely, we want to pck the dstrbutons D t such that, on each round, they provde us wth ore nforaton about the ponts n the saple 7

that are hard to learn The boostng algorth can be seen scheatcally n Fgure Let us defne: D t () = D t (x, y ) We pck the dstrbuton as follows: : D () = where α t > 0 D t+ () = D { t() e α t f h t (x ) y Z t e αt f h t (x ) = y Intutvely, all our exaples are consdered equally n the frst round of boostng Gong forward, f an exaple s sclassfed, ts weght n the next round wll ncrease, whle the weghts of the correctly classfed exaples wll decrease, so that the classfer wll focus on the exaples whch have proven harder to classfy correctly 8