Intro to Learning Theory

Similar documents
Machine Learning Theory (CS 6783)

Machine Learning Brett Bernstein

1 Review and Overview

Regression with quadratic loss

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

REGRESSION WITH QUADRATIC LOSS

Infinite Sequences and Series

10-701/ Machine Learning Mid-term Exam Solution

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Lecture 7: October 18, 2017

Machine Learning Theory (CS 6783)

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Empirical Process Theory and Oracle Inequalities

Sequences, Mathematical Induction, and Recursion. CSE 2353 Discrete Computational Structures Spring 2018

6.3 Testing Series With Positive Terms

Topic 9: Sampling Distributions of Estimators

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Convergence of random variables. (telegram style notes) P.J.C. Spreij

6.883: Online Methods in Machine Learning Alexander Rakhlin

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Sieve Estimators: Consistency and Rates of Convergence

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Math 155 (Lecture 3)

Introductory statistics

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

Rademacher Complexity

Optimally Sparse SVMs

Machine Learning. Ilya Narsky, Caltech

Topic 9: Sampling Distributions of Estimators

Lecture 10: Universal coding and prediction

Topic 9: Sampling Distributions of Estimators

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Lecture 2: April 3, 2013

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Lecture 3 The Lebesgue Integral

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

Sequences and Series of Functions

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002

CS 2750 Machine Learning. Lecture 23. Concept learning. CS 2750 Machine Learning. Concept Learning

Fall 2013 MTH431/531 Real analysis Section Notes

Relations Among Algebras

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Machine Learning Brett Bernstein

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Mathematical Statistics - MS

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =

Estimation for Complete Data

Mathematical Induction

Lecture 14: Graph Entropy

Agnostic Learning and Concentration Inequalities

18.657: Mathematics of Machine Learning

Maximum Likelihood Estimation and Complexity Regularization

1 Review of Probability & Statistics

CS 2750 Machine Learning. Lecture 22. Concept learning. CS 2750 Machine Learning. Concept Learning

Distribution of Random Samples & Limit theorems

Sequences. Notation. Convergence of a Sequence

Output Analysis and Run-Length Control

7.1 Convergence of sequences of random variables

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

Hashing and Amortization

Introduction to Extreme Value Theory Laurens de Haan, ISM Japan, Erasmus University Rotterdam, NL University of Lisbon, PT

Efficient GMM LECTURE 12 GMM II

18.657: Mathematics of Machine Learning

Shannon s noiseless coding theorem

Lecture 3: August 31

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

1 Review and Overview

Lecture Chapter 6: Convergence of Random Sequences

CS284A: Representations and Algorithms in Molecular Biology

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Lecture 19: Convergence

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Notes 19 : Martingale CLT

Lecture 12: November 13, 2018

Math 113 Exam 3 Practice

Measure and Measurable Functions

On the Theory of Learning with Privileged Information

An Introduction to Randomized Algorithms

Selective Prediction

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS

Lecture 2: Monte Carlo Simulation

Lecture 10: Mathematical Preliminaries

Lecture 10 October Minimaxity and least favorable prior sequences

Advanced Stochastic Processes.

Frequentist Inference

Part I: Covers Sequence through Series Comparison Tests

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Simulation. Two Rule For Inverting A Distribution Function

MA131 - Analysis 1. Workbook 2 Sequences I

Sequences I. Chapter Introduction

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

Transcription:

Lecture 1, October 18, 2016 Itro to Learig Theory Ruth Urer 1 Machie Learig ad Learig Theory Comig soo 2 Formal Framework 21 Basic otios I our formal model for machie learig, the istaces to be classified are members of a set X, the domai set or feature space Istaces are to be classified ito a label set Y For ow (ad most of the class), we assume that the label set is biary, that is Y = {0, 1} For example, a istace x X could be a email ad its label idicates whether the email is spam (y = 1) or ot spam (y = 0) We ofte assume that the istaces are represeted as real-valued vectors, that is X R d for some dimesio d A predictor or classifier is a fuctio h : X Y A learer is a fuctio that takes some traiig data ad maps it to a predictor We let the traiig data be deoted by a sequece S = ((X 1, Y 1 ),, (X, Y )) The, formally, a learer A is a fuctio A : (X Y) i Y X A : S h, where Y X deotes the set of all fuctios from set X to set Y For coveiece, whe the learer is clear from cotext, we use the otatio h to deote the output of the learer o data of size, that is h = A(S) for S = The goal of learig is produce a predictor h that correctly classifies ot oly the traiig data, but also future istaces that it has ot see yet We thus eed a mathematical descriptio of how the eviromet produces istaces I particular, we would like to model that the eviromet (or ature) remais somehow stable, that the process that geerated the traiig data is the same that will geerate future data We model the data geeratio as a probability distributio P over X Y = X {0, 1} We further assume that the istaces (X i, Y i ) are iid (idepedetly ad idetically distributed) accordig to P

The performace of a classifier h o a istace (X, Y ) is measured by a loss fuctio A loss fuctio is a fuctio l : (Y X X Y) R The value l(h, X, Y ) R idicates how badly h predicts o example (X, Y ) We will, for ow, work with the biary loss (or 0/1-loss), defied as l(h, X, Y ) = 1[h(X) Y ], where 1[p] deotes the idicator fuctio of predicate p, that is 1[p] = 1 of p is true ad 1[p] = 0 if p is false The biary loss is 1 if the predictio of h o example (X, Y ) is wrog If the predictio is correct, o loss is suffered ad the biary loss assigs value 0 We ca ow formally phrase the goal of learig, as aimig for a classifier that has low loss o expectatio over the data geeratig distributio That is, we would like to output a classifier that has low expected loss, or risk, defied as L(h) = E (X,Y ) P [l(h, X, Y )] = E (X,Y ) P [1[h(X) Y ]] Sice our loss fuctio assumes oly values i {0, 1}, the above expectatio is equal to the probability of geeratig a example X o which h makes a wrog predictio That is, we have L(h) = E (X,Y ) P E (X,Y ) P [1[h(X) Y ]] = P (X,Y ) P [h(x) Y ] Note however, that the learer does ot get to see the data geeratig distributio It ca thus ot merely output a classifier of lowest expected loss The learer eeds to make its decisios based o the data S Give a classifier h ad data S, the learer ca evaluate the empirical risk of h o S L (h) = 1 1[h(X i ) Y i ] 22 O the relatio of empirical ad true risk A atural strategy for the learer would be, to simply output a fuctio that has small empirical risk I favor of this approach, we ow show that the empirical risk is a ubiased estimator of the true risk Claim 1 For all fuctios h : X {0, 1} ad for all sample sizes we have E S L (h) = L(h)

Proof 1 E S L (h) = E S 1[h(X i ) Y i ] = 1 E S 1[h(X i ) Y i ] = 1 = 1 E (X,Y ) 1[h(X) Y ] L(h) = L(h) where the secod equality holds by liearity of expectatio, ad the third iequality holds sice that expectatio depeds oly o oe (the i-th) example i S Thus, for ay fixed fuctio, the empirical risk gives us a ubiased estimate of the quatity that we are after, the true risk Note that this holds eve for small sample sizes Moreover, by the law of large umbers, the above claim implies that, with large sample sizes, the empirical risk of a classifier coverges to its true risk (i probability) As we see more ad more data, the empirical risk of a fuctio becomes a better ad better estimate of its true risk This may lead us to believe that the simple learig strategy of just fidig some fuctio with low empirical risk should succeed at achievig low true risk as we see more ad more data However, the followig pheomeo shows that this strategy ca i fact go wrog arbitrarily badly Claim 2 There exists a distributio P ad a learer, such that for all we have L (h ) = 0 ad L(h ) = 1 Proof As the data geeratig distributio, cosider the uiform distributio over R {1} That is, i ay sample S, geerated by this P, the examples are labeled with 1, that is S = ((X 1, 1)),, (X, 1))) We costruct a stubbor learer A The stubbor learer outputs a fuctio that agrees with the sample s labels o poits that were i the sample S, but keeps believig that the label is 0 everywhere else Formally: h (X) = A(S)(X) = { 1 if (X, 1) S 0 otherwise Now we clearly have L (h ) = 0 for all However, sice S is fiite, the set of istaces X o which h predicts 1 has measure 0 Thus, with probability 1, h outputs the icorrect label 0 Thus L(h) = 1 The differece betwee the situatios i the above two claims is that, i the secod case, the fuctio h depeds o the data While, for every fixed fuctio h (fixed before the data is see), the empirical risk estimates coverge to the true risk of this fuctio,

this covergece is ot uiform over all fuctios Claim 2 shows that, at ay give sample size, there exist fuctios, for which true ad empirical risk are arbitrarily far apart Now, i machie learig, we do wat the fuctio that the learer outputs to be able to deped o the data Furthermore, the learer oly ever gets to see a fiite amout of data We have see that, for ay fiite sample size, that is, o ay fiite amout of data, the empirical risk ca be a very bad idicator of the true risk of a fuctio Basic questios of learig theory thus are: How ca we cotrol the (true) risk of a fuctio leared based o a fiite amout of data? Ca we idetify situatios where we ca relate the true ad empirical risk? 23 Fixig a hypothesis space We have see that, if we wat our leared fuctio h to deped o the data, we have to chage the rules for the learer I Claim 2, we let the learer output ay fuctio it wated This resulted i the learer adaptig itself very well to the data it has see i the sample, achievig 0 empirical risk, while ot makig ay progress towards predictig well o usee examples The costructio of Claim 2 is a extreme versio of a pheomeo called overfittig I iformal terms, if a learig method has too much freedom with regards to the fuctios it ca output, it may overadapt to the traiig data, rather tha extractig structure that will also apply to the usee examples Overfittig is frequetly ecoutered pheomeo i practice, that oe has to guard agaist To prevet the learer from overfittig, we eed to restrict the class of predictors A hypothesis class H is a set of predictors H {0, 1} X Istead of allowig the learer to output ay fuctio, we will ow cosider learers that output fuctios from H We will see that, i may cases, fixig the hypothesis class before we see the data, will let us regai cotrol over the relatio betwee empirical ad true risk However, fixig the Hypothesis class also meas that there may ot be ay good fuctio i the class We will thus rephrase the goal of learig, to oly require the learer to come up with a fuctio that is (approximately) as good as the best fuctio i the class H Thus, our ew goal is to show that L(h ) if L(h) + f(), h H where f is decreasig fuctio of sample size That is, as we see more ad more data, we would like that the true risk of the output of the learer approaches the best risk possible with the class H Or equivaletly, we would like to show that L(h ) if L(h) f() h H 24 Learability of fiite classes We ow show that the above goal is achievable for fiite classes H = {h 1,, h N } We will aalyze the learer ERM (Empirical Risk Miimizatio), which outputs a fuctio from H that has miimal empirical risk ERM : S ĥ argmi i L (h i )

There may be several fuctios i H that have lowest empirical risk o a data set S But sice every fuctio h H has some empirical risk o data S, ad the empirical risk ca oly assume fiitely may values (amely multiples of 1 = 1 ), the argmi is a oempty S subset of H A learer is a ERM learer, if it always outputs some fuctio from this subset For ow, we will further make a simplifyig assumptio o the data geeratig distributio P We will assume that P is realizable with respect to the clas H A distributio is realizable with respect to a hypothesis class H if there is a h H with L(h ) = 0 Theorem 1 Let H = {h 1,, h N } ad δ (0, 1] Uder the realizability assumptio, we have with probability at least (1 δ) over the geeratio of the sample S L(ĥ) log N + log(1/δ) Proof Note that L (h ) = 0 for all possible samples S Thus, for ay ɛ > 0, ERM oly outputs a fuctio error larger tha ɛ, if L (ĥ) = 0 while L(ĥ) ɛ For every h H with L(h) > ɛ, we have P S {L (h) = 0} (1 ɛ) e ɛ (Recall that, for all x R, we have (1 + x) e x ) Let H ɛ deote the set of fuctios h i H with L(h) > ɛ We get, usig the uio boud, } P S {L(ĥ) ɛ P S { h H ɛ : L (h) = 0} = P S { h Hɛ L (h) = 0} H ɛ (1 ɛ) H (1 ɛ) Now we set ɛ = log 1 δ +log N Pluggig i this value for ɛ, we have show P S { L(ĥ) H e ɛ = Ne ɛ log N + log(1/δ) which is equivalet to the statemet of the theorem } δ Thus, uder the realizability we have L(h ) mi h H L(h) f() for f() = log N+log(1/δ)