Machine Learning Theory (CS 6783)

Similar documents
Intro to Learning Theory

Machine Learning Theory (CS 6783)

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Optimally Sparse SVMs

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

10-701/ Machine Learning Mid-term Exam Solution

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Empirical Process Theory and Oracle Inequalities

1 Review and Overview

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

REGRESSION WITH QUADRATIC LOSS

Regression with quadratic loss

Agnostic Learning and Concentration Inequalities

Lecture 7: October 18, 2017

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Brett Bernstein

Lecture Chapter 6: Convergence of Random Sequences

Maximum Likelihood Estimation and Complexity Regularization

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

ECE 901 Lecture 13: Maximum Likelihood Estimation

18.657: Mathematics of Machine Learning

Math 525: Lecture 5. January 18, 2018

Stat410 Probability and Statistics II (F16)

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

6.883: Online Methods in Machine Learning Alexander Rakhlin

Lecture 13: Maximum Likelihood Estimation

Sieve Estimators: Consistency and Rates of Convergence

Lecture 6 Simple alternatives and the Neyman-Pearson lemma

Lecture 12: September 27

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

1 Review and Overview

Rademacher Complexity

MATH 472 / SPRING 2013 ASSIGNMENT 2: DUE FEBRUARY 4 FINALIZED

Lecture 2: Monte Carlo Simulation

Problem Set 4 Due Oct, 12

On the Theory of Learning with Privileged Information

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Unbiased Estimation. February 7-12, 2008

Lecture 2: April 3, 2013

Lecture 2 February 8, 2016

Discrete Mathematics and Probability Theory Fall 2016 Walrand Probability: An Overview

Step 1: Function Set. Otherwise, output C 2. Function set: Including all different w and b

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Information Theory and Statistics Lecture 4: Lempel-Ziv code

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

The Expectation-Maximization (EM) Algorithm

Section 14. Simple linear regression.

Notes 19 : Martingale CLT

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Distribution of Random Samples & Limit theorems

Infinite Sequences and Series

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Machine Learning for Data Science (CS 4786)

Machine Learning Brett Bernstein

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Lecture 10: Universal coding and prediction

6.883: Online Methods in Machine Learning Alexander Rakhlin

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Lecture 3: August 31

Frequentist Inference

7.1 Convergence of sequences of random variables

6.3 Testing Series With Positive Terms

Lecture 12: November 13, 2018

Pattern Classification, Ch4 (Part 1)

Learnability with Rademacher Complexities

Lecture 15: Learning Theory: Concentration Inequalities

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

FIR Filters. Lecture #7 Chapter 5. BME 310 Biomedical Computing - J.Schesser

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

32 estimating the cumulative distribution function

1 Review and Overview

It is often useful to approximate complicated functions using simpler ones. We consider the task of approximating a function by a polynomial.

Lecture 4: April 10, 2013

SOME THEORY AND PRACTICE OF STATISTICS by Howard G. Tucker

Law of the sum of Bernoulli random variables

Lecture 10 October Minimaxity and least favorable prior sequences

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Binary classification, Part 1

Introduction to Machine Learning DIS10

Lecture 2: Concentration Bounds

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Lecture 20: Multivariate convergence and the Central Limit Theorem

Matrix Representation of Data in Experiment

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

Expectation and Variance of a random variable

6.867 Machine learning

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

ECE 308 Discrete-Time Signals and Systems

18.657: Mathematics of Machine Learning

Learning Theory: Lecture Notes

Lecture 7: Properties of Random Samples

An Introduction to Randomized Algorithms

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

Estimation for Complete Data

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Transcription:

Machie Learig Theory (CS 6783) Lecture 2 : Learig Frameworks, Examples Settig up learig problems. X : istace space or iput space Examples: Computer Visio: Raw M N image vectorized X = 0, 255 M N, SIFT features (typically X R d ) Speech recogitio: Mel Cepstral co-efficiets X R 2 legth Natural Laguage Processig: Bag-of-words features (X N documet size ), -grams 2. Y: Outcome space, label space Examples: Biary classificatio Y = {±}, multiclass classificatio Y = {,..., K}, regressio Y R) 3. l : Y Y R: loss fuctio (measures predictio error) Examples: Classificatio l(y, y) = {y y}, Support vector machies l(y, y) = max{0, y y}, regressio l(y, y) = (y y ) 2 4. F Y X : Model/ Hypothesis class (set of fuctios from iput space to outcome space) Examples: Liear classifier: F = {x sig(f x) : f R d } Liear SVM: F = {x f x : f R d, f 2 R} Neural Netoworks (deep learig): F = {x σ(w out σ(w K σ(... σ(w 2 (W σ(w i x)))))} where σ is some o-liear trasformatio (Eg. ReLU) Learer observes sample: S = (x, y ),..., (x, y ) Learig Algorithm : (forecastig strategy, estimatio procedure) ŷ : X (X Y) t Y Give ew iput istace x the learig algorithm predicts ŷ(x, S). Whe cotext is clear (ie. sample S is uderstood) we will fudge otatio ad simply use otatio ŷ( ) = ŷ(, S). ŷ is the predictor retured by the learig algorithm.

Example: liear SVM Learig algorithm solves the optimizatio problem: w SVM = argmi max{0, y t w x t } + λ w w ad the predictor is ŷ(x) = ŷ(x, S) = w SVM x. PAC framework Y = {±}, l(y, y) = {y y} Iput istaces geerated as x,..., x D X where D X is some ukow distributio over iput space. The labels are geerated as y t = f (x t ) where target fuctio f F. Learig algorithm oly gets sample S ad does ot kow f or D X. Goal: Fid ŷ that miimizes P x DX (ŷ(x) f (x)).2 No-parametric Regressio Y R, l(y, y) = (y y) 2 Iput istaces geerated as x,..., x D X where D X is some ukow distributio over iput space. The labels are geerated as y t = f (x t ) + ε t where ε t N(0, σ) where target fuctio f F. Learig algorithm oly gets sample S ad does ot kow f or D X. Goal: Fid ŷ that miimizes E x DX (ŷ(x) f (x)) 2 =: ŷ f L2 (D X ).3 Statistical Learig (Agostic PAC) Geeric X, Y, l ad F Samples geerated as (x, y ),..., (x, y ) D where D is some ukow distributio over X Y. Goal: Fid ŷ that miimizes E (x,y) D l(ŷ(x), y) if f F E (x,y) D l(f(x), y) For ay mappig g : X Y we shall use the otatio L D (g) = E (x,y) D l(g(x), y) ad so our goal ca be re-writte as: L D (ŷ) if f F L D(f) Remarks:. ŷ is a radom quatity as it depeds o the sample 2. Hece formal statemets we make will be i high probability over the sample or i expectatio over draw of samples 2

.4 Olie Learig For t = to (a) Iput istace x t X is produced (b) Learig algorithm outputs predictio ŷ t (c) True outcome y t is revealed to learer Ed For Oe ca thik of ŷ t = ŷ t (x t, ((x, y ),..., (x t, y t ))). Goal: Fid learig algorithm ŷ that miimizes regret w.r.t. hypothesis class F Y X give by, Reg = l(ŷ t, y t ) if l(f(x t ), y t ) f F 2 Example : Classificatio usig Fiite Class, Realizable Settig I this sectio we cosider the classificatio settig where Y = {±} ad l(y, y) = {y y}. We further make the realizability assumptio meaig y t = f (x t ) where f is obviously ot kow to the learer. 2. Olie Framework The olie framework is just as described earlier with the realizability assumptio added i. That is, at every roud the true label y t revealed to us is set as y t = f (x t ) for some fixed f ot kow to the learig algorithm. However x t s ca be preseted to us arbitrarily. First ote that uder the realizability assumptio, we have that mi f F l(f(x t ), y t ) = {f (x t ) y t } = 0 Hece the aim i such a framework is to simply miimize umber of mistakes l(ŷ t, y t ) ad prove mistake bouds. Now say F = {f,..., f N }, a fiite set of hypothesis. What strategy ca we provide for this problem? How well does it work? If we simply pick some hypothesis that has ot made a mistake so far, such a algorithm ca make a large umber of mistakes (Eg. as may as N). A simple strategy that works i this sceario is the followig. At ay poit t, we have observed x,..., x t ad labels y,..., y t. Now say F t = {f F : i t, f(x i ) = y i }. Now give x t, we pick ŷ t = sig( f F t f(x t )). That is we go with the majority of predictios by hypothesis i F t. How well does this algorithm work? 3

Claim. For ay sequece x,..., x, the above algorithm makes at most log 2 N umber of mistakes. Proof. Notice that each time we make a mistake, ie. sig( f F t f(x t )) y t, the we kow that at least half the umber of fuctios i F t are wrog ad so each time we make a mistake, F t+ F t /2 ad hece, we ca make at most log 2 N umber of mistakes. That is the average error is log 2 N. 2.2 PAC Framework I the PAC framework, x,..., x are draw iid from some fixed distributio D X ad our goal is to miimize P x Dx (ŷ(x) f (x)) either i expectatio or high probability over sample {x,..., x }. Ulike the olie settig, i the PAC settig oe ca simply pick ay hypothesis that has ot made ay mistakes o traiig sample. That is, ŷ(, S) = argmi f F {f(x t ) y t }. (x t,y t) S How well does this algorithm work? How should we aalyze this? Let us show a boud of error with high probability over samples. To this ed we will use the so called Berstei cocetratio boud. Fact: Cosider biary r.v. Z,..., Z draw iid. Let µ = EZ be their expectatio. We have the followig boud o the average of these radom variables. (otice that sice Z s are biary their variace if give by µ µ 2 ) ( ) ( ) P µ Z t > θ exp θ2 2µ + θ 3 Now for ay f F, let Z f t = {f(x t ) f (x t ) where x t are draw from D X. Note that EZ f = P x DX (f(x) f (x)). Hece ote that for ay sigle f F, ( ) ( ) P S P x DX (f(x) f (x)) {f(x t ) f (x t )} > θ exp θ2 2µ + θ 3 Let use write the R.H.S. above as δ, ad hece, rewritig, we have that with probability at least δ over sample, P x DX (f(x) f (x)) {f(x t ) f (x t )} log(/δ) Px DX (f(x) f + (x)) log(/δ) 3 This upo further massagig (use iequality ab a/2 + b/2) leads to the boud P x DX (f(x) f (x)) 2 {f(x t ) f (x t )} 2 log(/δ) 4

Usig uio boud, we have that for ay δ > 0, with probability at least δ over sample, simultaeously, f F P x DX (f(x) f (x)) 2 {f(x t ) f (x t )} 2 log( F /δ) Sice ŷ F, from the above we coclude that, for ay δ > 0, with probability at least δ over sample, P x DX (ŷ(x) f (x)) 2 {ŷ(x t ) f (x t )} 2 log( F /δ) But ote that by realizability assumptio ad the defiitio of ŷ, we have that {ŷ f (x t )} = ad so, with probability at least δ over sample, 3 Example 2: Predictig Bits 3. Statistical Learig {ŷ y t } = 0 P x DX (ŷ(x) f (x)) 2 log( F /δ) We cosider as a warmup example, the simplest statistical learig/predictio problem. That of learig coi flips! Let us cosider the case where we do t receive ay iput istace (or X = {}) ad Y = {±}. We receive ± valued samples y,..., y {±} draw iid from Beroullis distributio with parameter p (ie. Y is + with probability p ad with probability p). Our loss fuctio is the zero-oe loss fuctio l(y, y) = {y y}. Recall that our goal i statistical learig is to miimize L p (ŷ) if L p (f). (Effectively our oly choice of F for this problem is the set of costat mappigs, F = {±}). Claim 2. Let ŷ = sig ( y ) be the predictio rule we use. For the problem above, oe has the boud, L p (ŷ) if L p(f) log / The predictio rule that ejoys the above boud is 5

Proof. Now ote that : L p (ŷ) mi L p(f) = E y p {y ŷ} mi E y p {f y} = E y p {y ŷ} Ey p {sig(2p ) y} = E y p {y ŷ} {ŷ y t } + {ŷ y t } E y p {sig(2p ) y} E y p {y ŷ} {ŷ y t } + {sig(2p ) y t } E y p {sig(2p ) y} 2 max E y p {y f} {f y t } Hece we coclude that P S (L p (ŷ) mi L p(f) > θ) P ( 2 max E y p {y f} ) {f y t } > θ Now we ca boud the RHS above usig Hoeffdig/Berstei boud + uio boud over the two choices as P S (L p (ŷ) mi L p(f) > θ) 4 exp( 2θ 2 ) Writte aother way, we ca claim that for ay δ > 0, with probability at least δ, log(4/δ) L p (ŷ) mi L p(f) 2 3.2 Ca we eve hope to hadle this problem i the olie settig? 6