Foundations of Machine Learning

Similar documents
Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

CS 6375 Machine Learning

COMS 4771 Introduction to Machine Learning. Nakul Verma

Machine Learning Theory (CS 6783)

Structured Prediction Theory and Algorithms

IFT Lecture 7 Elements of statistical learning theory

Learning with Imperfect Data

Domain Adaptation for Regression

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Does Unlabeled Data Help?

The PAC Learning Framework -II

Advanced Introduction to Machine Learning CMU-10715

Introduction to Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

The definitions and notation are those introduced in the lectures slides. R Ex D [h

Generalization theory

Statistical learning theory, Support vector machines, and Bioinformatics

Introduction and Models

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Generalization, Overfitting, and Model Selection

Consistency of Nearest Neighbor Methods

Loss Functions, Decision Theory, and Linear Models

Lecture 1 Measure concentration

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Advanced Machine Learning

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Machine Learning (CS 567) Lecture 2

Learning Weighted Automata

Discriminative Models

Empirical Risk Minimization

Rademacher Complexity Bounds for Non-I.I.D. Processes

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Generalization bounds

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

Generalization and Overfitting

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

The sample complexity of agnostic learning with deterministic labels

Rademacher Bounds for Non-i.i.d. Processes

Advanced Machine Learning

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert

Perceptron Mistake Bounds

Lecture 5 Neural models for NLP

6.036 midterm review. Wednesday, March 18, 15

Day 3: Classification, logistic regression

The No-Regret Framework for Online Learning

Active Learning and Optimized Information Gathering

Learning with Rejection

GI01/M055: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

Discriminative Models

Computational Learning Theory

COMS 4771 Lecture Boosting 1 / 16

Warm up: risk prediction with logistic regression

1 Review of The Learning Setting

CS281B/Stat241B. Statistical Learning Theory. Lecture 1.

Machine Learning Theory (CS 6783)

Introduction to Statistical Learning Theory

Models, Data, Learning Problems

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Stephen Scott.

Sample Selection Bias Correction

Binary Classification / Perceptron

Part of the slides are adapted from Ziko Kolter

Solving Classification Problems By Knowledge Sets

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Domain Adaptation Can Quantity Compensate for Quality?

Introduction to Bayesian Learning. Machine Learning Fall 2018

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

TTIC 31230, Fundamentals of Deep Learning, Winter David McAllester. The Fundamental Equations of Deep Learning

Qualifying Exam in Machine Learning

CSCI-567: Machine Learning (Spring 2019)

Online Learning and Sequential Decision Making

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory. CS534 - Machine Learning

PAC-learning, VC Dimension and Margin-based Bounds

Optimization Methods for Machine Learning (OMML)

ECE 5424: Introduction to Machine Learning

Dan Roth 461C, 3401 Walnut

Recap from previous lecture

COMS 4771 Lecture Course overview 2. Maximum likelihood estimation (review of some statistics)

Topics in Natural Language Processing

Class 2 & 3 Overfitting & Regularization

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

What is semi-supervised learning?

Machine Learning (COMP-652 and ECSE-608)

Understanding Generalization Error: Bounds and Decompositions

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Variance Reduction and Ensemble Methods

Machine Learning Lecture 7

Transcription:

Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1

Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about 3-4 homework assignments + project (topic of your choice). Mailing list: join as soon as possible. page 2

Course Material Textbook Slides: course web page. http://www.cs.nyu.edu/~mohri/ml17 page 3

This Lecture Basic definitions and concepts. Introduction to the problem of learning. Probability tools. page 4

Machine Learning Definition: computational methods using experience to improve performance. Experience: data-driven task, thus statistics, probability, and optimization. Computer science: learning algorithms, analysis of complexity, theoretical guarantees. Example: use document word counts to predict its topic. page 5

Examples of Learning Tasks Text: document classification, spam detection. Language: NLP tasks (e.g., morphological analysis, POS tagging, context-free parsing, dependency parsing). Speech: recognition, synthesis, verification. Image: annotation, face recognition, OCR, handwriting recognition. Games (e.g., chess, backgammon). Unassisted control of vehicles (robots, car). Medical diagnosis, fraud detection, network intrusion. page 6

Some Broad ML Tasks Classification: assign a category to each item (e.g., document classification). Regression: predict a real value for each item (prediction of stock values, economic variables). Ranking: order items according to some criterion (relevant web pages returned by a search engine). Clustering: partition data into homogenous regions (analysis of very large data sets). Dimensionality reduction: find lower-dimensional manifold preserving some properties of the data. page 7

General Objectives of ML Theoretical questions: what can be learned, under what conditions? are there learning guarantees? analysis of learning algorithms. Algorithms: more efficient and more accurate algorithms. deal with large-scale problems. handle a variety of different learning problems. page 8

This Course Theoretical foundations: learning guarantees. analysis of algorithms. Algorithms: main mathematically well-studied algorithms. discussion of their extensions. Applications: illustration of their use. page 9

Topics Probability tools, concentration inequalities. PAC learning model, Rademacher complexity, VC-dimension, generalization bounds. Support vector machines (SVMs), margin bounds, kernel methods. Ensemble methods, boosting. Logistic regression and conditional maximum entropy models. On-line learning, weighted majority algorithm, Perceptron algorithm, mistake bounds. Regression, generalization, algorithms. Ranking, generalization, algorithms. Reinforcement learning, MDPs, bandit problems and algorithm. page 10

Definitions and Terminology Example: item, instance of the data used. Features: attributes associated to an item, often represented as a vector (e.g., word counts). Labels: category (classification) or real value (regression) associated to an item. Data: training data (typically labeled). test data (labeled but labels not seen). validation data (labeled, for tuning parameters). page 11

General Learning Scenarios Settings: batch: learner receives full (training) sample, which he uses to make predictions for unseen points. on-line: learner receives one sample at a time and makes a prediction for that sample. Queries: active: the learner can request the label of a point. passive: the learner receives labeled points. page 12

Standard Batch Scenarios Unsupervised learning: no labeled data. Supervised learning: uses labeled data for prediction on unseen points. Semi-supervised learning: uses labeled and unlabeled data for prediction on unseen points. Transduction: uses labeled and unlabeled data for prediction on seen points. page 13

Example - SPAM Detection Problem: classify each e-mail message as SPAM or non- SPAM (binary classification problem). Potential data: large collection of SPAM and non-spam messages (labeled examples). page 14

Learning Stages labeled data training sample validation data test sample algorithm A( ) A( 0 ) evaluation prior knowledge parameter selection features page 15

This Lecture Basic definitions and concepts. Introduction to the problem of learning. Probability tools. page 16

Definitions Spaces: input space X, output space Y. Loss function: L: Y Y!R. L(by, y) : cost of predicting by instead of y. binary classification: 0-1 loss, L(y, y 0 )=1 y6=y 0. regression: Y R, l(y, y 0 )=(y 0 y) 2. Hypothesis set: H Y X, subset of functions out of which the learner selects his hypothesis. depends on features. represents prior knowledge about task. page 17

Supervised Learning Set-Up Training data: sample S of size m drawn i.i.d. from X Y according to distribution D: Problem: find hypothesis h2h with small generalization error. deterministic case: output label deterministic function of input, y =f(x). S =((x 1,y 1 ),...,(x m,y m )). stochastic case: output probabilistic function of input. page 18

Errors Generalization error: for h2h, it is defined by R(h) = E [L(h(x),y)]. (x,y) D Empirical error: for h2hand sample S, it is br(h) = 1 mx L(h(x i ),y i ). m Bayes error: R? = i=1 in deterministic case, inf R(h). h h measurable R? =0. page 19

Noise Noise: in binary classification, for any x 2 X, observe that noise(x) =min{pr[1 x], Pr[0 x]}. E[noise(x)] = R. page 20

Learning Fitting Notion of simplicity/complexity. How do we define complexity? page 21

Generalization Observations: the best hypothesis on the sample may not be the best overall. generalization is not memorization. complex rules (very complex separation surfaces) can be poor predictors. trade-off: complexity of hypothesis set vs sample size (underfitting/overfitting). page 22

Model Selection General equality: for any h2h, best in class R(h) R =[R(h) R(h )] {z } estimation +[R(h ) R ]. {z } approximation Approximation: not a random variable, only depends on H. Estimation: only term we can hope to bound. page 23

Empirical Risk Minimization Select hypothesis set H. Find hypothesis h2h minimizing empirical error: h = argmin h2h br(h). but H may be too complex. the sample size may not be large enough. page 24

Generalization Bounds Definition: upper bound on apple Pr sup h2h Bound on estimation error for hypothesis R(h) b R(h) >. h 0 R(h 0 ) R(h )=R(h 0 ) b R(h0 )+ b R(h 0 ) R(h ) apple R(h 0 ) b R(h0 )+ b R(h ) R(h ) apple 2 sup h2h R(h) b R(h). given by ERM: How should we choose H? (model selection problem) page 25

Model Selection error estimation approximation upper bound H = [ 2 H. page 26

Structural Risk Minimization Principle: consider an infinite sequence of hypothesis sets ordered for inclusion, H 1 H 2 H n (Vapnik, 1995) h = argmin h2h n,n2n br(h)+penalty(h n,m). strong theoretical guarantees. typically computationally hard. page 27

General Algorithm Families Empirical risk minimization (ERM): h = argmin h2h br(h). Structural risk minimization (SRM): H n H n+1, h = argmin h2h n,n2n br(h)+penalty(h n,m). Regularization-based algorithms: 0, h = argmin h2h br(h)+ khk 2. page 28

This Lecture Basic definitions and concepts. Introduction to the problem of learning. Probability tools. page 29

Basic Properties Union bound: Pr[A _ B] apple Pr[A]+Pr[B]. Inversion: if Pr[X ]applef( ), then, for any >0, with probability at least 1, X applef 1 ( ). f Jensen s inequality: if is convex, f(e[x])applee[f(x)]. Z +1 Expectation: if X 0, E[X]= Pr[X >t] dt. 0 page 30

Basic Inequalities Markov s inequality: if X 0 and >0, then Pr[X ]apple E[X]. Chebyshev s inequality: for any >0, Pr[ X E[X] ] apple 2 X 2. page 31

Hoeffding s Inequality Theorem: Let X 1,...,X m be indep. rand. variables with the same expectation µ and X i 2[a, b], ( a<b). Then, for any >0, the following inequalities hold: apple 1 mx 2m 2 Pr µ X i > apple exp m (b a) 2 Pr apple 1 m i=1 mx X i µ> apple exp i=1 2m 2 (b a) 2. page 32

McDiarmid s Inequality (McDiarmid, 1989) Theorem: let X 1,...,X m be independent random variables taking values in Uand f : U m!r a function verifying for all i2[1,m], sup f(x 1,...,x i,...,x m ) f(x 1,...,x 0 i,...,x m ) applec i. x 1,...,x m,x 0 i Then, for all >0, h Pr i f(x 1,...,X m ) E[f(X 1,...,X m )] > apple2 exp 2 2 P m i=1 c2 i. page 33

Appendix page 34

Markov s Inequality Theorem: let X be a non-negative random variable with E[X]<1, then, for all t>0, Pr[X te[x]] apple 1 t. Proof: Pr[X t E[X]] = X apple x x te[x] X t E[X] apple X x apple X =E t E[X] Pr[X = x] Pr[X = x] Pr[X = x] x t E[X] = 1 t. x t E[X] page 35

Chebyshev s Inequality Theorem: let X be a random variable with Var[X]<1, then, for all t>0, Pr[ X E[X] t X ] apple 1 t 2. Proof: Observe that Pr[ X E[X] t X ]=Pr[(X E[X]) 2 t 2 2 X]. The result follows Markov s inequality. page 36

Weak Law of Large Numbers Theorem: let (X n ) n2n be a sequence of independent random variables with the same mean µ and variance 2 <1 and let X n = 1 P n, then, for any >0, n i=1 X i lim Pr[ X n µ ] =0. n!1 Proof: Since the variables are independent, nx apple Xi Var[X n ]= Var = n 2 2 n n 2 = n. i=1 Thus, by Chebyshev s inequality, Pr[ X n µ ] apple 2 n 2. page 37

Concentration Inequalities Some general tools for error analysis and bounds: Hoeffding s inequality (additive). Chernoff bounds (multiplicative). McDiarmid s inequality (more general). page 38

Hoeffding s Lemma Lemma: Let X 2 [a, b] be a random variable with and b6=a. Then for any t>0, E[e tx ] apple e t2 (b a) 2 8. E[X]=0 Proof: by convexity of x7!e tx, for all aapplexappleb, Thus, E[e tx ] E[ b X b a eta + X a b a etb ]= b b a eta + a b a etb = e (t), with, (t) = log( b e tx apple b x b a eta + x a b a etb. b a eta + a b a etb )=ta + log( b b a + a b a et(b a) ). page 39

Taking the derivative gives: Note that: 0 t(b a) ae (t) =a b b a a a = a b a et(b a) b b a e t(b a) a b a (0) = 0 and 0 (0) = 0. Furthermore,. 00 (t) = abe t(b a) [ b b a e t(b a) a b a ]2 = (1 )e t(b a) (b a) 2 [(1 )e t(b a) + ] 2 = [(1 )e t(b a) + ] = u(1 u)(b a) 2 apple t(b a) (1 )e (b [(1 )e t(b a) a)2 + ] (b a)2 4 a with =. There exists 0apple applet such that: b a (t) = (0) + t 0 (0) + t2 2, 00 2 (b a)2 ( ) apple t. 8 page 40

Hoeffding s Theorem Theorem: Let X 1,...,X m be independent random variables. Then for X i 2[a i,b i ], the following inequalities hold for S m = P m, for any >0, i=1 X i Pr[S m E[S m ] ] apple e 2 2 / P m i=1 (b i a i ) 2 Pr[S m E[S m ] apple ] apple e 2 2 / P m i=1 (b i a i ) 2. Proof: The proof is based on Chernoff s bounding technique: for any random variable X and t>0, apply Markov s inequality and select tto minimize Pr[X ] =Pr[e tx e t ] apple E[etX ] e t. page 41

Using this scheme and the independence of the random variables gives Pr[S m E[S m ] ] apple e t E[e t(s m E[S m ]) ] = e t m i=1 E[e t(x i E[X i ]) ] (lemma applied to X i E[X i ]) apple e t m i=1e t2 (b i a i ) 2 /8 = e t e t2 P m i=1 (b i a i ) 2 /8 apple e 2 2 / P m i=1 (b i a i ) 2, choosing t =4 / P m i=1 (b i a i ) 2. The second inequality is proved in a similar way. page 42

Hoeffding s Inequality Corollary: for any >0, any distribution D and any hypothesis h: X!{0, 1}, the following inequalities hold: Pr[ R(h) b R(h) ] apple e 2m 2 Pr[ R(h) b R(h) apple ] apple e 2m 2. Proof: follows directly Hoeffding s theorem. Combining these one-sided inequalities yields h Pr br(h) R(h) i apple 2e 2m 2. page 43

Chernoff s Inequality Theorem: for any >0, any distribution D and any hypothesis h: X!{0, 1}, the following inequalities hold: Proof: proof based on Chernoff s bounding technique. Pr[ R(h) b (1 + )R(h)] apple e mr(h) 2 /3 Pr[ R(h) b apple (1 )R(h)] apple e mr(h) 2 /2. page 44

McDiarmid s Inequality (McDiarmid, 1989) Theorem: let X 1,...,X m be independent random variables taking values in Uand f : U m!r a function verifying for all i2[1,m], sup f(x 1,...,x i,...,x m ) f(x 1,...,x 0 i,...,x m ) applec i. x 1,...,x m,x 0 i Then, for all >0, h Pr i f(x 1,...,X m ) E[f(X 1,...,X m )] > apple2 exp 2 2 P m i=1 c2 i. page 45

Comments: Proof: uses Hoeffding s lemma. Hoeffding s inequality is a special case of McDiarmid s with f(x 1,...,x m )= 1 mx x i and c i = b i a i m m. i=1 page 46

Jensen s Inequality Theorem: let X be a random variable and f a measurable convex function. Then, f(e[x]) apple E[f(X)]. Proof: definition of convexity, continuity of convex functions, and density of finite distributions. page 47