Foundations of Machine Learning

Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1

Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about 3-4 homework assignments + project (topic of your choice). Mailing list: join as soon as possible. page 2

Course Material Textbook Slides: course web page. http://www.cs.nyu.edu/~mohri/ml17 page 3

This Lecture Basic definitions and concepts. Introduction to the problem of learning. Probability tools. page 4

Machine Learning Definition: computational methods using experience to improve performance. Experience: data-driven task, thus statistics, probability, and optimization. Computer science: learning algorithms, analysis of complexity, theoretical guarantees. Example: use document word counts to predict its topic. page 5

Examples of Learning Tasks Text: document classification, spam detection. Language: NLP tasks (e.g., morphological analysis, POS tagging, context-free parsing, dependency parsing). Speech: recognition, synthesis, verification. Image: annotation, face recognition, OCR, handwriting recognition. Games (e.g., chess, backgammon). Unassisted control of vehicles (robots, car). Medical diagnosis, fraud detection, network intrusion. page 6

Some Broad ML Tasks Classification: assign a category to each item (e.g., document classification). Regression: predict a real value for each item (prediction of stock values, economic variables). Ranking: order items according to some criterion (relevant web pages returned by a search engine). Clustering: partition data into homogenous regions (analysis of very large data sets). Dimensionality reduction: find lower-dimensional manifold preserving some properties of the data. page 7

General Objectives of ML Theoretical questions: what can be learned, under what conditions? are there learning guarantees? analysis of learning algorithms. Algorithms: more efficient and more accurate algorithms. deal with large-scale problems. handle a variety of different learning problems. page 8

This Course Theoretical foundations: learning guarantees. analysis of algorithms. Algorithms: main mathematically well-studied algorithms. discussion of their extensions. Applications: illustration of their use. page 9

Topics Probability tools, concentration inequalities. PAC learning model, Rademacher complexity, VC-dimension, generalization bounds. Support vector machines (SVMs), margin bounds, kernel methods. Ensemble methods, boosting. Logistic regression and conditional maximum entropy models. On-line learning, weighted majority algorithm, Perceptron algorithm, mistake bounds. Regression, generalization, algorithms. Ranking, generalization, algorithms. Reinforcement learning, MDPs, bandit problems and algorithm. page 10

Definitions and Terminology Example: item, instance of the data used. Features: attributes associated to an item, often represented as a vector (e.g., word counts). Labels: category (classification) or real value (regression) associated to an item. Data: training data (typically labeled). test data (labeled but labels not seen). validation data (labeled, for tuning parameters). page 11

General Learning Scenarios Settings: batch: learner receives full (training) sample, which he uses to make predictions for unseen points. on-line: learner receives one sample at a time and makes a prediction for that sample. Queries: active: the learner can request the label of a point. passive: the learner receives labeled points. page 12

Standard Batch Scenarios Unsupervised learning: no labeled data. Supervised learning: uses labeled data for prediction on unseen points. Semi-supervised learning: uses labeled and unlabeled data for prediction on unseen points. Transduction: uses labeled and unlabeled data for prediction on seen points. page 13

Example - SPAM Detection Problem: classify each e-mail message as SPAM or non- SPAM (binary classification problem). Potential data: large collection of SPAM and non-spam messages (labeled examples). page 14

Learning Stages labeled data training sample validation data test sample algorithm A( ) A( 0 ) evaluation prior knowledge parameter selection features page 15

This Lecture Basic definitions and concepts. Introduction to the problem of learning. Probability tools. page 16

Definitions Spaces: input space X, output space Y. Loss function: L: Y Y!R. L(by, y) : cost of predicting by instead of y. binary classification: 0-1 loss, L(y, y 0 )=1 y6=y 0. regression: Y R, l(y, y 0 )=(y 0 y) 2. Hypothesis set: H Y X, subset of functions out of which the learner selects his hypothesis. depends on features. represents prior knowledge about task. page 17

Supervised Learning Set-Up Training data: sample S of size m drawn i.i.d. from X Y according to distribution D: Problem: find hypothesis h2h with small generalization error. deterministic case: output label deterministic function of input, y =f(x). S =((x 1,y 1 ),...,(x m,y m )). stochastic case: output probabilistic function of input. page 18

Errors Generalization error: for h2h, it is defined by R(h) = E [L(h(x),y)]. (x,y) D Empirical error: for h2hand sample S, it is br(h) = 1 mx L(h(x i ),y i ). m Bayes error: R? = i=1 in deterministic case, inf R(h). h h measurable R? =0. page 19

Noise Noise: in binary classification, for any x 2 X, observe that noise(x) =min{pr[1 x], Pr[0 x]}. E[noise(x)] = R. page 20

Learning Fitting Notion of simplicity/complexity. How do we define complexity? page 21

Generalization Observations: the best hypothesis on the sample may not be the best overall. generalization is not memorization. complex rules (very complex separation surfaces) can be poor predictors. trade-off: complexity of hypothesis set vs sample size (underfitting/overfitting). page 22

Model Selection General equality: for any h2h, best in class R(h) R =[R(h) R(h )] {z } estimation +[R(h ) R ]. {z } approximation Approximation: not a random variable, only depends on H. Estimation: only term we can hope to bound. page 23

Empirical Risk Minimization Select hypothesis set H. Find hypothesis h2h minimizing empirical error: h = argmin h2h br(h). but H may be too complex. the sample size may not be large enough. page 24

Generalization Bounds Definition: upper bound on apple Pr sup h2h Bound on estimation error for hypothesis R(h) b R(h) >. h 0 R(h 0 ) R(h )=R(h 0 ) b R(h0 )+ b R(h 0 ) R(h ) apple R(h 0 ) b R(h0 )+ b R(h ) R(h ) apple 2 sup h2h R(h) b R(h). given by ERM: How should we choose H? (model selection problem) page 25

Model Selection error estimation approximation upper bound H = [ 2 H. page 26

Structural Risk Minimization Principle: consider an infinite sequence of hypothesis sets ordered for inclusion, H 1 H 2 H n (Vapnik, 1995) h = argmin h2h n,n2n br(h)+penalty(h n,m). strong theoretical guarantees. typically computationally hard. page 27

General Algorithm Families Empirical risk minimization (ERM): h = argmin h2h br(h). Structural risk minimization (SRM): H n H n+1, h = argmin h2h n,n2n br(h)+penalty(h n,m). Regularization-based algorithms: 0, h = argmin h2h br(h)+ khk 2. page 28

This Lecture Basic definitions and concepts. Introduction to the problem of learning. Probability tools. page 29

Basic Properties Union bound: Pr[A _ B] apple Pr[A]+Pr[B]. Inversion: if Pr[X ]applef( ), then, for any >0, with probability at least 1, X applef 1 ( ). f Jensen s inequality: if is convex, f(e[x])applee[f(x)]. Z +1 Expectation: if X 0, E[X]= Pr[X >t] dt. 0 page 30

Basic Inequalities Markov s inequality: if X 0 and >0, then Pr[X ]apple E[X]. Chebyshev s inequality: for any >0, Pr[ X E[X] ] apple 2 X 2. page 31

Hoeffding s Inequality Theorem: Let X 1,...,X m be indep. rand. variables with the same expectation µ and X i 2[a, b], ( a<b). Then, for any >0, the following inequalities hold: apple 1 mx 2m 2 Pr µ X i > apple exp m (b a) 2 Pr apple 1 m i=1 mx X i µ> apple exp i=1 2m 2 (b a) 2. page 32

McDiarmid s Inequality (McDiarmid, 1989) Theorem: let X 1,...,X m be independent random variables taking values in Uand f : U m!r a function verifying for all i2[1,m], sup f(x 1,...,x i,...,x m ) f(x 1,...,x 0 i,...,x m ) applec i. x 1,...,x m,x 0 i Then, for all >0, h Pr i f(x 1,...,X m ) E[f(X 1,...,X m )] > apple2 exp 2 2 P m i=1 c2 i. page 33

Appendix page 34

Markov s Inequality Theorem: let X be a non-negative random variable with E[X]<1, then, for all t>0, Pr[X te[x]] apple 1 t. Proof: Pr[X t E[X]] = X apple x x te[x] X t E[X] apple X x apple X =E t E[X] Pr[X = x] Pr[X = x] Pr[X = x] x t E[X] = 1 t. x t E[X] page 35

Chebyshev s Inequality Theorem: let X be a random variable with Var[X]<1, then, for all t>0, Pr[ X E[X] t X ] apple 1 t 2. Proof: Observe that Pr[ X E[X] t X ]=Pr[(X E[X]) 2 t 2 2 X]. The result follows Markov s inequality. page 36

Weak Law of Large Numbers Theorem: let (X n ) n2n be a sequence of independent random variables with the same mean µ and variance 2 <1 and let X n = 1 P n, then, for any >0, n i=1 X i lim Pr[ X n µ ] =0. n!1 Proof: Since the variables are independent, nx apple Xi Var[X n ]= Var = n 2 2 n n 2 = n. i=1 Thus, by Chebyshev s inequality, Pr[ X n µ ] apple 2 n 2. page 37

Concentration Inequalities Some general tools for error analysis and bounds: Hoeffding s inequality (additive). Chernoff bounds (multiplicative). McDiarmid s inequality (more general). page 38

Hoeffding s Lemma Lemma: Let X 2 [a, b] be a random variable with and b6=a. Then for any t>0, E[e tx ] apple e t2 (b a) 2 8. E[X]=0 Proof: by convexity of x7!e tx, for all aapplexappleb, Thus, E[e tx ] E[ b X b a eta + X a b a etb ]= b b a eta + a b a etb = e (t), with, (t) = log( b e tx apple b x b a eta + x a b a etb. b a eta + a b a etb )=ta + log( b b a + a b a et(b a) ). page 39

Taking the derivative gives: Note that: 0 t(b a) ae (t) =a b b a a a = a b a et(b a) b b a e t(b a) a b a (0) = 0 and 0 (0) = 0. Furthermore,. 00 (t) = abe t(b a) [ b b a e t(b a) a b a ]2 = (1 )e t(b a) (b a) 2 [(1 )e t(b a) + ] 2 = [(1 )e t(b a) + ] = u(1 u)(b a) 2 apple t(b a) (1 )e (b [(1 )e t(b a) a)2 + ] (b a)2 4 a with =. There exists 0apple applet such that: b a (t) = (0) + t 0 (0) + t2 2, 00 2 (b a)2 ( ) apple t. 8 page 40

Hoeffding s Theorem Theorem: Let X 1,...,X m be independent random variables. Then for X i 2[a i,b i ], the following inequalities hold for S m = P m, for any >0, i=1 X i Pr[S m E[S m ] ] apple e 2 2 / P m i=1 (b i a i ) 2 Pr[S m E[S m ] apple ] apple e 2 2 / P m i=1 (b i a i ) 2. Proof: The proof is based on Chernoff s bounding technique: for any random variable X and t>0, apply Markov s inequality and select tto minimize Pr[X ] =Pr[e tx e t ] apple E[etX ] e t. page 41

Using this scheme and the independence of the random variables gives Pr[S m E[S m ] ] apple e t E[e t(s m E[S m ]) ] = e t m i=1 E[e t(x i E[X i ]) ] (lemma applied to X i E[X i ]) apple e t m i=1e t2 (b i a i ) 2 /8 = e t e t2 P m i=1 (b i a i ) 2 /8 apple e 2 2 / P m i=1 (b i a i ) 2, choosing t =4 / P m i=1 (b i a i ) 2. The second inequality is proved in a similar way. page 42

Hoeffding s Inequality Corollary: for any >0, any distribution D and any hypothesis h: X!{0, 1}, the following inequalities hold: Pr[ R(h) b R(h) ] apple e 2m 2 Pr[ R(h) b R(h) apple ] apple e 2m 2. Proof: follows directly Hoeffding s theorem. Combining these one-sided inequalities yields h Pr br(h) R(h) i apple 2e 2m 2. page 43

Chernoff s Inequality Theorem: for any >0, any distribution D and any hypothesis h: X!{0, 1}, the following inequalities hold: Proof: proof based on Chernoff s bounding technique. Pr[ R(h) b (1 + )R(h)] apple e mr(h) 2 /3 Pr[ R(h) b apple (1 )R(h)] apple e mr(h) 2 /2. page 44

Comments: Proof: uses Hoeffding s lemma. Hoeffding s inequality is a special case of McDiarmid s with f(x 1,...,x m )= 1 mx x i and c i = b i a i m m. i=1 page 46

Jensen s Inequality Theorem: let X be a random variable and f a measurable convex function. Then, f(e[x]) apple E[f(X)]. Proof: definition of convexity, continuity of convex functions, and density of finite distributions. page 47