Introduction to Statistical Learning Theory

Similar documents
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

1 Review of The Learning Setting

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Computational and Statistical Learning Theory

Quantitative Biology II Lecture 4: Variational Methods

Posterior Regularization

Statistical Machine Learning Lectures 4: Variational Bayes

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

PATTERN RECOGNITION AND MACHINE LEARNING

Generalization bounds

13: Variational inference II

Least Squares Regression

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Least Squares Regression

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Lecture 35: December The fundamental statistical distances

Hands-On Learning Theory Fall 2016, Lecture 3

Machine learning - HT Maximum Likelihood

COMS 4771 Introduction to Machine Learning. Nakul Verma

Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Expectation Maximization

Expectation Maximization

Series 7, May 22, 2018 (EM Convergence)

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

Density Estimation. Seungjin Choi

Parametric Techniques Lecture 3

Information Theory Primer:

Lecture : Probabilistic Machine Learning

Bayesian Linear Regression [DRAFT - In Progress]

Speech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research

Density Estimation: ML, MAP, Bayesian estimation

Introduction to Statistical Learning Theory

Parametric Techniques

Variational Autoencoders (VAEs)

The binary entropy function

Bayesian Machine Learning - Lecture 7

Gaussian Mixture Models

1/37. Convexity theory. Victor Kitov

Lecture 7 Introduction to Statistical Decision Theory

Latent Variable Models and EM algorithm

Statistical learning. Chapter 20, Sections 1 3 1

Introduction to Bayesian Learning. Machine Learning Fall 2018

Expectation Propagation Algorithm

Lecture 2: August 31

Auto-Encoding Variational Bayes

U Logo Use Guidelines

Lecture 21: Minimax Theory

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes

Variational Inference (11/04/13)

The PAC Learning Framework -II

Introduction to Machine Learning (67577) Lecture 3

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Lecture 13 : Variational Inference: Mean Field Approximation

Generalization Bounds

COMP90051 Statistical Machine Learning

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Information Theory. Week 4 Compressing streams. Iain Murray,

Week 3: The EM algorithm

Notes on Machine Learning for and

Hoeffding, Chernoff, Bennet, and Bernstein Bounds

Unsupervised Learning

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

Bayesian Learning (II)

Quiz 2 Date: Monday, November 21, 2016

Classification & Information Theory Lecture #8

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

A Strongly Quasiconvex PAC-Bayesian Bound

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 2 Machine Learning Review

Bayesian Learning. Bayesian Learning Criteria

Data Mining Techniques

Bayesian Methods: Naïve Bayes

ECE 4400:693 - Information Theory

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

y Xw 2 2 y Xw λ w 2 2

1.6. Information Theory

MODULE -4 BAYEIAN LEARNING

Introduction to Machine Learning

14 : Mean Field Assumption

Lecture 3: Introduction to Complexity Regularization

Variational Principal Components

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Overfitting, Bias / Variance Analysis

Learning with Probabilities

Statistical learning. Chapter 20, Sections 1 4 1

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

11. Learning graphical models

Online Bounds for Bayesian Algorithms

PAC-Bayes Generalization Bounds for Randomized Structured Prediction

LECTURE 3. Last time:

Variational Bayesian Logistic Regression

Transcription:

Introduction to Statistical Learning Theory

In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated prior knowledge? Example: We trained many different face detectors w 1,..., w k and have a probabilistic model for P (w). PAC-Bayes combines a Bayesian approach with an agnostic approach to analyse this situation. We will start with an quick overview of Bayesian method.

Maximum Likelihood Assume your data is drawn from a distribution that comes from some parametric family. Example: P (y x; w) = N (w T x, σ 2 ) = w T x + N (0, σ 2 ). For simplicity we assume σ is a known fixed parameter. Given a sample S = {(x 1, y 1 ),..., (x m, y m )} we define the likelihood of w as L(w, S) = log (P (y 1,..., y m x 1,..., x m ; w)) = m log(p (y i x i ; w)) i=1 The maximum livelihood returns w = arg max L(w, S)

Maximum Likelihood In our example P (y x; w) = N (w T x, σ 2 ) = w T x + N (0, σ 2 ). ( This means that P (y i x i ; w) = 1 exp (y i w T x i ) ). 2 We conclude 2πσ 2 σ 2 that the likelihood is L(w, S) = m 1 (y σ 2 i w T x i ) 2 + C where C is the i=1 normalization factors that do not depend on w. In this model, maximum likelihood is equivalent to minimizing square loss. Problem is - we want to maximize P (w x, y).

Maximum a-posteriori To get P (w x, y) we need to a prior distribution P (w). We now have P (y x, w) and P (w) so from Bayes theorem we get P (w x, y) = P (y x, w) P (w) P (y x) P (y x, w) P (w) The maximum a-posteriori (MAP) model is w = arg max{p (Y X, w) P (w)} = arg max{l(w, S) + log(p (w))}

Maximum a-posteriori Continuing our example - assume P (w) = N (0, σ 2 w I). We now get w = arg max = arg min [ [ m i=1 m i=1 1 σ 2 (y i w T x) 2 1 σ 2 w (y i w T x) 2 + σ2 σw 2 w 2 2 w 2 2 ] ] This is equivalent to doing regularized ERM with l 2 regularization. If we use Laplacian distribution instead, we will get l 1 regularization.

Bayesian Inference MAP picks the best model, given our model and data. But why do we have to pick one model? We have seen that the optimal classifier can be calculated given P (y x) (assignment 1). The Bayesian approach does exactly that, so we get P (y x, S) = P (y x, w) P (w S)dP (w) w Some cases (Guassian) this as an analytic solution, most of the time there isn t any.

Introduction PAC-Bayes: We will consider algorithms that return a posterior - a distribution Q on H. Definition 2.1 (Loss of posterior) Let Q be a distribution on H, D a distribution on X Y and S a finite sample. Define [ ] L D (Q) = E [L D (h)] = E E [l(h, z)] h Q h Q z D L S (Q) = E h Q [L S (h)] = E h Q [ 1 m ] m l(h, z i ) i=1

Introduction We can turn a posterior into a learning algorithm: Definition 2.2 (Gibbs hypothesis) Let Q be a distribution on H. The Gibbs hypothesis is the following randomized hypothesis - Given x, sample h according to Q and return h(x). It is straightforward to show that the expected loss is L D (Q).

KL Divergence We want to show that if Q is similar to P we generalize well. Kullback-Leibler (KL) divergence is how we measure similarity. Definition 2.3 (KL Divergence) Let P, Q be continuous or discrete distributions. Define [ ( )] Q(x) KL(Q P ) = E ln x Q P (x) Notice this is not symmetrical KL(Q P ) KL(P Q). The intuition behind this definition comes from information theory.

KL Divergence Assume we have a finite alphabet and message x is sent with probability P (x). Shannon s coding theorem states that of you code x with log 2 (1/P (x)) bits [ you get ( an)] optimal coding. The expected bits per letter is then log 1 2 P (x) = H(P ). E x P Consider now that we use the optimal code for P, but the letters where sent according to Q. The expected bits per letter is now [ ( )] [ ( ) ( )] 1 Q(x) 1 E log 2 = E log x Q P (x) 2 + log x Q P (x) 2 = H(Q)+KL(Q P ) Q(x) Up to a factor due to different log basis. This shows KL(Q P ) 0. Another perspective - The mutual information I(X, Y ) is equal I(X, Y ) = KL(P (X, Y ) P (X)P (Y )).

KL Divergence Example 1: P some distribution on x 1,..., x m, Q is 1 on x i then KL(Q P ) = ln(1/p (x i )). Example 2: If P (x i ) = 0 and Q(x i ) > 0 then KL(Q P ) =. Example 3: If α, β [0, 1] then KL(α β) ( ) KL(Bernoulli(α) Bernoulli(β)) = α ln α β ( ) + (1 α) ln 1 α 1 β Example 4: If Q = N (µ 0, Σ 0 ) and P = N (µ 1, Σ 1 ) Gaussian distributions in dimension n, then KL(Q P ) = 1 ( trace(σ 1 1 2 Σ 0) + (µ 1 µ 0 )Σ 1 1 (µ 1 µ 0 ) n det Σ ) 0 det Σ 1

KL Bound We will now prove the following bound: Theorem 3.1 (McAllester) Let Q, P be distributions on H and D be a distribution on X Y. Assume l(h, z) [0, 1]. Let S D m be a sample, then with probability greater or equal to 1 δ over S we have KL(L S (Q) L D (Q)) KL(Q P ) + ln ( ) m+1 δ (1) m Notice: that the l.h.s is the KL divergence between two numbers (as in example 3), while the r.h.s is between distributions. Also notice we assume no connection between D and P - it is still an agnostic analysis.

KL Bound We will split the proof into technical lemmas: Lemma 3.1 If X is a real valued random number satisfying P (X x) e mf(x), then following holds: E[e (m 1)f(x) ] m. Proof: Define F (x) = P (X x) the CDF then from basic properties of the CDF we have P (F (x) y) y, therefore P (e mf(x) y) y. So ) y P (e mf(x) y) = P (e mf(x) 1/y) = P (e (m 1)f(x) y m 1 m (2) Define ν = y m 1 m and we have P (e (m 1)f(x) ν) ν m m 1. We use the following fact: for non-negative r.v we have E[W ] = P (W ν)dν. 0

KL Bound We conclude: E[e (m 1)f(x) ] = 0 P (e (m 1)f(x) ν)dν 1 + = 1 (m 1) [ν 1/(m 1)] 1 = m 1 ν m m 1 dν We will use the stronger version of the Hoeffding bound we proved in Lecture 1: Lemma 3.2 (Hoeffding) If X 1,..., X m are i.i.d r.v such that X i [0, 1], and X = 1 m for ɛ [0, 1] we have the following P ( X ɛ) e mkl(ɛ E[X 1]) m X i then i=1

KL Bound Lemma 3.3 With probability greater then 1 δ over S, E h P [ e (m 1)KL(L S(h) L D (h)) ] m δ Proof sketch - using lemma [ 3.1 + 3.2 (Hoeffding) we get that for any h H we have E e (m 1)KL(L S (h) L D (h)) ] m. The lemma follows by S D m taking expectation w.r.t P and Markov s inequality. Finally we need this shift of measure theorem: Lemma 3.4 E [f(x)] KL(Q P ) + ln E x Q x P [ e f(x) ]

KL Bound Proof: [ E [f(x)] = E ln e f(x)] [ ( P (x) = E ln x Q x Q x Q [ ( P (x) =KL(Q P ) + E ln x Q ( [ P (x) KL(Q P ) + ln E x Q ( =KL(Q P ) + ln [e f(x)]) E x P Q(x) ef(x) Q(x) ef(x) Q(x) ef(x) )] ]) ) + ln Q(x) ] P (x) where we use Jensen s inequality.

KL Bound We Can now prove theorem 3.1: Theorem 3.1 (McAllester) Let Q, P be distributions on H and D be a distribution on X Y. Assume l(h, z) [0, 1]. Let S D m be a sample, then with probability greater or equal to 1 δ over S we have KL(L S (Q) L D (Q)) KL(Q P ) + ln ( ) m+1 δ (1) m

KL Bound Proof: Define f(h) = KL ((L S (h) L D (h)). Using the shift of measure ( lemma 3.4) and lemma 3.3 we get: [ E h Q [mf(h)] KL(Q P ) + ln E e mf(h) ] KL(Q P ) + ln ( ) m+1 h P δ With probability greater or equal to 1 δ. To finish the proof we will use the fact that KL divergence is convex, so from the Jensen inequality KL(L S (Q) L D (Q)) = KL(E Q [L S (h)] E Q [L D (h)]) E Q [KL ((L S (h) L D (h))] = E Q [f(h)]. (sweeping a few subtleties under the rug)

Generalization Bounds We bounded KL(L S (Q) L D (Q)). Next step - bound L D (Q) L S (Q). We will show two bounds using the following lemma: Lemma 3.5 If a, b [0, 1] and KL(a b) x, then b a + x 2 and b a + 2x + 2ax Where the second is much stronger if a, i.e. L S (Q) is very small.

Generalization Bounds Proof of first inequality: Fix b and define f(a) = KL(a b) 2(b a) 2. The first and second derivatives are: ( ) ( ) a b f (a) = ln ln 4(a b) 1 a 1 b f (a) = 1 a(1 a) 4 The function a(1 a) has its maximum at a = 1/2 with value 4 so f (a) 0. As f (b) = 0 we have f(a) has its minimum at a = b with f(b) = 0. Therefore 2(a b) 2 KL(a b) x proving b a + x 2. Second inequality is left as an exercise. Notice we also have b a x 2.

Generalization Bounds We can combine everything to get the following theorem: Theorem 3.6 (Generalization Bound) Let Q, P be distributions on H and D be a distribution on X Y. Assume l(h, z) [0, 1]. Let S D m be a sample, then with probability greater or equal to 1 δ over S we have L D (Q) L S (Q) + KL(Q P ) + ln ( ) m+1 δ 2m L D (Q) L S (Q) + 2 KL(Q P ) + ln ( ) m+1 δ m + 2L S (Q) KL(Q P ) + ln ( ) m+1 δ m