U Logo Use Guidelines

Similar documents
U Logo Use Guidelines

Robustness and duality of maximum entropy and exponential family distributions

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

U Logo Use Guidelines

Hands-On Learning Theory Fall 2016, Lecture 3

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Learning, Games, and Networks

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

Lecture 16: FTRL and Online Mirror Descent

Online Prediction Peter Bartlett

Online Learning and Sequential Decision Making

PATTERN RECOGNITION AND MACHINE LEARNING

Quiz 2 Date: Monday, November 21, 2016

Bregman Divergences for Data Mining Meta-Algorithms

Lecture 2: August 31

CS281B/Stat241B. Statistical Learning Theory. Lecture 14.

Lecture 9: PGM Learning

Lecture 7 Introduction to Statistical Decision Theory

Introduction to Statistical Learning Theory

Expectation Maximization

Lecture 16: Perceptron and Exponential Weights Algorithm

Context tree models for source coding

Bayesian Machine Learning

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

STA 4273H: Statistical Machine Learning

1 Review of Winnow Algorithm

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

COMP2610/COMP Information Theory

Introduction to Machine Learning

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Online Convex Optimization

LECTURE 3. Last time:

Relative Loss Bounds for Multidimensional Regression Problems

Statistical Machine Learning Lectures 4: Variational Bayes

Machine Learning Basics: Maximum Likelihood Estimation

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Information Theory and Communication

Machine Learning Lecture Notes

Lecture 5: GPs and Streaming regression

Consistency of the maximum likelihood estimator for general hidden Markov models

Lecture 35: December The fundamental statistical distances

x log x, which is strictly convex, and use Jensen s Inequality:

Lecture 21: Minimax Theory

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Lecture 1 October 9, 2013

COMP90051 Statistical Machine Learning

Chapter 16. Structured Probabilistic Models for Deep Learning

topics about f-divergence

Discrete Markov Random Fields the Inference story. Pradeep Ravikumar

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

Quantitative Biology II Lecture 4: Variational Methods

Variational Inference and Learning. Sargur N. Srihari

ECE 4400:693 - Information Theory

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Lecture 2 Machine Learning Review

Bayesian Machine Learning - Lecture 7

Information Theory Primer:

13: Variational inference II

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Expectation Propagation Algorithm

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Lecture : Probabilistic Machine Learning

Lecture 5 - Information theory

G8325: Variational Bayes

Probabilistic Graphical Models

Full-information Online Learning

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Unsupervised Learning

Series 7, May 22, 2018 (EM Convergence)

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

The Multi-Arm Bandit Framework

A Few Notes on Fisher Information (WIP)

Lecture 19: Follow The Regulerized Leader

1 Review and Overview

Information Theory in Intelligent Decision Making

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

ST5215: Advanced Statistical Theory

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Part 1: Expectation Propagation

Lecture 8: Information Theory and Statistics

Grundlagen der Künstlichen Intelligenz

COMP90051 Statistical Machine Learning

Logistic Regression. COMP 527 Danushka Bollegala

The binary entropy function

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Probabilistic Reasoning in Deep Learning

Approximation Theoretical Questions for SVMs

The Free Matrix Lunch

Variational Scoring of Graphical Model Structures

Lecture 5 Channel Coding over Continuous Channels

Least Squares Regression

Computer Intensive Methods in Mathematical Statistics

Posterior Regularization

Transcription:

Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity of our brand identity, there are n how our logo is used. go - horizontal logo go should be used on a white background. ludes black text with the crest in Deep Gold in MYK. Research School of Computer Science The Australian National University Preferred logo 2nd December, 2014 Black version inting is not available, the black logo can hite background. used white reversed Mark out of Reid a black (ANU) Information Theory 2nd Dec. 2014 1 / 18

Outline 1 Prediction Error & Fano s Inequality 2 Online Learning 3 Exponential Families as Maximum Entropy Distributions Mark Reid (ANU) Information Theory 2nd Dec. 2014 2 / 18

Loss and Bayes Risk Machine learning is often framed in terms of losses. Given observations from X and predictors A, a loss function l : A R X assigns penalty l x (a) for predicting a A when x X is observed. If observations come from fixed, unknown distribution p(x) over X the risk of a is the expected loss R(a; p) = E x p [l x (a)] = p, l(a) The Bayes risk is the minimal risk for any distribution H(p) = inf R(a; p). a A H(p) is always concave. Mark Reid (ANU) Information Theory 2nd Dec. 2014 3 / 18

Log Loss and Entropy In the special case when predictions are distributions over X (i.e., A = X ) and the loss is log loss l x (q) = log q(x) we get R(q; p) = E x p [ log q(x)] and H(p) = inf E x p [ log q(x)] = E x p [log p(x)]. q X Furthermore, the Regret (i.e., how far prediction was from optimal) is Regret(q;p) {}}{ R(q; p) inf R(q ; p) = E x p [ log q(x) + log p(x)] = KL(p; q) q X (Aside: In general, regret for a proper loss is always a Bregman divergence constructed from the negative Bayes risk of a loss) Mark Reid (ANU) Information Theory 2nd Dec. 2014 4 / 18

Fano s Inequality Fano s Inequality Let p(x, Y ) be a joint distribution over X and Y where Y {1,..., K}. If Ŷ = f (X ) is an estimator for Y then p(ŷ H(Y X ) 1 Y ). log 2 K Proof: Define E = 1 if Ŷ Y and E = 0 if Ŷ = Y and let p = p(e = 1). Ignore X for the moment. Apply chain rule for conditional entropy: H(E, Y Ŷ ) = H(Y Ŷ ) + H(E Y, Ŷ ) = H(E Ŷ ) + H(Y E, Ŷ ) H(E Y, Ŷ ) = 0 since E is determined by Y and Ŷ. H(E Ŷ ) H(E) 1 (conditioning reduces entropy; E is binary) H(Y E, Ŷ ) = (1 p) H(Y Ŷ, E = 0) + p H(Y Ŷ, E = 1) p log 2 K since E = 0 = Y = Ŷ and H(Y Ŷ ) H(Y ) log 2 K. Mark Reid (ANU) Information Theory 2nd Dec. 2014 5 / 18

Fano s Inequality Proof (cont.): So H(Y Ŷ ) + 0 1 + p log 2 K But by the data processing inequality we know that I (Y ; Ŷ ) I (Y ; X ) since we assume Ŷ = f (X ) and so Y X Ŷ forms a Markov chain. Thus, I (Y ; Ŷ ) = H(Y ) H(Y Ŷ ) H(Y ) H(Y X ) = I (Y ; X ) which gives H(Y Ŷ ) H(Y X ) and so Rearranging gives Fano s inequality: H(Y X ) 1 + p log 2 K. P(Y Ŷ ) H(Y X ) 1 log 2 K Mark Reid (ANU) Information Theory 2nd Dec. 2014 6 / 18

Fano s Inequality We can interpret this inequality in some extreme situations to see if it makes sense. H(Y X ) 1 P(Y Ŷ ) log 2 K Suppose we are trying to learn noise. That is, that Y (the class label) is uniformly distributed and independ of X (the feature vector). Then H(Y X ) = H(Y ) = log 2 K and so Fano s inequality becomes: P(Y Ŷ ) log 2 K 1 log 2 K = 1 1 log 2 K Correct but weak since P(Y Ŷ ) = 1 1 K in this case. Amount X tells us about Y bounds how well we can predict Y based on X. Mark Reid (ANU) Information Theory 2nd Dec. 2014 7 / 18

Another Bound We can also use obtain a bound on chance matching. Lower bound on match by chance Suppose that Y and Y are i.i.d. with distribution p(y ). Then p(y = Y ) 2 H(Y ). This makes intuitive sense: the more spread out the distribution over Y s, the less chance we have of two randomly drawn samples matching. Conversely, if there is no randomness in Y then the probability of a match is 1. Proof: p(y = Ŷ ) = y p(y)2 = E y p [ 2 log 2 p(y) ] 2 Ey p[log 2 p(y)] = 2 H(Y ). Mark Reid (ANU) Information Theory 2nd Dec. 2014 8 / 18

Learning from Expert Advice: Motivation Mark Reid (ANU) Information Theory 2nd Dec. 2014 9 / 18

Online Learning from Expert Advice Consider the following game where each θ Θ denotes an expert and l : X R X is a loss. Each round t = 1,..., T : 1 Experts make predictions p t θ X 2 Player makes prediction p t X (can depend on p t θ ) 3 Observe a new instance x t X 4 Update losses: expert L t θ = Lt 1 θ + l x t (p t θ ) ; player Lt = L t 1 + l x t (p t ) Aim: choose p t to minimise regret after T rounds R(T ) = L T min θ L T θ 1 Ideally we want R(T ) so that lim T T R(T ) = 0 ( no regret ). No regret if R(T ) T ( slow rate ) or if R(T ) is constant ( fast rate ) Mark Reid (ANU) Information Theory 2nd Dec. 2014 10 / 18

Mixable Losses and the Aggregating Algorithm Vovk (1999) characterised when fast rates are possible in terms of a property of a loss he called mixability. Mixable Loss A loss l : X R X is η-mixable if for any {p θ X } θ Θ and any mixture µ Θ there exists p X such that for all x X l x (p) Mix η,l (µ, x) := 1 η log E θ µ [exp( ηl x (p θ ))] Examples: Log loss l x (p) = log p(x) is 1-mixable since log E θ µ [exp(log p θ (x))] = log E θ µ [p θ (x)] = log p(x). Square loss l x (p) = p δ x 2 2 is 2-mixable. Absolute loss l x (p) = p δ x 1 is not mixable Mark Reid (ANU) Information Theory 2nd Dec. 2014 11 / 18

Mixability Theorem Mixability guarantees fast rates (i.e., constant R(T )). Mixability implies fast rates If l is an η-mixable loss then there exists an algorithm that acheive a regret R(T ) log Θ. η The witness to the above result is called the Aggregating Algorithm: Initialise µ 0 = 1 Θ Each round t Set µ t (θ) µ t 1 (θ) exp( ηl x (p θ )) Predict using p guaranteed to satisfy lx (p) Mix η,l (µ, x) Mark Reid (ANU) Information Theory 2nd Dec. 2014 12 / 18

Proof of Mixability Theorem When using AA to choose p t, first note that if W t = θ e ηlt θ ] then E θ µ t [e ηl x t (pt θ ) = θ e ηl x t (pt θ ) e ηlt /W t = W t+1 /W t. Now consider total loss at round T when using AA to choose p t : L T = T l x t (p t ) t=1 = T Mix η,l (µ t, x t ) t=1 T η 1 [ log E θ µ t exp( ηlx t (pθ t ))] t=1 T = η 1 W t log W t 1 = η 1 log W T W 0 ( t=1 ) η 1 ηl T θ + log Θ for all θ Θ, giving L T L T θ log Θ η, as required. Mark Reid (ANU) Information Theory 2nd Dec. 2014 13 / 18

What s this got to do with Information Theory? The telescoping of W t /W t 1 in the above argument can be obtained via an additive telescoping in the dual space to Θ since the mixability condition can be written as Mix η,l (µ, x) = inf µ Θ Fit the loss {}}{ E θ Θ [l x (p θ )] +η 1 Regularise {}}{ KL(µ µ) and the minimising µ is the distribution obtained from the AA. Furthermore: The distributions µ t (θ) e ηlt θ are like an EF with statistic (L t θ ) θ Θ Regret bound is 1 η KL(π δ θ) for π = 1 N and δ θ is point mass on θ. For log loss, η = 1 and AA = Bayesian updating (Similar results hold for general Bregman divergence regularisation too) Mark Reid (ANU) Information Theory 2nd Dec. 2014 14 / 18

Exponential Families: A Quick Review Exponential Family For statistic φ : X R d an exponential family (w.r.t. some measure λ) is a set F = {p θ : θ Θ} of densities of the form p θ (x) := exp ( φ(x), θ C(θ)) with finite cumulant C(θ) := log X p θ(x) dλ(x). The parameters θ Θ are natural parameters. The family F is regular if Θ is an open set Mark Reid (ANU) Information Theory 2nd Dec. 2014 15 / 18

Exponential Families: A Quick Review Exponential Family For statistic φ : X R d an exponential family (w.r.t. some measure λ) is a set F = {p θ : θ Θ} of densities of the form p θ (x) := exp ( φ(x), θ C(θ)) with finite cumulant C(θ) := log X p θ(x) dλ(x). The parameters θ Θ are natural parameters. The family F is regular if Θ is an open set Selected Properties: Convexity: Θ is a convex set. C : Θ R is a convex function. The gradient of the cumulant is the mean: C(θ) = E x pθ [φ(x)] The KL divergence KL(p θ p θ ) = D C (θ, θ) the BD for C Mark Reid (ANU) Information Theory 2nd Dec. 2014 15 / 18

Exponential Families: A Quick Review Exponential Family For statistic φ : X R d an exponential family (w.r.t. some measure λ) is a set F = {p θ : θ Θ} of densities of the form p θ (x) := exp ( φ(x), θ C(θ)) with finite cumulant C(θ) := log X p θ(x) dλ(x). The parameters θ Θ are natural parameters. The family F is regular if Θ is an open set Selected Properties: Convexity: Θ is a convex set. C : Θ R is a convex function. The gradient of the cumulant is the mean: C(θ) = E x pθ [φ(x)] The KL divergence KL(p θ p θ ) = D C (θ, θ) the BD for C Mark Reid (ANU) Information Theory 2nd Dec. 2014 15 / 18

Exponential Families via Maximum Entropy EF distributions are maximum entropy solutions with mean-constraints. Maximum Entropy Define the Shannon entropy H(p) = X p(x) log p(x) dλ(x). For a given mean value r R d define the maximum entropy solution p r = arg sup{h(p) : p X, E p [φ] = r} and the maximum entropy family F = {p r } r Φ. Mark Reid (ANU) Information Theory 2nd Dec. 2014 16 / 18

Exponential Families via Maximum Entropy EF distributions are maximum entropy solutions with mean-constraints. Maximum Entropy Define the Shannon entropy H(p) = X p(x) log p(x) dλ(x). For a given mean value r R d define the maximum entropy solution p r = arg sup{h(p) : p X, E p [φ] = r} and the maximum entropy family F = {p r } r Φ. Properties: The exponential family {p θ } θ Θ and the MaxEnt family {p r } r Φ contain the same distributions A bijection between natural parameters θ Θ and mean parameters r Φ is given by r = C(θ) and θ = ( C) 1 (r) = C (r) The Lagrangian L(p, θ) = H(p) + θ, E p [φ] r with dual vars θ. Mark Reid (ANU) Information Theory 2nd Dec. 2014 16 / 18

Exponential Families via Convex Duality Some simple calculations show that H is convex over X and ( H) (q) = log exp (q(x)) dλ(x) and ( H )(q) x = X X exp(q(x)) exp (q(ξ)) dλ(ξ) Mark Reid (ANU) Information Theory 2nd Dec. 2014 17 / 18

Exponential Families via Convex Duality Some simple calculations show that H is convex over X and ( H) (q) = log exp (q(x)) dλ(x) and ( H )(q) x = X X exp(q(x)) exp (q(ξ)) dλ(ξ) Exponential Families via Convexity For statistic φ : X R d each p θ in the exp. family for φ can be written as p θ = ( H) (φ θ) and C(θ) = ( H )(φ θ) where φ θ W denotes the RV x φ(x), θ. Mark Reid (ANU) Information Theory 2nd Dec. 2014 17 / 18

Sufficient Statistics Suppose we have a parametric family of distributions over X, P Θ = {p θ X : θ Θ}. In statistics, a sufficient statistic for intuitively captures all the information in observations from X for inference in P Θ. Sufficient Statistic A function φ : X R K for P Θ is a sufficient statistic if θ and X are conditionally independent given φ(x ) i.e., θ φ(x ) X. This intution can be formalised using mutual information via the data processing inequality: since φ(x ) is a function of X we always have θ X φ(x ) and so I (θ; X ) I (θ, φ(x )). However, DPI also say equality happens iff θ X φ(x ) that is, iff φ is sufficient for P Θ. Sufficient Statistic (Info. Theory) A function φ : X R K is a sufficient statistic when I (θ; φ(x )) = I (θ; X ). Mark Reid (ANU) Information Theory 2nd Dec. 2014 18 / 18