Lecture 10: Universal coding and prediction

Similar documents
Information Theory and Statistics Lecture 4: Lempel-Ziv code

Lecture 2: Concentration Bounds

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Lecture 12: September 27

10-704: Information Processing and Learning Spring Lecture 10: Feb 12

Lecture 13: Maximum Likelihood Estimation

1 Review and Overview

ECE 901 Lecture 13: Maximum Likelihood Estimation

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Intro to Learning Theory

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Introductory statistics

Maximum Likelihood Estimation and Complexity Regularization

1 Convergence in Probability and the Weak Law of Large Numbers

Rademacher Complexity

7.1 Convergence of sequences of random variables

Lecture 19: Convergence

CS284A: Representations and Algorithms in Molecular Biology

Sieve Estimators: Consistency and Rates of Convergence

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Chapter 6 Infinite Series

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Lecture 6: Source coding, Typicality, and Noisy channels and capacity

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 6 9/23/2013. Brownian motion. Introduction

Empirical Process Theory and Oracle Inequalities

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Fall 2013 MTH431/531 Real analysis Section Notes

Shannon s noiseless coding theorem

Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP

Advanced Stochastic Processes.

Asymptotic Coupling and Its Applications in Information Theory

7.1 Convergence of sequences of random variables

10-701/ Machine Learning Mid-term Exam Solution

6. Sufficient, Complete, and Ancillary Statistics

Lecture 11: Pseudorandom functions

Machine Learning Theory (CS 6783)

Tests of Hypotheses Based on a Single Sample (Devore Chapter Eight)

The Maximum-Likelihood Decoding Performance of Error-Correcting Codes

Lecture 11: Channel Coding Theorem: Converse Part

Lesson 10: Limits and Continuity

subcaptionfont+=small,labelformat=parens,labelsep=space,skip=6pt,list=0,hypcap=0 subcaption ALGEBRAIC COMBINATORICS LECTURE 8 TUESDAY, 2/16/2016

Summary. Recap ... Last Lecture. Summary. Theorem

1 Review and Overview

5.1 A mutual information bound based on metric entropy

Ma 530 Introduction to Power Series

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

6.3 Testing Series With Positive Terms

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

The method of types. PhD short course Information Theory and Statistics Siena, September, Mauro Barni University of Siena

On Modeling On Minimum Description Length Modeling. M-closed

Lecture 14: Graph Entropy

Entropy Rates and Asymptotic Equipartition

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Lecture 10 October Minimaxity and least favorable prior sequences

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame

Lecture 5. Power properties of EL and EL for vectors

Lecture 7: October 18, 2017

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Recurrence Relations

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Lecture 4. We also define the set of possible values for the random walk as the set of all x R d such that P(S n = x) > 0 for some n.

Lecture 2: Monte Carlo Simulation

Lecture 2. The Lovász Local Lemma

Lecture Chapter 6: Convergence of Random Sequences

18.657: Mathematics of Machine Learning

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Glivenko-Cantelli Classes

Lecture 11: Decision Trees

Generalized Semi- Markov Processes (GSMP)

Entropies & Information Theory

Lecture 3 The Lebesgue Integral

Lecture 15: Learning Theory: Concentration Inequalities

Distribution of Random Samples & Limit theorems

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Lecture 15: Strong, Conditional, & Joint Typicality

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS


Efficient GMM LECTURE 12 GMM II

Basics of Probability Theory (for Theory of Computation courses)

Chapter 6 Principles of Data Reduction

Machine Learning Brett Bernstein

Notes 19 : Martingale CLT

Algorithms for Clustering

Lectures 12 and 13 - Complexity Penalized Maximum Likelihood Estimation

Lecture 1: Basic problems of coding theory

Infinite Sequences and Series

Output Analysis and Run-Length Control

We are mainly going to be concerned with power series in x, such as. (x)} converges - that is, lims N n

Feedback in Iterative Algorithms

An Introduction to Randomized Algorithms

Lecture 8: Convergence of transformations and law of large numbers

UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 17 Lecturer: David Wagner April 3, Notes 17 for CS 170

b i u x i U a i j u x i u x j

1 Approximating Integrals using Taylor Polynomials

Sequences and Series of Functions

Approximations and more PMFs and PDFs

Point Estimation: properties of estimators 1 FINITE-SAMPLE PROPERTIES. finite-sample properties (CB 7.3) large-sample properties (CB 10.

Transcription:

0-704: Iformatio Processig ad Learig Sprig 0 Lecture 0: Uiversal codig ad predictio Lecturer: Aarti Sigh Scribes: Georg M. Goerg Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for formal publicatios. They may be distributed outside this class oly with the permissio of the Istructor. 0. Uiversal codig ad predictio We wat a ecoder that works well for ay distributio p P. For example, to gzip a text file we would like a ecoder that works for several differet laguages, rather tha desigig a ew zip code for each sigle laguage. Shao/Huffma codes could achieve this, but their disadvatage is that geerally we have to wait for the whole sequece to arrive before begiig to ecode/decode. We would like to desig a code that ca be ecoded ad decoded immediately o a symbol per symbol basis. I this lecture we will focus o the duality betwee optimal codig ad optimal predictio. For example, it is atural to require a code C to have short legth L C (x) for the symbol x; aalogously i predictio we ofte wat to miimize a loss fuctio loss q (x) for a predictor q. Table 0. gives a geeral overview of several dual cocepts ad otios of a code C ad a predictor q that we will use i this (ad upcomig) lectures. 0.. Weak ad strog uiversality A code C is uiversal if R C = sup E p (R p,c ) 0. (0.) A predictor q is uiversally cosistet if R q = sup E p (R p,q ) 0. (0.) Just as stadard calculus has poitwise ad uiform covergece (of e.g. fuctios), we ca also differetiate betwee weak ad strog uiversality: weakly uiversal if the covergece rate depeds o p P, e.g. if R C = o( γp ) ad γ p chages for every p P. strog uiversal if the covergece rate is the same for all p P, e.g. R C = o( γ ) p P. By Kraft s iequality we kow that for prefix codes x X LC(x The rate eed ot ecessarily be polyomial; it just serves as a example. ), (0.3) 0-

0- Lecture 0: Uiversal codig ad predictio legth of symbol x usig code C code legth data x, x p P code C predictor q LC(x) lossq(x) = log q(x) Properties per symbol Average expected legth Ep ideal code legth LC(x ) - log q(x ) if x iid = i= log q(x i) ( ) LC(x) log p(x ) (Shao iformatio cotet) egative log-likelihood or self-iformatio loss empirical log-likelihood of the data E p log q(x ) E-loss = risk log p(x ) likelihood uder the true model p redudacy Rp,C = L C(x ) ( log p(x ) ) Rp,q = log q(x ) ( log p(x ) ) excess loss of predictor q wrt true p expected redudacy Ep (Rp,C) = Ep ( ) LC(x ) H(x ) 0 for uiquely decodable codes miimax expected redudacy a mic sup Ep (Rp,C) :=RC Ep (Rp,q) = E p log p(x ) q(x ) = D (p q) 0 excess risk = E excess loss miq sup D (p q) =:Rq miimax excess risk Table 0.: Uiversal codig ad predictio: per symbol properties of a code C ad a predictor q. a We wat that the code C works well for all p P.

Lecture 0: Uiversal codig ad predictio 0-3 ad we ca costruct the correspodig predictor by q(x L C(x ) ) = x = k X L C(x ) L C(x ) L c (x ) log q(x ). (0.4) Similarly, for ay distributio q(x ) we ca defie a correspodig prefix code that satisfies This must be true sice we kow that at least the Shao code For prefix codes (see (0.)) we have L c (x ) log q(x ) +. (0.5) R C mi sup q D (p q) arg mi q ( ) log q(x ) ca achieve this. =: R. (0.6) Let q be the predictor that achieves the above miimum. We ca the costruct a Shao code C from this predictor. By (0.5) we kow that R C will be withi / bit of R C (or withi bit for the etire sequece). 0.. Predictio problem We have data x from p P. How much loss do we suffer from usig q p istead of the true p? Here q ca be ay distributio; however, typically q is a estimate of p depedig o the data x. I geeral, we have to impose some restrictios o the class of distributios P to get uiversally cosistet codes C / predictors q, i.e. to achieve error rates 0. For example, we ofte assume i) iid, ii) Markov chais,... For a specific q cosider R q R. Typically we try to boud ad to get cotrol over the error rates. Note that q = arg mi q R q, where R q is the worst excess risk for a particular q. q is the model/estimate q that miimizes the expected worst case sceario. We will show later i the course that, i geeral, the optimal q is a mixture distributio over the class p P. I other words, for ay q, a mixture distributio p mix such that the excess risk of q is always greater or equal to the excess risk of p mix, i.e. D (p q) D (p p mix ). (0.7) 0..3 Maximum loss istead of expected loss Now istead of expected loss, cosider the maximum loss (maximum over all possible sequeces x ) R = mi q sup max x log p(x ) q(x ) (0.8)

0-4 Lecture 0: Uiversal codig ad predictio Let P ML (x ) = sup p(x ) (the MLE). Defie the ormalized ML as NML(x ) = P ML(x ) x P ML(x ) = q, (0.9) uder maximum-loss (istead of E-loss). The ormalized maximum likelihood distributio is the best uiversal predictor uder maximum loss. Theorem 0. For ay class P of processes with fiite alphabet q = NML(x ) ad R = log x P ML (x ). (0.0) For a proof, see pg 480 of Csiszar ad Shield s tutorial. 0..3. Problems of NML ad maximum loss for arithmetic codig For arithmetic codig we eed the coditioal distributio q(x x ). But the NML distributio is ot cosistet i the sese that q (x ) x + q (x + ) (0.) or equivaletly q (x,..., x ) x + q (x,..., x, x + ) (0.) Remark: The right had side i the above expressios yields a valid distributio, but it is ot the distributio of x,..., x uder q. This ca be see by recallig the defiitio of q. Thus it is ot possible to defie a correspodig arithmetic code (Shao ad Huffma codes are possible though). Thus we retur to cosider E-loss as i this case we kow that q is a mixture distributio - ad this is cosistet i the sese that q(x,..., x ) = x + q(x,..., x, x + ). (0.3) Examples of model classes ad their optimal codes/predictors. P is the class of iid processes with fiite alphabet X. It ca be show that the optimal predictor is give by q(x (x i x i ) + ) =, i= i + X (0.4) (x i x i ) = # of ocurreces of symbol x i i past x i. (0.5) The term (xi xi ) i is simply the frequecy of symbols observed before time t; the additioal smoothes out the ML estimate. Thus it avoids assigig 0 probability to symbols that have ot occurred yet (but may occur i the future). + + X

Lecture 0: Uiversal codig ad predictio 0-5 Let x be the umber of times the symbol x occurred i the etire legth sequece. The (0.4) ca be rewritte as (see ext lecture) q(x ) = ( x X ( x )( x 3 ) ) ( ) + X + X X π(p) p(x ), (0.6) where π(p) Dirichlet (,..., ). Oe ca show that (for proof see ext lecture) R q Rq X. P is the class of Markov processes of order. log best possible boud + costat. (0.7) Let i (k, j) be the cout of how may times the sequece (k, j) appeared i the first i symbols (x,..., x i ); also let i (k) = j i (k, j) be the total umber of times the symbol k occurred i the first i symbols. q(x ) = i= q(j x i ), q(j x i ) = i (k, j) +. (0.8) i (k) + X For a Markov process of order m = oe ca show (see ext lecture) R q Rq X ( X ) log best possible boud + costat. (0.9) 3. P is the class of Markov processes of order m, i.e. x i depeds o previous m steps. The q(j x i # times j occured preceeded by xi ) = # times x i X i m occured + i m +. (0.0) Here it holds R q Rq X m ( X ) log best possible boud + costat m. (0.) Agai, see the ext lecture for detailed derivatios. https://e.wikipedia.org/wiki/dirichlet_distributio