Predictive Hypothesis Identification

Similar documents
Predictive Hypothesis Identification

Bayesian Inference and MCMC

CSC321 Lecture 18: Learning Probabilistic Models

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

CS540 Machine learning L9 Bayesian statistics

Parametric Techniques Lecture 3

Computational Cognitive Science

Density Estimation. Seungjin Choi

Introduction to Probabilistic Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Parametric Techniques

Computational Cognitive Science

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

ECE 275A Homework 7 Solutions

Bayesian Methods: Naïve Bayes

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Introduction to Machine Learning

Probability and Estimation. Alan Moses

CPSC 540: Machine Learning

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

PMR Learning as Inference

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Lecture 6: Model Checking and Selection

Introduction to Bayesian Learning. Machine Learning Fall 2018

The Metropolis-Hastings Algorithm. June 8, 2012

COMP90051 Statistical Machine Learning

Naïve Bayes classification

A Very Brief Summary of Statistical Inference, and Examples

Lecture : Probabilistic Machine Learning

6.867 Machine Learning

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Machine Learning

Lecture 2: Priors and Conjugacy

6.867 Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

STA414/2104 Statistical Methods for Machine Learning II

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Lecture 35: December The fundamental statistical distances

Introduction to Machine Learning

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Conjugate Priors, Uninformative Priors

an introduction to bayesian inference

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Statistics: Learning models from data

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

The Volume of Bitnets

Predictive MDL & Bayes

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Introduction to Machine Learning. Lecture 2

Part 1: Expectation Propagation

Curve Fitting Re-visited, Bishop1.2.5

Patterns of Scalable Bayesian Inference Background (Session 1)

Inconsistency of Bayesian inference when the model is wrong, and how to repair it

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Bayesian RL Seminar. Chris Mansley September 9, 2008

Fast Non-Parametric Bayesian Inference on Infinite Trees

Introduction into Bayesian statistics

Sparse Linear Models (10/7/13)

Non-Parametric Bayes

Statistical Data Analysis Stat 3: p-values, parameter estimation

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Regression I: Mean Squared Error and Measuring Quality of Fit

STA 4273H: Sta-s-cal Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

GAUSSIAN PROCESS REGRESSION

Algorithmisches Lernen/Machine Learning

A Brief Review of Probability, Bayesian Statistics, and Information Theory

EIE6207: Maximum-Likelihood and Bayesian Estimation

Hierarchical Models & Bayesian Model Selection

Lecture 13 Fundamentals of Bayesian Inference

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

Notes on Machine Learning for and

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme

Lecture 5 September 19

Parameter estimation Conditional risk

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Bayesian Decision and Bayesian Learning

Bayesian Learning (II)

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Bayesian Methods for Machine Learning

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Data Analysis and Uncertainty Part 2: Estimation

Bayesian methods in economics and finance

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Invariant HPD credible sets and MAP estimators

Estimation. Max Welling. California Institute of Technology Pasadena, CA

Introduction to Bayesian inference

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

STAT 730 Chapter 4: Estimation

2.3 Exponential Families 61

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Transcription:

Marcus Hutter - 1 - Predictive Hypothesis Identification Predictive Hypothesis Identification Marcus Hutter Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU RSISE NICTA

Marcus Hutter - 2 - Predictive Hypothesis Identification Contents The Problem - Information Summarization Estimation, Testing, and Model Selection The Predictive Hypothesis Identification Principle Exact and Asymptotic Properties & Expressions Comparison to MAP, ML, MDL, and Moment Fitting Conclusions

Marcus Hutter - 3 - Predictive Hypothesis Identification Abstract While statistics focusses on hypothesis testing and on estimating (properties of) the true sampling distribution, in machine learning the performance of learning algorithms on future data is the primary issue. In this paper we bridge the gap with a general principle (PHI) that identifies hypotheses with best predictive performance. This includes predictive point and interval estimation, simple and composite hypothesis testing, (mixture) model selection, and others as special cases. For concrete instantiations we will recover well-known methods, variations thereof, and new ones. In particular we will discover moment estimation and a reparametrization invariant variation of MAP estimation, which beautifully reconciles MAP with ML. One particular feature of PHI is that it can genuinely deal with nested hypotheses.

Marcus Hutter - 4 - Predictive Hypothesis Identification The Problem - Information Summarization Given: Data D (x 1,..., x n ) X n (any X ) sampled from distribution p(d θ) with unknown θ Ω. Likelihood function p(d θ) or posterior p(θ D) p(d θ)p(θ) contain all statistical information about the sample D. Information summary or simplification of p(d θ) is needed: (comprehensibility, communication, storage, computational efficiency, mathematical tractability, etc.). Regimes: - parameter estimation, - hypothesis testing, - model (complexity) selection.

Marcus Hutter - 5 - Predictive Hypothesis Identification Ways to Summarize the Posterior by a single point Θ = {θ} (ML or MAP or mean or stochastic or...), a convex set Θ Ω (e.g. confidence or credible interval), a finite set of points Θ = {θ 1,..., θ l } (mixture models) a sample of points (particle filtering), the mean and covariance matrix (Gaussian approximation), more general density estimation, in a few other ways. I concentrate on set estimation, which includes (multiple) point estimation and hypothesis testing as special cases. Call it: Hypothesis Identification.

Marcus Hutter - 6 - Predictive Hypothesis Identification Desirable Properties of any hypothesis identification principle leads to good predictions (that s what models are ultimately for), be broadly applicable, be analytically and computationally tractable, be defined and works also for non-i.i.d. and non-stationary data, be reparametrization and representation invariant, works for simple and composite hypotheses, works for classes containing nested and overlapping hypotheses, works in the estimation, testing, and model selection regime, reduces in special cases (approximately) to existing other methods. Here we concentrate on the first item, and will show that the resulting principle nicely satisfies many of the other items.

Marcus Hutter - 7 - Predictive Hypothesis Identification The Main Idea Machine learning primarily cares about predictive performance. We address the problem head on. Goal: Predict m future obs. x (x n+1,..., x n+m ) X m well. If θ 0 is true parameter, then p(x θ 0 ) is obviously the best prediction. If θ 0 unknown, then predictive distribution p(x D) = p(x θ)p(θ D)dθ = p(d, x)/p(d) is best. Approx. full Bayes by predicting with hypothesis H = {θ Θ}, i.e. Use (comp) likelihood p(x Θ) = 1 P[Θ] p(x θ)p(θ)dθ for prediction. The closer p(x Θ) to p(x θ 0 ) or p(x D) the better H s prediction. Measure closeness with some distance function d(, ). Since x and θ 0 are unknown, we must sum or average over them. Θ

Marcus Hutter - 8 - Predictive Hypothesis Identification Predictive Hypothesis Identification (PHI) Definition 1 (Predictive Loss) The predictive Loss/ Lõss of Θ given D based on distance d for m future observations is Loss m d (Θ, D) := d(p(x Θ), p(x D))dx, Lõss m d (Θ, D) := d(p(x Θ), p(x θ)) p(θ D)dx dθ Definition 2 (PHI) The best ( best) predictive hypothesis in hypothesis class H given D is Use p(x ˆΘ m d ) (p(x Θ m d ˆΘ m d := arg min Θ H Lossm d (Θ, D) ( Θ m d := arg min Θ H Lõssm d (Θ, D) ) )) for prediction. That s it!

Marcus Hutter - 9 - Predictive Hypothesis Identification A Simple Motivating Example - The Problem Consider a sequence of n bits from an unknown source. Assume we have observed n 0 = #0s = #1s = n 1. We want to test whether the unknown source is a fair coin: fair (H f = {θ = 1 2 }) versus don t know (H v = {θ [0; 1]}) H = {H f, H v }, θ Ω = [0; 1] = bias. Classical tests involve the choice of some confidence level α. Problem 1: The answer depends on the confidence level. Problem 2: The answer should depend on the purpose.

Marcus Hutter - 10 - Predictive Hypothesis Identification A Simple Motivating Example - Intuition=PHI A smart customer wants to predict m further bits. We can tell him 1 bit of information: fair or don t know. m = 1: The answer doesn t matter, since in both cases customer will predict 50% by symmetry. m n: We should use our past knowledge and tell him fair. m n: We should ignore our past knowledge & tell don t know, since customer can make better judgement himself, since he will have much more data. Evaluating PHI on this simple Bernoulli example p(d θ) = θ n 1 (1 θ) n 0 exactly leads to this conclusion!

Marcus Hutter - 11 - Predictive Hypothesis Identification A Simple Motivating Example - MAP ML Maximum A Posteriori (MAP): P [Ω D] = 1 P [Θ D] Θ = Θ MAP := arg max Θ H P[Θ D] = Ω = H v = don t know, however strong the evidence for a fair coin! MAP is risk averse finding a likely true model of low pred. power. Maximum Likelihood (ML): p(d H f ) p(d Θ) Θ = Θ ML := arg max Θ H p(d Θ) = { 1 2 } = H f = fair, however weak the evidence for a fair coin! Composite ML risks an (over)precise prediction. Fazit: Although MAP and ML give identical answers for uniform prior on simple hypotheses, their naive extension to composite hypotheses is diametral. Intuition/PHI/MAP/ML conclusions hold in general.

Marcus Hutter - 12 - Predictive Hypothesis Identification Some Popular Distance Functions (f) f-divergence d(p, q) = f(p/q)q for convex f with f(1) = 0 (1) absolute deviation: d(p, q) = p q, f(t) = t 1 (h) Hellinger distance: d(p, q) = ( p q) 2, f(t) = ( t 1) 2 (2) squared distance: d(p, q) = (p q) 2, no f (c) chi-square distance: d(p, q) = (p q) 2 /q, f(t) = (t 1) 2 (k) KL-divergence: d(p, q) = p ln(p/q), f(t) = t ln t (r) reverse KL-div.: d(p, q) = q ln(q/p), f(t) = ln t The f-divergences are particularly interesting, since they contain most of the standard distances and make Loss representation invariant.

Marcus Hutter - 13 - Predictive Hypothesis Identification Exact Properties of PHI Theorem 3 (Invariance of Loss) Loss m d (Θ, D) and Lfoss m d (Θ, D) are invariant under reparametrization θ ϑ = g(θ) of Ω. If distance d is an f-divergence, then they are also independent of the representation x i y i = h(x i ) of the observation space X. Theorem 4 (PHI for sufficient statistic) Let t = T (x) be a sufficient statistic for R R θ. Then Loss m f (Θ, D) = d(p(t Θ), p(t D))dt and Lfoss m f (Θ, D) = d(p(t Θ), p(t θ))p(θ D)dtdθ, i.e. p(x ) can be replaced by the probability density p(t ) of t. Theorem 5 (Equivalence of PHI m 2 r and g PHI m 2 r) For square distance (db=2) and RKL distance (db=r), Loss m d (Θ, D) differs from Lfoss m d (Θ, D) only by an additive constant c m d (D) independent of Θ, hence PHI and gphi select the same hypotheses ˆΘ m 2 = Θ m 2 and ˆΘ m r = Θ m r.

Marcus Hutter - 14 - Predictive Hypothesis Identification Bernoulli Example p(t θ) = ( m t ) θ t (1 θ) m t, t = m 1 = #1s =suff.stat. For RKL-distance and point hypotheses, Theorems 4 and?? now yield θ r = ˆθ r = arg min θ Loss m r (θ D) = arg min θ... = 1 m E[t D] = n 1 + 1 n + 2 m t=1 = Laplace rule p(t D) ln p(t D) p(t θ) = Fisher Information and Jeffrey s Prior I 1 (θ) := ( ln p(x θ))p(x θ)dx = Fisher information matrix. J := det I1 (θ)dθ = intrinsic size of Ω. p J (θ) := det I 1 (θ)/j = Jeffrey s prior is a popular reparametrization invariant (objective) reference prior.

Marcus Hutter - 15 - Predictive Hypothesis Identification Loss for Large m and Point Estimation Theorem 6 (Lõss m h (θ, D) for large m) Under some differentiability assumptions, for point estimation, the predictive Hellinger loss for large m is Lõss m h (θ, D) = 2 2 ( 8π m ) d/2 p(θ D) det I1 (θ) [1 + O(m 1/2 )] ( ) d/2 J 8π p(d θ) = 2 2 m Jp(D) [1 + O(m 1/2 )] where the first expression holds for any continuous prior density and the second expression ( J =) holds for Jeffrey s prior.

Marcus Hutter - 16 - Predictive Hypothesis Identification Minimizing Lõss h variation of MAP: PHI = IMAP J = ML for m n θ h = θ IMAP := arg max θ is equivalent to a reparametrization invariant p(θ D) det I1 (θ) J = arg max p(d θ) θ ML θ This is a nice reconciliation of MAP and ML: An improved MAP leads for Jeffrey s prior back to simple ML. PHI MDL for m n We can also relate PHI to the Minimum Description Length (MDL) principle by taking the logarithm of the second expression in Theorem 6: θ h J = arg min θ { log p(d θ) + d 2 log m 8π + J } For m = 4n this is the classical (large n approximation of) MDL.

Marcus Hutter - 17 - Predictive Hypothesis Identification Loss for Large m and Composite Θ Theorem 7 (Lõss m h (Θ, D) for large m) Under some differentiability assumptions, for composite Θ, the predictive Hellinger loss for large m is ( ) d/4 Lõss m J 8π p(d Θ)P[Θ D] h (Θ, D) = 2 2 + o(m d/4 ) m JP[D] MAP Meets ML Half Way The expression is proportional to the geometric average of the posterior and the composite likelihood. For large Θ, the likelihood gets small, since the average involves many wrong models. For small Θ, posterior volume of Θ, hence tends to zero. The product is maximal for Θ n d/2 (which makes sense).

Marcus Hutter - 18 - Predictive Hypothesis Identification Finding Θ h Explicitly Contrary to MAP and ML, an unrestricted maximization of ML MAP over all measurable Θ Ω makes sense, and can be reduced to a one-dimensional maximization. Theorem 8 (Finding Θ h exactly) Let Θ γ := {θ : p(d θ) γ} be the γ-level set of p(d θ). If P[Θ γ ] is continuous in γ, then Θ h = arg max Θ P[Θ D] P[Θ] = arg max Θ γ :γ 0 P[Θ γ D] P[Θγ ] Theorem 9 (Finding Θ h for Large n (m n 1)) Θ h = {θ : (θ θ) I 1 ( θ)(θ θ) r 2 } = Ellipsoid, r d/n Loss m h : Similar to (asymptotic) expressions of Lõss m h.

Marcus Hutter - 19 - Predictive Hypothesis Identification Large Sample Approximations PHI for large sample sizes n m. For simplicity θ IR. A classical approximation of p(θ D) is by a Gaussian with same mean and variance. Generalization to Sequential moment fitting (SMF): Fit first k (central) moments θ A µ A 1 := E[θ A] and µ A k := E[(θ θ A ) 2 A] (k 2) Moments µ D k are known and can in principle be computed. Theorem 10 (PHI for large n by SMF) If Θ H is chosen such that µ Θ i = µ D i for i = 1,..., k, then under some technical conditions, Loss m f (Θ, D) = O(n k/2 ) Normally, no Θ H has better loss order, therefore ˆΘ m f Θ. ˆΘ ˆΘ m f neither depends on m, nor on the chosen distance f.

Marcus Hutter - 20 - Predictive Hypothesis Identification Large Sample Applications Θ = {θ 1,..., θ l } unrestricted = k = l moments can be fit. For interval est. H = {[a; b] : a, b IR, a b} and uniform prior, we have θ [a;b] = 1 2 (a + b) and µ[a;b] 2 = 1 12 (b a)2 = k = 2 and ˆΘ = [ θ D 3µ D 2 ; θ D + 3µ D 2 ]. In higher dimensions, common choices of H are convex sets, ellipsoids, and hypercubes.

Marcus Hutter - 21 - Predictive Hypothesis Identification Conclusion If prediction is the goal, but full Bayes not feasible, one should identify (estimate/test/select) the hypothesis (parameter/model/ interval) that predicts best. What best is can depend on benchmark (Loss, Lõss), distance function (d), how long we use the model (m), compared to how much data we have at hand (n). We have shown that predictive hypothesis identification (PHI) scores well on all desirable properties listed on Slide 6. In particular, PHI can properly deal with nested hypotheses, and nicely blends MAP and ML for m n with MDL for m n with SMF for n m.

Marcus Hutter - 22 - Predictive Hypothesis Identification Thanks! THE END Questions? Want to work on this or other things? Apply at ANU/NICTA/me for a PhD or PostDoc position! Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU RSISE NICTA