Optimal predictive inference

Similar documents
Information theoretic approach to interactive learning

arxiv: v2 [physics.data-an] 8 Feb 2009

Optimal Causal Inference

Structure or Noise? Susanne Still James P. Crutchfield SFI WORKING PAPER:

Optimal Deep Learning and the Information Bottleneck method

Deep Learning and Information Theory

ECE521 week 3: 23/26 January 2017

Complex Systems Methods 3. Statistical complexity of temporal sequences

arxiv:physics/ v1 [physics.data-an] 24 Apr 2000 Naftali Tishby, 1,2 Fernando C. Pereira, 3 and William Bialek 1

Reconstruction Deconstruction:

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

the Information Bottleneck

Bioinformatics: Biology X

Quantum versus Classical Measures of Complexity on Classical Information

STA 4273H: Statistical Machine Learning

Mathematical Formulation of Our Example

Bayesian Inference and the Symbolic Dynamics of Deterministic Chaos. Christopher C. Strelioff 1,2 Dr. James P. Crutchfield 2

Quiz 2 Date: Monday, November 21, 2016

Algorithmisches Lernen/Machine Learning

Chapter 16. Structured Probabilistic Models for Deep Learning

A Nonlinear Predictive State Representation

Classification and Regression Trees

CSC321 Lecture 18: Learning Probabilistic Models

Variational Methods in Bayesian Deconvolution

Diversity-Promoting Bayesian Learning of Latent Variable Models

Probabilistic numerics for deep learning

How Random is a Coin Toss? Bayesian Inference and the Symbolic Dynamics of Deterministic Chaos

Association studies and regression

On Prediction and Planning in Partially Observable Markov Decision Processes with Large Observation Sets

Probabilistic Graphical Models

Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information

Lin Song Shuo Shao Jun Chen. 11 September 2013

Analyticity and Derivatives of Entropy Rate for HMP s

Overfitting, Bias / Variance Analysis

State Space Abstraction for Reinforcement Learning

Uncertainity, Information, and Entropy

Learning Energy-Based Models of High-Dimensional Data

Information-theoretic foundations of differential privacy

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Why is Deep Learning so effective?

Multimedia Communications. Scalar Quantization

Introduction to Machine Learning

Information, Utility & Bounded Rationality

Lecture 2: August 31

Covariance Matrix Simplification For Efficient Uncertainty Management

Directed and Undirected Graphical Models

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

COMP90051 Statistical Machine Learning

Information Bottleneck for Gaussian Variables

Chapter 9 Fundamental Limits in Information Theory

Dynamic Approaches: The Hidden Markov Model

Anatomy of a Bit. Reading for this lecture: CMR articles Yeung and Anatomy.

Machine Learning I Continuous Reinforcement Learning

Bayesian probability theory and generative models

Linear Regression (9/11/13)

Inferring Markov Chains: Bayesian Estimation, Model Comparison, Entropy Rate, and Out-of-class Modeling

Linear Regression (continued)

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

PCA and admixture models

Lecture : Probabilistic Machine Learning

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Why should you care about the solution strategies?

Chapter 20. Deep Generative Models

Permutation Excess Entropy and Mutual Information between the Past and Future

Computational Systems Biology: Biology X

Linear Models for Regression CS534

Lecture 16 Deep Neural Generative Models

Unraveling the mysteries of stochastic gradient descent on deep neural networks

Informational and Causal Architecture of Discrete-Time Renewal Processes

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

STA 414/2104: Machine Learning

Reconstruction. Reading for this lecture: Lecture Notes.

Pattern Discovery in Time Series, Part II: Implementation, Evaluation, and Comparison

Linear Regression. Volker Tresp 2018

Uncertainty Quantification for Machine Learning and Statistical Models

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

L11: Pattern recognition principles

Introduction to Machine Learning

Machine Learning 4771

Entropies & Information Theory

Bayesian Deep Learning

Variational Autoencoders (VAEs)

Lecture 7: DecisionTrees

Linear Models for Regression CS534

STA 4273H: Statistical Machine Learning

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

arxiv: v1 [cond-mat.stat-mech] 9 Dec 2014

CSCI567 Machine Learning (Fall 2014)

Unsupervised Learning

4 Bias-Variance for Ridge Regression (24 points)

Causal Modeling with Generative Neural Networks

Location Prediction of Moving Target

11. Learning graphical models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Latent Variable Models and EM algorithm

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Logistic Regression & Neural Networks

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Logistic Regression. Machine Learning Fall 2018

Gaussian Graphical Models and Graphical Lasso

Transcription:

Optimal predictive inference Susanne Still University of Hawaii at Manoa Information and Computer Sciences S. Still, J. P. Crutchfield. Structure or Noise? http://lanl.arxiv.org/abs/0708.0654 S. Still, J. P. Crutchfield, C. J. Ellison. Optimal Causal Inference. http://lanl.arxiv.org/abs/0708.1580

The modeling (and machine learning) challenge: Input: Data Learning machine Output: Model Internal computations Fundamental question: What is a good model?

Intuition behind the approach we take 1. A good model predicts well -> find maximally predictive model. Keep predictive information! 2. A good model is compact -> do not keep irrelevant information.

System -> Produces observable x(t) Observer: Has access to a history h(t) of length K. Makes a representation of the system by creating internal states s(t).... x(t-k+1)... x(t) x(t+1)... x(t+l)... History (length K): h(t) z(t): Future (length L) summarize predict s(t) What information do we need to keep for prediction? Simplifying assumptions: discrete time; finite states.

Making a model Natural processes execute computations and produce information. The predictive information, or excess entropy measures the information that the past carries about the future. I pred = log p(future; past) p(future)p(past)

Quantifying the intuition Objectives Good generalization <=> Good prediction Quantified by Predicted Information I[s;z] s: internal state at time t z: future starting at t Minimal coding cost (complexity) Coding Rate I[s;h] h: history up to time t

Predicted Information: information model carries about future I[s;z] = H[z] - H[z s] Measures the reduction in uncertainty about the future, when the model state is known. If the state tells us nothing about future, then H[z s] = H[z] and I[s;z] = 0 If knowing the state reduces the uncertainty about the future to H[z s] = 0, then I is maximal: I[s;z] = H[z].

Coding rate = historical information that is retained by the model: I[s;h] = H[s] - H[s h] H[s]: measures the statistical complexity of the model (Crutchfield and Young, 89). Computational mechanics. H[s h]: measures how uncertain we are about whether the history h should be represented by the state s. Maximum entropy solution (Rose, 1990): max H[s h] min I[s;h] -> prefers model with minimal statistical complexity and maximal entropy. (H[s h] = 0 for deterministic maps - the case in computational mechanics. Then I[s;h] = H[s])

Making a model Finite state machine with internal states s Probabilistic map from input space to internal states: p(s h) Together with prediction from model state s p(z s) = <p(z h)p(h s)> Objective: Construct s such that it is a maximally predictive and efficient summary of historical information. Find the optimal probabilistic map p(s h).

Optimization principle: max (I[s; z] λi[s; h]) p(s h) This is rate-distortion theory!!! Maps directly onto the Information Bottleneck Method (N. Tishby, F. Pereira and W. Bialek (1999) http://lanl.arxiv.org/abs/ physics/0004057) past = data to compress future = relevant quantity

Note: Objective function is equivalent to constructing the states such that they implement causal shielding, a property of the causal states in computational mechanics. I[z; h s] = = = [ p(z; h s) log p(z s)p(h s) [ ] p(z h) log log p(z s) [ ] p(z h) p(z) = I[z; h] I[z; s] Max I[s;z] same as min I[z;h s], because I[z;h] is const. The optimization principle now reads: min p(s h) ] p(z,h,s) p(z,h) p(z,h,s) = use Markov condition, p(z h;s) = p(z h) [ ] p(z h)p(h s) = log p(z s)p(h s) p(z,h,s) [ ] [ ]) p(z h) p(z s) log p(z) p(z) [ ] p(z s) ( log log p(z) p(z,s) (I[s; h] + βi[z; h s]) p(z,h,s)

Writing the O.F. this way makes it obvious that this is rate distortion theory (I[s; ) h] + β d(h, s) p(h,s) min p(s h) with the distortion function d(h, s) = D KL [p(z h) p(z s)] beacause I[z; h s] = = log log [ p(z; h s) ] p(z s)p(h s) [ p(z h) p(z s) ] p(z h) p(z h,s) p(h,s) p(h,s) = D KL [p(z h) p(z s)] p(h,s)

Optimization principle: max (I[s; z] λi[s; h]) p(s h) Family of solutions: p(s h) = p(s) Z(h, λ) exp ( 1 λ D KL[p(z h) p(z s)] )

Iterative Algorithm p (j) (s h) = p(j) (s) Z(h, λ) exp ( 1 λ D KL[p(z h) p (j) (z s)] ) p (j+1) (s) = h p (j+1) (s h)p(h) p (j+1) (z s) = 1 p (j+1) (s) h p(z h)p (j) (s h)p(h) Blahut-Arimoto type of algorithm (same as Information Bottleneck algorithm). Converges to local optimum.

Theorem: In the low temperature regime ( ) λ 0 the causal state partition is found. (S. Still, J. P. Crutchfield, C. Ellison. Optimal Causal Inference. arxiv:0708.1580) The causal states reflect the underlying states of the system -> physically meaningful solution. (J. P. Crutchfield and K. Young (1989) PRL 63:105 108) Causal states are unique and minimal sufficient statistics. (J. P. Crutchfield and C. R. Shalizi (1999) Phys.Rev.E 59(1): 275 283)

Proof Sketch: Recall: p(s h) = p(s) Z(h, λ) exp ( 1 λ D KL[p(z h) p(z s)] ) 1. In low temp. regime -> deterministic partition: p(s h) = δ ss with: s = arg min s D KL [p(z h) p(z s)] 2.Histories h with same conditional future distributions, p(z h) = p(z s), are assigned to the same category s. This defines an equivalence relation, or probabilistic bisimulation (Milner, 84), which is precisely the causal state partition of computational mechanics.

Example 1: Golden Mean Produces all binary sequences except those with 00 I[R; X 2 ] 0.25 0.20 0.15 0.10 0.05 0.00 Infeasible 2 Feasible 0.0 0.5 1.0 1.5 2.0 I[ X 3 ;R] Algorithm finds the 2 states that describe the process fully. Conditional future distributions associated with 2 state model P( X 2 x 3 ) 1.0 0.8 0.6 0.4 0.2 0.0 00 01 10 11 x 2

Example 2: Even Process Even blocks of 1s. transient state Irreducible forbidden words: F={010, 01110, 0111110,...} No finite order markov process can model the even process.

Example 2: Even Process Generates all sequences with blocks 1s of even length I[R ; X ] 2 0.30 0.25 0.20 0.15 0.10 0.05 0.00 2 3 Algorithm finds the 3 states that describe the process fully. 0.0 0.5 1.0 1.5 2.0 2.5 3 I[ X ; R] Conditional future distributions associated with 3 state model P( X 2 x 3 ) 1.0 0.8 0.6 0.4 0.2 0.0 00 01 10 11 x 2

Hierarchy of optimal models As temperature -> 0, we find causal states. Those capture the full predictive information! But, in general we may not want to keep all detail! We can find more compact models. Those have larger prediction error. Compared to computational mechanics, here we have an extension to non-deterministic models.

Causal compressibility Study the full range of optimal models to lear about the causal compressibility of a stochastic process at hand. Encoded in the rate-distortion curve.

Fully causally compressible: full predictive information can be kept with a model that has a complexity smaller than H[h]. Not causally compressible: - Deterministic and time reversible processes (RD curve on the diagonal) - i.i.d. processes (RD curve degenerated to a point at the origin) Weakly causally compressible: RD curve is close to a straight line (small curvature). Strongly causally compressible: large curvature.

Causal compressibility I[s;z] H[h] I[h;z] Causally compressible processes Deterministic and time reversible processes: I[s;z] = I[s;h] I[h;z] = H[h] Example: periodic processes i.i.d. processes: I[s;z] = I[s;h] = 0 H[s] H[h] I[s;h] S. Still, J. P. Crutchfield. Structure or Noise? http://lanl.arxiv.org/abs/0708.0654

Example (SNS) I[h,z] 0.14 0.12 0.10 Infeasible 2 3 4 5 6 I[R; X 2 ] 0.08 0.06 Feasible 0.04 0.02 0.00 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 I[ X 5 ;R] Future distributions: H[h] X full historical information best 6-state model best 3-state model

Learning from finite data So far, we assumed knowledge of P(z h) In practice we have to estimate this distribution from finite data. Sampling errors result in overestimate of predictive information => could result in overfitting! It looks as if there are more causal states than there really are.

Finite Data Find the maximum number of states we can use safely without overfitting. Idea: compensate for sampling error in the objective function! Taylor expansion gives estimate of error. Result: Corrected curve has a maximum => easy to detect optimal number of states. S. Still and W. Bialek (2004) Neural Comp. 16(12):2483 2506

Golden Mean Process 0.30 0.25 Correction finds 0.20 0.15 0.10 0.05 optimal number of states (2) 0.00 * 1 2 3 4 5 N c Future distributions: 1.0 X Finite Data ideal (truth) P( X 2 x 3 ) 0.8 0.6 0.4 result (algorithm) 0.2 0.0 00 01 10 11 x 2

Even Process 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 * Correction finds optimal number of states (3) 1 2 3 4 5 6 7 N C Future distributions: 1.0 0.8 X Finite Data P( X x ) 2 3 0.6 0.4 ideal (truth) 0.2 result (algorithm) 0.0 00 01 10 11 x 2

Conclusion Simple and intuitive principles allow us to: 1. Find optimal abstractions. construct maximally predictive models at fixed model complexity in the limit of full prediction we find the causal states (which constitute unique minimal sufficient statistics). correct for sampling errors due to finite data set size. 2.characterize a process in terms of its causal compressibility by studying the full range of optimal models.

Extensions Online learning: Reconstructing the epsilon machine, including transition probabilities. Active learning.