Estimation of information-theoretic quantities

Similar documents
Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35

encoding and estimation bottleneck and limits to visual fidelity

Entropy and information estimation: An overview

arxiv:physics/ v1 [physics.data-an] 7 Jun 2003

Statistical models for neural encoding

Entropy and information in neural spike trains: Progress on the sampling problem

Encoding or decoding

Comparing integrate-and-fire models estimated using intracellular and extracellular data 1

Pacific Symposium on Biocomputing 6: (2001)

Statistical models for neural encoding, decoding, information estimation, and optimal on-line stimulus design

Chapter 9 Fundamental Limits in Information Theory

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Estimation of Entropy and Mutual Information

ECE Information theory Final (Fall 2008)

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

Entropies & Information Theory

Thanks. WCMC Neurology and Neuroscience. Collaborators

1/12/2017. Computational neuroscience. Neurotechnology.

Lecture 14 February 28

Noisy channel communication

ELEC546 Review of Information Theory

Lecture 4 Noisy Channel Coding

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Asymptotic Normality of an Entropy Estimator with Exponentially Decaying Bias

Lecture 22: Final Review

Bayesian probability theory and generative models

An Introduction to Independent Components Analysis (ICA)

Membrane equation. VCl. dv dt + V = V Na G Na + V K G K + V Cl G Cl. G total. C m. G total = G Na + G K + G Cl

Mid Year Project Report: Statistical models of visual neurons

Lecture 1: Introduction, Entropy and ML estimation

Bayesian estimation of discrete entropy with mixtures of stick-breaking priors

Classification & Information Theory Lecture #8

Entropy Estimation in Turing s Perspective

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

SUPPLEMENTARY INFORMATION

The Method of Types and Its Application to Information Hiding

Lecture 2: August 31

Model-based decoding, information estimation, and change-point detection techniques for multi-neuron spike trains

Lecture 7 Introduction to Statistical Decision Theory

Sean Escola. Center for Theoretical Neuroscience

Combining biophysical and statistical methods for understanding neural codes

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Estimating the Entropy Rate of Spike Trains via Lempel-Ziv Complexity

Channel capacity. Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5.

CHARACTERIZATION OF NONLINEAR NEURON RESPONSES

Synergy, redundancy, and multivariate information measures: an experimentalist s perspective

What is the neural code? Sekuler lab, Brandeis

Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Signal detection theory

CHARACTERIZATION OF NONLINEAR NEURON RESPONSES

Bayesian entropy estimation for binary spike train data using parametric prior knowledge

Transductive neural decoding for unsorted neuronal spikes of rat hippocampus

arxiv: v1 [q-bio.nc] 9 Mar 2016

MODULE -4 BAYEIAN LEARNING

Ch. 8 Math Preliminaries for Lossy Coding. 8.4 Info Theory Revisited

Information in Biology

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Linear Regression, Neural Networks, etc.

How biased are maximum entropy models?

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Introduction to Probability and Statistics (Continued)

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Estimating properties of distributions. Ronitt Rubinfeld MIT and Tel Aviv University

Multi-neuron spike trains - distances and information.

Variable selection and feature construction using methods related to information theory

Density Estimation. Seungjin Choi

Principles of Communications

arxiv:cond-mat/ v2 27 Jun 1997

Why is Deep Learning so effective?

Design principles for contrast gain control from an information theoretic perspective

LECTURE 13. Last time: Lecture outline

GAUSSIAN PROCESS REGRESSION

Analyzing Neuroscience Signals using Information Theory and Complexity

Kullback-Leibler Designs

Noisy-Channel Coding

Shannon s Noisy-Channel Coding Theorem

Robust regression and non-linear kernel methods for characterization of neuronal response functions from limited data

Decision-making, inference, and learning theory. ECE 830 & CS 761, Spring 2016

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Predictive information in Gaussian processes with application to music analysis

Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

Lecture : Probabilistic Machine Learning

Revision of Lecture 5

Approximations of Shannon Mutual Information for Discrete Variables with Applications to Neural Population Coding

(Classical) Information Theory III: Noisy channel coding

Information Theory Primer:

Neural Encoding Models

Neural Coding: Integrate-and-Fire Models of Single and Multi-Neuron Responses

Observed Brain Dynamics

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

Entropy and Information in the Neuroscience Laboratory

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Adaptive contrast gain control and information maximization $

Transcription:

Estimation of information-theoretic quantities Liam Paninski Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/ liam liam@gatsby.ucl.ac.uk November 16, 2004

Some questions: Estimation of information What part of the sensory input is best encoded by a given neuron? Are early sensory systems optimized to transmit information? Do noisy synapses limit the rate of information flow from neuron to neuron? Need to quantify information.

Mutual information I(X;Y ) = X Y p(x,y) log p(x,y) p(x)p(y) Mathematical reasons: invariance uncertainty axioms data processing inequality channel and source coding theorems But obvious open experimental question: is this computable for real data?

How to estimate information I very hard to estimate in general...... but lower bounds (via data processing inequality) are easier. Two ideas: 1) decoding approach: estimate x y, use quality of estimate to lower bound I(X;Y ) 2) discretize x,y, estimate discrete information I discrete (X;Y ) I(X;Y )

Decoding lower bound I(X;Y ) I(X; ˆX(Y )) (1) = H(X) H(X ˆX(Y )) H(X) H[N(0, Cov(X ˆX(Y ))] (2) (1): Data processing inequality (2): Gaussian maxent: H[N(0, Cov(X ˆX(Y ))] H(X ˆX(Y )) (Rieke et al., 1997)

Gaussian stimuli X(t) Gaussian + stationary = specified by power spectrum (= covariance in Fourier domain) Use Shannon formula I = dω log SNR(ω) SNR(ω) = signal-to-noise ratio at frequency ω (Rieke et al., 1997)

(Warland et al., 1997) Calculating the noise spectrum

Pros and cons Pros: only need to estimate covariances, not full distribution Fourier analysis gives good picture of what information is kept, discarded Cons: tightness of lower bound depends on decoder quality bound can be inaccurate if noise is non-gaussian Can we estimate I without a decoder, for general X,Y?

Discretization approach Second approach: discretize spike train into one of m bins, estimate I discrete (X;Y ) = m p(x,y) log p(x,y) p(x)p(y) Data processing: I discrete (X;Y ) I(X;Y ), for any m. Refine as more data come in; if m grows slowly enough, Î discrete I discrete I. doesn t assume anything about X or code p(y x): as nonparametric as possible

Digitizing spike trains To compute entropy rate, take limit T (Strong et al., 1998)

Discretization approach Use MLE to estimate H (if we have H, we have I): Ĥ MLE (p n ) m i=1 p n (i) log p n (i) Obvious concerns: Want N >> m samples, to fill in histograms p(x,y) How large is bias?

Bias is major problem Sample distributions of MLE; p uniform; m=500 0.8 0.6 0.4 0.2 0 N=10 0.1 N=100 0.05 P(H est ) 0 0.15 N=500 0.1 0.05 0 0.15 N=1000 0.1 0.05 N = number of samples 0 0 1 2 3 4 5 6 7 8 H est (bits)

Bias is major problem ĤMLE is negatively biased for all p Rough estimate of B(ĤMLE): (m 1)/2N. Variance is much smaller: (log m) 2 /N No unbiased estimator exists (Exercise: prove each of the above statements.) Try bias-corrected estimator: Ĥ MM ĤMLE + ˆm 1 2N ĤMM due to (Miller, 1955); see also e.g. (Treves and Panzeri, 1995)

Convergence of common information estimators Result 1: If N/m, ML and bias-corrected estimators converge to right answer. Converse: if N/m c <, ML and bias-corrected estimators converge to wrong answer. Implication: if N/m small, bias is large although errorbars vanish even if bias-corrected estimators are used (Paninski, 2003)

Whence bias? Sample histograms from uniform density: n 4 3 2 1 0 5 4 3 2 1 0 20 40 60 80 100 200 400 600 800 1000 5 4 3 2 1 m=n=100 0 0 0.5 1 6 4 2 m=n=1000 0 0 0.5 1 6 4 2 6 4 2 m=n=10000 0 2000 4000 6000 8000 10000 unsorted, unnormalized i 0 0 0.5 1 sorted, normalized p If N/m c <, sorted histograms converge to wrong density. Variability in histograms = bias in entropy estimates

Estimating information on m bins with fewer than m samples Result 2: A new estimator that gives correct answer even if N < m Estimator works well even in worst case Interpretation: entropy is easier to estimate than p! = we can estimate the information carried by the neural code even in cases when the codebook p(y x) is too complex to learn directly (Paninski, 2003; Paninski, 2004).

Sketch of logic Good estimators have low error for all p Error is sum of bias and variance Goal: 1. find simple worst-case bounds on bias, variance 2. minimize bounds over some large but tractable class of estimators

A simple class of estimators Entropy is f(n i ), f(t) = t N log t N. Look for Ĥ = g N,m (n i ), where g N,m minimizes worst-case error g N,m is just an (N + 1)-vector Very simple class, but turns out to be rich enough

Deriving a bias bound B = E(Ĥ) H = E( g(n i )) f(p i ) i i = P(n i = j)g(n i ) f(p i ) j i i = ( ) B j (p i )g(j) f(p i ) i j B j (p) = ( N j ) p j (1 p) N j : polynomial in p If j g(j)b j(p) close to f(p) for all p, bias will be small Standard uniform polynomial approximation theory

Bias and variance Interesting point: can make bias very small ( m/n 2 ), but variance explodes, ruining estimator. In fact, no uniform bounds can hold if m > N α,α > 1 Have to bound bias and variance together

Variance bound Method of bounded differences F(x 1,x 2,...,x N ) a function of N i.i.d. r.v. s If any single x i has small effect on F, i.e, max F(...,x,...) F(...,y,...) small = var(f) small. Our case: Ĥ = i g(n i); max j g(j) g(j 1) small = Var( i g(n i)) small

Computation Goal: minimize max p (bias 2 + variance) bias m max 0 t 1 f(t) j g(j)b j(t) variance N max j g(j) g(j 1) 2 Idea: minimize sum of bounds Convex in g = tractable...but still slow for N large

Fast solution Trick 1: approximate maximum error by mean-square error = simple regression problem: good solution in O(N 3 ) time Trick 2: use good starting point from approximation theory = good solution in O(N) time Computation time independent of m Once g N,m is calculated, cost is exactly same as for p log p

Error comparisons: upper and lower bounds 2.2 2 Upper and lower bounds on maximum rms error; N/m = 0.25 BUB JK 1.8 1.6 RMS error bound (bits) 1.4 1.2 1 0.8 0.6 0.4 0.2 10 1 10 2 10 3 10 4 10 5 10 6 N

Error comparisons: upper and lower bounds 2.2 2 1.8 1.6 Upper (BUB) and lower (JK) bounds on maximum rms error N/m = 0.10 (BUB) N/m = 0.25 (BUB) N/m = 1.00 (BUB) N/m = 0.10 (JK) N/m = 0.25 (JK) N/m = 1.00 (JK) RMS error bound (bits) 1.4 1.2 1 0.8 0.6 0.4 0.2 10 1 10 2 10 3 10 4 10 5 10 6 N upper bound on worst-case error 0, even if N/m > c > 0.

Error comparisons: integrate-and-fire model data True entropy Bias 6.5 0 6 0.5 5.5 1 bits 5 1.5 4.5 4 3.5 3 10 2 2 2.5 3 10 2 Standard deviation RMS error 0.3 0.25 3 2.5 MLE MM JK BUB 0.2 2 bits 0.15 1.5 0.1 1 0.05 0.5 firing rate (Hz) 10 2 firing rate (Hz) 10 2 Similar effects both in vivo and in vitro

Undersampling example true p(y x) estimated p(y x) error 2 4 0.012 0.01 0.008 x 6 0.006 8 0.004 0.002 10 2 4 6 8 10 y 2 4 6 8 10 y 2 4 6 8 10 y 0 m x = m y = 700; N/m xy = 0.3 Î MLE = 2.21 bits Î MM = 0.19 bits Î BUB = 0.60 bits; conservative (worst-case upper bound) error: ±0.2 bits true I(X; Y ) = 0.62 bits

Compression Bayesian estimators Parametric modeling Other approaches

Compression approaches Use interpretation of entropy as total number of bits required to code signal ( source coding theorem) Apply a compression algorithm (e.g. Lempel-Ziv) to data, take Ĥ = number of bits required takes temporal nature of data into account more directly than discretization approach

Bayesian approaches Previous analysis was worst-case : applicable without any knowledge at all of the underlying p. Easy to perform Bayesian inference if we have a priori knowledge of p (Wolpert and Wolf, 1995) H(p) (Nemenman et al., 2002) (Note: ignorant priors on p can place very strong constraints on H(p)!)

Parametric approaches Fit model to data, read off I(X;Y ) numerically (e.g., via Monte Carlo) Does depend on quality of encoding model, but doesn t depend on Gaussian noise E.g., instead of discretizing x x discrete and estimating H(x discrete ), use density estimation technique to estimate density p(x), then read off H(p(x)) (Beirlant et al., 1997)

Summary Two lower-bound approaches to estimating information Very general convergence theorems in discrete case Discussion of bias-correction techniques Some more efficient estimators

References Beirlant, J., Dudewicz, E., Gyorfi, L., and van der Meulen, E. (1997). Nonparametric entropy estimation: an overview. International Journal of the Mathematical Statistics Sciences, 6:17 39. Miller, G. (1955). Note on the bias of information estimates. In Information theory in psychology II-B, pages 95 100. Nemenman, I., Shafee, F., and Bialek, W. (2002). Entropy and inference, revisited. Advances in neural information processing, 14. Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation, 15:1191 1253. Paninski, L. (2004). Estimating entropy on m bins given fewer than m samples. IEEE Transactions on Information Theory, 50:2200 2203. Rieke, F., Warland, D., de Ruyter van Steveninck, R., and Bialek, W. (1997). Spikes: Exploring the neural code. MIT Press, Cambridge. Strong, S. Koberle, R., de Ruyter van Steveninck R., and Bialek, W. (1998). Entropy and information in neural spike trains. Physical Review Letters, 80:197 202. Treves, A. and Panzeri, S. (1995). The upward bias in measures of information derived from limited data samples. Neural Computation, 7:399 407. Warland, D., Reinagel, P., and Meister, M. (1997). Decoding visual information from a population of retinal ganglion cells. Journal of Neurophysiology, 78:2336 2350. Wolpert, D. and Wolf, D. (1995). Estimating functions of probability distributions from a finite set of samples. Physical Review E, 52:6841 6854.