Encoding or decoding

Similar documents
Signal detection theory

3.3 Population Decoding

Neural Decoding. Chapter Encoding and Decoding

3 Neural Decoding. 3.1 Encoding and Decoding. (r 1, r 2,..., r N ) for N neurons is a list of spike-count firing rates, although,

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

Population Coding. Maneesh Sahani Gatsby Computational Neuroscience Unit University College London

Exercises. Chapter 1. of τ approx that produces the most accurate estimate for this firing pattern.

CSE/NB 528 Final Lecture: All Good Things Must. CSE/NB 528: Final Lecture

!) + log(t) # n i. The last two terms on the right hand side (RHS) are clearly independent of θ and can be

The Bayesian Brain. Robert Jacobs Department of Brain & Cognitive Sciences University of Rochester. May 11, 2017

1/12/2017. Computational neuroscience. Neurotechnology.

Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35

encoding and estimation bottleneck and limits to visual fidelity

Concerns of the Psychophysicist. Three methods for measuring perception. Yes/no method of constant stimuli. Detection / discrimination.

Exercise Sheet 4: Covariance and Correlation, Bayes theorem, and Linear discriminant analysis

Estimation of information-theoretic quantities

What is the neural code? Sekuler lab, Brandeis

PATTERN RECOGNITION AND MACHINE LEARNING

Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

Bayesian Inference. 2 CS295-7 cfl Michael J. Black,

Decoding. How well can we learn what the stimulus is by looking at the neural responses?

Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory

Lecture 7 Introduction to Statistical Decision Theory

MODULE -4 BAYEIAN LEARNING

Bayesian Decision Theory

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Machine Learning Lecture 5

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

CS 630 Basic Probability and Information Theory. Tim Campbell

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 2: August 31

Multivariate statistical methods and data mining in particle physics

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Lecture 1: Introduction, Entropy and ML estimation

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Linear Classification. Prof. Matteo Matteucci

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

ECE521 week 3: 23/26 January 2017

Statistical Signal Processing Detection, Estimation, and Time Series Analysis

Stephen Scott.

Neural Encoding Models

Estimation Tasks. Short Course on Image Quality. Matthew A. Kupinski. Introduction

PATTERN CLASSIFICATION

Introduction to Signal Detection and Classification. Phani Chavali

Introduction to Statistical Learning Theory

Primer on statistics:

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Decision Theory

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Lecture 5: GPs and Streaming regression

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Linear & nonlinear classifiers

Intelligent Systems Statistical Machine Learning

Intelligent Systems Discriminative Learning, Neural Networks

Week 5: Logistic Regression & Neural Networks

Ch 4. Linear Models for Classification

Bayesian Learning (II)

Variable selection and feature construction using methods related to information theory

Bayesian probability theory and generative models

Mixture of Gaussians Models

Linear Models for Classification

Machine Learning 4771

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Machine learning - HT Maximum Likelihood

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Chapter 2 Signal Processing at Receivers: Detection Theory

Lecture : Probabilistic Machine Learning

Support Vector Machines

An introduction to Bayesian inference and model comparison J. Daunizeau

Lecture 3: Pattern Classification

Basics of Information Processing

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

INFORMATION PROCESSING ABILITY OF BINARY DETECTORS AND BLOCK DECODERS. Michael A. Lexa and Don H. Johnson

Artificial Neural Networks Examination, June 2004

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

If there exists a threshold k 0 such that. then we can take k = k 0 γ =0 and achieve a test of size α. c 2004 by Mark R. Bell,

Digital Transmission Methods S

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning Midterm Exam

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Advanced statistical methods for data analysis Lecture 1

Machine Learning

Bayesian inference J. Daunizeau

Pattern Recognition and Machine Learning

Minimum Error-Rate Discriminant

Machine Learning, Midterm Exam

Neural Decoding. Mark van Rossum. School of Informatics, University of Edinburgh. January 2012

Lecture 9: PGM Learning

Artificial Neural Networks Examination, March 2004

Transcription:

Encoding or decoding

Decoding How well can we learn what the stimulus is by looking at the neural responses? We will discuss two approaches: devise and evaluate explicit algorithms for extracting a stimulus estimate directly quantify the relationship between stimulus and response using information theory

The optimal linear estimator Let s start with a rate response, r(t) and a stimulus, s(t). The optimal linear estimator is closest to satisfying Want to solve for K. Multiply by s(t-t ) and integrate over t:

The optimal linear estimator produced terms which are simply correlation functions: Given a convolution, Fourier transform: Now we have a straightforward algebraic equation for K(w): Solving for K(t),

The optimal linear estimator For white noise, the correlation function C ss (t) = s 2 d(t), So K(t) is simply C rs (t).

Stimulus reconstruction K t

Stimulus reconstruction

Stimulus reconstruction

Reading minds: the LGN Yang Dan, UC Berkeley

Other decoding approaches

Binary choice tasks Britten et al. 92: measured both behavior + neural responses

Behavioral performance

Predictable from neural activity? Discriminability: d = ( <r> + - <r> - )/ s r

Signal detection theory z p(r -) p(r +) <r> - <r> + Decoding corresponds to comparing test, r, to threshold, z. a(z) = P[ r z -] false alarm rate, size b(z) = P[ r z +] hit rate, power Find z by maximizing P[correct] = p[+] b(z) + p[-](1 a(z))

ROC curves summarize performance of test for different thresholds z Want b 1, a 0.

ROC: two alternative forced choice Threshold z is the result from the first presentation The area under the ROC curve corresponds to P[correct]

Is there a better test to use than r? The optimal test function is the likelihood ratio, l(r) = p[r +] / p[r -]. (Neyman-Pearson lemma) Recall a(z) = P[ r z -] b(z) = P[ r z +] false alarm rate, size hit rate, power Then l(z) = (db/dz) / (da/dz) = db/da i.e. slope of ROC curve

The logistic function If p[r +] and p[r -] are both Gaussian, one can show that P[correct] = ½ erfc(-d /2). To interpret results as two-alternative forced choice, need simultaneous responses from + neuron and from neuron. Simulate - neuron responses from same neuron in response to stimulus. Ideal observer: performs as area under ROC curve.

More d Again, if and p[r -] and p[r +] are Gaussian, p[+] and p[-] are equal, P[+ r] = 1/ [1 + exp(-d (r - <r>)/ s)]. d is the slope of the sigmoidal fitted to P[+ r]

Neurons vs organisms Close correspondence between neural and behaviour.. Why so many neurons? Correlations limit performance.

Priors z p[r -] p[r +] <r> - <r> + Role of priors: Find z by maximizing P[correct] = p[+] b(z) + p[-](1 a(z))

The wind or a tiger? Classification of noisy data: single photon responses Rieke

Nonlinear separation of signal and noise Classification of noisy data: single photon responses P(I noise) P(I signal) I Rieke

Nonlinear separation of signal and noise Classification of noisy data: single photon responses P(I noise) P(I signal) I Rieke

Nonlinear separation of signal and noise Classification of noisy data: single photon responses P(I noise) P(I signal) I Rieke

Nonlinear separation of signal and noise Classification of noisy data: single photon responses P(I noise) P(noise) P(I signal) P(signal) I Rieke

How about costs? P(I noise) P(noise) P(I signal) P(signal) I

Building in cost Penalty for incorrect answer: L +, L - For an observation r, what is the expected loss? Loss - = L - P[+ r] Loss + = L + P[- r] Cut your losses: answer + when Loss + < Loss - i.e. Using Bayes, L + P[- r] < L - P[+ r]. P[+ r] = p[r +]P[+]/p(r); P[- r] = p[r -]P[-]/p(r); l(r) = p[r +]/p[r -] > L + P[-] / L - P[+].

Relationship of likelihood to tuning curves For small stimulus differences s and s + ds like comparing to threshold

Decoding from many neurons: population codes Population code formulation Methods for decoding: population vector Bayesian inference maximum likelihood maximum a posteriori Fisher information

Cricket cercal cells Jacobs G A et al. J Exp Biol 2008;211:1819-1828 2008 by The Company of Biologists Ltd

Cricket cercal cells

Population vector RMS error in estimate Theunissen & Miller, 1991

Population coding in M1 r 0 Hand reaching direction Cosine tuning curve of a motor cortical neuron

Population coding in M1 Cosine tuning: Pop. vector: For sufficiently large N, is parallel to the direction of arm movement

Population coding in M1 Cosine tuning: Pop. vector: Difficulties with this coding scheme?

Is this the best one can do? The population vector is neither general nor optimal. Optimal : make use of all information in the stimulus/response distributions

Bayesian inference Bayes law: likelihood function conditional distribution prior distribution a posteriori distribution marginal distribution

Bayesian estimation Want an estimator s Bayes Introduce a cost function, L(s,s Bayes ); minimize mean cost. For least squares cost, L(s,s Bayes ) = (s s Bayes ) 2 ; solution is the conditional mean.

Bayesian inference By Bayes law, likelihood function a posteriori distribution

Maximum likelihood Find maximum of P[r s] over s More generally, probability of the data given the model Model = stimulus assume parametric form for tuning curve

Bayesian inference By Bayes law, likelihood function a posteriori distribution

Population vector RMS error in estimate Theunissen & Miller, 1991

MAP and ML ML: s* which maximizes p[r s] MAP: s* which maximizes p[s r] Difference is the role of the prior: differ by factor p[s]/p[r] For cercal data:

Decoding an arbitrary continuous stimulus Work through a specific example assume independence assume Poisson firing Noise model: Poisson distribution P T [k] = (lt) k exp(-lt)/k!

Decoding an arbitrary continuous stimulus E.g. Gaussian tuning curves

Need to know full P[r s] Assume Poisson: Assume independent: Population response of 11 cells with Gaussian tuning curves

ML Apply ML: maximize ln P[r s] with respect to s Set derivative to zero, use sum = constant From Gaussianity of tuning curves, If all s same

MAP Apply MAP: maximise ln p[s r] with respect to s Set derivative to zero, use sum = constant From Gaussianity of tuning curves,

Given this data: Prior with mean -2, variance 1 MAP: Constant prior

How good is our estimate? For stimulus s, have estimated s est Bias: Variance: Mean square error: Cramer-Rao bound: Fisher information (ML is unbiased: b = b = 0)

Fisher information Alternatively: Quantifies local stimulus discriminability

Fisher information for Gaussian tuning curves For the Gaussian tuning curves w/poisson statistics:

Are narrow or broad tuning curves better? Approximate: Thus, Narrow tuning curves are better But not in higher dimensions!..what happens in 2D?

Fisher information and discrimination Recall d' = mean difference/standard deviation Can also decode and discriminate using decoded values. Trying to discriminate s and s+ds: Difference in ML estimate is Ds (unbiased) variance in estimate is 1/I F (s).

Limitations of these approaches Tuning curve/mean firing rate Correlations in the population

The importance of correlation Shadlen and Newsome, 98

The importance of correlation

The importance of correlation

Entropy and Shannon information Model-based vs model free

Entropy and Shannon information For a random variable X with distribution p(x), the entropy is H[X] = - S x p(x) log 2 p(x) Information is defined as I[X] = - log 2 p(x)

Mutual information Typically, information = mutual information: how much knowing the value of one random variable r (the response) reduces uncertainty about another random variable s (the stimulus). Variability in response is due both to different stimuli and to noise. How much response variability is useful, i.e. can represent different messages, depends on the noise. Noise can be specific to a given stimulus.

Mutual information Information quantifies how independent r and s are: I(s;r) = D KL [P(r,s), P(r)P(s)] Alternatively: I(s;r) = H[P(r)] S s P(s) H[P(r s)].

Mutual information Mutual information is the difference between the total response entropy and the mean noise entropy: I(s;r) = H[P(r)] S s P(s) H[P(r s)]. Need to know the conditional distribution P(s r) or P(r s). Take a particular stimulus s=s 0 and repeat many times to obtain P(r s 0 ). Compute variability due to noise: noise entropy

Mutual information Information is symmetric in r and s Examples: response is unrelated to stimulus: p[r s] =?, MI =? response is perfectly predicted by stimulus: p[r s] =?

Simple example r + encodes stimulus +, r - encodes stimulus - but with a probability of error: P(r + +) = 1- p P(r - -) = 1- p What is the response entropy H[p]? What is the noise entropy?

Entropy and Shannon information Entropy Information H[p] = -p + log p + (1-p + )log(1-p + ) When p + = ½, H[P(r s)] = -p log p (1-p)log(1-p)