Bayesian Inference. 2 CS295-7 cfl Michael J. Black,

Similar documents
+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

Conditional probabilities and graphical models

CS 361: Probability & Statistics

Introduction to Bayesian Learning. Machine Learning Fall 2018

CS 361: Probability & Statistics

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Probabilistic Classification

Encoding or decoding

Lecture : Probabilistic Machine Learning

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Naïve Bayes classification

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Bayesian Machine Learning

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Signal detection theory

The Bayesian Brain. Robert Jacobs Department of Brain & Cognitive Sciences University of Rochester. May 11, 2017

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

GAUSSIAN PROCESS REGRESSION

SAMPLE CHAPTER. Avi Pfeffer. FOREWORD BY Stuart Russell MANNING

Bayesian probability theory and generative models

Fitting a Straight Line to Data

CPSC 540: Machine Learning

COS513 LECTURE 8 STATISTICAL CONCEPTS

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian Machine Learning

Probability Review. September 25, 2015

Mathematical Formulation of Our Example

Probabilistic Graphical Models

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Bayesian Networks BY: MOHAMAD ALSABBAGH

STA 4273H: Sta-s-cal Machine Learning

Lecture 4 Noisy Channel Coding

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

What is the neural code? Sekuler lab, Brandeis

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Lecture 3: Pattern Classification

Bayesian Decision Theory

13: Variational inference II

Exercise 1: Basics of probability calculus

Probabilistic Models

Why Bayesian? Rigorous approach to address statistical estimation problems. The Bayesian philosophy is mature and powerful.

Generative Learning algorithms

Directed Graphical Models

MODULE -4 BAYEIAN LEARNING

Parameter estimation Conditional risk

Math 350: An exploration of HMMs through doodles.

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

3.3 Population Decoding

Robotics. Lecture 4: Probabilistic Robotics. See course website for up to date information.

Chris Bishop s PRML Ch. 8: Graphical Models

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

CS 237: Probability in Computing

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Refresher on Probability Theory

Bayesian Linear Regression [DRAFT - In Progress]

Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach. Radford M. Neal, 28 February 2005

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

CSC321 Lecture 7 Neural language models

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Computer vision: models, learning and inference

Lecture 1: Introduction, Entropy and ML estimation

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Algorithmisches Lernen/Machine Learning

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

Bayesian Decision Theory

y Xw 2 2 y Xw λ w 2 2

Probability and Estimation. Alan Moses

Lecture 9: Bayesian Learning

Learning Gaussian Process Models from Uncertain Data

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Machine Learning Lecture Notes

The Derivative of a Function

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Learning (II)

P (E) = P (A 1 )P (A 2 )... P (A n ).

6.867 Machine Learning

Introduction to Stochastic Processes

Classical and Bayesian inference

Probabilistic Graphical Models

1/12/2017. Computational neuroscience. Neurotechnology.

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Generative Clustering, Topic Modeling, & Bayesian Inference

Bayesian Regression (1/31/13)

Bayesian Learning Features of Bayesian learning methods:

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Bayesian Concept Learning

Kalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein

Lecture 16 Deep Neural Generative Models

Variational Autoencoders

9 Forward-backward algorithm, sum-product on factor graphs

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Gentle Introduction to Infinite Gaussian Mixture Modeling

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Transcription:

Population Coding Now listen to me closely, young gentlemen. That brain is thinking. Maybe it s thinking about music. Maybe it has a great symphony all thought out or a mathematical formula that would change the world or a book that would make people kinder or the germ of an idea that would save a hundred million people from cancer. This is a very interesting problem, young gentlemen, because if this brain does hold such secrets, how in the world are we ever going to find out? Dalton Trumbo, Johnny Got His Gun We have seen how activities such as hand motion can be represented by the firing rates of a population of cells and the population vector method gives a simple procedure for recovering (or inferring) this motion from observed firing rates. In its most general form the population vector method can be written as ^x =» X i r i μx i (1) where x is some quantity represented by the cells (e.g. hand position or velocity), ^x is the estimated quantity,» is some scaling or normalizing constant, r i is the firing rate of cell i and μx i is the quantity the cell encodes (e.g. its preferred direction). Until now, this equation has been presented as a simple, intuitive, model that, with some tuning of», produces reasonable reconstructions of the subject s action. But how do we arrive at such a model? How good is it? What assumptions does it make? By understanding the answers to these questions, or hope is to be able to formulate better models with more principled assumptions and that lead to more accurate reconstruction. 1

2 CS295-7 cfl Michael J. Black, 2001-2002 Bayesian Inference Let us start by posing the problem we are trying to solve in a general probabilistic framework. Let p(x = xjr = r) represent the probability that the state of the system is x given that the firing rates of all the cells are r =[r1;:::;r n ] T. X and R are random variables that can take on different values and p(xjr) represents the entire probability distribution over these values. For notational simplicity we will simply write p(x = xjr = r) as p(xjr). If we can formulate this probability distribution then we can do a variety of things such as find the value of x that maximizes the probability given r or we can find the expected value of x. The problem remains however as to how to formulate such a probabilistic model. Often it is convenient to use the laws of probability theory to rewrite this equation. Using Bayes rule we rewrite it as p(xjr) = p(rjx)p(x) : (2) p(r) The first thing to notice is that if we are interested in estimating something about x then the denominator, p(r) on the right hand side is not going to be relevant; it is constant with respect to x. Formally, this normalizing term can be computed by marginalizing out x Z p(r) = p(rjx)p(x)dx: x In general, it does not need to be estimate explicitly but rather is taken to be a constant such that Z p(xjr)dx =1: x

Brain Computer Interfaces: Computational and Mathematical Foundations 3 Bayes Rule Bayes rule is derived from a simple law of probability stating that the joint probability of two random variables A and B can be written as p(a; B) =p(ajb)p(b) or From these we have p(a; B) =p(bja)p(a): p(ajb)p(b) =p(bja)p(a) or Bayes r rule p(ajb) = p(bja)p(a) : p(b) Likelihood Bayes rule allows us to represent p(xjr) as the product of two terms which have special significance. The first, p(rjx), represents the likelihood of observing the data (firing rates), r, given the state x. If you think of x as a control knob which can be tuned to different values, then the likelihood tells us the probability of observations given a particular parameter setting. Prior The second term in the numerator, p(x), is call the prior. This represents the a priori probability of x; that is, the probability prior to making any measurements. This term is used to represent information we may know about x that is independent of the firing rates r. A particularly important source of prior information will be temporal. Often we want to estimate x at some particular instant in time t where we know the value of x at the previous time instant t 1. We can write this temporal prior as p(x t )= We will return to this later. Z p(x t jx t 1)p(x t 1)dx t 1:

4 CS295-7 cfl Michael J. Black, 2001-2002 Posterior Finally, p(xjr) is call the posterior probability; that is, the probability of x after taking into account the measurements. In Latin, a posteriori means reasoning from observed facts back to their causes (Microsoft Encarta World English Dictionary). The key here is that Bayes rule formalizes the reasoning process. It tells us how to make inferences that take into account both the observed data and our prior knowledge about the world. We will spend the rest of the course essentially exploring this idea and how to make it practical for computers to infer activity from brains. Below we start with the likelihood term and eventually will see how the population vector method can be thought of in this Bayesian framework. Right now you should be wondering what the simple population vector algorithm means in terms of a likelihood and a prior. Hold that thought... The Likelihood One of the benefits of a Bayesian approach to the decoding problem is that it forces us to make our assumptions explicit. In modeling the likelihood term we will start with some simple assumptions and later in the course explore how to make them more realistic. Conditional Independence First we can write the likelihood of the state x given the firing, r,ofn cells as p(r1;r2;:::;r n jx): If we know x and the firing rate of a cell depends on x, independent of the firing rates of other cells, then we may assume that the probability of firing for the different cells, conditioned on x, are independent. Formally, if p(ajb; C) =p(ajb) then A and C are said to be conditionally independent given B. In other words, once we know B, knowing C does not give any more information. Note that this does not imply the stronger statement that A and C are independent which would imply p(a; C) =p(a)p(c). In our case, we only assume that if we know the state x that the firing rates of the cells are conditionally independent: p(r1;r2;:::;r n jx) = p(r1jx; r2;:::;r n )p(r2;:::;r n jx)

Brain Computer Interfaces: Computational and Mathematical Foundations 5 = p(r1jx)p(r2;:::;r n jx) = p(r1jx)p(r2jx)p(r3;:::;r n jx) = ny p(r i jx): This assumption means that the joint likelihood over all the cell s firing rates is simply represented as a the product of the individual likelihoods. Note that this simplification is just an approximation. Since the cells we record from may influence each other, their firing rates may well be dependent on each other. Generative Model To specify the likelihood function it is useful to explicitly state our model of how cells encode the state, x t, at time t in terms of their activity. In particular, we specify a generative model of, for example, firing rate at time t for cell i as r i;t = f i (x t )+ where f i ( ) specifies some function that takes the state and produces the expected neural firing and is some noise which suggests that our model f i is uncertain or that the data is stochastic. Once again, we are making our assumptions explicit and throughout the course we will consider a number of different generative models with different models (or tuning functions) and different noise properties. For now consider a simple tuning function in which cells have some preferred state μx i at which they fire maximally and that the activity drops off as the current state x t differs more from the preferred state. This can be captured using a simple Gaussian tuning function (x t x i ) 2 f i (x t )=ke 2ff 2 p where k is a normalizing constant that is 1=( 2ßff) if f i (x) is a probability distribution (which integrates to 1) but more generally can be scaled so that f i () produces a predicted firing rate (rather than a probability). While f i () models how the firing rate varies with x, it does not say anything about the noise we expect in our observations, r i;t. If the firing rate is estimated in relatively short time bins (» 200ms), then this variation the activity is well modeled by a Poisson distribution p(r i;t jx t )= 1 r i;t! e f i(x t) fi (x t ) r i;t

6 CS295-7 cfl Michael J. Black, 2001-2002 where, recall r i;t is an integer as it is the observed firing at time t (computed in some finite time bin). We call this an inhomogeneous Poisson model since the mean firing rate, f i (x t ) varies over time according to the state as specified by the tuning function. Now combined with the conditional independence assumption we have fully specified the likelihood as Inference ny p(rjx t )= e f i(x t) f i(x t ) r i;t : r i;t! Inference often involves extracting some value (or values) from p(rjx). There are a number of possibilities. A common approach involves computing a maximum likelihood estimate of x ^x ML = argmax x p(rjx): This is only a Bayesian estimate if the prior, p(x) is uniform. Alternatively we can seek the maximum a posteriori, ormap, estimate ^x MAP = argmax x p(xjr) which takes into account both the likelihood and the prior and where ny p(x t jr) / e f i(x t) f i(x t ) r i;t p(x t ): r i;t! Maximizing p(xjr) is equivalent to minimizing its negative logarithm. For the case described above with a Gaussian tuning function and Poisson noise, if we further assume a uniform prior the we can write the log as log p(xjr) = f i (x) r i log(f i (x)) k1 where k1 = P i log(r i!) is a constant independent of x. Since the tuning function is Gaussian, this further simplifies to log p(xjr) = f i (x) + 1 2ff 2 r i (x x i ) 2 k2 where k2 now include both k1 and the log of the constant scale multiplying the tuning function.

Brain Computer Interfaces: Computational and Mathematical Foundations 7 We further assume that the space of possible values of x is evenly represented by the population of cells we are recording from. This is the case in primary motor cortex where cells representing movement direction appear randomly distributed and where we can record from a sufficiently large population. When this assumption holds P n f i(x) will be roughly constant, independent of x. Therefore it need not be considered in the minimization of log p(xjr). To minimize log p(xjr) we take its derivative with respect to x, set it equal to zero, and solve for x d dx Simplifying gives or finally log p(xjr) = 1 2ff 2 r i x r i x i =0 P n x = r ix i Pn r : i 2r i (x x i )=0: This is simply a scaled version of the population vector result. Summarizing, the ad hoc population vector method can be understood as embodying a number of implicit assumptions. These are ffl conditional independence of the firing rates of cells, ffl cells have a preferred value, x i, and a Gaussian tuning function, ffl the generative model assumes an inhomogeneous Poisson process, ffl the prior probability of x is uniform, ffl the cells are sampled uniformly with respect to their encoding, ffl there exists a single best (MAP or ML) estimate of x. Note there is also an inconsistency between the original assumption of cosine tuning for cells in motor cortex and the actual assumption of Gaussian tuning. They are similar but not the same. We argue that this analysis is useful because it makes explicit what was hidden implicitly in the algorithm. By making the assumptions explicit we can begin to see how to improve on them. We will do so one by one in the coming weeks.