Computational Cognitive Science

Size: px

Start display at page:

Download "Computational Cognitive Science"

Stanley Hudson
5 years ago
Views:

1 Computational Cognitive Science Lecture 8: Frank Keller School of Informatics University of Edinburgh Based on slides by Sharon Goldwater October 14, 2016 Frank Keller Computational Cognitive Science 1

2 1 Background Cognition as Inference Probability Distributions 2 Bayes Rule Comparing Infinitely Many Hypotheses 3 Maximum Likelihood Estimation Maximum a Posteriori Estimation Bayesian Integration 4 Choosing a Prior Conjugate Priors Reading: Griffiths and Yuille (2006). Frank Keller Computational Cognitive Science 2

3 Cognition as Inference Cognition as Inference Probability Distributions The story of probabilistic cognitive modeling so far: models define probabilities that correspond to some aspect of human behavior; example: P(R i = A i), the probability of assigning category A to item i in the GCM; models have parameters that determine these probability distributions (e.g., scaling factor c in the CGM); maximum likelihood estimation is a way of setting these parameters: we infer probability distributions from data. So are probabilities and parameter estimators just technical devices? Or do they have a cognitive status in our model? Frank Keller Computational Cognitive Science 3

4 Cognition as Inference Cognition as Inference Probability Distributions The recent literature assumes that probabilities and estimation are cognitively real. The intuitions behind this are: probabilities reflect degrees of belief; humans make observations from which they infer the probabilities on which their behavior is based; so humans also use estimation techniques! Frank Keller Computational Cognitive Science 4

5 Cognition as Inference Cognition as Inference Probability Distributions The recent literature assumes that probabilities and estimation are cognitively real. The intuitions behind this are: probabilities reflect degrees of belief; humans make observations from which they infer the probabilities on which their behavior is based; so humans also use estimation techniques! But which ones? Maximum likelihood estimation? Intuitively, inference is cognitively plausible if: estimates depend on observations, but also on prior beliefs; as more observations accrue, estimates become more reliable; when observations are unreliable, prior beliefs are used instead. Today we will discuss the mathematics behind these intuitions. Frank Keller Computational Cognitive Science 4

6 Distributions Background Cognition as Inference Probability Distributions Let s recap the distinction between discrete and continuous distributions. Discrete distributions: sample space S is finite or countably infinite (e.g., integers); distribution is a probability mass function, defines probability of a random variable taking on a particular value; example: P(x) = ( n x) θ x (1 θ) n x (binomial distribution): b(x; 12, 0.5) x Frank Keller Computational Cognitive Science 5

7 Distributions Background Cognition as Inference Probability Distributions We have also seen examples of continuous distributions: sample space is uncountably infinite (real numbers); distribution is a probability density function, defines the probabilities if intervals of the random variable; example: P(x) = 1 θ e x/θ (exponential distribution): Frank Keller Computational Cognitive Science 6

8 Discrete vs. Continuous Cognition as Inference Probability Distributions Discrete distributions: P(X = x) 0 for all x S x S P(x) = 1 P(Y ) = X i P(Y X i )P(X i ) E[X ] = x x P(X = x) Expectation Law of Total Probability Frank Keller Computational Cognitive Science 7

9 Discrete vs. Continuous Cognition as Inference Probability Distributions Discrete distributions: P(X = x) 0 for all x S x S P(x) = 1 P(Y ) = X i P(Y X i )P(X i ) E[X ] = x x P(X = x) Expectation Law of Total Probability Continuous distributions: P(x) 0 for all x R P(x)dx = 1 P(y) = P(y x)p(x)dx E[X ] = x x P(x)dx Law of Total Probability Expectation Frank Keller Computational Cognitive Science 7

10 Bayes Rule Background Bayes Rule Comparing Infinitely Many Hypotheses In its general form, the inference task consists of determining the probability of a hypothesis given some data. Notation: h: the hypothesis we are interested in; H: the hypothesis space (set of all possible hypotheses); y: observed data (note we use y rather than d); According to Bayes rule: P(h y) = P(y h)p(h) P(y) Frank Keller Computational Cognitive Science 8

11 Bayes Rule Background Bayes Rule Comparing Infinitely Many Hypotheses In its general form, the inference task consists of determining the probability of a hypothesis given some data. Notation: h: the hypothesis we are interested in; H: the hypothesis space (set of all possible hypotheses); y: observed data (note we use y rather than d); According to Bayes rule: P(h y) = P(y h)p(h) P(y) likelihood Frank Keller Computational Cognitive Science 8

12 Bayes Rule Background Bayes Rule Comparing Infinitely Many Hypotheses In its general form, the inference task consists of determining the probability of a hypothesis given some data. Notation: h: the hypothesis we are interested in; H: the hypothesis space (set of all possible hypotheses); y: observed data (note we use y rather than d); According to Bayes rule: P(h y) = P(y h)p(h) P(y) prior Frank Keller Computational Cognitive Science 8

13 Bayes Rule Background Bayes Rule Comparing Infinitely Many Hypotheses In its general form, the inference task consists of determining the probability of a hypothesis given some data. Notation: h: the hypothesis we are interested in; H: the hypothesis space (set of all possible hypotheses); y: observed data (note we use y rather than d); According to Bayes rule: P(h y) = P(y h)p(h) P(y) posterior Frank Keller Computational Cognitive Science 8

14 Bayes Rule Background Bayes Rule Comparing Infinitely Many Hypotheses In its general form, the inference task consists of determining the probability of a hypothesis given some data. Notation: h: the hypothesis we are interested in; H: the hypothesis space (set of all possible hypotheses); y: observed data (note we use y rather than d); According to Bayes rule: P(h y) = P(y h)p(h) P(y) We can compute the denominator using the law of total probability: P(y) = h H P(y h )P(h ) Frank Keller Computational Cognitive Science 8

15 Bayes Rule Comparing Infinitely Many Hypotheses Example: a box contains two coins, one that comes up heads 50% of the time, and one that comes up heads 90% of the time. You pick one of the coins, flip it 10 times and observe HHHHHHHHHH. Which coin was flipped? What if you had observed HHTHTHTTHT? Frank Keller Computational Cognitive Science 9

16 Bayes Rule Comparing Infinitely Many Hypotheses Example: a box contains two coins, one that comes up heads 50% of the time, and one that comes up heads 90% of the time. You pick one of the coins, flip it 10 times and observe HHHHHHHHHH. Which coin was flipped? What if you had observed HHTHTHTTHT? Let θ be the probability that the coin comes up heads. So we have two hypotheses: h 0 : θ = 0.5 and h 1 : θ = 0.9. The probability of a sequence y with N H heads and N T tails is: P(y θ) = θ N H (1 θ) N T This is a Bernoulli distribution (special case of the Binomial dist.). Frank Keller Computational Cognitive Science 9

17 Bayes Rule Comparing Infinitely Many Hypotheses We can compare the probabilities of the two hypotheses directly by computing the odds: P(h 1 y) P(h 0 y) = P(y h 1) P(h 1 ) P(y h 0 ) P(h 0 ) Frank Keller Computational Cognitive Science 10

18 Bayes Rule Comparing Infinitely Many Hypotheses We can compare the probabilities of the two hypotheses directly by computing the odds: P(h 1 y) P(h 0 y) = P(y h 1) P(h 1 ) P(y h 0 ) P(h 0 ) likelihood ratio Frank Keller Computational Cognitive Science 10

19 Bayes Rule Comparing Infinitely Many Hypotheses We can compare the probabilities of the two hypotheses directly by computing the odds: P(h 1 y) P(h 0 y) = P(y h 1) P(h 1 ) P(y h 0 ) P(h 0 ) prior odds Frank Keller Computational Cognitive Science 10

20 Bayes Rule Comparing Infinitely Many Hypotheses We can compare the probabilities of the two hypotheses directly by computing the odds: P(h 1 y) P(h 0 y) = P(y h 1) P(h 1 ) P(y h 0 ) P(h 0 ) posterior odds Frank Keller Computational Cognitive Science 10

21 Bayes Rule Comparing Infinitely Many Hypotheses We can compare the probabilities of the two hypotheses directly by computing the odds: P(h 1 y) P(h 0 y) = P(y h 1) P(h 1 ) P(y h 0 ) P(h 0 ) We get posterior odds of 357:1 in favor of h 1 for HHHHHHHHHH and 165:1 in favor of h 0 for HHTHTHTTHT. Frank Keller Computational Cognitive Science 10

22 Comparing Infinitely Many Hypotheses Bayes Rule Comparing Infinitely Many Hypotheses Let s now assume that θ, the probability of the coin coming up heads, can be anywhere between 0 and 1. Now we have infinitely many hypotheses, but Bayes rule still applies: P(θ y) = P(y θ)p(θ) P(y) where the probability of the data is: P(y) = 1 0 P(y θ)p(θ)dθ But how do we compute θ? There are three options. Frank Keller Computational Cognitive Science 11

23 Maximum Likelihood Estimation Maximum Likelihood Estimation Maximum a Posteriori Estimation Bayesian Integration 1. Choose the θ that makes y most probable, i.e., ignore P(θ): ˆθ = argmax P(y θ) θ This is the maximum likelihood (ML) estimate of θ. Problem: The ML estimate often does not generalize well (it overfits the data). It is a point estimate, and hence fails to take the shape of the posterior distribution into account. Frank Keller Computational Cognitive Science 12

24 Maximum a Posteriori Estimation Maximum Likelihood Estimation Maximum a Posteriori Estimation Bayesian Integration 2. Choose the θ that is most probable given y: ˆθ = argmax θ P(θ y) = argmax P(y θ)p(θ) θ This is the maximum a posteriori (MAP) estimate of θ, and is equivalent to the ML estimate when P(θ) is uniform. Non-uniform priors can reduce overfitting, but the MAP still doesn t account for the shape of P(θ y): Frank Keller Computational Cognitive Science 13

25 Bayesian Integration Background Maximum Likelihood Estimation Maximum a Posteriori Estimation Bayesian Integration 3. Instead of maximizing, take the expected value of θ: E[θ] = 1 0 θp(θ y)dθ = 1 0 θ P(y θ)p(θ) dθ P(y) 1 0 θp(y θ)p(θ)dθ This is the posterior mean, the average over all hypotheses. For our coin flip example, the posterior is: P(θ y) = (N H + N T + 1)! θ N H (1 θ) N T N H!N T! This is known as the Beta distribution. Frank Keller Computational Cognitive Science 14

26 Bayesian Integration Background Maximum Likelihood Estimation Maximum a Posteriori Estimation Bayesian Integration 3. Instead of maximizing, take the expected value of θ: E[θ] = 1 0 θp(θ y)dθ = 1 0 θ P(y θ)p(θ) dθ P(y) 1 0 θp(y θ)p(θ)dθ This is the posterior mean, the average over all hypotheses. For our coin flip example, the posterior is: P(θ y) = (N H + N T + 1)! θ N H (1 θ) N T = Beta(N H + 1, N T + 1) N H!N T! This is known as the Beta distribution. Frank Keller Computational Cognitive Science 14

27 Beta Distribution Background Maximum Likelihood Estimation Maximum a Posteriori Estimation Bayesian Integration Frank Keller Computational Cognitive Science 15

28 Maximum Likelihood Estimate Maximum Likelihood Estimation Maximum a Posteriori Estimation Bayesian Integration Using the Beta distribution, the ML estimate (equivalent to the MAP estimate with a uniform prior) works out as: ˆθ = N H N H + N T This is a relative frequency estimate: it s simply the frequency of heads over the total number of coin flips. This estimate is insensitive to sample size: if we get 10 heads and 0 tails then we are as certain about θ as if we get 100 heads and 0 tails. This explains the overfitting. Frank Keller Computational Cognitive Science 16

29 Posterior Mean Background Maximum Likelihood Estimation Maximum a Posteriori Estimation Bayesian Integration Let s compare this with the posterior mean, which for the Beta distribution works out as: E[θ] = N H + 1 N H + N T + 2 This is the average over all values of θ. It pays attention to sample size (compare E[θ] for 10 heads and 0 tails vs. 100 heads and 0 tails), and is less prone to overfitting. We can think of this as adding pseudocounts to the relative frequency estimate. This is called smoothing. Note that we are still assuming a uniform prior! Frank Keller Computational Cognitive Science 17

30 Choosing a Prior Background Choosing a Prior Conjugate Priors Let s assume we want to use a non-uniform prior. We could again use the Beta distribution: P(θ) = Beta(V H + 1, V T + 1) where V H, V T > 1 encodes our belief about likely values of θ. This distribution has a mean of (V H + 1)/(V H + VT + 2) and becomes concentrated around the mean as V H + V T increases. For example, V H = V T = 1000 puts a strong prior on θ = 0.5. The parameters that govern the prior distribution are called hyperparameters. (Here, V H and V T are hyperparameters.) Frank Keller Computational Cognitive Science 18

31 Choosing a Prior Background Choosing a Prior Conjugate Priors Using the Beta(V H + 1, V T + 1) prior, the posterior distribution becomes: P(θ y) = (N H + N T + V H + V T + 1)! θ N H+V H (1 θ) N T +V T (N H + V H )!(N T + V T )! which is Beta(N H + V H + 1, N T + V T + 1). The MAP estimate of this posterior is then: and the posterior mean becomes: N H + V H ˆθ = N H + N T + V H + V T E[θ] = N H + V H + 1 N H + N T + V H + V T + 2 Frank Keller Computational Cognitive Science 19

32 Choosing a Prior Background Choosing a Prior Conjugate Priors Returning to our example, if we use a Beta-prior with V H = V T = 1000, and our data consists of a sequence of 10 heads and 0 tails, then: E[θ] = N H + V H + 1 N H + N T + V H + V T + 2 = So we retain our belief that θ = 0.5, even though we ve seen strong evidence to the contrary. This would change had we seen 100 heads rather than 10. Compare this to the maximum likelihood estimate, which is: ˆθ = N H N H + N T = 1 Frank Keller Computational Cognitive Science 20

33 Conjugate Priors Background Choosing a Prior Conjugate Priors The likelihood was Bernoulli distributed, and the prior Beta distributed. This ensured the posterior was also Beta distributed. This is because the Bernoulli and the Beta distribution are conjugate distribution. Using a conjugate prior can make the computation of the posterior tractable (e.g., by ensuring that there is an analytic solution). Likelihood: Bernoulli Conjugate Prior: Beta Binomial Beta Multinomial Dirichlet Normal Normal Frank Keller Computational Cognitive Science 21

34 Summary Background Choosing a Prior Conjugate Priors Cognitive tasks can be modeled as probabilistic inference; using Bayes rule, inference can be broken down into posterior, likelihood, and prior distributions; standard techniques such as maximum likelihood estimation or MAP generate point estimates of the parameters; Bayesian techniques instead use averaging (Bayesian integration) over all parameter values; this makes them less prone to overfitting and allows the use of informative priors; the prior distribution is typically chosen to be conjugate with the likelihood distribution. Frank Keller Computational Cognitive Science 22

35 References Background Choosing a Prior Conjugate Priors Griffiths, Tom L. and Alan Yuille A primer on probabilistic inference. Trends in Cognitive Sciences 10(7). Frank Keller Computational Cognitive Science 23

Computational Cognitive Science

Computational Cognitive Science Lecture 9: Bayesian Estimation Chris Lucas (Slides adapted from Frank Keller s) School of Informatics University of Edinburgh clucas2@inf.ed.ac.uk 17 October, 2017 1 / 28