Day 1: Probability and speech perception

Size: px

Start display at page:

Download "Day 1: Probability and speech perception"

Curtis Moore
5 years ago
Views:

1 Day 1: Probability and speech perception 1

2 Day 2: Human sentence parsing 2

3 Day 3: Noisy-channel sentence processing?

4 Day 4: Language production & acquisition whatsthat thedoggie yeah wheresthedoggie Grammar/lexicon (abstract internal representation) whats that the doggie yeah wheres the doggie 4

5 Computational Psycholinguistics Day 1 Klinton Bicknell and Roger Levy Northwestern & UCSD July 7, /38

6 Computational Psycholinguistics Psycholinguistics deals with the problem of how humans 1. comprehend 2. produce 3. acquire language. 2/38

7 Computational Psycholinguistics Psycholinguistics deals with the problem of how humans 1. comprehend 2. produce 3. acquire language. In this class, we will study these problems from a computational, and especially probabilistic/bayesian, perspective. 2/38

8 Class goals Introduce you to the technical foundations of modeling work in the field Overview the literature and major areas in which computational psycholinguistic research is carried out Acquaint you with some of the key models and their empirical support Give you experience in understanding the details of a model from the papers Give you practice in critical analysis of models 3/38

9 What is computational modeling? Why do we do it? Any phenomenon involving human behavior is so complex that we cannot hope to formulate a comprehensive theory Instead, we devise a model that simplifies the phenomenon to capture some key aspect of it 4/38

10 What might we use a model for? Models can serve any of the following (related) functions: 5/38

11 What might we use a model for? Models can serve any of the following (related) functions: Prediction: estimating the behavior/properties of a new state/datum on the basis of an existing dataset 5/38

12 What might we use a model for? Models can serve any of the following (related) functions: Prediction: estimating the behavior/properties of a new state/datum on the basis of an existing dataset Hypothesis testing: as a framework for determining whether a given factor has an appreciable influence on some other variable 5/38

13 What might we use a model for? Models can serve any of the following (related) functions: Prediction: estimating the behavior/properties of a new state/datum on the basis of an existing dataset Hypothesis testing: as a framework for determining whether a given factor has an appreciable influence on some other variable Data simulation: creating artificial data more cheaply and quickly than through empirical data collection 5/38

14 What might we use a model for? Models can serve any of the following (related) functions: Prediction: estimating the behavior/properties of a new state/datum on the basis of an existing dataset Hypothesis testing: as a framework for determining whether a given factor has an appreciable influence on some other variable Data simulation: creating artificial data more cheaply and quickly than through empirical data collection Summarization: If phenomenon X is complex but relevant to phenomenon Y, it can be most effective to use a simple model of X when constructing a model of Y 5/38

15 What might we use a model for? Models can serve any of the following (related) functions: Prediction: estimating the behavior/properties of a new state/datum on the basis of an existing dataset Hypothesis testing: as a framework for determining whether a given factor has an appreciable influence on some other variable Data simulation: creating artificial data more cheaply and quickly than through empirical data collection Summarization: If phenomenon X is complex but relevant to phenomenon Y, it can be most effective to use a simple model of X when constructing a model of Y Insight: Most generally, a good model can be explored in ways that give insight into the phenomenon under consideration 5/38

16 Feedback from you Please take a moment to fill out a sheet of paper with this info: Name (optional) School & Program/Department Year/stage in program Computational Linguistics background Psycholinguistics background Probability/Statistics/Machine Learning background Do you know about (weighted) finite-state automata? Do you know about (probabilistic) context-free grammars? Other courses you re taking at ESSLLI (other side) What do you hope to learn in this class? 6/38

17 Today s content Foundations of probability theory Joint, marginal, and conditional probability Bayes Rule Bayes Nets (a.k.a. directed acyclic graphical models, DAGs) The Gaussian distribution A probabilistic model of human phoneme categorization A probabilistic model of the perceptual magnet effect 7/38

18 Probability spaces Traditionally, probability spaces are defined in terms of sets. An event E is a subset of a sample space Ω: E Ω. 8/38

19 Probability spaces Traditionally, probability spaces are defined in terms of sets. An event E is a subset of a sample space Ω: E Ω. A probability space P on a sample space Ω is a function from events E in Ω to real numbers such that the following three axioms hold: 1. P(E) 0 for all E Ω (non-negativity). 2. If E 1 and E 2 are disjoint, then P(E 1 E 2 ) = P(E 1 )+P(E 2 ) (disjoint union). 3. P(Ω) = 1 (properness). 8/38

20 Probability spaces Traditionally, probability spaces are defined in terms of sets. An event E is a subset of a sample space Ω: E Ω. A probability space P on a sample space Ω is a function from events E in Ω to real numbers such that the following three axioms hold: 1. P(E) 0 for all E Ω (non-negativity). 2. If E 1 and E 2 are disjoint, then P(E 1 E 2 ) = P(E 1 )+P(E 2 ) (disjoint union). 3. P(Ω) = 1 (properness). We can also think of these things as involving logical rather than set relations: Subset A B A B Disjointness E 1 E 2 = (E 1 E 2 ) Union E 1 E 2 E 1 E 2 8/38

21 A simple example In historical English, object NPs could appear both preverbally and postverbally. VP VP Object Verb Verb Object There is a broad cross-linguistic tendency for pronominal objects to occur earlier on average than non-pronominal objects. So, hypothetical probabilities from historical English: X: Y: Pronoun Not Pronoun Object Preverbal Object Postverbal /38

22 A simple example In historical English, object NPs could appear both preverbally and postverbally. VP VP Object Verb Verb Object There is a broad cross-linguistic tendency for pronominal objects to occur earlier on average than non-pronominal objects. So, hypothetical probabilities from historical English: X: Y: Pronoun Not Pronoun Object Preverbal Object Postverbal We will sometimes call this the joint distribution P(X,Y) over two random variables here, verb-object word order X and object pronominality Y. 9/38

23 Checking the axioms of probability 1. P(E) 0 for all E Ω (non-negativity). 2. If E 1 and E 2 are disjoint, then P(E 1 E 2 ) = P(E 1 ) + P(E 2 ) (disjoint union). 3. P(Ω) = 1 (properness). Object Pronoun Not Pronoun Object Preverbal Object Postverbal We can consider the sample space to be Ω ={Preverbal+Pronoun, Preverbal+Not Pronoun, Postverbal+Pronoun, Postverbal+Not Pronoun} 10/38

24 Checking the axioms of probability 1. P(E) 0 for all E Ω (non-negativity). 2. If E 1 and E 2 are disjoint, then P(E 1 E 2 ) = P(E 1 ) + P(E 2 ) (disjoint union). 3. P(Ω) = 1 (properness). Object Pronoun Not Pronoun Object Preverbal Object Postverbal We can consider the sample space to be Ω ={Preverbal+Pronoun, Preverbal+Not Pronoun, Postverbal+Pronoun, Postverbal+Not Pronoun} Disjoint union tells us the probabilities of non-atomic events: 10/38

25 Checking the axioms of probability 1. P(E) 0 for all E Ω (non-negativity). 2. If E 1 and E 2 are disjoint, then P(E 1 E 2 ) = P(E 1 ) + P(E 2 ) (disjoint union). 3. P(Ω) = 1 (properness). Object Pronoun Not Pronoun Object Preverbal Object Postverbal We can consider the sample space to be Ω ={Preverbal+Pronoun, Preverbal+Not Pronoun, Postverbal+Pronoun, Postverbal+Not Pronoun} Disjoint union tells us the probabilities of non-atomic events: If we define E 1 = {Preverbal+Pronoun,Postverbal+Not Pronoun}, then P(E 1 ) = = /38

26 Checking the axioms of probability 1. P(E) 0 for all E Ω (non-negativity). 2. If E 1 and E 2 are disjoint, then P(E 1 E 2 ) = P(E 1 ) + P(E 2 ) (disjoint union). 3. P(Ω) = 1 (properness). Object Pronoun Not Pronoun Object Preverbal Object Postverbal We can consider the sample space to be Ω ={Preverbal+Pronoun, Preverbal+Not Pronoun, Postverbal+Pronoun, Postverbal+Not Pronoun} Disjoint union tells us the probabilities of non-atomic events: If we define E 1 = {Preverbal+Pronoun,Postverbal+Not Pronoun}, then P(E 1 ) = = Check for properness: P(Ω) = = 1 10/38

27 Marginal probability Sometimes we have a joint distribution P(X,Y) over random variables X and Y, but we re interested in the distribution implied over one of them (here, without loss of generality, X) 11/38

28 Marginal probability Sometimes we have a joint distribution P(X,Y) over random variables X and Y, but we re interested in the distribution implied over one of them (here, without loss of generality, X) The marginal probability distribution P(X) is P(X = x) = y P(X = x,y = y) 11/38

29 Marginal probability: an example Y: X: Pronoun Not Pronoun Object Preverbal Object Postverbal Finding the marginal distribution on X: P(X = Preverbal) = P(X = Preverbal,Y = Prenominal) +P(X = Preverbal,Y = Postnominal) = = P(X = Postverbal) = P(X = Postverbal,Y = Prenominal) +P(X = Postverbal,Y = Postnominal) = = /38

30 Marginal probability: an example X: Y: Pronoun Not Pronoun Object Preverbal Object Postverbal So, the marginal distribution on X is P(X) Preverbal Postverbal Likewise, the marginal distribution on Y is P(Y) Pronoun Not Pronoun /38

31 Conditional probability The conditional probability of event B given that A has occurred/is known is defined as follows: P(B A) P(A,B) P(A) 14/38

32 Conditional Probability: an example X: Y: Pronoun Not Pronoun Object Preverbal Object Postverbal P(X) Preverbal Postverbal P(Y) Pronoun Not Pronoun /38

33 Conditional Probability: an example X: Y: Pronoun Not Pronoun Object Preverbal Object Postverbal P(X) Preverbal Postverbal P(Y) Pronoun Not Pronoun How do we calculate the following? P(Y = Pronoun X = Postverbal)

34 Conditional Probability: an example X: Y: Pronoun Not Pronoun Object Preverbal Object Postverbal P(X) Preverbal Postverbal P(Y) Pronoun Not Pronoun How do we calculate the following? P(Y = Pronoun X = Postverbal)

35 Conditional Probability: an example X: Y: Pronoun Not Pronoun Object Preverbal Object Postverbal P(X) Preverbal Postverbal P(Y) Pronoun Not Pronoun How do we calculate the following? P(X = Postverbal,Y = Pronoun) P(Y = Pronoun X = Postverbal) = P(X = Postverbal) = = /38

36 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) 16/38

37 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: 16/38

38 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 )

39 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 )

40 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 )..

41 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 )..

42 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 ).. P(E 1,E 2,...,E n ) = P(E n E 1,E 2,...,E n 1 )...P(E 2 E 1 )P(E 1 ) 16/38

43 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 ).. P(E 1,E 2,...,E n ) = P(E n E 1,E 2,...,E n 1 )...P(E 2 E 1 )P(E 1 ) Breaking a joint probability down into the product a marginal probability and several joint probabilities this way is called chain rule decomposition. 16/38

44 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) 17/38

45 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) 17/38

46 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: P(A, B) = P(B A)P(A) 17/38

47 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: P(A, B) = P(B A)P(A) P(A, B) = P(A B)P(B) 17/38

48 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: P(A, B) = P(B A)P(A) P(A, B) = P(A B)P(B) So P(A B)P(B) = P(B A)P(A) 17/38

49 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: So P(A, B) = P(B A)P(A) P(A, B) = P(A B)P(B) P(A B)P(B) = P(B A)P(A) P(A B)P(B) P(B) = P(B A)P(A) P(B) 17/38

50 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: So P(A, B) = P(B A)P(A) P(A, B) = P(A B)P(B) P(A B)P(B) = P(B A)P(A) P(A B)P(B) P(B) = P(B A)P(A) P(B) 17/38

51 Bayes Rule, more closely inspected Posterior {}}{ P(A B) = Likelihood Prior {}}{{}}{ P(B A) P(A) P(B) }{{} Normalizing constant 18/38

52 Bayes Rule in action Let me give you the same information you had before: P(Y = Pronoun) = P(X = Preverbal Y = Pronoun) = P(X = Preverbal Y = Not Pronoun ) = /38

53 Bayes Rule in action Let me give you the same information you had before: P(Y = Pronoun) = P(X = Preverbal Y = Pronoun) = P(X = Preverbal Y = Not Pronoun ) = Imagine you re an incremental sentence processor. You encounter a transitive verb but haven t encountered the object yet. Inference under uncertainty: How likely is it that the object is a pronoun? 19/38

54 Bayes Rule in Action P(Y = Pronoun) = P(X = Preverbal Y = Pronoun) = P(X = Preverbal Y = Not Pronoun ) = P(Y = Pron X = PostV) 20/38

55 Bayes Rule in Action P(Y = Pronoun) = P(X = Preverbal Y = Pronoun) = P(X = Preverbal Y = Not Pronoun ) = P(Y = Pron X = PostV) = P(X = PostV Y = Pron)P(Y = Pron) P(X = PostV) 20/38

56 Bayes Rule in Action P(Y = Pronoun) = P(X = Preverbal Y = Pronoun) = P(X = Preverbal Y = Not Pronoun ) = P(X = PostV Y = Pron)P(Y = Pron) P(Y = Pron X = PostV) = P(X = PostV) P(X = PostV Y = Pron)P(Y = Pron) = P(X = PostV,Y = y) y 20/38

57 Bayes Rule in Action P(Y = Pronoun) = P(X = Preverbal Y = Pronoun) = P(X = Preverbal Y = Not Pronoun ) = P(X = PostV Y = Pron)P(Y = Pron) P(Y = Pron X = PostV) = P(X = PostV) P(X = PostV Y = Pron)P(Y = Pron) = P(X = PostV,Y = y) y P(X = PostV Y = Pron)P(Y = Pron) = P(X = PostV Y = y)p(y = y) y 20/38

58 Bayes Rule in Action P(Y = Pronoun) = P(X = Preverbal Y = Pronoun) = P(X = Preverbal Y = Not Pronoun ) = P(X = PostV Y = Pron)P(Y = Pron) P(Y = Pron X = PostV) = P(X = PostV) P(X = PostV Y = Pron)P(Y = Pron) = P(X = PostV,Y = y) y P(X = PostV Y = Pron)P(Y = Pron) = P(X = PostV Y = y)p(y = y) y P(X = PostV Y = Pron)P(Y = Pron) = P(PostV Pron)P(Pron)+ P(PostV NotPron)P(NotPron) 20/38

59 Bayes Rule in Action P(Y = Pronoun) = P(X = Preverbal Y = Pronoun) = P(X = Preverbal Y = Not Pronoun ) = P(X = PostV Y = Pron)P(Y = Pron) P(Y = Pron X = PostV) = P(X = PostV) P(X = PostV Y = Pron)P(Y = Pron) = P(X = PostV,Y = y) y P(X = PostV Y = Pron)P(Y = Pron) = P(X = PostV Y = y)p(y = y) y P(X = PostV Y = Pron)P(Y = Pron) = = P(PostV Pron)P(Pron)+ P(PostV NotPron)P(NotPron) ( ) ( ) ( ) ( ) 20/38

60 Bayes Rule in Action P(Y = Pronoun) = P(X = Preverbal Y = Pronoun) = P(X = Preverbal Y = Not Pronoun ) = P(X = PostV Y = Pron)P(Y = Pron) P(Y = Pron X = PostV) = P(X = PostV) P(X = PostV Y = Pron)P(Y = Pron) = P(X = PostV,Y = y) y P(X = PostV Y = Pron)P(Y = Pron) = P(X = PostV Y = y)p(y = y) y P(X = PostV Y = Pron)P(Y = Pron) = P(PostV Pron)P(Pron)+ P(PostV NotPron)P(NotPron) ( ) = ( ) ( ) ( ) = /38

61 Other ways of writing Bayes Rule Likelihood Prior {}}{{}}{ P(B A) P(A) P(A B) = P(B) }{{} Normalizing constant The hardest part of using Bayes Rule was calculating the normalizing constant (a.k.a. the partition function) 21/38

62 Other ways of writing Bayes Rule Likelihood Prior {}}{{}}{ P(B A) P(A) P(A B) = P(B) }{{} Normalizing constant The hardest part of using Bayes Rule was calculating the normalizing constant (a.k.a. the partition function) Hence there are often two other ways we write Bayes Rule: 21/38

63 Other ways of writing Bayes Rule Likelihood Prior {}}{{}}{ P(B A) P(A) P(A B) = P(B) }{{} Normalizing constant The hardest part of using Bayes Rule was calculating the normalizing constant (a.k.a. the partition function) Hence there are often two other ways we write Bayes Rule: 1. Emphasizing explicit marginalization: P(A B) = P(B A)P(A) P(A = a,b) a 21/38

64 Other ways of writing Bayes Rule Likelihood Prior {}}{{}}{ P(B A) P(A) P(A B) = P(B) }{{} Normalizing constant The hardest part of using Bayes Rule was calculating the normalizing constant (a.k.a. the partition function) Hence there are often two other ways we write Bayes Rule: 1. Emphasizing explicit marginalization: 2. Ignoring the partition function: P(A B) = P(B A)P(A) P(A = a,b) a P(A B) P(B A)P(A) 21/38

65 (Conditional) Independence Events A and B are said to be Conditionally Independent given information C if P(A,B C) = P(A C)P(B C) Conditional independence of A and B given C is often expressed as A B C 22/38

66 Directed graphical models A lot of the interesting joint probability distributions in the study of language involve conditional independencies among the variables 23/38

67 Directed graphical models A lot of the interesting joint probability distributions in the study of language involve conditional independencies among the variables So next we ll introduce you to a general framework for specifying conditional independencies among collections of random variables 23/38

68 Directed graphical models A lot of the interesting joint probability distributions in the study of language involve conditional independencies among the variables So next we ll introduce you to a general framework for specifying conditional independencies among collections of random variables It won t allow us to express all possible independencies that may hold, but it goes a long way 23/38

69 Directed graphical models A lot of the interesting joint probability distributions in the study of language involve conditional independencies among the variables So next we ll introduce you to a general framework for specifying conditional independencies among collections of random variables It won t allow us to express all possible independencies that may hold, but it goes a long way And I hope that you ll agree that the framework is intuitive too! 23/38

70 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: 24/38

71 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 24/38

72 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 24/38

73 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. 24/38

74 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. Generative process: 24/38

75 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. Generative process: The factory produces a coin of type X and sends it to you; 24/38

76 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. Generative process: The factory produces a coin of type X and sends it to you; You receive the coin and flip it twice, with H(eads)/T(ails) outcomes Y 1 and Y 2 24/38

77 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. Generative process: The factory produces a coin of type X and sends it to you; You receive the coin and flip it twice, with H(eads)/T(ails) outcomes Y 1 and Y 2 Receiving a coin from the factory and flipping it twice is sampling (or taking a sample) from the joint distribution P(X,Y 1,Y 2 ) 24/38

78 This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents 25/38

79 This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents In this DAG, P(X,Y 1,Y 2 ) = P(X)P(Y 1 X)P(Y 2 X) 25/38

80 This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents In this DAG, P(X,Y 1,Y 2 ) = P(X)P(Y 1 X)P(Y 2 X) X P(X) 1 Fair 3 2-H T 3 25/38

81 This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents In this DAG, P(X,Y 1,Y 2 ) = P(X)P(Y 1 X)P(Y 2 X) X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T /38

82 This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents In this DAG, P(X,Y 1,Y 2 ) = P(X)P(Y 1 X)P(Y 2 X) X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair H T /38

83 Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair H T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? 26/38

84 Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair H T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? That is, if C = {}, is it the case that A B C? 26/38

85 Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair H T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? That is, if C = {}, is it the case that A B C? No! 26/38

86 Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair H T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? That is, if C = {}, is it the case that A B C? No! P(Y2 = H) = 1 2 (you can see this by symmetry) 26/38

87 Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair H T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? That is, if C = {}, is it the case that A B C? No! P(Y2 = H) = 1 2 (you can see this by symmetry) Coin was fair Coin was 2-H {}}{ 1 But P(Y2 = H Y 1 = H) = 3 1 {}}{ = /38

88 Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. 27/38

89 Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. A path between two disjoint node sets A and B is a sequence of edges connecting some node in A with some node in B 27/38

90 Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. A path between two disjoint node sets A and B is a sequence of edges connecting some node in A with some node in B Any node on a given path has converging arrows if two edges on the path connect to it and point to it. 27/38

91 Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. A path between two disjoint node sets A and B is a sequence of edges connecting some node in A with some node in B Any node on a given path has converging arrows if two edges on the path connect to it and point to it. A node on the path has non-converging arrows if two edges on the path connect to it, but at least one does not point to it. 27/38

92 Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. A path between two disjoint node sets A and B is a sequence of edges connecting some node in A with some node in B Any node on a given path has converging arrows if two edges on the path connect to it and point to it. A node on the path has non-converging arrows if two edges on the path connect to it, but at least one does not point to it. A third disjoint node set C d-separates A and B if for every path between A and B, either: 1. there is some node on the path with converging arrows which is not in C; or 2. there is some node on the path whose arrows do not converge and which is in C. 27/38

93 Major types of d-separation C d-separates A and B if for every path between A and B, either: 1. there is some node on the path with converging arrows which is not in C; or 2. there is some node on the path whose arrows do not converge and which is in C. Commoncause d- separation C Intervening d- separation A A Explaining away: no d- separation B D- separation in the absence of knowledge of C A B B A B C C C 28/38

94 Back to our example X Y 1 Y 2 29/38

95 Back to our example X Y 1 Y 2 Without looking at the coin before flipping it, the outcome Y 1 of the first flip gives me information about the type of coin, and affects my beliefs about the outcome of Y 2 29/38

96 Back to our example X Y 1 Y 2 Without looking at the coin before flipping it, the outcome Y 1 of the first flip gives me information about the type of coin, and affects my beliefs about the outcome of Y 2 But if I look at the coin before flipping it, Y 1 and Y 2 are rendered independent 29/38

97 An example of explaining away I saw an exhibition about the, uh... 30/38

98 An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: 30/38

99 An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency, astrolabe) 30/38

100 An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency, astrolabe) The speaker s attention was distracted by something in the non-linguistic environment 30/38

101 An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency, astrolabe) The speaker s attention was distracted by something in the non-linguistic environment 30/38

102 An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency, astrolabe) The speaker s attention was distracted by something in the non-linguistic environment A reasonable graphical model: W: hard word? A: attention distracted? D: disfluency? 30/38

103 An example of explaining away W: hard word? A: attention distracted? D: disfluency? Without knowledge of D, there s no reason to expect that W and A are correlated 31/38

104 An example of explaining away W: hard word? A: attention distracted? D: disfluency? Without knowledge of D, there s no reason to expect that W and A are correlated But hearing a disfluency demands a cause 31/38

105 An example of explaining away W: hard word? A: attention distracted? D: disfluency? Without knowledge of D, there s no reason to expect that W and A are correlated But hearing a disfluency demands a cause Knowing that there was a distraction explains away the disfluency, reducing the probability that the speaker was planning to utter a hard word 31/38

106 An example of the disfluency model W: hard word? A: attention distracted? Let s suppose that both hard words and distractions are unusual, the latter more so P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? 32/38

107 An example of the disfluency model W: hard word? A: attention distracted? Let s suppose that both hard words and distractions are unusual, the latter more so P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? Hard words and distractions both induce disfluencies; having both makes a disfluency really likely W A D=no disfluency D=disfluency easy undistracted easy distracted hard undistracted hard distracted /38

108 An example of the disfluency model W: hard word? A: attention distracted? P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? W A D=no disfluency D=disfluency easy undistracted easy distracted hard undistracted hard distracted Suppose that we observe the speaker uttering a disfluency. What is P(W = hard D = disfluent)? 33/38

109 An example of the disfluency model W: hard word? A: attention distracted? P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? W A D=no disfluency D=disfluency easy undistracted easy distracted hard undistracted hard distracted Suppose that we observe the speaker uttering a disfluency. What is P(W = hard D = disfluent)? Now suppose we also learn that her attention is distracted. What does that do to our beliefs about W 33/38

110 An example of the disfluency model W: hard word? A: attention distracted? P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? W A D=no disfluency D=disfluency easy undistracted easy distracted hard undistracted hard distracted Suppose that we observe the speaker uttering a disfluency. What is P(W = hard D = disfluent)? Now suppose we also learn that her attention is distracted. What does that do to our beliefs about W That is, what is P(W = hard D = disfluent,a = distracted)? 33/38

111 An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = /38

112 An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = 0.25 P(W = hard D = disfluent) = /38

113 An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = 0.25 P(W = hard D = disfluent) = 0.57 P(W = hard D = disfluent,a = distracted) = /38

114 An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = 0.25 P(W = hard D = disfluent) = 0.57 P(W = hard D = disfluent,a = distracted) = 0.40 Knowing that the speaker was distracted (A) decreased the probability that the speaker was about to utter a hard word (W) A explained D away. 34/38

115 An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = 0.25 P(W = hard D = disfluent) = 0.57 P(W = hard D = disfluent,a = distracted) = 0.40 Knowing that the speaker was distracted (A) decreased the probability that the speaker was about to utter a hard word (W) A explained D away. A caveat: the type of relationship among A, W, and D will depend on the values one finds in the probability table! P(W) P(A) P(D W,A) 34/38

116 Summary thus far Key points: Bayes Rule is a compelling framework for modeling inference under uncertainty DAGs/Bayes Nets are a broad class of models for specifying joint probability distributions with conditional independencies Classic Bayes Net references: Pearl (1988, 2000); Jordan (1998); Russell and Norvig (2003, Chapter 14); Bishop (2006, Chapter 8). 35/38

117 References I Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Jordan, M. I., editor (1998). Learning in Graphical Models. Cambridge, MA: MIT Press. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 2 edition. Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge. Russell, S. and Norvig, P. (2003). Artificial Intelligence: a Modern Approach. Prentice Hall, second edition. 36/38

118 An example of the disfluency model P(W = hard D = disfluent,a = distracted) hard W=hard easy W=easy disfl D=disfluent distr A=distracted undistr A=undistracted P(hard disfl,distr) = P(disfl hard,distr)p(hard distr) P(disfl distr) = P(disfl hard,distr)p(hard) P(disfl distr) P(disfl distr) = w P(disfl W = w )P(W = w ) (Bayes Rule) (Independence from the DAG) (Marginalization) = P(disfl hard)p(hard) + P(disfl easy)p(easy) = = P(hard disfl,distr) = = /38

119 An example of the disfluency model P(W = hard D = disfluent) P(hard disfl) = P(disfl hard)p(hard) P(disfl) (Bayes Rule) P(disfl hard) = a P(disfl A = a,hard)p(a = a hard) = P(disfl A = distr, hard)p(a = distr hard) + P(disfl undistr, hard)p(undistr hard) = = P(disfl) = P(disfl W = w )P(W = w ) w = P(disfl hard)p(hard) + P(disfl easy)p(easy) P(disfl easy) = a P(disfl A = a,easy)p(a = a easy) = P(disfl A = distr, easy)p(a = distr easy) + P(disfl undistr, easy)p(undistr easy) = = P(disfl) = = P(hard disfl) = = /38

120 Sound categorization our first computational psycholinguistic problem hear an acoustic signal, recover a sound category our example: distinguishing two similar sound categories, a voicing contrast between a pair of stops: /b/ vs. /p/ or /d/ vs. /t/ 1

121 Sound categorization voice onset time (VOT) primary cue distinguishing voiced and voiceless stops 2 (Chen, 1980)

122 Sound categorization identification curve (for /d/ vs. /t/) I z g 80. P b OK2 ONSET TIME o.%s, of stimulus and sentence context (Connine b et al., 1991) How do people do this? 3

123 Bayesian sound categorization Generative model c ~ discrete choice, e.g., p(p) = p(b) = 0.5 S c ~ [some distribution] Bayesian inference p(c S) = p(s c)p(c) p(s) 4 c! category S! sound value prior p(c): probability of each category overall (in first step of generative model) likelihood p(s c): [some distribution] = X p(s c)p(c) c 0 p(s c 0 )p(c 0 )

124 Plan some high level considerations in building cognitive models probability in continuous spaces and the Gaussian distribution deriving and testing a probabilistic model of sound categorization a closely related model of the perceptual magnet effect 5

125 Marr's levels of analysis Three levels of computational models (Marr, 1982) computational level what is the structure of the information processing problem? what are the inputs? what are the outputs? what information is relevant to solving the problem? algorithmic level what representations and algorithms are used? implementational level how are the representations and algorithms implemented neurally? levels are mutually constraining, and each necessary to fully understand 6

126 Rational analysis How to perform rational analysis (Anderson, 1990) background: organism behavior is optimized for common problems both by evolution and by learning step 1: specify a formal model of the problem to be solved and the agent's goals make as few assumptions about computational limitations as possible step 2: derive optimal behavior given problem and goals step 3: compare optimal behavior to agent behavior step 4: if predictions are off, revisit assumptions about limitations and iterate 7

127 Bayesian sound categorization Generative model c ~ discrete choice, e.g., p(p) = p(b) = 0.5 S c ~ [some distribution] Bayesian inference p(c S) = p(s c)p(c) p(s) 8 c! category S! sound (VOT) prior p(c): probability of each category overall (in first step of generative model) likelihood p(s c): [some distribution] = X p(s c)p(c) c 0 p(s c 0 )p(c 0 )

128 Continuous probability we can't just assign every VOT outcome a probability there are uncountably many possible outcomes (e.g., 60.1, 60.01, , ) instead, we use a probability density function that assigns each outcome a non-negative density actual probability is now an integral of the density function (area under the curve) properness requires that p(x) dx =1 9

129 Continuous probability a common continuous distribution: the Gaussian aka the normal probability density F2 (Hz) 10

130 Continuous probability a common continuous distribution: the Gaussian aka the normal probability density F2 (Hz) 11

131 Continuous probability a common continuous distribution: the Gaussian aka the normal probability density F2 (Hz) 12

132 Continuous probability a common continuous distribution: the Gaussian aka the normal probability density F2 (Hz) 13

133 Continuous probability a common continuous distribution: the Gaussian aka the normal 2.0 probability density VOT (ms) 14

134 Continuous probability a common continuous distribution: the Gaussian aka the normal 2.0 probability density VOT (ms) 15

135 Gaussian parameters Normal(μ, σ 2 ) = N(μ, σ 2 ) has two parameters most probability distributions are properly families of distributions, indexed by parameters e.g., N(μ = 10, σ 2 = 10) vs. N(μ = 20, σ 2 = 5) formal definition of Gaussian probability density function: p(x) = [ 1 exp (x µ)2 2πσ 2 2σ 2 ] 16

136 Gaussian parameters Mean = Expected value = Expectation = μ Formal definition E(X) = Z +1 1 intuitively: the center of mass (here: 0, 50) 0.05 xp(x)dx probability density x 17

137 Gaussian parameters Variance = Var = σ 2 Formal definition Var(X )=E[(X E(X )) 2 ] equivalent alternative definition Var(X )=E[X 2 ] E[X ] 2 intuitively: how broadly are outcomes dispersed: here (25, 100) 0.08 probability density x

138 Gaussian parameters Putting both parameters together p(x) µ=0, σ 2 =1 µ=0, σ 2 =2 µ=0, σ 2 =0.5 µ=2, σ 2 = x 19

139 Bayesian sound categorization modeling ideal speech sound categorization which Gaussian category did sound come from? Probability density /b/ /p/ VOT

140 Bayesian sound categorization Generative model c ~ discrete choice, e.g., p(p) = p(b) = 0.5 S c ~ Gaussian(μc, σ 2 c) Bayesian inference p(c S) = p(s c)p(c) p(s) 21 c! category S! sound value prior p(c): probability of each category overall (in first step of generative model) likelihood p(s c): Gaussian(μc, σ 2 c) = X p(s c)p(c) c 0 p(s c 0 )p(c 0 )

141 Bayesian sound categorization Concrete parameters c ~ discrete choice, p(p) = p(b) = 0.5 S c ~ normal, μb=0, μp=100; σb=20, σp=20 Concrete example p(c S) = p(s c)p(c) p(s) = p(s c)p(c) X c 0 p(s c 0 )p(c 0 ) p(b 60) = p(60 b)p(b) / [p(60 b)p(b) + p(60 p)p(p)] =.0002(.5) / [.0002(.5) (.5)].08 p(x) = 22 c! category S! [ 1 exp 2πσ 2 sound value (x µ)2 2σ 2 ]

142 Bayesian sound categorization Categorization function which (Gaussian) category did sound come from? Probability density /b/ /p/ VOT 23 Posterior probability of /b/ VOT

143 Bayesian sound categorization Categorization function ideal categorization function slope changes with category variance Probability density [b] [p] Posterior probability of /b/ VOT VOT 24

144 Bayesian sound categorization Clayards et al. (2008) tested exactly this prediction trained participants with Gaussian categories of two variances then tested categorization Posterior probability of /b/ Proportion response /b/ VOT VOT

145 Bayesian sound categorization Wrapping up categorization assumed knowledge of categories (which were Gaussian distributions) found the exact posterior probability that a sound belongs to each of two categories with a simple application of Bayes' rule confirmed Bayesian model prediction that categorization function should become less steep as variance is larger let's move on to a more complex situation 26

146 Bayesian sound categorization A more complex situation: the perceptual magnet effect empirical work by Kuhl and colleages [Kuhl et al., 1992; Iverson & Kuhl, 1995] modeling work we discuss is from Feldman and colleagues [Feldman & Griffths, 2007; Feldman et al., 2009] 27

147 28

148 Perceptual Magnet Effect /i/ /ε/ (Iverson & Kuhl, 1995)

149 Perceptual Magnet Effect Actual+S.muli:+ Perceived+S.muli:+ To account for this, we need a new generative model for speech perception (Iverson & Kuhl, 1995)

150 Speech Perception

151 Speech Perception c Speaker chooses a phonetic category

152 Speech Perception c Speaker chooses a phonetic category T Speaker articulates a target production

153 Speech Perception Noise in the speech signal c Speaker chooses a phonetic category T Speaker articulates a target production

154 Speech Perception Listener hears a speech sound S Noise in the speech signal c Speaker chooses a phonetic category T Speaker articulates a target production

155 Speech Perception Listener hears a speech sound S Noise in the speech signal Inferring an acoustic value: Compute p(t S) T Speaker articulates a target production c Speaker chooses a phonetic category

156 Statistical Model c Choose a category c with probability p(c)

157 Statistical Model c Choose a category c with probability p(c) T Articulate a target production T with probability p(t c) p(t c) =N(µ c, 2 c )

158 Statistical Model c Choose a category c with probability p(c) T S Articulate a target production T with probability p(t c) p(t c) =N(µ c, 2 c ) Listener hears speech sound S with probability p(s T) p(s T )=N(T, 2 S)

159 Statistical Model Phonetic Category c ( ) N µ c,σ c 2 Target Production T Speech Signal Noise ( ) N T,σ S 2 Speech Sound S

160 Statistical Model Phonetic Category c ( ) N µ c,σ c 2 Target Production T Speech Signal Noise ( ) N T,σ S 2 Speech Sound S

161 Statistical Model Phonetic Category c N ( 2 µ c,σ ) c? Speech Signal Noise 2 N( T,σ ) S Speech Sound S

162 Statistical Model Prior, p(h) Phonetic Category c N ( 2 µ c,σ ) c? Hypotheses, h Data, d Speech Sound Speech Signal Noise S ( ) N T,σ S 2 Likelihood, p(d h)

163 Bayes for Speech Perception Listeners must infer the target production based on the speech sound they hear and their prior knowledge of phonetic categories Data (d): speech sound S Hypotheses (h): target productions T Prior (p(h)): phonetic category structure p(t c) Likelihood (p(d h)): speech signal noise p(s T) p ( h d) p( d h) p( h)

164 Bayes for Speech Perception Prior Likelihood Speech Sound S

165 Bayes for Speech Perception Prior Likelihood Posterior Speech Sound S

166 Bayes for Speech Perception Prior Likelihood Posterior E[ T S,c ]= σ 2 S+ σ 2 µ c S c σ 2 + σ 2 c S

167 Perceptual Warping

168 Multiple Categories Want to compute p(t S) Marginalize over categories p(t S) = p(t S,c)p(c S) c

169 Multiple Categories Want to compute p(t S) Marginalize over categories p(t S) = p(t S,c)p(c S) c solution for a single probability of category category membership

170 Multiple Categories Speech Sound S

171 Multiple Categories Speech Sound S

172 Multiple Categories E[ T S,c ]= σ 2 S+ σ 2 µ c S c σ 2 + σ 2 c S

173 Multiple Categories E[ T S,c ]= σ 2 S+ σ 2 µ c S c σ 2 + σ 2 c S

174 Multiple Categories E[ T S]= c σ c 2 S+ σ S 2 µ c σ c 2 + σ S 2 p( c S)

175 Perceptual Warping

176 Perceptual Warping To compare model to humans we have a 13-step continuum estimate perceptual distance between each adjacent pair in humans and model

177 Modeling the /i/- /e/ Data Relative Distances Between Neighboring Stimuli MDS Model Perceptual Distance Stimulus Number

178 Bayesian sound categorization Conclusions continuous probability theory lets us build ideal models of speech perception part 1: can build a principled model of categorization, which fits human data well e.g., categorization less steep for high variance categories part 2: can predict how linguistic category structure warps perceptual space speech sounds are perceived as being closer to the center of their likely category 59

Advanced Probabilistic Modeling in R Day 1

Advanced Probabilistic Modeling in R Day 1 Roger Levy University of California, San Diego July 20, 2015 1/24 Today s content Quick review of probability: axioms, joint & conditional probabilities, Bayes