Advanced Probabilistic Modeling in R Day 1

Similar documents
Day 1: Probability and speech perception

Bayesian RL Seminar. Chris Mansley September 9, 2008

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution

LEARNING WITH BAYESIAN NETWORKS

Probability Theory for Machine Learning. Chris Cremer September 2015

CSC321 Lecture 18: Learning Probabilistic Models

SAMPLE CHAPTER. Avi Pfeffer. FOREWORD BY Stuart Russell MANNING

CS 630 Basic Probability and Information Theory. Tim Campbell

Review of Probabilities and Basic Statistics

Dept. of Linguistics, Indiana University Fall 2015

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Probabilistic Graphical Networks: Definitions and Basic Results

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

MATH MW Elementary Probability Course Notes Part I: Models and Counting

Bayesian Approaches Data Mining Selected Technique

Computational Perception. Bayesian Inference

Machine Learning

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Lecture 1: Probability Fundamentals

Bayesian Models in Machine Learning

Naïve Bayes classification

LECTURE 1. 1 Introduction. 1.1 Sample spaces and events

Introduction into Bayesian statistics

Unit 1: Sequence Models

CS 361: Probability & Statistics

Probability Review. Chao Lan

Rapid Introduction to Machine Learning/ Deep Learning

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Introduction to Bayesian inference

CS 361: Probability & Statistics

Probability and Estimation. Alan Moses

Inference for a Population Proportion

Learning Bayesian network : Given structure and completely observed data

Intro to Probability. Andrei Barbu

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Probabilistic Models

Chapter 4. Parameter Estimation. 4.1 Introduction

Directed Graphical Models

Bayesian Inference. Introduction

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Bayesian Methods: Naïve Bayes

Probabilistic Graphical Models

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Lecture 1. ABC of Probability

{ p if x = 1 1 p if x = 0

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Introduc)on to Bayesian Methods

CS 5522: Artificial Intelligence II

Introduction to Bayesian Learning

Probability, Entropy, and Inference / More About Inference

Data Mining Techniques. Lecture 3: Probability

Chris Bishop s PRML Ch. 8: Graphical Models

7.1 What is it and why should we care?

Latent Variable Models Probabilistic Models in the Study of Language Day 4

Introduction to Stochastic Processes

Probability Theory Review

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

CMPSCI 240: Reasoning about Uncertainty

Bayesian Machine Learning

CS 343: Artificial Intelligence

Probabilistic Machine Learning

CPSC 540: Machine Learning

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Sample Space: Specify all possible outcomes from an experiment. Event: Specify a particular outcome or combination of outcomes.

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

Conditional probabilities and graphical models

Bayesian Inference. p(y)

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Probabilistic Graphical Models (I)

Quantifying uncertainty & Bayesian networks

Probability and Inference

Hierarchical Models & Bayesian Model Selection

An Introduction to Bayesian Machine Learning

Some Probability and Statistics

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Origins of Probability Theory

Overview of Probability. Mark Schmidt September 12, 2017

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Review of Basic Probability

Axioms of Probability? Notation. Bayesian Networks. Bayesian Networks. Today we ll introduce Bayesian Networks.

Lecture Notes 1 Basic Probability. Elements of Probability. Conditional probability. Sequential Calculation of Probability

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Introduction to Machine Learning

Probability theory basics

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS 188: Artificial Intelligence Fall 2009

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Probability & statistics for linguists Class 2: more probability. D. Lassiter (h/t: R. Levy)

Introduction to Probability and Statistics (Continued)

The Monte Carlo Method: Bayesian Networks

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Learning in Bayesian Networks

Review: Bayesian learning and inference

Transcription:

Advanced Probabilistic Modeling in R Day 1 Roger Levy University of California, San Diego July 20, 2015 1/24

Today s content Quick review of probability: axioms, joint & conditional probabilities, Bayes Rule, conditional independence Bayes Nets (a.k.a. directed acyclic graphical models, DAGs) The Gaussian distribution Example: human phoneme categorization Maximum likelihood estimation Bayesian parameter estimation Frequentist hypothesis testing Bayesian hypothesis testing 2/24

Probability spaces Traditionally, probability spaces are defined in terms of sets. An event E is a subset of a sample space Ω: E Ω. 3/24

Probability spaces Traditionally, probability spaces are defined in terms of sets. An event E is a subset of a sample space Ω: E Ω. A probability space P on a sample space Ω is a function from events E in Ω to real numbers such that the following three axioms hold: 1. P(E) 0 for all E Ω (non-negativity). 2. If E 1 and E 2 are disjoint, then P(E 1 E 2 ) = P(E 1 )+P(E 2 ) (disjoint union). 3. P(Ω) = 1 (properness). 3/24

Joint, conditional, and marginal probabilities Given the joint distribution P(X,Y) over two random variables X and Y, the conditional distribution P(Y X) is defined as P(Y X) P(X,Y) P(X) 4/24

Joint, conditional, and marginal probabilities Given the joint distribution P(X,Y) over two random variables X and Y, the conditional distribution P(Y X) is defined as P(Y X) P(X,Y) P(X) The marginal probability distribution P(X) is P(X = x) = y P(X = x,y = y) These concepts can be extended to arbitrary numbers of random variables. 4/24

The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) 5/24

The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: 5/24

The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 )

The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 )

The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 )..

The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 )..

The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 ).. P(E 1,E 2,...,E n ) = P(E n E 1,E 2,...,E n 1 )...P(E 2 E 1 )P(E 1 ) 5/24

The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 ).. P(E 1,E 2,...,E n ) = P(E n E 1,E 2,...,E n 1 )...P(E 2 E 1 )P(E 1 ) Breaking a joint probability down into the product a marginal probability and several joint probabilities this way is called chain rule decomposition. 5/24

Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) 6/24

Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) 6/24

Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: P(A, B) = P(B A)P(A) 6/24

Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: P(A, B) = P(B A)P(A) P(A, B) = P(A B)P(B) 6/24

Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: P(A, B) = P(B A)P(A) P(A, B) = P(A B)P(B) So P(A B)P(B) = P(B A)P(A) 6/24

Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: So P(A, B) = P(B A)P(A) P(A, B) = P(A B)P(B) P(A B)P(B) = P(B A)P(A) P(A B)P(B) P(B) = P(B A)P(A) P(B) 6/24

Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: So P(A, B) = P(B A)P(A) P(A, B) = P(A B)P(B) P(A B)P(B) = P(B A)P(A) P(A B)P(B) P(B) = P(B A)P(A) P(B) 6/24

Other ways of writing Bayes Rule Likelihood Prior {}}{{}}{ P(B A) P(A) P(A B) = P(B) }{{} Normalizing constant The hardest part of using Bayes Rule was calculating the normalizing constant (a.k.a. the partition function) 7/24

Other ways of writing Bayes Rule Likelihood Prior {}}{{}}{ P(B A) P(A) P(A B) = P(B) }{{} Normalizing constant The hardest part of using Bayes Rule was calculating the normalizing constant (a.k.a. the partition function) Hence there are often two other ways we write Bayes Rule: 7/24

Other ways of writing Bayes Rule Likelihood Prior {}}{{}}{ P(B A) P(A) P(A B) = P(B) }{{} Normalizing constant The hardest part of using Bayes Rule was calculating the normalizing constant (a.k.a. the partition function) Hence there are often two other ways we write Bayes Rule: 1. Emphasizing explicit marginalization: P(A B) = P(B A)P(A) P(A = a,b) a 7/24

Other ways of writing Bayes Rule Likelihood Prior {}}{{}}{ P(B A) P(A) P(A B) = P(B) }{{} Normalizing constant The hardest part of using Bayes Rule was calculating the normalizing constant (a.k.a. the partition function) Hence there are often two other ways we write Bayes Rule: 1. Emphasizing explicit marginalization: 2. Ignoring the partition function: P(A B) = P(B A)P(A) P(A = a,b) a P(A B) P(B A)P(A) 7/24

(Conditional) Independence Events A and B are said to be Conditionally Independent given information C if P(A,B C) = P(A C)P(B C) Conditional independence of A and B given C is often expressed as A B C 8/24

Directed graphical models A lot of the interesting joint probability distributions in the study of language involve conditional independencies among the variables 9/24

Directed graphical models A lot of the interesting joint probability distributions in the study of language involve conditional independencies among the variables So next we ll introduce you to a general framework for specifying conditional independencies among collections of random variables 9/24

Directed graphical models A lot of the interesting joint probability distributions in the study of language involve conditional independencies among the variables So next we ll introduce you to a general framework for specifying conditional independencies among collections of random variables It won t allow us to express all possible independencies that may hold, but it goes a long way 9/24

Directed graphical models A lot of the interesting joint probability distributions in the study of language involve conditional independencies among the variables So next we ll introduce you to a general framework for specifying conditional independencies among collections of random variables It won t allow us to express all possible independencies that may hold, but it goes a long way And I hope that you ll agree that the framework is intuitive too! 9/24

A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: 10/24

A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 10/24

A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 10/24

A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. 10/24

A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. Generative process: 10/24

A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. Generative process: The factory produces a coin of type X and sends it to you; 10/24

A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. Generative process: The factory produces a coin of type X and sends it to you; You receive the coin and flip it twice, with H(eads)/T(ails) outcomes Y 1 and Y 2 10/24

A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. Generative process: The factory produces a coin of type X and sends it to you; You receive the coin and flip it twice, with H(eads)/T(ails) outcomes Y 1 and Y 2 Receiving a coin from the factory and flipping it twice is sampling (or taking a sample) from the joint distribution P(X,Y 1,Y 2 ) 10/24

This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents 11/24

This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents In this DAG, P(X,Y 1,Y 2 ) = P(X)P(Y 1 X)P(Y 2 X) 11/24

This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents In this DAG, P(X,Y 1,Y 2 ) = P(X)P(Y 1 X)P(Y 2 X) X P(X) 1 Fair 3 2-H 1 3 1 2-T 3 11/24

This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents In this DAG, P(X,Y 1,Y 2 ) = P(X)P(Y 1 X)P(Y 2 X) X P(X) 1 Fair 3 2-H 1 3 1 2-T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair 2 2 2-H 1 0 2-T 0 1 11/24

This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents In this DAG, P(X,Y 1,Y 2 ) = P(X)P(Y 1 X)P(Y 2 X) X P(X) 1 Fair 3 2-H 1 3 1 2-T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair 2 2 2-H 1 0 2-T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair 2 2 2-H 1 0 2-T 0 1 11/24

Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H 1 3 1 2-T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair 2 2 2-H 1 0 2-T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair 2 2 2-H 1 0 2-T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? 12/24

Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H 1 3 1 2-T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair 2 2 2-H 1 0 2-T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair 2 2 2-H 1 0 2-T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? That is, if C = {}, is it the case that A B C? 12/24

Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H 1 3 1 2-T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair 2 2 2-H 1 0 2-T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair 2 2 2-H 1 0 2-T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? That is, if C = {}, is it the case that A B C? No! 12/24

Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H 1 3 1 2-T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair 2 2 2-H 1 0 2-T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair 2 2 2-H 1 0 2-T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? That is, if C = {}, is it the case that A B C? No! P(Y2 = H) = 1 2 (you can see this by symmetry) 12/24

Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H 1 3 1 2-T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair 2 2 2-H 1 0 2-T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair 2 2 2-H 1 0 2-T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? That is, if C = {}, is it the case that A B C? No! P(Y2 = H) = 1 2 (you can see this by symmetry) Coin was fair Coin was 2-H {}}{ 1 But P(Y2 = H Y 1 = H) = 3 1 {}}{ 2 + 2 3 1 = 5 6 12/24

Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. 13/24

Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. A path between two disjoint node sets A and B is a sequence of edges connecting some node in A with some node in B 13/24

Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. A path between two disjoint node sets A and B is a sequence of edges connecting some node in A with some node in B Any node on a given path has converging arrows if two edges on the path connect to it and point to it. 13/24

Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. A path between two disjoint node sets A and B is a sequence of edges connecting some node in A with some node in B Any node on a given path has converging arrows if two edges on the path connect to it and point to it. A node on the path has non-converging arrows if two edges on the path connect to it, but at least one does not point to it. 13/24

Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. A path between two disjoint node sets A and B is a sequence of edges connecting some node in A with some node in B Any node on a given path has converging arrows if two edges on the path connect to it and point to it. A node on the path has non-converging arrows if two edges on the path connect to it, but at least one does not point to it. A third disjoint node set C d-separates A and B if for every path between A and B, either: 1. there is some node on the path with converging arrows which is not in C; or 2. there is some node on the path whose arrows do not converge and which is in C. 13/24

Major types of d-separation A node set C d-separates A and B if for every path between A and B, either: 1. there is some node on the path with converging arrows which is not in C; or 2. there is some node on the path whose arrows do not converge and which is in C. Commoncause d- separation (from knowing Z) Z Intervening d- separation (from knowing Y) X X Explaining away: knowing Z prevents d- separation Y D- separation in the absence of knowledge of Z X Y Y X Y Z Z Z 14/24

Back to our example X Y 1 Y 2 15/24

Back to our example X Y 1 Y 2 Without looking at the coin before flipping it, the outcome Y 1 of the first flip gives me information about the type of coin, and affects my beliefs about the outcome of Y 2 15/24

Back to our example X Y 1 Y 2 Without looking at the coin before flipping it, the outcome Y 1 of the first flip gives me information about the type of coin, and affects my beliefs about the outcome of Y 2 But if I look at the coin before flipping it, Y 1 and Y 2 are rendered independent 15/24

An example of explaining away I saw an exhibition about the, uh... 16/24

An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: 16/24

An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency, astrolabe) 16/24

An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency, astrolabe) The speaker s attention was distracted by something in the non-linguistic environment 16/24

An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency, astrolabe) The speaker s attention was distracted by something in the non-linguistic environment 16/24

An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency, astrolabe) The speaker s attention was distracted by something in the non-linguistic environment A reasonable graphical model: W: hard word? A: attention distracted? D: disfluency? 16/24

An example of explaining away W: hard word? A: attention distracted? D: disfluency? Without knowledge of D, there s no reason to expect that W and A are correlated 17/24

An example of explaining away W: hard word? A: attention distracted? D: disfluency? Without knowledge of D, there s no reason to expect that W and A are correlated But hearing a disfluency demands a cause 17/24

An example of explaining away W: hard word? A: attention distracted? D: disfluency? Without knowledge of D, there s no reason to expect that W and A are correlated But hearing a disfluency demands a cause Knowing that there was a distraction explains away the disfluency, reducing the probability that the speaker was planning to utter a hard word 17/24

An example of the disfluency model W: hard word? A: attention distracted? Let s suppose that both hard words and distractions are unusual, the latter more so P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? 18/24

An example of the disfluency model W: hard word? A: attention distracted? Let s suppose that both hard words and distractions are unusual, the latter more so P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? Hard words and distractions both induce disfluencies; having both makes a disfluency really likely W A D=no disfluency D=disfluency easy undistracted 0.99 0.01 easy distracted 0.7 0.3 hard undistracted 0.85 0.15 hard distracted 0.4 0.6 18/24

An example of the disfluency model W: hard word? A: attention distracted? P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? W A D=no disfluency D=disfluency easy undistracted 0.99 0.01 easy distracted 0.7 0.3 hard undistracted 0.85 0.15 hard distracted 0.4 0.6 Suppose that we observe the speaker uttering a disfluency. What is P(W = hard D = disfluent)? 19/24

An example of the disfluency model W: hard word? A: attention distracted? P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? W A D=no disfluency D=disfluency easy undistracted 0.99 0.01 easy distracted 0.7 0.3 hard undistracted 0.85 0.15 hard distracted 0.4 0.6 Suppose that we observe the speaker uttering a disfluency. What is P(W = hard D = disfluent)? Now suppose we also learn that her attention is distracted. What does that do to our beliefs about W 19/24

An example of the disfluency model W: hard word? A: attention distracted? P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? W A D=no disfluency D=disfluency easy undistracted 0.99 0.01 easy distracted 0.7 0.3 hard undistracted 0.85 0.15 hard distracted 0.4 0.6 Suppose that we observe the speaker uttering a disfluency. What is P(W = hard D = disfluent)? Now suppose we also learn that her attention is distracted. What does that do to our beliefs about W That is, what is P(W = hard D = disfluent,a = distracted)? 19/24

An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = 0.25 20/24

An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = 0.25 P(W = hard D = disfluent) = 0.57 20/24

An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = 0.25 P(W = hard D = disfluent) = 0.57 P(W = hard D = disfluent,a = distracted) = 0.40 20/24

An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = 0.25 P(W = hard D = disfluent) = 0.57 P(W = hard D = disfluent,a = distracted) = 0.40 Knowing that the speaker was distracted (A) decreased the probability that the speaker was about to utter a hard word (W) A explained D away. 20/24

An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = 0.25 P(W = hard D = disfluent) = 0.57 P(W = hard D = disfluent,a = distracted) = 0.40 Knowing that the speaker was distracted (A) decreased the probability that the speaker was about to utter a hard word (W) A explained D away. A caveat: the type of relationship among A, W, and D will depend on the values one finds in the probability table! P(W) P(A) P(D W,A) 20/24

Summary thus far Key points: Bayes Rule is a compelling framework for modeling inference under uncertainty DAGs/Bayes Nets are a broad class of models for specifying joint probability distributions with conditional independencies Classic Bayes Net references:??;?;?, Chapter 14;?, Chapter 8. 21/24

References I 22/24

An example of the disfluency model P(W = hard D = disfluent,a = distracted) hard W=hard easy W=easy disfl D=disfluent distr A=distracted undistr A=undistracted P(hard disfl,distr) = P(disfl hard,distr)p(hard distr) P(disfl distr) = P(disfl hard,distr)p(hard) P(disfl distr) P(disfl distr) = w P(disfl W = w )P(W = w ) (Bayes Rule) (Independence from the DAG) (Marginalization) = P(disfl hard)p(hard) + P(disfl easy)p(easy) = 0.6 0.25 + 0.3 0.75 = 0.375 0.6 0.25 P(hard disfl,distr) = 0.375 = 0.4 23/24

An example of the disfluency model P(W = hard D = disfluent) P(hard disfl) = P(disfl hard)p(hard) P(disfl) (Bayes Rule) P(disfl hard) = a P(disfl A = a,hard)p(a = a hard) = P(disfl A = distr, hard)p(a = distr hard) + P(disfl undistr, hard)p(undistr hard) = 0.6 0.15 + 0.15 0.85 = 0.2175 P(disfl) = P(disfl W = w )P(W = w ) w = P(disfl hard)p(hard) + P(disfl easy)p(easy) P(disfl easy) = a P(disfl A = a,easy)p(a = a easy) = P(disfl A = distr, easy)p(a = distr easy) + P(disfl undistr, easy)p(undistr easy) = 0.3 0.15 + 0.01 0.85 = 0.0535 P(disfl) = 0.2175 0.25 + 0.0535 0.75 = 0.0945 0.2175 0.25 P(hard disfl) = 0.0945 = 0.575396825396825 24/24

Bayesian parameter estimation The scenario: you are a native English speaker in whose experience passivizable constructions are passivized with frequency q. 1. The ball hit the window. (Active) 2. The window was hit by the ball. (Passive) You encounter a new dialect of English and hear data y consisting of n passivizable utterances, m of which were passivized: Goal: X Bern(π) Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances.

Anatomy of Bayesian inference Simplest possible scenario: I θ Y

Anatomy of Bayesian inference Simplest possible scenario: I θ Y The corresponding Bayesian inference: P(θ y,i) = P(y θ,i)p(θ I) P(y I)

Anatomy of Bayesian inference Simplest possible scenario: I θ Y The corresponding Bayesian inference: P(θ y,i) = P(y θ,i)p(θ I) P(y I) Likelihood for θ Prior over θ {}}{{}}{ P(y θ) P(θ I) = P(y I) }{{} Likelihood marginalized over θ (because y I θ)

Anatomy of Bayesian inference Simplest possible scenario: I θ Y The corresponding Bayesian inference: P(θ y,i) = P(y θ,i)p(θ I) P(y I) Likelihood for θ Prior over θ {}}{{}}{ P(y θ) P(θ I) = P(y I) }{{} Likelihood marginalized over θ (because y I θ) At the bottom of the graph, our model is the binomial distribution: P(y θ) Binom(n, θ) But to get things going we have to set the prior P(θ I).

Priors for the binomial distribution For a model with parameters θ, a prior distribution is just some joint probability distribution P(θ) Because the prior is often supposed to account for knowledge we bring to the table, we often write P(θ I) to be explciit

Priors for the binomial distribution For a model with parameters θ, a prior distribution is just some joint probability distribution P(θ) Because the prior is often supposed to account for knowledge we bring to the table, we often write P(θ I) to be explciit Model parameters are nearly always real-valued, so P(θ) is generally a multivariate continuous distribution

Priors for the binomial distribution For a model with parameters θ, a prior distribution is just some joint probability distribution P(θ) Because the prior is often supposed to account for knowledge we bring to the table, we often write P(θ I) to be explciit Model parameters are nearly always real-valued, so P(θ) is generally a multivariate continuous distribution In general, the sky is the limit as to what you choose for P(θ)

Priors for the binomial distribution For a model with parameters θ, a prior distribution is just some joint probability distribution P(θ) Because the prior is often supposed to account for knowledge we bring to the table, we often write P(θ I) to be explciit Model parameters are nearly always real-valued, so P(θ) is generally a multivariate continuous distribution In general, the sky is the limit as to what you choose for P(θ) But in many cases there are useful priors that will make your life easier

The beta distribution The beta distribution has two parameters α 1,α 2 > 0 and is defined as: P(π α 1,α 2 ) = 1 B(α 1,α 2 ) πα 1 1 (1 π) α 2 1 (0 π 1,α 1 > 0,α 2 > 0) where the beta function B(α 1,α 2 ) serves as a normalizing constant: asdf B(α 1,α 2 ) = 1 0 π α 1 1 (1 π) α 2 1 dπ

Some beta distributions If X B(α 1,α 2 ): p(π) 0 1 2 3 4 Beta(1,1) Beta(0.5,0.5) Beta(3,3) Beta(3,0.5) E[X] = α 1 α 1 +α 2 If α 1,α 2 > 1, then X has a mode at α 1 1 α 1 +α 2 2 0.0 0.2 0.4 0.6 0.8 1.0 π

Using the beta distribution as a prior 1. The ball hit the window. (Active) 2. The window was hit by the ball. (Passive) Let us use a beta distribution as a prior for our problem hence I = α 1,α 2.

Using the beta distribution as a prior 1. The ball hit the window. (Active) 2. The window was hit by the ball. (Passive) Let us use a beta distribution as a prior for our problem hence I = α 1,α 2. P(π y,α 1,α 2 ) = P(y π)p(π α 1,α 2 ) P(y α 1,α 2 ) (1)

Using the beta distribution as a prior 1. The ball hit the window. (Active) 2. The window was hit by the ball. (Passive) Let us use a beta distribution as a prior for our problem hence I = α 1,α 2. P(π y,α 1,α 2 ) = P(y π)p(π α 1,α 2 ) P(y α 1,α 2 ) Since the denominator is not a function of π, it is a normalizing constant. Ignore it and work in terms of proportionality: P(π y,α 1,α 2 ) P(y π)p(π α 1,α 2 ) (1)

Using the beta distribution as a prior 1. The ball hit the window. (Active) 2. The window was hit by the ball. (Passive) Let us use a beta distribution as a prior for our problem hence I = α 1,α 2. P(π y,α 1,α 2 ) = P(y π)p(π α 1,α 2 ) P(y α 1,α 2 ) Since the denominator is not a function of π, it is a normalizing constant. Ignore it and work in terms of proportionality: P(π y,α 1,α 2 ) P(y π)p(π α 1,α 2 ) Likelihood for the binomial distribution is ( ) n P(y π) = π m (1 π) n m m (1)

Using the beta distribution as a prior 1. The ball hit the window. (Active) 2. The window was hit by the ball. (Passive) Let us use a beta distribution as a prior for our problem hence I = α 1,α 2. P(π y,α 1,α 2 ) = P(y π)p(π α 1,α 2 ) P(y α 1,α 2 ) Since the denominator is not a function of π, it is a normalizing constant. Ignore it and work in terms of proportionality: P(π y,α 1,α 2 ) P(y π)p(π α 1,α 2 ) Likelihood for the binomial distribution is ( ) n P(y π) = π m (1 π) n m m Beta prior is P(π α 1,α 2 ) = 1 B(α 1,α 2 ) πα 1 1 (1 π) α 2 1 (1)

Using the beta distribution as a prior Ignore ( n m) and B(α1,α 2 ) (both constant in π): Likelihood Prior {}}{{}}{ P(π y,α 1,α 2 ) π m (1 π) n m π α1 1 (1 π) α 2 1

Using the beta distribution as a prior Ignore ( n m) and B(α1,α 2 ) (both constant in π): Likelihood Prior {}}{{}}{ P(π y,α 1,α 2 ) π m (1 π) n m π α1 1 (1 π) α 2 1 π m+α 1 1 (1 π) n m+α 2 1

Using the beta distribution as a prior Ignore ( n m) and B(α1,α 2 ) (both constant in π): Likelihood Prior {}}{{}}{ P(π y,α 1,α 2 ) π m (1 π) n m π α1 1 (1 π) α 2 1 π m+α 1 1 (1 π) n m+α 2 1 Crucial trick: this is itself a beta distribution! Recall that if θ Beta(α 1,α 2 ) then P(θ) = 1 B(α 1,α 2 ) πα 1 1 (1 π) α 2 1

Using the beta distribution as a prior Ignore ( n m) and B(α1,α 2 ) (both constant in π): Likelihood Prior {}}{{}}{ P(π y,α 1,α 2 ) π m (1 π) n m π α1 1 (1 π) α 2 1 π m+α 1 1 (1 π) n m+α 2 1 Crucial trick: this is itself a beta distribution! Recall that if θ Beta(α 1,α 2 ) then P(θ) = 1 B(α 1,α 2 ) πα 1 1 (1 π) α 2 1 Hence P(θ y,α 1,α 2 ) is distributed as Beta(α 1 +m,α 2 +n m).

Using the beta distribution as a prior Ignore ( n m) and B(α1,α 2 ) (both constant in π): Likelihood Prior {}}{{}}{ P(π y,α 1,α 2 ) π m (1 π) n m π α1 1 (1 π) α 2 1 π m+α 1 1 (1 π) n m+α 2 1 Crucial trick: this is itself a beta distribution! Recall that if θ Beta(α 1,α 2 ) then P(θ) = 1 B(α 1,α 2 ) πα 1 1 (1 π) α 2 1 Hence P(θ y,α 1,α 2 ) is distributed as Beta(α 1 +m,α 2 +n m). With a beta prior and a binomial likelihood, the posterior is still beta-distributed. This is called conjugacy.

Using our beta-binomial model Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances.

Using our beta-binomial model Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. To estimate π it is common to use Maximum a-posteriori (MAP) estimation: choose the value of π with highest posterior probability

Using our beta-binomial model Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. To estimate π it is common to use Maximum a-posteriori (MAP) estimation: choose the value of π with highest posterior probability P(passive passivizable clause) 0.08 (Roland et al., 2007)

Using our beta-binomial model Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. To estimate π it is common to use Maximum a-posteriori (MAP) estimation: choose the value of π with highest posterior probability P(passive passivizable clause) 0.08 (Roland et al., 2007) The mode of a beta distribution is α 1 1 α 1 +α 2 2

Using our beta-binomial model Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. To estimate π it is common to use Maximum a-posteriori (MAP) estimation: choose the value of π with highest posterior probability P(passive passivizable clause) 0.08 (Roland et al., 2007) The mode of a beta distribution is α 1 1 α 1 +α 2 2 Hence we might use α 1 = 3,α 2 = 24 (note that 2 25 = 0.08)

Using our beta-binomial model Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. To estimate π it is common to use Maximum a-posteriori (MAP) estimation: choose the value of π with highest posterior probability P(passive passivizable clause) 0.08 (Roland et al., 2007) The mode of a beta distribution is α 1 1 α 1 +α 2 2 Hence we might use α 1 = 3,α 2 = 24 (note that 2 25 = 0.08) Suppose that n = 7,m = 2: our posterior will be Beta(5,29), hence ˆπ = 4 32 = 0.125

Beta-binomial posterior distributions P(p) 0 2 4 6 8 Prior Likelihood (n=7) Posterior (n=7) Likelihood (n=21) Posterior (n=21) 0 0.2 0.4 0.6 0.8 1 Likelihood(p) 0.0 0.2 0.4 0.6 0.8 1.0 p

Fully Bayesian density estimation Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances.

Fully Bayesian density estimation Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. In the fully Bayesian view, we don t summarize our posterior beliefs into a point estimate; rather, we marginalize over them in predicting the future:

Fully Bayesian density estimation Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. In the fully Bayesian view, we don t summarize our posterior beliefs into a point estimate; rather, we marginalize over them in predicting the future: P(y new y,i) = P(y new θ)p(θ y,i)dθ θ

Fully Bayesian density estimation Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. In the fully Bayesian view, we don t summarize our posterior beliefs into a point estimate; rather, we marginalize over them in predicting the future: P(y new y,i) = P(y new θ)p(θ y,i)dθ θ This leads to the beta-binomial predictive model: ( ) k B(α1 +m+r,α 2 +n m+k r) P(r k,i,y) = r B(α 1 +m,α 2 +n m)

Fully Bayesian density estimation P(k passives out of 50 trials y, I) 0.00 0.05 0.10 0.15 Binomial Beta Binomial 0 10 20 30 40 50 k

Fully Bayesian density estimation In this case (as in many others), marginalizing over the model parameters allows for greater dispersion in the model s predictions

Fully Bayesian density estimation In this case (as in many others), marginalizing over the model parameters allows for greater dispersion in the model s predictions This is because the new observations are only conditionally independent given θ with uncertainty about θ, they are linked!

Fully Bayesian density estimation In this case (as in many others), marginalizing over the model parameters allows for greater dispersion in the model s predictions This is because the new observations are only conditionally independent given θ with uncertainty about θ, they are linked! I θ y (1) new y (2) new... y (N) new

References I Roland, D., Dick, F., and Elman, J. L. (2007). Frequency of basic English grammatical structures: A corpus analysis. Journal of Memory and Language, 57:348 379.