Bayesian Models for Sequences

Similar documents
Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning

Introduction to Artificial Intelligence (AI)

Introduction to Artificial Intelligence (AI)

Probability and Time: Hidden Markov Models (HMMs)

Bayesian Networks BY: MOHAMAD ALSABBAGH

Hidden Markov Models in Language Processing

Template-Based Representations. Sargur Srihari

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Dynamic Approaches: The Hidden Markov Model

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

STA 4273H: Statistical Machine Learning

Linear Dynamical Systems

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

2 : Directed GMs: Bayesian Networks

STA 414/2104: Machine Learning

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Intelligent Systems (AI-2)

CS 188: Artificial Intelligence Fall 2011

Intelligent Systems (AI-2)

Intelligent Systems (AI-2)

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Reasoning Under Uncertainty Over Time. CS 486/686: Introduction to Artificial Intelligence

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Graphical models for part of speech tagging

Machine Learning Summer School

Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Undirected Graphical Models

Lecture 6: Graphical Models: Learning

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Bayesian Networks. Motivation

Chapter 16. Structured Probabilistic Models for Deep Learning

Midterm sample questions

Probabilistic Graphical Models

Directed and Undirected Graphical Models

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Rapid Introduction to Machine Learning/ Deep Learning

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Logic, Knowledge Representation and Bayesian Decision Theory

Statistical Methods for NLP

Bayesian Machine Learning

CS 343: Artificial Intelligence

Artificial Intelligence

HMM and IOHMM Modeling of EEG Rhythms for Asynchronous BCI Systems

6.867 Machine learning, lecture 23 (Jaakkola)

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

p(d θ ) l(θ ) 1.2 x x x

Directed and Undirected Graphical Models

Learning in Bayesian Networks

COMP90051 Statistical Machine Learning

Intro to Probability. Andrei Barbu

3 : Representation of Undirected GM

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Statistical Pattern Recognition

Lecture 13: Structured Prediction

Introduction to Machine Learning Midterm Exam Solutions

CS 343: Artificial Intelligence

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

Mixtures of Gaussians with Sparse Structure

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Independent Component Analysis and Unsupervised Learning

Introduction to Machine Learning Midterm, Tues April 8

Graphical Models for Automatic Speech Recognition

STA 4273H: Statistical Machine Learning

Bayesian Networks Inference with Probabilistic Graphical Models

Probabilistic Classification

Hidden Markov Models

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Intelligent Systems (AI-2)

CSE 473: Artificial Intelligence

Note Set 5: Hidden Markov Models

The Noisy Channel Model and Markov Models

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Lecture 15. Probabilistic Models on Graph

Hidden Markov Models. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 19 Apr 2012

Mathematical Formulation of Our Example

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Markov Chains and Hidden Markov Models

Design and Implementation of Speech Recognition Systems

Approximate Inference

Directed Graphical Models or Bayesian Networks

Conditional Random Fields: An Introduction

p L yi z n m x N n xi

Today s Lecture: HMMs

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

From perceptrons to word embeddings. Simon Šuster University of Groningen

Bayesian networks Lecture 18. David Sontag New York University

Hidden Markov models

an introduction to bayesian inference

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Transcription:

Bayesian Models for Sequences the world is dynamic old information becomes obsolete new information is available the decisions an agent takes need to reflect these changes the dynamics of the world can be captured by means of state-based models c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 1

Bayesian Models for Sequences changes in the world are modelled as transitions between subsequent states state transitions can be clocked, e.g. speech: every 10 ms vision: every 40 ms stock market trends: every 24 hours triggered by external events language: every other word travel planning: potential transfer points c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 2

Bayesian Models for Sequences main purpose: predicting the probability of the next event computing the probability of a (sub-)sequence important application areas: speech and language processing, genome analysis, time series predictions (stock market, natural desasters,...) c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 3

Markov chain Markov chain : special sort of belief network for sequential observations S 0 S 1 S 2 S 3 S 4 Thus, P(S t+1 S 0,...,S t ) = P(S t+1 S t ). Intuitively S t conveys all of the information about the history that can affect the future states. The past is independent of the future given the present. c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 4

Stationary Markov chain A stationary Markov chain is when for all t > 0, t > 0, P(S t+1 S t ) = P(S t +1 S t ). Under this condition the network consists of two different kinds of slices for the initial state without previous nodes (parents) we specify P(S 0 ) for all the following states we specify P(St S t 1 ) Simple model, easy to specify Often a highly natural model c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 5

Stationary Markov chain The network can be extended indefinitely: it is rolled out over the full length of the observation sequence rolling out the network can be done on demand (incrementally) the length of the observation sequence need not be known in advance c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 6

Higher-order Markov Models modelling dependencies of various lengths bigrams S 0 S 1 S 2 S 3 S 4 trigrams S 0 S 1 S 2 S 3 S 4 three different time slices have to modelled for S0 : P(S 0 ) for S1 : P(S 1 S 0 ) for all others: P(Si ) S i 2 S i 1 ) c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 7

Higher-order Markov Models quadrograms: P(S i S i 3 S i 2 S i 1 ) S 0 S 1 S 2 S 3 S 4 four different kinds of time slices required c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 8

Markov Models examples of Markov chains for German letter sequences unigrams: aiobnin*tarsfneonlpiitdregedcoa*ds*e*dbieastnreleeucdkeaitb* dnurlarsls*omn*keu**svdleeoieei*... bigrams: er*agepteprteiningeit*gerelen*re*unk*ves*mterone*hin*d*an* nzerurbom*... trigrams: billunten*zugen*die*hin*se*sch*wel*war*gen*man* nicheleblant*diertunderstim*... quadrograms: eist*des*nich*in*den*plassen*kann*tragen*was*wiese* zufahr*... c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 9

Hidden Markov Model Often the observation does not deterministically depend on the state of the model This can be captured by a Hidden Markov Model (HMM)... even if the state transitions are not directly observable c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 10

Hidden Markov Model A HMM is a belief network where states and observations are separated S 0 S 1 S 2 S 3 S 4 O 0 O 1 O 2 O 3 O 4 P(S 0 ) specifies initial conditions P(S t+1 S t ) specifies the dynamics P(O t S t ) specifies the sensor model c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 11

Example (1): robot localization Suppose a robot wants to determine its location based on its actions and its sensor readings: Localization This can be represented by the augmented HMM: Act 0 Act 1 Act 2 Act 3 Loc 0 Loc 1 Loc 2 Loc 3 Loc 4 Obs 0 Obs 1 Obs 2 Obs 3 Obs 4 Combining two kinds of uncertainty: The location depends probabilistically on the robot s action The sensor data are noisy c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 12

Example localization domain Circular corridor, with 16 locations: Doors at positions: 2, 4, 7, 11. Robot starts at an unknown location and must determine where it is. c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 13

Example Sensor Model P(Observe Door At Door) = 0.8 P(Observe Door Not At Door) = 0.1 c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 14

Example Dynamics Model P(loc t = L action t 1 = goright loc t 1 = L) = 0.1 P(loc t = L+1 action t 1 = goright loc t 1 = L) = 0.8 P(loc t = L+2 action t 1 = goright loc t 1 = L) = 0.074 P(loc t = L action t 1 = goright loc t 1 = L) = 0.002 for any other location L. All location arithmetic is modulo 16. The action goleft works the same but to the left. c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 15

Combining sensor information the robot can have many (noisy) sensors for signals from the environment e.g. a light sensor in addition to the door sensor Sensor Fusion : combining information from different sources Act 0 Act 1 Act 2 Act 3 Loc 0 Loc 1 Loc 2 Loc 3 Loc 4 L 0 D L 1 0 D L 2 1 D L 3 2 D L 4 3 D 4 D t door sensor value at time t L t light sensor value at time t c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 16

Hidden Markov Models Example (2): medical diagnosis (milk infection test) (Jensen and Nielsen 2007) the probability of the test outcome depends on the cow being infected or not Infected? Test the probability of the cow being infected depends on the cow being infected on the previous day first order Markov model Inf 1 Inf 2 Inf 3 Inf 4 Inf 5 Test 1 Test 2 Test 3 Test 4 Test 5 c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 17

Hidden Markov Models the probability of the cow being infected depends on the cow being infected on the two previous days incubation and infection periods of more than one day second order Markov model Inf 1 Inf 2 Inf 3 Inf 4 Inf 5 Test 1 Test 2 Test 3 Test 4 Test 5 assumes only random test errors weaker independence assumptions more powerful model more data required for training c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 18

Hidden Markov Models the probability of the test outcome also depends on the cow s health and the test outcome on the previous day can also capture systematic test errors second order Markov model for the infection first order Markov model for the test results Inf 1 Inf 2 Inf 3 Inf 4 Inf 5 Test 1 Test 2 Test 3 Test 4 Test 5 c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 19

Hidden Markov Models Example (3): Tagging for Natural Language Processing annotating the word forms in a sentence with part-of-speech information Yesterday RB the DT school NNS was VBD closed VBN topic areas: He did some field work. field military, field agriculture, field physics, field social sci., field optics,... semantic roles The winner Beneficiary received the trophy Theme at the town hall Location c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 20

Hidden Markov Models sequence labelling problem the label depends on the current state and the most recent history one-to-one correspondence between states, tags, and word forms c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 21

Hidden Markov Models causal (generative) model of the sentence generation process tags are assigned to states the underlying state (tag) sequence produces the observations (word forms) typical independence assumptions trigram probabilities for the state transitions word form probabilities depend only on the current state Tag 0 Tag 1 Tag 2 Tag 3 Tag 4 Word 0 Word 1 Word 2 Word 3 Word 4 c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 22

Hidden Markov Model weaker independence assumption (stronger model): the probability of a word form also depends on the previous and subsequent state Tag 0 Tag 1 Tag 2 Tag 3 Tag 4 Word 0 Word 1 Word 2 Word 3 Word 4 c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 23

Two alternative graphical representations influence diagrams, belief networks, Bayesian networks, causal networks, graphical models,... state transition diagrams (probabilistic finite state machines) Bayesian networks State transition diagrams state nodes variables with states states as values edges into causal influence possible state transitions state nodes and their probabilities # state nodes # model states length of the observation sequence observation variables with observation values nodes observations as values edges into conditional probability conditional probabilities observ. nodes tables c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 24

Two alternative graphical representations Tagging as a Bayesian network Tag 0 Tag 1 Tag 2 Tag 3 Tag 4 Word 0 Word 1 Word 2 Word 3 Word 4 possible state transitions are not directly visible indirectly encoded in the conditional probability tables sometimes state transition diagrams are better suited to illustrate the model topology c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 25

Two alternative graphical representations Tagging as a state transition diagram (possible only for bigram models) t 1 t 2 t 3 t 4 w 1... w 3 w 1... w 3 w 1... w 3 w 1... w 3 ergodic model: full connectivity between all states c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 26

Hidden Markov Models Example (4): Speech Recognition, Swype gesture recognition observation subsequences of unknown length are mapped to one label alignment problem full connectivity is not desired a phone/syllable/word realization cannot be reversed c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 27

Hidden Markov Models possible model topologies for phones (only transitions depicted) P(1 0) P(1 1) 0 0 0 P(2 0) P(2 1) P(2 2) 0 0 0 P(3 1) P(3 2) P(3 3) 0 0 0 P(4 2) P(4 3) 0 c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 28

Hidden Markov Models possible model topologies for phones (only transitions depicted) P(1 0) P(1 1) 0 0 0 P(2 0) P(2 1) P(2 2) 0 0 0 P(3 1) P(3 2) P(3 3) 0 0 0 P(4 2) P(4 3) 0 P(1 0) P(1 1) 0 0 0 0 P(2 1) P(2 2) 0 0 0 P(3 1) P(3 2) P(3 3) 0 0 0 0 P(4 3) 0 c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 29

Hidden Markov Models possible model topologies for phones (only transitions depicted) P(1 0) P(1 1) 0 0 0 P(2 0) P(2 1) P(2 2) 0 0 0 P(3 1) P(3 2) P(3 3) 0 0 0 P(4 2) P(4 3) 0 P(1 0) P(1 1) 0 0 0 0 P(2 1) P(2 2) 0 0 0 P(3 1) P(3 2) P(3 3) 0 0 0 0 P(4 3) 0 P(1 0) P(1 1) 0 0 0 0 P(2 1) P(2 2) 0 0 0 0 P(3 2) P(3 3) 0 0 0 0 P(4 3) 0 c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 30

Hidden Markov Models possible model topologies for phones (only transitions depicted) P(1 0) P(1 1) 0 0 0 P(2 0) P(2 1) P(2 2) 0 0 0 P(3 1) P(3 2) P(3 3) 0 0 0 P(4 2) P(4 3) 0 P(1 0) P(1 1) 0 0 0 0 P(2 1) P(2 2) 0 0 0 P(3 1) P(3 2) P(3 3) 0 0 0 0 P(4 3) 0 P(1 0) P(1 1) 0 0 0 0 P(2 1) P(2 2) 0 0 0 0 P(3 2) P(3 3) 0 0 0 0 P(4 3) 0 the more data available the more sophisticated (and powerful) models can be trained c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 31

Hidden Markov Models composition of submodels on multiple levels phone models have to be concatenated into word models word models are concatenated into utterance models [ f ] [ a ] [ n ] [ f a n ] c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 32

Dynamic Bayesian Networks using complex state descriptions, encoded by means of features model can be in different states at the same time more efficient implementation of state transitions modelling of transitions between sub-models factoring out different influences on the outcome interplay of several actuators (muscles, motors,...) modelling partly asynchronized processes coordinated movement of different body parts (e.g. sign language) synchronization between speech sounds and lip movements synchronization between speech and gesture... c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 33

Dynamic Bayesian Networks problem: state-transition probability tables are sparse contain a large number of zero probabilities alternative model structure: separation of state and transition variables deterministic state variables stochastic transition variables observation variables causal links can be stochastic or deterministic stochastic: conditional probabilities to be estimated deterministic: to be specified manually (decision trees) c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 34

Dynamic Bayesian Networks state variables distinct values for each state of the corresponding HMM value at slice t +1 is a deterministic function of the state and the transition of slice t transition variables probability distribution which arc to take to leave a state of the corresponding HMM number of values is the outdegree of the corresponding state in an HMM use of transition variables is more efficient than using stochastic state variables with zero probabilities for the impossible state transitions c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 35

Dynamic Bayesian Networks composite models: some applications require the model to be composed out of sub-models speech: phones syllables words utterances vision: sub-parts parts composites genomics: nucleotides amino acids proteins c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 36

Dynamic Bayesian Networks composite models: length of the sub-segments is not kown in advance naive concatenation would require to generate all possible segmentations of the input sequence acoustic emission evolution of articulation } {{ } sub-model for /n/ which sub-model to choose next? } {{ } sub-model for /ow/ c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 37

Dynamic Bayesian Networks additional sub-model variables select the next sub-model to choose sub-model index variables stochastic transition variables submodel state variables observation variables sub-model index variables: which submodel to use at each point in time sub-model index and transition variables model legal sequences of sub-models (control layer) several control layers can be combined c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 38

Dynamic Bayesian Networks factored models (1): factoring out different influences on the observation e.g. articulation: asynchroneous movement of articulators (lips, tongue, jaw,...) state articulators observation if the data is drawn from a factored source, full DBNs are superior to the special case of HMMs c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 39

Dynamic Bayesian Networks factored models (2): coupling of different input channels e.g. acoustic and visual information in speech processing naïve approach (1): data level fusion too strong synchronisation constraints state mixtures observation c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 40

Dynamic Bayesian Networks naïve approach(2): independent input streams acoustic channel visual channel no synchronisation at all c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 41

Dynamic Bayesian Networks product model state mixtures visual channel acoustic channel state values are taken from the cross product of acoustic and visual states large probability distributions have to be trained c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 42

Dynamic Bayesian Networks factorial model (Nefian et al. 2002) factor 1 state factor 2 state mixtures visual channel acoustic channel independent (hidden) states indirect influence by means of the explaining away effect loose coupling of input channels c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 43

Dynamic Bayesian Networks inference is extremely expensive nodes are connected across slides domains are not locally restricted cliques become intractably large but: joint distribution usually need not be computed only maximum detection required finding the optimal path through a lattice dynamic programming can be applied (Viterbi algorithm) c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 44

Learning of Bayesian Networks estimating the probabilities for a given structure for complete data: maximum likelihood estimation Bayesian estimation for incomplete data expectation maximization gradient descent methods learning the network structure c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 45

Parameter estimation complete data maximum likelihood estimation Bayesian estimation incomplete data expectation maximization (gradient descent techniques) c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 46

Maximum Likelihood Estimation likelihood of the model M given the (training) data D L(M D) = d DP(d M) log-likelihood LL(M D) = d D log 2 P(d M) choose among several possible models for describing the data according to the principle of maximum likelihood ˆΘ = argmax Θ L(M Θ D) = argmax Θ LL(M Θ D) the models only differ in the set of parameters Θ c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 47

Maximum Likelihood Estimation complete data: estimating the parameters by counting P(A = a) = N(A = a) N(A = v) v dom(a) P(A = a B = b,c = c) = N(A = a,b = b,c = c) N(B = b,c = c) c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 48

Rare events sparse data results in pessimistic estimations for unseen events if the count for an event in the data base is 0, the event is considered impossible by the model in many applications most events will never be observed, irrespective of the sample size c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 49

Rare events Bayesian estimation: using an estimate of the prior probability as starting point for the counting estimation of maximum a posteriori parameters no zero counts can occur if nothing else available use an even distribution as prior Bayesian estimate in the binary case with an even distribution P(yes) = n+1 n+m+2 n: counts for yes, m: counts for no effectively adding virtual counts to the estimate c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 50

Rare events alternative: smoothing as a post processing step remove probability mass from the frequent observations...... and distribute it to the not observed ones floor method discounting... c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 51

Incomplete Data missing at random: probability that a value is missing depends only on the observed value e.g. confirmation measurement: values are available only if the preceding measurement was positive/negative missing completely at random probability that a value is missing is also independent of the value e.g. stochastic failures of the measurement equipment e.g. hidden/latent variables (mixture coefficients of a Gaussian mixture distribution) nonignorable: neither MAR or MCAR probability depends on the unseen values, e.g. exit polls for extremist parties c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 52

Expectation Maximization estimating the underlying distribution of not directly observable variables expectation: complete the data set using the current estimation h = Θ to calculate expectations for the missing values applies the model to be learned (Bayesian inference) maximization: use the completed data set to find a new maximum likelihood estimation h = Θ c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 53

Expectation Maximization full data consists of tuples x i1,...,x ik,z i1,...,z il only x i can be observed training data: X = { x 1,..., x m } hidden information: Z = { z 1,..., z m } parameters of the distribution to be estimated: Θ Z can be treated as random variable with p(z) = f(θ,x) full data: Y = { y y = x i z i } hypothesis: h of Θ, needs to be revised into h c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 54

Expectation Maximization goal of EM: h = argmaxe(log 2 p(y h )) define a function Q(h h) = E(log 2 p(y h ) h,x) Estimation (E) step: Calculate Q(h h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y Q(h h) E(log 2 p(y h ) h,x) Maximization (M) step Replace hypothesis h by h that maximizes the function Q h argmaxq(h h) h c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 55

Expectation Maximization expectation step requires applying the model to be learned Bayesian inference gradient ascent search converges to the next local optimum global optimum is not guaranteed c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 56

Expectation Maximization Q(h h) E(lnp(Y h ) h,x) h Q(h h) h argmaxq(h h) h If Q is continuous, EM converges to the local maximum of the likelihood function P(Y h ) c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 57

Learning the Network Structure learning the network structure space of possible networks is extremely large (> O(2 n )) a Bayesian network over a complete graph is always a possible answer, but not an interesting one (no modelling of independencies) problem of overfitting two approaches constraint-based learning (score-based learning) c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 58

Constraint-based Structure Learning estimate the pairwise degree of independence using conditional mutual information determine the direction of the arcs between non-independent nodes c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 59

Estimating Independence conditional mutual information CMI(A,B X) = X P(X) A,B P(A,B X) P(A,B X)log 2 P(A X) P(B X) two nodes are independent if CMI(A,B X) = 0 choose all pairs of nodes as non-independent, where the significance of a χ 2 -test on the hypothesis CMI(A,B X) = 0 is above a certain user-defined threshold high minimal significance level: more links are established result is a skeleton of possible candidates for causal influence c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 60

Determining Causal Influence Rule 1 (introduction of v-structures): A C and B C but not A B introduce a v-structure A C B if there exists a set of nodes X so that A is d-separated from B given X A B A B C C c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 61

Determining Causal Influence Rule 2 (avoid new v-structures): When Rule 1 has been exhausted and there is a structure A C B but not A B then direct C B Rule 3 (avoid cycles): If A B introduces a cycle in the graph do A B Rule 4 (choose randomly): If no other rule can be applied to the graph, choose an undirected link and give it an arbitrary direction c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 62

Determining Causal Influence A B A B A B C D E F G Rule 1 A B C D E C D E F G Rule 2 A B C D E C D E F G Rule 4 A B C D E F G F G F G Rule 2 Rule 2 Rule 4 c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 63

Determining Causal Influence independence/non-independence candidates might contradict each other I(A,B), I(A,C), I(B,C), but I(A,B C),I(A,C B) and I(B,C A) remove a link and build a chain out of the remaining ones A B A B C C uncertain region: different heuristics might lead to different structures c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 64

Determining Causal Influence I(A,C),I(A,D),I(B,D) A B D C A B E D C problem might be caused by a hidden variable E B E C A B D C c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 65

Constraint-based Structure Learning useful results can only be expected, if the data is complete no (unrecognized) hidden variables obscure the induced influence links the observations are a faithful sample of an underlying Bayesian network the distribution of cases in D reflects the distribution determined by the underlying network the estimated probability distribution is very close to the underlying one the underlying distribution is recoverable from the observations c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 66

Constraint-based Structure Learning example of an unrecoverable distribution: two switches: P(A = up) = P(B = up) = 0.5 P(C = on) = 1 if val(a) = val(b) I(A,C),I(B,C) A B C problem: independence decisions are taken on individual links (CMI), not on complete link configurations P(C A,B) = ( 1 0 0 1 ) c D. Poole, A. Mackworth 2010, W. Menzel 2013 Artificial Intelligence, Lecture 6.6, Page 67