Expectation Maximisation (EM) CS 486/686: Introduction to Artificial Intelligence University of Waterloo

Similar documents
Expectation Maximization [KF Chapter 19] Incomplete data

Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct /

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

CPSC 540: Machine Learning

Expectation Maximization Mixture Models HMMs

Expectation maximization tutorial

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

1 Expectation Maximization

Expectation-Maximization (EM) algorithm

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

CS 7180: Behavioral Modeling and Decision- making in AI

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Mixtures of Gaussians continued

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Parametric Models: from data to models

Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics

CSE 473: Artificial Intelligence Autumn Topics

Bayesian Models in Machine Learning

Final Examination CS 540-2: Introduction to Artificial Intelligence

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Qualifier: CS 6375 Machine Learning Spring 2015

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Machine Learning for Signal Processing Bayes Classification and Regression

Latent Variable Models and EM Algorithm

A Note on the Expectation-Maximization (EM) Algorithm

Language as a Stochastic Process

Naïve Bayes classification

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

6.867 Machine Learning

Algorithmisches Lernen/Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Directed Probabilistic Graphical Models CMSC 678 UMBC

CSCI-567: Machine Learning (Spring 2019)

CS145: INTRODUCTION TO DATA MINING

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Bayesian Networks Structure Learning (cont.)

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Gaussian Mixture Models, Expectation Maximization

Gaussian Mixture Models

Review of probabilities

CS 188: Artificial Intelligence. Bayes Nets

Statistical Pattern Recognition

Lecture 8: Graphical models for Text

CS Lecture 18. Expectation Maximization

The Expectation Maximization Algorithm

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Expectation Maximization

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

CSC242: Intro to AI. Lecture 23

CS 343: Artificial Intelligence

Expectation maximization

MIXTURE MODELS AND EM

CS 188: Artificial Intelligence Fall 2009

Expectation Maximization Algorithm

Lecture 4: Probabilistic Learning

Lecture 6: Graphical Models: Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

an introduction to bayesian inference

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

A brief introduction to Conditional Random Fields

6.867 Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine Learning for Data Science (CS4786) Lecture 12

Machine Learning Summer School

CS 5522: Artificial Intelligence II

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Learning (II)

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Mixtures of Gaussians. Sargur Srihari

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Probability Theory for Machine Learning. Chris Cremer September 2015

Expectation Maximization (EM)

CS 343: Artificial Intelligence

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Lecture 3: Pattern Classification

{ p if x = 1 1 p if x = 0

Discrete Binary Distributions

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Bayesian Networks. Motivation

A minimalist s exposition of EM

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Parameter estimation Conditional risk

Introduction to Artificial Intelligence Midterm 2. CS 188 Spring You have approximately 2 hours and 50 minutes.

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Transcription:

Expectation Maximisation (EM) CS 486/686: Introduction to Artificial Intelligence University of Waterloo 1

Incomplete Data So far we have seen problems where - Values of all attributes are known - Learning is relatively easy Many real-world problems have hidden variables - Incomplete data - Missing attribute values 2

Maximum Likelihood Learning Learning of Bayes nets parameters - ΘV=true, Par(V)=x = P(V=true Par(V)=x) - ΘV=true, Par(V)=x =(#Insts V=true)/(Total #V=x) Assumes all attributes have values - What if some values are missing? 3

Naïve Solutions Ignore examples with missing attribute values - What if all examples have missing attribute values? Ignore hidden variables - Model might become much more complex 4

Hidden Variables Heart disease example a) Uses a Hidden Variable, simpler (fewer CPT parameters) b) No Hidden Variable, complex (many CPT parameters) 5

Direct ML Maximize likelihood directly where E are the evidence variables and Z are the hidden variables 6

Expectation-Maximization (EM) If we knew the missing values computing hml is trivial Guess hml Iterate - Expectation: based on hml compute expectation of (missing) values - Maximization: based on expected (missing) values compute new hml 7

Expectation-Maximization (EM) Formally Approximate maximum likelihood Iteratively compute: h i+1 =argmax h Σ Z P(Z h i,e)log P(e,Z hi) Expectation Maximization 8

EM Log inside can linearize the product Monotonic improvement of likelihood 9

Example Assume we have two coins, A and B The probability of getting heads with A is θa The probability of getting heads with B is θb We want to find θa and θb by performing a number of trials Example from S. Zafeiriohu, Advanced Statistical Machine Learning, Imperial College 10

Example Coin A and Coin B H T T T H H T H T H H H H H T H H H H H H T H H H H H T H H H T H T T T H H T T T H H H T H H H T H Coin A Coin B 5 H, 5 T 9 H, 1 T 8 H, 2 T 4 H, 6 T 7 H, 3 T 24 H, 6 T 9 H, 11 T 11

Example Coin A Coin B 5 H, 5 T 9 H, 1 T 8 H, 2 T 4 H, 6 T 7 H, 3 T 24 H, 6 T 9 H, 11 T 12

Example Now assume we do not know which coin was used in which trial (hidden variable) H T T T H H T H T H H H H H T H H H H H H T H H H H H T H H H T H T T T H H T T T H H H T H H H T H 13

Example Initialization: E Step: Compute the Expected counts of Heads and Tails Trial 1: H T T T H H T H T H Coin A 2.2 H, 2.2 T Coin B 2.8 H, 2.8 T 14

Example H T T T H H T H T H (0.55 A, 0.45 B) H H H H T H H H H H (0.80 A, 0.20 B) H T H H H H H T H H (0.73 A, 0.27 A) H T H T T T H H T T (0.35 A, 0.65 B) T H H H T H H H T H (0.65 A, 0.35 B) Coin A Coin B 2.2H, 2.2T 2.8H, 2.8T 7.2H, 0.8T 1.8H, 0.2T 5.9H, 1.5T 2.1H, 0.5T 1.4H, 2.1T 2.6H, 3.9T 4.5H, 1.9T 2.5H, 1.1T 21.3H, 8.6T 11.7H, 8.4T 15

Example M Step: Compute parameters based on expected counts Coin A Coin B 2.2H, 2.2T 2.8H, 2.8T 7.2H, 0.8T 1.8H, 0.2T 5.9H, 1.5T 2.1H, 0.5T 1.4H, 2.1T 2.6H, 3.9T 4.5H, 1.9T 2.5H, 1.1T 21.3H, 8.6T 11.7H, 8.4T Repeat 16

EM: k-means Algorithm Input Set of examples, E Input features X1,,Xn val(e,x)=value of feature j for example e k classes Output Function class:e-> {1,,k} where class(e)=i means example e belongs to class i Function pval where pval(i,xj) is the predicted value of feature Xj for each example in class i 17

k-means Algorithm Sum-of-squares error for class i and pval is Goal: Final class and pval that minimizes sum-of-squares error. 18

Minimizing the error Given class, the pval that minimizes sum-ofsquare error is the mean value for that class Given pval, each example can be assigned to the class that minimizes the error for that example 19

k-means Algorithm Randomly assign the examples to classes Repeat the following two steps until E step does not change the assignment of any example M: For each class i and feature Xj E: For each example e, assign e to the class that minimizes 20

k-means Example Data set: (X,Y) pairs (0.7,5.1) (1.5,6), (2.1, 4.5), (2.4, 5.5), (3, 4.4), (3.5, 5), (4.5, 1.5), (5.2, 0.7), (5.3, 1.8), (6.2, 1.7), (6.7, 2.5), (8.5, 9.2), (9.1, 9.7), (9.5, 8.5) 21

Example Data 22

Random Assignment to Classes 23

Assign Each Example to Closest Mean 24

Reassign each example 25

Properties of k-means An assignment is stable if both M step and E step do not change the assignment - Algorithm will eventually converge to a stable local minimum - No guarantee that it will converge to a global minimum Increasing k can always decrease error until k is the number of different examples 26