CSCI-567: Machine Learning (Spring 2019)

Similar documents
Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Lecture 8. Instructor: Haipeng Luo

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Expectation maximization

CS534 Machine Learning - Spring Final Exam

ECE 5424: Introduction to Machine Learning

Logistic Regression. Machine Learning Fall 2018

Introduction to Machine Learning Midterm, Tues April 8

Machine Learning for Signal Processing Bayes Classification and Regression

FINAL: CS 6375 (Machine Learning) Fall 2014

VBM683 Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning Lecture 5

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Linear Models for Classification

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Statistical Pattern Recognition

Qualifying Exam in Machine Learning

ECE 5984: Introduction to Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Online Learning and Sequential Decision Making

Machine Learning for Signal Processing Bayes Classification

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Voting (Ensemble Methods)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CPSC 540: Machine Learning

Neural Networks and Deep Learning

Naïve Bayes classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 3: Machine learning, classification, and generative models

Stochastic Gradient Descent

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Chapter 14 Combining Models

1 Machine Learning Concepts (16 points)

Introduction to Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Mathematical Formulation of Our Example

Clustering and Gaussian Mixture Models

Gaussian Mixture Models

10-701/ Machine Learning - Midterm Exam, Fall 2010

Expectation Maximization

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Latent Variable Models and Expectation Maximization

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Ch 4. Linear Models for Classification

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Hidden Markov Models Part 2: Algorithms

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Mixtures of Gaussians. Sargur Srihari

Statistical Machine Learning from Data

Gaussian and Linear Discriminant Analysis; Multiclass Classification

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam

Latent Variable Models and Expectation Maximization

Statistical Data Mining and Machine Learning Hilary Term 2016

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning, Midterm Exam

Bayesian Learning (II)

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CS6220: DATA MINING TECHNIQUES

10701/15781 Machine Learning, Spring 2007: Homework 2

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Final Exam, Fall 2002

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Final Exam, Spring 2006

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Boosting: Foundations and Algorithms. Rob Schapire

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

CS7267 MACHINE LEARNING

Machine Learning: A Statistics and Optimization Perspective

Data Mining Techniques

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Bias-Variance Tradeoff

Brief Introduction of Machine Learning Techniques for Content Analysis

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Lecture 3: Pattern Classification

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Generative v. Discriminative classifiers Intuition

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning Lecture 7

Pattern Recognition and Machine Learning

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Machine Learning, Fall 2011: Homework 5

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Statistical learning. Chapter 20, Sections 1 4 1

Final Examination CS 540-2: Introduction to Artificial Intelligence

Classification: The rest of the story

Clustering and Gaussian Mixtures

Transcription:

CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43

Administration March 19, 2019 2 / 43

Administration TA3 is due this week March 19, 2019 2 / 43

Administration TA3 is due this week TA4 will be available in the next week. March 19, 2019 2 / 43

Administration TA3 is due this week TA4 will be available in the next week. PA4 (Clustering, Markov chains) will be available in two weeks March 19, 2019 2 / 43

Outline 1 Boosting 2 Gaussian mixture models March 19, 2019 3 / 43

Top 10 Algorithms in Machine Learning... March 19, 2019 4 / 43

Top 10 Algorithms in Machine Learning... You should know (in 2019): March 19, 2019 4 / 43

Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors March 19, 2019 4 / 43

Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests March 19, 2019 4 / 43

Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes March 19, 2019 4 / 43

Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression March 19, 2019 4 / 43

Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks March 19, 2019 4 / 43

Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM March 19, 2019 4 / 43

Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM Clustering (k-means) March 19, 2019 4 / 43

Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM Clustering (k-means) Boosting March 19, 2019 4 / 43

Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM Clustering (k-means) Boosting Dimensionality Reduction Algorithms March 19, 2019 4 / 43

Top 10 Algorithms in Machine Learning... You should know (in 2019): k-nearest Neighbors Decision Tree and Random Forests Naive Bayes Linear and Logistic Regression Artificial Neural Networks SVM Clustering (k-means) Boosting Dimensionality Reduction Algorithms Markov Chains March 19, 2019 4 / 43

Outline 1 Boosting Examples AdaBoost Derivation of AdaBoost 2 Gaussian mixture models March 19, 2019 5 / 43

Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy March 19, 2019 6 / 43

Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) March 19, 2019 6 / 43

Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) works very well in practice (especially in combination with trees) March 19, 2019 6 / 43

Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) works very well in practice (especially in combination with trees) often is resistant to overfitting March 19, 2019 6 / 43

Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) works very well in practice (especially in combination with trees) often is resistant to overfitting has strong theoretical guarantees March 19, 2019 6 / 43

Introduction Boosting is a meta-algorithm, which takes a base algorithm (classification, regression, ranking, etc) as input and boosts its accuracy main idea: combine weak rules of thumb (e.g. 51% accuracy) to form a highly accurate predictor (e.g. 99% accuracy) works very well in practice (especially in combination with trees) often is resistant to overfitting has strong theoretical guarantees We again focus on binary classification. March 19, 2019 6 / 43

A simple example Email spam detection: March 19, 2019 7 / 43

A simple example Email spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) March 19, 2019 7 / 43

A simple example Email spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam March 19, 2019 7 / 43

A simple example Email spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam reweight the examples so that difficult ones get more attention e.g. spam that doesn t contain the word money March 19, 2019 7 / 43

A simple example Email spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam reweight the examples so that difficult ones get more attention e.g. spam that doesn t contain the word money obtain another classifier by applying the same base algorithm: e.g. empty to address spam March 19, 2019 7 / 43

A simple example Email spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam reweight the examples so that difficult ones get more attention e.g. spam that doesn t contain the word money obtain another classifier by applying the same base algorithm: e.g. empty to address spam repeat... March 19, 2019 7 / 43

A simple example Email spam detection: given a training set like: ( Want to make money fast?..., spam) ( Viterbi Research Gist..., not spam) first obtain a classifier by applying a base algorithm, which can be a rather simple/weak one, like decision stumps: e.g. contains the word money spam reweight the examples so that difficult ones get more attention e.g. spam that doesn t contain the word money obtain another classifier by applying the same base algorithm: e.g. empty to address spam repeat... final classifier is the (weighted) majority vote of all weak classifiers March 19, 2019 7 / 43

The base algorithm A base algorithm A (also called weak learning algorithm/oracle) takes a training set S weighted by D as input, and outputs classifier h A(S, D) March 19, 2019 8 / 43

The base algorithm A base algorithm A (also called weak learning algorithm/oracle) takes a training set S weighted by D as input, and outputs classifier h A(S, D) this can be any off-the-shelf classification algorithm (e.g. decision trees, logistic regression, neural nets, etc) March 19, 2019 8 / 43

The base algorithm A base algorithm A (also called weak learning algorithm/oracle) takes a training set S weighted by D as input, and outputs classifier h A(S, D) this can be any off-the-shelf classification algorithm (e.g. decision trees, logistic regression, neural nets, etc) many algorithms can deal with a weighted training set (e.g. for algorithm that minimizes some loss, we can simply replace total loss by weighted total loss ) March 19, 2019 8 / 43

Boosting Algorithms Given: a training set S a base algorithm A March 19, 2019 9 / 43

Boosting Algorithms Given: a training set S a base algorithm A Two things to specify a boosting algorithm: how to reweight the examples? how to combine all the weak classifiers? March 19, 2019 9 / 43

Boosting Algorithms Given: a training set S a base algorithm A Two things to specify a boosting algorithm: how to reweight the examples? how to combine all the weak classifiers? AdaBoost is one of the most successful boosting algorithms. March 19, 2019 9 / 43

The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. March 19, 2019 10 / 43

The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. Initialize D 1 = 1 N to be uniform. March 19, 2019 10 / 43

The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. Initialize D 1 = 1 N to be uniform. For t = 1,..., T Train a weak classifier h t A(S, D t ) based on the current weight D t (n), by minimizing the weighted classification error ɛ t = n D t (n)i[y n h t (x n )] March 19, 2019 10 / 43

The AdaBoost Algorithm, 1990 Given N samples {x n, y n }, where y n {+1, 1}, and a base algorithm A. Initialize D 1 = 1 N to be uniform. For t = 1,..., T Train a weak classifier h t A(S, D t ) based on the current weight D t (n), by minimizing the weighted classification error ɛ t = n D t (n)i[y n h t (x n )] Calculate the importance of h t as β t = 1 ( ) 1 2 ln ɛt ɛ t (β t > 0 ɛ t < 0.5) March 19, 2019 10 / 43

The Betas Calculate the importance of h t as β t = 1 2 ln ( 1 ɛt ɛ t ) (β t > 0 ɛ t < 0.5) March 19, 2019 11 / 43

The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t March 19, 2019 12 / 43

The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) March 19, 2019 12 / 43

The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) = { D t (n)e βt D t (n)e βt if h t (x n ) = y n else March 19, 2019 12 / 43

The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) = { D t (n)e βt D t (n)e βt if h t (x n ) = y n else and normalize them such that D t+1 (n) = D t+1(n) n D t+1(n). March 19, 2019 12 / 43

The AdaBoost Algorithm For t = 1,..., T Train a weak classifier h t A(S, D t ) Calculate β t Update weights D t+1 (n) = D t (n)e βtynht(xn) = { D t (n)e βt D t (n)e βt if h t (x n ) = y n else and normalize them such that D t+1 (n) = Output the final classifier: ( T ) H(x) = sign β t h t (x) t=1 D t+1(n) n D t+1(n). March 19, 2019 12 / 43

Example 10 data points in R 2 The size of + or - indicates the weight, which starts from uniform D 1 March 19, 2019 13 / 43

Example 10 data points in R 2 The size of + or - indicates the weight, which starts from uniform D 1 Base algorithm is decision stump: March 19, 2019 13 / 43

Observe that no stump can predict very accurately for this dataset March 19, 2019 13 / 43 Example 10 data points in R 2 The size of + or - indicates the weight, which starts from uniform D 1 Base algorithm is decision stump:

Round 1: t = 1 h 1 D 2 March 19, 2019 14 / 43

Round 1: t = 1 h 1 D 2 3 misclassified (circled): ɛ 1 = 0.3 March 19, 2019 14 / 43

Round 1: t = 1 h 1 D 2 3 misclassified (circled): ɛ 1 = 0.3 β 1 = 1 2 ln ( 1 ɛt ɛ t ) 0.42. March 19, 2019 14 / 43

Round 1: t = 1 h 1 D 2 3 misclassified (circled): ɛ 1 = 0.3 β 1 = 1 2 ln ( 1 ɛt ɛ t ) 0.42. D 2 puts more weights on those examples March 19, 2019 14 / 43

Round 2: t = 2 h 2 D 3 3 misclassified (circled): ɛ 2 = 0.21 β 2 = 0.65. March 19, 2019 15 / 43

Round 2: t = 2 h 2 D 3 3 misclassified (circled): ɛ 2 = 0.21 β 2 = 0.65. D 3 puts more weights on those examples March 19, 2019 15 / 43

Round 3: t = 3 Round 3 h3 "3 =0.14!3=0.92 again 3 misclassified (circled): 3 = 0.14 β3 = 0.92. March 19, 2019 16 / 43

Final classifier: combining 3 classifiers H = sign 0.42 + 0.65 + 0.92 final = March 19, 2019 17 / 43

Final classifier: combining 3 classifiers H = sign 0.42 + 0.65 + 0.92 final = All data points are now classified correctly, even though each weak classifier makes 3 mistakes. March 19, 2019 17 / 43

Overfitting When T is large, the model is very complicated and overfitting can happen March 19, 2019 18 / 43

Overfitting When T is large, the model is very complicated and overfitting can happen March 19, 2019 18 / 43

Resistance to overfitting However, very often AdaBoost is resistant to overfitting March 19, 2019 19 / 43

Resistance to overfitting However, very often AdaBoost is resistant to overfitting March 19, 2019 19 / 43

Resistance to overfitting However, very often AdaBoost is resistant to overfitting Used to be a mystery, but by now rigorous theory has been developed to explain this phenomenon. March 19, 2019 19 / 43

Why AdaBoost works? In fact, AdaBoost also follows the general framework of minimizing some surrogate loss. March 19, 2019 20 / 43

Why AdaBoost works? In fact, AdaBoost also follows the general framework of minimizing some surrogate loss. Step 1: the model that AdaBoost considers is { } T sgn (f( )) f( ) = β t h t ( ) for some β t 0 and h t H t=1 where H is the set of models considered by the base algorithm March 19, 2019 20 / 43

Why AdaBoost works? In fact, AdaBoost also follows the general framework of minimizing some surrogate loss. Step 1: the model that AdaBoost considers is { } T sgn (f( )) f( ) = β t h t ( ) for some β t 0 and h t H t=1 where H is the set of models considered by the base algorithm Step 2: the loss that AdaBoost minimizes is the exponential loss N exp ( y n f(x n )) n=1 March 19, 2019 20 / 43

Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. March 19, 2019 21 / 43

Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? March 19, 2019 21 / 43

Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 March 19, 2019 21 / 43

Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 Greedily, we want to find β t, h t to minimize March 19, 2019 21 / 43

Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 Greedily, we want to find β t, h t to minimize N exp ( y n f t (x n )) n=1 March 19, 2019 21 / 43

Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 Greedily, we want to find β t, h t to minimize N exp ( y n f t (x n )) = n=1 N exp ( y n f t 1 (x n )) exp ( y n β t h t (x n )) n=1 March 19, 2019 21 / 43

Greedy minimization Step 3: the way that AdaBoost minimizes exponential loss is by a greedy approach, that is, find β t, h t one by one for t = 1,..., T. Specifically, let f t = t τ=1 β τ h τ. Suppose we have found f t 1, what should f t be? t 1 f t = β τ h τ + β t h t = f t 1 + β t h t. τ=1 Greedily, we want to find β t, h t to minimize N exp ( y n f t (x n )) = n=1 N exp ( y n f t 1 (x n )) exp ( y n β t h t (x n )) n=1 Next, we use the definition of weights (slide 12). March 19, 2019 21 / 43

Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) March 19, 2019 22 / 43

Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. March 19, 2019 22 / 43

Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) March 19, 2019 22 / 43

Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) March 19, 2019 22 / 43

Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) D 1 exp ( y n β 1 h 1 (x n )... y n β t 1 h t 1 (x n )) March 19, 2019 22 / 43

Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) D 1 exp ( y n β 1 h 1 (x n )... y n β t 1 h t 1 (x n )) exp ( y n f t 1 (x n )) March 19, 2019 22 / 43

Greedy minimization Claim: exp ( y n f t 1 (x n )) D t (n) Proof. D t (n) D t 1 (n) exp ( y n β t 1 h t 1 (x n )) D t 2 (n) exp ( y n β t 2 h t 2 (x n )) exp ( y n β t 1 h t 1 (x n )) D 1 exp ( y n β 1 h 1 (x n )... y n β t 1 h t 1 (x n )) exp ( y n f t 1 (x n )) Remark. All weights D t (n) are normalized: n D t(n) = 1. March 19, 2019 22 / 43

Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 March 19, 2019 23 / 43

Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts March 19, 2019 23 / 43

Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 March 19, 2019 23 / 43

Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 = n:y n h t(x n) D t (n)e βt + n:y n=h t(x n) D t (n)e βt March 19, 2019 23 / 43

Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 = n:y n h t(x n) D t (n)e βt + n:y n=h t(x n) D t (n)e βt = ɛ t e βt + (1 ɛ t )e βt (recall ɛ t = n:y n h t(x n) D t(n)) March 19, 2019 23 / 43

Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 = n:y n h t(x n) D t (n)e βt + n:y n=h t(x n) D t (n)e βt = ɛ t e βt + (1 ɛ t )e βt (recall ɛ t = n:y n h t(x n) D t(n)) = ɛ t (e βt e βt ) + e βt March 19, 2019 23 / 43

Greedy minimization So the goal becomes finding β t 0, h t H that minimize argmin β t,h t N n=1 exp ( y n f t (x n )) = argmin β t,h t N D t (n) exp ( y n β t h t (x n )) n=1 We decompose the weighted loss function into two parts N D t (n) exp ( y n β t h t (x n )) n=1 = n:y n h t(x n) D t (n)e βt + n:y n=h t(x n) D t (n)e βt = ɛ t e βt + (1 ɛ t )e βt (recall ɛ t = n:y n h t(x n) D t(n)) = ɛ t (e βt e βt ) + e βt We find h t by minimizing the weighted classification error ɛ t. March 19, 2019 23 / 43

Minimizing the weighted classification error Thus, we would want to choose h t (x n ) such that h t (x) = argmin ɛ t = h t(x) n:y n h t(x n) D t (n) March 19, 2019 24 / 43

Minimizing the weighted classification error Thus, we would want to choose h t (x n ) such that h t (x) = argmin ɛ t = h t(x) n:y n h t(x n) D t (n) This is exactly the first step of the AdaBoost algorithm on slide 10 train a weak classifier based on the current weight D t (n). March 19, 2019 24 / 43

Greedy minimization When h t (and thus ɛ t ) is fixed, we then find β t to minimize ɛ t (e βt e βt ) + e βt March 19, 2019 25 / 43

Greedy minimization When h t (and thus ɛ t ) is fixed, we then find β t to minimize ɛ t (e βt e βt ) + e βt We take derivative with respect to β t, set it to zero, and derive the optimal β t as βt = 1 ( ) 1 2 ln ɛt ɛ t March 19, 2019 25 / 43

Greedy minimization When h t (and thus ɛ t ) is fixed, we then find β t to minimize ɛ t (e βt e βt ) + e βt We take derivative with respect to β t, set it to zero, and derive the optimal β t as βt = 1 ( ) 1 2 ln ɛt ɛ t which is precisely the step of the AdaBoost algorithm on slide 10. March 19, 2019 25 / 43

Greedy minimization When h t (and thus ɛ t ) is fixed, we then find β t to minimize ɛ t (e βt e βt ) + e βt We take derivative with respect to β t, set it to zero, and derive the optimal β t as βt = 1 ( ) 1 2 ln ɛt ɛ t which is precisely the step of the AdaBoost algorithm on slide 10. Verify the solution β t. March 19, 2019 25 / 43

Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) March 19, 2019 26 / 43

Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, March 19, 2019 26 / 43

Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) March 19, 2019 26 / 43

Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) = e yn[f t 1(x)+β t h t (xn)] March 19, 2019 26 / 43

Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) = e yn[f t 1(x)+β t h t (xn)] D t (n)e ynβ t h t (xn) March 19, 2019 26 / 43

Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) = e yn[f t 1(x)+βt h t (xn)] { D t (n)e ynβ t h t (xn) Dt (n)e = β t if y n h t (x n ) D t (n)e β t if y n = h t (x n ) March 19, 2019 26 / 43

Updating the weights Now that we have improved our classifier into f t (x n ) = f t 1 (x n ) + β t h t (x n ) At the t-th iteration, we will need to compute the weights for the above classifier, which is, D t+1 (n) e ynf(xn) = e yn[f t 1(x)+βt h t (xn)] { D t (n)e ynβ t h t (xn) Dt (n)e = β t if y n h t (x n ) D t (n)e β t if y n = h t (x n ) which is precisely the last step of the AdaBoost algorithm on slide 12. March 19, 2019 26 / 43

Remarks Note that the AdaBoost algorithm itself never specifies how we would get h t (x) as long as it minimizes the weighted classification error ɛ t = n D t (n)i[y n h t (x n )] In this aspect, the AdaBoost algorithm is a meta-algorithm and can be used with any classifier where we can do the above. March 19, 2019 27 / 43

Remarks Note that the AdaBoost algorithm itself never specifies how we would get h t (x) as long as it minimizes the weighted classification error ɛ t = n D t (n)i[y n h t (x n )] In this aspect, the AdaBoost algorithm is a meta-algorithm and can be used with any classifier where we can do the above. Ex. How do we choose the decision stump classifier given the weights at the second round of the following distribution? h 1 D 2 We can simply enumerate all possible ways of putting vertical and horizontal lines to separate the data points into two classes and find the one with the smallest weighted classification error! March 19, 2019 27 / 43

Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. March 19, 2019 28 / 43

Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. There are many boosting algorithms; AdaBoost is the most classic one. March 19, 2019 28 / 43

Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. There are many boosting algorithms; AdaBoost is the most classic one. AdaBoost is greedily minimizing the exponential loss. March 19, 2019 28 / 43

Summary for boosting Key idea of boosting is to combine weak predictors into a strong one. There are many boosting algorithms; AdaBoost is the most classic one. AdaBoost is greedily minimizing the exponential loss. AdaBoost tends to not overfit. March 19, 2019 28 / 43

Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. March 19, 2019 29 / 43

Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: March 19, 2019 29 / 43

Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. March 19, 2019 29 / 43

Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. we learn f() on data set (x i, y i ) to output the most likely y on unseen x. March 19, 2019 29 / 43

Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. we learn f() on data set (x i, y i ) to output the most likely y on unseen x. having f() we know how to discriminate unseen x s from different classes. March 19, 2019 29 / 43

Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. we learn f() on data set (x i, y i ) to output the most likely y on unseen x. having f() we know how to discriminate unseen x s from different classes. we learn the decision boundary between the classes. March 19, 2019 29 / 43

Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Discriminative models: nearest neighbor, traditional neural networks, SVM. we learn f() on data set (x i, y i ) to output the most likely y on unseen x. having f() we know how to discriminate unseen x s from different classes. we learn the decision boundary between the classes. we have no idea how the data is generated. March 19, 2019 29 / 43

Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: March 19, 2019 30 / 43

Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: Naïve Bayes, Gaussian mixture model, Hidden Markov model, Adversarial Network (GAN). March 19, 2019 30 / 43

Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: Naïve Bayes, Gaussian mixture model, Hidden Markov model, Adversarial Network (GAN). it s used widely in unsupervised machine learning. March 19, 2019 30 / 43

Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: Naïve Bayes, Gaussian mixture model, Hidden Markov model, Adversarial Network (GAN). it s used widely in unsupervised machine learning. it s a probabilistic way to think about how the data might have been generated. March 19, 2019 30 / 43

Taxonomy of ML Models There are two kinds of classification models in machine learning generative models and discriminative models. Generative models: Naïve Bayes, Gaussian mixture model, Hidden Markov model, Adversarial Network (GAN). it s used widely in unsupervised machine learning. it s a probabilistic way to think about how the data might have been generated. learn the joint probability distribution P (x, y) and predict P (y x) with the help of Bayes Theorem. March 19, 2019 30 / 43

Outline 1 Boosting 2 Gaussian mixture models Motivation and Model EM algorithm March 19, 2019 31 / 43

Gaussian mixture models Gaussian mixture models (GMM) is a probabilistic approach for clustering March 19, 2019 32 / 43

Gaussian mixture models Gaussian mixture models (GMM) is a probabilistic approach for clustering more explanatory than minimizing the K-means objective can be seen as a soft version of K-means March 19, 2019 32 / 43

Gaussian mixture models Gaussian mixture models (GMM) is a probabilistic approach for clustering more explanatory than minimizing the K-means objective can be seen as a soft version of K-means To solve GMM, we will introduce a powerful method for learning probabilistic mode: Expectation Maximization (EM) algorithm March 19, 2019 32 / 43

A generative model For classification, we discussed the sigmoid model to explain how the labels are generated. March 19, 2019 33 / 43

A generative model For classification, we discussed the sigmoid model to explain how the labels are generated. Similarly, for clustering, we want to come up with a probabilistic model p to explain how the data is generated. March 19, 2019 33 / 43

A generative model For classification, we discussed the sigmoid model to explain how the labels are generated. Similarly, for clustering, we want to come up with a probabilistic model p to explain how the data is generated. That is, each point is an independent sample of x p. March 19, 2019 33 / 43

A generative model For classification, we discussed the sigmoid model to explain how the labels are generated. Similarly, for clustering, we want to come up with a probabilistic model p to explain how the data is generated. That is, each point is an independent sample of x p. What probabilistic model generates data like this? March 19, 2019 33 / 43

Gaussian mixture models: intuition We will model each region with a Gaussian distribution. This leads to the idea of Gaussian mixture models (GMMs). The problem we are now facing is that i) we do not know which (color) region a data point comes from; ii) the parameters of Gaussian distributions in each region. We need to find all of them from unsupervised data D = {x n } N n=1. March 19, 2019 34 / 43

GMM: formal definition A GMM has the following density function: p(x) = K ω k N(x µ k, Σ k ) = k=1 K k=1 ω k 1 (2π) D Σ k e 1 2 (x µ k) T Σ 1 k (x µ k) March 19, 2019 35 / 43

GMM: formal definition A GMM has the following density function: p(x) = where K ω k N(x µ k, Σ k ) = k=1 K k=1 ω k 1 (2π) D Σ k e 1 2 (x µ k) T Σ 1 k (x µ k) K: the number of Gaussian components (same as #clusters we want) March 19, 2019 35 / 43

GMM: formal definition A GMM has the following density function: p(x) = where K ω k N(x µ k, Σ k ) = k=1 K k=1 ω k 1 (2π) D Σ k e 1 2 (x µ k) T Σ 1 k (x µ k) K: the number of Gaussian components (same as #clusters we want) µ k and Σ k : mean and covariance matrix of the k-th Gaussian March 19, 2019 35 / 43

GMM: formal definition A GMM has the following density function: p(x) = where K ω k N(x µ k, Σ k ) = k=1 K k=1 ω k 1 (2π) D Σ k e 1 2 (x µ k) T Σ 1 k (x µ k) K: the number of Gaussian components (same as #clusters we want) µ k and Σ k : mean and covariance matrix of the k-th Gaussian ω 1,..., ω K : mixture weights, they represent how much each component contributes to the final distribution. It satisfies two properties: k, ω k > 0, and ω k = 1 k March 19, 2019 35 / 43

Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) k=1 March 19, 2019 36 / 43

Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) = k=1 K p(z = k)p(x z = k) k=1 March 19, 2019 36 / 43

Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) = k=1 K p(z = k)p(x z = k) = k=1 K ω k N(x µ k, Σ k ) k=1 March 19, 2019 36 / 43

Another view By introducing a latent variable z [K], which indicates cluster membership, we can see p as a marginal distribution p(x) = K p(x, z = k) = k=1 K p(z = k)p(x z = k) = k=1 K ω k N(x µ k, Σ k ) k=1 x and z are both random variables drawn from the model x is observed z is unobserved/latent March 19, 2019 36 / 43

An example The conditional distributions are p(x z = red) = N(x µ 1, Σ 1 ) p(x z = blue) = N(x µ 2, Σ 2 ) p(x z = green) = N(x µ 3, Σ 3 ) March 19, 2019 37 / 43

An example The conditional distributions are p(x z = red) = N(x µ 1, Σ 1 ) p(x z = blue) = N(x µ 2, Σ 2 ) p(x z = green) = N(x µ 3, Σ 3 ) The marginal distribution is p(x) = p(red)n(x µ 1, Σ 1 ) + p(blue)n(x µ 2, Σ 2 ) + p(green)n(x µ 3, Σ 3 ) March 19, 2019 37 / 43

Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. March 19, 2019 38 / 43

Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) March 19, 2019 38 / 43

Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) γ nk [0, 1] i.e. soft assignment of each point to each cluster, as opposed to hard assignment by K-means. March 19, 2019 38 / 43

Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) γ nk [0, 1] i.e. soft assignment of each point to each cluster, as opposed to hard assignment by K-means. GMM is more explanatory than K-means both learn the cluster centers µ k s March 19, 2019 38 / 43

Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) γ nk [0, 1] i.e. soft assignment of each point to each cluster, as opposed to hard assignment by K-means. GMM is more explanatory than K-means both learn the cluster centers µ k s in addition, GMM learns cluster weight ω k and covariance Σ k, March 19, 2019 38 / 43

Learning GMMs Learning a GMM means finding all the parameters θ = {ω k, µ k, Σ k } K k=1. In the process, we will learn the latent variable z n as well: p(z n = k x n ) γ nk [0, 1] i.e. soft assignment of each point to each cluster, as opposed to hard assignment by K-means. GMM is more explanatory than K-means both learn the cluster centers µ k s in addition, GMM learns cluster weight ω k and covariance Σ k, thus we can predict probability of seeing a new point we can generate synthetic data March 19, 2019 38 / 43

How to learn these parameters? An obvious attempt is maximum-likelihood estimation (MLE): find argmax θ ln N n=1 p(x n ; θ) = argmax θ N n=1 ln p(x n ; θ) argmax P (θ) θ March 19, 2019 39 / 43

How to learn these parameters? An obvious attempt is maximum-likelihood estimation (MLE): find argmax θ ln N n=1 p(x n ; θ) = argmax θ N n=1 ln p(x n ; θ) argmax P (θ) θ This is called incomplete likelihood (since z n s are unobserved), and is intractable in general (non-concave problem). March 19, 2019 39 / 43

How to learn these parameters? An obvious attempt is maximum-likelihood estimation (MLE): find argmax θ ln N n=1 p(x n ; θ) = argmax θ N n=1 ln p(x n ; θ) argmax P (θ) θ This is called incomplete likelihood (since z n s are unobserved), and is intractable in general (non-concave problem). One solution is to still apply GD/SGD, but a much more effective approach is the Expectation Maximization (EM) algorithm. March 19, 2019 39 / 43

Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] March 19, 2019 40 / 43

Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] Step 1 (E-Step) update the soft assignment (fixing parameters) γ nk = p(z n = k x n ) ω k N (x n µ k, Σ k ) March 19, 2019 40 / 43

Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] Step 1 (E-Step) update the soft assignment (fixing parameters) γ nk = p(z n = k x n ) ω k N (x n µ k, Σ k ) Step 2 (M-Step) update the model parameter (fixing assignments) n ω k = γ nk µ k = n γ nkx n N n γ nk 1 Σ k = n γ γ nk (x n µ k )(x n µ k ) T nk n March 19, 2019 40 / 43

Preview of EM for learning GMMs Step 0 Initialize ω k, µ k, Σ k for each k [K] Step 1 (E-Step) update the soft assignment (fixing parameters) γ nk = p(z n = k x n ) ω k N (x n µ k, Σ k ) Step 2 (M-Step) update the model parameter (fixing assignments) n ω k = γ nk µ k = n γ nkx n N n γ nk 1 Σ k = n γ γ nk (x n µ k )(x n µ k ) T nk Step 3 return to Step 1 if not converged n March 19, 2019 40 / 43

Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 March 19, 2019 41 / 43

Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 histogram represents the data 1 0.8 0.6 0.4 0.2 0-6 -4-2 0 2 4 6 March 19, 2019 41 / 43

Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 histogram represents the data red curve represents the ground-truth density p(x) = K k=1 ω kn(x µ k, Σ k ) 1 0.8 0.6 0.4 0.2 0-6 -4-2 0 2 4 6 March 19, 2019 41 / 43

Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 histogram represents the data red curve represents the ground-truth density p(x) = K k=1 ω kn(x µ k, Σ k ) blue curve represents the learned density for a specific round 1 0.8 0.6 0.4 0.2 0-6 -4-2 0 2 4 6 March 19, 2019 41 / 43

Demo Generate 50 data points from a mixture of 2 Gaussians with ω 1 = 0.3, µ 1 = 0.8, Σ 1 = 0.52 ω 2 = 0.7, µ 2 = 1.2, Σ 2 = 0.35 histogram represents the data red curve represents the ground-truth density p(x) = K k=1 ω kn(x µ k, Σ k ) blue curve represents the learned density for a specific round 1 0.8 0.6 0.4 0.2 0-6 -4-2 0 2 4 6 EM demo.pdf shows how the blue curve moves towards red curve quickly via EM March 19, 2019 41 / 43

EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 March 19, 2019 42 / 43

EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 θ is the parameters for a general probabilistic model March 19, 2019 42 / 43

EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 θ is the parameters for a general probabilistic model x n s are observed random variables March 19, 2019 42 / 43

EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 θ is the parameters for a general probabilistic model x n s are observed random variables z n s are latent variables March 19, 2019 42 / 43

EM algorithm In general EM is a heuristic to solve MLE with latent variables (not just GMM), i.e. find the maximizer of P (θ) = N ln p(x n ; θ) n=1 θ is the parameters for a general probabilistic model x n s are observed random variables z n s are latent variables Again, directly solving the objective is intractable. March 19, 2019 42 / 43

High level idea Keep maximizing a lower bound of P that is more manageable March 19, 2019 43 / 43