Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Similar documents
Generative v. Discriminative classifiers Intuition

Notes on Machine Learning for and

Co-training and Learning with Noise

Machine Learning Gaussian Naïve Bayes Big Picture

PAC Generalization Bounds for Co-training

Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CSE446: Clustering and EM Spring 2017

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Generative v. Discriminative classifiers Intuition

Bias-Variance Tradeoff

Machine Learning

Expectation maximization

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Statistical learning. Chapter 20, Sections 1 4 1

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Notes on Machine Learning for and

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Expectation Maximization Algorithm

10-701/ Machine Learning - Midterm Exam, Fall 2010

CPSC 540: Machine Learning

Machine Learning, Fall 2012 Homework 2

Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Generative Clustering, Topic Modeling, & Bayesian Inference

Bootstrapping. 1 Overview. 2 Problem Setting and Notation. Steven Abney AT&T Laboratories Research 180 Park Avenue Florham Park, NJ, USA, 07932

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Introduction to Machine Learning

Machine Learning

Naïve Bayes classification

Hidden Markov Models

CS 6375 Machine Learning

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Clustering and Gaussian Mixture Models

10-701/15-781, Machine Learning: Homework 4

A brief introduction to Conditional Random Fields

Co-Training and Expansion: Towards Bridging Theory and Practice

Logistic Regression. Machine Learning Fall 2018

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Statistical Pattern Recognition

EM Algorithm LECTURE OUTLINE

Estimating Parameters

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Expectation Maximization (EM)

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

CS534 Machine Learning - Spring Final Exam

arxiv: v1 [cs.lg] 15 Aug 2017

Bayesian Networks Structure Learning (cont.)

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Probabilistic Machine Learning. Industrial AI Lab.

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

L11: Pattern recognition principles

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Machine Learning

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

EM (cont.) November 26 th, Carlos Guestrin 1

Two-view Feature Generation Model for Semi-supervised Learning

Mathematical Formulation of Our Example

COM336: Neural Computing

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Hidden Markov Models

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Clustering with k-means and Gaussian mixture distributions

Directed Graphical Models

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Gaussian Mixture Models

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

Introduction to Graphical Models

A Hybrid Generative/Discriminative Approach to Semi-supervised Classifier Design

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

CSCI-567: Machine Learning (Spring 2019)

But if z is conditioned on, we need to model it:

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

p(d θ ) l(θ ) 1.2 x x x

CS145: INTRODUCTION TO DATA MINING

Mixtures of Gaussians continued

Bayesian Machine Learning

speaker recognition using gmm-ubm semester project presentation

Expectation Maximization

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Final Exam, Fall 2002

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Pattern Recognition. Parameter Estimation of Probability Density Functions

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Machine Learning Lecture 5

Transcription:

Expectation Maximization, and Learning from Partly Unobserved Data (part 2) Machine Learning 10-701 April 2005 Tom M. Mitchell Carnegie Mellon University

Clustering Outline K means EM: Mixture of Gaussians Training classifiers with partially unlabeled data Naïve Bayes and EM Reweighting the labeled examples using P(X) Co-training Regularization based on

1. Unsupervised Clustering: K-means and Mixtures of Gaussians

Clustering Given set of data points, group them Unsupervised learning Which patients are similar? (or which earthquakes, customers, faces, )

K-means Clustering Given data <x 1 x n >, and K, assign each x i to one of K clusters, C 1 C K, minimizing Where is mean over all points in cluster C j K-Means Algorithm: Initialize randomly Repeat until convergence: 1. Assign each point x i to the cluster with the closest mean μ j 2. Calculate the new mean for each cluster

Run applet Try 3 clusters, 15 pts K Means applet

Mixtures of Gaussians K-means is EM ish, but makes hard assignments of x i to clusters. Let s derive a real EM algorithm for clustering. What object function shall we optimize? Maximize data likelihood! What form of P(X) should we assume? Mixture of Gaussians Mixture distribution: Assume P(x) is a mixture of K different Gaussians Assume each data point, x, is generated by 2-step process 1. z choose one of the K Gaussians, according to π 1 π K 2. Generate x according to the Gaussian N(μ z, Σ z )

EM for Mixture of Gaussians Simplify to make this easier 1. assume X i are conditionally independent given Z. 2. assume only 2 classes, and assume 3. Assume σ known, π 1 π K, μ 1i μ Ki unknown Z Observed: X Unobserved: Z X 1 X 2 X 3 X 4

EM Z Given observed variables X, unobserved Z Define where X 1 X 2 X 3 X 4 Iterate until convergence: E Step: Calculate P(Z(n) X(n),θ) for each example X(n). Use this to construct M Step: Replace current θ by

EM E Step Z Calculate P(Z(n) X(n),θ) for each observed example X(n) X(n)=<x 1 (n), x 2 (n), x T (n)>. X 1 X 2 X 3 X 4

First consider update for π EM M Step Z π has no influence Count z(n)=1 X 1 X 2 X 3 X 4

Now consider update for μ ji EM M Step Z μ ji has no influence X 1 X 2 X 3 X 4 Compare above to MLE if Z were observable:

Run applet Mixture of Gaussians applet Try 2 clusters See different local minima with different random starts

K-Means vs Mixture of Gaussians Both are iterative algorithms to assign points to clusters Objective function K Means: minimize MixGaussians: maximize P(X θ) MixGaussians is the more general formulation Equivalent to K Means when Σ k = σ I, and σ 0

Using Unlabeled Data to Help Train Naïve Bayes Classifier Learn P(Y X) Y X 1 X 2 X 3 X 4 Y X1 X2 X3 X4 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0? 0 1 1 0? 0 1 0 1

From [Nigam et al., 2000]

E Step: M Step: w t is t-th word in vocabulary

Elaboration 1: Downweight the influence of unlabeled examples by factor λ New M step: Chosen by cross validation

Using one labeled example per class

Experimental Evaluation Newsgroup postings 20 newsgroups, 1000/group Web page classification student, faculty, course, project 4199 web pages Reuters newswire articles 12,902 articles 90 topics categories

20 Newsgroups

20 Newsgroups

Combining Labeled and Unlabeled Data How else can unlabeled data be useful for supervised learning/function approximation?

1. Use U to reweight labeled examples 1 if hypothesis h disagrees with true function f, else 0

3. If Problem Setting Provides Redundantly Sufficient Features, use CoTraining In some settings, available data features are redundant and we can train two classifiers using different features In this case, the two classifiers should at least agree on the classification for each unlabeled example Therefore, we can use the unlabeled data to constrain training of both classifiers, forcing them to agree

Redundantly Sufficient Features Professor Faloutsos my advisor

Redundantly Sufficient Features Professor Faloutsos my advisor

Redundantly Sufficient Features

Redundantly Sufficient Features Professor Faloutsos my advisor

Given: labeled data L, Loop: CoTraining Algorithm #1 [Blum&Mitchell, 1998] unlabeled data U Train g1 (hyperlink classifier) using L Train g2 (page classifier) using L Allow g1 to label p positive, n negative examps from U Allow g2 to label p positive, n negative examps from U Add these self-labeled examples to L

CoTraining: Experimental Results begin with 12 labeled web pages (academic course) provide 1,000 additional unlabeled web pages average error: learning from labeled data 11.1%; average error: cotraining 5.0% Typical run:

learn f : where X where x and g 1, X = g 2 CoTraining Setting Y X X 1 2 drawn from unknown ( x) g 1 ( x 1 ) = g 2 ( x 2 ) = f distribution ( x) If x1, x2 conditionally independent given y f is PAC learnable from noisy labeled data Then f is PAC learnable from weak initial classifier plus unlabeled data

Co-Training Rote Learner hyperlinks pages My advisor + - -

Co-Training Rote Learner My advisor hyperlinks + + + + + + + - - - - - - - - pages + + - - - -

Expected Rote CoTraining error given m examples CoTraining learn f : X where X = and g 1, g 2 setting Y X 1 X where x drawn 2 ( x) g : from unknown 1 ( x 1 ) = g 2 ( x 2 ) = distribution f ( x) [ ] m P( x g )(1 P( x g )) E error = j Where g is the jth connected component of graph j j j

How many unlabeled examples suffice? Want to assure that connected components in the underlying distribution, G D, are connected components in the observed sample, G S G D G S O(log(N)/α) examples assure that with high probability, G S has same connected components as G D [Karger, 94] N is size of G D, α is min cut over all connected components of G D

PAC Generalization Bounds on CoTraining [Dasgupta et al., NIPS 2001]

What if CoTraining Assumption Not Perfectly Satisfied? + + - + Idea: Want classifiers that produce a maximally consistent labeling of the data If learning is an optimization problem, what function should we optimize?

What Objective Function? 2 2 2 1 1, 2 2 2 1 1, 2 2 2, 2 1 1 4 3 2 ) ( ˆ ) ( ˆ 1 1 4 )) ( ˆ ) ( ˆ ( 3 )) ( ˆ ( 2 )) ( ˆ ( 1 4 3 2 1 + + = = = = + + + = > < > < > < U L x L y x U x L y x L y x x g x g U L y L E x g x g E x g y E x g y E E c E c E E E Error on labeled examples Disagreement over unlabeled Misfit to estimated class priors

No word independence assumption, use both labeled and unlabeled data What Function Approximators? g ˆ1 ( x) = 1+ e 1 w j,1x j j gˆ 2( x) = 1+ e 1 w j, 2x j j Same fn form as Naïve Bayes, Max Entropy Use gradient descent to simultaneously learn g1 and g2, directly minimizing E = E1 + E2 + E3 + E4

Classifying Jobs for FlipDog X1: job title X2: job description

Gradient CoTraining Classifying FlipDog job descriptions: SysAdmin vs. WebProgrammer Final Accuracy Labeled data alone: 86% CoTraining: 96%

Gradient CoTraining Classifying Upper Case sequences as Person Names Error Rates Using labeled data only 25 labeled 5000 unlabeled.24 2300 labeled 5000 unlabeled.13 Cotraining.15 *.11 * Cotraining without fitting class priors (E4).27 * * sensitive to weights of error terms E3 and E4

CoTraining Summary Unlabeled data improves supervised learning when example features are redundantly sufficient Family of algorithms that train multiple classifiers Theoretical results Expected error for rote learning If X1,X2 conditionally independent given Y, Then PAC learnable from weak initial classifier plus unlabeled data error bounds in terms of disagreement between g1(x1) and g2(x2) Many real-world problems of this type Semantic lexicon generation [Riloff, Jones 99], [Collins, Singer 99] Web page classification [Blum, Mitchell 98] Word sense disambiguation [Yarowsky 95] Speech recognition [de Sa, Ballard 98]

Clustering: What you should know K-means algorithm : hard labels EM for mixtures of Gaussians : probabilistic labels Be able to derive your own EM algorithm Using unlabeled data to help with supervised classification Naïve Bayes augmented by unlabeled data Using unlabeled data to reweight labeled examples Co-training Using unlabeled data for regularization