CS 6140: Machine Learning Spring 2017

Size: px

Start display at page:

Download "CS 6140: Machine Learning Spring 2017"

Lorraine Perry
5 years ago
Views:

1 CS 6140: Machine Learning Spring 2017 Instructor: Lu Wang College of Computer and Science Northeastern University Webpage:

2 Assignment 3 is due on 3/30. 4/13: course project presenta@on. 4/20: final exam.

3 What we learned labeling models Hidden Markov Models Maximum-entropy Markov model Random Fields

4 Sample Markov Model for POS 0.1 Det 0.95 Noun start PropNoun Verb stop

5 The Markov

6 Hidden Markov Models (HMMs) Words Part-of-Speech tags

7 Formally

8 Viterbi Backtrace s 1 s 0 s 2 s N s F t 1 t 2 t 3 t T-1 t T Most likely Sequence: s 0 s N s 1 s 2 s 2 s F

9 Log-Linear Models

11 Using Log-Linear Models

12 Random Fields (CRFs)

13 Today s Outline Bayesian Networks Mixture Models Expecta@on Maximiza@on Latent Dirichlet Alloca@on [Some slides are borrowed from Christopher Bishop and David Sontag]

22 Today s Outline Bayesian Networks Mixture Models Expecta@on Maximiza@on Latent Dirichlet Alloca@on

K-means Algorithm Goal: represent a data set in terms of K clusters each of which is summarized by a prototype (mean) Ini@alize prototypes, then iterate between

23 K-means Algorithm Goal: represent a data set in terms of K clusters each of which is summarized by a prototype (mean) Ini@alize prototypes, then iterate between two phases: Step 1: assign each data point to nearest prototype Step 2: update prototypes to be the cluster means Simplest version is based on Euclidean distance

24 BCS Summer School, Exeter, 2003 Christopher M. Bishop

25 BCS Summer School, Exeter, 2003 Christopher M. Bishop

26 BCS Summer School, Exeter, 2003 Christopher M. Bishop

27 BCS Summer School, Exeter, 2003 Christopher M. Bishop

28 BCS Summer School, Exeter, 2003 Christopher M. Bishop

29 BCS Summer School, Exeter, 2003 Christopher M. Bishop

30 BCS Summer School, Exeter, 2003 Christopher M. Bishop

31 BCS Summer School, Exeter, 2003 Christopher M. Bishop

32 BCS Summer School, Exeter, 2003 Christopher M. Bishop

40 The Gaussian Gaussian mean covariance

41 Gaussian Mixtures Linear of Gaussians and require Can interpret the mixing coefficients as prior

42 Example: Mixture of 3 Gaussians

43 Contours of Probability

44 Sampling from the Gaussian To generate a data point: first pick one of the components with probability then draw a sample from that component Repeat these two steps for each new data point

45 Data Set

46 Data Set Without Labels

47 Fieng the Gaussian Mixture We wish to invert this process given the data set, find the corresponding parameters: mixing coefficients means Covariances

48 Fieng the Gaussian Mixture We wish to invert this process given the data set, find the corresponding parameters: mixing coefficients means covariances If we knew which component generated each data point, the maximum likelihood would involve fieng each component to the corresponding cluster Problem: the data set is unlabelled We shall refer to the labels as latent (= hidden) variables

49 Data Set Without Labels

Posterior Probabili@es We can think of the mixing coefficients as prior probabili@es for the components For a given value

50 Posterior We can think of the mixing coefficients as prior for the components For a given value of we can evaluate the corresponding posterior probabili@es, called responsibili,es These are given from Bayes theorem by

51 Posterior (colour coded)

55 Today s Outline Bayesian Networks Mixture Models Expecta@on Maximiza@on Latent Dirichlet Alloca@on

72 BCS Summer School, Exeter, 2003 Christopher M. Bishop

73 BCS Summer School, Exeter, 2003 Christopher M. Bishop

74 BCS Summer School, Exeter, 2003 Christopher M. Bishop

75 BCS Summer School, Exeter, 2003 Christopher M. Bishop

76 BCS Summer School, Exeter, 2003 Christopher M. Bishop

77 BCS Summer School, Exeter, 2003 Christopher M. Bishop

EM in General Consider arbitrary distribu@on over the latent variables

78 EM in General Consider arbitrary over the latent variables (p is the true The following always holds where

Op@mizing the Bound E-step: maximize with respect to equivalent to minimizing KL

respect to equivalent to maximizing expected complete-data log likelihood Each

80 the Bound E-step: maximize with respect to equivalent to minimizing KL divergence sets equal to the posterior M-step: maximize bound with respect to equivalent to maximizing expected complete-data log likelihood Each EM cycle must increase incomplete-data likelihood unless already at a (local) maximum

81 E-step

82 M-step

83 Today s Outline Bayesian Networks Mixture Models Expecta@on Maximiza@on Latent Dirichlet Alloca@on [Slides are based on David Blei s ICML 2012 tutorial]

93 model for a document in LDA

100

101

102

103

104

105

106

107

108

109 model for a document in LDA

110

111

112 Comparison of mixture and admixture models

113 Usage of LDA

114 EM for mixture models

115 EM for mixture models

116

117

118

119 What We Learned Today Bayesian Networks Mixture Models Latent Dirichlet

120 Homework Reading Murphy , , More about EM hhp://cs229.stanford.edu/notes/cs229-notes7b.pdf hhp://cs229.stanford.edu/notes/cs229-notes8.pdf More about LDA hhp://menome.com/wp/wp-content/uploads/ 2014/12/Blei2011.pdf hhp://obphio.us/pdfs/lda_tutorial.pdf

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a