Unsupervised Learning
|
|
- Chloe Walker
- 5 years ago
- Views:
Transcription
1 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University
2 ML Problem Setting First build and learn p(x) and then infer the conditional dependence p(x t x i ) Unsupervised learning Each dimension of x is equally treated Directly learn the conditional dependence p(x t x i ) Supervised learning x t is the label to predict
3 Definition of Unsupervised Learning Given the training dataset D = fx i g ;2;:::;N let the machine learn the data underlying patterns Latent variables z! x Probabilistic density function (p.d.f.) estimation p(x) Good data representation (used for discrimination) Á(x)
4 Uses of Unsupervised Learning Data structure discovery, data science Data compression Outlier detection Input to supervised/reinforcement algorithms (causes may be more simply related to outputs or rewards) A theory of biological learning and perception Slide credit: Maneesh Sahani
5 Content Fundamentals of Unsupervised Learning K-means clustering Principal component analysis Probabilistic Unsupervised Learning Mixture Gaussians EM Methods
6 K-Means Clustering
7 K-Means Clustering
8 K-Means Clustering Provide the number of desired clusters k Randomly choose k instances as seeds, one per each cluster, i.e. the centroid for each cluster Iterate Assign each instance to the cluster with the closest centroid Re-estimate the centroid of each cluster Stop when clustering converges Or after a fixed number of iterations Slide credit: Ray Mooney
9 K-Means Clustering: Centriod Assume instances are real-valued vectors x 2 R d Clusters based on centroids, center of gravity, or mean of points in a cluster C k ¹ k = 1 X C k x x2c k Slide credit: Ray Mooney
10 K-Means Clustering: Distance Distance to a centroid L(x; ¹ k ) Euclidian distance (L2 norm) L 2 (x; ¹ k )=kx ¹ k k = Euclidian distance (L1 norm) Cosine distance L 1 (x; ¹ k )=jx ¹ k j = L cos (x; ¹ k )=1 v u X t d (x i ¹ k m) 2 m=1 dx jx i ¹ k mj m=1 x> ¹ k jxj j¹ k j Slide credit: Ray Mooney
11 K-Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x x Compute centroids Reassign clusters Converged! Slide credit: Ray Mooney
12 K-Means Time Complexity Assume computing distance between two instances is O(d) where d is the dimensionality of the vectors Reassigning clusters: O(knd) distance computations Computing centroids: Each instance vector gets added once to some centroid: O(nd) Assume these two steps are each done once for I iterations: O(Iknd) Slide credit: Ray Mooney
13 K-Means Clustering Objective The objective of K-means is to minimize the total sum of the squared distance of every point to its corresponding cluster centroid min KX f¹ k g K k=1 k=1 X L(x ¹ k ) x2c k ¹ k = 1 X C k x x2c k Finding the global optimum is NP-hard. The K-means algorithm is guaranteed to converge a local optimum.
14 Seed Choice Results can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. Select good seeds using a heuristic or the results of another method.
15 Clustering Applications Text mining Cluster documents for related search Cluster words for query suggestion Recommender systems and advertising Cluster users for item/ad recommendation Cluster items for related item suggestion Image search Cluster images for similar image search and duplication detection Speech recognition or separation Cluster phonetical features
16 Principal Component Analysis (PCA) An example of 2- dimensional data x 1 : the piloting skill of pilot x 2 : how much he/she enjoys flying Main components u 1 : intrinsic piloting karma of a person u 2 : some noise Example credit: Andrew Ng
17 Principal Component Analysis (PCA) PCA tries to identify the subspace in which the data approximately lies PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the smaller of the number of original variables or the number of observations. R d! R k k d
18 PCA Data Preprocessing Given the dataset Typically we first pre-process the data to normalize its mean and variance 1. Move the central of the data set to 0 ¹ = 1 m mx D = fx (i) g m x (i) 2. Unify the variance of each variable ¾ j 2 = 1 m mx (x (i) j )2 x (i) Ã x (i) ¹ j )2 x (i) Ã x (i) =¾ j
19 PCA Data Preprocessing Zero out the mean of the data Rescale each coordinate to have unit variance, which ensures that different attributes are all treated on the same scale.
20 PCA Solution PCA finds the directions with the largest variable variance which correspond to the eigenvectors of the matrix X T X with the largest eigenvalues
21 PCA Solution: Data Projection The projection of each point x (i) to a direction u x (i)> u (kuk =1) x (i) u The variance of the projection x (i)> u 1 m mx (x (i)> u) 2 = 1 m mx u > x (i) x (i)> u = u >³ 1 m u > u mx x (i) x (i)> u
22 PCA Solution: Largest Eigenvalues max u u > u s.t. kuk =1 = 1 m Find k principal components of the data is to find the k principal eigenvectors of Σ i.e. the top-k eigenvectors with the largest eigenvalues Projected vector for x (i) 2 y (i) = 6 4 u > 1 x(i) u > 2 x(i) Rk mx x (i) x (i)> x (i) x (i)> u u u > k x(i)
23 Eigendecomposition Revisit For a semi-positive square matrix Σ d d suppose u to be its eigenvector with the scalar eigenvalue w Thus any vector v can be written as (kuk =1) u = wu There are d eigenvectors-eigenvalue pairs (u i, w i ) These d eigenvectors are orthogonal, thus they form an orthonormal basis ³ dx v = u i u > i v = Σ d d can be written as = dx u i u > i = dx u i u > i = I dx (u > i v)u i = dx v (i) u i dx w i u i u > i = UWU > U =[u 1 ;u 2 ;:::;u d ] 2 3 w w 2 0 W = w d
24 Eigendecomposition Revisit Given the data 2 x > 1 x X = 6 4 > 2. x > n The variance in direction u i is The variance in any direction v is and its covariance matrix =X > X kxu i k 2 = u > i X > Xu i = u > i u i = u > i w i u i = w i ³ kxvk 2 dx 2 X = X v (i) u i = v (i) u > i u i v (j) = where v (i) is the projection length of v on u i If v T v = 1, then ij arg max kvk=1 kxvk2 = u (max) (here we may drop m for simplicity) dx v (i) 2 w i The direction of greatest variance is the eigenvector with the largest eigenvalue
25 PCA Discussion PCA can also be derived by picking the basis that minimizes the approximation error arising from projecting the data onto the k-dimensional subspace spanned by them.
26 PCA Visualization
27 PCA Visualization
28 Content Fundamentals of Unsupervised Learning K-means clustering Principal component analysis Probabilistic Unsupervised Learning Mixture Gaussians EM Methods
29 Mixture Gaussian
30 Mixture Gaussian
31 Graphic Model for Mixture Gaussian Given a training set fx (1) ;x (2) ;:::;x (m) g Model the data by specifying a joint distribution p(x (i) ;z (i) )=p(x (i) jz (i) )p(z (i) ) Á Parameters of latent variable distribution z (i)» Multinomial(Á) p(z (i) = j) =Á j x (i)» N (¹ j ; j ) z x Latent variable: the Gaussian cluster ID Indicates which Gaussian each x comes from Observed data points
32 Data Likelihood We want to maximize l(á; ¹; ) = = mx log p(x (i) ; Á; ¹; ) mx log mx = log kx p(x (i) jz (i) ; ¹; )p(z (i) ; Á) z (i) =1 j=1 kx N (x (i) j¹ j ; j )Á j No closed form solution by simply ¹; ¹; ¹; ) @ =0
33 Data Likelihood Maximization For each data point x (i), latent variable z (i) indicates which Gaussian it comes from If we knew z (i), the data likelihood l(á; ¹; ) = = = mx log p(x (i) ; Á; ¹; ) mx log p(x (i) jz (i) ; ¹; )p(z (i) ; Á) mx log N (x (i) j¹ z (i); z (i))+logp(z (i) ; Á)
34 Data Likelihood Maximization Given z (i), maximize the data likelihood max l(á; ¹; ) = max Á;¹; Á;¹; It is easy to get the solution Á j = 1 mx 1fz (i) = jg m P m ¹ j = 1fz(i) = jgx (i) P m 1fz(i) = jg mx log N (x (i) j¹ z (i); z (i))+logp(z (i) ; Á) j = P m 1fz(i) = jg(x (i) ¹ j )(x (i) ¹ j ) > P m 1fz(i) = jg
35 Latent Variable Inference Given the parameters μ, Σ, ϕ, it is not hard to infer the posterior of the latent variable z (i) for each instance Á z x p(z (i) = jjx (i) ; Á; ¹; ) = p(z(i) = j; x (i) ; Á; ¹; ) p(x (i) ; Á; ¹; ) ¹; where The prior of z (i) is The likelihood is = p(x(i) jz (i) = j; ¹; )p(z (i) = j; Á) P k l=1 p(x(i) jz (i) = l; ¹; )p(z (i) = l; Á) p(z (i) = j; Á) p(x (i) jz (i) = j; ¹; )
36 Expectation Maximization Methods E-step: infer the posterior distribution of the latent variables given the model parameters M-step: tune parameters to maximize the data likelihood given the latent variable distribution EM methods Iteratively execute E-step and M-step until convergence
37 EM Methods for Mixture Gaussians Mixture Gaussian example Repeat until convergence: { (E-step) For each i, j, set } w (i) j = p(z (i) = j; x (i) ; Á; ¹; ) (M-step) Update the parameters Á j = 1 m ¹ j = j = mx w (i) j P m w(i) j x(i) P m w(i) j P m w(i) P m w(i) j j (x(i) ¹ j )(x (i) ¹ j ) > Á z x ¹;
38 General EM Methods Claims: 1. After each E-M step, the data likelihood will not decrease. 2. The EM algorithm finds a (local) maximum of a latent variable model likelihood Now let s discuss the general EM methods and verify its effectiveness of improving data likelihood and its convergence
39 Jensen s Inequality Theorem. Let f be a convex function, and let X be a random variable. Then: E[f(X)] f(e[x]) Moreover, if f is strictly convex, then holds true if and only if E[f(X)] = f(e[x]) X = E[X] with probability 1 (i.e., if X is a constant).
40 Jensen s Inequality E[f(X)] f(e[x]) Figure credit: Andrew Ng
41 Jensen s Inequality Figure credit: Maneesh Sahani
42 General EM Methods: Problem Given the training dataset D = fx i g ;2;:::;N let the machine learn the data underlying patterns Assume latent variables z! x We wish to fit the parameters of a model p(x,z) to the data, where the log-likelihood is l(μ) = = NX log p(x; μ) NX log X z p(x; z; μ)
43 General EM Methods: Problems EM methods solve the problems where Explicitly find the maximum likelihood estimation (MLE) is hard μ =argmax μ NX log X z p(x (i) ;z (i) ; μ) But given z (i) observed, the MLE is easy μ =argmax μ NX log p(x (i) jz (i) ; μ) EM methods give an efficient solution for MLE, by iteratively doing E-step: construct a (good) lower-bound of log-likelihood M-step: optimize that lower-bound
44 General EM Methods: Lower Bound For each instance i, let q i be some distribution of z (i) X q i (z) =1; q i (z) 0 Thus the data log-likelihood l(μ) = z NX log p(x (i) ; μ) = = NX log X p(x (i) ;z (i) ; μ) z (i) NX log X q i (z (i) ) p(x(i) ;z (i) ; μ) q i (z (i) ) z (i) NX X q i (z (i) p(x (i) ;z (i) ; μ) )log q i (z (i) ) z (i) Jensen s inequality -log(x) is a convex function Lower bound of l(θ)
45 General EM Methods: Lower Bound NX l(μ) = log p(x (i) ; μ) NX X Then what q i (z) should we choose? q i (z (i) )log z (i) p(x (i) ;z (i) ; μ) q i (z (i) )
46 REVIEW Jensen s Inequality Theorem. Let f be a convex function, and let X be a random variable. Then: E[f(X)] f(e[x]) Moreover, if f is strictly convex, then holds true if and only if E[f(X)] = f(e[x]) X = E[X] with probability 1 (i.e., if X is a constant).
47 General EM Methods: Lower Bound l(μ) = NX log p(x (i) ; μ) NX X q i (z (i) p(x (i) ;z (i) ; μ) )log q i (z (i) ) z (i) Then what q i (z) should we choose? In order to make above inequality tight (to hold with equality), it is sufficient that We can derive log p(x (i) ; μ) =log X z (i) p(x (i) ;z (i) ; μ) =q i (z (i) ) c p(x (i) ;z (i) ; μ) =log X z (i) q(z (i) )c = X z (i) As such q i (z) is written as the posterior distribution q i (z (i) )= p(x(i) ;z (i) ; μ) P z p(x(i) ;z; μ) = p(x(i) ;z(i) ; μ) p(x (i) ; μ) q i (z (i) )log p(x(i) ;z (i) ; μ) q i (z (i) ) = p(z (i) jx (i) ; μ)
48 General EM Methods Repeat until convergence: { (E-step) For each i, set } q i (z (i) )=p(z (i) jx (i) ; μ) (M-step) Update the parameters μ =argmax μ NX X q i (z (i) p(x (i) ;z (i) ; μ) )log q i (z (i) ) z (i)
49 Convergence of EM Denote θ (t) and θ (t+1) as the parameters of two successive iterations of EM, we prove that l(μ (t) ) l(μ (t+1) ) which shows EM always monotonically improves the loglikelihood, thus ensures EM will at least converge to a local optimum.
50 Proof of EM Convergence Start from θ (t), we choose the posterior of latent variable q (t) i (z (i) )=p(z (i) jx (i) ; μ (t) ) This choice ensures the Jensen s inequality holds with equality l(μ (t) )= NX log X z (i) q (t) i (z (i) ) p(x(i) ;z (i) ; μ (t) ) q (t) i (z (i) ) = NX X q i (z (i) p(x (i) ;z (i) ; μ (t) ) )log q (t) z (i) i (z (i) ) Then the parameters θ (t+1) are then obtained by maximizing the right hand side of above equation Thus l(μ (t+1) ) NX X q (t) z (i) NX X = l(μ (t) ) q (t) z (i) i (z (i) )log p(x(i) ;z (i) ; μ (t+1) ) q (t) i (z (i) ) i (z (i) )log p(x(i) ;z (i) ; μ (t) ) q (t) i (z (i) ) [lower bound] [parameter optimization]
51 Remark of EM Convergence If we define J(q; μ) = Then we know NX X q i (z (i) )log z (i) l(μ) J(q;μ) p(x (i) ;z (i) ; μ) q i (z (i) ) EM can also be viewed as a coordinate ascent on J E-step maximizes it w.r.t. q M-step maximizes it w.r.t. θ
52 Coordinate Ascent in EM q (t) i (z (i) )=p(z (i) jx (i) ; μ (t) ) q Figure credit: Maneesh Sahani μ (t+1) =argmax μ NX X μ q (t) z (i) i (z (i) )log p(x(i) ;z (i) ; μ) q (t) i (z (i) )
Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationLatent Variable Models and EM Algorithm
SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/
More informationMachine Learning - MT & 14. PCA and MDS
Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More informationLecture 7: Con3nuous Latent Variable Models
CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationExpectation Maximization
Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationPCA and admixture models
PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1
More informationCOMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017
COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationPrincipal Component Analysis
Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationMachine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.
Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationMachine Learning for Data Science (CS4786) Lecture 12
Machine Learning for Data Science (CS4786) Lecture 12 Gaussian Mixture Models Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Back to K-means Single link is sensitive to outliners We
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationProbabilistic & Unsupervised Learning. Latent Variable Models
Probabilistic & Unsupervised Learning Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationLecture 6: April 19, 2002
EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable
More informationExpectation Maximization Algorithm
Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters
More informationGaussian Mixture Models, Expectation Maximization
Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak
More informationK-Means and Gaussian Mixture Models
K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David
More informationSeries 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)
Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof
More informationSTATS 306B: Unsupervised Learning Spring Lecture 3 April 7th
STATS 306B: Unsupervised Learning Spring 2014 Lecture 3 April 7th Lecturer: Lester Mackey Scribe: Jordan Bryan, Dangna Li 3.1 Recap: Gaussian Mixture Modeling In the last lecture, we discussed the Gaussian
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization
More informationClustering and Gaussian Mixture Models
Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap
More informationPCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given
More informationMathematical foundations - linear algebra
Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures
More informationMachine Learning 2nd Edition
INTRODUCTION TO Lecture Slides for Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts from http://www.cs.tau.ac.il/~apartzin/machinelearning/ The MIT Press, 2010
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 Discriminative vs Generative Models Discriminative: Just learn a decision boundary between your
More informationIntroduction to Machine Learning
Introduction to Machine Learning CS4731 Dr. Mihail Fall 2017 Slide content based on books by Bishop and Barber. https://www.microsoft.com/en-us/research/people/cmbishop/ http://web4.cs.ucl.ac.uk/staff/d.barber/pmwiki/pmwiki.php?n=brml.homepage
More informationCSE446: Clustering and EM Spring 2017
CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled
More informationUnsupervised Learning: K- Means & PCA
Unsupervised Learning: K- Means & PCA Unsupervised Learning Supervised learning used labeled data pairs (x, y) to learn a func>on f : X Y But, what if we don t have labels? No labels = unsupervised learning
More informationCSC411 Fall 2018 Homework 5
Homework 5 Deadline: Wednesday, Nov. 4, at :59pm. Submission: You need to submit two files:. Your solutions to Questions and 2 as a PDF file, hw5_writeup.pdf, through MarkUs. (If you submit answers to
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationUnsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent
Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:
More informationLecture 2: GMM and EM
2: GMM and EM-1 Machine Learning Lecture 2: GMM and EM Lecturer: Haim Permuter Scribe: Ron Shoham I. INTRODUCTION This lecture comprises introduction to the Gaussian Mixture Model (GMM) and the Expectation-Maximization
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationPrincipal Component Analysis -- PCA (also called Karhunen-Loeve transformation)
Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation) PCA transforms the original input space into a lower dimensional space, by constructing dimensions that are linear combinations
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationFactor Analysis and Kalman Filtering (11/2/04)
CS281A/Stat241A: Statistical Learning Theory Factor Analysis and Kalman Filtering (11/2/04) Lecturer: Michael I. Jordan Scribes: Byung-Gon Chun and Sunghoon Kim 1 Factor Analysis Factor analysis is used
More informationMachine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component
More informationClustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation
Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide
More informationCSC411: Final Review. James Lucas & David Madras. December 3, 2018
CSC411: Final Review James Lucas & David Madras December 3, 2018 Agenda 1. A brief overview 2. Some sample questions Basic ML Terminology The final exam will be on the entire course; however, it will be
More informationIntroduction to Graphical Models
Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationStatistical Learning. Dong Liu. Dept. EEIS, USTC
Statistical Learning Dong Liu Dept. EEIS, USTC Chapter 6. Unsupervised and Semi-Supervised Learning 1. Unsupervised learning 2. k-means 3. Gaussian mixture model 4. Other approaches to clustering 5. Principle
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality
More informationMathematical Formulation of Our Example
Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot
More informationLecture 11: Unsupervised Machine Learning
CSE517A Machine Learning Spring 2018 Lecture 11: Unsupervised Machine Learning Instructor: Marion Neumann Scribe: Jingyu Xin Reading: fcml Ch6 (Intro), 6.2 (k-means), 6.3 (Mixture Models); [optional]:
More informationDimension Reduction (PCA, ICA, CCA, FLD,
Dimension Reduction (PCA, ICA, CCA, FLD, Topic Models) Yi Zhang 10-701, Machine Learning, Spring 2011 April 6 th, 2011 Parts of the PCA slides are from previous 10-701 lectures 1 Outline Dimension reduction
More informationPattern Classification
Pattern Classification Introduction Parametric classifiers Semi-parametric classifiers Dimensionality reduction Significance testing 6345 Automatic Speech Recognition Semi-Parametric Classifiers 1 Semi-Parametric
More informationClustering. CSL465/603 - Fall 2016 Narayanan C Krishnan
Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample
More informationClustering, K-Means, EM Tutorial
Clustering, K-Means, EM Tutorial Kamyar Ghasemipour Parts taken from Shikhar Sharma, Wenjie Luo, and Boris Ivanovic s tutorial slides, as well as lecture notes Organization: Clustering Motivation K-Means
More informationGaussian Mixture Models
Gaussian Mixture Models David Rosenberg, Brett Bernstein New York University April 26, 2017 David Rosenberg, Brett Bernstein (New York University) DS-GA 1003 April 26, 2017 1 / 42 Intro Question Intro
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Kernel Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 2: 2 late days to hand it in tonight. Assignment 3: Due Feburary
More informationPattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM
Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures
More informationLatent Variable Models
Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationLatent Variable Models and Expectation Maximization
Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationExpectation Maximization
Expectation Maximization Aaron C. Courville Université de Montréal Note: Material for the slides is taken directly from a presentation prepared by Christopher M. Bishop Learning in DAGs Two things could
More informationMachine learning for pervasive systems Classification in high-dimensional spaces
Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see
More informationLecture 8: Graphical models for Text
Lecture 8: Graphical models for Text 4F13: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/
More informationData Mining and Analysis: Fundamental Concepts and Algorithms
Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA
More informationCSC 411 Lecture 12: Principal Component Analysis
CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 12-PCA 1 / 23 Overview Today we ll cover the first unsupervised
More informationMachine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis
Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis Sham M Kakade c 2019 University of Washington cse446-staff@cs.washington.edu 0 / 10 Announcements Please do Q1
More informationSTATS 306B: Unsupervised Learning Spring Lecture 2 April 2
STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised
More informationVariational Inference (11/04/13)
STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further
More informationMixture Models and Expectation-Maximization
Mixture Models and Expectation-Maximiation David M. Blei March 9, 2012 EM for mixtures of multinomials The graphical model for a mixture of multinomials π d x dn N D θ k K How should we fit the parameters?
More informationFactor Analysis (10/2/13)
STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More information20 Unsupervised Learning and Principal Components Analysis (PCA)
116 Jonathan Richard Shewchuk 20 Unsupervised Learning and Principal Components Analysis (PCA) UNSUPERVISED LEARNING We have sample points, but no labels! No classes, no y-values, nothing to predict. Goal:
More informationWhat is Principal Component Analysis?
What is Principal Component Analysis? Principal component analysis (PCA) Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables Retains most
More informationCS181 Midterm 2 Practice Solutions
CS181 Midterm 2 Practice Solutions 1. Convergence of -Means Consider Lloyd s algorithm for finding a -Means clustering of N data, i.e., minimizing the distortion measure objective function J({r n } N n=1,
More informationManifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA
Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/
More information