CS249: ADVANCED DATA MINING

Similar documents
CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

CS249: ADVANCED DATA MINING

CS145: INTRODUCTION TO DATA MINING

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning for Data Science (CS4786) Lecture 12

CS249: ADVANCED DATA MINING

Lecture 11. Kernel Methods

CS6220: DATA MINING TECHNIQUES

CS145: INTRODUCTION TO DATA MINING

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

Chapter 6: Classification

CS145: INTRODUCTION TO DATA MINING

CS6220: DATA MINING TECHNIQUES

Radial Basis Function (RBF) Networks

Lecture No. 1 Introduction to Method of Weighted Residuals. Solve the differential equation L (u) = p(x) in V where L is a differential operator

CS6220: DATA MINING TECHNIQUES

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

CS145: INTRODUCTION TO DATA MINING

Inferring the origin of an epidemic with a dynamic message-passing algorithm

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

Variations. ECE 6540, Lecture 02 Multivariate Random Variables & Linear Algebra

Worksheets for GCSE Mathematics. Algebraic Expressions. Mr Black 's Maths Resources for Teachers GCSE 1-9. Algebra

Data Mining Techniques

Dan Roth 461C, 3401 Walnut

Data Mining Techniques

HISTORY. In the 70 s vs today variants. (Multi-layer) Feed-forward (perceptron) Hebbian Recurrent

Data Preprocessing. Cluster Similarity

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Statistical Learning with the Lasso, spring The Lasso

Pattern Recognition and Machine Learning

Loss Functions, Decision Theory, and Linear Models

Introduction to Machine Learning

A new procedure for sensitivity testing with two stress factors

The Bayes classifier

Lecture 6. Notes on Linear Algebra. Perceptron

Advanced data analysis

CSCI-567: Machine Learning (Spring 2019)

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Unit II. Page 1 of 12

Multiple Regression Analysis: Inference ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Expectation Propagation performs smooth gradient descent GUILLAUME DEHAENE

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Density Estimation and Anomaly Detection. Tom Dietterich

Gaussian Mixture Models, Expectation Maximization

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CS 6375 Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Generative Clustering, Topic Modeling, & Bayesian Inference

Expectation maximization

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Module 7 (Lecture 27) RETAINING WALLS

Introduction to time series econometrics and VARs. Tom Holden PhD Macroeconomics, Semester 2

Gaussian Mixture Models

(1) Introduction: a new basis set

A Step Towards the Cognitive Radar: Target Detection under Nonstationary Clutter

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

CS534 Machine Learning - Spring Final Exam

Work, Energy, and Power. Chapter 6 of Essential University Physics, Richard Wolfson, 3 rd Edition

Active and Semi-supervised Kernel Classification

Brief Introduction of Machine Learning Techniques for Content Analysis

Lecture 3. Linear Regression

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Exam Programme VWO Mathematics A

Introduction to Machine Learning

Chapter 5: Spectral Domain From: The Handbook of Spatial Statistics. Dr. Montserrat Fuentes and Dr. Brian Reich Prepared by: Amanda Bell

A linear operator of a new class of multivalent harmonic functions

Cryptography CS 555. Topic 4: Computational Security

Probabilistic Time Series Classification

Clustering and Gaussian Mixture Models

Recent Advances in Bayesian Inference Techniques

Click Prediction and Preference Ranking of RSS Feeds

A brief summary of the chapters and sections that we cover in Math 141. Chapter 2 Matrices

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

2017 Predictive Analytics Symposium

Variable Selection in Predictive Regressions

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

International Journal of Scientific & Engineering Research, Volume 7, Issue 8, August ISSN MM Double Exponential Distribution

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

ECE 5984: Introduction to Machine Learning

PHY103A: Lecture # 4

Machine Learning (CS 567) Lecture 5

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

7.3 The Jacobi and Gauss-Seidel Iterative Methods

Introduction to Machine Learning Midterm Exam

Lecture 11: Unsupervised Machine Learning

Review for Exam Hyunse Yoon, Ph.D. Assistant Research Scientist IIHR-Hydroscience & Engineering University of Iowa

P.2 Multiplication of Polynomials

(1) Correspondence of the density matrix to traditional method

General Strong Polarization

Lecture 13 : Variational Inference: Mean Field Approximation

Introduction to Graphical Models

Transcription:

CS249: ADVANCED DATA MINING Vector Data: Clustering: Part II Instructor: Yizhou Sun yzsun@cs.ucla.edu May 3, 2017

Methods to Learn: Last Lecture Classification Clustering Vector Data Text Data Recommender System Decision Tree; Naïve Bayes; Logistic Regression SVM; NN K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k-means PLSA; LDA Matrix Factorization Graph & Network Label Propagation SCAN; Spectral Clustering Prediction Ranking Linear Regression GLM Collaborative Filtering PageRank Feature Representation Word embedding Network embedding 2

Methods to Learn Classification Clustering Vector Data Text Data Recommender System Decision Tree; Naïve Bayes; Logistic Regression SVM; NN K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k-means PLSA; LDA Matrix Factorization Graph & Network Label Propagation SCAN; Spectral Clustering Prediction Ranking Linear Regression GLM Collaborative Filtering PageRank Feature Representation Word embedding Network embedding 3

Vector Data: Clustering: Part 2 Revisit K-means Kernel K-means Mixture Model and EM algorithm Summary 4

Objective function Recall K-Means kk JJ = jj=1 CC ii =jj xx ii cc jj 2 Total within-cluster variance Re-arrange the objective function kk JJ = jj=1 ii ww iiii xx ii cc jj 2 ww iiii {0,1} ww iiii = 1, iiii xx ii bbbbbbbbbbbbbb tttt cccccccccccccc jj; ww iiii = 0, ooooooooooooooooo Looking for: The best assignment ww iiii The best center cc jj 5

Iterations Solution of K-Means JJ = Step 1: Fix centers cc jj, find assignment ww iiii that minimizes JJ => ww iiii = 1, iiii xx ii cc jj 2 is the smallest Step 2: Fix assignment ww iiii, find centers that minimize JJ => first derivative of JJ = 0 => cc jj = 2 ii ww iiii (xx ii cc jj ) = 0 =>cc jj = ii ww iiiixx ii ii ww iiii Note ii ww iiii is the total number of objects in cluster j kk jj=1 ww iiii xx ii cc jj 2 ii 6

Converges! Why?

Limitations of K-Means K-means has problems when clusters are of different Sizes and density Non-Spherical Shapes 13

Limitations of K-Means: Different Sizes and Density 14

Limitations of K-Means: Non-Spherical Shapes 15

Demo http://webdocs.cs.ualberta.ca/~yaling/clu ster/applet/code/cluster.html 16

Connections of K-means to Other Methods K-means Kernel K- means Gaussian Mixture Model 17

Vector Data: Clustering: Part 2 Revisit K-means Kernel K-means Mixture Model and EM algorithm Summary 18

Kernel K-Means How to cluster the following data? A non-linear map: φφ: RR nn FF Map a data point into a higher/infinite dimensional space xx φφ xx Dot product matrix KK iiii KK iiii =< φφ xx ii, φφ(xx jj ) > 19

Typical Kernel Functions Recall kernel SVM: 20

Solution of Kernel K-Means Objective function under new feature space: JJ = kk jj=1 ii ww iiii φφ(xx ii ) cc jj 2 Algorithm By fixing assignment ww iiii cc jj = ii ww iiii φφ(xx ii )/ ii ww iiii In the assignment step, assign the data points to the closest center dd xx ii, cc jj = φφ xx ii ii ww ii jjφφ xx ii ii ww ii jj 2 = φφ xx ii φφ xx ii 2 ii ww ii jj φφ xx ii φφ xx ii ii ww ii jj + ii ll ww ii jj ww llll φφ xx ii φφ xx ll ( ii ww ii jj )^2 Do not really need to know φφ xx, bbbbbb oooooooo KK iiii 21

Advantages and Disadvantages of Kernel K-Means Advantages Algorithm is able to identify the non-linear structures. Disadvantages Number of cluster centers need to be predefined. Algorithm is complex in nature and time complexity is large. References Kernel k-means and Spectral Clustering by Max Welling. Kernel k-means, Spectral Clustering and Normalized Cut by Inderjit S. Dhillon, Yuqiang Guan and Brian Kulis. An Introduction to kernel methods by Colin Campbell. 22

Vector Data: Clustering: Part 2 Revisit K-means Kernel K-means Mixture Model and EM algorithm Summary 23

Fuzzy Set and Fuzzy Cluster Clustering methods discussed so far Every data object is assigned to exactly one cluster Some applications may need for fuzzy or soft cluster assignment Ex. An e-game could belong to both entertainment and software Methods: fuzzy clusters and probabilistic model-based clusters Fuzzy cluster: A fuzzy set S: F S : X [0, 1] (value between 0 and 1) 24

Mixture Model-Based Clustering A set C of k probabilistic clusters C 1,,C k probability density functions: f 1,, f k, Cluster prior probabilities: w 1,, w k, jj ww jj = 1 Joint Probability of an object i and its cluster C j is: PP(xx ii, zz ii = CC jj ) = ww jj ff jj (xx ii ) Probability of i is: PP xx ii = jj ww jj ff jj (xx ii ) 25

Maximum Likelihood Estimation Since objects are assumed to be generated independently, for a data set D = {x 1,, x n }, we have, PP DD = PP xx ii = ww jj ff jj (xx ii ) ii ii jj llllllll DD = llllllll xx ii = llllll ww jj ff jj (xx ii ) ii Task: Find a set C of k probabilistic clusters s.t. P(D) is maximized ii jj 26

The EM (Expectation Maximization) Algorithm The (EM) algorithm: A framework to approach maximum likelihood or maximum a posteriori estimates of parameters in statistical models. E-step assigns objects to clusters according to the current fuzzy clustering or parameters of probabilistic clusters ww tt+1 iiii = pp zz ii = jj θθ tt jj, xx ii pp xx ii zz ii = jj, θθ jj tt pp(zz ii = jj) M-step finds the new clustering or parameters that maximize the expected likelihood, with respect to conditional distribution pp zz ii = jj θθ jj tt, xx ii θθ tt+1 = aaaaaaaaaaxx θθ ii jj ww tt+1 iiii log LL(xx ii, zz ii = jj θθ) 27

Gaussian Mixture Model Generative model For each object: Pick its distribution component: ZZ~MMMMMMMMMM ww 1,, ww kk Sample a value from the selected distribution: XX~NN μμ ZZ, σσ ZZ 2 Overall likelihood function LL DD θθ = ii jj ww jj pp(xx ii μμ jj, σσ jj 2 ) s.t. jj ww jj = 1 aaaaaa ww jj 0 Q: What is θθ here? 28

Estimating Parameters LL DD; θθ = ii log jj ww jj pp(xx ii μμ jj, σσ jj 2 ) Considering the first derivative of μμ jj : LL uu jj = ii ww jj jj ww jj pp(xx ii μμ jj,σσ jj 2 ) pp(xx ii μμ jj,σσ jj 2 ) μμ jj = ii ww jj pp(xx ii μμ jj,σσ jj 2 ) jj ww jj pp(xx ii μμ jj,σσ jj 2 ) 1 pp(xx ii μμ jj,σσ jj 2 ) (xx ii μμ jj,σσ jj 2 ) μμ jj = ii ww jj pp(xx ii μμ jj,σσ jj 2 ) jj ww jj pp(xx ii μμ jj,σσ jj 2 ) ww iiii = PP(ZZ = jj XX = xx ii, θθ) (xx ii μμ jj,σσ jj 2 ) uu jj (xx ii )/ μμ jj Like weighted likelihood estimation; But the weight is determined by the parameters! 29

Apply EM algorithm: 1-d An iterative algorithm (at iteration t+1) E(expectation)-step Evaluate the weight ww iiii when μμ jj, σσ jj, ww jj are given ww iiii tt+1 = ww jj tt pp(xx ii μμ jj tt,(σσjj 2 ) tt ) jj ww jj tt pp(xx ii μμ jj tt,(σσ jj 2 ) tt ) M(maximization)-step Evaluate μμ jj, σσ jj, ww jj when ww iiii s are given that maximize the weighted likelihood It is equivalent to Gaussian distribution parameter estimation when each point has a weight belonging to each distribution μμ tt+1 jj = ii ww iiii tt+1 xx ii tt+1 ii ww iiii ; (σσ 2 jj ) tt+1 = ii ww tt+1 tt 2 iiii xxii μμ jj tt+1 ; ww tt+1 tt+1 ii ww jj ii ww iiii iiii 30

Example: 1-D GMM 31

2-d Gaussian Bivariate Gaussian distribution Two dimensional random variable: X = XX 1 XX 1 XX 2 NN(μμ = μμ 1 μμ 2, Σ = μμ 1 aaaaaa μμ 2 are means of XX 1 aaaaaa XX 2 XX 2 σσ 1 2 σσ XX 1, XX 2 σσ XX 1, XX 2 σσ 2 2 ) σσ 1 aaaaaa σσ 2 aaaaaa standard deviations of XX 1 aaaaaa XX 2 σσ XX 1, XX 2 iiii ttttt cccccccccccccccccccc bbbbbbbbbbbbbb XX 1 aaaaaa XX 2, ii. ee., σσ XX 1, XX 2 = EE XX 1 μμ 1 XX 2 μμ 2 32

Apply EM algorithm: 2-d An iterative algorithm (at iteration t+1) E(expectation)-step Evaluate the weight ww iiii when μμ jj, Σ jj, ww jj are given ww tt+1 iiii = ww jj tt tt tt pp(xx ii μμ jj,σjj) jj ww tt jj pp(xx ii μμ tt jj,σ tt jj ) M(maximization)-step Evaluate μμ jj, Σ jj, ww jj when ww iiii s are given that maximize the weighted likelihood It is equivalent to Gaussian distribution parameter estimation when each point has a weight belonging to each distribution μμ tt+1 jj = ii ww iiii tt+1 xx ii ii ww tt+1 ; (σσ jj,1 iiii (σσ XX 1, XX 2 jj ) tt+1 = ii ww iiii 2 ) tt+1 = ii ww iiii tt+1 tt xx ii,1 μμ jj,1 ii ww tt+1 iiii tt+1 tt tt (xx ii,1 μμ jj,1)(xxii,2 μμ jj,2) ii ww tt+1 iiii 2 ; (σσ 2 jj,2 ) tt+1 = ii ww iiii tt+1 ; ww jj tt+1 ii ww iiii tt+1 2 tt xx ii,2 μμ jj,2 tt+1 ; ii ww iiii 33

K-Means: A Special Case of Gaussian Mixture Model When each Gaussian component with covariance matrix σσ 2 II Soft K-means ww iiii pp xx ii μμ jj, σσ 2 ww jj = exp xx ii μμ jj 2 When σσ 2 0 2σσ 2 Distance! Soft assignment becomes hard assignment ww iiii 1, iiii xx ii is closest to μμ jj (why?) ww jj 34

Why EM Works? E-Step: computing a tight lower bound LL of the original objective function l at θθ oooooo M-Step: find θθ nnnnnn to maximize the lower bound ll θθ nnnnnn LL θθ nnnnnn LL(θθ oooooo ) = ll(θθ oooooo ) 35

How to Find Tight Lower Bound? qq h : ttttt kkkkkk tttt ttttttttt llllllllll bbbbbbbbbb wwww wwwwwwww tttt gggggg Jensen s inequality When = holds to get a tight lower bound? qq h = pp(h dd, θθ) (why?) tthee ttttttttt llllllllll bbbbbbbbbb 36

In GMM Case LL DD; θθ = ii log jj ww jj pp(xx ii μμ jj, σσ jj 2 ) ii jj ww iiii (log ww jj pp xx ii μμ jj, σσ jj 2 llllllww iiii ) log LL(xx ii, zz ii = jj θθ) Does not involve θθ, can be dropped 37

Advantages and Disadvantages of GMM Strength Mixture models are more general than partitioning: different densities and sizes of clusters Clusters can be characterized by a small number of parameters The results may satisfy the statistical assumptions of the generative models Weakness Converge to local optimal (overcome: run multi-times w. random initialization) Computationally expensive if the number of distributions is large, or the data set contains very few observed data points Hard to estimate the number of clusters Can only deal with spherical clusters 38

Vector Data: Clustering: Part 2 Revisit K-means Kernel K-means Mixture Model and EM algorithm Summary 39

Revisit k-means Derivative Kernel k-means Summary Objective function; solution; connection to k- means Mixture models Gaussian mixture model; multinomial mixture model; EM algorithm; Connection to k-means 40