Data Mining Techniques

Size: px

Start display at page:

Download "Data Mining Techniques"

Melvyn Jackson
5 years ago
Views:

1 Data Mining Techniques CS Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.)

2 Project

3 Project Deadlines 3 Feb: Form teams of 2-4 people 17 Feb: Submit abstract (1 paragraph) 3 Mar: Submit proposals (2 pages) 13 Mar: Milestone 1 (exploratory analysis) 31 Mar: Milestone 2 (statistical analysis) 14 Apr: Submit reports (10 pages) 21 Apr: Submit peer reviews

4 Project Proposal Length: 2 pages Should describe The dataset that you will be looking at The algorithms that you intend to use Project milestones Milestone 1 recommendation: Exploratory analysis Milestone 2 recommendation: Data mining analysis

5 Evaluation of Clustering

6 Clusters in Random Data Random Points y y DBSCAN x x K-means Complete Link y 0.5 y x x

7 Clustering Criteria Internal Quality Criteria Measure compactness of clusters Sum of Squared Error (SSE) Scatter Criteria External Quality Criteria Precision-Recall Measure Mutual Information

8 Scatter Criteria (Internal) Let x =(x 1,...,x d ) T C 1,...,C K be a clustering of {x 1,...,x N } Define Size of each cluster: N i = C i Mean for each cluster: µ i = 1 N i P x2c i x i =1, 2,...,K i =1, 2,...,K Total mean : µ = 1 N NP i=1 x i OR µ = 1 N KP i=1 N i µ i

9 Scatter Criteria (Internal) Scatter matrix for the i th cluster: S i = P x2c i (x µ i )(x µ i ) T (outer product) Within cluster scatter matrix : S W = K P i=1 S i Between cluster scatter matrix : S B = K P i=1 N i (µ i µ)(µ i µ) T (outer product)

10 Scatter Criteria (Internal) The trace criteria: sum of the diagonal elements of a matrix A good partition of the data should have: Low tr(s W ): similar to minimizing SSE High tr(s B ) High tr(s B) tr(s W )

11 Mutual Information (External)

12 Mutual Information (External) Uncorrelated Variables

13 Mutual Information (External) Uncorrelated Variables

14 Mutual Information (External) Perfectly Correlated Variables

15 Mutual Information (External) Perfectly Correlated Variables

16 Mutual Information (External) Perfectly Correlated Variables

17 Mutual Information (External) Perfectly Correlated Variables

18 Mutual Information (External) Perfectly Correlated Variables

19 Mutual Information (External) Perfectly Correlated Variables

20 Mutual Information (External) yn: True class label for example n zn: Clustering label for example n

21 Mutual Information (External) yn: True class label for example n zn: Clustering label for example n

22 Mutual Information (External)

23 Mutual Information (External) What happens to I(Y;Z) if we swap cluster labels?

24 Mutual Information (External) What happens to I(Y;Z) if we swap cluster labels?

25 Mutual Information (External) Mutual Information is invariant under label permutations

26 Mixture Models

27 Review: K-means Clustering μ1 μ2 Objective: Sum of Squares SSE = KX k=1 NX n=1 I[z n = k] x n µ k 2 z n µ k Assignment for point n Center for cluster k μ3 Alternate between two steps 1. Update assignments 2. Update centers

28 Review: Regression Objective: Sum of Squares Probabilistic Interpretation: y n = x > n w + n n Norm(0, 2 ) log p(y w )= E(w )+const.

29 K-means: Probabilistic Generalization Generative Model z n Discrete( ) x n z n = k Norm(µ k, k ) Questions 1. What is log p(x, z μ, Σ, π)? 2. For what choice of π and Σ do we recover K-means? Same as K-means when: k = 1/K k = 2 I

30 Gaussian K-means Algorithm Initialize parameters to Repeat until convergence 1. Update cluster assignments 2. Update parameters := {µ 1:K, 1:K, }

31 Gaussian K-means Assignment Update Parameter Updates How can we deal with overlapping clusters in a better way? N k := P N n=1 z nk z nk := I[z n = k] N = [ = ] = PN N,... P =(N 1 /N,...,N K /N) 1 µ k = P N z N k n=1 P x nk n P 1 P N k = 1 N k P N n=1 z nk (x n µ k )(x n µ k ) >

32 Gaussian K-means Assignment Update Parameter Updates Idea: Replace hard assignments with soft assignments N k := P N n=1 z nk z nk := I[z n = k] N = [ = ] = PN N,... P =(N 1 /N,...,N K /N) 1 µ k = P N z N k n=1 P x nk n P 1 P N k = 1 N k P N n=1 z nk (x n µ k )(x n µ k ) >

33 Gaussian Soft K-means Soft Assignment Update Parameter Updates Idea: Replace hard assignments with soft assignments N k := P N n=1 nk = PN [ N,...,N = ] N = P =(N 1 /N,...,N K /N)) 1 P µ k = N z x N k n=1 nk n P = k = 1 N N k n=1 nk (x n µ k )(x n µ k ) >

34 EM for Gaussian Mixtures Credit: Andrew Moore

35 EM for Gaussian Mixtures Credit: Andrew Moore

36 EM for Gaussian Mixtures Credit: Andrew Moore

37 EM for Gaussian Mixtures Credit: Andrew Moore

38 EM for Gaussian Mixtures Credit: Andrew Moore

39 EM for Gaussian Mixtures Credit: Andrew Moore

40 EM for Gaussian Mixtures Credit: Andrew Moore

41 Expectation Maximization

42 Maximum Likelihood Estimation Supervised (e.g. regression) w = argmax w = argmax w log p(y X, w) NX log p( y n x n, w ) n=1 Solve for zero gradient to find maximum Unsupervised (e.g. GMM) = argmax = argmax log P z p(x, z ) P N n=1 log P K z n =1 p(x n, z n ) Not so easy here, because of sum inside logarithm

43 Lower Bound on Log Likelihood (multiplication by 1)

44 Lower Bound on Log Likelihood (multiplication by 1) (multiplication by 1)

45 Lower Bound on Log Likelihood (multiplication by 1) (multiplication by 1) (Bayes rule)

46 Lower Bound on Log Likelihood (multiplication by 1) (multiplication by 1) (Bayes rule)

47 Lower Bound on Log Likelihood

48 Lower Bound on Log Likelihood Claim:

49 Intermezzo: KL Divergence KL Divergence Properties KL(q p) 0 If KL(q p) = 0, then q = p KL(q p) KL(p q)

50 Intermezzo: Information Theory KL Divergence Proof } D(p q) = x A p(x) log p(x) q(x) Properties = x A p(x) log q(x) p(x) KL(q p) 0 If KL(q p) = 0, then q = p KL(q p) KL(p q) log x A = log x A q(x) p(x) q(x) p(x) log x X q(x) = log 1 =

51 Intermezzo: Information Theory KL Divergence Entropy Mutual Information

52 Lower Bound on Log Likelihood Claim:

53 Generalized EM KL(q p) L(q, θ) ln p(x θ) 1. Lower bound is sum over log, not log of sum

54 Generalized EM KL(q p) L(q, θ) ln p(x θ) 1. Lower bound is sum over log, not log of sum

55 Generalized EM KL(q p) L(q, θ) ln p(x θ) 2. Bound is tight when q(z) = p(z X, θ)

56 Generalized EM L(q, 7θ old ) 7 ln p(x θ old ) 7 E-step: maximize with respect to q(z)

57 Generalized EM KL(q p) L(q, θ new ) ln p(x θ new ) M-step: maximize with respect to θ

58 Gaussian Mixture Model Generative Model z n Discrete( ) x n z n = k Norm(µ k, k ) Expectation Maximization Initialize θ Repeat until convergence 1. Expectation Step 2. Maximization Step

59 GMM Advantages / Disadvantages 1 (a) 1 (b) 1 (c) Figure 9.5 Example of 500 points drawn from the mixture of 3 Gaussians shown in Figure (a) Samples from the joint distribution p(z)p(x z) in which the three states of z, corresponding to the three components of the mixture, are depicted + Works in red, green, with and blue, overlapping and (b) the corresponding clusters samples from the marginal distribution p(x), which is obtained by simply ignoring the values of z and just plotting the x values. The data set in (a) is said to be complete, + whereas Works thatwith in (b) is incomplete. clusters (c) The of same different samples indensities which the colours represent the value of the responsibilities γ(z nk ) associated with data point x n, obtained by plotting the corresponding point using proportions of red, blue, and green ink given by γ(z nk ) for k =1, 2, 3, respectively + Same complexity as K-means - Can get stuck in local maximum matrix X in which the n th row is given by x T n. Similarly, the corresponding latent variables will be denoted by an N K matrix Z with rows z T n. If we assume that the data points are drawn independently from the distribution, then we can express - Need to set number of components

60 GMM Advantages / Disadvantages + Works with overlapping clusters + Works with clusters of different densities + Same complexity as K-means - Can get stuck in local maximum - Need to set number of components

61 Model Selection Need to specify two components 1. Likelihood 2. Mixture distribution How do we know that we have made good choices?

62 Model Selection Strategy 1: Cross-validation Split data in to K folds. For each fold k Perform EM to learn θ from training set X train Calculate test set likelihood p(x test θ)

63 Model Selection Strategy 2: Model Evidence Define a prior p(θ) and evaluate the marginal likelihood p(d K) Two families of methods K Variational Inference Importance Sampling

64 Variational Inference (Sketch) Lower bound on Log Evidence Variational E-step Variational M-step

65 Variational Inference (Sketch) 0 15 p(d K) K Can use lower bound on evidence to select best model Variational inference for often assigns zero weight to superfluous components

Data Mining Techniques

Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!