Data Mining Techniques

Size: px

Start display at page:

Download "Data Mining Techniques"

Pauline Williams
5 years ago
Views:

1 Data Mining Techniques CS Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent

2 Feedback

3 Feedback (also posted on Piazza) Also, please fill out your TRACE evaluations!

4 Background

5 Multivariate Normal Density: Parameters: ij = E[(x i µ i )(x j µ j )]

6 The Dirichlet Distribution

7 The Dirichlet Distribution

8 Information Theory KL Divergence Entropy Mutual Information

9 Conjugacy Likelihood (discrete) Prior (Dirichlet) Question: What distribution is the posterior? More examples:

10 Mixture Models

11 Review: K-means Clustering μ μ2 Objective: Sum of Squares SSE = KX k= NX n= I[z n = k] x n µ k 2 z n µ k Assignment for point n Center for cluster k μ3 Alternate between two steps. Update assignments 2. Update centers

12 Review: Regression Objective: Sum of Squares Probabilistic Interpretation: y n = x > n w + n n Norm(, 2 ) log p(y w )= 2 2 E(w )+const.

13 K-means: Probabilistic Generalization Generative Model z n Discrete( ) x n z n = k Norm(µ k, k ) Questions. What is log p(x, z μ, Σ, π)? 2. For what choice of π and Σ do we recover K-means? Same as K-means when: k = /K k = 2 I

14 Gaussian K-means Assignment Update Parameter Updates Idea: Replace hard assignments with soft assignments N k := P N n= z nk z nk := I[z n = k] N = [ = ] = PN N,... P =(N /N,...,N K /N) µ k = P N z N k n= P x nk n P P N k = N k P N n= z nk (x n µ k )(x n µ k ) >

15 Gaussian Soft K-means Soft Assignment Update Parameter Updates Idea: Replace hard assignments with soft assignments N k := P N n= nk = PN [ N,...,N = ] N = P =(N /N,...,N K /N)) P µ k = N z x N k n= nk n P = k = N N k n= nk (x n µ k )(x n µ k ) >

16 Lower Bound on Log Likelihood (multiplication by )

17 Lower Bound on Log Likelihood (multiplication by ) (multiplication by )

18 Lower Bound on Log Likelihood (multiplication by ) (multiplication by ) (Bayes rule)

19 Lower Bound on Log Likelihood (multiplication by ) (multiplication by ) (Bayes rule)

20 Lower Bound on Log Likelihood

21 Gaussian Mixture Model Generative Model z n Discrete( ) x n z n = k Norm(µ k, k ) Expectation Maximization Initialize θ Repeat until convergence. Expectation Step 2. Maximization Step

22 GMM Advantages / Disadvantages (a) (b) (c) Figure 9.5 Example of 5 points drawn from the mixture of 3 Gaussians shown in Figure (a) Samples from the joint distribution p(z)p(x z) in which the three states of z, corresponding to the three components of the mixture, are depicted + Works in red, green, with and blue, overlapping and (b) the corresponding clusters samples from the marginal distribution p(x), which is obtained by simply ignoring the values of z and just plotting the x values. The data set in (a) is said to be complete, + whereas Works thatwith in (b) is incomplete. clusters (c) The of same different samples indensities which the colours represent the value of the responsibilities γ(z nk ) associated with data point x n, obtained by plotting the corresponding point using proportions of red, blue, and green ink given by γ(z nk ) for k =, 2, 3, respectively + Same complexity as K-means - Can get stuck in local maximum matrix X in which the n th row is given by x T n. Similarly, the corresponding latent variables will be denoted by an N K matrix Z with rows z T n. If we assume that the data points are drawn independently from the distribution, then we can express - Need to set number of components

23 GMM Advantages / Disadvantages + Works with overlapping clusters + Works with clusters of different densities + Same complexity as K-means - Can get stuck in local maximum - Need to set number of components

24 Model Selection Strategy : Cross-validation Split data in to K folds. For each fold k Perform EM to learn θ from training set X train Calculate test set likelihood p(x test θ)

25 Latent Dirichlet Allocation

26 Word Mixtures Generative model f Latent Dirichlet allocation (LDA) Idea: Model text as a mixture over words (ignore order) Topics gene dna genetic.,,.4.2. life.2 evolve. organism..,, brain neuron nerve data.2 number.2 computer..,, Each topic is a distrib Words: Topics: Simple intuition: Documents exhibit multiple topics. Each document is a

27 EM for Word Mixtures Generative Model E-step: Update assignments M-step: Update parameters

28 Topic Modeling Topics gene.4 dna.2 genetic..,, Documents Topic proportions and assignments life.2 evolve. organism..,, brain.4 neuron.2 nerve.... data.2 number.2 computer..,, Each topic is a distribution over words Each document is a mixture over topics Each word is drawn from one topic distribution

29 Topic Modeling Topics gene.4 dna.2 genetic..,, Documents Topic proportions and assignments life.2 evolve. organism..,, brain.4 neuron.2 nerve.... data.2 number.2 computer..,, Words: Topics:

30 EM for Topic Models (PLSI/PLSA*) Generative Model E-step: Update assignments M-step: Update parameters *(Probabilistic Latent Semantic Indexing, a.k.a. Probabilistic Latent Semantic Analysis)

31 Latent Dirichlet Allocation (a.k.a. PLSI/PLSA with priors) Proportions parameter Per-word topic assignment Per-document topic proportions Observed word Topics Topic parameter d Z d,n W d,n N k D K η

32 Community Detection

33 Girvan-Newman Algorithm (hierarchical divisive clustering according to betweenness) Repeat until k clusters found. Calculate betweenness 2. Remove edge(s) with highest betweenness (Adapted from: Mining of Massive Datasets,

34 Girvan-Newman Algorithm (hierarchical divisive clustering according to betweenness) Step Step Step Hierarchical network (Adapted from: Mining of Massive Datasets,

35 Calculating Betweenness Step. Count number of shortest paths from to each node (Adapted from: Mining of Massive Datasets,

36 Calculating Betweenness path to K. Split in ratio 3:3 Step 2. Propagate credit upwards, splitting according to number of paths to parents (Adapted from: Mining of Massive Datasets,

37 Calculating Betweenness +.5 paths to J Split :2 path to K. Split in ratio 3:3 Step 2. Propagate credit upwards, splitting according to number of paths to parents (Adapted from: Mining of Massive Datasets,

38 Calculating Betweenness +.5 paths to J Split :2 path to K. Split in ratio 3:3 Step 2. Propagate credit upwards, splitting according to number of paths to parents (Adapted from: Mining of Massive Datasets,

39 Calculating Betweenness +.5 paths to J Split :2 path to K. Split in ratio 3:3 Step 2. Propagate credit upwards, splitting according to number of paths to parents (Adapted from: Mining of Massive Datasets,

40 Calculating Betweenness +.5 paths to J Split :2 path to K. Split in ratio 3:3 Step 2. Propagate credit upwards, splitting according to number of paths to parents (Adapted from: Mining of Massive Datasets,

41 Determining the Number of Communities Hierarchical decomposition Choosing a cut-off Analogous problem to deciding on number of clusters in hierarchical clustering (Adapted from: Mining of Massive Datasets,

42 Modularity Idea: Compare fraction of edges within module to fraction that would be observed for random connections Adjacency Matrix Node Degree Node Assignment (Adapted from: Mining of Massive Datasets,

43 Modularity Use modularity to optimize connectivity within modules (Adapted from: Mining of Massive Datasets,

44 Minimum Cuts Minimum Cut y = argmin y2{,} n X (i, j)2e ( y i y j ) 2 Problem: Can t enumerate all choices y,, yn (Adapted from: Mining of Massive Datasets,

45 Laplacian Matrix Difference of Degree and Adjacency Matrix (Adapted from: Mining of Massive Datasets,

46 Eigenvectors of the Laplacian Properties of Laplacian: Real-valued, symmetric Rows/columns sum to (Adapted from: Mining of Massive Datasets,

47 Second Eigenvector (Fiedler Vector) Eigenvalue is related to the cut: i j (Adapted from: Mining of Massive Datasets,

48 Minimum Cuts Minimum Cut Solution: use sign of Fiedler vector y = argmin y2{,} n X (i, j)2e ( y i y j ) 2 (Adapted from: Mining of Massive Datasets,

49 Normalized Cuts Optimal cut Minimum cut Problem: minimal cut is not necessarily a good splitting criterion (Adapted from: Mining of Massive Datasets,

50 Solving Normalized Cuts Optimal cut Minimum cut Solve using Normalized Laplacian (for derivation see: Shi & Malik, IEEE TPAMI, 2) (Adapted from: Mining of Massive Datasets,

51 Example: Spectral Partitioning Value of x 2 Rank in x 2 (Adapted from: Mining of Massive Datasets,

52 Example: Spectral Partitioning Value of x 2 Rank in x 2 (Adapted from: Mining of Massive Datasets,

53 k-way Spectral Clustering Example: Clustering with 2 eigenvectors

54 Link Analysis

55 (adapted from:: Mining of Massive Datasets, PageRank: Recursive Formulation r j = r i /3+r k /4 j i k r i /3 rk /4 r j /3 r j /3 r j /3 A link s vote is proportional to the importance of its source page If page j with importance r j has n out-links, each link gets r j / n votes Page j s own importance is the sum of the votes on its in-links

56 (adapted from:: Mining of Massive Datasets, Equivalent Formulation: Random Surfer r j = r i /3+r k /4 j i k r i /3 rk /4 r j /3 r j /3 r j /3 At time t a surfer is on some page i At time t+ the surfer follows a link to a new page at random Define rank ri as fraction of time spent on page i

57 (adapted from:: Mining of Massive Datasets, PageRank: Problems. Dead Ends Dead end Nodes with no outgoing links. Where do surfers go next? 2. Spider Traps Subgraph with no outgoing Spider trap links to wider graph Surfers are trapped with no way out.

58 (adapted from:: Mining of Massive Datasets, Solution: Random Teleports Model for teleporting random surfer: At time t = pick a page at random At each subsequent time t With probability β follow an outgoing link at random With probability -β teleport to a new initial location at random PageRank Equation [Page & Brin 998] X r i r j = +( ) d i N i! j

59 PageRank: Extensions Topic-specific PageRank: Restrict teleportation to some set S of pages related to a specific topic Set p i = / S if i S, p i = otherwise Trust Propagation Use set S of trusted pages as teleport set

60 Recommender Systems

61 The Long Tail (from:

62 Problem Setting Task: Predict user preferences for unseen items Content-based filtering: Use user/item features Collaborative filtering: Use similarity in ratings

63 Neighborhood Based Methods (user, user) similarity predict rating based on average from k-nearest users good if item base is smaller than user base good if item base changes rapidly (item,item) similarity predict rating based on average from k-nearest items good if the user base is smaller than item base good if user base changes rapidly

64 (item,item) similarity Empirical estimate of Pearson correlation coefficient P u2u(i,j) (r ui b ui )(r uj b uj ) ˆ ij = q P u2u(i,j) (r ui b ui ) 2 P u2u(i,j) (r uj b uj ) 2 Regularize towards for small support s ij = U(i, j) U(i, j) + ˆ ij Regularize towards baseline for small neighborhood

65 Matrix Factorization Moonrise Kingdom Idea: pose as (biased) matrix factorization problem

66 Alternating Least Squares ~ (regress xi given W) (regress wu given X)

67 Ratings are not given at random Netflix ratings Yahoo! music ratings Yahoo! survey answers

68 Ratings are not given at random users movies users movies rui cui matrix factorization regression data

69 Improvements RMSE Factor models: Error vs. #parameters Add biases NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v Millions of Parameters Do SGD, but also learn biases μ, bu and bi

70 Improvements RMSE Factor models: Error vs. #parameters who rated what NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v Millions of Parameters Account for fact that ratings are not missing at random.

71 Improvements Factor models: Error vs. #parameters NMF BiasSVD SVD++ RMSE temporal effects SVD v.2 SVD v.3 SVD v Millions of Parameters

72 As with the Midterm Exam questions will be conceptual (and range from straightforward to hard) You may bring notes, slide printouts and textbooks. You may not use any internet-enabled electronics. The exam is designed to have a median score of 75/ (though this is not an exact science)

Data Mining Techniques

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of