Data Mining Techniques - PDF Free Download

Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent

Feedback

Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

Background

Multivariate Normal Density: Parameters: ij = E[(x i µ i )(x j µ j )]

The Dirichlet Distribution

Information Theory KL Divergence Entropy Mutual Information

Conjugacy Likelihood (discrete) Prior (Dirichlet) Question: What distribution is the posterior? More examples: https://en.wikipedia.org/wiki/conjugate_prior

Mixture Models

Review: K-means Clustering μ μ2 Objective: Sum of Squares SSE = KX k= NX n= I[z n = k] x n µ k 2 z n µ k Assignment for point n Center for cluster k μ3 Alternate between two steps. Update assignments 2. Update centers

Review: Regression Objective: Sum of Squares Probabilistic Interpretation: y n = x > n w + n n Norm(, 2 ) log p(y w )= 2 2 E(w )+const.

K-means: Probabilistic Generalization Generative Model z n Discrete( ) x n z n = k Norm(µ k, k ) Questions. What is log p(x, z μ, Σ, π)? 2. For what choice of π and Σ do we recover K-means? Same as K-means when: k = /K k = 2 I

Gaussian K-means Assignment Update Parameter Updates Idea: Replace hard assignments with soft assignments N k := P N n= z nk z nk := I[z n = k] N = [ = ] = PN N,... P =(N /N,...,N K /N) µ k = P N z N k n= P x nk n P P N k = N k P N n= z nk (x n µ k )(x n µ k ) >

Gaussian Soft K-means Soft Assignment Update Parameter Updates Idea: Replace hard assignments with soft assignments N k := P N n= nk = PN [ N,...,N = ] N = P =(N /N,...,N K /N)) P µ k = N z x N k n= nk n P = k = N N k n= nk (x n µ k )(x n µ k ) >

Lower Bound on Log Likelihood (multiplication by )

Lower Bound on Log Likelihood (multiplication by ) (multiplication by )

Lower Bound on Log Likelihood (multiplication by ) (multiplication by ) (Bayes rule)

Lower Bound on Log Likelihood

Gaussian Mixture Model Generative Model z n Discrete( ) x n z n = k Norm(µ k, k ) Expectation Maximization Initialize θ Repeat until convergence. Expectation Step 2. Maximization Step

GMM Advantages / Disadvantages (a) (b) (c).5.5.5.5.5.5 Figure 9.5 Example of 5 points drawn from the mixture of 3 Gaussians shown in Figure 2.23. (a) Samples from the joint distribution p(z)p(x z) in which the three states of z, corresponding to the three components of the mixture, are depicted + Works in red, green, with and blue, overlapping and (b) the corresponding clusters samples from the marginal distribution p(x), which is obtained by simply ignoring the values of z and just plotting the x values. The data set in (a) is said to be complete, + whereas Works thatwith in (b) is incomplete. clusters (c) The of same different samples indensities which the colours represent the value of the responsibilities γ(z nk ) associated with data point x n, obtained by plotting the corresponding point using proportions of red, blue, and green ink given by γ(z nk ) for k =, 2, 3, respectively + Same complexity as K-means - Can get stuck in local maximum matrix X in which the n th row is given by x T n. Similarly, the corresponding latent variables will be denoted by an N K matrix Z with rows z T n. If we assume that the data points are drawn independently from the distribution, then we can express - Need to set number of components

GMM Advantages / Disadvantages + Works with overlapping clusters + Works with clusters of different densities + Same complexity as K-means - Can get stuck in local maximum - Need to set number of components

Model Selection Strategy : Cross-validation Split data in to K folds. For each fold k Perform EM to learn θ from training set X train Calculate test set likelihood p(x test θ)

Latent Dirichlet Allocation

Word Mixtures Generative model f Latent Dirichlet allocation (LDA) Idea: Model text as a mixture over words (ignore order) Topics gene dna genetic.,,.4.2. life.2 evolve. organism..,, brain neuron nerve....4.2. data.2 number.2 computer..,, Each topic is a distrib Words: Topics: Simple intuition: Documents exhibit multiple topics. Each document is a

EM for Word Mixtures Generative Model E-step: Update assignments M-step: Update parameters

Topic Modeling Topics gene.4 dna.2 genetic..,, Documents Topic proportions and assignments life.2 evolve. organism..,, brain.4 neuron.2 nerve.... data.2 number.2 computer..,, Each topic is a distribution over words Each document is a mixture over topics Each word is drawn from one topic distribution

Topic Modeling Topics gene.4 dna.2 genetic..,, Documents Topic proportions and assignments life.2 evolve. organism..,, brain.4 neuron.2 nerve.... data.2 number.2 computer..,, Words: Topics:

EM for Topic Models (PLSI/PLSA*) Generative Model E-step: Update assignments M-step: Update parameters *(Probabilistic Latent Semantic Indexing, a.k.a. Probabilistic Latent Semantic Analysis)

Latent Dirichlet Allocation (a.k.a. PLSI/PLSA with priors) Proportions parameter Per-word topic assignment Per-document topic proportions Observed word Topics Topic parameter d Z d,n W d,n N k D K η

Community Detection

Girvan-Newman Algorithm (hierarchical divisive clustering according to betweenness) 2 33 49 Repeat until k clusters found. Calculate betweenness 2. Remove edge(s) with highest betweenness (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Girvan-Newman Algorithm (hierarchical divisive clustering according to betweenness) Step Step Step Hierarchical network (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Calculating Betweenness Step. Count number of shortest paths from to each node (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Calculating Betweenness path to K. Split in ratio 3:3 Step 2. Propagate credit upwards, splitting according to number of paths to parents (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Calculating Betweenness +.5 paths to J Split :2 path to K. Split in ratio 3:3 Step 2. Propagate credit upwards, splitting according to number of paths to parents (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Determining the Number of Communities Hierarchical decomposition Choosing a cut-off Analogous problem to deciding on number of clusters in hierarchical clustering (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Modularity Idea: Compare fraction of edges within module to fraction that would be observed for random connections Adjacency Matrix Node Degree Node Assignment (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Modularity Use modularity to optimize connectivity within modules (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Minimum Cuts Minimum Cut y = argmin y2{,} n X (i, j)2e ( y i y j ) 2 Problem: Can t enumerate all choices y,, yn (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Laplacian Matrix 2 3 4 5 6 2 3 4 5 6 3 - - - 2-2 - 3 - - 3-4 - 3 - - 5 - - 3-6 - - 2 Difference of Degree and Adjacency Matrix (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Eigenvectors of the Laplacian 2 3 4 5 6 2 3 4 5 6 3 - - - 2-2 - 3 - - 3-4 - 3 - - 5 - - 3-6 - - 2 Properties of Laplacian: Real-valued, symmetric Rows/columns sum to (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Second Eigenvector (Fiedler Vector) Eigenvalue is related to the cut: i j (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Minimum Cuts Minimum Cut Solution: use sign of Fiedler vector y = argmin y2{,} n X (i, j)2e ( y i y j ) 2 (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Normalized Cuts Optimal cut Minimum cut Problem: minimal cut is not necessarily a good splitting criterion (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Solving Normalized Cuts Optimal cut Minimum cut Solve using Normalized Laplacian (for derivation see: Shi & Malik, IEEE TPAMI, 2) (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Example: Spectral Partitioning Value of x 2 Rank in x 2 (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

k-way Spectral Clustering Example: Clustering with 2 eigenvectors

Link Analysis

(adapted from:: Mining of Massive Datasets, http://www.mmds.org) PageRank: Recursive Formulation r j = r i /3+r k /4 j i k r i /3 rk /4 r j /3 r j /3 r j /3 A link s vote is proportional to the importance of its source page If page j with importance r j has n out-links, each link gets r j / n votes Page j s own importance is the sum of the votes on its in-links

(adapted from:: Mining of Massive Datasets, http://www.mmds.org) Equivalent Formulation: Random Surfer r j = r i /3+r k /4 j i k r i /3 rk /4 r j /3 r j /3 r j /3 At time t a surfer is on some page i At time t+ the surfer follows a link to a new page at random Define rank ri as fraction of time spent on page i

(adapted from:: Mining of Massive Datasets, http://www.mmds.org) PageRank: Problems. Dead Ends Dead end Nodes with no outgoing links. Where do surfers go next? 2. Spider Traps Subgraph with no outgoing Spider trap links to wider graph Surfers are trapped with no way out.

(adapted from:: Mining of Massive Datasets, http://www.mmds.org) Solution: Random Teleports Model for teleporting random surfer: At time t = pick a page at random At each subsequent time t With probability β follow an outgoing link at random With probability -β teleport to a new initial location at random PageRank Equation [Page & Brin 998] X r i r j = +( ) d i N i! j

PageRank: Extensions Topic-specific PageRank: Restrict teleportation to some set S of pages related to a specific topic Set p i = / S if i S, p i = otherwise Trust Propagation Use set S of trusted pages as teleport set

Recommender Systems

The Long Tail (from: https://www.wired.com/24//tail/)

Problem Setting Task: Predict user preferences for unseen items Content-based filtering: Use user/item features Collaborative filtering: Use similarity in ratings

Neighborhood Based Methods (user, user) similarity predict rating based on average from k-nearest users good if item base is smaller than user base good if item base changes rapidly (item,item) similarity predict rating based on average from k-nearest items good if the user base is smaller than item base good if user base changes rapidly

(item,item) similarity Empirical estimate of Pearson correlation coefficient P u2u(i,j) (r ui b ui )(r uj b uj ) ˆ ij = q P u2u(i,j) (r ui b ui ) 2 P u2u(i,j) (r uj b uj ) 2 Regularize towards for small support s ij = U(i, j) U(i, j) + ˆ ij Regularize towards baseline for small neighborhood

Matrix Factorization Moonrise Kingdom 4 5 4 4.3.2 Idea: pose as (biased) matrix factorization problem

Alternating Least Squares 4 5 5 3 3 2 4 4 5 5 3 4 3 2 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3.2 -.4..5.6 -.5.5.3 -.2.3 2.. -2 2. -.7.3.7 - -.9 2.4.4.3 -.4.8 -.5-2.5.3 -.2..3 -..2 -.7 2.9.4 -.3.4.5.7 -.8. -.6.7.8.4 -.3.9 2.4.7.6 -.4 2. ~ (regress xi given W) (regress wu given X)

Ratings are not given at random Netflix ratings Yahoo! music ratings Yahoo! survey answers

Ratings are not given at random 4 5 5 3 3 2 4 4 5 5 3 4 3 2 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 users movies users movies rui cui matrix factorization regression data

Improvements RMSE.9.95.9.895.89.885.88 Factor models: Error vs. #parameters 4 5 6 9 288 5 5 2 2 Add biases 2 5 2 5 5 2 5 5 NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4.875 Millions of Parameters Do SGD, but also learn biases μ, bu and bi

Improvements RMSE.9.95.9.895.89.885.88 Factor models: Error vs. #parameters 4 5 6 9 288 5 5 2 2 who rated what 2 5 2 5 5 2 5 5 NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4.875 Millions of Parameters Account for fact that ratings are not missing at random.

Improvements.9.95.9 Factor models: Error vs. #parameters 4 6 9 288 5 2 NMF BiasSVD SVD++ RMSE.895.89.885.88 5 5 2 2 temporal effects 5 2 5 5 2 5 5 SVD v.2 SVD v.3 SVD v.4.875 Millions of Parameters

As with the Midterm Exam questions will be conceptual (and range from straightforward to hard) You may bring notes, slide printouts and textbooks. You may not use any internet-enabled electronics. The exam is designed to have a median score of 75/ (though this is not an exact science)