Data Mining Techniques

Similar documents
Data Mining Techniques

Data Mining Techniques

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

CS249: ADVANCED DATA MINING

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

CS6220: DATA MINING TECHNIQUES

ECE 5984: Introduction to Machine Learning

CS6220: DATA MINING TECHNIQUES

Mixtures of Gaussians. Sargur Srihari

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Data Mining Techniques

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

CS6220: DATA MINING TECHNIQUES

Link Mining PageRank. From Stanford C246

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Introduction to Graphical Models

Data Mining Techniques

Link Analysis. Stony Brook University CSE545, Fall 2016

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Restricted Boltzmann Machines for Collaborative Filtering

Data and Algorithms of the Web

K-Means, Expectation Maximization and Segmentation. D.A. Forsyth, CS543

Andriy Mnih and Ruslan Salakhutdinov

Learning Bayesian networks

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Slides based on those in:

CS6220: DATA MINING TECHNIQUES

Generative Clustering, Topic Modeling, & Bayesian Inference

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Collaborative Filtering. Radek Pelánek

Using SVD to Recommend Movies

CS281 Section 4: Factor Analysis and PCA

Clustering using Mixture Models

CS145: INTRODUCTION TO DATA MINING

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Final Exam, Machine Learning, Spring 2009

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Chris Bishop s PRML Ch. 8: Graphical Models

CS246 Final Exam, Winter 2011

Machine Learning for Signal Processing Bayes Classification and Regression

Collaborative topic models: motivations cont

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Recommendation Systems

Graphical Models for Collaborative Filtering

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CSCI-567: Machine Learning (Spring 2019)

Latent Variable Models and Expectation Maximization

An Introduction to Spectral Learning

Latent Variable View of EM. Sargur Srihari

Latent Variable Models and Expectation Maximization

Collaborative Filtering

Recommendation Systems

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Collaborative Filtering: A Machine Learning Perspective

Lecture 13 : Variational Inference: Mean Field Approximation

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Expectation Maximization

Clustering and Gaussian Mixture Models

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

STA 4273H: Statistical Machine Learning

Clustering VS Classification

Probabilistic Graphical Models & Applications

Link Analysis Ranking

Gaussian Mixture Models

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Clustering and Gaussian Mixtures

Statistical Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Dimension Reduction. David M. Blei. April 23, 2012

CS Lecture 18. Topic Models and LDA

Recommendation Systems

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Collaborative Topic Modeling for Recommending Scientific Articles

Machine Learning for Signal Processing Bayes Classification

Online Algorithms for Sum-Product

Modeling User Rating Profiles For Collaborative Filtering

CS6220: DATA MINING TECHNIQUES

Document and Topic Models: plsa and LDA

Learning Bayesian belief networks

Jeffrey D. Ullman Stanford University

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Undirected Graphical Models

CS 572: Information Retrieval

Click Prediction and Preference Ranking of RSS Feeds

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Generative Models for Discrete Data

Unsupervised Learning

11 : Gaussian Graphic Models and Ising Models

Online Dictionary Learning with Group Structure Inducing Norms

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Lecture 8: Clustering & Mixture Models

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Probabilistic clustering

Expectation maximization

Matrix Factorization Techniques for Recommender Systems

CS249: ADVANCED DATA MINING

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Transcription:

Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent

Feedback

Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

Background

Multivariate Normal Density: Parameters: ij = E[(x i µ i )(x j µ j )]

The Dirichlet Distribution

The Dirichlet Distribution

Information Theory KL Divergence Entropy Mutual Information

Conjugacy Likelihood (discrete) Prior (Dirichlet) Question: What distribution is the posterior? More examples: https://en.wikipedia.org/wiki/conjugate_prior

Mixture Models

Review: K-means Clustering μ μ2 Objective: Sum of Squares SSE = KX k= NX n= I[z n = k] x n µ k 2 z n µ k Assignment for point n Center for cluster k μ3 Alternate between two steps. Update assignments 2. Update centers

Review: Regression Objective: Sum of Squares Probabilistic Interpretation: y n = x > n w + n n Norm(, 2 ) log p(y w )= 2 2 E(w )+const.

K-means: Probabilistic Generalization Generative Model z n Discrete( ) x n z n = k Norm(µ k, k ) Questions. What is log p(x, z μ, Σ, π)? 2. For what choice of π and Σ do we recover K-means? Same as K-means when: k = /K k = 2 I

Gaussian K-means Assignment Update Parameter Updates Idea: Replace hard assignments with soft assignments N k := P N n= z nk z nk := I[z n = k] N = [ = ] = PN N,... P =(N /N,...,N K /N) µ k = P N z N k n= P x nk n P P N k = N k P N n= z nk (x n µ k )(x n µ k ) >

Gaussian Soft K-means Soft Assignment Update Parameter Updates Idea: Replace hard assignments with soft assignments N k := P N n= nk = PN [ N,...,N = ] N = P =(N /N,...,N K /N)) P µ k = N z x N k n= nk n P = k = N N k n= nk (x n µ k )(x n µ k ) >

Lower Bound on Log Likelihood (multiplication by )

Lower Bound on Log Likelihood (multiplication by ) (multiplication by )

Lower Bound on Log Likelihood (multiplication by ) (multiplication by ) (Bayes rule)

Lower Bound on Log Likelihood (multiplication by ) (multiplication by ) (Bayes rule)

Lower Bound on Log Likelihood

Gaussian Mixture Model Generative Model z n Discrete( ) x n z n = k Norm(µ k, k ) Expectation Maximization Initialize θ Repeat until convergence. Expectation Step 2. Maximization Step

GMM Advantages / Disadvantages (a) (b) (c).5.5.5.5.5.5 Figure 9.5 Example of 5 points drawn from the mixture of 3 Gaussians shown in Figure 2.23. (a) Samples from the joint distribution p(z)p(x z) in which the three states of z, corresponding to the three components of the mixture, are depicted + Works in red, green, with and blue, overlapping and (b) the corresponding clusters samples from the marginal distribution p(x), which is obtained by simply ignoring the values of z and just plotting the x values. The data set in (a) is said to be complete, + whereas Works thatwith in (b) is incomplete. clusters (c) The of same different samples indensities which the colours represent the value of the responsibilities γ(z nk ) associated with data point x n, obtained by plotting the corresponding point using proportions of red, blue, and green ink given by γ(z nk ) for k =, 2, 3, respectively + Same complexity as K-means - Can get stuck in local maximum matrix X in which the n th row is given by x T n. Similarly, the corresponding latent variables will be denoted by an N K matrix Z with rows z T n. If we assume that the data points are drawn independently from the distribution, then we can express - Need to set number of components

GMM Advantages / Disadvantages + Works with overlapping clusters + Works with clusters of different densities + Same complexity as K-means - Can get stuck in local maximum - Need to set number of components

Model Selection Strategy : Cross-validation Split data in to K folds. For each fold k Perform EM to learn θ from training set X train Calculate test set likelihood p(x test θ)

Latent Dirichlet Allocation

Word Mixtures Generative model f Latent Dirichlet allocation (LDA) Idea: Model text as a mixture over words (ignore order) Topics gene dna genetic.,,.4.2. life.2 evolve. organism..,, brain neuron nerve....4.2. data.2 number.2 computer..,, Each topic is a distrib Words: Topics: Simple intuition: Documents exhibit multiple topics. Each document is a

EM for Word Mixtures Generative Model E-step: Update assignments M-step: Update parameters

Topic Modeling Topics gene.4 dna.2 genetic..,, Documents Topic proportions and assignments life.2 evolve. organism..,, brain.4 neuron.2 nerve.... data.2 number.2 computer..,, Each topic is a distribution over words Each document is a mixture over topics Each word is drawn from one topic distribution

Topic Modeling Topics gene.4 dna.2 genetic..,, Documents Topic proportions and assignments life.2 evolve. organism..,, brain.4 neuron.2 nerve.... data.2 number.2 computer..,, Words: Topics:

EM for Topic Models (PLSI/PLSA*) Generative Model E-step: Update assignments M-step: Update parameters *(Probabilistic Latent Semantic Indexing, a.k.a. Probabilistic Latent Semantic Analysis)

Latent Dirichlet Allocation (a.k.a. PLSI/PLSA with priors) Proportions parameter Per-word topic assignment Per-document topic proportions Observed word Topics Topic parameter d Z d,n W d,n N k D K η

Community Detection

Girvan-Newman Algorithm (hierarchical divisive clustering according to betweenness) 2 33 49 Repeat until k clusters found. Calculate betweenness 2. Remove edge(s) with highest betweenness (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Girvan-Newman Algorithm (hierarchical divisive clustering according to betweenness) Step Step Step Hierarchical network (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Calculating Betweenness Step. Count number of shortest paths from to each node (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Calculating Betweenness path to K. Split in ratio 3:3 Step 2. Propagate credit upwards, splitting according to number of paths to parents (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Calculating Betweenness +.5 paths to J Split :2 path to K. Split in ratio 3:3 Step 2. Propagate credit upwards, splitting according to number of paths to parents (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Calculating Betweenness +.5 paths to J Split :2 path to K. Split in ratio 3:3 Step 2. Propagate credit upwards, splitting according to number of paths to parents (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Calculating Betweenness +.5 paths to J Split :2 path to K. Split in ratio 3:3 Step 2. Propagate credit upwards, splitting according to number of paths to parents (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Calculating Betweenness +.5 paths to J Split :2 path to K. Split in ratio 3:3 Step 2. Propagate credit upwards, splitting according to number of paths to parents (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Determining the Number of Communities Hierarchical decomposition Choosing a cut-off Analogous problem to deciding on number of clusters in hierarchical clustering (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Modularity Idea: Compare fraction of edges within module to fraction that would be observed for random connections Adjacency Matrix Node Degree Node Assignment (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Modularity Use modularity to optimize connectivity within modules (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Minimum Cuts Minimum Cut y = argmin y2{,} n X (i, j)2e ( y i y j ) 2 Problem: Can t enumerate all choices y,, yn (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Laplacian Matrix 2 3 4 5 6 2 3 4 5 6 3 - - - 2-2 - 3 - - 3-4 - 3 - - 5 - - 3-6 - - 2 Difference of Degree and Adjacency Matrix (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Eigenvectors of the Laplacian 2 3 4 5 6 2 3 4 5 6 3 - - - 2-2 - 3 - - 3-4 - 3 - - 5 - - 3-6 - - 2 Properties of Laplacian: Real-valued, symmetric Rows/columns sum to (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Second Eigenvector (Fiedler Vector) Eigenvalue is related to the cut: i j (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Minimum Cuts Minimum Cut Solution: use sign of Fiedler vector y = argmin y2{,} n X (i, j)2e ( y i y j ) 2 (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Normalized Cuts Optimal cut Minimum cut Problem: minimal cut is not necessarily a good splitting criterion (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Solving Normalized Cuts Optimal cut Minimum cut Solve using Normalized Laplacian (for derivation see: Shi & Malik, IEEE TPAMI, 2) (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Example: Spectral Partitioning Value of x 2 Rank in x 2 (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

Example: Spectral Partitioning Value of x 2 Rank in x 2 (Adapted from: Mining of Massive Datasets, http://www.mmds.org)

k-way Spectral Clustering Example: Clustering with 2 eigenvectors

Link Analysis

(adapted from:: Mining of Massive Datasets, http://www.mmds.org) PageRank: Recursive Formulation r j = r i /3+r k /4 j i k r i /3 rk /4 r j /3 r j /3 r j /3 A link s vote is proportional to the importance of its source page If page j with importance r j has n out-links, each link gets r j / n votes Page j s own importance is the sum of the votes on its in-links

(adapted from:: Mining of Massive Datasets, http://www.mmds.org) Equivalent Formulation: Random Surfer r j = r i /3+r k /4 j i k r i /3 rk /4 r j /3 r j /3 r j /3 At time t a surfer is on some page i At time t+ the surfer follows a link to a new page at random Define rank ri as fraction of time spent on page i

(adapted from:: Mining of Massive Datasets, http://www.mmds.org) PageRank: Problems. Dead Ends Dead end Nodes with no outgoing links. Where do surfers go next? 2. Spider Traps Subgraph with no outgoing Spider trap links to wider graph Surfers are trapped with no way out.

(adapted from:: Mining of Massive Datasets, http://www.mmds.org) Solution: Random Teleports Model for teleporting random surfer: At time t = pick a page at random At each subsequent time t With probability β follow an outgoing link at random With probability -β teleport to a new initial location at random PageRank Equation [Page & Brin 998] X r i r j = +( ) d i N i! j

PageRank: Extensions Topic-specific PageRank: Restrict teleportation to some set S of pages related to a specific topic Set p i = / S if i S, p i = otherwise Trust Propagation Use set S of trusted pages as teleport set

Recommender Systems

The Long Tail (from: https://www.wired.com/24//tail/)

Problem Setting Task: Predict user preferences for unseen items Content-based filtering: Use user/item features Collaborative filtering: Use similarity in ratings

Neighborhood Based Methods (user, user) similarity predict rating based on average from k-nearest users good if item base is smaller than user base good if item base changes rapidly (item,item) similarity predict rating based on average from k-nearest items good if the user base is smaller than item base good if user base changes rapidly

(item,item) similarity Empirical estimate of Pearson correlation coefficient P u2u(i,j) (r ui b ui )(r uj b uj ) ˆ ij = q P u2u(i,j) (r ui b ui ) 2 P u2u(i,j) (r uj b uj ) 2 Regularize towards for small support s ij = U(i, j) U(i, j) + ˆ ij Regularize towards baseline for small neighborhood

Matrix Factorization Moonrise Kingdom 4 5 4 4.3.2 Idea: pose as (biased) matrix factorization problem

Alternating Least Squares 4 5 5 3 3 2 4 4 5 5 3 4 3 2 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3.2 -.4..5.6 -.5.5.3 -.2.3 2.. -2 2. -.7.3.7 - -.9 2.4.4.3 -.4.8 -.5-2.5.3 -.2..3 -..2 -.7 2.9.4 -.3.4.5.7 -.8. -.6.7.8.4 -.3.9 2.4.7.6 -.4 2. ~ (regress xi given W) (regress wu given X)

Ratings are not given at random Netflix ratings Yahoo! music ratings Yahoo! survey answers

Ratings are not given at random 4 5 5 3 3 2 4 4 5 5 3 4 3 2 4 2 2 4 5 4 2 5 2 2 4 3 4 4 2 3 3 users movies users movies rui cui matrix factorization regression data

Improvements RMSE.9.95.9.895.89.885.88 Factor models: Error vs. #parameters 4 5 6 9 288 5 5 2 2 Add biases 2 5 2 5 5 2 5 5 NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4.875 Millions of Parameters Do SGD, but also learn biases μ, bu and bi

Improvements RMSE.9.95.9.895.89.885.88 Factor models: Error vs. #parameters 4 5 6 9 288 5 5 2 2 who rated what 2 5 2 5 5 2 5 5 NMF BiasSVD SVD++ SVD v.2 SVD v.3 SVD v.4.875 Millions of Parameters Account for fact that ratings are not missing at random.

Improvements.9.95.9 Factor models: Error vs. #parameters 4 6 9 288 5 2 NMF BiasSVD SVD++ RMSE.895.89.885.88 5 5 2 2 temporal effects 5 2 5 5 2 5 5 SVD v.2 SVD v.3 SVD v.4.875 Millions of Parameters

As with the Midterm Exam questions will be conceptual (and range from straightforward to hard) You may bring notes, slide printouts and textbooks. You may not use any internet-enabled electronics. The exam is designed to have a median score of 75/ (though this is not an exact science)