Knowledge Discovery and Data Mining 1 (VO) ( )

Size: px

Start display at page:

Download "Knowledge Discovery and Data Mining 1 (VO) ( )"

Geraldine Jackson
6 years ago
Views:

1 Knowledge Discovery and Data Mining 1 (VO) ( ) Probabilistic Latent Semantic Analysis Denis Helic KTI, TU Graz Jan 16, 2014 Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

2 Big picture: KDDM Probability Theory Linear Algebra Map-Reduce Information Theory Statistical Inference Mathematical Tools Infrastructure Data Mining Preprocessing Transformation Knowledge Discovery Process Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

3 Outline 1 Introduction adn Recap 2 Probabilistic Generative Models 3 Topic Models 4 Probabilistic Latent Semantic Analysis Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

4 Introduction adn Recap Short recap: SVD and LSA Singular Value Decomposition Let M R m n be a matrix and let r be the rank of M (the rank of a matrix is the largest number of linearly independent rows or columns). Then we can find matrices U, V, and Σ with the following properties: U R m r is a column-orthonormal matrix V R n r is a column-orthonormal matrix Σ R r r is a diagonal matrix. The matrix M can be then written as: M = UΣV T Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

5 2. V is an n r column-orthonormal Introduction adn Recap matrix. Note that we always use V in its transposed form, so it is the rows of V T that are orthonormal. Short recap: SVD and LSA 3. Σ is a diagonal matrix; that is, all elements not on the main diagonal are 0. The elements of Σ are called the singular values of M. n r r n Σ V T r m M = U Figure 11.5: The form of a singular-value decomposition Figure: Figure from Mining Massive Datasets Example 11.8 : In Fig is a rank-2 matrix representing ratings of movies Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

6 Introduction adn Recap Short recap: SVD and LSA Let M be a utility matrix with people ratings for the movies The rows of M are people, the columns of M are movies The rows of U are people, the columns of U are concepts U connects people to concepts Then the rows of V T are concepts, the columns of V T are movies V connects movies to concepts Σ represents the importance of concepts Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

7 Introduction adn Recap Short recap: SVD and LSA Let M be a term-document matrix with term occurrences in the documents The rows of M are terms, the columns of M are documents The rows of U are terms, the columns of U are concepts U connects terms to concepts Then the rows of V T are concepts, the columns of V T are documents V connects documents to concepts Σ represents the importance of concepts Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

8 Introduction adn Recap Short recap: SVD and LSA Vector Space Model: documents are represented as term vectors Cosine similarity to compute scores Vector Space Model can not cope with two classic problems arising in natural languages Synonymy: two words having the same meaning Polysemy: one word having multiple meanings Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

9 Introduction adn Recap Short recap: SVD and LSA In latent semantic analysis (LSA) or latent semantic indexing (LSI) we use SVD to create a low-rank approximation of the term-document matrix We select k largest singular values and create M k approximation to the original matrix We thus map each term/document to a k-dimensional space of concepts These concepts are hidden (latent) in the collection They represent the semantic of the terms and documents E.g. the topics of terms and documents Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

10 Introduction adn Recap Short recap: SVD and LSA By computing low-rank approximation of the original term-document matrix the SVD brings together the terms with similar co-occurrences Retrieval quality may actually be improved by the approximation! As we reduce k recall improves A value of k in low hundreds tend to increase precision as well (this suggests that a suitable k addresses some of the challenges of synonymy) Retrieval by folding the query into the low-rank space Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

11 Introduction adn Recap Disadvantages of LSA Statistical foundation is missing SVD assumes normally distributed data But, term occurrence is not normally distributed Still, often it works remarkably good! Why? Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

12 Introduction adn Recap Disadvantages of LSA Statistical foundation is missing SVD assumes normally distributed data But, term occurrence is not normally distributed Still, often it works remarkably good! Why? Matrix entries are weighted (e.g. tf-idf) and those weighted entries may be normally distributed Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

13 Probabilistic Generative Models Recap: Model-based methods Statistical inference is based on fitting a probabilistic model of data The idea is based on a probabilistic or generative model Such models assign a probability for observing specific data examples, e.g. observing words in a text document Generative models are a powerful method to encode specific assumptions of how unknown parameters interact to create data Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

14 Probabilistic Generative Models Recap: Generative models How does a generative network model work? It defines a conditional probability distribution over data given a hypothesis P(D h) Given h we generate data from the conditional distribution P(D h) Generative models have many advantages The main disadvantage is that fitting of the models can be more complicated than an algorithmic approach Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

15 Probabilistic Generative Models Recap: Inference (Statistical) inference is the reverse of the generation process We are given some data D, e.g. a collection of documents We want to estimate the model, or more precisely the parameters of the hypothesis h that are most likely to have generated the data generation P(D h) D inference Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

16 Probabilistic Generative Models Recap: Naive Bayes document models We discussed generative models in connection with Naive Bayes classification We introduced multinomial generative model and Bernoulli generative model In the multinomial model we assume that the documents are generated from a multinomial distribution, i.e. the number of occurrences of terms in document is a multinomial r.v. In the Bernoulli model we assume that the documents are generated from a multivariate Bernoulli distribution The distributions were conditioned on the document class Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

17 Topic Models Topic models Document class is something that we observe in our data (at least in the training data) Other observable entities: documents and words However, there are some entities which are present but not observable, i.e. they are hidden They are latent E.g. concepts in LSA Let us call those entities topics Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

18 Topic Models Topic models A topic model is a probabilistic generative model that we can use to generate the observable data, i.e. documents and words In the other direction: inference When we observe a specific data instance we can infer the model Probabilistic model: we will have joint probability distributions Typically we will work with conditional probability distributions Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

19 Topic Models Probabilistic topic models Each document is a probability distribution over topics Distribution over topics represents the essence, the body, or the gist of a given document Each topic is a probability distribution over words Topic Education : School, Students, Education, University,... Topic Budget : Million, Finance, Tax, Program,... Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

20 Topic Models Document generation process 1 For each document d choose a mixture of topics z 2 For every word slot draw a topic from the mixture with probability p(z d) 3 Then draw a word from the topic with probability p(w z) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

21 Topic Models Document generation process Figure: Figure from slides by Thomas Huffman Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

22 Probabilistic Latent Semantic Analysis Document generation process z w N M (b) mixture of unigrams d z w N M (c) plsi/aspect model Figure: Figure from LDA by Blei et al. Figure 3: Graphical model representation of different models of discrete data. ture of unigrams gment the unigram model with a discrete random topic variable z (Figure 3b), we o Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

23 Probabilistic Latent Semantic Analysis Distributions We are interested in the joint probability of the observable variables: p(d, w) However, we have a joint probability of the observable and latent variables p(d, w, z) Thus, we have to marginalize over z to obtain p(d, w) p(d, w) = z p(d, w, z) = z p(d, w z)p(z) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

24 Probabilistic Latent Semantic Analysis Recap: Conditional independence Definition Suppose P(C) > 0. Event A and B are conditionally independent given C if: P(A B C) = P(A C)P(B C) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

25 Probabilistic Latent Semantic Analysis Distributions We made the same assumption in Naive Bayes classification Documents and words are conditionally independent given the topic: p(d, w z) = p(d z)p(w z) p(d, w) = z p(d z)p(w z)p(z) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

26 Probabilistic Latent Semantic Analysis Distributions p(d, w) = z p(d z)p(w z)p(z) This is symmetric formulation of plsa We select a topic z and then with the probability p(d z) a document d and then with the probability p(w z) words for that document We repeat the process for all documents Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

27 Probabilistic Latent Semantic Analysis Distributions We can reformulate the last equation Let us see what is p(d, z) again using the assumption that d and w are independent p(d, z) = p(z)p(d z) = p(d)p(z d) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

28 Probabilistic Latent Semantic Analysis Distributions We can now substitute in the symmetric equation p(d, w) = z = z p(d z)p(w z)p(z) p(z d)p(w z)p(d) = p(d) z p(z d)p(w z) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

29 Probabilistic Latent Semantic Analysis Distributions This is asymmetric formulation Thus, we first pick a document with p(d) and then select all words for that document from p(w d) given by p(d, w) = p(w d)p(d) = p(w d) = z p(w z)p(z d) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

30 Probabilistic Latent Semantic Analysis plsa Decomposition p(w i d j ) = K p(w i z k )p(z k d j ) k=1 Figure: Figure from slides by Josef Sivic Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

31 11.3. SINGULAR-VALUE DECOMPOSITION 409 Probabilistic Latent Semantic Analysis plsa comparison with SVD 2. V is an n r column-orthonormal matrix. Note that we always use V in its transposed form, so it is the rows of V T that are orthonormal. p(d, w) = 3. Σ is a diagonal matrix; that is, all p(w elements z)p(z)p(d z) not on the main diagonal are 0. The elements of Σ are called z the singular values of M. n r r n Σ V T r m M = U Figure 11.5: The form of a singular-value decomposition Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

32 Probabilistic Latent Semantic Analysis plsa comparison with SVD Word probabilities given topics p(w z): matrix U Document probabilities given topics p(d z): matrix V Topic probabilities p(z): matrix Σ Difference: values in all matrices are normalized and non-negative They are probabilities Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

33 Probabilistic Latent Semantic Analysis Parameter inference We will infer parameters using Maximum Likelihood Estimator (MLE) First, we need to write down the likelihood function Let n(w i, d j ) be the number of occurrences of word w i in document d j p(w i, d j ) is the probability of observing a single occurrence word w i in document d j Then, the probability of observing n(w i, d j ) occurrences of word w i in document d j is given by: p(w i, d j ) n(w i,d j ) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

34 Parameter inference Probabilistic Latent Semantic Analysis The probability of observing the complete document collection is then given by the product of probabilities of observing every single word in every document with corresponding number of occurrences That is then the likelihood L = m i=1 j=1 n p(w i, d j ) n(w i,d j ) L = = m n n(w i, d j )log(p(w i, d j )) i=1 j=1 m n K n(w i, d j )log( p(w i z l )p(z l )p(d j z l )) i=1 j=1 l=1 Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

35 Probabilistic Latent Semantic Analysis EM algorithm We can not maximize the likelihood analytically because of the logarithm of the sum A standard procedure is to use an algorithm called Expectation-Maximization (EM) This is an iterative method to estimate parameters of the models with latent variables Each iteration consists of two steps: expectation step (E) and maximization step (M) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

36 Probabilistic Latent Semantic Analysis EM algorithm In the E step we create a function of the expectation of the log-likelihood using the current parameter estimates In the M step we compute parameters which maximize the expectation of the log-likelihood These parameter estimates are used to determine the distribution of the latent variables in the next E step Let us illustrate the EM algorithm in a general case Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

37 Probabilistic Latent Semantic Analysis EM algorithm We observe some data D generated by a probabilistic model with parameters θ and some latent variables z We are interested in the likelihood of that data D given the parameters θ: p(d θ) However, there exist a joint probability distribution of data D and latent variables z: p(d, z θ) Thus, to obtain p(d θ) we have to marginalize out z: p(d θ) = z p(d z, θ)p(z θ) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

38 EM algorithm Probabilistic Latent Semantic Analysis We are now interested in maximizing this likelihood, which is equivalent to maximizing log-likelihood log(p(d θ)) = log( z p(d z, θ)p(z θ)) Jensen s inequality for concave functions such as log gives us: E[f (x)] f (E[x]) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

39 EM algorithm Probabilistic Latent Semantic Analysis log( z p(d z, θ)p(z θ)) = log( p(d z, θ)p(z θ) q(z) q(z) ) z z p(d z, θ)p(z θ) = log( q(z)) q(z) p(d z, θ)p(z θ) = log(e[ ]) q(z) This is by the Jensen s inequality greater or equal to: log(e[ p(d z, θ)p(z θ) p(d z, θ)p(z θ) ]) E[log( )] q(z) q(z) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

40 Probabilistic Latent Semantic Analysis EM algorithm E[log( p(d z,θ)p(z θ) q(z) )] is then lower bound on the likelihood Thus, we can maximize this lower bound EM algorithm maximizes exactly this lower bound Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

41 EM algorithm Probabilistic Latent Semantic Analysis E[log( p(d z, θ)p(z θ) )] = q(z) z q(z)log( p(d z, θ)p(z θ) ) q(z) = p(z D, θ)p(d θ) q(z)log( ) q(z) z = q(z)log(p(d θ)) + p(z D, θ) q(z)log( ) q(z) z z = log(p(d θ)) q(z) q(z)log( p(z D, θ) ) z q(z) This will have maximum when z q(z)log( p(z D,θ) ) = 0 This is the case when q(z) = p(z D, θ) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

42 EM algorithm Probabilistic Latent Semantic Analysis p(z D, θ) is the posterior of z z q(z)log( q(z) p(z D,θ) ) is Kullback-Leibler (KL) divergence Thus, in E step we use the current values of parameters to calculate the posterior of z M step is then problem dependent Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

43 Probabilistic Latent Semantic Analysis EM algorithm for plsa p(z w, d) = = = p(z, w, d) p(w, d) p(d)p(z d)p(w z) z p(d)p(z d)p(w z) p(z d)p(w z) z p(z d)p(w z) p(w z) d p(z d) w n(d, w)p(z d, w) n(d, w)p(z d, w) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

44 Probabilistic Latent Semantic Analysis Example IPython Notebook examples Slightly modified code from: http: //kti.tugraz.at/staff/denis/courses/kddm1/plsa.ipynb Command Line ipython notebook pylab=inline plsa.ipynb Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

45 Example Probabilistic Latent Semantic Analysis User Movie Matrix Alien Star Wars Casablanca Titanic Joe Jim John Jack Jill Jenny Jane Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

46 Example Probabilistic Latent Semantic Analysis User Movie Matrix Alien Star Wars Casablanca Titanic Joe Jim John Jack Jill Jenny Jane Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

47 Example Probabilistic Latent Semantic Analysis \segment 1" \segment 2" \matrix 1" \matrix 2" \line 1" \line 2" \power 1" power 2" imag speaker robust manufactur constraint alpha POWER load SEGMENT speech MATRIX cell LINE redshift spectrum memori texture recogni eigenvalu part match LINE omega vlsi color signal uncertainti MATRIX locat galaxi mpc POWER tissue train plane cellular imag quasar hsup systolic brain hmm linear famili geometr absorp larg input slice source condition design impos high redshift complex cluster speakerind. perturb machinepart segment ssup galaxi arrai mri SEGMENT root format fundament densiti standard present volume sound suci group recogn veloc model implement Figure 3: Eight selected factors from a 128 factor decomposition. The displayed word stems are the 10 most probable words in the class-conditional distribution P (wjz), from top to bottom in descending order. Figure: From Hofmann, 2000 Document 1,P fzkjd1;wj =`segment`g =(0:951; 0:0001;:::) P fwj =`segment`jd1 g =0:06 SEGMENT medic imag challeng problem eld imag analysi diagnost base proper SEGMENT digit imag SEGMENT medic imag need applic involv estim boundari object classif tissu abnorm shape analysi contour detec textur SEGMENT despit exist techniqu SEGMENT specif medic imag remain crucial problem [...] Document 2,P fzkjd2;wj =`segment`g =(0:025; 0:867;:::) Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

48 Performance Probabilistic Latent Semantic Analysis The performance of a retrieval system based on this model (PLSI) In IRwas typically found superior to that toof both: the VSM vector and space LSA based similarity (cos) and a non-probabilistic latent semantic indexing (LSI) method. (We skip details here.) From Th. Hofmann, 2000 Figure: From Hofmann, 2000 Denis Helic (KTI, TU Graz) KDDM1 Jan 16, / 47

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) (707.003) Sample Examination Questions Denis Helic KTI, TU Graz Jan 16, 2014 Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 1 / 22 Exercise Suppose we have a utility