CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

Size: px

Start display at page:

Download "CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides"

Iris Fowler
5 years ago
Views:

1 CSE 494/598 Lecture-6: Latent Semantic Indexing LYDIA MANIKONDA HT TP:// / **Content adapted from last year s slides

2 Announcements Homework-1 and Quiz-1 Project part-2 released Important Dates Activity 03/04/2016 Midterm 1 03/18/2016 Project part 2 04/22/2016 Project part 3 05/02/2016 or 05/04/2016 Final Exam

3 Today Latent Semantic Indexing

4 Correlation Analysis Correlation matrices Normalized correlation matrices Association clusters Scalar clusters

5 Terms and Documents as mutually dependent vectors Document vector a b c d e f g h I Interface User System Human Computer Response Time EPS Survey Trees Graph Minors Instead of doc-doc similarity, we can compute term-term distance. If terms are independent, the term-term similarity matrix should be diagonal (*) If it is not diagonal, we use the correlations to add related terms to the query But can also ask the question Are there independent dimensions which define the space where terms & docs are vectors? (*) Note that ij th element in the term-term matrix is the dot product of i th term vector and j th term vector

6 Beyond Correlation Analaysis: PCA/LSI Suppose I start with documents described in terms of just 2 keywords u and v 1. Add a bunch of new keywords (2u+3v; 4u+v) and give the new document matrix. Will you be able to tell that the documents are really 2-D (there are only two independent keywords) 2. Suppose we add a bit of noise to each of the new terms in the above scenario. Can you now discover that the documents are really 2-D? 3. Suppose, I remove the original keywords u and v from the document-term matrix and give you only the new linearly dependent keywords. Can you now tell that the documents are 2- D?

7 Beyond Correlation Analysis Notice that in the last case, the true dimensions of the data are not even present in the representation! You have to re-discover the true dimensions as linear combinations of the given dimensions. Which means the current terms themselves are vectors in the original space

8 PCA/LSI The fact that keywords in the documents are not actually independent, and that they have synonymy and polysemy among them, often manifests itself as if some malicious oracle mixed up the data as above. Need Dimensionality Reduction Techniques If the keyword dependence is only linear (as above), a general polynomial complexity technique called Principal Components Analysis is able to do this dimensionality reduction PCA applied to documents is called Latent Semantic Indexing If the dependence is nonlinear, you need non-linear dimensionality reduction techniques (such as neural networks); much costlier.

9 Visual Example Data on Fish Length Height

10 Move Origin To center of centroid But are these the best axes?

11 Better if one axis accounts for most data variation and each axis is orthogonal to the others What should we call the red axis? Size ( factor )

12 Reduce Dimensions What if we only consider size We retain 1.75/2.00 x 100 (87.5%) of the original variation. Thus, by discarding the yellow axis we lose only 12.5% of the original information.

13 If you can do it for fish, why not to docs? We have documents as vectors in the space of terms We want to Transform the axes so that the new axes are Orthonormal (independent axes) Notice that the new fish axes are uncorrelated.. Can be ordered in terms of the amount of variation in the documents they capture Pick top K dimensions (axes) in this ordering; and use these new K dimensions to do the vectorspace similarity ranking Why? Can reduce noise Can eliminate dependent variables Can capture synonymy and polysemy How? SVD (Singular Value Decomposition)

14 SVD is the Solution, but what is the problem? Rank of a matrix M is defined as the size of the largest square sub-matrix of M which has a non-zero determinant. The rank of a matrix M is also equal to the number of non-zero singular values it has Rank of M is related to the true dimensionality of M. If you add a bunch of rows to M that are linear combinations of the existing rows of M, the rank of the new matrix will still be the same as the rank of M. Distance between two equi-sized matrices M and M ; M-M is defined as the sum of the squares of the differences between the corresponding entries (Sum (m uv -m uv ) 2 ) Will be equal to zero when M = M What we want to do: Given M of rank R, find a matrix M of rank R < R such that M-M is the smallest (would be zero if no noise beyond linear combination was added by the oracle) Using calculus, it can be shown that the solution is related to Eigen decomposition More specifically, Singular Value Decomposition, of a matrix SVD of a matrix dt is three matrices df, ff, tf such that M=df*ff*tf df is the eigen vectors of dt*dt tf is the eigen vectors of dt *dt ff is a diagonal matrix whose diagonal values are the +ve square roots of eigen vectors of dt*dt or dt *dt

15 Facts about SVD Relation between SVD and Eigen value decomposition Eigen value decomp is defined only for square matrices Only square symmetric matrices have real-valued eigen values PCA (principle component analysis) is normally done on correlation matrices which are square symmetric (think of d-d or t-t matrices). SVD is defined for all matrices Given a matrix dt, we consider the eigen decomposion of the correlation matrices d-d (dt*dt ) and tt (dt *dt). SVD is (a) the eigen vectors of d-d (2) positive square roots of eigen values of dd or tt (3) eigen vectors of tt Both dd and tt are symmetric (they are correlation matrices) They both will have the same eigen values Unless M is symmetric, MM T and M T M are different So, in general their eigen vectors will be different (although their eigen values are same) Since SVD is defined in terms of the eigen values and vectors of the Correlation matrices of a matrix, the eigen values will always be real valued (even if the matrix M is not symmetric). In general, the SVD decomposition of a matrix M equals its eigen decomposition only if M is both square and symmetric

doc-term matrix into 3matrices D-F, F-F, T-F Where DF*FF*TF gives the Original matrix back Reduce Dimensionality: Throw out

16 Overview of Latent Semantic Indexing Term Documents Term Documents doc Terms dt = df ff tf t df k ff k tf k t = dt k doc Terms mxn dxt dxf mxr rxr fxf fxt rxn dxk mxk kxk kxn kxt dxt mxn A = U D V T U k D k V T k = Â k Singular Value Decomposition Convert doc-term matrix into 3matrices D-F, F-F, T-F Where DF*FF*TF gives the Original matrix back Reduce Dimensionality: Throw out low-order rows and columns Recreate Matrix: Multiply to produce approximate termdocument matrix. dt k is a k-rank matrix That is closest to dt

17 t1= database t2=sql t3=index t4=regression t5=likelihood t6=linear k i i loss %( k n i i ) u = Columns 1 through New document coordinates d-f*f-f Columns 8 through D-F Eigen vectors of dd (dt*dt ) (Principal document directions) singular values (positive sqrt of eigen values of dd or tt) F-F Eigen vectors of tt (dt *dt) (Principal term directions) T-F

18 t1= database t2=sql t3=index t4=regression t5=likelihood t6=linear For the database/regression example Suppose D1 is a new Doc containing database 50 times and D2 contains SQL 50 times

19 Rank=2 Variance loss: 7.5% Rank= Rank=4 Variance loss: 1.4%

20 LSI Ranking Given a query Either add query also as a document in the D-T matrix and do the svd OR Convert query vector (separately) to the LSI space DFq*FF=q*TF this is the weighted query document in LSI space Reduce dimensionality as needed Do the vector-space similarity in the LSI space DT DF * FF * TF' q DFq* FF * TF' q* TF DFq* FF * TF '* TF DFq* FF q* TF I

21 Using LSI Can be used on the entire corpus First compute the SVD of the entire corpus Store first k columns of the df*ff matrix [df*ff] k Keep the tf matrix handy When a new query q comes, take the k columns of q*tf Compute the vector similarity between [q*tf] k and all rows of [df*ff] k, rank the documents and return Can be used as a way of clustering the results returned by normal vector space ranking Take some top 50 or 100 of the documents returned by some ranking (e.g. vector ranking) Do LSI on these documents Take the first k columns of the resulting [df*ff] matrix Each row in this matrix is the representation of the original documents in the reduced space. Cluster the documents in this reduced space (We will talk about clustering later) MANJARA did this We will need fast SVD computation algorithms for this. MANJARA folks developed approximate algorithms for SVD

22 SVD Computation complexity For an mxn matrix SVD computation is O( km 2 n+k n 3 ) complexity k=4 and k =22 for best algorithms Approximate algorithms that exploit the sparsity of M are available (and being developed)

23 Summary:What LSI can do LSI analysis effectively does Dimensionality reduction Noise reduction Exploitation of redundant data Correlation analysis and Query expansion (with related words) Any one of the individual effects can be achieved with simpler techniques (see scalar clustering etc). But LSI does all of them together.

24 LSI (dimensionality reduction) vs. Feature Selection Before reducing dimensions, LSI first finds a new basis (coordinate axes) and then selects a subset of them Good because the original axes may be too correlated to find top-k subspaces containing most variance Bad because the new dimensions may not have any significance to the user What are the two dimensions of the database example? Something like 0.44*database+0.33*sql.. An alternative is to select a subset of the original features themselves Advantage is that the selected features are readily understandable by the users (to the extent they understood the original features). Disadvantage is that as we saw in the Fish example, all the original dimensions may have about the same variance, while a (linear) combination of them might capture much more variation. Another disadvantage is that since original features, unlike LSI features, may be correlated, finding the best subset of k features is not the same as sorting individual features in terms of the variance they capture and taking the top-k (as we could do with LSI) The second feature we pick should be the one that is least correlated with the first one.. (Jadiel s point)

25 LSI as a special case of LDA Dimensionality reduction (or feature selection) is typically done in the context of specific classification tasks We want to pick dimensions (or features) that maximally differentiate across classes, while having minimal variance within any given class When doing dimensionality reduction w.r.t a classification task, we need to focus on dimensions that Increase variance across classes and reduce variance within each class Doing this is called LDA (linear discriminant analysis) LSI as given is insensitive to any particular classification task and only focuses on data variance LSI is a special case of LDA where each point defines its own class This makes sense since relevant vs. irrelevant documents are query dependent In the example above, the red line corresponds to the dimension with most data variance However, the green line corresponds to the axis that does a better job of capturing the class variance (assuming that the two different blobs correspond To the different classes)

26 LSI vs. Nonlinear dimensionality reduction LSI only captures linear correlations It cannot capture non-linear dependencies between original dimensions E.g. if the data points are all falling on a simple manifold (e.g. a circle in the example below), Then, the features are non-linearly correlated (here X 2 +Y 2 =c) LSI analysis can t reduce dimensionality here One idea is to use techniques such as neural nets or manifold learning techniques Another simpler idea is to consider first blowing up the dimensionality of the data by introducing new axes that are nonlinear combinations of existing ones (e.g. X 2, Y 2, sqrt(xy) etc.) We can now capture linear correlations across these nonlinear dimensions by doing LSI in this enlarged space, and map the k important dimensions found back to original space. So, in order to reduce dimensions, we first increase them (talk about crazy!) A way of doing this implicitly is kernel trick.. Advanced; Optional

27 Lessons Learned Today Principal Component Analysis Dimensionality reduction Latent Semantic Indexing Singular value decomposition

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

CSE 494/598 Lecture-4: Correlation Analysis LYDIA MANIKONDA HT TP://WWW.PUBLIC.ASU.EDU/~LMANIKON / **Content adapted from last year s slides Announcements Project-1 Due: February 12 th 2016 Analysis report: