Lecture 19, November 19, 2012

Size: px

Start display at page:

Download "Lecture 19, November 19, 2012"

Mitchell Carroll
5 years ago
Views:

1 Machine Learning 0-70/5-78, Fall 0 Latent Space Analysis SVD and Topic Models Eric Xing Lecture 9, November 9, 0 Reading: Tutorial on Topic ACL Eric CMU, 006-0

2 We are inundated with data (from imagesgooglecn) Humans cannot afford to deal with (eg, search, browse, or measure similarity) a huge number of text and media documents We need computers to help out Eric CMU, 006-0

3 A task: Say, we want to have a mapping, so that Compare similarity Classify contents Cluster/group/categorize docs Distill semantics and perspectives Eric CMU,

4 Representation: Data: Bag of Words Representation As for the Arabian and Palestinean voices that are against the current negotiations and the so-called peace process, they are not against peace per se, but rather for their well-founded predictions that Israel would NOT give an inch of the West bank (and most probably the same for Golan Heights) back to the Arabs An 8 months of "negotiations" in Madrid, and Washington proved these predictions Now many will jump on me saying why are you blaming israelis for no-result negotiations I would say why would the Arabs stall the negotiations, what do they have to loose? Arabian negotiations against peace Israel Arabs blaming Each document is a vector in the word space Ignore the order of words in a document Only count matters! A high-dimensional and sparse representation Not efficient text processing tasks, eg, search, document classification, or similarity measure Not effective for browsing Eric CMU,

5 Subspace analysis Document Term = * * X (m x n) T Λ D T (m x k) (k x k) (k x n) cluster/topic/bas A priori weights is Distributions s (subspace) Clustering: (0,) matrix LSI/NMF: arbitrary matrices Topic Models: stochastic matrix Sparse coding: arbitrary sparse matrices Memberships (coordinates) Eric CMU,

6 An example: Eric CMU,

7 Principal Component Analysis The new variables/dimensions Are linear combinations of the original ones Are uncorrelated with one another Orthogonal in original dimension space Capture as much of the original variance in the data as possible l Variable B Origina PC PC Are called Principal Components Orthogonal directions of greatest variance in data Projections along PC discriminate the data most along any one axis Original Variable A First principal component is the direction of greatest variability (covariance) in the data Second is the next orthogonal (uncorrelated) direction of greatest variability So first remove all the variability along the first component, and then find the next direction of greatest variability And so on Eric CMU,

8 Computing the Components Projection of vector x onto an axis (dimension) u is u T x Direction of greatest variability is that in which the average square of the projection is greatest: Maximize u T XX T u st u T u = Construct Langrangian u T XX T u λu T u Vector of partial derivatives set to zero xx T u λu = (xx T λi) u = 0 As u 0 then u must be an eigenvector of XX T with eigenvalue λ λ is the principal eigenvalue of the correlation matrix C= XX T The eigenvalue denotes the amount of variability captured along that dimension Eric CMU,

the similarities of the original variables based on how data samples project to them

9 Computing the Components Similarly for the next axis, etc So, the new axes are the eigenvectors of the matrix of correlations of the original variables, which captures the similarities of the original variables based on how data samples project to them Geometrically: centering followed by rotation Linear transformation Eric CMU,

10 Eigenvalues & Eigenvectors For symmetric matrices, eigenvectors for distinct eigenvalues are orthogonal Sv {, } = λ {, } v{, and λ λ v v = }, All eigenvalues of a real symmetric matrix are real 0 if S λi = 0 and S = T S λ R All eigenvalues of a positive semidefinite matrix are nonnegative w R n T, w Sw 0, then if Sv = λ v λ 0 Eric CMU,

11 Eigen/diagonal Decomposition Let be a square matrix with m linearly independent eigenvectors (a non-defective matrix) Theorem: Exists an eigen decomposition diagonal Unique for distinc t eigenvalues (cf matrix diagonalization theorem) Columns of U are eigenvectors of S Diagonal elements of are eigenvalues of Eric CMU, 006-0

12 PCs, Variance and Least-Squares The first PC retains the greatest amount of variation in the sample The k th PC retains the kth greatest t fraction of the variation in the sample The k th largest eigenvalue of the correlation matrix C is the variance in the sample along the k th PC The least-squares view: PCs are a series of linear least squares fits to a sample, each orthogonal ogo to all previous ones Eric CMU, 006-0

13 The Corpora Matrix Doc Doc Doc 3 Word X = Word 0 8 Word Word Word Eric CMU,

14 Singular Value Decomposition For an m n matrix A of rank r there exists a factorization (Singular Value Decomposition = SVD) as follows: A = U Σ V m m m n V is n n T The columns of U are orthogonal eigenvectors of AA T The columns of V are orthogonal eigenvectors of A T A Eigenvalues λ λ r of AA T are the eigenvalues of A T A Σ = σ = λ i λ i ( σ ) diag σ r Eric CMU, Singular values 4

15 SVD and PCA The first root is called the prinicipal eigenvalue which has an associated orthonormal (u T u = ) eigenvector u Subsequent roots are ordered such that λ > λ > > λ M with rank(d) non-zero values Eigenvectors form an orthonormal basis ie u it u j = δ ij The eigenvalue decomposition of XX T = UΣU T where U = [u, u,, u M ] and Σ = diag[λ, λ,, λ M ] Similarly the eigenvalue decomposition of X T X = VΣV T The SVD is closely related to the above X=U Σ / V T The left eigenvectors U, right eigenvectors V, singular values = square root of eigenvalues Eric CMU,

16 HowManyPCs? For n original dimensions, sample covariance matrix is nxn, and has up to n eigenvectors So n PCs Where does dimensionality reduction come from? Can ignore the components of lesser significance 5 0 (%) Variance PC PC PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC0 You do lose some information, but if the eigenvalues are small, you don t lose much n dimensions in original data calculate n eigenvectors and eigenvalues choose only the first p eigenvectors, based on their eigenvalues final data set has only p dimensions Eric CMU,

17 K is the number of singular values used Eric CMU, Within 40 threshold 7

18 Summary: Latent Semantic Indexing(Deerwester et al, 990) Document Term = * * X (m x n) T Λ D T (m x k) (k x k) (k x n) r w= K k= r d k λ T LSI does not define a properly p normalized probability distribution of observed and latent entities Does not support probabilistic reasoning under uncertainty and data fusion k k Eric CMU,

19 Connecting Probability Models to Data (Generative Model) P(Data Parameters) Probabilistic Model Real World Data P(Parameters Data) (Inference) Eric CMU,

20 Latent Semantic Structure in GM Latent Structure l Distribution over words P( w) = P( w,l) l Inferring latent structure P(w l) P( l) Words w P( l w) = P(w ) Eric CMU,

21 How to Model Semantics? Q: What is it about? A: Mainly MT, with syntax, some learning AdMixing Proportion MT Syntax Learning Source Target SMT Alignment Score BLEU Parse Tree Noun Phrase Grammar CFG likelihood EM Hidden Parameters Estimation argmax Top pics A Hierarchical Phrase-Based Model for Statistical ti ti lmachine Translation We present a statistical phrase-based Translation model that uses hierarchical phrases phrases that contain sub-phrases The model is formally a synchronous context-free grammar but is learned from a bitext t without t any syntactic ti information Thus it can be seen as a shift to the formal machinery of syntax based translation systems without any linguistic commitment In our experiments using BLEU as a metric, the hierarchical Phrase based model achieves a relative Improvement of 75% over Pharaoh, a state-of-the-art phrase-based system Unigram over vocabulary Topic Models Eric CMU, 006-0

22 WhythisisUseful? Q: What is it about? A: Mainly MT, with syntax, some learning AdMixing for Statistical Machine Translation Proportion MT Syntax Learning Q: give me similar document? Structured way of browsing the collection A Hierarchical Phrase-Based Model f St ti ti lm hi T l ti We present a statistical phrase-based Translation model that uses hierarchical phrases phrases that contain sub-phrases The model is formally a synchronous context-free grammar but is learned from a bitext t without t any syntactic ti information Thus it can be seen as a shift to the formal machinery of syntax based translation systems without any linguistic commitment In our experiments using BLEU as a metric, the hierarchical Phrase based model achieves a relative Other tasks Improvement of 75% over Pharaoh, a state-of-the-art phrase-based system Other tasks Dimensionality reduction TF-IDF vs topic mixing proportion Classification, clustering, and more Eric CMU, 006-0

23 Words in Contexts It was a nice shot Eric CMU,

24 Words in Contexts (con'd) the opposition Labor Party fared even worse, with a predicted 35 seats, seven less than last election Eric CMU,

25 A possible generative process of a document TOPIC 3 8 DOCUMENT : money bank bank loan river stream bank money river bank money bank loan money stream bank money bank bank loan river stream bank money river bank money bank loan bank money stream DOCUMENT : river stream bank stream bank money loan river stream loan bank river bank bank stream river loan bank stream bank money loan river stream bank stream bank money river 7 stream loan bank river bank money bank stream river bank stream bank money TOPIC Mixture Components (distributions over elements) admixing weight vector θ (represents all components contributions) vector θ Bayesian approach: use priors Admixture weights ~ Dirichlet( α ) Mixture components ~ Dirichlet( Γ ) Eric CMU,

26 Probabilistic LSI Hoffman (999) β k θ d z n w n N M Eric CMU,

27 Probabilistic LSI A "generative" model Models each word in a document as a sample from a mixture model Each word is generated from a single topic, different words in the document may be generated from different topics A topic is characterized by a distribution over words Each document is represented as a list of admixing proportions for the components (ie topic vector θ ) β k θ d z n w n N M Eric CMU,

28 Latent Dirichlet Allocation Blei, Ng and Jordan (003) Essentially a Bayesian plsi: η β k K α θ z n w n N M Eric CMU,

29 LDA Generative model Models each word in a document as a sample from a mixture model Each word is generated from a single topic, different words in the document may be generated from different topics A topic is characterized by a distribution over words Each document is represented as a list of admixing proportions for the components (ie topic vector) The topic vectors and the word rates each follows a Dirichlet prior --- essentially a Bayesian plsi η β k K α θ z n w n Eric CMU, N M 9

30 Topic Models = Mixed Membership Models = Admixture Generating a document Draw θ from the prior For each word n - Draw z n - Draw w n from multinomia l z n, ( θ ) { β } from multinomia l( β ) : k zn Prior θ z β K w N d Which prior to use? N Eric CMU,

Choices of Priors Dirichlet (LDA) (Blei et al 003)

capture variations in each topic s intensity

Lafferty 005, Ahmed & Xing 006) Capture the intuition

in intensity together Not a conjugate prior implies hard

31 Choices of Priors Dirichlet (LDA) (Blei et al 003) Conjugate prior means efficient inference Can only capture variations in each topic s intensity independently Logistic Normal (CTM=LoNTAM) (Blei & Lafferty 005, Ahmed & Xing 006) Capture the intuition that some topics are highly correlated and can rise up in intensity together Not a conjugate prior implies hard inference Nested CRP (Blei et al 005) Defines hierarchy on topics Eric CMU,

32 Generative Semantic of LoNTAM Generating a document µ Draw θ from the For each word γ ~ N - Draw z n n from prior - Draw wn zn, : θ ~ LN K multinomia l ( θ ) { β } from multinomia l( β ) k zn ( µ, Σ) ( µ, Σ) γ 0 K K = K = exp log + θi γ i i= C ( γ ) K = log + i= e γ i e γ i Eric CMU, β K - Log Partition Function - Normalization Constant Σ γ z w N d 3 N

33 Outcomes from a topic model The topics β in a corpus: There is no name for each topic, you need to name it! There is no objective measure of good/bad The shown topics are the good ones, there are many many trivial ones, meaningless ones, redundant ones, you need to manually prune the results How many topics? Eric CMU,

Outcomes from a topic model The topic vector θ of each doc Create an embedding of docs in a topic space Their no ground truth of θ to measure quality of inference But on θ it is possible to define an

34 Outcomes from a topic model The topic vector θ of each doc Create an embedding of docs in a topic space Their no ground truth of θ to measure quality of inference But on θ it is possible to define an objective measure of goodness, such as classification error, retrieval of similar docs, clustering, etc, of documents But there is no consensus on whether these tasks bear the true value of topic models Eric CMU,

Outcomes from a topic model The per-word topic indicator z: Not very useful under the bag of word representation, because of loss of ordering But it is possible to define

35 Outcomes from a topic model The per-word topic indicator z: Not very useful under the bag of word representation, because of loss of ordering But it is possible to define simple probabilistic linguistic constraints (eg, bi-grams) over z and get potentially interesting results [Griffiths, Steyvers, Blei, & Tenenbaum, 004] Eric CMU,

36 Outcomes from a topic model The topic graph S (when using CTM): Kind of interesting for understanding/visualizing large corpora Eric CMU, [David Blei, MLSS09] 36

37 Outcomes from a topic model Topic change trends [David Blei, MLSS09] Eric CMU,

38 The Big Picture Unstructured Collection Structured Topic Network Topic Discovery w n w T x x x x x w Dimensionality T x k x x T Word Simplex Reduction Eric CMU, Topic Space (eg, a Simplex) 38

{D i } Parameter estimation arg max ( µ,

39 Computation on LDA Inference Given a Document D Posterior: P(Θ µ,σ, β,d) Evaluation: P(D µ,σ, Σ β ) β i θ n Learning Given a collection of documents {D i } Parameter estimation arg max ( µ, Σ, β ) log ( P( µ, Σ, β )) D i Eric CMU,

40 Exact Bayesian inference on LDA is intractable A possible query: is intractable p q y? ) (? ) (, = = D z p D p m n n πθ n Close form solution? ) ( ), ( ) ( D p D p D p n n = π π, θ n θ n ) ( ) ( ) ( ) ( ) ( } {,,, D p d d G p p z p x p m n n z i n n m n m n z m n = φ π φ α π π φ θ n θ -n θ n β β = } {,,, ) ( ) ( ) ( ) ( ) ( m n n z N n n m n m n z m n d d d G p p z p x p D p φ π π φ α π π φ L L θ n θ n θ θ β β β Sum in the denominator over T n terms, and integrate over n k-dimensional topic vectors 40 Eric CMU, 006-0

41 Approximate Inference Variational Inference Mean field approximation (Blei et al) Expectation ti propagation (Minka et al) Variational nd -order Taylor approximation (Ahmed and Xing) Markov Chain Monte Carlo Gibbs sampling (Griffiths et al) Eric CMU,

42 Collapsed Gibbs sampling (Tom Griffiths & Mark Steyvers) Collapsed Gibbs sampling Integrate out θ For variables z = z, z,, z n Draw z (t+) i from P(z i z -i, w) β i z =z (t+) (t+) (t+) (t) (t) -i, z,, z i-, z i+,, z n θ n Eric CMU,

43 Gibbs sampling Need full conditional distributions for variables Since we only sample z we need θ n β i G G number of times word w assigned to topic j number of times topic j used in document d Eric CMU,

44 Gibbs sampling i w i d i z i MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5 iteration Eric CMU,

45 Gibbs sampling iteration i w i d i z i z i MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5? Eric CMU,

46 Gibbs sampling iteration i w i d i z i z i MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5? G G Eric CMU,

47 Gibbs sampling iteration i w i d i z i z i MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5? G G Eric CMU,

48 Gibbs sampling iteration i w i d i z i z i MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5? G G Eric CMU,

49 Gibbs sampling iteration i w i d i z i z i MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5? G G Eric CMU,

50 Gibbs sampling iteration i w i d i z i z i MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5? G G Eric CMU,

51 Gibbs sampling iteration i w i d i z i z i MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5? G G Eric CMU,

52 Gibbs sampling iteration 000 i w i d i z i z i z i MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5 G G Eric CMU,

53 Learning a TM Maximum likelihood estimation: Need statistics on topic-specific word assignment (due to z), topic vector distribution (due to θ), etc Eg,, this is the formula for topic k: These are hidden variables, therefore need an EM algorithm (also known as data augmentation, or DA, in Monte Carlo paradigm) This is a reduce step in parallel implementation Eric CMU,

54 Conclusion GM-based topic models are cool Flexible Modular Interactive There are many ways of implementing topic models unsupervised supervised Efficient Inference/learning algorithms GMF, with Laplace approx for non-conjugate dist MCMC Many applications Word-sense disambiguation Image understanding Network inference Eric CMU,

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component