Outlines Advanced Artificial Intelligence October 1, 2009
Outlines Part I: Theoretical Background Part II: Application and Results 1 Motive Previous Research Exchangeability 2 Notation and Terminology 3 Comparison Geometric Interpretation 4 Inference Variational Inference Parameter Estimation
Outlines Part I: Theoretical Background Part II: Application and Results 5 Example 6 Empirical Results Document Modeling Document Classification Collaborative Filtering 7 Summary
Part I Theoretical Background
Why did the research start? Motive Previous Research Exchangeability People want to know many things about various documents Large set of discrete data is hard to handle Don t want to lose the essential statistical relationships
Latent Semantic Indexing Motive Previous Research Exchangeability Deerwester et al. came up with LSI Uses tools from linear algebra (Singular Value Decomposition) occurrence matrix relation between terms & concepts relation between concepts & documents Dimensionality reduction Capturing basic linguistic notions
Motive Previous Research Exchangeability Generative Probabilistic Model of Text Corpora Developed to prove the claims regarding LSI No need to use LSI, just use Bayesian methods
Probabilistic LSI Motive Previous Research Exchangeability Models each word in a document as a sample from topics Has solid foundation in statistics No probabilistic model at document level Overfitting due to linear growth of model Not proper model for unseen documents
Probabilistic LSI Motive Previous Research Exchangeability Models each word in a document as a sample from topics Has solid foundation in statistics No probabilistic model at document level Overfitting due to linear growth of model Not proper model for unseen documents
de Finetti Theorem Motive Previous Research Exchangeability bag-of-words Any collection of exchangeable random variables has a representation as a mixture distribution Consider mixture models that capture the exchangeability of both words and documents
Notation Notation and Terminology word - unit of discrete data document - sequence of words corpus - collection of documents
What is LDA? Notation and Terminology Generative probabilistic model of a corpus Our view of document generation
Generative Process Notation and Terminology Choose N Poisson(ξ) Choose θ Dir(α) For each of the N words w n : Choose a topic z n Multinomial(θ) Choose a word w n from p(w n z n, β), a multinomial probability conditioned on the topic z n
Graphical Model Representation Notation and Terminology α, β - corpus level parameter θ d - document level variable z dn, w dn - word level variable
Unigram Model Comparison Geometric Interpretation The words of every document are drawn independently from a single multinomial distribution p(w) = N p(w n ) n=1
Mixture of Unigrams Comparison Geometric Interpretation Each document is generated by first choosing a topic z and then generating N words independently from the conditional multinomial p(w z). p(w) = z N p(z) p(w n z) n=1
Comparison Geometric Interpretation Probabilistic Latent Semantic Indexing A document label d and a word w n are conditionally independent given an unobserved topic z p(d, w n ) = p(d) z p(w n z)p(z d)
Geometric Interpretation Comparison Geometric Interpretation
Inference Inference Variational Inference Parameter Estimation Given α, β and a document, calculate the posterior distribution of hidden variables p(θ, z w, α, β) = p(θ, z, w α, β) p(w α, β) p(w α, β) = Γ( i α i) i Γ(α i) ( k i=1 N θ α i 1 i )( n=1 k i=1 j=1 Intractable due to the coupling between θ and β!! Use approximation Laplace approximation Variational approximation Markov chain Monte Carlo V (θ i β ij ) w j n )dθ
Graphical Model Representation Inference Variational Inference Parameter Estimation Modify the original graphical model so that the couple between θ and β disappers Figure: LDA Figure: Variational Distribution
Variational Distribution Inference Variational Inference Parameter Estimation The approximation q(θ, z γ, φ) = q(θ γ) N q(z n φ n ) n=1 Using the Kullback-Leibler divergence between the variational distribution and the true posterior as the dissimilarity function, (γ, φ ) = argmin (γ,φ) D(q(θ, z γ, φ) p(θ, z w, α, β))
Variational Inference Algorithm Inference Variational Inference Parameter Estimation initialize φ 0 ni := 1/k for all i and n initialize γ i := α i + N/k for all i repeat for n = 1 to N for i = 1 to k φ t+1 ni := β iwn exp(ψ(γi t)) normalize φ t+1 n to sum to 1 γ t+1 := α + N n=1 φt+1 n until convergence O(N 2 k) complexity
Variational EM Algorithm Inference Variational Inference Parameter Estimation Given a corpus of documents D, we wish to find parameters α and β that maximize the (marginal) log likelihood of the data Use variational EM algorithm 1 (E-step) For each document, find the optimizing values of γ d and φ d 2 (M-step) Maximize the resulting lower bound on the log likelihood with respect to α and β
Variational EM Algorithm Inference Variational Inference Parameter Estimation Given a corpus of documents D, we wish to find parameters α and β that maximize the (marginal) log likelihood of the data Use variational EM algorithm 1 (E-step) For each document, find the optimizing values of γ d and φ d 2 (M-step) Maximize the resulting lower bound on the log likelihood with respect to α and β
Example Empirical Results Summary Part II Application and Results
TREC AP corpus Example Empirical Results Summary Arts Budgets Children Education NEW MILLION CHILDREN SCHOOL FILM TAX WOMEN STUDENTS SHOW PROGRAM PEOPLE SCHOOLS MUSIC BUDGET CHILD EDUCATION MOVIE BILLION YEARS TEACHERS PLAY FEDERAL FAMILIES HIGH MUSICAL YEAR WORK PUBLIC BEST SPENDING PARENTS TEACHER ACTOR NEW SAYS BENNETT FIRST STATE FAMILY MANIGAT YORK PLAN WELFARE NAMPHY OPERA MONEY MEN STATE THEATER PROGRAMS PERCENT PRESIDENT ACTRESS GOVERNMENT CARE ELEMENTARY LOVE CONGRESS LIFE HAITI
Example Empirical Results Summary Inference of Unseen Document The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropolitan Opera Co., New York Philharmonic and Juilliard School. Our board felt that we had a real opportunity to make a mark on the future of the performing arts with these grants an act every bit as important as our traditional areas of support in health, medical reasearch, education and the social services, Hearst Foundation President Randolph A.Hearst said Monday in announcing the grants. Lincoln Center s share will be $200,000 for its new building, which will house young artists and provide new public facilities. The Metropolitan Opera Co. and New York Philharmonic will receive $400,000 each. The Juilliard School, where music and the performing arts are taught, will get $250,000. The Hearst Foundation, a leading supporter of the Lincoln Center Consolidated Corporate Fund, will make its usual annual $100,000 donation, too.
Document Modeling Example Empirical Results Summary Document Modeling Document Classification Collaborative Filtering Compare the generalization performance of models Used perplexity to evaluate the models { } M d=1 perplexity(d test ) = exp logp(w d) M d=1 N d
Perplexity Results Example Empirical Results Summary Document Modeling Document Classification Collaborative Filtering
Example Empirical Results Summary Document Classification Document Modeling Document Classification Collaborative Filtering Binary classification experiment Low-dimensional representation by LDA vs. all the word feature Used SVM for training
Classification Result Example Empirical Results Summary Document Modeling Document Classification Collaborative Filtering Almost no reduction in classification performance while the feature space is reduced by 99.6 percent
Example Empirical Results Summary Document Modeling Document Classification Collaborative Filtering Collaborative Filtering { } M d=1 predictive-perplexity(d test ) = exp logp(w d,n d w d,1:nd 1) M
Perplexity Result Example Empirical Results Summary Document Modeling Document Classification Collaborative Filtering
Summary Example Empirical Results Summary Generative probabilistic model for collections of discrete data Based on a simple exchangeability assumption Exact inference is intractable Approximate inference algorithms are used
Correctness Example Empirical Results Summary Hard to evaluate the actual correctness of the model Experiments do not provide how many times they were tested Do not provide whether they have statistically significant difference Classification is done only for binary case
Assessment Example Empirical Results Summary Big impact in the field Refered many times Many applications of LDA
Example Empirical Results Summary Thank you!