Applying LDA topic model to a corpus of Italian Supreme Court decisions Paolo Fantini Statistical Service of the Ministry of Justice - Italy CESS Conference - Rome - November 25, 2014
Our goal finding similar documents to a given document in a large collection of text documents by applying topic models to the Italgiure supreme court archive
The problem The Italgiure electronic archive I legal information system of the Italian Supreme Court I two millions of text documents specifically about concluded trials in both civil and criminal matters
ing methods for automatically organizing, understanding, searching, and summarizing large electronic archives of text documents a topic model aims to discover the hidden themes that pervade the collection
Probabilistic model Topics Proportions Assignments TOPIC 1 autovettura fari TOPIC 2 1.0 0.5 0.5 DOC1: autovettura 1 veicolo 1 autovettura 1 fari 1 veicolo 1 fari 1 autovettura 1 pneumatici 1 DOC2: veicolo 1 strada 2 margine 2 fari 1 carreggiata 2 strada 2 autovettura 1 fari 1 strada provinciale 1.0 DOC3: strada 2 strada 2 margine 2 strada 2 margine 2 carreggiata 2 strada 2 strada 2 Generative model for documents words in documents are assumed to be observed from a generative probabilistic process that includes hidden variables (topics)
Probabilistic model Topics Proportions Assignments Topics Proportions Assignments TOPIC 1 autovettura fari TOPIC 2 1.0 0.5 0.5 DOC1: autovettura 1 veicolo 1 autovettura 1 fari 1 veicolo 1 fari 1 autovettura 1 pneumatici 1 DOC2: veicolo 1 strada 2 margine 2 fari 1 carreggiata 2 strada 2 autovettura 1 fari 1 TOPIC 1 TOPIC 2 DOC1: autovettura veicolo autovettura fari veicolo fari autovettura pneumatici DOC2: veicolo strada margine fari carreggiata strada autovettura fari strada provinciale 1.0 DOC3: strada 2 strada 2 margine 2 strada 2 margine 2 carreggiata 2 strada 2 strada 2 DOC3: strada strada margine strada margine carreggiata strada strada Generative model for documents words in documents are assumed to be observed from a generative probabilistic process that includes hidden variables (topics) Infer the hidden structure using posterior inference what are the topics that describe this collection
Model assumptions Topics Proportions Assignments Topics Proportions Assignments TOPIC 1 autovettura fari TOPIC 2 1.0 0.5 0.5 DOC1: autovettura 1 veicolo 1 autovettura 1 fari 1 veicolo 1 fari 1 autovettura 1 pneumatici 1 DOC2: veicolo 1 strada 2 margine 2 fari 1 carreggiata 2 strada 2 autovettura 1 fari 1 TOPIC 1 TOPIC 2 DOC1: autovettura veicolo autovettura fari veicolo fari autovettura pneumatici DOC2: veicolo strada margine fari carreggiata strada autovettura fari strada provinciale 1.0 DOC3: strada 2 strada 2 margine 2 strada 2 margine 2 carreggiata 2 strada 2 strada 2 DOC3: strada strada margine strada margine carreggiata strada strada each topic is a probability distribution over words each document is a mixture of collection-wide topics each word is drawn from one of those topics the only parameter to be fixed is the number of topics: K
To make a new document Topics Proportions Assignments TOPIC 1 autovettura fari TOPIC 2 strada provinciale 1.0 0.5 0.5 1.0 DOC1: autovettura 1 veicolo 1 autovettura 1 fari 1 veicolo 1 fari 1 autovettura 1 pneumatici 1 DOC2: veicolo 1 strada 2 margine 2 fari 1 carreggiata 2 strada 2 autovettura 1 fari 1 DOC3: strada 2 strada 2 margine 2 strada 2 margine 2 carreggiata 2 strada 2 strada 2 choose a distribution over topics then, for each word in that document: choose a topic at random according to this distribution and draw a word from that topic
Posterior inference Topics Proportions Assignments TOPIC 1 TOPIC 2 DOC1: autovettura veicolo autovettura fari veicolo fari autovettura pneumatici DOC2: veicolo strada margine fari carreggiata strada autovettura fari DOC3: strada strada margine strada margine carreggiata strada strada we only observe the words in documents topics, proportions, assignments are hidden variables p(topics, proportions, assignments documents)
Note that each document is a mixture of K topics posterior expectations of the per-document topic proportions can be used to estimate the mixing coefficients the K estimated mixing coefficients provide a low-dimensional representation of documents (K << W )
between two documents two documents are similar when they exhibit the same topics similarity can be measured by the distance between the estimated mixing coefficients, i.e. the estimated per-document topic proportions
Data 3,6K documents relating to final decisions in civil matters by the Italian Supreme Court 3M words 25K unique terms - stop words and rare words removed
Data 3,6K documents relating to final decisions in civil matters by the Italian Supreme Court 3M words 25K unique terms - stop words and rare words removed Model document-term matrix with 3,6K rows and 1,2K columns 55-topic LDA model using variational inference
Data 3,6K documents relating to final decisions in civil matters by the Italian Supreme Court 3M words 25K unique terms - stop words and rare words removed Model document-term matrix with 3,6K rows and 1,2K columns 55-topic LDA model using variational inference R packages topicmodels, lda, and Supreme a self-developed package containing utilities for handling ISC data and functions for model selection.
Benchmark: similar documents into the Italgiure system search procedure based on tf-idf measure of importance of a term for the document two documents are similar when they share the same important terms no probabilistic model of documents is assumed
Example 1: Italgiure and LDA give the same results Id doc KL div Italgiure 1880 0.0000000 Yes 1883 0.0000123 Yes 1881 0.0334733 Yes 1884 0.0825216 Yes 1882 0.1411761 Yes 1885 0.1935806 Yes 2693 0.7787369 No 2687 0.7865201 Yes 2696 0.7955861 Yes 2686 0.7957056 Yes Table: Top-10 similar documents to document 1880 using 55-LDA model and the symmetric KL divergence
Example 2: LDA retrieves similar documents when Italgiure does not Id doc KL div Italgiure 1000 0.0000000 Yes 3349 0.7783704 No 1807 1.0416604 No 3354 1.0774032 No 1628 1.1333891 No 3240 1.1631313 No 2071 1.2383640 No 3353 1.2800263 No 1817 1.2908571 No 1615 1.4328374 No Table: Top-10 similar documents to document 1000 using 55-LDA model and the symmetric KL divergence
Summary of my talk Introduction to topic models generative models for documents hidden thematic structure posterior inference problem use posterior estimates to perform tasks such as information retrieval, document similarity, exploration, classification...
Summary of my talk Introduction to topic models generative models for documents hidden thematic structure posterior inference problem use posterior estimates to perform tasks such as information retrieval, document similarity, exploration, classification... Application to a real-word problem retrieving similar documents to a given document with reference to a corpus of Italian Supreme Court decisions we need much more experiments...but the first results are encouraging
LDA as a graphical model Per-document topic proportions Per-word topic assignment Observed word Topics α η Proportions parameter θ d Z d,n W d,n N β k K Topic parameter D nodes are random variables and edges indicate dependence shaded nodes are observed plates indicate replicated variables
LDA as a graphical model Per-document topic proportions Per-word topic assignment Observed word Topics α η Proportions parameter θ d Z d,n W d,n N β k K Topic parameter K i=1 p(β i η) D d=1 p(θ d α) D ( N p(z d,n θ d )p(w d,n β 1:K,z d,n ) n=1 )
Use posterior expectations α η θ d Z d,n W d,n N β k K D this joint defines a posterior from a collection of documents, infer: per-word topic assignment zd,n per-document topic proportions θd per-corpus topic distributions βk then use posterior expectations to perform the task at hand, e.g., information retrieval, document similarity, exploration,...
Latent Dirichlet Allocation Let K, the number of topics, be fixed a-priori and let w = (w 1,...,w N ) be an observed document containing N words from a vocabulary with V different terms, w n {1,...,V }, n = 1,...,N. This is the generative process underlying LDA model: Step 1: For each topic k K, choose a distribution over words: β k Dirichlet V (η) (1) Step 2: For each document d D, choose a vector of topic proportions: θ d Dirichlet K (α) (2) Step 3: For each word w dn in the document d, 1. Choose a topic assignment: Z dn Multinomial(θ d ), Z dn {1,...,K } (3) 2. Choose a word: W dn Multinomial(β zdn ), W dn {1,...,V } (4)
Model selection: how to choose the document-term matrix labelled documents in the observed collection the original document-term matrix consists of 3,6K rows and 25K columns application of lasso logistic classification scheme reduces the number of columns to 1,2K
Model selection: how to choose the number of topics K we perform a classification task over a predefined grid of values for the number of topics K, using the estimated per-document topic proportions and the document labels as document features and target classes respectively the best K value will be associated with the maximum average accuracy obtained via cross-validation
Approximate posterior inference algorithms Collapsed Gibbs sampling (Griffiths and Steyvers, 2002) Mean field variational methods (Blei et al., 2003) Collapsed variational inference (Teh et al., 2006) Online variational inference (Hoffman et al., 2010)...