Applying LDA topic model to a corpus of Italian Supreme Court decisions

Size: px

Start display at page:

Download "Applying LDA topic model to a corpus of Italian Supreme Court decisions"

Milo Walters
5 years ago
Views:

1 Applying LDA topic model to a corpus of Italian Supreme Court decisions Paolo Fantini Statistical Service of the Ministry of Justice - Italy CESS Conference - Rome - November 25, 2014

2 Our goal finding similar documents to a given document in a large collection of text documents by applying topic models to the Italgiure supreme court archive

3 The problem The Italgiure electronic archive I legal information system of the Italian Supreme Court I two millions of text documents specifically about concluded trials in both civil and criminal matters

4 ing methods for automatically organizing, understanding, searching, and summarizing large electronic archives of text documents a topic model aims to discover the hidden themes that pervade the collection

5 Probabilistic model Topics Proportions Assignments TOPIC 1 autovettura fari TOPIC DOC1: autovettura 1 veicolo 1 autovettura 1 fari 1 veicolo 1 fari 1 autovettura 1 pneumatici 1 DOC2: veicolo 1 strada 2 margine 2 fari 1 carreggiata 2 strada 2 autovettura 1 fari 1 strada provinciale 1.0 DOC3: strada 2 strada 2 margine 2 strada 2 margine 2 carreggiata 2 strada 2 strada 2 Generative model for documents words in documents are assumed to be observed from a generative probabilistic process that includes hidden variables (topics)

6 Probabilistic model Topics Proportions Assignments Topics Proportions Assignments TOPIC 1 autovettura fari TOPIC DOC1: autovettura 1 veicolo 1 autovettura 1 fari 1 veicolo 1 fari 1 autovettura 1 pneumatici 1 DOC2: veicolo 1 strada 2 margine 2 fari 1 carreggiata 2 strada 2 autovettura 1 fari 1 TOPIC 1 TOPIC 2 DOC1: autovettura veicolo autovettura fari veicolo fari autovettura pneumatici DOC2: veicolo strada margine fari carreggiata strada autovettura fari strada provinciale 1.0 DOC3: strada 2 strada 2 margine 2 strada 2 margine 2 carreggiata 2 strada 2 strada 2 DOC3: strada strada margine strada margine carreggiata strada strada Generative model for documents words in documents are assumed to be observed from a generative probabilistic process that includes hidden variables (topics) Infer the hidden structure using posterior inference what are the topics that describe this collection

7 Model assumptions Topics Proportions Assignments Topics Proportions Assignments TOPIC 1 autovettura fari TOPIC DOC1: autovettura 1 veicolo 1 autovettura 1 fari 1 veicolo 1 fari 1 autovettura 1 pneumatici 1 DOC2: veicolo 1 strada 2 margine 2 fari 1 carreggiata 2 strada 2 autovettura 1 fari 1 TOPIC 1 TOPIC 2 DOC1: autovettura veicolo autovettura fari veicolo fari autovettura pneumatici DOC2: veicolo strada margine fari carreggiata strada autovettura fari strada provinciale 1.0 DOC3: strada 2 strada 2 margine 2 strada 2 margine 2 carreggiata 2 strada 2 strada 2 DOC3: strada strada margine strada margine carreggiata strada strada each topic is a probability distribution over words each document is a mixture of collection-wide topics each word is drawn from one of those topics the only parameter to be fixed is the number of topics: K

8 To make a new document Topics Proportions Assignments TOPIC 1 autovettura fari TOPIC 2 strada provinciale DOC1: autovettura 1 veicolo 1 autovettura 1 fari 1 veicolo 1 fari 1 autovettura 1 pneumatici 1 DOC2: veicolo 1 strada 2 margine 2 fari 1 carreggiata 2 strada 2 autovettura 1 fari 1 DOC3: strada 2 strada 2 margine 2 strada 2 margine 2 carreggiata 2 strada 2 strada 2 choose a distribution over topics then, for each word in that document: choose a topic at random according to this distribution and draw a word from that topic

9 Posterior inference Topics Proportions Assignments TOPIC 1 TOPIC 2 DOC1: autovettura veicolo autovettura fari veicolo fari autovettura pneumatici DOC2: veicolo strada margine fari carreggiata strada autovettura fari DOC3: strada strada margine strada margine carreggiata strada strada we only observe the words in documents topics, proportions, assignments are hidden variables p(topics, proportions, assignments documents)

10 Note that each document is a mixture of K topics posterior expectations of the per-document topic proportions can be used to estimate the mixing coefficients the K estimated mixing coefficients provide a low-dimensional representation of documents (K << W )

measured by the distance between the estimated mixing

11 between two documents two documents are similar when they exhibit the same topics similarity can be measured by the distance between the estimated mixing coefficients, i.e. the estimated per-document topic proportions

12 Data 3,6K documents relating to final decisions in civil matters by the Italian Supreme Court 3M words 25K unique terms - stop words and rare words removed

13 Data 3,6K documents relating to final decisions in civil matters by the Italian Supreme Court 3M words 25K unique terms - stop words and rare words removed Model document-term matrix with 3,6K rows and 1,2K columns 55-topic LDA model using variational inference

14 Data 3,6K documents relating to final decisions in civil matters by the Italian Supreme Court 3M words 25K unique terms - stop words and rare words removed Model document-term matrix with 3,6K rows and 1,2K columns 55-topic LDA model using variational inference R packages topicmodels, lda, and Supreme a self-developed package containing utilities for handling ISC data and functions for model selection.

15 Benchmark: similar documents into the Italgiure system search procedure based on tf-idf measure of importance of a term for the document two documents are similar when they share the same important terms no probabilistic model of documents is assumed

16 Example 1: Italgiure and LDA give the same results Id doc KL div Italgiure Yes Yes Yes Yes Yes Yes No Yes Yes Yes Table: Top-10 similar documents to document 1880 using 55-LDA model and the symmetric KL divergence

17 Example 2: LDA retrieves similar documents when Italgiure does not Id doc KL div Italgiure Yes No No No No No No No No No Table: Top-10 similar documents to document 1000 using 55-LDA model and the symmetric KL divergence

18 Summary of my talk Introduction to topic models generative models for documents hidden thematic structure posterior inference problem use posterior estimates to perform tasks such as information retrieval, document similarity, exploration, classification...

19 Summary of my talk Introduction to topic models generative models for documents hidden thematic structure posterior inference problem use posterior estimates to perform tasks such as information retrieval, document similarity, exploration, classification... Application to a real-word problem retrieving similar documents to a given document with reference to a corpus of Italian Supreme Court decisions we need much more experiments...but the first results are encouraging

21 LDA as a graphical model Per-document topic proportions Per-word topic assignment Observed word Topics α η Proportions parameter θ d Z d,n W d,n N β k K Topic parameter D nodes are random variables and edges indicate dependence shaded nodes are observed plates indicate replicated variables

22 LDA as a graphical model Per-document topic proportions Per-word topic assignment Observed word Topics α η Proportions parameter θ d Z d,n W d,n N β k K Topic parameter K i=1 p(β i η) D d=1 p(θ d α) D ( N p(z d,n θ d )p(w d,n β 1:K,z d,n ) n=1 )

23 Use posterior expectations α η θ d Z d,n W d,n N β k K D this joint defines a posterior from a collection of documents, infer: per-word topic assignment zd,n per-document topic proportions θd per-corpus topic distributions βk then use posterior expectations to perform the task at hand, e.g., information retrieval, document similarity, exploration,...

24 Latent Dirichlet Allocation Let K, the number of topics, be fixed a-priori and let w = (w 1,...,w N ) be an observed document containing N words from a vocabulary with V different terms, w n {1,...,V }, n = 1,...,N. This is the generative process underlying LDA model: Step 1: For each topic k K, choose a distribution over words: β k Dirichlet V (η) (1) Step 2: For each document d D, choose a vector of topic proportions: θ d Dirichlet K (α) (2) Step 3: For each word w dn in the document d, 1. Choose a topic assignment: Z dn Multinomial(θ d ), Z dn {1,...,K } (3) 2. Choose a word: W dn Multinomial(β zdn ), W dn {1,...,V } (4)

25 Model selection: how to choose the document-term matrix labelled documents in the observed collection the original document-term matrix consists of 3,6K rows and 25K columns application of lasso logistic classification scheme reduces the number of columns to 1,2K

26 Model selection: how to choose the number of topics K we perform a classification task over a predefined grid of values for the number of topics K, using the estimated per-document topic proportions and the document labels as document features and target classes respectively the best K value will be associated with the maximum average accuracy obtained via cross-validation

27 Approximate posterior inference algorithms Collapsed Gibbs sampling (Griffiths and Steyvers, 2002) Mean field variational methods (Blei et al., 2003) Collapsed variational inference (Teh et al., 2006) Online variational inference (Hoffman et al., 2010)...

Topic Modelling and Latent Dirichlet Allocation

Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer