Applying LDA topic model to a corpus of Italian Supreme Court decisions

Similar documents
Topic Modelling and Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA)

CS Lecture 18. Topic Models and LDA

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation Introduction/Overview

CS145: INTRODUCTION TO DATA MINING

Lecture 13 : Variational Inference: Mean Field Approximation

Introduction to Bayesian inference

Topic Models and Applications to Short Documents

Applying hlda to Practical Topic Modeling

Text Mining for Economics and Finance Latent Dirichlet Allocation

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

topic modeling hanna m. wallach

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Study Notes on the Latent Dirichlet Allocation

Gaussian Mixture Model

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Sparse Stochastic Inference for Latent Dirichlet Allocation

Distributed ML for DOSNs: giving power back to users

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Document and Topic Models: plsa and LDA

Introduction To Machine Learning

Topic Models. Charles Elkan November 20, 2008

Latent Dirichlet Allocation

Generative Clustering, Topic Modeling, & Bayesian Inference

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Evaluation Methods for Topic Models

AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION

Topic Modeling Using Latent Dirichlet Allocation (LDA)

16 : Approximate Inference: Markov Chain Monte Carlo

Latent Dirichlet Alloca/on

Collaborative Topic Modeling for Recommending Scientific Articles

AN INTRODUCTION TO TOPIC MODELS

Latent Dirichlet Allocation

Topic Modeling: Beyond Bag-of-Words

LDA with Amortized Inference

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Benchmarking and Improving Recovery of Number of Topics in Latent Dirichlet Allocation Models

Kernel Density Topic Models: Visual Topics Without Visual Words

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Statistical Debugging with Latent Topic Models

Online Bayesian Passive-Agressive Learning

Collapsed Variational Bayesian Inference for Hidden Markov Models

Lecture 19, November 19, 2012

Language Information Processing, Advanced. Topic Models

Latent Dirichlet Allocation Based Multi-Document Summarization

INTERPRETING THE PREDICTION PROCESS OF A DEEP NETWORK CONSTRUCTED FROM SUPERVISED TOPIC MODELS

Web Search and Text Mining. Lecture 16: Topics and Communities

Probablistic Graphical Models, Spring 2007 Homework 4 Due at the beginning of class on 11/26/07

Modeling Environment

Recent Advances in Bayesian Inference Techniques

Mixed-membership Models (and an introduction to variational inference)

Mixture Models in Text Mining Tools in R

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Dirichlet Enhanced Latent Semantic Analysis

Bayesian Nonparametrics for Speech and Signal Processing

Bayesian Nonparametric Models

Improving Topic Models with Latent Feature Word Representations

LDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling

Gaussian Models

EMERGING TOPIC MODELS CAMCOS REPORT FALL 2011 NEETI MITTAL

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Web-Mining Agents Topic Analysis: plsi and LDA. Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Optimization Number of Topic Latent Dirichlet Allocation

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Content-based Recommendation

odeling atient ortality from linical ote

Hierarchical Dirichlet Processes

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Note 1: Varitional Methods for Latent Dirichlet Allocation

Lecture 8: Graphical models for Text

Lecture 22 Exploratory Text Analysis & Topic Models

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

PROBABILISTIC LATENT SEMANTIC ANALYSIS

arxiv: v6 [stat.ml] 11 Apr 2017

Latent Variable Models Probabilistic Models in the Study of Language Day 4

Measuring Topic Quality in Latent Dirichlet Allocation

13: Variational inference II

RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

The Role of Semantic History on Online Generative Topic Modeling

Note for plsa and LDA-Version 1.1

Probabilistic Graphical Models

Variational Inference (11/04/13)

Knowledge Discovery and Data Mining 1 (VO) ( )

Topic Learning and Inference Using Dirichlet Allocation Product Partition Models and Hybrid Metropolis Search

The supervised hierarchical Dirichlet process

Machine Learning Summer School, Austin, TX January 08, 2015

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

EM-algorithm for motif discovery

Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability

Inference Methods for Latent Dirichlet Allocation

Time-Sensitive Dirichlet Process Mixture Models

Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Bayesian non-parametric model to longitudinally predict churn

Transcription:

Applying LDA topic model to a corpus of Italian Supreme Court decisions Paolo Fantini Statistical Service of the Ministry of Justice - Italy CESS Conference - Rome - November 25, 2014

Our goal finding similar documents to a given document in a large collection of text documents by applying topic models to the Italgiure supreme court archive

The problem The Italgiure electronic archive I legal information system of the Italian Supreme Court I two millions of text documents specifically about concluded trials in both civil and criminal matters

ing methods for automatically organizing, understanding, searching, and summarizing large electronic archives of text documents a topic model aims to discover the hidden themes that pervade the collection

Probabilistic model Topics Proportions Assignments TOPIC 1 autovettura fari TOPIC 2 1.0 0.5 0.5 DOC1: autovettura 1 veicolo 1 autovettura 1 fari 1 veicolo 1 fari 1 autovettura 1 pneumatici 1 DOC2: veicolo 1 strada 2 margine 2 fari 1 carreggiata 2 strada 2 autovettura 1 fari 1 strada provinciale 1.0 DOC3: strada 2 strada 2 margine 2 strada 2 margine 2 carreggiata 2 strada 2 strada 2 Generative model for documents words in documents are assumed to be observed from a generative probabilistic process that includes hidden variables (topics)

Probabilistic model Topics Proportions Assignments Topics Proportions Assignments TOPIC 1 autovettura fari TOPIC 2 1.0 0.5 0.5 DOC1: autovettura 1 veicolo 1 autovettura 1 fari 1 veicolo 1 fari 1 autovettura 1 pneumatici 1 DOC2: veicolo 1 strada 2 margine 2 fari 1 carreggiata 2 strada 2 autovettura 1 fari 1 TOPIC 1 TOPIC 2 DOC1: autovettura veicolo autovettura fari veicolo fari autovettura pneumatici DOC2: veicolo strada margine fari carreggiata strada autovettura fari strada provinciale 1.0 DOC3: strada 2 strada 2 margine 2 strada 2 margine 2 carreggiata 2 strada 2 strada 2 DOC3: strada strada margine strada margine carreggiata strada strada Generative model for documents words in documents are assumed to be observed from a generative probabilistic process that includes hidden variables (topics) Infer the hidden structure using posterior inference what are the topics that describe this collection

Model assumptions Topics Proportions Assignments Topics Proportions Assignments TOPIC 1 autovettura fari TOPIC 2 1.0 0.5 0.5 DOC1: autovettura 1 veicolo 1 autovettura 1 fari 1 veicolo 1 fari 1 autovettura 1 pneumatici 1 DOC2: veicolo 1 strada 2 margine 2 fari 1 carreggiata 2 strada 2 autovettura 1 fari 1 TOPIC 1 TOPIC 2 DOC1: autovettura veicolo autovettura fari veicolo fari autovettura pneumatici DOC2: veicolo strada margine fari carreggiata strada autovettura fari strada provinciale 1.0 DOC3: strada 2 strada 2 margine 2 strada 2 margine 2 carreggiata 2 strada 2 strada 2 DOC3: strada strada margine strada margine carreggiata strada strada each topic is a probability distribution over words each document is a mixture of collection-wide topics each word is drawn from one of those topics the only parameter to be fixed is the number of topics: K

To make a new document Topics Proportions Assignments TOPIC 1 autovettura fari TOPIC 2 strada provinciale 1.0 0.5 0.5 1.0 DOC1: autovettura 1 veicolo 1 autovettura 1 fari 1 veicolo 1 fari 1 autovettura 1 pneumatici 1 DOC2: veicolo 1 strada 2 margine 2 fari 1 carreggiata 2 strada 2 autovettura 1 fari 1 DOC3: strada 2 strada 2 margine 2 strada 2 margine 2 carreggiata 2 strada 2 strada 2 choose a distribution over topics then, for each word in that document: choose a topic at random according to this distribution and draw a word from that topic

Posterior inference Topics Proportions Assignments TOPIC 1 TOPIC 2 DOC1: autovettura veicolo autovettura fari veicolo fari autovettura pneumatici DOC2: veicolo strada margine fari carreggiata strada autovettura fari DOC3: strada strada margine strada margine carreggiata strada strada we only observe the words in documents topics, proportions, assignments are hidden variables p(topics, proportions, assignments documents)

Note that each document is a mixture of K topics posterior expectations of the per-document topic proportions can be used to estimate the mixing coefficients the K estimated mixing coefficients provide a low-dimensional representation of documents (K << W )

between two documents two documents are similar when they exhibit the same topics similarity can be measured by the distance between the estimated mixing coefficients, i.e. the estimated per-document topic proportions

Data 3,6K documents relating to final decisions in civil matters by the Italian Supreme Court 3M words 25K unique terms - stop words and rare words removed

Data 3,6K documents relating to final decisions in civil matters by the Italian Supreme Court 3M words 25K unique terms - stop words and rare words removed Model document-term matrix with 3,6K rows and 1,2K columns 55-topic LDA model using variational inference

Data 3,6K documents relating to final decisions in civil matters by the Italian Supreme Court 3M words 25K unique terms - stop words and rare words removed Model document-term matrix with 3,6K rows and 1,2K columns 55-topic LDA model using variational inference R packages topicmodels, lda, and Supreme a self-developed package containing utilities for handling ISC data and functions for model selection.

Benchmark: similar documents into the Italgiure system search procedure based on tf-idf measure of importance of a term for the document two documents are similar when they share the same important terms no probabilistic model of documents is assumed

Example 1: Italgiure and LDA give the same results Id doc KL div Italgiure 1880 0.0000000 Yes 1883 0.0000123 Yes 1881 0.0334733 Yes 1884 0.0825216 Yes 1882 0.1411761 Yes 1885 0.1935806 Yes 2693 0.7787369 No 2687 0.7865201 Yes 2696 0.7955861 Yes 2686 0.7957056 Yes Table: Top-10 similar documents to document 1880 using 55-LDA model and the symmetric KL divergence

Example 2: LDA retrieves similar documents when Italgiure does not Id doc KL div Italgiure 1000 0.0000000 Yes 3349 0.7783704 No 1807 1.0416604 No 3354 1.0774032 No 1628 1.1333891 No 3240 1.1631313 No 2071 1.2383640 No 3353 1.2800263 No 1817 1.2908571 No 1615 1.4328374 No Table: Top-10 similar documents to document 1000 using 55-LDA model and the symmetric KL divergence

Summary of my talk Introduction to topic models generative models for documents hidden thematic structure posterior inference problem use posterior estimates to perform tasks such as information retrieval, document similarity, exploration, classification...

Summary of my talk Introduction to topic models generative models for documents hidden thematic structure posterior inference problem use posterior estimates to perform tasks such as information retrieval, document similarity, exploration, classification... Application to a real-word problem retrieving similar documents to a given document with reference to a corpus of Italian Supreme Court decisions we need much more experiments...but the first results are encouraging

LDA as a graphical model Per-document topic proportions Per-word topic assignment Observed word Topics α η Proportions parameter θ d Z d,n W d,n N β k K Topic parameter D nodes are random variables and edges indicate dependence shaded nodes are observed plates indicate replicated variables

LDA as a graphical model Per-document topic proportions Per-word topic assignment Observed word Topics α η Proportions parameter θ d Z d,n W d,n N β k K Topic parameter K i=1 p(β i η) D d=1 p(θ d α) D ( N p(z d,n θ d )p(w d,n β 1:K,z d,n ) n=1 )

Use posterior expectations α η θ d Z d,n W d,n N β k K D this joint defines a posterior from a collection of documents, infer: per-word topic assignment zd,n per-document topic proportions θd per-corpus topic distributions βk then use posterior expectations to perform the task at hand, e.g., information retrieval, document similarity, exploration,...

Latent Dirichlet Allocation Let K, the number of topics, be fixed a-priori and let w = (w 1,...,w N ) be an observed document containing N words from a vocabulary with V different terms, w n {1,...,V }, n = 1,...,N. This is the generative process underlying LDA model: Step 1: For each topic k K, choose a distribution over words: β k Dirichlet V (η) (1) Step 2: For each document d D, choose a vector of topic proportions: θ d Dirichlet K (α) (2) Step 3: For each word w dn in the document d, 1. Choose a topic assignment: Z dn Multinomial(θ d ), Z dn {1,...,K } (3) 2. Choose a word: W dn Multinomial(β zdn ), W dn {1,...,V } (4)

Model selection: how to choose the document-term matrix labelled documents in the observed collection the original document-term matrix consists of 3,6K rows and 25K columns application of lasso logistic classification scheme reduces the number of columns to 1,2K

Model selection: how to choose the number of topics K we perform a classification task over a predefined grid of values for the number of topics K, using the estimated per-document topic proportions and the document labels as document features and target classes respectively the best K value will be associated with the maximum average accuracy obtained via cross-validation

Approximate posterior inference algorithms Collapsed Gibbs sampling (Griffiths and Steyvers, 2002) Mean field variational methods (Blei et al., 2003) Collapsed variational inference (Teh et al., 2006) Online variational inference (Hoffman et al., 2010)...