how to *do* computationally assisted research

Similar documents
Latent Dirichlet Allocation Introduction/Overview

Sparse Stochastic Inference for Latent Dirichlet Allocation

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Generative Clustering, Topic Modeling, & Bayesian Inference

Topic Models and Applications to Short Documents

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Machine Learning for natural language processing

Statistical Debugging with Latent Topic Models

Gaussian Models

Language Information Processing, Advanced. Topic Models

Collapsed Variational Bayesian Inference for Hidden Markov Models

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Document and Topic Models: plsa and LDA

CS Lecture 18. Topic Models and LDA

Latent Dirichlet Allocation (LDA)

Topic Modeling: Beyond Bag-of-Words

Measuring Topic Quality in Latent Dirichlet Allocation

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Naïve Bayes, Maxent and Neural Models

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Topic Models. Charles Elkan November 20, 2008

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Latent Dirichlet Allocation

Naïve Bayes classification

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Classification. Team Ravana. Team Members Cliffton Fernandes Nikhil Keswaney

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51

Evaluation Methods for Topic Models

Introduction to Bayesian inference

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Topic Modelling and Latent Dirichlet Allocation

AN INTRODUCTION TO TOPIC MODELS

Statistical Methods for NLP

Homework 6: Image Completion using Mixture of Bernoullis

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

CS 188: Artificial Intelligence Spring Today

Improving Topic Models with Latent Feature Word Representations

Notes on Markov Networks

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Lecture 1b: Text, terms, and bags of words

EMPHASIZING MODULARITY IN MACHINE LEARNING PROGRAMS. Praveen Narayanan IIS talk series, Feb

Lecture 2: Probability, Naive Bayes

Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Spatial Role Labeling

Naïve Bayes Classifiers

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Machine Learning: Assignment 1

Performance Metrics for Machine Learning. Sargur N. Srihari

Lecture 19, November 19, 2012

Machine Learning and Deep Learning! Vincent Lepetit!

Web Search and Text Mining. Lecture 16: Topics and Communities

10/17/04. Today s Main Points

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Classification & Information Theory Lecture #8

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Latent Variable Models in NLP

Forward algorithm vs. particle filtering

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Modeling Environment

Probability Review and Naïve Bayes

Machine Learning for Large-Scale Data Analysis and Decision Making A. Week #1

Intelligent Systems:

Lecture 22 Exploratory Text Analysis & Topic Models

Bayesian Methods: Naïve Bayes

Midterm sample questions

CS340 Winter 2010: HW3 Out Wed. 2nd February, due Friday 11th February

Loss Functions, Decision Theory, and Linear Models

Hidden Markov Models in Language Processing

Classification, Linear Models, Naïve Bayes

From Language towards. Formal Spatial Calculi

Latent Variable Models Probabilistic Models in the Study of Language Day 4

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

BASIC TECHNOLOGY Pre K starts and shuts down computer, monitor, and printer E E D D P P P P P P P P P P

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Collaborative topic models: motivations cont

Information Retrieval and Organisation

CIKM 18, October 22-26, 2018, Torino, Italy

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Machine Learning (CS 567) Lecture 2

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Maschinelle Sprachverarbeitung

Introduction to Machine Learning

Maschinelle Sprachverarbeitung

Classification with Perceptrons. Reading:

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Note for plsa and LDA-Version 1.1

Bayesian Networks BY: MOHAMAD ALSABBAGH

Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization

GIS Visualization: A Library s Pursuit Towards Creative and Innovative Research

Spatial Role Labeling CS365 Course Project

AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Statistical methods for NLP Estimation

Boolean and Vector Space Retrieval Models

Hierarchical Bayesian Nonparametric Models of Language and Text

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Infinite-Word Topic Models for Digital Media

Transcription:

how to *do* computationally assisted research digital literacy @ comwell Kristoffer L Nielbo knielbo@sdu.dk knielbo.github.io/ March 22, 2018 1/30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 class Person(object): def init (self, name): self.name = name def says_hello(self): print Hello, my name is, self.name class Researcher(Person): def init (self, title=none, areas=none, **kwargs): super(researcher, self). init (**kwargs) self.title = title self.areas = areas KLN = Researcher(name = Kristoffer L Nielbo,\ title = Associate professor,\ areas = [ Humanities Computing, Culture Analytics, escience ]) KLN.says_hello() 2/30

evolution of workflows 3/30

4/30

5/30

6/30

7/30

8/30

9/30

10/30

data 11/30

data objects that are described over a set of (qualitative or quantitative) features - fundamental difference between structured data and unstructured* data - word processing files, pdfs, emails, social media posts, digital images, video, and audio - today > 80% of all data are unstructured - unstructured data require expertise in culture, media, linguistic... 12/30

data access and sampling select (sample*) a set of documents (target data) relevant to your research question from a data collection >> online databases and research libraries are excellent resources - proprietary issues - data protection acts - ethical concerns - availability (e.g., historical sources) >> sample requirements - all the data - balancing and stratification - bias reduction 13/30

we will focus on documents stored locally in a plain text without markup 1 """The First Book of Moses, called Genesis 2 3 {1:1} In the beginning God created the heaven and the earth. {1:2} 4 And the earth was without form, and void; and darkness was upon the 5 face of the deep. And the Spirit of God moved upon the face of the 6 waters. 7 8 {1:3} And God said, Let there be light: and there was light. {1:4} 9 And God saw the light, that it was good: and God divided the light""" BUT with a bit of code everything is possible 1 import urllib2 2 from HTMLParser import HTMLParser 3 4 class html_parser(htmlparser): 5 def handle_starttag(self, tag, attrs): 6 print "start tag:", tag 7 def handle_endtag(self, tag): 8 print "end tag :", tag 9 def handle_data(self, data): 10 print "data :", data 11 12 url = "https://knielbo.github.io//" 13 response = urllib2.urlopen(url) 14 webpage = response.read() 15 parser = html_parser() 16 parser.feed(webpage) 14/30

preprocessing 15/30

proprocessing language normalization to prepare a document we need to parse, slice and split it at the relevant level(s). unstructured data are very noisy, so to increase the signal, we therefore remove irrelevant data through preprocessing range of text normalization techniques to preprocess the data: - casefolding - removal of non-alphanumeric characters (punctuation, blanks) and numerals - vocablary pruning - identification of parts of speech - reduction of inflectional forms through stemming and lemmatization - disambiguation - synonym substitution... one man s rubbish may be another s treasure 16/30

example - normalization by reducing inflected words to their stem, base or root form - the stem need not be identical to the morphological root - sufficient that related words map to the same stem (stem valid root) - search engines treat words with the same stem as synonyms (conflation) Porter stemming algorithm - step 1a 1 SSES -> SS caresses -> caress 2 IES -> I ponies -> poni 3 ties -> ti 4 SS -> SS caress -> caress 5 S -> cats -> cat 17/30

proprocessing structuring words selecting the right formalism for representing a problem over a data set many techniques rely on basic probabilistic or geometric properties of the data set example I am Daniel I am Sam Sam I am That Sam-I-am That Sam-I-am! I do not like that Sam-I-am... I am Daniel I am Sam Sam I am That Sam I am That Sam I am I do not like that Sam I am... a 1 59 0.073 am 1 16 0.02 and 1 24 0.03 anyhwhere 1 1 0.001 anywhere 1 7 0.009... you 1 34 0.042 total 55 804 1.0 18/30

example any collection of m documents can be represented in the vector space model by a document-term matrix of m documents and n terms a vector space model is a basic modeling mechanism for a word- or document-space (whether we look at rows or columns) - a document vector with only one word is collinear to the vocabulary word axis - a document vector that does not contain a specific word is orthogonal/perpendicular to the word axis - two documents are identical if they contain the same words in a different order (BOW assumption) 19/30

analysis 20/30

analysis basic properties describe basic properties of the data, e.g., simple distributions and relations result in themselves or input to more advanced analysis the value depends critically on domain knowledge beauty lies in simplicity 21/30

α β θ i φ k Z ij W ij Dirichlet prior for per-doc topic dist - proportions parameter Dirichlet prior for per-topic word dist - topic parameter word dist for topic - per-document topic proportions word dist for topic k - topics topic for j th word in doc i - per-word topic assignment the observed word Procedure 1 Generative Model 1: choose θ i Dir(α), i is a document 2: choose φ k Dir(β), k is a topic 3: for each word position do 4: choose a topic z ij Multinomial(θ i ) 5: choose a word w ij Multinomial(φ zij ) 6: end for The joint distribution defines a posterior probability: P(θ, z, φ) use posterior to: Train on a corpus: Bayesian inference on θ and φ Train on a new documents d: fix P(w z) to infer P(z d) Multiple inference algorithms available (expectation-maximization/vem and Gibbs sampling/gibbs) 22/30

interpretation and evaluation 23/30

evaluation is our model valid? relevant objects (e.g., ham) irrelevant objects (e.g., spam) objects classified with relevant class label ERROR CORRECT Precision: fraction of retrieved instances that are relevant P = TP TP + FP Recall: fraction of relevant instances that are retrieved R = TP TP + FN P and R are inversely related. Identify balance through a Precision-Recall curve. 24/30

interpretation what does our model mean? philosphers and sinologists have been debating the existence of mind-body dualism in classical Chinese philosophy with domain experts, unsupervised learning was used to identify a multi-level dualistic semantic space one model (LDA) was further utilized to predict class of origin for controversial texts slices 25/30

knowledge 26/30

27/30

cases 28/30

29/30

History Predictive Causality & Slow Decay historians and media researchers theorize about the causal dependencies between public discourse and advertisement time series analysis of keyword frequencies (from seedlists) indicated that for some categories ads shape society, while other categories merely reflect advertisements show a faster decay (on-off intermittant behavior) than public discourse (long-range dependencies) 30/30