Statistical Structure in Natural Language. Tom Jackson April 4th PACM seminar Advisor: Bill Bialek

Similar documents
Classification & Information Theory Lecture #8

CS 630 Basic Probability and Information Theory. Tim Campbell

Estimation of information-theoretic quantities

Historical Perspectives John von Neumann ( )

Sequences and Information

Revealing inductive biases through iterated learning

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Modeling Environment

Generative Clustering, Topic Modeling, & Bayesian Inference

DATA MINING LECTURE 9. Minimum Description Length Information Theory Co-Clustering

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

CS546:Learning and NLP Lec 4: Mathematical and Computational Paradigms

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Introduction to Machine Learning

Week 5: Logistic Regression & Neural Networks

Example: Letter Frequencies

Information Retrieval CS Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION

CPSC 540: Machine Learning

Lecture 13: More uses of Language Models

Language as a Stochastic Process

Feature selection. Micha Elsner. January 29, 2014

Bayesian Methods for Machine Learning

arxiv:physics/ v1 [physics.data-an] 24 Apr 2000 Naftali Tishby, 1,2 Fernando C. Pereira, 3 and William Bialek 1

Information in Biology

1 Introduction. Sept CS497:Learning and NLP Lec 4: Mathematical and Computational Paradigms Fall Consider the following examples:

Support Vector Machines

Models of Language Evolution

IBM Model 1 for Machine Translation

Information in Biology

Principles of Communications

Variable selection and feature construction using methods related to information theory

Bayesian Nonparametrics for Speech and Signal Processing

the Information Bottleneck

DT2118 Speech and Speaker Recognition

Lecture 22 Exploratory Text Analysis & Topic Models

Encoding or decoding

6.02 Fall 2011 Lecture #9

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Entropies & Information Theory

Bayesian Methods: Naïve Bayes

Probability Review. September 25, 2015

Statistical Measurement of Information Leakage

Noisy-Channel Coding

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Ch. 8 Math Preliminaries for Lossy Coding. 8.5 Rate-Distortion Theory

Example: Letter Frequencies

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Recurrent neural network grammars

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Example: Letter Frequencies

Tuning as Linear Regression

Acoustic Unit Discovery (AUD) Models. Leda Sarı

Gaussian Models

Bayesian Tools for Natural Language Learning. Yee Whye Teh Gatsby Computational Neuroscience Unit UCL

Challenges (& Some Solutions) and Making Connections

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Channel capacity. Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5.

Machine Learning, Fall 2012 Homework 2

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

MODULE -4 BAYEIAN LEARNING

Entropy and Inference, Revisited

Generating Sentences by Editing Prototypes

Why is Deep Learning so effective?

Outline. Limits of Bayesian classification Bayesian concept learning Probabilistic models for unsupervised and semi-supervised category learning

Lecture 2: August 31

CS Lecture 18. Topic Models and LDA

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35

Information Theory Primer:

AN INTRODUCTION TO TOPIC MODELS

Achievable rates for pattern recognition

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN CLASSIFICATION

ECE INFORMATION THEORY. Fall 2011 Prof. Thinh Nguyen

FINAL: CS 6375 (Machine Learning) Fall 2014

Introduction to Information Theory. Part 3

Entropy-based data organization tricks for browsing logs and packet captures

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Learning Context Free Grammars with the Syntactic Concept Lattice

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Document and Topic Models: plsa and LDA

Click Prediction and Preference Ranking of RSS Feeds

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Word Embeddings 2 - Class Discussions

3F1 Information Theory, Lecture 1

Entropy and information estimation: An overview

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Hierarchical Bayesian Models of Language and Text

Bayesian Analysis for Natural Language Processing Lecture 2

Language Modeling. Michael Collins, Columbia University

Sharpening the empirical claims of generative syntax through formalization

Lab 12: Structured Prediction

Naïve Bayes Classifiers

Transcription:

Statistical Structure in Natural Language Tom Jackson April 4th PACM seminar Advisor: Bill Bialek

Some Linguistic Questions n Why does language work so well? Unlimited capacity and flexibility Requires little conscious computational effort n Why does language work at all? Wittgenstein s dilemma Rather accurate transmission of meaning from one mind to another n How do we acquire language so easily? Poverty of the stimulus Everyone learns the same language

Some Linguistic Answers Noam Chomsky n Language as words and rules Semantic and syntactic components of language n Hierarchial structure Morphology (intra-word) Syntax (intra-sentence) Compositional rules and paragraph structure n Chomsky: Grammar must be universal n Nowak: evolution of syntactic communication More efficient to express an increasing number of concepts using combinatorial rules.

Computational Linguistics n Language performance vs. language competence The Chomskyan Revolution Linguistics as psychology Corpus-based methods outside the mainstream n Inherent drawbacks No access to contextual meaning Word meaning only defined through relationships to other words Vive la différance! n Relevant for applications Ferdinand de Saussure n What are appropriate metrics to use?

Information Theory n Entropy: information capacity H(X) = {x} p x log p x Claude Shannon n Mutual information: Information X provides about Y and vice versa. I(X;Y) = H(X) + H(Y) H(X Y) = p xy log p xy p x p {x,y} y

Naïve Entropy Estimation ˆ H {n i } ( ) = K n i log n i N N i=1 n n n Pro: Entirely straightforward Con: Useless for interesting case K<N In the undersampled regime, just because we don t see an event doesn t mean it has zero probability. Analogous to overfitting a curve to data More appropriate to consider a smooth class of distributions Problem: How to apply Occam s principle in an unbiased way?

Bayesian Inference n Bayes Theorem: Pr(model data) = Pr(data model)pr(model) Pr(data) n Here our prior is on the space of distributions 1 Z δ % ' & 1 K q i i=1 K i=1 ( β 1 * q ) i n Generalize uniform prior to the Dirichlet prior parameterized by β. q i β = n i + β N + Kβ

Bias in the Dirichelet Prior A priori (all n=0) entropy estimate as a function of β 1σ error is shaded n K = 10 n K = 100 n K = 1000 n Prior highly biased for fixed values of β n Obtain an unbiased prior by averaging over β

The estimator ˆ S m = 1 Z {n i } ( ) ( ) S m β dβ dξ ρ {n i },β dβ ρ ({n i,β}) = K Γ(Kβ) Γ(N + Kβ) K i=1 Γ(n i + β) Γ(β) n S β = i + β Ψ(n i + β +1) Ψ(N + Kβ +1) N + Kβ i=1 ( ) n Computational complexity still linear in N

Getting to Work n Summer 03 research project n MI estimation algorithm implemented in ~5K lines of C code n Texts used: Project Gutenberg s Moby Dick Project Gutenberg s War and Peace n K=18004, N = 567657 Reuters news articles corpus n K=40696, N = 2478116

Results 1 War and Peace data: Original Text Sentence order scrambled Words within sentence scrambled All words scrambled

Results 2 War and Peace data: Original Text Sentence order scrambled Words within sentence scrambled All words scrambled

Results 3 War and Peace data: Original Text Sentence order scrambled Words within sentence scrambled All words scrambled Reuters corpus

Information Bottleneck n Are we really observing a crossover between syntactic and semantic constraints? What carries the information at a given separation? n Information Bottleneck method: compress X into X while maximizing I(X ;Y). Two algorithms to determine p(x x): n Simulated Annealing (top down clustering) n Agglomerative (bottom up clustering) n Difficult to implement in this case Start with K~10 5 ; expect relevant clusters at K ~10 Considerable degeneracy from hapax legomena

Conclusions n Robust estimation of probability distribution functionals is possible even with undersampling n Pairwise mutual information displays power law behavior, with breaks at semantically significant length scales

Possible Applications n Semantic extraction Automatic document summarizing Search engines n Data compression Shannon compression theorem Existing algorithms work at finite range only

For further reading n Language Words and Rules: the ingredients of language, by Steven Pinker Nature 417:611 n Entropy estimation Physical review E 52:6841 xxx.lanl.gov/abs/physics/0108025

End