Statistical Structure in Natural Language. Tom Jackson April 4th PACM seminar Advisor: Bill Bialek

Size: px

Start display at page:

Download "Statistical Structure in Natural Language. Tom Jackson April 4th PACM seminar Advisor: Bill Bialek"

Loreen Franklin
5 years ago
Views:

1 Statistical Structure in Natural Language Tom Jackson April 4th PACM seminar Advisor: Bill Bialek

2 Some Linguistic Questions n Why does language work so well? Unlimited capacity and flexibility Requires little conscious computational effort n Why does language work at all? Wittgenstein s dilemma Rather accurate transmission of meaning from one mind to another n How do we acquire language so easily? Poverty of the stimulus Everyone learns the same language

Some Linguistic Answers Noam Chomsky n Language as words and rules Semantic and syntactic components of language n Hierarchial structure Morphology (intra-word) Syntax (intra-sentence) Compositional

3 Some Linguistic Answers Noam Chomsky n Language as words and rules Semantic and syntactic components of language n Hierarchial structure Morphology (intra-word) Syntax (intra-sentence) Compositional rules and paragraph structure n Chomsky: Grammar must be universal n Nowak: evolution of syntactic communication More efficient to express an increasing number of concepts using combinatorial rules.

4 Computational Linguistics n Language performance vs. language competence The Chomskyan Revolution Linguistics as psychology Corpus-based methods outside the mainstream n Inherent drawbacks No access to contextual meaning Word meaning only defined through relationships to other words Vive la différance! n Relevant for applications Ferdinand de Saussure n What are appropriate metrics to use?

5 Information Theory n Entropy: information capacity H(X) = {x} p x log p x Claude Shannon n Mutual information: Information X provides about Y and vice versa. I(X;Y) = H(X) + H(Y) H(X Y) = p xy log p xy p x p {x,y} y

6 Naïve Entropy Estimation ˆ H {n i } ( ) = K n i log n i N N i=1 n n n Pro: Entirely straightforward Con: Useless for interesting case K<N In the undersampled regime, just because we don t see an event doesn t mean it has zero probability. Analogous to overfitting a curve to data More appropriate to consider a smooth class of distributions Problem: How to apply Occam s principle in an unbiased way?

7 Bayesian Inference n Bayes Theorem: Pr(model data) = Pr(data model)pr(model) Pr(data) n Here our prior is on the space of distributions 1 Z δ % ' & 1 K q i i=1 K i=1 ( β 1 * q ) i n Generalize uniform prior to the Dirichlet prior parameterized by β. q i β = n i + β N + Kβ

8 Bias in the Dirichelet Prior A priori (all n=0) entropy estimate as a function of β 1σ error is shaded n K = 10 n K = 100 n K = 1000 n Prior highly biased for fixed values of β n Obtain an unbiased prior by averaging over β

9 The estimator ˆ S m = 1 Z {n i } ( ) ( ) S m β dβ dξ ρ {n i },β dβ ρ ({n i,β}) = K Γ(Kβ) Γ(N + Kβ) K i=1 Γ(n i + β) Γ(β) n S β = i + β Ψ(n i + β +1) Ψ(N + Kβ +1) N + Kβ i=1 ( ) n Computational complexity still linear in N

10 Getting to Work n Summer 03 research project n MI estimation algorithm implemented in ~5K lines of C code n Texts used: Project Gutenberg s Moby Dick Project Gutenberg s War and Peace n K=18004, N = Reuters news articles corpus n K=40696, N =

11 Results 1 War and Peace data: Original Text Sentence order scrambled Words within sentence scrambled All words scrambled

12 Results 2 War and Peace data: Original Text Sentence order scrambled Words within sentence scrambled All words scrambled

13 Results 3 War and Peace data: Original Text Sentence order scrambled Words within sentence scrambled All words scrambled Reuters corpus

14 Information Bottleneck n Are we really observing a crossover between syntactic and semantic constraints? What carries the information at a given separation? n Information Bottleneck method: compress X into X while maximizing I(X ;Y). Two algorithms to determine p(x x): n Simulated Annealing (top down clustering) n Agglomerative (bottom up clustering) n Difficult to implement in this case Start with K~10 5 ; expect relevant clusters at K ~10 Considerable degeneracy from hapax legomena

15 Conclusions n Robust estimation of probability distribution functionals is possible even with undersampling n Pairwise mutual information displays power law behavior, with breaks at semantically significant length scales

16 Possible Applications n Semantic extraction Automatic document summarizing Search engines n Data compression Shannon compression theorem Existing algorithms work at finite range only

17 For further reading n Language Words and Rules: the ingredients of language, by Steven Pinker Nature 417:611 n Entropy estimation Physical review E 52:6841 xxx.lanl.gov/abs/physics/

18 End

Classification & Information Theory Lecture #8

Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing