How to Use Probabilities. The Crash Course

Similar documents
Computers playing Jeopardy!

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

LI 569 Intro to Computational Linguistics Assignment 1: Probability Exercises

What is a random variable

What is Probability? Probability. Sample Spaces and Events. Simple Event

Formal Modeling in Cognitive Science Lecture 19: Application of Bayes Theorem; Discrete Random Variables; Distributions. Background.

Formal Modeling in Cognitive Science

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

PROBABILITY. Admin. Assignment 0 9/9/14. Assignment advice test individual components of your regex first, then put them all together write test cases

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides

Sequences and Information

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

SDS 321: Introduction to Probability and Statistics

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Lecture 1 : The Mathematical Theory of Probability

Lecture 6: The Pigeonhole Principle and Probability Spaces

Mean, Median and Mode. Lecture 3 - Axioms of Probability. Where do they come from? Graphically. We start with a set of 21 numbers, Sta102 / BME102

Steve Smith Tuition: Maths Notes

Lecture Notes 1 Basic Probability. Elements of Probability. Conditional probability. Sequential Calculation of Probability

NLP: Probability. 1 Basics. Dan Garrette December 27, E : event space (sample space)

Mathematics. ( : Focus on free Education) (Chapter 16) (Probability) (Class XI) Exercise 16.2

CS 630 Basic Probability and Information Theory. Tim Campbell

Language Modeling. Michael Collins, Columbia University

Prof. Thistleton MAT 505 Introduction to Probability Lecture 18

STAT 430/510 Probability

Lecture 3 - Axioms of Probability

Basic Probability and Statistics

Why should you care?? Intellectual curiosity. Gambling. Mathematically the same as the ESP decision problem we discussed in Week 4.

(i) Given that a student is female, what is the probability of having a GPA of at least 3.0?

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Homework 4 Solution, due July 23

CISC 1100/1400 Structures of Comp. Sci./Discrete Structures Chapter 7 Probability. Outline. Terminology and background. Arthur G.

First Digit Tally Marks Final Count

CSC Discrete Math I, Spring Discrete Probability

CS206 Review Sheet 3 October 24, 2018

Dept. of Linguistics, Indiana University Fall 2015

Bayes Rule for probability

CS 441 Discrete Mathematics for CS Lecture 20. Probabilities. CS 441 Discrete mathematics for CS. Probabilities

Probability: Terminology and Examples Class 2, Jeremy Orloff and Jonathan Bloom

Midterm sample questions

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

Marquette University MATH 1700 Class 5 Copyright 2017 by D.B. Rowe

Lecture 2. Constructing Probability Spaces

Random Variables. Statistics 110. Summer Copyright c 2006 by Mark E. Irwin

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

27 Binary Arithmetic: An Application to Programming

6.2 Introduction to Probability. The Deal. Possible outcomes: STAT1010 Intro to probability. Definitions. Terms: What are the chances of?

2. Conditional Probability

3.5. First step analysis

Probability. VCE Maths Methods - Unit 2 - Probability

Intro to Information Theory

General Info. Grading

Expected Value 7/7/2006

MITOCW watch?v=vjzv6wjttnc

Introduction to Probability and Sample Spaces

PERMUTATIONS, COMBINATIONS AND DISCRETE PROBABILITY

CSCI 5832 Natural Language Processing. Today 1/31. Probability Basics. Lecture 6. Probability. Language Modeling (N-grams)

EE 178 Lecture Notes 0 Course Introduction. About EE178. About Probability. Course Goals. Course Topics. Lecture Notes EE 178

Sociology 6Z03 Topic 10: Probability (Part I)

2/3/04. Syllabus. Probability Lecture #2. Grading. Probability Theory. Events and Event Spaces. Experiments and Sample Spaces

Natural Language Processing. Statistical Inference: n-grams

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Probabilistic Partial Evaluation: Exploiting rule structure in probabilistic inference

Chapter 1 Review of Equations and Inequalities

the time it takes until a radioactive substance undergoes a decay

7.1 What is it and why should we care?

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Discrete Probability Distribution

Quantitative Methods for Decision Making

Probability (10A) Young Won Lim 6/12/17

4th IIA-Penn State Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur

Lecture 1. ABC of Probability

Conditional Probability

Conditional Probability and Bayes Theorem (2.4) Independence (2.5)

Conditional Probability

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Lecture 4: Probability, Proof Techniques, Method of Induction Lecturer: Lale Özkahya

Probability 1 (MATH 11300) lecture slides

Classification & Information Theory Lecture #8

Statistical methods in recognition. Why is classification a problem?

Sec$on Summary. Assigning Probabilities Probabilities of Complements and Unions of Events Conditional Probability

1 What are probabilities? 2 Sample Spaces. 3 Events and probability spaces

33 The Gambler's Ruin 1

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

COMP9414: Artificial Intelligence Reasoning Under Uncertainty

Probability, Entropy, and Inference / More About Inference

Term Definition Example Random Phenomena

MODULE 2 RANDOM VARIABLE AND ITS DISTRIBUTION LECTURES DISTRIBUTION FUNCTION AND ITS PROPERTIES

The Naïve Bayes Classifier. Machine Learning Fall 2017

6.863J Natural Language Processing Lecture 5: What s my line?

Probabilistic models

Probability Pr(A) 0, for any event A. 2. Pr(S) = 1, for the sample space S. 3. If A and B are mutually exclusive, Pr(A or B) = Pr(A) + Pr(B).

Deep Learning for Computer Vision

Probability Theory. Introduction to Probability Theory. Principles of Counting Examples. Principles of Counting. Probability spaces.

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Expected Value. Lecture A Tiefenbruck MWF 9-9:50am Center 212 Lecture B Jones MWF 2-2:50pm Center 214 Lecture C Tiefenbruck MWF 11-11:50am Center 212

Probability (Devore Chapter Two)

Probability Lecture #7

Brief Review of Probability

Transcription:

How to Use Probabilities The Crash Course 1

Goals of this lecture Probability notation like p(y X): What does this expression mean? How can I manipulate it? How can I estimate its value in practice? Probability models: What is one? Can we build one for language ID? How do I know if my model is any good? 600.465 Intro to NLP J. Eisner 2

3 Kinds of Statistics descriptive: mean Hopkins SAT (or median) confirmatory: statistically significant? predictive: wanna bet? this course why? 600.465 Intro to NLP J. Eisner 3

Fugue for Tinhorns Opening number from Guys and Dolls 1950 Broadway musical about gamblers Words & music by Frank Loesser Video: https://www.youtube.com/results?search_query=fugue+for+tinhorns Lyrics: http://www.lyricsmania.com/fugue_for_tinhorns_lyrics_guys_and_dolls.html 600.465 Intro to NLP J. Eisner 4

Notation for Greenhorns Paul Revere probability model p(paul Revere wins weather s clear) = 0.9 600.465 Intro to NLP J. Eisner 5

What does that really mean? p(paul Revere wins weather s clear) = 0.9 Past performance? Revere s won 90% of races with clear weather Hypothetical performance? If he ran the race in many parallel universes Subjective strength of belief? Would pay up to 90 cents for chance to win $1 Output of some computable formula? Ok, but then which formulas should we trust? p(y X) versus q(y X) 600.465 Intro to NLP J. Eisner 6

p is a function on sets of outcomes p(win clear) p(win, clear) / p(clear) weather s clear Paul Revere wins All Outcomes (races) 600.465 Intro to NLP J. Eisner 7

p is a function on sets of outcomes p(win clear) p(win, clear) / p(clear) syntactic sugar logical conjunction of predicates predicate selecting races where weather s clear weather s clear Paul Revere wins All Outcomes (races) p measures total probability of a set of outcomes (an event ). 600.465 Intro to NLP J. Eisner 8

most of the Required Properties of p (axioms) p( ) = 0 p(all outcomes) = 1 p(x) p(y) for any X Y p(x) + p(y) = p(x Y) provided X Y= e.g., p(win & clear) + p(win & clear) = p(win) weather s clear Paul Revere wins All Outcomes (races) p measures total probability of a set of outcomes (an event ). 600.465 Intro to NLP J. Eisner 9

Commas denote conjunction p(paul Revere wins, Valentine places, Epitaph shows weather s clear) what happens as we add conjuncts to left of bar? probability can only decrease numerator of historical estimate likely to go to zero: # times Revere wins AND Val places AND weather s clear # times weather s clear 600.465 Intro to NLP J. Eisner 10

Commas denote conjunction p(paul Revere wins, Valentine places, Epitaph shows weather s clear) p(paul Revere wins weather s clear, ground is dry, jockey getting over sprain, Epitaph also in race, Epitaph was recently bought by Gonzalez, race is on May 17, ) what happens as we add conjuncts to right of bar? probability could increase or decrease probability gets more relevant to our case (less bias) probability estimate gets less reliable (more variance) # times Revere wins AND weather clear AND it s May 17 # times weather clear AND it s May 17 600.465 Intro to NLP J. Eisner 11

Simplifying Right Side: Backing Off p(paul Revere wins weather s clear, ground is dry, jockey getting over sprain, Epitaph also in race, Epitaph was recently bought by Gonzalez, race is on May 17, ) not exactly what we want but at least we can get a reasonable estimate of it! (i.e., more bias but less variance) try to keep the conditions that we suspect will have the most influence on whether Paul Revere wins 600.465 Intro to NLP J. Eisner 12

Simplifying Left Side: Backing Off p(paul Revere wins, Valentine places, Epitaph shows weather s clear) NOT ALLOWED! but we can do something similar to help 600.465 Intro to NLP J. Eisner 13

Factoring Left Side: The Chain Rule p(revere, Valentine, Epitaph weather s clear) = p(revere Valentine, Epitaph, weather s clear) * p(valentine Epitaph, weather s clear) * p(epitaph weather s clear) RVEW/W = RVEW/VEW * VEW/EW * EW/W True because numerators cancel against denominators Makes perfect sense when read from bottom to top Revere? Valentine? Revere? Epitaph? Valentine? Revere? Revere? Epitaph, Valentine, Revere? 1/3 * 1/5 * 1/4 600.465 Intro to NLP J. Eisner 14

Factoring Left Side: The Chain Rule p(revere, Valentine, Epitaph weather s clear) RVEW/W = p(revere Valentine, Epitaph, weather s clear) = RVEW/VEW * p(valentine Epitaph, weather s clear) * VEW/EW * p(epitaph weather s clear) * EW/W True because numerators cancel against denominators Makes perfect sense when read from bottom to top Moves material to right of bar so it can be ignored If this prob is unchanged by backoff, we say Revere was CONDITIONALLY INDEPENDENT of Valentine and Epitaph (conditioned on the weather s being clear). Often we just ASSUME conditional independence to get the nice product above. 600.465 Intro to NLP J. Eisner 15

Factoring Left Side: The Chain Rule p(revere Valentine, Epitaph, weather s clear) conditional independence lets us use backed-off data from all four of these cases to estimate their shared probabilities If this prob is unchanged by backoff, we say Revere was CONDITIONALLY INDEPENDENT of Valentine and Epitaph (conditioned on the weather s being clear). no 3/4 Revere? yes 1/4 Valentine? no 3/4 Revere? yes 1/4 Epitaph? no 3/4 yes yes 1/4 Valentine? Revere? clear? no 3/4 Revere? yes 1/4 irrelevant 600.465 Intro to NLP J. Eisner 16

Courses in handicapping [i.e., estimating a horse's odds] should be required, like composition and Western civilization, in our universities. For sheer complexity, there's nothing like a horse race, excepting life itself... To weigh and evaluate a vast grid of information, much of it meaningless, and to arrive at sensible, if erroneous, conclusions, is a skill not to be sneezed at. - Richard Russo, The Risk Pool (a novel)

Remember Language ID? Horses and Lukasiewicz are on the curriculum. Is this English or Polish or what? We had some notion of using n-gram models Is it good (= likely) English? Is it good (= likely) Polish? Space of outcomes will be not races but character sequences (x 1, x 2, x 3, ) where x n = EOS 600.465 Intro to NLP J. Eisner 18

Remember Language ID? Let p(x) = probability of text X in English Let q(x) = probability of text X in Polish Which probability is higher? (we d also like bias toward English since it s more likely a priori ignore that for now) Horses and Lukasiewicz are on the curriculum. p(x 1 =h, x 2 =o, x 3 =r, x 4 =s, x 5 =e, x 6 =s, ) 600.465 Intro to NLP J. Eisner 19

Apply the Chain Rule p(x 1 =h, x 2 =o, x 3 =r, x 4 =s, x 5 =e, x 6 =s, ) = p(x 1 =h) * p(x 2 =o x 1 =h) * p(x 3 =r x 1 =h, x 2 =o) * p(x 4 =s x 1 =h, x 2 =o, x 3 =r) * p(x 5 =e x 1 =h, x 2 =o, x 3 =r, x 4 =s) * p(x 6 =s x 1 =h, x 2 =o, x 3 =r, x 4 =s, x 5 =e) * = 0 4470/ 52108 395/ 4470 5/ 395 3/ 5 3/ 3 0/ 3 counts from Brown corpus 600.465 Intro to NLP J. Eisner 20

Back Off On Right Side p(x 1 =h, x 2 =o, x 3 =r, x 4 =s, x 5 =e, x 6 =s, ) p(x 1 =h) * p(x 2 =o x 1 =h) * p(x 3 =r x 1 =h, x 2 =o) * p(x 4 =s x 2 =o, x 3 =r) * p(x 5 =e x 3 =r, x 4 =s) * p(x 6 =s x 4 =s, x 5 =e) * = 7.3e-10 * 4470/ 52108 395/ 4470 5/ 395 12/ 919 12/ 126 3/ 485 counts from Brown corpus 600.465 Intro to NLP J. Eisner 21

Change the Notation p(x 1 =h, x 2 =o, x 3 =r, x 4 =s, x 5 =e, x 6 =s, ) p(x 1 =h) * p(x 2 =o x 1 =h) * p(x i =r x i-2 =h, x i-1 =o, i=3) * p(x i =s x i-2 =o, x i-1 =r, i=4) * p(x i =e x i-2 =r, x i-1 =s, i=5) * p(x i =s x i-2 =s, x i-1 =e, i=6) * = 7.3e-10 * 4470/ 52108 395/ 4470 5/ 395 12/ 919 12/ 126 3/ 485 counts from Brown corpus 600.465 Intro to NLP J. Eisner 22

Another Independence Assumption p(x 1 =h, x 2 =o, x 3 =r, x 4 =s, x 5 =e, x 6 =s, ) p(x 1 =h) * p(x 2 =o x 1 =h) * p(x i =r x i-2 =h, x i-1 =o) * p(x i =s x i-2 =o, x i-1 =r) * p(x i =e x i-2 =r, x i-1 =s) * p(x i =s x i-2 =s, x i-1 =e) * = 5.4e-7 * 4470/ 52108 395/ 4470 1417/ 14765 1573/ 26412 1610/ 12253 2044/ 21250 counts from Brown corpus 600.465 Intro to NLP J. Eisner 23

Simplify the Notation p(x 1 =h, x 2 =o, x 3 =r, x 4 =s, x 5 =e, x 6 =s, ) p(x 1 =h) * p(x 2 =o x 1 =h) * p(r h, o) * p(s o, r) * p(e r, s) * p(s s, e) * 4470/ 52108 395/ 4470 1417/ 14765 1573/ 26412 1610/ 12253 2044/ 21250 counts from Brown corpus 600.465 Intro to NLP J. Eisner 24

Simplify the Notation p(x 1 =h, x 2 =o, x 3 =r, x 4 =s, x 5 =e, x 6 =s, ) p(h BOS, BOS) the parameters of our old * p(o BOS, h) trigram generator! Same assumptions * p(r h, o) about language. * p(s o, r) * p(e r, s) * p(s s, e) * These basic probabilities are used to define p(horses) values of those parameters, as naively estimated from Brown corpus. 4470/ 52108 395/ 4470 1417/ 14765 1573/ 26412 1610/ 12253 2044/ 21250 counts from Brown corpus 600.465 Intro to NLP J. Eisner 25

Simplify the Notation p(x 1 =h, x 2 =o, x 3 =r, x 4 =s, x 5 =e, x 6 =s, ) t the parameters BOS, BOS, h of our old * t BOS, h, o trigram generator! Same assumptions * t h, o, r about language. * t o, r, s * t r, s, e * t s, e,s * This notation emphasizes that they re just real variables whose value must be estimated values of those parameters, as naively estimated from Brown corpus. 4470/ 52108 395/ 4470 1417/ 14765 1573/ 26412 1610/ 12253 2044/ 21250 counts from Brown corpus 600.465 Intro to NLP J. Eisner 26

Definition: Probability Model param values Trigram Model (defined in terms of parameters like t h, o, r and t o, r, s ) definition of p generate random text find event probabilities 600.465 Intro to NLP J. Eisner 27

English vs. Polish English param values Polish param values Trigram Model (defined in terms of parameters like t h, o, r and t o, r, s ) definition of p definition of q compute q(x) 600.465 Intro to NLP J. Eisner 28 compute p(x)

What is X in p(x)? Element (or subset) of some implicit outcome space e.g., race e.g., sentence What if outcome is a whole text? p(text) = p(sentence 1, sentence 2, ) = p(sentence 1) * p(sentence 2 sentence 1) * compute q(x) definition of p definition of q 600.465 Intro to NLP J. Eisner 29 compute p(x)

What is X in p(x)? Element (or subset) of some implicit outcome space e.g., race, sentence, text Suppose an outcome is a sequence of letters: p(horses) But we rewrote p(horses) as p(x 1 =h, x 2 =o, x 3 =r, x 4 =s, x 5 =e, x 6 =s, ) p(x 1 =h) * p(x 2 =o x 1 =h) * What does this variable=value notation mean? 600.465 Intro to NLP J. Eisner 30

Random Variables: What is variable in p(variable=value)? Answer: variable is really a function of Outcome p(x 1 =h) * p(x 2 =o x 1 =h) * Outcome is a sequence of letters x 2 is the second letter in the sequence p(number of heads=2) or just p(h=2) or p(2) Outcome is a sequence of 3 coin flips H is the number of heads p(weather s clear=true) or just p(weather s clear) Outcome is a race weather s clear is true or false 600.465 Intro to NLP J. Eisner 31

Random Variables: What is variable in p(variable=value)? Answer: variable is really a function of Outcome p(x 1 =h) * p(x 2 =o x 1 =h) * Outcome is a sequence of letters x 2 (Outcome) is the second letter in the sequence p(number of heads=2) or just p(h=2) or p(2) Outcome is a sequence of 3 coin flips H(Outcome) is the number of heads p(weather s clear=true) or just p(weather s clear) Outcome is a race weather s clear (Outcome) is true or false 600.465 Intro to NLP J. Eisner 32

Random Variables: What is variable in p(variable=value)? p(number of heads=2) or just p(h=2) Outcome is a sequence of 3 coin flips H is the number of heads in the outcome So p(h=2) = p(h(outcome)=2) picks out set of outcomes w/2 heads = p({hht,hth,thh}) = p(hht)+p(hth)+p(thh) TTT TTH HTT HTH THT THH HHT HHH All Outcomes 600.465 Intro to NLP J. Eisner 33

Random Variables: What is variable in p(variable=value)? p(weather s clear) Outcome is a race weather s clear is true or false of the outcome So p(weather s clear) = p(weather s clear(outcome)=true) picks out the set of outcomes with clear weather p(win clear) p(win, clear) / p(clear) weather s clear Paul Revere wins All Outcomes (races) 600.465 Intro to NLP J. Eisner 34

Random Variables: What is variable in p(variable=value)? p(x 1 =h) * p(x 2 =o x 1 =h) * Outcome is a sequence of letters x 2 is the second letter in the sequence So p(x 2 =o) = p(x 2 (Outcome)=o) picks out set of outcomes with = p(outcome) over all outcomes whose second letter = p(horses) + p(boffo) + p(xoyzkklp) + 600.465 Intro to NLP J. Eisner 35

Back to trigram model of p(horses) p(x 1 =h, x 2 =o, x 3 =r, x 4 =s, x 5 =e, x 6 =s, ) t the parameters BOS, BOS, h of our old * t BOS, h, o trigram generator! Same assumptions * t h, o, r about language. * t o, r, s * t r, s, e * t s, e,s * This notation emphasizes that they re just real variables whose value must be estimated values of those parameters, as naively estimated from Brown corpus. 4470/ 52108 395/ 4470 1417/ 14765 1573/ 26412 1610/ 12253 2044/ 21250 counts from Brown corpus 600.465 Intro to NLP J. Eisner 36

A Different Model Exploit fact that horses is a common word p(w 1 = horses) where word vector W is a function of the outcome (the sentence) just as character vector X is. = p(w i = horses i=1) p(w i = horses) = 7.2e-5 independence assumption says that sentence-initial words w 1 are just like all other words w i (gives us more data to use) Much larger than previous estimate of 5.4e-7 why? Advantages, disadvantages? 600.465 Intro to NLP J. Eisner 37

Improving the New Model: Weaken the Indep. Assumption Don t totally cross off i=1 since it s not irrelevant: Yes, horses is common, but less so at start of sentence since most sentences start with determiners. p(w 1 = horses) = t p(w 1 =horses, T 1 = t) = t p(w 1 =horses T 1 = t) * p(t 1 = t) = t p(w i =horses T i = t, i=1) * p(t 1 = t) t p(w i =horses T i = t) * p(t 1 = t) = p(w i =horses T i = PlNoun) * p(t 1 = PlNoun) + p(w i =horses T i = Verb) * p(t 1 = Verb) + = (72 / 55912) * (977 / 52108) + (0 / 15258) * (146 / 52108) + = 2.4e-5 + 0 + + 0 = 2.4e-5 600.465 Intro to NLP J. Eisner 38

Which Model is Better? Model 1 predict each letter X i from previous 2 letters X i-2, X i-1 Model 2 predict each word W i by its part of speech T i, having predicted T i from i Models make different independence assumptions that reflect different intuitions Which intuition is better??? 600.465 Intro to NLP J. Eisner 39

Measure Performance! Which model does better on language ID? Administer test where you know the right answers Seal up test data until the test happens Simulates real-world conditions where new data comes along that you didn t have access to when choosing or training model In practice, split off a test set as soon as you obtain the data, and never look at it Need enough test data to get statistical significance Report all results on test data For a different task (e.g., speech transcription instead of language ID), use that task to evaluate the models 600.465 Intro to NLP J. Eisner 40

Cross-Entropy ( xent ) Another common measure of model quality Task-independent Continuous so slight improvements show up here even if they don t change # of right answers on task Just measure probability of (enough) test data Higher prob means model better predicts the future There s a limit to how well you can predict random stuff Limit depends on how random the dataset is (easier to predict weather than headlines, especially in Arizona) 600.465 Intro to NLP J. Eisner 41

Cross-Entropy ( xent ) Want prob of test data to be high: p(h BOS, BOS) * p(o BOS, h) * p(r h, o) * p(s o, r) 1/8 * 1/8 * 1/8 * 1/16 high prob low xent by 3 cosmetic improvements: Take logarithm (base 2) to prevent underflow: log (1/8 * 1/8 * 1/8 * 1/16 ) = log 1/8 + log 1/8 + log 1/8 + log 1/16 = (-3) + (-3) + (-3) + (-4) + Negate to get a positive value in bits 3+3+3+4+ Average? Geometric average of 1/2 3,1/2 3, 1/2 3, 1/2 4 = 1/2 3.25 1/9.5 Divide by length of text 3.25 bits per letter (or per word) Want this to be small (equivalent to wanting good compression!) Lower limit is called entropy obtained in principle as cross-entropy of the true model measured on an infinite amount of data Or use perplexity = 2 to the xent ( 9.5 choices instead of 3.25 bits) 600.465 Intro to NLP J. Eisner 42