Class: Backoff (sections 4.6 and 4.7)

Size: px

Start display at page:

Download "Class: Backoff (sections 4.6 and 4.7)"

Doreen Johnson
5 years ago
Views:

1 Class: Backoff (sections 4.6 and 4.7) October 1, 2012 Admistrivia Next week Adi will talk on varaible length markov chains, and Jordan will talk on HMM s. I ll be setting up both of their talks this week. Good prediction means good compression means good models Three views: Probability: build a good model Information theory: construct a short code Statistics: predict 1

2 All three are basically the same. In particular, if we are using log loss. Probability: No loss just build a good model. Requires statistics to say some models are better than others. Information theory: loss = code length Statistics: loss = log(p(observed)) Arithmetic coding How to convert a probability to a good code. Very easy: equally likely It is easy to convert a equal probabilty to a code. Spot ourselfs an extra bit or two. List the events in order and just assign bit strings to them Almost as easy, diadic code books Try same trick with unequal probabilities Now some events have more than one code Killing off these redundent codes leads to: Huffman codes (if you sort by probabilities) Arithmetic codes (if you don t sort) 2

3 Information theory Good prediction implies good codes Primative form: Huffman codes Fancy form: Arithmetic codes Idea: random code book We have probabilty for each string Sample a string, call it string 1 Sample another string, call it string 2... (skip repeats if you like) Tell someone the index of when the true string is sampled Theorem: about 1/p, hence about log(p) bits. Markov chains for good prediction Nice model is a k-state Markov chain Theorem: LZ compression is as well as any k-state Markov chain. Variable length Markov chains are nice models Theorem: LZ compresses as well a variable length Markov chain. 3

4 Good compression implies good prediction Kraft identity: 2 code length 1 So treat 2 code length as a probability Allows converting any compression scheme into a probability forecasting scheme. Bob and I realized we could use this idea plus LZ to come up with a universal prediction algorithm. We started writing up a paper only to find that it had already won an award for best paper the year before. Their only error was the list of authors we weren t on it. So, yet another simutanious discovery. Backoff Problem: Some contexts are full, others empty. empty contexts? How to deal with Naive solution: Use best context available. If you have 5 grams, use them, otherwise If you have 4 grams, use them, otherwise... With nothing: Use base rate Example: We can do this will have lots of examples 5 grams. 4

5 Example: Twas brillig, and the slithy toves was brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe. So back off to base rate. Hence we guess very likely words next. like A, the,..., did Interpolation 0 / 1 context seems stupid statistically Better would be to have smooth rounding General method: ˆP (wn w n 1, ) = λ 0 (w s) ˆP (w n )+λ 1 (w s) ˆP (w n w n 1 )+ Begs the question as to how to pick the λ s. The books solution is cross validation. I.e. DUCK! But a strong statement is that the λ s depend only on the history not the current word. Is this right? Katz says no. Katz backoff Simple improvement: Use Good / Turing for the really low counts. 5

6 Not real helpful for zero counts though, since it basically gives base rates (Could this be improved by thinking of different models for each context?) So backoff until you have some data to actually use. But since we are now using the word to be forecast we have to worry about probabilities summing to 1 or not. Example. Consider the words x,y,z in order: { P ˆP (z x, y) if C(x, y, z) > 0 2 (z x, y) = α 2 (x, y) ˆP 1 (z y) otherwise { P ˆP (z y) if C(y, z) > 0 1 (z y) = α 1 (y) ˆP 0 (z) otherwise ˆP 0 (z) = P (z) (Note: There are slight changes from the book. I m using what should be done, not necessarilly what either Katz or others have said to do.) The P () are defined using Good / Turing. That is a proper probabilitic model. It assigns intellegent probabilities to unseen events. We can sum over all these unseen events and call that total α(). This keeps the probabilities summing to 1. P is now the good / turing probability. 6

7 Normalization Bayesian model generates proper predictions for zero events. This is just Good / Turing. But it uses a constant to fill in. What Katz does is refill-in these zeros by using the weaker contexts. So α is the total probability estimate of getting any zero count item. Alternative models for backoff Lots of similar results in information theory Variable length Markov chain Context tree algorithm LZW Other compression algorithms Let me describe a few of them A more principled solution? Backoff is our first real statistics problem 7

8 Room for lots of real statistical thinking here Thesis anyone? 8

Causal Semantics in Physics. exploring extra-mathematical constraints on physics equations

Causal Semantics in Physics. exploring extra-mathematical constraints on physics equations Causal Semantics in Physics exploring extra-mathematical constraints on physics equations 1 Act 1: Debating the role of causation in science Act 2: physics equations have syntax & semantics Act 3: causation