Exercise 1: Basics of probability calculus

Size: px

Start display at page:

Download "Exercise 1: Basics of probability calculus"

Ethan Ford
5 years ago
Views:

1 : Basics of probability calculus Stig-Arne Grönroos Department of Signal Processing and Acoustics Aalto University, School of Electrical Engineering [ ]

2 Ex 1.1: Conditional probability Unconditional probability: P(A) Conditional probability: P(A B) Joint probability: P(A, B) Chain rule: P(A, B) = P(A)P(B A) [ ]2/26 Grönroos

3 Ex 1.1: Conditional probability The following probabilities might apply to English: P(word is abbreviation word has three letters) = 0.8 P(word has three letters) = What is the probability that an observed word is a three letter abbreviation? [ ]3/26 Grönroos

4 Ex 1.1: Conditional probability What is the probability that an observed word is a three letter abbreviation? P(word is abbreviation, word has three letters) First, how probable it is for a word to be three letters long, then how probable it is to be abbreviation when being three letters: =P(three letters) P(abbreviation three letters) = = [ ]4/26 Grönroos

5 Ex 1.2: Bayes Theorem A stemming program for English can determine whether the stem for does is 1. the verb do 2. or the noun doe (female deer). The stem do is much more common, only 1 inflection of doe The program is returning does doe. What is the probability that the program is correct? does is an [ ]5/26 Grönroos

6 Ex 1.2: Bayes Theorem P(R = C i T = C j ) True T Result R C 1 C 2 C C P(T = C 1 ) = P(T = C 2 ) = do doe [ ]6/26 Grönroos

7 Ex 1.2: Bayes Theorem We have P(R = C i T = C j ), but we want P(T = C j R = C i ) Bayes theorem: P(B j A) = P(A B j)p(b j ) P(A) = P(A B j)p(b j ) i P(A B i)p(b i ) [ ]7/26 Grönroos

8 Ex 1.2: Bayes Theorem Practical when you need to turn a conditional probability around =) Forms the basis for Bayesian statistics, where you have a prior probability P(B) some new observations A which you want to combine into a posterior probability P(B A). [ ]8/26 Grönroos

9 Ex 1.3: Zipf s law Sort the words so that the most common word comes first (r = 1). Also include how many times it occurred it the text (f ). Zipf alleges that i.e. f times r remains constant. f 1 r Does this apply to a randomly generated language, which has 30 letters including the word boundary? [ ]9/26 Grönroos

10 Ex 1.3: Zipf s law A particular one letter word. Generate two symbols: word boundary after something else. There are 29 words of this kind. P(s = t 1 ) = Two letter word. There are 29 2 words of this kind. P(s = t 1, t 2 ) = Three letter word. There are 29 3 words of this kind. P(s = t 1, t 2, t 3 ) = [ ]10/26 Grönroos

11 Ex 1.3: Zipf s law r f k Table : Zipf constant. r: the ranking number when sorted by frequency, f : expected occurrence count in a text of words. Even for a random language, k remains quite constant for a large range of r. [ ]11/26 Grönroos

12 Ex 1.3: Zipf s law Zipf s law for a randomly generated language k = frequency rank rank Figure : k = rank frequency as a function of rank. For a randomly generated language with 30 letters k is roughly constant. [ ]12/26 Grönroos

13 Ex 1.3: Zipf s law 10 6 Zipf s law for Finnish corpus of 32 million words 10 4 frequency Figure : Logarithmic plot of ranks and frequencies for Finnish corpus of 32 million words. rank [ ]13/26 Grönroos

14 Ex 1.3: Zipf s law The power law distribution is highly peaked at low frequencies. Already a quite short list of the most frequent words will give good coverage of tokens. The long tail of infrequent words contain most of the interesting content words. [ ]14/26 Grönroos

15 Ex 1.4: Central limit theorem Figure : Expectation as center of probability mass. CC BY-SA 3.0 Erzbischof. [ ]15/26 Grönroos

16 Ex 1.4: Central limit theorem Expectation and variance E(x) = Var(x) = xp(x)dx (1) ( x E(x) ) 2p(x)dx (2) Expectation and variance sum of independent random variables E(x + y) = E(x) + E(y) (3) Var(x + y) = Var(x) + Var(y) (4) Variance of a random variable multiplied by a constant Var(ax) = a 2 Var(x) (5) [ ]16/26 Grönroos

17 Ex 1.4: Central limit theorem Throwing one 101-sided die. Equal probability p(x) = Expectation: 100 E(x) = ip(x = i) i=0 = 1 ( ) 101 = 1 ( ) ( ) + (2 + 99) + (3 + 98)... + ( ) = = 50 (6) 101 [ ]17/26 Grönroos

18 Ex 1.4: Central limit theorem Variance: 100 Var(x) = (i E(x)) 2 p(x = i) i=0 = ( ) = ( ) Now we can use formula to get the result n 2 = n(n + 1)(2n + 1) 6 Var(x) = = 850 (7) [ ]18/26 Grönroos

19 Ex 1.4: Central limit theorem Calculate the expectation for the sum (x + y)/2. E( x + y 2 ) = 1 2 (E(x) + E(y)) = 1 ( ) = 50 2 The expectation does not change. What about variance, then? Var( x + y 2 ) = Var(x 2 ) + Var(y 2 ) = 1 4 Var(x) Var(y) = 1 ( ) = [ ]19/26 Grönroos

20 Ex 1.4: Central limit theorem We throw ten dice. Extending the previous solutions: E( x 1 + x x 10 ) = = 50 Var( x 1 + x x 10 ) = = 85 As we throw even more dice, the distribution will sharpen around the expectation. At the infinite limit, the expectation is 50 and variance 0, which means that we will always get a result of 50. [ ]20/26 Grönroos

21 Ex 1.4: Central limit theorem dice 2 dice x dice x dice x dice x dice Figure : Throwing dice. The throw was simulated 1 million times for each curve. [ ]21/26 Grönroos

22 Ex 1.4: Central limit theorem Here we used the uniform distribution, but the CLT applies to most reasonable cases (i.i.d., finite E and Var). The normal distribution (Gaussian) is common in nature, and easy to use in closed form solutions. Normal distribution approximation. Sample size in scientific experiments. [ ]22/26 Grönroos

23 Ex 1.5: Minimum description length Agree in advance on a model class that could generate the data. For our message, select a particular model by setting θ. Use that model to compress the data. The receiver does not know the θ we choose, must send that too. [ ]23/26 Grönroos

24 Ex 1.5: Minimum description length Model parameters DL to encode parameters L(θ) DL to encode data given θ L(x θ) Total code length L(x, θ) = L(θ) + L(x θ) DL to encode message with p(i) log p(i) bits Prior distribution p(θ) Likelihood (a function of θ) p(x θ) Posterior p(θ x) θ [ ]24/26 Grönroos

25 Ex 1.5: Minimum description length Show that the optimal selection of the parameters in two-part coding scheme equals Maximum A Posteriori estimation. Minimizing the total message length ˆθ = arg min L(x, θ) θ... ˆθ = arg max p(θ x) θ Maximizing the posterior θ after observing x [ ]25/26 Grönroos

26 Ex 1.5: Minimum description length In compression, statistical regularities are used for compressing data. In MDL-flavored machine learning, compression is used for finding statistical regularities. [ ]26/26 Grönroos

Dept. of Linguistics, Indiana University Fall 2015

L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 34 To start out the course, we need to know something about statistics and This is only an introduction; for a fuller understanding, you would