: Basics of probability calculus Stig-Arne Grönroos Department of Signal Processing and Acoustics Aalto University, School of Electrical Engineering stig-arne.gronroos@aalto.fi [21.01.2016]
Ex 1.1: Conditional probability Unconditional probability: P(A) Conditional probability: P(A B) Joint probability: P(A, B) Chain rule: P(A, B) = P(A)P(B A) [21.01.2016]2/26 Grönroos
Ex 1.1: Conditional probability The following probabilities might apply to English: P(word is abbreviation word has three letters) = 0.8 P(word has three letters) = 0.0003 What is the probability that an observed word is a three letter abbreviation? [21.01.2016]3/26 Grönroos
Ex 1.1: Conditional probability What is the probability that an observed word is a three letter abbreviation? P(word is abbreviation, word has three letters) First, how probable it is for a word to be three letters long, then how probable it is to be abbreviation when being three letters: =P(three letters) P(abbreviation three letters) =0.0003 0.8 =0.00024 [21.01.2016]4/26 Grönroos
Ex 1.2: Bayes Theorem A stemming program for English can determine whether the stem for does is 1. the verb do 2. or the noun doe (female deer). The stem do is much more common, only 1 inflection of doe. 1000 The program is returning does doe. What is the probability that the program is correct? does is an [21.01.2016]5/26 Grönroos
Ex 1.2: Bayes Theorem P(R = C i T = C j ) True T Result R C 1 C 2 C 1 0.95 0.05 C 2 0.05 0.95 P(T = C 1 ) = 0.999 P(T = C 2 ) = 0.001 do doe [21.01.2016]6/26 Grönroos
Ex 1.2: Bayes Theorem We have P(R = C i T = C j ), but we want P(T = C j R = C i ) Bayes theorem: P(B j A) = P(A B j)p(b j ) P(A) = P(A B j)p(b j ) i P(A B i)p(b i ) [21.01.2016]7/26 Grönroos
Ex 1.2: Bayes Theorem Practical when you need to turn a conditional probability around =) Forms the basis for Bayesian statistics, where you have a prior probability P(B) some new observations A which you want to combine into a posterior probability P(B A). [21.01.2016]8/26 Grönroos
Ex 1.3: Zipf s law Sort the words so that the most common word comes first (r = 1). Also include how many times it occurred it the text (f ). Zipf alleges that i.e. f times r remains constant. f 1 r Does this apply to a randomly generated language, which has 30 letters including the word boundary? [21.01.2016]9/26 Grönroos
Ex 1.3: Zipf s law A particular one letter word. Generate two symbols: word boundary after something else. There are 29 words of this kind. P(s = t 1 ) = 1 30 1 30 Two letter word. There are 29 2 words of this kind. P(s = t 1, t 2 ) = 1 30 1 30 1 30 Three letter word. There are 29 3 words of this kind. P(s = t 1, t 2, t 3 ) = 1 30 1 30 1 30 1 30 [21.01.2016]10/26 Grönroos
Ex 1.3: Zipf s law r f k 15 1111 16111 450 37.04 16648 13064 1.235 16129 378900 0.0412 15593 1098800 0.00137 15073 318660000 0.0000457 14570 Table : Zipf constant. r: the ranking number when sorted by frequency, f : expected occurrence count in a text of 1000000 words. Even for a random language, k remains quite constant for a large range of r. [21.01.2016]11/26 Grönroos
Ex 1.3: Zipf s law 0.017 Zipf s law for a randomly generated language 0.016 k = frequency rank 0.015 0.014 0.013 0.012 10 0 10 5 10 10 10 15 rank Figure : k = rank frequency as a function of rank. For a randomly generated language with 30 letters k is roughly constant. [21.01.2016]12/26 Grönroos
Ex 1.3: Zipf s law 10 6 Zipf s law for Finnish corpus of 32 million words 10 4 frequency 10 2 10 0 10 0 10 2 10 4 10 6 Figure : Logarithmic plot of ranks and frequencies for Finnish corpus of 32 million words. rank [21.01.2016]13/26 Grönroos
Ex 1.3: Zipf s law The power law distribution is highly peaked at low frequencies. Already a quite short list of the most frequent words will give good coverage of tokens. The long tail of infrequent words contain most of the interesting content words. [21.01.2016]14/26 Grönroos
Ex 1.4: Central limit theorem Figure : Expectation as center of probability mass. CC BY-SA 3.0 Erzbischof. [21.01.2016]15/26 Grönroos
Ex 1.4: Central limit theorem Expectation and variance E(x) = Var(x) = xp(x)dx (1) ( x E(x) ) 2p(x)dx (2) Expectation and variance sum of independent random variables E(x + y) = E(x) + E(y) (3) Var(x + y) = Var(x) + Var(y) (4) Variance of a random variable multiplied by a constant Var(ax) = a 2 Var(x) (5) [21.01.2016]16/26 Grönroos
Ex 1.4: Central limit theorem Throwing one 101-sided die. Equal probability p(x) = 1 101. Expectation: 100 E(x) = ip(x = i) i=0 = 1 (1 + 2 + 3 + 4 +... + 100) 101 = 1 ( ) (1 + 100) + (2 + 99) + (3 + 98)... + (50 + 51) 101 50 101 = = 50 (6) 101 [21.01.2016]17/26 Grönroos
Ex 1.4: Central limit theorem Variance: 100 Var(x) = (i E(x)) 2 p(x = i) i=0 = 1 101 (502 + 49 2 +... + 1 + 0 + 1 + 2 2 +... + 49 2 + 50 2 ) = 2 101 (1 + 22 +... + 49 2 + 50 2 ) Now we can use formula to get the result 1 + 2 2 + 3 2 +... + n 2 = n(n + 1)(2n + 1) 6 Var(x) = 2 50 51 101 = 850 (7) 101 6 [21.01.2016]18/26 Grönroos
Ex 1.4: Central limit theorem Calculate the expectation for the sum (x + y)/2. E( x + y 2 ) = 1 2 (E(x) + E(y)) = 1 (50 + 50) = 50 2 The expectation does not change. What about variance, then? Var( x + y 2 ) = Var(x 2 ) + Var(y 2 ) = 1 4 Var(x) + 1 4 Var(y) = 1 (850 + 850) = 425 4 [21.01.2016]19/26 Grönroos
Ex 1.4: Central limit theorem We throw ten dice. Extending the previous solutions: E( x 1 + x 2 +... + x 10 ) = 1 10 50 10 10 = 50 Var( x 1 + x 2 +... + x 10 ) = 1 10 850 10 100 = 85 As we throw even more dice, the distribution will sharpen around the expectation. At the infinite limit, the expectation is 50 and variance 0, which means that we will always get a result of 50. [21.01.2016]20/26 Grönroos
Ex 1.4: Central limit theorem 10000 1 dice 2 dice 5000 0 20 40 60 80 x 10 4 3 dice 2 1 0 20 40 60 80 x 10 4 10 dice 4 2 0 20 40 60 80 10000 0 20 40 60 80 x 10 4 5 dice 2 1 0 20 40 60 80 x 10 4 100 dice 10 5 0 20 40 60 80 Figure : Throwing dice. The throw was simulated 1 million times for each curve. [21.01.2016]21/26 Grönroos
Ex 1.4: Central limit theorem Here we used the uniform distribution, but the CLT applies to most reasonable cases (i.i.d., finite E and Var). The normal distribution (Gaussian) is common in nature, and easy to use in closed form solutions. Normal distribution approximation. Sample size in scientific experiments. [21.01.2016]22/26 Grönroos
Ex 1.5: Minimum description length Agree in advance on a model class that could generate the data. For our message, select a particular model by setting θ. Use that model to compress the data. The receiver does not know the θ we choose, must send that too. [21.01.2016]23/26 Grönroos
Ex 1.5: Minimum description length Model parameters DL to encode parameters L(θ) DL to encode data given θ L(x θ) Total code length L(x, θ) = L(θ) + L(x θ) DL to encode message with p(i) log p(i) bits Prior distribution p(θ) Likelihood (a function of θ) p(x θ) Posterior p(θ x) θ [21.01.2016]24/26 Grönroos
Ex 1.5: Minimum description length Show that the optimal selection of the parameters in two-part coding scheme equals Maximum A Posteriori estimation. Minimizing the total message length ˆθ = arg min L(x, θ) θ... ˆθ = arg max p(θ x) θ Maximizing the posterior θ after observing x [21.01.2016]25/26 Grönroos
Ex 1.5: Minimum description length In compression, statistical regularities are used for compressing data. In MDL-flavored machine learning, compression is used for finding statistical regularities. [21.01.2016]26/26 Grönroos