CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University Charalampos E. Tsourakakis January 25rd, 2017

Probability Theory

The theory of probability is a system for making better guesses. http://www.feynmanlectures.caltech.edu/i_06.html Babis Tsourakakis CS 591 Data Analytics, Lecture 2 3 / 36

By the probability of a particular outcome of an observation we mean our estimate for the most likely fraction of a number of repeated observations that will yield that particular outcome. http://www.feynmanlectures.caltech.edu/i_06.html p(a) = N A N Babis Tsourakakis CS 591 Data Analytics, Lecture 2 4 / 36

Inclusion Exclusion theorem Theorem Suppose n N and A i is a finite set for 1 i n. It follows that A i = A i1 A i1 A i2 1 i n 1 i 1 n + 1 i 1 i 2 i 3 n 1 i 1 i 2 n n A i1 A i2 A i3... + ( 1) n+1 A i i=1 Application (aka matching hat problem): Deal two packs of shuffled cards simultaneously. What is the probability that no pair of identical cards will be exposed simultaneously? Babis Tsourakakis CS 591 Data Analytics, Lecture 2 5 / 36

Inclusion Exclusion theorem Fix the first pack Let A i be the set of all possible arrangements of the second pack which match the card in position i of the first pack. X = i A i Details on whiteboard. X /52! = (52!) ( 1 ( 52 1)51! ( 52 2)50! + ( 52 3)49!.. ( 52 52)0! ) = 1 1/2! + 1/3!... 1/52! 1 ( ( 1) i /i!) i=0 = 1 1/e. Thus the desired probability is 1/e as n +. Babis Tsourakakis CS 591 Data Analytics, Lecture 2 6 / 36

Fundamental Rules Pr [X Y ] = Pr [X ] + Pr [Y ] Pr [X Y ] (1) Pr [X ] = y Pr [X, Y = y] = y Pr [X Y = y]pr [Y = y] (2) Sum Rule Pr [X, Y ] = Pr [X Y ] = Pr [X ]Pr [Y X ] = Pr [Y ]Pr [X Y ] (3) Product Rule Babis Tsourakakis CS 591 Data Analytics, Lecture 2 7 / 36

Fundamental Rules By applying the product rule multiple times we obtain the chain rule: Pr [X 1, X 2,..., X n ] = Pr [X 1 ]Pr [X 2,..., X n X 1 ] =.. = Pr [X 1 ]Pr [X 2 X 1 ]Pr [X 3 X 2, X 1 ]... Pr [X n X 1,..., X n 1 ] (4) Chain Rule Pr [X Y ] = Pr [X Y ] (5) Pr [Y ] Conditional probability Babis Tsourakakis CS 591 Data Analytics, Lecture 2 8 / 36

Reminder: Bayes rule Bayes rule is a direct application of conditional probabilities. Pr [H D] = Pr [D H]Pr [H], and Pr [D] > 0, or... Pr [D] posterior likelihood prior. Babis Tsourakakis CS 591 Data Analytics, Lecture 2 9 / 36

Independence and Conditional Independence We say X and Y are unconditionally independent or marginally independent, or just independent if As a result Notation: X Y Pr [X Y ] = Pr [X ],Pr [Y X ] = Pr [Y ] Pr [X, Y ] = Pr [X ]Pr [Y ]. Babis Tsourakakis CS 591 Data Analytics, Lecture 2 10 / 36

Independence and Conditional Independence Source: http://colah.github.io/posts/ 2015-09-Visual-Information/ Babis Tsourakakis CS 591 Data Analytics, Lecture 2 11 / 36

Independence and Conditional Independence Independence visualized Babis Tsourakakis CS 591 Data Analytics, Lecture 2 12 / 36

Independence and Conditional Independence Closer to reality Babis Tsourakakis CS 591 Data Analytics, Lecture 2 13 / 36

Independence and Conditional Independence Closer to reality... Babis Tsourakakis CS 591 Data Analytics, Lecture 2 14 / 36

Independence and Conditional Independence... or alternatively... Babis Tsourakakis CS 591 Data Analytics, Lecture 2 15 / 36

Independence and Conditional Independence... or alternatively... Babis Tsourakakis CS 591 Data Analytics, Lecture 2 16 / 36

Independence and Conditional Independence We say X and Y are conditionally independent given Z if Pr [X, Y Z] = Pr [X Z]Pr [Y Z]. Joint distribution factorizes as Pr [X, Y, Z] = Pr [X Z]Pr [Y Z]Pr [Z]. Notation: X Y Z Babis Tsourakakis CS 591 Data Analytics, Lecture 2 17 / 36

Mean, variance, covariance For discrete RVs E [X ] = x xpr [X = x] and for continuous E [X ] = x xp(x)dx The variance and the standard deviation std[x ] = σ are defined as Var [X ] = σ 2 = E [ (X E [X ]) 2] = E [ X 2] E [X ] 2. Reminder: Jensen s inequality states that if f is convex, then f (E [X ]) E [f (X )]. Babis Tsourakakis CS 591 Data Analytics, Lecture 2 18 / 36

Mean, variance, covariance Covariance of two random variables X, Y cov[x, Y ] = E [(X E [X ])(Y E [Y ])] = = E [XY ] E [X ] E [Y ]. In general, if x is a d-dimensional random vector, the covariance is defined as cov[x] = E [ (x E [x])(x E [x]) T ]. Pearson correlation coefficient: corr[x, Y ] = cov[x, Y ] Var [X ] Var [Y ]. Babis Tsourakakis CS 591 Data Analytics, Lecture 2 19 / 36

Mean, variance, covariance Correlation examples, Wikipedia Babis Tsourakakis CS 591 Data Analytics, Lecture 2 20 / 36

Probability distributions Source We will go over few important ones. Babis Tsourakakis CS 591 Data Analytics, Lecture 2 21 / 36

Discrete distributions Details on whiteboard. Bernoulli: X Ber(p) Binomial: X Bin(n, p) Multinomial: x Mu(n, θ) Poisson: X Po(λ) Babis Tsourakakis CS 591 Data Analytics, Lecture 2 22 / 36

Continuous Univariate distributions Normal: X N(x; µ, σ 2 ) Student t distribution: X T (x; µ, σ 2, ν) Laplace: X Lap(x; µ, β) Gamma: X Ga(x; α, β) Pareto: Pareto(x k, m) Babis Tsourakakis CS 591 Data Analytics, Lecture 2 23 / 36

Multivariate normal distribution Isotropic, i.e., Σ = σ 2 I N (x; µ, Σ) = 1 { exp 1 } (2π) D 2 Σ 1/2 2 (x µ)t Σ 1 (x µ) where µ = E [x], Σ = Cov[x]. Σ 1 = Λ is also known as the precision matrix. Babis Tsourakakis CS 591 Data Analytics, Lecture 2 24 / 36

Multivariate normal distribution µ = (0, 0), Σ = [21.8; 1.82] Contour plot Babis Tsourakakis CS 591 Data Analytics, Lecture 2 25 / 36

Linear transformations of Random Variables Suppose f is a linear function: Then, y = f (x) = Ax + b E [y] = AE [x] + b (6) by Linearity of Expectation Cov[y] = ACov[x]A T (7) Covariance Cov[y] = Var [ a T x + b ] = a T Cov[x]a (8) if f() scalar valued Babis Tsourakakis CS 591 Data Analytics, Lecture 2 26 / 36

Information Theory

Information Theory Suppose Bob wants to communicate with Alice by sending her bits. Example: Babis Tsourakakis CS 591 Data Analytics, Lecture 2 28 / 36

Information Theory Can we use fewer than 2 bits? Babis Tsourakakis CS 591 Data Analytics, Lecture 2 29 / 36

Information Theory Can we use fewer than 1.75 bits? Babis Tsourakakis CS 591 Data Analytics, Lecture 2 30 / 36

Information Theory Suppose there are n events, the k-th event with probability p k Shannon entropy, or just entropy is defined as: H(p 1,..., p n ) = n p k log 2 ( 1 ). p k k=1 Babis Tsourakakis CS 591 Data Analytics, Lecture 2 31 / 36

Information Theory Intuition: When the k-th event happends, we receive log( 1 p k ) bits of information. Therefore, H(p 1,..., p n ) is the expected number of bits in a random event. If p k = 0, we define p k log( 1 p k ) = 0 To see why: lim ɛ log( 1 ) = 0. ɛ 0+ p k Question: For what values p 1,..., p n is the entropy maximized? Babis Tsourakakis CS 591 Data Analytics, Lecture 2 32 / 36

Information Theory Cross-entropy: Babis Tsourakakis CS 591 Data Analytics, Lecture 2 33 / 36

Information Theory Cross-entropy: H p (q) = x 1 q(x) log( p(x) ). H(p) = 1.75 H(q) = 1.75 H p (q) = 2.25 2.375 = H q (p) Cross-entropy isnt symmetric! For the interested: Cross entropy and neural networks Babis Tsourakakis CS 591 Data Analytics, Lecture 2 34 / 36

Information Theory Kullback-Leibler divergence (aka as relative entropy): KL(p, q) = k p k log( p k q k ). KL(p, q) = H(p) + H q (p). Theorem with equality iff p = q. KL(p, q) 0 (9) Information Inequality Babis Tsourakakis CS 591 Data Analytics, Lecture 2 35 / 36

Information Theory How similar is the joint probability distribution p(x, Y ) to the factorization p(x )p(y )? I (X ; Y ) = KL(p(X, Y ) p(x )p(y )) = p(x, y) p(x, y) log( p(x)p(y) ) x,y (10) Mutual information Babis Tsourakakis CS 591 Data Analytics, Lecture 2 36 / 36