Lecture 2: Introduction to Probability

Statistical Methods for Intelligent Information Processing (SMIIP) Lecture 2: Introduction to Probability Shuigeng Zhou School of Computer Science September 20, 2017

Outline Background and concepts Some discrete distributions Some continuous distributions Joint probability distribution Transformations of random variables Monte Carlo approximation Information theory Examples 2017/9/25 SMIIP 2

Background and Concepts 2017/9/25 SMIIP 3

What is probability? Probability theory is nothing but common sense reduced to calculation Pierre Laplace Two probability interpretations Frequentist interpretation (objectivists) Probabilities represent long run frequencies of events Bayesian interpretation (subjectivists) Probability is used to quantify our uncertainty about something 2017/9/25 SMIIP 4

German tank problem During World War II, German tanks were sequentially numbered; assume 1, 2, 3,, N Some of the numbers became known to Allied Forces when tanks were captured or records seized The Allied statisticians developed an estimation procedure to determine N At the end of WWII, the serial-number estimate for German tank production was very close to the actual figure 2017/9/25 SMIIP 5

Sampling methods Convenience sampling: Obtain the easiest sample you can get (this is a bad idea) 2017/9/25 SMIIP 6

Sampling methods Random sampling: Any method where every member of the population has an equal chance of being selected 2017/9/25 SMIIP 7

Sampling methods Stratified Sample: Split the population into groups (strata) and sample from each group separately The goal here is for the strata to be homogeneous (the members are very similar) 2017/9/25 SMIIP 8

Sampling methods Cluster sample: randomly select a few clusters and sample all members of the clusters. 2017/9/25 SMIIP 9

Sampling methods Systematic sampling: Set an order for the data, start from a random element, and then select every k th member, with k=n/n where N is the dataset size, n is the number of samples to be selected 2017/9/25 SMIIP 10

Basic concepts (1) Event A and its probability p(a): 0 p(a) 1 Discrete random variable X State space χ Probability mass function (pmf): p(x) Probability of a union of two events A and B p(a B)=p(A)+p(B)-p(A B) Joint probability: the probability of the joint event A and B p(a, B)= p(a B)=p (A) p(b A)=p (B) p(a B) --- product rule Conditional probability p A B = p(a,b) p(b) Marginal distribution if p B > 0 p A = b p A, B = b p A B = b p(b = b) --- sum rule 2017/9/25 SMIIP 11

Basic concepts (2) Continuous random variable X Cumulative distribution function (cdf): F(q) F q = p X q Probability density function (pdf): f(x) b p a < X b = f x dx a Quantile( 分位数 ) If F is the cdf of X, and F x α = α, then x α is the α quantile of F Mean, or expected value: ; Variance: 2017/9/25 SMIIP 12

Mode, median and range Median: the middle value in the dataset Mode: the value that occurs most often in the dataset Range: the difference between the largest and the smallest values 2017/9/25 SMIIP 13

Descriptive variables 2017/9/25 SMIIP 14

Descriptive statistics to measure the central tendency 2017/9/25 SMIIP 15

The Variance estimation It measure dispersion relative to the scatter of the values about the mean 2017/9/25 SMIIP 16

The Variance estimation Population variance 2 = 1 N μ= 1 N N i=1 N i=1 x i (x i μ) 2 = Sample variance 1 N N i=1 x i 2 μ 2 Taking n samples from the population, estimate the variance y 2 = 1 n n i=1 (y i μ y ) 2, μ y = 1 n n i=1 Sampling multiple times, computing the expected valued of y 2 E y 2 = n 1 n 2, so 2 = n n 1 E y 2 y i We take the variance of one time sampling as E y 2, the sample variance s 2 is s 2 = 1 n 1 n i=1 (y i μ y ) 2 2017/9/25 SMIIP 17

Independence and conditional independence Unconditionally independence Marginally independence Conditional independence 2017/9/25 SMIIP 18

Bayes rule 2017/9/25 SMIIP 19

Some Common Discrete Distributions 2017/9/25 SMIIP 20

The binomial and Bernoulli distributions Binomial distribution: toss a coin n times, the probability of having k heads Bernoulli: a special case of binominal distribution where tossing a coin only once 2017/9/25 SMIIP 21

The binomial distribution 2017/9/25 SMIIP 22

The multinomial and multinoulli distributions Multinomial distribution: tossing a die of K-side n times, x=(x 1, x 2,, x k ) is a vector indicating the appearing time of each side Multinoulli: a special case of multinomial distribution with n=1 2017/9/25 SMIIP 23

Summary of the multinomial and related distributions 2017/9/25 SMIIP 24

Application: DNA sequence motifs 2017/9/25 SMIIP 25

The Poisson distribution The Poisson distribution is often used as a model for counts of rare events like radioactive decay and traffic accidents 2017/9/25 SMIIP 26

The Poisson distribution Considering a binomial distribution 2017/9/25 SMIIP 27

Mean and Variance of Poisson Distribution Recall the mean of a binomial distribution B(n, p) = np, variance of B(n, p) = np(1-p)= λ(1-p) Since Poisson distribution is an approximation of binomial distribution when n is approaching infinity, and p is extremely small, then its mean E(x)=np= λ Variance λ(1-p) ~ λ when p is very small Mean and Variance of Poisson distribution are the same: λ 2017/9/25 SMIIP 28

The Poisson distribution 2017/9/25 SMIIP 29

Empirical distribution Here, A is a range 2017/9/25 SMIIP 30

Discrete probability distributions 2017/9/25 SMIIP 31

Some Common Continuous Distributions 2017/9/25 SMIIP 32

Gaussian (normal) distribution --- Standard normal distribution CDF of the Gaussian is defined as 2017/9/25 SMIIP 33

Why Gaussian distribution is important? It is simple with only two parameters, and easy to be used Many phenomena in real world have an approximate Gaussian distribution According to the central limit theorem, the sums of independent random variables have an approximate Gaussian distribution 2017/9/25 SMIIP 34

Student t distribution Gaussian distribution is sensitive to outliers. A more robust distribution is Student t distribution When v=1, it is known as Cauchy or Lorentz distribution, which has a heavy tail When v>>5, it approaches to Gaussian distribution 2017/9/25 SMIIP 35

The Laplace distribution Also called double sided exponential distribution 2017/9/25 SMIIP 36

pdf and log(pdf) 2017/9/25 SMIIP 37

Effect of Outliers 2017/9/25 SMIIP 38

The gamma distribution The gamma distribution is a flexible distribution for positive real valued random variables 2017/9/25 SMIIP 39

The beta distribution The beta distribution has support over the interval [0, 1] and is defined as follows: Here B(p, q) is the beta function: 2017/9/25 SMIIP 40

The beta distribution a=b=1, uninform distribution a and b <1, bimodal distribution with the spikes at 0 and 1 a and b >1, unimodal distribution 2017/9/25 SMIIP 41

Pareto distribution The Pareto distribution is used to model the distribution of quantities that exhibit long tails, also called heavy tails The Pareto pdf is defined as follow: This distribution has the following properties 2017/9/25 SMIIP 42

Pareto distribution 2017/9/25 SMIIP 43

Continuous probability distributions 2017/9/25 SMIIP 44

Joint Probability Distributions 2017/9/25 SMIIP 45

Covariance A joint probability distribution has the form p(x 1,..., x D ) for a set of D > 1 variables The covariance between two rv s X and Y measures the degree to which X and Y are (linearly) related For a d-dimensional random vector x, its covariance matrix is: 2017/9/25 SMIIP 46

Correlation The (Pearson) correlation coefficient between X and Y is defined as For a d-dimensional random vector x, its correlation matrix is: 2017/9/25 SMIIP 47

Correlation Correlation coefficient is as a degree of linearity, it is not related to the slope of the regression line The regression coefficient is If X and Y are independent, meaning p(x, Y) = p(x)p(y ), then cov [X, Y] = 0, and hence corr [X, Y] = 0 so they are uncorrelated. However, the converse is not true: uncorrelated does not imply independent 2017/9/25 SMIIP 48

Correlation 2017/9/25 SMIIP 49

The multivariate Gaussian The pdf of multivariate Gaussian or multivariate normal (MVN) in D dimension is Here, μ = E [x] RD is the mean vector, and Σ = cov[x] is the D D covariance matrix. 2017/9/25 SMIIP 50

2D Gaussians 2017/9/25 SMIIP 51

Multivariate Student t distribution The pdf of multivariate Student t distribution is The distribution has the following properties 2017/9/25 SMIIP 52

Dirichlet distribution Dirichlet distribution is a multivariate generalization of the beta distribution. which has support over the probability simplex, defined by The pdf is defined as follows: the distribution has these properties 2017/9/25 SMIIP 53

Transformations of Random Variables

Linear transformations Suppose f() is a linear function We have If f() is a scalar-valued function, f(x) = a T x + b, then 2017/9/25 SMIIP 55

General transformations If X is a discrete rv, we can derive the pmf for y by simply summing up the probability mass for all the x s such that f(x) = y: If X is continuous 2017/9/25 SMIIP 56

Multivariate change of variables Let f be a function that maps R n to R n, and let y = f(x). Then its Jacobian matrix J is given by If f is an invertible mapping, we can define the pdf of the transformed variables using the Jacobian of the inverse mapping y x: 2017/9/25 SMIIP 57

Central limit theorem Now consider N random variables with pdf s (not necessarily Gaussian) p(x i ), each with mean μ and variance σ 2. We assume each variable is independent and identically distributed or iid for short N Let S N = i=1 X i be the sum of the rv s. One can show that, as N increases, the distribution of this sum approaches 2017/9/25 SMIIP 58

Central limit theorem 2017/9/25 SMIIP 59

Monte Carlo Approximation 2017/9/25 SMIIP 60

Monte Carlo approximation In general, computing the distribution of a function of an rv using the change of variables formula can be difficult One simple but powerful alternative is Monte Carlo approximation as follows: First, we generate S samples from the distribution, call them x 1,..., x S. By Markov chain Monte Carlo or MCMC Then, we can approximate the distribution of f(x) by using the empirical distribution of {f(x s )} 1 S s=1. 2017/9/25 SMIIP 61

Monte Carlo approximation By varying the function f(), we can approximate many quantities of interest, such as 2017/9/25 SMIIP 62

Monte Carlo approximation 2017/9/25 SMIIP 63

Some Concepts of Information Theory

Entropy Entropy of a random variable X with distribution p For binary random variables, we have This is called binary entropy function 2017/9/25 SMIIP 65

Entropy 2017/9/25 SMIIP 66

KL divergence KL divergence is the average number of extra bits needed to encode the data 2017/9/25 SMIIP 67

Why mutual information? Often, we want to know something of a variable Y from another variable X Correlation can measure the relationship between two variables, but it is defined on real values Furthermore, and it cannot describe the independence between two variables well Independent -> uncorrelated Uncorrelated does not imply independent 2017/9/25 SMIIP 68

Mutual information For two rvs X and Y, the MI is defined as conditional entropy We can show that Ⅱ(X, Y) 0 with equality iif p(x, Y)=p(X) p(y) MI between X and Y as the reduction in uncertainty about X after observing Y 2017/9/25 SMIIP 69

Pointwise mutual information For two events (not random variables) x and y, PMI is defined as PMI measures the discrepancy between these events occurring together compared to what would be expected by chance MI of X and Y is just the expected value of PMI 2017/9/25 SMIIP 70

Two Examples

Example: medical diagnosis (rare diseases) breast cancer: p(y = 1) healthy: p(y = 0) Test Positive Test Negative p(x = 1 y = 1) p(x = 1 y = 0) p(x=1) p(x = 0 y = 1) p(x = 0 y = 0) p(x=0) If Jenny is tested positive on breast cancer, what is the probability that she really suffers from breast cancer? 2017/9/25 SMIIP 72

Example 1: medical diagnosis (rare diseases) breast cancer: p(y = 1) healthy: p(y = 0) Test Positive Test Negative p(x = 1 y = 1) p(x = 1 y = 0) p(x=1) p(x = 0 y = 1) p(x = 0 y = 0) p(x=0) If Jenny is tested positive on breast cancer, what is the probability that she really suffers from breast cancer? p(y = 1 x = 1)? 2017/9/25 SMIIP 73

Example: medical diagnosis (rare diseases) breast cancer: p(y = 1)= 0.004 healthy: p(y = 0)=0.996 Test Positive p(x = 1 y = 1) =0.8 p(x = 1 y = 0) = 0.1 p(x=1)=0.004*0.8 +0.996*0.1 Test Negative p(x = 0 y = 1) p(x = 0 y = 0) p(x=0) If Jenny is tested positive on breast cancer, what is the probability that she really suffers from breast cancer? p(y = 1 x = 1)? 2017/9/25 SMIIP 74

Example 1: medical diagnosis (rare diseases) breast cancer: p(y = 1)= 0.004 healthy: p(y = 0)=0.996 Test Positive p(x = 1 y = 1) =0.8 p(x = 1 y = 0) = 0.1 p(x=1)=0.004*0.8 +0.996*0.1 Test Negative p(x = 0 y = 1) p(x = 0 y = 0) p(x=0) Test Positive should be treated carefully for rare diseases 2017/9/25 SMIIP 75

Example 2: German tank problem 1. Frequentist statistics Sample k labels, the largest one is m; this event: E[largest] = P largest = m = N m=k m m 1 k 1 N k = N = μ 1 + 1 k 1 m 1 k 1 N k k N + 1 k + 1 = μ μ 1 + 1 k 1 = E m 1 + 1 k 1 N = m 1 + 1 k 1 k = 4, samples = 2.6.7.14 m = 14 N = 14 1.25 1 = 16.5 2017/9/25 SMIIP 76

German tank problem 2. Bayesian statistics 贝叶斯方法要考虑当观察到的坦克总数 K 等于数 k 序列号最大值 M 等于数 m 时, 敌方坦克总数 N 等于数 n 的可信度 (N = n M = m, K = k)( 简写为 n m, k ), 条件概率有 n k n m, k = m n, k m k 坦克总数已知为 n 观察 k 辆坦克中序列号最大值等于 m 的概率 : 2017/9/25 SMIIP 77

German tank problem(bayesian statistics) 2017/9/25 SMIIP 78

German tank problem(bayesian statistics) https://en.wikipedia.org/wiki/german_tank_problem 2017/9/25 SMIIP 79

German tank problem 假设某个情报人员已经发现了 k = 4 辆坦克, 其序列号分别为 2 6 7 14, 观测到的最大的序列号为 m = 14 坦克未知的总数设为 N 2017/9/25 SMIIP 80

German tank problem 根据常规盟军情报的估计, 德国在 1940 年 6 月和 1942 年 9 月之间, 每月大约能生产 1,400 辆坦克将缴获坦克的序列号代入下文的公式, 可计算出每月 246 辆战后, 从阿尔伯特斯佩尔所管辖的部门缴获的德国生产记录显示, 实际数目是 245 辆某些特定月份的估计如下 : 2017/9/25 SMIIP 81

The end Assignment: reading Chapter 2 of the Murphy book 2017/9/25 SMIIP 82