Statistical Methods for Intelligent Information Processing (SMIIP) Lecture 2: Introduction to Probability Shuigeng Zhou School of Computer Science September 20, 2017
Outline Background and concepts Some discrete distributions Some continuous distributions Joint probability distribution Transformations of random variables Monte Carlo approximation Information theory Examples 2017/9/25 SMIIP 2
Background and Concepts 2017/9/25 SMIIP 3
What is probability? Probability theory is nothing but common sense reduced to calculation Pierre Laplace Two probability interpretations Frequentist interpretation (objectivists) Probabilities represent long run frequencies of events Bayesian interpretation (subjectivists) Probability is used to quantify our uncertainty about something 2017/9/25 SMIIP 4
German tank problem During World War II, German tanks were sequentially numbered; assume 1, 2, 3,, N Some of the numbers became known to Allied Forces when tanks were captured or records seized The Allied statisticians developed an estimation procedure to determine N At the end of WWII, the serial-number estimate for German tank production was very close to the actual figure 2017/9/25 SMIIP 5
Sampling methods Convenience sampling: Obtain the easiest sample you can get (this is a bad idea) 2017/9/25 SMIIP 6
Sampling methods Random sampling: Any method where every member of the population has an equal chance of being selected 2017/9/25 SMIIP 7
Sampling methods Stratified Sample: Split the population into groups (strata) and sample from each group separately The goal here is for the strata to be homogeneous (the members are very similar) 2017/9/25 SMIIP 8
Sampling methods Cluster sample: randomly select a few clusters and sample all members of the clusters. 2017/9/25 SMIIP 9
Sampling methods Systematic sampling: Set an order for the data, start from a random element, and then select every k th member, with k=n/n where N is the dataset size, n is the number of samples to be selected 2017/9/25 SMIIP 10
Basic concepts (1) Event A and its probability p(a): 0 p(a) 1 Discrete random variable X State space χ Probability mass function (pmf): p(x) Probability of a union of two events A and B p(a B)=p(A)+p(B)-p(A B) Joint probability: the probability of the joint event A and B p(a, B)= p(a B)=p (A) p(b A)=p (B) p(a B) --- product rule Conditional probability p A B = p(a,b) p(b) Marginal distribution if p B > 0 p A = b p A, B = b p A B = b p(b = b) --- sum rule 2017/9/25 SMIIP 11
Basic concepts (2) Continuous random variable X Cumulative distribution function (cdf): F(q) F q = p X q Probability density function (pdf): f(x) b p a < X b = f x dx a Quantile( 分位数 ) If F is the cdf of X, and F x α = α, then x α is the α quantile of F Mean, or expected value: ; Variance: 2017/9/25 SMIIP 12
Mode, median and range Median: the middle value in the dataset Mode: the value that occurs most often in the dataset Range: the difference between the largest and the smallest values 2017/9/25 SMIIP 13
Descriptive variables 2017/9/25 SMIIP 14
Descriptive statistics to measure the central tendency 2017/9/25 SMIIP 15
The Variance estimation It measure dispersion relative to the scatter of the values about the mean 2017/9/25 SMIIP 16
The Variance estimation Population variance 2 = 1 N μ= 1 N N i=1 N i=1 x i (x i μ) 2 = Sample variance 1 N N i=1 x i 2 μ 2 Taking n samples from the population, estimate the variance y 2 = 1 n n i=1 (y i μ y ) 2, μ y = 1 n n i=1 Sampling multiple times, computing the expected valued of y 2 E y 2 = n 1 n 2, so 2 = n n 1 E y 2 y i We take the variance of one time sampling as E y 2, the sample variance s 2 is s 2 = 1 n 1 n i=1 (y i μ y ) 2 2017/9/25 SMIIP 17
Independence and conditional independence Unconditionally independence Marginally independence Conditional independence 2017/9/25 SMIIP 18
Bayes rule 2017/9/25 SMIIP 19
Some Common Discrete Distributions 2017/9/25 SMIIP 20
The binomial and Bernoulli distributions Binomial distribution: toss a coin n times, the probability of having k heads Bernoulli: a special case of binominal distribution where tossing a coin only once 2017/9/25 SMIIP 21
The binomial distribution 2017/9/25 SMIIP 22
The multinomial and multinoulli distributions Multinomial distribution: tossing a die of K-side n times, x=(x 1, x 2,, x k ) is a vector indicating the appearing time of each side Multinoulli: a special case of multinomial distribution with n=1 2017/9/25 SMIIP 23
Summary of the multinomial and related distributions 2017/9/25 SMIIP 24
Application: DNA sequence motifs 2017/9/25 SMIIP 25
The Poisson distribution The Poisson distribution is often used as a model for counts of rare events like radioactive decay and traffic accidents 2017/9/25 SMIIP 26
The Poisson distribution Considering a binomial distribution 2017/9/25 SMIIP 27
Mean and Variance of Poisson Distribution Recall the mean of a binomial distribution B(n, p) = np, variance of B(n, p) = np(1-p)= λ(1-p) Since Poisson distribution is an approximation of binomial distribution when n is approaching infinity, and p is extremely small, then its mean E(x)=np= λ Variance λ(1-p) ~ λ when p is very small Mean and Variance of Poisson distribution are the same: λ 2017/9/25 SMIIP 28
The Poisson distribution 2017/9/25 SMIIP 29
Empirical distribution Here, A is a range 2017/9/25 SMIIP 30
Discrete probability distributions 2017/9/25 SMIIP 31
Some Common Continuous Distributions 2017/9/25 SMIIP 32
Gaussian (normal) distribution --- Standard normal distribution CDF of the Gaussian is defined as 2017/9/25 SMIIP 33
Why Gaussian distribution is important? It is simple with only two parameters, and easy to be used Many phenomena in real world have an approximate Gaussian distribution According to the central limit theorem, the sums of independent random variables have an approximate Gaussian distribution 2017/9/25 SMIIP 34
Student t distribution Gaussian distribution is sensitive to outliers. A more robust distribution is Student t distribution When v=1, it is known as Cauchy or Lorentz distribution, which has a heavy tail When v>>5, it approaches to Gaussian distribution 2017/9/25 SMIIP 35
The Laplace distribution Also called double sided exponential distribution 2017/9/25 SMIIP 36
pdf and log(pdf) 2017/9/25 SMIIP 37
Effect of Outliers 2017/9/25 SMIIP 38
The gamma distribution The gamma distribution is a flexible distribution for positive real valued random variables 2017/9/25 SMIIP 39
The beta distribution The beta distribution has support over the interval [0, 1] and is defined as follows: Here B(p, q) is the beta function: 2017/9/25 SMIIP 40
The beta distribution a=b=1, uninform distribution a and b <1, bimodal distribution with the spikes at 0 and 1 a and b >1, unimodal distribution 2017/9/25 SMIIP 41
Pareto distribution The Pareto distribution is used to model the distribution of quantities that exhibit long tails, also called heavy tails The Pareto pdf is defined as follow: This distribution has the following properties 2017/9/25 SMIIP 42
Pareto distribution 2017/9/25 SMIIP 43
Continuous probability distributions 2017/9/25 SMIIP 44
Joint Probability Distributions 2017/9/25 SMIIP 45
Covariance A joint probability distribution has the form p(x 1,..., x D ) for a set of D > 1 variables The covariance between two rv s X and Y measures the degree to which X and Y are (linearly) related For a d-dimensional random vector x, its covariance matrix is: 2017/9/25 SMIIP 46
Correlation The (Pearson) correlation coefficient between X and Y is defined as For a d-dimensional random vector x, its correlation matrix is: 2017/9/25 SMIIP 47
Correlation Correlation coefficient is as a degree of linearity, it is not related to the slope of the regression line The regression coefficient is If X and Y are independent, meaning p(x, Y) = p(x)p(y ), then cov [X, Y] = 0, and hence corr [X, Y] = 0 so they are uncorrelated. However, the converse is not true: uncorrelated does not imply independent 2017/9/25 SMIIP 48
Correlation 2017/9/25 SMIIP 49
The multivariate Gaussian The pdf of multivariate Gaussian or multivariate normal (MVN) in D dimension is Here, μ = E [x] RD is the mean vector, and Σ = cov[x] is the D D covariance matrix. 2017/9/25 SMIIP 50
2D Gaussians 2017/9/25 SMIIP 51
Multivariate Student t distribution The pdf of multivariate Student t distribution is The distribution has the following properties 2017/9/25 SMIIP 52
Dirichlet distribution Dirichlet distribution is a multivariate generalization of the beta distribution. which has support over the probability simplex, defined by The pdf is defined as follows: the distribution has these properties 2017/9/25 SMIIP 53
Transformations of Random Variables
Linear transformations Suppose f() is a linear function We have If f() is a scalar-valued function, f(x) = a T x + b, then 2017/9/25 SMIIP 55
General transformations If X is a discrete rv, we can derive the pmf for y by simply summing up the probability mass for all the x s such that f(x) = y: If X is continuous 2017/9/25 SMIIP 56
Multivariate change of variables Let f be a function that maps R n to R n, and let y = f(x). Then its Jacobian matrix J is given by If f is an invertible mapping, we can define the pdf of the transformed variables using the Jacobian of the inverse mapping y x: 2017/9/25 SMIIP 57
Central limit theorem Now consider N random variables with pdf s (not necessarily Gaussian) p(x i ), each with mean μ and variance σ 2. We assume each variable is independent and identically distributed or iid for short N Let S N = i=1 X i be the sum of the rv s. One can show that, as N increases, the distribution of this sum approaches 2017/9/25 SMIIP 58
Central limit theorem 2017/9/25 SMIIP 59
Monte Carlo Approximation 2017/9/25 SMIIP 60
Monte Carlo approximation In general, computing the distribution of a function of an rv using the change of variables formula can be difficult One simple but powerful alternative is Monte Carlo approximation as follows: First, we generate S samples from the distribution, call them x 1,..., x S. By Markov chain Monte Carlo or MCMC Then, we can approximate the distribution of f(x) by using the empirical distribution of {f(x s )} 1 S s=1. 2017/9/25 SMIIP 61
Monte Carlo approximation By varying the function f(), we can approximate many quantities of interest, such as 2017/9/25 SMIIP 62
Monte Carlo approximation 2017/9/25 SMIIP 63
Some Concepts of Information Theory
Entropy Entropy of a random variable X with distribution p For binary random variables, we have This is called binary entropy function 2017/9/25 SMIIP 65
Entropy 2017/9/25 SMIIP 66
KL divergence KL divergence is the average number of extra bits needed to encode the data 2017/9/25 SMIIP 67
Why mutual information? Often, we want to know something of a variable Y from another variable X Correlation can measure the relationship between two variables, but it is defined on real values Furthermore, and it cannot describe the independence between two variables well Independent -> uncorrelated Uncorrelated does not imply independent 2017/9/25 SMIIP 68
Mutual information For two rvs X and Y, the MI is defined as conditional entropy We can show that Ⅱ(X, Y) 0 with equality iif p(x, Y)=p(X) p(y) MI between X and Y as the reduction in uncertainty about X after observing Y 2017/9/25 SMIIP 69
Pointwise mutual information For two events (not random variables) x and y, PMI is defined as PMI measures the discrepancy between these events occurring together compared to what would be expected by chance MI of X and Y is just the expected value of PMI 2017/9/25 SMIIP 70
Two Examples
Example: medical diagnosis (rare diseases) breast cancer: p(y = 1) healthy: p(y = 0) Test Positive Test Negative p(x = 1 y = 1) p(x = 1 y = 0) p(x=1) p(x = 0 y = 1) p(x = 0 y = 0) p(x=0) If Jenny is tested positive on breast cancer, what is the probability that she really suffers from breast cancer? 2017/9/25 SMIIP 72
Example 1: medical diagnosis (rare diseases) breast cancer: p(y = 1) healthy: p(y = 0) Test Positive Test Negative p(x = 1 y = 1) p(x = 1 y = 0) p(x=1) p(x = 0 y = 1) p(x = 0 y = 0) p(x=0) If Jenny is tested positive on breast cancer, what is the probability that she really suffers from breast cancer? p(y = 1 x = 1)? 2017/9/25 SMIIP 73
Example: medical diagnosis (rare diseases) breast cancer: p(y = 1)= 0.004 healthy: p(y = 0)=0.996 Test Positive p(x = 1 y = 1) =0.8 p(x = 1 y = 0) = 0.1 p(x=1)=0.004*0.8 +0.996*0.1 Test Negative p(x = 0 y = 1) p(x = 0 y = 0) p(x=0) If Jenny is tested positive on breast cancer, what is the probability that she really suffers from breast cancer? p(y = 1 x = 1)? 2017/9/25 SMIIP 74
Example 1: medical diagnosis (rare diseases) breast cancer: p(y = 1)= 0.004 healthy: p(y = 0)=0.996 Test Positive p(x = 1 y = 1) =0.8 p(x = 1 y = 0) = 0.1 p(x=1)=0.004*0.8 +0.996*0.1 Test Negative p(x = 0 y = 1) p(x = 0 y = 0) p(x=0) Test Positive should be treated carefully for rare diseases 2017/9/25 SMIIP 75
Example 2: German tank problem 1. Frequentist statistics Sample k labels, the largest one is m; this event: E[largest] = P largest = m = N m=k m m 1 k 1 N k = N = μ 1 + 1 k 1 m 1 k 1 N k k N + 1 k + 1 = μ μ 1 + 1 k 1 = E m 1 + 1 k 1 N = m 1 + 1 k 1 k = 4, samples = 2.6.7.14 m = 14 N = 14 1.25 1 = 16.5 2017/9/25 SMIIP 76
German tank problem 2. Bayesian statistics 贝叶斯方法要考虑当观察到的坦克总数 K 等于数 k 序列号最大值 M 等于数 m 时, 敌方坦克总数 N 等于数 n 的可信度 (N = n M = m, K = k)( 简写为 n m, k ), 条件概率有 n k n m, k = m n, k m k 坦克总数已知为 n 观察 k 辆坦克中序列号最大值等于 m 的概率 : 2017/9/25 SMIIP 77
German tank problem(bayesian statistics) 2017/9/25 SMIIP 78
German tank problem(bayesian statistics) https://en.wikipedia.org/wiki/german_tank_problem 2017/9/25 SMIIP 79
German tank problem 假设某个情报人员已经发现了 k = 4 辆坦克, 其序列号分别为 2 6 7 14, 观测到的最大的序列号为 m = 14 坦克未知的总数设为 N 2017/9/25 SMIIP 80
German tank problem 根据常规盟军情报的估计, 德国在 1940 年 6 月和 1942 年 9 月之间, 每月大约能生产 1,400 辆坦克 将缴获坦克的序列号代入下文的公式, 可计算出每月 246 辆 战后, 从阿尔伯特 斯佩尔所管辖的部门缴获的德国生产记录显示, 实际数目是 245 辆 某些特定月份的估计如下 : 2017/9/25 SMIIP 81
The end Assignment: reading Chapter 2 of the Murphy book 2017/9/25 SMIIP 82