Lecture 2: Introduction to Probability

Size: px

Start display at page:

Download "Lecture 2: Introduction to Probability"

Colin Dawson
5 years ago
Views:

1 Statistical Methods for Intelligent Information Processing (SMIIP) Lecture 2: Introduction to Probability Shuigeng Zhou School of Computer Science September 20, 2017

2 Outline Background and concepts Some discrete distributions Some continuous distributions Joint probability distribution Transformations of random variables Monte Carlo approximation Information theory Examples 2017/9/25 SMIIP 2

3 Background and Concepts 2017/9/25 SMIIP 3

4 What is probability? Probability theory is nothing but common sense reduced to calculation Pierre Laplace Two probability interpretations Frequentist interpretation (objectivists) Probabilities represent long run frequencies of events Bayesian interpretation (subjectivists) Probability is used to quantify our uncertainty about something 2017/9/25 SMIIP 4

German tank problem During World War II, German tanks were sequentially numbered; assume 1, 2, 3,, N Some of the numbers became known to Allied Forces when tanks were captured or records seized The

5 German tank problem During World War II, German tanks were sequentially numbered; assume 1, 2, 3,, N Some of the numbers became known to Allied Forces when tanks were captured or records seized The Allied statisticians developed an estimation procedure to determine N At the end of WWII, the serial-number estimate for German tank production was very close to the actual figure 2017/9/25 SMIIP 5

6 Sampling methods Convenience sampling: Obtain the easiest sample you can get (this is a bad idea) 2017/9/25 SMIIP 6

7 Sampling methods Random sampling: Any method where every member of the population has an equal chance of being selected 2017/9/25 SMIIP 7

8 Sampling methods Stratified Sample: Split the population into groups (strata) and sample from each group separately The goal here is for the strata to be homogeneous (the members are very similar) 2017/9/25 SMIIP 8

9 Sampling methods Cluster sample: randomly select a few clusters and sample all members of the clusters. 2017/9/25 SMIIP 9

10 Sampling methods Systematic sampling: Set an order for the data, start from a random element, and then select every k th member, with k=n/n where N is the dataset size, n is the number of samples to be selected 2017/9/25 SMIIP 10

11 Basic concepts (1) Event A and its probability p(a): 0 p(a) 1 Discrete random variable X State space χ Probability mass function (pmf): p(x) Probability of a union of two events A and B p(a B)=p(A)+p(B)-p(A B) Joint probability: the probability of the joint event A and B p(a, B)= p(a B)=p (A) p(b A)=p (B) p(a B) --- product rule Conditional probability p A B = p(a,b) p(b) Marginal distribution if p B > 0 p A = b p A, B = b p A B = b p(b = b) --- sum rule 2017/9/25 SMIIP 11

12 Basic concepts (2) Continuous random variable X Cumulative distribution function (cdf): F(q) F q = p X q Probability density function (pdf): f(x) b p a < X b = f x dx a Quantile( 分位数 ) If F is the cdf of X, and F x α = α, then x α is the α quantile of F Mean, or expected value: ; Variance: 2017/9/25 SMIIP 12

13 Mode, median and range Median: the middle value in the dataset Mode: the value that occurs most often in the dataset Range: the difference between the largest and the smallest values 2017/9/25 SMIIP 13

14 Descriptive variables 2017/9/25 SMIIP 14

15 Descriptive statistics to measure the central tendency 2017/9/25 SMIIP 15

16 The Variance estimation It measure dispersion relative to the scatter of the values about the mean 2017/9/25 SMIIP 16

17 The Variance estimation Population variance 2 = 1 N μ= 1 N N i=1 N i=1 x i (x i μ) 2 = Sample variance 1 N N i=1 x i 2 μ 2 Taking n samples from the population, estimate the variance y 2 = 1 n n i=1 (y i μ y ) 2, μ y = 1 n n i=1 Sampling multiple times, computing the expected valued of y 2 E y 2 = n 1 n 2, so 2 = n n 1 E y 2 y i We take the variance of one time sampling as E y 2, the sample variance s 2 is s 2 = 1 n 1 n i=1 (y i μ y ) /9/25 SMIIP 17

18 Independence and conditional independence Unconditionally independence Marginally independence Conditional independence 2017/9/25 SMIIP 18

19 Bayes rule 2017/9/25 SMIIP 19

20 Some Common Discrete Distributions 2017/9/25 SMIIP 20

21 The binomial and Bernoulli distributions Binomial distribution: toss a coin n times, the probability of having k heads Bernoulli: a special case of binominal distribution where tossing a coin only once 2017/9/25 SMIIP 21

22 The binomial distribution 2017/9/25 SMIIP 22

23 The multinomial and multinoulli distributions Multinomial distribution: tossing a die of K-side n times, x=(x 1, x 2,, x k ) is a vector indicating the appearing time of each side Multinoulli: a special case of multinomial distribution with n=1 2017/9/25 SMIIP 23

24 Summary of the multinomial and related distributions 2017/9/25 SMIIP 24

25 Application: DNA sequence motifs 2017/9/25 SMIIP 25

26 The Poisson distribution The Poisson distribution is often used as a model for counts of rare events like radioactive decay and traffic accidents 2017/9/25 SMIIP 26

27 The Poisson distribution Considering a binomial distribution 2017/9/25 SMIIP 27

28 Mean and Variance of Poisson Distribution Recall the mean of a binomial distribution B(n, p) = np, variance of B(n, p) = np(1-p)= λ(1-p) Since Poisson distribution is an approximation of binomial distribution when n is approaching infinity, and p is extremely small, then its mean E(x)=np= λ Variance λ(1-p) ~ λ when p is very small Mean and Variance of Poisson distribution are the same: λ 2017/9/25 SMIIP 28

29 The Poisson distribution 2017/9/25 SMIIP 29

30 Empirical distribution Here, A is a range 2017/9/25 SMIIP 30

31 Discrete probability distributions 2017/9/25 SMIIP 31

32 Some Common Continuous Distributions 2017/9/25 SMIIP 32

33 Gaussian (normal) distribution --- Standard normal distribution CDF of the Gaussian is defined as 2017/9/25 SMIIP 33

34 Why Gaussian distribution is important? It is simple with only two parameters, and easy to be used Many phenomena in real world have an approximate Gaussian distribution According to the central limit theorem, the sums of independent random variables have an approximate Gaussian distribution 2017/9/25 SMIIP 34

35 Student t distribution Gaussian distribution is sensitive to outliers. A more robust distribution is Student t distribution When v=1, it is known as Cauchy or Lorentz distribution, which has a heavy tail When v>>5, it approaches to Gaussian distribution 2017/9/25 SMIIP 35

36 The Laplace distribution Also called double sided exponential distribution 2017/9/25 SMIIP 36

37 pdf and log(pdf) 2017/9/25 SMIIP 37

38 Effect of Outliers 2017/9/25 SMIIP 38

39 The gamma distribution The gamma distribution is a flexible distribution for positive real valued random variables 2017/9/25 SMIIP 39

40 The beta distribution The beta distribution has support over the interval [0, 1] and is defined as follows: Here B(p, q) is the beta function: 2017/9/25 SMIIP 40

41 The beta distribution a=b=1, uninform distribution a and b <1, bimodal distribution with the spikes at 0 and 1 a and b >1, unimodal distribution 2017/9/25 SMIIP 41

42 Pareto distribution The Pareto distribution is used to model the distribution of quantities that exhibit long tails, also called heavy tails The Pareto pdf is defined as follow: This distribution has the following properties 2017/9/25 SMIIP 42

43 Pareto distribution 2017/9/25 SMIIP 43

44 Continuous probability distributions 2017/9/25 SMIIP 44

45 Joint Probability Distributions 2017/9/25 SMIIP 45

46 Covariance A joint probability distribution has the form p(x 1,..., x D ) for a set of D > 1 variables The covariance between two rv s X and Y measures the degree to which X and Y are (linearly) related For a d-dimensional random vector x, its covariance matrix is: 2017/9/25 SMIIP 46

47 Correlation The (Pearson) correlation coefficient between X and Y is defined as For a d-dimensional random vector x, its correlation matrix is: 2017/9/25 SMIIP 47

48 Correlation Correlation coefficient is as a degree of linearity, it is not related to the slope of the regression line The regression coefficient is If X and Y are independent, meaning p(x, Y) = p(x)p(y ), then cov [X, Y] = 0, and hence corr [X, Y] = 0 so they are uncorrelated. However, the converse is not true: uncorrelated does not imply independent 2017/9/25 SMIIP 48

49 Correlation 2017/9/25 SMIIP 49

50 The multivariate Gaussian The pdf of multivariate Gaussian or multivariate normal (MVN) in D dimension is Here, μ = E [x] RD is the mean vector, and Σ = cov[x] is the D D covariance matrix. 2017/9/25 SMIIP 50

51 2D Gaussians 2017/9/25 SMIIP 51

52 Multivariate Student t distribution The pdf of multivariate Student t distribution is The distribution has the following properties 2017/9/25 SMIIP 52

53 Dirichlet distribution Dirichlet distribution is a multivariate generalization of the beta distribution. which has support over the probability simplex, defined by The pdf is defined as follows: the distribution has these properties 2017/9/25 SMIIP 53

54 Transformations of Random Variables

55 Linear transformations Suppose f() is a linear function We have If f() is a scalar-valued function, f(x) = a T x + b, then 2017/9/25 SMIIP 55

56 General transformations If X is a discrete rv, we can derive the pmf for y by simply summing up the probability mass for all the x s such that f(x) = y: If X is continuous 2017/9/25 SMIIP 56

57 Multivariate change of variables Let f be a function that maps R n to R n, and let y = f(x). Then its Jacobian matrix J is given by If f is an invertible mapping, we can define the pdf of the transformed variables using the Jacobian of the inverse mapping y x: 2017/9/25 SMIIP 57

58 Central limit theorem Now consider N random variables with pdf s (not necessarily Gaussian) p(x i ), each with mean μ and variance σ 2. We assume each variable is independent and identically distributed or iid for short N Let S N = i=1 X i be the sum of the rv s. One can show that, as N increases, the distribution of this sum approaches 2017/9/25 SMIIP 58

59 Central limit theorem 2017/9/25 SMIIP 59

60 Monte Carlo Approximation 2017/9/25 SMIIP 60

61 Monte Carlo approximation In general, computing the distribution of a function of an rv using the change of variables formula can be difficult One simple but powerful alternative is Monte Carlo approximation as follows: First, we generate S samples from the distribution, call them x 1,..., x S. By Markov chain Monte Carlo or MCMC Then, we can approximate the distribution of f(x) by using the empirical distribution of {f(x s )} 1 S s= /9/25 SMIIP 61

62 Monte Carlo approximation By varying the function f(), we can approximate many quantities of interest, such as 2017/9/25 SMIIP 62

63 Monte Carlo approximation 2017/9/25 SMIIP 63

64 Some Concepts of Information Theory

65 Entropy Entropy of a random variable X with distribution p For binary random variables, we have This is called binary entropy function 2017/9/25 SMIIP 65

66 Entropy 2017/9/25 SMIIP 66

67 KL divergence KL divergence is the average number of extra bits needed to encode the data 2017/9/25 SMIIP 67

68 Why mutual information? Often, we want to know something of a variable Y from another variable X Correlation can measure the relationship between two variables, but it is defined on real values Furthermore, and it cannot describe the independence between two variables well Independent -> uncorrelated Uncorrelated does not imply independent 2017/9/25 SMIIP 68

69 Mutual information For two rvs X and Y, the MI is defined as conditional entropy We can show that Ⅱ(X, Y) 0 with equality iif p(x, Y)=p(X) p(y) MI between X and Y as the reduction in uncertainty about X after observing Y 2017/9/25 SMIIP 69

70 Pointwise mutual information For two events (not random variables) x and y, PMI is defined as PMI measures the discrepancy between these events occurring together compared to what would be expected by chance MI of X and Y is just the expected value of PMI 2017/9/25 SMIIP 70

71 Two Examples

72 Example: medical diagnosis (rare diseases) breast cancer: p(y = 1) healthy: p(y = 0) Test Positive Test Negative p(x = 1 y = 1) p(x = 1 y = 0) p(x=1) p(x = 0 y = 1) p(x = 0 y = 0) p(x=0) If Jenny is tested positive on breast cancer, what is the probability that she really suffers from breast cancer? 2017/9/25 SMIIP 72

73 Example 1: medical diagnosis (rare diseases) breast cancer: p(y = 1) healthy: p(y = 0) Test Positive Test Negative p(x = 1 y = 1) p(x = 1 y = 0) p(x=1) p(x = 0 y = 1) p(x = 0 y = 0) p(x=0) If Jenny is tested positive on breast cancer, what is the probability that she really suffers from breast cancer? p(y = 1 x = 1)? 2017/9/25 SMIIP 73

74 Example: medical diagnosis (rare diseases) breast cancer: p(y = 1)= healthy: p(y = 0)=0.996 Test Positive p(x = 1 y = 1) =0.8 p(x = 1 y = 0) = 0.1 p(x=1)=0.004* *0.1 Test Negative p(x = 0 y = 1) p(x = 0 y = 0) p(x=0) If Jenny is tested positive on breast cancer, what is the probability that she really suffers from breast cancer? p(y = 1 x = 1)? 2017/9/25 SMIIP 74

75 Example 1: medical diagnosis (rare diseases) breast cancer: p(y = 1)= healthy: p(y = 0)=0.996 Test Positive p(x = 1 y = 1) =0.8 p(x = 1 y = 0) = 0.1 p(x=1)=0.004* *0.1 Test Negative p(x = 0 y = 1) p(x = 0 y = 0) p(x=0) Test Positive should be treated carefully for rare diseases 2017/9/25 SMIIP 75

76 Example 2: German tank problem 1. Frequentist statistics Sample k labels, the largest one is m; this event: E[largest] = P largest = m = N m=k m m 1 k 1 N k = N = μ k 1 m 1 k 1 N k k N + 1 k + 1 = μ μ k 1 = E m k 1 N = m k 1 k = 4, samples = m = 14 N = = /9/25 SMIIP 76

77 German tank problem 2. Bayesian statistics 贝叶斯方法要考虑当观察到的坦克总数 K 等于数 k 序列号最大值 M 等于数 m 时, 敌方坦克总数 N 等于数 n 的可信度 (N = n M = m, K = k)( 简写为 n m, k ), 条件概率有 n k n m, k = m n, k m k 坦克总数已知为 n 观察 k 辆坦克中序列号最大值等于 m 的概率 : 2017/9/25 SMIIP 77

78 German tank problem(bayesian statistics) 2017/9/25 SMIIP 78

79 German tank problem(bayesian statistics) /9/25 SMIIP 79

80 German tank problem 假设某个情报人员已经发现了 k = 4 辆坦克, 其序列号分别为 , 观测到的最大的序列号为 m = 14 坦克未知的总数设为 N 2017/9/25 SMIIP 80

81 German tank problem 根据常规盟军情报的估计, 德国在 1940 年 6 月和 1942 年 9 月之间, 每月大约能生产 1,400 辆坦克将缴获坦克的序列号代入下文的公式, 可计算出每月 246 辆战后, 从阿尔伯特斯佩尔所管辖的部门缴获的德国生产记录显示, 实际数目是 245 辆某些特定月份的估计如下 : 2017/9/25 SMIIP 81

82 The end Assignment: reading Chapter 2 of the Murphy book 2017/9/25 SMIIP 82

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB