Lecture 2: Introduction to Probability

Similar documents
Introduction to Machine Learning

Introduction to Machine Learning

课内考试时间? 5/10 5/17 5/24 课内考试? 5/31 课内考试? 6/07 课程论文报告

Conditional expectation and prediction

GRE 精确 完整 数学预测机经 发布适用 2015 年 10 月考试

Lecture 1: Probability Fundamentals

Introduction to Probability and Statistics (Continued)

Statistical Methods in Particle Physics

2012 AP Calculus BC 模拟试卷

On the Quark model based on virtual spacetime and the origin of fractional charge

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Lecture 2: Repetition of probability theory and statistics

Lecture 2. Random variables: discrete and continuous

A Tutorial on Variational Bayes

Recitation 2: Probability

Algorithms for Uncertainty Quantification

Chapter 5 continued. Chapter 5 sections

系统生物学. (Systems Biology) 马彬广

Lecture 3. Probability - Part 2. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. October 19, 2016

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Probability and Estimation. Alan Moses

Source mechanism solution

Chapter 2 Bayesian Decision Theory. Pattern Recognition Soochow, Fall Semester 1

Review of probability

Lectures on Statistical Data Analysis

Computational Genomics

Introduction to Probability and Statistics (Continued)

Properties Measurement of H ZZ* 4l and Z 4l with ATLAS

0 0 = 1 0 = 0 1 = = 1 1 = 0 0 = 1

三类调度问题的复合派遣算法及其在医疗运营管理中的应用

通量数据质量控制的理论与方法 理加联合科技有限公司

Joint Probability Distributions, Correlations

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Class 26: review for final exam 18.05, Spring 2014

Probability Distributions Columns (a) through (d)

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY

The dynamic N1-methyladenosine methylome in eukaryotic messenger RNA 报告人 : 沈胤

Bayesian Models in Machine Learning

Review (Probability & Linear Algebra)

Joint Probability Distributions, Correlations

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

HANDBOOK OF APPLICABLE MATHEMATICS

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

A Few Special Distributions and Their Properties

Chapter 22 Lecture. Essential University Physics Richard Wolfson 2 nd Edition. Electric Potential 電位 Pearson Education, Inc.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Bayesian Regression Linear and Logistic Regression

能源化学工程专业培养方案. Undergraduate Program for Specialty in Energy Chemical Engineering 专业负责人 : 何平分管院长 : 廖其龙院学术委员会主任 : 李玉香

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Lecture 1: August 28

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Expectation. DS GA 1002 Probability and Statistics for Data Science. Carlos Fernandez-Granda

Probability and Information Theory. Sargur N. Srihari

Learning Objectives for Stat 225

Expectation. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Chapter 2 the z-transform. 2.1 definition 2.2 properties of ROC 2.3 the inverse z-transform 2.4 z-transform properties

A proof of the 3x +1 conjecture

Lecture 2: Review of Basic Probability Theory

课内考试时间 5/21 5/28 课内考试 6/04 课程论文报告?

CS Lecture 19. Exponential Families & Expectation Propagation

Algorithms and Complexity

Probability Theory for Machine Learning. Chris Cremer September 2015

3. Review of Probability and Statistics

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

= lim(x + 1) lim x 1 x 1 (x 2 + 1) 2 (for the latter let y = x2 + 1) lim

Machine Learning using Bayesian Approaches

Bayesian analysis in nuclear physics

Binomial and Poisson Probability Distributions

Easter Traditions 复活节习俗

Quick Tour of Basic Probability Theory and Linear Algebra

國立中正大學八十一學年度應用數學研究所 碩士班研究生招生考試試題

Chapter 5. Chapter 5 sections

Name: Firas Rassoul-Agha

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Introduction to Probabilistic Machine Learning

Bivariate distributions

Lecture 3. Discrete Random Variables

Bayesian Methods for Machine Learning

Concurrent Engineering Pdf Ebook Download >>> DOWNLOAD

5. Polymorphism, Selection. and Phylogenetics. 5.1 Population genetics. 5.2 Phylogenetics

Review of Probabilities and Basic Statistics

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction to Applied Bayesian Modeling. ICPSR Day 4

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

An Introduction to Generalized Method of Moments. Chen,Rong aronge.net

MATH Notebook 5 Fall 2018/2019

Riemann s Hypothesis and Conjecture of Birch and Swinnerton-Dyer are False

Statistical Methods for Astronomy

Why study probability? Set theory. ECE 6010 Lecture 1 Introduction; Review of Random Variables

Lecture 5: GPs and Streaming regression

The Binomial distribution. Probability theory 2. Example. The Binomial distribution

Digital Image Processing. Point Processing( 点处理 )

M378K In-Class Assignment #1

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

d) There is a Web page that includes links to both Web page A and Web page B.

EE514A Information Theory I Fall 2013

Modeling effects of changes in diffuse radiation on light use efficiency in forest ecosystem. Wei Nan

Transcription:

Statistical Methods for Intelligent Information Processing (SMIIP) Lecture 2: Introduction to Probability Shuigeng Zhou School of Computer Science September 20, 2017

Outline Background and concepts Some discrete distributions Some continuous distributions Joint probability distribution Transformations of random variables Monte Carlo approximation Information theory Examples 2017/9/25 SMIIP 2

Background and Concepts 2017/9/25 SMIIP 3

What is probability? Probability theory is nothing but common sense reduced to calculation Pierre Laplace Two probability interpretations Frequentist interpretation (objectivists) Probabilities represent long run frequencies of events Bayesian interpretation (subjectivists) Probability is used to quantify our uncertainty about something 2017/9/25 SMIIP 4

German tank problem During World War II, German tanks were sequentially numbered; assume 1, 2, 3,, N Some of the numbers became known to Allied Forces when tanks were captured or records seized The Allied statisticians developed an estimation procedure to determine N At the end of WWII, the serial-number estimate for German tank production was very close to the actual figure 2017/9/25 SMIIP 5

Sampling methods Convenience sampling: Obtain the easiest sample you can get (this is a bad idea) 2017/9/25 SMIIP 6

Sampling methods Random sampling: Any method where every member of the population has an equal chance of being selected 2017/9/25 SMIIP 7

Sampling methods Stratified Sample: Split the population into groups (strata) and sample from each group separately The goal here is for the strata to be homogeneous (the members are very similar) 2017/9/25 SMIIP 8

Sampling methods Cluster sample: randomly select a few clusters and sample all members of the clusters. 2017/9/25 SMIIP 9

Sampling methods Systematic sampling: Set an order for the data, start from a random element, and then select every k th member, with k=n/n where N is the dataset size, n is the number of samples to be selected 2017/9/25 SMIIP 10

Basic concepts (1) Event A and its probability p(a): 0 p(a) 1 Discrete random variable X State space χ Probability mass function (pmf): p(x) Probability of a union of two events A and B p(a B)=p(A)+p(B)-p(A B) Joint probability: the probability of the joint event A and B p(a, B)= p(a B)=p (A) p(b A)=p (B) p(a B) --- product rule Conditional probability p A B = p(a,b) p(b) Marginal distribution if p B > 0 p A = b p A, B = b p A B = b p(b = b) --- sum rule 2017/9/25 SMIIP 11

Basic concepts (2) Continuous random variable X Cumulative distribution function (cdf): F(q) F q = p X q Probability density function (pdf): f(x) b p a < X b = f x dx a Quantile( 分位数 ) If F is the cdf of X, and F x α = α, then x α is the α quantile of F Mean, or expected value: ; Variance: 2017/9/25 SMIIP 12

Mode, median and range Median: the middle value in the dataset Mode: the value that occurs most often in the dataset Range: the difference between the largest and the smallest values 2017/9/25 SMIIP 13

Descriptive variables 2017/9/25 SMIIP 14

Descriptive statistics to measure the central tendency 2017/9/25 SMIIP 15

The Variance estimation It measure dispersion relative to the scatter of the values about the mean 2017/9/25 SMIIP 16

The Variance estimation Population variance 2 = 1 N μ= 1 N N i=1 N i=1 x i (x i μ) 2 = Sample variance 1 N N i=1 x i 2 μ 2 Taking n samples from the population, estimate the variance y 2 = 1 n n i=1 (y i μ y ) 2, μ y = 1 n n i=1 Sampling multiple times, computing the expected valued of y 2 E y 2 = n 1 n 2, so 2 = n n 1 E y 2 y i We take the variance of one time sampling as E y 2, the sample variance s 2 is s 2 = 1 n 1 n i=1 (y i μ y ) 2 2017/9/25 SMIIP 17

Independence and conditional independence Unconditionally independence Marginally independence Conditional independence 2017/9/25 SMIIP 18

Bayes rule 2017/9/25 SMIIP 19

Some Common Discrete Distributions 2017/9/25 SMIIP 20

The binomial and Bernoulli distributions Binomial distribution: toss a coin n times, the probability of having k heads Bernoulli: a special case of binominal distribution where tossing a coin only once 2017/9/25 SMIIP 21

The binomial distribution 2017/9/25 SMIIP 22

The multinomial and multinoulli distributions Multinomial distribution: tossing a die of K-side n times, x=(x 1, x 2,, x k ) is a vector indicating the appearing time of each side Multinoulli: a special case of multinomial distribution with n=1 2017/9/25 SMIIP 23

Summary of the multinomial and related distributions 2017/9/25 SMIIP 24

Application: DNA sequence motifs 2017/9/25 SMIIP 25

The Poisson distribution The Poisson distribution is often used as a model for counts of rare events like radioactive decay and traffic accidents 2017/9/25 SMIIP 26

The Poisson distribution Considering a binomial distribution 2017/9/25 SMIIP 27

Mean and Variance of Poisson Distribution Recall the mean of a binomial distribution B(n, p) = np, variance of B(n, p) = np(1-p)= λ(1-p) Since Poisson distribution is an approximation of binomial distribution when n is approaching infinity, and p is extremely small, then its mean E(x)=np= λ Variance λ(1-p) ~ λ when p is very small Mean and Variance of Poisson distribution are the same: λ 2017/9/25 SMIIP 28

The Poisson distribution 2017/9/25 SMIIP 29

Empirical distribution Here, A is a range 2017/9/25 SMIIP 30

Discrete probability distributions 2017/9/25 SMIIP 31

Some Common Continuous Distributions 2017/9/25 SMIIP 32

Gaussian (normal) distribution --- Standard normal distribution CDF of the Gaussian is defined as 2017/9/25 SMIIP 33

Why Gaussian distribution is important? It is simple with only two parameters, and easy to be used Many phenomena in real world have an approximate Gaussian distribution According to the central limit theorem, the sums of independent random variables have an approximate Gaussian distribution 2017/9/25 SMIIP 34

Student t distribution Gaussian distribution is sensitive to outliers. A more robust distribution is Student t distribution When v=1, it is known as Cauchy or Lorentz distribution, which has a heavy tail When v>>5, it approaches to Gaussian distribution 2017/9/25 SMIIP 35

The Laplace distribution Also called double sided exponential distribution 2017/9/25 SMIIP 36

pdf and log(pdf) 2017/9/25 SMIIP 37

Effect of Outliers 2017/9/25 SMIIP 38

The gamma distribution The gamma distribution is a flexible distribution for positive real valued random variables 2017/9/25 SMIIP 39

The beta distribution The beta distribution has support over the interval [0, 1] and is defined as follows: Here B(p, q) is the beta function: 2017/9/25 SMIIP 40

The beta distribution a=b=1, uninform distribution a and b <1, bimodal distribution with the spikes at 0 and 1 a and b >1, unimodal distribution 2017/9/25 SMIIP 41

Pareto distribution The Pareto distribution is used to model the distribution of quantities that exhibit long tails, also called heavy tails The Pareto pdf is defined as follow: This distribution has the following properties 2017/9/25 SMIIP 42

Pareto distribution 2017/9/25 SMIIP 43

Continuous probability distributions 2017/9/25 SMIIP 44

Joint Probability Distributions 2017/9/25 SMIIP 45

Covariance A joint probability distribution has the form p(x 1,..., x D ) for a set of D > 1 variables The covariance between two rv s X and Y measures the degree to which X and Y are (linearly) related For a d-dimensional random vector x, its covariance matrix is: 2017/9/25 SMIIP 46

Correlation The (Pearson) correlation coefficient between X and Y is defined as For a d-dimensional random vector x, its correlation matrix is: 2017/9/25 SMIIP 47

Correlation Correlation coefficient is as a degree of linearity, it is not related to the slope of the regression line The regression coefficient is If X and Y are independent, meaning p(x, Y) = p(x)p(y ), then cov [X, Y] = 0, and hence corr [X, Y] = 0 so they are uncorrelated. However, the converse is not true: uncorrelated does not imply independent 2017/9/25 SMIIP 48

Correlation 2017/9/25 SMIIP 49

The multivariate Gaussian The pdf of multivariate Gaussian or multivariate normal (MVN) in D dimension is Here, μ = E [x] RD is the mean vector, and Σ = cov[x] is the D D covariance matrix. 2017/9/25 SMIIP 50

2D Gaussians 2017/9/25 SMIIP 51

Multivariate Student t distribution The pdf of multivariate Student t distribution is The distribution has the following properties 2017/9/25 SMIIP 52

Dirichlet distribution Dirichlet distribution is a multivariate generalization of the beta distribution. which has support over the probability simplex, defined by The pdf is defined as follows: the distribution has these properties 2017/9/25 SMIIP 53

Transformations of Random Variables

Linear transformations Suppose f() is a linear function We have If f() is a scalar-valued function, f(x) = a T x + b, then 2017/9/25 SMIIP 55

General transformations If X is a discrete rv, we can derive the pmf for y by simply summing up the probability mass for all the x s such that f(x) = y: If X is continuous 2017/9/25 SMIIP 56

Multivariate change of variables Let f be a function that maps R n to R n, and let y = f(x). Then its Jacobian matrix J is given by If f is an invertible mapping, we can define the pdf of the transformed variables using the Jacobian of the inverse mapping y x: 2017/9/25 SMIIP 57

Central limit theorem Now consider N random variables with pdf s (not necessarily Gaussian) p(x i ), each with mean μ and variance σ 2. We assume each variable is independent and identically distributed or iid for short N Let S N = i=1 X i be the sum of the rv s. One can show that, as N increases, the distribution of this sum approaches 2017/9/25 SMIIP 58

Central limit theorem 2017/9/25 SMIIP 59

Monte Carlo Approximation 2017/9/25 SMIIP 60

Monte Carlo approximation In general, computing the distribution of a function of an rv using the change of variables formula can be difficult One simple but powerful alternative is Monte Carlo approximation as follows: First, we generate S samples from the distribution, call them x 1,..., x S. By Markov chain Monte Carlo or MCMC Then, we can approximate the distribution of f(x) by using the empirical distribution of {f(x s )} 1 S s=1. 2017/9/25 SMIIP 61

Monte Carlo approximation By varying the function f(), we can approximate many quantities of interest, such as 2017/9/25 SMIIP 62

Monte Carlo approximation 2017/9/25 SMIIP 63

Some Concepts of Information Theory

Entropy Entropy of a random variable X with distribution p For binary random variables, we have This is called binary entropy function 2017/9/25 SMIIP 65

Entropy 2017/9/25 SMIIP 66

KL divergence KL divergence is the average number of extra bits needed to encode the data 2017/9/25 SMIIP 67

Why mutual information? Often, we want to know something of a variable Y from another variable X Correlation can measure the relationship between two variables, but it is defined on real values Furthermore, and it cannot describe the independence between two variables well Independent -> uncorrelated Uncorrelated does not imply independent 2017/9/25 SMIIP 68

Mutual information For two rvs X and Y, the MI is defined as conditional entropy We can show that Ⅱ(X, Y) 0 with equality iif p(x, Y)=p(X) p(y) MI between X and Y as the reduction in uncertainty about X after observing Y 2017/9/25 SMIIP 69

Pointwise mutual information For two events (not random variables) x and y, PMI is defined as PMI measures the discrepancy between these events occurring together compared to what would be expected by chance MI of X and Y is just the expected value of PMI 2017/9/25 SMIIP 70

Two Examples

Example: medical diagnosis (rare diseases) breast cancer: p(y = 1) healthy: p(y = 0) Test Positive Test Negative p(x = 1 y = 1) p(x = 1 y = 0) p(x=1) p(x = 0 y = 1) p(x = 0 y = 0) p(x=0) If Jenny is tested positive on breast cancer, what is the probability that she really suffers from breast cancer? 2017/9/25 SMIIP 72

Example 1: medical diagnosis (rare diseases) breast cancer: p(y = 1) healthy: p(y = 0) Test Positive Test Negative p(x = 1 y = 1) p(x = 1 y = 0) p(x=1) p(x = 0 y = 1) p(x = 0 y = 0) p(x=0) If Jenny is tested positive on breast cancer, what is the probability that she really suffers from breast cancer? p(y = 1 x = 1)? 2017/9/25 SMIIP 73

Example: medical diagnosis (rare diseases) breast cancer: p(y = 1)= 0.004 healthy: p(y = 0)=0.996 Test Positive p(x = 1 y = 1) =0.8 p(x = 1 y = 0) = 0.1 p(x=1)=0.004*0.8 +0.996*0.1 Test Negative p(x = 0 y = 1) p(x = 0 y = 0) p(x=0) If Jenny is tested positive on breast cancer, what is the probability that she really suffers from breast cancer? p(y = 1 x = 1)? 2017/9/25 SMIIP 74

Example 1: medical diagnosis (rare diseases) breast cancer: p(y = 1)= 0.004 healthy: p(y = 0)=0.996 Test Positive p(x = 1 y = 1) =0.8 p(x = 1 y = 0) = 0.1 p(x=1)=0.004*0.8 +0.996*0.1 Test Negative p(x = 0 y = 1) p(x = 0 y = 0) p(x=0) Test Positive should be treated carefully for rare diseases 2017/9/25 SMIIP 75

Example 2: German tank problem 1. Frequentist statistics Sample k labels, the largest one is m; this event: E[largest] = P largest = m = N m=k m m 1 k 1 N k = N = μ 1 + 1 k 1 m 1 k 1 N k k N + 1 k + 1 = μ μ 1 + 1 k 1 = E m 1 + 1 k 1 N = m 1 + 1 k 1 k = 4, samples = 2.6.7.14 m = 14 N = 14 1.25 1 = 16.5 2017/9/25 SMIIP 76

German tank problem 2. Bayesian statistics 贝叶斯方法要考虑当观察到的坦克总数 K 等于数 k 序列号最大值 M 等于数 m 时, 敌方坦克总数 N 等于数 n 的可信度 (N = n M = m, K = k)( 简写为 n m, k ), 条件概率有 n k n m, k = m n, k m k 坦克总数已知为 n 观察 k 辆坦克中序列号最大值等于 m 的概率 : 2017/9/25 SMIIP 77

German tank problem(bayesian statistics) 2017/9/25 SMIIP 78

German tank problem(bayesian statistics) https://en.wikipedia.org/wiki/german_tank_problem 2017/9/25 SMIIP 79

German tank problem 假设某个情报人员已经发现了 k = 4 辆坦克, 其序列号分别为 2 6 7 14, 观测到的最大的序列号为 m = 14 坦克未知的总数设为 N 2017/9/25 SMIIP 80

German tank problem 根据常规盟军情报的估计, 德国在 1940 年 6 月和 1942 年 9 月之间, 每月大约能生产 1,400 辆坦克 将缴获坦克的序列号代入下文的公式, 可计算出每月 246 辆 战后, 从阿尔伯特 斯佩尔所管辖的部门缴获的德国生产记录显示, 实际数目是 245 辆 某些特定月份的估计如下 : 2017/9/25 SMIIP 81

The end Assignment: reading Chapter 2 of the Murphy book 2017/9/25 SMIIP 82