Introduction to Probability and Statistics (Continued)

Size: px

Start display at page:

Download "Introduction to Probability and Statistics (Continued)"

Dwight Flynn
5 years ago
Views:

1 Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science University of otre Dame otre Dame, Indiana, USA URL: August 3, 018

2 Contents Sum and Product Rules of Probability Bayes rule Conditional Probability Independence, Conditional Independence Theorem Random Variables CDF, Mean and Variance Uniform Distribution, Gaussian, The binomial and Bernoulli distributions The multinomial and multinoulli distributions

3 References Following closely Chris Bishops PRML book, Chapter Kevin Murphy s, Machine Learning: A probablistic perspective, Chapter Jaynes, E. T. (003). Probability Theory: The Logic of Science. Cambridge University Press. Bertsekas, D. and J. Tsitsiklis (008). Introduction to Probability. Athena Scientific. nd Edition Wasserman, L. (004). All of statistics. A Concise Course in Statistical Inference. Springer. 3

4 Joint Probability Probability theory provides a consistent framework for the quantification and manipulation of uncertainty. It forms one of the central foundations for pattern recognition and machine learning. The probability that the event X will take the value x i and event Y will take the value y j is written P(X = x i, Y = y j ) and is called the joint probability of X = x i and Y = y j. nij P( X xi, Y y j) n ij Here is the number of times (in trials) that the event X xi, Y y j occurs. Similarly ci is the number of times that X x i occurs. 4

5 The Sum and Product Rules Even complex calculations in probability are simply derived from the sum and product rules of probability. Sum Rule: L P( X x ) P( X x, Y y ) i i j j1 Product Rule: nij nij ci P( X xi, Y y j) c P( Y y X x ) P( X x ) The product rule leads to the Chain Rule: i j i i P( X1: D ) P( X1) P( X X1) P( X 3 X1, X )... P( X D X1: D 1) 5

6 Conditional Probability and Bayes Rule P( X x, Y y) P( X x) P( Y y X x) P( X x Y y) P( Y y) P( X x ') P( Y y X x ') Bayes theorem plays a central role in pattern recognition and machine learning The normalizing factor P(Y) is given as: x' P( Y y) P( X, Y y) P( X x ') P( Y y X x ') X x' 6

7 Example: Generative Classifier P( y c x, ) Here we classify feature vectors x to classes using the above posterior. It is a generative classifier as it specifies how to generate data x using the class-conditional probabilities P( x y c, ) and class priors P( y c ). θ are (unknown) parameters defining these distributions. In a discriminative setting, the posterior directly fitted. x' P( y c ) P( x y c, ) P( y c ' ) P( x y c ', ) P( y c x) is 7

8 Independency, Conditional Probability Two events A and B are independent (written as P( AB) P( A) P( B) A B) if Conditional probability: Probability that A happens provided that B happens, P( A B) P( A B) ote : P( A B) P( A B) PB ( ) Using the above Eqs, we see that for independent events, P( A B) P( A) 8

9 Independence, Conditional Independence X and Y are unconditionally independent or marginally independent, denoted X Y, if we can represent the joint as the product of the two marginals X Y P( X, Y ) P( X ) P( Y ) X and Y are conditionally independent (CI) given Z iff the conditional joint can be written as a product of conditional marginals: X Y Z P( X, Y Z) P( X Z) P( Y Z) 9

10 Pairwise Vs. Mutual Independence Pairwise independence does not imply mutual independence. Consider 4 balls (numbered 1,,3,4) in a box. You draw one at random. Define the following events: ote that the three events are pairwise independent, e.g. 1 However: X : ball 1or is drawn 1 X : ball or 3 is drawn X : ball 1or 3 is drawn 3 X1 X P( X1, X ) P( X1) P( X ) 4 1/ 1/ 1 P( X1, X, X 3) 0, P( X1) P( X ) P( X 3) 8 1/ 1/ 1/ 10

11 Independence, Conditional Independence Consider the following example. Define: Event a = `it will rain tomorrow Event b =`the ground is wet today and Event c =`raining today. c causes both a and b thus given c we don t need to know about b to predict a a b c P( a, b c) P( a c) P( b c) a b c P( a b, c) P( a c) Observing a root node, separates the children! 11

12 Independence, Conditional Independence Assume unconditional independence X Y. Let us assume that X takes 6 values and Y takes 5 values. The cost for defining p(x, Y) is drastically reduced if X Y. The parameters are reduced from 9 (= 30 1) to 9 = = (6 1) + (5 1). a Independence is key to efficient probabilistic modeling (naïve Bayes classifiers, Markov Models, graphical models, etc.). a We subtract one on the counts to account for the sum-to-one probability constraint rule. 1

13 Conditional Independence Theorem X Y Z iff there exist functions g and h such that the following holds X Y Z P( x, y z) g( x, z) h( y, z) x, y, z such that P( z) 0 Independence implies the factorization: X Y Z P( x, y z) P( x z) P( y z) g( x, z) h( y, z) g ( x, z) g ( y, z) Factorization implies independence: P( x, y z) g( x, z) h( y, z) 1 P( x, y z) g( x, z) h( y, z) g( x, z) h( y, z) (1) x, y x, y x y P( x z) g( x, z) h( y, z) g( x, z) h( y, z) () y P( y z) g( x, z) h( y, z) h( y, z) g( x, z) (3) x (,3) P( x z) P( y z) g( x, z) h( y, z) g( x, z) h( y, z) g( x, z) h( y, z) P( x, y z) x y y x (1) 13

14 Independence Relations: Decomposition The following decomposition independence relation holds: Proof: X Y, W Z X Y Z X Y, W Z P( X, Y, W Z) P( X Z) P( Y, W Z) P( X, Y Z) P( X Z) P( Y, w Z) w P( X, Y Z) P( X Z) P( Y, w Z) P( X Z) P( Y Z) X Y Z w (independence) (sum rule) 14

15 Independence Relations: Weak Union The following Weak Union independence relation holds: Proof: X X Y, W Z X Y Z, W Y, W Z P( X, Y, W Z) P( X Z) P( Y, W Z) P( X, Y, W Z) P( Y, W Z) P( X, Y Z, W ) P( X Z) P( X Z) P( Y Z, W ) P( W Z) P( W Z) P( X, Y Z, W ) P( X Z, W ) P( Y Z, W ) X Y Z, W ote that in the last step, we used the decomposition independence relation: X Y, W Z X W Z P( X, W Z) P( X Z) P( W Z) P( X, W Z) P( X Z) P( X Z, W ) P( X Z) P( X Z, W ) PW ( Z) 15

16 Independence Relations: Contraction The following Contraction independence relation holds: Proof: X W Z, Y & X Y Z X Y, W Z X W Y, Z P( X, W Y, Z) P( X Y, Z) P( W Y, Z) ( a) X Y Z P( X Y, Z) P( X Z) ( b) P( X, Y, W Z) P( X, W Y, Z) P( Y Z) P( X Y, Z) P( W Y, Z) P( Y Z) ( a) ( b) X W, Y Z P( X Z) PW (, Y Z) 16

17 Random Variables Define Ω to be a probability space equipped with a probability measure P that measures the probability of events E. Ω contains all possible events in the form of its own subsets. A real valued random variable X is a mapping We call a realization of X. Probability distribution of X: For B, Probability density We often write: x X( ),, 1 P ( B) P X ( B) P X ( ) B X PX ( B) px ( x) dx B p ( x) p( x) X X : 17

18 Cummulative Density Function 0 p x p( x) dx 1 ( ) z F z p x dx ( cummulative density function) b P x ( a, b) p( x) dx F( b) F( a) a The CDF for a random variable X is the function F(x) that returns the probability that X is less than x 18

19 Mean and Variance The expected (mean) value of X is: The variance of X is defined as: X xpx x dx ( ) The standard deviation is often useful defined as var X X X X X std X var[ X ] 19

20 Mean and Variance If T(X) is a function of X (called a statistic), the mean of T(X) is given by The variance of T(X) is also defined as: T ( X ) T ( x) px ( x) dx The k th moment is defined as: (k = 3, skewness, k = 4, kurtosis) T X T X T X T X T X var ( ) ( ) ( ( )) ( ) ( ) k k X [ X ] ( x [ X ]) px ( x) dx 0

21 Expectation For discrete case For continuous case Conditional expectation x x f p( x) f ( x) x f ( ) ( ) p x f x dx f y p( x y) f ( x) x f y p( x y) f ( x) dx 1

Let [ 1,1] with the probability P being uniform: 1 1 P( I) dx I, I [ 1,1] Consider the

22 Expectation The expectation of a random variable is not necessarily the value that we should expect a realization to have. Let [ 1,1] with the probability P being uniform: 1 1 P( I) dx I, I [ 1,1] Consider the following two random variables: X X 1 1 I :[ 1,1], X ( ) 1 0 :[ 1,1], X ( ) 0 0 We can see that X X dx 1 dx 1, X X dx dx although X ( ) 1.

23 Uniform Random Variable Consider the Uniform distribution 1 U ( x a, b) ( a x b) b a What is the CDF of a uniform random variable U(0,1)? x for 0 x 1, PU ( x) 1, for x 1 0, otherwise U ( x a, b) are: a b a ab b ( b a) 3 1 You can show easily that the mean and variance of x a, b, x a, b, var x a, b ote that it is possible for p(x) > 1 but the density still needs to integrate to 1. For example, note that 1 U ( x 0,1/ ) 1x 0, 3

24 The Gaussian Distribution A random variable X is Gaussian or normally distributed X ~ (, ) if: t 1 1 PX t exp ( x ) dx The probability density is: 1 1 x,, exp ( x ) To show that this density is normalized note the following trick: Let I exp ( x ) dx, then : I exp ( y ) exp ( x ) dxdy 1 1 ( ) ( ), : exp exp Thus 1 : I exp 1 u du e 0 Set r x y then I r rdrd r rdr 4

25 The Gaussian Distribution A random variable X ~ (, ) if: X is Gaussian or normally distributed The following can be shown easily with direct integration: 1 1 x,, exp ( x ) exp ( ), 1 1 X x xdx 1 1 exp ( ), var[ ] X x x dx X X The following integrals are useful in these derivations : exp, exp 0, exp u du u u du u u du We often work with the precision of a Gaussian λ = 1/σ. The higher the λ the narrower the distribution is. 5

26 CDF of a Gaussian Plot of the Standard ormal (0,1) and CDF. Run MatLab function gaussplotdemo from Kevin Murphys PMTK PDF CDF x;0, x;0, x ;, (, ) F x z dz 1 F x;, 1 erf z /, z ( x ) / t erf x e dt x 0 6

27 CDF and the Standard ormal Assume X ~ (0,1) Define Φ(x) = P(X < x) = Cumulative Distribution of X ( x ) p( z) dz 1 x z x z exp z dz Assume X ~ (μ, σ ) x F x = P(X < x μ, σ ) = ( ) 7

28 Univariate Gaussian [ X ] var[ X ] μ p( x) 1 exp ( x ) Shorthand: We say X ~ (μ, σ ) to mean X is distributed as a Gaussian with parameters μ and σ. 8

29 Univariate Gaussian Representation of symmetric phenomena without long tails. Inappropriate for skewness, fat tails, multimodality, etc. The popularity of the Gaussian is related to facts as: p( x) 1 ( x ) exp Completely defined in terms of the mean and the variance The Central Limit Theorem shows that the sum of i.i.d. random variables has approximately a Gaussian distribution making it an appropriate choice for modeling noise (limit of additive small effects) The Gaussian distribution makes the least assumptions (maximum entropy) from all distributions with given mean and variance. 9

30 The ormal Model The normal (or Gaussian) distribution 1 1 exp is one of the most studied and one of the most used distributions, x Sample x 1,..., x n from a normal distribution,. In a typical inference problem, we are interested in: Estimation of (μ, σ) Confidence regions on (μ, σ) Test on (μ, σ) and comparison with other samples

31 Datasets: ormal Data ormaldata : Relative changes in reported larcenies between 1991 and 1995 (relative to 1991) for the 90 most populous US counties (Source: FBI) ormal data Matlab implementation From Bayesian Core, J.M. Marin and C.P. Roberts, Chapter (available online) 31

32 Datasets: CMBData CMBdata: Spectral representation of the cosmological microwave background (CMB), i.e. electromagnetic radiation from photons back to 300,000 years after the Big Bang, expressed as difference in apparent temperature from the mean temperature CMBdata Matlab implementation ormal estimation From Bayesian Core, J.M. Marin and C.P. Roberts, Chapter (available online) MLE based estimates of μ and used in constructing the Gaussian shown with the solid line Maximum Likelihood Estimators (MLE) will be discussed in a follow up lecture 3

33 Quantiles Recall that the probability density function p X ( ) of a random variable X is defined as the derivative of the cumulative density function, so that The value y (a such that F(y (α) ) = α is called the a-quantile of the distribution with CDF F. The median is of course F(y 0.5 ) = 0.5 ote that x 0 F( x0 ) px ( x) dx P ( ) p ( x) dx 1 X X One can define tail area probabilities. The shaded regions each contain α/ of the probability mass. For (0, 1), the leftmost cutoff point is 1 (α/), where is the cdf of (0, 1). a/ a/ If α = 0.05, the central interval is 95%, and the left cutoff is 1.96 and the right is Figure generated by quantiledemo from PMTK -1 (a/) 0-1 (1-a/) For (, ), the 95% interval becomes (-1.96, +1.96) often approximated as () 33

34 Binary Variables Consider a coin flipping experiment with heads = 1 and tails = 0. With [0,1] px ( 1 ) px ( 0 ) 1 This defines the Bernoulli distribution as follows: Bern( ) (1 ) x 1 x x Using the indicator function, we can also write this as: Bern ( x ) (1 ) ( x 1) ( x0) 34

35 Bernoulli Distribution Recall that in general [ f ] p( x) f ( x), [ f ] p( x) f ( x) dx x var[ f ] [ f ( x) ] [ f ( x)] For the Bernoulli distribution can easily show from the definitions: [ x] var[ x] (1 ) Bern( ) (1 ) x 1 x x, we [ x] p( x )ln p( x ) ln (1 )ln(1 ) x 0,1 Here H[x] is the entropy of the distribution 35

36 Likelihood Function for Bernoulli Distribution Consider the data set D x1, x,..., x in which we have m heads (x = 1), and m tails (x = 0) The likelihood function takes the form: p xn 1 xn m m ( ) p( xn ) (1 ) D (1 ) n1 n1 36

37 Binomial Distribution Consider the discrete random variable X 0,1,,..., We define the Binomial distribution as follows: Bin( X m, ) (1 ) m m m In our coin flipping experiment, it gives the probability in flips to get m heads with μ being the probability getting heads in one flip. It can be shown (see S. Ross, Introduction to Probability Models) that the limit of the binomial distribution as, is the Poisson(l) distribution., l binomial distribution Bin( 10, 0.5) Matlab Code 37

38 Binomial Distribution 0.35 The Binomial distribution for = 10, and 0.5, 0.9 is shown below using MatLab function binomdistplot from Kevin Murphys PMTK 0.5 = = Bin(, )

39 Mean,Variance of the Binomial Distribution Consider for independent events the mean of the sum is the sum of the means, and the variance of the sum is the sum of the variances. Because m = x x, and for each observation the mean and variance are known from the Bernoulli distribution: [ m] mbin ( m, ) [ x... x ] m0 m0 One can also compute (twice) the identity 1 var[ m] m [ m] Bin( m, ) var[ x... x ] (1 ) m1 [ m], [ m ] by differentiating (1 ) 1 wrt μ. Try it! m m m 1 39

40 Binomial Distribution: ormalization To show that the Binomial is correctly normalized, we use the following identities: Can be shown with direct substitution: Binomial theorem: m (1 x) x (**) m0m This theorem is proved by induction using (*) and noticing: To finally show normalization using (**): 1 (*) n n 1 n 1 1 m m m 1 m m (1 x) x (1 x) x x x x m0m m0 m m0 m m0 m m1 m 1 * m m m 1 1 m 1 x x x 1 x x x m1 m m1 m 1 m1 m m0 m m m (1 ) (1 ) (1 ) 1 1 m m 1 1 m0 m0 m 40

41 Generalization of the Bernoulli Distribution We are now looking at discrete variables that can take on one of K possible mutually exclusive states. The variable is represented by a K-dimensional vector x in which one of the elements x k equals 1, and all remaining elements equal 0: x = 0,0,, 1,0,, 0 T These vectors satisfy: K k 1 x k 1 Let the probability of x k = 1 be denoted as μ k. Then K K K xk ( xk1) ( x ) k k, k 1, k 0 k1 k1 k 1 p where μ = μ 1,, μ K T. 41

42 Multinoulli/Categorical Distribution The distribution is already normalized: K K k x x k 1 k 1 xk p( x ) 1 The mean of the distribution is computed as: x xp( x ) 1,..., K x similar to the result for the Bernoulli distribution. The Multinoulli also known as the Categorical distribution often denoted as (Mu here is the multinomial distribution): x x x 1, Cat Multinoulli Mu The parameter 1 stands to emphasize that we roll a K-sided dice once ( = 1) see next for the multinomial distribution. k T 4

43 Likelihood: Multinoulli Distribution Let us consider a data set becomes: D x,..., 1 x. The likelihood K K xnk K xnk n 1 mk p( D ), where : m x k k k k nk n1 k1 k1 k1 n1 is the # of observations of x k = 1. m k is the sufficient statistic of the distribution. 43

44 MLE Estimate: Multinoulli Distribution To compute the maximum likelihood (MLE) estimate of μ, we maximize an augmented log-likelihood K K K ln p( D ) l k 1 mk ln k l k 1 k 1 k 1 k 1 Setting the derivative wrt μ k equal to zero: Substitution into the constraint K mk l K m m K k K k 1 k 1 1 l k1 l k1 k K K m k 1 k m k m As expected, this is the fraction in the observations of x k = 1 k 44

45 Multinomial Distribution We can also consider the joint distribution of m 1,, m K in observations conditioned on the parameters μ = (μ 1,, μ K ). From the expression for the likelihood given earlier p K mk ( D ) k k 1 the multinomial distribution Mu 1 parameters and μ takes the form: ( m,..., m, ) K with! K m1 m mk p( m, m,..., m,,,..., )... where m 1 K 1 K 1 k k 1 k m1! m!... mk! 45

Introduction to Probability and Statistics (Continued)

Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of otre Dame otre Dame, Indiana, USA Email: