Introduction to Probability and Statistics (Continued)

Size: px

Start display at page:

Download "Introduction to Probability and Statistics (Continued)"

Jonah Benson
5 years ago
Views:

1 Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science University of otre Dame otre Dame, Indiana, USA URL: August 7, 8

2 Contents The binomial and Bernoulli distributions The multinomial and multinoulli distributions The Poisson distribution Student s T Laplace distribution Gamma distribution Beta distribution Pareto distribution Introduction to Covariance and Correlation

3 References Following closely Chris Bishops PRML book, Chapter Kevin Murphy s, Machine Learning: A probablistic perspective, Chapter Jaynes, E. T. (3). Probability Theory: The Logic of Science. Cambridge University Press. Bertsekas, D. and J. Tsitsiklis (8). Introduction to Probability. Athena Scientific. nd Edition Wasserman, L. (4). All of statistics. A Concise Course in Statistical Inference. Springer. 3

4 Binary Variables Consider a coin flipping experiment with heads = and tails =. With [,] px ( ) px ( ) This defines the Bernoulli distribution as follows: Bern( ) ( ) x x x Using the indicator function, we can also write this as: Bern ( x ) ( ) ( x ) ( x) 4

5 Bernoulli Distribution Recall that in general [ f ] p( x) f ( x), [ f ] p( x) f ( x) dx x var[ f ] [ f ( x) ] [ f ( x)] For the Bernoulli distribution can easily show from the definitions: [ x] var[ x] ( ) Bern( ) ( ) x x x, we [ x] p( x )ln p( x ) ln ( )ln( ) x, Here H[x] is the entropy of the distribution 5

6 Likelihood Function for Bernoulli Distribution Consider the data set D x, x,..., x in which we have m heads (x = ), and m tails (x = ) The likelihood function takes the form: p xn xn m m ( ) p( xn ) ( ) D ( ) n n 6

7 Binomial Distribution Consider the discrete random variable X,,,..., We define the Binomial distribution as follows: Bin( X m, ) ( ) m m m In our coin flipping experiment, it gives the probability in flips to get m heads with μ being the probability getting heads in one flip. It can be shown (see S. Ross, Introduction to Probability Models) that the limit of the binomial distribution as, is the Poisson(l) distribution., l binomial distribution Bin(,.5) Matlab Code 7

8 Binomial Distribution.35 The Binomial distribution for =, and.5,.9 is shown below using MatLab function binomdistplot from Kevin Murphys PMTK.5 =.5.4 = Bin(, )

9 Mean,Variance of the Binomial Distribution Consider for independent events the mean of the sum is the sum of the means, and the variance of the sum is the sum of the variances. Because m = x x, and for each observation the mean and variance are known from the Bernoulli distribution: [ m] mbin ( m, ) [ x... x ] m m One can also compute (twice) the identity var[ m] m [ m] Bin( m, ) var[ x... x ] ( ) m [ m], [ m ] by differentiating ( ) wrt μ. Try it! m m m 9

10 Binomial Distribution: ormalization To show that the Binomial is correctly normalized, we use the following identities: Can be shown with direct substitution: Binomial theorem: m ( x) x (**) mm This theorem is proved by induction using (*) and noticing: To finally show normalization using (**): (*) n n n m m m m m ( x) x ( x) x x x x mm m m m m m m m m * m m m m x x x x x x m m m m m m m m m m ( ) ( ) ( ) m m m m m

11 Generalization of the Bernoulli Distribution We are now looking at discrete variables that can take on one of K possible mutually exclusive states. The variable is represented by a K-dimensional vector x in which one of the elements x k equals, and all remaining elements equal : x =,,,,,, T These vectors satisfy: K k x k Let the probability of x k = be denoted as μ k. Then K K K xk ( xk) ( x ) k k, k, k k k k p where μ = μ,, μ K T.

Multinoulli/Categorical Distribution The distribution is already normalized: K K k x x k k xk p( x ) The mean of the distribution is computed as: x xp( x ),.

12 Multinoulli/Categorical Distribution The distribution is already normalized: K K k x x k k xk p( x ) The mean of the distribution is computed as: x xp( x ),..., K x similar to the result for the Bernoulli distribution. The Multinoulli also known as the Categorical distribution often denoted as (Mu here is the multinomial distribution): x x x, Cat Multinoulli Mu The parameter stands to emphasize that we roll a K-sided dice once ( = ) see next for the multinomial distribution. k T

13 Likelihood: Multinoulli Distribution Let us consider a data set becomes: D x,..., x. The likelihood K K xnk K xnk n mk p( D ), where : m x k k k k nk n k k k n is the # of observations of x k =. m k is the sufficient statistic of the distribution. 3

14 MLE Estimate: Multinoulli Distribution To compute the maximum likelihood (MLE) estimate of μ, we maximize an augmented log-likelihood K K K ln p( D ) l k mk ln k l k k k k Setting the derivative wrt μ k equal to zero: Substitution into the constraint K mk l K m m K k K k k l k l k k K K m k k m k m As expected, this is the fraction in the observations of x k = k 4

15 Multinomial Distribution We can also consider the joint distribution of m,, m K in observations conditioned on the parameters μ = (μ,, μ K ). From the expression for the likelihood given earlier p K mk ( D ) k k the multinomial distribution Mu parameters and μ takes the form: ( m,..., m, ) K with! K m m mk p( m, m,..., m,,,..., )... where m K K K k k m! m!... mk! 5

16 Example: Biosequence Analysis Consider a set of DA sequences where there are rows (sequences) and 5 columns (locations along the genome). Several locations are conserved by evolution (e.g., because they are part c g a t a c g g g g t c g a a c a a t c c g a g a t c g c a c a a t c c g t g t t g g g a c a a t c g g c a t g c g g g c g a g c c g c g t a c g a a c a t a c g g a g c a c g a a t a a t c c g g g c a t g t a c g a g c c g a g t a c a g a c c a t c c g c g t a a g c a g g a t a c g a g a t g a c a Location along the genome of a gene coding region), since the corresponding columns tend to be pure e.g., column 7 is all g s. To visuallize the data (sequence logo), we plot the letters A, C, G and T with a font size proportional to their empirical probability, and with the most probable letter on the top. Sequences =: 6

17 Example: Biosequence Analysis The empirical probability distribution at location t, is obtained by normalizing the vector of counts (see MLE estimate) θ t = i= This distribution is known as a motif. I(X it = ), I(X it = ), I(X it = 3 ), I(X it = 4) i= i= i= Can also compute the most probable letter in each location; this is the consensus sequence. Use MatLab function seqlogodemo from Kevin Murphys PMTK Bits Sequence Position 7

18 Summary of Discrete Distributions A summary of the multinomial and related discrete distributions is summarized below on a Table from Kevin Murphy s textbook n = (one roll of the dice), n = ( rolls of the dice) K = (binary variables), K = (-of-k encoding) 8

19 The Poisson Distribution X,,,3,... We say that has a Poisson distribution with parameter λ >, if its pmf is X ~ Poi( l) : Poi( x l) e x l l x! This is a model for counts of rare events. Poi(l=.).4.4 Poi(l=.) Use MatLab function poissonplotdemo from Kevin Murphys PMTK

20 The Empirical Distribution Given data, D = {x,, x }, we define the empirical distribution as: p ( A) ( A), Dirac Measure : ( A) emp xi x i if xi A if xi A We can also associate weights with each sample: Generalize pem p ( x) ( x) p ( x) w ( x), w, w xi emp i xi i i i i i This corresponds to a histogram with spikes at each sample point with height equal to the corresponding weight. This distribution assigns zero weight to any point not in the dataset. ote that the sample mean of f(x) is the expectation of f(x) under the empirical distribution: [ f ( x)] f ( x) ( x) dx f ( x ) xi i i n i

21 Student s T Distribution l l x px (, l, ) ( ) // ( ) / The parameter λ is called the precision of the T-distribution, even though it is not in general equal to the inverse of the variance (see below on behavior as υ ). The parameter υ is called the degrees of freedom. For the particular case of υ =, the T-distribution reduces to the Cauchy distribution. In the limit υ, the T-distribution T(x μ, λ, υ) becomes a Gaussian (x μ, λ ) with mean μ and precision λ.

22 For υ, T(x μ, λ, υ) Becomes a Gaussian ( ) / l l x px (, l, ) ( ) We first write the distribution as follows: For large υ, we can approximate the log as follows: // // x x l l T ( x, l, ) exp ln x x l l T ( x, l, ) exp O( ) exp O( ) In the limit υ, the T-distribution T(x μ, λ, υ) is indeed a Gaussian (x μ, λ ) with mean μ and precision λ. The normalization of the T is valid in this limit as well (so the Gaussian obtained is normalized).

23 Student s T Distribution Mean :, Mode : Var :, l l l x px (, l, ) ( ) // ( ) / student distribution v= v=. v=., l For, we obtain ( l, ) MatLab Code 3

24 Student s T Vs the Gaussian We plot: x,, ( x,,), x,/ T Lap The mean and variance of the Student s is undefined for υ = prob. density functions Gauss Student Laplace Logs of the PDFs. The Student s is OT log concave. Run MatLab function studentlaplacepdfplot from Kevin Murphys PMTK When υ =, the distribution is known as Cauchy or Lorentz. Due to its heavy tails, the mean does not converge. Recommended to use υ = log prob. density functions Gauss Student Laplace

25 Student s T Distribution Gamma ab, p( x, a, b) x, d / a b exp x ( a) The Student s T distribution can be seen from the equation above (see following two slides for proof) as an infinite mixture of Gaussians each of them with different precision (governed by a Gamma distribution) The result is a distribution that in general has longer tails than a Gaussian. This gives the T-distribution robustness, i.e. the T- distribution is much less sensitive than the Gaussian to the presence of outliers. a e b d 5

26 Appendix: Student s T as a Mixture of Gaussians If we have a univariate Gaussian T(x μ, τ ) together with a prior Gamma(τ a, b) and we integrate out the precision, we obtain the marginal distribution of x Introduce the transformation Gamma p( x, a, b) x, a, b d / a b a b exp x e d ( a) z b x a / b / a p( x, a, b) exp z d ( a) / a b ( a) A A to simplify as: / a z exp / a z z dz 6

dz ( a) / a/ It is common to redefine the parameters in this distribution as: / (

27 Appendix: Student s T as a Mixture of Gaussians a / b / a p( x, a, b) z exp / a z z dz ( a) A Recalling the definition of the Gamma function: / a/ a/ a b b x expz z dz ( a) / a/ It is common to redefine the parameters in this distribution as: / ( ) l l x px (, l, ) ( ) a ( a) exp z z dz a b p( x, a, b) b x ( a ) ( a) // a, l a b 7

28 Robustness of Student s T Distribution The robustness of the T-distribution is illustrated here by comparing the maximum likelihood solutions for a Gaussian and a T-distribution (3 data points from the Gaussian are used). The effect of a small number of outliers (Fig. on the right) is less significant for the T-distribution than for the Gaussian T-distribution & Gaussian nearly overlap Gaussian MatLab Code T-distribution 8

29 Robustness of Student s T Distribution The earlier simulation is repeated here with the PMTK toolbox gaussian student T laplace.5 gaussian student T laplace Run MatLab function robustdemo from Kevin Murphys PMTK 9

30 The Laplace Distribution Another distribution with heavy tails is the Laplace distribution, also known as the double sided exponential distribution. It has the following pdf: x b Lap x, b e b μ is a location parameter and b > is a scale parameter Mean, Mode, Var b Its robust to outliers (see earlier demonstration). It puts mores probability density at than the Gaussian. This property is a useful way to encourage sparsity in a model. 3

31 Beta Distribution The Beta(α, β) distribution with follows: x [,],, is defined as The expected value, mode and variance of a Beta random variable x with (hyper-) parameters α and β : var x ( ) ( ) ( ) ( x) x ( x), beta(, ) ( ) ( ) beta(, ) x, x x ormalizing factor mode é ë xù û = a - a + b a=., b=. a=., b=. a=., b=3. a=8., b=4. beta distributions For more information visit this link

32 Beta Distribution If α = β =, we obtain a uniform distribution. If α and β are both less than, we get a bimodal distribution with spikes at and. If α and β are both greater than, the distribution is unimodal. Run betaplotdemo from PMTK a=., b=. a=., b=. a=., b=3. a=8., b=4. α=β= beta distributions α,β> α,β<

33 3 Beta Distribution Beta(.,.) 3 Beta(,).5.5 See Matlab implementation pdf.5 pdf x 3.5 Beta(,3) x 3.5 Beta(8,4) pdf.5 pdf x x 33

34 Gamma Function The gamma function extends the factorial to real numbers: ( ) With integration by parts: ( ) ( x) x ( x) ( ) ( ) x u x u e du For integer n: ( x ) x( x) ( n) ( n)! For more information visit this link. 34

35 Beta Distribution: ormalization Showing that the Beta(α, β) distribution is normalized correctly is a bit tricky. We need to prove that: ( ) ( ) ( ) d Follow the steps: (a) change the variable y below to y = t x; (b) change the order of integration in the shaded triangular region; and (c) change x to via x = tμ: t m ( ) ( ) x y t x e dx y e dy x e t x dt dx y t x x t t t x e t x dx dt t e t tdt d ) ( d 35 x t = x

36 Gamma Distribution- Rate Parametrization It is frequently a model for waiting times. For important properties see here. It is more often parameterized in terms of a shape parameter a and an inverse scale parameter b = /θ, called a rate parameter: a b a bx a u p( x a, b) x e, x,, ( a) u e du ( a) The mean, mode and variance with this parametrization are: a, for a x mod ex b var x b otherwise b 36

37 Plots of Gamma Distribution Gamma b ( a) a a ( X a, b) x exp( xb), b As we increase the rate b, the distribution squeezes leftwards and upwards. For a <, the mode is at zero Gamma distributions a=.,b=. a=.5,b=. a=.,b= Run gammaplotdemo from PMTK 37

38 Gamma Distribution An empirical PDF of rainfall data fitted with a Gamma distribution MoM MLE Run MatLab function gammarainfalldemo from PMTK 38

39 Exponential Distribution This is defined as Expon( X l) Gamma ( X, l) l exp( xl), x, Here λ is the rate parameter. This distribution describes the times between events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate λ. 39

40 Chi-Squared Distribution This is defined as x ( X ) Gamma ( X, ) x exp( ), x, This is the distribution of the sum of squared Gaussian random variables. More precisely, Let Zi ~ (,) and S Z, then : S ~ i i 4

41 Inverse Gamma Distribution This is defined as follows: where: If X X a b X X a b a is the shape and b the scale parameters. ote that b is a scale parameter since: It can be shown that: ~ Gamma (, ) ~ InvGamma (, ) InvGamma b ( a) a ( a) ( X a, b) x exp( b / x), x, X InvGamma ( a,) InvGamma ( X a, b) b b b b b Mean ( exists for a ), Mode, var ( exists for a ) a a ( a ) ( a ) 4

42 The Pareto Distribution Used to model the distribution of quantities that exhibit long tails (heavy tails) Pareto X k, m = km k x (k+) I x m This density asserts that x must be greater than some constant m, but not too much greater, k controls what is too much. Modeling the frequency of words vs. their rank (e.g. the, of, etc.) or the wealth of people.* As k, the distribution approaches δ(x m). On a log-log scale, the pdf forms a straight line of the form log p(x) = a log x + c for some constants a and c (power law, Zipf s law). * Basis of the distribution: a high proportion of a population has low income and only few have very high incomes. 4

43 The Pareto Distribution Applications: Modeling the frequency of words vs their rank, distribution of wealth (k =Pareto Index), etc. Pareto k ( k) ( X k, m) km x ( x m), km Mean ( if k ), k Mode m, mk ( k) ( k) var ( if k ) Pareto distribution m=., k=. m=., k=.5 m=., k= ParetoPlot from PMTK 43

44 Covariance Consider two random variables XY, :. The joint probability distribution is defined as: P X A, Y B P X ( A) Y ( B) p( x, y) dxdy Two random variables are independent if p( x, y) p( x) p( y) The covariance of X and Y is defined as: AB cov( X, Y ) ( X ( X ))( Y ( Y )) It is straight forward to verify that: cov( X, Y ) XY X Y 44

45 Correlation, Center ormalized Random Variables Consider two random variables XY, :. The correlation coefficient of X and Y is defined as: corrc( X, Y ) cov( XY, ) where the standard deviations of X and Y are X cov( X), cov( Y) The center normalized random variables are defined as: It is straight forward to verify that: Y X Y X ~ = Y ~ = X E X σ X Y E Y σ Y E X ~ = E Y ~ = var X ~ = var Y ~ = 45

Introduction to Probability and Statistics (Continued)

Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of otre Dame otre Dame, Indiana, USA Email: