Probability Background

Size: px

Start display at page:

Download "Probability Background"

Whitney Cobb
5 years ago
Views:

1 CS76 Spring 0 Advanced Machine Learning robability Background Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu robability Meure A sample space Ω is the set of all possible outcomes. Elements ω Ω are called sample outcomes, while subsets A Ω are called events. For example, for a die roll, Ω = {,,, 4, 5, 6}, ω = 5 is an outcome, A = {5} is the event that the outcome is 5, and A = {,, 5} is the event that the outcome is odd. Ideally, one would like to be able to sign probabilities to all As. This is trivial for finite Ω. However, when Ω = R strange things happen: it is no longer possible to consistently sign probabilities to all subsets of R. The problem is not singleton sets such A = {π}, which happily h probability 0 (we will define it formally). This happens because of the existence of non-meurable sets, whose construction is non-trivial. See for example http: // www. math. kth. se/ matstat/ gru/ godis/ nonme. pdf Instead, we will restrict ourselves to only some of the events. A σ-algebra B is a set of Ω subsets satisfying:. B. if A B then A c B (complementarity). if A, A,... B then i= A i B (countable unions) The sets in B are called meurable. The pair (Ω, B) is called a meurable space. For Ω = R, we take the smallest σ-algebra that contains all the open subsets and call it the Borel σ-algebra. It turns out the Borel σ-algebra can be defined alternatively the smallest σ-algebra that contains all the closed subsets. Can you see why? It also follows that singleton sets {x} where x R is in the Borel σ-algebra. A meure is a function : B R satisfying:. (A) 0 for all A B. If A, A,... are disjoint then ( i= A i) = i= (A i) Note these imply ( ) = 0. The triple (Ω, B, ) is called a meure space. The Lebesgue meure is a uniform meure over the Borel σ-algebra with the usual meaning of length, area or volume, depending on dimensionality. For example, for R the Lebesgue meure of the interval (a, b) is b a. A probability meure is a meure satisfying additionally the normalization constraint: (Ω) =. Such a triple (Ω, B, ) is called a probability space. When Ω is finite, ({ω}) h the intuitive meaning: the chance that the outcome is ω. When Ω = R, ((a, b)) is the probability ms signed to the interval (a, b).

2 robability Background Random Variables Let (Ω, B, ) be a probability space. Let (R, R) be the usual meurable space of reals and its Borel σ- algebra. A random variable is a function X : Ω R such that the preimage of any set A R is meurable in B: X (A) = {ω : X(ω) A} B. This allows us to define the following (the first is the new definition, while the nd and rd s are the already-defined probability meure on B): (X A) = (X (A)) = ({ω : X(ω) A}) () (X = x) = (X (x)) = ({ω : X(ω) = x}) () A random variable is a function. Intuitively, the world generates random outcomes ω, while a random variable deterministically translates them into numbers. Example Let Ω = {(x, y) : x + y } be the unit disk. Consider drawing a point at random from Ω. The outcome is of the form ω = (x, y). Some example random variables are X(ω) = x, Y (ω) = y, Z(ω) = xy, and W (ω) = x + y. Example Let X = ω be a uniform random variable (to be defined later) with the sample space Ω = [0, ]. A sequence of different random variables {X n } n= can be defined follows, where {z} = if z is true, and 0 otherwise: X (ω) = ω + {ω [0, ]} () X (ω) = ω + {ω [0, ]} (4) X (ω) = ω + {ω [, ]} (5) X 4 (ω) = ω + {ω [0, ]} (6) X 5 (ω) = ω + {ω [, ]} (7) X 6 (ω) = ω + {ω [, ]} (8)... Given a random variable X, the cumulative distribution function (CDF) is the function F X : R [0, ] F X (x) = (X x) = (X (, x]) = ({ω : X(ω) (, x]}). (9) A random variable X is discrete if it takes countably many values. We define the probability ms function f X (x) = (X = x). A random variable X is continuous if there exists a function f X such that. f X (x) 0 for all x R. f X(x)dx =. for every a b, (X [a, b]) = b a f X(x)dx. The function f X is called the probability density function (DF) of X. CDF and DF are related by F X (x) = x f X (t)dt (0) f X (x) = F X(x) at all points x at which F X is differentiable. () It is certainly possible for the DF of a continuous random variable to be larger than one: f X (x). In fact, the DF can be unbounded in x / for x (0, ) and 0 otherwise. However, (x) = 0 for all x. Recall that and f X is related by integration over an interval. We will often use p instead of f X to denote a DF later in cls.

3 robability Background Some Random Variables. Discrete Dirac or point ms distribution X δ a if (X = a) = with CDF F (x) = 0 if x < a and if x a. Binomial. A random variable X h a binomial distribution with parameters n (number of trials) and p (head probability) X Binomial(n, p) with probability ms function ( ) n p x x ( p) n x for x = 0,,..., n () 0 otherwise If X Binomial(n, p) and X Binomial(n, p) then X + X Binomial(n + n, p). Think this merging two coin flip experiments on the same coin. The binomial distribution is used to model test set error of a clsifier. Assuming a clsifier s true error rate is p (with respect to the unknown underlying joint distribution we will make it precise later in cls), then on a test set of size n the number of misclsified items follow Binomial(n, p). Bernoulli. Binomial with n =. Multinomial. The d-dimensional version of binomial. The parameter p = (p,..., p d ) is now the probabilities of a d-sided die, and x = (x,..., x d ) is the counts of each face. ( ) n d x,..., x k= px k k if d k= x k = n d () 0 otherwise The multinomial distribution is typically used with the bag-of-word representation of text documents. λ λx oisson. X oisson(λ) if e x! for x = 0,,,.... If X oisson(λ ) and X oisson(λ ) then X + X oisson(λ + λ ). This is a distribution on unbounded counts with a probability ms function hump (mode not the mean at λ ). It can be used, for example, to model the length of a document. Geometric. X Geom(p) if p( p) x for x =,,... This is another distribution on unbounded counts. Its probability ms function h no hump (mode not the mean at ) and decrees monotonically. Think of it a stick breaking procedure: start with a stick of length. Each day you take away p fraction of the remaining stick. Then f(x) is the length you get on day x. We will see more interesting stick breaking in Bayesian nonparametrics.. Continuous Gaussian (also called Normal distribution). X N(µ, σ ) with parameters µ R (the mean) and σ (the variance) if ) ( σ π exp (x µ) σ. (4) The square root of variance σ > 0 is called the standard deviation. If µ = 0, σ =, X h a standard normal distribution. In this ce, X is usually written Z. Some useful properties: (Scaling) If X N(µ, σ ), then Z = (X µ)/σ N(0, ) (Independent sum) If X i N(µ i, σi ) are independent, then i X i N ( i µ i, ) i σ i Note there is concentration of meure in the independent sum: the spread or stddev grows n, not n.

4 robability Background 4 This is one DF you need to pay close attention to. The CDF of a standard normal is usually written Φ(z), which h no closed-form expression. However, it qualitatively resembles a sigmoid function and is often used in translating unbounded real margin values into probabilities of binary clses. You might recognize the RBF kernel an unnormalized version of the Gaussian DF. The parameter σ or confusingly σ or a scaled version of either is called the bandwidth. χ distribution. If Z,..., Z k are independent standard normal random variables, then Y = k i Z i h a χ distribution with k degrees of freedom. If X i N(µ i, σi ) are independent, then ( ) Xi µ i i σ i h a χ distribution with k degrees of freedom. The DF for the χ distribution with k degrees of freedom is k/ Γ(k/) xk/ e x/, x > 0. (5) Multivariate Gaussian. Let x, µ R d, Σ S+ d a symmetric, positive definite matrix of size d d. Then X N(µ, Σ) with DF ( exp ) Σ / (π) d/ (x µ) Σ (x µ). (6) Here, µ is the mean vector, Σ is the covariance matrix, Σ its determinant, and Σ its inverse (exists because Σ is positive definite). If we have two (groups of) variables that are jointly Gaussian: then we have: (Marginal) x N(µ x, A) [ ] x y N ([ µx µ y ] [ ]) A C, C B (Conditional) y x N(µ y + C A (x µ x ), B C A C) You want to pay attention to Multivariate Gaussian too. It is used frequently in mixture models (Mixture of Gaussian), and building blocks for Gaussian rocesses. The conditional is useful for inference. Exponential. X Exp(β) with parameter β > 0, if β e x/β. Not to be confused with the exponential family (more later). Gamma. The Gamma function (not distribution) is defined Γ(α) = 0 y α e y dy with α > 0. The Gamma function is an extension to the factorial function: Γ(n) = (n )! when n is a positive integer. It also satisfies Γ(α + ) = αγ(α) for α > 0. X h a Gamma distribution with shape parameter α > 0 and scale parameter β > 0, denoted by X Gamma(α, β), if Gamma(, β) is the same Exp(β). Beta. X Beta(α, β) with parameters α, β > 0, if (7) β α Γ(α) xα e x/β, x > 0. (8) Γ(α + β) Γ(α)Γ(β) xα ( x) β, x (0, ). (9) Beta(, ) is uniform in [0, ]. Beta(α <, β < ) h a U-shape. Beta(α >, β > ) is unimodal with mean α/(α + β) and mode (α )/(α + β ). Beta distribution h a special status in machine learning because it is the conjugate prior of the binomial and Bernoulli distributions. A draw from a beta distribution can be thought of generating a (bied) coin. A draw from the corresponding Bernoulli distribution can be thought of a flip of that coin.

5 robability Background 5 Dirichlet. The Dirichlet distribution is the multivariate version of the beta distribution. X Dir(α,..., α d ) with parameters α i > 0, if Γ( d i α i) d d i Γ(α x αi i, (0) i) where x = (x,..., x d ) with x i > 0, d i x i =. This support is called the open (d ) dimensional simplex. The Dirichlet distribution is important for machine learning because it is the conjugate prior for multinomial distributions. The analogy here is that of a dice factory (Dirichlet) and die rolls (multinomial). It is used extensively for modeling bag-of-word documents. We will also see it in Dirichlet rocesses. t (or Student s t) Distribution X t ν h a t distribution with ν degrees of freedom, if Γ ( ) ν+ ( νπγ ν ) i ) (ν+)/ ( + x () ν It is similar to a Gaussian distribution, but with heavier tails (more likely to produce extreme values). When ν =, it becomes a standard Gaussian distribution. Student is the pen name of William Gosset who worked for Guinness, because the brewery did not want its employees to publish any scientific papers for the fear of leaking trade secret (sounds familiar?). Cauchy. The Cauchy distribution is a special ce of t-distribution with ν =. The DF is ( πγ + ( ) ) () x x 0 γ where x 0 is the location parameter for the mode, and γ is the scale parameter. It is notable for the lack of mean: E(X) does not exist because x x df X(x) =. Similarly, it h no variance or higher moments. However, the mode and median are both x 0. This is a very heavy-tailed distribution compared to Gaussian, so much so that draws from Cauchy will occionally produce very large values and the sample mean never settles. 4 Convergence of Random Variables Let X, X,..., X n,... be a sequence of random variables, and X be another random variable. Think of X n the clsifier trained on a training set of size n. It is a random variable because the training set is random (an iid sample of the underlying unknown joint distribution XY ). Think of X the true clsifier (well, it is a fixed quantity, not random, but that s fine). We are interested in how the clsifiers behave we have more and more training data: Do they get closer to X? What do we mean by close? These questions are made precise by the different types of convergence. Study of such topics are called large sample theory or ymptotic theory. X n converges to X in distribution, written X n X, if lim F X n (t) = F (t) () at all t where F is continuous. Here, F Xn is the CDF of X n, and F is the CDF of X. We expect to see the next outcome in a sequence of random experiments becoming better and better modeled by the probability distribution of X. In other words, the probability for X n to be in a given interval is approximately equal to the probability for X to be in the same interval, n grows. Example Let X,..., X n be iid continuous random variables. Then trivially X n X. But note (X = X n ) = 0.

6 robability Background 6 It is their distributions that are the same, not their values which are controlled by different randomness. Example 4 Let X,..., X n be identical (but not independent) copies of X N(0, ). Then X n Y = X. Example 5 Let X n uniform[0, /n]. Then X n δ 0. This is often written X n 0. Interestingly, note F (0) = but F Xn (0) = 0 for all n, so lim F Xn (0) F (0). This does not contradict the definition of convergence in distribution, because t = 0 is not a point at which F is continuous. Example 6 Let X n h the DF f n (x) = ( cos(πnx)), x (0, ). Then X n uniform(0, ). Note the DFs f n s do not converge at all, but the CDFs do. Theorem (The Central Limit Theorem). Let X,..., X n be iid with finite mean µ and finite variance σ > 0. Let X n = n n i X i. Then Z n = X n µ σ/ N(0, ). (4) n Example 7 Let X i uniform(, ) for i =,... Note µ = 0, σ = /. Then Z n = n i X i N(0, ). Example 8 Let X i beta(, ). Note µ =, σ = 8. Then Z n = 8n ( n i X i / ) N(0, ). The Central Limit Theorem is the most widely used application of convergence in distribution. Demo: CLT uniform.m and CLT beta.m Does the CLT apply to the Cauchy distribution? X n converges to X in probability, written X n X, if for any ɛ > 0 lim ( X n X > ɛ) = 0. (5) Convergence in probability is central to machine learning because a very important concept, consistency, is defined using it. Convergence in probability implies convergence in distribution. The reverse is not true in general. However, convergence in distribution to a point ms distribution implies convergence in probability. Example, Example 4, and Example 6 do not converge in probability. Example 5 does. The definition might be eier to understand if we rewrite it lim ({ω : X n(ω) X(ω) > ɛ}) = 0. That is, the fraction of outcomes ω on which X n and X disagree must shrink to zero. When X n (ω) and X(ω) do disagree, they can differ by a lot in value. More importantly, note that X n (ω) does not need to converge to X(ω) pointwise for any ω. This will be the distinguishing property between converge in probability and converge almost surely (to be defined shortly). Example 9 X n X in Example. X n converges almost surely to X, written X n X, if ({ }) ω : lim X n(ω) = X(ω) = (6)

7 robability Background 7 Example 0 Example does not converge almost surely to X there. This is because lim X n (ω) does not exist for any ω. ointwise convergence is central to. implies, which in turn implies. Example X n 0 (and hence X n 0 and Xn 0) does not imply convergence in expectation E(X n ) 0. To see this, let { /n, if Xn = n (X n ) = (7) /n, if X n = 0 Then X n 0. However, E(X n ) = n n = n does not converge. X n converges in rth mean where r, written X n L r X, if L r implies. lim E( X n X r ) = 0. (8) Lr implies, Ls if r > s. There is no general order between Lr and. Theorem (The Weak Law of Large Numbers). If X,..., X n are iid, then X n = n n i= X i µ(x ). Theorem (The Strong Law of Large Numbers). If X,..., X n are iid, and E( X ) <, then X n = n n i= X i µ(x ).

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems Review of Basic Probability The fundamentals, random variables, probability distributions Probability mass/density functions