Statistics for scientists and engineers February 0, 006 Contents Introduction. Motivation - why study statistics?................................... Examples..................................................3 Statistics................................................. 3 Probability theory 3. Random variables............................................ 3. Vectors of random variables...................................... 4.3 Conditioning............................................... 4.4 Moments and expected values..................................... 5.5 Transformation of random variables................................. 5.5. One-dimensional case..................................... 5.5. Multi-variate case........................................ 6.5.3 Example............................................. 6.5.4 Special case: affine transformations............................. 7 3 Some useful distributions 8 3. Continuous RVs............................................. 8 3.. Dirac delta distribution.................................... 8 3.. Uniform distribution...................................... 8 3..3 ormal distribution...................................... 8 3..4 Multi-variate normal distribution.............................. 8 3..5 Exponential distribution.................................... 9 3..6 Chi-squared distribution.................................... 9 3..7 Beta distribution........................................ 9 3..8 Gamma distribution...................................... 9 3. Discrete RVs............................................... 0 3.. Kronecker delta distribution................................. 0 3.. Bernoulli distribution..................................... 0 3..3 Binomial distribution...................................... 0 3..4 Poisson distribution...................................... 0 4 Some important relations 4. Orthonormal transformation of independent normal RVs..................... 4. Statistics of normal random variables................................ 4.3 Chi-s quared distribution....................................... 4.4 Relation between Chi-squared and Gamma distribution...................... 3 4.5 Relation between Gamma and Beta distributions.......................... 3 4.6 Statistics from normal RVs....................................... 4
Introduction. Motivation - why study statistics? Statistical modeling and analysis, including the collection and interpretation of data, form an essential part of the scientific method in diverse fields, including social, biological, and physical sciences. Statistical theory is primarily based on the mathematical theory of probability, and covers a wide range of topics, from highly abstract areas to topics directly relevant for applications. The main goal of the theory of statistics is to draw information from data. Data can come in a variety of forms including signals in continuous time and lists of discrete-time values in dimensions. Important aspects include:. Model construction: to get insight into a problem, we build a generally drastically oversimplified model of the problem. This model should capture the properties in which we re interested and abstract away everything else. Example: Shannon s BSC in information theory.. Methods: given a certain model, we derive methods for extracting useful information from the data. Examples: MMSE estimation. 3. Performance comparison: different methods need to be compared in terms of certain performance criteria. We also introduce notions of optimality. Examples: information inequality. 4. Algorithm design: in most cases, estimation methods do not lead to closed-form solutions. We need to develop clever numerical methods to solve these problems. Example: ewton-raphson, Expectation-Maximization, Turbo codes.. Examples A sequence of elements from an assembly line. An unknown number of these elements θ is defective. We would like to know what θ is, but we do not have the time or resources to investigate each of the elements. We choose to draw without replacement n elements and try to determine θ. We wish to study how the income of a large population e.g., grad students is distributed. An exhaustive study of the entire population is impossible. We base our study on n samples. We make n observations to determine a constant µ. The observations are corrupted by random fluctuations. For instance we transmit a symbol b, +} n times and try to recover that bit in the presence of thermal noise. To solve these problems, we first require to construct a model. Let us consider some simple models: Define a random variable X k, as being the whether or not item k out of the n items is defective. So X k defective, OK}. A possible model could be independent observations with p Xk defective θ and p Xk OK θ. This model may not be correct: for instance we may have drawn samples just as after machine in the assembly line was repaired. We introduce a random variable, being the incomes of n people X [X,..., X n ]. The joint distribution of these incomes is given by p X x. Let us assume the incomes are independent, so that p X x n k p X k x k. Finally, let us model p Xk x k as a normal distribution with mean µ and variance σ, both independent of k. The observation is given by a random variable X [X,..., X n ], with X k µ + W k where w k is a noise sample. Clearly p X x can be found from p W w. ow we can introduce some additional assumptions regarding p W w. We can say that the noise at time k does not depend on the noise at time l k. The noise samples are independent. In that case p W w n k p Z k z k. We can also assume the noise samples are identically distributed, so that p Zk z k is independent of k. The specific distribution p Z z depends on the physical properties of the noise.
We see that we always need to introduce some simplifying assumptions. These are generally based on knowledge regarding the observations. For instance, when we see some observations looks like it is normally distributed, let us then model it so. These models are not always 00% accurate. Look at the income distribution. A normal distribution can result in some people having negative incomes!.3 Statistics Definition A statistic T is a map from an observation space Ω to some space of values J. T x is usually what you compute after you observe x. J is commonly within a Euclidean space. The choice of the statistics is closely related to what we are trying to infer from the data. Examples the fraction of defective items out of n samples, T x. the sample mean T x n k x k x the sample variance T x n k x k x Probability theory We now give a non-too-rigorous introduction into the basics of probability.. Random variables Given a sample space Ω e.g, heads, tails, Ω H, T }. A random variable RV X is a mapping from Ω to the real numbers R. When Ω is finite or countably infinite, X is said to be a discrete RV. Otherwise X is a continuous RV. With each RV we can associate a cumulative distribution function CDF: F X x P ω Ω : X ω x} where P E} is the probability of some event. We generally abuse the notation and write P X x} instead of P ω Ω : X ω x}. where the last equation is a slight abuse of notation. We will often use probability density functions pdfs p X x, defined as b a p X x dx P a X b} P ω Ω : a X ω b} ote that p X x and P X x} are not necessarily the same thing! ote also that + For continuous RVs, when F X x is differentiable, Similarly, p X x dx. d dx F X x p X x. F X x x p X u du. In the case of discrete RVs, we use a slightly different terminology: F X x is the cumulative mass function CMF, while p X x is the probability mass function pmf. In that case, we have p X x P X x}. For discrete RVs all integrals should be replaced by summations, and sections dealing with differentiations cannot be applied. 3
. Vectors of random variables The concept of a RV is easily extended to a multi-dimensional case. We can group n random variables X,..., X n in a vector X. This vector is again a random variable with probability density function p X x. X,..., X n are said to be mutually independent when p X x n p Xk x k. k X,..., X n are said to be identically distributed when p Xk x p Xl x for any k and l. In many problems we will consider variables which are independent and identically distributed iid. The marginal distributions p Xk x k, k,..., n can be obtained as follows p Xk x k... p X x dx... dx k dx k... dx n..3 Conditioning Given two random variables X and Y. The conditional probability function p X Y x y is given by p X Y x y p X,Y x, y p Y y for p Y y 0. This is to be interpreted as the probability function of x, given than Y y. In p X Y x y, x is the random variable, while y should be interpreted as a parameter. Also, p X Y x y dx for all y, while p X Y x y dx g x for some function g.. Examples: What happens when X and Y are independent? dice, with X k being the number of eyes of dice k and Y k even, odd}. Determine p X,X x, x, p X,Y x, y as well as the conditional probability functions. Bayes Rule Probably one of the most useful results is Bayes Rule. Looking back at, we easily find that so that p X,Y x, y p X Y x y p Y y p Y X y x p X x p X Y x y p Y X y x p X x p Y y Learn this rule by heart. You will be using this a lot! As a variation, note that p Y y p X,Y x, y dx py X y x p X x dx, so that p X Y x y p Y X y x p X x py X y x p X x dx which is known as Bayes Theorem. 4
.4 Moments and expected values We introduce the expectation operator on a RV X: given a function g : R Γ, the expectation or: expected value of g is w.r.t. p X x is given by E X g X} g x p X x dx. Observe that E X g X} is just an element in Γ, no longer dependent on x. A special case are the moments and the central moments µ n E X X n } n } µ n E X X µ n. The mean and variance of a distribution are given by µ and µ, respectively. We will sometimes denote the mean by µ and the variance by σ. The standard deviation is given by σ. Properties of the expectation operator Linearity E X g X + g X} E X g X} + E X g X} Uncorrelated RVs: X and Y are said to be uncorrelated when E X,Y X Y } E X X} E Y Y }. Show that independent RVs are uncorrelated. Show than uncorrelated RVs are not necessarily independent Conditional expectation E X Y g X} g x p X Y x Y dx, which is a function of the RV Y. Iterative expectations E X,Y g X, Y } E Y EX Y g X, Y } } Expectations and functions: let Y g X, then E X g X} E Y Y } so we can evaluate E Y Y } without explicit knowledge of p Y y..5 Transformation of random variables The discrete case is trivial, so we will focus on continuous RVs..5. One-dimensional case RV X and an invertible function f : R R. We wish to determine the probability distribution of Y f X. We see that X f Y g Y. It is easily verified that p Y y p X g y y. When f is not invertible, we use a different technique: F Y y P Y y} P f X y} which should be evaluated further and then differentiated wrt y. 5
.5. Multi-variate case Given real-valued RVs X, X,..., X and functions h,..., h h k : R R. We define Y [Y,..., Y ] T and X [X,..., X ] T where Y n h n X or simply Y h X. We assume h is one-to-one invertible, so that X h Y with X n g n Y. We now introduce the Jacobian as the determinant of the matrix J y with [J y] k,n y n g k y. Then p Y y p X h Y det J y..5.3 Example Problem: Given X and X with known p X,X x, x. Y X + X. Determine p Y y. Solution : We see that the transformation is not one-to-one. So we first introduce Y X X. ow Y h X is invertible: given Y and Y, we find X and X as X g Y, Y and Y + Y X g Y, Y so that Y Y [J y], y g y [J y], y g y [J y], y g y [J y], y g y 6
so that J y [ and det J y /. Hence Y + Y p Y,Y y, y p X,X, Y Y / p Y + Y X,X, Y Y ] And finally Solution : Since p Y y p Y,Y y, y dy F Y! X y x P Y y X x } P X + X y X x } P X y x } F X y x so that and p Y! X y x p X y x p Y y p X y x p X x dx which can be interpreted as a convolution of two pdfs..5.4 Special case: affine transformations Introduce an matrix A and an vector c. Define Y h X AX + c then h. is an affine transformation. When A is invertible X A Y c and so that J y A p Y y p X A y c deta p X A y c deta. When A is an invertible square matrix with AA T I A T A, we say that A is orthonormal. Orthonormal matrices are norm-preserving AX X where X k X k. 7
3 Some useful distributions For a more exhaustive list, go to http://mathworld.wolfram.com/topics/statisticaldistributions.html. With each distribution we will provide the mean µ and the variance σ. 3. Continuous RVs 3.. Dirac delta distribution The Dirac delta distribution is used when we have absolute certainty regarding a random variable. We write X δ x where δ x is defined as f x δ x dx f 0 and has µ σ 0. 3.. Uniform distribution X U a, b, with b > a, then Also and p X x b a a < x < b 0 else µ b a σ b a.. 3..3 ormal distribution Probably the most important distribution. X µ, σ p X x 3..4 Multi-variate normal distribution exp πσ σ x µ X [X,..., X ] is said to have a multi-variate normal distribution X m, Σ if its probability function has the form p X x π / det Σ exp x mt Σ x m where and Properties E X X} m E X X m X m T } Σ. When X m, Σ, then X k µ k, σ k. Determine µk and σ k. When X m, Σ and X k and X l are uncorrelated, then they are also independent. X k µ k, σ k does not imply X m, Σ! ot a function!. 8
3..5 Exponential distribution X has an exponential distribution with rate parameter λ > 0 when p X x λe λx x > 0 0 else. Then µ λ σ λ 3..6 Chi-squared distribution When Y k 0,, Y,..., Y n independent, then X n degrees of freedom. We write X χ n with where Γ z is the Gamma function. Also, Properties k Y k p X x x n e x/ Γ n, x > 0 n/ µ n σ n X k χ n k, X,..., X L independent, then L k X k χ P k n k. Γ z + zγ z Γ. So, Γ n n! Γ / π. 3..7 Beta distribution X β r, s is defined for x [0, ] with where B r, s is the beta function, given by p X x xr x s B r, s B r, s Γ r Γ s Γ r + s. Mean and variance are given by µ s/ s + r, and σ rs/ has a chi-squared distribution with n s + r r + s +. 3..8 Gamma distribution X Γ k, θ for x > 0, k > 0 and θ > 0, with p X x xk e x/θ Γ k θ k. Mean and variance are given by µ θk, and σ θ k. Somewhat confusingly, sometimes you will see λ θ but write X Γ k, λ for p X x x k e xλ λ k /Γ k. Beware! 9
3. Discrete RVs 3.. Kronecker delta distribution This is the discrete counterpart to the Dirac delta distribution: p X x δ x with x 0 δ x 0 x 0 and σ µ 0. 3.. Bernoulli distribution There are two possible outcomes 0 failure and success with probability p and p, respectively. p X x p x 0 p x with µ p σ p p 3..3 Binomial distribution X is the number of successes out of Bernoulli trials p X x p x p x x for x 0,,..., } where and x! x! x! µ p σ p p 3..4 Poisson distribution Events occur with a known average rate /λ expressed in events per unit of time. Then the distribution of the number of events X in a unit of time has the following distribution for x, with p X x e λ λ x x! µ λ σ λ 0
4 Some important relations 4. Orthonormal transformation of independent normal RVs Theorem Z Z, Z,...Z n T has independent normal distributed elements with the same variance σ, and expected value E Z} d. Y g Z AZ + c, where A is an n n orthonormal matrix. Then Y has independent normal components with the same variance σ and E Y} Ad + c. Proof: We know that since det A for orthonormal matrices. Hence ow, since p Z z p Y y p Y A y c deta p Y A y c πσ n exp n exp πσ n exp πσ σ n z i d i i z d σ z d σ z d A y c d y Ad + c where we have used the fact that A and A are norm-preserving. This leads to p Y y. n exp y Ad + c πσ σ which proves that the Y k s are independent, normal distributed, variance σ with E Y} Ad + c. QED. 4. Statistics of normal random variables Theorem: Let Z [Z,..., Z ] be a sample iid from a µ, σ population. Then. Z i Z i and i Zi Z are independent. Z µ, σ 3. σ i Zi Z χ Proof: We introduce an orthonormal matrix A and Y AZ with Y n Z and i Y i i Zi Z. This matrix is constructed as follows: select the first row as [,,,..., ] and the remaining rows are then obtained by the Gram-Schmidt orthogonalization procedure: Y i Z i Y Z
and i Y i Y AZ Z Zi. i We find that i Y i Zi nz i Z i Z. i Since A is an orthonormal matrix, a T i a j δ i j, where a T i denotes the i-th row of A. Since at [,,,..., ], we see that for j a T j a 0 a ji i so that for j E Y j } E a T j Z } a T j E Z} µ 0. i a ji We can draw the following conclusions: Y, Y,..., Y are independent normal RVs with variance σ and E Y j } 0 for j, and E Y } µ µ. Y µ, σ so that Z Y / µ, σ /. This proves the second part of the theorem. Y,..., Y 0, σ iid, so that Y /σ,..., Y /σ 0,. We see that proves the third part of the theorem. i Y χ n. This Since Y,..., Y are independent of Y, it follows that Z is independent of i Zi Z. This proves the first part of the theorem. 4.3 Chi-s quared distribution Z µ. σ Relate the result to the χ distribu- Problem: Z µ, σ. Determine the distribution of X tion. Solution:
Y Z µ σ, with Y 0,. Since X Y is a non-invertible function, we cannot use the Jacobian. However F X x P X x} P Y x } P x Y x } F Y x FY x Taking the derivative wrt x, and noting that p Y y is an even function: p X x p Y x x + p Y x x p Y x x exp x πx If we consider the χ distribution, this gives us the same result: x e x/ Γ e x/ / π x 4.4 Relation between Chi-squared and Gamma distribution Problem: How are the gamma distribution and the χ distribution related? Solution: Γ k, θ with k n/ and θ yields which is clearly equal to the χ n distribution. n Γ, xn/ e x/ Γ n n/ 4.5 Relation between Gamma and Beta distributions Problem: X Γ k, θ and X Γ k, θ independent. A Determine the distribution of Y X + X and of Y X / X + X. B Show that Y and Y are independent. C Show from this result that the sum of squared iid zero-mean unit-variance normal RVs has a χ distribution. Solution: A Since we can always introduce Z X /θ, with Z Γ k,, we can assume θ without loss of generality. Since X and X are independent p X,X x, x xk e x Γ k x k e x Γ k Since Y X + X and Y X / X + X, X Y Y and X Y Y Y. Hence y y J y y y and det J y y y y + y y y. We find that p Y Y y, y p X X y y, y y y y yk y k e yy y k y k e y e +yy Γ k Γ k 3 y
y k +k e y y k y k Γ k + k Γ k + k Γ k Γ k yk +k e y Γ k + k y k y k B k, k so that Y Γ k + k, and Y β k, k. B follows from A, since p Y Y y, y can be written as p Y Y y, y p Y y p Y y. CWhen X l Γ, independent, we now know that n l X l Γ n Y l 0,, and X l Yl, then X l χ Γ,. So, n l Y l n l Y l Γ n, χ n, n l Y l χ n. 4.6 Statistics from normal RVs,. When we introduce Γ n,. And since Problem: X [X,..., X ] is a vector of iid RVs with X k µ, σ. Compute the expected value of the following RVs: A Sample mean M k X k B Sample variance with known mean C Sample variance with unknown mean Solution: A B R S X k µ k X k M k E M M} E X µ E R R} E X σ } X k k E Xk X k } k } X k µ k E Xk X k µx k + µ } k µ + σ µ + µ k 4
C E S S} σ } E X X k M k } E X Xk + M M X k k k } E X Xk + M M E X k } Xk M k n } E X X k E X M } k µ + σ µ + σ Verify that E X X k } µ + σ and that E X M } σ + µ. 5