MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

Size: px

Start display at page:

Download "MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2"

Thomasine Fitzgerald
6 years ago
Views:

MA 575 Linear Models: Cedric E Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2 1 Revision: Probability Theory 11 Random Variables A real-valued random variable

1 MA 575 Linear Models: Cedric E Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2 1 Revision: Probability Theory 11 Random Variables A real-valued random variable is a function from a probability space (Ω, F, P), to a given domain (R, B) (The precise meanings of these spaces are not important for the remainder of this course) Strictly speaking, therefore, a value or realization of that function can be written for any ω Ω, X(ω) = x For notational convenience, we often omit any reference to the sample space, Ω However, we still systematically distinguish a random variable from one of its realizations, by using upper cases for the former and lower cases, for the latter A realization from a random variable is also referred to as an observed value Note that this upper/lower case convention is not respected by Weisberg, in the main textbook for this course 12 Expectation Operator The expectation or expected value of a real-valued random variable (rv), X with probability density function (pdf) p(x), defined over the real line, R, is given by EX := x p(x) dx R This corresponds to the average value of X over its domain If the random variable X were to be defined over a discrete space X, we would compute its expected value by replacing the integral with a cumulative summation, as follows, EX := x X xpx = x Since the expectation operator, E, takes into account all of the values of a random variable, it therefore acts on an upper case X, and not on a single realization, x A similar notation is adopted for the variance and covariance operators, below Crucially, the expectation operator, EX, is a linear function of X For any given real numbers α, β R, we have Eα + βx = α + βex Department of Mathematics and Statistics, Boston University 1

2 This linear relationship can be extended to any linear combination of random variables, X 1,, X n, such that if we are interested in the expectation of α + n βx i, we obtain, E α + βx i = α + β EX i 13 Variance Operator The variance of a random variable X is defined as the expected squared differences between the observed values of X and its mean value VarX := E(X EX) 2 =: EX EX 2 Since the Euclidean distance between two real numbers, a and b, is defined as d(a, b) := a b, one may geometrically interpret the variance as the average squared distance of the x s, from the mean value, EX The variance operator is non-linear Given a linear combination of uncorrelated random variables, λ 0 + n β ix i, we have Var α 0 + β i X i = βi 2 VarX i The first term, α 0, has been eliminated because the variance of a constant is nil, since Eα 0 = α 0 Finally, the standard deviation of a variable X is defined as 14 Covariance and Correlation σ X := VarX The covariance of two random variables, X and Y, is defined as the expected product of differences between the observed values of these two random variables, and their respective mean values CovX, Y := E (X EX) (Y EY ) = CovY, X, since the covariance can be seen to be symmetric, by inspection This quantity describes how two random variables vary jointly As for the variance operator, the covariance is non-linear, such that for any linear transformations of two random variables X and Y, we obtain Cov α x + β x X, α y + β y Y = β x β y CovX, Y Also, observe that the covariance operator is a generalization of the variance, since CovX, X = VarX The Pearson product-moment correlation coefficient between random variables X and Y is defined as the covariance between X and Y, standardized by the product of their respective standard deviations ρ(x, Y ) := CovX, Y VarX VarY The correlation coefficient is especially valuable, because (i) it does not depend on units of measurement, and (ii) because its values are comprised between 1 and 1 This can be easily proved by an application of the Cauchy-Schwarz inequality, which states that x, y 2 x, x y, y, for any x, y R d Department of Mathematics and Statistics, Boston University 2

3 15 Random Samples We will often consider random samples from a given population, whose moments are controlled by some unknown parameters, such as a mean µ and a variance σ 2 Such a sample is commonly denoted as follows, X i iid f(µ, σ 2 ), i = 1,, n, where note that we are using an upper case, on the left-hand side, for every index i from 1 to n The variance operator of a linear combination of random variables is determined by the relationships of the random variables in that sequence i Independent and Identically Distributed (IID) random variables: Var α 0 + β i X i = βi 2 σ 2 ii Independent, but not identically distributed random variables: Var α 0 + β i X i = βi 2 VarX i iii Neither independent, nor identically distributed random variables: n 1 Var α 0 + β i X i = βi 2 VarX i + 2 β i β j Cov(X i, X j ), j=i+1 where the two in the second term comes from the fact that the matrix of covariances is symmetric 16 Conditional Moments Any conditional expectation is itself a random variable Given two random variables Y and X, where X takes values in some (metric) space X, with σ-algebra B X, X : (Ω, F, P) (X, B X ); the conditional expectation of Y given X corresponds to the mapping, E Y X : (Ω, F, P) (R, B) Thus, for any ω Ω, with X(ω) = x, we have the following equivalence, E Y X Y X = X(ω) = E Y X Y X = x The essential rules governing the manipulations of conditional moments are as follows: 1 The law of total probability: PY = x X PY X = xpx = x 2 The law of total expectation, or power rule: E Y Y = E X E Y X Y X = X E Y X Y X = x PX = x dx 3 The law of total variance, or variance decomposition: Var Y Y = E X Var Y X Y X + Var X E Y X Y X Department of Mathematics and Statistics, Boston University 3

4 2 Revision: Linear Algebra 21 Basic Terminology 1 A matrix X is an arrayed arrangement of elements (X) ij = x ij 2 A matrix X is of order r c, if it has r rows and c columns, with i = 1,, r and j = 1,, c 3 A column vector is a matrix of order (r 1) (All vectors used in this course will be assumed to be column vectors) 4 A matrix is said to be square if r = c 5 A matrix is symmetric, if x ij = x ji, for all i = 1,, r, and j = 1,, c 6 A square matrix is diagonal, if x ij = 0, for every i j 7 The diagonal matrix, whose diagonal elements are 1 s is the identity matrix, and is denoted I n 8 A scalar is a matrix of order 1 1, or more precisely an element in the field supporting the vector space under scrutiny 9 Two matrices, A and B of respective orders r A c A and r B c B, are said to be conformable if they are of an order, which is suitable for performing some operations For instance, when performing addition on A and B, we require that r A = r B and c A = c B 22 Unary and Binary Operations Matrix Addition and Subtraction 1 Matrix addition and subtraction is only conducted between conformable matrices, such that if C = A + B, it follows that all three matrices have the same order 2 These two operations are performed elementwise 3 Addition and subtraction are commutative, A + B = B + A 4 Addition and subtraction are associative, (A + B) + C = A + (B + C) Matrix Multiplication 1 Matrix multiplication is a binary operation, which takes two matrices as arguments 2 The matrix product of A and B of respective orders r t and t c is a matrix C of order r c, with elements, t (C) ij = (AB) ij = a ik b kj 3 Matrix multiplication is not commutative That is, in general, AB BA 4 Matrix multiplication is associative, such that A(BC) = (AB)C (However, these two sequences of multiplications may differ in their computational complexity, ie in number of computational steps) Matrix Transpose k=1 Department of Mathematics and Statistics, Boston University 4

5 1 Matrix transposition is a unary operation It takes a single matrix as argument 2 The transpose of a matrix A of order r c is the matrix A T of order c r with (A T ) ij = (A) ji 3 The product of the transposes is equal to the transpose of the product in opposite order, such that (AB) T = B T A T 4 The dot product or (Euclidean) inner product of two vectors is obtained by transposing one of them, irrespective of the order, such that for any two vectors x and y, of order n 1, we have x T y = x i y i = 5 The norm or length of a vector x is given by x = (x T x) 1/2 6 The inner product is distributive with respect to vector addition: Matrix Inverse y i x i = y T x (1) (x y) T (x y) = x T x 2x T y + y T y 1 An n n square matrix A is said to be invertible (or non-singular), if there exists an (n n) matrix B, such that AB = BA = I n 2 The inverse of a diagonal matrix D is obtained by inverting its diagonal elements elementwise, 23 Random Vectors (D 1 ) ii = (D ii ) 1 Finally, we can combine the tools of probability theory with the ones of linear algebra, in order to consider the moments of a random vector In the lecture notes, random vectors will be denoted by bold lower cases This differs from the convention adopted by Weisberg in the textbook Thus, a random vector y is defined as y := y 1,, y n T The expectation of a (column) random vector y of order n 1 is given by applying the expectation operator elementwise, Ey 1 Ey 2 Ey = Ey n As for single random variables, the expectation of a random vector is linear For convenience, let a 0 and y be n 1 vectors, and let A be an n n matrix Here, a 0 and A are non-random, whereas y is a random vector, as above Then, Ea 0 + Ay = a 0 + AEy, Department of Mathematics and Statistics, Boston University 5

6 which can be verified by observing that for i = 1,, n, we have (EAy) i = E A ij y j = A ij Ey j = (AEy) i j=1 i Moreover, we can define the variance (or covariance) matrix of a random vector y of order n 1, denoted Vary, as a matrix of order n n with the following diagonal entries, and non-diagonal entries, j=1 (Vary) ii = Vary i, (Vary) ij = Covy i, y j Thus, altogether the variance/covariance matrix of the random vector y is given by an outer product, Vary := E(y Ey)(y Ey) T, Explicitly, this gives the matrix, Vary 1 Covy 1, y 2 Covy 1, y n Covy 2, y 1 Vary 2 Covy 2, y n Vary = Covyn 1, y n Covy n, y 1 Covy n, y n 1 Vary n In contrast to the expectation of a random vector, the variance of the transformed version of y, is non-linear, since we have Vara 0 + Ay = A VaryA T By extension, the covariance of two random vectors y and z of orders n 1 is given by a matrix of order n n, Covy, z = E (y Ey)(z Ez) T (2) Moreover, both the variance and covariance of random vectors satisfy the classical decompositions of the variance and covariance operators for real-valued random variables For two univariate random variables, X and Y, recall that we have VarX = EX 2 EX 2, and CovX, Y = EXY EXEY Similarly, for two n-dimensional random vectors x and y, we have and 3 Three Types of Independence 31 Probabilistic Independence Varx = Exx T ExEx T ; Covx, y = Exy T ExEy T Given some probability space, two events A, B Ω, are independent when PA B = PAPB Two random variables are independent, when their cumulative distribution functions (CDFs) satisfy F X,Y (x, y) = F X (x)f Y (y), or equivalently, in terms of their probability density functions (pdfs), f X,Y (x, y) = f X (x)f Y (y) i Department of Mathematics and Statistics, Boston University 6

7 32 Statistical Independence (Uncorrelated) Two random variables are said to be statistically independent or uncorrelated, when their covariance is nil, CovX, Y = 0 Clearly, if X and Y are probabilistically independent, then CovX, Y = EX, Y EXEY = 0 33 Linear Independence Two vectors of n realizations, x and y, from two random variables X and Y are linearly independent, when there does not exist any non-zero coefficients α x, α y R, such that α x x + α y y = 0, where 0 is the n-dimensional vector with zero entries If two random variables are statistically independent, then any sequence of realizations from these random variables has a low probability of yielding two linearly independent vectors of realizations Making such a statement precise, however, would require an appeal to much more probability theory, than is required for this course These observations provide us with a more precise definition of covariance, as a measure of linear dependence; since the larger is the estimated covariance, the larger is the probability of obtaining two vectors of realizations, which are linearly dependent Department of Mathematics and Statistics, Boston University 7

Lecture 2: Repetition of probability theory and statistics

Algorithms for Uncertainty Quantification SS8, IN2345 Tobias Neckel Scientific Computing in Computer Science TUM Lecture 2: Repetition of probability theory and statistics Concept of Building Block: Prerequisites: