Review (Probability & Linear Algebra)

Review (Probability & Linear Algebra) CE-725 : Statistical Pattern Recognition Sharif University of Technology Spring 2013 M. Soleymani

Outline Axioms of probability theory Conditional probability, Joint probability, Bayes rule Discrete and continuous random variables Probability mass and density functions Expected value, variance, standard deviation Expectation for two variables covariance, correlation Some probability distributions Gaussian distribution Linear Algebra 2

Basic Probability Elements Sample space ( ): set of all possible outcomes (or worlds) Outcomes are assumed to be mutually exclusive. An event is a certain set of outcomes (i.e., subset of ). A random variable is a function defined over the sample space Gender: {, } Height: R 3

Probability Space A probability space is defined as a triple : A sample space Ω that contains the set of all possible outcomes (outcomes also called states of nature) Aset whose elements are called events. The events are subsets of Ω. should be a Borel Field. represents the probability measure that assigns probabilities to events. 4

Probability and Logic Probability theory can be viewed as a generalization of propositional logic Probabilistic logic : is a propositional logic sentence Belief of agent in is no longer restricted to true, false, unknown () can range from 0 to 1 5

Probability Axioms (Kolomogrov) Axioms define a reasonable theory of uncertainty Kolomogrov s probability axioms (propositional notation) () 0 ( ) If is a tautology = 1 = + ( ) (, ) True a b ^ 6

Random Variables Random variables: Variables in probability theory Domain of random variables: Boolean, discrete or continuous Probability distribution: the function describing probabilities of possible values of a random variable = 1 =,(=2)=, 7

Random Variables Random variable is a function that maps every outcome in to a real (complex) number. To define probabilities easily as functions defined on (real) numbers. Expectation, variance, Head Tail 0 1 Real line 8

Probabilistic Inference Joint probability distribution Can specify probability of every atomic event We can find every probabilities from it (by summing over atomic events). Prior and posterior probabilities belief in the absence or presence of evidences Bayes rule used when it is difficult to compute ( ) but we have information about ( ) Independence new evidence may be irrelevant 9

Joint Probability Distribution Probability of all combinations of the values for a set of random variables. If two or more random variables are considered together, they can be described in terms of their joint probability Example: Joint probability of features (,,, ) 10

Prior and Posterior Probabilities Prior or unconditional probabilities of propositions: belief in the absence of any other evidence e.g., = = 0.14 (h = ) = 0.6 Posterior or conditional probabilities: belief in the presence of evidences ( ) is the probability of given that is true e.g., h = = ) 11

Conditional Probability if Product rule (an alternative formulation): ( ) = ( ) () = ( ) () ( ) obeys the same rules as probabilities E.g., ( ) + (~ ) = 1 12

Conditional Probability For statistically dependent variables, knowing the value of one variable may allow us to better estimate the other. All probabilities in effect are conditional probabilities E.g., () = ( ) (. ) Ω Type equation here. Ω Renormalize the probability of events jointly occur with b 13

Conditional Probability: Example Rolling a fair dice : the outcome is an even number : the outcome is a prime number P a b Pab 1 6 1 Pb 1 3 2 14

Chain rule Chain rule is derived by successive application of product rule: ( 1,, ) = ( 1,, ) ( 1,, ) = ( 1,, ) ( 1,, ) ( 1,, ) = =( ) ( 1,, ) 15

Law of Total Probability mutually exclusive and is an event (subset of ) = ( )+ ( )+ + ( ) 16

Independence Propositions and are independent iff ( ) = () = (, ) = () () Knowing tells us nothing about (and vice versa) 17

Bayes Rule Bayes rule: Obtained form product rule: = = In some problems, it may be difficult to compute ( ) directly, yet we might have information about ( ). ( ) = ( ) () / () 18

Bayes Rule Often it would be useful to derive the rule a bit further: 19

Bayes Rule: Example Meningitis( )&Stiffneck( ) = = 0.01 =0.7 ( ) = () /() = 0.7 0.0002/0.01 = 0.0014 20

Bayes Rule: General Form 21

Calculating Probabilities from Joint Distribution Useful techniques Marginalization = (, ) Conditioning = () Normalization X = X, 22

Probability Mass Function (PMF) Probability Mass Function (PMF) shows the probability for each value of a discrete random variable Each impulse magnitude is equal to the probability of the corresponding outcome Example: PMF of a fair dice () ( = ) 0 (=) =1 23

Probability Density Function (PDF) Probability Density Function (PDF) is defined for continuous random variables 2 2 The probability of x x, x is ( ) 0 0 1 px ( x 0) lim P( x 0 X x 0 ) 0 2 2 () () 0 = 1 ( ) 24

Cumulative Distribution Function (CDF) Cumulative Distribution Function (CDF) Defined as the integration of PDF Similarly defined on discrete variables (summation instead of integration) () CDF x F x P X x x X. FX x px d Non-decreasing Right Continuous ( ) = 0 ( ) = 1 () = () P( u x v) F ( v) F ( u) X X 25

Distribution Statistics Basic descriptors of spatial distributions: Mean Value Variance & Standard Deviation Moments Covariance & Correlation 26

Expected Value Expected (or mean) value: weighted average of all possible values of the random variable 27 Expectation of a discrete random variable : xpx E X Expectation of a function of a discrete random variable : Expected value of a continuous random variable : Expectation of a function of a continuous random variable : x ( ) g( x) Px E g X E X xpx x dx x E g X g x px x dx

Variance Variance: a measure of how far values of a random variable are spread out around its expected value 2 Var X E X E X X 2 2 X X E X Standard deviation: square root of variance X var( X ) 28

Moments Moments n th order moment of a random variable : M E X n n Normalized n th order moment: E X X n The first order moment is the mean value. The second order moment is the variance added by the square of the mean. 29

Correlation & Covariance Correlation Crr ( X, Y ) E XY Discrete RVs x xyp ( x, y ) y Covariance is the correlation of mean removed variables: Cov ( X, Y ) E X X Y Y E XY X Y 30

Covariance: Example, = 0, = 0.9 31

Covariance Properties The covariance value shows the tendency of the pair of RVs to increase together >0 and tend to increase together <0 : tends to decrease when increases =0 : no linear correlation between and 32

Orthogonal, Uncorrelated & Independent RVs Orthogonal random variables ( ), = 0 Uncorrelated random variables ( ), = 0 Independent random variables, = 0 Independent random variables 33

Pearson s Product Moment Correlation Defined only if both and are finite and nonzero. shows the degree of linear dependence between and. 1 1 according to Cauchy- Schwarz inequality ( ) =1shows a perfect positive linear relationship and = 1 shows a perfect negative linear relationship 34

Pearson s Correlation: Examples = (, ) [Wikipedia] 35

Covariance Matrix If is a vector of random variables ( -dim random vector): Covariance matrix indicates the tendency of each pair of RVs to vary together =[ ] = = ( ) ( ) = (( )( )) (( )( )) (( )( )) (( )( )) 36

Covariance Matrix: Two Variables 12 21 Cov ( X, Y ) = 1 0 0 1 = 1 0.9 0.9 1 37

Covariance Matrix shows the covariance of and : ij Cov ( X i, X j ) E X i i. X j j = 38

Sums of Random Variables Mean: = + =+ Variance: () = () + () + 2(, ) If, independent: () = () + () Distribution: p ( z) p ( x, z x) dx Z independence X, Y p ( x) p ( y ) p ( x) p ( z x) dx X Y X Y 39

Some Famous Probability Mass and Density Functions Uniform () ~(, ) 1 p x U a U b b a 1 Gaussian (Normal) 1 2 2. p x e N. 2 x 2, () 1 2 ~(, ) 40

Some Famous Probability Mass and Density Functions Binomial n P X k 1p p k nk k ( = ) ~(, ) Exponential () x x e x 0 px e U x 0 x 0 41 0

Gaussian (Normal) Distribution It is widely used to model the distribution of continuous variables 68% within,+ 95% within 2,+2 Standard Normal distribution: 42

Multivariate Gaussian Distribution is a vector of Gaussian variables 1 ( ) 1 1 xμ T ( xμ) p() x N (, μ ) e 2 (2 ) d /2 1/2 μe[] x E( x ) E( x ) 1 d T E[( xμ)( xμ) ] Mahalanobis Distance: 43 r 2 T x μ x μ T 1 ( ) ( ) Probability Density 0.4 0.3 0.2 0.1 0 2 0 x2 2 3 2 1 x1 0 1 2 3

Multivariate Gaussian Distribution The covariance matrix is always symmetric and positive semidefinite Multivariate Gaussian is completely specified by +( + 1)/2 parameters Special cases of : = : Independent random variables with the same variance (circularly symmetric Gaussian) Digonal matrix = different variances : Independent random variables with 44

Multivariate Gaussian Distribution Level Surfaces The Gaussian distribution will be constant on surfaces in -space for which: = Hyper-ellipsoid Principal axes of the hyper-ellipsoids are the eigenvectors of. Bivariate Gaussian: Curves of constant density are ellipses. 45

Bivariate Gaussian distribution and are the eigenvalues of ( )and and are the corresponding eigenvectors = 46

Linear Transformations on Multivariate Gaussian : Whitening transform = 47

Gaussian Distribution Properties Some attracting properties of the Gaussian distribution: Marginal and conditional distributions of a Gaussian are also Gaussian After a linear transformation, a Gaussian distribution is again Gaussian There exists a linear transformation that diagonalizes the covariance matrix (whitening transform). It converts the multivariate normal distribution into a spherical one. Gaussian is a distribution that maximizes the entropy Gaussian is stable and infinitely divisible Central Limit Theorem Some distributions can be estimated by Gaussian distribution when their parameter value is sufficiently large (e.g., Binomial) 48

Central Limit Theorem (under mild conditions) Suppose i.i.d. (Independent Identically Distributed) RVs ( =1,,) with finite variances Let S N be the sum of these RVs Distribution of converges to a normal distribution as increases, regardless to the distribution of the RVs. Example: N X i 1 i ~ uniform, =1,, = 1 49

Linear Algebra: Basic Definitions Matrix : Matrix Transpose A a a a a...... a a............ a a... a 11 12 1n 21 22 2n [ aij ] mn m1 m2 mn T BA b a 1 i n,1 j m Symetric matrix = ij ji Vector 50 a a1 T... a [ a1,..., an ] a n

Linear Mapping Linear function ( + ) = () + () () = (),, A linear function: In general, a matrix = can be used to denote a map : R R where = + + = 51

Linear Algebra: Basic Definitions Inner (dot) product ab, Matrix multiplication A T a b i 1 [ a ] B [ b ] ij mp ij pn n ab i i AB = C [ c ] c A, B T ij mn ij i.. j i-th row of A j-th column of B 52

Inner Product Inner (dot) product Length (Euclidean norm) of a vector a is normalized iff a = 1 Angle between vectors Orthogonal vectors and : n T ab i 1 ab n T a a i 1 Orthonormal set of vectors,,, :, = 1 = 0.. 53 a cos T ab i T ab a b 0 i a 2 i

Linear Independence A set of vectors is linearly independent if no vector is a linear combination of other vectors. + +...+ = 0 = =...= = 0 54

Matrix Determinant and Trace Determinant () = () () n det( A) a A ; i 1,... n; A ij j 1 i j ( 1) det( M ) ij ij ij Trace tr[ A ] n a j 1 jj 55

Matrix Inversion Inverse of : AB BA I B A n 1 exists iff det () 0 (A is nonsingular) Singular: det () = 0 ill-conditioned:a is nonsingular but close to being singular Pseudo-inverse for a non square matrix # = is not singular # = 56

Matrix Rank : maximum number of linearly independent columns or rows of A. : Full rank : iff is nonsingular ( ) 57

Eigenvectors and Eigenvalues Characteristic equation: n-th order polynomial, with n roots det( AI ) 0 n tr( A) n j j 1 det( A) A n j 1 j 58

Eigenvectors and Eigenvalues: Symmetric Matrix For a symmetric matrix, the eigenvectors corresponding to distinct eigenvalues are orthogonal These eigenvectors can be used to form an orthonormal set ( and ) 59

Eigen Decomposition: Symmetric Matrix = = = Eigen decomposition of a symmetric matrix: 60

Positive definite matrix Symmetric is positive definite: Eigen values of a positive define matrix are positive:, >0 61

Simple Vector Derivatives 62