Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adopted from Prof. H.R. Rabiee s and also Prof. R. Gutierrez-Osuna s lectures

Outline Axioms of probability theory Conditional probability, Joint probability, Bayes rule Discrete and continuous random variables Probability mass, density and distribution functions Expected value, variance, standard deviation Expectation for two variables covariance, correlation Linear Algebra 2

Basic probability elements Sample space ( ): set of all possible outcomes (or worlds) Outcomes are assumed to be mutually exclusive. An event is a certain set of outcomes (i.e., subset of ). A random variable is a function defined over the sample space Gender: {, } Height: R 3

Probability space Atripleof Ω represents a nonempty set, whose elements are called outcomes or states of nature represents a set, whose elements are called events. The events are subsets of Ω. should be a Borel Field. represents the probability measure. 4

Probability and Logic Probability theory can be viewed as a generalization of propositional logic Probabilistic logic : is a propositional logic sentence Belief of agent in is no longer restricted to true, false, unknown () can range from 0 to 1 5

Probability axioms (Kolomogrov) Axioms define a reasonable theory of uncertainty Kolomogrov s probability axioms (propositional notation) () 0 ( ) If is a tautology = 1 = + ( ) (, ) True a b ^ 6

Random variables Random variables: Variables in probability theory Domain of random variables: Boolean, discrete or continuous Probability distribution: the function describing probabilities of possible values of a random variable = 1 =,(=2)=, 7

Random variables Random variable is a function that maps every outcome in to a real (complex) number. To define probabilities easily as functions defined on (real) numbers. Expectation, variance, Head Tail 0 1 Real line 8

Probabilistic inference Joint probability distribution Can specify probability of every atomic event We can find every probabilities from it (by summing over atomic events). Prior and posterior probabilities belief in the absence or presence of evidences Bayes rule used when it is difficult to compute ( ) but we have information about ( ) Independence new evidence may be irrelevant 9

Joint probability distribution Probability of all combinations of the values for a set of random vars. If two or more random variables are considered together, they can be described in terms of their joint probability (h, ) : 7 4 matrix of values W=sunny W=rainy W=cloudy W=snow Day = Saturday 0.08 0.02 0.04 0.01 Day = Sunday 0.07 0.01 0.06 0.00 Day = Friday 0.09 0.02 0.03 0.01 10

Prior and posterior probabilities Prior or unconditional probabilities of propositions: belief in the absence of any other evidence e.g., = = 0.14 (h = ) = 0.6 Posterior or conditional probabilities: belief in the presence of evidences ( ) is the probability of given that is true e.g., h = = ) 11

Conditional probability if Product rule (an alternative formulation): ( ) = ( ) () = ( ) () ( ) obeys the same rules as probabilities E.g., ( ) + (~ ) = 1 12

Conditional probability For statistical dependent variables, knowing the value of one variable may allow us to better estimate the other. All probabilities in effect are conditional probabilities E.g., () = ( ) (. ) Ω Type equation here. Ω Renormalize the probability of events jointly occur with b 13

Example Rolling a fair dice : the outcome is an even number : the outcome is a prime number P a b Pab 1 6 1 Pb 1 3 2 14

Chain rule Chain rule is derived by successive application of product rule: ( 1,, ) = ( 1,, ) ( 1,, ) = ( 1,, ) ( 1,, ) ( 1,, ) = =( ) ( 1,, ) 15

Law of total probability mutually exclusive and is an event (subset of ) = ( )+ ( )+ + ( ) 16

Independence Propositions and are independent iff ( ) = () = (, ) = () () Knowing tells us nothing about (and vice versa) 17

Bayes' rule Bayes' rule: Obtained form product rule: = = In some problems, it may be difficult to compute ( ) directly, yet we might have information about ( ). ( ) = ( ) () / () 18

Bayes' rule: example Meningitis( )&Stiffneck( ) = = 0.01 =0.7 ( ) = () /() = 0.7 0.0002/0.01 = 0.0014 19

Bayes rule Often it would be useful to derive the rule a bit further: 20

Bayes rule 21

Calculating probabilities from joint distribution Useful techniques Marginalization = (, ) Conditioning = () Normalization X = X, 22

Inference: example Joint probability distribution: h h h 0.108 0.012 0.016 0.064 h h h 0.072 0.008 0.144 0.576 Conditional probabilities: 23 ( hh) ( hh) = (hh) 0.016 + 0.064 = 0.108 + 0.012 + 0.016 + 0.064 = 0.4

Probability Mass Function (PMF) Probability Mass (Distribution) Function (PMF) Discrete random variables Summation of impulses The magnitude of each impulse represents the probability of occurrence of the outcome Example I: Rolling a fair dice () 0 (=) =1 1 6 P X X 24 PMF 1 6 6 X i i 1

Probability Mass Function (PMF) (Cont d) Example II: Summation of two fair dices 1 6 P X X Note : Summation of all probabilities should be equal to ONE. (Why?) 25

Probability Density Function (PDF) Probability Density Function (PDF) Continuous random variables The probability of x x, x is ( ) 0 0 P X 1 px ( 0) lim Px ( 0 xx 0 ) 0 Px x X 26

Cumulative Distribution Function (CDF) Cumulative Distribution Function (CDF) Could be defined as the integration of PDF Similarly defined on discrete variables PDF X CDF x F x P X x x X. FX x px d Non-decreasing Right Continuous ( ) = 0 ( ) = 1 () = () P( u x v) F ( v) F ( u) X X X 27

Distribution statistics Most basic type of descriptor for spatial distributions, include: Mean Value Variance & Standard Deviation Covariance Moments 28

Expected value ExpectedValue (discrete variable) Expectation of function x. px E X x ExpectedValue (continous variable) Expectation of function ( ) ( ). E g X g x p x E X x. px x dx x E g X g x. px x dx 29

Moments Moments n th order moment of a random variable X Normalized form M E n X n E X X n Mean is the first order moment Variance is the second moment added by the square of the mean 30

Variance Variance Covariance of a random variable with itself Standard deviation Square root of variance Var 2 X E X X X 2 X E X E 2 2 2 X X X 31

Correlation & covariance Knowing about a random variable X, how much information will we gain about Y? Linear dependency Crr ( X, Y ) Covariance is normalized correlation E XY Cov ( X, Y ) E X X Y Y E XY X Y Cov ( X, Y ) x y X Y P ( x, y ) x y 32

Orthogonal, uncorrelated random vars Orthogonal random variables, = 0 Uncorrelated random variables, = 0 Independent random variables, = 0, = 0 Independent random variables 33

Covariance matrix: two variables 12 21 Cov ( X, Y ) 2 11 1 2 22 2 var( X ) var( Y ) 34

Covariance: example 35

Covariance properties The covariance matrix indicates the tendency of each pair of features to vary together If and tend to increase together, then >0 If tends to decrease when increases, then 0 If and are uncorrelated, then =0 36

Covariance matrix If is a vector of random variables ( -dim random vector): = = ( ) ( ) = (( )( )) (( )( )) (( )( )) (( )( )) 37

Covariance matrix If shows the covariance of and : ij Cov ( x i, x j ) E x i i. x j j Σ= 38

Sums of Random Variables z x y Variance: If x,y independent: () = () + () Distribution: 39 p ( z) p ( x, z x) dx z independence x, y p ( x) p ( y) p ( x) p ( z x) dx x y x y

Some famous masses and densities Uniform Density () P X ~(, ) 1. p x U end U begin a a 1 Gaussian (Normal) Density a Perhaps the most widely used distribution in all of science fully defined by 2 parameters P() X ~(, ) X 1 2 2. p x e N. 2 x 2, 1. 2 40 X

Some famous masses and densities Binomial Density n P X k.1 p. p k n k k ~(, ) Exponential Density x x. e x 0 px. e. U x 0 x 0 41

Gaussian (Normal) Density 42

Multivariate normal density is a vector of d Gaussian variables 1 ( ) 1 1 xμ T ( xμ) p() x N (, μ ) e 2 (2 ) d /2 1/2 μe[] x E( x ) E( x ) E[( xμ)( xμ) ] 1 T d T Mahalanobis Distance r 2 x μ x μ T 1 ( ) ( ) 43

Bivariate Normal Densities Level curves - ellipses Principal axes are the eigenvectors, and the width in these direction is the root of the corresponding eigenvalue. 44

Gaussian pdf properties It is closed under linear transformation The product of two Gaussians is also Gaussian All marginal and conditional densities of a Gaussian are also Gaussian Diagonalization of covariance matrix Rotated variables are independent Gaussian distribution is infinitely divisible 46

Central Limit Theorem Given a distribution with a mean and variance,the sampling distribution of the mean approaches a normal distribution with a mean ( ) and a variance as, the number of samples, increases. CLT s large sample properties : how large is large enough? Some books will tell you 30 is large enough. 47

Central limit theorem (loosely) Suppose i.i.d. (Independent Identically Distributed) RVs with finite variances Let S N N X i 1 i be the sum of these RVs PDF of converges to a normal distribution as increases, regardless to the distribution of RVs. Example: ~ uniform, =1,, = 48

Linear algebra Matrix A: Matrix Transpose A [ a ij ] mn a a... am 11 21 1 a a a 12 22... m2............ a a a 1n 2n... mn T B A b a 1 i n,1 j m Symetric matrix = ij ji Vector 49 a a1 T... a [ a1,..., an ] a n

Matrix and vector product Inner (dot) product Matrix multiplication n T ab i 1 ab A [ a ] B [ b ] ij mp ij pn i i AB C [ c ] c row ( A) col ( B) ij mn ij i j 50

Inner Product Inner (dot) product Length (Eucledian norm) of a vector a is normalized iff a = 1 Angle between vectors Geometric interpretation Orthogonal vectors and : n T ab i 1 ab n T a a i 1 Orthonormal set of vectors,,, :, = 1 = 0.. 51 a cos T ab i T ab a b 0 i a 2 i

Linear independence A set of vectors is linearly independent if no vector is a linear combination of other vectors. + +...+ = 0 = =...= = 0 52

Matrix determinant and trace Determinant () = () () Trace n det( A) a A ; i 1,... n; A ij j 1 i j ( 1) det( M ) ij ij ij tr[ A ] n a j 1 jj 53

Matrix inversion Inverse of : exists iff det () 0 (A is nonsingular) Singular: det () = 0 AB BA I B A ill-conditioned:a is nonsingular but close to being singular n 1 Pseudo-inverse for a non square matrix is not singular A T A # A A I A # [ A T A] 1 A T 54

Matrix rank : maximum number of linearly independent columns or rows of A. dimension of the largest square submatrix of A that has a nonzero determinant. 55

Matrix rank: properties : Full rank : iff is nonsingular ( ) 56

Eigenvectors and eigenvalues Characteristic equation: n-th order polynomial, with n roots det( A I n ) 0 n tr[ A] j1 j det[ A] n j1 j 57

Eigenvectors and eigenvalues Symetric =0 Orthonormal eigen-vectors Av v, j 1,..., n; v 1 j j j j 58

Eigen-decomposition: Symmetric matrix 59

Positive definite matrix Symmetric is positive definite: Eigen values of positive define matrix are positive, >0 60

Some vector derivation 61

Linear mapping Linear function ( + ) = () + (), () = (), Amatrix is a useful notion to encode linear maps 62