Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 143
Part IV Probability and Uncertainty 144of 143
t Linear Curve Fitting - Least Squares N = 10 x (x 1,..., x N ) T t (t 1,..., t N ) T x i R t i R y(x, w) = w 1 x + w 0 X [x 1] i = 1,..., N i = 1,..., N w = (X T X) 1 X T t 8 7 6 5 4 3 2 1 0 0 2 4 6 8 10 x How do we choose a noise model? 145of 143
Simple Experiment 1 Choose a box. Red box p(b = r) = 4/10 Blue box p(b = b) = 6/10 2 Choose any item of the selected box with equal probability. Given that we have chosen an orange, what is the probability that the box we chose was the blue one? 146of 143
What do we know? p(f = o B = b) = 1/4 p(f = a B = b) = 3/4 p(f = o B = r) = 3/4 p(f = a B = r) = 1/4 Given that we have chosen an orange, what is the probability that the box we chose was the blue one? p(b = b F = o)? 147of 143
Calculating the Posterior p(b = b F = o) p(b = b F = o) = Sum Rule for the denominator p(f = o B = b)p(b = b) p(f = o) p(f = o) = p(f = o, B = b) + p(f = o, B = r) = p(f = o B = b)p(b = b) + p(f = o B = r)p(b = r) = 1 4 6 10 + 3 4 4 10 = 9 20 p(b = b F = o) = 1 4 6 10 20 9 = 1 3 148of 143
- revisited Before choosing an item from a box: most complete information in p(b) (prior). Note, that in our example p(b = b) = 6 10. Therefore choosing the box from the prior, we would opt for the blue box. Once we observe some data (e.g. choose an orange), we can calculate p(b = b F = o) (posterior probability) via Bayes theorem. After observing an orange, the posterior probability p(b = b F = o) = 1 3 and therefore p(b = r F = o) = 2 3. Observing an orange it is now more likely that the orange came from the red box. 149of 143
Bayes Rule posterior {}}{ p(y X) = likelihood {}}{ p(x Y) prior {}}{ p(y) p(x) }{{} normalisation = p(x Y) p(y) p(x Y)p(Y) Y 150of 143
classical or frequentist interpretation of probabilities Bayesian view: probabilities represent uncertainty Example: Will the Arctic ice cap have disappeared by the end of the century? fresh evidence can change the opinion on ice loss goal: quantify uncertainty and revise uncertainty in light of new evidence use Bayesian interpretation of probability 151of 143
Andrey Kolmogorov - Axiomatic Probability Theory (1933) Let (Ω, F, P) be a measure space with P(Ω) = 1. Then (Ω, F, P) is a probability space, with sample space Ω, event space F and probability measure P. 1. Axiom P(E) 0 E F. 2. Axiom P(Ω) = 1. 3. Axiom P(E 1 E 2... ) = i P(E i) for any countable sequence of pairwise disjoint events E 1, E 2,.... 152of 143
Richard Threlkeld Cox - (1946, 1961) Postulates (according to Arnborg and Sjödin) 1 Divisibility and comparability: The plausibility of a statement is a real number and is dependent on information we have related to the statement. 2 Common sense: Plausibilities should vary sensibly with the assessment of plausibilities in the model. 3 Consistency: If the plausibility of a statement can be derived in many ways, all the results must be equal. Common sense includes consistency with Aristotelian logic Result: the numerical quantities representing plausibility behave all according to the rules of probability (either p [0, 1] or p [1, ]). Denote these quantities as Bayesian probabilities. 153of 143
Curve fitting - revisited uncertainty about the parameter w captured in the prior probability p(w) observed data D = {t 1,..., t N } calculate the uncertainty in w after the data D have been observed p(d w)p(w) p(w D) = p(d) p(d w) as a function of w : likelihood function likelihood expresses how probable the data are for different values of w not a probability function over w 154of 143
Likelihood Function - Frequentist versus Bayesian Frequentist Approach w considered fixed parameter value defined by some estimator error bars on the estimated w obtained from the distribution of possible data sets D Likelihood function p(d w) Bayesian Approach only one single data set D uncertainty in the parameters comes from a probability distribution over w 155of 143
Frequentist Estimator - Maximum Likelihood choose w for which the likelihood p(d w) is maximal choose w for which the probability of the observed data is maximal : error function is negative log of likelihood function log is a monoton function maximising the likelihood minimising the error Example: Fair-looking coin is tossed three times, always landing on heads. Maximum likelihood estimate of the probability of landing heads will give 1. 156of 143
Bayesian Approach including prior knowledge easy (via prior w) BUT: if prior is badly chosen, can lead to bad results subjective choice of prior sometimes choice of prior motivated by convinient mathematical form need to sum/integrate over the whole parameter space advances in sampling (Markov Chain Monte Carlo methods) advances in approximation schemes (Variational Bayes, Expectation Propagation) 157of 143
Expectations Weighted average of a function f(x) under the probability distribution p(x) E [f ] = p(x) f (x) x E [f ] = p(x) f (x) dx discrete distribution p(x) probability density p(x) 158of 143
Variance arbitrary function f (x) var[f ] = E [ (f (x) E [f (x)]) 2] = E [ f (x) 2] E [f (x)] 2 Special case: f (x) = x var[x] = E [ (x E [x]) 2] = E [ x 2] E [x] 2 159of 143
The x R with mean µ and variance σ 2 N (x µ, σ 2 1 ) = (2πσ 2 ) 1 2 N (x µ, σ 2 ) 2σ exp{ 1 2σ 2 (x µ)2 } µ x 160of 143
The N (x µ, σ 2 ) > 0 N (x µ, σ2 ) dx = 1 Expectation over x E [x] = Expectation over x 2 E [ x 2] = N (x µ, σ 2 ) x dx = µ N (x µ, σ 2 ) x 2 dx = µ 2 + σ 2 Variance of x var[x] = E [ x 2] E [x] 2 = σ 2 161of 143
t Linear Curve Fitting - Least Squares N = 10 x (x 1,..., x N ) T t (t 1,..., t N ) T x i R t i R y(x, w) = w 1 x + w 0 X [x 1] i = 1,..., N i = 1,..., N w = (X T X) 1 X T t We assume 8 7 6 5 4 3 2 1 0 0 2 4 6 8 10 x t = y(x, w) }{{} + }{{} ɛ deterministic Gaussian noise 162of 143
The Mode of a Probability Distribution Mode of a distribution : the value that occurs the most frequently in a probability distribution. For a probability density function: the value x at which the probability density attains its maximum. has one mode (unimodular); the mode is µ. If there are multiple local maxima in the probability distribution, the probability distribution is called multimodal (example: mixture of three Gaussians). N (x µ, σ 2 ) p(x) 2σ µ x x 163of 143
The Bernoulli Distribution Two possible outcomes x {0, 1} (e.g. coin which may be damaged). p(x = 1 µ) = µ for 0 µ 1 p(x = 0 µ) = 1 µ Bernoulli Distribution Bern(x µ) = µ x (1 µ) 1 x Expectation over x Variance of x E [x] = µ var[x] = µ(1 µ) 164of 143
The Binomial Distribution Flip a coin N times. What is the distribution to observe heads exactly m times? This is a distribution over m = {0,..., N}. Binomial Distribution Bin(m N, µ) = 0.3 0.2 0.1 ( ) N µ m (1 µ) N m m 0 0 1 2 3 4 5 6 7 8 9 10 m N = 10, µ = 0.25 165of 143
The Beta Distribution Beta Distribution Beta(µ a, b) = Γ(a + b) Γ(a)Γ(b) µa 1 (1 µ) b 1 3 a = 0.1 3 a = 1 2 b = 0.1 2 b = 1 1 0 0 0.5 µ 1 1 0 0 0.5 µ 1 3 a = 2 3 a = 8 2 b = 3 2 b = 4 1 1 0 0 0.5 µ 1 0 0 0.5 µ 1 166of 143
The x x R D with mean µ R D and covariance matrix Σ R D D N (x µ, Σ) = 1 (2π) D 2 1 Σ 1 2 where Σ is the determinant of Σ. exp{ 1 2 (x µ)t Σ 1 (x µ)} x 2 u 2 λ 1/2 2 y 2 µ u 1 λ 1/2 1 y 1 x 1 167of 143
The over a vector x Can find a linear transformation to a new coordinate system y in which the x becomes Σ U = U E = U y = U T (x µ), U is the eigenvector matrix for the covariance matrix Σ with eigenvalue matrix E λ 1 0... 0 0 λ 2... 0............ 0 0... λ D U can be made an orthogonal matrix, therefore the columns u i of U are{ unit vectors which are orthogonal to each other u T 1 i = j i u j = 0 i j Now we can write Σ and its inverse (prove that ΣΣ 1 = I) n n Σ = U E U T = λ i u i u T i Σ 1 = U E 1 U T 1 = u i u T i λ i i=1 i=1 168of 143
The over a vector x Now use the linear transformation y = U T (x µ) and Σ 1 = U E 1 U T to transform the exponent (x µ) T Σ 1 (x µ) into n y T y 2 j E y = λ j j=1 Now exponating the sum (and taking care of the factors) results in a product of scalar valued Gaussian distributions in orthogonal directions u i p(y) = D j=1 1 (2πλ j ) 1/2 exp{ y2 j 2λ j } x2 u2 u1 y2 y1 µ λ 1/2 2 λ 1/2 1 x1 169of 143
Partitioned Gaussians Given a joint Gaussian distribution N (x µ, Σ), where µ is the mean vector, Σ the covariance matrix, and Λ Σ 1 the precision matrix Assume that the variables can be partitioned into two sets x = ( xa x b ), µ = ( µa µ b ) ( ) ( ) Σaa Σ, Σ = ab Λaa Λ, Λ = ab Σ ba Σ bb Λ aa = (Σ aa Σ ab Σ 1 bb Σ ba) 1 Λ ab = (Σ aa Σ ab Σ 1 bb Σ ba) 1 Σ ab Σ 1 bb Conditional distribution Marginal distribution p(x a x b ) = N (x a µ a b, Λ 1 aa ) µ a b = µ a Λ 1 aa Λ ab (x b µ b ) Λ ba Λ bb p(x a ) = N (x a µ a, Σ aa ) 170of 143
Partitioned Gaussians 1 10 xb xb = 0.7 p(xa xb = 0.7) 0.5 5 p(xa, xb) 0 0 0.5 xa 1 p(xa) 0 0 0.5 xa 1 Contours of a Gaussian distribution over two variables x a and x b (left), and marginal distribution p(x a ) and conditional distribution p(x a x b ) for x b = 0.7 (right). 171of 143
- Key Two classes C 1 and C 2 joint distribution p(x, C k ) using Bayes theorem p(c k x) = p(x C k) p(c k ) p(x) Example: cancer treatment (k = 2) data x : an X-ray image C 1 : patient has cancer (C 2 : patient has no cancer) p(c 1 ) is the prior probability of a person having cancer p(c 1 x) is the posterior probability of a person having cancer after having seen the X-ray data 172of 143
- Key Need a rule which assigns each value of the input x to one of the available classes. The input space is partitioned into decision regions R k. Leads to decision boundaries or decision surfaces probability of a mistake p(mistake) = p(x R 1, C 2 ) + p(x R 2, C 1 ) = p(x, C 2 ) dx + p(x, C 1 ) dx R 1 R 2 R1 R2 173of 143
- Key probability of a mistake p(mistake) = p(x R 1, C 2 ) + p(x R 2, C 1 ) = p(x, C 2 ) dx + p(x, C 1 ) dx R 1 R 2 goal: minimize p(mistake) p(x, C 1 ) x 0 x p(x, C 2 ) x R 1 R 2 174of 143
- Key multiple classes instead of minimising the probability of mistakes, maximise the probability of correct classification p(correct) = = K p(x R k, C k ) k=1 K k=1 R k p(x, C k ) dx 175of 143
Minimising the Expected Loss Not all mistakes are equally costly. Weight each misclassification of x to the wrong class C j instead of assigning it to the correct class C k by a factor L kj. The expected loss is now E [L] = L kj p(x, C k )dx k j R j Goal: minimize the expected loss E [L] 176of 143
The Reject Region Avoid making automated decisions on difficult cases. Difficult cases: posterior probabilities p(c k x) are very small joint distributions p(x, C k) have comparable values 1.0 θ p(c 1 x) p(c 2 x) 0.0 reject region x 177of 143
S-fold Cross Validation Given a set of N data items and targets. Goal: Find the best model (type of model, number of parameters like the order p of the polynomial or the regularisation constant λ). Avoid overfitting. Solution: Train a machine learning algorithm with some of the data, evaluate it with the rest. If we have many data 1 Train a range of models or a model with a range of parameters. 2 Compare the performance on an independent data set (validation set) and choose the one with the best predictive performance. 3 Still, overfitting to the validation set can occur. Therefore, use a third test set for final evaluation. (Keep the test set in a safe and never give it to the developers ; ) 178of 143
S-fold Cross Validation For few data, there is a dilemma: Few training data or few test data. Solution is cross-validation. Use a portion of (S 1)/S of the available data for training, but use all the data to asses the performance. For very scarce data one may use S = N, which is also called the leave-one-out technique. 179of 143
S-fold Cross Validation Partition data into S groups. Use S 1 groups to train a set of models that are then evaluated on the remaining group. Repeat for all S choices of the held-out group, and average the performance scores from the S runs. run 1 run 2 run 3 Example for S = 4. run 4 180of 143