Cheng Soon Ong & Christian Walder. Canberra February June 2018

Similar documents
Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Srihari. Probability Theory. Sargur N. Srihari

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

INTRODUCTION TO PATTERN RECOGNITION

Cheng Soon Ong & Christian Walder. Canberra February June 2018

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Machine Learning using Bayesian Approaches

ECE521 week 3: 23/26 January 2017

PATTERN RECOGNITION AND MACHINE LEARNING

Probability Theory for Machine Learning. Chris Cremer September 2015

Data Mining Techniques. Lecture 3: Probability

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Lecture : Probabilistic Machine Learning

Bayesian Decision Theory

Machine Learning Lecture 2

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Naïve Bayes classification

Lecture 1: Probability Fundamentals

Bayesian Methods: Naïve Bayes

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Introduction to Machine Learning

Dept. of Linguistics, Indiana University Fall 2015

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Probability Distributions

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

an introduction to bayesian inference

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Bayesian RL Seminar. Chris Mansley September 9, 2008

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Deep Learning for Computer Vision

Probability theory basics

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

CSC321 Lecture 18: Learning Probabilistic Models

Machine Learning Lecture 2

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Bayesian Learning (II)

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Lecture 2: Repetition of probability theory and statistics

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probability Distributions

COMP90051 Statistical Machine Learning

Probability Theory. Introduction to Probability Theory. Principles of Counting Examples. Principles of Counting. Probability spaces.

Probabilistic Graphical Models

Bayesian Inference and MCMC

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Bayes Decision Theory

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Models in Machine Learning

Bayesian Machine Learning

Recitation 2: Probability

Machine Learning Lecture 2

Bayesian Machine Learning

Parameter Learning With Binary Variables

Grundlagen der Künstlichen Intelligenz

Introduction to Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Multivariate statistical methods and data mining in particle physics

ECE521 Tutorial 2. Regression, GPs, Assignment 1. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel for the slides.

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Introduc)on to Bayesian Methods

Basics on Probability. Jingrui He 09/11/2007

CS 630 Basic Probability and Information Theory. Tim Campbell

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Quick Tour of Basic Probability Theory and Linear Algebra

Machine Learning: Probability Theory

Bayesian Decision and Bayesian Learning

Be able to define the following terms and answer basic questions about them:

Probability and Estimation. Alan Moses

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Expectation Propagation for Approximate Bayesian Inference

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Bayesian Methods for Machine Learning

BAYESIAN DECISION THEORY

Lecture 2: Review of Basic Probability Theory

Probability Review. Yutian Li. January 18, Stanford University. Yutian Li (Stanford University) Probability Review January 18, / 27

An Introduction to Bayesian Machine Learning

Some Concepts of Probability (Review) Volker Tresp Summer 2018

Bayesian Decision Theory

Probability and Information Theory. Sargur N. Srihari

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Strong Lens Modeling (II): Statistical Methods

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Artificial Intelligence

ECE295, Data Assimila0on and Inverse Problems, Spring 2015

Transcription:

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 143

Part IV Probability and Uncertainty 144of 143

t Linear Curve Fitting - Least Squares N = 10 x (x 1,..., x N ) T t (t 1,..., t N ) T x i R t i R y(x, w) = w 1 x + w 0 X [x 1] i = 1,..., N i = 1,..., N w = (X T X) 1 X T t 8 7 6 5 4 3 2 1 0 0 2 4 6 8 10 x How do we choose a noise model? 145of 143

Simple Experiment 1 Choose a box. Red box p(b = r) = 4/10 Blue box p(b = b) = 6/10 2 Choose any item of the selected box with equal probability. Given that we have chosen an orange, what is the probability that the box we chose was the blue one? 146of 143

What do we know? p(f = o B = b) = 1/4 p(f = a B = b) = 3/4 p(f = o B = r) = 3/4 p(f = a B = r) = 1/4 Given that we have chosen an orange, what is the probability that the box we chose was the blue one? p(b = b F = o)? 147of 143

Calculating the Posterior p(b = b F = o) p(b = b F = o) = Sum Rule for the denominator p(f = o B = b)p(b = b) p(f = o) p(f = o) = p(f = o, B = b) + p(f = o, B = r) = p(f = o B = b)p(b = b) + p(f = o B = r)p(b = r) = 1 4 6 10 + 3 4 4 10 = 9 20 p(b = b F = o) = 1 4 6 10 20 9 = 1 3 148of 143

- revisited Before choosing an item from a box: most complete information in p(b) (prior). Note, that in our example p(b = b) = 6 10. Therefore choosing the box from the prior, we would opt for the blue box. Once we observe some data (e.g. choose an orange), we can calculate p(b = b F = o) (posterior probability) via Bayes theorem. After observing an orange, the posterior probability p(b = b F = o) = 1 3 and therefore p(b = r F = o) = 2 3. Observing an orange it is now more likely that the orange came from the red box. 149of 143

Bayes Rule posterior {}}{ p(y X) = likelihood {}}{ p(x Y) prior {}}{ p(y) p(x) }{{} normalisation = p(x Y) p(y) p(x Y)p(Y) Y 150of 143

classical or frequentist interpretation of probabilities Bayesian view: probabilities represent uncertainty Example: Will the Arctic ice cap have disappeared by the end of the century? fresh evidence can change the opinion on ice loss goal: quantify uncertainty and revise uncertainty in light of new evidence use Bayesian interpretation of probability 151of 143

Andrey Kolmogorov - Axiomatic Probability Theory (1933) Let (Ω, F, P) be a measure space with P(Ω) = 1. Then (Ω, F, P) is a probability space, with sample space Ω, event space F and probability measure P. 1. Axiom P(E) 0 E F. 2. Axiom P(Ω) = 1. 3. Axiom P(E 1 E 2... ) = i P(E i) for any countable sequence of pairwise disjoint events E 1, E 2,.... 152of 143

Richard Threlkeld Cox - (1946, 1961) Postulates (according to Arnborg and Sjödin) 1 Divisibility and comparability: The plausibility of a statement is a real number and is dependent on information we have related to the statement. 2 Common sense: Plausibilities should vary sensibly with the assessment of plausibilities in the model. 3 Consistency: If the plausibility of a statement can be derived in many ways, all the results must be equal. Common sense includes consistency with Aristotelian logic Result: the numerical quantities representing plausibility behave all according to the rules of probability (either p [0, 1] or p [1, ]). Denote these quantities as Bayesian probabilities. 153of 143

Curve fitting - revisited uncertainty about the parameter w captured in the prior probability p(w) observed data D = {t 1,..., t N } calculate the uncertainty in w after the data D have been observed p(d w)p(w) p(w D) = p(d) p(d w) as a function of w : likelihood function likelihood expresses how probable the data are for different values of w not a probability function over w 154of 143

Likelihood Function - Frequentist versus Bayesian Frequentist Approach w considered fixed parameter value defined by some estimator error bars on the estimated w obtained from the distribution of possible data sets D Likelihood function p(d w) Bayesian Approach only one single data set D uncertainty in the parameters comes from a probability distribution over w 155of 143

Frequentist Estimator - Maximum Likelihood choose w for which the likelihood p(d w) is maximal choose w for which the probability of the observed data is maximal : error function is negative log of likelihood function log is a monoton function maximising the likelihood minimising the error Example: Fair-looking coin is tossed three times, always landing on heads. Maximum likelihood estimate of the probability of landing heads will give 1. 156of 143

Bayesian Approach including prior knowledge easy (via prior w) BUT: if prior is badly chosen, can lead to bad results subjective choice of prior sometimes choice of prior motivated by convinient mathematical form need to sum/integrate over the whole parameter space advances in sampling (Markov Chain Monte Carlo methods) advances in approximation schemes (Variational Bayes, Expectation Propagation) 157of 143

Expectations Weighted average of a function f(x) under the probability distribution p(x) E [f ] = p(x) f (x) x E [f ] = p(x) f (x) dx discrete distribution p(x) probability density p(x) 158of 143

Variance arbitrary function f (x) var[f ] = E [ (f (x) E [f (x)]) 2] = E [ f (x) 2] E [f (x)] 2 Special case: f (x) = x var[x] = E [ (x E [x]) 2] = E [ x 2] E [x] 2 159of 143

The x R with mean µ and variance σ 2 N (x µ, σ 2 1 ) = (2πσ 2 ) 1 2 N (x µ, σ 2 ) 2σ exp{ 1 2σ 2 (x µ)2 } µ x 160of 143

The N (x µ, σ 2 ) > 0 N (x µ, σ2 ) dx = 1 Expectation over x E [x] = Expectation over x 2 E [ x 2] = N (x µ, σ 2 ) x dx = µ N (x µ, σ 2 ) x 2 dx = µ 2 + σ 2 Variance of x var[x] = E [ x 2] E [x] 2 = σ 2 161of 143

t Linear Curve Fitting - Least Squares N = 10 x (x 1,..., x N ) T t (t 1,..., t N ) T x i R t i R y(x, w) = w 1 x + w 0 X [x 1] i = 1,..., N i = 1,..., N w = (X T X) 1 X T t We assume 8 7 6 5 4 3 2 1 0 0 2 4 6 8 10 x t = y(x, w) }{{} + }{{} ɛ deterministic Gaussian noise 162of 143

The Mode of a Probability Distribution Mode of a distribution : the value that occurs the most frequently in a probability distribution. For a probability density function: the value x at which the probability density attains its maximum. has one mode (unimodular); the mode is µ. If there are multiple local maxima in the probability distribution, the probability distribution is called multimodal (example: mixture of three Gaussians). N (x µ, σ 2 ) p(x) 2σ µ x x 163of 143

The Bernoulli Distribution Two possible outcomes x {0, 1} (e.g. coin which may be damaged). p(x = 1 µ) = µ for 0 µ 1 p(x = 0 µ) = 1 µ Bernoulli Distribution Bern(x µ) = µ x (1 µ) 1 x Expectation over x Variance of x E [x] = µ var[x] = µ(1 µ) 164of 143

The Binomial Distribution Flip a coin N times. What is the distribution to observe heads exactly m times? This is a distribution over m = {0,..., N}. Binomial Distribution Bin(m N, µ) = 0.3 0.2 0.1 ( ) N µ m (1 µ) N m m 0 0 1 2 3 4 5 6 7 8 9 10 m N = 10, µ = 0.25 165of 143

The Beta Distribution Beta Distribution Beta(µ a, b) = Γ(a + b) Γ(a)Γ(b) µa 1 (1 µ) b 1 3 a = 0.1 3 a = 1 2 b = 0.1 2 b = 1 1 0 0 0.5 µ 1 1 0 0 0.5 µ 1 3 a = 2 3 a = 8 2 b = 3 2 b = 4 1 1 0 0 0.5 µ 1 0 0 0.5 µ 1 166of 143

The x x R D with mean µ R D and covariance matrix Σ R D D N (x µ, Σ) = 1 (2π) D 2 1 Σ 1 2 where Σ is the determinant of Σ. exp{ 1 2 (x µ)t Σ 1 (x µ)} x 2 u 2 λ 1/2 2 y 2 µ u 1 λ 1/2 1 y 1 x 1 167of 143

The over a vector x Can find a linear transformation to a new coordinate system y in which the x becomes Σ U = U E = U y = U T (x µ), U is the eigenvector matrix for the covariance matrix Σ with eigenvalue matrix E λ 1 0... 0 0 λ 2... 0............ 0 0... λ D U can be made an orthogonal matrix, therefore the columns u i of U are{ unit vectors which are orthogonal to each other u T 1 i = j i u j = 0 i j Now we can write Σ and its inverse (prove that ΣΣ 1 = I) n n Σ = U E U T = λ i u i u T i Σ 1 = U E 1 U T 1 = u i u T i λ i i=1 i=1 168of 143

The over a vector x Now use the linear transformation y = U T (x µ) and Σ 1 = U E 1 U T to transform the exponent (x µ) T Σ 1 (x µ) into n y T y 2 j E y = λ j j=1 Now exponating the sum (and taking care of the factors) results in a product of scalar valued Gaussian distributions in orthogonal directions u i p(y) = D j=1 1 (2πλ j ) 1/2 exp{ y2 j 2λ j } x2 u2 u1 y2 y1 µ λ 1/2 2 λ 1/2 1 x1 169of 143

Partitioned Gaussians Given a joint Gaussian distribution N (x µ, Σ), where µ is the mean vector, Σ the covariance matrix, and Λ Σ 1 the precision matrix Assume that the variables can be partitioned into two sets x = ( xa x b ), µ = ( µa µ b ) ( ) ( ) Σaa Σ, Σ = ab Λaa Λ, Λ = ab Σ ba Σ bb Λ aa = (Σ aa Σ ab Σ 1 bb Σ ba) 1 Λ ab = (Σ aa Σ ab Σ 1 bb Σ ba) 1 Σ ab Σ 1 bb Conditional distribution Marginal distribution p(x a x b ) = N (x a µ a b, Λ 1 aa ) µ a b = µ a Λ 1 aa Λ ab (x b µ b ) Λ ba Λ bb p(x a ) = N (x a µ a, Σ aa ) 170of 143

Partitioned Gaussians 1 10 xb xb = 0.7 p(xa xb = 0.7) 0.5 5 p(xa, xb) 0 0 0.5 xa 1 p(xa) 0 0 0.5 xa 1 Contours of a Gaussian distribution over two variables x a and x b (left), and marginal distribution p(x a ) and conditional distribution p(x a x b ) for x b = 0.7 (right). 171of 143

- Key Two classes C 1 and C 2 joint distribution p(x, C k ) using Bayes theorem p(c k x) = p(x C k) p(c k ) p(x) Example: cancer treatment (k = 2) data x : an X-ray image C 1 : patient has cancer (C 2 : patient has no cancer) p(c 1 ) is the prior probability of a person having cancer p(c 1 x) is the posterior probability of a person having cancer after having seen the X-ray data 172of 143

- Key Need a rule which assigns each value of the input x to one of the available classes. The input space is partitioned into decision regions R k. Leads to decision boundaries or decision surfaces probability of a mistake p(mistake) = p(x R 1, C 2 ) + p(x R 2, C 1 ) = p(x, C 2 ) dx + p(x, C 1 ) dx R 1 R 2 R1 R2 173of 143

- Key probability of a mistake p(mistake) = p(x R 1, C 2 ) + p(x R 2, C 1 ) = p(x, C 2 ) dx + p(x, C 1 ) dx R 1 R 2 goal: minimize p(mistake) p(x, C 1 ) x 0 x p(x, C 2 ) x R 1 R 2 174of 143

- Key multiple classes instead of minimising the probability of mistakes, maximise the probability of correct classification p(correct) = = K p(x R k, C k ) k=1 K k=1 R k p(x, C k ) dx 175of 143

Minimising the Expected Loss Not all mistakes are equally costly. Weight each misclassification of x to the wrong class C j instead of assigning it to the correct class C k by a factor L kj. The expected loss is now E [L] = L kj p(x, C k )dx k j R j Goal: minimize the expected loss E [L] 176of 143

The Reject Region Avoid making automated decisions on difficult cases. Difficult cases: posterior probabilities p(c k x) are very small joint distributions p(x, C k) have comparable values 1.0 θ p(c 1 x) p(c 2 x) 0.0 reject region x 177of 143

S-fold Cross Validation Given a set of N data items and targets. Goal: Find the best model (type of model, number of parameters like the order p of the polynomial or the regularisation constant λ). Avoid overfitting. Solution: Train a machine learning algorithm with some of the data, evaluate it with the rest. If we have many data 1 Train a range of models or a model with a range of parameters. 2 Compare the performance on an independent data set (validation set) and choose the one with the best predictive performance. 3 Still, overfitting to the validation set can occur. Therefore, use a third test set for final evaluation. (Keep the test set in a safe and never give it to the developers ; ) 178of 143

S-fold Cross Validation For few data, there is a dilemma: Few training data or few test data. Solution is cross-validation. Use a portion of (S 1)/S of the available data for training, but use all the data to asses the performance. For very scarce data one may use S = N, which is also called the leave-one-out technique. 179of 143

S-fold Cross Validation Partition data into S groups. Use S 1 groups to train a set of models that are then evaluated on the remaining group. Repeat for all S choices of the held-out group, and average the performance scores from the S runs. run 1 run 2 run 3 Example for S = 4. run 4 180of 143