Cheng Soon Ong & Christian Walder. Canberra February June PDF Free Download

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89

Part II Introduction 48of 89

Flavour of this course Formalise intuitions about problems Use language of mathematics to express models Geometry, vectors, linear algebra for reasoning Probabilistic models to capture uncertainty Design and analysis of algorithms Numerical algorithms in python Understand the choices when designing machine learning methods 49of 89

some artificial data created from the function sin(2πx) + random noise x = 0,..., 1 t 1 0 1 0 x 1 51of 89

- Input Specification N = 10 x (x 1,..., x N ) T t (t 1,..., t N ) T 52of 89

- Input Specification N = 10 x (x 1,..., x N ) T t (t 1,..., t N ) T x i R i = 1,..., N t i R i = 1,..., N 53of 89

- Model Specification M : order of polynomial y(x, w) = w 0 + w 1 x + w 2 x 2 + + w M x M M = w m x m m=0 nonlinear function of x linear function of the unknown model parameter w How can we find good parameters w = (w 1,..., w M ) T? 54of 89

Learning is Improving Performance t t n y(x n, w) x n x 55of 89

Learning is Improving Performance t t n y(x n, w) x n x Performance measure : Error between target and prediction of the model for the training data E(w) = 1 2 N (y(x n, w) t n ) 2 n=1 unique minimum of E(w) for argument w 56of 89

Model Comparison or Model Selection M y(x, w) = w m x m m=0 M=0 = w 0 t 1 M = 0 0 1 0 x 1 57of 89

Model Comparison or Model Selection y(x, w) = M m=0 w m x m M=1 = w 0 + w 1 x t 1 M = 1 0 1 0 x 1 58of 89

Model Comparison or Model Selection y(x, w) = M m=0 w m x m M=3 = w 0 + w 1 x + w 2 x 2 + w 3 x 3 t 1 M = 3 0 1 0 x 1 59of 89

Model Comparison or Model Selection overfitting y(x, w) = M m=0 w m x m M=9 = w 0 + w 1 x + + w 8 x 8 + w 9 x 9 t 1 M = 9 0 1 0 x 1 60of 89

Testing the Model Train the model and get w Get 100 new data points Root-mean-square (RMS) error E RMS = 2E(w )/N 1 Training Test ERMS 0.5 0 0 3 M 6 9 61of 89

Testing the Model M = 0 M = 1 M = 3 M = 9 w 0 0.19 0.82 0.31 0.35 w 1-1.27 7.99 232.37 w 2-25.43-5321.83 w 3 17.37 48568.31 w 4-231639.30 w 5 640042.26 w 6-1061800.52 w 7 1042400.18 w 8-557682.99 w 9 125201.43 Table: Coefficients w for polynomials of various order. 62of 89

More Data N = 15 t 1 0 N = 15 1 0 x 1 63of 89

More Data N = 100 heuristics : have no less than 5 to 10 times as many data points than parameters but number of parameters is not necessarily the most appropriate measure of model complexity! later: Bayesian approach t 1 N = 100 0 1 0 x 1 64of 89

Regularisation How to constrain the growing of the coefficients w? Add a regularisation term to the error function Ẽ(w) = 1 N ( y(x n, w) t n ) 2 + λ 2 2 w 2 n=1 Squared norm of the parameter vector w w 2 w T w = w 2 0 + w 2 1 + + w 2 M 65of 89

Regularisation M = 9 t 1 0 ln λ = 18 1 0 x 1 66of 89

Regularisation M = 9 t 1 0 ln λ = 0 1 0 x 1 67of 89

Regularisation M = 9 1 Training Test ERMS 0.5 0 35 30 ln λ 25 20 68of 89

What is? Definition (Mitchell, 1998) A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Task: regression Experience: x input examples, t output labels Performance: squared error Model choice Regularisation do not train on the test set! 69of 89

p(x, Y ) Y = 2 Y = 1 X 70of 89

Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 p(x, Y ) Y = 2 Y = 1 X 71of 89

Sum Rule Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 p(x = d, Y = 1) = 8/60 p(x = d) = p(x = d, Y = 2) + p(x = d, Y = 1) = 1/60 + 8/60 p(x = d) = Y p(x = d, Y) p(x) = Y p(x, Y) 72of 89

Sum Rule Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 p(x) = Y p(x, Y) p(y) = X p(x, Y) p(x) p(y ) X 73of 89

Product Rule Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 Conditional Probability p(x = d Y = 1) = 8/34 Calculate p(y = 1): p(y = 1) = X p(x, Y = 1) = 34/60 p(x = d, Y = 1) = p(x = d Y = 1)p(Y = 1) p(x, Y) = p(x Y) p(y) 74of 89

Product Rule Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 p(x) = Y p(x, Y) p(x, Y) = p(x Y) p(y) p(x) p(x Y = 1) X X 75of 89

Sum Rule and Product Rule Sum Rule p(x) = Y p(x, Y) Product Rule p(x, Y) = p(x Y) p(y) 76of 89

Bayes Theorem Use product rule p(x, Y) = p(x Y) p(y) = p(y X) p(x) Bayes Theorem and p(y X) = p(x Y) p(y) p(x) only defined for p(x) > 0 p(x) = Y = Y p(x, Y) p(x Y) p(y) (sum rule) (product rule) 77of 89

Real valued variable x R Probability of x to fall in the interval (x, x + δx) is given by p(x)δx for infinitesimal small δx. p(x (a, b)) = b a p(x) dx. p(x) P (x) δx x 78of 89

Constraints on p(x) Nonnegative Normalisation p(x) 0 p(x) dx = 1. p(x) P (x) δx x 79of 89

Cumulative distribution function P(x) or P(x) = x p(z) dz d P(x) = p(x) dx p(x) P (x) δx x 80of 89

Multivariate Probability Density Vector x (x 1,..., x D ) T = Nonnegative Normalisation This means x 1. x D p(x) 0 p(x) dx = 1. p(x) dx 1... dx D = 1. 81of 89

Sum and Product Rule for Sum Rule Product Rule p(x) = p(x, y) dy p(x, y) = p(y x) p(x) 82of 89

Expectations Weighted average of a function f(x) under the probability distribution p(x) E [f ] = p(x) f (x) x E [f ] = p(x) f (x) dx discrete distribution p(x) probability density p(x) 83of 89

How to approximate E [f ] Given a finite number N of points x n drawn from the probability distribution p(x). Approximate the expectation by a finite sum: E [f ] 1 N N f (x n ) n=1 How to draw points from a probability distribution p(x)? Lecture coming about Sampling 84of 89

Expection of a function of several variables arbitrary function f (x, y) E x [f (x, y)] = p(x) f (x, y) x E x [f (x, y)] = p(x) f (x, y) dx discrete distribution p(x) probability density p(x) Note that E x [f (x, y)] is a function of y. 85of 89

Conditional Expectation arbitrary function f (x) E x [f y] = p(x y) f (x) x E x [f y] = p(x y) f (x) dx discrete distribution p(x) probability density p(x) Note that E x [f y] is a function of y. Other notation used in the literature : E x y [f ]. What is E [E [f (x) y]]? Can we simplify it? This must mean E y [E x [f (x) y]]. (Why?) E y [E x [f (x) y]] = y p(y) E x [f y] = y p(y) x p(x y) f (x) = x,y f (x) p(x, y) = x f (x) p(x) = E x [f (x)] 86of 89

Variance arbitrary function f (x) var[f ] = E [ (f (x) E [f (x)]) 2] = E [ f (x) 2] E [f (x)] 2 Special case: f (x) = x var[x] = E [ (x E [x]) 2] = E [ x 2] E [x] 2 87of 89

Covariance Two random variables x R and y R cov[x, y] = E x,y [(x E [x])(y E [y])] With E [x] = a and E [y] = b = E x,y [x y] E [x] E [y] cov[x, y] = E x,y [(x a)(y b)] = E x,y [x y] E x,y [x b] E x,y [a y] + E x,y [a b] = E x,y [x y] b E x,y [x] a E x,y [y] +a b E x,y [1] }{{}}{{}}{{} =E x[x] =E y[y] =1 = E x,y [x y] a b a b + a b = E x,y [x y] a b = E x,y [x y] E [x] E [y] Expresses how strongly x and y vary together. If x and y are independent, their covariance vanishes. 88of 89

Covariance for Vector Valued Variables Two random variables x R D and y R D cov[x, y] = E x,y [ (x E [x])(y T E [ y T] ) ] = E x,y [ x y T ] E [x] E [ y T] 89of 89

Cheng Soon Ong & Christian Walder. Canberra February June 2018