Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89
Part II Introduction 48of 89
Flavour of this course Formalise intuitions about problems Use language of mathematics to express models Geometry, vectors, linear algebra for reasoning Probabilistic models to capture uncertainty Design and analysis of algorithms Numerical algorithms in python Understand the choices when designing machine learning methods 49of 89
What is? Definition (Mitchell, 1998) A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. 50of 89
some artificial data created from the function sin(2πx) + random noise x = 0,..., 1 t 1 0 1 0 x 1 51of 89
- Input Specification N = 10 x (x 1,..., x N ) T t (t 1,..., t N ) T 52of 89
- Input Specification N = 10 x (x 1,..., x N ) T t (t 1,..., t N ) T x i R i = 1,..., N t i R i = 1,..., N 53of 89
- Model Specification M : order of polynomial y(x, w) = w 0 + w 1 x + w 2 x 2 + + w M x M M = w m x m m=0 nonlinear function of x linear function of the unknown model parameter w How can we find good parameters w = (w 1,..., w M ) T? 54of 89
Learning is Improving Performance t t n y(x n, w) x n x 55of 89
Learning is Improving Performance t t n y(x n, w) x n x Performance measure : Error between target and prediction of the model for the training data E(w) = 1 2 N (y(x n, w) t n ) 2 n=1 unique minimum of E(w) for argument w 56of 89
Model Comparison or Model Selection M y(x, w) = w m x m m=0 M=0 = w 0 t 1 M = 0 0 1 0 x 1 57of 89
Model Comparison or Model Selection y(x, w) = M m=0 w m x m M=1 = w 0 + w 1 x t 1 M = 1 0 1 0 x 1 58of 89
Model Comparison or Model Selection y(x, w) = M m=0 w m x m M=3 = w 0 + w 1 x + w 2 x 2 + w 3 x 3 t 1 M = 3 0 1 0 x 1 59of 89
Model Comparison or Model Selection overfitting y(x, w) = M m=0 w m x m M=9 = w 0 + w 1 x + + w 8 x 8 + w 9 x 9 t 1 M = 9 0 1 0 x 1 60of 89
Testing the Model Train the model and get w Get 100 new data points Root-mean-square (RMS) error E RMS = 2E(w )/N 1 Training Test ERMS 0.5 0 0 3 M 6 9 61of 89
Testing the Model M = 0 M = 1 M = 3 M = 9 w 0 0.19 0.82 0.31 0.35 w 1-1.27 7.99 232.37 w 2-25.43-5321.83 w 3 17.37 48568.31 w 4-231639.30 w 5 640042.26 w 6-1061800.52 w 7 1042400.18 w 8-557682.99 w 9 125201.43 Table: Coefficients w for polynomials of various order. 62of 89
More Data N = 15 t 1 0 N = 15 1 0 x 1 63of 89
More Data N = 100 heuristics : have no less than 5 to 10 times as many data points than parameters but number of parameters is not necessarily the most appropriate measure of model complexity! later: Bayesian approach t 1 N = 100 0 1 0 x 1 64of 89
Regularisation How to constrain the growing of the coefficients w? Add a regularisation term to the error function Ẽ(w) = 1 N ( y(x n, w) t n ) 2 + λ 2 2 w 2 n=1 Squared norm of the parameter vector w w 2 w T w = w 2 0 + w 2 1 + + w 2 M 65of 89
Regularisation M = 9 t 1 0 ln λ = 18 1 0 x 1 66of 89
Regularisation M = 9 t 1 0 ln λ = 0 1 0 x 1 67of 89
Regularisation M = 9 1 Training Test ERMS 0.5 0 35 30 ln λ 25 20 68of 89
What is? Definition (Mitchell, 1998) A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Task: regression Experience: x input examples, t output labels Performance: squared error Model choice Regularisation do not train on the test set! 69of 89
p(x, Y ) Y = 2 Y = 1 X 70of 89
Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 p(x, Y ) Y = 2 Y = 1 X 71of 89
Sum Rule Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 p(x = d, Y = 1) = 8/60 p(x = d) = p(x = d, Y = 2) + p(x = d, Y = 1) = 1/60 + 8/60 p(x = d) = Y p(x = d, Y) p(x) = Y p(x, Y) 72of 89
Sum Rule Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 p(x) = Y p(x, Y) p(y) = X p(x, Y) p(x) p(y ) X 73of 89
Product Rule Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 Conditional Probability p(x = d Y = 1) = 8/34 Calculate p(y = 1): p(y = 1) = X p(x, Y = 1) = 34/60 p(x = d, Y = 1) = p(x = d Y = 1)p(Y = 1) p(x, Y) = p(x Y) p(y) 74of 89
Product Rule Y vs. X a b c d e f g h i sum 2 0 0 0 1 4 5 8 6 2 26 1 3 6 8 8 5 3 1 0 0 34 sum 3 6 8 9 9 8 9 6 2 60 p(x) = Y p(x, Y) p(x, Y) = p(x Y) p(y) p(x) p(x Y = 1) X X 75of 89
Sum Rule and Product Rule Sum Rule p(x) = Y p(x, Y) Product Rule p(x, Y) = p(x Y) p(y) 76of 89
Bayes Theorem Use product rule p(x, Y) = p(x Y) p(y) = p(y X) p(x) Bayes Theorem and p(y X) = p(x Y) p(y) p(x) only defined for p(x) > 0 p(x) = Y = Y p(x, Y) p(x Y) p(y) (sum rule) (product rule) 77of 89
Real valued variable x R Probability of x to fall in the interval (x, x + δx) is given by p(x)δx for infinitesimal small δx. p(x (a, b)) = b a p(x) dx. p(x) P (x) δx x 78of 89
Constraints on p(x) Nonnegative Normalisation p(x) 0 p(x) dx = 1. p(x) P (x) δx x 79of 89
Cumulative distribution function P(x) or P(x) = x p(z) dz d P(x) = p(x) dx p(x) P (x) δx x 80of 89
Multivariate Probability Density Vector x (x 1,..., x D ) T = Nonnegative Normalisation This means x 1. x D p(x) 0 p(x) dx = 1. p(x) dx 1... dx D = 1. 81of 89
Sum and Product Rule for Sum Rule Product Rule p(x) = p(x, y) dy p(x, y) = p(y x) p(x) 82of 89
Expectations Weighted average of a function f(x) under the probability distribution p(x) E [f ] = p(x) f (x) x E [f ] = p(x) f (x) dx discrete distribution p(x) probability density p(x) 83of 89
How to approximate E [f ] Given a finite number N of points x n drawn from the probability distribution p(x). Approximate the expectation by a finite sum: E [f ] 1 N N f (x n ) n=1 How to draw points from a probability distribution p(x)? Lecture coming about Sampling 84of 89
Expection of a function of several variables arbitrary function f (x, y) E x [f (x, y)] = p(x) f (x, y) x E x [f (x, y)] = p(x) f (x, y) dx discrete distribution p(x) probability density p(x) Note that E x [f (x, y)] is a function of y. 85of 89
Conditional Expectation arbitrary function f (x) E x [f y] = p(x y) f (x) x E x [f y] = p(x y) f (x) dx discrete distribution p(x) probability density p(x) Note that E x [f y] is a function of y. Other notation used in the literature : E x y [f ]. What is E [E [f (x) y]]? Can we simplify it? This must mean E y [E x [f (x) y]]. (Why?) E y [E x [f (x) y]] = y p(y) E x [f y] = y p(y) x p(x y) f (x) = x,y f (x) p(x, y) = x f (x) p(x) = E x [f (x)] 86of 89
Variance arbitrary function f (x) var[f ] = E [ (f (x) E [f (x)]) 2] = E [ f (x) 2] E [f (x)] 2 Special case: f (x) = x var[x] = E [ (x E [x]) 2] = E [ x 2] E [x] 2 87of 89
Covariance Two random variables x R and y R cov[x, y] = E x,y [(x E [x])(y E [y])] With E [x] = a and E [y] = b = E x,y [x y] E [x] E [y] cov[x, y] = E x,y [(x a)(y b)] = E x,y [x y] E x,y [x b] E x,y [a y] + E x,y [a b] = E x,y [x y] b E x,y [x] a E x,y [y] +a b E x,y [1] }{{}}{{}}{{} =E x[x] =E y[y] =1 = E x,y [x y] a b a b + a b = E x,y [x y] a b = E x,y [x y] E [x] E [y] Expresses how strongly x and y vary together. If x and y are independent, their covariance vanishes. 88of 89
Covariance for Vector Valued Variables Two random variables x R D and y R D cov[x, y] = E x,y [ (x E [x])(y T E [ y T] ) ] = E x,y [ x y T ] E [x] E [ y T] 89of 89