Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Size: px

Start display at page:

Download "Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met"

Jocelin Wilkins
5 years ago
Views:

1 c Outlines Statistical Group and College of Engineering and Computer Science Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More Summer School -, 27 September - 6 October (Figures from C. M. Bishop, "Pattern Recognition and " and T. Hastie, R. Tibshirani, J. Friedman, "The Elements of Statistical Learning") 1of 183

2 Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More 5 Fundamental Types of Learning 6 Basic Probability Theory 7 Polynomial Curve Fitting 2of 183

3 Linear Regression c 8 Linear Basis Function Models 9 Maximum Likelihood and Least Squares Outlines 1 Regularized Least Squares 11 Bayesian Regression Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More 12 Example for Bayesian Regression 13 Predictive Distribution 14 Limitations of Linear Basis Function Models 3of 183

4 Linear Classification 15 Classification c 16 Generalised Linear Model 17 Inference and Decision 18 Decision Theory 19 Fisher s Linear Discriminant 2 The Perceptron Algorithm Outlines Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More 21 Probabilistic Generative Models 22 Discrete Features 23 Logistic Regression 24 Feature Space 4of 183

5 Neural Networks c 25 Neural Networks Outlines Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More 26 Parameter Optimisation 5of 183

6 Kernel Methods and SVM c 27 Kernel Methods Outlines Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More 28 Maximum Margin Classifiers 6of 183

7 Mixture Models and EM c 29 K-means Clustering 3 Mixture Models and EM 31 Mixture of Bernoulli Distributions Outlines Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More 32 EM for Gaussian Mixtures - Latent Variables 33 Convergence of EM 7of 183

8 Resources c 34 Sampling from the Uniform Distribution 35 Sampling from Standard Distributions 36 Rejection Sampling Outlines Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More 37 Importance Sampling 38 Markov Chain Monte Carlo - The Idea 8of 183

9 More c Outlines 39 More Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More 9of 183

10 c Part I Overview What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 1of 183

11 What is? Definition Machine learning is concerned with the design and development of algorithms that allow computers (machines) to improve their performance over time based on data. learning from past experience (training data) generalisation quantify learning : improve their performance over time need to quantify performance c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 11of 183

12 What is? Definition Machine learning is concerned with the design and development of algorithms that allow computers (machines) to improve their performance over time based on data. learning from past experience (training data) generalisation quantify learning : improve their performance over time need to quantify performance Definition (Mitchell, 1998) A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 12of 183

13 Why? is essential when humans are unable to explain their expertise (e.g. speech recognition). humans are not around for help (e.g. navigation on Mars, underwater robotics). large amount of data with possible hidden relationships and correlations (empirical sciences, e.g. discover unusual astronomical objects). environment changes (fast) in time (e.g. mobile phone network). solutions need to be adapted to many particular cases (e.g. junk mail). Example: It is easier to write a program that learns to play checkers or backgammon well by self-play rather than converting the expertise of a master player to a program. c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 13of 183

14 Junk Mail Filtering Given examples of data (mail), and targets {Junk,NoJunk}. c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting Learn to identify new incoming mail as Junk or NoJunk. Continue to learn from the user classifying new mail. 14of 183

15 Handwritten Digit Recognition Given handwritten ZIP codes on letters, money amounts on cheques etc. c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting Learn to correctly recognise new handwritten digits. Nonsense input: Don t know preferred to some wrong digit. 15of 183

16 Backgammon c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting World best computer program TD-GAMMON (Tesauro 1992, 1995) played over a million games against itself. Plays now on the level of human world champion. 16of 183

17 Image Denoising c Original image Noise added Denoised What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting McAuley et. al., "Learning High-Order MRF Priors of Color Images", ICML26 17of 183

18 Separating Audio Sources c Cocktail Party Problem (human brains may do it differently ; ) Audio Sources Microphones Audio Mixtures What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 18of 183

19 Other applications of c autonomous robotics, detecting credit card fraud, detecting network intrusion, bioinformatics, neuroscience, medical diagnosis, stock market analysis,... What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 19of 183

20 Related Fields c Artificial Intelligence - AI Statistics Game Theory Neuroscience, Psychology Data Mining Computer Science Adaptive Control Theory... What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 2of 183

21 Fundamental Types of Learning Unsupervised Learning Association Clustering Density Estimation Blind source separation Supervised Learning Regression Classification Reinforcement Learning Others Agents Active Learning SemiSupervised Learning Transductive Learning... c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 21of 183

22 Unsupervised Learning c Only input data given, no targets (labels). Goal: Determine how the data are organised. What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 22of 183

23 Unsupervised Learning - Clustering Clustering : Group similar instances c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting Example applications Clustering customers in Customer-Relationship-Management Image compression: color quantisation 23of 183

24 Supervised Learning Given pairs of data and targets (=labels). Learn a mapping from the data to the targets (training). Goal: Use the learned mapping to correctly predict the target for new input data. Need to generalise well from the training data/target pairs. c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 24of 183

25 Reinforcement Learning c Example: Game playing. There is one reward at the end of the game (negative or positive). Find suitable actions in a given environment with the goal of maximising some reward. correct input/output pairs never presented Reward might only come after many actions. Current action may not only influence the current reward, but future rewards too. What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 25of 183

26 Reinforcement Learning c observation1 receive reward reward1 observation2 reward2 receive reward observationi receive reward rewardi Agent Agent... Agent What is Machine Learning? Definition choose action choose action choose action Examples of Machine Learning action1 action2 actioni Related Fields Exploration versus Exploitation. Well suited for problems with a long-term versus short-term reward trade-off. Naturally focusing on online performance. Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 26of 183

27 Basic Probability Theory Probability is a way of expressing knowledge or belief that an event will occur or has occurred. Example: Fair Six-Sided Die Sample space Ω = {1, 2, 3, 4, 5, 6} Events Even = {2, 4, 6}, Odd = {1, 3, 5} Probability P(3) = 1 6, P(Odd) = P(Even) = 1 2 Outcome Conditional Probability 3 Ω P(3 Odd) = General Axioms P({}) = P(A) P(Ω) = 1, P(A B) + P(A B) = P(A) + P(B), P(A B) = P(A B)P(B). Rules of Probability Sum rule: P(X) = Y P(X, Y) Product rule: P(X, Y) = P(X Y) P(Y) P(3 and Odd) P(Odd) = 1/6 1/2 = 1 3 c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 27of 183

28 Probability Jargon c (Un)fair Coin: Ω = {Tail =, Head = 1}. P(1) = θ [, 1]. Likelihood P(111 θ) = θ θ (1 θ) θ Maximum Likelihood (ML) estimate ˆθ = arg max θ P(111 θ) = 3 4 Prior If we are indifferent, then P(θ) = const. Evidence P(111) = θ P(111 θ)p(θ) = 1 2 (actually ) Posterior P(θ 111) = P(111 θ)p(θ) P(111) θ 3 (1 θ) (Bayes Rule) Maximum a Posterior (MAP) estimate ˆθ = arg max θ P(θ 111) = 3 4 Predictive Distribution P(1 111) = P(1111) P(111) = 2 3 Expectation E [f...] = θ f (θ)p(θ... ), e.g. E [θ 111] = 2 3 Variance var(θ) = E [ (θ E [θ]) ] = 2 63 Probability Density P(θ) = 1 ɛ P([θ, θ + ɛ]) for ɛ What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 28of 183

29 Polynomial Curve Fitting some artificial data created from the function sin(2πx) + random noise x =,..., 1 c t 1 What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 1 x 1 29of 183

30 Polynomial Curve Fitting - Training Data c N = 1 x (x 1,..., x N ) T t (t 1,..., t N ) T x i R i = 1,..., N t i R i = 1,..., N What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 3of 183

31 Polynomial Curve Fitting - Model Specification c M : order of polynomial y(x, w) = w + w 1 x + w 2 x w M x M M = w m x m m= nonlinear function of x linear function of the unknown model parameter w How can we find good parameters w = (w 1,..., w M ) T? What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 31of 183

32 Learning is Improving Performance c t t n y(x n, w) x n x What is Machine Learning? Definition Examples of Machine Learning Related Fields Performance measure : Error between target and prediction of the model for the training data E(w) = 1 2 N (y(x n, w) t n ) 2 n=1 Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting unique minimum of E(w) for argument w 32of 183

33 Model is Constant Function M y(x, w) = w m x m m= M= = w c What is Machine Learning? t 1 M = Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 1 x 1 33of 183

34 Model is Linear Function y(x, w) = M m= w m x m M=1 = w + w 1 x c What is Machine Learning? t 1 M = 1 Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 1 x 1 34of 183

35 Model is Cubic Polynomial y(x, w) = M m= w m x m M=3 = w + w 1 x + w 2 x 2 + w 3 x 3 c What is Machine Learning? t 1 M = 3 Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 1 x 1 35of 183

36 Model is 9 th order Polynomial overfitting y(x, w) = M m= w m x m M=9 = w + w 1 x + + w 8 x 8 + w 9 x 9 c What is Machine Learning? Definition Examples of Machine Learning t 1 M = 9 Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 1 x 1 36of 183

37 Testing the Fitted Model Train the model and get w Get 1 new data points Root-mean-square (RMS) error E RMS = 2E(w )/N c What is Machine Learning? 1 Training Test Definition Examples of Machine Learning Related Fields ERMS.5 Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 3 M of 183

38 Parameters of the Fitted Model c M = M = 1 M = 3 M = 9 w w w w w w w w w w What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting Table: Coefficients w for polynomials of various order. 38of 183

39 Get More Data c N = 15 1 t 1 N = 15 What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting x 1 39of 183

40 Get Even More Data N = 1 heuristics : have no less than 5 to 1 times as many data points than parameters but number of parameters is not necessarily the most appropriate measure of model complexity! later: Bayesian approach c What is Machine Learning? Definition t 1 N = 1 Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 1 x 1 4of 183

41 Regularisation c How to constrain the growing of the coefficients w? Add a regularisation term to the error function Ẽ(w) = 1 N ( y(x n, w) t n ) 2 + λ 2 2 w 2 n=1 Squared norm of the parameter vector w w 2 w T w = w 2 + w w 2 M What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 41of 183

42 Regularisation c M = 9 1 t 1 ln λ = 18 What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting x 1 42of 183

43 Regularisation c M = 9 1 t 1 ln λ = What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting x 1 43of 183

44 Regularisation c M = 9 1 Training Test What is Machine Learning? Definition ERMS.5 Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory 35 3 ln λ 25 2 Polynomial Curve Fitting 44of 183

45 c Part II Linear Regression Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 45of 183

46 Linear Regression Model input "feature" vector x = (1 x(), x(1),..., x(d) )T RD+1 linear regression model y(x, w) = D X wj x(j) = wt x j= model parameter w = (w,..., wd )T where w is the bias X2 c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares 2. Bayesian Regression Example for Bayesian Regression 2 15 Predictive Distribution Y 1 Limitations of Linear Basis Function Models X Hyperplanes for w = {(2, 1, 1), (5, 2, 1), (1, 2, 2)} 46of 183

47 Linear Regression - Finding the Best Model Use training data (x 1, t 1 ),..., (x N, t N ) and loss function (performance measure) to find best w. Example : Residual sum of squares N Loss(w) = (t n y(x n, w)) 2 Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 29 Chap 3 Least square regression Y n=1 ŵ = arg min Loss(w) w X 2 X 1 c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 47of 183

48 Linear Basis Function Models c Linear combination of fixed nonlinear basis functions φ j (x) R M 1 y(x, w) = w j φ j (x) = w T φ(x) j= parameter w = (w,..., w M 1 ) T, w is the bias parameter, basis functions φ = (φ,..., φ M 1 ) T convention φ (x) = 1 Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 48of 183

49 Polynomial Basis Functions Scalar input variable x φ j (x) = x j Limitation : Polynomials are global functions of the input variable x. Extension: Split the input space into regions and fit a different polynomial to each region (spline functions) c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models of 183

50 Gaussian Basis Functions Scalar input variable x φ j (x) = exp { (x µ j) 2 } 2s 2 Not a probability distribution. No normalisation required, taken care of by the model parameters w c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 1 1 5of 183

51 Sigmoidal Basis Functions Scalar input variable x ( ) x µj φ j (x) = σ s where σ(a) is the logistic sigmoid function defined by σ(a) = exp( a) σ(a) is related to the hyperbolic tangent tanh(a) by tanh(a) = 2σ(a) c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models of 183

52 Other Basis Functions - Wavelets Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 29 ChapMachine 5 Learning c Haar Wavelets Symmlet-8 Wavelets Wavelets : localised in both space and frequency mutually orthogonal to simplify application. ψ 6,35 ψ 6,15 ψ 5,15 ψ 5,1 ψ 4,9 ψ 4,4 ψ 3,5 ψ 3,2 ψ 2,3 ψ 2,1 ψ 1, Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models Time Time 52of 183

53 Other Basis Functions - 2D Splines Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 29 Chap 5 Splines: polynomials restricted to regions of the input space c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 53of 183

54 Maximum Likelihood and Least Squares c No special assumption about the basis functions φ j (x). In the simplest case, one can think of φ j (x) = x j. Assume target t is given by t = y(x, w) }{{} + }{{} ɛ deterministic noise where ɛ is a zero-mean Gaussian random variable with precision (inverse variance) β. Thus p(t x, w, β) = N (t y(x, w), β 1 ) Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 54of 183

55 Maximum Likelihood and Least Squares Likelihood of one target t given the data x p(t x, w, β) = N (t y(x, w), β 1 ) Set of inputs X = {x 1,..., x N } with corresponding target values t = (t 1,..., t n ) T. Assume data are independent and identically distributed (i.i.d.) (means : data are drawn independent and from the same distribution). The likelihood of the target t is then p(t X, w, β) = = N N (t n y(x n, w), β 1 ) n=1 N N (t n w T φ(x n ), β 1 ) n=1 c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 55of 183

56 Maximum Likelihood and Least Squares Consider the logarithm of the likelihood p(t X, w, β) (the logarithm is a monoton function! ) ln p(t X, w, β) = = N ln N (t n w T φ(x n ), β 1 ) n=1 ( N β ln { 2π exp β2 } ) (t n w T φ(x n )) 2 n=1 = N 2 ln β N 2 ln(2π) βe D(w) where the sum-of-squares error function is E D (w) = 1 2 N {t n w T φ(x n )} 2. n=1 c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models arg max w ln p(t X, w, β) arg min w E D (w) 56of 183

57 Maximum Likelihood and Least Squares Rewrite the Error Function E D (w) = 1 N {t n w T φ(x n )} 2 = (t Φw)T (t Φw) n=1 where t = (t 1,..., t N ) T, and φ (x 1 ) φ 1 (x 1 )... φ M 1 (x 1 ) φ (x 2 ) φ 1 (x 2 )... φ M 1 (x 2 ) Φ = φ (x N ) φ 1 (x N )... φ M 1 (x N ) Maximum likelihood estimate c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models w ML = arg max w ln p(t w, β) = arg min E D (w) w = (Φ T Φ) 1 Φ T t = Φ t where Φ is the Moore-Penrose pseudo-inverse of Φ. 57of 183

58 Regularized Least Squares c Add regularisation in order to prevent overfitting E D (w) + λe W (w) with regularisation coefficient λ. Simple quadratic regulariser E W (w) = 1 2 wt w Maximum likelihood solution w ML = ( λi + Φ T Φ ) 1 Φ T t Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 58of 183

59 Regularized Least Squares c More general regulariser E W (w) = 1 2 M w j q q = 1 (lasso) leads to a sparse model if λ large enough. j=1 Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models q =.5 q = 1 q = 2 q = 4 59of 183

60 Comparison of Quadratic and Lasso Regulariser Assume a sufficiently large regularisation coefficient λ. c Quadratic regulariser w M j=1 w w 2 j Lasso regulariser w w M w j j=1 Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models w 1 w 1 6of 183

61 Bayesian Regression Bayes Theorem posterior = likelihood prior normalisation p(w t) = p(t w) p(w) p(t) c likelihood for i.i.d. data p(t w) = = N N (t n y(x n, w), β 1 ) n=1 N N (t n w T φ(x n ), β 1 ) n=1 = const exp{ β 1 2 (t Φw)T (t Φw)} Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models where we left out the conditioning on x (always assumed), and β, which is assumed to be constant. 61of 183

62 How to choose a prior? c Can we find a prior for the given likelihood which makes sense for the problem at hand allows us to find a posterior in a nice form An answer to the second question: Definition ( Conjugate Prior) A class of prior probability distributions p(w) is conjugate to a class of likelihood functions p(x w) if the resulting posterior distributions p(w x) are in the same family as p(w). Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 62of 183

63 Examples of Conjugate Prior Distributions Table: Discrete likelihood distributions Likelihood Conjugate Prior Bernoulli Beta Binomial Beta Poisson Gamma Multinomial Dirichlet c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Table: Continuous likelihood distributions Likelihood Conjugate Prior Uniform Pareto Exponential Gamma Normal Normal Multivariate normal Multivariate normal Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 63of 183

64 Bayesian Regression c No data point (N = ): start with prior. Each posterior acts as the prior for the next data/target pair. Nicely fits a sequential learning framework. likelihood prior/posterior p(w) p(t1 w, x1 ) Bayes p(w t1, x1 ) p(t2 w, x2) Bayes p(w t1, x1, t2, x2) Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 64of 183

65 Sequential Update of the Posterior c Example of a linear (basis function) model Single input x, single output t Linear model y(x, w) = w + w 1 x. Data creation 1 Choose an x n from the uniform distribution U(x 1, 1). 2 Calculate f (x n, a) = a + a 1x n, where a =.3, a 1 =.5. 3 Add Gaussian noise with standard deviation σ =.2, t n = N (x n f (x n, a),.4) Set the precision of the uniform prior to α = 2.. Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 65of 183

66 Sequential Update of the Posterior c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 66of 183

67 Sequential Update of the Posterior c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 67of 183

68 Predictive Distribution Definition (The Predictive Distribution) The Predictive Distribution is the probability of the test target t given test data x, the training data set X and the training targets t. p(t x, X, t) How to calculate the Predictive Distribution? p(t x, X, t) = p(t, w x, X, t) dw = p(t w, x, X, t) p(w x, X, t) dw }{{}}{{} testing only training only = p(t w, x) p(w X, t) dw (sum rule) c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 68of 183

69 Predictive Distribution Isotropic Gaussian Prior c (Simplified) isotropic Gaussian prior p(w α) = N (w, α 1 I) Predictive distribution p(t x, X, t) is Gaussian, variance after N data points have been seen σ 2 N(x) = 1 + φ T (αi + βφ T Φ) 1 φ β }{{}}{{} uncertainty of w noise of data σ 2 N+1 (x) σ2 N(x) and lim N σ 2 N(x) = 1 β Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 69of 183

70 Predictive Distribution Isotropic Gaussian Prior 1 1 t t c 1 1 Linear Basis Function Models x 1 x 1 Maximum Likelihood and Least Squares Regularized Least Squares 1 t 1 1 t 1 Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models x 1 x 1 Example with artificial sinusoidal data from sin(2πx) (green) and added noise. Mean of the predictive distribution (red) and regions of one standard deviation from mean (red shaded). 7of 183

71 Samples from the Posterior Distribution 1 1 t t c 1 1 Linear Basis Function Models x 1 x 1 Maximum Likelihood and Least Squares Regularized Least Squares 1 t 1 1 t 1 Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models x 1 x 1 Example with artificial sinusoidal data from sin(2πx) (green) and added noise. Samples y(x, w) (red) from the posterior distribution p(w X, t). 71of 183

72 Limitations of Linear Basis Function Models c Basis function φ j (x) are fixed before the training data set is observed. Curse of dimensionality : Number of basis function grows rapidly, often exponentially, with the dimensionality D. But typical data sets have two nice properties which can be exploited if the basis functions are not fixed : Data lie close to a nonlinear manifold with intrinsic dimension much smaller than D. Need algorithms which place basis functions only where data are (e.g. radial basis function networks, support vector machines, relevance vector machines, neural networks). Target variables may only depend on a few significant directions within the data manifold. Need algorithms which can exploit this property (Neural networks). Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 72of 183

73 Curse of Dimensionality Linear Algebra allows us to operate in n-dimensional vector spaces using the intution from our 3-dimensional world as a vector space. No surprises as long as n is finite. If we add more structure to a vector space (e.g. inner product, metric), our intution gained from the 3-dimensional world around us may be wrong. Example: Sphere of radius r = 1. What is the fraction of the volume of the sphere in a D-dimensional space which lies between radius r = 1 and r = 1 ɛ? Volume scales like r D, therefore the formula for the volume of a sphere is V D (r) = K D r D. V D (1) V D (1 ɛ) V D (1) = 1 (1 ɛ) D c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 73of 183

74 Curse of Dimensionality Fraction of the volume of the sphere in a D-dimensional space which lies between radius r = 1 and r = 1 ɛ V D (1) V D (1 ɛ) V D (1) 1 D = 2 = 1 (1 ɛ) D c Linear Basis Function Models Maximum Likelihood and Least Squares.8 D = 5 Regularized Least Squares volume fraction.6.4 D = 2 D = 1 Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models ɛ 74of 183

75 Curse of Dimensionality Probability density with respect to radius r of a Gaussian distribution for various values of the dimensionality D. p(r) 2 1 D = 1 D = 2 D = 2 c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution 2 4 r Limitations of Linear Basis Function Models 75of 183

76 Curse of Dimensionality Probability density with respect to radius r of a Gaussian distribution for various values of the dimensionality D. Example: D = 2; assume µ =, Σ = I N (x, I) = 1 { 2π exp 1 } 2 xt x Coordinate transformation x 1 = r cos(φ) Probability in the new coordinates = 1 2π exp x 2 = r sin(φ) p(r, φ, I) = N (r(x), φ(x), I) J { 1 } 2 (x2 1 + x2) 2 where J = r is the determinant of the Jacobian for the given coordinate transformation. p(r, φ, I) = 1 { 2π r exp 12 } r2 c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 76of 183

77 Curse of Dimensionality Probability density with respect to radius r of a Gaussian distribution for D = 2 (and µ =, Σ = I) p(r, φ, I) = 1 { 2π r exp 12 } r2 Integrate over all angles φ p(r, I) = 2π 2 1 { 2π r exp 12 } r2 dφ = r exp { 12 } r2 D = 1 c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution p(r) 1 D = 2 D = 2 Limitations of Linear Basis Function Models 2 4 r 77of 183

78 c Part III Classification Linear Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 78of 183

79 Classification Goal : Given input data x, assign it to one of K discrete classes C k where k = 1,..., K. Divide the input space into different regions. c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 79of 183

80 How to represent binary class labels? c Class labels are no longer real values as in regression, but a discrete set. Two classes : t {, 1} ( t = 1 represents class C 1 and t = represents class C 2 ) Can interpret the value of t as the probability of class C 1, with only two values possible for the probability, or 1. Note: Other conventions to map classes into integers possible, check the setup. Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 8of 183

81 How to represent multi-class labels? c If there are more than two classes ( K > 2), we call it a multi-class setup. Often used: 1-of-K coding scheme in which t is a vector of length K which has all values except for t j = 1, where j comes from the membership in class C j to encode. Example: Given 5 classes, {C 1,..., C 5 }. Membership in class C 2 will be encoded as the target vector t = (, 1,,, ) T Note: Other conventions to map multi-classes into integers possible, check the setup. Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 81of 183

82 Linear Model Idea: Use again a Linear Model as in regression: y(x, w) is a linear function of the parameters w y(x n, w) = w T φ(x n ) But generally y(x n, w) R. Example: Which class is y(x, w) =.71623? c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space of 183

83 Generalised Linear Model Apply a mapping f : R Z to the linear model to get the discrete class labels. Generalised Linear Model Activation function: f ( ) Link function : f 1 ( ) y(x n, w) = f (w T φ(x n )) Figure: Example of an activation function f (z) = sign (z). c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 83of 183

84 Three Models for Decision Problems In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. Discriminative Models 1 Solve the inference problem of determining the posterior class probabilities p(c k x). 2 Use decision theory to assign each new x to one of the classes. Generative Models 1 Solve the inference problem of determining the class-conditional probabilities p(x C k). 2 Also, infer the prior class probabilities p(c k). 3 Use Bayes theorem to find the posterior p(c k x). 4 Alternatively, model the joint distribution p(x, C k) directly. 5 Use decision theory to assign each new x to one of the classes. c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 84of 183

85 Decision Theory - Key Ideas probability of a mistake p(mistake) = p(x R 1, C 2 ) + p(x R 2, C 1 ) = p(x, C 2 ) dx + p(x, C 1 ) dx R 1 R 2 goal: minimize p(mistake) c Classification Generalised Linear Model Inference and Decision p(x, C 1 ) x x Decision Theory Fisher s Linear Discriminant p(x, C 2 ) The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression x Feature Space R 1 R 2 85of 183

86 Minimising the Expected Loss c Not all mistakes are equally costly. Weight each misclassification of x to the wrong class C j instead of assigning it to the correct class C k by a factor L kj. The expected loss is now E [L] = L kj p(x, C k )dx k j R j Goal: minimize the expected loss E [L] Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 86of 183

87 The Reject Region Avoid making automated decisions on difficult cases. Difficult cases: posterior probabilities p(c k x) are very small joint distributions p(x, C k) have comparable values 1. θ p(c 1 x) p(c 2 x) c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models. reject region x Discrete Features Logistic Regression Feature Space 87of 183

88 Least Squares for Classification c Regression with a linear function of the model parameters and minimisation of sum-of-squares error function resulted in a closed-from solution for the parameter values. Is this also possible for classification? Given input data x belonging to one of K classes C k. Use 1-of-K binary coding scheme. Each class is described by its own linear model Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm y k (x) = w T k x + w k k = 1,..., K Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 88of 183

89 Least Squares for Classification With the conventions [ ] wk w k = w k [ ] 1 x = x W = [ w ] 1... w K R D+1 R D+1 R (D+1) K we get for the discriminant function (vector valued) y(x) = W T x R K. For a new input x, the class is then defined by the index of the largest value in the row vector y(x) c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 89of 183

90 Determine W c Given a training set {x n, t} where n = 1,..., N, and t is the class in the 1-of-K coding scheme. Define a matrix T where row n corresponds to t T n. The sum-of-squares error can now be written as E D ( W) = 1 {( X 2 tr W T) T ( X W } T) The minimum of E D ( W) will be reached for W = ( X T X) 1 XT T = X T where X is the pseudo-inverse of X. Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 9of 183

91 Discriminant Function for Multi-Class The discriminant function y(x) is therefore c y(x) = W T x = T T ( X ) T x, where X is given by the training data, and x is the new input. Interesting property: If for every t n the same linear constraint a T t n + b = holds, then the prediction y(x) will also obey the same constraint a T y(x) + b =. For the 1-of-K coding scheme, the sum of all components in t n is one, and therefore all components of y(x) will sum to one. BUT: the components are not probabilities, as they are not constraint to the interval (, 1). Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 91of 183

92 Deficiencies of the Least Squares Approach Magenta curve : Decision Boundary for the least squares approach ( Green curve : Decision boundary for the logistic regression model described later) c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 92of 183

93 Deficiencies of the Least Squares Approach Magenta curve : Decision Boundary for the least squares approach ( Green curve : Decision boundary for the logistic regression model described later) c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 93of 183

94 Fisher s Linear Discriminant c View linear classification as dimensionality reduction. y(x) = w T x If y w then class C 1, otherwise C 2. But there are many projections from a D-dimensional input space onto one dimension. Projection always means loss of information. For classification we want to preserve the class separation in one dimension. Can we find a projection which maximally preserves the class separation? Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 94of 183

95 Fisher s Linear Discriminant c Samples from two classes in a two-dimensional input space and their histogram when projected to two different one-dimensional spaces. Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 95of 183

96 Fisher s Linear Discriminant - First Try Given N 1 input data of class C 1, and N 2 input data of class C 2, calculate the centres of the two classes m 1 = 1 N 1 n C 1 x n, m 2 = 1 N 2 n C 2 x n Choose w so as to maximise the projection of the class means onto w m 1 m 2 = w T (m 1 m 2 ) Problem with non-uniform covariance c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space of 183

97 Fisher s Linear Discriminant Measure also the within-class variance for each class s 2 k = n C k (y n m k ) 2 where y n = w T x n. Maximise the Fisher criterion J(w) = (m 2 m 1 ) 2 s s2 2 c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space of 183

98 Fisher s Linear Discriminant c The Fisher criterion can be rewritten as J(w) = wt S B w w T S W w S B is the between-class covariance S B = (m 2 m 1 )(m 2 m 1 ) T S W is the within-class covariance S W = n C 1 (x n m 1 )(x n m 1 ) T + n C 2 (x n m 2 )(x n m 2 ) T Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 98of 183

99 Fisher s Linear Discriminant c The Fisher criterion J(w) = wt S B w w T S W w has a maximum for Fisher s linear discriminant w S 1 W (m 2 m 1 ) Fisher s linear discriminant is NOT a discriminant, but can be used to construct one by choosing a threshold y in the projection space. Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 99of 183

100 The Perceptron Algorithm c Perceptron ("MARK 1", Cornell Univ., 196) was the first computer which could learn new skills by trial and error Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 1of 183

The Perceptron Algorithm c Frank Rosenblatt (1928-1969) "Principles of neurodynamics: Perceptrons and the theory of brain mechanisms" (Spartan Books, 1962) Classification Generalised Linear

101 The Perceptron Algorithm c Frank Rosenblatt ( ) "Principles of neurodynamics: Perceptrons and the theory of brain mechanisms" (Spartan Books, 1962) Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 11of 183

102 The Perceptron Algorithm Two class model Create feature vector φ(x) by a fixed nonlinear transformation of the input x. Generalised linear model y(x) = f (w T φ(x)) with φ(x) containing some bias element φ (x) = 1. nonlinear activation function { +1, a f (a) = 1, a < Target coding for perceptron { +1, if C 1 t = 1, if C 2 c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 12of 183

103 The Perceptron Algorithm - Error Function c Idea : Minimise total number of misclassified patterns. Problem : As a function of w, this is piecewise constant and therefore the gradient is zero almost everywhere. Better idea: Using the ( 1, +1) target coding scheme, we want all patterns to satisfy w T φ(x n )t n >. Perceptron Criterion : Add the errors for all patterns belonging to the set of misclassified patterns M E P (w) = w T φ(x n )t n n M Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 13of 183

104 Perceptron - Stochastic Gradient Descent c Perceptron Criterion (with notation φ n = φ(x n ) ) One iteration at step τ E P (w) = w T φ n t n n M 1 Choose a training pair (x n, t n) 2 Update the weight vector w by w (τ+1) = w (τ) η E P(w) = w (τ) + ηφ n t n As y(x, w) does not depend on the norm of w, one can set η = 1 w (τ+1) = w (τ) + φ n t n Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 14of 183

105 The Perceptron Algorithm - Update 1 Update of the perceptron weights from a misclassified pattern (green) w (τ+1) = w (τ) + φ n t n c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 15of 183

106 The Perceptron Algorithm - Update 2 Update of the perceptron weights from a misclassified pattern (green) w (τ+1) = w (τ) + φ n t n c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 16of 183

107 The Perceptron Algorithm - Convergence c Does the algorithm converge? For a single update step w (τ+1)t φ n t n = w (τ)t φ n t n (φ n t n ) T φ n t n < w (τ)t φ n t n because (φ n t n ) T φ n t n = φ n t n >. BUT: contributions to the error from the other misclassified patterns might have increased. AND: some correctly classified patterns might now be misclassified. Perceptron Convergence Theorem : If the training set is linearly separable, the perceptron algorithm is guaranteed to find a solution in a finite number of steps. Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 17of 183

108 Three Models for Decision Problems In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. Discriminative Models 1 Solve the inference problem of determining the posterior class probabilities p(c k x). 2 Use decision theory to assign each new x to one of the classes. Generative Models 1 Solve the inference problem of determining the class-conditional probabilities p(x C k). 2 Also, infer the prior class probabilities p(c k). 3 Use Bayes theorem to find the posterior p(c k x). 4 Alternatively, model the joint distribution p(x, C k) directly. 5 Use decision theory to assign each new x to one of the classes. c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 18of 183

109 Probabilistic Generative Models Generative approach: model class-conditional densities p(x C k ) and priors p(c k ) to calculate the posterior probability for class C 1 p(x C 1 )p(c 1 ) p(c 1 x) = p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) 1 = 1 + exp( a(x)) = σ(a(x)) where a and the logistic sigmoid function σ(a) are given by a(x) = ln p(x C 1) p(c 1 ) p(x C 2 ) p(c 2 ) = ln p(x, C 1) p(x, C 2 ) 1 σ(a) = 1 + exp( a). c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 19of 183

110 Logistic Sigmoid The logistic sigmoid function σ(a) = 1 1+exp( a) "squashing function because it maps the real axis into a finite interval (, 1) σ( a) = 1 σ(a) Derivative d daσ(a) = σ(a) σ( a) = σ(a) (1 σ(a)) ( ) Inverse is called logit function a(σ) = ln σ 1 σ c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant Σ The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 1 Logistic Sigmoid σ(a) a 6 Logit a(σ) 11of 183

111 Probabilistic Generative Models - Multiclass c The normalised exponential is given by where p(c k x) = p(x C k) p(c k ) j p(x C j) p(c j ) = exp(a k) j exp(a j) a k = ln(p(x C k ) p(c k )). Also called softmax function as it is a smoothed version of the max function. Example: If a k a j for all j k, then p(c k x) 1, and p(c j x). Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 111of 183

General Case - K Classes, Different Covariance If each class-conditional probability is Gaussian and has a different covariance, the quadratic terms 1 2 xt Σ 1 x do no longer cancel each other out.

112 General Case - K Classes, Different Covariance If each class-conditional probability is Gaussian and has a different covariance, the quadratic terms 1 2 xt Σ 1 x do no longer cancel each other out. We get a quadratic discriminant. c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 112of 183

113 Discrete Features - Naive Bayes c Assume the input space consists of discrete features, in the simplest case x i {, 1}. For a D-dimensional input space, a general distribution would be represented by a table with 2 D entries. Together with the normalisation constraint, this are 2 D 1 independent variables. Grows exponentially with the number of features. The Naive Bayes assumption is that all features conditioned on the class C k are independent of each other. p(x C k ) = D i=1 µ xi k i (1 µ ki ) 1 xi Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 113of 183

114 Discrete Features - Naive Bayes With the naive Bayes p(x C k ) = D i=1 µ xi k i (1 µ ki ) 1 xi we can then again find the factors a k in the normalised exponential p(c k x) = p(x C k)p(c k ) j p(x C j)p(c j ) = exp(a k) j exp(a j) as a linear function of the x i c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models a k (x) = D {x i ln µ ki + (1 x i ) ln(1 µ ki )} + ln p(c k ). i=1 Discrete Features Logistic Regression Feature Space 114of 183

115 Three Models for Decision Problems In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. Discriminative Models 1 Solve the inference problem of determining the posterior class probabilities p(c k x). 2 Use decision theory to assign each new x to one of the classes. Generative Models 1 Solve the inference problem of determining the class-conditional probabilities p(x C k). 2 Also, infer the prior class probabilities p(c k). 3 Use Bayes theorem to find the posterior p(c k x). 4 Alternatively, model the joint distribution p(x, C k) directly. 5 Use decision theory to assign each new x to one of the classes. c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 115of 183

116 Logistic Regression is Classification Two classes where the posterior of class C 1 is a logistic sigmoid σ() acting on a linear function of the feature vector φ p(c 1 φ) = y(φ) = σ(w T φ) p(c 2 φ) = 1 p(c 1 φ) Model dimension is equal to dimension of the feature space M. Compare this to fitting two Gaussians }{{} 2M + M(M + 1)/2 = M(M + 5)/2 }{{} means shared covariance For larger M, the logistic regression model has a clear advantage. c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 116of 183

117 Logistic Regression is Classification Determine the parameter via maximum likelihood for data (φ n, t n ), n = 1,..., N, where φ n = φ(x n ). The class membership is coded as t n {, 1}. Likelihood function p(t w) = where y n = p(c 1 φ n ). N n=1 y tn n (1 y n ) 1 tn Error function : negative log likelihood resulting in the cross-entropy error function E(w) = ln p(t w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 117of 183

118 Logistic Regression is Classification Error function (cross-entropy error ) E(w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 y n = p(c 1 φ n ) = σ(w T φ n ) Gradient of the error function (using dσ da E(w) = N (y n t n )φ n n=1 = σ(1 σ) ) gradient does not contain any sigmoid function for each data point error is product of deviation y n t n and basis function φ n. BUT : maximum likelihood solution can exhibit over-fitting even for many data points; should use regularised error or MAP then. c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 118of 183

119 Original Input versus Feature Space Used direct input x until now. All classification algorithms work also if we first apply a fixed nonlinear transformation of the inputs using a vector of basis functions φ(x). Example: Use two Gaussian basis functions centered at the green crosses in the input space. c Classification Generalised Linear Model 1 x2 1 1 φ2.5 Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 1 1 x1.5 1 φ1 119of 183

120 Original Input versus Feature Space Linear decision boundaries in the feature space correspond to nonlinear decision boundaries in the input space. Classes which are NOT linearly separable in the input space can become linearly separable in the feature space. BUT: If classes overlap in input space, they will also overlap in feature space. Nonlinear features φ(x) can not remove the overlap; but they may increase it! c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant x2 1 φ2 1.5 The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space x1.5 1 φ1 12of 183

121 c Part IV Neural Networks Neural Networks Parameter Optimisation 121of 183

122 Functional Transformations As before, the biases can be absorbed into the weights by introducing an extra input x = 1 and a hidden unit z = 1. ( M ) D y k (x, w) = g w (2) kj h w (1) ji x i j= i= Compare to Generalised Linear Model M y k (x, w) = g w (2) kj φ j (x) j= c Neural Networks Parameter Optimisation hidden units x D w (1) MD z M w (2) KM y K inputs outputs y 1 x 1 x z 1 w (2) 1 z 122of 183

123 Variable Basis Functions in a Neural Networks φ(x) = σ(w + w 1 x 1 + w 2 x 2 ) for different parameter w. c Neural Networks -1-5 x x2 Parameter Optimisation x1 x1 5 5 w = (, 1,.1) 1 w = (,.1, 1) x2 x x1 x1 5 5 w = (,.5,.5) 1 w = (1,.5,.5) 1 123of 183

124 Aproximation Capabilities of Neural Networks Neural network approximating f (x) = x 2 c Neural Networks Parameter Optimisation Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 5 data points sampled from the interval ( 1, 1). Red: resulting output. Dashed: Output of the hidden units. 124of 183

125 Aproximation Capabilities of Neural Networks Neural network approximating f (x) = sin(x) c Neural Networks Parameter Optimisation Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 5 data points sampled from the interval ( 1, 1). Red: resulting output. Dashed: Output of the hidden units. 125of 183

126 Aproximation Capabilities of Neural Networks Neural network approximating f (x) = x c Neural Networks Parameter Optimisation Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 5 data points sampled from the interval ( 1, 1). Red: resulting output. Dashed: Output of the hidden units. 126of 183

127 Aproximation Capabilities of Neural Networks Neural network approximating Heaviside function { 1, x f (x) =, x < c Neural Networks Parameter Optimisation Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 5 data points sampled from the interval ( 1, 1). Red: resulting output. Dashed: Output of the hidden units. 127of 183

128 Aproximation Capabilities of Neural Networks Neural network for two-class classification. 2 inputs, 2 hidden units with tanh activation function, 1 output with logistic sigmoid activation function. 3 2 c Neural Networks Parameter Optimisation Red: y =.5 decision boundary. Dashed blue: z =.5 hidden unit contours. Green: Optimal decision boundary from the known data distribution. 128of 183

129 Parameter Optimisation Nonlinear mapping from input x n to output y(x n, w). Sum-of-squares error function over all training data E(w) = 1 2 N y(x n, w) t n 2, n=1 c Neural Networks Parameter Optimisation where we have N pairs of input vectors x n and target vectors t n. Find the parameter ŵ which minimises E(w) by gradient descent. ŵ = arg min E(w) w 129of 183

130 Error Backpropagation Given current errors δ k, the activation function h( ), its derivative h ( ), and its output z i in the previous layer. Error in the previous layer via the backpropagation formula c δ j = h (a j ) k w kj δ k. Neural Networks Parameter Optimisation Components of the gradient E n are then En(w) w ji = δ j z i. z i w ji δ j wkj δ k z j δ 1 13of 183

131 Efficieny of Error Backpropagation As the number of weights is usually much larger than the number of units (the network is well connected), the complexity of calculating the gradient En(w) w ji via error backpropagation is of O(W) where W is the number of weights. Compare this to numerical differentiation using c Neural Networks Parameter Optimisation E n (w) = E n(w ji + ɛ) E n (w ji ) + O(ɛ) w ji ɛ or the numerically more stable (fewer round-off errors) symmetric differences E n (w) = E n(w ji + ɛ) E n (w ji ɛ) + O(ɛ 2 ) w ji 2ɛ which both need O(W 2 ) operations. 131of 183

132 Regularisation in Neural Networks c Model complexity matters again. Neural Networks 1 M = 1 1 M = 3 1 M = 1 Parameter Optimisation M = 1 M = 3 M = 1 Examples of two-layer networks with M hidden units. 132of 183

133 Regularisation via Early Stopping c Stop training at the minimum of the validation set error. Neural Networks.45 Parameter Optimisation Training set error Validation set error. 133of 183

134 c Part V Kernel Methods and SVM Kernel Methods Maximum Margin Classifiers 134of 183

135 Kernel Methods Keep (some) of the training data and recast prediction as a linear combination of kernel functions which are evaluated at the kept training data points and the new test point. Let L(t, y(x) be any loss function and J(f ) be any penalty quadratic in f, then minimum of penalised loss N n=1 L(t n, y(x n )) + λj(f ) has form f (x) = N n=1 α n k(x n, x) with α minimising N n=1 L(t n, (Kα) n ) + λα T Kα, and Kernel K ij = K ji = k(x i, x j ) Kernel trick based on Mercer s theorem: Any continuous, symmetric, positive semi-definite kernel function k(x, y) can be expressed as a dot product in a high-dimensional (possibly infinite dimensional) space. c Kernel Methods Maximum Margin Classifiers 135of 183

136 Maximum Margin Classifiers Support Vector Machines choose the decision boundary which maximises the smallest distance to samples in both classes. ŵ = arg max min [t n (w T φ(x n ))] t n { 1, 1} w: w =1 n Linear boundary for φ k (x) = x (k) c Kernel Methods Maximum Margin Classifiers y = 1 y = y = 1 136of 183

137 Maximum Margin Classifiers Non-linear boundary for general φ(x). N ŵ = α n φ(x n ) n=1 for a few α n and corresponding x n (support vectors). c Kernel Methods N ˆf (x) = ŵ T φ(x) = α n k(x n, x) n=1 with k(x n, x) = φ(x n ) T φ(x) Maximum Margin Classifiers 137of 183

138 Overlapping Class distributions Introduce slack variable ξ n for each data point n., data point is correctly classified and ξ n = on margin boundary or beyond t n y(x), otherwise c Kernel Methods ξ > 1 y = 1 y = y = 1 Maximum Margin Classifiers ξ < 1 ξ = ξ = 138of 183

139 Overlapping Class distributions c 2 Kernel Methods Maximum Margin Classifiers The ν-svm algorithm using Gaussian kernels exp( γ x x 2 ) with γ =.45 applied to a nonseparable data set in two dimensions. Support vectors are indicated by circles. 139of 183

140 c Part VI Mixture Models and EM K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 14of 183

141 K-means Clustering Goal: Partition N features x n into K clusters using Euclidian distance d(x i, x j ) = x i x j such that each feature belongs to the cluster with the nearest mean. Distortion measure : J(µ, cl(x i )) = N n=1 d(x i, µ cl(xi)) 2 where cl(x i ) is the index of the cluster centre closest to x i. Start with K arbitrary cluster centres µ k. M-step: Minimise J w.r.t. cl(x i ): Assign each data point x i to closest cluster with index cl(x i ). E-step: Minimise J w.r.t. µ k : Find new µ k as the mean of points belonging to cluster k. Iteration over M/E-steps converges to local minimum of J. c K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 2 (a) of 183

142 K-means Clustering - Example 2 (a) 2 (b) 2 (c) c K-means Clustering Mixture Models and EM 2 (d) 2 (e) 2 (f) Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM (g) 2 (h) 2 (i) of 183

143 Mixture Models and EM Mixture of Gaussians: P(x π, µ, Σ) = K k=1 π k N (x µ k, Σ k ) Maximise likelihood P(x π, µ, Σ) w.r.t. π, µ, Σ. c K-means Clustering Mixture Models and EM M-step: Minimise J w.r.t. cl(x i ): Assign each data point x i to closest cluster with index cl(x i ). E-step: Minimise J w.r.t. µ k : Find new µ k as the mean of points belonging to cluster k (a) 2 L = 2 2 (d) (b) 2 L = 5 2 (e) 2 2 L = (c) 2 2 L = (f) 2 Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 143of 183

144 EM for Gaussian Mixtures Given a Gaussian mixture and data X, maximise the log likelihood w.r.t. the parameters (π, µ, Σ). 1 Initialise the means µ k, covariances µ k and mixing coefficients π k. Evaluate the log likelihood function. 2 E step : Evaluate the γ(z k) using the current parameters γ(z k) = πk N (x µ k, Σ k) K j=1 πj N (x µ j, Σ j) 3 M step : Re-estimate the parameters using the current γ(z k) µ new k Σ new k = 1 N k = 1 N k N γ(z nk) x n n=1 N n=1 γ(z nk)(x n µ new k π new k = Nk N )(x n µ new k ) T c K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 4 Evaluate the log likelihood, if not converged then goto 2. { N K } ln p(x π, µ, Σ) = ln πk new N (x µ new k, Σ new k ) n=1 k=1 144of 183

145 Mixture of Bernoulli Distributions c Set of D binary variables x i, i = 1,..., D. Each governed by a Bernoulli distribution with parameter µ i. Therefore p(x µ) = Expecation and covariance D i=1 µ xi i (1 µ i) 1 xi K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM E [x] = µ cov[x] = diag{µ i (1 µ i )} 145of 183

146 Mixture of Bernoulli Distributions Mixture with p(x µ, π) = p(x µ k ) = D i=1 K π k p(x µ k ) k=1 µ xi ki (1 µ ki) 1 xi Similar calculation as with mixture of Gaussian γ(z nk ) = π k p(x n µ k ) K j=1 π j p(x n µ j ) N k = N γ(z nk ) n=1 x = 1 N k π k = N k N N γ(z nk )x n n=1 µ k = x c K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 146of 183

147 EM for Mixture of Bernoulli Distributions - Digits c K-means Clustering Examples from a digits data set, each pixel taken only binary values. Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM Parameters µ ki for each component in the mixture. Fit to one multivariate Bernoulli distribution. 147of 183

148 The Role of Latent Variables EM finds the maximum likelihod solution for models with latent variables. Two kinds of variables Observed variables X Latent variables Z plus model parameters θ. Log likelihood is then { } ln p(x θ) = ln p(x, Z θ) Z c K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM Optimisation problem due to the log-sum. Assume maximisation of the distribution p(x, Z θ) over the complete data set {X, Z} is straightforward. But we only have the incomplete data set {X} and the posterior distribution p(z X, θ). 148of 183

149 EM - Key Idea c Key idea of EM: As Z is not observed, work with an averaged version Q(θ, θ old ) of the complete log-likelihood ln p(x, Z θ), averaged over all states of Z. K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions Q(θ, θ old ) = Z p(z X, θ old ) ln p(x, Z θ) EM for Gaussian Mixtures - Latent Variables Convergence of EM 149of 183

150 EM Algorithm 1 Choose an initial setting for the parameters θ old. 2 E step Evaluate p(z X, θ old ). 3 M step Evaluate θ new given by c where Q(θ, θ old ) = Z θ new = arg max Q(θ, θ old ) θ p(z X, θ old ) ln p(x, Z θ) K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 4 Check for convergence of log likelihood or parameter values. If not yet converged, then and go to step 2. θ old = θ new 15of 183

151 EM Algorithm - Convergence Start with the product rule for the observed variables x, the unobserved variables Z, and the parameters θ ln p(x, Z θ) = ln p(z X, θ) + ln p(x θ). Apply Z q(z) with arbitrary q(z) to the formula q(z) ln p(x, Z θ) = q(z) ln p(z X, θ) + ln p(x θ). Z Z Rewrite as ln p(x θ) = p(x, Z θ) q(z) ln q(z) Z }{{} L(q,θ) p(z X, θ) q(z) q(z) ln Z }{{} KL(q p) c K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM KL(q p) is the Kullback-Leibler divergence. 151of 183

152 Kullback-Leibler Divergence c Distance between two distributions p(y) and q(y) KL(q p) = y KL(q p) = KL(q p) q(y) ln q(y) p(y) q(y) ln q(y) p(y) dy not symmetric: KL(q p) KL(p q) KL(q p) = iff q = p. = y = invariant under parameter transformations q(y) ln p(y) q(y) q(y) ln p(y) q(y) dy K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 152of 183

153 EM Algorithm - Convergence The two parts of ln p(x θ) ln p(x θ) = p(x, Z θ) q(z) ln q(z) Z }{{} L(q,θ) KL(q p) p(z X, θ) q(z) q(z) ln Z }{{} KL(q p) c K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM L(q, θ) ln p(x θ) 153of 183

154 EM Algorithm - E Step Hold θ old fixed. Maximise the lower bound L(q, θ old ) with respect to q( ). L(q, θ old ) is a functional. ln p(x θ) does NOT depend on q( ). Maximum for L(q, θ old ) will occur when the Kullback-Leibler divergence vanishes. Therefore, choose q(z) = p(z X, θ old ) ln p(x θ) = p(x, Z θ) q(z) ln q(z) Z }{{} L(q,θ) KL(q p) = p(z X, θ) q(z) q(z) ln Z }{{} KL(q p) c K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM L(q, θ old ) ln p(x θ old ) 154of 183

155 EM Algorithm - M Step Hold q( ) = p(z X, θ old ) fixed. Maximise the lower bound L(q, θ) with respect to θ : θ new = arg max θ L(q, θ old ) = arg max θ Z q( ) ln p(x, Z θ) L(q, θ new ) > L(q, θ old ) unless maximum already reached. As q( ) = p(z X, θ old ) is fixed, p(z X, θ new ) will not be equal to q( ), and therefore the Kullback-Leiber distance will be greater than zero (unless converged). ln p(x θ) = p(x, Z θ) q(z) ln q(z) Z }{{} L(q,θ) p(z X, θ) q(z) q(z) ln Z }{{} KL(q p) c K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM KL(q p) L(q, θ new ) ln p(x θ new ) 155of 183

156 EM Algorithm - Parameter View c ln p(x θ) K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables L (q, θ) Convergence of EM θ old θ new Red curve : incomplete data likelihood. Blue curve : After E step. Green curve : After M step. 156of 183

157 c Part VII Sampling Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 157of 183

158 Sampling from the Uniform Distribution In a computer usually via pseudorandom number generator : an algorithm generating a sequence of numbers that approximates the properties of random numbers. Example : linear congruential generators z (n+1) = (a z (n) + c) mod m for modulus m >, multiplier < a < m, increment c < m, and seed z. Other classes of pseudorandom number generators: Lagged Fibonacci generators Linear feedback shift registers Generalised feedback shift registers c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 158of 183

159 Example: RANDU Random Number Generator c Used since the 196s on many machines Defined by the recurrence z (n+1) = ( ) z (n) mod 2 31 Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 159of 183

160 RANDU looks somehow ok? Plotting ( z (n+2), z (n+1), z (n)) T in 3D... c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 16of 183

161 RANDU not really ok Plotting ( z (n+2), z (n+1), z (n)) T in 3D... and changing the viewpoint results in 15 planes. c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 161of 183

162 A Bad Generator - RANDU Analyse the recurrence z (n+1) = ( ) z (n) mod 2 31 Assuming every equation to be modulo 2 31, we can correlate three samples z (n+2) = ( ) 2 z (n) = ( )z (n) = (6( ) 9)z (n) = 6z (n+1) 9z (n) c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea Marsaglia, George "Random Numbers Fall Mainly In The Planes", Proc National Academy of Sciences 61, 25-28, of 183

163 Sampling from Standard Distributions c Goal: Sample from p(y) which is given in analytical form. Suppose uniformly distributed samples of z in the interval (, 1) are available. Calculate the cumulative distribution function h(y) = y p(x) dx Transform the samples from U(z, 1) by Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea y = h 1 (z) to obtain samples y distributed according to p(y). 163of 183

164 Sampling from Standard Distributions Goal: Sample from p(y) which is given in analytical form. If a uniformly distributed random variable z is transformed using y = h 1 (z) then y will be distributed according to p(y). c 1 h(y) Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling p(y) Markov Chain Monte Carlo - The Idea y 164of 183

165 Sampling from the Exponential Distribution Goal: Sample from the exponential distribution { λe λy y p(y) = y < with rate parameter λ >. Suppose uniformly distributed samples of z in the interval (, 1) are available. Calculate the cumulative distribution function h(y) = y p(x) dx = y λe λy dx = 1 e λy c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea Transform the samples from U(z, 1) by y = h 1 (z) = 1 ln(1 z) λ to obtain samples y distributed according to the exponential distribution. 165of 183

166 Sampling the Gaussian Distribution - Box-Muller 1 Generate pairs of uniformly distributed random numbers z 1, z 2 ( 1, 1) (e.g. z i = 2z 1 for z from U(z, 1)) 2 Discard any pair (z 1, z 2 ) unless z z Results in a uniform distribution inside of the unit circle p(z 1, z 2 ) = 1/π. 3 Evaluate r 2 = z z2 2 and y 1 = z 1 ( 2 ln r 2 r 2 y 2 = z 2 ( 2 ln r 2 r 2 ) 1/2 ) 1/2 c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 4 y 1 and y 2 are independent with joint distribution p(y 1, y 2 ) = p(z 1, z 2 ) (z 1, z 2 ) (y 1, y 2 ) = 1 e y2 1 /2 1 e y2 2 /2 2π 2π 166of 183

167 Rejection Sampling Assumption 1 : Sampling directly from p(z) is difficult, but we can evaluate p(z) up to some unknown normalisation constant Z p p(z) = 1 Z p p(z) Assumption 2 : We can draw samples from a simpler distribution q(z) and for some constant k and all z holds kq(z ) kq(z) p(z) kq(z) c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea u p(z) z z 167of 183

168 Rejection Sampling 1 Generate a random number z from the distribution q(z). 2 Generate a number from the u from the uniform distribution over [, k q(z )]. 3 If u > p(z ) then reject the pair (z, u ). 4 The remaining pairs have uniform distribution under the curve p(z). 5 The z values are distributed according to p(z). c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling kq(z ) kq(z) Importance Sampling Markov Chain Monte Carlo - The Idea u p(z) z z 168of 183

169 Importance Sampling Provides a framework to directly calculate the expectation E p [f (z)] with respect to some distribution p(z). Does NOT provide p(z). Again use a proposal distribution q(z) and draw samples z from it. Then E [f ] = f (z) p(z) dz = f (z) p(z) q(z) q(z) dz 1 L L l=1 p(z (l) ) q(z (l) ) f (z(l) ) c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea p(z) q(z) f(z) z 169of 183

170 Importance Sampling - Unnormalised Consider both p(z) and q(z) to be not normalised. It follows then that p(z) = p(z) Z p E [f ] Z q Z p 1 L L l=1 q(z) = q(z) Z q. r l f (z (l) ) r l = p(z(l) ) q(z (l) ). Use the same set of samples to calculate Z p Z q 1 L L r l, resulting in the formula for unnormalised distributions l=1 c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea E [f ] L w l f (z (l) r l ) w l = L m=1 r m l=1 17of 183

171 Importance Sampling - Key Points c Try to choose sample points in the input space where the product f (z) p(z) is large. Or at least where p(z) is large. Importance weights r l correct the bias introduced by sampling from the proposal distribution q(z) instead of the wanted distribution p(z). Success depends on how well q(z) approximates p(z). If p(z) > in same region, then q(z) > necessary. Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 171of 183

172 Markov Chain Monte Carlo c Goal : Generate samples from the distribution p(z). Idea : Build a machine which uses the current sample to decide which next sample to produce in such a way that the overall distribution of the samples will be p(z). 1 Current sample z (r) is known. Generate a new sample z from a proposal distribution q(z z (r) ) we know how to sample from. 2 Accept or reject the new sample according to some appropriate criterion. { z (l+1) z if accepted = z (r) if rejected Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 3 Proposal distribution depends on the current state. 172of 183

173 Metropolis Algorithm 1 Choose a symmetric proposal distribution q(z A z B ) = q(z B z A ). 2 Accept the new sample z with probability ( ) A(z, z (r) ) = min 1, p(z ) p(z (r) ) 3 How? Choose a random number u with uniform distribution in (, 1). Accept new sample if A(z, z (r) ) > u. 4 z (l+1) = { z if accepted if rejected Rejection of a point leads to inclusion of the previous sample. (Different from rejection sampling.) z (r) c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 173of 183

174 Metropolis Algorithm - Illustration Sampling from a Gaussian Distribution (black contour shows one standard deviation). Proposal distribution is isotropic Gaussian with standard deviation candidates generated; 43 rejected c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea accepted steps, rejected steps. 174of 183

175 Markov Chain Monte Carlo - Metropolis-Hasting Generalisation of the Metropolis algorithm for nonsymmetric proposal distributions q k. At step τ, draw a sample z from the distribution q k (z z (τ) ) where k labels the set of possible transitions. Accept with probability ( A k (z, z (τ) p(z ) q k (z (τ) z ) ) ) = min 1, p(z (τ) ) q k (z z (τ) ) Choice of proposal distribution critical. Common choice : Gaussian centered on the current state. small variance high acceptance rate, but slow walk through the state space; samples not independent large variance high rejection rate c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 175of 183

176 c Part VIII More More 176of 183

177 More Graphical Models Gaussian Processes Sequential Data Sequential Decision Theory Learning Agents Reinforcement Learning Theoretical Model Selection Additive Models and Trees and Related Methods Approximate (Variational) Inference Boosting Concept Learning Computational Learning Theory Genetic Algorithms Learning Sets of Rules Analytical Learning... c More 177of 183

178 c Part IX Resources Journals Books Datasets 178of 183

179 Journals c Journal of Research IEEE Transactions on Pattern Analysis and Machine Intelligence IEEE Transactions on Neural Networks Neural Computation Neural Networks Annals of Statistics Journal of the American Statistical Association SIAM Journal on Applied Mathematics (SIAP)... Journals Books Datasets 179of 183

180 Conferences c International Conference on (ICML) European Conference on (ECML) Neural Information Processing Systems (NIPS) Algorithmic Learning Theory (ALT) Computational Learning Theory (COLT) Uncertainty in Artificial Intelligence (UAI) International Joint Conference on Artificial Intelligence (IJCAI) International Conference on Artificial Neural Networks (ICANN)... Journals Books Datasets 18of 183

181 Books c Pattern Recognition and The Elements of Statistical Learning Journals Books Datasets Christopher M. Bishop Trevor Hastie, Robert Tibshirani, Jerome Friedman 181of 183

182 Books c Pattern Classification Introduction to Machine Learning Journals Books Datasets Richard O. Duda, Peter E. Hart, David G. Stork Ethem Alpaydin 182of 183

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V