Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Size: px
Start display at page:

Download "Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met"

Transcription

1 c Outlines Statistical Group and College of Engineering and Computer Science Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More Summer School -, 27 September - 6 October (Figures from C. M. Bishop, "Pattern Recognition and " and T. Hastie, R. Tibshirani, J. Friedman, "The Elements of Statistical Learning") 1of 183

2 Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More 5 Fundamental Types of Learning 6 Basic Probability Theory 7 Polynomial Curve Fitting 2of 183

3 Linear Regression c 8 Linear Basis Function Models 9 Maximum Likelihood and Least Squares Outlines 1 Regularized Least Squares 11 Bayesian Regression Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More 12 Example for Bayesian Regression 13 Predictive Distribution 14 Limitations of Linear Basis Function Models 3of 183

4 Linear Classification 15 Classification c 16 Generalised Linear Model 17 Inference and Decision 18 Decision Theory 19 Fisher s Linear Discriminant 2 The Perceptron Algorithm Outlines Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More 21 Probabilistic Generative Models 22 Discrete Features 23 Logistic Regression 24 Feature Space 4of 183

5 Neural Networks c 25 Neural Networks Outlines Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More 26 Parameter Optimisation 5of 183

6 Kernel Methods and SVM c 27 Kernel Methods Outlines Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More 28 Maximum Margin Classifiers 6of 183

7 Mixture Models and EM c 29 K-means Clustering 3 Mixture Models and EM 31 Mixture of Bernoulli Distributions Outlines Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More 32 EM for Gaussian Mixtures - Latent Variables 33 Convergence of EM 7of 183

8 Resources c 34 Sampling from the Uniform Distribution 35 Sampling from Standard Distributions 36 Rejection Sampling Outlines Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More 37 Importance Sampling 38 Markov Chain Monte Carlo - The Idea 8of 183

9 More c Outlines 39 More Overview Linear Regression Linear Classification Neural Networks Kernel Methods and SVM Mixture Models and EM Resources More 9of 183

10 c Part I Overview What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 1of 183

11 What is? Definition Machine learning is concerned with the design and development of algorithms that allow computers (machines) to improve their performance over time based on data. learning from past experience (training data) generalisation quantify learning : improve their performance over time need to quantify performance c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 11of 183

12 What is? Definition Machine learning is concerned with the design and development of algorithms that allow computers (machines) to improve their performance over time based on data. learning from past experience (training data) generalisation quantify learning : improve their performance over time need to quantify performance Definition (Mitchell, 1998) A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 12of 183

13 Why? is essential when humans are unable to explain their expertise (e.g. speech recognition). humans are not around for help (e.g. navigation on Mars, underwater robotics). large amount of data with possible hidden relationships and correlations (empirical sciences, e.g. discover unusual astronomical objects). environment changes (fast) in time (e.g. mobile phone network). solutions need to be adapted to many particular cases (e.g. junk mail). Example: It is easier to write a program that learns to play checkers or backgammon well by self-play rather than converting the expertise of a master player to a program. c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 13of 183

14 Junk Mail Filtering Given examples of data (mail), and targets {Junk,NoJunk}. c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting Learn to identify new incoming mail as Junk or NoJunk. Continue to learn from the user classifying new mail. 14of 183

15 Handwritten Digit Recognition Given handwritten ZIP codes on letters, money amounts on cheques etc. c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting Learn to correctly recognise new handwritten digits. Nonsense input: Don t know preferred to some wrong digit. 15of 183

16 Backgammon c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting World best computer program TD-GAMMON (Tesauro 1992, 1995) played over a million games against itself. Plays now on the level of human world champion. 16of 183

17 Image Denoising c Original image Noise added Denoised What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting McAuley et. al., "Learning High-Order MRF Priors of Color Images", ICML26 17of 183

18 Separating Audio Sources c Cocktail Party Problem (human brains may do it differently ; ) Audio Sources Microphones Audio Mixtures What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 18of 183

19 Other applications of c autonomous robotics, detecting credit card fraud, detecting network intrusion, bioinformatics, neuroscience, medical diagnosis, stock market analysis,... What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 19of 183

20 Related Fields c Artificial Intelligence - AI Statistics Game Theory Neuroscience, Psychology Data Mining Computer Science Adaptive Control Theory... What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 2of 183

21 Fundamental Types of Learning Unsupervised Learning Association Clustering Density Estimation Blind source separation Supervised Learning Regression Classification Reinforcement Learning Others Agents Active Learning SemiSupervised Learning Transductive Learning... c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 21of 183

22 Unsupervised Learning c Only input data given, no targets (labels). Goal: Determine how the data are organised. What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 22of 183

23 Unsupervised Learning - Clustering Clustering : Group similar instances c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting Example applications Clustering customers in Customer-Relationship-Management Image compression: color quantisation 23of 183

24 Supervised Learning Given pairs of data and targets (=labels). Learn a mapping from the data to the targets (training). Goal: Use the learned mapping to correctly predict the target for new input data. Need to generalise well from the training data/target pairs. c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 24of 183

25 Reinforcement Learning c Example: Game playing. There is one reward at the end of the game (negative or positive). Find suitable actions in a given environment with the goal of maximising some reward. correct input/output pairs never presented Reward might only come after many actions. Current action may not only influence the current reward, but future rewards too. What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 25of 183

26 Reinforcement Learning c observation1 receive reward reward1 observation2 reward2 receive reward observationi receive reward rewardi Agent Agent... Agent What is Machine Learning? Definition choose action choose action choose action Examples of Machine Learning action1 action2 actioni Related Fields Exploration versus Exploitation. Well suited for problems with a long-term versus short-term reward trade-off. Naturally focusing on online performance. Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 26of 183

27 Basic Probability Theory Probability is a way of expressing knowledge or belief that an event will occur or has occurred. Example: Fair Six-Sided Die Sample space Ω = {1, 2, 3, 4, 5, 6} Events Even = {2, 4, 6}, Odd = {1, 3, 5} Probability P(3) = 1 6, P(Odd) = P(Even) = 1 2 Outcome Conditional Probability 3 Ω P(3 Odd) = General Axioms P({}) = P(A) P(Ω) = 1, P(A B) + P(A B) = P(A) + P(B), P(A B) = P(A B)P(B). Rules of Probability Sum rule: P(X) = Y P(X, Y) Product rule: P(X, Y) = P(X Y) P(Y) P(3 and Odd) P(Odd) = 1/6 1/2 = 1 3 c What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 27of 183

28 Probability Jargon c (Un)fair Coin: Ω = {Tail =, Head = 1}. P(1) = θ [, 1]. Likelihood P(111 θ) = θ θ (1 θ) θ Maximum Likelihood (ML) estimate ˆθ = arg max θ P(111 θ) = 3 4 Prior If we are indifferent, then P(θ) = const. Evidence P(111) = θ P(111 θ)p(θ) = 1 2 (actually ) Posterior P(θ 111) = P(111 θ)p(θ) P(111) θ 3 (1 θ) (Bayes Rule) Maximum a Posterior (MAP) estimate ˆθ = arg max θ P(θ 111) = 3 4 Predictive Distribution P(1 111) = P(1111) P(111) = 2 3 Expectation E [f...] = θ f (θ)p(θ... ), e.g. E [θ 111] = 2 3 Variance var(θ) = E [ (θ E [θ]) ] = 2 63 Probability Density P(θ) = 1 ɛ P([θ, θ + ɛ]) for ɛ What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 28of 183

29 Polynomial Curve Fitting some artificial data created from the function sin(2πx) + random noise x =,..., 1 c t 1 What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 1 x 1 29of 183

30 Polynomial Curve Fitting - Training Data c N = 1 x (x 1,..., x N ) T t (t 1,..., t N ) T x i R i = 1,..., N t i R i = 1,..., N What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 3of 183

31 Polynomial Curve Fitting - Model Specification c M : order of polynomial y(x, w) = w + w 1 x + w 2 x w M x M M = w m x m m= nonlinear function of x linear function of the unknown model parameter w How can we find good parameters w = (w 1,..., w M ) T? What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 31of 183

32 Learning is Improving Performance c t t n y(x n, w) x n x What is Machine Learning? Definition Examples of Machine Learning Related Fields Performance measure : Error between target and prediction of the model for the training data E(w) = 1 2 N (y(x n, w) t n ) 2 n=1 Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting unique minimum of E(w) for argument w 32of 183

33 Model is Constant Function M y(x, w) = w m x m m= M= = w c What is Machine Learning? t 1 M = Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 1 x 1 33of 183

34 Model is Linear Function y(x, w) = M m= w m x m M=1 = w + w 1 x c What is Machine Learning? t 1 M = 1 Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 1 x 1 34of 183

35 Model is Cubic Polynomial y(x, w) = M m= w m x m M=3 = w + w 1 x + w 2 x 2 + w 3 x 3 c What is Machine Learning? t 1 M = 3 Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 1 x 1 35of 183

36 Model is 9 th order Polynomial overfitting y(x, w) = M m= w m x m M=9 = w + w 1 x + + w 8 x 8 + w 9 x 9 c What is Machine Learning? Definition Examples of Machine Learning t 1 M = 9 Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 1 x 1 36of 183

37 Testing the Fitted Model Train the model and get w Get 1 new data points Root-mean-square (RMS) error E RMS = 2E(w )/N c What is Machine Learning? 1 Training Test Definition Examples of Machine Learning Related Fields ERMS.5 Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 3 M of 183

38 Parameters of the Fitted Model c M = M = 1 M = 3 M = 9 w w w w w w w w w w What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting Table: Coefficients w for polynomials of various order. 38of 183

39 Get More Data c N = 15 1 t 1 N = 15 What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting x 1 39of 183

40 Get Even More Data N = 1 heuristics : have no less than 5 to 1 times as many data points than parameters but number of parameters is not necessarily the most appropriate measure of model complexity! later: Bayesian approach c What is Machine Learning? Definition t 1 N = 1 Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 1 x 1 4of 183

41 Regularisation c How to constrain the growing of the coefficients w? Add a regularisation term to the error function Ẽ(w) = 1 N ( y(x n, w) t n ) 2 + λ 2 2 w 2 n=1 Squared norm of the parameter vector w w 2 w T w = w 2 + w w 2 M What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting 41of 183

42 Regularisation c M = 9 1 t 1 ln λ = 18 What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting x 1 42of 183

43 Regularisation c M = 9 1 t 1 ln λ = What is Machine Learning? Definition Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory Polynomial Curve Fitting x 1 43of 183

44 Regularisation c M = 9 1 Training Test What is Machine Learning? Definition ERMS.5 Examples of Machine Learning Related Fields Fundamental Types of Learning Basic Probability Theory 35 3 ln λ 25 2 Polynomial Curve Fitting 44of 183

45 c Part II Linear Regression Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 45of 183

46 Linear Regression Model input "feature" vector x = (1 x(), x(1),..., x(d) )T RD+1 linear regression model y(x, w) = D X wj x(j) = wt x j= model parameter w = (w,..., wd )T where w is the bias X2 c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares 2. Bayesian Regression Example for Bayesian Regression 2 15 Predictive Distribution Y 1 Limitations of Linear Basis Function Models X Hyperplanes for w = {(2, 1, 1), (5, 2, 1), (1, 2, 2)} 46of 183

47 Linear Regression - Finding the Best Model Use training data (x 1, t 1 ),..., (x N, t N ) and loss function (performance measure) to find best w. Example : Residual sum of squares N Loss(w) = (t n y(x n, w)) 2 Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 29 Chap 3 Least square regression Y n=1 ŵ = arg min Loss(w) w X 2 X 1 c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 47of 183

48 Linear Basis Function Models c Linear combination of fixed nonlinear basis functions φ j (x) R M 1 y(x, w) = w j φ j (x) = w T φ(x) j= parameter w = (w,..., w M 1 ) T, w is the bias parameter, basis functions φ = (φ,..., φ M 1 ) T convention φ (x) = 1 Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 48of 183

49 Polynomial Basis Functions Scalar input variable x φ j (x) = x j Limitation : Polynomials are global functions of the input variable x. Extension: Split the input space into regions and fit a different polynomial to each region (spline functions) c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models of 183

50 Gaussian Basis Functions Scalar input variable x φ j (x) = exp { (x µ j) 2 } 2s 2 Not a probability distribution. No normalisation required, taken care of by the model parameters w c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 1 1 5of 183

51 Sigmoidal Basis Functions Scalar input variable x ( ) x µj φ j (x) = σ s where σ(a) is the logistic sigmoid function defined by σ(a) = exp( a) σ(a) is related to the hyperbolic tangent tanh(a) by tanh(a) = 2σ(a) c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models of 183

52 Other Basis Functions - Wavelets Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 29 ChapMachine 5 Learning c Haar Wavelets Symmlet-8 Wavelets Wavelets : localised in both space and frequency mutually orthogonal to simplify application. ψ 6,35 ψ 6,15 ψ 5,15 ψ 5,1 ψ 4,9 ψ 4,4 ψ 3,5 ψ 3,2 ψ 2,3 ψ 2,1 ψ 1, Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models Time Time 52of 183

53 Other Basis Functions - 2D Splines Elements of Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 29 Chap 5 Splines: polynomials restricted to regions of the input space c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 53of 183

54 Maximum Likelihood and Least Squares c No special assumption about the basis functions φ j (x). In the simplest case, one can think of φ j (x) = x j. Assume target t is given by t = y(x, w) }{{} + }{{} ɛ deterministic noise where ɛ is a zero-mean Gaussian random variable with precision (inverse variance) β. Thus p(t x, w, β) = N (t y(x, w), β 1 ) Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 54of 183

55 Maximum Likelihood and Least Squares Likelihood of one target t given the data x p(t x, w, β) = N (t y(x, w), β 1 ) Set of inputs X = {x 1,..., x N } with corresponding target values t = (t 1,..., t n ) T. Assume data are independent and identically distributed (i.i.d.) (means : data are drawn independent and from the same distribution). The likelihood of the target t is then p(t X, w, β) = = N N (t n y(x n, w), β 1 ) n=1 N N (t n w T φ(x n ), β 1 ) n=1 c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 55of 183

56 Maximum Likelihood and Least Squares Consider the logarithm of the likelihood p(t X, w, β) (the logarithm is a monoton function! ) ln p(t X, w, β) = = N ln N (t n w T φ(x n ), β 1 ) n=1 ( N β ln { 2π exp β2 } ) (t n w T φ(x n )) 2 n=1 = N 2 ln β N 2 ln(2π) βe D(w) where the sum-of-squares error function is E D (w) = 1 2 N {t n w T φ(x n )} 2. n=1 c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models arg max w ln p(t X, w, β) arg min w E D (w) 56of 183

57 Maximum Likelihood and Least Squares Rewrite the Error Function E D (w) = 1 N {t n w T φ(x n )} 2 = (t Φw)T (t Φw) n=1 where t = (t 1,..., t N ) T, and φ (x 1 ) φ 1 (x 1 )... φ M 1 (x 1 ) φ (x 2 ) φ 1 (x 2 )... φ M 1 (x 2 ) Φ = φ (x N ) φ 1 (x N )... φ M 1 (x N ) Maximum likelihood estimate c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models w ML = arg max w ln p(t w, β) = arg min E D (w) w = (Φ T Φ) 1 Φ T t = Φ t where Φ is the Moore-Penrose pseudo-inverse of Φ. 57of 183

58 Regularized Least Squares c Add regularisation in order to prevent overfitting E D (w) + λe W (w) with regularisation coefficient λ. Simple quadratic regulariser E W (w) = 1 2 wt w Maximum likelihood solution w ML = ( λi + Φ T Φ ) 1 Φ T t Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 58of 183

59 Regularized Least Squares c More general regulariser E W (w) = 1 2 M w j q q = 1 (lasso) leads to a sparse model if λ large enough. j=1 Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models q =.5 q = 1 q = 2 q = 4 59of 183

60 Comparison of Quadratic and Lasso Regulariser Assume a sufficiently large regularisation coefficient λ. c Quadratic regulariser w M j=1 w w 2 j Lasso regulariser w w M w j j=1 Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models w 1 w 1 6of 183

61 Bayesian Regression Bayes Theorem posterior = likelihood prior normalisation p(w t) = p(t w) p(w) p(t) c likelihood for i.i.d. data p(t w) = = N N (t n y(x n, w), β 1 ) n=1 N N (t n w T φ(x n ), β 1 ) n=1 = const exp{ β 1 2 (t Φw)T (t Φw)} Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models where we left out the conditioning on x (always assumed), and β, which is assumed to be constant. 61of 183

62 How to choose a prior? c Can we find a prior for the given likelihood which makes sense for the problem at hand allows us to find a posterior in a nice form An answer to the second question: Definition ( Conjugate Prior) A class of prior probability distributions p(w) is conjugate to a class of likelihood functions p(x w) if the resulting posterior distributions p(w x) are in the same family as p(w). Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 62of 183

63 Examples of Conjugate Prior Distributions Table: Discrete likelihood distributions Likelihood Conjugate Prior Bernoulli Beta Binomial Beta Poisson Gamma Multinomial Dirichlet c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Table: Continuous likelihood distributions Likelihood Conjugate Prior Uniform Pareto Exponential Gamma Normal Normal Multivariate normal Multivariate normal Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 63of 183

64 Bayesian Regression c No data point (N = ): start with prior. Each posterior acts as the prior for the next data/target pair. Nicely fits a sequential learning framework. likelihood prior/posterior p(w) p(t1 w, x1 ) Bayes p(w t1, x1 ) p(t2 w, x2) Bayes p(w t1, x1, t2, x2) Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 64of 183

65 Sequential Update of the Posterior c Example of a linear (basis function) model Single input x, single output t Linear model y(x, w) = w + w 1 x. Data creation 1 Choose an x n from the uniform distribution U(x 1, 1). 2 Calculate f (x n, a) = a + a 1x n, where a =.3, a 1 =.5. 3 Add Gaussian noise with standard deviation σ =.2, t n = N (x n f (x n, a),.4) Set the precision of the uniform prior to α = 2.. Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 65of 183

66 Sequential Update of the Posterior c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 66of 183

67 Sequential Update of the Posterior c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 67of 183

68 Predictive Distribution Definition (The Predictive Distribution) The Predictive Distribution is the probability of the test target t given test data x, the training data set X and the training targets t. p(t x, X, t) How to calculate the Predictive Distribution? p(t x, X, t) = p(t, w x, X, t) dw = p(t w, x, X, t) p(w x, X, t) dw }{{}}{{} testing only training only = p(t w, x) p(w X, t) dw (sum rule) c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 68of 183

69 Predictive Distribution Isotropic Gaussian Prior c (Simplified) isotropic Gaussian prior p(w α) = N (w, α 1 I) Predictive distribution p(t x, X, t) is Gaussian, variance after N data points have been seen σ 2 N(x) = 1 + φ T (αi + βφ T Φ) 1 φ β }{{}}{{} uncertainty of w noise of data σ 2 N+1 (x) σ2 N(x) and lim N σ 2 N(x) = 1 β Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 69of 183

70 Predictive Distribution Isotropic Gaussian Prior 1 1 t t c 1 1 Linear Basis Function Models x 1 x 1 Maximum Likelihood and Least Squares Regularized Least Squares 1 t 1 1 t 1 Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models x 1 x 1 Example with artificial sinusoidal data from sin(2πx) (green) and added noise. Mean of the predictive distribution (red) and regions of one standard deviation from mean (red shaded). 7of 183

71 Samples from the Posterior Distribution 1 1 t t c 1 1 Linear Basis Function Models x 1 x 1 Maximum Likelihood and Least Squares Regularized Least Squares 1 t 1 1 t 1 Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models x 1 x 1 Example with artificial sinusoidal data from sin(2πx) (green) and added noise. Samples y(x, w) (red) from the posterior distribution p(w X, t). 71of 183

72 Limitations of Linear Basis Function Models c Basis function φ j (x) are fixed before the training data set is observed. Curse of dimensionality : Number of basis function grows rapidly, often exponentially, with the dimensionality D. But typical data sets have two nice properties which can be exploited if the basis functions are not fixed : Data lie close to a nonlinear manifold with intrinsic dimension much smaller than D. Need algorithms which place basis functions only where data are (e.g. radial basis function networks, support vector machines, relevance vector machines, neural networks). Target variables may only depend on a few significant directions within the data manifold. Need algorithms which can exploit this property (Neural networks). Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 72of 183

73 Curse of Dimensionality Linear Algebra allows us to operate in n-dimensional vector spaces using the intution from our 3-dimensional world as a vector space. No surprises as long as n is finite. If we add more structure to a vector space (e.g. inner product, metric), our intution gained from the 3-dimensional world around us may be wrong. Example: Sphere of radius r = 1. What is the fraction of the volume of the sphere in a D-dimensional space which lies between radius r = 1 and r = 1 ɛ? Volume scales like r D, therefore the formula for the volume of a sphere is V D (r) = K D r D. V D (1) V D (1 ɛ) V D (1) = 1 (1 ɛ) D c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 73of 183

74 Curse of Dimensionality Fraction of the volume of the sphere in a D-dimensional space which lies between radius r = 1 and r = 1 ɛ V D (1) V D (1 ɛ) V D (1) 1 D = 2 = 1 (1 ɛ) D c Linear Basis Function Models Maximum Likelihood and Least Squares.8 D = 5 Regularized Least Squares volume fraction.6.4 D = 2 D = 1 Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models ɛ 74of 183

75 Curse of Dimensionality Probability density with respect to radius r of a Gaussian distribution for various values of the dimensionality D. p(r) 2 1 D = 1 D = 2 D = 2 c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution 2 4 r Limitations of Linear Basis Function Models 75of 183

76 Curse of Dimensionality Probability density with respect to radius r of a Gaussian distribution for various values of the dimensionality D. Example: D = 2; assume µ =, Σ = I N (x, I) = 1 { 2π exp 1 } 2 xt x Coordinate transformation x 1 = r cos(φ) Probability in the new coordinates = 1 2π exp x 2 = r sin(φ) p(r, φ, I) = N (r(x), φ(x), I) J { 1 } 2 (x2 1 + x2) 2 where J = r is the determinant of the Jacobian for the given coordinate transformation. p(r, φ, I) = 1 { 2π r exp 12 } r2 c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution Limitations of Linear Basis Function Models 76of 183

77 Curse of Dimensionality Probability density with respect to radius r of a Gaussian distribution for D = 2 (and µ =, Σ = I) p(r, φ, I) = 1 { 2π r exp 12 } r2 Integrate over all angles φ p(r, I) = 2π 2 1 { 2π r exp 12 } r2 dφ = r exp { 12 } r2 D = 1 c Linear Basis Function Models Maximum Likelihood and Least Squares Regularized Least Squares Bayesian Regression Example for Bayesian Regression Predictive Distribution p(r) 1 D = 2 D = 2 Limitations of Linear Basis Function Models 2 4 r 77of 183

78 c Part III Classification Linear Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 78of 183

79 Classification Goal : Given input data x, assign it to one of K discrete classes C k where k = 1,..., K. Divide the input space into different regions. c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 79of 183

80 How to represent binary class labels? c Class labels are no longer real values as in regression, but a discrete set. Two classes : t {, 1} ( t = 1 represents class C 1 and t = represents class C 2 ) Can interpret the value of t as the probability of class C 1, with only two values possible for the probability, or 1. Note: Other conventions to map classes into integers possible, check the setup. Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 8of 183

81 How to represent multi-class labels? c If there are more than two classes ( K > 2), we call it a multi-class setup. Often used: 1-of-K coding scheme in which t is a vector of length K which has all values except for t j = 1, where j comes from the membership in class C j to encode. Example: Given 5 classes, {C 1,..., C 5 }. Membership in class C 2 will be encoded as the target vector t = (, 1,,, ) T Note: Other conventions to map multi-classes into integers possible, check the setup. Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 81of 183

82 Linear Model Idea: Use again a Linear Model as in regression: y(x, w) is a linear function of the parameters w y(x n, w) = w T φ(x n ) But generally y(x n, w) R. Example: Which class is y(x, w) =.71623? c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space of 183

83 Generalised Linear Model Apply a mapping f : R Z to the linear model to get the discrete class labels. Generalised Linear Model Activation function: f ( ) Link function : f 1 ( ) y(x n, w) = f (w T φ(x n )) Figure: Example of an activation function f (z) = sign (z). c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 83of 183

84 Three Models for Decision Problems In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. Discriminative Models 1 Solve the inference problem of determining the posterior class probabilities p(c k x). 2 Use decision theory to assign each new x to one of the classes. Generative Models 1 Solve the inference problem of determining the class-conditional probabilities p(x C k). 2 Also, infer the prior class probabilities p(c k). 3 Use Bayes theorem to find the posterior p(c k x). 4 Alternatively, model the joint distribution p(x, C k) directly. 5 Use decision theory to assign each new x to one of the classes. c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 84of 183

85 Decision Theory - Key Ideas probability of a mistake p(mistake) = p(x R 1, C 2 ) + p(x R 2, C 1 ) = p(x, C 2 ) dx + p(x, C 1 ) dx R 1 R 2 goal: minimize p(mistake) c Classification Generalised Linear Model Inference and Decision p(x, C 1 ) x x Decision Theory Fisher s Linear Discriminant p(x, C 2 ) The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression x Feature Space R 1 R 2 85of 183

86 Minimising the Expected Loss c Not all mistakes are equally costly. Weight each misclassification of x to the wrong class C j instead of assigning it to the correct class C k by a factor L kj. The expected loss is now E [L] = L kj p(x, C k )dx k j R j Goal: minimize the expected loss E [L] Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 86of 183

87 The Reject Region Avoid making automated decisions on difficult cases. Difficult cases: posterior probabilities p(c k x) are very small joint distributions p(x, C k) have comparable values 1. θ p(c 1 x) p(c 2 x) c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models. reject region x Discrete Features Logistic Regression Feature Space 87of 183

88 Least Squares for Classification c Regression with a linear function of the model parameters and minimisation of sum-of-squares error function resulted in a closed-from solution for the parameter values. Is this also possible for classification? Given input data x belonging to one of K classes C k. Use 1-of-K binary coding scheme. Each class is described by its own linear model Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm y k (x) = w T k x + w k k = 1,..., K Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 88of 183

89 Least Squares for Classification With the conventions [ ] wk w k = w k [ ] 1 x = x W = [ w ] 1... w K R D+1 R D+1 R (D+1) K we get for the discriminant function (vector valued) y(x) = W T x R K. For a new input x, the class is then defined by the index of the largest value in the row vector y(x) c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 89of 183

90 Determine W c Given a training set {x n, t} where n = 1,..., N, and t is the class in the 1-of-K coding scheme. Define a matrix T where row n corresponds to t T n. The sum-of-squares error can now be written as E D ( W) = 1 {( X 2 tr W T) T ( X W } T) The minimum of E D ( W) will be reached for W = ( X T X) 1 XT T = X T where X is the pseudo-inverse of X. Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 9of 183

91 Discriminant Function for Multi-Class The discriminant function y(x) is therefore c y(x) = W T x = T T ( X ) T x, where X is given by the training data, and x is the new input. Interesting property: If for every t n the same linear constraint a T t n + b = holds, then the prediction y(x) will also obey the same constraint a T y(x) + b =. For the 1-of-K coding scheme, the sum of all components in t n is one, and therefore all components of y(x) will sum to one. BUT: the components are not probabilities, as they are not constraint to the interval (, 1). Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 91of 183

92 Deficiencies of the Least Squares Approach Magenta curve : Decision Boundary for the least squares approach ( Green curve : Decision boundary for the logistic regression model described later) c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 92of 183

93 Deficiencies of the Least Squares Approach Magenta curve : Decision Boundary for the least squares approach ( Green curve : Decision boundary for the logistic regression model described later) c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 93of 183

94 Fisher s Linear Discriminant c View linear classification as dimensionality reduction. y(x) = w T x If y w then class C 1, otherwise C 2. But there are many projections from a D-dimensional input space onto one dimension. Projection always means loss of information. For classification we want to preserve the class separation in one dimension. Can we find a projection which maximally preserves the class separation? Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 94of 183

95 Fisher s Linear Discriminant c Samples from two classes in a two-dimensional input space and their histogram when projected to two different one-dimensional spaces. Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 95of 183

96 Fisher s Linear Discriminant - First Try Given N 1 input data of class C 1, and N 2 input data of class C 2, calculate the centres of the two classes m 1 = 1 N 1 n C 1 x n, m 2 = 1 N 2 n C 2 x n Choose w so as to maximise the projection of the class means onto w m 1 m 2 = w T (m 1 m 2 ) Problem with non-uniform covariance c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space of 183

97 Fisher s Linear Discriminant Measure also the within-class variance for each class s 2 k = n C k (y n m k ) 2 where y n = w T x n. Maximise the Fisher criterion J(w) = (m 2 m 1 ) 2 s s2 2 c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space of 183

98 Fisher s Linear Discriminant c The Fisher criterion can be rewritten as J(w) = wt S B w w T S W w S B is the between-class covariance S B = (m 2 m 1 )(m 2 m 1 ) T S W is the within-class covariance S W = n C 1 (x n m 1 )(x n m 1 ) T + n C 2 (x n m 2 )(x n m 2 ) T Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 98of 183

99 Fisher s Linear Discriminant c The Fisher criterion J(w) = wt S B w w T S W w has a maximum for Fisher s linear discriminant w S 1 W (m 2 m 1 ) Fisher s linear discriminant is NOT a discriminant, but can be used to construct one by choosing a threshold y in the projection space. Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 99of 183

100 The Perceptron Algorithm c Perceptron ("MARK 1", Cornell Univ., 196) was the first computer which could learn new skills by trial and error Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 1of 183

101 The Perceptron Algorithm c Frank Rosenblatt ( ) "Principles of neurodynamics: Perceptrons and the theory of brain mechanisms" (Spartan Books, 1962) Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 11of 183

102 The Perceptron Algorithm Two class model Create feature vector φ(x) by a fixed nonlinear transformation of the input x. Generalised linear model y(x) = f (w T φ(x)) with φ(x) containing some bias element φ (x) = 1. nonlinear activation function { +1, a f (a) = 1, a < Target coding for perceptron { +1, if C 1 t = 1, if C 2 c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 12of 183

103 The Perceptron Algorithm - Error Function c Idea : Minimise total number of misclassified patterns. Problem : As a function of w, this is piecewise constant and therefore the gradient is zero almost everywhere. Better idea: Using the ( 1, +1) target coding scheme, we want all patterns to satisfy w T φ(x n )t n >. Perceptron Criterion : Add the errors for all patterns belonging to the set of misclassified patterns M E P (w) = w T φ(x n )t n n M Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 13of 183

104 Perceptron - Stochastic Gradient Descent c Perceptron Criterion (with notation φ n = φ(x n ) ) One iteration at step τ E P (w) = w T φ n t n n M 1 Choose a training pair (x n, t n) 2 Update the weight vector w by w (τ+1) = w (τ) η E P(w) = w (τ) + ηφ n t n As y(x, w) does not depend on the norm of w, one can set η = 1 w (τ+1) = w (τ) + φ n t n Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 14of 183

105 The Perceptron Algorithm - Update 1 Update of the perceptron weights from a misclassified pattern (green) w (τ+1) = w (τ) + φ n t n c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 15of 183

106 The Perceptron Algorithm - Update 2 Update of the perceptron weights from a misclassified pattern (green) w (τ+1) = w (τ) + φ n t n c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 16of 183

107 The Perceptron Algorithm - Convergence c Does the algorithm converge? For a single update step w (τ+1)t φ n t n = w (τ)t φ n t n (φ n t n ) T φ n t n < w (τ)t φ n t n because (φ n t n ) T φ n t n = φ n t n >. BUT: contributions to the error from the other misclassified patterns might have increased. AND: some correctly classified patterns might now be misclassified. Perceptron Convergence Theorem : If the training set is linearly separable, the perceptron algorithm is guaranteed to find a solution in a finite number of steps. Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 17of 183

108 Three Models for Decision Problems In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. Discriminative Models 1 Solve the inference problem of determining the posterior class probabilities p(c k x). 2 Use decision theory to assign each new x to one of the classes. Generative Models 1 Solve the inference problem of determining the class-conditional probabilities p(x C k). 2 Also, infer the prior class probabilities p(c k). 3 Use Bayes theorem to find the posterior p(c k x). 4 Alternatively, model the joint distribution p(x, C k) directly. 5 Use decision theory to assign each new x to one of the classes. c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 18of 183

109 Probabilistic Generative Models Generative approach: model class-conditional densities p(x C k ) and priors p(c k ) to calculate the posterior probability for class C 1 p(x C 1 )p(c 1 ) p(c 1 x) = p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) 1 = 1 + exp( a(x)) = σ(a(x)) where a and the logistic sigmoid function σ(a) are given by a(x) = ln p(x C 1) p(c 1 ) p(x C 2 ) p(c 2 ) = ln p(x, C 1) p(x, C 2 ) 1 σ(a) = 1 + exp( a). c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 19of 183

110 Logistic Sigmoid The logistic sigmoid function σ(a) = 1 1+exp( a) "squashing function because it maps the real axis into a finite interval (, 1) σ( a) = 1 σ(a) Derivative d daσ(a) = σ(a) σ( a) = σ(a) (1 σ(a)) ( ) Inverse is called logit function a(σ) = ln σ 1 σ c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant Σ The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 1 Logistic Sigmoid σ(a) a 6 Logit a(σ) 11of 183

111 Probabilistic Generative Models - Multiclass c The normalised exponential is given by where p(c k x) = p(x C k) p(c k ) j p(x C j) p(c j ) = exp(a k) j exp(a j) a k = ln(p(x C k ) p(c k )). Also called softmax function as it is a smoothed version of the max function. Example: If a k a j for all j k, then p(c k x) 1, and p(c j x). Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 111of 183

112 General Case - K Classes, Different Covariance If each class-conditional probability is Gaussian and has a different covariance, the quadratic terms 1 2 xt Σ 1 x do no longer cancel each other out. We get a quadratic discriminant. c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 112of 183

113 Discrete Features - Naive Bayes c Assume the input space consists of discrete features, in the simplest case x i {, 1}. For a D-dimensional input space, a general distribution would be represented by a table with 2 D entries. Together with the normalisation constraint, this are 2 D 1 independent variables. Grows exponentially with the number of features. The Naive Bayes assumption is that all features conditioned on the class C k are independent of each other. p(x C k ) = D i=1 µ xi k i (1 µ ki ) 1 xi Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 113of 183

114 Discrete Features - Naive Bayes With the naive Bayes p(x C k ) = D i=1 µ xi k i (1 µ ki ) 1 xi we can then again find the factors a k in the normalised exponential p(c k x) = p(x C k)p(c k ) j p(x C j)p(c j ) = exp(a k) j exp(a j) as a linear function of the x i c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models a k (x) = D {x i ln µ ki + (1 x i ) ln(1 µ ki )} + ln p(c k ). i=1 Discrete Features Logistic Regression Feature Space 114of 183

115 Three Models for Decision Problems In increasing order of complexity Find a discriminant function f (x) which maps each input directly onto a class label. Discriminative Models 1 Solve the inference problem of determining the posterior class probabilities p(c k x). 2 Use decision theory to assign each new x to one of the classes. Generative Models 1 Solve the inference problem of determining the class-conditional probabilities p(x C k). 2 Also, infer the prior class probabilities p(c k). 3 Use Bayes theorem to find the posterior p(c k x). 4 Alternatively, model the joint distribution p(x, C k) directly. 5 Use decision theory to assign each new x to one of the classes. c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 115of 183

116 Logistic Regression is Classification Two classes where the posterior of class C 1 is a logistic sigmoid σ() acting on a linear function of the feature vector φ p(c 1 φ) = y(φ) = σ(w T φ) p(c 2 φ) = 1 p(c 1 φ) Model dimension is equal to dimension of the feature space M. Compare this to fitting two Gaussians }{{} 2M + M(M + 1)/2 = M(M + 5)/2 }{{} means shared covariance For larger M, the logistic regression model has a clear advantage. c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 116of 183

117 Logistic Regression is Classification Determine the parameter via maximum likelihood for data (φ n, t n ), n = 1,..., N, where φ n = φ(x n ). The class membership is coded as t n {, 1}. Likelihood function p(t w) = where y n = p(c 1 φ n ). N n=1 y tn n (1 y n ) 1 tn Error function : negative log likelihood resulting in the cross-entropy error function E(w) = ln p(t w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 117of 183

118 Logistic Regression is Classification Error function (cross-entropy error ) E(w) = N {t n ln y n + (1 t n ) ln(1 y n )} n=1 y n = p(c 1 φ n ) = σ(w T φ n ) Gradient of the error function (using dσ da E(w) = N (y n t n )φ n n=1 = σ(1 σ) ) gradient does not contain any sigmoid function for each data point error is product of deviation y n t n and basis function φ n. BUT : maximum likelihood solution can exhibit over-fitting even for many data points; should use regularised error or MAP then. c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 118of 183

119 Original Input versus Feature Space Used direct input x until now. All classification algorithms work also if we first apply a fixed nonlinear transformation of the inputs using a vector of basis functions φ(x). Example: Use two Gaussian basis functions centered at the green crosses in the input space. c Classification Generalised Linear Model 1 x2 1 1 φ2.5 Inference and Decision Decision Theory Fisher s Linear Discriminant The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space 1 1 x1.5 1 φ1 119of 183

120 Original Input versus Feature Space Linear decision boundaries in the feature space correspond to nonlinear decision boundaries in the input space. Classes which are NOT linearly separable in the input space can become linearly separable in the feature space. BUT: If classes overlap in input space, they will also overlap in feature space. Nonlinear features φ(x) can not remove the overlap; but they may increase it! c Classification Generalised Linear Model Inference and Decision Decision Theory Fisher s Linear Discriminant x2 1 φ2 1.5 The Perceptron Algorithm Probabilistic Generative Models Discrete Features Logistic Regression Feature Space x1.5 1 φ1 12of 183

121 c Part IV Neural Networks Neural Networks Parameter Optimisation 121of 183

122 Functional Transformations As before, the biases can be absorbed into the weights by introducing an extra input x = 1 and a hidden unit z = 1. ( M ) D y k (x, w) = g w (2) kj h w (1) ji x i j= i= Compare to Generalised Linear Model M y k (x, w) = g w (2) kj φ j (x) j= c Neural Networks Parameter Optimisation hidden units x D w (1) MD z M w (2) KM y K inputs outputs y 1 x 1 x z 1 w (2) 1 z 122of 183

123 Variable Basis Functions in a Neural Networks φ(x) = σ(w + w 1 x 1 + w 2 x 2 ) for different parameter w. c Neural Networks -1-5 x x2 Parameter Optimisation x1 x1 5 5 w = (, 1,.1) 1 w = (,.1, 1) x2 x x1 x1 5 5 w = (,.5,.5) 1 w = (1,.5,.5) 1 123of 183

124 Aproximation Capabilities of Neural Networks Neural network approximating f (x) = x 2 c Neural Networks Parameter Optimisation Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 5 data points sampled from the interval ( 1, 1). Red: resulting output. Dashed: Output of the hidden units. 124of 183

125 Aproximation Capabilities of Neural Networks Neural network approximating f (x) = sin(x) c Neural Networks Parameter Optimisation Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 5 data points sampled from the interval ( 1, 1). Red: resulting output. Dashed: Output of the hidden units. 125of 183

126 Aproximation Capabilities of Neural Networks Neural network approximating f (x) = x c Neural Networks Parameter Optimisation Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 5 data points sampled from the interval ( 1, 1). Red: resulting output. Dashed: Output of the hidden units. 126of 183

127 Aproximation Capabilities of Neural Networks Neural network approximating Heaviside function { 1, x f (x) =, x < c Neural Networks Parameter Optimisation Two-layer network with 3 hidden units (tanh activation functions) and linear outputs trained on 5 data points sampled from the interval ( 1, 1). Red: resulting output. Dashed: Output of the hidden units. 127of 183

128 Aproximation Capabilities of Neural Networks Neural network for two-class classification. 2 inputs, 2 hidden units with tanh activation function, 1 output with logistic sigmoid activation function. 3 2 c Neural Networks Parameter Optimisation Red: y =.5 decision boundary. Dashed blue: z =.5 hidden unit contours. Green: Optimal decision boundary from the known data distribution. 128of 183

129 Parameter Optimisation Nonlinear mapping from input x n to output y(x n, w). Sum-of-squares error function over all training data E(w) = 1 2 N y(x n, w) t n 2, n=1 c Neural Networks Parameter Optimisation where we have N pairs of input vectors x n and target vectors t n. Find the parameter ŵ which minimises E(w) by gradient descent. ŵ = arg min E(w) w 129of 183

130 Error Backpropagation Given current errors δ k, the activation function h( ), its derivative h ( ), and its output z i in the previous layer. Error in the previous layer via the backpropagation formula c δ j = h (a j ) k w kj δ k. Neural Networks Parameter Optimisation Components of the gradient E n are then En(w) w ji = δ j z i. z i w ji δ j wkj δ k z j δ 1 13of 183

131 Efficieny of Error Backpropagation As the number of weights is usually much larger than the number of units (the network is well connected), the complexity of calculating the gradient En(w) w ji via error backpropagation is of O(W) where W is the number of weights. Compare this to numerical differentiation using c Neural Networks Parameter Optimisation E n (w) = E n(w ji + ɛ) E n (w ji ) + O(ɛ) w ji ɛ or the numerically more stable (fewer round-off errors) symmetric differences E n (w) = E n(w ji + ɛ) E n (w ji ɛ) + O(ɛ 2 ) w ji 2ɛ which both need O(W 2 ) operations. 131of 183

132 Regularisation in Neural Networks c Model complexity matters again. Neural Networks 1 M = 1 1 M = 3 1 M = 1 Parameter Optimisation M = 1 M = 3 M = 1 Examples of two-layer networks with M hidden units. 132of 183

133 Regularisation via Early Stopping c Stop training at the minimum of the validation set error. Neural Networks.45 Parameter Optimisation Training set error Validation set error. 133of 183

134 c Part V Kernel Methods and SVM Kernel Methods Maximum Margin Classifiers 134of 183

135 Kernel Methods Keep (some) of the training data and recast prediction as a linear combination of kernel functions which are evaluated at the kept training data points and the new test point. Let L(t, y(x) be any loss function and J(f ) be any penalty quadratic in f, then minimum of penalised loss N n=1 L(t n, y(x n )) + λj(f ) has form f (x) = N n=1 α n k(x n, x) with α minimising N n=1 L(t n, (Kα) n ) + λα T Kα, and Kernel K ij = K ji = k(x i, x j ) Kernel trick based on Mercer s theorem: Any continuous, symmetric, positive semi-definite kernel function k(x, y) can be expressed as a dot product in a high-dimensional (possibly infinite dimensional) space. c Kernel Methods Maximum Margin Classifiers 135of 183

136 Maximum Margin Classifiers Support Vector Machines choose the decision boundary which maximises the smallest distance to samples in both classes. ŵ = arg max min [t n (w T φ(x n ))] t n { 1, 1} w: w =1 n Linear boundary for φ k (x) = x (k) c Kernel Methods Maximum Margin Classifiers y = 1 y = y = 1 136of 183

137 Maximum Margin Classifiers Non-linear boundary for general φ(x). N ŵ = α n φ(x n ) n=1 for a few α n and corresponding x n (support vectors). c Kernel Methods N ˆf (x) = ŵ T φ(x) = α n k(x n, x) n=1 with k(x n, x) = φ(x n ) T φ(x) Maximum Margin Classifiers 137of 183

138 Overlapping Class distributions Introduce slack variable ξ n for each data point n., data point is correctly classified and ξ n = on margin boundary or beyond t n y(x), otherwise c Kernel Methods ξ > 1 y = 1 y = y = 1 Maximum Margin Classifiers ξ < 1 ξ = ξ = 138of 183

139 Overlapping Class distributions c 2 Kernel Methods Maximum Margin Classifiers The ν-svm algorithm using Gaussian kernels exp( γ x x 2 ) with γ =.45 applied to a nonseparable data set in two dimensions. Support vectors are indicated by circles. 139of 183

140 c Part VI Mixture Models and EM K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 14of 183

141 K-means Clustering Goal: Partition N features x n into K clusters using Euclidian distance d(x i, x j ) = x i x j such that each feature belongs to the cluster with the nearest mean. Distortion measure : J(µ, cl(x i )) = N n=1 d(x i, µ cl(xi)) 2 where cl(x i ) is the index of the cluster centre closest to x i. Start with K arbitrary cluster centres µ k. M-step: Minimise J w.r.t. cl(x i ): Assign each data point x i to closest cluster with index cl(x i ). E-step: Minimise J w.r.t. µ k : Find new µ k as the mean of points belonging to cluster k. Iteration over M/E-steps converges to local minimum of J. c K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 2 (a) of 183

142 K-means Clustering - Example 2 (a) 2 (b) 2 (c) c K-means Clustering Mixture Models and EM 2 (d) 2 (e) 2 (f) Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM (g) 2 (h) 2 (i) of 183

143 Mixture Models and EM Mixture of Gaussians: P(x π, µ, Σ) = K k=1 π k N (x µ k, Σ k ) Maximise likelihood P(x π, µ, Σ) w.r.t. π, µ, Σ. c K-means Clustering Mixture Models and EM M-step: Minimise J w.r.t. cl(x i ): Assign each data point x i to closest cluster with index cl(x i ). E-step: Minimise J w.r.t. µ k : Find new µ k as the mean of points belonging to cluster k (a) 2 L = 2 2 (d) (b) 2 L = 5 2 (e) 2 2 L = (c) 2 2 L = (f) 2 Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 143of 183

144 EM for Gaussian Mixtures Given a Gaussian mixture and data X, maximise the log likelihood w.r.t. the parameters (π, µ, Σ). 1 Initialise the means µ k, covariances µ k and mixing coefficients π k. Evaluate the log likelihood function. 2 E step : Evaluate the γ(z k) using the current parameters γ(z k) = πk N (x µ k, Σ k) K j=1 πj N (x µ j, Σ j) 3 M step : Re-estimate the parameters using the current γ(z k) µ new k Σ new k = 1 N k = 1 N k N γ(z nk) x n n=1 N n=1 γ(z nk)(x n µ new k π new k = Nk N )(x n µ new k ) T c K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 4 Evaluate the log likelihood, if not converged then goto 2. { N K } ln p(x π, µ, Σ) = ln πk new N (x µ new k, Σ new k ) n=1 k=1 144of 183

145 Mixture of Bernoulli Distributions c Set of D binary variables x i, i = 1,..., D. Each governed by a Bernoulli distribution with parameter µ i. Therefore p(x µ) = Expecation and covariance D i=1 µ xi i (1 µ i) 1 xi K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM E [x] = µ cov[x] = diag{µ i (1 µ i )} 145of 183

146 Mixture of Bernoulli Distributions Mixture with p(x µ, π) = p(x µ k ) = D i=1 K π k p(x µ k ) k=1 µ xi ki (1 µ ki) 1 xi Similar calculation as with mixture of Gaussian γ(z nk ) = π k p(x n µ k ) K j=1 π j p(x n µ j ) N k = N γ(z nk ) n=1 x = 1 N k π k = N k N N γ(z nk )x n n=1 µ k = x c K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 146of 183

147 EM for Mixture of Bernoulli Distributions - Digits c K-means Clustering Examples from a digits data set, each pixel taken only binary values. Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM Parameters µ ki for each component in the mixture. Fit to one multivariate Bernoulli distribution. 147of 183

148 The Role of Latent Variables EM finds the maximum likelihod solution for models with latent variables. Two kinds of variables Observed variables X Latent variables Z plus model parameters θ. Log likelihood is then { } ln p(x θ) = ln p(x, Z θ) Z c K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM Optimisation problem due to the log-sum. Assume maximisation of the distribution p(x, Z θ) over the complete data set {X, Z} is straightforward. But we only have the incomplete data set {X} and the posterior distribution p(z X, θ). 148of 183

149 EM - Key Idea c Key idea of EM: As Z is not observed, work with an averaged version Q(θ, θ old ) of the complete log-likelihood ln p(x, Z θ), averaged over all states of Z. K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions Q(θ, θ old ) = Z p(z X, θ old ) ln p(x, Z θ) EM for Gaussian Mixtures - Latent Variables Convergence of EM 149of 183

150 EM Algorithm 1 Choose an initial setting for the parameters θ old. 2 E step Evaluate p(z X, θ old ). 3 M step Evaluate θ new given by c where Q(θ, θ old ) = Z θ new = arg max Q(θ, θ old ) θ p(z X, θ old ) ln p(x, Z θ) K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 4 Check for convergence of log likelihood or parameter values. If not yet converged, then and go to step 2. θ old = θ new 15of 183

151 EM Algorithm - Convergence Start with the product rule for the observed variables x, the unobserved variables Z, and the parameters θ ln p(x, Z θ) = ln p(z X, θ) + ln p(x θ). Apply Z q(z) with arbitrary q(z) to the formula q(z) ln p(x, Z θ) = q(z) ln p(z X, θ) + ln p(x θ). Z Z Rewrite as ln p(x θ) = p(x, Z θ) q(z) ln q(z) Z }{{} L(q,θ) p(z X, θ) q(z) q(z) ln Z }{{} KL(q p) c K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM KL(q p) is the Kullback-Leibler divergence. 151of 183

152 Kullback-Leibler Divergence c Distance between two distributions p(y) and q(y) KL(q p) = y KL(q p) = KL(q p) q(y) ln q(y) p(y) q(y) ln q(y) p(y) dy not symmetric: KL(q p) KL(p q) KL(q p) = iff q = p. = y = invariant under parameter transformations q(y) ln p(y) q(y) q(y) ln p(y) q(y) dy K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM 152of 183

153 EM Algorithm - Convergence The two parts of ln p(x θ) ln p(x θ) = p(x, Z θ) q(z) ln q(z) Z }{{} L(q,θ) KL(q p) p(z X, θ) q(z) q(z) ln Z }{{} KL(q p) c K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM L(q, θ) ln p(x θ) 153of 183

154 EM Algorithm - E Step Hold θ old fixed. Maximise the lower bound L(q, θ old ) with respect to q( ). L(q, θ old ) is a functional. ln p(x θ) does NOT depend on q( ). Maximum for L(q, θ old ) will occur when the Kullback-Leibler divergence vanishes. Therefore, choose q(z) = p(z X, θ old ) ln p(x θ) = p(x, Z θ) q(z) ln q(z) Z }{{} L(q,θ) KL(q p) = p(z X, θ) q(z) q(z) ln Z }{{} KL(q p) c K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM L(q, θ old ) ln p(x θ old ) 154of 183

155 EM Algorithm - M Step Hold q( ) = p(z X, θ old ) fixed. Maximise the lower bound L(q, θ) with respect to θ : θ new = arg max θ L(q, θ old ) = arg max θ Z q( ) ln p(x, Z θ) L(q, θ new ) > L(q, θ old ) unless maximum already reached. As q( ) = p(z X, θ old ) is fixed, p(z X, θ new ) will not be equal to q( ), and therefore the Kullback-Leiber distance will be greater than zero (unless converged). ln p(x θ) = p(x, Z θ) q(z) ln q(z) Z }{{} L(q,θ) p(z X, θ) q(z) q(z) ln Z }{{} KL(q p) c K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables Convergence of EM KL(q p) L(q, θ new ) ln p(x θ new ) 155of 183

156 EM Algorithm - Parameter View c ln p(x θ) K-means Clustering Mixture Models and EM Mixture of Bernoulli Distributions EM for Gaussian Mixtures - Latent Variables L (q, θ) Convergence of EM θ old θ new Red curve : incomplete data likelihood. Blue curve : After E step. Green curve : After M step. 156of 183

157 c Part VII Sampling Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 157of 183

158 Sampling from the Uniform Distribution In a computer usually via pseudorandom number generator : an algorithm generating a sequence of numbers that approximates the properties of random numbers. Example : linear congruential generators z (n+1) = (a z (n) + c) mod m for modulus m >, multiplier < a < m, increment c < m, and seed z. Other classes of pseudorandom number generators: Lagged Fibonacci generators Linear feedback shift registers Generalised feedback shift registers c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 158of 183

159 Example: RANDU Random Number Generator c Used since the 196s on many machines Defined by the recurrence z (n+1) = ( ) z (n) mod 2 31 Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 159of 183

160 RANDU looks somehow ok? Plotting ( z (n+2), z (n+1), z (n)) T in 3D... c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 16of 183

161 RANDU not really ok Plotting ( z (n+2), z (n+1), z (n)) T in 3D... and changing the viewpoint results in 15 planes. c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 161of 183

162 A Bad Generator - RANDU Analyse the recurrence z (n+1) = ( ) z (n) mod 2 31 Assuming every equation to be modulo 2 31, we can correlate three samples z (n+2) = ( ) 2 z (n) = ( )z (n) = (6( ) 9)z (n) = 6z (n+1) 9z (n) c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea Marsaglia, George "Random Numbers Fall Mainly In The Planes", Proc National Academy of Sciences 61, 25-28, of 183

163 Sampling from Standard Distributions c Goal: Sample from p(y) which is given in analytical form. Suppose uniformly distributed samples of z in the interval (, 1) are available. Calculate the cumulative distribution function h(y) = y p(x) dx Transform the samples from U(z, 1) by Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea y = h 1 (z) to obtain samples y distributed according to p(y). 163of 183

164 Sampling from Standard Distributions Goal: Sample from p(y) which is given in analytical form. If a uniformly distributed random variable z is transformed using y = h 1 (z) then y will be distributed according to p(y). c 1 h(y) Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling p(y) Markov Chain Monte Carlo - The Idea y 164of 183

165 Sampling from the Exponential Distribution Goal: Sample from the exponential distribution { λe λy y p(y) = y < with rate parameter λ >. Suppose uniformly distributed samples of z in the interval (, 1) are available. Calculate the cumulative distribution function h(y) = y p(x) dx = y λe λy dx = 1 e λy c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea Transform the samples from U(z, 1) by y = h 1 (z) = 1 ln(1 z) λ to obtain samples y distributed according to the exponential distribution. 165of 183

166 Sampling the Gaussian Distribution - Box-Muller 1 Generate pairs of uniformly distributed random numbers z 1, z 2 ( 1, 1) (e.g. z i = 2z 1 for z from U(z, 1)) 2 Discard any pair (z 1, z 2 ) unless z z Results in a uniform distribution inside of the unit circle p(z 1, z 2 ) = 1/π. 3 Evaluate r 2 = z z2 2 and y 1 = z 1 ( 2 ln r 2 r 2 y 2 = z 2 ( 2 ln r 2 r 2 ) 1/2 ) 1/2 c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 4 y 1 and y 2 are independent with joint distribution p(y 1, y 2 ) = p(z 1, z 2 ) (z 1, z 2 ) (y 1, y 2 ) = 1 e y2 1 /2 1 e y2 2 /2 2π 2π 166of 183

167 Rejection Sampling Assumption 1 : Sampling directly from p(z) is difficult, but we can evaluate p(z) up to some unknown normalisation constant Z p p(z) = 1 Z p p(z) Assumption 2 : We can draw samples from a simpler distribution q(z) and for some constant k and all z holds kq(z ) kq(z) p(z) kq(z) c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea u p(z) z z 167of 183

168 Rejection Sampling 1 Generate a random number z from the distribution q(z). 2 Generate a number from the u from the uniform distribution over [, k q(z )]. 3 If u > p(z ) then reject the pair (z, u ). 4 The remaining pairs have uniform distribution under the curve p(z). 5 The z values are distributed according to p(z). c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling kq(z ) kq(z) Importance Sampling Markov Chain Monte Carlo - The Idea u p(z) z z 168of 183

169 Importance Sampling Provides a framework to directly calculate the expectation E p [f (z)] with respect to some distribution p(z). Does NOT provide p(z). Again use a proposal distribution q(z) and draw samples z from it. Then E [f ] = f (z) p(z) dz = f (z) p(z) q(z) q(z) dz 1 L L l=1 p(z (l) ) q(z (l) ) f (z(l) ) c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea p(z) q(z) f(z) z 169of 183

170 Importance Sampling - Unnormalised Consider both p(z) and q(z) to be not normalised. It follows then that p(z) = p(z) Z p E [f ] Z q Z p 1 L L l=1 q(z) = q(z) Z q. r l f (z (l) ) r l = p(z(l) ) q(z (l) ). Use the same set of samples to calculate Z p Z q 1 L L r l, resulting in the formula for unnormalised distributions l=1 c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea E [f ] L w l f (z (l) r l ) w l = L m=1 r m l=1 17of 183

171 Importance Sampling - Key Points c Try to choose sample points in the input space where the product f (z) p(z) is large. Or at least where p(z) is large. Importance weights r l correct the bias introduced by sampling from the proposal distribution q(z) instead of the wanted distribution p(z). Success depends on how well q(z) approximates p(z). If p(z) > in same region, then q(z) > necessary. Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 171of 183

172 Markov Chain Monte Carlo c Goal : Generate samples from the distribution p(z). Idea : Build a machine which uses the current sample to decide which next sample to produce in such a way that the overall distribution of the samples will be p(z). 1 Current sample z (r) is known. Generate a new sample z from a proposal distribution q(z z (r) ) we know how to sample from. 2 Accept or reject the new sample according to some appropriate criterion. { z (l+1) z if accepted = z (r) if rejected Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 3 Proposal distribution depends on the current state. 172of 183

173 Metropolis Algorithm 1 Choose a symmetric proposal distribution q(z A z B ) = q(z B z A ). 2 Accept the new sample z with probability ( ) A(z, z (r) ) = min 1, p(z ) p(z (r) ) 3 How? Choose a random number u with uniform distribution in (, 1). Accept new sample if A(z, z (r) ) > u. 4 z (l+1) = { z if accepted if rejected Rejection of a point leads to inclusion of the previous sample. (Different from rejection sampling.) z (r) c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 173of 183

174 Metropolis Algorithm - Illustration Sampling from a Gaussian Distribution (black contour shows one standard deviation). Proposal distribution is isotropic Gaussian with standard deviation candidates generated; 43 rejected c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea accepted steps, rejected steps. 174of 183

175 Markov Chain Monte Carlo - Metropolis-Hasting Generalisation of the Metropolis algorithm for nonsymmetric proposal distributions q k. At step τ, draw a sample z from the distribution q k (z z (τ) ) where k labels the set of possible transitions. Accept with probability ( A k (z, z (τ) p(z ) q k (z (τ) z ) ) ) = min 1, p(z (τ) ) q k (z z (τ) ) Choice of proposal distribution critical. Common choice : Gaussian centered on the current state. small variance high acceptance rate, but slow walk through the state space; samples not independent large variance high rejection rate c Sampling from the Uniform Distribution Sampling from Standard Distributions Rejection Sampling Importance Sampling Markov Chain Monte Carlo - The Idea 175of 183

176 c Part VIII More More 176of 183

177 More Graphical Models Gaussian Processes Sequential Data Sequential Decision Theory Learning Agents Reinforcement Learning Theoretical Model Selection Additive Models and Trees and Related Methods Approximate (Variational) Inference Boosting Concept Learning Computational Learning Theory Genetic Algorithms Learning Sets of Rules Analytical Learning... c More 177of 183

178 c Part IX Resources Journals Books Datasets 178of 183

179 Journals c Journal of Research IEEE Transactions on Pattern Analysis and Machine Intelligence IEEE Transactions on Neural Networks Neural Computation Neural Networks Annals of Statistics Journal of the American Statistical Association SIAM Journal on Applied Mathematics (SIAP)... Journals Books Datasets 179of 183

180 Conferences c International Conference on (ICML) European Conference on (ECML) Neural Information Processing Systems (NIPS) Algorithmic Learning Theory (ALT) Computational Learning Theory (COLT) Uncertainty in Artificial Intelligence (UAI) International Joint Conference on Artificial Intelligence (IJCAI) International Conference on Artificial Neural Networks (ICANN)... Journals Books Datasets 18of 183

181 Books c Pattern Recognition and The Elements of Statistical Learning Journals Books Datasets Christopher M. Bishop Trevor Hastie, Robert Tibshirani, Jerome Friedman 181of 183

182 Books c Pattern Classification Introduction to Machine Learning Journals Books Datasets Richard O. Duda, Peter E. Hart, David G. Stork Ethem Alpaydin 182of 183

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 143 Part IV

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Machine Learning. 7. Logistic and Linear Regression

Machine Learning. 7. Logistic and Linear Regression Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

Linear Classification

Linear Classification Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 89 Part II

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Outline Lecture 2 2(32)

Outline Lecture 2 2(32) Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Lecture 1b: Linear Models for Regression

Lecture 1b: Linear Models for Regression Lecture 1b: Linear Models for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced

More information

Bayesian Linear Regression. Sargur Srihari

Bayesian Linear Regression. Sargur Srihari Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2 Linear

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

Machine Learning Overview

Machine Learning Overview Machine Learning Overview Sargur N. Srihari University at Buffalo, State University of New York USA 1 Outline 1. What is Machine Learning (ML)? 2. Types of Information Processing Problems Solved 1. Regression

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Be able to define the following terms and answer basic questions about them:

Be able to define the following terms and answer basic questions about them: CS440/ECE448 Section Q Fall 2017 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables, axioms of probability o Joint, marginal, conditional

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh September 24 (All of the slides in this course have been adapted from previous versions

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Linear Models for Classification

Linear Models for Classification Catherine Lee Anderson figures courtesy of Christopher M. Bishop Department of Computer Science University of Nebraska at Lincoln CSCE 970: Pattern Recognition and Machine Learning Congradulations!!!!

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information

Inf2b Learning and Data

Inf2b Learning and Data Inf2b Learning and Data Lecture : Single layer Neural Networks () (Credit: Hiroshi Shimodaira Iain Murray and Steve Renals) Centre for Speech Technology Research (CSTR) School of Informatics University

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

INTRODUCTION TO PATTERN RECOGNITION

INTRODUCTION TO PATTERN RECOGNITION INTRODUCTION TO PATTERN RECOGNITION INSTRUCTOR: WEI DING 1 Pattern Recognition Automatic discovery of regularities in data through the use of computer algorithms With the use of these regularities to take

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Advanced statistical methods for data analysis Lecture 2

Advanced statistical methods for data analysis Lecture 2 Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline

More information

CS 540: Machine Learning Lecture 1: Introduction

CS 540: Machine Learning Lecture 1: Introduction CS 540: Machine Learning Lecture 1: Introduction AD January 2008 AD () January 2008 1 / 41 Acknowledgments Thanks to Nando de Freitas Kevin Murphy AD () January 2008 2 / 41 Administrivia & Announcement

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information