TDA231. Logistic regression

Size: px
Start display at page:

Download "TDA231. Logistic regression"

Transcription

1 TDA231 Devdatt Dubhashi Dept. of Computer Science and Engg. Chalmers University February 19, 2016

2 Some data 5 x x 1

3 In the Bayes classifier, we built a model of each class and then used Bayes rule: P(T new = k x new, X, t) = p(x new T new = k, X, t)p(t new = k) j p(x new t new = j, X, t)p(t new = j)

4 In the Bayes classifier, we built a model of each class and then used Bayes rule: P(T new = k x new, X, t) = p(x new T new = k, X, t)p(t new = k) j p(x new t new = j, X, t)p(t new = j) Alternative is to directly model P(T new = k x new, X, t) = f (x new ; w) with some parameters w.

5 In the Bayes classifier, we built a model of each class and then used Bayes rule: P(T new = k x new, X, t) = p(x new T new = k, X, t)p(t new = k) j p(x new t new = j, X, t)p(t new = j) Alternative is to directly model P(T new = k x new, X, t) = f (x new ; w) with some parameters w. We ve seen f (x new ; w) = w T x new before can we use it here? No output is unbounded and so can t be a probability.

6 In the Bayes classifier, we built a model of each class and then used Bayes rule: P(T new = k x new, X, t) = p(x new T new = k, X, t)p(t new = k) j p(x new t new = j, X, t)p(t new = j) Alternative is to directly model P(T new = k x new, X, t) = f (x new ; w) with some parameters w. We ve seen f (x new ; w) = w T x new before can we use it here? No output is unbounded and so can t be a probability. But, can use P(T new = k x new, w) = h(f (x new ; w)) where h( ) squashes f (x new ; w) to lie between 0 and 1 a probability.

7 h( ) For logistic regression (binary), we use the sigmoid function: P(T new = 1 x new, w) = h(w T x new ) = exp( w T x new ) 1 1/(1 + exp( w T x)) w T x

8 Bayesian logistic regression Recall Bayesian ideas... In theory, if we place a prior on w and define a likelihood we can obtain a posterior: p(w X, t) = p(t X, w)p(w) p(t X)

9 Bayesian logistic regression Recall Bayesian ideas... In theory, if we place a prior on w and define a likelihood we can obtain a posterior: p(w X, t) = p(t X, w)p(w) p(t X) And we can make predictions by taking expectations (averaging over w): P(T new = 1 x new, X, t) = E p(w X,t) {P(T new = 1 x new, w)} Sounds good so far...

10 Defining a prior Choose a Gaussian prior: D p(w) = N (0, σ 2 ). d=1 Prior choice is always important from a data analysis point of view. Previously, it was also important for the maths. This isn t the case today could choose any prior no prior makes the maths easier!

11 Logistic Regression: Likelihood First assume independence: N p(t X, w) = p(t n x n, w) n=1

12 Logistic Regression: Likelihood First assume independence: p(t X, w) = N p(t n x n, w) n=1 We have already defined this it s our squashing function! If t n = 1: and if t n = 0: P(t n = 1 x n, w) = exp( w T x n ) P(t n = 0 x n, w) = 1 P(t n = 1 x, w)

13 Defining a prior Choose a Gaussian prior: D p(w) = N (0, σ 2 ). d=1 Prior choice is always important from a data analysis point of view. Previously, it was also important for the maths. This isn t the case today could choose any prior no prior makes the maths easier!

14 Posterior p(w X, t, σ 2 ) = p(t X, w, σ2 )p(w σ 2 ) p(t X, σ 2 ) Now things start going wrong. We can t compute p(w X, t) analytically. Prior is not conjugate to likelihood. No prior is! This means we don t know the form of p(w X, t, σ 2 ) And we can t compute the marginal likelihood: p(t X, σ 2 ) = p(t X, w, σ 2 )p(w σ 2 ) dw

15 What can we compute? p(w X, t, σ 2 ) = p(t X, w, σ2 )p(w σ 2 ) p(t X, σ 2 ) We can compute p(t X, w, σ 2 )p(w σ 2 ) Define g(w; X, t, σ 2 ) = p(t X, w)p(w σ 2 )

16 What can we compute? p(w X, t, σ 2 ) = p(t X, w, σ2 )p(w σ 2 ) p(t X, σ 2 ) We can compute p(t X, w, σ 2 )p(w σ 2 ) Define g(w; X, t, σ 2 ) = p(t X, w)p(w σ 2 ) Armed with this, we have three options: Find the most likely value of w a point estimate.

17 What can we compute? p(w X, t, σ 2 ) = p(t X, w, σ2 )p(w σ 2 ) p(t X, σ 2 ) We can compute p(t X, w, σ 2 )p(w σ 2 ) Define g(w; X, t, σ 2 ) = p(t X, w)p(w σ 2 ) Armed with this, we have three options: Find the most likely value of w a point estimate. Approximate p(w X, t, σ 2 ) with something easier.

18 What can we compute? p(w X, t, σ 2 ) = p(t X, w, σ2 )p(w σ 2 ) p(t X, σ 2 ) We can compute p(t X, w, σ 2 )p(w σ 2 ) Define g(w; X, t, σ 2 ) = p(t X, w)p(w σ 2 ) Armed with this, we have three options: Find the most likely value of w a point estimate. Approximate p(w X, t, σ 2 ) with something easier. Sample from p(w X, t, σ 2 ).

19 What can we compute? p(w X, t, σ 2 ) = p(t X, w, σ2 )p(w σ 2 ) p(t X, σ 2 ) We can compute p(t X, w, σ 2 )p(w σ 2 ) Define g(w; X, t, σ 2 ) = p(t X, w)p(w σ 2 ) Armed with this, we have three options: Find the most likely value of w a point estimate. Approximate p(w X, t, σ 2 ) with something easier. Sample from p(w X, t, σ 2 ). These examples aren t the only ways of approximating/sampling. They are also general techniques not unique to logistic regression.

20 MAP estimate Out first method is to find the value of w that maximises p(w X, t, σ 2 ) (call it ŵ). g(w; X, t, σ 2 ) p(w X, t, σ 2 ) ŵ therefore also maximises g(w; X, t, σ 2 ). Very similar to maximum likelihood but additional effect of prior. Known as MAP (maximum a posteriori) solution.

21 MAP estimate Out first method is to find the value of w that maximises p(w X, t, σ 2 ) (call it ŵ). g(w; X, t, σ 2 ) p(w X, t, σ 2 ) ŵ therefore also maximises g(w; X, t, σ 2 ). Very similar to maximum likelihood but additional effect of prior. Known as MAP (maximum a posteriori) solution. Once we have ŵ, make predictions with: P(t new = 1 x new, ŵ) = exp( ŵ T x new )

22 MAP When we met maximum likelihood, we could find ŵ exactly with some algebra. Can t do that here (can t solve g(w;x,t,σ2 ) w = 0)

23 MAP When we met maximum likelihood, we could find ŵ exactly with some algebra. Can t do that here (can t solve g(w;x,t,σ2 ) w = 0) Resort to numerical optimisation: 1. Guess ŵ 2. Change it a bit in a way that increases g(w; X, t, σ 2 ) 3. Repeat until no further increase is possible.

24 MAP When we met maximum likelihood, we could find ŵ exactly with some algebra. Can t do that here (can t solve g(w;x,t,σ2 ) w = 0) Resort to numerical optimisation: 1. Guess ŵ 2. Change it a bit in a way that increases g(w; X, t, σ 2 ) 3. Repeat until no further increase is possible. Many algorithms exist that differ in how they do step 2. e.g. Gradient Descent Not covered in this course. You just need to know that sometimes we can t do things analytically and there are methods to help us! Ask John!

25 MAP numerical optimisation for our data w 2 w 1 x2 0 wi x Iteration Left: Data. Right: Evolution of ŵ in numerical optimisation.

26 Decision boundary Once we have ŵ, we can classify new examples. Decision boundary is a useful visualisation: 5 x x 1 Line corresponding to P(T new = 1 x new, ŵ) = 0.5.

27 Decision boundary Once we have ŵ, we can classify new examples. Decision boundary is a useful visualisation: 5 x x 1 Line corresponding to P(T new = 1 x new, ŵ) = = 1 2 = exp( ŵ T x new ). So: exp( ŵ T x new ) = 1. Or: ŵ T x new = 0

28 Predictive probabilities x x 1 Contours of P(T new = 1 x new, ŵ). Do they look sensible? 0.1

29 Sampling from posterior Suppose we can produce samples w 1, w 2,..., w s,... from p(w X, t, σ 2 ). Then we can average the predictions to approximate p(w X, t, σ 2 ): P(t new = 1 x new, X, t, σ 2 ) = E p(w X,t,σ2 ) {P(t new x new, w)} 1 S 1 S 1 + exp( ws T x new ) s=1

30 Magic! We can sample directly from p(w X, t, σ 2 ) even though we can t compute it! Various algorithms exist we ll use Metropolis-Hastings

31 Back to the script: Metropolis-Hastings Produces a sequence of samples w 1, w 2,..., w s,... Imagine we ve just produced w s 1

32 Back to the script: Metropolis-Hastings Produces a sequence of samples w 1, w 2,..., w s,... Imagine we ve just produced w s 1 MH firsts proposes a possible w s (call it w s ) based on w s 1.

33 Back to the script: Metropolis-Hastings Produces a sequence of samples w 1, w 2,..., w s,... Imagine we ve just produced w s 1 MH firsts proposes a possible w s (call it w s ) based on w s 1. MH then decides whether or not to accept w s If accepted, ws = w s If not, ws = w s 1

34 Back to the script: Metropolis-Hastings Produces a sequence of samples w 1, w 2,..., w s,... Imagine we ve just produced w s 1 MH firsts proposes a possible w s (call it w s ) based on w s 1. MH then decides whether or not to accept w s If accepted, ws = w s If not, ws = w s 1 Two distinct steps proposal and acceptance.

35 MH proposal Treat w s as a random variable conditioned on w s 1 i.e. need to define p( w s w s 1 ) Note that this does not necessarily have to be similar to posterior we re trying to sample from. Can choose whatever we like!

36 MH proposal Treat w s as a random variable conditioned on w s 1 i.e. need to define p( w s w s 1 ) Note that this does not necessarily have to be similar to posterior we re trying to sample from. Can choose whatever we like! e.g. use a Gaussian centered on w s 1 with some covariance: p( w s w s 1, Σ p ) = N (w s 1, Σ p )

37 MH proposal Treat w s as a random variable conditioned on w s 1 i.e. need to define p( w s w s 1 ) Note that this does not necessarily have to be similar to posterior we re trying to sample from. Can choose whatever we like! e.g. use a Gaussian centered on w s 1 with some covariance: p( w s w s 1, Σ p ) = N (w s 1, Σ p ) Σ = w Σ = w 1

38 MH acceptance Choice of acceptance based on the following ratio: r = p( w s X, t, σ 2 ) p(w s 1 w s, Σ p ) p(w s 1 X, t, σ 2 ) p( w s w s 1, Σ p ).

39 MH acceptance Choice of acceptance based on the following ratio: r = p( w s X, t, σ 2 ) p(w s 1 w s, Σ p ) p(w s 1 X, t, σ 2 ) p( w s w s 1, Σ p ). Which simplifies to (all of which we can compute): r = g( w s; X, t, σ 2 ) p(w s 1 w s, Σ p ) g(w s 1 ; X, t, σ 2 ) p( w s w s 1, Σ p ).

40 MH acceptance Choice of acceptance based on the following ratio: r = p( w s X, t, σ 2 ) p(w s 1 w s, Σ p ) p(w s 1 X, t, σ 2 ) p( w s w s 1, Σ p ). Which simplifies to (all of which we can compute): r = g( w s; X, t, σ 2 ) p(w s 1 w s, Σ p ) g(w s 1 ; X, t, σ 2 ) p( w s w s 1, Σ p ). We now use the following rules: If r 1, accept: ws = w s. If r < 1, accept with probability r.

41 MH acceptance Choice of acceptance based on the following ratio: r = p( w s X, t, σ 2 ) p(w s 1 w s, Σ p ) p(w s 1 X, t, σ 2 ) p( w s w s 1, Σ p ). Which simplifies to (all of which we can compute): r = g( w s; X, t, σ 2 ) p(w s 1 w s, Σ p ) g(w s 1 ; X, t, σ 2 ) p( w s w s 1, Σ p ). We now use the following rules: If r 1, accept: ws = w s. If r < 1, accept with probability r. If we do this enough, we ll eventually be sampling from p(w X, t), no matter where we started! i.e. for any w1

42 Where to Start? Convergence Theorem for Markov Chains No matter where the chain is started, the MH process will always converge (under some technical conditions) to its target distribution!

43 When to Stop?

44 When to Stop? How do we know Markov Chain has converged?

45 When to Stop? How do we know Markov Chain has converged? Start chain from different starting points and run until they look the same.

46 When to Stop? How do we know Markov Chain has converged? Start chain from different starting points and run until they look the same. Apply statistical hypothesis testing on empirical distributions.

47 MH flowchart s =1 Choose w s s = s +1 Generate w s from p( w s w s 1 ) w s = w s Compute acceptance ratio r w s = w s 1 Yes r 1? No Generate u from U(0, 1) Yes u r? No

48 MH walkthrough w2 0 w2 0 w 1 w w w 1 5 w2 0 w2 0 w1 w w w 1

49 MH walkthrough w2 0 w2 0 w 1 w w w 1 5 w2 0 w2 0 w1 w w w 1

50 What do the samples look like? w w samples from the posterior using MH.

51 Predictions with MH MH provides us with a set of samples w 1,..., w S. These can be used to approximate posterior: P(t new = 1 x new, X, t, σ 2 ) = E p(w X,t,σ2 ) {P(t new x new, w)} 1 S 1 S 1 + exp( ws T x new ) s=1

52 Predictions with MH MH provides us with a set of samples w 1,..., w S. These can be used to approximate posterior: P(t new = 1 x new, X, t, σ 2 ) = E p(w X,t,σ2 ) {P(t new x new, w)} 1 S 1 S 1 + exp( ws T x new ) s= x x 1 Contours of P(t new = 1 x new, X, t, σ 2 )

53 Summary Introduced logistic regression a probabilistic binary classifier. Saw that we couldn t compute the posterior. Introduced examples of two alternatives: MAP solution. Sample Metropolis-Hastings. Second is better than the last (in terms of predictions)......but each has greater complexity! To think about: What if posterior is multi-modal?

Week 3: Linear Regression

Week 3: Linear Regression Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 EPSY 905: Intro to Bayesian and MCMC Today s Class An

More information

CSC 411: Lecture 04: Logistic Regression

CSC 411: Lecture 04: Logistic Regression CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is

More information

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers.

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 2, 2015 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Probabilistic Machine Learning. Industrial AI Lab.

Probabilistic Machine Learning. Industrial AI Lab. Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Linear Regression [DRAFT - In Progress] Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory

More information

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail

More information

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I SYDE 372 Introduction to Pattern Recognition Probability Measures for Classification: Part I Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 Why use probability

More information

Midterm, Fall 2003

Midterm, Fall 2003 5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated

More information

Where now? Machine Learning and Bayesian Inference

Where now? Machine Learning and Bayesian Inference Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone etension 67 Email: sbh@clcamacuk wwwclcamacuk/ sbh/ Where now? There are some simple take-home messages from

More information

Lecture 10: Powers of Matrices, Difference Equations

Lecture 10: Powers of Matrices, Difference Equations Lecture 10: Powers of Matrices, Difference Equations Difference Equations A difference equation, also sometimes called a recurrence equation is an equation that defines a sequence recursively, i.e. each

More information

MCMC notes by Mark Holder

MCMC notes by Mark Holder MCMC notes by Mark Holder Bayesian inference Ultimately, we want to make probability statements about true values of parameters, given our data. For example P(α 0 < α 1 X). According to Bayes theorem:

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation

Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation PRE 905: Multivariate Analysis Spring 2014 Lecture 4 Today s Class The building blocks: The basics of mathematical

More information

Harrison B. Prosper. Bari Lectures

Harrison B. Prosper. Bari Lectures Harrison B. Prosper Florida State University Bari Lectures 30, 31 May, 1 June 2016 Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 1 h Lecture 1 h Introduction h Classification h Grid Searches

More information

The Naïve Bayes Classifier. Machine Learning Fall 2017

The Naïve Bayes Classifier. Machine Learning Fall 2017 The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning

More information

Markov Networks.

Markov Networks. Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop Bayesian Gaussian / Linear Models Read Sections 2.3.3 and 3.3 in the text by Bishop Multivariate Gaussian Model with Multivariate Gaussian Prior Suppose we model the observed vector b as having a multivariate

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han Math for Machine Learning Open Doors to Data Science and Artificial Intelligence Richard Han Copyright 05 Richard Han All rights reserved. CONTENTS PREFACE... - INTRODUCTION... LINEAR REGRESSION... 4 LINEAR

More information

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter

More information

CSC321 Lecture 7 Neural language models

CSC321 Lecture 7 Neural language models CSC321 Lecture 7 Neural language models Roger Grosse and Nitish Srivastava February 1, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 1 / 19 Overview We

More information

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0. Gaussian Processes Gaussian Process Stochastic process: basically, a set of random variables. may be infinite. usually related in some way. Gaussian process: each variable has a Gaussian distribution every

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

More information

The Metropolis-Hastings Algorithm. June 8, 2012

The Metropolis-Hastings Algorithm. June 8, 2012 The Metropolis-Hastings Algorithm June 8, 22 The Plan. Understand what a simulated distribution is 2. Understand why the Metropolis-Hastings algorithm works 3. Learn how to apply the Metropolis-Hastings

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

MLPR: Logistic Regression and Neural Networks

MLPR: Logistic Regression and Neural Networks MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28 Outline 1 Logistic Regression 2 Multi-layer

More information

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap. Outline MLPR: and Neural Networks Machine Learning and Pattern Recognition 2 Amos Storkey Amos Storkey MLPR: and Neural Networks /28 Recap Amos Storkey MLPR: and Neural Networks 2/28 Which is the correct

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Relevance Vector Machines

Relevance Vector Machines LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

But if z is conditioned on, we need to model it:

But if z is conditioned on, we need to model it: Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

10-701/ Machine Learning, Fall

10-701/ Machine Learning, Fall 0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training

More information

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others February 22, 2016 Naive Bayes and Gaussian Bayes Classifier February 22, 2016 1 / 21 Naive Bayes Bayes Rule:

More information

Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning Gaussian Naïve Bayes Big Picture Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 27, 2011 Today: Naïve Bayes Big Picture Logistic regression Gradient ascent Generative discriminative

More information

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier Naive Bayes and Gaussian Bayes Classifier Elias Tragas tragas@cs.toronto.edu October 3, 2016 Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, 2016 1 / 23 Naive Bayes Bayes Rules: Naive

More information

CSC321 Lecture 5 Learning in a Single Neuron

CSC321 Lecture 5 Learning in a Single Neuron CSC321 Lecture 5 Learning in a Single Neuron Roger Grosse and Nitish Srivastava January 21, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 1 / 14

More information

Lecture 4: Training a Classifier

Lecture 4: Training a Classifier Lecture 4: Training a Classifier Roger Grosse 1 Introduction Now that we ve defined what binary classification is, let s actually train a classifier. We ll approach this problem in much the same way as

More information

Toss 1. Fig.1. 2 Heads 2 Tails Heads/Tails (H, H) (T, T) (H, T) Fig.2

Toss 1. Fig.1. 2 Heads 2 Tails Heads/Tails (H, H) (T, T) (H, T) Fig.2 1 Basic Probabilities The probabilities that we ll be learning about build from the set theory that we learned last class, only this time, the sets are specifically sets of events. What are events? Roughly,

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics March 14, 2018 CS 361: Probability & Statistics Inference The prior From Bayes rule, we know that we can express our function of interest as Likelihood Prior Posterior The right hand side contains the

More information

SAMPLE CHAPTER. Avi Pfeffer. FOREWORD BY Stuart Russell MANNING

SAMPLE CHAPTER. Avi Pfeffer. FOREWORD BY Stuart Russell MANNING SAMPLE CHAPTER Avi Pfeffer FOREWORD BY Stuart Russell MANNING Practical Probabilistic Programming by Avi Pfeffer Chapter 9 Copyright 2016 Manning Publications brief contents PART 1 INTRODUCING PROBABILISTIC

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

Bayesian data analysis in practice: Three simple examples

Bayesian data analysis in practice: Three simple examples Bayesian data analysis in practice: Three simple examples Martin P. Tingley Introduction These notes cover three examples I presented at Climatea on 5 October 0. Matlab code is available by request to

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

More information

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction 15-0: Learning vs. Deduction Artificial Intelligence Programming Bayesian Learning Chris Brooks Department of Computer Science University of San Francisco So far, we ve seen two types of reasoning: Deductive

More information

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)

More information

CSE446: Clustering and EM Spring 2017

CSE446: Clustering and EM Spring 2017 CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Machine Learning and Bayesian Inference

Machine Learning and Bayesian Inference Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 63725 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Copyright c Sean Holden 22-17. 1 Artificial

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Some Probability and Statistics

Some Probability and Statistics Some Probability and Statistics David M. Blei COS424 Princeton University February 12, 2007 D. Blei ProbStat 01 1 / 42 Who wants to scribe? D. Blei ProbStat 01 2 / 42 Random variable Probability is about

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9 Metropolis Hastings Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601 Module 9 1 The Metropolis-Hastings algorithm is a general term for a family of Markov chain simulation methods

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Variational Bayesian Logistic Regression

Variational Bayesian Logistic Regression Variational Bayesian Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic

More information

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian RL Seminar. Chris Mansley September 9, 2008 Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in

More information

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Strong Lens Modeling (II): Statistical Methods

Strong Lens Modeling (II): Statistical Methods Strong Lens Modeling (II): Statistical Methods Chuck Keeton Rutgers, the State University of New Jersey Probability theory multiple random variables, a and b joint distribution p(a, b) conditional distribution

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 3 More Markov Chain Monte Carlo Methods The Metropolis algorithm isn t the only way to do MCMC. We ll

More information

Linear Classification

Linear Classification Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative

More information

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 4: Learning a Classifier CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 28 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron

More information

Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach. Radford M. Neal, 28 February 2005

Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach. Radford M. Neal, 28 February 2005 Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach Radford M. Neal, 28 February 2005 A Very Brief Review of Gaussian Processes A Gaussian process is a distribution over

More information