TDA231. Logistic regression
|
|
- Gyles Cook
- 5 years ago
- Views:
Transcription
1 TDA231 Devdatt Dubhashi Dept. of Computer Science and Engg. Chalmers University February 19, 2016
2 Some data 5 x x 1
3 In the Bayes classifier, we built a model of each class and then used Bayes rule: P(T new = k x new, X, t) = p(x new T new = k, X, t)p(t new = k) j p(x new t new = j, X, t)p(t new = j)
4 In the Bayes classifier, we built a model of each class and then used Bayes rule: P(T new = k x new, X, t) = p(x new T new = k, X, t)p(t new = k) j p(x new t new = j, X, t)p(t new = j) Alternative is to directly model P(T new = k x new, X, t) = f (x new ; w) with some parameters w.
5 In the Bayes classifier, we built a model of each class and then used Bayes rule: P(T new = k x new, X, t) = p(x new T new = k, X, t)p(t new = k) j p(x new t new = j, X, t)p(t new = j) Alternative is to directly model P(T new = k x new, X, t) = f (x new ; w) with some parameters w. We ve seen f (x new ; w) = w T x new before can we use it here? No output is unbounded and so can t be a probability.
6 In the Bayes classifier, we built a model of each class and then used Bayes rule: P(T new = k x new, X, t) = p(x new T new = k, X, t)p(t new = k) j p(x new t new = j, X, t)p(t new = j) Alternative is to directly model P(T new = k x new, X, t) = f (x new ; w) with some parameters w. We ve seen f (x new ; w) = w T x new before can we use it here? No output is unbounded and so can t be a probability. But, can use P(T new = k x new, w) = h(f (x new ; w)) where h( ) squashes f (x new ; w) to lie between 0 and 1 a probability.
7 h( ) For logistic regression (binary), we use the sigmoid function: P(T new = 1 x new, w) = h(w T x new ) = exp( w T x new ) 1 1/(1 + exp( w T x)) w T x
8 Bayesian logistic regression Recall Bayesian ideas... In theory, if we place a prior on w and define a likelihood we can obtain a posterior: p(w X, t) = p(t X, w)p(w) p(t X)
9 Bayesian logistic regression Recall Bayesian ideas... In theory, if we place a prior on w and define a likelihood we can obtain a posterior: p(w X, t) = p(t X, w)p(w) p(t X) And we can make predictions by taking expectations (averaging over w): P(T new = 1 x new, X, t) = E p(w X,t) {P(T new = 1 x new, w)} Sounds good so far...
10 Defining a prior Choose a Gaussian prior: D p(w) = N (0, σ 2 ). d=1 Prior choice is always important from a data analysis point of view. Previously, it was also important for the maths. This isn t the case today could choose any prior no prior makes the maths easier!
11 Logistic Regression: Likelihood First assume independence: N p(t X, w) = p(t n x n, w) n=1
12 Logistic Regression: Likelihood First assume independence: p(t X, w) = N p(t n x n, w) n=1 We have already defined this it s our squashing function! If t n = 1: and if t n = 0: P(t n = 1 x n, w) = exp( w T x n ) P(t n = 0 x n, w) = 1 P(t n = 1 x, w)
13 Defining a prior Choose a Gaussian prior: D p(w) = N (0, σ 2 ). d=1 Prior choice is always important from a data analysis point of view. Previously, it was also important for the maths. This isn t the case today could choose any prior no prior makes the maths easier!
14 Posterior p(w X, t, σ 2 ) = p(t X, w, σ2 )p(w σ 2 ) p(t X, σ 2 ) Now things start going wrong. We can t compute p(w X, t) analytically. Prior is not conjugate to likelihood. No prior is! This means we don t know the form of p(w X, t, σ 2 ) And we can t compute the marginal likelihood: p(t X, σ 2 ) = p(t X, w, σ 2 )p(w σ 2 ) dw
15 What can we compute? p(w X, t, σ 2 ) = p(t X, w, σ2 )p(w σ 2 ) p(t X, σ 2 ) We can compute p(t X, w, σ 2 )p(w σ 2 ) Define g(w; X, t, σ 2 ) = p(t X, w)p(w σ 2 )
16 What can we compute? p(w X, t, σ 2 ) = p(t X, w, σ2 )p(w σ 2 ) p(t X, σ 2 ) We can compute p(t X, w, σ 2 )p(w σ 2 ) Define g(w; X, t, σ 2 ) = p(t X, w)p(w σ 2 ) Armed with this, we have three options: Find the most likely value of w a point estimate.
17 What can we compute? p(w X, t, σ 2 ) = p(t X, w, σ2 )p(w σ 2 ) p(t X, σ 2 ) We can compute p(t X, w, σ 2 )p(w σ 2 ) Define g(w; X, t, σ 2 ) = p(t X, w)p(w σ 2 ) Armed with this, we have three options: Find the most likely value of w a point estimate. Approximate p(w X, t, σ 2 ) with something easier.
18 What can we compute? p(w X, t, σ 2 ) = p(t X, w, σ2 )p(w σ 2 ) p(t X, σ 2 ) We can compute p(t X, w, σ 2 )p(w σ 2 ) Define g(w; X, t, σ 2 ) = p(t X, w)p(w σ 2 ) Armed with this, we have three options: Find the most likely value of w a point estimate. Approximate p(w X, t, σ 2 ) with something easier. Sample from p(w X, t, σ 2 ).
19 What can we compute? p(w X, t, σ 2 ) = p(t X, w, σ2 )p(w σ 2 ) p(t X, σ 2 ) We can compute p(t X, w, σ 2 )p(w σ 2 ) Define g(w; X, t, σ 2 ) = p(t X, w)p(w σ 2 ) Armed with this, we have three options: Find the most likely value of w a point estimate. Approximate p(w X, t, σ 2 ) with something easier. Sample from p(w X, t, σ 2 ). These examples aren t the only ways of approximating/sampling. They are also general techniques not unique to logistic regression.
20 MAP estimate Out first method is to find the value of w that maximises p(w X, t, σ 2 ) (call it ŵ). g(w; X, t, σ 2 ) p(w X, t, σ 2 ) ŵ therefore also maximises g(w; X, t, σ 2 ). Very similar to maximum likelihood but additional effect of prior. Known as MAP (maximum a posteriori) solution.
21 MAP estimate Out first method is to find the value of w that maximises p(w X, t, σ 2 ) (call it ŵ). g(w; X, t, σ 2 ) p(w X, t, σ 2 ) ŵ therefore also maximises g(w; X, t, σ 2 ). Very similar to maximum likelihood but additional effect of prior. Known as MAP (maximum a posteriori) solution. Once we have ŵ, make predictions with: P(t new = 1 x new, ŵ) = exp( ŵ T x new )
22 MAP When we met maximum likelihood, we could find ŵ exactly with some algebra. Can t do that here (can t solve g(w;x,t,σ2 ) w = 0)
23 MAP When we met maximum likelihood, we could find ŵ exactly with some algebra. Can t do that here (can t solve g(w;x,t,σ2 ) w = 0) Resort to numerical optimisation: 1. Guess ŵ 2. Change it a bit in a way that increases g(w; X, t, σ 2 ) 3. Repeat until no further increase is possible.
24 MAP When we met maximum likelihood, we could find ŵ exactly with some algebra. Can t do that here (can t solve g(w;x,t,σ2 ) w = 0) Resort to numerical optimisation: 1. Guess ŵ 2. Change it a bit in a way that increases g(w; X, t, σ 2 ) 3. Repeat until no further increase is possible. Many algorithms exist that differ in how they do step 2. e.g. Gradient Descent Not covered in this course. You just need to know that sometimes we can t do things analytically and there are methods to help us! Ask John!
25 MAP numerical optimisation for our data w 2 w 1 x2 0 wi x Iteration Left: Data. Right: Evolution of ŵ in numerical optimisation.
26 Decision boundary Once we have ŵ, we can classify new examples. Decision boundary is a useful visualisation: 5 x x 1 Line corresponding to P(T new = 1 x new, ŵ) = 0.5.
27 Decision boundary Once we have ŵ, we can classify new examples. Decision boundary is a useful visualisation: 5 x x 1 Line corresponding to P(T new = 1 x new, ŵ) = = 1 2 = exp( ŵ T x new ). So: exp( ŵ T x new ) = 1. Or: ŵ T x new = 0
28 Predictive probabilities x x 1 Contours of P(T new = 1 x new, ŵ). Do they look sensible? 0.1
29 Sampling from posterior Suppose we can produce samples w 1, w 2,..., w s,... from p(w X, t, σ 2 ). Then we can average the predictions to approximate p(w X, t, σ 2 ): P(t new = 1 x new, X, t, σ 2 ) = E p(w X,t,σ2 ) {P(t new x new, w)} 1 S 1 S 1 + exp( ws T x new ) s=1
30 Magic! We can sample directly from p(w X, t, σ 2 ) even though we can t compute it! Various algorithms exist we ll use Metropolis-Hastings
31 Back to the script: Metropolis-Hastings Produces a sequence of samples w 1, w 2,..., w s,... Imagine we ve just produced w s 1
32 Back to the script: Metropolis-Hastings Produces a sequence of samples w 1, w 2,..., w s,... Imagine we ve just produced w s 1 MH firsts proposes a possible w s (call it w s ) based on w s 1.
33 Back to the script: Metropolis-Hastings Produces a sequence of samples w 1, w 2,..., w s,... Imagine we ve just produced w s 1 MH firsts proposes a possible w s (call it w s ) based on w s 1. MH then decides whether or not to accept w s If accepted, ws = w s If not, ws = w s 1
34 Back to the script: Metropolis-Hastings Produces a sequence of samples w 1, w 2,..., w s,... Imagine we ve just produced w s 1 MH firsts proposes a possible w s (call it w s ) based on w s 1. MH then decides whether or not to accept w s If accepted, ws = w s If not, ws = w s 1 Two distinct steps proposal and acceptance.
35 MH proposal Treat w s as a random variable conditioned on w s 1 i.e. need to define p( w s w s 1 ) Note that this does not necessarily have to be similar to posterior we re trying to sample from. Can choose whatever we like!
36 MH proposal Treat w s as a random variable conditioned on w s 1 i.e. need to define p( w s w s 1 ) Note that this does not necessarily have to be similar to posterior we re trying to sample from. Can choose whatever we like! e.g. use a Gaussian centered on w s 1 with some covariance: p( w s w s 1, Σ p ) = N (w s 1, Σ p )
37 MH proposal Treat w s as a random variable conditioned on w s 1 i.e. need to define p( w s w s 1 ) Note that this does not necessarily have to be similar to posterior we re trying to sample from. Can choose whatever we like! e.g. use a Gaussian centered on w s 1 with some covariance: p( w s w s 1, Σ p ) = N (w s 1, Σ p ) Σ = w Σ = w 1
38 MH acceptance Choice of acceptance based on the following ratio: r = p( w s X, t, σ 2 ) p(w s 1 w s, Σ p ) p(w s 1 X, t, σ 2 ) p( w s w s 1, Σ p ).
39 MH acceptance Choice of acceptance based on the following ratio: r = p( w s X, t, σ 2 ) p(w s 1 w s, Σ p ) p(w s 1 X, t, σ 2 ) p( w s w s 1, Σ p ). Which simplifies to (all of which we can compute): r = g( w s; X, t, σ 2 ) p(w s 1 w s, Σ p ) g(w s 1 ; X, t, σ 2 ) p( w s w s 1, Σ p ).
40 MH acceptance Choice of acceptance based on the following ratio: r = p( w s X, t, σ 2 ) p(w s 1 w s, Σ p ) p(w s 1 X, t, σ 2 ) p( w s w s 1, Σ p ). Which simplifies to (all of which we can compute): r = g( w s; X, t, σ 2 ) p(w s 1 w s, Σ p ) g(w s 1 ; X, t, σ 2 ) p( w s w s 1, Σ p ). We now use the following rules: If r 1, accept: ws = w s. If r < 1, accept with probability r.
41 MH acceptance Choice of acceptance based on the following ratio: r = p( w s X, t, σ 2 ) p(w s 1 w s, Σ p ) p(w s 1 X, t, σ 2 ) p( w s w s 1, Σ p ). Which simplifies to (all of which we can compute): r = g( w s; X, t, σ 2 ) p(w s 1 w s, Σ p ) g(w s 1 ; X, t, σ 2 ) p( w s w s 1, Σ p ). We now use the following rules: If r 1, accept: ws = w s. If r < 1, accept with probability r. If we do this enough, we ll eventually be sampling from p(w X, t), no matter where we started! i.e. for any w1
42 Where to Start? Convergence Theorem for Markov Chains No matter where the chain is started, the MH process will always converge (under some technical conditions) to its target distribution!
43 When to Stop?
44 When to Stop? How do we know Markov Chain has converged?
45 When to Stop? How do we know Markov Chain has converged? Start chain from different starting points and run until they look the same.
46 When to Stop? How do we know Markov Chain has converged? Start chain from different starting points and run until they look the same. Apply statistical hypothesis testing on empirical distributions.
47 MH flowchart s =1 Choose w s s = s +1 Generate w s from p( w s w s 1 ) w s = w s Compute acceptance ratio r w s = w s 1 Yes r 1? No Generate u from U(0, 1) Yes u r? No
48 MH walkthrough w2 0 w2 0 w 1 w w w 1 5 w2 0 w2 0 w1 w w w 1
49 MH walkthrough w2 0 w2 0 w 1 w w w 1 5 w2 0 w2 0 w1 w w w 1
50 What do the samples look like? w w samples from the posterior using MH.
51 Predictions with MH MH provides us with a set of samples w 1,..., w S. These can be used to approximate posterior: P(t new = 1 x new, X, t, σ 2 ) = E p(w X,t,σ2 ) {P(t new x new, w)} 1 S 1 S 1 + exp( ws T x new ) s=1
52 Predictions with MH MH provides us with a set of samples w 1,..., w S. These can be used to approximate posterior: P(t new = 1 x new, X, t, σ 2 ) = E p(w X,t,σ2 ) {P(t new x new, w)} 1 S 1 S 1 + exp( ws T x new ) s= x x 1 Contours of P(t new = 1 x new, X, t, σ 2 )
53 Summary Introduced logistic regression a probabilistic binary classifier. Saw that we couldn t compute the posterior. Introduced examples of two alternatives: MAP solution. Sample Metropolis-Hastings. Second is better than the last (in terms of predictions)......but each has greater complexity! To think about: What if posterior is multi-modal?
Week 3: Linear Regression
Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationIntroduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016
Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 EPSY 905: Intro to Bayesian and MCMC Today s Class An
More informationCSC 411: Lecture 04: Logistic Regression
CSC 411: Lecture 04: Logistic Regression Raquel Urtasun & Rich Zemel University of Toronto Sep 23, 2015 Urtasun & Zemel (UofT) CSC 411: 04-Prob Classif Sep 23, 2015 1 / 16 Today Key Concepts: Logistic
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationThe classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know
The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is
More informationThe classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA
The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers.
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationIntroduction to Bayesian Learning. Machine Learning Fall 2018
Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 2, 2015 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)
More informationDEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE
Data Provided: None DEPARTMENT OF COMPUTER SCIENCE Autumn Semester 203 204 MACHINE LEARNING AND ADAPTIVE INTELLIGENCE 2 hours Answer THREE of the four questions. All questions carry equal weight. Figures
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationProbabilistic Machine Learning. Industrial AI Lab.
Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII
More informationBayesian Linear Regression [DRAFT - In Progress]
Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory
More informationMachine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)
More informationMidterm sample questions
Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationCSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes
CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail
More informationSYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I
SYDE 372 Introduction to Pattern Recognition Probability Measures for Classification: Part I Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 Why use probability
More informationMidterm, Fall 2003
5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are
More information17 : Markov Chain Monte Carlo
10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo
More informationOverview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation
Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated
More informationWhere now? Machine Learning and Bayesian Inference
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone etension 67 Email: sbh@clcamacuk wwwclcamacuk/ sbh/ Where now? There are some simple take-home messages from
More informationLecture 10: Powers of Matrices, Difference Equations
Lecture 10: Powers of Matrices, Difference Equations Difference Equations A difference equation, also sometimes called a recurrence equation is an equation that defines a sequence recursively, i.e. each
More informationMCMC notes by Mark Holder
MCMC notes by Mark Holder Bayesian inference Ultimately, we want to make probability statements about true values of parameters, given our data. For example P(α 0 < α 1 X). According to Bayes theorem:
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationUnivariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation
Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation PRE 905: Multivariate Analysis Spring 2014 Lecture 4 Today s Class The building blocks: The basics of mathematical
More informationHarrison B. Prosper. Bari Lectures
Harrison B. Prosper Florida State University Bari Lectures 30, 31 May, 1 June 2016 Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 1 h Lecture 1 h Introduction h Classification h Grid Searches
More informationThe Naïve Bayes Classifier. Machine Learning Fall 2017
The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning
More informationMarkov Networks.
Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationBayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop
Bayesian Gaussian / Linear Models Read Sections 2.3.3 and 3.3 in the text by Bishop Multivariate Gaussian Model with Multivariate Gaussian Prior Suppose we model the observed vector b as having a multivariate
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationMachine Learning, Fall 2009: Midterm
10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all
More informationMath for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han
Math for Machine Learning Open Doors to Data Science and Artificial Intelligence Richard Han Copyright 05 Richard Han All rights reserved. CONTENTS PREFACE... - INTRODUCTION... LINEAR REGRESSION... 4 LINEAR
More informationChapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)
HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter
More informationCSC321 Lecture 7 Neural language models
CSC321 Lecture 7 Neural language models Roger Grosse and Nitish Srivastava February 1, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 1 / 19 Overview We
More informationLinear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.
Gaussian Processes Gaussian Process Stochastic process: basically, a set of random variables. may be infinite. usually related in some way. Gaussian process: each variable has a Gaussian distribution every
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationBayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine
Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview
More informationThe Metropolis-Hastings Algorithm. June 8, 2012
The Metropolis-Hastings Algorithm June 8, 22 The Plan. Understand what a simulated distribution is 2. Understand why the Metropolis-Hastings algorithm works 3. Learn how to apply the Metropolis-Hastings
More informationBayesian Regression Linear and Logistic Regression
When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we
More informationMLPR: Logistic Regression and Neural Networks
MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28 Outline 1 Logistic Regression 2 Multi-layer
More informationOutline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.
Outline MLPR: and Neural Networks Machine Learning and Pattern Recognition 2 Amos Storkey Amos Storkey MLPR: and Neural Networks /28 Recap Amos Storkey MLPR: and Neural Networks 2/28 Which is the correct
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationRelevance Vector Machines
LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian
More informationMidterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.
CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic
More informationBut if z is conditioned on, we need to model it:
Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More information10-701/ Machine Learning, Fall
0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training
More informationNaive Bayes and Gaussian Bayes Classifier
Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others February 22, 2016 Naive Bayes and Gaussian Bayes Classifier February 22, 2016 1 / 21 Naive Bayes Bayes Rule:
More informationMachine Learning Gaussian Naïve Bayes Big Picture
Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 27, 2011 Today: Naïve Bayes Big Picture Logistic regression Gradient ascent Generative discriminative
More informationNaive Bayes and Gaussian Bayes Classifier
Naive Bayes and Gaussian Bayes Classifier Elias Tragas tragas@cs.toronto.edu October 3, 2016 Elias Tragas Naive Bayes and Gaussian Bayes Classifier October 3, 2016 1 / 23 Naive Bayes Bayes Rules: Naive
More informationCSC321 Lecture 5 Learning in a Single Neuron
CSC321 Lecture 5 Learning in a Single Neuron Roger Grosse and Nitish Srivastava January 21, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 5 Learning in a Single Neuron January 21, 2015 1 / 14
More informationLecture 4: Training a Classifier
Lecture 4: Training a Classifier Roger Grosse 1 Introduction Now that we ve defined what binary classification is, let s actually train a classifier. We ll approach this problem in much the same way as
More informationToss 1. Fig.1. 2 Heads 2 Tails Heads/Tails (H, H) (T, T) (H, T) Fig.2
1 Basic Probabilities The probabilities that we ll be learning about build from the set theory that we learned last class, only this time, the sets are specifically sets of events. What are events? Roughly,
More informationIntroduction: MLE, MAP, Bayesian reasoning (28/8/13)
STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this
More informationMachine Learning for Signal Processing Bayes Classification and Regression
Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For
More informationCS 361: Probability & Statistics
March 14, 2018 CS 361: Probability & Statistics Inference The prior From Bayes rule, we know that we can express our function of interest as Likelihood Prior Posterior The right hand side contains the
More informationSAMPLE CHAPTER. Avi Pfeffer. FOREWORD BY Stuart Russell MANNING
SAMPLE CHAPTER Avi Pfeffer FOREWORD BY Stuart Russell MANNING Practical Probabilistic Programming by Avi Pfeffer Chapter 9 Copyright 2016 Manning Publications brief contents PART 1 INTRODUCING PROBABILISTIC
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationMODULE -4 BAYEIAN LEARNING
MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample
More informationBayesian data analysis in practice: Three simple examples
Bayesian data analysis in practice: Three simple examples Martin P. Tingley Introduction These notes cover three examples I presented at Climatea on 5 October 0. Matlab code is available by request to
More informationUniversity of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians
Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision
More informationBayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction
15-0: Learning vs. Deduction Artificial Intelligence Programming Bayesian Learning Chris Brooks Department of Computer Science University of San Francisco So far, we ve seen two types of reasoning: Deductive
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationCSE446: Clustering and EM Spring 2017
CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationMachine Learning and Bayesian Inference
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 63725 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Copyright c Sean Holden 22-17. 1 Artificial
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationSome Probability and Statistics
Some Probability and Statistics David M. Blei COS424 Princeton University February 12, 2007 D. Blei ProbStat 01 1 / 42 Who wants to scribe? D. Blei ProbStat 01 2 / 42 Random variable Probability is about
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationMetropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9
Metropolis Hastings Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601 Module 9 1 The Metropolis-Hastings algorithm is a general term for a family of Markov chain simulation methods
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming
More informationIntroduction. Chapter 1
Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics
More informationVariational Bayesian Logistic Regression
Variational Bayesian Logistic Regression Sargur N. University at Buffalo, State University of New York USA Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic
More informationBayesian RL Seminar. Chris Mansley September 9, 2008
Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in
More informationMLE/MAP + Naïve Bayes
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes MLE / MAP Readings: Estimating Probabilities (Mitchell, 2016)
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationStrong Lens Modeling (II): Statistical Methods
Strong Lens Modeling (II): Statistical Methods Chuck Keeton Rutgers, the State University of New Jersey Probability theory multiple random variables, a and b joint distribution p(a, b) conditional distribution
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More informationCSC 2541: Bayesian Methods for Machine Learning
CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 3 More Markov Chain Monte Carlo Methods The Metropolis algorithm isn t the only way to do MCMC. We ll
More informationLinear Classification
Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative
More informationCSC321 Lecture 4: Learning a Classifier
CSC321 Lecture 4: Learning a Classifier Roger Grosse Roger Grosse CSC321 Lecture 4: Learning a Classifier 1 / 28 Overview Last time: binary classification, perceptron algorithm Limitations of the perceptron
More informationCreating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach. Radford M. Neal, 28 February 2005
Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach Radford M. Neal, 28 February 2005 A Very Brief Review of Gaussian Processes A Gaussian process is a distribution over
More information