Where now? Machine Learning and Bayesian Inference
|
|
- Evelyn Joseph
- 5 years ago
- Views:
Transcription
1 Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone etension 67 wwwclcamacuk/ sbh/ Where now? There are some simple take-home messages from the study of SVMs: You can get state-of-the-art performance You can do this using the kernel trick to obtain a non-linear model You can do this without invoking the full machinery of the Bayes-optimal classifier Part III: back to Bayes Bayesian neural networks Gaussian processes BUT: You don t have anything keeping you honest regarding which assumptions you re making As we shall see, by using the full-strength probabilistic framework we gain some useful etras In particular, the ability to assign confidences to our predictions Copyright c Sean Holden -7 We re now going to see how the idea of the Bayes-optimal classifier can be applied to neural networks We re only going to consider regression Classification can also be done this way, but it s a bit more complicated For classification we derived the Bayes-optimal classifier as the maimizer of: Pr C, s) = Pr C w, ) pw s) dw For regression the Bayes-optimal classifier ends up having the same epression as we ve already seen We want to compute: We have: A neural network computing a function h w ) In fact this can be pretty much any parameterized function we like) A training sequence s T = [, y ) m, y m ) ], split into and y = y y y m ) X = m ) py, s) = py w, ) pw s) dw }{{}}{{} Likelihood Posterior s is the training set is a new eample to be classified Y is the RV representing the prediction for
2 It turns out that if you try to incorporate the density p) modelling how feature vectors are generated, things can get complicated So: We regard all input vectors as fied: they are not treated as random variables This means that, strictly speaking, they should no longer appear in epressions like py w, ) However, this seems to be uniformly disliked writing py w) for an epression that still depends on seems confusing Solution: write py w; ) instead Note the semi-colon! So we re actually going to look at py y;, X) = py w; ) pw y; X) dw }{{}}{{} Likelihood Posterior NOTE: this is a notational hack There s nothing new, just an attempt at clarity What s going on? Turning prior into posterior Let s make a brief sidetrack into what s going on with the posterior density pw y; X) py w; X)pw) Typically, the prior starts wide and as we see more data the posterior narrows pw y;x) and pw) 8 6 The posterior density pw y; X) becomes more localised 6 wmap Prior Posterior What s going on? Turning prior into posterior This can be seen very clearly if we use real numbers: So now we have three things to do: Eamples Prior density pw) - STEP : remind ourselves what py w; ) is STEP : remind ourselves what pw y; X) is STEP : do the integral This is the fun bit ) - - Likelihood py w; X) w - - w Posterior density pw y;x) - The first two steps are straightforward as we ve already derived them when looking at maimum-likelihood and MAP learning w - - w w - - w 7 8
3 STEP : assuming Gaussian noise is added to the labels so y = h w ) + ɛ where ɛ N, σn) we have the usual likelihood py w; ) = ep ) Y h πσ n σn w )) Here, the subscript in σ n reminds us that it s the variance of the noise Traditionally this is re-written using the hyperparameter so the likelihood is β = σ n py w; ) ep β ) Y h w)) STEP : the posterior is also eactly as it was when we derived the MAP learning algorithms pw y; X) py w; X)pw) and as before, the likelihood is py w; X) ep β ) m y i h w i )) i= = ep βew)) and using a Gaussian prior with mean and covariance Σ = σ I gives pw) ep α ) w where traditionally the second hyperparameter is α = /σ Combining these pw y; X) = )) α w Zα, β) ep + βew) 9 What s going on? Turning prior into posterior Considering the central part of pw y; X): α w + βew) What happens as the number m of eamples increases? The first term corresponding to the prior remains fied The second term corresponding to the likelihood increases Step : putting together steps and, the integral we need to evaluate is: I ep β ) Y h w)) } {{ } Likelihood )) α w ep + βew) }{{} Posterior Obviously this gives us all a sad face because there is no solution So what can we do now? dw So for small training sequences the prior dominates, but for large ones w ML is a good approimation to w MAP
4 In order to make further progress it s necessary to perform integrals of the general form F w)pw y; X) dw I Method : approimation to pw y; X) ep β ) Y h w)) } {{ } Likelihood py w;) )) α w ep + βew) }{{} Posterior pw y;x) dw for various functions F and this is generally not possible There are two ways to get around this: We can use an approimate form for pw y; X) We can use Monte Carlo methods We ll be taking a look at both possibilities The first approach introduces a Gaussian approimation to pw y; X) by using a Taylor epansion of Sw) = α w at the maimum a posteriori weights w MAP This allows us to use a standard integral + βew) The result will be approimate but we hope it s good! Let s recall how Taylor series work Reminder: Taylor epansion Reminder: Taylor epansion In one dimension the Taylor epansion about a point R for a function f : R R is The functions of interest look like this: 6 The function f) 6 The function ep f)) f) f ) +! )f ) +! ) f ) f) ep f)) + + k! ) k f k ) - - What does this look like for the kinds of function we re interested in? As an eample We can try to approimate where ep f)) f) = 7 + This has a form similar to Sw), but in one dimension By replacing f) with its Taylor epansion about its maimum, which is at ma = 7 we can see what the approimation to ep f)) looks like Note that the ep hugely emphasises peaks 6
5 Reminder: Taylor epansion Here are the approimations for k =, k = and k = Reminder: Taylor epansion In multiple dimensions the Taylor epansion for k = is - - Taylor epansion for k = -6 - ep f)) eact The use of k = looks promising Taylor epansion for k = ep f)) using Taylor epansion for k = 6 - Taylor epansion for k = where denotes gradient f) f ) +! ) T f) f) = +! ) T f) ) f) and f) is the matri with elements f) M ij = f) i j ) f) n Looks complicated, but it s just the obvious etension of the -dimensional case) 7 8 Method : approimation to pw y; X) Applying this to Sw) and epanding around w MAP Sw) = α w + βew) Sw MAP ) + w w MAP) T Aw w MAP ) As w MAP minimises the function the first derivatives are zero and the corresponding term in the Taylor epansion disappears The quantity A = Sw) wmap can be simplified This is because Method : approimation to pw y; X) We actually already know something about how to get w MAP : A method such as backpropagation can be used to compute Sw) The vector w MAP can then be obtained using any standard optimisation method such as gradient descent) It s also likely to be straightforward to compute Ew): The quantity Ew) can be evaluated using an etended form of backpropagation α w A = + βew)) = αi + β Ew MAP ) wmap 9
6 A useful integral Dropping for this slide only the special meaning usually given to the vector, here is a useful standard integral: If A R n n is symmetric then for b R n and c R ep T A + T b + c )) d R n = π) n/ A / ep c bt A )) b You re not epected to know how to evaluate this, but see the handout on the course web page if you re curious To make this easy to refer to, let s call it the BIG INTEGRAL Defining we now have an approimation Using the BIG INTEGRAL Method : approimation to pw y; X) w = w w MAP pw y; X) Sw Z ep MAP ) ) wt A w where W is the number of weights Z = π) W/ A / ep Sw MAP )) Let s plug this approimation back into the epression for the Bayes-optimum and see what we get No, I won t ask you to evaluate it in the eam Method : approimation to pw y; X) Method : second approimation I ep β ) Y h w)) } {{ } Likelihood py w;) ep ) wt A w }{{} Approimation to pw y;x) dw This leads to py y;, X) ep β Y hwmap ) g T w ) ) wt A w dw There is still no solution! We need another approimation We can introduce a linear approimation of h w ) at w MAP : h w ) h wmap ) + g T w where g = h w ) wmap By linear approimation we just mean the Taylor epansion for k = ) SUCCESS!!! This integral can be evaluated this is an eercise) using the BIG INTEGRAL to give THE ANSWER py y;, X) ep y h w MAP )) ) πσ Y where σ Y σ Y = β + gt A g We really are making assumptions here this is OK if we assume that pw y; X) is narrow, which depends on A
7 Method : final epression Hooray! But what does it mean? This is a Gaussian density, so we can now see that: py y;, X) peaks at h wmap ) That is, the MAP solution Method : final epression Hooray! But what does it mean? Interpreted graphically: Typical behaviour of the Bayesian solution The variance σ Y can be interpreted as a measure of certainty: The first term of σy is /β and corresponds to the noise The second term of σ Y is gt A g and corresponds to the width of pw y; X) Plotting ±σ Y around the prediction gives a measure of certainty 6 Method II: Markov chain Monte Carlo MCMC) methods The second solution to the problem of performing integrals I = F w)pw y; X)dw is to use Monte Carlo methods The basic approach is to make the approimation I N N F w i ) i= where the w i have distribution pw y; X) Unfortunately, generating w i with a given distribution can be non-trivial MCMC methods A simple technique is to introduce a random walk, so w i+ = w i + ɛ where ɛ is zero mean spherical Gaussian and has small variance Obviously the sequence w i does not have the required distribution However, we can use the Metropolis algorithm, which does not accept all the steps in the random walk: If pw i+ y; X) > pw i y; X) then accept the step Else accept the step with probability pw i+ y;x) pw i y;x) In practice, the Metropolis algorithm has several shortcomings, and a great deal of research eists on improved methods, see: R Neal, Probabilistic inference using Markov chain Monte Carlo methods, University of Toronto, Department of Computer Science Technical Report CRG-TR-9-,
8 A very) brief introduction to how to learn hyperparameters So far in our coverage of the Bayesian approach to neural networks, the hyperparameters α and β were assumed to be known and fied But this is not a good assumption because α corresponds to the width of the prior and β to the noise variance So we really want to learn these from the data as well How can this be done? We now take a look at one of several ways of addressing this problem Note: from now on I m going to leave out the dependencies on and X as leaving them in starts to make everything cluttered The prior and likelihood depend on α and β respectively so we now make this clear and write pw y, α, β) = py w, β)pw α) py α, β) Don t worry about recalling the actual epressions for the prior and likelihood we re not going to delve deep enough to need them Let s write down directly something that might be useful to know: pα, β y) = py α, β)pα, β) py) 9 Hierarchical Bayes and the evidence If we know pα, β y) then a straightforward approach is to use the values for α and β that maimise it: argma pα, β y) α,β Here is a standard trick: assume that the prior pα, β) is flat, so that we can just maimise py α, β) Hierarchical Bayes and the evidence The quantity py α, β) is called the evidence or marginal likelihood When we re-wrote our earlier equation for the posterior density of the weights, making α and β eplicit, we found pw y, α, β) = py w, β)pw α) py α, β) This is called type II maimum likelihood and is one common way of doing the job So the evidence is the denominator in this equation This is the common pattern and leads to the idea of hierarchical Bayes: the evidence for the hyperparameters at one level is the denominator in the relevant application of Bayes theorem
9 There is an alternative approach to Bayesian regression and classification: The fundamental idea is to not think in terms of weights w that specify functions Instead the idea is to deal with functions directly Fundamental to this is the concept of a Gaussian process We will continue to omit the dependencies on and X to keep the notation simple We have seen that inference can be performed by: Computing the posterior density pw y) of the parameters given the observed labels Computing the Bayes-optimal prediction py y) = py w)pw y) dw which is the epected value of the likelihood for a new point Choosing any hyperparameters p using the evidence py p) But shouldn t we deal with functions directly, not via parameters? What happens if we deal directly with functions f, rather than choosing them via parameters? f) Can we change the equation for prediction to py y) = py f)pf y) df in any sensible way? This obviously requires us to talk about probability densities over functions That is probably not something you have ever seen before In the diagram: four samples f pf ) from a probability density defined on functions f) Can we change the equation for prediction to py y) = py f)pf y) df in any sensible way? This is quite straightforward, using the concept of a Gaussian process GP) 6
10 Definition: say we have a set of RVs This set forms a Gaussian process if any finite subset of them is jointly Gaussian distributed The same four samples f pf ), where F is in fact a GP The crosses mark the values of the sampled functions at four different values of f) - - What happens when we randomly select a function that is a GP? We are only ever interested in a finite number of its values This is because we only need to deal with the values in the training set and for any new points we want to predict Consequently we can use a GP as a prior rather than having a prior pw) Note again the key point: we are randomly selecting functions and we can say something about their behaviour for any finite collection of arguments And that is enough, as we only ever have finite quantities of data - - Because F is a GP any such finite set of values has a jointly Gaussian distribution 7 8 GP priors To specify a GP on vectors in R n, we just need: Squared eponential prior, l = / γ-eponential prior, l = γ = / A mean function m : R n R A covariance function k : R n R n R f) - f) - - m) = E f F [f)] k, ) = E f F [f ) m ))f ) m ))] Rational quadratic prior, l = α = Neural network prior, σ =, σ = We then write F GPm, k) to denote that F is a GP By specifying m and k we get different kinds of function when sampling F f) f)
11 Polynomial; Eponential: Squared eponential: Gamma eponential: Covariance functions k, ) = c + T ) k k, ) = ep ) l k, ) = ep ) l k, ) = ep l ) γ ) Covariance functions Rational quadratic; k, ) = + ) α αl Eponential: k, ) = sin ) T Σ ) + )T Σ ) + )T Σ ))/ where ) T = [ T] As usual these have associated hyperparameters These have to be dealt with correctly as always Gaussian processes: generating data Gaussian processes: generating data with noise Say we have some data y i = f i ) for i =,, m and f GPm, k) Remember, the i are fied, not RVs) Any finite set of points must be jointly Gaussian So py) = N m, K) where m T = [ m ) m m ) ] and K is the Gram matri K ij = k i, j ) Note : this is not py f) We can completely remove the need for integration! Note : from now on we will assume m) = It is straightforward to incorporate a non-zero mean) Now add noise to the data Say we add Gaussian noise so y i = f i ) + ɛ i Again, i =,, m and f GPm, k), but now we also have ɛ i N, σ ) As we are adding Gaussian RVs, we have py) = N, K + σ I) BUT: in order to do prediction we actually need to involve a new point, for which we want to predict the corresponding value y
12 Gaussian processes: prediction SO: we incorporate, for which we want to predict the corresponding value y By eactly the same argument py, y) = N, K ) where [ ] k k K T = k K + σ I k T = [ k, ) k, m ) ] k = k, ) + σ Note : all we ve done here is to epand the Gram matri by an etra row and column to get K Note : whether or not you include σ in k is a matter of choice What difference does it make? This is an Eercise) Split so and correspondingly Gaussian density: marginals and conditionals For a normal RV N µ, Σ) p) = π)d Σ ep ) µ)t Σ µ) µ = What are p ) and p )? [ µ µ ] = [ ] [ ] Σ Σ Σ = Σ Σ 6 Gaussian density: marginals and conditionals Define the precision matri [ ] Λ = Σ Λ Λ = Λ Λ It is possible to show that p ) = N µ, Σ ) p ) = N µ Λ Λ µ ), Λ ) Inverting a block matri In the last slide, we see: [ ] Σ Λ Λ = Λ Λ Re-writing Σ as [ ] A B Σ = C D it is possible to show it is an Eercise to do this) that Λ = A Λ = A BD Λ = D CA Λ = D + D CA BD where A = A BD C) 7 8
13 GP regression GP regression To do prediction all that s left is to compute py y) Squared eponential prior, l = / γ-eponential prior, l = γ = / Because everything is Gaussian this turns out to be easy: py, y) = N, K ) [ ] k k K T = k L L = K + σ I From these we want to know py y) f) f) Rational quadratic prior, l = a = Neural network prior, σ =, σ = Only two things are needed: the inverse formula for a block matri and the formula for obtaining a conditional from a joint Gaussian Using these we can show it is an Eercise to derive this) that f) f) py y) = N k } T L {{ y }, } k k{{ T L k} ) Mean Variance Learning the hyperparameters A nice side-effect of this formulation is that we get a usable epression for the marginal likelihood If we incorporate the hyperparameters p, which in this case are any parameters associated with k, ) along with σ, then we ve just computed py y, p) = py, y p) py p) The denominator is the marginal likelihood, and we computed it above on slide : py p) = N, L) = π)m L ep ) yt L y Learning the hyperparameters As usual this looks nicer if we consider its log This is a rare beast: log py p) = log L yt L y d log π It s a sensible formula that tells you how good a set p of hyperparameters is That means you can use it as an alternative to cross-validation to search for hyperparameters As a bonus you can generally differentiate it so it s possible to use gradientbased search
Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationThe maximum margin classifier. Machine Learning and Bayesian Inference. Dr Sean Holden Computer Laboratory, Room FC06
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC06 Suggestion: why not drop all this probability nonsense and just do this: 2 Telephone etension 63725 Email: sbh@cl.cam.ac.uk
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationBayesian Linear Regression [DRAFT - In Progress]
Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationAn example to illustrate frequentist and Bayesian approches
Frequentist_Bayesian_Eample An eample to illustrate frequentist and Bayesian approches This is a trivial eample that illustrates the fundamentally different points of view of the frequentist and Bayesian
More information6.867 Machine learning
6.867 Machine learning Mid-term eam October 8, 6 ( points) Your name and MIT ID: .5.5 y.5 y.5 a).5.5 b).5.5.5.5 y.5 y.5 c).5.5 d).5.5 Figure : Plots of linear regression results with different types of
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationLecture 3: Pattern Classification. Pattern classification
EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of
More informationBayesian Inference of Noise Levels in Regression
Bayesian Inference of Noise Levels in Regression Christopher M. Bishop Microsoft Research, 7 J. J. Thomson Avenue, Cambridge, CB FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop
More informationRegression with Input-Dependent Noise: A Bayesian Treatment
Regression with Input-Dependent oise: A Bayesian Treatment Christopher M. Bishop C.M.BishopGaston.ac.uk Cazhaow S. Qazaz qazazcsgaston.ac.uk eural Computing Research Group Aston University, Birmingham,
More informationNotes on Discriminant Functions and Optimal Classification
Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationMachine Learning and Bayesian Inference
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 63725 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Copyright c Sean Holden 22-17. 1 Artificial
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabás Póczos http://www.gaussianprocess.org/ 2 Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationCSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes
CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail
More informationTDA231. Logistic regression
TDA231 Devdatt Dubhashi dubhashi@chalmers.se Dept. of Computer Science and Engg. Chalmers University February 19, 2016 Some data 5 x2 0 5 5 0 5 x 1 In the Bayes classifier, we built a model of each class
More informationState Space and Hidden Markov Models
State Space and Hidden Markov Models Kunsch H.R. State Space and Hidden Markov Models. ETH- Zurich Zurich; Aliaksandr Hubin Oslo 2014 Contents 1. Introduction 2. Markov Chains 3. Hidden Markov and State
More informationApproximate inference, Sampling & Variational inference Fall Cours 9 November 25
Approimate inference, Sampling & Variational inference Fall 2015 Cours 9 November 25 Enseignant: Guillaume Obozinski Scribe: Basile Clément, Nathan de Lara 9.1 Approimate inference with MCMC 9.1.1 Gibbs
More information1 Rational Exponents and Radicals
Introductory Algebra Page 1 of 11 1 Rational Eponents and Radicals 1.1 Rules of Eponents The rules for eponents are the same as what you saw earlier. Memorize these rules if you haven t already done so.
More informationSTA 414/2104, Spring 2014, Practice Problem Set #1
STA 44/4, Spring 4, Practice Problem Set # Note: these problems are not for credit, and not to be handed in Question : Consider a classification problem in which there are two real-valued inputs, and,
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationLecture 1c: Gaussian Processes for Regression
Lecture c: Gaussian Processes for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk
More informationMachine Learning Basics III
Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient
More informationArtificial Neural Networks 2
CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b
More informationArtificial Intelligence: what have we seen so far? Machine Learning and Bayesian Inference
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6375 Email: sbh@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh/ Artificial Intelligence: what have we seen so
More informationGaussian Processes (10/16/13)
STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs
More informationIntroduction. So, why did I even bother to write this?
Introduction This review was originally written for my Calculus I class, but it should be accessible to anyone needing a review in some basic algebra and trig topics. The review contains the occasional
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationGaussians. Hiroshi Shimodaira. January-March Continuous random variables
Cumulative Distribution Function Gaussians Hiroshi Shimodaira January-March 9 In this chapter we introduce the basics of how to build probabilistic models of continuous-valued data, including the most
More informationLecture 4: Training a Classifier
Lecture 4: Training a Classifier Roger Grosse 1 Introduction Now that we ve defined what binary classification is, let s actually train a classifier. We ll approach this problem in much the same way as
More informationSupervised Learning Coursework
Supervised Learning Coursework John Shawe-Taylor Tom Diethe Dorota Glowacka November 30, 2009; submission date: noon December 18, 2009 Abstract Using a series of synthetic examples, in this exercise session
More informationBayesian Approach 2. CSC412 Probabilistic Learning & Reasoning
CSC412 Probabilistic Learning & Reasoning Lecture 12: Bayesian Parameter Estimation February 27, 2006 Sam Roweis Bayesian Approach 2 The Bayesian programme (after Rev. Thomas Bayes) treats all unnown quantities
More informationRelevance Vector Machines
LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian
More informationCreating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach. Radford M. Neal, 28 February 2005
Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach Radford M. Neal, 28 February 2005 A Very Brief Review of Gaussian Processes A Gaussian process is a distribution over
More informationBayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine
Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview
More informationGaussian Quiz. Preamble to The Humble Gaussian Distribution. David MacKay 1
Preamble to The Humble Gaussian Distribution. David MacKay Gaussian Quiz H y y y 3. Assuming that the variables y, y, y 3 in this belief network have a joint Gaussian distribution, which of the following
More information2 Statistical Estimation: Basic Concepts
Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof. N. Shimkin 2 Statistical Estimation:
More information1 Probabilities. 1.1 Basics 1 PROBABILITIES
1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationLecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan
Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always
More informationCSC 2541: Bayesian Methods for Machine Learning
CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 3 More Markov Chain Monte Carlo Methods The Metropolis algorithm isn t the only way to do MCMC. We ll
More information20: Gaussian Processes
10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction
More informationProbabilistic Graphical Models Lecture 20: Gaussian Processes
Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationQuadratic Equations Part I
Quadratic Equations Part I Before proceeding with this section we should note that the topic of solving quadratic equations will be covered in two sections. This is done for the benefit of those viewing
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationCSci 8980: Advanced Topics in Graphical Models Gaussian Processes
CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian
More informationCOMP 551 Applied Machine Learning Lecture 20: Gaussian processes
COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationOverview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation
Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationComputer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression
Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
More informationCSC2515 Assignment #2
CSC2515 Assignment #2 Due: Nov.4, 2pm at the START of class Worth: 18% Late assignments not accepted. 1 Pseudo-Bayesian Linear Regression (3%) In this question you will dabble in Bayesian statistics and
More informationMachine Learning Lecture 2
Machine Perceptual Learning and Sensory Summer Augmented 6 Computing Announcements Machine Learning Lecture 2 Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationCOMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation
COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted
More informationRational Expressions & Equations
Chapter 9 Rational Epressions & Equations Sec. 1 Simplifying Rational Epressions We simply rational epressions the same way we simplified fractions. When we first began to simplify fractions, we factored
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationMachine Learning 4771
Machine Learning 4771 Instructor: Tony Jebara Topic 7 Unsupervised Learning Statistical Perspective Probability Models Discrete & Continuous: Gaussian, Bernoulli, Multinomial Maimum Likelihood Logistic
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationDevelopment of Stochastic Artificial Neural Networks for Hydrological Prediction
Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental
More informationLecture 4: Training a Classifier
Lecture 4: Training a Classifier Roger Grosse 1 Introduction Now that we ve defined what binary classification is, let s actually train a classifier. We ll approach this problem in much the same way as
More informationBayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge
Bayesian Quadrature: Model-based Approimate Integration David Duvenaud University of Cambridge The Quadrature Problem ˆ We want to estimate an integral Z = f ()p()d ˆ Most computational problems in inference
More information1 Bayesian Linear Regression (BLR)
Statistical Techniques in Robotics (STR, S15) Lecture#10 (Wednesday, February 11) Lecturer: Byron Boots Gaussian Properties, Bayesian Linear Regression 1 Bayesian Linear Regression (BLR) In linear regression,
More informationGaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction
Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction Agathe Girard Dept. of Computing Science University of Glasgow Glasgow, UK agathe@dcs.gla.ac.uk Carl Edward Rasmussen Gatsby
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationAnalytic Long-Term Forecasting with Periodic Gaussian Processes
Nooshin Haji Ghassemi School of Computing Blekinge Institute of Technology Sweden Marc Peter Deisenroth Department of Computing Imperial College London United Kingdom Department of Computer Science TU
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationCALCULUS I. Integrals. Paul Dawkins
CALCULUS I Integrals Paul Dawkins Table of Contents Preface... ii Integrals... Introduction... Indefinite Integrals... Computing Indefinite Integrals... Substitution Rule for Indefinite Integrals... More
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationLecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu
Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes
More informationIntroduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem
CS 189 Introduction to Machine Learning Spring 2018 Note 22 1 Duality As we have seen in our discussion of kernels, ridge regression can be viewed in two ways: (1) an optimization problem over the weights
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More information+ y = 1 : the polynomial
Notes on Basic Ideas of Spherical Harmonics In the representation of wavefields (solutions of the wave equation) one of the natural considerations that arise along the lines of Huygens Principle is the
More informationMobile Robot Localization
Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationThe binary entropy function
ECE 7680 Lecture 2 Definitions and Basic Facts Objective: To learn a bunch of definitions about entropy and information measures that will be useful through the quarter, and to present some simple but
More informationLecture Machine Learning (past summer semester) If at least one English-speaking student is present.
Advanced Machine Learning Lecture 1 Organization Lecturer Prof. Bastian Leibe (leibe@vision.rwth-aachen.de) Introduction 20.10.2015 Bastian Leibe Visual Computing Institute RWTH Aachen University http://www.vision.rwth-aachen.de/
More information1 Probabilities. 1.1 Basics 1 PROBABILITIES
1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability
More informationGenerative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul
Generative Learning INFO-4604, Applied Machine Learning University of Colorado Boulder November 29, 2018 Prof. Michael Paul Generative vs Discriminative The classification algorithms we have seen so far
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationComputer Vision Group Prof. Daniel Cremers. 3. Regression
Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the
More informationMachine Learning Lecture 2
Announcements Machine Learning Lecture 2 Eceptional number of lecture participants this year Current count: 449 participants This is very nice, but it stretches our resources to their limits Probability
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More information17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk
94 Jonathan Richard Shewchuk 7 Neural Networks NEURAL NETWORKS Can do both classification & regression. [They tie together several ideas from the course: perceptrons, logistic regression, ensembles of
More informationThe connection of dropout and Bayesian statistics
The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University
More information