Where now? Machine Learning and Bayesian Inference

Size: px
Start display at page:

Download "Where now? Machine Learning and Bayesian Inference"

Transcription

1 Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone etension 67 wwwclcamacuk/ sbh/ Where now? There are some simple take-home messages from the study of SVMs: You can get state-of-the-art performance You can do this using the kernel trick to obtain a non-linear model You can do this without invoking the full machinery of the Bayes-optimal classifier Part III: back to Bayes Bayesian neural networks Gaussian processes BUT: You don t have anything keeping you honest regarding which assumptions you re making As we shall see, by using the full-strength probabilistic framework we gain some useful etras In particular, the ability to assign confidences to our predictions Copyright c Sean Holden -7 We re now going to see how the idea of the Bayes-optimal classifier can be applied to neural networks We re only going to consider regression Classification can also be done this way, but it s a bit more complicated For classification we derived the Bayes-optimal classifier as the maimizer of: Pr C, s) = Pr C w, ) pw s) dw For regression the Bayes-optimal classifier ends up having the same epression as we ve already seen We want to compute: We have: A neural network computing a function h w ) In fact this can be pretty much any parameterized function we like) A training sequence s T = [, y ) m, y m ) ], split into and y = y y y m ) X = m ) py, s) = py w, ) pw s) dw }{{}}{{} Likelihood Posterior s is the training set is a new eample to be classified Y is the RV representing the prediction for

2 It turns out that if you try to incorporate the density p) modelling how feature vectors are generated, things can get complicated So: We regard all input vectors as fied: they are not treated as random variables This means that, strictly speaking, they should no longer appear in epressions like py w, ) However, this seems to be uniformly disliked writing py w) for an epression that still depends on seems confusing Solution: write py w; ) instead Note the semi-colon! So we re actually going to look at py y;, X) = py w; ) pw y; X) dw }{{}}{{} Likelihood Posterior NOTE: this is a notational hack There s nothing new, just an attempt at clarity What s going on? Turning prior into posterior Let s make a brief sidetrack into what s going on with the posterior density pw y; X) py w; X)pw) Typically, the prior starts wide and as we see more data the posterior narrows pw y;x) and pw) 8 6 The posterior density pw y; X) becomes more localised 6 wmap Prior Posterior What s going on? Turning prior into posterior This can be seen very clearly if we use real numbers: So now we have three things to do: Eamples Prior density pw) - STEP : remind ourselves what py w; ) is STEP : remind ourselves what pw y; X) is STEP : do the integral This is the fun bit ) - - Likelihood py w; X) w - - w Posterior density pw y;x) - The first two steps are straightforward as we ve already derived them when looking at maimum-likelihood and MAP learning w - - w w - - w 7 8

3 STEP : assuming Gaussian noise is added to the labels so y = h w ) + ɛ where ɛ N, σn) we have the usual likelihood py w; ) = ep ) Y h πσ n σn w )) Here, the subscript in σ n reminds us that it s the variance of the noise Traditionally this is re-written using the hyperparameter so the likelihood is β = σ n py w; ) ep β ) Y h w)) STEP : the posterior is also eactly as it was when we derived the MAP learning algorithms pw y; X) py w; X)pw) and as before, the likelihood is py w; X) ep β ) m y i h w i )) i= = ep βew)) and using a Gaussian prior with mean and covariance Σ = σ I gives pw) ep α ) w where traditionally the second hyperparameter is α = /σ Combining these pw y; X) = )) α w Zα, β) ep + βew) 9 What s going on? Turning prior into posterior Considering the central part of pw y; X): α w + βew) What happens as the number m of eamples increases? The first term corresponding to the prior remains fied The second term corresponding to the likelihood increases Step : putting together steps and, the integral we need to evaluate is: I ep β ) Y h w)) } {{ } Likelihood )) α w ep + βew) }{{} Posterior Obviously this gives us all a sad face because there is no solution So what can we do now? dw So for small training sequences the prior dominates, but for large ones w ML is a good approimation to w MAP

4 In order to make further progress it s necessary to perform integrals of the general form F w)pw y; X) dw I Method : approimation to pw y; X) ep β ) Y h w)) } {{ } Likelihood py w;) )) α w ep + βew) }{{} Posterior pw y;x) dw for various functions F and this is generally not possible There are two ways to get around this: We can use an approimate form for pw y; X) We can use Monte Carlo methods We ll be taking a look at both possibilities The first approach introduces a Gaussian approimation to pw y; X) by using a Taylor epansion of Sw) = α w at the maimum a posteriori weights w MAP This allows us to use a standard integral + βew) The result will be approimate but we hope it s good! Let s recall how Taylor series work Reminder: Taylor epansion Reminder: Taylor epansion In one dimension the Taylor epansion about a point R for a function f : R R is The functions of interest look like this: 6 The function f) 6 The function ep f)) f) f ) +! )f ) +! ) f ) f) ep f)) + + k! ) k f k ) - - What does this look like for the kinds of function we re interested in? As an eample We can try to approimate where ep f)) f) = 7 + This has a form similar to Sw), but in one dimension By replacing f) with its Taylor epansion about its maimum, which is at ma = 7 we can see what the approimation to ep f)) looks like Note that the ep hugely emphasises peaks 6

5 Reminder: Taylor epansion Here are the approimations for k =, k = and k = Reminder: Taylor epansion In multiple dimensions the Taylor epansion for k = is - - Taylor epansion for k = -6 - ep f)) eact The use of k = looks promising Taylor epansion for k = ep f)) using Taylor epansion for k = 6 - Taylor epansion for k = where denotes gradient f) f ) +! ) T f) f) = +! ) T f) ) f) and f) is the matri with elements f) M ij = f) i j ) f) n Looks complicated, but it s just the obvious etension of the -dimensional case) 7 8 Method : approimation to pw y; X) Applying this to Sw) and epanding around w MAP Sw) = α w + βew) Sw MAP ) + w w MAP) T Aw w MAP ) As w MAP minimises the function the first derivatives are zero and the corresponding term in the Taylor epansion disappears The quantity A = Sw) wmap can be simplified This is because Method : approimation to pw y; X) We actually already know something about how to get w MAP : A method such as backpropagation can be used to compute Sw) The vector w MAP can then be obtained using any standard optimisation method such as gradient descent) It s also likely to be straightforward to compute Ew): The quantity Ew) can be evaluated using an etended form of backpropagation α w A = + βew)) = αi + β Ew MAP ) wmap 9

6 A useful integral Dropping for this slide only the special meaning usually given to the vector, here is a useful standard integral: If A R n n is symmetric then for b R n and c R ep T A + T b + c )) d R n = π) n/ A / ep c bt A )) b You re not epected to know how to evaluate this, but see the handout on the course web page if you re curious To make this easy to refer to, let s call it the BIG INTEGRAL Defining we now have an approimation Using the BIG INTEGRAL Method : approimation to pw y; X) w = w w MAP pw y; X) Sw Z ep MAP ) ) wt A w where W is the number of weights Z = π) W/ A / ep Sw MAP )) Let s plug this approimation back into the epression for the Bayes-optimum and see what we get No, I won t ask you to evaluate it in the eam Method : approimation to pw y; X) Method : second approimation I ep β ) Y h w)) } {{ } Likelihood py w;) ep ) wt A w }{{} Approimation to pw y;x) dw This leads to py y;, X) ep β Y hwmap ) g T w ) ) wt A w dw There is still no solution! We need another approimation We can introduce a linear approimation of h w ) at w MAP : h w ) h wmap ) + g T w where g = h w ) wmap By linear approimation we just mean the Taylor epansion for k = ) SUCCESS!!! This integral can be evaluated this is an eercise) using the BIG INTEGRAL to give THE ANSWER py y;, X) ep y h w MAP )) ) πσ Y where σ Y σ Y = β + gt A g We really are making assumptions here this is OK if we assume that pw y; X) is narrow, which depends on A

7 Method : final epression Hooray! But what does it mean? This is a Gaussian density, so we can now see that: py y;, X) peaks at h wmap ) That is, the MAP solution Method : final epression Hooray! But what does it mean? Interpreted graphically: Typical behaviour of the Bayesian solution The variance σ Y can be interpreted as a measure of certainty: The first term of σy is /β and corresponds to the noise The second term of σ Y is gt A g and corresponds to the width of pw y; X) Plotting ±σ Y around the prediction gives a measure of certainty 6 Method II: Markov chain Monte Carlo MCMC) methods The second solution to the problem of performing integrals I = F w)pw y; X)dw is to use Monte Carlo methods The basic approach is to make the approimation I N N F w i ) i= where the w i have distribution pw y; X) Unfortunately, generating w i with a given distribution can be non-trivial MCMC methods A simple technique is to introduce a random walk, so w i+ = w i + ɛ where ɛ is zero mean spherical Gaussian and has small variance Obviously the sequence w i does not have the required distribution However, we can use the Metropolis algorithm, which does not accept all the steps in the random walk: If pw i+ y; X) > pw i y; X) then accept the step Else accept the step with probability pw i+ y;x) pw i y;x) In practice, the Metropolis algorithm has several shortcomings, and a great deal of research eists on improved methods, see: R Neal, Probabilistic inference using Markov chain Monte Carlo methods, University of Toronto, Department of Computer Science Technical Report CRG-TR-9-,

8 A very) brief introduction to how to learn hyperparameters So far in our coverage of the Bayesian approach to neural networks, the hyperparameters α and β were assumed to be known and fied But this is not a good assumption because α corresponds to the width of the prior and β to the noise variance So we really want to learn these from the data as well How can this be done? We now take a look at one of several ways of addressing this problem Note: from now on I m going to leave out the dependencies on and X as leaving them in starts to make everything cluttered The prior and likelihood depend on α and β respectively so we now make this clear and write pw y, α, β) = py w, β)pw α) py α, β) Don t worry about recalling the actual epressions for the prior and likelihood we re not going to delve deep enough to need them Let s write down directly something that might be useful to know: pα, β y) = py α, β)pα, β) py) 9 Hierarchical Bayes and the evidence If we know pα, β y) then a straightforward approach is to use the values for α and β that maimise it: argma pα, β y) α,β Here is a standard trick: assume that the prior pα, β) is flat, so that we can just maimise py α, β) Hierarchical Bayes and the evidence The quantity py α, β) is called the evidence or marginal likelihood When we re-wrote our earlier equation for the posterior density of the weights, making α and β eplicit, we found pw y, α, β) = py w, β)pw α) py α, β) This is called type II maimum likelihood and is one common way of doing the job So the evidence is the denominator in this equation This is the common pattern and leads to the idea of hierarchical Bayes: the evidence for the hyperparameters at one level is the denominator in the relevant application of Bayes theorem

9 There is an alternative approach to Bayesian regression and classification: The fundamental idea is to not think in terms of weights w that specify functions Instead the idea is to deal with functions directly Fundamental to this is the concept of a Gaussian process We will continue to omit the dependencies on and X to keep the notation simple We have seen that inference can be performed by: Computing the posterior density pw y) of the parameters given the observed labels Computing the Bayes-optimal prediction py y) = py w)pw y) dw which is the epected value of the likelihood for a new point Choosing any hyperparameters p using the evidence py p) But shouldn t we deal with functions directly, not via parameters? What happens if we deal directly with functions f, rather than choosing them via parameters? f) Can we change the equation for prediction to py y) = py f)pf y) df in any sensible way? This obviously requires us to talk about probability densities over functions That is probably not something you have ever seen before In the diagram: four samples f pf ) from a probability density defined on functions f) Can we change the equation for prediction to py y) = py f)pf y) df in any sensible way? This is quite straightforward, using the concept of a Gaussian process GP) 6

10 Definition: say we have a set of RVs This set forms a Gaussian process if any finite subset of them is jointly Gaussian distributed The same four samples f pf ), where F is in fact a GP The crosses mark the values of the sampled functions at four different values of f) - - What happens when we randomly select a function that is a GP? We are only ever interested in a finite number of its values This is because we only need to deal with the values in the training set and for any new points we want to predict Consequently we can use a GP as a prior rather than having a prior pw) Note again the key point: we are randomly selecting functions and we can say something about their behaviour for any finite collection of arguments And that is enough, as we only ever have finite quantities of data - - Because F is a GP any such finite set of values has a jointly Gaussian distribution 7 8 GP priors To specify a GP on vectors in R n, we just need: Squared eponential prior, l = / γ-eponential prior, l = γ = / A mean function m : R n R A covariance function k : R n R n R f) - f) - - m) = E f F [f)] k, ) = E f F [f ) m ))f ) m ))] Rational quadratic prior, l = α = Neural network prior, σ =, σ = We then write F GPm, k) to denote that F is a GP By specifying m and k we get different kinds of function when sampling F f) f)

11 Polynomial; Eponential: Squared eponential: Gamma eponential: Covariance functions k, ) = c + T ) k k, ) = ep ) l k, ) = ep ) l k, ) = ep l ) γ ) Covariance functions Rational quadratic; k, ) = + ) α αl Eponential: k, ) = sin ) T Σ ) + )T Σ ) + )T Σ ))/ where ) T = [ T] As usual these have associated hyperparameters These have to be dealt with correctly as always Gaussian processes: generating data Gaussian processes: generating data with noise Say we have some data y i = f i ) for i =,, m and f GPm, k) Remember, the i are fied, not RVs) Any finite set of points must be jointly Gaussian So py) = N m, K) where m T = [ m ) m m ) ] and K is the Gram matri K ij = k i, j ) Note : this is not py f) We can completely remove the need for integration! Note : from now on we will assume m) = It is straightforward to incorporate a non-zero mean) Now add noise to the data Say we add Gaussian noise so y i = f i ) + ɛ i Again, i =,, m and f GPm, k), but now we also have ɛ i N, σ ) As we are adding Gaussian RVs, we have py) = N, K + σ I) BUT: in order to do prediction we actually need to involve a new point, for which we want to predict the corresponding value y

12 Gaussian processes: prediction SO: we incorporate, for which we want to predict the corresponding value y By eactly the same argument py, y) = N, K ) where [ ] k k K T = k K + σ I k T = [ k, ) k, m ) ] k = k, ) + σ Note : all we ve done here is to epand the Gram matri by an etra row and column to get K Note : whether or not you include σ in k is a matter of choice What difference does it make? This is an Eercise) Split so and correspondingly Gaussian density: marginals and conditionals For a normal RV N µ, Σ) p) = π)d Σ ep ) µ)t Σ µ) µ = What are p ) and p )? [ µ µ ] = [ ] [ ] Σ Σ Σ = Σ Σ 6 Gaussian density: marginals and conditionals Define the precision matri [ ] Λ = Σ Λ Λ = Λ Λ It is possible to show that p ) = N µ, Σ ) p ) = N µ Λ Λ µ ), Λ ) Inverting a block matri In the last slide, we see: [ ] Σ Λ Λ = Λ Λ Re-writing Σ as [ ] A B Σ = C D it is possible to show it is an Eercise to do this) that Λ = A Λ = A BD Λ = D CA Λ = D + D CA BD where A = A BD C) 7 8

13 GP regression GP regression To do prediction all that s left is to compute py y) Squared eponential prior, l = / γ-eponential prior, l = γ = / Because everything is Gaussian this turns out to be easy: py, y) = N, K ) [ ] k k K T = k L L = K + σ I From these we want to know py y) f) f) Rational quadratic prior, l = a = Neural network prior, σ =, σ = Only two things are needed: the inverse formula for a block matri and the formula for obtaining a conditional from a joint Gaussian Using these we can show it is an Eercise to derive this) that f) f) py y) = N k } T L {{ y }, } k k{{ T L k} ) Mean Variance Learning the hyperparameters A nice side-effect of this formulation is that we get a usable epression for the marginal likelihood If we incorporate the hyperparameters p, which in this case are any parameters associated with k, ) along with σ, then we ve just computed py y, p) = py, y p) py p) The denominator is the marginal likelihood, and we computed it above on slide : py p) = N, L) = π)m L ep ) yt L y Learning the hyperparameters As usual this looks nicer if we consider its log This is a rare beast: log py p) = log L yt L y d log π It s a sensible formula that tells you how good a set p of hyperparameters is That means you can use it as an alternative to cross-validation to search for hyperparameters As a bonus you can generally differentiate it so it s possible to use gradientbased search

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

The maximum margin classifier. Machine Learning and Bayesian Inference. Dr Sean Holden Computer Laboratory, Room FC06

The maximum margin classifier. Machine Learning and Bayesian Inference. Dr Sean Holden Computer Laboratory, Room FC06 Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC06 Suggestion: why not drop all this probability nonsense and just do this: 2 Telephone etension 63725 Email: sbh@cl.cam.ac.uk

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Linear Regression [DRAFT - In Progress] Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

An example to illustrate frequentist and Bayesian approches

An example to illustrate frequentist and Bayesian approches Frequentist_Bayesian_Eample An eample to illustrate frequentist and Bayesian approches This is a trivial eample that illustrates the fundamentally different points of view of the frequentist and Bayesian

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machine learning Mid-term eam October 8, 6 ( points) Your name and MIT ID: .5.5 y.5 y.5 a).5.5 b).5.5.5.5 y.5 y.5 c).5.5 d).5.5 Figure : Plots of linear regression results with different types of

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

Lecture 3: Pattern Classification. Pattern classification

Lecture 3: Pattern Classification. Pattern classification EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

Bayesian Inference of Noise Levels in Regression

Bayesian Inference of Noise Levels in Regression Bayesian Inference of Noise Levels in Regression Christopher M. Bishop Microsoft Research, 7 J. J. Thomson Avenue, Cambridge, CB FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop

More information

Regression with Input-Dependent Noise: A Bayesian Treatment

Regression with Input-Dependent Noise: A Bayesian Treatment Regression with Input-Dependent oise: A Bayesian Treatment Christopher M. Bishop C.M.BishopGaston.ac.uk Cazhaow S. Qazaz qazazcsgaston.ac.uk eural Computing Research Group Aston University, Birmingham,

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Machine Learning and Bayesian Inference

Machine Learning and Bayesian Inference Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 63725 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Copyright c Sean Holden 22-17. 1 Artificial

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabás Póczos http://www.gaussianprocess.org/ 2 Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes Roger Grosse Roger Grosse CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes 1 / 55 Adminis-Trivia Did everyone get my e-mail

More information

TDA231. Logistic regression

TDA231. Logistic regression TDA231 Devdatt Dubhashi dubhashi@chalmers.se Dept. of Computer Science and Engg. Chalmers University February 19, 2016 Some data 5 x2 0 5 5 0 5 x 1 In the Bayes classifier, we built a model of each class

More information

State Space and Hidden Markov Models

State Space and Hidden Markov Models State Space and Hidden Markov Models Kunsch H.R. State Space and Hidden Markov Models. ETH- Zurich Zurich; Aliaksandr Hubin Oslo 2014 Contents 1. Introduction 2. Markov Chains 3. Hidden Markov and State

More information

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25 Approimate inference, Sampling & Variational inference Fall 2015 Cours 9 November 25 Enseignant: Guillaume Obozinski Scribe: Basile Clément, Nathan de Lara 9.1 Approimate inference with MCMC 9.1.1 Gibbs

More information

1 Rational Exponents and Radicals

1 Rational Exponents and Radicals Introductory Algebra Page 1 of 11 1 Rational Eponents and Radicals 1.1 Rules of Eponents The rules for eponents are the same as what you saw earlier. Memorize these rules if you haven t already done so.

More information

STA 414/2104, Spring 2014, Practice Problem Set #1

STA 414/2104, Spring 2014, Practice Problem Set #1 STA 44/4, Spring 4, Practice Problem Set # Note: these problems are not for credit, and not to be handed in Question : Consider a classification problem in which there are two real-valued inputs, and,

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

Lecture 1c: Gaussian Processes for Regression

Lecture 1c: Gaussian Processes for Regression Lecture c: Gaussian Processes for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

Artificial Neural Networks 2

Artificial Neural Networks 2 CSC2515 Machine Learning Sam Roweis Artificial Neural s 2 We saw neural nets for classification. Same idea for regression. ANNs are just adaptive basis regression machines of the form: y k = j w kj σ(b

More information

Artificial Intelligence: what have we seen so far? Machine Learning and Bayesian Inference

Artificial Intelligence: what have we seen so far? Machine Learning and Bayesian Inference Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6375 Email: sbh@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh/ Artificial Intelligence: what have we seen so

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information

Introduction. So, why did I even bother to write this?

Introduction. So, why did I even bother to write this? Introduction This review was originally written for my Calculus I class, but it should be accessible to anyone needing a review in some basic algebra and trig topics. The review contains the occasional

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Gaussians. Hiroshi Shimodaira. January-March Continuous random variables

Gaussians. Hiroshi Shimodaira. January-March Continuous random variables Cumulative Distribution Function Gaussians Hiroshi Shimodaira January-March 9 In this chapter we introduce the basics of how to build probabilistic models of continuous-valued data, including the most

More information

Lecture 4: Training a Classifier

Lecture 4: Training a Classifier Lecture 4: Training a Classifier Roger Grosse 1 Introduction Now that we ve defined what binary classification is, let s actually train a classifier. We ll approach this problem in much the same way as

More information

Supervised Learning Coursework

Supervised Learning Coursework Supervised Learning Coursework John Shawe-Taylor Tom Diethe Dorota Glowacka November 30, 2009; submission date: noon December 18, 2009 Abstract Using a series of synthetic examples, in this exercise session

More information

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning CSC412 Probabilistic Learning & Reasoning Lecture 12: Bayesian Parameter Estimation February 27, 2006 Sam Roweis Bayesian Approach 2 The Bayesian programme (after Rev. Thomas Bayes) treats all unnown quantities

More information

Relevance Vector Machines

Relevance Vector Machines LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian

More information

Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach. Radford M. Neal, 28 February 2005

Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach. Radford M. Neal, 28 February 2005 Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach Radford M. Neal, 28 February 2005 A Very Brief Review of Gaussian Processes A Gaussian process is a distribution over

More information

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

More information

Gaussian Quiz. Preamble to The Humble Gaussian Distribution. David MacKay 1

Gaussian Quiz. Preamble to The Humble Gaussian Distribution. David MacKay 1 Preamble to The Humble Gaussian Distribution. David MacKay Gaussian Quiz H y y y 3. Assuming that the variables y, y, y 3 in this belief network have a joint Gaussian distribution, which of the following

More information

2 Statistical Estimation: Basic Concepts

2 Statistical Estimation: Basic Concepts Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof. N. Shimkin 2 Statistical Estimation:

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 3 More Markov Chain Monte Carlo Methods The Metropolis algorithm isn t the only way to do MCMC. We ll

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

More information

Quadratic Equations Part I

Quadratic Equations Part I Quadratic Equations Part I Before proceeding with this section we should note that the topic of solving quadratic equations will be covered in two sections. This is done for the benefit of those viewing

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Overview Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation Probabilistic Interpretation: Linear Regression Assume output y is generated

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

CSC2515 Assignment #2

CSC2515 Assignment #2 CSC2515 Assignment #2 Due: Nov.4, 2pm at the START of class Worth: 18% Late assignments not accepted. 1 Pseudo-Bayesian Linear Regression (3%) In this question you will dabble in Bayesian statistics and

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 6 Computing Announcements Machine Learning Lecture 2 Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted

More information

Rational Expressions & Equations

Rational Expressions & Equations Chapter 9 Rational Epressions & Equations Sec. 1 Simplifying Rational Epressions We simply rational epressions the same way we simplified fractions. When we first began to simplify fractions, we factored

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 7 Unsupervised Learning Statistical Perspective Probability Models Discrete & Continuous: Gaussian, Bernoulli, Multinomial Maimum Likelihood Logistic

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Development of Stochastic Artificial Neural Networks for Hydrological Prediction Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental

More information

Lecture 4: Training a Classifier

Lecture 4: Training a Classifier Lecture 4: Training a Classifier Roger Grosse 1 Introduction Now that we ve defined what binary classification is, let s actually train a classifier. We ll approach this problem in much the same way as

More information

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge Bayesian Quadrature: Model-based Approimate Integration David Duvenaud University of Cambridge The Quadrature Problem ˆ We want to estimate an integral Z = f ()p()d ˆ Most computational problems in inference

More information

1 Bayesian Linear Regression (BLR)

1 Bayesian Linear Regression (BLR) Statistical Techniques in Robotics (STR, S15) Lecture#10 (Wednesday, February 11) Lecturer: Byron Boots Gaussian Properties, Bayesian Linear Regression 1 Bayesian Linear Regression (BLR) In linear regression,

More information

Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction

Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction Agathe Girard Dept. of Computing Science University of Glasgow Glasgow, UK agathe@dcs.gla.ac.uk Carl Edward Rasmussen Gatsby

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Analytic Long-Term Forecasting with Periodic Gaussian Processes

Analytic Long-Term Forecasting with Periodic Gaussian Processes Nooshin Haji Ghassemi School of Computing Blekinge Institute of Technology Sweden Marc Peter Deisenroth Department of Computing Imperial College London United Kingdom Department of Computer Science TU

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

CALCULUS I. Integrals. Paul Dawkins

CALCULUS I. Integrals. Paul Dawkins CALCULUS I Integrals Paul Dawkins Table of Contents Preface... ii Integrals... Introduction... Indefinite Integrals... Computing Indefinite Integrals... Substitution Rule for Indefinite Integrals... More

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes

More information

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem CS 189 Introduction to Machine Learning Spring 2018 Note 22 1 Duality As we have seen in our discussion of kernels, ridge regression can be viewed in two ways: (1) an optimization problem over the weights

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

+ y = 1 : the polynomial

+ y = 1 : the polynomial Notes on Basic Ideas of Spherical Harmonics In the representation of wavefields (solutions of the wave equation) one of the natural considerations that arise along the lines of Huygens Principle is the

More information

Mobile Robot Localization

Mobile Robot Localization Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

The binary entropy function

The binary entropy function ECE 7680 Lecture 2 Definitions and Basic Facts Objective: To learn a bunch of definitions about entropy and information measures that will be useful through the quarter, and to present some simple but

More information

Lecture Machine Learning (past summer semester) If at least one English-speaking student is present.

Lecture Machine Learning (past summer semester) If at least one English-speaking student is present. Advanced Machine Learning Lecture 1 Organization Lecturer Prof. Bastian Leibe (leibe@vision.rwth-aachen.de) Introduction 20.10.2015 Bastian Leibe Visual Computing Institute RWTH Aachen University http://www.vision.rwth-aachen.de/

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul Generative Learning INFO-4604, Applied Machine Learning University of Colorado Boulder November 29, 2018 Prof. Michael Paul Generative vs Discriminative The classification algorithms we have seen so far

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Announcements Machine Learning Lecture 2 Eceptional number of lecture participants this year Current count: 449 participants This is very nice, but it stretches our resources to their limits Probability

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk 94 Jonathan Richard Shewchuk 7 Neural Networks NEURAL NETWORKS Can do both classification & regression. [They tie together several ideas from the course: perceptrons, logistic regression, ensembles of

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information