Perturbation Theory for Variational Inference

Size: px
Start display at page:

Download "Perturbation Theory for Variational Inference"

Transcription

1 Perturbation heor for Variational Inference Manfred Opper U Berlin Marco Fraccaro echnical Universit of Denmark Ulrich Paquet Apple Ale Susemihl U Berlin Ole Winther echnical Universit of Denmark Abstract he variational approimation is known to underestimate the true variance of a probabilit measure, like a posterior distribution. In latent variable models whose parameters are permutation-invariant with respect to the likelihood, the variational approimation might self-prune and ignore its parameters. We view the variational free energ and its accompaning evidence lower bound as a first-order term from a perturbation of the true log partition function and derive a power series of corrections. We sketch b means of a variational matri factorization eample how a further term could correct for predictions from a self-pruned approimation. Introduction We consider the probabilit measure p() = µ()e H() Z on a random variable. We are often concerned with the computation of the negative log partition function log Z = log dµ() e H(), or of epectations Ef() = dµ() f()e H() under p. As these are tpicall not analticall tractable, the variational approimation introduces a tractable approimation measure q() = Z q µ()e Hq(), and adjusts the parameters in H q in such a wa that the free energ which is defined via the Kullback Leibler divergence is minimal. his ields the form F q = D(q, p) log Z = E q log q() p() log Z F q = log Z q + E q V () where V = H H q. he minimizer of F usuall underestimates the variance of p, and smmetries are sometimes not broken,, so that the approimation self-prunes. In Sec. we illustrate this effect with a small matri factorization eample, and then show how predictions (epectations Ef() of f() under the posterior p) could be improved. Perturbation theor Perturbation theor aims to find approimate solutions to a problem given eact solutions of a simpler related sub-problem (the VB solution in our case). If we define Ĥλ = H q + λv =

2 ( λ)h q + λh, we have Ĥ = H and Ĥ = H q. Let us now define the perturbation epansion using the cumulant epansion of log E q e V, where E q denotes the epectation with respect to the variational distribution: log dµ() e Ĥλ() = log Z q log E q e λv () = log Z q + λe q V λ }{{} E q (V E q V ) + λ! E q (V E q V ) +... () F q for λ= At the end, of course, we set λ =. he first order term in () ields the variational free energ F q in (), i.e. the negative of the usual evidence lower bound (ELBO). It is therefore reasonable to correct it using higher orders. Note that this ma not be a convergent series, but lead to an asmptotic epansion onl. he approach is similar to corrections to Epectation Propagation, 6, but the computation of the correction terms here require less effort. A variational linear response correction can also be applied 5. Epectations. In a similar wa, we can compute corrections for epectations of functions. We first define E λ as the epectation with respect to p λ () = µ()e Ĥλ() (we then have E q = E ) and notice that dµ() f() e Ĥλ() dµ() f() e Ĥ() λv () E λ f() = = dµ() e Ĥλ() dµ() e Ĥ() λv () ) dµ() f() e ( Ĥ() λv () + λ V () λ! V () ±... = ( dµ() e Ĥ() λv () + λ V () λ! V () ±... ) ( ) E f() λv + λ V λ! V ±... = ( E λv + λ V λ! V ). () ±... Using z = + z + z +... we can re-epand the part from the denominator in () to get (λe V λ E V + λ! E V ±...) = + λe V λ E V + λ E V ±... () Putting this together with the numerator in () ields ) E λ f() = (E f() λe f()v + λ E f()v ±... ) ( + λe V λ E V + λ E V ±... = E f() λe f()v + λe f()e V + λ E f()v λ E f()e V λ E f()v E V + λ E f()e V ±... = E f() λ f(), V () λ E V () f(), V () + λ f(), V () ±... (5) Variational Matri Factorization As to eample, we factorize a scalar matri r = + ɛ with ɛ N (ɛ;, β ). In the parlance of recommender sstems (as we ll later consider sparsel observed matrices R) a user (modelled b a latent scalar ) rates an item (modelled b ). One rating Note r N (r;, β ) z = + z + z +... onl converges for z <.

3 ..5 λ =. MC VB VB-st VB-nd E noise std dev β / Figure : he epected affinit E under p(, r, β) as a Monte Carlo ground truth (MC), and its approimations. We plot the results with r = and α = α =. With these parameters the VB estimate is zero for β > / (see Fig. and analsis in ), but is accuratel corrected through the first order term in (7). Under less noise smmetr is broken (β < / ), and the first order correction (6) is remains accurate. is observed. Assume priors p() = N (;, α ) and p() = N (;, α ). Let q(, ) = q()q() be a factorized approimation with q() = N (; µ, γ ) and q() = N (; µ, γ ). We will investigate how the variational prediction of a rating E and its further terms in (5) change as the observation noise standard deviation β increases. Motivation. his to eample is the building block for computing the corrections for real recommender sstem models, i.e. matri factorization models with sparsel observed ratings matrices and K latent dimension for both user and item vectors. As shown in the Supplementar Material, the corrections for the predicted rating from user i to item j, E i j, can be shown to be the sum of K(N i + M j ) corrections like the ones presented in this section, when q full factorizes. N i denotes the number of items rated b user i and M j the number of users that have rated item j. We epect the correction to be largest when N i and M j are small. Before commencing to technical details, consider Fig., which is accompanied b Fig.. When there is little observation noise, with β being small, the variational Baes (VB) solution E closel matches E, obtained from a Monte Carlo (MC) estimate. As the noise is increased, the VB solution stops breaking smmetr and snaps to zero (see Fig., and Nakajima et al. s analsis ), after which E = will alwas be predicted. he first order correction gives a substantial improvement towards the ground truth, although with no guarantee that (5) is a convergent series, the inclusion of a second order term incorporates a degradation. echnical details. he terms that constitute V (, ) = H(, ) H q (, ) are H(, ) = β(r ) + α + α H q (, ) = Let s look at the first order of the epansion in (5), E = E q λe q V + λe q E q V, γ ( µ ) + γ ( µ ). bearing in mind that q is the minimizer of F q = log Z q + E q V. he terms that we require in the epansion are E q = µ µ E q V E q E q V = { µ µ α α 8 β γ γ γ γ + βr { β µ µ γ γ γ } + µ } µ µ { β γ { βr γ } + µ { } βr γ }, (6)

4 p(, r, β =.) p(, r, β =.6) p(, r, β =.8) p(, r, β =.) p(, r, β =.5) p( r, β =.) p( r, β =.6) p( r, β =.8) p( r, β =.) p( r, β =.5) affinit 5 affinit 5 6 affinit 6 8 affinit affinit Figure : he posterior p(, r, β ), and the accompaning full factorized VB solution (circular contours), for r =. he approimation breaks smmetr at β =. Fig. presents corrections to the mean of p( r, β ). with the latter being Cov q, V. o illustrate the effect of the correction, consider high observation noise β > in Fig.. he VB approimation locks on to a zero mean, and does not break smmetr. With µ u = µ v = for eample, the onl remaining term to first order is (using λ = ) E β r, (7) γ γ and the correction is verifiabl in the direction of r. Summar We ve motivated, b means of a to eample, the benefit of and framework for a perturbation theoretical view on variational inference. Outside the scope of the workshop, current and future work encompass the application to matri factorization (see Supplementar Material), variational inference of stochastic differential equations, Gaussian Process (GP) classification, and the variational approach for sparse GP regression. References D. J. C. MacKa. Local minima, smmetr-breaking, and model pruning in variational free energ minimization. echnical report, Universit of Cambridge Cavendish Laborator,. S. Nakajima and M. Sugiama. heoretical analsis of Baesian matri factorization. Journal of Machine Learning Research, (Nov):58 68,. S. Nakajima, M. Sugiama, S. D. Babacan, and R. omioka. Global analtic solution of full-observed variational Baesian matri factorization. Journal of Machine Learning Research, (Jan): 7,. M. Opper, U. Paquet, and O. Winther. Perturbative corrections for approimate inference in Gaussian latent variable models. Journal of Machine Learning Research, (Sep): ,. 5 M. Opper and O. Winther. Variational linear response. In Advances in Neural Information Processing Sstems 6, pages MI Press,. 6 U. Paquet, O. Winther, and M. Opper. Perturbation corrections in approimate inference: Miture modelling applications. Journal of Machine Learning Research, (Jun):6, 9.

5 Supplementar Material Matri factorization - general case We now consider a recommender sstem with N users, M items and K latent dimensions. An N M sparse rating matri R can be constructed using each user s list of rated items in the catalogue. An element r ij in R contains the rating that user i has given to item j. Matri factorization techniques factorize the rating matri as R = X Y + E, where X is an K N user matri, the item matri Y is K M and the residual term E is N M. We denote the set of observed entries in R b R. We use a Gaussian likelihood for the ratings, i.e. r ij N (r ij ; i j, β ), and isotropic Gaussian priors for both user and item vectors: K K p( i ) = N ( ik ;, α ) p( j ) = N ( jk ;, α ). (8) It is common to take take a full factorized approimation for user i s and item j s vectors over dimensions k =,..., K: q( i ) = k q( ik ) = k N ( ik ; µ ik, γ ik ) q( j) = k q( jk ) = k N ( jk ; η jk, ζ jk ). he main quantit of interest in a recommender sstem is the predicted rating E i j, whose first order correction requires the computation of i j, V (X, Y). With our assumptions, the functions H and H q in V (X, Y) = H(X, Y) H q (X, Y) are H(X, Y) = β (r ij i j ) + α i i + α j j, (i,j) i j H q (X, Y) = γ ik ( ik µ ik ) + ζ jk ( jk η jk ). i,k j,k Counterintuitivel, we see that while we are onl interested in correcting E i j, both H and H q depend also on terms relative to items not seen b user i and users that have not rated item j. As we will see shortl, however, these terms do not contribute in i j, V (X, Y). In its unsimplified form, i j, V (X, Y) = i j, β N M (r nm n m ) + α n n + α (n,m) R N n= K γ nk ( nk µ nk ) M n= m= m= m m } K ζ mk ( mk η mk ) In the computation of i j, V (X, Y), all the terms in V (X, Y) that do not depend on i or j are constants and leave the covariance unchanged: considering the user vectors we have for eample N i j, α n n = i j, α i i. (n,m) R n= In a similar wa, among the likelihood terms (n,m) R (r nm n m ) we need onl to keep the ones corresponding to the items rated b user i and the users that rated item j (i.e. onl the i-th row and j-th column of the ratings matri). Defining M(i) as the set containing the indices of the items rated b user i and N (j) the set containing the indices of the users that rated item j, we have i j, (r nm n m ) = i j, (r im i m ) + (r nj n j ). Notice that we now use η and ζ for item mean and precisions, to avoid subscript spaghetti. 5

6 After removing all the constant terms the simplified covariance is given b i j, V = i j, β (r im i m ) + β (r nj n j ) + + α i i + α j j K γ ik ( ik µ ik ) K ζ jk ( jk η jk ) = β Cov i j, (r im i m ) + β Cov i j, (r nj n j ) + + K K i j, α i i + α j j γ ik ( ik µ ik ) ζ jk ( jk η jk ) = β f ( i, i, Y i ) + β f ( i, j, X j ) + g( i, j ). where X j is the restriction of X to the columns indeed b N (j), and Y i is the restriction of Y to the columns indeed b M(i). Computation of g( i, j ). hanks to our choice of a full factorized VB distribution and the bi-linearit of the covariance operator we have K K ( g( i, j ) = ik ik, α ik + α jk γ ik ( ik µ ik ) ζ jk ( jk η jk ) ) = K ik ik, α ik + α jk γ ik ( ik µ ik ) ζ jk ( jk η jk ). Computation of f ( i, i, Y i ) and f ( i, j, X j ). We now focus below onl on the computation of f ( i, i, Y i ), as the results for f ( i, j, X j ) are analogous. We rewrite the term f ( i, i, Y i ) as f ( i, i, Y i ) = i j, = = = (r im i m ) i j, (r im i m ) i j, r im r im i m + ( i m ) { rim i j, i m + Cov i j, ( i m ) } and analse separatel each of the covariance terms. hanks to the full factorization of the VB posterior we have i j, K K K i m = Cov ik jk, ik mk = ik jk, ik mk. 6

7 For i j, ( i m) we get instead i j, ( i m ) K K K = ik jk, ( ik mk ) + ik mk ip mp p<k K = ik jk, ( ik mk ) K K + ik jk, ik mk ip mp p<k K = ik jk, ( ik mk ) K + ik jk, ic mc ik mk K = ik jk, ( ik mk ) K + E ic mc ik jk, ik mk Combining these results we obtain f ( i, i, Y i ) = { K r im ik jk, ik mk + ik jk, ( ik mk ) + + E ic mc ik jk, ik mk = K r im E ic mc ik jk, ik mk + ik jk, ( ik mk ) If we then define r (k) im as the epected contribution to the rating r im from component k, i.e. r (k) im = r im E ic mc = r im µ ic η mc, we can rewrite f ( i, i, Y i ) as f ( i, i, Y i ) = i j, = = K K (r im i m ) { r (k) im ik jk, ik mk + ik jk, ( ik mk ) } ik jk, (r (k) im ik mk ). For each of the M(i) points we will then onl need the correction for K univariate problems (that are far simpler to compute). Notice in particular that this result is onl possible thanks to our assumption of full factorized priors and VB posteriors, as there are no correlations among variables of the same user/item vectors to be taken into account when computing the covariances. 7

8 Putting all together: corrections for the predicted ratings Combining the results above, the first order correction for the required epectation of this matri factorization problem is given b i j, V = β + β K K ik jk, (r (k) im ik mk ) + ik jk, (r (k) nj nk jk ) + + K ik ik, α ik + α jk γ ik ( ik µ ik ) ζ jk ( jk η jk ) K β = ik jk, (r (k) im ik mk ) + + β ik jk, (r (k) nj nk jk ) + + Cov } ik ik, α ik + α jk γ ik ( ik µ ik ) ζ jk ( jk η jk ) Due to the full factorized VB approimation, the computation of these covariances does not require vector operations and can be easil done manuall or using smbolic mathematics software such as Smp. he required non-central Gaussian moments can be computed using Wick s theorem. Assuming that r ij is an unobserved rating that we want to predict, the final epression for the covariance term in the first order correction of E i j is i j, V K β = µ ik η jk ηmk βr(k) im η jk η mk + β µ ik η jk + γ ik γ ik γ ik ζ mk + β µ ik η jk µ nk βr (k) nj µ ik µ nk + β µ ik η jk + ζ jk ζ jk γ nk ζ jk ( α + + α γ ik ζ jk ) µ ik η jk } o illustrate the effect of the correction we can compute its value in the case where smmetries in the user vectors u i are not broken, i.e. µ i = (this ma happen for nois users with onl few rated items). In this case, the epected rating with a first order correction is (with λ = ) E i j K βr (k) im η jk η mk γ ik. We see that the epectation depends on the rating given b user i to other items in the catalogue weighted b how similar their vectors are to j (so that we have a positive contribution from r (k) im to his means that the likelihood term in V (X, Y) does not depend on i j. In the to eample presented in Section instead we were computing a correction on an observed rating. 8

9 the predicted rating onl if the k-th components of the means of the j-th and m-th item vectors have the same direction). 9

CONTINUOUS SPATIAL DATA ANALYSIS

CONTINUOUS SPATIAL DATA ANALYSIS CONTINUOUS SPATIAL DATA ANALSIS 1. Overview of Spatial Stochastic Processes The ke difference between continuous spatial data and point patterns is that there is now assumed to be a meaningful value, s

More information

MMJ1153 COMPUTATIONAL METHOD IN SOLID MECHANICS PRELIMINARIES TO FEM

MMJ1153 COMPUTATIONAL METHOD IN SOLID MECHANICS PRELIMINARIES TO FEM B Course Content: A INTRODUCTION AND OVERVIEW Numerical method and Computer-Aided Engineering; Phsical problems; Mathematical models; Finite element method;. B Elements and nodes, natural coordinates,

More information

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification INF 4300 151014 Introduction to classifiction Anne Solberg anne@ifiuiono Based on Chapter 1-6 in Duda and Hart: Pattern Classification 151014 INF 4300 1 Introduction to classification One of the most challenging

More information

Large-scale Ordinal Collaborative Filtering

Large-scale Ordinal Collaborative Filtering Large-scale Ordinal Collaborative Filtering Ulrich Paquet, Blaise Thomson, and Ole Winther Microsoft Research Cambridge, University of Cambridge, Technical University of Denmark ulripa@microsoft.com,brmt2@cam.ac.uk,owi@imm.dtu.dk

More information

INF Introduction to classifiction Anne Solberg

INF Introduction to classifiction Anne Solberg INF 4300 8.09.17 Introduction to classifiction Anne Solberg anne@ifi.uio.no Introduction to classification Based on handout from Pattern Recognition b Theodoridis, available after the lecture INF 4300

More information

INF Anne Solberg One of the most challenging topics in image analysis is recognizing a specific object in an image.

INF Anne Solberg One of the most challenging topics in image analysis is recognizing a specific object in an image. INF 4300 700 Introduction to classifiction Anne Solberg anne@ifiuiono Based on Chapter -6 6inDuda and Hart: attern Classification 303 INF 4300 Introduction to classification One of the most challenging

More information

Recurrent Latent Variable Networks for Session-Based Recommendation

Recurrent Latent Variable Networks for Session-Based Recommendation Recurrent Latent Variable Networks for Session-Based Recommendation Panayiotis Christodoulou Cyprus University of Technology paa.christodoulou@edu.cut.ac.cy 27/8/2017 Panayiotis Christodoulou (C.U.T.)

More information

Short course A vademecum of statistical pattern recognition techniques with applications to image and video analysis. Agenda

Short course A vademecum of statistical pattern recognition techniques with applications to image and video analysis. Agenda Short course A vademecum of statistical pattern recognition techniques with applications to image and video analysis Lecture Recalls of probability theory Massimo Piccardi University of Technology, Sydney,

More information

Computation fundamentals of discrete GMRF representations of continuous domain spatial models

Computation fundamentals of discrete GMRF representations of continuous domain spatial models Computation fundamentals of discrete GMRF representations of continuous domain spatial models Finn Lindgren September 23 2015 v0.2.2 Abstract The fundamental formulas and algorithms for Bayesian spatial

More information

Review of Probability

Review of Probability Review of robabilit robabilit Theor: Man techniques in speech processing require the manipulation of probabilities and statistics. The two principal application areas we will encounter are: Statistical

More information

6. Vector Random Variables

6. Vector Random Variables 6. Vector Random Variables In the previous chapter we presented methods for dealing with two random variables. In this chapter we etend these methods to the case of n random variables in the following

More information

A variational radial basis function approximation for diffusion processes

A variational radial basis function approximation for diffusion processes A variational radial basis function approximation for diffusion processes Michail D. Vrettas, Dan Cornford and Yuan Shen Aston University - Neural Computing Research Group Aston Triangle, Birmingham B4

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

Overview. IAML: Linear Regression. Examples of regression problems. The Regression Problem

Overview. IAML: Linear Regression. Examples of regression problems. The Regression Problem 3 / 38 Overview 4 / 38 IAML: Linear Regression Nigel Goddard School of Informatics Semester The linear model Fitting the linear model to data Probabilistic interpretation of the error function Eamples

More information

Gaussian Process Approximations of Stochastic Differential Equations

Gaussian Process Approximations of Stochastic Differential Equations Gaussian Process Approximations of Stochastic Differential Equations Cédric Archambeau Centre for Computational Statistics and Machine Learning University College London c.archambeau@cs.ucl.ac.uk CSML

More information

MAE 323: Chapter 4. Plane Stress and Plane Strain. The Stress Equilibrium Equation

MAE 323: Chapter 4. Plane Stress and Plane Strain. The Stress Equilibrium Equation The Stress Equilibrium Equation As we mentioned in Chapter 2, using the Galerkin formulation and a choice of shape functions, we can derive a discretized form of most differential equations. In Structural

More information

Where now? Machine Learning and Bayesian Inference

Where now? Machine Learning and Bayesian Inference Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone etension 67 Email: sbh@clcamacuk wwwclcamacuk/ sbh/ Where now? There are some simple take-home messages from

More information

2: Distributions of Several Variables, Error Propagation

2: Distributions of Several Variables, Error Propagation : Distributions of Several Variables, Error Propagation Distribution of several variables. variables The joint probabilit distribution function of two variables and can be genericall written f(, with the

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Applications of Gauss-Radau and Gauss-Lobatto Numerical Integrations Over a Four Node Quadrilateral Finite Element

Applications of Gauss-Radau and Gauss-Lobatto Numerical Integrations Over a Four Node Quadrilateral Finite Element Avaiable online at www.banglaol.info angladesh J. Sci. Ind. Res. (), 77-86, 008 ANGLADESH JOURNAL OF SCIENTIFIC AND INDUSTRIAL RESEARCH CSIR E-mail: bsir07gmail.com Abstract Applications of Gauss-Radau

More information

15. Eigenvalues, Eigenvectors

15. Eigenvalues, Eigenvectors 5 Eigenvalues, Eigenvectors Matri of a Linear Transformation Consider a linear ( transformation ) L : a b R 2 R 2 Suppose we know that L and L Then c d because of linearit, we can determine what L does

More information

Gaussian processes for inference in stochastic differential equations

Gaussian processes for inference in stochastic differential equations Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017

More information

A New Perspective and Extension of the Gaussian Filter

A New Perspective and Extension of the Gaussian Filter Robotics: Science and Sstems 2015 Rome, Ital, Jul 13-17, 2015 A New Perspective and Etension of the Gaussian Filter Manuel Wüthrich, Sebastian Trimpe, Daniel Kappler and Stefan Schaal Autonomous Motion

More information

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always

More information

Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting Cross-validation for detecting and preventing overfitting A Regression Problem = f() + noise Can we learn f from this data? Note to other teachers and users of these slides. Andrew would be delighted if

More information

Finding Limits Graphically and Numerically. An Introduction to Limits

Finding Limits Graphically and Numerically. An Introduction to Limits 8 CHAPTER Limits and Their Properties Section Finding Limits Graphicall and Numericall Estimate a it using a numerical or graphical approach Learn different was that a it can fail to eist Stud and use

More information

Understanding Covariance Estimates in Expectation Propagation

Understanding Covariance Estimates in Expectation Propagation Understanding Covariance Estimates in Expectation Propagation William Stephenson Department of EECS Massachusetts Institute of Technology Cambridge, MA 019 wtstephe@csail.mit.edu Tamara Broderick Department

More information

Chapter 4 Analytic Trigonometry

Chapter 4 Analytic Trigonometry Analtic Trigonometr Chapter Analtic Trigonometr Inverse Trigonometric Functions The trigonometric functions act as an operator on the variable (angle, resulting in an output value Suppose this process

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

The connection of dropout and Bayesian statistics

The connection of dropout and Bayesian statistics The connection of dropout and Bayesian statistics Interpretation of dropout as approximate Bayesian modelling of NN http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf Dropout Geoffrey Hinton Google, University

More information

Unit 12 Study Notes 1 Systems of Equations

Unit 12 Study Notes 1 Systems of Equations You should learn to: Unit Stud Notes Sstems of Equations. Solve sstems of equations b substitution.. Solve sstems of equations b graphing (calculator). 3. Solve sstems of equations b elimination. 4. Solve

More information

Finding Limits Graphically and Numerically. An Introduction to Limits

Finding Limits Graphically and Numerically. An Introduction to Limits 60_00.qd //0 :05 PM Page 8 8 CHAPTER Limits and Their Properties Section. Finding Limits Graphicall and Numericall Estimate a it using a numerical or graphical approach. Learn different was that a it can

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

8.1 Exponents and Roots

8.1 Exponents and Roots Section 8. Eponents and Roots 75 8. Eponents and Roots Before defining the net famil of functions, the eponential functions, we will need to discuss eponent notation in detail. As we shall see, eponents

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing

More information

Optimal Kernels for Unsupervised Learning

Optimal Kernels for Unsupervised Learning Optimal Kernels for Unsupervised Learning Sepp Hochreiter and Klaus Obermaer Bernstein Center for Computational Neuroscience and echnische Universität Berlin 587 Berlin, German {hochreit,ob}@cs.tu-berlin.de

More information

Section J Venn diagrams

Section J Venn diagrams Section J Venn diagrams A Venn diagram is a wa of using regions to represent sets. If the overlap it means that there are some items in both sets. Worked eample J. Draw a Venn diagram representing the

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Supplementary Material: A Robust Approach to Sequential Information Theoretic Planning

Supplementary Material: A Robust Approach to Sequential Information Theoretic Planning Supplementar aterial: A Robust Approach to Sequential Information Theoretic Planning Sue Zheng Jason Pacheco John W. Fisher, III. Proofs of Estimator Properties.. Proof of Prop. Here we show how the bias

More information

Smoothness Selection. Simon Wood Mathematical Sciences, University of Bath, U.K.

Smoothness Selection. Simon Wood Mathematical Sciences, University of Bath, U.K. Smoothness Selection Simon Wood Mathematical Sciences, Universit of Bath, U.K. Smoothness selection approaches The smoothing model i = f ( i ) + ɛ i, ɛ i N(0, σ 2 ), is represented via a basis epansion

More information

A Comparison of Particle Filters for Personal Positioning

A Comparison of Particle Filters for Personal Positioning VI Hotine-Marussi Symposium of Theoretical and Computational Geodesy May 9-June 6. A Comparison of Particle Filters for Personal Positioning D. Petrovich and R. Piché Institute of Mathematics Tampere University

More information

Biostatistics in Research Practice - Regression I

Biostatistics in Research Practice - Regression I Biostatistics in Research Practice - Regression I Simon Crouch 30th Januar 2007 In scientific studies, we often wish to model the relationships between observed variables over a sample of different subjects.

More information

1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure

1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure Overview Breiman L (2001). Statistical modeling: The two cultures Statistical learning reading group Aleander L 1 Histor of statistical/machine learning 2 Supervised learning 3 Two approaches to supervised

More information

3.3 Logarithmic Functions and Their Graphs

3.3 Logarithmic Functions and Their Graphs 274 CHAPTER 3 Eponential, Logistic, and Logarithmic Functions What ou ll learn about Inverses of Eponential Functions Common Logarithms Base 0 Natural Logarithms Base e Graphs of Logarithmic Functions

More information

Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction

Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction Agathe Girard Dept. of Computing Science University of Glasgow Glasgow, UK agathe@dcs.gla.ac.uk Carl Edward Rasmussen Gatsby

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Bayesian Semi-supervised Learning with Deep Generative Models

Bayesian Semi-supervised Learning with Deep Generative Models Bayesian Semi-supervised Learning with Deep Generative Models Jonathan Gordon Department of Engineering Cambridge University jg801@cam.ac.uk José Miguel Hernández-Lobato Department of Engineering Cambridge

More information

Gaussian integrals. Calvin W. Johnson. September 9, The basic Gaussian and its normalization

Gaussian integrals. Calvin W. Johnson. September 9, The basic Gaussian and its normalization Gaussian integrals Calvin W. Johnson September 9, 24 The basic Gaussian and its normalization The Gaussian function or the normal distribution, ep ( α 2), () is a widely used function in physics and mathematical

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

A Practitioner s Guide to Generalized Linear Models

A Practitioner s Guide to Generalized Linear Models A Practitioners Guide to Generalized Linear Models Background The classical linear models and most of the minimum bias procedures are special cases of generalized linear models (GLMs). GLMs are more technically

More information

Gaussian Process Vine Copulas for Multivariate Dependence

Gaussian Process Vine Copulas for Multivariate Dependence Gaussian Process Vine Copulas for Multivariate Dependence José Miguel Hernández Lobato 1,2, David López Paz 3,2 and Zoubin Ghahramani 1 June 27, 2013 1 University of Cambridge 2 Equal Contributor 3 Ma-Planck-Institute

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Uncertainty and Parameter Space Analysis in Visualization -

Uncertainty and Parameter Space Analysis in Visualization - Uncertaint and Parameter Space Analsis in Visualiation - Session 4: Structural Uncertaint Analing the effect of uncertaint on the appearance of structures in scalar fields Rüdiger Westermann and Tobias

More information

DART_LAB Tutorial Section 2: How should observations impact an unobserved state variable? Multivariate assimilation.

DART_LAB Tutorial Section 2: How should observations impact an unobserved state variable? Multivariate assimilation. DART_LAB Tutorial Section 2: How should observations impact an unobserved state variable? Multivariate assimilation. UCAR 2014 The National Center for Atmospheric Research is sponsored by the National

More information

. a m1 a mn. a 1 a 2 a = a n

. a m1 a mn. a 1 a 2 a = a n Biostat 140655, 2008: Matrix Algebra Review 1 Definition: An m n matrix, A m n, is a rectangular array of real numbers with m rows and n columns Element in the i th row and the j th column is denoted by

More information

A comparison of estimation accuracy by the use of KF, EKF & UKF filters

A comparison of estimation accuracy by the use of KF, EKF & UKF filters Computational Methods and Eperimental Measurements XIII 779 A comparison of estimation accurac b the use of KF EKF & UKF filters S. Konatowski & A. T. Pieniężn Department of Electronics Militar Universit

More information

Why do we care? Measurements. Handling uncertainty over time: predicting, estimating, recognizing, learning. Dealing with time

Why do we care? Measurements. Handling uncertainty over time: predicting, estimating, recognizing, learning. Dealing with time Handling uncertainty over time: predicting, estimating, recognizing, learning Chris Atkeson 2004 Why do we care? Speech recognition makes use of dependence of words and phonemes across time. Knowing where

More information

σ = F/A. (1.2) σ xy σ yy σ zx σ xz σ yz σ, (1.3) The use of the opposite convention should cause no problem because σ ij = σ ji.

σ = F/A. (1.2) σ xy σ yy σ zx σ xz σ yz σ, (1.3) The use of the opposite convention should cause no problem because σ ij = σ ji. Cambridge Universit Press 978-1-107-00452-8 - Metal Forming: Mechanics Metallurg, Fourth Edition Ecerpt 1 Stress Strain An understing of stress strain is essential for the analsis of metal forming operations.

More information

Unsupervised Learning

Unsupervised Learning CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality

More information

STA 414/2104, Spring 2014, Practice Problem Set #1

STA 414/2104, Spring 2014, Practice Problem Set #1 STA 44/4, Spring 4, Practice Problem Set # Note: these problems are not for credit, and not to be handed in Question : Consider a classification problem in which there are two real-valued inputs, and,

More information

Regression with Input-Dependent Noise: A Bayesian Treatment

Regression with Input-Dependent Noise: A Bayesian Treatment Regression with Input-Dependent oise: A Bayesian Treatment Christopher M. Bishop C.M.BishopGaston.ac.uk Cazhaow S. Qazaz qazazcsgaston.ac.uk eural Computing Research Group Aston University, Birmingham,

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

Chapter 5: Systems of Equations

Chapter 5: Systems of Equations Chapter : Sstems of Equations Section.: Sstems in Two Variables... 0 Section. Eercises... 9 Section.: Sstems in Three Variables... Section. Eercises... Section.: Linear Inequalities... Section.: Eercises.

More information

Polynomial and Rational Functions

Polynomial and Rational Functions Name Date Chapter Polnomial and Rational Functions Section.1 Quadratic Functions Objective: In this lesson ou learned how to sketch and analze graphs of quadratic functions. Important Vocabular Define

More information

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25 Approimate inference, Sampling & Variational inference Fall 2015 Cours 9 November 25 Enseignant: Guillaume Obozinski Scribe: Basile Clément, Nathan de Lara 9.1 Approimate inference with MCMC 9.1.1 Gibbs

More information

Black-box α-divergence Minimization

Black-box α-divergence Minimization Black-box α-divergence Minimization José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, Richard Turner, Harvard University, University of Cambridge, Universidad Autónoma de Madrid.

More information

Lecture 3: Pattern Classification. Pattern classification

Lecture 3: Pattern Classification. Pattern classification EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and

More information

Properties of Limits

Properties of Limits 33460_003qd //04 :3 PM Page 59 SECTION 3 Evaluating Limits Analticall 59 Section 3 Evaluating Limits Analticall Evaluate a it using properties of its Develop and use a strateg for finding its Evaluate

More information

Machine Learning. 1. Linear Regression

Machine Learning. 1. Linear Regression Machine Learning Machine Learning 1. Linear Regression Lars Schmidt-Thieme Information Sstems and Machine Learning Lab (ISMLL) Institute for Business Economics and Information Sstems & Institute for Computer

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)

More information

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes

More information

A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

More information

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower

More information

Chapter 3. Theory of measurement

Chapter 3. Theory of measurement Chapter. Introduction An energetic He + -ion beam is incident on thermal sodium atoms. Figure. shows the configuration in which the interaction one is determined b the crossing of the laser-, sodium- and

More information

EigenGP: Gaussian Process Models with Adaptive Eigenfunctions

EigenGP: Gaussian Process Models with Adaptive Eigenfunctions Proceedings of the Twent-Fourth International Joint Conference on Artificial Intelligence (IJCAI 5) : Gaussian Process Models with Adaptive Eigenfunctions Hao Peng Department of Computer Science Purdue

More information

Mobile Robot Localization

Mobile Robot Localization Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations

More information

The Force Table Introduction: Theory:

The Force Table Introduction: Theory: 1 The Force Table Introduction: "The Force Table" is a simple tool for demonstrating Newton s First Law and the vector nature of forces. This tool is based on the principle of equilibrium. An object is

More information

Type II variational methods in Bayesian estimation

Type II variational methods in Bayesian estimation Type II variational methods in Bayesian estimation J. A. Palmer, D. P. Wipf, and K. Kreutz-Delgado Department of Electrical and Computer Engineering University of California San Diego, La Jolla, CA 9093

More information

Approximate Message Passing

Approximate Message Passing Approximate Message Passing Mohammad Emtiyaz Khan CS, UBC February 8, 2012 Abstract In this note, I summarize Sections 5.1 and 5.2 of Arian Maleki s PhD thesis. 1 Notation We denote scalars by small letters

More information

Why do we care? Examples. Bayes Rule. What room am I in? Handling uncertainty over time: predicting, estimating, recognizing, learning

Why do we care? Examples. Bayes Rule. What room am I in? Handling uncertainty over time: predicting, estimating, recognizing, learning Handling uncertainty over time: predicting, estimating, recognizing, learning Chris Atkeson 004 Why do we care? Speech recognition makes use of dependence of words and phonemes across time. Knowing where

More information

Deep Nonlinear Non-Gaussian Filtering for Dynamical Systems

Deep Nonlinear Non-Gaussian Filtering for Dynamical Systems Deep Nonlinear Non-Gaussian Filtering for Dnamical Sstems Arash Mehrjou Department of Empirical Inference Ma Planck Institute for Intelligent Sstems arash.mehrjou@tuebingen.mpg.de Bernhard Schölkopf Department

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 7 Unsupervised Learning Statistical Perspective Probability Models Discrete & Continuous: Gaussian, Bernoulli, Multinomial Maimum Likelihood Logistic

More information

Variational Autoencoder

Variational Autoencoder Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture Notes Fall 2009 November, 2009 Byoung-Ta Zhang School of Computer Science and Engineering & Cognitive Science, Brain Science, and Bioinformatics Seoul National University

More information

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge Bayesian Quadrature: Model-based Approimate Integration David Duvenaud University of Cambridge The Quadrature Problem ˆ We want to estimate an integral Z = f ()p()d ˆ Most computational problems in inference

More information

Self-Averaging Expectation Propagation

Self-Averaging Expectation Propagation Self-Averaging Expectation Propagation Burak Çakmak Department of Electronic Systems Aalborg Universitet Aalborg 90, Denmark buc@es.aau.dk Manfred Opper Department of Artificial Intelligence Technische

More information

Gaussian Process Vine Copulas for Multivariate Dependence

Gaussian Process Vine Copulas for Multivariate Dependence Gaussian Process Vine Copulas for Multivariate Dependence José Miguel Hernández-Lobato 1,2 joint work with David López-Paz 2,3 and Zoubin Ghahramani 1 1 Department of Engineering, Cambridge University,

More information

Andriy Mnih and Ruslan Salakhutdinov

Andriy Mnih and Ruslan Salakhutdinov MATRIX FACTORIZATION METHODS FOR COLLABORATIVE FILTERING Andriy Mnih and Ruslan Salakhutdinov University of Toronto, Machine Learning Group 1 What is collaborative filtering? The goal of collaborative

More information

Parameterized Joint Densities with Gaussian and Gaussian Mixture Marginals

Parameterized Joint Densities with Gaussian and Gaussian Mixture Marginals Parameterized Joint Densities with Gaussian and Gaussian Miture Marginals Feli Sawo, Dietrich Brunn, and Uwe D. Hanebeck Intelligent Sensor-Actuator-Sstems Laborator Institute of Computer Science and Engineering

More information

Supplementary Note on Bayesian analysis

Supplementary Note on Bayesian analysis Supplementary Note on Bayesian analysis Structured variability of muscle activations supports the minimal intervention principle of motor control Francisco J. Valero-Cuevas 1,2,3, Madhusudhan Venkadesan

More information

Viktor Shkolnikov, Juan G. Santiago*

Viktor Shkolnikov, Juan G. Santiago* Supplementar Information: Coupling Isotachophoresis with Affinit Chromatograph for Rapid and Selective Purification with High Column Utilization. Part I: Theor Viktor Shkolnikov, Juan G. Santiago* Department

More information

Computation of Total Capacity for Discrete Memoryless Multiple-Access Channels

Computation of Total Capacity for Discrete Memoryless Multiple-Access Channels IEEE TRANSACTIONS ON INFORATION THEORY, VOL. 50, NO. 11, NOVEBER 2004 2779 Computation of Total Capacit for Discrete emorless ultiple-access Channels ohammad Rezaeian, ember, IEEE, and Ale Grant, Senior

More information

Bayesian Inference of Noise Levels in Regression

Bayesian Inference of Noise Levels in Regression Bayesian Inference of Noise Levels in Regression Christopher M. Bishop Microsoft Research, 7 J. J. Thomson Avenue, Cambridge, CB FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop

More information

Anomaly Detection and Removal Using Non-Stationary Gaussian Processes

Anomaly Detection and Removal Using Non-Stationary Gaussian Processes Anomaly Detection and Removal Using Non-Stationary Gaussian ocesses Steven Reece Roman Garnett, Michael Osborne and Stephen Roberts Robotics Research Group Dept Engineering Science Oford University, UK

More information

Bayesian IV: the normal case with multiple endogenous variables

Bayesian IV: the normal case with multiple endogenous variables Baesian IV: the normal case with multiple endogenous variables Timoth Cogle and Richard Startz revised Januar 05 Abstract We set out a Gibbs sampler for the linear instrumental-variable model with normal

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures AB = BA = I,

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures AB = BA = I, FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING Lectures MODULE 7 MATRICES II Inverse of a matri Sstems of linear equations Solution of sets of linear equations elimination methods 4

More information

Probabilistic & Bayesian deep learning. Andreas Damianou

Probabilistic & Bayesian deep learning. Andreas Damianou Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of Sheffield, 19 March 2019 In this talk Not in this talk: CRFs, Boltzmann machines,... In this

More information