Expectation Propagation performs smooth gradient descent GUILLAUME DEHAENE
|
|
- Diane Hall
- 5 years ago
- Views:
Transcription
1 Expectation Propagation performs smooth gradient descent 1 GUILLAUME DEHAENE
2 In a nutshell Problem: posteriors are uncomputable Solution: parametric approximations 2 But which one should we choose? Laplace? Variational Bayes? Expectation Propagation? I will unite these three methods
3 Outline I. Gaussian Approximation methods A. Laplace B. Variational Bayes C. Expectation Propagation II. Smooth gradient methods A. Fixed-point conditions B. Reformulating gradient descent C. Gaussian Smoothing D. Using the factor structure III. Some consequences 3
4 I. Gaussian approximation methods 4 We have collected some data DD 1 DD nn We have a great IID model: - Prior: pp θθ - Conditional: pp DD ii θθ The posterior: p θθ DD 1 DD nn = 1 ZZ DD 1 DD nn pp θθ ii pp DD ii θθ
5 Uncomputability Usually, the posterior is uncomputable: - θθ is high-dimensional - The likelihoods have a complicated structure 5 Two solutions: - Sampling methods - Approximation methods We are going to focus on Gaussian approximations
6 A. The Laplace approximation The problem: Finding a Gaussian approximation qq(θθ) of pp θθ 6 IE: a quadratic approximation of log pp θθ = ψψ θθ The most basic quadratic approximation is the Taylor expansion to second order
7 A. The Laplace approximation 7 We center at the global maximum of pp θθ : θθ MMMMMM log pp θθ = ψψ θθ ψψ θθ MMMMMM + θθ θθ MMMMMM 2 2 HHHH θθ MMMMMM The Laplace approximation: - A Gaussian - Centered at θθ MMMMMM - With inverse-variance HHHH θθ MMMMMM
8 B. The Variational Bayes approximation Laplace is fine but not very principled 8 Instead, let s find the Gaussian which minimizes a «sensible notion of distance» to pp θθ qq VVVV = arg min KKKK qq, pp Distance = reverse Kullback-Leibler divergence: qq θθ KKKK qq, pp = qq θθ log pp θθ
9 B. The Variational Bayes approximation Parameterize Gaussians with their mean and std: μμ, σσ ηη μμ,σσ = μμ + σσηη 0 qq θθ μμ, σσ = xx μμ exp 2ππ σσ 2σσ 2 9 KKKK qq, pp = EE ψψ μμ + σσηη 0 log σσ + Gradient to μμ: μμ = EE ψψ μμ + σσηη 0 Gradient to σσ: σσ = σσee Hψψ μμ + σσηη 0 1/σσ
10 B. The Variational Bayes approximation 10 Gradient to μμ: μμ = EE ψψ μμ + σσηη 0 Gradient to σσ: σσ = σσee Hψψ μμ + σσηη 0 1/σσ We can use stochastic gradient descent to optimize the reverse KL divergence We use samples from ηη 0
11 C. Expectation Propagation First key idea: - Instead of a global approximation qq θθ pp θθ - We use the factor structure of pp θθ : 11 nn pp θθ = ff ii θθ ii=1 - And compute nn local approximations gg ii θθ ff ii θθ We can recover a global approximation as: qq EEEE θθ = gg ii θθ
12 C. Expectation Propagation First key idea: - Compute nn local approximations gg ii θθ 12 Second key idea: - Iteratively refine the approximation gg ii θθ ff ii θθ - Using other approximations gg jj θθ ; jj ii as context
13 C. Expectation Propagation To update gg ii θθ - Define the «hybrid»: 13 h ii θθ ff ii θθ gg jj θθ - Compute the mean and variance of h ii θθ and the Gaussian which has that same mean and variance qq jj ii - New approximation: gg ii θθ = qq θθ jj ii gg jj θθ
14 C. Expectation Propagation First key idea: - Compute nn local approximations gg ii θθ 14 Second key idea: - Iteratively refine the approximation gg ii θθ ff ii θθ - New approximation: gg ii θθ = GGGGGGGGGG ff ii θθ jj ii gg jj θθ jj ii gg jj θθ
15 C. Expectation Propagation Computing the mean and the variance of the hybrid requires model specific work: - Analytic - Quadrature methods - Pre-compiled approximations - Sampling 15 EP can also use non-gaussian approximations
16 Summary To deal with an uncomputable target distribution: 16 nn pp θθ = ff ii θθ ii=1 - Laplace approximation GD on ψψ θθ - VB approximation SGD on KKKK qq, pp - EP EP iteration These methods couldn t be further appart!!
17 Summary Gradient descent is well-understood and intuitive: - Dynamics of an object sliding down a slope 17 Stochastic optimization of KKKK qq, pp and EP iteration are unintuitive This makes them hard to use
18 II. Smooth gradient methods I will now unite these three methods under a single framework: smooth gradient methods 18 These iterate on Gaussian approximations to pp θθ They are closely related to GD = intuitive The three methods correspond to special cases
19 A. Fixed-point conditions The methods can be united because their fixed-point conditions are extremely similar 19 Laplace: gradient is 0 ψψ θθ MMMMMM = 0
20 A. Fixed-point conditions Variational Bayes: Recall the gradient: Gradient to μμ: μμ = EE ψψ μμ + σσηη 0 Gradient to σσ: σσ = σσσσ Hψψ μμ + σσηη 0 1/σσ 20 Optimal Gaussian qq VVVV must respect: EE qqvvvv ψψ θθ = 0 EE qqvvvv HHHH = CCCCvv qqvvvv 1
21 Variational Bayes: A. Fixed-point conditions 21 EE qqvvvv ψψ θθ = 0 EE qqvvvv HHHH = CCCCvv qqvvvv 1 The smoothed gradient must be 0 The covariance is related to the peakedness of log pp
22 A. Fixed-point conditions Expectation Propagation: - Define the «hybrid»: 22 h ii θθ ff ii θθ gg jj θθ - Compute the mean and variance of h ii θθ and the Gaussian which has that same mean and variance qq jj ii - New approximation: gg ii θθ = qq θθ jj ii gg jj θθ
23 A. Fixed-point conditions Expectation-Propagation: Easy fixed-point condition: All the hybrids h ii θθ and the global Gaussian approximation qq EEEE θθ have the same mean / variance Using this, and a little bit of math (Dehaene 2016): 23 ii EE hii log ff ii θθ = 0 ii CCCCvv hii 1 EEhii θθ μμ ii log ff ii θθ = CCCCvv qqeeee 1
24 A. Fixed-point conditions 24 Laplace: VB: ψψ θθ MMMMMM = 0 EE qqvvvv ψψ θθ = 0 EE qqvvvv HHHH θθ = CCCCvv qqvvvv 1 EP: EE hii log ff ii θθ = 0 CCCCvv hii 1 EE hii θθ μμ EEEE log ff ii θθ = CCCCvv qqeeee 1
25 A. Fixed-point conditions 25 Laplace: ψψ θθ MMMMMM = 0 VB: EP: EE qqvvvv ψψ θθ = 0 CCCCvv qqvvvv 1 EEqqVVVV θθ μμ VVVV ψψ θθ = CCCCvv qqvvvv 1 EE hii log ff ii θθ = 0 CCCCvv hii 1 EE hii θθ μμ EEEE log ff ii θθ = CCCCvv qqeeee 1
26 B. Reformulating gradient descent The first step: reframing gradient descent as - Iterating over Gaussian approximations to pp θθ - Fixed point = Laplace approximation 26 Key idea: GD corresponds to using a linear approximation of ψψ = log pp
27 B. Reformulating gradient descent Interpretation of GD: θθ nn+1 = θθ nn λλ ψψ θθ nn 27 «Please find the maximizer of the equation» θθ θθ nn ψψ θθ nn λλ 2 θθ θθ nn 2 This is the same as the mean of the Gaussian: qq nn+1 θθ exp θθ θθ nn ψψ θθ nn λλ 2 θθ θθ nn 2
28 B. Reformulating gradient descent 28 A trivial reformulation of GD: Iterate: qq nn+1 θθ exp θθ θθ nn ψψ θθ nn λλ 2 θθ θθ nn 2 θθ nn+1 = EE qqnn+1 θθ This iterates Gaussian approximations to pp θθ But the fixed-point isn t the Laplace approximation!
29 B. Reformulating gradient descent We need to use an optimization algorithm which uses a quadratic approximation of ψψ: 29 Newton s method: θθ nn+1 = θθ nn Hψψ θθ nn 1 ψψ θθ nn «Please find the maximizer of the equation» θθ θθ nn ψψ θθ nn 1 2 HHHH θθ nn θθ θθ nn 2
30 B. Reformulating gradient descent A trivial reformulation of Newton s: 30 Iterate: qq nn+1 θθ exp θθ θθ nn ψψ θθ nn θθ nn+1 = EE qqnn+1 HHHH θθ nn 2 θθ θθ θθ nn 2 We iterate Gaussian approximations of pp θθ until we find the Laplace approximation
31 Algorithm 1: disguised gradient descent Newton s method 31 DGD ψψ qqqqqqqqqqqqqqqqqq pp exp qqqqqqqqqqqqqqqqqq pp GGGGGGGGGGGGGGGG
32 C. Gaussian smoothing The Laplace approximation is a point approximation 32 We could improve the algorithm: - Smooth the objective function ψψ = ψψ exp θθ2 2σσ 2 - Run the algorithm on ψψ
33 C. Gaussian smoothing How should we choose the smoothing bandwith σσ? 33 We could choose it once for all steps OR On each step, we could use the current Gaussian approximation qq nn θθ to smooth ψψ
34 Algorithm 2: smoothed gradient descent 34 - Initialize with any Gaussian qq 0 - Loop: μμ nn = EE qqnn θθ rr = EE qqnn ψψ θθ ββ = EE qqnn HHHH θθ qq nn+1 θθ exp rr θθ μμ nn ββ 2 θθ μμ nn 2
35 Algorithm 2: smoothed gradient descent 35
36 D. Using the factor structure In order to use EP, pp θθ needs to have a nice factor structure: 36 nn pp θθ = ff ii θθ ii=1 For ψψ: nn ψψ θθ = φφ ii θθ ii=1
37 D. Using the factor structure Crazy idea: - We could use the VB algorithm - With non-gaussian smoothing - On each component φφ ii of ψψ 37
38 D. Using the factor structure The algorithm 2 update has two equivalent forms: ββ = EE qqnn HHHH θθ 38 OR ββ = CCCCvv qqnn 1 EEqqnn θθ μμ nn ψψ θθ When we replace qq nn by h ii, we have a choice to make: which form should we use?
39 Algorithm 3: smooth EP 39 - Initialize with any Gaussians gg 1, gg 2 gg nn - Loop: h ii ff ii jj ii gg jj μμ ii = EE hii θθ rr ii = EE hii φφ ii θθ ββ ii = vvvvrr hii 1 EEhii θθ μμ ii φφ ii θθ gg ii θθ exp rr θθ μμ ii ββ 2 θθ μμ ii 2
40 D. Using the factor structure Smooth EP is actually exactly equivalent to EP!! 40 This ties EP to a much more intuitive algorithm: Newton s method However, we have lost the most important feature: the explicit objective function
41 Summary I have presented a family of algorithms which: - Iterate Gaussian approximations to pp θθ - By computing smoothed quadratic approximations of log pp θθ (or parts of it) 41 Different smoothings correspod to different known methods: - Laplace = No smoothing - VB = Gaussian smoothing - EP = Various hybrid smoothings
42 III. Consequences Key observation: EP and algorithm 2 are closely related to Newton s method 42 They must behave in a similar fashion!!
43 III. Consequences Newton s has two striking features: - Very fast convergence near its fixed-points - Possible oscillations (overshooting) Solution: compliment it with line-search methods 43 EP is also known for its oscillations! Are these also overshoots of the target?
44 III. Consequences 44 EP still needs improvements We could import good ideas from Newton s method and use them on EP - Line-search (but how?) - Non-SPD second-order term ββ ii (but how?) Finally, the smooth Newton view of EP might be useful on its own
45 III. Consequences Since Laplace, VB and EP are closely related, we can ask whether they behave similarly in some situations 45 - Laplace = No smoothing - VB = Gaussian smoothing - EP = Various hybrid smoothings Whenever the smoothing distributions are similar, the methods are similar!
46 III. Consequences If all hybrids are almost equal to the global approximation: h ii qq nn 46 IE: ff ii /gg ii is negligible compared to qq nn Then EP and VB have the same smoothing and behave similarly
47 III. Consequences Furthermore, if qq nn is almost a Dirac distribution IE: its width is negligible compared to the oscillations of ψψ and/or the φφ ii functions 47 Then, algorithm 2 behaves similarly to algorithm 1 Thus, the VB approximation behaves similarly to the Laplace approximation
48 Conclusion Various approximation methods are closely related: The can be obtaiend through smooth Newton methods 48 - Laplace = No smoothing - VB = Gaussian smoothing - EP = Hybrid smoothings Corollary: VB and EP are very closely related EP behaves similarly to Newton s method
49 Speculation Smooth Newton variants might be computationally useful for VB and EP 49 VB and EP should give better approximations than Laplace Can this give a path towards understanding or improving the convergence of EP and VB?
50 References Minka, 2001, Expectation Propagation for approximate Bayesian inference Seeger, 2007, Expectation Propagation for Exponential Families Dehaene, Barthelmé, 2017, Expectation Propagation in the large-data limit Dehaene, 2016, Expectation Propagation performs a smoothed gradient descent 50
Variations. ECE 6540, Lecture 02 Multivariate Random Variables & Linear Algebra
Variations ECE 6540, Lecture 02 Multivariate Random Variables & Linear Algebra Last Time Probability Density Functions Normal Distribution Expectation / Expectation of a function Independence Uncorrelated
More informationInferring the origin of an epidemic with a dynamic message-passing algorithm
Inferring the origin of an epidemic with a dynamic message-passing algorithm HARSH GUPTA (Based on the original work done by Andrey Y. Lokhov, Marc Mézard, Hiroki Ohta, and Lenka Zdeborová) Paper Andrey
More informationCS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds
CS 6347 Lecture 8 & 9 Lagrange Multipliers & Varitional Bounds General Optimization subject to: min ff 0() R nn ff ii 0, h ii = 0, ii = 1,, mm ii = 1,, pp 2 General Optimization subject to: min ff 0()
More informationExpectation Propagation Algorithm
Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,
More informationLecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher
Lecture 3 STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher Previous lectures What is machine learning? Objectives of machine learning Supervised and
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationGradient expansion formalism for generic spin torques
Gradient expansion formalism for generic spin torques Atsuo Shitade RIKEN Center for Emergent Matter Science Atsuo Shitade, arxiv:1708.03424. Outline 1. Spintronics a. Magnetoresistance and spin torques
More informationCS249: ADVANCED DATA MINING
CS249: ADVANCED DATA MINING Vector Data: Clustering: Part II Instructor: Yizhou Sun yzsun@cs.ucla.edu May 3, 2017 Methods to Learn: Last Lecture Classification Clustering Vector Data Text Data Recommender
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationApproximate Second Order Algorithms. Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo
Approximate Second Order Algorithms Seo Taek Kong, Nithin Tangellamudi, Zhikai Guo Why Second Order Algorithms? Invariant under affine transformations e.g. stretching a function preserves the convergence
More informationEntropy Enhanced Covariance Matrix Adaptation Evolution Strategy (EE_CMAES)
1 Entropy Enhanced Covariance Matrix Adaptation Evolution Strategy (EE_CMAES) Developers: Main Author: Kartik Pandya, Dept. of Electrical Engg., CSPIT, CHARUSAT, Changa, India Co-Author: Jigar Sarda, Dept.
More informationStatistical Learning with the Lasso, spring The Lasso
Statistical Learning with the Lasso, spring 2017 1 Yeast: understanding basic life functions p=11,904 gene values n number of experiments ~ 10 Blomberg et al. 2003, 2010 The Lasso fmri brain scans function
More informationNon-linear least squares
Non-linear least squares Concept of non-linear least squares We have extensively studied linear least squares or linear regression. We see that there is a unique regression line that can be determined
More informationPart 1: Expectation Propagation
Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud
More informationDan Roth 461C, 3401 Walnut
CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn
More informationChapter 5: Spectral Domain From: The Handbook of Spatial Statistics. Dr. Montserrat Fuentes and Dr. Brian Reich Prepared by: Amanda Bell
Chapter 5: Spectral Domain From: The Handbook of Spatial Statistics Dr. Montserrat Fuentes and Dr. Brian Reich Prepared by: Amanda Bell Background Benefits of Spectral Analysis Type of data Basic Idea
More informationExtreme value statistics: from one dimension to many. Lecture 1: one dimension Lecture 2: many dimensions
Extreme value statistics: from one dimension to many Lecture 1: one dimension Lecture 2: many dimensions The challenge for extreme value statistics right now: to go from 1 or 2 dimensions to 50 or more
More informationAngular Momentum, Electromagnetic Waves
Angular Momentum, Electromagnetic Waves Lecture33: Electromagnetic Theory Professor D. K. Ghosh, Physics Department, I.I.T., Bombay As before, we keep in view the four Maxwell s equations for all our discussions.
More informationGeneral Strong Polarization
General Strong Polarization Madhu Sudan Harvard University Joint work with Jaroslaw Blasiok (Harvard), Venkatesan Gurswami (CMU), Preetum Nakkiran (Harvard) and Atri Rudra (Buffalo) May 1, 018 G.Tech:
More informationSolar Photovoltaics & Energy Systems
Solar Photovoltaics & Energy Systems Lecture 3. Solar energy conversion with band-gap materials ChE-600 Kevin Sivula, Spring 2014 The Müser Engine with a concentrator T s Q 1 = σσ CffT ss 4 + 1 Cff T pp
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationPHY103A: Lecture # 4
Semester II, 2017-18 Department of Physics, IIT Kanpur PHY103A: Lecture # 4 (Text Book: Intro to Electrodynamics by Griffiths, 3 rd Ed.) Anand Kumar Jha 10-Jan-2018 Notes The Solutions to HW # 1 have been
More informationA new procedure for sensitivity testing with two stress factors
A new procedure for sensitivity testing with two stress factors C.F. Jeff Wu Georgia Institute of Technology Sensitivity testing : problem formulation. Review of the 3pod (3-phase optimal design) procedure
More information(1) Correspondence of the density matrix to traditional method
(1) Correspondence of the density matrix to traditional method New method (with the density matrix) Traditional method (from thermal physics courses) ZZ = TTTT ρρ = EE ρρ EE = dddd xx ρρ xx ii FF = UU
More informationCHAPTER 5 Wave Properties of Matter and Quantum Mechanics I
CHAPTER 5 Wave Properties of Matter and Quantum Mechanics I 1 5.1 X-Ray Scattering 5.2 De Broglie Waves 5.3 Electron Scattering 5.4 Wave Motion 5.5 Waves or Particles 5.6 Uncertainty Principle Topics 5.7
More informationGeneral Strong Polarization
General Strong Polarization Madhu Sudan Harvard University Joint work with Jaroslaw Blasiok (Harvard), Venkatesan Gurswami (CMU), Preetum Nakkiran (Harvard) and Atri Rudra (Buffalo) December 4, 2017 IAS:
More informationA minimalist s exposition of EM
A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationStat 451 Lecture Notes Numerical Integration
Stat 451 Lecture Notes 03 12 Numerical Integration Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 5 in Givens & Hoeting, and Chapters 4 & 18 of Lange 2 Updated: February 11, 2016 1 / 29
More informationSupport Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Support Vector Machines CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification
More informationECE 6540, Lecture 06 Sufficient Statistics & Complete Statistics Variations
ECE 6540, Lecture 06 Sufficient Statistics & Complete Statistics Variations Last Time Minimum Variance Unbiased Estimators Sufficient Statistics Proving t = T(x) is sufficient Neyman-Fischer Factorization
More informationExpectation Propagation for Approximate Bayesian Inference
Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationCSci 8980: Advanced Topics in Graphical Models Gaussian Processes
CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian
More informationSECTION 5: POWER FLOW. ESE 470 Energy Distribution Systems
SECTION 5: POWER FLOW ESE 470 Energy Distribution Systems 2 Introduction Nodal Analysis 3 Consider the following circuit Three voltage sources VV sss, VV sss, VV sss Generic branch impedances Could be
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationDistributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College
Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic
More informationSECTION 8: ROOT-LOCUS ANALYSIS. ESE 499 Feedback Control Systems
SECTION 8: ROOT-LOCUS ANALYSIS ESE 499 Feedback Control Systems 2 Introduction Introduction 3 Consider a general feedback system: Closed-loop transfer function is KKKK ss TT ss = 1 + KKKK ss HH ss GG ss
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated
More informationCurve Fitting Re-visited, Bishop1.2.5
Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the
More informationInstructor: Dr. Volkan Cevher. 1. Background
Instructor: Dr. Volkan Cevher Variational Bayes Approximation ice University STAT 631 / ELEC 639: Graphical Models Scribe: David Kahle eviewers: Konstantinos Tsianos and Tahira Saleem 1. Background These
More informationExpectation propagation for signal detection in flat-fading channels
Expectation propagation for signal detection in flat-fading channels Yuan Qi MIT Media Lab Cambridge, MA, 02139 USA yuanqi@media.mit.edu Thomas Minka CMU Statistics Department Pittsburgh, PA 15213 USA
More informationFoundations of Statistical Inference
Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 32 Lecture 14 : Variational Bayes
More informationActive and Semi-supervised Kernel Classification
Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),
More informationMachine Learning 4771
Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood
More informationRadial Basis Function (RBF) Networks
CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks 1 Function approximation We have been using MLPs as pattern classifiers But in general, they are function approximators Depending
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More information(1) Introduction: a new basis set
() Introduction: a new basis set In scattering, we are solving the S eq. for arbitrary VV in integral form We look for solutions to unbound states: certain boundary conditions (EE > 0, plane and spherical
More informationVariational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M
A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M PD M = PD θ, MPθ Mdθ Lecture 14 : Variational Bayes where θ are the parameters of the model and Pθ M is
More informationLecture 3 Optimization methods for econometrics models
Lecture 3 Optimization methods for econometrics models Cinzia Cirillo University of Maryland Department of Civil and Environmental Engineering 06/29/2016 Summer Seminar June 27-30, 2016 Zinal, CH 1 Overview
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationMultilayer Perceptrons (MLPs)
CSE 5526: Introduction to Neural Networks Multilayer Perceptrons (MLPs) 1 Motivation Multilayer networks are more powerful than singlelayer nets Example: XOR problem x 2 1 AND x o x 1 x 2 +1-1 o x x 1-1
More informationBlack-box α-divergence Minimization
Black-box α-divergence Minimization José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, Richard Turner, Harvard University, University of Cambridge, Universidad Autónoma de Madrid.
More informationLog Gaussian Cox Processes. Chi Group Meeting February 23, 2016
Log Gaussian Cox Processes Chi Group Meeting February 23, 2016 Outline Typical motivating application Introduction to LGCP model Brief overview of inference Applications in my work just getting started
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationExpectation propagation as a way of life
Expectation propagation as a way of life Yingzhen Li Department of Engineering Feb. 2014 Yingzhen Li (Department of Engineering) Expectation propagation as a way of life Feb. 2014 1 / 9 Reference This
More informationMachine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart
Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural
More informationBayesian Regression Linear and Logistic Regression
When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationIntegration, differentiation, and root finding. Phys 420/580 Lecture 7
Integration, differentiation, and root finding Phys 420/580 Lecture 7 Numerical integration Compute an approximation to the definite integral I = b Find area under the curve in the interval Trapezoid Rule:
More informationLecture 10: Logistic Regression
BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics Lecture 10: Logistic Regression Jie Wang Department of Computational Medicine & Bioinformatics University of Michigan 1 Outline An
More informationCLASS NOTES Models, Algorithms and Data: Introduction to computing 2018
CLASS NOTES Models, Algorithms and Data: Introduction to computing 208 Petros Koumoutsakos, Jens Honore Walther (Last update: June, 208) IMPORTANT DISCLAIMERS. REFERENCES: Much of the material (ideas,
More informationBayesian Deep Learning
Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference
More informationHopfield Network for Associative Memory
CSE 5526: Introduction to Neural Networks Hopfield Network for Associative Memory 1 The next few units cover unsupervised models Goal: learn the distribution of a set of observations Some observations
More informationIntegrated Non-Factorized Variational Inference
Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014
More informationMinimum Message Length Analysis of the Behrens Fisher Problem
Analysis of the Behrens Fisher Problem Enes Makalic and Daniel F Schmidt Centre for MEGA Epidemiology The University of Melbourne Solomonoff 85th Memorial Conference, 2011 Outline Introduction 1 Introduction
More informationEstimate by the L 2 Norm of a Parameter Poisson Intensity Discontinuous
Research Journal of Mathematics and Statistics 6: -5, 24 ISSN: 242-224, e-issn: 24-755 Maxwell Scientific Organization, 24 Submied: September 8, 23 Accepted: November 23, 23 Published: February 25, 24
More informationCSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross
More information1 Bayesian Linear Regression (BLR)
Statistical Techniques in Robotics (STR, S15) Lecture#10 (Wednesday, February 11) Lecturer: Byron Boots Gaussian Properties, Bayesian Linear Regression 1 Bayesian Linear Regression (BLR) In linear regression,
More informationDistributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server
Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server in collaboration with: Minjie Xu, Balaji Lakshminarayanan, Leonard Hasenclever, Thibaut Lienart, Stefan Webb,
More informationSolving Fuzzy Nonlinear Equations by a General Iterative Method
2062062062062060 Journal of Uncertain Systems Vol.4, No.3, pp.206-25, 200 Online at: www.jus.org.uk Solving Fuzzy Nonlinear Equations by a General Iterative Method Anjeli Garg, S.R. Singh * Department
More informationVariational Bayesian Inference Techniques
Advanced Signal Processing 2, SE Variational Bayesian Inference Techniques Johann Steiner 1 Outline Introduction Sparse Signal Reconstruction Sparsity Priors Benefits of Sparse Bayesian Inference Variational
More informationUnderstanding Covariance Estimates in Expectation Propagation
Understanding Covariance Estimates in Expectation Propagation William Stephenson Department of EECS Massachusetts Institute of Technology Cambridge, MA 019 wtstephe@csail.mit.edu Tamara Broderick Department
More informationQuantitative Biology II Lecture 4: Variational Methods
10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More informationLogistic Regression. Seungjin Choi
Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationCSC 2541: Bayesian Methods for Machine Learning
CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 10 Alternatives to Monte Carlo Computation Since about 1990, Markov chain Monte Carlo has been the dominant
More informationVariational Methods in Bayesian Deconvolution
PHYSTAT, SLAC, Stanford, California, September 8-, Variational Methods in Bayesian Deconvolution K. Zarb Adami Cavendish Laboratory, University of Cambridge, UK This paper gives an introduction to the
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationMachine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall
Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume
More informationTDA231. Logistic regression
TDA231 Devdatt Dubhashi dubhashi@chalmers.se Dept. of Computer Science and Engg. Chalmers University February 19, 2016 Some data 5 x2 0 5 5 0 5 x 1 In the Bayes classifier, we built a model of each class
More informationStochastic Spectral Approaches to Bayesian Inference
Stochastic Spectral Approaches to Bayesian Inference Prof. Nathan L. Gibson Department of Mathematics Applied Mathematics and Computation Seminar March 4, 2011 Prof. Gibson (OSU) Spectral Approaches to
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters
More informationDeep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016
Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? Want to estimate some distribution, p*(x) p*(x) What is
More informationBayesian Inference for DSGE Models. Lawrence J. Christiano
Bayesian Inference for DSGE Models Lawrence J. Christiano Outline State space-observer form. convenient for model estimation and many other things. Bayesian inference Bayes rule. Monte Carlo integation.
More information13 : Variational Inference: Loopy Belief Propagation and Mean Field
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction
More informationReview for Exam Hyunse Yoon, Ph.D. Assistant Research Scientist IIHR-Hydroscience & Engineering University of Iowa
57:020 Fluids Mechanics Fall2013 1 Review for Exam3 12. 11. 2013 Hyunse Yoon, Ph.D. Assistant Research Scientist IIHR-Hydroscience & Engineering University of Iowa 57:020 Fluids Mechanics Fall2013 2 Chapter
More informationGradient Descent. Sargur Srihari
Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationChapter 1: A Brief Review of Maximum Likelihood, GMM, and Numerical Tools. Joan Llull. Microeconometrics IDEA PhD Program
Chapter 1: A Brief Review of Maximum Likelihood, GMM, and Numerical Tools Joan Llull Microeconometrics IDEA PhD Program Maximum Likelihood Chapter 1. A Brief Review of Maximum Likelihood, GMM, and Numerical
More informationContinuous Random Variables
Continuous Random Variables Page Outline Continuous random variables and density Common continuous random variables Moment generating function Seeking a Density Page A continuous random variable has an
More informationBayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop
Bayesian Gaussian / Linear Models Read Sections 2.3.3 and 3.3 in the text by Bishop Multivariate Gaussian Model with Multivariate Gaussian Prior Suppose we model the observed vector b as having a multivariate
More information