A Tutorial on Variational Bayes

Similar documents
GRE 精确 完整 数学预测机经 发布适用 2015 年 10 月考试

d) There is a Web page that includes links to both Web page A and Web page B.

On the Quark model based on virtual spacetime and the origin of fractional charge

Chapter 2 Bayesian Decision Theory. Pattern Recognition Soochow, Fall Semester 1

Integrated Algebra. Simplified Chinese. Problem Solving

Principia and Design of Heat Exchanger Device 热交换器原理与设计

Lecture 2: Introduction to Probability

0 0 = 1 0 = 0 1 = = 1 1 = 0 0 = 1

Chapter 2 the z-transform. 2.1 definition 2.2 properties of ROC 2.3 the inverse z-transform 2.4 z-transform properties

( 选出不同类别的单词 ) ( 照样子完成填空 ) e.g. one three

The dynamic N1-methyladenosine methylome in eukaryotic messenger RNA 报告人 : 沈胤

Source mechanism solution

2012 AP Calculus BC 模拟试卷

Riemann s Hypothesis and Conjecture of Birch and Swinnerton-Dyer are False

A proof of the 3x +1 conjecture

5. Polymorphism, Selection. and Phylogenetics. 5.1 Population genetics. 5.2 Phylogenetics

Type and Propositions

通量数据质量控制的理论与方法 理加联合科技有限公司

Lecture 2. Random variables: discrete and continuous

三类调度问题的复合派遣算法及其在医疗运营管理中的应用

系统生物学. (Systems Biology) 马彬广

CT reconstruction Simulation

Molecular weights and Sizes

Galileo Galilei ( ) Title page of Galileo's Dialogue concerning the two chief world systems, published in Florence in February 1632.

= lim(x + 1) lim x 1 x 1 (x 2 + 1) 2 (for the latter let y = x2 + 1) lim

Alternative flat coil design for electromagnetic forming using FEM

Algorithms and Complexity

Lecture Note on Linear Algebra 16. Eigenvalues and Eigenvectors

Chapter 22 Lecture. Essential University Physics Richard Wolfson 2 nd Edition. Electric Potential 電位 Pearson Education, Inc.

Cooling rate of water

Service Bulletin-04 真空电容的外形尺寸

QTM - QUALITY TOOLS' MANUAL.

Synthesis of PdS Au nanorods with asymmetric tips with improved H2 production efficiency in water splitting and increased photostability

Explainable Recommendation: Theory and Applications

Easter Traditions 复活节习俗

Chapter 4. Mobile Radio Propagation Large-Scale Path Loss

Halloween 万圣节. Do you believe in ghosts? 你相信有鬼吗? Read the text below and do the activity that follows. 阅读下面的短文, 然后完成练习 :

上海激光电子伽玛源 (SLEGS) 样机的实验介绍

Sichuan Earthquake 四川地震

课内考试时间? 5/10 5/17 5/24 课内考试? 5/31 课内考试? 6/07 课程论文报告

2. The lattice Boltzmann for porous flow and transport

Digital Image Processing. Point Processing( 点处理 )

4.8 Low-Density Parity-Check Codes

Happy Niu Year 牛年快乐 1

Rigorous back analysis of shear strength parameters of landslide slip

能源化学工程专业培养方案. Undergraduate Program for Specialty in Energy Chemical Engineering 专业负责人 : 何平分管院长 : 廖其龙院学术委员会主任 : 李玉香

Lecture Note on Linear Algebra 14. Linear Independence, Bases and Coordinates

进化树构建方法的概率方法 第 4 章 : 进化树构建的概率方法 问题介绍. 部分 lid 修改自 i i f l 的 ih l i

Workshop on Numerical Partial Differential Equations and Scientific Computing

USTC SNST 2014 Autumn Semester Lecture Series

Effect of lengthening alkyl spacer on hydroformylation performance of tethered phosphine modified Rh/SiO2 catalyst

Conditional expectation and prediction

Geomechanical Issues of CO2 Storage in Deep Saline Aquifers 二氧化碳咸水层封存的力学问题

澳作生态仪器有限公司 叶绿素荧光测量中的 PAR 测量 植物逆境生理生态研究方法专题系列 8 野外进行荧光测量时, 光照和温度条件变异性非常大 如果不进行光照和温度条件的控制或者精确测量, 那么荧光的测量结果将无法科学解释

课内考试时间 5/21 5/28 课内考试 6/04 课程论文报告?

Lecture 13 : Variational Inference: Mean Field Approximation

Key Topic. Body Composition Analysis (BCA) on lab animals with NMR 采用核磁共振分析实验鼠的体内组分. TD-NMR and Body Composition Analysis for Lab Animals

Chapter 20 Cell Division Summary

There are only 92 stable elements in nature

The preload analysis of screw bolt joints on the first wall graphite tiles in East

Design, Development and Application of Northeast Asia Resources and Environment Scientific Expedition Data Platform

Concurrent Engineering Pdf Ebook Download >>> DOWNLOAD

Human CXCL9 / MIG ELISA Kit

Multivariate Statistics Analysis: 多元统计分析

13: Variational inference II

國立中正大學八十一學年度應用數學研究所 碩士班研究生招生考試試題

Fundamentals of Heat and Mass Transfer, 6 th edition

Differential Equations (DE)

生物統計教育訓練 - 課程. Introduction to equivalence, superior, inferior studies in RCT 謝宗成副教授慈濟大學醫學科學研究所. TEL: ext 2015

Pore structure effects on the kinetics of methanol oxidation over nanocast mesoporous perovskites

深圳市凯琦佳科技股份有限公司 纳入规格书

Measurement of accelerator neutron radiation field spectrum by Extended Range Neutron Multisphere Spectrometers and unfolding program

Project Report of DOE

RESEARCH AND APPLICATION OF THE EVAPORATION CAPACITY SPATIAL INTERPOLATION METHOD FOR AGRICULUTRAL ENVIRONMENT / 面向农业环境的蒸发量空间插值方法研究和应用

Atomic & Molecular Clusters / 原子分子团簇 /

2NA. MAYFLOWER SECONDARY SCHOOL 2018 SEMESTER ONE EXAMINATION Format Topics Comments. Exam Duration. Number. Conducted during lesson time

Unsupervised Learning

Lecture 4-1 Nutrition, Laboratory Culture and Metabolism of Microorganisms

Numbers and Fundamental Arithmetic

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

École Doctorale Mathématiques, Sciences de l'information et de l'ingénieur UdS INSA ENGEES THÈSE

Numerical Analysis in Geotechnical Engineering

Finite-Difference Methods (FDM)

Effect of promoters on the selective hydrogenolysis of glycerol over Pt/W containing catalysts

Ni based catalysts derived from a metal organic framework for selective oxidation of alkanes

Enhancement of the activity and durability in CO oxidation over silica supported Au nanoparticle catalyst via CeOx modification

Zinc doped g C3N4/BiVO4 as a Z scheme photocatalyst system for water splitting under visible light

Chapter 10 ALCOHOLS, PHENOLS, AND ETHERS

Chapter 1 Linear Regression with One Predictor Variable

Ch.9 Liquids and Solids

NiFe layered double hydroxide nanoparticles for efficiently enhancing performance of BiVO4 photoanode in

An Introduction to Generalized Method of Moments. Chen,Rong aronge.net

Photo induced self formation of dual cocatalysts on semiconductor surface

tan θ(t) = 5 [3 points] And, we are given that d [1 points] Therefore, the velocity of the plane is dx [4 points] (km/min.) [2 points] (The other way)

Lecture English Term Definition Chinese: Public

Definition: Arenes are hydrocarbons based on the benzene ring as a structural unit.

复合功能 VZθ 执行器篇 Multi-Functional VZθActuator

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

1. Length of Daytime (7 points) 白昼长度 (7 分 )

Transcription:

A Tutorial on Variational Bayes Junhao Hua ( 华俊豪 ) Laboratory of Machine and Biological Intelligence, Department of Information Science & Electronic Engineering, ZheJiang University 204/3/27 Email: huajh7@gmail.com For further information, see: http://www.huajh7.com

Outline Motivation The Variational Bayesian Framework Variational Free Energy Optimization Tech. Mean Field Approximation Exponential Family Bayesian Networks Example: VB for Mixture model Discussion Application Reference 2

A Problem: How Learn From Data? Typical, we use a complex statistical model, but how to learn its parameters and latent variables? Data: X Model: P(X \theta, Z) 3

Challenge Maximum Likelihood: Overfits the data Model Complexity Computational tractability Bayesian Framework: Arising intractable integrals: partition function posterior of unobserved variables Approximate Inference: Monte Carlo Sampling: e.g. MCMC, particle filter. Variational Bayes 4

Outline Motivation The Variational Bayesian Framework Variational Free Energy Optimization Tech. Mean Field Approximation Exponential Family Bayesian Networks Example: VB for Mixture model Discussion Application Reference 5

Variational Free energy Basic Idea: conditional independence is enforced as a functional constraint in the approximating distribution, and the best such approximation is found by minimization of a Kullback-Leibler divergence (KLD). Use a simpler variational distribution, Q(Z), to approximate the true posterior P(Z X) Two alternative explanations Minimize (reverse) Kullback-Leibler divergence QZ ( ) QZ ( ) DKL( Q P) = QZ ( ) log = QZ ( ) log + log PD ( ) PZ ( D) PZD (, ) Maximum variational free energy(lower bound) LQ ( ) = QZ ( ) log PZD (, ) QZ ( ) log QZ ( ) = E[log PZD (, )] + HQ ( ) Z Z Z Z Q 6

Optimization Techniques: Mean Field Approximation Originated in the statistical physics literature Conditional Independence assumption Decoupling: intractable distribution -> a product of tractable marginal distributions (tractable subgraph) Factorization: M QZ ( ) = qz ( i D) i= 7

Optimization Techniques: Variational methods Optimization Problem: Maximum the lower bound LQZ ( ( )) = E [ln PZD (, )] + HQZ ( ( )) Q( Z) QZ ( ) = Q( Z) Where i i i Subject to normalization constraints: i. Qi( Zi) dzi = Seek the extremum of a functional: Euler Lagrange equation 8

Derivation Consider the partition Consider Energy term, Q( Z) i i i We define normalization constant. {, },where = i i Z = i Z \ Zi Z Z Z E [ln PZD (, )] = ( Q ( Z))ln( ZDdZ, ) = Q ( Z ) Q ( Z )ln( Z, D) dz dz i i i i i i = Q ( Z ) ln( Z, D) dz i i Q ( Z ) i i i = Q( Z)ln exp ln( ZD, ) i i Q Q Z Q Z dz * = i( i)ln i ( i) i +lnc Q ( Z ) = exp ln( Z, D) * i i Q ( Z ) i i i C ( Z ) i dz i, where C is the 9

Derivation (cont.) Consider the entropy, = i Then we get the functional, H ( Q( Z)) ( Q ( Z ))ln Q ( Z ) dz i k k k i i = Q ( Z ) Q ( Z )ln Q ( Z ) dz dz i i i i i i i i i = Q ( Z )ln Q ( Z ) dz i i i i i i Q ( Z ) = Q ( Z )ln Q ( Z ) dz i i i i i L(Q(Z))= * Qi( Zi)ln Qi ( Zi) dzi + Qi( Zi)ln Qi( Zi) dzi + lnc i =( * Qi( Zi)ln Qi ( Zi) dzi Qi( Zi)ln Qi( Zi) dzi)+ Qk( Zk)ln Qk( Zk) dzk + lnc k i Q ( Z ) = Q ( Z )ln dz + Q ( Z )ln Q ( Z ) dz + lnc * i i i i i k k k k k Qi( Zi) k i * = D KL ( i ( i ) i ( i )) + [ i ( i )] + ln Q Z Q Z HQ Z C i i 0

Derivation (cont.) Maximizing energy functional L w.r.t. each Q_i could be achieved by Lagrange multipliers and functional differentiation i Q Z Q Z Q Z dz = *. { D KL[ i( i) i ( i)] λi( i( i) i )}: 0 Qi( Zi) A long algebraic derivation would then eventually lead to a Gibbs distribution; Fortunately, L will be maximized when the KL divergence is zero, Q Z = Q Z = PZ Z D * i( i) i ( i) exp ln ( i, i, ) Q ( Z ) i i C Where C is normalization constant.

Challenge Q Z = Q Z = PZ Z D * i( i) i ( i) exp ln ( i, i, ) Q ( Z ) i i C The expectation can be intractable. We need pick a family of distributions Q that allow for exact inference Then Find Q' Q that maximizes the functional energy. 2

Challenge The expectation can be intractable. We need pick a family of distributions Q that allow for exact inference Q' Q Then Find functional energy. that maximizes the Exponential Family 3

Why Exponential Family? Principle of maximum entropy Density function: Log partition function: canonical parameters Mean parameters 4

Properties of Exponential family Mean parameters: θ various statistical computations, among them marginalization and maximum likelihood estimation, can be understood as transforming from one parameterization to the other. All realizable mean parameters Always a convex subset of Forward mapping From canonical parameters Backward mapping From mean parameters d φ( x) to the mean parameters θ to the canonical parameters φ( x) θ 5

Properties of partition function A 6

Conjugate Duality: Maximum Likelihood and Maximum Entropy The variational representation of log partition function The conjugate dual function to A 7

Nonconvexity for Naïve Mean Field Mean field optimization is always nonconvex for any exponential family in which the state space is finite. It is a strict subset of M(G) Contains all of the extreme points of polytope 8

Outline Motivation The Variational Bayesian Framework Variational Free Energy Optimization Tech. Mean Field Approximation Exponential Family Bayesian Networks Example: VB for Mixture model Discussion Application Reference 9

Inference in Bayesian Networks Variational Message Passing (Winn, Bishop, 2003.) Message from parents: my X uy Message to parents: = φ Update natural parameter vector : =,{ } ( ) m u m ({ m } ) i pa X Y XY X i X i cp Y φ = φ + m * Y Y i Y j Y Y j ch Y PX ( φ) = exp[ φ T ux ( ) + f( X) + g ( φ)] 20

Summary of VB Mean-field Assumption Variational Methods QZ ( ) exp ln PZ (, Z, D) C i i i Q ( Z ) orq ( mb ( Z )) i i Conjugate-exponential family Forward, backward mapping 2

Outline Motivation The Variational Bayesian Framework Variational Free Energy Optimization Tech. Mean Field Approximation Exponential Family Bayesian Networks Example VB for Mixture model Discussion Application Reference 22

Mixture of Gaussian (MoG) N K z (, µ, Λ ) = Nx ( n µ k, Λk ) nk n= k= px Z pxz (,, π, µ, Λ ) = px ( Z, µ, Λ) pz ( π) p( π ) p( µ Λ) p( Λ) 23

Infinite Student s t-mixture DP( α, G ) 0 Dirichlet Process G = π j ( V) δθ j= Dirichlet Process Mixture Stick-Breaking prior j j π ( V) = V ( V) j j i i= V ~ Beta(, α) j p( α) = Gam( α η, η ) 2 N p( X ) = π ( V ) St( x µ, Λ, v ) n= j= j n j j j 24

Latent Dirichlet Allocation (LDA) 25

Outline Motivation The Variational Bayesian Framework Variational Free Energy Optimization Tech. Mean Field Approximation Exponential Family Bayesian Networks Example: VB for Mixture model Discussion Application Reference 26

步骤一 : 选择无信息先验分布 选择先验分布原则 : 共轭分布,Jefferys 原则, 最大熵原则等 一般要求先验分布应取共轭分布 (conjugate distribution) 才合适, 即先验分布 h(θ) 与后验分布 h(θ x) 属于同一分布类型 π Λ z i=,..., k 0 i=,..., k 0 0 i=,..., k ~ Nm ( 0,( 0Λi) ) i=,..., N X ~ SymDir( K, α ) ~ W( w, υ ) µ β ~ Mult(, π ) N( µ ) i=,..., N z 说明 : K: 单高斯分布个数,N: 样本个数 SymDir() :K 维对称 Dirichlet 分布 ; 它是分类分布 (categorical) 或多项式分布的共轭先验分布 W() 表示 Wishart 分布 ; 对多元高斯分布, 它是 Precision 矩阵 ( 逆协方差矩阵 ) 的共轭先验 Mult() 表示多项分布 ; 是二项式分布的推广, 表示在一个 K 维向量中只有一项为, 其它都为 0. N() 为多元高斯分布 X = { x,..., x } 是 N个训练样本, 每项都是服从多元高斯分布的 K维向量 ; Z = { z,..., z } 是一组潜在变量, 每项 z = { z, z } 表示对应的样本 x 属于哪个混合部分 ; N k k,... nk k π = { π,..., π } 表示每个单高斯分布混合比例 ; µ N K 和 Λ 分别表示每个单高斯分布参数的均值和精度 ; i=,..., k i=,..., k K, α, β, w, υ, m 称为超参数 ( hyperparameter), 都为已知量 0 0 0 0 0 27

用 盘子表示法 (plate notation) 表示多元高斯混合模型, 如图所示 小正方形表示不变的超参数, 如 β 0,ν 0,α 0,µ 0,W 0 ; 圆圈表示随机变量, 如 π, zi, xi, µ k, Λk ; 圆圈内的值为已知量, 其中 [K],[D] 表示 K D 维的向量,[D,D] 表示 DxD 的矩阵 ; 单个 K 表示一个有 K 个值的 categorical 变量 ; 波浪线和开关表示变量 x i 通过一个 K 维向量 z i 来选择其他传入的变量 (µ k, Ʌ k ) 28

步骤二 : 写出联合概率密度函数 假设各参数与潜在变量条件独立, 则联合概率密度函数可以表示为 pxz (,, π, µ, Λ ) = px ( Z, µ, Λ) pz ( π) p( π ) p( µ Λ) p( Λ) 每个因子为 : 其中, N K z (, µ, Λ ) = Nx ( n µ k, Λk ) n= k= px Z pz ( π) = n= k= Γ( Kα0) p( π) = Γ K ( α ) 0 k = α0 k ( ) ( k 0,( 0 k) ) p( Λ ) = W( Λ w, ν ) N k K π K z k p µ Λ = N µ m β Λ nk π 0 0 T N( x µ, ) = exp ( ) ( ) D/2 /2 x µ x µ (2 π ) 2 ( ν D )/2 W ( Λ w, ν) = B( w, ν) Λ exp( Tr( w Λ)) 2 D ν/2 νd/2 D( D )/4 ν + i Bw (, ν) = w (2 π Γ( )) 2 i= nk 29

步骤三 : 计算边缘密度 (VB- marginal) () 计算 Z 的边缘密度, 根据平均场假设, 有 qz (, π, µ, Λ ) = qzq ( ) ( π, µ, Λ) * ln q( Z) = E [ln pxz (,, π, µ, Λ )] + const 其中 π, µ, Λ = E [ln px ( Z, µ, Λ) pz ( π) p( π ) p( µ Λ) p( Λ )] + const π, µ, Λ = E[ln pz ( π )] + E [ln px ( Z, µ, Λ )] + const π µ, Λ N K = z ln ρ + const n= k= * znk 两边分别取对数可得, q ( Z) ρnk * znk 归一化, 得 q ( Z) r, 其中 nk nk D T ln ρnk = E[ln πk] + E[ln Λk ] ln(2 π ) Eµ, [( x ) ( )] k Λk n µ k Λk xn µ k 2 2 2 N K = n= k= nk N K n= k= r 可见 q * ( Z) 是多个单观测多项式分布 (single-observation multinomial distribution) 的乘积 更进一步, 根据 categorical 分布, 有 Ez [ ] = r nk ρ nk = K j = nk ρ nk nj 30

K (2) 计算 π 的概率密度, q( π, µ, Λ ) = q( π ) q( µ k, Λk) * ln ( π) Z, µ, Λ[ (, π, µ, )] N * r nk α0 n 两边取对数 q ( π)~ π + = k, 可见 q * ( 是 π ) Dirichlet 分布, 其中 α = α0 + Nk, Nk = rnk. 0 k = q = E p X Z Λ + const q * ( π)~ Dir( α) = ln p( π) + E [ln p( Z π)] + const Z K N K =( α -) lnπ + r lnπ + const k nk k k= n= k= K n= N n= 3

最后同时考虑 µ, Λ, 对于每一个单高斯分布有, µ Λ = µ Λ µ Λ * ln q( k, k) EZ, π, µ, Λ [ln px ( Z, k, k) p( k, k)] i k N = ln ( µ k, k) [ nk]ln ( n µ k, k ) n= 经过一系列重组化解将得到 Gaussian-Wishart 分布, i k p Λ + E z N x Λ + const q ( µ, Λ ) = N( µ m,( β Λ ) ) W( Λ w, ν ) * k k k k k k k k k 其中 βk = β0 + Nk, mk = ( β0m0 + Nx k k ), βk β0nk T wk = w0 + NkSk + ( xk m0)( xk m0), β0 + Nk vk = v0 + Nk, N xk = rnk xn, Nk n= N T Sk = rnk( xk xk)( xk xk). Nk n= 32

步骤四 : 迭代收敛 最后, 注意到对 π,µ,ʌ 的边缘概率都需要且只需要 r nk ; 另一方面,r nk 的 T 计算需要 ρ nk, 而这又是基于 E[ln π k], E[ln Λk ], Eµ, [( ) ( )], k Λ x k n µ k Λk xn µ k 即需要知道 π,µ,ʌ 的值 不难确定这三个期望的一般表达式为 : 这些结果能导出, K 且 r =. ( K ) i= ln πk E[ln πk ] = ψα ( k) ψ αi D vk + i ln Λ k E[ln Λ k ] = ψ Dln 2 ln i= k 2 + + Λ Eµ k, Λ x Λ x = D + x m W x m k k = nk T T [( n µ k) k( n µ k)] βk νk( n k) k( n k) D ν /2 T k rnk π kλ k exp ( xn mk) Wk( xn mk) 2βk 2 33

Summary: Variational Inference for GMM q * ( π)~ Dir( α) α = α + N 0 k N k N = r n= nk q ( µ, Λ ) = N( µ m,( β Λ ) ) W( Λ w, ν ) * k k k k k k k k k N β = β + N, m = ( β m+ Nx), v = v+ N, x = r x k 0 k k 0 0 k k k 0 k k nk n βk Nk n= β N w w N S ( x m )( x m ), S r ( x x )( x x ) N 0 k T k = 0 + k k + k 0 k 0 k = nk k n k n β0 + Nk Nk n= VBM-Step Latent variable Soft-count or ESS T N K * z q Z = rnk n= k= ( ) nk VBE-Step D ν /2 T k rnk π kλ k exp ( xn mk) Wk( xn mk) 2βk 2 34

Outline Motivation The Variational Bayesian Framework Variational Free Energy Optimization Tech. Mean Field Approximation Exponential Family Bayesian Networks Example: VB for Mixture model Discussion Application Reference 35

EM vs. VB 36

The Accuracy-vs-Complexity trade-off 37

Application Matrix Factorization: Probabilistic PCA, Mixtures of PPCA, Independent Factor Analysis(IFA), nonlinear ICA/IFA/SSM, Mixture of Bayesian ICA, Bayesian Mixture of Factor Analyzers, etc. Time Series: Bayesian HMMs, variational Kalman filtering, Switching State-space models, etc. Topic model: Latent Dirichlet Allocation(LDA), (Hierarchical) Dirichlet Process (Mixture) Model, Bayesian Nonparametrical Models, etc. Variational Gaussian Process Classifiers Sparse Bayesian Learning Variational Bayesian Filtering, etc. 38

Reference Neal, Radford M., and Geoffrey E. Hinton. "A view of the EM algorithm that justifies incremental, sparse, and other variants." Learning in graphical models. Springer Netherlands, 998. 355-368. Attias, Hagai. "Inferring parameters and structure of latent variable models by variational Bayes." Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 999. Winn, John, Christopher M. Bishop, and Tommi Jaakkola. "Variational Message Passing." Journal of Machine Learning Research 6.4 (2005). Wainwright, Martin J., and Michael I. Jordan. "Graphical models, exponential families, and variational inference." Foundations and Trends in Machine Learning.-2 (2008): -305. Šmídl, Václav, and Anthony Quinn. The variational Bayes method in signal processing. Springer, 2006. Wikipedia, Variational Bayesian methods, http://en.wikipedia.org/wiki/variational_bayes 39

Any Question? 40