Statistical Machine Learning Lectures 4: Variational Bayes
|
|
- Neal Pope
- 5 years ago
- Views:
Transcription
1 1 / 29 Statistical Machine Learning Lectures 4: Variational Bayes Melih Kandemir Özyeğin University, İstanbul, Turkey
2 2 / 29 Synonyms Variational Bayes Variational Inference Variational Bayesian Inference Mean-Field Inference (in certain cases)
3 3 / 29 Information We would like to measure the amount of information received when a binary variable x {0, 1} is observed. Information: Degree of surprise after observing x. Devise a function h(x) that measures information gained from x.
4 4 / 29 How should h(x) look like? When we observe two independent binary variables x and y. The information received should be the sum of the individual events. Because independence implies p(x, y) = p(x)p(y), it is suitable to measure information by h(x) = log 2 p(x). Base 2 is arbitrary. When base 2 is used, the measure is called a bit! Negative sign assures that the measure of information is non-negative.
5 5 / 29 Entropy Expected amount of information for x p(x): H[x] = log 2 p(x)p(x). Note that the case for p(x) = 0 looks degenerate. Handle this by lim p 0 p ln p = 0, hence H[x] = 0.
6 6 / 29 Example 1 Consider the case where we have four possible states. When they are equally likely, the entropy turns out to be [ H[x] = log 1 ] 2 = 2 bits. 4
7 7 / 29 Example 2 Assume we have again four possible states, this time with ( 5 probabilities 8, 1 4, ), 1. Then the entropy is H[x] = 5 8 log log log = = 1.42 bits. There is more information in the uniform case!
8 8 / 29 Measures of information content Measures of information content log 2 p(x) bits ln p(x) nats Distributions that maximize the entropy Discrete uniform Continuous (for a given location and spread) normal!
9 9 / 29 Entropy in the continuous domain No exact counterpart. Using mean value theorem, we attain that entropy of a continuous density p(x) differs from the term below by ln H[x] = p(x) log p(x)dx. This term is called the differential entropy. Although differential entropy diverges from the exact entropy as 0, it is often used in place of the plain entropy for continuous densities. We will adopt the same convention here.
10 10 / 29 Relative entropy or KL divergence Suppose for some reason, we need to approximate p(x) by another density q(x), which has some more pleasant properties. The additional information (in nats) required to be conveyed as a result of using q(x) in place of p(x) is ( ) log q(x) log p(x) = log q(x) p(x) = log p(x) q(x). Since x follows p(x), the expected additional information is log p(x) q(x) p(x)dx. This quantity is called relative entropy or Kullback-Leibler divergence and denoted by KL[p q].
11 Convexity Consider a parametric line aλ + b(1 λ) that passes between points a and b and an arbitrary function f(x). If any line passing between f(a) and f(b) is always above f(x), then f(x) is called a convex function. More formally, if for any a and b the below inequality satisfies f(a)λ + f(b)(1 λ) f(aλ + b(1 λ)), then f(x) is said to be convex. Figure: C. Bishop, Pattern Recognition and Machine Learning, Springer, / 29
12 12 / 29 Jensen s inequality We can prove by induction that convexity holds also for more than two points: ( M M λ i f(x i ) f λ i x i ), i=1 such that {x 1,, x M } is a set of points on the function domain and M i=1 λ i = 1 with λ i 0. We can think of {λ i,, λ M } as parameters of a categorial distribution with M states. Hence we can equivalently write i=1 E[f(x)] f(e[x]). This outcome generalizes to continuous variables straightforwardly (use Riemann integration): f(x)p(x)dx f( ) xp(x)dx.
13 13 / 29 KL divergence is a dissimilarity measure Considering that log x is a convex function, KL[p q] = p(x) log q(x) p(x) dx log p(x) q(x) p(x) dx = 0. }{{} 1 Because log x is a strictly convex function (i.e. equality holds only at intersection points), p(x) = q(x) KL[p q] = 0. Hence, KL divergence is a dissimilarity metric between two densities. Note that KL[p q] KL[q p].
14 14 / 29 Calculus of variations Typically we have scalars or vectors as variables. Then we operate on mappings from these variables to other entities. For instance in f(x) : R D R, the vector x is our variable of interest and f( ) is a function of it. There are some cases where we take functions as variables of interest and operate on mappings from functions to other entities: F : f(x) R. Such mappings are called functionals. One example is the KL divergence. The branch of mathematics that has functionals in its focus is named as the calculus of variations.
15 15 / 29 What if we have non-conjugate priors? Assume we are given a data set X = {x 1,, x N } and a Bayesian model X θ N p(x n θ), n=1 θ p(θ). with a non-conjugate prior p(θ) on the set of latent variables wrt likelihood p(x n θ). We are interested in the posterior p(θ X), which does not have a closed-form solution. What shall we do then?
16 16 / 29 Approximating the posterior Choose a q(θ γ), a density parameterized by γ, and construct an optimization problem to make q(θ γ) as similar as possible to the true posterior p(θ X). But what sort of an optimization problem would be suitable? Hint: Put the pieces together.
17 17 / 29 How about this? argmin q(θ γ) Did we solve the problem now? KL[p(θ X) q(θ γ)]
18 18 / 29 How about this? Not quite! KL[p(θ X) q(θ γ)] = p(θ X) log p(θ X) q(θ γ) dθ. The loss function depends on p(θ X), which we do not know. We ended up with the point we started from!
19 19 / 29 How about the other way around? argmin q(θ γ) KL[q(θ γ) p(θ X)] At least worthwhile going forward. Approximating the posterior by solving this optimization problem is called Variational Bayes! Actually, there are ways to go forward from KL[p(θ X) q(θ γ)] as well by introducing further approximations. This is called Expectation Propagation. We will cover that approach towards the end of the semester.
20 Variational Bayes KL[q(θ γ) p(θ X)] = = = + q(θ γ) log q(θ γ) dθ p(θ X) }{{} p(θ, X) p(x) q(θ γ) log q(θ γ)p(x) dθ p(θ, X) q(θ γ) log q(θ γ)dθ q(θ γ) log p(x)dθ q(θ γ) log p(θ, X)dθ 20 / 29
21 21 / 29 Variational Bayes KL[q(θ γ) p(θ X)] = E q(θ γ) [log q(θ γ)] + E q(θ γ) [log p(x)] }{{}}{{} H q(θ γ) [θ] log p(x) E q(θ γ) [log p(θ, X)] Arranging the terms, we get the interesting outcome below log p(x) = E }{{} q(θ γ) [log p(θ, X)] + H q(θ γ) [θ] + KL[q(θ γ) p(θ X)]. }{{}}{{} const L 0 Hence, L is a lower bound to the log of the evidence. Hence it is called the Evidence Lower Bound (ELBO). ELBO equals to the log-evidence iff q(θ γ) = p(θ X).
22 22 / 29 Variational Bayes: Inference Optimization argmin q(θ γ) KL[q(θ γ) p(θ X)] argmax q(θ γ) L
23 23 / 29 Inference as optimization Let us take a closer look at the generic form and contemplate on the feasibility of the approach argmax q(θ γ) L = argmax γ = argmax γ { N } E q(θ γ) [log p(x n θ)] + E q(θ γ) [log p(θ)] + H q(θ γ) [θ] n=1 { N } E q(θ γ) [log p(x n θ)] KL q(θ γ) [q(θ γ) p(θ)] n=1 Calculate E q(θ γ) [log p(x n θ)] and look up H q(θ γ) [θ] or alternatively KL q(θ γ) [q(θ γ) p(θ)]. Take the gradient of the ELBO wrt γ and optimize.
24 24 / 29 Mean-Field Variational Bayes Let us choose q(θ) = i P q(θ i ), where P is a partitioning of the set of all latent variables. Expressing the ELBO in terms of one of the partitions reads { L = q(θ j ) log p(x, θ) } q(θ i )dθ i dθ j i j } {{ } E q(θ)\q(θj )[log p(x,θ)] q(θ j ) log q(θ j )dθ j + const.
25 25 / 29 Mean-Field Variational Bayes Let us define p E q(θ)\q(θj )[log p(x, θ)] as a new density and fix all the factors except q(θ j ). The ideal q(θ j ) would make KL[q(θ j ) p] = 0. Hence, q(θ j ) E q(θ)\q(θj )[log p(x, θ)] /Z j, }{{} p where Z j = p dθ. Update all partitions individually using this update rule. Iterate until convergence.
26 Genuine Bayesian linear regression N y X N (y n w T x n, β 1 ), n=1 w α N (w 0, α 1 ), α G(α a 0, b 0 ). Approximate p(w, α X, y) with q(w, α) = q(w)q(α). 26 / 29
27 Update for q(w) log q(w) = E q(α) [log p(y w, X)] + E q(α) [log p(w α)] + E q(α) [log p(α)] +const }{{} const = β 2 wt X T Xw + βy T Xw E q(α)[α] w T w + const 2 = 1 [ ] 2 wt βx T X + E q(α) [α] I w + βy T Xw + const }{{} a Completing the square, taking the exponent, and normalizing gives the update rule where q(w) N (w m, S) S = [βx T X + a I] 1, m = βsx T y. 27 / 29
28 28 / 29 Update for q(α) log q(α) = E q(w) [log p(y w, X)] +E q(w) [log p(w α)] }{{} const + E q(w) [log p(α)] + const = 1 2 log α 1 I α }{{} 2 E q(w)[w T w] }{{} D w T w + (a 0 1) log α b 0 α + const ( D ) = 2 + a 0 1 log α ( b 0)α + const Taking the exponent and normalizing gives ( q(α) G α D 2 + a 0, 1 2 wt w + b 0 ).
29 Handling the first and second moments cov(x, y) = E [(x ] E[x])(y E[y]) T ] = E[xx T E[x]y T E[x]y T + E[x]E[y] T = E[xx T ] E[x]E[y] T E[x]E[y] T + E[x]E[y] T = E[xx T ] E[x]E[y] T Hence, E[xx T ] = cov(x, y) + E[x]E[y] T. Taking the trace of both sides yields [ ] Tr E[xx T ] = Tr[cov(x, y)] + Tr [E[x]E[y] ] T [ ] E Tr[xx T ] = Tr[cov(x, y)] + E [Tr[x]E[y] ] T E[x T x] = Tr[cov(x, y)] + E[x] T E[y]. Consequently, w T w = Tr(S) + m T m. Note ) also that in the first update a = (D/2 + a 0 )/( w T w /2 + b / 29
Machine Learning Srihari. Information Theory. Sargur N. Srihari
Information Theory Sargur N. Srihari 1 Topics 1. Entropy as an Information Measure 1. Discrete variable definition Relationship to Code Length 2. Continuous Variable Differential Entropy 2. Maximum Entropy
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationMachine learning - HT Maximum Likelihood
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationIntroduction to Bayesian Statistics
School of Computing & Communication, UTS January, 207 Random variables Pre-university: A number is just a fixed value. When we talk about probabilities: When X is a continuous random variable, it has a
More informationIntroduction to Machine Learning
What does this mean? Outline Contents Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola December 26, 2017 1 Introduction to Probability 1 2 Random Variables 3 3 Bayes
More informationBayesian Inference Course, WTCN, UCL, March 2013
Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information
More informationWeek 3: The EM algorithm
Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent
More informationExpectation Propagation Algorithm
Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,
More informationAn introduction to Variational calculus in Machine Learning
n introduction to Variational calculus in Machine Learning nders Meng February 2004 1 Introduction The intention of this note is not to give a full understanding of calculus of variations since this area
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated
More informationInformation Theory and Communication
Information Theory and Communication Ritwik Banerjee rbanerjee@cs.stonybrook.edu c Ritwik Banerjee Information Theory and Communication 1/8 General Chain Rules Definition Conditional mutual information
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationStatistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling
1 / 27 Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 27 Monte Carlo Integration The big question : Evaluate E p(z) [f(z)]
More informationIntroduction to Machine Learning
Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB
More informationCheng Soon Ong & Christian Walder. Canberra February June 2017
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX
More informationInformation Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18
Information Theory David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 18 A Measure of Information? Consider a discrete random variable
More information1/37. Convexity theory. Victor Kitov
1/37 Convexity theory Victor Kitov 2/37 Table of Contents 1 2 Strictly convex functions 3 Concave & strictly concave functions 4 Kullback-Leibler divergence 3/37 Convex sets Denition 1 Set X is convex
More informationCS 591, Lecture 2 Data Analytics: Theory and Applications Boston University
CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University Charalampos E. Tsourakakis January 25rd, 2017 Probability Theory The theory of probability is a system for making better guesses.
More informationFundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner
Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization
More informationG8325: Variational Bayes
G8325: Variational Bayes Vincent Dorie Columbia University Wednesday, November 2nd, 2011 bridge Variational University Bayes Press 2003. On-screen viewing permitted. Printing not permitted. http://www.c
More informationExpectation Maximization
Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of
More information14 : Mean Field Assumption
10-708: Probabilistic Graphical Models 10-708, Spring 2018 14 : Mean Field Assumption Lecturer: Kayhan Batmanghelich Scribes: Yao-Hung Hubert Tsai 1 Inferential Problems Can be categorized into three aspects:
More informationFoundations of Statistical Inference
Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 32 Lecture 14 : Variational Bayes
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More informationPattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM
Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures
More informationProbabilistic and Bayesian Machine Learning
Probabilistic and Bayesian Machine Learning Lecture 1: Introduction to Probabilistic Modelling Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Why a
More informationQuantitative Biology II Lecture 4: Variational Methods
10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate
More informationVariational Inference. Sargur Srihari
Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms
More informationSeries 7, May 22, 2018 (EM Convergence)
Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More information1.6. Information Theory
48. INTRODUCTION Section 5.6 Exercise.7 (b) First solve the inference problem of determining the conditional density p(t x), and then subsequently marginalize to find the conditional mean given by (.89).
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More informationProbabilistic Graphical Models for Image Analysis - Lecture 4
Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.
More informationBioinformatics: Biology X
Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Model Building/Checking, Reverse Engineering, Causality Outline 1 Bayesian Interpretation of Probabilities 2 Where (or of what)
More informationLearning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014
Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of
More informationMachine Learning using Bayesian Approaches
Machine Learning using Bayesian Approaches Sargur N. Srihari University at Buffalo, State University of New York 1 Outline 1. Progress in ML and PR 2. Fully Bayesian Approach 1. Probability theory Bayes
More informationMachine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang
Machine Learning Lecture 02.2: Basics of Information Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationIntegrated Non-Factorized Variational Inference
Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15
More informationExpectation Propagation for Approximate Bayesian Inference
Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationVariational Inference (11/04/13)
STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further
More informationLecture 1a: Basic Concepts and Recaps
Lecture 1a: Basic Concepts and Recaps Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced
More informationThe Expectation Maximization or EM algorithm
The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,
More informationLecture 6: Model Checking and Selection
Lecture 6: Model Checking and Selection Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 27, 2014 Model selection We often have multiple modeling choices that are equally sensible: M 1,, M T. Which
More informationInformation Theory Primer:
Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationVariational Learning : From exponential families to multilinear systems
Variational Learning : From exponential families to multilinear systems Ananth Ranganathan th February 005 Abstract This note aims to give a general overview of variational inference on graphical models.
More informationEE514A Information Theory I Fall 2013
EE514A Information Theory I Fall 2013 K. Mohan, Prof. J. Bilmes University of Washington, Seattle Department of Electrical Engineering Fall Quarter, 2013 http://j.ee.washington.edu/~bilmes/classes/ee514a_fall_2013/
More informationSparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference
Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which
More informationToday. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion
Today Probability and Statistics Naïve Bayes Classification Linear Algebra Matrix Multiplication Matrix Inversion Calculus Vector Calculus Optimization Lagrange Multipliers 1 Classical Artificial Intelligence
More informationBayesian Machine Learning - Lecture 7
Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationThe binary entropy function
ECE 7680 Lecture 2 Definitions and Basic Facts Objective: To learn a bunch of definitions about entropy and information measures that will be useful through the quarter, and to present some simple but
More informationThe Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision
The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that
More informationSpeech Recognition Lecture 7: Maximum Entropy Models. Mehryar Mohri Courant Institute and Google Research
Speech Recognition Lecture 7: Maximum Entropy Models Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Information theory basics Maximum entropy models Duality theorem
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationVariational Autoencoders
Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationVariational Inference. Sargur Srihari
Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of Discussion Functionals Calculus of Variations Maximizing a Functional Finding Approximation to a Posterior Minimizing K-L divergence Factorized
More informationtopics about f-divergence
topics about f-divergence Presented by Liqun Chen Mar 16th, 2018 1 Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments
More informationan introduction to bayesian inference
with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena
More informationLecture 2: August 31
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy
More informationVariational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M
A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M PD M = PD θ, MPθ Mdθ Lecture 14 : Variational Bayes where θ are the parameters of the model and Pθ M is
More informationCurve Fitting Re-visited, Bishop1.2.5
Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the
More informationSeries 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)
Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationCSC 2541: Bayesian Methods for Machine Learning
CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 10 Alternatives to Monte Carlo Computation Since about 1990, Markov chain Monte Carlo has been the dominant
More informationIntroduction to Bayesian inference
Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions
More informationTwo Useful Bounds for Variational Inference
Two Useful Bounds for Variational Inference John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We review and derive two lower bounds on the
More informationCOMPSCI 650 Applied Information Theory Jan 21, Lecture 2
COMPSCI 650 Applied Information Theory Jan 21, 2016 Lecture 2 Instructor: Arya Mazumdar Scribe: Gayane Vardoyan, Jong-Chyi Su 1 Entropy Definition: Entropy is a measure of uncertainty of a random variable.
More informationICES REPORT Model Misspecification and Plausibility
ICES REPORT 14-21 August 2014 Model Misspecification and Plausibility by Kathryn Farrell and J. Tinsley Odena The Institute for Computational Engineering and Sciences The University of Texas at Austin
More informationStatistical Machine Learning Lecture 1: Motivation
1 / 65 Statistical Machine Learning Lecture 1: Motivation Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 65 What is this course about? Using the science of statistics to build machine learning
More informationArtificial Intelligence
Artificial Intelligence Probabilities Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: AI systems need to reason about what they know, or not know. Uncertainty may have so many sources:
More informationLecture 5 - Information theory
Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationU Logo Use Guidelines
Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity
More informationProbabilistic Reasoning in Deep Learning
Probabilistic Reasoning in Deep Learning Dr Konstantina Palla, PhD palla@stats.ox.ac.uk September 2017 Deep Learning Indaba, Johannesburgh Konstantina Palla 1 / 39 OVERVIEW OF THE TALK Basics of Bayesian
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationVariational Bayes and Variational Message Passing
Variational Bayes and Variational Message Passing Mohammad Emtiyaz Khan CS,UBC Variational Bayes and Variational Message Passing p.1/16 Variational Inference Find a tractable distribution Q(H) that closely
More informationLecture 2: From Linear Regression to Kalman Filter and Beyond
Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing
More informationLecture 1b: Linear Models for Regression
Lecture 1b: Linear Models for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced
More informationMACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun
Y. LeCun: Machine Learning and Pattern Recognition p. 1/? MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun The Courant Institute, New York University http://yann.lecun.com
More informationStochastic Variational Inference
Stochastic Variational Inference David M. Blei Princeton University (DRAFT: DO NOT CITE) December 8, 2011 We derive a stochastic optimization algorithm for mean field variational inference, which we call
More informationOther Topologies. Y. LeCun: Machine Learning and Pattern Recognition p. 5/3
Y. LeCun: Machine Learning and Pattern Recognition p. 5/3 Other Topologies The back-propagation procedure is not limited to feed-forward cascades. It can be applied to networks of module with any topology,
More informationPILCO: A Model-Based and Data-Efficient Approach to Policy Search
PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol
More informationPart IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015
Part IA Probability Theorems Based on lectures by R. Weber Notes taken by Dexter Chua Lent 2015 These notes are not endorsed by the lecturers, and I have modified them (often significantly) after lectures.
More informationUnderstanding Covariance Estimates in Expectation Propagation
Understanding Covariance Estimates in Expectation Propagation William Stephenson Department of EECS Massachusetts Institute of Technology Cambridge, MA 019 wtstephe@csail.mit.edu Tamara Broderick Department
More informationHands-On Learning Theory Fall 2016, Lecture 3
Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete
More informationStatistical Learning Theory
Statistical Learning Theory Part I : Mathematical Learning Theory (1-8) By Sumio Watanabe, Evaluation : Report Part II : Information Statistical Mechanics (9-15) By Yoshiyuki Kabashima, Evaluation : Report
More informationPROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability
More informationRobustness and duality of maximum entropy and exponential family distributions
Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat
More informationIntroduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak
Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,
More informationLecture 1: Bayesian Framework Basics
Lecture 1: Bayesian Framework Basics Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de April 21, 2014 What is this course about? Building Bayesian machine learning models Performing the inference of
More informationECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 2017
ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 207 Author: Galen Reeves Last Modified: August 3, 207 Outline of lecture: 2. Quantifying Information..................................
More information