PMR Learning as Inference

Similar documents
Learning Bayesian network : Given structure and completely observed data

Probability and Estimation. Alan Moses

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning (II)

Computational Cognitive Science

Lecture 2: Priors and Conjugacy

Bayesian Models in Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

CPSC 540: Machine Learning

Statistical learning. Chapter 20, Sections 1 4 1

CSC321 Lecture 18: Learning Probabilistic Models

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

STA 4273H: Statistical Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

COS513 LECTURE 8 STATISTICAL CONCEPTS

Density Estimation. Seungjin Choi

Lecture : Probabilistic Machine Learning

GWAS IV: Bayesian linear (variance component) models

Introduction to Probabilistic Machine Learning

Naïve Bayes classification

Introduction to Machine Learning

Hierarchical Models & Bayesian Model Selection

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Non-Parametric Bayes

Statistical learning. Chapter 20, Sections 1 3 1

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Computational Cognitive Science

Introduction to Bayesian Learning. Machine Learning Fall 2018

Bayesian Regression (1/31/13)

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS-E3210 Machine Learning: Basic Principles

Unsupervised Learning

Bayesian Methods for Machine Learning

Parametric Techniques Lecture 3

Bayesian Machine Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Lecture 4: Probabilistic Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Machine Learning for Data Science (CS4786) Lecture 12

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Variational Scoring of Graphical Model Structures

CPSC 540: Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PMR Introduction. Probabilistic Modelling and Reasoning. Amos Storkey. School of Informatics, University of Edinburgh

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Introduction to Bayesian inference

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Parametric Techniques

STA 4273H: Statistical Machine Learning

STA 4273H: Sta-s-cal Machine Learning

Bayesian RL Seminar. Chris Mansley September 9, 2008

How can a Machine Learn: Passive, Active, and Teaching

Bayesian Machine Learning

More Spectral Clustering and an Introduction to Conjugacy

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Exponential Families

Probabilistic Graphical Models (I)

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Introduction to Applied Bayesian Modeling. ICPSR Day 4

Probabilistic Graphical Models

6.867 Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

CPSC 340: Machine Learning and Data Mining

Statistical learning. Chapter 20, Sections 1 3 1

An Introduction to Bayesian Machine Learning

STA414/2104 Statistical Methods for Machine Learning II

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Graphical Models for Collaborative Filtering

Bayesian Analysis for Natural Language Processing Lecture 2

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Bayesian Regression Linear and Logistic Regression

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

20: Gaussian Processes

Introduction to Machine Learning

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Bayesian Machine Learning

STAT J535: Chapter 5: Classes of Bayesian Priors

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Linear Classification

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Transcription:

Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning as Inference /43 Modelling Amos Storkey PMR Learning as Inference 2/43 Modelling Probabilistic Modelling is about building models and using them. What do we mean by modelling? Amos Storkey PMR Learning as Inference 5/43 Amos Storkey PMR Learning as Inference 6/43

Modelling A Generative Model Building an idealisation to capture the essential elements of an item. Think of a model as a model for future data generation Given a model (a distribution) we can sample from that distribution to get artificial data. Need to specify enough to do this generation. Often make IID (Independent and Identically Distributed) Assumption Models are not truth. They try to capture our uncertainties. Amos Storkey PMR Learning as Inference 7/43 The Inverse Problem Amos Storkey PMR Learning as Inference 8/43 Likelihood We built a generative model, or a set of generative models on the basis of what we know (prior). Can generate artificial data. BUT what if we want to learn a good distribution for data that we then see? How is goodness measured? Explaining Data A particular distribution explains the data better if the data is more probable under that distribution. The likelihood approach P(D M). The probability of the data D given a distribution (or model) M. This is called the likelihood of the model. This is P(D M) = N P(x n M) i.e. the product of the probabilities of generating each data point individually. This is a result of the independence assumption (indep product of probabilities by definition). n= Try different M (different distributions). Pick the M with the highest likelihood Maximum Likelihood Approach. Amos Storkey PMR Learning as Inference 9/43 Amos Storkey PMR Learning as Inference /43

Bernoulli model Bernoulli model Data:. Continuous range of hypotheses: M = p - Generated from a Boolean distribution with P( p) = p. Likelihood of data. Let c=number of ones: N P(x n p) = p c ( p) 2 c n= Maximum likelihood hypothesis? Differentiate w.r.t. p to find maximum In fact usually easier to differentiate log P(D M): log is monotonic. So argmax log( f (x)) = argmax f (x). Data:. Likelihood of data. Let c=number of ones: log N P(x n p) = c log p + (2 c) log( p) n= Set d/dp log P(D M) = c/p (2 c)/( p) to zero to find maximum. So c( p) (2 c)p =. This gives p = c/2. Maximum likelihood is unsurprising. Amos Storkey PMR Learning as Inference /43 Distributions over Parameters Parameter bias Amos Storkey PMR Learning as Inference 2/43 Maximum Posterior Although we are uncertain about the parameter values, often some are more probable than others. Uncertainty probability: put prior (distribution) on parameters. Compute max posterior instead of max likelihood Bayes Rule: Posterior P(p D) = P(D p)p(p) P(D) P(D) = dp P(p D)P(p) does not depend on p. argmax P(D p)p(p) argmax (log P(D p) + log P(p)) penalty term Prior. P(p) p( p) Let c=number of ones (9). Then log P(p D) = c log p + (2 c) log( p) + log p + log( p) + const Set d/dp log P(D M) = (c + )/p (2 c + )/( p) to zero to find maximum. So (c + )( p) (2 c + )p =. This gives p = (c + )/22 = 9/22.4. With this prior, max. posterior prefers p closer to /2. Amos Storkey PMR Learning as Inference 3/43 Amos Storkey PMR Learning as Inference 4/43

Uncertainty of Parameters Posterior Distribution Maximizing the posterior gets us one value for the parameter. Is it right? No it is an estimate. But how good an estimate? There is some uncertainty. How much? Uncertainty probability. Find posterior distribution over parameters, not just maximum. 2.5 2.5.5 Prior in Red, Posterior in Blue.2.4.6.8 Amos Storkey PMR Learning as Inference 5/43 Inference and Marginalisation Amos Storkey PMR Learning as Inference 6/43 Test. P(p) p( p) What is probability of next item x being? Predict. Could take maximum posterior parameter and compute probability of next item? (.4) But: lots of possible posterior parameters. Some more possible than others. Instead marginalise: dp P(x = p)p(p D) We considered choosing between model, where each model defined a precise distribution. But what if each model defines a whole type of distribution. We might not know the precise parameters of the distribution. Compute the evidence or marginal likelihood: Marginalise out the unknown parameters to get likelihood of model. This gives approximately.46. Amos Storkey PMR Learning as Inference 7/43 Amos Storkey PMR Learning as Inference 8/43

Learning as inference Summary of Bayesian Computation Its just as if the parameters were nodes in our graphical model. In fact that is exactly what they are. Latent variables - intrinsic - separate variables for each data item. Parameters - extrinsic - shared across all data items. Define prior model P(D), usually by using P(D) = dθ P(D θ)p(θ) and defining: The likelihood P(D θ) with parameters θ. The prior distribution (over parameters) P(θ α) which might also be parameterized by hyper-parameters α. Conditioning on data to get the posterior distribution over parameters P(θ D). Using the posterior distribution for prediction (inference) P(x D) = dθ P(x θ)p(θ D) Amos Storkey PMR Learning as Inference 9/43 Why not maximize? We have described learning as an inference procedure. But why not maximize. Why be Bayesian? Why not compute best parameters and compare? More parameters=better fit to data. ML: bigger is better. But might be overfitting: only these parameters work. Many others don t. P(D H ) 2 Evidence C P(D H ) Prefer models that are unlikely to accidentally explain the data. That said, maximum posterior parameters are often good Amos Storkey enough PMR Learning inaslarge Inferencedata scenarios... 2/43 D Amos Storkey PMR Learning as Inference 2/43 Recap For Bernoulli likelihood with Beta prior, could do Bayesian computation analytically. For Binomial likelihood and Beta prior, could do Bayesian computation analytically. For Multinomial likelihood and Dirichlet prior, could do Bayesian computation analytically. Question: are there other distributions for which we can do analytical Bayesian computations? Is this a good thing? Discuss. Amos Storkey PMR Learning as Inference 22/43

Analytical methods The exponential family Yes: conjugate exponential models. Good thing: easy to do the sums. Bad thing: prior distribution should match beliefs. Does a Beta distribution match your beliefs? Is it good enough? Certainly not always. Any distribution over some x that can be written as P(x η) = h(x)g(η) exp ( η T u(x) ) with h and g known, is in the exponential family of distributions. Many of the distributions we have seen are in the exponential family. A notable exception is the t-distribution. The η are called the natural parameters of the distribution. Amos Storkey PMR Learning as Inference 23/43 Wait - I didn t get that! Amos Storkey PMR Learning as Inference 24/43 The exponential family P(x η) = h(x)g(η) exp ( η T u(x) ) More simply... Any distribution that can be written such that the interaction term (between parameters and variables) is log linear in the parameters is in the exponential family. i.e. log P(x η) = η i u i (x)+(other stuff that only contains x or η) i A distribution may usually be parameterized in a way that is different from the exponential family form. So sometimes useful to convert to exponential family representation and find the natural parameters. Amos Storkey PMR Learning as Inference 25/43 Multinomial Distribution P(x {log p k }) exp x k log p k ) Amos Storkey PMR Learning as Inference 26/43 k

The Gaussian Distribution Definition The one dimensional Gaussian distribution is given by Need to intro the Gaussian first. P(x µ, σ 2 ) = N(x; µ, σ 2 ) = µ)2 exp (x 2πσ 2 2σ 2 µ is the mean of the Gaussian and σ 2 is the variance. If µ = and σ 2 = then N(x; µ, σ 2 ) is called a standard Gaussian. Amos Storkey PMR Learning as Inference 27/43 Plot Amos Storkey PMR Learning as Inference 28/43 Normalisation.4.35.3.25.2.5. Remember all distributions must integrate to one. The 2πσ 2 is called a normalisation constant - it ensures this is the case. Hence tighter Gaussians have higher peaks:.4.5.35 5 4 3 2 2 3 4 5 This is a standard one dimensional Gaussian distribution. All Gaussians have the same shape subject to scaling and displacement. If x is distributed N(x; µ, σ 2 ), then y = (x µ)/σ is distributed N(y;, )..3.25.2.5..5 8 6 4 2 2 4 6 8 Amos Storkey PMR Learning as Inference 29/43 Amos Storkey PMR Learning as Inference 3/43

Central Limit Theorems Multivariate Gaussian X i mean, variance Σ, not necessarily Gaussian. X i subject to various conditions (e.g. IID, light tails). N X i N(, Σ) N i= asymptotically as N. The vector x is multivariate Gaussian if for mean µ and covariance matrix Σ, it is distributed according to P(x µ, Σ) = ( (2π)Σ /2 exp ) 2 (x µ)t Σ (x µ) The univariate Gaussian is a special case of this. Σ is called a covariance matrix. It says how much attributes co-vary. More later. Amos Storkey PMR Learning as Inference 3/43 Multivariate Gaussian: Picture Amos Storkey PMR Learning as Inference 32/43 The exponential family.6.4.2..8.6.4 Gaussian Distribution P(x η) exp η k x k 2 k ij Σ ij x i x j ).2 3 2 2 3 3 2 2 3 η = Σ µ. Amos Storkey PMR Learning as Inference 33/43 Amos Storkey PMR Learning as Inference 34/43

Pause Conjugate exponential models. If the prior takes the same functional form as the posterior for a given likelihood, a prior is said to be conjugate for that likelihood. There is a conjugate prior for any exponential family distribution. If the prior and likelihood are conjugate and exponential, then the the model is said to be conjugate exponential In conjugate exponential models, the Bayesian integrals can be done analytically. Amos Storkey PMR Learning as Inference 35/43 Conjugacy Amos Storkey PMR Learning as Inference 36/43 Conjugacy In high dimensional spaces it is hard to accurately estimate the parameters using maximum likelihood. Can utilise Bayesian methods. Conjugate distribution for the Gaussian with mean parameter is another Gaussian. Conjugate distribution for the Gaussian with precision (inverse variance) parameter is the Gamma distribution. Conjugate distribution for the Gaussian with precision matrix (inverse covariance is the Wishart distirbution. Conjugate distribution for the Gaussian with both mean and precision matrix is the Gaussian-Wishart distribution. Wishart distribution is distribution over matrices! Remember - for conjugate distribution posterior is of the same form. So given the data, we just need to update the hyperparameters of the prior distribution to get the posterior. Gaussian N(µ, Λ ). Fixed precision Λ, but µ distributed N(µ, Λ ) Posterior mean (Λ + nλ) (Λ µ + nλ x) Posterior precision (Λ + nλ). Amos Storkey PMR Learning as Inference 37/43 Amos Storkey PMR Learning as Inference 38/43

Conjugacy: Evidence Give me more... Gaussian likelihood N(x; µ, Σ), Gaussian prior N(µ; µ, Σ ). Simple case: µ =, Σ known. Marginal likelihood (Evidence)? We know Marginal Likelihood is Gaussian. So using x = µ + ɛ, ɛ mean, covariance Σ, compute mean m and covariance C of marginal likelihood Red Orange Yellow Aquamarine Haggis Mountains Loch Celtic Castle Trees, Forests, Pruning, Parent, Machine Learning, Bayesian. Google Sets m = x = µ + = C = xx T = µµ T + ɛɛ T + = Σ + Σ Amos Storkey PMR Learning as Inference 39/43 Different features Amos Storkey PMR Learning as Inference 4/43 Model Data consists of D c and query point x. Denote by D. Have a large database of objects, each described by D + (e.g. Web) Have a small number of examples from the dataset, each with various (binary) features, which we collect into D c. Want to pick things from D + that belong to the same set as those in D c How should we do it? Two models: M : D all from same subset C, or M 2 : D c from the same subset C, but x from the general distribution over all data D + Parameter vector is vector of (Boolean) probabilities, one for each feature. D + is vast, and so presume maximum likelihood estimate good enough for M: have vector θ + for this. Amos Storkey PMR Learning as Inference 4/43 Amos Storkey PMR Learning as Inference 42/43

Score Parameter vector θ c for subset C is not known. So put a conjugate prior on the parameters: a Beta distribution for each component i of the feature vector, with hyper-parameters a i and b i. Compute P(D M )/P(D M 2 ) (called the Bayes Factor). The larger this ratio is, the more this favours x being included in the set. Bayesian Model Comparison: parameters integrated out: P(D M 2 ) = P(D θ)p(θ α)dθ Amos Storkey PMR Learning as Inference 43/43