NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Similar documents
An Introduction to Bayesian Machine Learning

Bayesian Models in Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Probability and Estimation. Alan Moses

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Probabilistic Machine Learning

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Pattern Recognition and Machine Learning

Probability Theory for Machine Learning. Chris Cremer September 2015

Recent Advances in Bayesian Inference Techniques

Human-Oriented Robotics. Probability Refresher. Kai Arras Social Robotics Lab, University of Freiburg Winter term 2014/2015

Graphical Models and Kernel Methods

Data Mining Techniques. Lecture 3: Probability

13: Variational inference II

Bayesian Methods for Machine Learning

PMR Learning as Inference

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Learning Bayesian network : Given structure and completely observed data

Machine Learning using Bayesian Approaches

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

p L yi z n m x N n xi

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Probabilistic Graphical Networks: Definitions and Basic Results

CSC321 Lecture 18: Learning Probabilistic Models

Lecture 2: Conjugate priors

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Bayesian Inference and MCMC

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

an introduction to bayesian inference

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Lecture 2: Priors and Conjugacy

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Lecture 13 : Variational Inference: Mean Field Approximation

Bayesian Machine Learning

Lecture 6: Graphical Models: Learning

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

STA 4273H: Statistical Machine Learning

Stat 5101 Lecture Notes

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

CPSC 540: Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Intro to Probability. Andrei Barbu

STA 4273H: Statistical Machine Learning

Machine Learning Summer School

Probabilistic Graphical Models

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Introduction to Bayesian inference

COMP90051 Statistical Machine Learning

Probabilistic Graphical Models

Bayesian Regression Linear and Logistic Regression

Lecture 11: Probability Distributions and Parameter Estimation

Introduction to Probabilistic Machine Learning

Probabilistic Graphical Models

Introduction to Bayesian Learning

Machine Learning Lecture 14

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Computational Cognitive Science

Unsupervised Learning

Chris Bishop s PRML Ch. 8: Graphical Models

STA 414/2104: Machine Learning

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Machine Learning Techniques for Computer Vision

Probability Distributions

Variational Principal Components

Introduction to Machine Learning CMU-10701

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Statistics Part III: Building Bayes Theorem Part IV: Prior Specification

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Nonparametric Bayesian Methods (Gaussian Processes)

MLE/MAP + Naïve Bayes

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Computational Cognitive Science

Naïve Bayes classification

Machine Learning Overview

STA 4273H: Statistical Machine Learning

Lecture 2: Simple Classifiers

Multivariate Bayesian Linear Regression MLAI Lecture 11

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Artificial Intelligence

STA 4273H: Statistical Machine Learning

CS 361: Probability & Statistics

GWAS V: Gaussian processes

Probabilistic Graphical Models for Image Analysis - Lecture 1

Part 1: Expectation Propagation

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 16: Mixtures of Generalized Linear Models

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Introduction to Probabilistic Graphical Models

Graphical Models for Collaborative Filtering

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Variational Inference. Sargur Srihari

CS540 Machine learning L9 Bayesian statistics

Introduction to Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Hierarchical Models & Bayesian Model Selection

Introduction to Probability and Statistics (Continued)

K. Nishijima. Definition and use of Bayesian probabilistic networks 1/32

Transcription:

NPFL108 Bayesian inference Introduction Filip Jurčíček Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic Home page: http://ufal.mff.cuni.cz/~jurcicek Version: 21/02/2014 NPFL108 2014LS 1/37

Outline The course objective Syllabus Literature Course structure Basics of Probability Theory Basic probability distributions NPFL108 2014LS 2/37

The course objective The course aims to provide students with basic understanding of modern Bayesian inference methods used in Bayesian Machine Learning. Being Bayesian is about managing uncertainty and efficient use of data. In many tasks such as stock trading or speech recognition, the uncertainty is inherent and there is always less data then we really need. NPFL108 2014LS 3/37

What is Machine Learning? The design of computational systems that discover patterns in a collection of data instances in an automated manner. The ultimate goal is to use the discovered patterns to make predictions on new data instances not seen before. Instead of manually encoding patterns in computer programs, we make computers learn these patterns without explicitly programming them. Figure source [Hinton et al. 2006]. NPFL108 2014LS 4/37

Model-based Machine Learning We design a probabilistic model which explains how the data is generated. An inference algorithm combines model and data to make predictions. Probability theory is used to deal with uncertainty in the model or the data. NPFL108 2014LS 5/37

Syllabus #1 Introduction Random variables Sum rule, product rule, Bayes rule Independence Prior, likelihood, posterior, predictions Basic probability distributions Types of priors Conjugate prior vs. Non-conjugate prior Proper prior vs. improper prior Informative prior vs. uninformative prior NPFL108 2014LS 6/37

Syllabus #2 Bayesian inference for parameters of the normal distribution Unknown mean, known variance Known mean, unknown variance Unknown mean and variance, conjugate prior Unknown mean and variance, non-conjugate prior NPFL108 2014LS 7/37

Syllabus #3 Inference in discrete graphical models Variables, Parameters, Networks, Plate notation Conditional Independence Markov blanket Message passing, Belief propagation, Loopy belief propagation Approximate Inference Variational Inference / Bayes Expectation propagation Sampling methods Metropolis-Hastings, Gibbs, slice sampling, random walk Non-parametric Bayesian Methods Gaussian processes Dirichlet processes NPFL108 2014LS 8/37

Paradigm shift Point estimates vs. posterior estimates An example: flip of coins (data): H T H H point estimate 3/4 Bayesian estimate? flip of coins (data): H T H H T H H H point estimate 3/4 Bayesian estimate? We need distributions over parameters! NPFL108 2014LS 9/37

True or False - being Bayesian is just about having priors + being Bayesian is about managing uncertainty - Bayesian methods are slow + Bayesian methods can be as fast as Expectation-Maximisation - Non parametric means no parameters + Non parametric means the number of parameters grows as necessary given the data - Variational inference is complicated + Variational inference is an extension of Expectation-Maximization NPFL108 2014LS 10/37

Literature C. M. Bishop, Pattern Recognition and Machine Learning, vol. 4, no. 4. Springer, 2006, p. 738. MacKay, David JC. Information theory, inference and learning algorithms. Cambridge university press, 2003. Koller, Daphne, and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009. B. Thomson and S. Young, Bayesian update of dialogue state: A POMDP framework for spoken dialogue systems," Computer Speech and Language, vol. 24, no. 4, pp. 562-588, 2010. D. Marek (studijní obor teoretická informatika): Implementace aproximativních Bayesovských metod pro odhad stavu v dialogových systémech, Dimplomová práce, UFAL, MFF, CUNI, 2013. NPFL108 2014LS 11/37

Course structure Mixed lectures and practicals Each of you will have its own lecture: description of some inference problem derivation of the solution presentation of implementation of the problem in some programming language I prefer Python ;-) using only NumPy and SciPy A lot of work! NPFL108 2014LS 12/37

Available problems You can invent your own Message passing, belief propagation, loopy belief propagation in discrete graphical models Laplace approximation: The probit regression model Variational inference: The probit regression model Variational inference: 2D Ising Model Variational inference: Unknown mean and variance of a Gaussian with improper priors. Expectation Propagation: The Clutter Problem Variational inference: The Clutter Problem Expectation Propagation: The probit regression model Bayesian inference for regression Gibbs Sampling: Probit regression Metropolis-Hastings Random walk: Logit regression Gibbs Sampling: Probit regression in multi-class setting Gaussian Processes: Sampling from a GP, inference GP with prior m(x) = 0 and k(x,x') = {squared exponential, ration quadratic}, analyse impact of different covariance functions of the resulting approximations. NPFL108 2014LS 13/37

Basics of Probability Theory Random variables Sum rule, product rule, Bayes rule Independence Prior, likelihood, posterior, predictions NPFL108 2014LS 14/37

Random variable Random variable (RV) is a variable whose value is subject to variations due to chance (i.e. randomness, in a mathematical sense) A random variable conceptually does not have a single, fixed value It can take on a set of possible different values, each with an associated probability We talk about a probability distribution for values of some RV P (X) NPFL108 2014LS 15/37

Examples of random variables rolling a die (head or tail) person's marriage status (no, yes) person's number of children (0, 1, 2,...) person's height (real numbers between 0 and +inf) temperature the next year the same day parameters of the distributions of describing another RVs Basic division of RV is: Discrete Continuous NPFL108 2014LS 16/37

Let's have two RVs: Sum rule #1 X - person's marriage status Y - person's number of children Then P(X=yes) is the probability that a person is married P(Y=0) is the probability that a person has exactly one child P(X=yes, Y=0) is the JOINT probability that a person is married and has exactly one child P(X=yes Y=0) is the CONDITIONAL probability that a person is married if we know that he/she has exactly one child NPFL108 2014LS 17/37

Sum rule #2 Computes marginal probabilities from joint probabilities. Sum rule says: P (X)= Y P(X, Y) P (X)= Y P(X, Y) In more precise notation: P (X=x i )= y j P(X=x i,y=y j ) NPFL108 2014LS 18/37

Product rule Relates joint probability with conditional and marginal probability marginal probability. Product rule says: P (X,Y)=P (X Y)P (Y) NPFL108 2014LS 19/37

Bayes rule The theory of probability can be derived using just sum and product rules Bayes' theorem gives the relationship between the probabilities of X and Y and the conditional probabilities of X given Y and Y given Z P (X,Y)=P (X Y)P (Y) P (Y X)= P (X,Y) P (Y) = P (X Y) P(Y) P (Y) = P(X Y)P(Y) P (X Y) P(Y) X NPFL108 2014LS 20/37

Independence Independence of X and Y: P (X,Y)=P (X) P(Y) Conditional independence of X and Y given Z: P (X,Y Z)=P(X Z)P(Y Z) NPFL108 2014LS 21/37

Bayesian Model Framework The probabilistic model M with parameters θ explains how the data D is generated by specifying the likelihood function p(d θ, M). Our initial uncertainty on θ is encoded in the prior distribution p(θ M). Bayes rule allows us to update our uncertainty on θ given D (posterior): P (θ D, M)= P(D θ, M)P(θ M) P(D M) NPFL108 2014LS 22/37

Bayesian predictions We can then generate probabilistic predictions for some new data point x given D and M using: P (x D, M)= P(x θ)p (θ D, M)d θ Example: Buttered toast phenomenon NPFL108 2014LS 23/37

Probabilistic Graphical Models The Bayesian framework requires to specify a high-dimensional distribution p(x 1,..., x k ) on the data, model parameters and latent variables. Working with fully flexible joint distributions is intractable! We will work with structured distributions, in which the random variables interact directly with only few others. These distributions will have many conditional independences. This structure will allow us to: Obtain a compact representation of the distribution. Use computationally effi cient inference algorithms. The framework of probabilistic graphical models allows us to represent and work with such structured distributions in an efficient manner. NPFL108 2014LS 24/37

Examples of Probabilistic Graphical Models Figure source [Koller et al. 2009]. NPFL108 2014LS 25/37

Examples of Probabilistic Graphical Models BN Examples: Naive Bayes BN Examples: Hidden Markov Model NPFL108 2014LS 26/37

Basic probability distributions Bernoulli, Binomial Beta Categorical (Discrete), Multinomial Dirichlet (Multivariate) Normal Gamma and inverse gamma Wishart NPFL108 2014LS 27/37

Bernoulli Distribution for x {0, 1} governed by µ [0, 1] such that µ = p(x = 1). Bern(x ;μ)=μ x (1 μ) 1 x E(x) = µ Var(x) = µ(1 µ) NPFL108 2014LS 28/37

Binomial Binomial distribution is a variation of Bernoulli distribution for multiple trials. Distribution for m {0, 1,.., N} governed by µ [0, 1] probability of success in N trials. Bin (m ; N, μ)= ( m) N μm (1 μ) N m E(m) = Nµ Var(m) = Nµ(1 µ). Source Wikipedia NPFL108 2014LS 29/37

Beta Distribution for µ [0, 1] such as the probability of a binary event. Beta is very often used as a prior for parameters of Bernoulli and Binomial distributions NPFL108 2014LS 30/37

Multinomial We extract with replacement n balls of k different categories from a bag. Let x i and denote the number of balls extracted and p i the probability, both of category i = 1,..., k e.g. {0,0,5,2,4,0,0} NPFL108 2014LS 31/37

Dirichlet Dirichlet is very often used as a prior for parameters of a Multinomial distribution NPFL108 2014LS 32/37

Multivariate Gaussian NPFL108 2014LS 33/37

Gamma Distribution for τ > 0 governed by a > 0 and b > 0 Gam(τ ;a, b)= 1 Γ(a) ba τ a 1 e b τ E(τ) = a/b Var(τ) = a/b 2 Gamma is very often used as a prior for a precision of a normal distribution Inverse Gamma is typically used as a prior for variance Source Wikipedia NPFL108 2014LS 34/37

Wishart Wishart is an multivariate equivalent of gamma distribution. NPFL108 2014LS 35/37

Summary With ML computers learn patterns and then use them to make predictions. With ML we avoid to manually encode patterns in computer programs. Model-based ML separates knowledge about the data generation process (model) from reasoning and prediction (inference algorithm). The Bayesian framework allows us to do model-based ML using probability distributions which must be structured for tractability. Probabilistic graphical models encode such structured distributions by specifying several CIs (factorizations) that they must satisfy. NPFL108 2014LS 36/37

Thank you! Filip Jurčíček Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic Home page: http://ufal.mff.cuni.cz/~jurcicek NPFL108 2014LS 37/37