Introduction to Probabilistic Machine Learning

Similar documents
Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Density Estimation. Seungjin Choi

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

CSC321 Lecture 18: Learning Probabilistic Models

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Probability and Estimation. Alan Moses

COS513 LECTURE 8 STATISTICAL CONCEPTS

Introduction to Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

COMP90051 Statistical Machine Learning

Non-Parametric Bayes

Linear Models A linear model is defined by the expression

Introduction to Machine Learning

Learning Bayesian network : Given structure and completely observed data

Computational Cognitive Science

Naïve Bayes classification

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Computational Cognitive Science

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Time Series and Dynamic Models

Bayesian Machine Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Methods: Naïve Bayes

Probabilistic Graphical Models

Pattern Recognition and Machine Learning

Lecture 2: Priors and Conjugacy

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Lecture 4: Probabilistic Learning

PMR Learning as Inference

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Bayesian Methods for Machine Learning

Introduction to Bayesian Methods

GAUSSIAN PROCESS REGRESSION

Bayesian Regression Linear and Logistic Regression

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Lecture : Probabilistic Machine Learning

Introduction to Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CPSC 540: Machine Learning

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Bayesian Analysis for Natural Language Processing Lecture 2

Introduction to Machine Learning

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

STA 4273H: Sta-s-cal Machine Learning

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Bayesian Inference and MCMC

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Announcements. Proposals graded

Unsupervised Learning

6.867 Machine Learning

Lecture 2: Conjugate priors

CS540 Machine learning L9 Bayesian statistics

Bayesian Models in Machine Learning

Markov Chain Monte Carlo methods

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

STAT J535: Chapter 5: Classes of Bayesian Priors

Computational Perception. Bayesian Inference

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Parametric Techniques Lecture 3

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Generative Clustering, Topic Modeling, & Bayesian Inference

6.867 Machine Learning

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Probabilistic Graphical Models

Introduction to Machine Learning. Lecture 2

STA414/2104 Statistical Methods for Machine Learning II

Machine Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Probabilistic Machine Learning. Industrial AI Lab.

Introduction to Bayesian inference

Bayesian Inference. p(y)

Bayesian Learning (II)

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

The Bayes classifier

Parametric Techniques

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

13: Variational inference II

Kernel methods, kernel SVM and ridge regression

Bayesian RL Seminar. Chris Mansley September 9, 2008

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 3. Univariate Bayesian inference: conjugate analysis

Clustering bi-partite networks using collapsed latent block models

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

CS-E3210 Machine Learning: Basic Principles

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Bayesian Inference: Posterior Intervals

Introduc)on to Bayesian Methods

GWAS IV: Bayesian linear (variance component) models

Introduction to Bayesian Learning. Machine Learning Fall 2018

Variational Scoring of Graphical Model Structures

Probabilistic Graphical Models

Transcription:

Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1

Machine Learning Detecting trends/patterns in the data Making predictions about future data Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 2

Machine Learning Detecting trends/patterns in the data Making predictions about future data Two schools of thoughts Learning as optimization: fit a model to minimize some loss function Learning as inference: infer parameters of the data generating distribution Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 2

Machine Learning Detecting trends/patterns in the data Making predictions about future data Two schools of thoughts Learning as optimization: fit a model to minimize some loss function Learning as inference: infer parameters of the data generating distribution The two are not really completely disjoint ways of thinking about learning Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 2

Plan for the mini-course A series of 4 talks Introduction to Probabilistic and Bayesian Machine Learning (today) Case Study: Bayesian Linear Regression, Approx. Bayesian Inference (Nov 5) Nonparametric Bayesian modeling for function approximation (Nov 7) Nonparam. Bayesian modeling for clustering/dimensionality reduction (Nov 8) Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 3

Machine Learning via Probabilistic Modeling Assume data X = {x 1,..., x N} generated from a probabilistic model: Data usually assumed i.i.d. (independent and identically distributed) x 1,..., x N p(x θ) Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 4

Machine Learning via Probabilistic Modeling Assume data X = {x 1,..., x N} generated from a probabilistic model: Data usually assumed i.i.d. (independent and identically distributed) x 1,..., x N p(x θ) For i.i.d. data, probability of observed data X given model parameters θ N p(x θ) = p(x 1,..., x N θ) = p(x n θ) n=1 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 4

Machine Learning via Probabilistic Modeling Assume data X = {x 1,..., x N} generated from a probabilistic model: Data usually assumed i.i.d. (independent and identically distributed) x 1,..., x N p(x θ) For i.i.d. data, probability of observed data X given model parameters θ N p(x θ) = p(x 1,..., x N θ) = p(x n θ) p(x n θ) denotes the likelihood w.r.t. data point n The form of p(x n θ) depends on the type/characteristics of the data n=1 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 4

Some common probability distributions Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 5

Maximum Likelihood Estimation (MLE) We wish to estimate parameters θ from observed data {x 1,..., x N} MLE does this by finding θ that maximizes the (log)likelihood p(x θ) ˆθ = arg max log p(x θ) θ Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6

Maximum Likelihood Estimation (MLE) We wish to estimate parameters θ from observed data {x 1,..., x N} MLE does this by finding θ that maximizes the (log)likelihood p(x θ) N ˆθ = arg max log p(x θ) = arg max log p(x n θ) θ θ n=1 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6

Maximum Likelihood Estimation (MLE) We wish to estimate parameters θ from observed data {x 1,..., x N} MLE does this by finding θ that maximizes the (log)likelihood p(x θ) N N ˆθ = arg max log p(x θ) = arg max log p(x n θ) = arg max log p(x n θ) θ θ θ n=1 n=1 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6

Maximum Likelihood Estimation (MLE) We wish to estimate parameters θ from observed data {x 1,..., x N} MLE does this by finding θ that maximizes the (log)likelihood p(x θ) N N ˆθ = arg max log p(x θ) = arg max log p(x n θ) = arg max log p(x n θ) θ θ θ n=1 MLE now reduces to solving an optimization problem w.r.t. θ n=1 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6

Maximum Likelihood Estimation (MLE) We wish to estimate parameters θ from observed data {x 1,..., x N} MLE does this by finding θ that maximizes the (log)likelihood p(x θ) N N ˆθ = arg max log p(x θ) = arg max log p(x n θ) = arg max log p(x n θ) θ θ θ n=1 MLE now reduces to solving an optimization problem w.r.t. θ MLE has some nice theoretical properties (e.g., consistency as N ) n=1 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 6

Injecting Prior Knowledge Often, we might a priori know something about the parameters A prior distribution p(θ) can encode/specify this knowledge Bayes rule gives us the posterior distribution over θ: p(θ X) Posterior reflects our updated knowledge about θ using observed data p(θ X) = p(x θ)p(θ) p(x) = p(x θ)p(θ) Likelihood Prior p(x θ)p(θ)dθ θ Note: θ is now a random variable Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 7

Maximum-a-Posteriori (MAP) Estimation MAP estimation finds θ that maximizes the posterior p(θ X) p(x θ)p(θ) ˆθ = arg max θ log N n=1 p(x n θ)p(θ) = arg max θ N log p(x n θ) + log p(θ) n=1 MAP now reduces to solving an optimization problem w.r.t. θ Objective function very similar to MLE, except for the log p(θ) term In some sense, MAP is just a regularized MLE Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 8

Bayesian Learning Both MLE and MAP only give a point estimate (single best answer) of θ How can we capture/quantify the uncertainty in θ? Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 9

Bayesian Learning Both MLE and MAP only give a point estimate (single best answer) of θ How can we capture/quantify the uncertainty in θ? Need to infer the full posterior distribution p(θ X) = p(x θ)p(θ) p(x) = p(x θ)p(θ) Likelihood Prior p(x θ)p(θ)dθ θ Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 9

Bayesian Learning Both MLE and MAP only give a point estimate (single best answer) of θ How can we capture/quantify the uncertainty in θ? Need to infer the full posterior distribution p(θ X) = p(x θ)p(θ) p(x) = p(x θ)p(θ) Likelihood Prior p(x θ)p(θ)dθ θ Requires doing a fully Bayesian inference Inference sometimes a somewhat easy and sometimes a (very) hard problem Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 9

A Simple Example of Bayesian Inference We want to estimate a coin s bias θ (0, 1) based on N tosses The likelihood model: {x 1,..., x N} Bernoulli(θ) The prior: θ Beta(a, b) p(x n θ) = θ x n (1 θ) 1 x n p(θ a, b) = Γ(a + b) Γ(a)Γ(b) θa 1 (1 θ) b 1 The posterior p(θ X) N n=1 n θ)p(θ a, b) p(x N n=1 n (1 θ) 1 x n θ a 1 (1 θ) b 1 θx = θ a+ N n=1 x n 1 (1 θ) b+n N n=1 x n 1 Thus the posterior is: Beta(a + N n=1 x n, b + N N n=1 x n) Here, the posterior has the same form as the prior (both Beta) Also very easy to perform online inference Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 10

Conjugate Priors Recall p(θ X) = p(x θ)p(θ) p(x) Given some data distribution (likelihood) p(x θ) and a prior p(θ) = π(θ α).. The prior is conjugate if the posterior also has the same form, i.e., p(θ α, X) = P(X θ)π(θ π) p(x) = π(θ α ) Several pairs of distributions are conjugate to each other, e.g., Gaussian-Gaussian Beta-Bernoulli Beta-Binomial Gamma-Poisson Dirichlet-Multinomial.. Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 11

A Non-Conjugate Case Want to learn a classifier θ for predicting label x { 1, +1} for a point z Assume a logistic likelihood model for the labels p(x n θ) = 1 1 + exp( x nθ z n) Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 12

A Non-Conjugate Case Want to learn a classifier θ for predicting label x { 1, +1} for a point z Assume a logistic likelihood model for the labels p(x n θ) = 1 1 + exp( x nθ z n) The prior: θ Normal(µ, Σ) (Gaussian, not conjugate to the logistic) p(θ µ, Σ) exp( 1 2 (θ µ) Σ 1 (θ µ)) Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 12

A Non-Conjugate Case Want to learn a classifier θ for predicting label x { 1, +1} for a point z Assume a logistic likelihood model for the labels p(x n θ) = 1 1 + exp( x nθ z n) The prior: θ Normal(µ, Σ) (Gaussian, not conjugate to the logistic) p(θ µ, Σ) exp( 1 2 (θ µ) Σ 1 (θ µ)) The posterior p(θ X) N n=1 p(x n θ)p(θ µ, Σ) does not have a closed form Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 12

A Non-Conjugate Case Want to learn a classifier θ for predicting label x { 1, +1} for a point z Assume a logistic likelihood model for the labels p(x n θ) = 1 1 + exp( x nθ z n) The prior: θ Normal(µ, Σ) (Gaussian, not conjugate to the logistic) p(θ µ, Σ) exp( 1 2 (θ µ) Σ 1 (θ µ)) The posterior p(θ X) N n=1 p(x n θ)p(θ µ, Σ) does not have a closed form Approximate Bayesian inference needed in such cases Sampling based approximations: MCMC methods Optimization based approximations: Variational Bayes, Laplace, etc. Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 12

Benefits of Bayesian Modeling Our estimate of θ is not a single value ( point ) but a distribution Can model and quantify the uncertainty (or variance ) in θ via p(θ X) Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 13

Benefits of Bayesian Modeling Our estimate of θ is not a single value ( point ) but a distribution Can model and quantify the uncertainty (or variance ) in θ via p(θ X) Can use the uncertainty in various tasks such as diagnosis, predictions Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 13

Benefits of Bayesian Modeling Our estimate of θ is not a single value ( point ) but a distribution Can model and quantify the uncertainty (or variance ) in θ via p(θ X) Can use the uncertainty in various tasks such as diagnosis, predictions, e.g., Making predictions by averaging over all possibile values of θ p(y x, X, Y ) = E p(θ.) [p(y x, θ)] p(y x, X, Y ) = p(y x, θ)p(θ X, Y )dθ Allows also quantifying the uncertainty in the predictions Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 13

Other Benefits of Bayesian Modeling Hierarchical model construction: parameters can depend on hyperparameters hyperparameters need not be tuned but can be inferred from data.. by maximizing the marginal likelihood p(x α) Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 14

Other Benefits of Bayesian Modeling Hierarchical model construction: parameters can depend on hyperparameters hyperparameters need not be tuned but can be inferred from data.. by maximizing the marginal likelihood p(x α) Provides robustness: E.g., learning the sparsity hyperparameter in sparse regression, learning kernel hyperparameters in kernel methods Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 14

Other Benefits of Probabilistic/Bayesian Modeling Can introduce local parameters (latent variables) associated with each data point and infer those as well Used in many problems: Gaussian mixture model, probabilistic principal component analysis, factor analysis, topic models Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 15

Other Benefits of Probabilistic/Bayesian Modeling Enables a modular architecture: Simple models can be neatly combined to solve more complex problems Allows jointly learning across multiple data sets (sometimes also known as multitask learning or transfer learning) Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 16

Other Benefits of Bayesian Modeling Nonparametric Bayesian modeling: a principled way to learn model size E.g., how many clusters (Gaussian mixture model or graph clustering), how many basis vectors (PCA) or dictionary elements (sparse coding or dictionary learning), how many topics (topic models such as LDA), etc.. NPBayes modeling allows the model size to grow with data Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 17

Other Benefits of Bayesian Modeling Sequential data acquisition or active learning Can check how confident the learned model is w.r.t. a new data point p(θ λ) = Normal(θ 0, λ 2 ) Prior p(y x, θ) = Normal(y θ x, σ 2 ) Likelihood p(θ Y, X) = Normal(θ µ θ, Σ θ ) Posterior p(y 0 x 0, Y, X) = Normal(y 0 µ 0, σ0) 2 Predictive dist. µ 0 = µ θ x 0 Predictive mean σ0 2 = σ 2 + x 0 Σ θ x 0 Predictive variance Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 18

Other Benefits of Bayesian Modeling Sequential data acquisition or active learning Can check how confident the learned model is w.r.t. a new data point p(θ λ) = Normal(θ 0, λ 2 ) Prior p(y x, θ) = Normal(y θ x, σ 2 ) Likelihood p(θ Y, X) = Normal(θ µ θ, Σ θ ) Posterior p(y 0 x 0, Y, X) = Normal(y 0 µ 0, σ0) 2 Predictive dist. µ 0 = µ θ x 0 Predictive mean σ0 2 = σ 2 + x 0 Σ θ x 0 Predictive variance Gives a strategy to choose data points sequentially for improved learning with a budget on the amount of data available Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 18

Next Talk Case study on Bayesian sparse linear regression Hyperparameter estimation Introduction to approximate Bayesian inference Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 19

Thanks! Questions? Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 20