Lecture : Probabilistic Machine Learning

Similar documents
Bayesian Machine Learning

Density Estimation. Seungjin Choi

Bayesian Machine Learning

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CS-E3210 Machine Learning: Basic Principles

y Xw 2 2 y Xw λ w 2 2

GAUSSIAN PROCESS REGRESSION

Bayesian Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

STA414/2104 Statistical Methods for Machine Learning II

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Introduction to Bayesian Learning. Machine Learning Fall 2018

6.867 Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Overfitting, Bias / Variance Analysis

6.867 Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

20: Gaussian Processes

PATTERN RECOGNITION AND MACHINE LEARNING

Parametric Techniques Lecture 3

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Bayesian RL Seminar. Chris Mansley September 9, 2008

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Parametric Techniques

Lecture 4: Probabilistic Learning

Bayesian Linear Regression [DRAFT - In Progress]

Linear Models for Regression

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

COMP90051 Statistical Machine Learning

COS513 LECTURE 8 STATISTICAL CONCEPTS

CMU-Q Lecture 24:

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Nonparametric Bayesian Methods (Gaussian Processes)

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

STA 4273H: Sta-s-cal Machine Learning

Learning Bayesian network : Given structure and completely observed data

ECE521 week 3: 23/26 January 2017

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Bayesian Methods: Naïve Bayes

Expectation Propagation for Approximate Bayesian Inference

Probabilistic Graphical Models Lecture 20: Gaussian Processes

COMP90051 Statistical Machine Learning

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Probabilistic Reasoning in Deep Learning

Naïve Bayes classification

Outline lecture 2 2(30)

Lecture 2 Machine Learning Review

Bayesian Learning (II)

Gaussian Process Regression

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Computational Cognitive Science

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

CPSC 540: Machine Learning

Computational Cognitive Science

Statistical learning. Chapter 20, Sections 1 4 1

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Bias-Variance Tradeoff

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Probabilistic & Bayesian deep learning. Andreas Damianou

PMR Learning as Inference

Probability and Information Theory. Sargur N. Srihari

Bayesian Regression Linear and Logistic Regression

CPSC 540: Machine Learning

Bayesian Inference and an Intro to Monte Carlo Methods

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Bayesian Models in Machine Learning

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Linear Models A linear model is defined by the expression

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

g-priors for Linear Regression

GWAS IV: Bayesian linear (variance component) models

CHAPTER 2 Estimating Probabilities

Parameter Estimation. Industrial AI Lab.

Introduction to Probabilistic Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

An Introduction to Statistical and Probabilistic Linear Models

Introduction. Chapter 1

Non-parametric Methods

MODULE -4 BAYEIAN LEARNING

Generative Clustering, Topic Modeling, & Bayesian Inference

Nonparameteric Regression:

Lecture 5: GPs and Streaming regression

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

INTRODUCTION TO PATTERN RECOGNITION

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Transcription:

Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018

ML : Many Methods with Many Links

Modelling Views of Machine Learning Machine Learning is the science of learning models from data Define space of possible models Learn parameters and structure of models from data Make predictions and decisions

Probabilistic Modelling Provides a framework for understanding what learning is. It describes how to represent and manipulate uncertainty about models and predictions. Model describes data one could observe from a system We use mathematics of probability theory to express all forms of uncertainty and noise associated with the model Use inverse probability (Bayes rule) to infer unknown quantities, adapt our models, make accurate predictions and learn from data

Bayes Rule Bayes rule tells us how to do inference about hypothesis (the uncertain quantities) from data (measured quantities) Learning and prediction can be seen as a form of inference

Why do we care? A robot, in order to behave intelligently, should be able to represent beliefs about propositions in the world : charging station is at location (x,y,z) range finder is malfunctioning that stormtrooper is hostile Using probabilistic models, we want to represent the strengths of these beliefs, and be able to manipulate these beliefs Probabilistic learning can also be used for calibrated models and prediction uncertainty - getting systems that know what they don t know (more on this later...)

Recall : How do we fit this dataset? Goal: Predict target y associated to any arbitrary input x

Recall : Regression Problems Data : (x 1, y 1 ), (x 2, y 2 ),...(x n, y n ) Given x i y i related to x i by some function Plus noise independent of each i Possible approaches : Use a simple parametric model : y i is a linear function of x i plus noise Use a non-parametric method : e.g nearest-neighbour or other sophisticated local smoothing models (later...) Use a flexible model for the conditional distribution of y i given x i - e.g Gaussian Processes, Neural Networks (later...)

Recall : Model of the data To predict a new x we need to postulate a model of the data. We will estimate y with f (x ) f (x) : Example : A polynomial f w (x) = w 0 + w 1 x + w 2 x 2 +...w M x M (1) w M are the weights of the polynomial, the parameters of the model

Recall : Least squares fit for polynomials

Recall : Least squares fit for polynomials Should we choose a polynomial? What degree should we choose for the polynomial? For a given degree, how do we choose the weights?

Recall : Least squares fit for polynomials Idea : measure the quality of the fit to the training data en 2 = (y n f (x n )) 2 (2) Find model parameters which minimizes the sum of squared errors N E(w) = en 2 (3) n=1 M f w (x) = w 0 1 + w 2 x + w 3 x 2... = w m φ m (x) (4) m=0

Recall : Overfitting

Recall : Regularization Intuition : Complicated hypothesis lead to overfitting Idea : change error function to penalize hypothesis complexity J(w) = J D (w) + λj pen (w) (5) λ controls how much we value fitting the data well vs having a simple hypothesis

Some Open Questions Do we think that all models are equally probable...before we see any data? What does the probability of a model even mean? Do we need to choose a single best model or can we consider several? We need a framework to answer such questions Perhaps our training targets are contaminated with noise. What to do? We will start here

Observation Noise Assume the data is generated by the red function Each f(x) was independently contaminted by a noise term ɛ n The observations are noisy y n = f w (x n ) + ɛ n (6) We characterize the noise by a probability density function ɛ n N(ɛ n ; 0, σ 2 noise )

Probability of Observed Data given Model Given that y = f + ɛ. We can write the probability of y given f E(w) is the sum of squared errors E(w) = N (y n f w (x n )) 2 = y Φw 2 (7) n=1 Since f = Φw, we can write p(y w, σ 2 noise ) = p(y f, σ2 noise ) for a given Φ

Statistical Parameter Fitting Given instances x 1, x 2,...x m that are i.i.d Find a set of parameters w such that the data can be summarized by a probability P(x w) Parameters w depends on the family of probability distributions we consider (e.g Multinomial, Gaussian) For regression and supervised methods, we have special target variables we are interested in P(y x, w)

Likelihood Function The likelihood of the parameters is the probability of the data given the parameters p(y w, σnoise 2 ) is the probability of the observed data given the weights L(w) p(y x, w, σnoise 2 ) is the likelihood of the weights Suppose we assume that the noise in the regression model is Gaussian (normal) with mean zero and some variance σ 2 We therefore write down the likelihood function for the parameters w, which is the joint probability density of all the y i as a function of w L(w) = P(y 1,...y n x 1,...x n, w) n = N(y i w T φ(x i ), σ 2 ) i=1 (8)

Maximum Likelihood Estimation The probability of y given f is therefore p(y f, σ 2 ) = N(y; f, σ 2 1 ) = ( ) exp( E(w) 2πσ 2 2σ 2 ) (9) where E(w) is the sum of squared errors E(w) = y w T φ(x) 2 Maximum Likelihood : We can fit the model weights to the data by maximising the likelihood ŵ = arg max L(w) = arg max exp( E(w) ) = arg min E(w) (10) 2σ2 With additive Gaussian independent noise model, the maximum likelihood and least squares solutions are same We still have not solved the prediction problem! We still overfit.

Probabilistic Approach So far, we looked at a probabilistic approach to learn a parametric form of a function that could fit the data. For a Gaussian noise model, this approach will make the same predictions as using the squared loss error function Regularization log p(y x, w, σ 2 ) 1 2σ 2 N (f (x i, w) y(x i )) 2 (11) We still need to use a penalized loss function for regularization i=1

Probabilistic Approach The probabilistic approach helps us interpret the error measure in a deterministic approach, and gives us a sense of the noise level σ 2 Probabilistic methods thus provides an intuitive framework for representing uncertainty, and model development

Multiple Explanations of the Data We do not believe all models are equally probable to explain the data We may believe that a simpler model is more probable than a complex one Model complexity and uncertainty : We do not know what particular function generated the data More than one of our models can perfectly fit the data We believe more than one of our models could have generated the data We want to reason in terms of a set of possible explanations, not just one

Likelihood and MLE

Bayesian Machine Learning

What does it mean by being Bayesian Dealing with all sources of parameter uncertainty Potentially dealing with structure uncertainty Figure: Yarin Gal s thesis, Uncertainty in Deep Learning (2016)

Why should we care about Uncertainty? Figure: Yarin Gal Talk (2016). What my deep model doesn t know

Bayesian View Formulate knowledge about situation probabilistically Define model that expresses qualitative aspects of our knowledge. The model will have unknown parameters Specify a prior probability distribution for these unknown parameters that expresses our beliefs about which values are likely, before seeing the data Gather the data Compute a posterior distribution for the parameters given the observed data Use this posterior distribution to Make predictions by averaging over the posterior distribution Make decisions so as to minimize posterior expected loss

Maximum A Posteriori (MAP) Recall in MLE : Choose value that maximizes the probability of observed data θ MLE = arg max P(D θ) (12) θ In MAP estimates, we use prior beliefs. Choose value that is most probable given observed data and prior belief θ MAP = arg max θ P(θ D) = arg max P(D θ)p(θ) (13) θ In MAP, we seek the value of θ which maximizes the posterior P(θ D). It estimates θ as the mode of the posterior distribution These are point estimates and not fully Bayesian. In Bayesian methods, we use distributions to summarize and draw inferences from data

Prior on Parameters induce Priors on Functions

Finding the Posterior Distribution Given the prior over functions (parameters), we want to find the posterior distribution and use it to make predictions The posterior distribution for the model parameters given the observed data is found by combining the prior distribution with the likelihood for the parameters given the data (Bayes rule) The denominator is just the normalizing constant. So as a proportionality, we can write

Predictive Distribution Posterior distribution We make predictions by integrating with respect to the posterior Think of each setting of w as a different model. Equation above is a Bayesian model average, an average of infinitely many models weighted by their posterior probabilities No overfitting - automatically calibrated complexity

Maximum Likelihood, Parametric Model data : x, y model M : y = f w (x) + ɛ Gaussian Likelihood N p(y x, w, M) exp( 1 2 (y n f w (x n )) 2 /σnoise 2 (14) n=1 Maximize the likelihood w ML = argmax w p(y x, w, M) (15) Make predictions by plugging in the ML estimate p(y x, w ML, M) (16)

Prior, Likelihood, Posterior and Predictive distributions are all Gaussian Distributions. Bayesian Inference, Parametric Model data : x, y model M : y = f w (x) + ɛ Gaussian Likelihood p(y x, w, M) Parameter prior p(w M) N exp( 1 2 (y n f w (x n )) 2 /σnoise 2 (17) n=1 Posterior parameter distribution (Bayes rule) p(w M)p(y w, x, M) p(w x, y, M) = (18) p(y x, M) Making predictions (marginalizing out the parameters): p(y x, x, y, M) = p(y w, x, M)p(w x, y, M)dw (19)

Posterior and Predictive Distributions

Conjugate Priors For most Bayesian inference problems, the integrals needed to do inference and prediction are not analytically tractable - hence the need for variaous approximations. Most of the exceptions involve conjugate priors, which combine nicely with the likelihood to give a posterior distribution of the same form. Basic Idea : Given likelihood function p(x θ), choose a family of prior distributions such that integrals can be obtained tractably. If the prior p(θ) and posterior p(θ x) belong to same family of distributions, the prior is called a conjugate prior. If likelihood function is Gaussian, choosing Gaussian prior over mean will ensure that the posterior distribution is also Gaussian.

Regularisation = MAP Bayesian Inference

Representing Prior and Posterior by Samples The complex distributions we will often use as priors, or obtain as posteriors, may not be easily represented A general technique is to represent a distribution by sampling of many values drawn randomly from it. We can then Visualize the distribution by viewing these sample values, or low dimensional projections of them (PCA..later) Make Monte Carlo estimates for probabilities or expectations with respect to the distribution, by taking averages over these sample values Obtaining a sample from the prior is easy! Obtaining a sample from the posterior is usually more difficult - nevertheless a dominant approach to Bayesian computation

Representing Prior and Posterior by Samples

Comparing Models : Marginal Likelihood What if we are unsure which model is right? So far we assumed we were able to start by making a definite choice of model. We can compare models based on marginal likelihoods (also known as model evidence) for each model - this is the probability the model assigns to the observed data. This is the normalizing constant in Bayes rule which we ignored previously Here, M 1 represents the condition that model M 1 is the correct one. Similarly, we can compute P(data M 2 ) for some other model We might choose the model that gives higher probability to the data, or average predictions from both sides with weights based on the marginal likelihood, multiplied by any prior preference we have over M 1 versus M 2

Marginal Likelihood From looking at the equation of posterior distribution, the marginal likelihood is given by p(y x, M) = p(w M)p(y x, w, M)dw (20) Second level inference : model comparison p(m y, x) = p(y x, M)p(M) p(y x) p(y x, M)p(M) (21) The marginal likelihood is used to select between models

Posterior Probability of a Function

Priors on parameters induce priors on functions

Maximum Likelihood, Parametric Model

Bayesian Inference, Parametric Model

Bayesian Inference, Parametric Model (cont)

Posterior and Predictive Distribution in detail

Marginal Likelihood

Understanding Marginal Likelihood. Models

Understanding Marginal Likelihood. Posterior

Predictive Distribution

Conclusions

Thank You Slides dedicated to David MacKay I wish I had known you better... Information Theory, Inference, and Learning Algorithms