CSC321 Lecture 18: Learning Probabilistic Models

Similar documents
Lecture 18: Learning probabilistic models

CSC2541 Lecture 1 Introduction

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Learning (II)

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Naïve Bayes classification

COMP90051 Statistical Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Density Estimation. Seungjin Choi

Lecture : Probabilistic Machine Learning

Introduction to Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CS 361: Probability & Statistics

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Machine Learning. Lecture 2

Conjugate Priors, Uninformative Priors

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Parametric Techniques Lecture 3

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Computational Cognitive Science

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Bayesian Models in Machine Learning

CS 361: Probability & Statistics

CPSC 540: Machine Learning

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Introduction to Probabilistic Machine Learning

Lecture 4: Probabilistic Learning

Parametric Techniques

Bayesian Methods: Naïve Bayes

6.867 Machine Learning

Computational Cognitive Science

CSC321 Lecture 7 Neural language models

Bayesian RL Seminar. Chris Mansley September 9, 2008

Computational Perception. Bayesian Inference

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Probabilistic Graphical Models

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Statistics: Learning models from data

Learning Bayesian network : Given structure and completely observed data

Statistical learning. Chapter 20, Sections 1 4 1

Non-parametric Methods

6.867 Machine Learning

Notes on Noise Contrastive Estimation (NCE)

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Bayesian Analysis for Natural Language Processing Lecture 2

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Probability and Estimation. Alan Moses

Homework 5: Conditional Probability Models

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CSC321 Lecture 4: Learning a Classifier

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PMR Learning as Inference

COS513 LECTURE 8 STATISTICAL CONCEPTS

Machine Learning, Fall 2012 Homework 2

Machine Learning 4771

Bayesian Inference and MCMC

Statistical learning. Chapter 20, Sections 1 3 1

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

A primer on Bayesian statistics, with an application to mortality rate estimation

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Hierarchical Models & Bayesian Model Selection

Notes on Machine Learning for and

CSC321 Lecture 4: Learning a Classifier

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

CMU-Q Lecture 24:

Bayesian Linear Regression [DRAFT - In Progress]

Expectation maximization tutorial

Some Probability and Statistics

Probabilistic and Bayesian Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Modeling Environment

Machine Learning Basics: Maximum Likelihood Estimation

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Linear Models for Regression

Time Series and Dynamic Models

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Probability Theory for Machine Learning. Chris Cremer September 2015

Bayesian Machine Learning

ECE521 week 3: 23/26 January 2017

(1) Introduction to Bayesian statistics

Learning Bayesian belief networks

Some Probability and Statistics

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Bayesian Analysis (Optional)

Transcription:

CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25

Overview So far in this course: mainly supervised learning Language modeling was our one unsupervised task; we broke it down into a series of prediction tasks This was an example of distribution estimation: we d like to learn a distribution which looks as much as possible like the input data. This lecture: basic concepts in probabilistic modeling This will be review if you ve taken 411. Following two lectures: more recent approaches to unsupervised learning Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 2 / 25

Maximum Likelihood We already used maximum likelihood in this course for training language models. Let s cover it in a bit more generality. Motivating example: estimating the parameter of a biased coin You flip a coin 100 times. It lands heads N H = 55 times and tails N T = 45 times. What is the probability it will come up heads if we flip again? Model: flips are independent Bernoulli random variables with parameter θ. Assume the observations are independent and identically distributed (i.i.d.) Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 3 / 25

Maximum Likelihood The likelihood function is the probability of the observed data, as a function of θ. In our case, it s the probability of a particular sequence of H s and T s. Under the Bernoulli model with i.i.d. observations, L(θ) = p(d) = θ N H (1 θ) N T This takes very small values (in this case, L(0.5) = 0.5 100 7.9 10 31 ) Therefore, we usually work with log-likelihoods: l(θ) = log L(θ) = N H log θ + N T log(1 θ) Here, l(0.5) = log 0.5 100 = 100 log 0.5 = 69.31 Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 4 / 25

Maximum Likelihood N H = 55, N T = 45 Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 5 / 25

Maximum Likelihood Good values of θ should assign high probability to the observed data. This motivates the maximum likelihood criterion. Remember how we found the optimal solution to linear regression by setting derivatives to zero? We can do that again for the coin example. dl dθ = d dθ (N H log θ + N T log(1 θ)) = N H θ N T 1 θ Setting this to zero gives the maximum likelihood estimate: ˆθ ML = N H N H + N T, Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 6 / 25

Maximum Likelihood This is equivalent to minimizing cross-entropy. Let t i = 1 for heads and t i = 0 for tails. L CE = i t i log θ (1 t i ) log(1 θ) = N H log θ N T log(1 θ) = l(θ) Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 7 / 25

Maximum Likelihood Recall the Gaussian, or normal, distribution: N (x; µ, σ) = 1 ) (x µ)2 exp ( 2πσ 2σ 2 The Central Limit Theorem says that sums of lots of independent random variables are approximately Gaussian. In machine learning, we use Gaussians a lot because they make the calculations easy. Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 8 / 25

Maximum Likelihood Suppose we want to model the distribution of temperatures in Toronto in March, and we ve recorded the following observations: -2.5-9.9-12.1-8.9-6.0-4.8 2.4 Assume they re drawn from a Gaussian distribution with known standard deviation σ = 5, and we want to find the mean µ. Log-likelihood function: l(µ) = log = = [ N i=1 [ N log i=1 1 2π σ exp 1 2π σ exp ( )] (x (i) µ) 2 2σ 2 ( )] (x (i) µ) 2 2σ 2 N 1 log 2π log σ (x (i) µ) 2 2 i=1 }{{} 2σ 2 constant! Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 9 / 25

Maximum Likelihood Maximize the log-likelihood by setting the derivative to zero: 0 = dl dµ = 1 2σ 2 Solving we get µ = 1 N N i=1 x (i) N i=1 = 1 N σ 2 x (i) µ i=1 d dµ (x (i) µ) 2 This is just the mean of the observed values, or the empirical mean. Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 10 / 25

Maximum Likelihood In general, we don t know the true standard deviation σ, but we can solve for it as well. Set the partial derivatives to zero, just like in linear regression. 0 = l µ = 1 σ 2 0 = l σ = N x (i) µ i=1 [ N 1 σ 2 i=1 N = 1 log 2π 2 σ σ log σ σ i=1 N = 0 1 σ + 1 σ 3 (x(i) µ) 2 i=1 = N σ + 1 N σ 3 (x (i) µ) 2 i=1 ] 1 log 2π log σ 2σ 2 (x(i) µ) 2 1 2σ (x(i) µ) 2 ˆµ ML = 1 N x (i) N i=1 ˆσ ML = 1 N (x (i) µ) 2 N i=1 Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 11 / 25

Maximum Likelihood So far, maximum likelihood has told us to use empirical counts or statistics: Bernoulli: θ = N H N H +N T Gaussian: µ = 1 N x (i), σ 2 = 1 N (x (i) µ) 2 This doesn t always happen; e.g. for the neural language model, there was no closed form, and we needed to use gradient descent. But these simple examples are still very useful for thinking about maximum likelihood. Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 12 / 25

Data Sparsity Maximum likelihood has a pitfall: if you have too little data, it can overfit. E.g., what if you flip the coin twice and get H both times? θ ML = N H = 2 N H + N T 2 + 0 = 1 Because it never observed T, it assigns this outcome probability 0. This problem is known as data sparsity. If you observe a single T in the test set, the likelihood is. Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 13 / 25

(the rest of this lecture is optional) Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 14 / 25

Bayesian Parameter Estimation (optional) In maximum likelihood, the observations are treated as random variables, but the parameters are not. The Bayesian approach treats the parameters as random variables as well. To define a Bayesian model, we need to specify two distributions: The prior distribution p(θ), which encodes our beliefs about the parameters before we observe the data The likelihood p(d θ), same as in maximum likelihood When we update our beliefs based on the observations, we compute the posterior distribution using Bayes Rule: p(θ D) = p(θ)p(d θ) p(θ )p(d θ ) dθ. We rarely ever compute the denominator explicitly. Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 15 / 25

Bayesian Parameter Estimation (optional) Let s revisit the coin example. We already know the likelihood: L(θ) = p(d) = θ N H (1 θ) N T It remains to specify the prior p(θ). We can choose an uninformative prior, which assumes as little as possible. A reasonable choice is the uniform prior. But our experience tells us 0.5 is more likely than 0.99. One particularly useful prior that lets us specify this is the beta distribution: p(θ; a, b) = Γ(a + b) Γ(a)Γ(b) θa 1 (1 θ) b 1. This notation for proportionality lets us ignore the normalization constant: p(θ; a, b) θ a 1 (1 θ) b 1. Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 16 / 25

Bayesian Parameter Estimation (optional) Beta distribution for various values of a, b: Some observations: The expectation E[θ] = a/(a + b). The distribution gets more peaked when a and b are large. The uniform distribution is the special case where a = b = 1. The main thing the beta distribution is used for is as a prior for the Bernoulli distribution. Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 17 / 25

Bayesian Parameter Estimation (optional) Computing the posterior distribution: p(θ D) p(θ)p(d θ) [ θ a 1 (1 θ) b 1] [ ] θ N H (1 θ) N T = θ a 1+N H (1 θ) b 1+N T. This is just a beta distribution with parameters N H + a and N T + b. The posterior expectation of θ is: E[θ D] = N H + a N H + N T + a + b The parameters a and b of the prior can be thought of as pseudo-counts. The reason this works is that the prior and likelihood have the same functional form. This phenomenon is known as conjugacy, and it s very useful. Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 18 / 25

Bayesian Parameter Estimation (optional) Bayesian inference for the coin flip example: Small data setting N H = 2, N T = 0 Large data setting N H = 55, N T = 45 When you have enough observations, the data overwhelm the prior. Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 19 / 25

Bayesian Parameter Estimation (optional) What do we actually do with the posterior? The posterior predictive distribution is the distribution over future observables given the past observations. We compute this by marginalizing out the parameter(s): p(d D) = p(θ D)p(D θ) dθ. (1) For the coin flip example: θ pred = Pr(x = H D) = p(θ D)Pr(x = H θ) dθ = Beta(θ; N H + a, N T + b) θ dθ = E Beta(θ;NH +a,n T +b)[θ] = N H + a N H + N T + a + b, (2) Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 20 / 25

Bayesian Parameter Estimation (optional) Bayesian estimation of the mean temperature in Toronto Assume observations are i.i.d. Gaussian with known standard deviation σ and unknown mean µ Broad Gaussian prior over µ, centered at 0 We can compute the posterior and posterior predictive distributions analytically (full derivation in notes) Why is the posterior predictive distribution more spread out than the posterior distribution? Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 21 / 25

Bayesian Parameter Estimation (optional) Comparison of maximum likelihood and Bayesian parameter estimation The Bayesian approach deals better with data sparsity Maximum likelihood is an optimization problem, while Bayesian parameter estimation is an integration problem This means maximum likelihood is much easier in practice, since we can just do gradient descent Automatic differentiation packages make it really easy to compute gradients There aren t any comparable black-box tools for Bayesian parameter estimation (although Stan can do quite a lot) Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 22 / 25

Maximum A-Posteriori Estimation (optional) Maximum a-posteriori (MAP) estimation: find the most likely parameter settings under the posterior This converts the Bayesian parameter estimation problem into a maximization problem ˆθ MAP = arg max θ = arg max θ = arg max θ = arg max θ p(θ D) p(θ, D) p(θ) p(d θ) log p(θ) + log p(d θ) Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 23 / 25

Maximum A-Posteriori Estimation (optional) Joint probability in the coin flip example: log p(θ, D) = log p(θ) + log p(d θ) = const + (a 1) log θ + (b 1) log(1 θ) + N H log θ + N T log(1 θ) = const + (N H + a 1) log θ + (N T + b 1) log(1 θ) Maximize by finding a critical point 0 = d dθ log p(θ, D) = N H + a 1 θ N T + b 1 1 θ Solving for θ, ˆθ MAP = N H + a 1 N H + N T + a + b 2 Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 24 / 25

Maximum A-Posteriori Estimation (optional) Comparison of estimates in the coin flip example: ˆθ ML Formula N H = 2, N T = 0 N H = 55, N T = 45 θ pred N H +a N H +N T +a+b ˆθ MAP N H +a 1 N H +N T +a+b 2 N H N H +N T 1 55 100 = 0.55 ˆθ MAP assigns nonzero probabilities as long as a, b > 1. 4 6 0.67 57 104 0.548 3 4 = 0.75 56 102 0.549 Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 25 / 25

Maximum A-Posteriori Estimation (optional) Comparison of predictions in the Toronto temperatures example 1 observation 7 observations Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 26 / 25