Density Estimation. Seungjin Choi

Similar documents
Nonparameteric Regression:

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Parametric Techniques Lecture 3

Lecture : Probabilistic Machine Learning

Parametric Techniques

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Bayesian Inference and MCMC

Introduction to Probabilistic Machine Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

COS513 LECTURE 8 STATISTICAL CONCEPTS

CSC321 Lecture 18: Learning Probabilistic Models

Logistic Regression. Seungjin Choi

Density Estimation: ML, MAP, Bayesian estimation

Linear Models for Regression

GAUSSIAN PROCESS REGRESSION

Lecture 4: Probabilistic Learning

Learning Bayesian network : Given structure and completely observed data

Introduction to Machine Learning

STA 4273H: Sta-s-cal Machine Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bayesian Decision and Bayesian Learning

Probabilistic Reasoning in Deep Learning

y Xw 2 2 y Xw λ w 2 2

Bayesian Regression Linear and Logistic Regression

STA414/2104 Statistical Methods for Machine Learning II

Bayesian Learning (II)

Linear Models for Regression

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Linear Models A linear model is defined by the expression

Unsupervised Learning

Bayes Decision Theory

Nonparametric Bayesian Methods (Gaussian Processes)

Expectation Maximization

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Maximum Likelihood Estimation. only training data is available to design a classifier

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction to Machine Learning

Introduction into Bayesian statistics

Lecture 2: From Linear Regression to Kalman Filter and Beyond

PATTERN RECOGNITION AND MACHINE LEARNING

Statistics: Learning models from data

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Bayesian Machine Learning

Outline Lecture 2 2(32)

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Bayesian Methods for Machine Learning

Data Analysis and Uncertainty Part 2: Estimation

David Giles Bayesian Econometrics

Part 1: Expectation Propagation

Introduction to Bayesian inference

Ch 4. Linear Models for Classification

Statistical learning. Chapter 20, Sections 1 4 1

COMP90051 Statistical Machine Learning

Curve Fitting Re-visited, Bishop1.2.5

Bayesian RL Seminar. Chris Mansley September 9, 2008

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Expectation Propagation for Approximate Bayesian Inference

PMR Learning as Inference

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

an introduction to bayesian inference

Approximate Inference using MCMC

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

STA 4273H: Statistical Machine Learning

Graphical Models for Collaborative Filtering

Machine Learning Techniques for Computer Vision

Probabilistic Machine Learning

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Probabilistic Graphical Models

CPSC 540: Machine Learning

Statistical learning. Chapter 20, Sections 1 3 1

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

GWAS IV: Bayesian linear (variance component) models

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Notes on pseudo-marginal methods, variational Bayes and ABC

Variational Inference via Stochastic Backpropagation

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

g-priors for Linear Regression

Machine Learning 4771

Variational Scoring of Graphical Model Structures

STA 4273H: Statistical Machine Learning

Week 3: The EM algorithm

ECE521 week 3: 23/26 January 2017

Bayesian Methods: Naïve Bayes

1 Bayesian Linear Regression (BLR)

CS-E3210 Machine Learning: Basic Principles

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Bayesian Machine Learning

Non-Parametric Bayes

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Bias-Variance Tradeoff

Transcription:

Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/ seungjin 1 / 24

Supervised vs Unsupervised Learning The goal of learning is to train probabilistic models from observed data: D = {(x n, y n )} N for supervised learning or D = {x n } N for unsupervised learning. Supervised learning Assume a parameterized model p(y x, θ) = p(y, z x, θ)dz. Use D = {(x1, y 1),..., (x N, y N )} to learn a mapping from input to output under a probabilistic model. Examples: Linear regression, logistic regression, mixture of experts. Unsupervised learning Assume a parameterized model p(x θ) = p(x, z θ)dz. Fit a probabilistic model to D = {x1,..., x N } Examples: Latent class models (e.g. MoG) and latent feature models (e.g. PPCA). 2 / 24

Why Latent Variable Models?: An Example Gaussian p(x 1,..., x D ) requires D independent parameters for mean vector and D(D+1) 2 independent parameters for covariance matrix, D(D+3) parameters in 2 total. The number of independent parameters grows with D 2. Marginal independence assumes: p(x 1,..., x D ) = D p(x i ) i=1 which requires just 2D free parameters. Conditional independence assumes: p(x 1,..., x D z 1,..., z K ) = D i=1 p(x i z 1,..., z K ). In the case of linear models, the number of independent parameters grows with D (actually, need (DK + 2D)). 3 / 24

Density Estimation The density estimation is the problem of modeling a probability density function p(x), given a finite number of data points, {x n } N drawn from that density function. Approaches to density estimation Parametric estimation Assumes a specific functional form for density model. A number of parameters are optimized by fitting the model to the data set. Maximum likelihood estimation (MLE), maximum a posteriori (MAP) estimation, Bayesian inference. Nonparametric estimation No specific functional form is assumed. Allows the form of the density to be determined entirely by the data. Parzen window and Bayesian noparametrics. 4 / 24

Maximum Likelihood Estimation 5 / 24

Maximum Likelihood Estimation (MLE) The likelihood function is nothing but a parameterized density p(x θ) that is used to model a set of data X = {x 1,..., x N } which are assumed to be drawn independently from p(x θ): p(x θ) = N p(x n θ). Maximum likelihood seeks to find the optimum values for the parameters by maximizing a likelihood function form the training data. The log-likelihood is given by ML finds θ ML : N L(θ) = log p(x n θ). θ ML = arg max L(θ). θ 6 / 24

MLE: Kullback Matching Perspective Suppose that we are given a set of data, X = [x 1,..., x N ] drawn from an underlying distribution p(x). Empirical distribution: p(x) = 1 N N δ(x xn). Model: p(x θ). Fit the model p(x θ) to data X: arg min D KL [ p(x) p(x θ)] = arg min p(x) log p(x) θ θ p(x θ) dx [ ] = arg min θ H( p) p(x) log p(x θ)dx, leading to 1 arg max E p log p(x θ) = arg max θ θ N = arg max θ 1 N N δ(x x n) log p(x θ)dx N log p(x n θ). 7 / 24

Estimation Estimator: Statistic whose calculated value is used to estimate model parameter θ Estimate: A particular realization of an estimator, θ. Good estimators are: Consistent: limn [ P( θ ] θ > ɛ) = 0. Unbiased: Ep(x θ) θ = θ. 8 / 24

Parameter Estimation: An Example Suppose that we wish to estimate θ from its noisy observations x n = θ + ɛ n for n = 1,..., N, where ɛ n N (0, σ 2 ). Estimator: Take the first sample only θ = x 1. Mean and variance: ] E [ θ = θ, var( θ) = σ 2. Estimator: Take averaging θ = 1 N N x n. Mean and variance: ] E [ θ = θ, var( θ) = σ2 N. Both estimators are unbiased, but var( θ) var( θ). It turns out that θ = θ ML. 9 / 24

In this example, MLE is determined by solving θ ML = arg max θ = arg max θ N log p(x n θ) N log N (x n θ, σ 2 ), where { N (x n θ, σ 2 1 ) = exp 1 } (xn θ)2. 2πσ 2 2σ2 Solving θ [ N ] log N (xn θ, σ2 ) = 0 for θ yields θ M = 1 N N x n. 10 / 24

Maximum A Posteriori (MAP) Estimation 11 / 24

MAP Estimation θ MAP = arg max p(θ x) θ = arg max θ = arg max θ = arg max θ p(x θ)p(θ) p(x) p(x θ)p(θ) [log p(x θ) + log p(θ)]. The prior log p(θ) plays a critical role in protecting against overfitting. If our belief says the function should be smooth, then the prior plays like an regularizer (which penalizes too complex models). 12 / 24

An Example of MAP Estimation: Univariate Normal Assume x N (µ, 1). Use a prior p(µ) N (0, α 2 ). Then we have It follows from L µ = 0 that L = log p(x θ) + log p(θ) 1 N (x n µ) 2 1 2 2α 2 µ2. µ MAP = 1 ( N + 1 α 2 ) N x n. 13 / 24

For N 1 α (the influence of the prior is negligible), we have 2 µ MAP µ ML = 1 N N x n 1 For very strong belief in the prior, i.e., α N, we have 2 µ MAP 0. If few data points are available, the prior will bias the estimate towards the priori expected value. 14 / 24

Bayesian Inference 15 / 24

Bayesian Inference A Bayesian considers θ as a random variable. A Bayesian wants to know how his prior knowledge of the random variable θ changes in the light of the new observations d, where d = (x, y) in the case of supervised learning and d = x in the case of unsupervised learning. Need to calculate the posterior distribution p(θ d) = likelihood {}}{ p(d θ) prior {}}{ p(θ) p(d θ)p(θ)dθ }{{} marginal likelihood In general, the marginal likelihood (or evidence) is hard to compute.. 16 / 24

Bayesian Inference: Predictive Distribution The unsupervised Bayesian would want to calculate the probability of a new data point x, given the data D, p(x D) = p(x, θ D)dθ = p(x θ, D)p(θ D)dθ = p(x θ)p(θ D)dθ. The supervised Bayesian would want to calculate the probability over target values, given an input data point and the previous data points, p(y x, D) = p(y x, θ)p(θ D)dθ. Bayesian approach performs a weighted average over all values of θ, instead of choosing a specific value for θ. 17 / 24

Bayesian Inference: Posterior Calculation The posterior distribution of θ is updated using Bayes rule, where the likelihood is given by p(d θ) = N p(x n θ): p(θ D) = p(d θ)p(θ) p(d) p(θ)p(d θ) = p(d θ )p(θ )dθ = p(θ) N p(x n θ) p(θ ) N p(x n θ )dθ. Conjugate prior: A prior p(θ) which gives rise to a posterior p(θ D) having the same function form, given p(d θ). 18 / 24

Bayesian Inference: A Few Remarks Never actually estimate a value of θ. Instead, determine the posterior density over all values for θ and use it to integrate over all possible values of θ. Approximation inference Laplace approximation Variational Bayes Markov chain Monte Carlo (MCMC) 19 / 24

Bayesian Inference: An Example Suppose x N (µ, σ 2 ) where σ 2 is assumed to be known. Find the mean µ, given a set of data points {x n }. Assume that the prior for µ to be Gaussian, { 1 p 0 (µ) = exp 1 } 2πσ 2 0 2σ0 2 (µ µ 0 ) 2. Observing a set of N data points, we calculate the posterior p(µ D) = p 0(µ) p(d) N p(x n µ), where p(x n µ) = { 1 exp 1 } 2πσ 2 2σ 2 (x n µ) 2. 20 / 24

After tedious calculations, we have { 1 p(µ D) = exp 1 } (µ µ)2, 2π σ 2 2 σ 2 where µ = σ 2 = ( ) Nσ 2 0 σ 2 Nσ0 2 + µ ML + σ2 Nσ0 2 + µ 0, σ2 σ0 2σ2 Nσ0 2 +, 1 σ2 σ 2 = 1 σ0 2 + N σ 2 (precision). When N = 0, µ reduces to the prior mean and σ 2 does to the prior variance, as expected. As N, the posterior mean is given by the ML solution and the posterior variance goes to 0, leading that the posterior distribution becomes infinitely peaked around the ML solution. 21 / 24

The data points are generated from a Gaussian of mean 0.8 and variance 0.1 and the prior is chosen to have mean 0. The posterior distribution is shown for increasing numbers N of data points. 5 0 1 0 1 22 / 24

Kernel Density Estimation 23 / 24

Kernel Density Estimation: Nonparametric Approach Place a kernel on each data point and compute an average to estimate the probability distribution of x, given a set of data points {x 1, x 2,..., x N }: p(x) = 1 N = 1 N N k(x, x n, λ x ) N 1 Z x exp { λ x x x n 2}. 24 / 24