Density Estimation. Seungjin Choi

Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/ seungjin 1 / 24

Supervised vs Unsupervised Learning The goal of learning is to train probabilistic models from observed data: D = {(x n, y n )} N for supervised learning or D = {x n } N for unsupervised learning. Supervised learning Assume a parameterized model p(y x, θ) = p(y, z x, θ)dz. Use D = {(x1, y 1),..., (x N, y N )} to learn a mapping from input to output under a probabilistic model. Examples: Linear regression, logistic regression, mixture of experts. Unsupervised learning Assume a parameterized model p(x θ) = p(x, z θ)dz. Fit a probabilistic model to D = {x1,..., x N } Examples: Latent class models (e.g. MoG) and latent feature models (e.g. PPCA). 2 / 24

Why Latent Variable Models?: An Example Gaussian p(x 1,..., x D ) requires D independent parameters for mean vector and D(D+1) 2 independent parameters for covariance matrix, D(D+3) parameters in 2 total. The number of independent parameters grows with D 2. Marginal independence assumes: p(x 1,..., x D ) = D p(x i ) i=1 which requires just 2D free parameters. Conditional independence assumes: p(x 1,..., x D z 1,..., z K ) = D i=1 p(x i z 1,..., z K ). In the case of linear models, the number of independent parameters grows with D (actually, need (DK + 2D)). 3 / 24

Density Estimation The density estimation is the problem of modeling a probability density function p(x), given a finite number of data points, {x n } N drawn from that density function. Approaches to density estimation Parametric estimation Assumes a specific functional form for density model. A number of parameters are optimized by fitting the model to the data set. Maximum likelihood estimation (MLE), maximum a posteriori (MAP) estimation, Bayesian inference. Nonparametric estimation No specific functional form is assumed. Allows the form of the density to be determined entirely by the data. Parzen window and Bayesian noparametrics. 4 / 24

Maximum Likelihood Estimation 5 / 24

Maximum Likelihood Estimation (MLE) The likelihood function is nothing but a parameterized density p(x θ) that is used to model a set of data X = {x 1,..., x N } which are assumed to be drawn independently from p(x θ): p(x θ) = N p(x n θ). Maximum likelihood seeks to find the optimum values for the parameters by maximizing a likelihood function form the training data. The log-likelihood is given by ML finds θ ML : N L(θ) = log p(x n θ). θ ML = arg max L(θ). θ 6 / 24

MLE: Kullback Matching Perspective Suppose that we are given a set of data, X = [x 1,..., x N ] drawn from an underlying distribution p(x). Empirical distribution: p(x) = 1 N N δ(x xn). Model: p(x θ). Fit the model p(x θ) to data X: arg min D KL [ p(x) p(x θ)] = arg min p(x) log p(x) θ θ p(x θ) dx [ ] = arg min θ H( p) p(x) log p(x θ)dx, leading to 1 arg max E p log p(x θ) = arg max θ θ N = arg max θ 1 N N δ(x x n) log p(x θ)dx N log p(x n θ). 7 / 24

Estimation Estimator: Statistic whose calculated value is used to estimate model parameter θ Estimate: A particular realization of an estimator, θ. Good estimators are: Consistent: limn [ P( θ ] θ > ɛ) = 0. Unbiased: Ep(x θ) θ = θ. 8 / 24

Parameter Estimation: An Example Suppose that we wish to estimate θ from its noisy observations x n = θ + ɛ n for n = 1,..., N, where ɛ n N (0, σ 2 ). Estimator: Take the first sample only θ = x 1. Mean and variance: ] E [ θ = θ, var( θ) = σ 2. Estimator: Take averaging θ = 1 N N x n. Mean and variance: ] E [ θ = θ, var( θ) = σ2 N. Both estimators are unbiased, but var( θ) var( θ). It turns out that θ = θ ML. 9 / 24

In this example, MLE is determined by solving θ ML = arg max θ = arg max θ N log p(x n θ) N log N (x n θ, σ 2 ), where { N (x n θ, σ 2 1 ) = exp 1 } (xn θ)2. 2πσ 2 2σ2 Solving θ [ N ] log N (xn θ, σ2 ) = 0 for θ yields θ M = 1 N N x n. 10 / 24

Maximum A Posteriori (MAP) Estimation 11 / 24

MAP Estimation θ MAP = arg max p(θ x) θ = arg max θ = arg max θ = arg max θ p(x θ)p(θ) p(x) p(x θ)p(θ) [log p(x θ) + log p(θ)]. The prior log p(θ) plays a critical role in protecting against overfitting. If our belief says the function should be smooth, then the prior plays like an regularizer (which penalizes too complex models). 12 / 24

An Example of MAP Estimation: Univariate Normal Assume x N (µ, 1). Use a prior p(µ) N (0, α 2 ). Then we have It follows from L µ = 0 that L = log p(x θ) + log p(θ) 1 N (x n µ) 2 1 2 2α 2 µ2. µ MAP = 1 ( N + 1 α 2 ) N x n. 13 / 24

For N 1 α (the influence of the prior is negligible), we have 2 µ MAP µ ML = 1 N N x n 1 For very strong belief in the prior, i.e., α N, we have 2 µ MAP 0. If few data points are available, the prior will bias the estimate towards the priori expected value. 14 / 24

Bayesian Inference 15 / 24

Bayesian Inference A Bayesian considers θ as a random variable. A Bayesian wants to know how his prior knowledge of the random variable θ changes in the light of the new observations d, where d = (x, y) in the case of supervised learning and d = x in the case of unsupervised learning. Need to calculate the posterior distribution p(θ d) = likelihood {}}{ p(d θ) prior {}}{ p(θ) p(d θ)p(θ)dθ }{{} marginal likelihood In general, the marginal likelihood (or evidence) is hard to compute.. 16 / 24

Bayesian Inference: Predictive Distribution The unsupervised Bayesian would want to calculate the probability of a new data point x, given the data D, p(x D) = p(x, θ D)dθ = p(x θ, D)p(θ D)dθ = p(x θ)p(θ D)dθ. The supervised Bayesian would want to calculate the probability over target values, given an input data point and the previous data points, p(y x, D) = p(y x, θ)p(θ D)dθ. Bayesian approach performs a weighted average over all values of θ, instead of choosing a specific value for θ. 17 / 24

Bayesian Inference: Posterior Calculation The posterior distribution of θ is updated using Bayes rule, where the likelihood is given by p(d θ) = N p(x n θ): p(θ D) = p(d θ)p(θ) p(d) p(θ)p(d θ) = p(d θ )p(θ )dθ = p(θ) N p(x n θ) p(θ ) N p(x n θ )dθ. Conjugate prior: A prior p(θ) which gives rise to a posterior p(θ D) having the same function form, given p(d θ). 18 / 24

Bayesian Inference: A Few Remarks Never actually estimate a value of θ. Instead, determine the posterior density over all values for θ and use it to integrate over all possible values of θ. Approximation inference Laplace approximation Variational Bayes Markov chain Monte Carlo (MCMC) 19 / 24

Bayesian Inference: An Example Suppose x N (µ, σ 2 ) where σ 2 is assumed to be known. Find the mean µ, given a set of data points {x n }. Assume that the prior for µ to be Gaussian, { 1 p 0 (µ) = exp 1 } 2πσ 2 0 2σ0 2 (µ µ 0 ) 2. Observing a set of N data points, we calculate the posterior p(µ D) = p 0(µ) p(d) N p(x n µ), where p(x n µ) = { 1 exp 1 } 2πσ 2 2σ 2 (x n µ) 2. 20 / 24

After tedious calculations, we have { 1 p(µ D) = exp 1 } (µ µ)2, 2π σ 2 2 σ 2 where µ = σ 2 = ( ) Nσ 2 0 σ 2 Nσ0 2 + µ ML + σ2 Nσ0 2 + µ 0, σ2 σ0 2σ2 Nσ0 2 +, 1 σ2 σ 2 = 1 σ0 2 + N σ 2 (precision). When N = 0, µ reduces to the prior mean and σ 2 does to the prior variance, as expected. As N, the posterior mean is given by the ML solution and the posterior variance goes to 0, leading that the posterior distribution becomes infinitely peaked around the ML solution. 21 / 24

The data points are generated from a Gaussian of mean 0.8 and variance 0.1 and the prior is chosen to have mean 0. The posterior distribution is shown for increasing numbers N of data points. 5 0 1 0 1 22 / 24

Kernel Density Estimation 23 / 24

Kernel Density Estimation: Nonparametric Approach Place a kernel on each data point and compute an average to estimate the probability distribution of x, given a set of data points {x 1, x 2,..., x N }: p(x) = 1 N = 1 N N k(x, x n, λ x ) N 1 Z x exp { λ x x x n 2}. 24 / 24