Learning the hyper-parameters. Luca Martino

Similar documents
Bayesian Inference and MCMC

Principles of Bayesian Inference

The Recycling Gibbs Sampler for Efficient Learning

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Bayesian Phylogenetics:

Markov Chain Monte Carlo methods

Bayesian Estimation with Sparse Grids

MCMC algorithms for fitting Bayesian models

Monte Carlo in Bayesian Statistics

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

Probabilistic Machine Learning

A Review of Pseudo-Marginal Markov Chain Monte Carlo

The Recycling Gibbs Sampler for Efficient Learning

Density Estimation. Seungjin Choi

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Introduction to Machine Learning

Markov Chain Monte Carlo

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Machine Learning. Probabilistic KNN.

CPSC 540: Machine Learning

Computational statistics

LECTURE 15 Markov chain Monte Carlo

An introduction to Sequential Monte Carlo

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Notes on pseudo-marginal methods, variational Bayes and ABC

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Basic math for biology

an introduction to bayesian inference

Markov Chain Monte Carlo, Numerical Integration

Principles of Bayesian Inference

Markov Chain Monte Carlo methods

Principles of Bayesian Inference

MONTE CARLO METHODS. Hedibert Freitas Lopes

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture : Probabilistic Machine Learning

Monte Carlo Methods. Leon Gu CSD, CMU

MCMC and Gibbs Sampling. Kayhan Batmanghelich

STA 4273H: Statistical Machine Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Afternoon Meeting on Bayesian Computation 2018 University of Reading


MCMC Sampling for Bayesian Inference using L1-type Priors

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Bayesian Model Comparison:

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Variational inference

MCMC Methods: Gibbs and Metropolis

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Likelihood-free MCMC

CSC 2541: Bayesian Methods for Machine Learning

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Adaptive Monte Carlo methods

Metropolis-Hastings Algorithm

Bayesian GLMs and Metropolis-Hastings Algorithm

Lecture 6: Markov Chain Monte Carlo

A = {(x, u) : 0 u f(x)},

Kernel adaptive Sequential Monte Carlo

Monte Carlo Dynamically Weighted Importance Sampling for Spatial Models with Intractable Normalizing Constants

ST 740: Markov Chain Monte Carlo

David Giles Bayesian Econometrics

Risk Estimation and Uncertainty Quantification by Markov Chain Monte Carlo Methods

Markov chain Monte Carlo

F denotes cumulative density. denotes probability density function; (.)

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

VCMC: Variational Consensus Monte Carlo

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Markov chain Monte Carlo

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

MCMC: Markov Chain Monte Carlo

Hierarchical Modeling for Spatial Data

Nested Sampling. Brendon J. Brewer. brewer/ Department of Statistics The University of Auckland

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

ComputationalToolsforComparing AsymmetricGARCHModelsviaBayes Factors. RicardoS.Ehlers

Bayesian Regression Linear and Logistic Regression

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Introduction to Bayesian Computation

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox Richard A. Norton, J.

On Markov chain Monte Carlo methods for tall data

On Bayesian Computation

STA 414/2104, Spring 2014, Practice Problem Set #1

DAG models and Markov Chain Monte Carlo methods a short overview

MARKOV CHAIN MONTE CARLO

Probabilistic Graphical Networks: Definitions and Basic Results

Bayesian model selection in graphs by using BDgraph package

Reminder of some Markov Chain properties:

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

The Origin of Deep Learning. Lili Mou Jan, 2015

Introduction to Markov Chain Monte Carlo & Gibbs Sampling

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

Strong Lens Modeling (II): Statistical Methods

Bayesian Phylogenetics

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

Bayesian Machine Learning

Transcription:

Learning the hyper-parameters Luca Martino 2017 2017 1 / 28

Parameters and hyper-parameters 1. All the described methods depend on some choice of hyper-parameters... 2. For instance, do you recall λ (bandwidth of the kernel/basis) and σ e (std of the noise)? where e N (e; 0, σ e 2 ). ψ(x x n ) = exp ( x x n 2 ) 2λ 2, y = f (x) + e, 2 / 28

Cross Validation (CV) Split the dataset D = {x i, y i } N i=1 in two sets D train = {x (TR) i, y (TR) i } N TR i=1, and D test = {x (V ) i, y (V ) i } N V i=1, (or D validation ) so that D = D train D test. Then: 1. Given some values of the hyper-parameters θ = [λ, σ e ], compute the estimator f (x θ) using D train. 2. Validate how good is the solution f (x θ) using D test. For instance, we can try to minimize the MSE in prediction θ = arg min θ Θ N (V ) n=1 ( y (V ) n f (x (V ) θ)) 2. 3 / 28

Cross Validation (CV) Note that the previous procedure is equivalent to θ = arg max θ Θ exp N (V ) n=1 ( y n (V ) f ) 2 (x (V ) θ). However, we can also try to minimize or maximize other cost or pay-off functions. (not only the error in prediction)...or using other estimators θ, considering the mean of the median, instead of the maximum... 4 / 28

Other estimators for CV Denoting as p(y (V ) θ) exp N (V ) n=1 ( y n (V ) f ) 2 (x (V ) θ). the CV-Error in Prediction likelihood, and Denoting as p(θ) the prior over the hyper-parameters θ, and p(θ y (V ) ) the corresponding posterior, then we can also define other estimators, for instance, Minimum Mean Square Error (MMSE) estimator, θ MMSE = θp(θ y)dθ. (1) instead of using the maximum θ MAP. Θ 5 / 28

K-fold CV Split the dataset D = {x i, y i } N i=1 in K sets D(K). For k = 1,..., K : 1. Given some hyper-parameters θ = [λ, σ e ], and using D (k) as training set, compute the estimator f k (x θ). 2. Obtain θ (k) considering the rest of K 1 sets as validation sets. Finally, compute θ = 1 K K θ (k). k=1 6 / 28

Leave-one-Out and All-in Leave-One-Out : In this case, we consider exactly K = N sets each one formed by N 1 data and only one out. All-in : all for training...it is not CV (K = 1 with N data)... let see the marginal likelihood approach to clarify this point... 7 / 28

Alternative to the Error in Prediction: Marginal Likelihood Given the studied models, the marginal likelihood has the form (or similar) p(y θ) = N (y 0, Ψ + σ 2 ei N ), where λ affects the construction of Ψ!! (recall that θ = [λ, σ e ]). We can try to maximize the marginal likelihood, θ = arg max θ Θ p(y θ). It can be used with (inside) or without ( All-in ) CV... 8 / 28

Marginal Likelihood Recall that log[p(y θ)] = y (K + σ 2 I N ) 1 y log [ det(k + σ 2 I N ) ] + const. With a uniform prior density p(θ) = I(θ), the posterior density p(θ, y) p(y θ)p(θ) = p(y θ)i(θ), (2) where I(θ) = 1 if θ Θ, I(θ) = 0 otherwise, if θ / Θ. Maximum a Posteriori (MAP) estimator, θ MAP = arg max p(θ y), (3) Minimum Mean Square Error (MMSE) estimator, θ MMSE = θp(θ y)dθ. (4) Θ 9 / 28

Global View In general, the elements that must be analyzed/chosen are: 1. Different cost or pay-off functions (including Cross Validation (CV) and mini-batches approaches) 2. Different estimators (MAP, MMSE, median etc.) 3. Choice of the prior pdfs (in a Bayesian framework) 4. Computational algorithms (for approximating the estimators) Several possible combinations Different conclusions for different Machine Learning algorithms Compare methods: complexity, number of parameters/hyperparameters 10 / 28

SECOND PART: given a posterior, approximation of the estimators by MONTE CARLO 11 / 28

Inference using Monte Carlo Given a posterior π(θ) = p(θ y), we desire to obtain maximum, expected value (mean) (h(θ) = θ; see below), median, covariance matrix and other moments... such as I = h(θ) π(θ)dθ. but it cannot be done analytically, in general. It is impossible analytically: we will do it numerically. Deterministic methods fails in high dimensions, cannot be applied easily... Θ 12 / 28

Inference using Monte Carlo Let us consider that we are able to evaluate point-wise π(θ) = p(y θ)p(θ), then π(θ) = p(θ y) = 1 Z π(θ), where Z = Θ π(θ)dθ, is the marginal likelihood Z = p(y). 13 / 28

Monte Carlo approximation Our problem is to compute numerically integrals of type I = h(θ) π(θ)dθ, (5) Θ = 1 h(θ)π(θ)dθ. (6) Z Θ Monte Carlo approximation: I = h(θ) π(θ)dθ, (7) Θ 1 T T h(θ t ) (8) t=1 where θ t π(θ). 14 / 28

Monte Carlo approximation: Sampling methods Then the problem is to generate random vectors from π(θ). Sampling Methods: procedures to generate random vectors from a generic density. Sampling Methods: NO RELATED to Nyquist and Signal Processing sampling procedures... to sample from..., to draw from... mean to generate random vectors/numbers... Figure with other notation (θ = x = [x 1, x 2]) x 2 4 3 2 1 0 1 2 4 2 0 x 1 2 4 x 2 4 3 2 1 0 1 2 3 2 1 0 x 1 1 2 3 15 / 28

proposal density and target density Proposal density: q(θ), easy to sample from (we can draw easily random samples from q). Target density: the posterior π(θ). Sampling Method: converts samples from q(θ) to samples distributed according to π(θ). A sampling Method can be considered a filter, that filters random vectors/numbers distributed according to q(θ) and convert these random vectors into vectors distributed according to π(θ). 16 / 28

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 proposal density and target density (2) Samples from proposal q(θ) = Sampling Method = Samples from target π(θ) SAMPLING METHOD (Monte Carlo) -3-2 -1 0 1 2 3 4 5 17 / 28

Evaluating versus Sampling a density IMPORTANT!! it is mandatory to distinguish between: Evaluating a density (or a function): given an x, obtain the output y = π(x). Ex: z = ( ) 1 exp (x µ)2 2πσ 2 2σ 2. Sampling (or draw) from a density: generate vectors/numbers x according to π(x). Namely, if we generate several samples x, x... the histogram of these samples approximates the shape of π(x). Ex: x = randn(1,1). 18 / 28

Classification of sampling methods MAIN FAMILIES: Direct methods: based on random variable transformation. independent samples. (the best, almost) computational effort: lowest. applicability: low. Rejection sampling independent samples. (the best, almost) computational effort: higher (depending on the acceptance rate). applicability: wider of direct methods, but in general low. Importance sampling (IS) weighted samples. computational effort: low. applicability: always. Markov Chain Monte Carlo (MCMC) positive-correlated samples. computational effort: low. applicability: always. 19 / 28

Markov Chain Monte Carlo (MCMC) MCMC: we generate a Markov Chain that has the posterior density π(θ) as an invariant/stationary density. θ 0 θ 1 θ 2... θ t after a burn-in period (with length t b ), we have Problem: we do not know t b... θ t π(θ), for t t b. (we will use all the samples without discarding some of them, hoping that T is enough great...) 20 / 28

Metropolis-Hastings (MH) algorithm The Metropolis-Hastings (MH) sampler: 1. Choose θ 0. 2. For t = 1,..., T : 2.1 Generate θ q(θ θ t 1 ). 2.2 Set θ t = θ with probability α = min [ 1, π(θ )q(θ t 1 θ ) π(θ t 1 )q(θ θ t 1 ) ], otherwise set θ t = θ t 1 (with probability 1 α). 3. Outputs: {θ 1,..., θ T } 21 / 28

From MH to Gibbs In MH, we propose directly vectors/samples θ = [θ 1,..., θ L ] RL directly on the space with dimension L. There are also component-wise strategies that work component by component in order to construct a complete sample/vector θ = [θ 1,..., θ L ]. 22 / 28

Bidimensional Gibbs Sampling (L = 2) Consider π(θ 1, θ 2 ), and note that π 1 (θ 1 θ 2 ) π(θ 1, θ 2 ), π 2 (θ 2 θ 1 ) π(θ 2 θ 1 ). Assume that we are able to draw from the conditionals π 1 and π 2. (strong assumption) The Bidimensional Gibbs sampler: 1. Choose θ 0 = [θ 1,0, θ 2,0 ]. 2. For t = 1,..., T : 2.1 Draw θ 1,t π 1 (θ 1 θ 2,t 1 ). 2.2 Draw θ 2,t π 2 (θ 2 θ 1,t ). 2.3 Set θ t = [θ 1,t, θ 2,t ]. 3. Outputs: {θ 1,..., θ T } 23 / 28

Bidimensional Gibbs Sampling (L = 2) Figure with other notation (θ = x = [x 1, x 2 ]) 4 3 2 x 2 1 0 1 2 4 2 0 2 4 x 1 24 / 28

Bidimensional Gibbs Sampling (L = 2) Figure with other notation (θ = x = [x 1, x 2 ]) 4 4 3 3 2 2 x 2 1 x 2 1 0 0 1 1 2 2 4 2 0 2 4 x 1 3 2 1 0 1 2 3 x 1 25 / 28

Gibbs Sampling Assume that we are able to draw from the full-conditionals π l, l = 1,..., L. (strong assumption) The Gibbs sampler: 1. Choose θ 0 = [θ 1,0, θ 2,0,..., θ l,0,..., θ L,0 ]. 2. For t = 1,..., T : 2.1 For l = 1,..., L: 2.1.1 Draw θ l,t π l (θ l θ 1:l 1,t, θ l+1:l,t 1 ) 2.2 Set θ t = [θ 1,t, θ 2,t,..., θ l,t,..., θ L,t ]. 3. Outputs: {θ 1,..., θ T } 26 / 28

MH-within-Gibbs If we are not able to draw from the full-conditionals, what do we do? we use another MCMC inside the Gibbs sampler, e.g., a MH method inside Gibbs. The MH-within-Gibbs sampler: 1. Choose θ 0 = [θ 1,0, θ 2,0,..., θ l,0,..., θ L,0 ]. 2. For t = 1,..., T : 2.1 For l = 1,..., L: 2.1.1 Draw θ l,t from π l (θ l θ 1:l 1,t, θ l+1:l,t 1 ) using a MH algorithm (for instance, other T steps of MH). 2.2 Set θ t = [θ 1,t, θ 2,t,..., θ l,t,..., θ L,t ]. 3. Outputs: {θ 1,..., θ T } 27 / 28

Questions? THANKS! References: [1] L. Martino, V. Elvira. Metropolis Sampling, Wiley StatsRef: Statistics Reference Online, 2017. arxiv:1704.04629 [2] L. Martino, V. Elvira, G. Camps-Valls, The Recycling Gibbs Sampler for Efficient Learning, (to appear) Digital Signal Processing, 2017. arxiv:1611.07056, 28 / 28