Learning the hyper-parameters Luca Martino 2017 2017 1 / 28
Parameters and hyper-parameters 1. All the described methods depend on some choice of hyper-parameters... 2. For instance, do you recall λ (bandwidth of the kernel/basis) and σ e (std of the noise)? where e N (e; 0, σ e 2 ). ψ(x x n ) = exp ( x x n 2 ) 2λ 2, y = f (x) + e, 2 / 28
Cross Validation (CV) Split the dataset D = {x i, y i } N i=1 in two sets D train = {x (TR) i, y (TR) i } N TR i=1, and D test = {x (V ) i, y (V ) i } N V i=1, (or D validation ) so that D = D train D test. Then: 1. Given some values of the hyper-parameters θ = [λ, σ e ], compute the estimator f (x θ) using D train. 2. Validate how good is the solution f (x θ) using D test. For instance, we can try to minimize the MSE in prediction θ = arg min θ Θ N (V ) n=1 ( y (V ) n f (x (V ) θ)) 2. 3 / 28
Cross Validation (CV) Note that the previous procedure is equivalent to θ = arg max θ Θ exp N (V ) n=1 ( y n (V ) f ) 2 (x (V ) θ). However, we can also try to minimize or maximize other cost or pay-off functions. (not only the error in prediction)...or using other estimators θ, considering the mean of the median, instead of the maximum... 4 / 28
Other estimators for CV Denoting as p(y (V ) θ) exp N (V ) n=1 ( y n (V ) f ) 2 (x (V ) θ). the CV-Error in Prediction likelihood, and Denoting as p(θ) the prior over the hyper-parameters θ, and p(θ y (V ) ) the corresponding posterior, then we can also define other estimators, for instance, Minimum Mean Square Error (MMSE) estimator, θ MMSE = θp(θ y)dθ. (1) instead of using the maximum θ MAP. Θ 5 / 28
K-fold CV Split the dataset D = {x i, y i } N i=1 in K sets D(K). For k = 1,..., K : 1. Given some hyper-parameters θ = [λ, σ e ], and using D (k) as training set, compute the estimator f k (x θ). 2. Obtain θ (k) considering the rest of K 1 sets as validation sets. Finally, compute θ = 1 K K θ (k). k=1 6 / 28
Leave-one-Out and All-in Leave-One-Out : In this case, we consider exactly K = N sets each one formed by N 1 data and only one out. All-in : all for training...it is not CV (K = 1 with N data)... let see the marginal likelihood approach to clarify this point... 7 / 28
Alternative to the Error in Prediction: Marginal Likelihood Given the studied models, the marginal likelihood has the form (or similar) p(y θ) = N (y 0, Ψ + σ 2 ei N ), where λ affects the construction of Ψ!! (recall that θ = [λ, σ e ]). We can try to maximize the marginal likelihood, θ = arg max θ Θ p(y θ). It can be used with (inside) or without ( All-in ) CV... 8 / 28
Marginal Likelihood Recall that log[p(y θ)] = y (K + σ 2 I N ) 1 y log [ det(k + σ 2 I N ) ] + const. With a uniform prior density p(θ) = I(θ), the posterior density p(θ, y) p(y θ)p(θ) = p(y θ)i(θ), (2) where I(θ) = 1 if θ Θ, I(θ) = 0 otherwise, if θ / Θ. Maximum a Posteriori (MAP) estimator, θ MAP = arg max p(θ y), (3) Minimum Mean Square Error (MMSE) estimator, θ MMSE = θp(θ y)dθ. (4) Θ 9 / 28
Global View In general, the elements that must be analyzed/chosen are: 1. Different cost or pay-off functions (including Cross Validation (CV) and mini-batches approaches) 2. Different estimators (MAP, MMSE, median etc.) 3. Choice of the prior pdfs (in a Bayesian framework) 4. Computational algorithms (for approximating the estimators) Several possible combinations Different conclusions for different Machine Learning algorithms Compare methods: complexity, number of parameters/hyperparameters 10 / 28
SECOND PART: given a posterior, approximation of the estimators by MONTE CARLO 11 / 28
Inference using Monte Carlo Given a posterior π(θ) = p(θ y), we desire to obtain maximum, expected value (mean) (h(θ) = θ; see below), median, covariance matrix and other moments... such as I = h(θ) π(θ)dθ. but it cannot be done analytically, in general. It is impossible analytically: we will do it numerically. Deterministic methods fails in high dimensions, cannot be applied easily... Θ 12 / 28
Inference using Monte Carlo Let us consider that we are able to evaluate point-wise π(θ) = p(y θ)p(θ), then π(θ) = p(θ y) = 1 Z π(θ), where Z = Θ π(θ)dθ, is the marginal likelihood Z = p(y). 13 / 28
Monte Carlo approximation Our problem is to compute numerically integrals of type I = h(θ) π(θ)dθ, (5) Θ = 1 h(θ)π(θ)dθ. (6) Z Θ Monte Carlo approximation: I = h(θ) π(θ)dθ, (7) Θ 1 T T h(θ t ) (8) t=1 where θ t π(θ). 14 / 28
Monte Carlo approximation: Sampling methods Then the problem is to generate random vectors from π(θ). Sampling Methods: procedures to generate random vectors from a generic density. Sampling Methods: NO RELATED to Nyquist and Signal Processing sampling procedures... to sample from..., to draw from... mean to generate random vectors/numbers... Figure with other notation (θ = x = [x 1, x 2]) x 2 4 3 2 1 0 1 2 4 2 0 x 1 2 4 x 2 4 3 2 1 0 1 2 3 2 1 0 x 1 1 2 3 15 / 28
proposal density and target density Proposal density: q(θ), easy to sample from (we can draw easily random samples from q). Target density: the posterior π(θ). Sampling Method: converts samples from q(θ) to samples distributed according to π(θ). A sampling Method can be considered a filter, that filters random vectors/numbers distributed according to q(θ) and convert these random vectors into vectors distributed according to π(θ). 16 / 28
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 proposal density and target density (2) Samples from proposal q(θ) = Sampling Method = Samples from target π(θ) SAMPLING METHOD (Monte Carlo) -3-2 -1 0 1 2 3 4 5 17 / 28
Evaluating versus Sampling a density IMPORTANT!! it is mandatory to distinguish between: Evaluating a density (or a function): given an x, obtain the output y = π(x). Ex: z = ( ) 1 exp (x µ)2 2πσ 2 2σ 2. Sampling (or draw) from a density: generate vectors/numbers x according to π(x). Namely, if we generate several samples x, x... the histogram of these samples approximates the shape of π(x). Ex: x = randn(1,1). 18 / 28
Classification of sampling methods MAIN FAMILIES: Direct methods: based on random variable transformation. independent samples. (the best, almost) computational effort: lowest. applicability: low. Rejection sampling independent samples. (the best, almost) computational effort: higher (depending on the acceptance rate). applicability: wider of direct methods, but in general low. Importance sampling (IS) weighted samples. computational effort: low. applicability: always. Markov Chain Monte Carlo (MCMC) positive-correlated samples. computational effort: low. applicability: always. 19 / 28
Markov Chain Monte Carlo (MCMC) MCMC: we generate a Markov Chain that has the posterior density π(θ) as an invariant/stationary density. θ 0 θ 1 θ 2... θ t after a burn-in period (with length t b ), we have Problem: we do not know t b... θ t π(θ), for t t b. (we will use all the samples without discarding some of them, hoping that T is enough great...) 20 / 28
Metropolis-Hastings (MH) algorithm The Metropolis-Hastings (MH) sampler: 1. Choose θ 0. 2. For t = 1,..., T : 2.1 Generate θ q(θ θ t 1 ). 2.2 Set θ t = θ with probability α = min [ 1, π(θ )q(θ t 1 θ ) π(θ t 1 )q(θ θ t 1 ) ], otherwise set θ t = θ t 1 (with probability 1 α). 3. Outputs: {θ 1,..., θ T } 21 / 28
From MH to Gibbs In MH, we propose directly vectors/samples θ = [θ 1,..., θ L ] RL directly on the space with dimension L. There are also component-wise strategies that work component by component in order to construct a complete sample/vector θ = [θ 1,..., θ L ]. 22 / 28
Bidimensional Gibbs Sampling (L = 2) Consider π(θ 1, θ 2 ), and note that π 1 (θ 1 θ 2 ) π(θ 1, θ 2 ), π 2 (θ 2 θ 1 ) π(θ 2 θ 1 ). Assume that we are able to draw from the conditionals π 1 and π 2. (strong assumption) The Bidimensional Gibbs sampler: 1. Choose θ 0 = [θ 1,0, θ 2,0 ]. 2. For t = 1,..., T : 2.1 Draw θ 1,t π 1 (θ 1 θ 2,t 1 ). 2.2 Draw θ 2,t π 2 (θ 2 θ 1,t ). 2.3 Set θ t = [θ 1,t, θ 2,t ]. 3. Outputs: {θ 1,..., θ T } 23 / 28
Bidimensional Gibbs Sampling (L = 2) Figure with other notation (θ = x = [x 1, x 2 ]) 4 3 2 x 2 1 0 1 2 4 2 0 2 4 x 1 24 / 28
Bidimensional Gibbs Sampling (L = 2) Figure with other notation (θ = x = [x 1, x 2 ]) 4 4 3 3 2 2 x 2 1 x 2 1 0 0 1 1 2 2 4 2 0 2 4 x 1 3 2 1 0 1 2 3 x 1 25 / 28
Gibbs Sampling Assume that we are able to draw from the full-conditionals π l, l = 1,..., L. (strong assumption) The Gibbs sampler: 1. Choose θ 0 = [θ 1,0, θ 2,0,..., θ l,0,..., θ L,0 ]. 2. For t = 1,..., T : 2.1 For l = 1,..., L: 2.1.1 Draw θ l,t π l (θ l θ 1:l 1,t, θ l+1:l,t 1 ) 2.2 Set θ t = [θ 1,t, θ 2,t,..., θ l,t,..., θ L,t ]. 3. Outputs: {θ 1,..., θ T } 26 / 28
MH-within-Gibbs If we are not able to draw from the full-conditionals, what do we do? we use another MCMC inside the Gibbs sampler, e.g., a MH method inside Gibbs. The MH-within-Gibbs sampler: 1. Choose θ 0 = [θ 1,0, θ 2,0,..., θ l,0,..., θ L,0 ]. 2. For t = 1,..., T : 2.1 For l = 1,..., L: 2.1.1 Draw θ l,t from π l (θ l θ 1:l 1,t, θ l+1:l,t 1 ) using a MH algorithm (for instance, other T steps of MH). 2.2 Set θ t = [θ 1,t, θ 2,t,..., θ l,t,..., θ L,t ]. 3. Outputs: {θ 1,..., θ T } 27 / 28
Questions? THANKS! References: [1] L. Martino, V. Elvira. Metropolis Sampling, Wiley StatsRef: Statistics Reference Online, 2017. arxiv:1704.04629 [2] L. Martino, V. Elvira, G. Camps-Valls, The Recycling Gibbs Sampler for Efficient Learning, (to appear) Digital Signal Processing, 2017. arxiv:1611.07056, 28 / 28