Bayesian Interpretations of Regularization

Size: px

Start display at page:

Download "Bayesian Interpretations of Regularization"

Charla Jones
6 years ago
Views:

1 Bayesian Interpretations of Regularization Charlie Frogner 9.50 Class 15 April 1, 009

2 The Plan Regularized least squares maps {(x i, y i )} n i=1 to a function that minimizes the regularized loss: f S = arg min f H 1 n (y i f(x i )) + λ f H i=1 Can we justify Tikhonov regularization from a probabilistic point of view?

3 The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation of linear RLS Bayesian interpretation of kernel RLS Transductive model Infinite dimensions = weird

4 Some notation S = {(x i, y i )} n i=1 is the set of observed input/output pairs in R d R (the training set). X and Y denote the matrices [x 1,..., x n ] T R n d and [y 1,...,y n ] T R n, respectively. θ is a vector of parameters in R p. p(y X,θ) is the joint distribution over outputs Y given inputs X and the parameters.

5 Estimation The setup: A model: relates observed quantities (α) to an unobserved quantity (say β). Want: an estimator maps observed data α back to an estimate of unobserved β. Nothing new yet... Estimator β B is unobserved, α A is observed. An estimator for β is a function ˆβ : A B such that ˆβ(α) is an estimate of β given an observation α.

6 Estimation Tikhonov fits in the estimation framework. f S = arg min f H 1 n (y i f(x i )) + λ f H i=1 Regression model: y i = f(x i ) + ε, ) ε N (0,σε I

7 Bayesian Estimation Difference: Bayesian model specifies p(β, α), usually by a measurement model, p(α β) and a prior p(β). Bayesian model β is unobserved, α is observed. p(β,α) = p(α β) p(β)

8 ERM as a Maximum Likelihood Estimator (Linear) Expected risk minimization: f S (x) = x T ˆθ ERM (S), ˆθ ERM (S) = arg min θ 1 n (y i xi T θ) i=1 Measurement model: ) Y X,θ N (Xθ,σε I X fixed/non-random, θ is unknown.

9 ERM as a Maximum Likelihood Estimator (Linear) Expected risk minimization: f S (x) = x T ˆθ ERM (S), ˆθ ERM (S) = arg min θ 1 n (y i xi T θ) i=1 Measurement model: ) Y X,θ N (Xθ,σε I X fixed/non-random, θ is unknown.

10 ERM as a Maximum Likelihood Estimator Measurement model: ) Y X,θ N (Xθ,σε I Want to estimate θ. Can do this without defining a prior on θ. Maximize the likelihood, i.e. the probability of the observations. Likelihood The likelihood of any fixed parameter vector θ is: L(θ X) = p(y X,θ) Note: we always condition on X.

11 ERM as a Maximum Likelihood Estimator Measurement model: Y X,θ N ( ) Xθ,σεI Likelihood: ( ) L(θ X) = N Y; Xθ,σεI ( exp 1 ) σε Y Xθ Maximum likelihood estimator is ERM: arg min θ 1 Y Xθ

12 ERM as a Maximum Likelihood Estimator Measurement model: Y X,θ N ( ) Xθ,σεI Likelihood: ( ) L(θ X) = N Y; Xθ,σεI ( exp 1 ) σε Y Xθ Maximum likelihood estimator is ERM: arg min θ 1 Y Xθ

13 Really? 1 Y Xθ

14 Really? e 1 σε Y Xθ

15 What about regularization? Linear regularized least squares: f S (x) = x T θ, ˆθRLS (S) = arg min θ 1 n i=1 (y i x T i θ) + λ θ Is there a model of Y and θ that yields linear RLS? Yes. e 1 σε ( n (y i xi T θ) + λ θ ) i=1 p(y X,θ) p(θ)

16 What about regularization? Linear regularized least squares: f S (x) = x T θ, ˆθRLS (S) = arg min θ 1 n i=1 (y i x T i θ) + λ θ Is there a model of Y and θ that yields linear RLS? Yes. e 1 σε ( n (y i xi T θ) + λ θ ) i=1 p(y X,θ) p(θ)

17 What about regularization? Linear regularized least squares: f S (x) = x T θ, ˆθRLS (S) = arg min θ 1 n i=1 (y i x T i θ) + λ θ Is there a model of Y and θ that yields linear RLS? Yes. e 1 σ ε n (y i x T i=1 i θ) e λ θ p(y X,θ) p(θ)

18 What about regularization? Linear regularized least squares: f S (x) = x T θ, ˆθRLS (S) = arg min θ 1 n i=1 (y i x T i θ) + λ θ Is there a model of Y and θ that yields linear RLS? Yes. e 1 σ ε n (y i x T i=1 i θ) e λ θ p(y X,θ) p(θ)

19 Linear RLS as a MAP estimator Measurement model: Add a prior: Y X,θ N So σε = λ. How to estimate θ? θ N (0, I) ( ) Xθ,σεI

20 The Bayesian method Take p(y X,θ) and p(θ). Apply Bayes rule to get posterior: p(θ X, Y) = = p(y X,θ) p(θ) p(y X) p(y X,θ) p(θ) p(y X,θ)dθ Use the posterior to estimate θ.

21 Estimators that use the posterior Bayes least squares estimator The Bayes least squares estimator for θ given the observed Y is: ˆθ BLS (Y X) = E θ X,Y [θ] i.e. the mean of the posterior. Maximum a posteriori estimator The MAP estimator for θ given the observed Y is: i.e. a mode of the posterior. ˆθ MAP (Y X) = arg max p(θ X, Y) θ

22 Estimators that use the posterior Bayes least squares estimator The Bayes least squares estimator for θ given the observed Y is: ˆθ BLS (Y X) = E θ X,Y [θ] i.e. the mean of the posterior. Maximum a posteriori estimator The MAP estimator for θ given the observed Y is: i.e. a mode of the posterior. ˆθ MAP (Y X) = arg max p(θ X, Y) θ

23 Linear RLS as a MAP estimator Model: Y X,θ N ( ) Xθ,σε I, θ N (0, I) Joint over Y and θ: [ ] Y X N θ ([ 0 0 ] [ XX, T + σε I X X T I ]) Condition on Y X.

24 Linear RLS as a MAP estimator Model: Y X,θ N (Xθ,σ ε I ), θ N (0, I) Posterior: where θ X, Y N ( µ θ X,Y,Σ θ X,Y ) µ θ X,Y = X T (XX T + σ ε I) 1 Y Σ θ X,Y = I X T (XX T + σ ε I) 1 X This is Gaussian, so ˆθ MAP (Y X) = ˆθ BLS (Y X) = X T (XX T + σ ε I) 1 Y

25 Linear RLS as a MAP estimator Model: Y X,θ N (Xθ,σ ε I ), θ N (0, I) Posterior: where θ X, Y N ( µ θ X,Y,Σ θ X,Y ) µ θ X,Y = X T (XX T + σ ε I) 1 Y Σ θ X,Y = I X T (XX T + σ ε I) 1 X This is Gaussian, so ˆθ MAP (Y X) = ˆθ BLS (Y X) = X T (XX T + σ ε I) 1 Y

26 Linear RLS as a MAP estimator Model: Y X,θ N (Xθ,σ ε I ), θ N (0, I) ˆθ MAP (Y X) = X T (XX T + σ ε I) 1 Y Recall the linear RLS solution: ˆθ RLS (Y X) = arg min θ 1 n (y i xi T θ) + λ θ i=1 = X T (XX T + λ I) 1 Y How do we write the estimated function?

27 Linear RLS as a MAP estimator Model: Y X,θ N (Xθ,σ ε I ), θ N (0, I) ˆθ MAP (Y X) = X T (XX T + σ ε I) 1 Y Recall the linear RLS solution: ˆθ RLS (Y X) = arg min θ 1 n (y i xi T θ) + λ θ i=1 = X T (XX T + λ I) 1 Y How do we write the estimated function?

28 What about Kernel RLS? We can use basically the same trick to derive kernel RLS; How? 1 f S = arg min f H K n (y i f(x i )) + λ f H K i=1 Feature space: f(x) = θ,φ(x) F e 1 ( n (y i φ(x i ) T θ) + λ θt θ) i=1 Feature space must be finite-dimensional.

29 What about Kernel RLS? We can use basically the same trick to derive kernel RLS; How? 1 f S = arg min f H K n (y i f(x i )) + λ f H K i=1 Feature space: f(x) = θ,φ(x) F e 1 ( n (y i φ(x i ) T θ) + λ θt θ) i=1 Feature space must be finite-dimensional.

30 What about Kernel RLS? We can use basically the same trick to derive kernel RLS; How? 1 f S = arg min f H K n (y i f(x i )) + λ f H K i=1 Feature space: f(x) = θ,φ(x) F e 1 n i=1(y i φ(x i ) T θ) e λ θt θ Feature space must be finite-dimensional.

31 What about Kernel RLS? We can use basically the same trick to derive kernel RLS; How? 1 f S = arg min f H K n (y i f(x i )) + λ f H K i=1 Feature space: f(x) = θ,φ(x) F e 1 n i=1(y i φ(x i ) T θ) e λ θt θ Feature space must be finite-dimensional.

32 More notation φ(x) = [φ(x 1 ),...,φ(x n )] T K(X, X) is the kernel matrix: [K(X, X)] ij = K(x i, x j ) K(x, X) = [K(x, x 1 ),..., K(x, x n )] f(x) = [f(x 1 ),..., f(x n )] T

33 What about Kernel RLS? Model: Then: Y X,θ N What is φ(x)φ(x) T? It s K(X, X). (φ(x)θ,σ ε I ), θ N (0, I) ˆθ MAP (Y X) = φ(x) T (φ(x)φ(x) T + σ ε I) 1 Y

34 What about Kernel RLS? Model: Then: Y X,θ N What is φ(x)φ(x) T? It s K(X, X). (φ(x)θ,σ ε I ), θ N (0, I) ˆθ MAP (Y X) = φ(x) T (φ(x)φ(x) T + σ ε I) 1 Y

35 What about Kernel RLS? Model: Y X,θ N (φ(x)θ,σ ε I ), θ N (0, I) Then: ˆθ MAP (Y X) = φ(x) T (K(X, X) + σ ε I) 1 Y Estimated function? ˆfMAP (x) = φ(x)ˆθ MAP (Y X) = φ(x)φ(x) T (K(X, X) + σε I) 1 Y = K(x, X)(K(X, X) + λ I) 1 Y = ˆf RLS (x)

36 What about Kernel RLS? Model: Y X,θ N (φ(x)θ,σ ε I ), θ N (0, I) Then: ˆθ MAP (Y X) = φ(x) T (K(X, X) + σ ε I) 1 Y Estimated function? ˆfMAP (x) = φ(x)ˆθ MAP (Y X) = φ(x)φ(x) T (K(X, X) + σε I) 1 Y = K(x, X)(K(X, X) + λ I) 1 Y = ˆf RLS (x)

37 A prior over functions Model: Y X,θ N Can we write this as a prior on H K? (φ(x)θ,σ ε I ), θ N (0, I)

38 A prior over functions Remember Mercer s theorem: K(x i, x j ) = k ν k ψ k (x i )ψ k (x j ) where ν k ψ k ( ) = K(, y)ψ k (y)dy for all k. The functions { ν k ψ k ( )} form an orthonormal basis for H K. Let φ( ) = [ ν 1 ψ 1 ( ),..., ν p ψ p ( )]. Then: H K = {θ T φ( ) θ R p }

39 A prior over functions We showed: when θ N (0, I), ˆfMAP ( ) = ˆθ MAP (Y X) T φ( ) = ˆf RLS ( ) Taking φ( ) = [ ν 1 ψ 1,..., ν p ψ p ], this prior is equivalently: f( ) = θ T φ( ) N (0, I) i.e. the functions in H K are Gaussian distributed: p(f) exp ( 1 ) ( f HK = exp 1 ) θt θ Note: again we need H K to be finite-dimensional.

40 A prior over functions So: p(f) exp ( 1 ) f HK θ N (0, I) ˆf MAP = ˆf RLS Assuming p(f) exp( 1 f H K ), p (Y X, f) p(f) p (f X, Y) = p (Y X) exp ( 1 ) Y f(x) exp ( 1 ) f = exp ( 1 Y f(x) 1 ) f

41 A quick recap We wanted to know if RLS has a probabilistic interpretation. Empirical risk minimization is ML. p(y X,θ) e 1 Y Xθ Linear RLS is MAP. p(y,θ X) e 1 Y Xθ e λ θt θ Kernel RLS is also MAP. p(y,θ X) e 1 Y φ(x)θ e λ θt θ Equivalent to a Gaussian prior on H K : p(y,θ X) e 1 Y f(x) e λ f H K But these don t work for infinite dimensional function spaces...

42 A quick recap We wanted to know if RLS has a probabilistic interpretation. Empirical risk minimization is ML. p(y X,θ) e 1 Y Xθ Linear RLS is MAP. p(y,θ X) e 1 Y Xθ e λ θt θ Kernel RLS is also MAP. p(y,θ X) e 1 Y φ(x)θ e λ θt θ Equivalent to a Gaussian prior on H K : p(y,θ X) e 1 Y f(x) e λ f H K But these don t work for infinite dimensional function spaces...

43 A quick recap We wanted to know if RLS has a probabilistic interpretation. Empirical risk minimization is ML. p(y X,θ) e 1 Y Xθ Linear RLS is MAP. p(y,θ X) e 1 Y Xθ e λ θt θ Kernel RLS is also MAP. p(y,θ X) e 1 Y φ(x)θ e λ θt θ Equivalent to a Gaussian prior on H K : p(y,θ X) e 1 Y f(x) e λ f H K But these don t work for infinite dimensional function spaces...

44 A quick recap We wanted to know if RLS has a probabilistic interpretation. Empirical risk minimization is ML. p(y X,θ) e 1 Y Xθ Linear RLS is MAP. p(y,θ X) e 1 Y Xθ e λ θt θ Kernel RLS is also MAP. p(y,θ X) e 1 Y φ(x)θ e λ θt θ Equivalent to a Gaussian prior on H K : p(y,θ X) e 1 Y f(x) e λ f H K But these don t work for infinite dimensional function spaces...

45 Transductive setting We hinted at problems if dim H K =. Idea: Forget about estimating θ (i.e. f ). Instead: Estimate predicted outputs Y = [y1,...,y M ]T at test inputs X = [x 1,...,x M ]T Need the joint distribution over Y and Y.

46 Transductive setting Say Y and Y are jointly Gaussian: [ ] ([ ] [ Y µy ΛY Λ Y = N, YY µ Y Λ Y Y Λ Y ]) Want: kernel RLS. General form for the posterior: Y X, Y N ( ) µ Y X,Y,Σ Y X,Y where µ Y X,Y = µ Y + Λ T YY Λ 1 Y (Y µ Y) Σ Y X,Y = Λ Y Λ T YY Λ 1 Y Λ YY

47 Transductive setting Say Y and Y are jointly Gaussian: [ ] ([ ] [ Y µy ΛY Λ Y = N, YY µ Y Λ Y Y Λ Y ]) Want: kernel RLS. General form for the posterior: Y X, Y N ( ) µ Y X,Y,Σ Y X,Y where µ Y X,Y = µ Y + Λ T YY Λ 1 Y (Y µ Y) Σ Y X,Y = Λ Y Λ T YY Λ 1 Y Λ YY

48 Transductive setting Set Λ Y = K(X, X) + σ I, Λ YY = K(X, X ), Λ Y = K(X, X ). Posterior: where Y X, Y N ( ) µ Y X,Y,Σ Y X,Y µ Y X,Y = µ Y + K(X, X)(K(X, X + σ I) 1 (Y µ Y ) Σ Y X,Y = K(X, X ) K(X, X)(K(X, X) + σ I) 1 K(X, X ) So: Ŷ MAP = ˆf RLS (X ).

49 Transductive setting Model: [ Y Y ] = N ([ µy µ Y ] [ K(X, X) + σ, ε I K(X, X ) K(X, X) K(X, X ) ]) MAP estimate (posterior mean) = RLS function at every point x, regardless of dim H K. Are the prior and posterior (on points!) consistent with a distribution on H K?

50 Transductive setting Strictly speaking, θ and f don t come into play here at all: Have: p(y X, Y) Do not have: p(θ X, Y) or p(f X, Y) But, if H K is finite dimensional, the joint over Y and Y is consistent with: Y = f(x) + ε, Y = f(x), and f H K is a random trajectory from a Gaussian process over the domain, with mean µ and covariance K. (Ergo, people call this Gaussian process regression. ) (Also Kriging, because of a guy.)

51 Transductive setting Strictly speaking, θ and f don t come into play here at all: Have: p(y X, Y) Do not have: p(θ X, Y) or p(f X, Y) But, if H K is finite dimensional, the joint over Y and Y is consistent with: Y = f(x) + ε, Y = f(x), and f H K is a random trajectory from a Gaussian process over the domain, with mean µ and covariance K. (Ergo, people call this Gaussian process regression. ) (Also Kriging, because of a guy.)

52 Recap redux Empirical risk minimization is the maximum likelihood estimator when: y = x T θ + ε Linear RLS is the MAP estimator when: y = x T θ + ε, θ N (0, I) Kernel RLS is the MAP estimator when: y = φ(x) T θ + ε, θ N (0, I) in finite dimensional H K. Kernel RLS is the MAP estimator at points when: [ ] ([ ] [ Y µy K(X, X) + σ Y = N, ε I K(X, X ) µ Y K(X, X) K(X, X ) in possibly infinite dimensional H K. ])

53 Is this useful in practice? Want confidence intervals + believe the posteriors are meaningful = yes Maybe other reasons? 5 4 Gaussian Regression + Confidence Intervals Regression function Observed points 3 1 Y X

54 What is going on with infinite-dimensional H K? Wrote down statements like: θ N (0, I) and f N (0, I). The space H K can be written H K = {f : f H K < } = {θ T φ( ) : θi < } i=1 Difference between finite and infinite: not every θ yields a function θ T φ( ) in H K. A hint: θ N (0, I) E θ T φ( ) H K =. In fact: θ N (0, I) θ T φ( ) H K with probability zero. So be careful out there.

55 What is going on with infinite-dimensional H K? Wrote down statements like: θ N (0, I) and f N (0, I). The space H K can be written H K = {f : f H K < } = {θ T φ( ) : θi < } i=1 Difference between finite and infinite: not every θ yields a function θ T φ( ) in H K. A hint: θ N (0, I) E θ T φ( ) H K =. In fact: θ N (0, I) θ T φ( ) H K with probability zero. So be careful out there.

56 What is going on with infinite-dimensional H K? Wrote down statements like: θ N (0, I) and f N (0, I). The space H K can be written H K = {f : f H K < } = {θ T φ( ) : θi < } i=1 Difference between finite and infinite: not every θ yields a function θ T φ( ) in H K. A hint: θ N (0, I) E θ T φ( ) H K =. In fact: θ N (0, I) θ T φ( ) H K with probability zero. So be careful out there.

57 A hint that things are amiss Assume: θ N (0, I), {φ i } i=1 orthonormal basis in H K. E θ T φ H K = E θ i φ i H K = E = i=1 i=1 1 i=1 = θ i

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression