Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Similar documents
What is to be done? Two attempts using Gaussian process priors

Statistical Data Mining and Machine Learning Hilary Term 2016

Recap from previous lecture

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Gaussian Process Regression

Why experimenters should not randomize, and what they should do instead

Reproducing Kernel Hilbert Spaces

The risk of machine learning

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

CS281 Section 4: Factor Analysis and PCA

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Probabilistic & Unsupervised Learning

Linear Methods for Prediction

Using data to inform policy

Bayesian Interpretations of Regularization

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

CS 7140: Advanced Machine Learning

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Statistical Machine Learning Hilary Term 2018

STAT 518 Intro Student Presentation

Kernel Methods. Machine Learning A W VO

Advanced Introduction to Machine Learning

LECTURE 2 LINEAR REGRESSION MODEL AND OLS

Worst-Case Bounds for Gaussian Process Models

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

GAUSSIAN PROCESS REGRESSION

Gaussian Processes for Machine Learning

Statistics & Data Sciences: First Year Prelim Exam May 2018

Nonparametric Bayesian Methods

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Least Squares Regression

Fixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility

Quantitative Methods in Economics Conditional Expectations

Quantitative Methods in Economics Panel data and generalized least squares

STA414/2104 Statistical Methods for Machine Learning II

Introduction to Bayesian Inference

Advanced Introduction to Machine Learning CMU-10715

Least Squares Regression

MTTTS16 Learning from Multiple Sources

Announcements. Proposals graded

STAT 100C: Linear models

Model Selection for Gaussian Processes

PMR Learning as Inference

ECE521 week 3: 23/26 January 2017

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Loss Functions, Decision Theory, and Linear Models

Qualifying Exam in Probability and Statistics.

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

5. Random Vectors. probabilities. characteristic function. cross correlation, cross covariance. Gaussian random vectors. functions of random vectors

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Today. Calculus. Linear Regression. Lagrange Multipliers

y(x) = x w + ε(x), (1)

Linear Regression (9/11/13)

STAT 100C: Linear models

Neutron inverse kinetics via Gaussian Processes

Spatial Process Estimates as Smoothers: A Review

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Homework 10 (due December 2, 2009)

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

ESTIMATION THEORY. Chapter Estimation of Random Variables

Theoretical Exercises Statistical Learning, 2009

Naïve Bayes classification

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Introduction to Machine Learning

Kernels A Machine Learning Overview

Advanced Econometrics I

Lecture 2 Machine Learning Review

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Multivariate Random Variable

The Hilbert Space of Random Variables

1 Inner Product and Orthogonality

Multivariate Bayesian Linear Regression MLAI Lecture 11

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Lecture 4 February 2

Learning gradients: prescriptive models

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Econ 2140, spring 2018, Part IIa Statistical Decision Theory

Habilitationsvortrag: Machine learning, shrinkage estimation, and economic theory

REGRESSION TREE CREDIBILITY MODEL

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

CIS 520: Machine Learning Oct 09, Kernel Methods

Lecture 1: Bayesian Framework Basics

Nonparameteric Regression:

Machine learning, shrinkage estimation, and economic theory

Transcription:

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37

Agenda 6 equivalent representations of the posterior mean in the Normal-Normal model. Gaussian process priors for regression functions. Reproducing Kernel Hilbert Spaces and splines. Applications from my own work, to 1. Optimal treatment assignment in experiments. 2. Optimal insurance and taxation. 2 / 37

Takeaways for this part of class In a Normal means model with Normal prior, there are a number of equivalent ways to think about regularization. Posterior mean, penalized least squares, shrinkage, etc. We can extend from estimation of means to estimation of functions using Gaussian process priors. Gaussian process priors yield the same function estimates as penalized least squares regressions. Theoretical tool: Reproducing kernel Hilbert spaces. Special case: Spline regression. 3 / 37

Normal posterior means equivalent representations Normal posterior means equivalent representations Setup θ R k X θ N(θ,I k ) Loss Prior L( θ,θ) = ( θ i θ i ) 2 i θ N(0,C) 4 / 37

Normal posterior means equivalent representations 6 equivalent representations of the posterior mean 1. Minimizer of weighted average risk 2. Minimizer of posterior expected loss 3. Posterior expectation 4. Posterior best linear predictor 5. Penalized least squares estimator 6. Shrinkage estimator 5 / 37

Normal posterior means equivalent representations 1) Minimizer of weighted average risk Minimize weighted average risk (= Bayes risk), averaging loss L( θ,θ) = ( θ θ) 2 over both 1. the sampling distribution f X θ, and 2. weighting values of θ using the decision weights (prior) π θ. Formally, θ( ) = argmin t( ) E θ [L(t(X),θ)]dπ(θ). 6 / 37

Normal posterior means equivalent representations 2) Minimizer of posterior expected loss Minimize posterior expected loss, averaging loss L( θ,θ) = ( θ θ) 2 over 1. just the posterior distribution π θ X. Formally, θ(x) = argmin t L(t,θ)dπ θ X (θ x). 7 / 37

Normal posterior means equivalent representations 3 and 4) Posterior expectation and posterior best linear predictor Note that Posterior expectation: ( ) ( X N 0, θ Posterior best linear predictor: ( C + I C C C θ = E[θ X]. )). θ = E [θ X] = C (C + I) 1 X. 8 / 37

Normal posterior means equivalent representations 5) Penalization Minimize 1. the sum of squared residuals, 2. plus a quadratic penalty term. Formally, where θ = argmin t n i=1 (X i t i ) 2 + t 2, t 2 = t C 1 t. 9 / 37

Normal posterior means equivalent representations 6) Shrinkage Diagonalize C: Find 1. orthonormal matrix U of eigenvectors, and 2. diagonal matrix D of eigenvalues, so that Change of coordinates, using U: C = UDU. X = U X θ = U θ. Componentwise shrinkage in the new coordinates: θ i = d i d i + 1 X i. (1) 10 / 37

Normal posterior means equivalent representations Practice problem Show that these 6 objects are all equivalent to each other. 11 / 37

Normal posterior means equivalent representations Solution (sketch) 1. Minimizer of weighted average risk = minimizer of posterior expected loss: See decision slides. 2. Minimizer of posterior expected loss = posterior expectation: First order condition for quadratic loss function, pull derivative inside, and switch order of integration. 3. Posterior expectation = posterior best linear predictor: X and θ are jointly Normal, conditional expectations for multivariate Normals are linear. 4. Posterior expectation penalized least squares: Posterior is symmetric unimodal posterior mean is posterior mode. Posterior mode = maximizer of posterior log-likelihood = maximizer of joint log likelihood, since denominator f X does not depend on θ. 12 / 37

Normal posterior means equivalent representations Solution (sketch) continued 5. Penalized least squares posterior expectation: Any penalty of the form t At for A symmetric positive definite corresponds to the log of a Normal prior θ ( N 0,A 1). 6. Componentwise shrinkage = posterior best linear predictor: Change of coordinates turns θ = C (C + I) 1 X into θ = D (D + I) 1 X. Diagonality implies D (D + I) 1 = diag ( di d i + 1 ). 13 / 37

Gaussian process regression Gaussian processes for machine learning Machine Learning metrics dictionary machine learning supervised learning features weights bias metrics regression regressors coefficients intercept 14 / 37

Gaussian process regression Gaussian prior for linear regression Normal linear regression model: Suppose we observe n i.i.d. draws of (Y i,x i ), where Y i is real valued and X i is a k vector. Y i = X i β + ε i ε i X,β N(0,σ 2 ) β X N(0,Ω) (prior) Note: will leave conditioning on X implicit in following slides. 15 / 37

Gaussian process regression Practice problem ( weight space view ) Find the posterior expectation of β Hints: 1. The posterior expectation is the maximum a posteriori. 2. The log likelihood takes a penalized least squares form. Find the posterior expectation of x β for some (non-random) point x. 16 / 37

Gaussian process regression Solution Joint log likelihood of Y,β : log(f Y β ) =log(f Y Y β ) + log(f β ) =const. 1 2σ 2 (Y i X i β) 2 1 i 2 β Ω 1 β. First order condition for maximum a posteriori: Thus 0 = f Y β β = 1 σ 2 (Y i X i β) X i β Ω 1. i ( ) 1 β = X i X i + σ 2 Ω 1 X i Y i. i E[x β Y ] = x β = x (X X + σ 2 Ω 1) 1 X Y. 17 / 37

Gaussian process regression Previous derivation required inverting k k matrix. Can instead do prediction inverting an n n matrix. n might be smaller than k if there are many features. This will lead to a function space view of prediction. Practice problem ( kernel trick ) Find the posterior expectation of Wait, didn t we just do that? Hints: f(x) = E[Y X = x] = x β. 1. Start by figuring out the variance / covariance matrix of (x β,y ). 2. Then deduce the best linear predictor of x β given Y. 18 / 37

Gaussian process regression Solution The joint distribution of (x β,y ) is given by ( x β Y ) ( ( xωx N 0, XΩx Denote C = XΩX and c(x) = xωx. Then xωx XΩX + σ 2 I n E[x β Y ] = c(x) (C + σ 2 I n ) 1 Y. Contrast with previous representation: )) E[x β Y ] = x (X X + σ 2 Ω 1) 1 X Y. 19 / 37

Gaussian process regression General GP regression Suppose we observe n i.i.d. draws of (Y i,x i ), where Y i is real valued and X i is a k vector. Y i = f(x i ) + ε i ε i X,f( ) N(0,σ 2 ) Prior: f is distributed according to a Gaussian process, f X GP(0,C), where C is a covariance kernel, Cov(f(x),f(x ) X) = C(x,x ). We will again leave conditioning on X implicit in following slides. 20 / 37

Gaussian process regression Practice problem Find the posterior expectation of f(x). Hints: 1. Start by figuring out the variance / covariance matrix of (f(x),y ). 2. Then deduce the best linear predictor of f(x) given Y. 21 / 37

Gaussian process regression Solution The joint distribution of (f(x),y ) is given by ( ) ( ( )) f(x) C(x,x) c(x) N 0, Y c(x) C + σ 2, I n where c(x) is the n vector with entries C(x,X i ), and C is the n n matrix with entries C i,j = C(X i,x j ). Then, as before, Read: f( ) = E[f( ) Y ] E[f(x) Y ] = c(x) (C + σ 2 I n ) 1 Y. is a linear combination of the functions C(,X i ) with weights ( C + σ 2 I n ) 1 Y. 22 / 37

Gaussian process regression Hyperparameters and marginal likelihood Usually, covariance kernel C(, ) depends on on hyperparameters η. Example: squared exponential kernel with η = (l,τ 2 ) (length-scale l, variance τ 2 ). C(x,x ) = τ 2 exp ( 1 2l x x 2) Following the empirical Bayes paradigm, we can estimate η by maximizing the marginal log likelihood: η = argmax η 1 2 det(c η + σ 2 I) 1 2 Y (C η + σ 2 I) 1 Y Alternatively, we could choose η using cross-validation or Stein s unbiased risk estimate. 23 / 37

Splines and Reproducing Kernel Hilbert Spaces Splines and Reproducing Kernel Hilbert Spaces Penalized least squares: For some (semi-)norm f, f = argmin f Leading case: Splines, e.g., f = argmin f (Y i f(x i )) 2 + λ f 2. i (Y i f(x i )) 2 + λ i f (x) 2 dx. Can we think of penalized regressions in terms of a prior? If so, what is the prior distribution? 24 / 37

Splines and Reproducing Kernel Hilbert Spaces The finite dimensional case Consider the finite dimensional analog to penalized regression: where θ = argmin t n i=1 (X i t i ) 2 + t 2 C, t 2 C = t C 1 t. We saw before that this is the posterior mean when X θ N(θ,I k ), θ N(0,C). 25 / 37

Splines and Reproducing Kernel Hilbert Spaces The reproducing property The norm t C corresponds to the inner product t,s C = t C 1 s. Let C i = (C i1,...,c ik ). Then, for any vector y, C i,y C = y i. Practice problem Verify this. 26 / 37

Splines and Reproducing Kernel Hilbert Spaces Reproducing kernel Hilbert spaces Now consider a general Hilbert space of functions equipped with an inner product, and corresponding norm, such that for all x there exists an M x such that for all f f(x) M x f. Read: Function evaluation is continuous with respect to the norm. Hilbert spaces with this property are called reproducing kernel Hilbert spaces (RKHS). Note that L 2 spaces are not RKHS in general! 27 / 37

Splines and Reproducing Kernel Hilbert Spaces The reproducing kernel Riesz representation theorem: For every continuous linear functional L on a Hilbert space H, there exists a g L H such that for all f H L(f) = g L,f. Applied to function evaluation on RKHS: Define the reproducing kernel: By construction: f(x) = C x,f C(x 1,x 2 ) = C x1,c x2. C(x 1,x 2 ) = C x1 (x 2 ) = C x2 (x 1 ) 28 / 37

Splines and Reproducing Kernel Hilbert Spaces Practice problem Show that C(, ) is positive semi-definite, i.e., for any (x 1,...,x k ) and (a 1,...,a k ) a i a j C(x i,x j ) 0. i,j Given a positive definite kernel C(, ), construct a corresponding Hilbert space. 29 / 37

Splines and Reproducing Kernel Hilbert Spaces Solution Positive definiteness: i,j a i a j C(x i,x j ) = a i a j C xi,c xj = i,j i a i C xi, j a j C xj = i a i C xi 2 0. Construction of Hilbert space: Take linear combinations of the functions C(x, ) (and their limits) with inner product i a i C(x i, ), b j C(y j, ) j C = a i a j C(x i,y j ). i,j 30 / 37

Splines and Reproducing Kernel Hilbert Spaces Kolmogorov consistency theorem: For a positive definite kernel C(, ) we can always define a corresponding prior Recap: Next: f GP(0,C). For each regression penalty, when function evaluation is continuous w.r.t. the penalty norm there exists a corresponding prior. The solution to the penalized regression problem is the posterior mean for this prior. 31 / 37

Splines and Reproducing Kernel Hilbert Spaces Solution to penalized regression Let f be the solution to the penalized regression Practice problem f = argmin f (Y i f(x i )) 2 + λ f 2 C. i Show that the solution to the penalized regression has the form f(x) = c(x) (C + nλ I) 1 Y, where C ij = C(X i,x j ) and c(x) = (C(X 1,x),...,C(X n,x)). Hints Write f( ) = ai C(X i, ) + ρ( ), where ρ is orthogonal to C(X i, ) for all i. Show that ρ = 0. Solve the resulting least squares problem in a 1,...,a n. 32 / 37

Splines and Reproducing Kernel Hilbert Spaces Solution Using the reproducing property, the objective can be written as (Y i f(x i )) 2 + λ f 2 C i = (Y i C(X i, ),f ) 2 + λ f 2 C i = i = i ( ( Y i 2 C(X i, ), a j C(X j, ) + ρ ) + λ j ( 2 Y i a j C(X i,x j )) + λ j = Y C a 2 + λ ( a Ca + ρ 2 C a i a j C(x i,x j ) + ρ 2 C i,j a i C(X i, ) + ρ i ) Given a, this is minimized by setting ρ = 0. Now solve the quadratic program using first order conditions. ) 2 C 33 / 37

Splines and Reproducing Kernel Hilbert Spaces Splines Now what about the spline penalty f (x) 2 dx? Is function evaluation continuous for this norm? Yes, if we restrict to functions such that f(0) = f (0) = 0. The penalty is a semi-norm that equals 0 for all linear functions. It corresponds to the GP prior with for x 2 x 1. C(x 1,x 2 ) = x 1x2 2 2 x 2 3 6 This is in fact the covariance of integrated Brownian motion! 34 / 37

Splines and Reproducing Kernel Hilbert Spaces Practice problem Verify that C is indeed the reproducing kernel for the inner product f,g = 1 0 f (x)g (x)dx. Takeaway: Spline regression is equivalent to the limit of a posterior mean where the prior is such that f(x) = A 0 + A 1 x + g where and as v. g GP(0,C) A N(0,v I) 35 / 37

Splines and Reproducing Kernel Hilbert Spaces Solution Have to show: C x,g = g(x) Plug in definition of C x Last 2 steps: use integration by parts, use g(0) = g (0) = 0 This yields: C x,g = C x (y)g (y)dy x ( xy 2 = 0 2 y 3 6 x = (x y)g (y)dy 0 x = x (g (x) g (0)) + = g(x). ) g (y)dy + 0 1 x ( yx 2 2 x ) 3 g (y)dy 6 g (y)dy (yg (y)) x y=0 36 / 37

References References Gaussian process priors: Williams, C. and Rasmussen, C. (2006). Gaussian processes for machine learning. MIT Press, chapter 2. Splines and Reproducing Kernel Hilbert Spaces Wahba, G. (1990). Spline models for observational data, volume 59. Society for Industrial Mathematics, chapter 1. 37 / 37