Scalable kernel methods and their use in black-box optimization

Size: px

Start display at page:

Download "Scalable kernel methods and their use in black-box optimization"

John Lindsey
5 years ago
Views:

1 with derivatives Scalable kernel methods and their use in black-box optimization David Eriksson Center for Applied Mathematics Cornell University November 9, /37

2 with derivatives Main projects 1 Scaling Gaussian process regression [Today] Collaborators: David Bindel, Kun Dong, Hannes Nickisch, Andrew Wilson 2 Scaling Gaussian process regression with derivatives [Today] Collaborators: David Bindel, Kun Dong, Eric Lee, Andrew Wilson 3 Energy bound optimization [A-exam] Collaborators: David Bindel 4 Asynchronous surrogate optimization [A-exam] Collaborators: David Bindel, Christine Shoemaker 5 Khatri-Rao systems of equations [Another time] Collaborators: Alex Townsend, Charles Van Loan /37

3 with derivatives Outline 1 Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions 2 Gaussian processes Kernel Learning Approximate kernel learning 3 with derivatives Incorporating derivatives /37

4 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Section /37

5 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Scattered data interpolation Given: Pairwise distinct points: X = {x i } n i=1 Ω Rd Function values: f X = [f(x 1 ),..., f(x n )] T Goal: Find continuous function s f,x s.t. s f,x (x i ) = f(x i ), i = 1,..., n Can use linear combination of continuous basis functions s f,x (x) = n λ i b i (x) i=1 Need to solve A X λ = f X, where (A X ) ij = b j (x i ) Well-posed if A X is non-singular. When is this the case? /37

6 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Basis functions (d = 1): Can choose basis functions independent of data Example: Polynomial interpolation with the monomial basis det A X = (x j x i ) 0 1 i<j n Always non-singular if X are pairwise distinct (d 2): Famous negative result: Mairhuber-Curtis: In order for det A X 0 for all pairwise distinct X Ω, the basis functions must depend on X /37

7 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Positive definite kernels Characterizing all data dependent basis functions challenging Common restriction: Require that A X is always s.p.d. Achieved by using an s.p.d. kernel: b i (x) = k(x, x i ) Definition (Positive definite kernel) A (continuous) symmetric function k : R d R d R is called a positive definite kernel if for all X, λ s.t. 1 The points in X are pairwise distinct, 2 λ 0, n n = λ i λ j k(x i, x j ) > 0. i=1 j= /37

8 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Popular positive definite kernels White noise: k(x, y) = σ 2 δ xy Gaussian (SE): k(x, y) = s 2 exp ( Matérn 1/2: k(x, y) = s 2 exp Matérn 3/2: k(x, y) = s 2 (1 + ) ( x y 2 2l 2 ) x y l 3 x y l ) ( exp Matérn 5/2: ) ( k(x, y) = s ( x y l + 5 x y 2 exp 3l 2 ( Rational quadratic: k(x, y) = s x y 2 2αl 2 3 x y l 5 x y ) α l ) ) /37

9 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Polynomial precision Example: Gaussian kernel cannot reproduce f(x) = constant Desirable: s f,x exact for low-degree polynomials Often referred to as polynomial precision Mairhuber-Curtis = Need additional assumptions on X Definition A set of points X are ν-unisolvent if the only polynomial of degree at most ν interpolating zero data on X is the zero polynomial. Three collinear points in R 2 The points (0, 0), (1, 1), (2, 2) are not 1-unisolvent /37

10 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions and polynomial precision Assume: The points X are ν-unisolvent {π i } m i=1 basis for p(x) Πd ν (polynomials of degree ν) Look for s f,x (x) = n m λ i k(x, x i ) + µ i π i (x) i=1 i=1 We now have n equations and m + n unknowns Add the m discrete orthogonality conditions: n λ j π i (x i ) = 0, j = 1,..., m i=1 Allows us to use a larger family of kernels! /37

11 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions and polynomial precision Letting (K XX ) ij = k(x i, x j ) and (P X ) ij = π j (x i ): [ ] [ ] [ ] KXX P X λ fx P T = X 0 µ 0 Need: X (ν 1)-unisolvent, p Π d ν 1, k c.p.d of order ν Definition (Conditionally positive definite kernel) A (continuous) symmetric function k : R d R d R is called a conditionally positive definite kernel of order ν if for all X, λ s.t. 1 The points in X are pairwise distinct, 2 λ 0 and n i=1 λ iq(x i ) = 0, q Π d ν 1, = n n λ i λ j k(x i, x j ) > 0. i=1 j= /37

12 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Radial basis functions Important special case: φ(r) = k(x, y) where r = x y Cubic (φ(r) = r 3 ), Thin-plate spline (φ(r) = r 2 log r) Semi-norm: s f,x 2 = s, s = λ T Φ XX λ Native space: f N φ = sup s f,x X Ω, X < Generic error estimate: f(x) s f,x (x) P φ,x (x) f 2 N φ s f,x 2 Power function: [P φ,x (x)] 2 = φ(0) [ ΦXx P T x ] T [ ΦXX P X P T X 0 ] 1 [ ΦXx The power function is a Schur complement after adding x /37 P T x ].

13 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Section /37

14 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Gaussian processes interpolation Defines a distribution over functions: f(x) GP(µ(x), k(x, x )) Mean function: µ : R d R, often low-degree polynomial Covariance function: cov(f(x i ), f(x j )) = k(x i, x j ) s.p.d kernel Posterior mean and variance at x: E[f(x)] = K xx K 1 XX (y X µ X ), V[f(x)] = K xx K xx K 1 XX K Xx, Compared to RBFs, V[f(x)] tells us about the average case /37

15 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Draws from GP prior with zero mean /37

16 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Draws from GP posterior /37

17 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Posterior mean and variance /37

18 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Gaussian processes regression Assume we observe f X y X + ϵ, ϵ N (0, σ 2 I) Add white noise kernel: k(x, y) = k(x, y) + σ 2 δ xy We often do this even in the case of no noise Weyl: φ(r) C ν = λ n = o ( n ν 1/2) Example: λ n decays exponentially for Gaussian (SE) kernel Adding σ 2 δ xy guarantees λ n σ 2 Gershgorin: κ(φ XX + σ 2 I ) n φ(0) σ 2 Example: κ(φ XX + σ 2 I ) n ( s σ ) 2 for Gaussian (SE) kernel /37

19 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Kernel hyper-parameters How do we learn the optimal kernel hyperparameters θ? Bayesian approach is expensive, often do MLE Log marginal likelihood: log p(θ y X ) = L y + L K n log 2π 2 Need to compute: L y = 1 2 (y X µ X ) T c, L K = 1 2 log det K XX, L y θ i = 1 2 ct L K θ i = 1 2 tr ( K XX θ i ( K 1 XX ) c K XX θ i ) where c = K 1 XX (y X µ X ) /37

20 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Exact kernel learning K XX = L L T Compute dense Cholesky factorization: O(n 3 ) flops Solves and logdet computations with K XX are now trivial: K XX \c = L T \(L\c) log det K n XX = 2 log L ii i=1 Works for small n, but dense LA is not scalable! /37

21 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Iterative methods Assumption: We have access to a fast MVM with K XX Use a Krylov method to solve linear systems with K XX K k (A, b) = span{b, Ab,..., A k 1 b} K XX is s.p.d = use the conjugate gradient (CG) method Only interacts with K XX via MVMs Converges in n iterations in exact arithmetic A few iterations are enough for many kernels Small l: K XX almost diagonal = fast convergence Large l: Pivoted Cholesky preconditioner, K XX P(LL T )P T /37

22 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Stochastic trace estimation How do we approximate log det K XX using fast MVMs? Note that log det K XX = tr(log K XX ) Estimate trace of a matrix = Stochastic trace estimation If z has independent random entries, E[z i ] = 0, E[z 2 i ] = 1: E[z T Az] = n n a ij E[z i z j ] = tr(a) i=1 j=1 Common choices of probe vector z: Hutchinson: z i = ±1 with probability 0.5 Gaussian: z i N (0, 1) This requires fast computation of log( K XX )z: Function application with Hermitian matrix = Lanczos /37

23 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Lanczos Lanczos computes factorization: KXX Q = QT Q orthogonal, T tridiagonal Elegant three term recursion with one MVM per iteration Converges in k p steps if K XX has p distinct eigenvalues Function application starting at z/ z : f( K XX )z = Q f(t)q T z = z Q f(t)e 1 Truncate after k n steps: f( K XX )z z Q k f(t k )e 1 N.B: CG is a special case of Lanczos /37

24 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Fast MVMs: SKI Structured kernel interpolation (SKI): K XX W T K UU W U is a structured grid with m points K UU is BTTB (with Kronecker structure for product kernel) W sparse matrix with interpolation weights Can apply MVM with K XX in O(m log m) time using FFT Grid structure limited to 5 dimensions /37

25 with derivatives Gaussian processes Kernel Learning Approximate kernel learning SKI for Product kernels (SKIP) Main idea: (A B)x = diag(a diag(x) B T ) Cost for an MVM: O(nr 2 ) flops if A, B have rank r Assume tensor product structure: k(x, y) = d i=1 k i(x i, y i ) Many popular kernels (e.g., SE) have tensor product structure Use SKI in each dimension: K (W 1 K 1 W T 1 )... (W d K d W T d ) Divide and conquer + truncated rank-r Lanczos factorizations: K (Q 1 T 1 Q T 1 ) (Q 2 T 2 Q T 2 ) Constructing SKIP kernel: O(n + m log m + r 3 n log d) flops Often achieve high accuracy for r n /37

26 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Rainfall Method n m MSE Time [min] Lanczos 528k 3M Scaled eigenvalues 528k 3M Exact 12k Data: Hourly precipitation data at 5500 weather stations Aggregate into daily precipitation Total data: 628k entries Train on 528k data points, test on remainder Use SKI with 100 points per spatial dim, 300 in time Reference comparison: exact computation (12k entries) /37

27 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Hickory Data Set Our approach can be used for non-gaussian likelihoods Example: Log-Gaussian Cox process Count data for Hickory trees in Michigan Area discretized using a grid Use the Poisson likelihood with the SE kernel Laplace approximation for posterior The scaled eigenvalue method uses the Fiedler bound (a) Count data (b) Exact (c) Scaled eigs (d) Lanczos /37

28 with derivatives Incorporating derivatives Section 3 with derivatives /37

29 with derivatives Incorporating derivatives Gaussian process with derivatives Assume we observe both f(x) and f(x) Let f(x) GP(µ(x), k(x, x )) Differentiation is a linear operator: [ ] [ µ µ(x) (x) =, k (x, x k(x, x ) = ) ( x k(x, x )) T ] µ(x) x k(x, x ) 2 k(x, x ) Multi-output GP: [ ] f(x) GP ( µ (x), k (x, x ) ) f(x) Exact kernel learning and inference is now O(n 3 d 3 ) flops Involves kernel matrix of size n(d + 1) n(d + 1) /37

30 with derivatives Incorporating derivatives Example: Branin function Gradient information can make the GP model more accurate (Left) True function (Middle) GP without derivatives (Right) GP with derivatives Branin SE no gradient SE with gradients /37

10-2 10-2 10-4 10-4 10-6 -10-8 -6-4 -10 50-8 -6 100 150 200 250 300-4 200 400 600 800 1000 Figure: (Left) log10 error in D-SKI

31 with derivatives Incorporating derivatives Extending SKI and SKIP Differentiate the approximation scheme D-SKI: k(x, x ) i wi (x)k(ui, x ) k(x, x ) i wi (x)k(ui, x ) D-SKIP: Differentiate each Hadamard product True spectrum SKI spectrum 100 True spectrum SKIP spectrum Figure: (Left) log10 error in D-SKI approximation and comparison to the exact spectrum. (Right) log10 error in D-SKIP approximation and comparison to the exact spectrum /37

32 with derivatives Incorporating derivatives Active subspaces Can estimate active subspace from gradients: C = f(x) f(x) T dx QΛQ T Ω λ i measures the average change in f along q i Optimal d-dimensional subspace P: First d columns of Q Active subspace approximation: f(x) f(pp T x) Can work with kernel k(x, x ) = k(p T x, P T x ) We estimate C using Monte Carlo integration: C 1 n n f(x i ) f(x i ) T i= /37

33 with derivatives Incorporating derivatives Bayesian optimization with active subspace learning 1: Generate experimental design 2: Evaluate experimental design 3: while Budget not exhausted do 4: Calculate active subspace P using sampled gradients 5: Fit GP with derivatives using k(p T x, P T x ) 6: Optimize u n+1 = arg max A(u) with x n+1 = Pu n+1 7: Sample point x n+1, value f n+1, and gradient f n+1 8: Update data D i+1 = D i {x n+1, f n+1, f n+1 } 9: end /37

34 with derivatives Incorporating derivatives Bayesian optimization with EI 5-dimensional Ackley randomly embedded in 50 dimensions Observe noisy values and noisy gradients Use active subspace learning from sampled gradients Use D-SKI in the active subspace for fast kernel learning Active subspace learning improves the performance of BO -5 BO exact BO D-SKI BFGS -10 Random sampling /37

35 with derivatives Incorporating derivatives Stanford bunny Recovering the Stanford bunny from 25k noisy normals Spline kernel: k(x, y) = s 2 ( x y 3 + a x y 2 + b) Fit an implicit GP surface: f(x i ) = 0, f(x i ) = n i /37

36 with derivatives Section /37

37 with derivatives Thank you for your attention!? /37

LA Support for Scalable Kernel Methods. David Bindel 29 Sep 2018

LA Support for Scalable Kernel Methods David Bindel 29 Sep 2018 Collaborators Kun Dong (Cornell CAM) David Eriksson (Cornell CAM) Jake Gardner (Cornell CS) Eric Lee (Cornell CS) Hannes Nickisch (Phillips