Scalable kernel methods and their use in black-box optimization

Size: px
Start display at page:

Download "Scalable kernel methods and their use in black-box optimization"

Transcription

1 with derivatives Scalable kernel methods and their use in black-box optimization David Eriksson Center for Applied Mathematics Cornell University November 9, /37

2 with derivatives Main projects 1 Scaling Gaussian process regression [Today] Collaborators: David Bindel, Kun Dong, Hannes Nickisch, Andrew Wilson 2 Scaling Gaussian process regression with derivatives [Today] Collaborators: David Bindel, Kun Dong, Eric Lee, Andrew Wilson 3 Energy bound optimization [A-exam] Collaborators: David Bindel 4 Asynchronous surrogate optimization [A-exam] Collaborators: David Bindel, Christine Shoemaker 5 Khatri-Rao systems of equations [Another time] Collaborators: Alex Townsend, Charles Van Loan /37

3 with derivatives Outline 1 Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions 2 Gaussian processes Kernel Learning Approximate kernel learning 3 with derivatives Incorporating derivatives /37

4 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Section /37

5 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Scattered data interpolation Given: Pairwise distinct points: X = {x i } n i=1 Ω Rd Function values: f X = [f(x 1 ),..., f(x n )] T Goal: Find continuous function s f,x s.t. s f,x (x i ) = f(x i ), i = 1,..., n Can use linear combination of continuous basis functions s f,x (x) = n λ i b i (x) i=1 Need to solve A X λ = f X, where (A X ) ij = b j (x i ) Well-posed if A X is non-singular. When is this the case? /37

6 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Basis functions (d = 1): Can choose basis functions independent of data Example: Polynomial interpolation with the monomial basis det A X = (x j x i ) 0 1 i<j n Always non-singular if X are pairwise distinct (d 2): Famous negative result: Mairhuber-Curtis: In order for det A X 0 for all pairwise distinct X Ω, the basis functions must depend on X /37

7 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Positive definite kernels Characterizing all data dependent basis functions challenging Common restriction: Require that A X is always s.p.d. Achieved by using an s.p.d. kernel: b i (x) = k(x, x i ) Definition (Positive definite kernel) A (continuous) symmetric function k : R d R d R is called a positive definite kernel if for all X, λ s.t. 1 The points in X are pairwise distinct, 2 λ 0, n n = λ i λ j k(x i, x j ) > 0. i=1 j= /37

8 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Popular positive definite kernels White noise: k(x, y) = σ 2 δ xy Gaussian (SE): k(x, y) = s 2 exp ( Matérn 1/2: k(x, y) = s 2 exp Matérn 3/2: k(x, y) = s 2 (1 + ) ( x y 2 2l 2 ) x y l 3 x y l ) ( exp Matérn 5/2: ) ( k(x, y) = s ( x y l + 5 x y 2 exp 3l 2 ( Rational quadratic: k(x, y) = s x y 2 2αl 2 3 x y l 5 x y ) α l ) ) /37

9 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Polynomial precision Example: Gaussian kernel cannot reproduce f(x) = constant Desirable: s f,x exact for low-degree polynomials Often referred to as polynomial precision Mairhuber-Curtis = Need additional assumptions on X Definition A set of points X are ν-unisolvent if the only polynomial of degree at most ν interpolating zero data on X is the zero polynomial. Three collinear points in R 2 The points (0, 0), (1, 1), (2, 2) are not 1-unisolvent /37

10 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions and polynomial precision Assume: The points X are ν-unisolvent {π i } m i=1 basis for p(x) Πd ν (polynomials of degree ν) Look for s f,x (x) = n m λ i k(x, x i ) + µ i π i (x) i=1 i=1 We now have n equations and m + n unknowns Add the m discrete orthogonality conditions: n λ j π i (x i ) = 0, j = 1,..., m i=1 Allows us to use a larger family of kernels! /37

11 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions and polynomial precision Letting (K XX ) ij = k(x i, x j ) and (P X ) ij = π j (x i ): [ ] [ ] [ ] KXX P X λ fx P T = X 0 µ 0 Need: X (ν 1)-unisolvent, p Π d ν 1, k c.p.d of order ν Definition (Conditionally positive definite kernel) A (continuous) symmetric function k : R d R d R is called a conditionally positive definite kernel of order ν if for all X, λ s.t. 1 The points in X are pairwise distinct, 2 λ 0 and n i=1 λ iq(x i ) = 0, q Π d ν 1, = n n λ i λ j k(x i, x j ) > 0. i=1 j= /37

12 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Radial basis functions Important special case: φ(r) = k(x, y) where r = x y Cubic (φ(r) = r 3 ), Thin-plate spline (φ(r) = r 2 log r) Semi-norm: s f,x 2 = s, s = λ T Φ XX λ Native space: f N φ = sup s f,x X Ω, X < Generic error estimate: f(x) s f,x (x) P φ,x (x) f 2 N φ s f,x 2 Power function: [P φ,x (x)] 2 = φ(0) [ ΦXx P T x ] T [ ΦXX P X P T X 0 ] 1 [ ΦXx The power function is a Schur complement after adding x /37 P T x ].

13 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Section /37

14 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Gaussian processes interpolation Defines a distribution over functions: f(x) GP(µ(x), k(x, x )) Mean function: µ : R d R, often low-degree polynomial Covariance function: cov(f(x i ), f(x j )) = k(x i, x j ) s.p.d kernel Posterior mean and variance at x: E[f(x)] = K xx K 1 XX (y X µ X ), V[f(x)] = K xx K xx K 1 XX K Xx, Compared to RBFs, V[f(x)] tells us about the average case /37

15 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Draws from GP prior with zero mean /37

16 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Draws from GP posterior /37

17 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Posterior mean and variance /37

18 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Gaussian processes regression Assume we observe f X y X + ϵ, ϵ N (0, σ 2 I) Add white noise kernel: k(x, y) = k(x, y) + σ 2 δ xy We often do this even in the case of no noise Weyl: φ(r) C ν = λ n = o ( n ν 1/2) Example: λ n decays exponentially for Gaussian (SE) kernel Adding σ 2 δ xy guarantees λ n σ 2 Gershgorin: κ(φ XX + σ 2 I ) n φ(0) σ 2 Example: κ(φ XX + σ 2 I ) n ( s σ ) 2 for Gaussian (SE) kernel /37

19 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Kernel hyper-parameters How do we learn the optimal kernel hyperparameters θ? Bayesian approach is expensive, often do MLE Log marginal likelihood: log p(θ y X ) = L y + L K n log 2π 2 Need to compute: L y = 1 2 (y X µ X ) T c, L K = 1 2 log det K XX, L y θ i = 1 2 ct L K θ i = 1 2 tr ( K XX θ i ( K 1 XX ) c K XX θ i ) where c = K 1 XX (y X µ X ) /37

20 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Exact kernel learning K XX = L L T Compute dense Cholesky factorization: O(n 3 ) flops Solves and logdet computations with K XX are now trivial: K XX \c = L T \(L\c) log det K n XX = 2 log L ii i=1 Works for small n, but dense LA is not scalable! /37

21 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Iterative methods Assumption: We have access to a fast MVM with K XX Use a Krylov method to solve linear systems with K XX K k (A, b) = span{b, Ab,..., A k 1 b} K XX is s.p.d = use the conjugate gradient (CG) method Only interacts with K XX via MVMs Converges in n iterations in exact arithmetic A few iterations are enough for many kernels Small l: K XX almost diagonal = fast convergence Large l: Pivoted Cholesky preconditioner, K XX P(LL T )P T /37

22 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Stochastic trace estimation How do we approximate log det K XX using fast MVMs? Note that log det K XX = tr(log K XX ) Estimate trace of a matrix = Stochastic trace estimation If z has independent random entries, E[z i ] = 0, E[z 2 i ] = 1: E[z T Az] = n n a ij E[z i z j ] = tr(a) i=1 j=1 Common choices of probe vector z: Hutchinson: z i = ±1 with probability 0.5 Gaussian: z i N (0, 1) This requires fast computation of log( K XX )z: Function application with Hermitian matrix = Lanczos /37

23 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Lanczos Lanczos computes factorization: KXX Q = QT Q orthogonal, T tridiagonal Elegant three term recursion with one MVM per iteration Converges in k p steps if K XX has p distinct eigenvalues Function application starting at z/ z : f( K XX )z = Q f(t)q T z = z Q f(t)e 1 Truncate after k n steps: f( K XX )z z Q k f(t k )e 1 N.B: CG is a special case of Lanczos /37

24 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Fast MVMs: SKI Structured kernel interpolation (SKI): K XX W T K UU W U is a structured grid with m points K UU is BTTB (with Kronecker structure for product kernel) W sparse matrix with interpolation weights Can apply MVM with K XX in O(m log m) time using FFT Grid structure limited to 5 dimensions /37

25 with derivatives Gaussian processes Kernel Learning Approximate kernel learning SKI for Product kernels (SKIP) Main idea: (A B)x = diag(a diag(x) B T ) Cost for an MVM: O(nr 2 ) flops if A, B have rank r Assume tensor product structure: k(x, y) = d i=1 k i(x i, y i ) Many popular kernels (e.g., SE) have tensor product structure Use SKI in each dimension: K (W 1 K 1 W T 1 )... (W d K d W T d ) Divide and conquer + truncated rank-r Lanczos factorizations: K (Q 1 T 1 Q T 1 ) (Q 2 T 2 Q T 2 ) Constructing SKIP kernel: O(n + m log m + r 3 n log d) flops Often achieve high accuracy for r n /37

26 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Rainfall Method n m MSE Time [min] Lanczos 528k 3M Scaled eigenvalues 528k 3M Exact 12k Data: Hourly precipitation data at 5500 weather stations Aggregate into daily precipitation Total data: 628k entries Train on 528k data points, test on remainder Use SKI with 100 points per spatial dim, 300 in time Reference comparison: exact computation (12k entries) /37

27 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Hickory Data Set Our approach can be used for non-gaussian likelihoods Example: Log-Gaussian Cox process Count data for Hickory trees in Michigan Area discretized using a grid Use the Poisson likelihood with the SE kernel Laplace approximation for posterior The scaled eigenvalue method uses the Fiedler bound (a) Count data (b) Exact (c) Scaled eigs (d) Lanczos /37

28 with derivatives Incorporating derivatives Section 3 with derivatives /37

29 with derivatives Incorporating derivatives Gaussian process with derivatives Assume we observe both f(x) and f(x) Let f(x) GP(µ(x), k(x, x )) Differentiation is a linear operator: [ ] [ µ µ(x) (x) =, k (x, x k(x, x ) = ) ( x k(x, x )) T ] µ(x) x k(x, x ) 2 k(x, x ) Multi-output GP: [ ] f(x) GP ( µ (x), k (x, x ) ) f(x) Exact kernel learning and inference is now O(n 3 d 3 ) flops Involves kernel matrix of size n(d + 1) n(d + 1) /37

30 with derivatives Incorporating derivatives Example: Branin function Gradient information can make the GP model more accurate (Left) True function (Middle) GP without derivatives (Right) GP with derivatives Branin SE no gradient SE with gradients /37

31 with derivatives Incorporating derivatives Extending SKI and SKIP Differentiate the approximation scheme D-SKI: k(x, x ) i wi (x)k(ui, x ) k(x, x ) i wi (x)k(ui, x ) D-SKIP: Differentiate each Hadamard product True spectrum SKI spectrum 100 True spectrum SKIP spectrum Figure: (Left) log10 error in D-SKI approximation and comparison to the exact spectrum. (Right) log10 error in D-SKIP approximation and comparison to the exact spectrum /37

32 with derivatives Incorporating derivatives Active subspaces Can estimate active subspace from gradients: C = f(x) f(x) T dx QΛQ T Ω λ i measures the average change in f along q i Optimal d-dimensional subspace P: First d columns of Q Active subspace approximation: f(x) f(pp T x) Can work with kernel k(x, x ) = k(p T x, P T x ) We estimate C using Monte Carlo integration: C 1 n n f(x i ) f(x i ) T i= /37

33 with derivatives Incorporating derivatives Bayesian optimization with active subspace learning 1: Generate experimental design 2: Evaluate experimental design 3: while Budget not exhausted do 4: Calculate active subspace P using sampled gradients 5: Fit GP with derivatives using k(p T x, P T x ) 6: Optimize u n+1 = arg max A(u) with x n+1 = Pu n+1 7: Sample point x n+1, value f n+1, and gradient f n+1 8: Update data D i+1 = D i {x n+1, f n+1, f n+1 } 9: end /37

34 with derivatives Incorporating derivatives Bayesian optimization with EI 5-dimensional Ackley randomly embedded in 50 dimensions Observe noisy values and noisy gradients Use active subspace learning from sampled gradients Use D-SKI in the active subspace for fast kernel learning Active subspace learning improves the performance of BO -5 BO exact BO D-SKI BFGS -10 Random sampling /37

35 with derivatives Incorporating derivatives Stanford bunny Recovering the Stanford bunny from 25k noisy normals Spline kernel: k(x, y) = s 2 ( x y 3 + a x y 2 + b) Fit an implicit GP surface: f(x i ) = 0, f(x i ) = n i /37

36 with derivatives Section /37

37 with derivatives Thank you for your attention!? /37

LA Support for Scalable Kernel Methods. David Bindel 29 Sep 2018

LA Support for Scalable Kernel Methods. David Bindel 29 Sep 2018 LA Support for Scalable Kernel Methods David Bindel 29 Sep 2018 Collaborators Kun Dong (Cornell CAM) David Eriksson (Cornell CAM) Jake Gardner (Cornell CS) Eric Lee (Cornell CS) Hannes Nickisch (Phillips

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

D. Shepard, Shepard functions, late 1960s (application, surface modelling)

D. Shepard, Shepard functions, late 1960s (application, surface modelling) Chapter 1 Introduction 1.1 History and Outline Originally, the motivation for the basic meshfree approximation methods (radial basis functions and moving least squares methods) came from applications in

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

MATH 590: Meshfree Methods

MATH 590: Meshfree Methods MATH 590: Meshfree Methods Chapter 34: Improving the Condition Number of the Interpolation Matrix Greg Fasshauer Department of Applied Mathematics Illinois Institute of Technology Fall 2010 fasshauer@iit.edu

More information

Numerical Methods in Matrix Computations

Numerical Methods in Matrix Computations Ake Bjorck Numerical Methods in Matrix Computations Springer Contents 1 Direct Methods for Linear Systems 1 1.1 Elements of Matrix Theory 1 1.1.1 Matrix Algebra 2 1.1.2 Vector Spaces 6 1.1.3 Submatrices

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group Nonparmeteric Bayes & Gaussian Processes Baback Moghaddam baback@jpl.nasa.gov Machine Learning Group Outline Bayesian Inference Hierarchical Models Model Selection Parametric vs. Nonparametric Gaussian

More information

Fast Direct Methods for Gaussian Processes

Fast Direct Methods for Gaussian Processes Fast Direct Methods for Gaussian Processes Mike O Neil Departments of Mathematics New York University oneil@cims.nyu.edu December 12, 2015 1 Collaborators This is joint work with: Siva Ambikasaran Dan

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

Radial Basis Functions I

Radial Basis Functions I Radial Basis Functions I Tom Lyche Centre of Mathematics for Applications, Department of Informatics, University of Oslo November 14, 2008 Today Reformulation of natural cubic spline interpolation Scattered

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up! Feature Mapping Consider the following mapping φ for an example x = {x 1,...,x D } φ : x {x1,x 2 2,...,x 2 D,,x 2 1 x 2,x 1 x 2,...,x 1 x D,...,x D 1 x D } It s an example of a quadratic mapping Each new

More information

Reliability Monitoring Using Log Gaussian Process Regression

Reliability Monitoring Using Log Gaussian Process Regression COPYRIGHT 013, M. Modarres Reliability Monitoring Using Log Gaussian Process Regression Martin Wayne Mohammad Modarres PSA 013 Center for Risk and Reliability University of Maryland Department of Mechanical

More information

Scattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions

Scattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions Chapter 3 Scattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions 3.1 Scattered Data Interpolation with Polynomial Precision Sometimes the assumption on the

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic Applied Mathematics 205 Unit V: Eigenvalue Problems Lecturer: Dr. David Knezevic Unit V: Eigenvalue Problems Chapter V.4: Krylov Subspace Methods 2 / 51 Krylov Subspace Methods In this chapter we give

More information

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Practical Bayesian Optimization of Machine Learning. Learning Algorithms Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that

More information

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0. Gaussian Processes Gaussian Process Stochastic process: basically, a set of random variables. may be infinite. usually related in some way. Gaussian process: each variable has a Gaussian distribution every

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

An Introduction to Gaussian Processes for Spatial Data (Predictions!)

An Introduction to Gaussian Processes for Spatial Data (Predictions!) An Introduction to Gaussian Processes for Spatial Data (Predictions!) Matthew Kupilik College of Engineering Seminar Series Nov 216 Matthew Kupilik UAA GP for Spatial Data Nov 216 1 / 35 Why? When evidence

More information

Student-t Process as Alternative to Gaussian Processes Discussion

Student-t Process as Alternative to Gaussian Processes Discussion Student-t Process as Alternative to Gaussian Processes Discussion A. Shah, A. G. Wilson and Z. Gharamani Discussion by: R. Henao Duke University June 20, 2014 Contributions The paper is concerned about

More information

Hilbert Space Methods for Reduced-Rank Gaussian Process Regression

Hilbert Space Methods for Reduced-Rank Gaussian Process Regression Hilbert Space Methods for Reduced-Rank Gaussian Process Regression Arno Solin and Simo Särkkä Aalto University, Finland Workshop on Gaussian Process Approximation Copenhagen, Denmark, May 2015 Solin &

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES 48 Arnoldi Iteration, Krylov Subspaces and GMRES We start with the problem of using a similarity transformation to convert an n n matrix A to upper Hessenberg form H, ie, A = QHQ, (30) with an appropriate

More information

A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression

More information

Gaussian Processes for Machine Learning

Gaussian Processes for Machine Learning Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou

PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS Chenlin Wu Yuhan Lou Background Smart grid: increasing the contribution of renewable in grid energy Solar generation: intermittent and nondispatchable

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

APPLIED NUMERICAL LINEAR ALGEBRA

APPLIED NUMERICAL LINEAR ALGEBRA APPLIED NUMERICAL LINEAR ALGEBRA James W. Demmel University of California Berkeley, California Society for Industrial and Applied Mathematics Philadelphia Contents Preface 1 Introduction 1 1.1 Basic Notation

More information

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing George Papandreou and Alan Yuille Department of Statistics University of California, Los Angeles ICCV Workshop on Information

More information

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012 Instructions Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012 The exam consists of four problems, each having multiple parts. You should attempt to solve all four problems. 1.

More information

Fixed-domain Asymptotics of Covariance Matrices and Preconditioning

Fixed-domain Asymptotics of Covariance Matrices and Preconditioning Fixed-domain Asymptotics of Covariance Matrices and Preconditioning Jie Chen IBM Thomas J. Watson Research Center Presented at Preconditioning Conference, August 1, 2017 Jie Chen (IBM Research) Covariance

More information

Hierarchical Modelling for Univariate Spatial Data

Hierarchical Modelling for Univariate Spatial Data Hierarchical Modelling for Univariate Spatial Data Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department

More information

Efficient Solvers for Stochastic Finite Element Saddle Point Problems

Efficient Solvers for Stochastic Finite Element Saddle Point Problems Efficient Solvers for Stochastic Finite Element Saddle Point Problems Catherine E. Powell c.powell@manchester.ac.uk School of Mathematics University of Manchester, UK Efficient Solvers for Stochastic Finite

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian with mean ( µ ) and standard deviation ( σ) Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes 1 Objectives to express prior knowledge/beliefs about model outputs using Gaussian process (GP) to sample functions from the probability measure defined by GP to build

More information

State Space Gaussian Processes with Non-Gaussian Likelihoods

State Space Gaussian Processes with Non-Gaussian Likelihoods State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2 Alexander Grigorievskiy 2,3 1 Philips Research, 2 Aalto University, 3 Silo.AI ICML2018 July 13, 2018 Outline

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Statistical Methods for SVM

Statistical Methods for SVM Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,

More information

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms

More information

9.2 Support Vector Machines 159

9.2 Support Vector Machines 159 9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

6.4 Krylov Subspaces and Conjugate Gradients

6.4 Krylov Subspaces and Conjugate Gradients 6.4 Krylov Subspaces and Conjugate Gradients Our original equation is Ax = b. The preconditioned equation is P Ax = P b. When we write P, we never intend that an inverse will be explicitly computed. P

More information

Gaussian Process Optimization with Mutual Information

Gaussian Process Optimization with Mutual Information Gaussian Process Optimization with Mutual Information Emile Contal 1 Vianney Perchet 2 Nicolas Vayatis 1 1 CMLA Ecole Normale Suprieure de Cachan & CNRS, France 2 LPMA Université Paris Diderot & CNRS,

More information

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016 Log Gaussian Cox Processes Chi Group Meeting February 23, 2016 Outline Typical motivating application Introduction to LGCP model Brief overview of inference Applications in my work just getting started

More information

Nearest Neighbor Gaussian Processes for Large Spatial Data

Nearest Neighbor Gaussian Processes for Large Spatial Data Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 18 Outline

More information

Gaussian processes for inference in stochastic differential equations

Gaussian processes for inference in stochastic differential equations Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Consistency Estimates for gfd Methods and Selection of Sets of Influence

Consistency Estimates for gfd Methods and Selection of Sets of Influence Consistency Estimates for gfd Methods and Selection of Sets of Influence Oleg Davydov University of Giessen, Germany Localized Kernel-Based Meshless Methods for PDEs ICERM / Brown University 7 11 August

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Lecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm

Lecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm CS 622 Data-Sparse Matrix Computations September 19, 217 Lecture 9: Krylov Subspace Methods Lecturer: Anil Damle Scribes: David Eriksson, Marc Aurele Gilles, Ariah Klages-Mundt, Sophia Novitzky 1 Introduction

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian

More information

LMS Algorithm Summary

LMS Algorithm Summary LMS Algorithm Summary Step size tradeoff Other Iterative Algorithms LMS algorithm with variable step size: w(k+1) = w(k) + µ(k)e(k)x(k) When step size µ(k) = µ/k algorithm converges almost surely to optimal

More information

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Siwei Guo: s9guo@eng.ucsd.edu Anwesan Pal:

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Gaussian Process Regression Networks

Gaussian Process Regression Networks Gaussian Process Regression Networks Andrew Gordon Wilson agw38@camacuk mlgengcamacuk/andrew University of Cambridge Joint work with David A Knowles and Zoubin Ghahramani June 27, 2012 ICML, Edinburgh

More information

Fundamentals of Matrices

Fundamentals of Matrices Maschinelles Lernen II Fundamentals of Matrices Christoph Sawade/Niels Landwehr/Blaine Nelson Tobias Scheffer Matrix Examples Recap: Data Linear Model: f i x = w i T x Let X = x x n be the data matrix

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Kernel-based Approximation. Methods using MATLAB. Gregory Fasshauer. Interdisciplinary Mathematical Sciences. Michael McCourt.

Kernel-based Approximation. Methods using MATLAB. Gregory Fasshauer. Interdisciplinary Mathematical Sciences. Michael McCourt. SINGAPORE SHANGHAI Vol TAIPEI - Interdisciplinary Mathematical Sciences 19 Kernel-based Approximation Methods using MATLAB Gregory Fasshauer Illinois Institute of Technology, USA Michael McCourt University

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with

More information

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.

More information

Contents. Preface... xi. Introduction...

Contents. Preface... xi. Introduction... Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism

More information

A MATRIX-FREE APPROACH FOR SOLVING THE PARAMETRIC GAUSSIAN PROCESS MAXIMUM LIKELIHOOD PROBLEM

A MATRIX-FREE APPROACH FOR SOLVING THE PARAMETRIC GAUSSIAN PROCESS MAXIMUM LIKELIHOOD PROBLEM Preprint ANL/MCS-P1857-0311 A MATRIX-FREE APPROACH FOR SOLVING THE PARAMETRIC GAUSSIAN PROCESS MAXIMUM LIKELIHOOD PROBLEM MIHAI ANITESCU, JIE CHEN, AND LEI WANG Abstract. Gaussian processes are the cornerstone

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

Geometric Modeling Summer Semester 2012 Linear Algebra & Function Spaces

Geometric Modeling Summer Semester 2012 Linear Algebra & Function Spaces Geometric Modeling Summer Semester 2012 Linear Algebra & Function Spaces (Recap) Announcement Room change: On Thursday, April 26th, room 024 is occupied. The lecture will be moved to room 021, E1 4 (the

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural

More information

A orthonormal basis for Radial Basis Function approximation

A orthonormal basis for Radial Basis Function approximation A orthonormal basis for Radial Basis Function approximation 9th ISAAC Congress Krakow, August 5-9, 2013 Gabriele Santin, joint work with S. De Marchi Department of Mathematics. Doctoral School in Mathematical

More information

Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian

More information

State Space Representation of Gaussian Processes

State Space Representation of Gaussian Processes State Space Representation of Gaussian Processes Simo Särkkä Department of Biomedical Engineering and Computational Science (BECS) Aalto University, Espoo, Finland June 12th, 2013 Simo Särkkä (Aalto University)

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

10-701/ Recitation : Kernels

10-701/ Recitation : Kernels 10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer

More information