Scalable kernel methods and their use in black-box optimization
|
|
- John Lindsey
- 5 years ago
- Views:
Transcription
1 with derivatives Scalable kernel methods and their use in black-box optimization David Eriksson Center for Applied Mathematics Cornell University November 9, /37
2 with derivatives Main projects 1 Scaling Gaussian process regression [Today] Collaborators: David Bindel, Kun Dong, Hannes Nickisch, Andrew Wilson 2 Scaling Gaussian process regression with derivatives [Today] Collaborators: David Bindel, Kun Dong, Eric Lee, Andrew Wilson 3 Energy bound optimization [A-exam] Collaborators: David Bindel 4 Asynchronous surrogate optimization [A-exam] Collaborators: David Bindel, Christine Shoemaker 5 Khatri-Rao systems of equations [Another time] Collaborators: Alex Townsend, Charles Van Loan /37
3 with derivatives Outline 1 Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions 2 Gaussian processes Kernel Learning Approximate kernel learning 3 with derivatives Incorporating derivatives /37
4 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Section /37
5 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Scattered data interpolation Given: Pairwise distinct points: X = {x i } n i=1 Ω Rd Function values: f X = [f(x 1 ),..., f(x n )] T Goal: Find continuous function s f,x s.t. s f,x (x i ) = f(x i ), i = 1,..., n Can use linear combination of continuous basis functions s f,x (x) = n λ i b i (x) i=1 Need to solve A X λ = f X, where (A X ) ij = b j (x i ) Well-posed if A X is non-singular. When is this the case? /37
6 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Basis functions (d = 1): Can choose basis functions independent of data Example: Polynomial interpolation with the monomial basis det A X = (x j x i ) 0 1 i<j n Always non-singular if X are pairwise distinct (d 2): Famous negative result: Mairhuber-Curtis: In order for det A X 0 for all pairwise distinct X Ω, the basis functions must depend on X /37
7 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Positive definite kernels Characterizing all data dependent basis functions challenging Common restriction: Require that A X is always s.p.d. Achieved by using an s.p.d. kernel: b i (x) = k(x, x i ) Definition (Positive definite kernel) A (continuous) symmetric function k : R d R d R is called a positive definite kernel if for all X, λ s.t. 1 The points in X are pairwise distinct, 2 λ 0, n n = λ i λ j k(x i, x j ) > 0. i=1 j= /37
8 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Popular positive definite kernels White noise: k(x, y) = σ 2 δ xy Gaussian (SE): k(x, y) = s 2 exp ( Matérn 1/2: k(x, y) = s 2 exp Matérn 3/2: k(x, y) = s 2 (1 + ) ( x y 2 2l 2 ) x y l 3 x y l ) ( exp Matérn 5/2: ) ( k(x, y) = s ( x y l + 5 x y 2 exp 3l 2 ( Rational quadratic: k(x, y) = s x y 2 2αl 2 3 x y l 5 x y ) α l ) ) /37
9 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Polynomial precision Example: Gaussian kernel cannot reproduce f(x) = constant Desirable: s f,x exact for low-degree polynomials Often referred to as polynomial precision Mairhuber-Curtis = Need additional assumptions on X Definition A set of points X are ν-unisolvent if the only polynomial of degree at most ν interpolating zero data on X is the zero polynomial. Three collinear points in R 2 The points (0, 0), (1, 1), (2, 2) are not 1-unisolvent /37
10 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions and polynomial precision Assume: The points X are ν-unisolvent {π i } m i=1 basis for p(x) Πd ν (polynomials of degree ν) Look for s f,x (x) = n m λ i k(x, x i ) + µ i π i (x) i=1 i=1 We now have n equations and m + n unknowns Add the m discrete orthogonality conditions: n λ j π i (x i ) = 0, j = 1,..., m i=1 Allows us to use a larger family of kernels! /37
11 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions and polynomial precision Letting (K XX ) ij = k(x i, x j ) and (P X ) ij = π j (x i ): [ ] [ ] [ ] KXX P X λ fx P T = X 0 µ 0 Need: X (ν 1)-unisolvent, p Π d ν 1, k c.p.d of order ν Definition (Conditionally positive definite kernel) A (continuous) symmetric function k : R d R d R is called a conditionally positive definite kernel of order ν if for all X, λ s.t. 1 The points in X are pairwise distinct, 2 λ 0 and n i=1 λ iq(x i ) = 0, q Π d ν 1, = n n λ i λ j k(x i, x j ) > 0. i=1 j= /37
12 with derivatives Scattered data interpolation Positive definite kernels Conditionally positive definite kernels Radial basis functions Radial basis functions Important special case: φ(r) = k(x, y) where r = x y Cubic (φ(r) = r 3 ), Thin-plate spline (φ(r) = r 2 log r) Semi-norm: s f,x 2 = s, s = λ T Φ XX λ Native space: f N φ = sup s f,x X Ω, X < Generic error estimate: f(x) s f,x (x) P φ,x (x) f 2 N φ s f,x 2 Power function: [P φ,x (x)] 2 = φ(0) [ ΦXx P T x ] T [ ΦXX P X P T X 0 ] 1 [ ΦXx The power function is a Schur complement after adding x /37 P T x ].
13 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Section /37
14 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Gaussian processes interpolation Defines a distribution over functions: f(x) GP(µ(x), k(x, x )) Mean function: µ : R d R, often low-degree polynomial Covariance function: cov(f(x i ), f(x j )) = k(x i, x j ) s.p.d kernel Posterior mean and variance at x: E[f(x)] = K xx K 1 XX (y X µ X ), V[f(x)] = K xx K xx K 1 XX K Xx, Compared to RBFs, V[f(x)] tells us about the average case /37
15 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Draws from GP prior with zero mean /37
16 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Draws from GP posterior /37
17 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Posterior mean and variance /37
18 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Gaussian processes regression Assume we observe f X y X + ϵ, ϵ N (0, σ 2 I) Add white noise kernel: k(x, y) = k(x, y) + σ 2 δ xy We often do this even in the case of no noise Weyl: φ(r) C ν = λ n = o ( n ν 1/2) Example: λ n decays exponentially for Gaussian (SE) kernel Adding σ 2 δ xy guarantees λ n σ 2 Gershgorin: κ(φ XX + σ 2 I ) n φ(0) σ 2 Example: κ(φ XX + σ 2 I ) n ( s σ ) 2 for Gaussian (SE) kernel /37
19 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Kernel hyper-parameters How do we learn the optimal kernel hyperparameters θ? Bayesian approach is expensive, often do MLE Log marginal likelihood: log p(θ y X ) = L y + L K n log 2π 2 Need to compute: L y = 1 2 (y X µ X ) T c, L K = 1 2 log det K XX, L y θ i = 1 2 ct L K θ i = 1 2 tr ( K XX θ i ( K 1 XX ) c K XX θ i ) where c = K 1 XX (y X µ X ) /37
20 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Exact kernel learning K XX = L L T Compute dense Cholesky factorization: O(n 3 ) flops Solves and logdet computations with K XX are now trivial: K XX \c = L T \(L\c) log det K n XX = 2 log L ii i=1 Works for small n, but dense LA is not scalable! /37
21 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Iterative methods Assumption: We have access to a fast MVM with K XX Use a Krylov method to solve linear systems with K XX K k (A, b) = span{b, Ab,..., A k 1 b} K XX is s.p.d = use the conjugate gradient (CG) method Only interacts with K XX via MVMs Converges in n iterations in exact arithmetic A few iterations are enough for many kernels Small l: K XX almost diagonal = fast convergence Large l: Pivoted Cholesky preconditioner, K XX P(LL T )P T /37
22 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Stochastic trace estimation How do we approximate log det K XX using fast MVMs? Note that log det K XX = tr(log K XX ) Estimate trace of a matrix = Stochastic trace estimation If z has independent random entries, E[z i ] = 0, E[z 2 i ] = 1: E[z T Az] = n n a ij E[z i z j ] = tr(a) i=1 j=1 Common choices of probe vector z: Hutchinson: z i = ±1 with probability 0.5 Gaussian: z i N (0, 1) This requires fast computation of log( K XX )z: Function application with Hermitian matrix = Lanczos /37
23 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Lanczos Lanczos computes factorization: KXX Q = QT Q orthogonal, T tridiagonal Elegant three term recursion with one MVM per iteration Converges in k p steps if K XX has p distinct eigenvalues Function application starting at z/ z : f( K XX )z = Q f(t)q T z = z Q f(t)e 1 Truncate after k n steps: f( K XX )z z Q k f(t k )e 1 N.B: CG is a special case of Lanczos /37
24 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Fast MVMs: SKI Structured kernel interpolation (SKI): K XX W T K UU W U is a structured grid with m points K UU is BTTB (with Kronecker structure for product kernel) W sparse matrix with interpolation weights Can apply MVM with K XX in O(m log m) time using FFT Grid structure limited to 5 dimensions /37
25 with derivatives Gaussian processes Kernel Learning Approximate kernel learning SKI for Product kernels (SKIP) Main idea: (A B)x = diag(a diag(x) B T ) Cost for an MVM: O(nr 2 ) flops if A, B have rank r Assume tensor product structure: k(x, y) = d i=1 k i(x i, y i ) Many popular kernels (e.g., SE) have tensor product structure Use SKI in each dimension: K (W 1 K 1 W T 1 )... (W d K d W T d ) Divide and conquer + truncated rank-r Lanczos factorizations: K (Q 1 T 1 Q T 1 ) (Q 2 T 2 Q T 2 ) Constructing SKIP kernel: O(n + m log m + r 3 n log d) flops Often achieve high accuracy for r n /37
26 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Rainfall Method n m MSE Time [min] Lanczos 528k 3M Scaled eigenvalues 528k 3M Exact 12k Data: Hourly precipitation data at 5500 weather stations Aggregate into daily precipitation Total data: 628k entries Train on 528k data points, test on remainder Use SKI with 100 points per spatial dim, 300 in time Reference comparison: exact computation (12k entries) /37
27 with derivatives Gaussian processes Kernel Learning Approximate kernel learning Hickory Data Set Our approach can be used for non-gaussian likelihoods Example: Log-Gaussian Cox process Count data for Hickory trees in Michigan Area discretized using a grid Use the Poisson likelihood with the SE kernel Laplace approximation for posterior The scaled eigenvalue method uses the Fiedler bound (a) Count data (b) Exact (c) Scaled eigs (d) Lanczos /37
28 with derivatives Incorporating derivatives Section 3 with derivatives /37
29 with derivatives Incorporating derivatives Gaussian process with derivatives Assume we observe both f(x) and f(x) Let f(x) GP(µ(x), k(x, x )) Differentiation is a linear operator: [ ] [ µ µ(x) (x) =, k (x, x k(x, x ) = ) ( x k(x, x )) T ] µ(x) x k(x, x ) 2 k(x, x ) Multi-output GP: [ ] f(x) GP ( µ (x), k (x, x ) ) f(x) Exact kernel learning and inference is now O(n 3 d 3 ) flops Involves kernel matrix of size n(d + 1) n(d + 1) /37
30 with derivatives Incorporating derivatives Example: Branin function Gradient information can make the GP model more accurate (Left) True function (Middle) GP without derivatives (Right) GP with derivatives Branin SE no gradient SE with gradients /37
31 with derivatives Incorporating derivatives Extending SKI and SKIP Differentiate the approximation scheme D-SKI: k(x, x ) i wi (x)k(ui, x ) k(x, x ) i wi (x)k(ui, x ) D-SKIP: Differentiate each Hadamard product True spectrum SKI spectrum 100 True spectrum SKIP spectrum Figure: (Left) log10 error in D-SKI approximation and comparison to the exact spectrum. (Right) log10 error in D-SKIP approximation and comparison to the exact spectrum /37
32 with derivatives Incorporating derivatives Active subspaces Can estimate active subspace from gradients: C = f(x) f(x) T dx QΛQ T Ω λ i measures the average change in f along q i Optimal d-dimensional subspace P: First d columns of Q Active subspace approximation: f(x) f(pp T x) Can work with kernel k(x, x ) = k(p T x, P T x ) We estimate C using Monte Carlo integration: C 1 n n f(x i ) f(x i ) T i= /37
33 with derivatives Incorporating derivatives Bayesian optimization with active subspace learning 1: Generate experimental design 2: Evaluate experimental design 3: while Budget not exhausted do 4: Calculate active subspace P using sampled gradients 5: Fit GP with derivatives using k(p T x, P T x ) 6: Optimize u n+1 = arg max A(u) with x n+1 = Pu n+1 7: Sample point x n+1, value f n+1, and gradient f n+1 8: Update data D i+1 = D i {x n+1, f n+1, f n+1 } 9: end /37
34 with derivatives Incorporating derivatives Bayesian optimization with EI 5-dimensional Ackley randomly embedded in 50 dimensions Observe noisy values and noisy gradients Use active subspace learning from sampled gradients Use D-SKI in the active subspace for fast kernel learning Active subspace learning improves the performance of BO -5 BO exact BO D-SKI BFGS -10 Random sampling /37
35 with derivatives Incorporating derivatives Stanford bunny Recovering the Stanford bunny from 25k noisy normals Spline kernel: k(x, y) = s 2 ( x y 3 + a x y 2 + b) Fit an implicit GP surface: f(x i ) = 0, f(x i ) = n i /37
36 with derivatives Section /37
37 with derivatives Thank you for your attention!? /37
LA Support for Scalable Kernel Methods. David Bindel 29 Sep 2018
LA Support for Scalable Kernel Methods David Bindel 29 Sep 2018 Collaborators Kun Dong (Cornell CAM) David Eriksson (Cornell CAM) Jake Gardner (Cornell CS) Eric Lee (Cornell CS) Hannes Nickisch (Phillips
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationD. Shepard, Shepard functions, late 1960s (application, surface modelling)
Chapter 1 Introduction 1.1 History and Outline Originally, the motivation for the basic meshfree approximation methods (radial basis functions and moving least squares methods) came from applications in
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationComputer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression
Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
More informationMATH 590: Meshfree Methods
MATH 590: Meshfree Methods Chapter 34: Improving the Condition Number of the Interpolation Matrix Greg Fasshauer Department of Applied Mathematics Illinois Institute of Technology Fall 2010 fasshauer@iit.edu
More informationNumerical Methods in Matrix Computations
Ake Bjorck Numerical Methods in Matrix Computations Springer Contents 1 Direct Methods for Linear Systems 1 1.1 Elements of Matrix Theory 1 1.1.1 Matrix Algebra 2 1.1.2 Vector Spaces 6 1.1.3 Submatrices
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationNonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group
Nonparmeteric Bayes & Gaussian Processes Baback Moghaddam baback@jpl.nasa.gov Machine Learning Group Outline Bayesian Inference Hierarchical Models Model Selection Parametric vs. Nonparametric Gaussian
More informationFast Direct Methods for Gaussian Processes
Fast Direct Methods for Gaussian Processes Mike O Neil Departments of Mathematics New York University oneil@cims.nyu.edu December 12, 2015 1 Collaborators This is joint work with: Siva Ambikasaran Dan
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationGaussian Processes (10/16/13)
STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs
More informationStein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d
More informationRadial Basis Functions I
Radial Basis Functions I Tom Lyche Centre of Mathematics for Applications, Department of Informatics, University of Oslo November 14, 2008 Today Reformulation of natural cubic spline interpolation Scattered
More information20: Gaussian Processes
10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction
More informationEach new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!
Feature Mapping Consider the following mapping φ for an example x = {x 1,...,x D } φ : x {x1,x 2 2,...,x 2 D,,x 2 1 x 2,x 1 x 2,...,x 1 x D,...,x D 1 x D } It s an example of a quadratic mapping Each new
More informationReliability Monitoring Using Log Gaussian Process Regression
COPYRIGHT 013, M. Modarres Reliability Monitoring Using Log Gaussian Process Regression Martin Wayne Mohammad Modarres PSA 013 Center for Risk and Reliability University of Maryland Department of Mechanical
More informationScattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions
Chapter 3 Scattered Data Interpolation with Polynomial Precision and Conditionally Positive Definite Functions 3.1 Scattered Data Interpolation with Polynomial Precision Sometimes the assumption on the
More informationNonparameteric Regression:
Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationApplied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic
Applied Mathematics 205 Unit V: Eigenvalue Problems Lecturer: Dr. David Knezevic Unit V: Eigenvalue Problems Chapter V.4: Krylov Subspace Methods 2 / 51 Krylov Subspace Methods In this chapter we give
More informationPractical Bayesian Optimization of Machine Learning. Learning Algorithms
Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that
More informationLinear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.
Gaussian Processes Gaussian Process Stochastic process: basically, a set of random variables. may be infinite. usually related in some way. Gaussian process: each variable has a Gaussian distribution every
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationAn Introduction to Gaussian Processes for Spatial Data (Predictions!)
An Introduction to Gaussian Processes for Spatial Data (Predictions!) Matthew Kupilik College of Engineering Seminar Series Nov 216 Matthew Kupilik UAA GP for Spatial Data Nov 216 1 / 35 Why? When evidence
More informationStudent-t Process as Alternative to Gaussian Processes Discussion
Student-t Process as Alternative to Gaussian Processes Discussion A. Shah, A. G. Wilson and Z. Gharamani Discussion by: R. Henao Duke University June 20, 2014 Contributions The paper is concerned about
More informationHilbert Space Methods for Reduced-Rank Gaussian Process Regression
Hilbert Space Methods for Reduced-Rank Gaussian Process Regression Arno Solin and Simo Särkkä Aalto University, Finland Workshop on Gaussian Process Approximation Copenhagen, Denmark, May 2015 Solin &
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More information4.8 Arnoldi Iteration, Krylov Subspaces and GMRES
48 Arnoldi Iteration, Krylov Subspaces and GMRES We start with the problem of using a similarity transformation to convert an n n matrix A to upper Hessenberg form H, ie, A = QHQ, (30) with an appropriate
More informationA Process over all Stationary Covariance Kernels
A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationGaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak
Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression
More informationGaussian Processes for Machine Learning
Gaussian Processes for Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics Tübingen, Germany carl@tuebingen.mpg.de Carlos III, Madrid, May 2006 The actual science of
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationKernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More informationPREDICTING SOLAR GENERATION FROM WEATHER FORECASTS. Chenlin Wu Yuhan Lou
PREDICTING SOLAR GENERATION FROM WEATHER FORECASTS Chenlin Wu Yuhan Lou Background Smart grid: increasing the contribution of renewable in grid energy Solar generation: intermittent and nondispatchable
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of
More informationPILCO: A Model-Based and Data-Efficient Approach to Policy Search
PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol
More informationAPPLIED NUMERICAL LINEAR ALGEBRA
APPLIED NUMERICAL LINEAR ALGEBRA James W. Demmel University of California Berkeley, California Society for Industrial and Applied Mathematics Philadelphia Contents Preface 1 Introduction 1 1.1 Basic Notation
More informationEfficient Variational Inference in Large-Scale Bayesian Compressed Sensing
Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing George Papandreou and Alan Yuille Department of Statistics University of California, Los Angeles ICCV Workshop on Information
More informationPreliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012
Instructions Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012 The exam consists of four problems, each having multiple parts. You should attempt to solve all four problems. 1.
More informationFixed-domain Asymptotics of Covariance Matrices and Preconditioning
Fixed-domain Asymptotics of Covariance Matrices and Preconditioning Jie Chen IBM Thomas J. Watson Research Center Presented at Preconditioning Conference, August 1, 2017 Jie Chen (IBM Research) Covariance
More informationHierarchical Modelling for Univariate Spatial Data
Hierarchical Modelling for Univariate Spatial Data Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department
More informationEfficient Solvers for Stochastic Finite Element Saddle Point Problems
Efficient Solvers for Stochastic Finite Element Saddle Point Problems Catherine E. Powell c.powell@manchester.ac.uk School of Mathematics University of Manchester, UK Efficient Solvers for Stochastic Finite
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationGaussian with mean ( µ ) and standard deviation ( σ)
Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes 1 Objectives to express prior knowledge/beliefs about model outputs using Gaussian process (GP) to sample functions from the probability measure defined by GP to build
More informationState Space Gaussian Processes with Non-Gaussian Likelihoods
State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2 Alexander Grigorievskiy 2,3 1 Philips Research, 2 Aalto University, 3 Silo.AI ICML2018 July 13, 2018 Outline
More informationLecture 5: GPs and Streaming regression
Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationStatistical Methods for SVM
Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,
More informationProbabilistic Graphical Models Lecture 20: Gaussian Processes
Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms
More information9.2 Support Vector Machines 159
9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of
More informationModel Selection for Gaussian Processes
Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal
More information6.4 Krylov Subspaces and Conjugate Gradients
6.4 Krylov Subspaces and Conjugate Gradients Our original equation is Ax = b. The preconditioned equation is P Ax = P b. When we write P, we never intend that an inverse will be explicitly computed. P
More informationGaussian Process Optimization with Mutual Information
Gaussian Process Optimization with Mutual Information Emile Contal 1 Vianney Perchet 2 Nicolas Vayatis 1 1 CMLA Ecole Normale Suprieure de Cachan & CNRS, France 2 LPMA Université Paris Diderot & CNRS,
More informationLog Gaussian Cox Processes. Chi Group Meeting February 23, 2016
Log Gaussian Cox Processes Chi Group Meeting February 23, 2016 Outline Typical motivating application Introduction to LGCP model Brief overview of inference Applications in my work just getting started
More informationNearest Neighbor Gaussian Processes for Large Spatial Data
Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning Xiangmin Jiao SUNY Stony Brook Xiangmin Jiao Numerical Analysis I 1 / 18 Outline
More informationGaussian processes for inference in stochastic differential equations
Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017
More informationInformation geometry for bivariate distribution control
Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic
More informationCOMP 551 Applied Machine Learning Lecture 20: Gaussian processes
COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55
More informationConsistency Estimates for gfd Methods and Selection of Sets of Influence
Consistency Estimates for gfd Methods and Selection of Sets of Influence Oleg Davydov University of Giessen, Germany Localized Kernel-Based Meshless Methods for PDEs ICERM / Brown University 7 11 August
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationLecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm
CS 622 Data-Sparse Matrix Computations September 19, 217 Lecture 9: Krylov Subspace Methods Lecturer: Anil Damle Scribes: David Eriksson, Marc Aurele Gilles, Ariah Klages-Mundt, Sophia Novitzky 1 Introduction
More informationKernel Method: Data Analysis with Positive Definite Kernels
Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University
More informationStatistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes
Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian
More informationLMS Algorithm Summary
LMS Algorithm Summary Step size tradeoff Other Iterative Algorithms LMS algorithm with variable step size: w(k+1) = w(k) + µ(k)e(k)x(k) When step size µ(k) = µ/k algorithm converges almost surely to optimal
More informationECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering
ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Siwei Guo: s9guo@eng.ucsd.edu Anwesan Pal:
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationGaussian Process Regression Networks
Gaussian Process Regression Networks Andrew Gordon Wilson agw38@camacuk mlgengcamacuk/andrew University of Cambridge Joint work with David A Knowles and Zoubin Ghahramani June 27, 2012 ICML, Edinburgh
More informationFundamentals of Matrices
Maschinelles Lernen II Fundamentals of Matrices Christoph Sawade/Niels Landwehr/Blaine Nelson Tobias Scheffer Matrix Examples Recap: Data Linear Model: f i x = w i T x Let X = x x n be the data matrix
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London
More informationKernel-based Approximation. Methods using MATLAB. Gregory Fasshauer. Interdisciplinary Mathematical Sciences. Michael McCourt.
SINGAPORE SHANGHAI Vol TAIPEI - Interdisciplinary Mathematical Sciences 19 Kernel-based Approximation Methods using MATLAB Gregory Fasshauer Illinois Institute of Technology, USA Michael McCourt University
More informationCSci 8980: Advanced Topics in Graphical Models Gaussian Processes
CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationSupport Vector Machines
Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures
More informationIntroduction to Gaussian Process
Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More information26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.
10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with
More informationApplications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices
Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.
More informationContents. Preface... xi. Introduction...
Contents Preface... xi Introduction... xv Chapter 1. Computer Architectures... 1 1.1. Different types of parallelism... 1 1.1.1. Overlap, concurrency and parallelism... 1 1.1.2. Temporal and spatial parallelism
More informationA MATRIX-FREE APPROACH FOR SOLVING THE PARAMETRIC GAUSSIAN PROCESS MAXIMUM LIKELIHOOD PROBLEM
Preprint ANL/MCS-P1857-0311 A MATRIX-FREE APPROACH FOR SOLVING THE PARAMETRIC GAUSSIAN PROCESS MAXIMUM LIKELIHOOD PROBLEM MIHAI ANITESCU, JIE CHEN, AND LEI WANG Abstract. Gaussian processes are the cornerstone
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)
AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical
More informationGeometric Modeling Summer Semester 2012 Linear Algebra & Function Spaces
Geometric Modeling Summer Semester 2012 Linear Algebra & Function Spaces (Recap) Announcement Room change: On Thursday, April 26th, room 024 is occupied. The lecture will be moved to room 021, E1 4 (the
More informationGraphical Models and Kernel Methods
Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.
More informationDefault Priors and Effcient Posterior Computation in Bayesian
Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature
More informationMachine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart
Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural
More informationA orthonormal basis for Radial Basis Function approximation
A orthonormal basis for Radial Basis Function approximation 9th ISAAC Congress Krakow, August 5-9, 2013 Gabriele Santin, joint work with S. De Marchi Department of Mathematics. Doctoral School in Mathematical
More informationSparse Gaussian conditional random fields
Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian
More informationState Space Representation of Gaussian Processes
State Space Representation of Gaussian Processes Simo Särkkä Department of Biomedical Engineering and Computational Science (BECS) Aalto University, Espoo, Finland June 12th, 2013 Simo Särkkä (Aalto University)
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More information10-701/ Recitation : Kernels
10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer
More information