Efficient Complex Output Prediction

Size: px

Start display at page:

Download "Efficient Complex Output Prediction"

Annabella Greer
6 years ago
Views:

1 Efficient Complex Output Prediction Florence d Alché-Buc Joint work with Romain Brault, Alex Lambert, Maxime Sangnier October 12, 2017 LTCI, Télécom ParisTech, Institut-Mines Télécom, Université Paris-Saclay

2 Outline Motivation and Goals Operator-valued Kernel Regression Scaling up Operator-valued Kernel Regression Conclusion 1

3 Classic Regression Using training data {(x i, y i ), i = 1,... N}, build a scalar-valued function f that predicts an output y R, given some input x X : Complex output regression: when Y = R p or a structured objects set or a functional space 2

4 Multiple Output Regression When Y = R p Image understanding : predict the name of an object in an image X : image representation space, Y = R p : semantic space Joint Quantile Regression Y = R p as a multitask learning for the values of the wished quantiles: τ 1,..., τ p (Sangnier et al. 2016) 3

5 Multiple Output Regression When Y is set of structured objects Identification of metabolites from mass spectra X : mass spectra space, Y= set of metabolites When Y = F a space of functions Functional quantile regression X = R d and Y = H a Reproducing Kernel Hilbert Space (Brault 2017) 4

6 Learning functions with values in a Hilbert space Y Operator-valued kernels, vector-valued Reproducing Kernel Hilbert Spaces Nonparametric learning Various loss functions for data-fitting Various kinds of regularization With theoretical guarantees both in terms of statistics and optimization that also leads to efficient learning algorithms 5

7 Outline Motivation and Goals Operator-valued Kernel Regression Scaling up Operator-valued Kernel Regression Conclusion 6

8 Operator-valued Kernels Natural extension of scalar kernels for vector-valued functions Allows coupling between outputs Let X some input space and Y a Hilbert Space Scalar Operator-valued Domain k(x, z) R K(x, z) L(Y) Symmetry k(x, z) = k(z, x) K(x, z) = K(z, x) N PD c i, c j R, i,j=1 c ic j k(x i, x j ) 0 c i, c j R p N, i,j=1 < c i, K(x i, x j )c j > Y 0 The simplest operator-valued kernel, the decomposable one: K(x, x ) = k(x, x )B, where B a positive semi-definite p p matrix; k is scalar-valued kernel on X. B = I : recovers the case of p independent scalar-valued kernel machines. 7

9 Vector-valued RKHS Given an OVK K, unique vector-valued RKHS (H K ), Feature maps: K(x, z) = Φ(x) Φ(z), Representer theorems. Representer theorem (Micchelli and Pontil, 2005) Given a training set {(x 1, y 1 ),..., (x N, y N )} X Y, the minimizer N ˆf = arg minf f 2 H K + λ N l(y i, f(x i )) admits an expansion of the form: where c i Y. i=1 ˆf( ) N = K(, x i )c i, i=1 8

10 Regression in Vector-valued RKHS: a few examples Image understanding : predict object name in an image Surrogate loss l Fisher (y, f(x)) = θ ln p θ (y) f(x) 2 + pre-image (Djerrab et al. 2017) Decomposable kernels Very good results on Few-shot-learning (Caltech101) 9

11 Regression in Vector-valued RKHS: a few examples Image understanding : predict object name in an image Surrogate loss l Fisher (y, f(x)) = θ ln p θ (y) f(x) 2 + pre-image (Djerrab et al. 2017) Decomposable kernels Very good results on Few-shot-learning (Caltech101) Sparse modeling of time series Loss: ϵ-sensitive loss and transformable kernels (Lim et al. 2013, 2015; Sangnier et al. 2016) Application : modeling climate data (Lim et al. 2015) 9

12 Joint Quantile Regression as multitask learning Loss: l pinball (y, f(x)) = l τ (y.1 f(x) b), y R but f(x) R p Decomposable matrix parameterized with the values of the wished quantiles: τ 1,..., τ p (Sangnier et al. 2016) Pinball loss: l τ (r) = { p j=1 τ j r j if r j 0, (τ j 1)r j if r j < 0. 10

13 Outline Motivation and Goals Operator-valued Kernel Regression Scaling up Operator-valued Kernel Regression Conclusion 11

14 Scalability of Regression in vector-valued RKHS Focus on Kernel ridge regression for Y = R p : Prediction in linear time w.r.t data O(Np 2 ), Naive learning (closed form) in O(N 3 p 3 ) How to make the method scalable? Find a matrix-valued feature map, ϕ, such that K(x, z) K(x, z) = Φ(x) Φ(z), (1) In order to work with the following linear model f(x) = Φ(x) θ (2) where θ R D. 12

15 Toward spectral approximation of OVK Theorem (Bochner for OVK (Carmeli et al. 2010)) Let K: R d R d L(Y) be a translation invariant positive definite continuous OVK. There exists a unique non-negative Borel operator-valued measure Q such that x, z R d R d K(x, z) = cos ( x z, ω )dq(ω) R d 13

16 Toward spectral approximation of OVK Theorem (Bochner for OVK (Carmeli et al. 2010)) Let K: R d R d L(Y) be a translation invariant positive definite continuous OVK. There exists a unique non-negative Borel operator-valued measure Q such that x, z R d R d K(x, z) = cos ( x z, ω )dq(ω) R d Find B : R d L(U; Y), µ scalar positive measure, such that dq(ω) = B(ω)B(ω) dµ(ω) 13

17 Operator Random Fourier Features (ORFF) Assume µ is a probability distribution Then given (ω j ) D j=1 µd i.i.d construct (Brault et al. 2016) Φ: X L(Y, U 2D ) ( ) cos( x, ωj )B(ω j ) x 1 D D j=1 sin( x, ω j )B(ω j ) Φ approximated feature map for kernel K. x, z R d R d, Φ(x) Φ(z) = 1 D D cos ( x z, ω j )B(ω j )B(ω j ) K(x, z) D j=1 where the convergence holds µ-almost everywhere in the weak sense. 14

18 Application to Functional Quantile Regression Toy dataset: N = 1000 points, D = 100(ORFF on input kernel) and D = 100(RFF on output kernel) Matches performance obtained by multi-task learning (Sangnier et al. 2016), but faster, and with an access to all quantiles levels. 15

19 Outline Motivation and Goals Operator-valued Kernel Regression Scaling up Operator-valued Kernel Regression Conclusion 16

20 Conclusion Operator-valued Kernel Regression: extends kernel methods to more involved prediction problems Versatile framework : losses and kernels Scalability obtained with Random Fourier Feature techniques Theoretical guarantees on approximation 17

21 Perspectives Theoretical properties of learning with ORFF Stacking ORFF / links with Deep Learning Towards Hybrid Architectures (Mairal 2016) Image/text understanding (combining Deep Neural architectures and ORFF) Anomaly detection (extending one-class SVM) Spatio-temporal data : climatics, epidemics data 18

22 Thank you for your attention Our contributions C. Brouard, F. d Alché-Buc, M.Szafranski: Semi-supervised Penalized Output Kernel Regression for Link Prediction. ICML 2011: N. Lim, Y. Senbabaoglu, G.Michailidis, F. d Alché-Buc: OKVAR-Boost: a novel boosting algorithm to infer nonlinear dynamics and interactions in gene regulatory networks. Bioinformatics 29(11): (2013) N. Lim, F. d Alché-Buc, C. Auliac, G. Michailidis, Operator-valued Kernel based Vector Autoregressive Models for Network Inference, Machine Learning Journal, April C. Brouard, M. Szafranski, F. d Alché-Buc,, Input Output Kernel Regression for supervised and semi-supervised structured output learning, JMLR, 2016 C. Brouard, H. Shen, K. Dührkop, F. d Alché-Buc, S. Böcker, J. Rousu: Fast metabolite identification with Input Output Kernel Regression. Bioinformatics 32(12): (2016) M. Sangnier, O. Fercoq, F. d Alché-Buc, Joint quantile regression in vector-valued RKHSs, NIPS 2016: , (2016) R. Brault, M. Heinonen, F. d Alché-Buc, Random Fourier Features for Operator-valued Kernels, ACML 2016, (2016) M. Djerrab, A. Garcia, M. Sangnier, F. d Alché-Buc, Output Fisher Embedding Regression, Machine Learning Journal (in revision), (2017) M. Sangnier, O. Fercoq, F. d Alché-Buc, Data sparse nonparametric regression with ϵ-insensitive losses, ACML 2017, (2017) 19

23 Collaborations are welcome One-year postdoc position (from January) Master internship positions (April-september) co-supervising PhD thesis Contact: 20

Scaling up Vector Autoregressive Models with Operator Random Fourier Features

Scaling up Vector Autoregressive Models with Operator Random Fourier Features Romain Brault, Néhémy Lim, Florence d Alché-Buc August, 6 Abstract A nonparametric approach to Vector Autoregressive Modeling