Minimax-optimal distribution regression

Size: px

Start display at page:

Download "Minimax-optimal distribution regression"

Kristin Preston
5 years ago
Views:

1 Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12, 2016

2 Example: sustainability Goal: aerosol prediction = air pollution climate. Prediction using labelled bags: bag := multi-spectral satellite measurements over an area, label := local aerosol value.

3 Existing methods Multi-instance learning: [Haussler, 1999, Gärtner et al., 2002] (set kernel): sensible methods in regression: few, 1 restrictive technical conditions, 2 super-high resolution satellite image: would be needed.

4 One-page summary Contributions: 1 Practical: state-of-the-art accuracy (aerosol). 2 Theoretical: General bags: graphs, time series, texts,... Consistency of set kernel in regression (17-year-old open problem). How many samples/bag?

5 Objects in the bags Examples: time-series modelling: user = set of time-series, computer vision: image = collection of patch vectors, NLP: corpus = bag of documents, network analysis: group of people = bag of friendship graphs,...

6 Objects in the bags Examples: time-series modelling: user = set of time-series, computer vision: image = collection of patch vectors, NLP: corpus = bag of documents, network analysis: group of people = bag of friendship graphs,... Wider context (statistics): point estimation tasks.

7 Regression on labelled bags Given: labelled bags: ẑ = {(ˆP i,y i )} l i=1, ˆP i : bag from P i, N := ˆP i. test bag: ˆP.

8 Regression on labelled bags Given: labelled bags: ẑ = {(ˆP )} l i,y i, ˆP i=1 i : bag from P i, N := ˆP i. test bag: ˆP. Estimator: fẑ λ = argmin 1 f H l l i=1 [ f ( ) ] 2 µˆpi yi +λ f 2 }{{} feature of ˆP i H.

9 Regression on labelled bags Given: labelled bags: ẑ = {(ˆPi,y i )} l i=1, ˆP i : bag from P i, N := ˆP i. test bag: ˆP. Estimator: Prediction: fẑ λ = argmin 1 f H(K) l l i=1 ŷ (ˆP) = g T (G+lλI) 1 y, [ f ( µˆp i ) yi ] 2 +λ f 2 H. g = [ K ( µˆp,µˆp i )],G = [ K ( µˆp i,µˆp j )],y = [yi ].

10 Regression on labelled bags Given: labelled bags: ẑ = {(ˆPi,y i )} l i=1, ˆP i : bag from P i, N := ˆP i. test bag: ˆP. Estimator: Prediction: fẑ λ = argmin 1 f H(K) l l i=1 ŷ (ˆP) = g T (G+lλI) 1 y, [ f ( µˆp i ) yi ] 2 +λ f 2 H. g = [ K ( µˆp,µˆp i )],G = [ K ( µˆp i,µˆp j )],y = [yi ]. Challenges 1 Inner product of distributions: K ( µˆp i,µˆp j ) =? 2 How many samples/bag?

11 Regression on labelled bags: similarity Let us define an inner product on distributions [ K(P,Q)]: 1 Set kernel: A = {a i } N i=1, B = {b j} N j=1. K(A,B) = 1 N 2 Remember: N 1 N k(a i,b j ) = ϕ(a i ), 1 N N i=1 }{{} feature of bag A i,j=1 N j=1 ϕ(b j ).

12 Regression on labelled bags: similarity Let us define an inner product on distributions [ K(P,Q)]: 1 Set kernel: A = {a i } N i=1, B = {b j} N j=1. K(A,B) = 1 N 2 N 1 N k(a i,b j ) = ϕ(a i ), 1 N N i=1 }{{} feature of bag A i,j=1 N j=1 ϕ(b j ). 2 Taking limit [Berlinet and Thomas-Agnan, 2004, Altun and Smola, 2006, Smola et al., 2007]: a P,b Q K(P,Q) = E a,b k(a,b) = E a ϕ(a),e b ϕ(b). }{{} feature of distribution P=:µ P Example (Gaussian kernel): k(a,b) = e a b 2 2 /(2σ2).

13 Regression on labelled bags: baseline Quality of estimator, baseline: R(f) = E (µp,y) ρ[f(µ P ) y] 2, f ρ = best regressor. How many samples/bag to get the accuracy of f ρ? Possible? Assume (for a moment): f ρ H(K).

14 Our result: how many samples/bag Known [Caponnetto and De Vito, 2007]: best/achieved rate ) R(fz λ ) R(f ρ ) = O (l bc bc+1, b size of the input space, c smoothness of f ρ.

15 Our result: how many samples/bag Known [Caponnetto and De Vito, 2007]: best/achieved rate ) R(fz λ ) R(f ρ ) = O (l bc bc+1, b size of the input space, c smoothness of f ρ. Let N = Õ(l a ). N: size of the bags. l: number of bags. Our result If 2 a, then f λ ẑ attains the best achievable rate.

16 Our result: how many samples/bag Known [Caponnetto and De Vito, 2007]: best/achieved rate ) R(fz λ ) R(f ρ ) = O (l bc bc+1, b size of the input space, c smoothness of f ρ. Let N = Õ(l a ). N: size of the bags. l: number of bags. Our result If 2 a, then f λ ẑ attains the best achievable rate. In fact, a = b(c+1) bc+1 < 2 is enough. Consequence: regression with set kernel is consistent.

17 Aerosol prediction result (100 RMSE) We perform on par with the state-of-the-art, hand-engineered method. Zhuang Wang, Liang Lan, Slobodan Vucetic. IEEE Transactions on Geoscience and Remote Sensing, 2012: (± ): hand-crafted features. Our prediction accuracy: 7.81 (±1.64). no expert knowledge. Code in ITE: #2 on mloss,

18 Summary Task: regression on bags/distributions. Result: minimax optimality, sub-quadratic bag size, specifically: set kernel is consistent. Preprint (JMLR, in revision):

19 Thank you for the attention! Acknowledgments: This work was supported by the Gatsby Charitable Foundation, and by NSF grants IIS and IIS A part of the work was carried out while Bharath K. Sriperumbudur was a research fellow in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge, UK.

20 Why can we get consistency/rates? intuition Convergence of the mean embedding: ( ) 1 H = O. N µp µˆp Hölder property of K (0 < L, 0 < h 1): K(,µ P ) K(,µˆP) H L µ P µˆp h H. fẑ λ depends nicely on K(µˆP,µˆQ) = K(,µˆP),K(,µˆQ). H [39 pages]

21 Extensions 1 Misspecified setting (f ρ L 2 \H): Consistency: convergence to inf f H f f ρ L 2. Smoothness on f ρ : computational & statistical tradeoff.

22 Extensions 2 Vector-valued output: Y: separable Hilbert space K(µ P,µ Q ) L(Y). Prediction on a test bag ˆP: ŷ (ˆP) = g T (G+lλI) 1 y, g = [K(µˆP,µˆPi )],G = [K(µˆPi,µˆPj )],y = [y i ]. Specifically: Y = R L(Y) = R; Y = R d L(Y) = R d d.

23 Other valid similarities Recall: K(P,Q) = µ P,µ Q. K G K e K C e µ P µ Q 2 2θ 2 e µ P µ Q 2θ 2 ( 1+ µ P µ Q 2 /θ 2 ) 1 K t K i (1+ µ P µ Q θ) 1 ( µ P µ Q 2 +θ 2 ) 1 2 Functions of µ P µ Q computation: similar to set kernel.

24 Altun, Y. and Smola, A. (2006). Unifying divergence minimization and statistical inference via convex duality. In Conference on Learning Theory (COLT), pages Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer. Caponnetto, A. and De Vito, E. (2007). Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7: Gärtner, T., Flach, P. A., Kowalczyk, A., and Smola, A. (2002). Multi-instance kernels. In International Conference on Machine Learning (ICML), pages Haussler, D. (1999).

Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz. (http://cbse.soe.ucsc.

25 Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz. ( convolutions.pdf). Smola, A., Gretton, A., Song, L., and Schölkopf, B. (2007). A Hilbert space embedding for distributions. In Algorithmic Learning Theory (ALT), pages

Distribution Regression with Minimax-Optimal Guarantee

Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton