Distribution Regression with Minimax-Optimal Guarantee

Size: px

Start display at page:

Download "Distribution Regression with Minimax-Optimal Guarantee"

Charity Owen
5 years ago
Views:

1 Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) MASCOT-NUM, Toulouse March 25, 2016

2 Example: sustainability Goal: aerosol prediction = air pollution climate. Prediction using labelled bags: bag := multi-spectral satellite measurements over an area, label := local aerosol value.

3 Example: existing methods Multi-instance learning: [Haussler, 1999, Gärtner et al., 2002] (set kernel): sensible methods in regression: few, 1 restrictive technical conditions, 2 super-high resolution satellite image: would be needed.

4 One-page summary Contributions: 1 Practical: state-of-the-art accuracy (aerosol). 2 Theoretical: General bags: graphs, time series, texts,... Consistency of set kernel in regression (17-year-old open problem). How many samples/bag?

5 One-page summary Contributions: 1 Practical: state-of-the-art accuracy (aerosol). 2 Theoretical: General bags: graphs, time series, texts,... Consistency of set kernel in regression (17-year-old open problem). How many samples/bag? AISTATS-2015 (oral 6.11%) JMLR in revision.

6 Objects in the bags Examples: time-series modelling: user = set of time-series, computer vision: image = collection of patch vectors, NLP: corpus = bag of documents, network analysis: group of people = bag of friendship graphs,...

7 Objects in the bags Examples: time-series modelling: user = set of time-series, computer vision: image = collection of patch vectors, NLP: corpus = bag of documents, network analysis: group of people = bag of friendship graphs,... Wider context (statistics): point estimation tasks.

8 Regression on labelled bags Given: labelled bags: ẑ = {(ˆP i,y i )} l i=1, ˆP i : bag from P i, N := ˆP i. test bag: ˆP.

9 Regression on labelled bags Given: labelled bags: ẑ = {(ˆP )} l i,y i, ˆP i=1 i : bag from P i, N := ˆP i. test bag: ˆP. Estimator: fẑ λ = argmin 1 f H l l i=1 [ f ( ) ] 2 µˆpi yi +λ f 2 }{{} feature of ˆP i H.

10 Regression on labelled bags Given: labelled bags: ẑ = {(ˆPi,y i )} l i=1, ˆP i : bag from P i, N := ˆP i. test bag: ˆP. Estimator: Prediction: fẑ λ = argmin 1 f H(K) l l i=1 ŷ (ˆP) = g T (G+lλI) 1 y, [ f ( µˆp i ) yi ] 2 +λ f 2 H. g = [ K ( µˆp,µˆp i )],G = [ K ( µˆp i,µˆp j )],y = [yi ].

11 Regression on labelled bags Given: labelled bags: ẑ = {(ˆPi,y i )} l i=1, ˆP i : bag from P i, N := ˆP i. test bag: ˆP. Estimator: Prediction: fẑ λ = argmin 1 f H(K) l l i=1 ŷ (ˆP) = g T (G+lλI) 1 y, [ f ( µˆp i ) yi ] 2 +λ f 2 H. g = [ K ( µˆp,µˆp i )],G = [ K ( µˆp i,µˆp j )],y = [yi ]. Challenges 1 Inner product of distributions: K ( µˆp i,µˆp j ) =? 2 How many samples/bag?

12 Regression on labelled bags: similarity Let us define an inner product on distributions [ K(P,Q)]: 1 Set kernel: A = {a i } N i=1, B = {b j} N j=1. K(A,B) = 1 N 2 Remember: N 1 N k(a i,b j ) = ϕ(a i ), 1 N N i=1 }{{} feature of bag A i,j=1 N j=1 ϕ(b j ).

13 Regression on labelled bags: similarity Let us define an inner product on distributions [ K(P,Q)]: 1 Set kernel: A = {a i } N i=1, B = {b j} N j=1. K(A,B) = 1 N 2 N 1 N k(a i,b j ) = ϕ(a i ), 1 N N i=1 }{{} feature of bag A i,j=1 N j=1 ϕ(b j ). 2 Taking limit [Berlinet and Thomas-Agnan, 2004, Altun and Smola, 2006, Smola et al., 2007]: a P,b Q K(P,Q) = E a,b k(a,b) = E a ϕ(a),e b ϕ(b). }{{} feature of distribution P=:µ P Example (Gaussian kernel): k(a,b) = e a b 2 2 /(2σ2).

14 RKHS definition(s) Given: D set. Kernel: k(a,b) = ϕ(a),ϕ(b) F, F: Hilbert space.

15 RKHS definition(s) Given: D set. Kernel: k(a,b) = ϕ(a),ϕ(b) F, F: Hilbert space. RKHS: H R D Hilbert space, δ b (f) = f(b) is continuous ( b).

16 RKHS definition(s) Given: D set. Kernel: k(a,b) = ϕ(a),ϕ(b) F, F: Hilbert space. RKHS: H R D Hilbert space, δ b (f) = f(b) is continuous ( b). Reproducing kernel of an H R D Hilbert space, 1 k(,b) H,

17 RKHS definition(s) Given: D set. Kernel: k(a,b) = ϕ(a),ϕ(b) F, F: Hilbert space. RKHS: H R D Hilbert space, δ b (f) = f(b) is continuous ( b). Reproducing kernel of an H R D Hilbert space, 1 k(,b) H, 2 f,k(,b) H = f(b). Note: k(a,b) = k(,a),k(,b) H.

18 RKHS definition(s) Given: D set. Kernel: k(a,b) = ϕ(a),ϕ(b) F, F: Hilbert space. RKHS: H R D Hilbert space, δ b (f) = f(b) is continuous ( b). Reproducing kernel of an H R D Hilbert space, 1 k(,b) H, 2 f,k(,b) H = f(b). Note: k(a,b) = k(,a),k(,b) H. k : D D R sym. is pd. if G = [k(x i,x j )] n i,j=1 0.

19 Other valid similarities Recall: K(P,Q) = µ P,µ Q. K G K e K C e µ P µ Q 2 2θ 2 e µ P µ Q 2θ 2 ( 1+ µ P µ Q 2 /θ 2 ) 1 K t (1+ µ P µ Q θ) 1 ( µ P µ Q 2 +θ 2 ) 1 2 K i Functions of µ P µ Q computation: similar to set kernel.

20 Regression on labelled bags: baseline Quality of estimator, baseline: R(f) = E (µp,y) ρ[f(µ P ) y] 2, f ρ = best regressor. How many samples/bag to get the accuracy of f ρ? Possible? Assume (for a moment): f ρ H(K).

21 Our result: how many samples/bag Known [Caponnetto and De Vito, 2007]: best/achieved rate ) R(fz λ ) R(f ρ ) = O (l bc bc+1, b size of the input space, c smoothness of f ρ.

22 Our result: how many samples/bag Known [Caponnetto and De Vito, 2007]: best/achieved rate ) R(fz λ ) R(f ρ ) = O (l bc bc+1, b size of the input space, c smoothness of f ρ. Let N = Õ(l a ). N: size of the bags. l: number of bags. Our result If 2 a, then f λ ẑ attains the best achievable rate.

23 Our result: how many samples/bag Known [Caponnetto and De Vito, 2007]: best/achieved rate ) R(fz λ ) R(f ρ ) = O (l bc bc+1, b size of the input space, c smoothness of f ρ. Let N = Õ(l a ). N: size of the bags. l: number of bags. Our result If 2 a, then f λ ẑ attains the best achievable rate. In fact, a = b(c+1) bc+1 < 2 is enough. Consequence: regression with set kernel is consistent.

24 Why can we get consistency/rates? intuition Convergence of the mean embedding: ( ) 1 H = O. N µp µˆp Hölder property of K (0 < L, 0 < h 1): K(,µP ) K(,µˆP ) H L µp µˆp h H. fẑ λ depends nicely on µˆp. [39 pages]

25 Extensions 1 Misspecified setting (f ρ L 2 \H): Consistency: convergence to inf f H f f ρ L 2. Smoothness on f ρ : computational & statistical tradeoff.

26 Extensions 2 Vector-valued output: Y: separable Hilbert space K(µ P,µ Q ) L(Y). Prediction on a test bag ˆP: ŷ (ˆP) = g T (G+lλI) 1 y, g = [K(µˆP,µˆPi )],G = [K(µˆPi,µˆPj )],y = [y i ]. Specifically: Y = R L(Y) = R; Y = R d L(Y) = R d d.

27 Aerosol prediction result (100 RMSE) We perform on par with the state-of-the-art, hand-engineered method. Zhuang Wang, Liang Lan, Slobodan Vucetic. IEEE Transactions on Geoscience and Remote Sensing, 2012: (± ): hand-crafted features. Our prediction accuracy: 7.81 (±1.64). no expert knowledge. Code in ITE: #2 on mloss,

28 Summary Problem: distribution regression. Contribution: computational & statistical tradeoff analysis, specifically, the set kernel is consistent: 17-year-old open question, minimax optimal rate is achievable: sub-quadratic bag size. Details (JMLR in revision):

29 Thank you for the attention! Acknowledgments: This work was supported by the Gatsby Charitable Foundation, and by NSF grants IIS and IIS A part of the work was carried out while Bharath K. Sriperumbudur was a research fellow in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge, UK.

30 Altun, Y. and Smola, A. (2006). Unifying divergence minimization and statistical inference via convex duality. In Conference on Learning Theory (COLT), pages Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer. Caponnetto, A. and De Vito, E. (2007). Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7: Gärtner, T., Flach, P. A., Kowalczyk, A., and Smola, A. (2002). Multi-instance kernels. In International Conference on Machine Learning (ICML), pages Haussler, D. (1999).

Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz. (http://cbse.soe.ucsc.

31 Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz. ( convolutions.pdf). Smola, A., Gretton, A., Song, L., and Schölkopf, B. (2007). A Hilbert space embedding for distributions. In Algorithmic Learning Theory (ALT), pages

Minimax-optimal distribution regression

Minimax-optimal distribution regression Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12,