Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Size: px

Start display at page:

Download "Distribution Regression: A Simple Technique with Minimax-optimal Guarantee"

Neil Harmon
5 years ago
Views:

1 Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Parisian Statistics Seminar March 27, 2017

2 Example: sustainability Goal: aerosol prediction = air pollution climate.

3 Example: sustainability Goal: aerosol prediction = air pollution climate. Prediction using labelled bags: bag := multi-spectral satellite measurements over an area, label := local aerosol value.

4 Example: existing methods Multi-instance learning: [Haussler, 1999, Gärtner et al., 2002] (set kernel):

5 Example: existing methods Multi-instance learning: [Haussler, 1999, Gärtner et al., 2002] (set kernel): sensible methods in regression: few, 1 restrictive technical conditions, 2 super-high resolution satellite image: would be needed.

6 One-page summary Contributions: 1 Practical: state-of-the-art accuracy (aerosol). 2 Theoretical: General bags: graphs, time series, texts,... Consistency of set kernel in regression (17-year-old open problem). How many samples/bag?

7 One-page summary Contributions: 1 Practical: state-of-the-art accuracy (aerosol). 2 Theoretical: General bags: graphs, time series, texts,... Consistency of set kernel in regression (17-year-old open problem). How many samples/bag?

8 Objects in the bags Examples: time-series modelling: user = set of time-series, computer vision: image = collection of patch vectors, NLP: corpus = bag of documents, network analysis: group of people = bag of friendship graphs,... Zolta n Szabo

9 Objects in the bags Examples: time-series modelling: user = set of time-series, computer vision: image = collection of patch vectors, NLP: corpus = bag of documents, network analysis: group of people = bag of friendship graphs,... Wider context (statistics): point estimation tasks. Zolta n Szabo

10 Regression on labelled bags Given: labelled bags: ẑ = {( ˆPi, y i )} l i=1, ˆP i : bag from P i, N := ˆP i. test bag: ˆP.

11 Regression on labelled bags Given: labelled bags: ẑ = {( )} l ˆP i, y i i=1, ˆP i : bag from P i, N := ˆP i. test bag: ˆP. Estimator: f λ ẑ 1 = arg min f H l l i=1 [ f ( µ ˆPi }{{} feature of ˆP i ) yi ] 2 + λ f 2 H.

12 Regression on labelled bags Given: labelled bags: ẑ = {( ˆPi, y i )} l i=1, ˆP i : bag from P i, N := ˆP i. test bag: ˆP. Estimator: f λ ẑ Prediction: 1 = arg min f H(K) l l i=1 ŷ ( ˆP) = g T (G + lλi) 1 y, [ f ( µ ˆPi ) yi ] 2 + λ f 2 H. g = [ K ( µ ˆP, µ ˆP i )], G = [ K ( µ ˆP i, µ ˆP j )], y = [yi ].

13 Regression on labelled bags Given: labelled bags: ẑ = {( ˆPi, y i )} l i=1, ˆP i : bag from P i, N := ˆP i. test bag: ˆP. Estimator: f λ ẑ Prediction: 1 = arg min f H(K) l l i=1 ŷ ( ˆP) = g T (G + lλi) 1 y, [ f ( µ ˆPi ) yi ] 2 + λ f 2 H. g = [ K ( µ ˆP, µ ˆP i )], G = [ K ( µ ˆP i, µ ˆP j )], y = [yi ]. Challenges 1 Inner product of distributions: K ( µ ˆP i, µ ˆP j ) =? 2 How many samples/bag?

14 Regression on labelled bags: similarity Let us define an inner product on distributions [ K(P, Q)]: 1 Set kernel: A = {a i } N i=1, B = {b j} N j=1. K(A, B) = 1 N 2 Remember: N 1 N k(a i, b j ) = ϕ(a i ), 1 N N i=1 }{{} feature of bag A i,j=1 N j=1 ϕ(b j ).

15 Regression on labelled bags: similarity Let us define an inner product on distributions [ K(P, Q)]: 1 Set kernel: A = {a i } N i=1, B = {b j} N j=1. K(A, B) = 1 N 2 N 1 N k(a i, b j ) = ϕ(a i ), 1 N N i=1 }{{} feature of bag A i,j=1 N j=1 ϕ(b j ). 2 Taking limit [Berlinet and Thomas-Agnan, 2004, Altun and Smola, 2006, Smola et al., 2007]: a P, b Q K(P, Q) = E a,b k(a, b) = E a ϕ(a), E b ϕ(b). }{{} feature of distribution P=:µ P Example (Gaussian kernel): k(a, b) = e a b 2 2 /(2σ2).

16 RKHS definition(s) Given: D set. Kernel: k(a, b) = ϕ(a), ϕ(b) F, F: Hilbert space.

17 RKHS definition(s) Given: D set. Kernel: k(a, b) = ϕ(a), ϕ(b) F, F: Hilbert space. RKHS: H R D Hilbert space, δ b (f ) = f (b) is continuous ( b).

18 RKHS definition(s) Given: D set. Kernel: k(a, b) = ϕ(a), ϕ(b) F, F: Hilbert space. RKHS: H R D Hilbert space, δ b (f ) = f (b) is continuous ( b). Reproducing kernel of an H R D Hilbert space, 1 k(, b) H,

19 RKHS definition(s) Given: D set. Kernel: k(a, b) = ϕ(a), ϕ(b) F, F: Hilbert space. RKHS: H R D Hilbert space, δ b (f ) = f (b) is continuous ( b). Reproducing kernel of an H R D Hilbert space, 1 k(, b) H, 2 f, k(, b) H = f (b). Note: k(a, b) = k(, a), k(, b) H.

20 RKHS definition(s) Given: D set. Kernel: k(a, b) = ϕ(a), ϕ(b) F, F: Hilbert space. RKHS: H R D Hilbert space, δ b (f ) = f (b) is continuous ( b). Reproducing kernel of an H R D Hilbert space, 1 k(, b) H, 2 f, k(, b) H = f (b). Note: k(a, b) = k(, a), k(, b) H. k : D D R sym. is pd. if G = [k(x i, x j )] n i,j=1 0 ( n, x i).

21 Kernel examples on D = R d, θ > 0 k G (a, b) = e a b 2 2 2θ 2, k e (a, b) = e a b 2 2θ 2, 1 1 k C (a, b) =, k t (a, b) = 1 + a b a b θ, θ 2 2 k p (a, b) = ( a, b + θ) p, k r (a, b) = 1 a b 2 2 a b θ, k i (a, b) = k M, 3 (a, b) = 2 k M, 5 (a, b) = 2 1 a b 22 + θ2, ( ) 3 a b θ ( 5 a b θ e 3 a b 2 θ, + 5 a b 2 2 3θ 2 ) e 5 a b 2 θ.

22 Regression on labelled bags: baseline Quality of estimator, baseline: R(f ) = E (µp,y) ρ[f (µ P ) y] 2, f ρ = best regressor. How many samples/bag to achieve the accuracy of f ρ? Possible? Assume (for a moment): f ρ H(K).

23 Our result: how many samples/bag Known [Caponnetto and De Vito, 2007]: best/realized rate ) R(fz λ ) R(f ρ ) = O (l bc bc+1, b size of the input space, c smoothness of f ρ.

24 Our result: how many samples/bag Known [Caponnetto and De Vito, 2007]: best/realized rate ) R(fz λ ) R(f ρ ) = O (l bc bc+1, b size of the input space, c smoothness of f ρ. Let N = Õ(l a ). N: size of the bags. l: number of bags. Our result If 2 a, then f λ ẑ attains the best achievable rate.

25 Our result: how many samples/bag Known [Caponnetto and De Vito, 2007]: best/realized rate ) R(fz λ ) R(f ρ ) = O (l bc bc+1, b size of the input space, c smoothness of f ρ. Let N = Õ(l a ). N: size of the bags. l: number of bags. Our result If 2 a, then f λ ẑ In fact, a = b(c+1) bc+1 attains the best achievable rate. < 2 is enough. Consequence: regression with set kernel is consistent.

26 Well-specified case: computational & statistical tradeoff Let N = Õ (l a ). Our result If b(c+1) bc+1 a, then R ( fẑ λ ) ) R (fρ ) = O (l bc bc+1.

27 Well-specified case: computational & statistical tradeoff Let N = Õ (l a ). Our result If b(c+1) bc+1 a, then R ( fẑ λ ) ) R (fρ ) = O (l bc bc+1. If a b(c+1) bc+1, then R ( fẑ λ ) ) R (fρ ) = O (l ac c+1. Meaning: smaller a: computational saving, but reduced statistical efficiency.

28 Well-specified case: computational & statistical tradeoff Let N = Õ (l a ). Our result If b(c+1) bc+1 a, then R ( fẑ λ ) ) R (fρ ) = O (l bc bc+1. If a b(c+1) bc+1, then R ( fẑ λ ) ) R (fρ ) = O (l ac c+1. Meaning: smaller a: computational saving, but reduced statistical efficiency. c b(c+1) bc+1 decreasing: easier problems smaller bags.

29 Why can we get consistency/rates? intuition Convergence of the mean embedding: ( ) µ P µ 1 ˆP H = O. N Hölder property of K (0 < L, 0 < h 1): K(, µ P ) K(, µ ˆP) H L µ P µ h ˆP H. f λ ẑ depends nicely on µ ˆP.

30 Valid similarities Recall: K(P, Q) = µ P, µ Q. K G K e K C e µ P µ Q 2 2θ 2 e µ P µ Q 2θ 2 ( 1 + µ P µ Q 2 /θ 2 ) 1 K t (1 + µ P µ Q θ) 1 ( µ P µ Q 2 + θ 2 ) 1 2 K i Functions of µ P µ Q computation: similar to set kernel.

31 Extensions 1 Misspecified setting (f ρ L 2 \H): Consistency: convergence to inf f H f f ρ L 2. Smoothness on f ρ : computational & statistical tradeoff.

32 Extensions 2 Vector-valued output: Y : separable Hilbert space K(µ P, µ Q ) L(Y ).

33 Extensions 2 Vector-valued output: Y : separable Hilbert space K(µ P, µ Q ) L(Y ). Prediction on a test bag ˆP: ŷ ( ˆP) = g T (G + lλi) 1 y, g = [K(µ ˆP, µ ˆPi )], G = [K(µ ˆPi, µ ˆPj )], y = [y i ]. Specifically: Y = R L(Y ) = R; Y = R d L(Y ) = R d d.

34 Misspecified case: consistency Our result Let Then, N = Õ (l), l, λ 0, λ l. ( R f λ ẑ ) R (f ρ ) inf f H f f ρ L 2.

35 Misspecified case: s-smooth Let N = Õ ( l 2a). f ρ : s-smooth, s (0, 1]. Our result (computational & statistical tradeoff) If s+1 s+2 a, then R ( fẑ λ ) ) R (fρ ) = O (l 2s s+2.

36 Misspecified case: s-smooth Let N = Õ ( l 2a). f ρ : s-smooth, s (0, 1]. Our result (computational & statistical tradeoff) If s+1 s+2 a, then R ( fẑ λ ) ) R (fρ ) = O (l 2s s+2. If a s+1 s+2, then R ( fẑ λ ) ) R (fρ ) = O (l 2sa s+1. Meaning: Smaller a: computational saving, but reduced statistical efficiency.

37 Misspecified case: s-smooth Let N = Õ ( l 2a). f ρ : s-smooth, s (0, 1]. Our result (computational & statistical tradeoff) If s+1 s+2 a, then R ( fẑ λ ) ) R (fρ ) = O (l 2s s+2. If a s+1 s+2, then R ( fẑ λ ) ) R (fρ ) = O (l 2sa s+1. Meaning: Smaller a: computational saving, but reduced statistical efficiency. Sensible choice: a s+1 s a 4 3 < 2!

38 Misspecified case: s-smooth Let N = Õ ( l 2a). f ρ : s-smooth, s (0, 1]. Our result (computational & statistical tradeoff) If s+1 s+2 a, then R ( fẑ λ ) ) R (fρ ) = O (l 2s s+2. If a s+1 s+2, then R ( fẑ λ ) ) R (fρ ) = O (l 2sa s+1. Meaning: Smaller a: computational saving, but reduced statistical efficiency. Sensible choice: a s+1 s a 4 3 < 2! s 2s s+2 is increasing: easier task = better rate. s 0 ( f ρ L 2 only): arbitrary slow rate. s = 1: O(l 2 3 ) speed.

39 Misspecified case: optimality Our rate: r(l) = l 2s s+2. One-stage sampled optimal rate: r o (l) = l 2s 2s+1 [Steinwart et al., 2009], s-smoothness + eigendecay constraint, D: compact metric, Y = R.

40 Misspecified case: optimality Our rate: r(l) = l 2s s+2. One-stage sampled optimal rate: r o (l) = l 2s 2s+1 [Steinwart et al., 2009], s-smoothness + eigendecay constraint, D: compact metric, Y = R Rate log l r o (l)=2s/(2s+1) log l r(l)=2s/(s+2) Smoothness (s)

41 s-smoothness: intuition Assumption: f ρ Im(C s ), s (0, 1]. C = uncentered covariance.

42 s-smoothness: intuition Assumption: f ρ Im(C s ), s (0, 1]. C = uncentered covariance. Imagine: C R d d is a symmetric matrix, C = UΛU T

43 s-smoothness: intuition Assumption: f ρ Im(C s ), s (0, 1]. C = uncentered covariance. Imagine: C R d d is a symmetric matrix, C = UΛU T, Cv = d λ n u n, v u n. n=1

44 s-smoothness: intuition Assumption: f ρ Im(C s ), s (0, 1]. C = uncentered covariance. Imagine: C R d d is a symmetric matrix, General C: C = UΛU T, Cv = d λ n u n, v u n. n=1 C(v) = n λ n u n, v u n, C s (v) = λ s n u n, v u n, n { Im(C s ) = c n u n : cnλ 2 2s n n n < }. Larger s faster decay of the c n Fourier coefficients.

45 Aerosol prediction result (100 RMSE) We perform on par with the state-of-the-art, hand-engineered method. Zhuang Wang, Liang Lan, Slobodan Vucetic. IEEE Transactions on Geoscience and Remote Sensing, 2012: (± ): hand-crafted features. Our prediction accuracy: 7.81 (±1.64). no expert knowledge. Code in ITE: #2 on mloss,

46 Summary Problem: distribution regression. Contribution: computational & statistical tradeoff analysis, set kernel: simple algorithm with minimax optimal rate. Learning Theory for Distribution Regression. Journal of Machine Learning Research, 17(152):1-40, 2016.

47 Thank you for the attention! Acknowledgments: This work was supported by the Gatsby Charitable Foundation, and by NSF grants IIS and IIS A part of the work was carried out while Bharath K. Sriperumbudur was a research fellow in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge, UK.

48 Altun, Y. and Smola, A. (2006). Unifying divergence minimization and statistical inference via convex duality. In Conference on Learning Theory (COLT), pages Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer. Caponnetto, A. and De Vito, E. (2007). Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7: Gärtner, T., Flach, P. A., Kowalczyk, A., and Smola, A. (2002). Multi-instance kernels. In International Conference on Machine Learning (ICML), pages Haussler, D. (1999).

49 Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz. ( convolutions.pdf). Smola, A., Gretton, A., Song, L., and Schölkopf, B. (2007). A Hilbert space embedding for distributions. In Algorithmic Learning Theory (ALT), pages Steinwart, I., Hush, D. R., and Scovel, C. (2009). Optimal rates for regularized least squares regression. In Conference on Learning Theory (COLT).

Minimax-optimal distribution regression

Minimax-optimal distribution regression Zoltán Szabó (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) ISNPS, Avignon June 12,