Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Size: px

Start display at page:

Download "Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates"

Maryann Lloyd
5 years ago
Views:

1 : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014

2 Outline Motivation. Algorithm. Consistency results.

3 Motivation: non-parametric regression Given: {(x i, y i )} N i=1 training samples (x i X, y i R). Assumption: (x i, y i ) i.i.d. P. Goal: ˆf : X R, which predicts well on future inputs. Objective function: mean square prediction error, i.e. J(f) := E[f(X) Y] min. (1) f: measurable Optimal solution (theoretical): regression function f (x) = E[Y X = x]. ()

4 Motivation: ridge regressor Regularized M-estimators: data-dependent loss + regularization. example: least-squares loss + squared Hilbert norm. Our focus: function class = RKHS: H = H(K). kernel ridge regression: ˆf := arg min f H 1 N N [f(x i ) y i ] +λ f H i=1 (λ > 0). (3)

5 Motivation: analytical solution Explicit solution: where ˆf( ) N = α i K(, x i ), (4) i=1 K = [K(x i, x j )] R N N,α = (K +λni) 1 y R N. (5)

6 Motivation: analytical solution Explicit solution: where ˆf( ) N = α i K(, x i ), (4) i=1 K = [K(x i, x j )] R N N,α = (K +λni) 1 y R N. (5) Slight problem: scales terribly, time complexity: O ( N 3).

7 Motivation: approximations Low-rank methods: Examples: incomplete Cholesky, Nyström approximation. Prediction error guarantees: hardly studied. Early stopping methods: Early stopping regularization. Examples: gradient descent, conjugate gradient. Time complexity: O ( d N ), O ( tn ).

8 Motivation: current approach Decomposition-based technique: randomly partition the N samples: m equal sized subsets (S i ). independent ridge regressors: ˆf i (i = 1,...,m). average the obtained predictors: f = 1 m m i=1 ˆfi, ˆfi = arg min f H ( Time complexity: O m ( ) N 3 ) m 1 S i [f(x) y] +λ f H. (x,y) S i ( ) = O N 3. m

9 Algorithm: f Sub-problems: use λ; as if we had N samples. Under-regularization: each estimate has small bias, but the variance blows up! Average: reduces variance enough, minimax optimality: for certain kernel classes.

10 Notations (X, K), (X, Y) P, X P X, n = N m = # of blocks. S K : L (P X ) H = H(K), id = SK : H L (P X ) S K (f)(x) = K(x, x )f(x )dp X (x ), T K = id S K. (6) X T K : compact, positive, self-adjoint operator (if H is separable, K 1 L (P X ) := X K(x, x)dp X(x) < ).

11 Notations (X, K), (X, Y) P, X P X, n = N m = # of blocks. S K : L (P X ) H = H(K), id = SK : H L (P X ) S K (f)(x) = K(x, x )f(x )dp X (x ), T K = id S K. (6) X T K : compact, positive, self-adjoint operator (if H is separable, K 1 L (P X ) := X K(x, x)dp X(x) < ). spectral theorem ========= countable {φ i } ONS (eigenvectors) L (P X ), µ i eigenvalues (> 0, 0). W.l.o.g.: φ i H.

12 Mercer theorem: K {(φ i,µ i )} If X is compact metric, K is continuous, then K(u, v) = µ j φ j (u)φ j (v). (7) j=1 Note (T K conditions): (X, K) conditions K : bounded. X: compact metric separable. X: separable, K : continuous} H = H(K): separable.

13 Some notations h := h L (P X ) = X h (x)dp(x). [ Our bound on the MSE E f ] f is formulated in terms of tr(k) = µ j, γ(λ) = j=1 j= λ µ j,β d = j=d+1 Intuition: tr(k): "size" of the kernel operator (T K ). γ(λ): "effective dimensionality" of T K w.r.t. L (P X ). β d : tail decay of the eigenvalues of T K (d 0 free parameter). β 0 = tr(k). µ j. (8)

14 Assumptions: tail behaviour of φ j -s, bounded variance A: k, ρ < such that E [ φ j (X) k] ρ k (j = 1,,...). A : ρ < such that sup u X φ j (u) ρ (j = 1,,...). Assumption A Assumption A: E [ φ j (X) k] [ ] E sup φ j (u) k E [ ρ k] = ρ k. (9) u X B: f H. σ > 0 such that x X: E[Y f (x)] σ. Notation+: [ ] max(k, max(k, log(d)) b(n, d, k) = max log(d)),. n 1 1 k

15 Main result (C: universal constant) If f H, assumptions A and B hold, then E [ f f ] ( 8+ 1 ) λ f H m + 1σ γ(λ) + N inf {T 1(d)+T (d)+t 3 (d)}, d N T 1 (d) = 8ρ4 f H tr(k)β d, λ T (d) = 4 f H + σ /λ m T 3 (d) = [ Cb(n, d, k) ρ γ(λ) n ] k f ( µ d+1 + 1ρ4 tr(k)β d λ ( ), 1+ σ mλ + 4 f H m ).

16 Main result: intuition "Simplified" form: E [ f f ] ( = O λ f H }{{} squared bias ) + σ γ(λ). }{{ N } variance For 3 kernel families, this is "correct" (idea): For large enough d and small enough m: T 3 (d) γ(λ) N. T 1 (d), T (d): either 0, or smaller then the others. λ = γ(λ) N fixed point equation λ. Rate: γ(λ ) N.

17 Consequence-1 (finite rank kernel; example: linear/polynomial) Assumption: rank(k) = r, λ = r N, A (or A ) and B. If m c N k 4 k r ρ 4k k log k k (r) (A), N m c r ρ 4 log(n) (A ), then E [ f f ] ( σ ) r = O. (10) N Moreover, (10) is minimax-optimal: c > 0 inf f E sup f B H (1)={f H: f H 1} E [ ] f E f c r N. (11)

18 Consequence- (polynomially decaying eigenvalues; example: Sobolev; C: universal constant) Assumption: µ j Cj ν (j = 1,,...), ν > 1, λ = 1 A ) and B. If [c = c(ν)] N ν+1 ν, A (or m c ( N (k 4)ν k ν+1 ρ 4k log k (N) ) 1 k (A), m c N ν 1 ν+1 (0,1) ρ 4 log(n) (A ), then E [ f f ] ( = O σ N ) ν ν+1 ( 1,1). (1) Moreover, (1) is minimax-optimal.

19 Consequence-3 (exponentially decaying eigenvalues; example: RBF; c i > 0) Assumption: λ = 1 N, µ j c 1 e c j, A (or A ) and B, λ = 1 N. If m c N k 4 k ρ 4k k log k 1 k (N) (A), N m c ρ 4 log (N) (A ), then E [ f f ( ] ) log(n) = O σ. (13) N Moreover, (13) is minimax-optimal.

20 Theorem: decomposition trick E f f = E f E[ f]+e[ f] f [ ] = E f E[ f] + E[ f] f [ f + E E[ f],e[ f] f ] L (P) = E 1 m (ˆf i E[ˆf i ]) + E[ f] f m i=1 i=1 1 m m m ] E[ ˆfi E[ˆf i ] + E[ˆf 1 ] f = 1 [ ] ˆf1 m E f + E[ˆf 1 ] f = variance + bias m ] using f H, E[ˆf i ] = arg min f H E[ ˆfi f and (H: Hilbert) m h i m i=1 H m h i H,E[ f] = E[ˆf j ],E rnd, const = E[rnd], const. i=1

21 Summary Goal: conditional expectation approximation. Tool: kernel ridge regression O ( N 3) time. Studied algorithm: simple, parallelizable. Result: MSE bound. Explicit rates + minimax optimality for 3 (kernel, P) classes.

22 Thank you for the attention!

23 Operator property: definitions A T : H H(ilbert) bounded linear operator is positive: Ta, a H 0 ( a H). self-adjoint: T = T. compact: T(B E ) is compact, B H = {u H : u H 1}. example: finite rank operator. alternative definition: closure of finite rank operators (in operator norm).

24 Sobolev space X R d : bounded domain. p [1, ], α = d i=1 α i. Weak derivative of u (extension of the integration by part formula): D α u. W m,p (X) := {u L p (X) : D α u L p (X), α m}. Example: W 1, (I) = Lipschitz functions on interval I.

Approximate Kernel Methods

Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression