Local Self-Tuning in Nonparametric Regression Samory Kpotufe ORFE, Princeton University
Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg (Y i ) of Neighbors(x). (e.g. k-nn, kernel, or tree-based reg.) Quite basic = common in modern applications. Sensitive to choice of Neighbors(x): k, band. h, tree cell size. Goal: choose Neighbors(x) optimally!
Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg (Y i ) of Neighbors(x). (e.g. k-nn, kernel, or tree-based reg.) Quite basic = common in modern applications. Sensitive to choice of Neighbors(x): k, band. h, tree cell size. Goal: choose Neighbors(x) optimally!
Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg (Y i ) of Neighbors(x). (e.g. k-nn, kernel, or tree-based reg.) Quite basic = common in modern applications. Sensitive to choice of Neighbors(x): k, band. h, tree cell size. Goal: choose Neighbors(x) optimally!
Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg (Y i ) of Neighbors(x). (e.g. k-nn, kernel, or tree-based reg.) Quite basic = common in modern applications. Sensitive to choice of Neighbors(x): k, band. h, tree cell size. Goal: choose Neighbors(x) optimally!
Performance depends on problem parameters Performance would depend on dim(x) and how fast f varies... Suppose X IR D, and x, x, f(x) f(x ) λ x x α. Performance measure: f n f 2 2,P X. = E X f n (X) f(x) 2. Minimax global performance (Stone 80-82) f n f 2 2,P X λ 2D/(2α+D) n 2α/(2α+D).
Performance depends on problem parameters Performance would depend on dim(x) and how fast f varies... Suppose X IR D, and x, x, f(x) f(x ) λ x x α. Performance measure: f n f 2 2,P X. = E X f n (X) f(x) 2. Minimax global performance (Stone 80-82) f n f 2 2,P X λ 2D/(2α+D) n 2α/(2α+D).
Performance depends on problem parameters Performance would depend on dim(x) and how fast f varies... Suppose X IR D, and x, x, f(x) f(x ) λ x x α. Performance measure: f n f 2 2,P X. = E X f n (X) f(x) 2. Minimax global performance (Stone 80-82) f n f 2 2,P X λ 2D/(2α+D) n 2α/(2α+D).
Performance depends on problem parameters Performance would depend on dim(x) and how fast f varies... Suppose X IR D, and x, x, f(x) f(x ) λ x x α. Performance measure: f n f 2 2,P X. = E X f n (X) f(x) 2. Minimax global performance (Stone 80-82) f n f 2 2,P X λ 2D/(2α+D) n 2α/(2α+D).
Some milder situations for X IR D f is quite smooth, f is sparse, f is additive,... Of interest here: X has low intrinsic dimension d D.
Some milder situations for X IR D f is quite smooth, f is sparse, f is additive,... Of interest here: X has low intrinsic dimension d D.
X IR D but has low intrinsic dimension d D Linear data
X IR D but has low intrinsic dimension d D Linear data Manifold data
X IR D but has low intrinsic dimension d D Linear data Manifold data Sparse data
X IR D but has low intrinsic dimension d D Linear data Manifold data Sparse data Common approach: Dimension reduction or estimation.
Dimension reduction/estimation increases tuning! Recent Alternative: f n operates in IR D but adapts to the unknown d of X. We want: f n f 2 2,P X n 1/Cd n 1/CD
Dimension reduction/estimation increases tuning! Recent Alternative: f n operates in IR D but adapts to the unknown d of X. We want: f n f 2 2,P X n 1/Cd n 1/CD
Dimension reduction/estimation increases tuning! Recent Alternative: f n operates in IR D but adapts to the unknown d of X. We want: f n f 2 2,P X n 1/Cd n 1/CD
Some work on adaptivity to intrinsic dimension: Kernel and local polynomial regression: Bickel and Li 2006, Lafferty and Wasserman 2007. Manifold dim. Dyadic tree classification: Scott and Nowak 2006. Box dim. RPtree, dyadic tree regression: Kpotufe and Dasgupta 2011. Doubling dim.
Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.
Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. Kernel reg.: Avg. mass of a ball of radius ɛ is approx. ɛ d For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.
Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. RPtree: Number of cells of diameter ɛ is approx. ɛ d. For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.
Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. RPtree: Number of cells of diameter ɛ is approx. ɛ d. For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.
Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. RPtree: Number of cells of diameter ɛ is approx. ɛ d. For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.
Insights help with tuning under time-constraints. Main Idea: compress data in a way that respects structure of X. Faster Kernel regression. [Kpo. 2009] Fast online tree-regression. [Kpo. and Orabona 2013] O(log n) time vs. O(n), + regression rates remain optimal!
Insights help with tuning under time-constraints. Main Idea: compress data in a way that respects structure of X. Faster Kernel regression. [Kpo. 2009] Fast online tree-regression. [Kpo. and Orabona 2013] O(log n) time vs. O(n), + regression rates remain optimal!
Insights help with tuning under time-constraints. Main Idea: compress data in a way that respects structure of X. Faster Kernel regression. [Kpo. 2009] Fast online tree-regression. [Kpo. and Orabona 2013] O(log n) time vs. O(n), + regression rates remain optimal!
Insights help with tuning under time-constraints. Main Idea: compress data in a way that respects structure of X. Faster Kernel regression. [Kpo. 2009] Fast online tree-regression. [Kpo. and Orabona 2013] O(log n) time vs. O(n), + regression rates remain optimal!
So far, we have viewed d as a global characteristic of X...
Problem complexity is likely to depend on location! Space X, d = d(x) Function f, α = α(x) Choose Neighbors(x) adaptively so that: f n (x) f(x) 2 λ 2dx/(2αx+dx) x n 2αx/(2αx+dx). Choose Neighbors(x): Cannot cross-validate locally at x!
Problem complexity is likely to depend on location! Space X, d = d(x) Function f, α = α(x) Choose Neighbors(x) adaptively so that: f n (x) f(x) 2 λ 2dx/(2αx+dx) x n 2αx/(2αx+dx). Choose Neighbors(x): Cannot cross-validate locally at x!
Problem complexity is likely to depend on location! Space X, d = d(x) Function f, α = α(x) Choose Neighbors(x) adaptively so that: f n (x) f(x) 2 λ 2dx/(2αx+dx) x n 2αx/(2αx+dx). Choose Neighbors(x): Cannot cross-validate locally at x!
NEXT: I. Local notions of smoothness and dimension. II. Local adaptivity to dimension: k-nn example. III. Full local adaptivity: kernel example.
Local smoothness Use local Hölder parameters λ = λ(x), α = α(x) on B(x, r): For all x B(x, r), f(x) f(x ) λρ(x, x ) α. f(x) = x α is flatter at x = 0 as α is increased.
Local dimension Figure : d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). If P X is U(B(x, r)), then P X (B(x, r)) ɛ d P X (B(x, ɛr)). Def.: P X is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, P X (B(x, r )) Cɛ d P X (B(x, ɛr )).
Local dimension Figure : d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). If P X is U(B(x, r)), then P X (B(x, r)) ɛ d P X (B(x, ɛr)). Def.: P X is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, P X (B(x, r )) Cɛ d P X (B(x, ɛr )).
Local dimension Figure : d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). If P X is U(B(x, r)), then P X (B(x, r)) ɛ d P X (B(x, ɛr)). Def.: P X is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, P X (B(x, r )) Cɛ d P X (B(x, ɛr )).
Local dimension Figure : d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). If P X is U(B(x, r)), then P X (B(x, r)) ɛ d P X (B(x, ɛr)). Def.: P X is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, P X (B(x, r )) Cɛ d P X (B(x, ɛr )).
The growth of P X can capture the intrinsic dimension in B(x). Location of query x matters! Size of neighborhood B matters!
The growth of P X can capture the intrinsic dimension in B(x). Location of query x matters! Size of neighborhood B matters!
The growth of P X can capture the intrinsic dimension in B(x). Location of query x matters! Size of neighborhood B matters!
Size of neighborhood B matters! For k-nn, or kernel reg, size of B depends on n and (k or h).
Size of neighborhood B matters! For k-nn, or kernel reg, size of B depends on n and (k or h).
The growth of P X (B) can capture the intrinsic dimension locally. Linear data Manifold data Sparse data
The growth of P X (B) can capture the intrinsic dimension locally. Linear data Manifold data Sparse data X can be a collection of subspaces of various dimensions.
Intrinsic d tightly captures the minimax rate: Theorem: Consider a metric measure space (X, ρ, µ), such that for all x X, r > 0, ɛ > 0, we have µ(b(x, r)) ɛ d µ(b(x, ɛr)). Then, for any regressor f n, there exists P X,Y, where P X = µ and f(x) = E Y x is λ-lipschitz, such that E f n f 2 PX,Y n 2,µ λ2d/(2+d) n 2/(2+d).
Intrinsic d tightly captures the minimax rate: Theorem: Consider a metric measure space (X, ρ, µ), such that for all x X, r > 0, ɛ > 0, we have µ(b(x, r)) ɛ d µ(b(x, ɛr)). Then, for any regressor f n, there exists P X,Y, where P X = µ and f(x) = E Y x is λ-lipschitz, such that E f n f 2 PX,Y n 2,µ λ2d/(2+d) n 2/(2+d).
NEXT: I. Local notions of smoothness and dimension. II. Local adaptivity to dimension: k-nn example. III. Full local adaptivity: kernel example.
Main Assumptions: X metric space (X, ρ). P X is locally homogeneous with unknown d(x). f is λ-lipschitz on X, i.e. α = 1. k-nn regression: f n (x) = weighted avg (Y i ) of k-nn(x). Suppose X IR D, the learner operates in IR D! No dimensionality reduction, no dimension estimation!
Main Assumptions: X metric space (X, ρ). P X is locally homogeneous with unknown d(x). f is λ-lipschitz on X, i.e. α = 1. k-nn regression: f n (x) = weighted avg (Y i ) of k-nn(x). Suppose X IR D, the learner operates in IR D! No dimensionality reduction, no dimension estimation!
Bias-Variance tradeoff E (X i,y i ) n 1 f n (x) f(x) 2 = E f n (x) E f n (x) 2 }{{} Variance + E f n (x) f(x) 2. }{{} Bias 2
General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).
General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).
General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).
General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).
General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).
Choosing k locally at x- Intuition Remember: Cross-valid. or dim. estimation at x are impractical. Instead: Main technical hurdle: intrinsic dimension might vary with k.
Choosing k locally at x- Intuition Remember: Cross-valid. or dim. estimation at x are impractical. Instead: Main technical hurdle: intrinsic dimension might vary with k.
Choosing k(x)- Result Theorem: Suppose k(x) is chosen as above. The following holds w.h.p. simultaneously for all x. Consider any B centered at x, s.t. P X (B) n 1/3. Suppose P X is (C, d)-homogeneous on B. We have ( ) f n (x) f(x) 2 C ln n 2/(2+d) λ 2. np X (B) As n the claim applies to any B centered at x, P X (B) 0.
Choosing k(x)- Result Theorem: Suppose k(x) is chosen as above. The following holds w.h.p. simultaneously for all x. Consider any B centered at x, s.t. P X (B) n 1/3. Suppose P X is (C, d)-homogeneous on B. We have ( ) f n (x) f(x) 2 C ln n 2/(2+d) λ 2. np X (B) As n the claim applies to any B centered at x, P X (B) 0.
NEXT: I. Local notions of smoothness and dimension. II. Local adaptivity to dimension: k-nn example. III. Full local adaptivity: kernel example. (Recent work with Vikas Garg)
Main Assumptions: X metric space (X, ρ) of diameter 1. P X is locally homogeneous with unknown d(x). f is locally Hölder with unknown λ(x), α(x). Kernel regression: f n (x) = weighted avg (Y i ) for X i in B ρ (x, h).
Main Assumptions: X metric space (X, ρ) of diameter 1. P X is locally homogeneous with unknown d(x). f is locally Hölder with unknown λ(x), α(x). Kernel regression: f n (x) = weighted avg (Y i ) for X i in B ρ (x, h).
General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).
General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).
General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).
General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).
General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).
From the previous intuition Suppose we know α(x) but not d(x). Monitor 1 n h (x) and h2α. Picking h d (x): f n (x) f(x) 2 err(h ) n 2α/(2α+d).
From the previous intuition Suppose we know α(x) but not d(x). Monitor 1 n h (x) and h2α. Picking h d (x): f n (x) f(x) 2 err(h ) n 2α/(2α+d).
From Lepski Suppose we know d(x) but not α(x). Intuition: For every h < h, 1 nh d > h 2α therefore for such h f n (h; x) f(x) 2 1 nh d + h2α 2 1 nh d.
From Lepski Suppose we know d(x) but not α(x). Intuition: For every h < h, 1 nh d > h 2α therefore for such h f n (h; x) f(x) 2 1 nh d + h2α 2 1 nh d.
From Lepski Suppose we know d(x) but not α(x). All intervals [ ] f n (h; x) ± 2 1, h < h must intersect! nh d Picking h α (x): f n (x) f(x) 2 err(h ) n 2α/(2α+d).
From Lepski Suppose we know d(x) but not α(x). All intervals [ ] f n (h; x) ± 2 1, h < h must intersect! nh d Picking h α (x): f n (x) f(x) 2 err(h ) n 2α/(2α+d).
Combine Lepski with previous intuition We know neither d nor α. All intervals [ ] f n (h; x) ± 2 1 n h (x), h < h d must intersect! Picking h α,d (x): f n (x) f(x) 2 err(h d ) err(h ) n 2α/(2α+d).
Combine Lepski with previous intuition We know neither d nor α. All intervals [ ] f n (h; x) ± 2 1 n h (x), h < h d must intersect! Picking h α,d (x): f n (x) f(x) 2 err(h d ) err(h ) n 2α/(2α+d).
Choosing h α,d (x)- Result Tightness assumption on d(x): r 0, x X, C, C, d such that r r 0, Cr d P X (B(x, r)) C r d. Theorem: Suppose h α,d (x) is chosen as described. Let n N(r 0 ). The following holds w.h.p. simultaneously for all x. Let d, α, λ be the local problem parameters on B(x, r 0 ). We have ( ) f n (x) f(x) 2 ln n 2α/(2α+d) λ 2d/(2α+d). n The rate is optimal.
Choosing h α,d (x)- Result Tightness assumption on d(x): r 0, x X, C, C, d such that r r 0, Cr d P X (B(x, r)) C r d. Theorem: Suppose h α,d (x) is chosen as described. Let n N(r 0 ). The following holds w.h.p. simultaneously for all x. Let d, α, λ be the local problem parameters on B(x, r 0 ). We have ( ) f n (x) f(x) 2 ln n 2α/(2α+d) λ 2d/(2α+d). n The rate is optimal.
Simulation on data with mixed spatial complexity... the approach works, but should be made more efficient!
Simulation on data with mixed spatial complexity... the approach works, but should be made more efficient!
Future direction: Extend self-tuning to tree-based kernel implementations.
Initial experiments with tree-based kernel: Without CValidation: automatically detect interval containing h.
Future direction: Extend self-tuning to data streaming setting.
Initial streaming experiments: 1.1 1 0.9 Synthetic dataset, d=5, D=100, first 1000 samples d=1 Incremental Tree based fixed d=1 fixed d=5 fixed d=10 Normalized RMSE on test set 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.5 1 1.5 2 2.5 3 3.5 4 # Training samples x 10 4 Robustness to increasing intrinsic dimension.
Future direction: Adaptive error bands. Would lead to more local sampling strategies.
TAKE HOME MESSAGE: We can adapt to intrinsic d(x ) without preprocessing. Local-learners can self-tune optimally to local d(x) and α(x). Results extend to plug-in classification! Many potential future directions!
TAKE HOME MESSAGE: We can adapt to intrinsic d(x ) without preprocessing. Local-learners can self-tune optimally to local d(x) and α(x). Results extend to plug-in classification! Many potential future directions!
TAKE HOME MESSAGE: We can adapt to intrinsic d(x ) without preprocessing. Local-learners can self-tune optimally to local d(x) and α(x). Results extend to plug-in classification! Many potential future directions!
Thank you!