Local Self-Tuning in Nonparametric Regression. Samory Kpotufe ORFE, Princeton University

Local Self-Tuning in Nonparametric Regression Samory Kpotufe ORFE, Princeton University

Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg (Y i ) of Neighbors(x). (e.g. k-nn, kernel, or tree-based reg.) Quite basic = common in modern applications. Sensitive to choice of Neighbors(x): k, band. h, tree cell size. Goal: choose Neighbors(x) optimally!

Performance depends on problem parameters Performance would depend on dim(x) and how fast f varies... Suppose X IR D, and x, x, f(x) f(x ) λ x x α. Performance measure: f n f 2 2,P X. = E X f n (X) f(x) 2. Minimax global performance (Stone 80-82) f n f 2 2,P X λ 2D/(2α+D) n 2α/(2α+D).

Some milder situations for X IR D f is quite smooth, f is sparse, f is additive,... Of interest here: X has low intrinsic dimension d D.

X IR D but has low intrinsic dimension d D Linear data

X IR D but has low intrinsic dimension d D Linear data Manifold data

X IR D but has low intrinsic dimension d D Linear data Manifold data Sparse data

X IR D but has low intrinsic dimension d D Linear data Manifold data Sparse data Common approach: Dimension reduction or estimation.

Dimension reduction/estimation increases tuning! Recent Alternative: f n operates in IR D but adapts to the unknown d of X. We want: f n f 2 2,P X n 1/Cd n 1/CD

Some work on adaptivity to intrinsic dimension: Kernel and local polynomial regression: Bickel and Li 2006, Lafferty and Wasserman 2007. Manifold dim. Dyadic tree classification: Scott and Nowak 2006. Box dim. RPtree, dyadic tree regression: Kpotufe and Dasgupta 2011. Doubling dim.

Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.

Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. Kernel reg.: Avg. mass of a ball of radius ɛ is approx. ɛ d For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.

Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. RPtree: Number of cells of diameter ɛ is approx. ɛ d. For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.

Insights help with tuning under time-constraints. Main Idea: compress data in a way that respects structure of X. Faster Kernel regression. [Kpo. 2009] Fast online tree-regression. [Kpo. and Orabona 2013] O(log n) time vs. O(n), + regression rates remain optimal!

So far, we have viewed d as a global characteristic of X...

Problem complexity is likely to depend on location! Space X, d = d(x) Function f, α = α(x) Choose Neighbors(x) adaptively so that: f n (x) f(x) 2 λ 2dx/(2αx+dx) x n 2αx/(2αx+dx). Choose Neighbors(x): Cannot cross-validate locally at x!

NEXT: I. Local notions of smoothness and dimension. II. Local adaptivity to dimension: k-nn example. III. Full local adaptivity: kernel example.

Local smoothness Use local Hölder parameters λ = λ(x), α = α(x) on B(x, r): For all x B(x, r), f(x) f(x ) λρ(x, x ) α. f(x) = x α is flatter at x = 0 as α is increased.

Local dimension Figure : d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). If P X is U(B(x, r)), then P X (B(x, r)) ɛ d P X (B(x, ɛr)). Def.: P X is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, P X (B(x, r )) Cɛ d P X (B(x, ɛr )).

The growth of P X can capture the intrinsic dimension in B(x). Location of query x matters! Size of neighborhood B matters!

Size of neighborhood B matters! For k-nn, or kernel reg, size of B depends on n and (k or h).

The growth of P X (B) can capture the intrinsic dimension locally. Linear data Manifold data Sparse data

The growth of P X (B) can capture the intrinsic dimension locally. Linear data Manifold data Sparse data X can be a collection of subspaces of various dimensions.

Intrinsic d tightly captures the minimax rate: Theorem: Consider a metric measure space (X, ρ, µ), such that for all x X, r > 0, ɛ > 0, we have µ(b(x, r)) ɛ d µ(b(x, ɛr)). Then, for any regressor f n, there exists P X,Y, where P X = µ and f(x) = E Y x is λ-lipschitz, such that E f n f 2 PX,Y n 2,µ λ2d/(2+d) n 2/(2+d).

NEXT: I. Local notions of smoothness and dimension. II. Local adaptivity to dimension: k-nn example. III. Full local adaptivity: kernel example.

Main Assumptions: X metric space (X, ρ). P X is locally homogeneous with unknown d(x). f is λ-lipschitz on X, i.e. α = 1. k-nn regression: f n (x) = weighted avg (Y i ) of k-nn(x). Suppose X IR D, the learner operates in IR D! No dimensionality reduction, no dimension estimation!

Bias-Variance tradeoff E (X i,y i ) n 1 f n (x) f(x) 2 = E f n (x) E f n (x) 2 }{{} Variance + E f n (x) f(x) 2. }{{} Bias 2

General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).

Choosing k locally at x- Intuition Remember: Cross-valid. or dim. estimation at x are impractical. Instead: Main technical hurdle: intrinsic dimension might vary with k.

Choosing k(x)- Result Theorem: Suppose k(x) is chosen as above. The following holds w.h.p. simultaneously for all x. Consider any B centered at x, s.t. P X (B) n 1/3. Suppose P X is (C, d)-homogeneous on B. We have ( ) f n (x) f(x) 2 C ln n 2/(2+d) λ 2. np X (B) As n the claim applies to any B centered at x, P X (B) 0.

NEXT: I. Local notions of smoothness and dimension. II. Local adaptivity to dimension: k-nn example. III. Full local adaptivity: kernel example. (Recent work with Vikas Garg)

Main Assumptions: X metric space (X, ρ) of diameter 1. P X is locally homogeneous with unknown d(x). f is locally Hölder with unknown λ(x), α(x). Kernel regression: f n (x) = weighted avg (Y i ) for X i in B ρ (x, h).

General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).

From the previous intuition Suppose we know α(x) but not d(x). Monitor 1 n h (x) and h2α. Picking h d (x): f n (x) f(x) 2 err(h ) n 2α/(2α+d).

From Lepski Suppose we know d(x) but not α(x). Intuition: For every h < h, 1 nh d > h 2α therefore for such h f n (h; x) f(x) 2 1 nh d + h2α 2 1 nh d.

From Lepski Suppose we know d(x) but not α(x). All intervals [ ] f n (h; x) ± 2 1, h < h must intersect! nh d Picking h α (x): f n (x) f(x) 2 err(h ) n 2α/(2α+d).

Combine Lepski with previous intuition We know neither d nor α. All intervals [ ] f n (h; x) ± 2 1 n h (x), h < h d must intersect! Picking h α,d (x): f n (x) f(x) 2 err(h d ) err(h ) n 2α/(2α+d).

Choosing h α,d (x)- Result Tightness assumption on d(x): r 0, x X, C, C, d such that r r 0, Cr d P X (B(x, r)) C r d. Theorem: Suppose h α,d (x) is chosen as described. Let n N(r 0 ). The following holds w.h.p. simultaneously for all x. Let d, α, λ be the local problem parameters on B(x, r 0 ). We have ( ) f n (x) f(x) 2 ln n 2α/(2α+d) λ 2d/(2α+d). n The rate is optimal.

Simulation on data with mixed spatial complexity... the approach works, but should be made more efficient!

Future direction: Extend self-tuning to tree-based kernel implementations.

Initial experiments with tree-based kernel: Without CValidation: automatically detect interval containing h.

Future direction: Extend self-tuning to data streaming setting.

Initial streaming experiments: 1.1 1 0.9 Synthetic dataset, d=5, D=100, first 1000 samples d=1 Incremental Tree based fixed d=1 fixed d=5 fixed d=10 Normalized RMSE on test set 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.5 1 1.5 2 2.5 3 3.5 4 # Training samples x 10 4 Robustness to increasing intrinsic dimension.

Future direction: Adaptive error bands. Would lead to more local sampling strategies.

TAKE HOME MESSAGE: We can adapt to intrinsic d(x ) without preprocessing. Local-learners can self-tune optimally to local d(x) and α(x). Results extend to plug-in classification! Many potential future directions!

Thank you!