Local Self-Tuning in Nonparametric Regression. Samory Kpotufe ORFE, Princeton University

Similar documents
Local regression, intrinsic dimension, and nonparametric sparsity

Adaptivity to Local Smoothness and Dimension in Kernel Regression

Escaping the curse of dimensionality with a tree-based regressor

k-nn Regression Adapts to Local Intrinsic Dimension

Optimal rates for k-nn density and mode estimation

Time-Accuracy Tradeoffs in Kernel Prediction: Controlling Prediction Quality

Efficient and Optimal Modal-set Estimation using knn graphs

Unlabeled Data: Now It Helps, Now It Doesn t

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Random projection trees and low dimensional manifolds. Sanjoy Dasgupta and Yoav Freund University of California, San Diego

Distribution-specific analysis of nearest neighbor search and classification

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

MATH 31BH Homework 1 Solutions

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

A Consistent Estimator of the Expected Gradient Outerproduct

Bias-Variance in Machine Learning

Understanding Generalization Error: Bounds and Decompositions

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Semi-Supervised Learning by Multi-Manifold Separation

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Introduction to Nonparametric and Semiparametric Estimation. Good when there are lots of data and very little prior information on functional form.

Plug-in Approach to Active Learning

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Math General Topology Fall 2012 Homework 1 Solutions

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds

Online Learning in High Dimensions. LWPR and it s application

Your first day at work MATH 806 (Fall 2015)

Prediction of double gene knockout measurements

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:

Advanced computational methods X Selected Topics: SGD

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran

Unlabeled data: Now it helps, now it doesn t

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Ensemble Methods and Random Forests

Nonparametric Inference via Bootstrapping the Debiased Estimator

4 Bias-Variance for Ridge Regression (24 points)

Linear and Logistic Regression. Dr. Xiaowei Huang

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Regression, Ridge Regression, Lasso

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Class 2 & 3 Overfitting & Regularization

Sample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI

Machine Learning

Nonparametric regression with martingale increment errors

Efficient and Principled Online Classification Algorithms for Lifelon

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Analysis I (Math 121) First Midterm Correction

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Economics 620, Lecture 19: Introduction to Nonparametric and Semiparametric Estimation

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Bayesian Regularization

Metric Spaces and Topology

Basis Expansion and Nonlinear SVM. Kai Yu

Math 5210, Definitions and Theorems on Metric Spaces

The Decision List Machine

Machine Learning

Math General Topology Fall 2012 Homework 6 Solutions

Convex optimization COMS 4771

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

Introduction to Machine Learning (67577) Lecture 3

Locality Sensitive Hashing

where m is the maximal ideal of O X,p. Note that m/m 2 is a vector space. Suppose that we are given a morphism

Active learning. Sanjoy Dasgupta. University of California, San Diego

Introduction to Gaussian Process

Statistical learning theory, Support vector machines, and Bioinformatics

Announcements. Proposals graded

Sparse Kernel Machines - SVM

CSE446: non-parametric methods Spring 2017

variance of independent variables: sum of variances So chebyshev predicts won t stray beyond stdev.

The Curse of Dimensionality for Local Kernel Machines

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

STA 4273H: Sta-s-cal Machine Learning

Learning with multiple models. Boosting.

Optimality Conditions for Nonsmooth Convex Optimization

Nonparametric Methods

Voting (Ensemble Methods)

Classifier Complexity and Support Vector Classifiers

Lecture 3: Statistical Decision Theory (Part II)

ECE521 week 3: 23/26 January 2017

Introduction to Real Analysis

Lecture 18: March 15

Math 320-2: Midterm 2 Practice Solutions Northwestern University, Winter 2015

Tsybakov noise adap/ve margin- based ac/ve learning

Low Bias Bagged Support Vector Machines

Learning Theory Continued

Problem Set 1: Solutions Math 201A: Fall Problem 1. Let (X, d) be a metric space. (a) Prove the reverse triangle inequality: for every x, y, z X

Lecture Notes 1: Vector spaces

Variance Reduction and Ensemble Methods

Linear Models for Regression. Sargur Srihari

A New Look at Nearest Neighbours: Identifying Benign Input Geometries via Random Projections

Exploiting k-nearest Neighbor Information with Many Data

Recap from previous lecture

Transcription:

Local Self-Tuning in Nonparametric Regression Samory Kpotufe ORFE, Princeton University

Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg (Y i ) of Neighbors(x). (e.g. k-nn, kernel, or tree-based reg.) Quite basic = common in modern applications. Sensitive to choice of Neighbors(x): k, band. h, tree cell size. Goal: choose Neighbors(x) optimally!

Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg (Y i ) of Neighbors(x). (e.g. k-nn, kernel, or tree-based reg.) Quite basic = common in modern applications. Sensitive to choice of Neighbors(x): k, band. h, tree cell size. Goal: choose Neighbors(x) optimally!

Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg (Y i ) of Neighbors(x). (e.g. k-nn, kernel, or tree-based reg.) Quite basic = common in modern applications. Sensitive to choice of Neighbors(x): k, band. h, tree cell size. Goal: choose Neighbors(x) optimally!

Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg (Y i ) of Neighbors(x). (e.g. k-nn, kernel, or tree-based reg.) Quite basic = common in modern applications. Sensitive to choice of Neighbors(x): k, band. h, tree cell size. Goal: choose Neighbors(x) optimally!

Performance depends on problem parameters Performance would depend on dim(x) and how fast f varies... Suppose X IR D, and x, x, f(x) f(x ) λ x x α. Performance measure: f n f 2 2,P X. = E X f n (X) f(x) 2. Minimax global performance (Stone 80-82) f n f 2 2,P X λ 2D/(2α+D) n 2α/(2α+D).

Performance depends on problem parameters Performance would depend on dim(x) and how fast f varies... Suppose X IR D, and x, x, f(x) f(x ) λ x x α. Performance measure: f n f 2 2,P X. = E X f n (X) f(x) 2. Minimax global performance (Stone 80-82) f n f 2 2,P X λ 2D/(2α+D) n 2α/(2α+D).

Performance depends on problem parameters Performance would depend on dim(x) and how fast f varies... Suppose X IR D, and x, x, f(x) f(x ) λ x x α. Performance measure: f n f 2 2,P X. = E X f n (X) f(x) 2. Minimax global performance (Stone 80-82) f n f 2 2,P X λ 2D/(2α+D) n 2α/(2α+D).

Performance depends on problem parameters Performance would depend on dim(x) and how fast f varies... Suppose X IR D, and x, x, f(x) f(x ) λ x x α. Performance measure: f n f 2 2,P X. = E X f n (X) f(x) 2. Minimax global performance (Stone 80-82) f n f 2 2,P X λ 2D/(2α+D) n 2α/(2α+D).

Some milder situations for X IR D f is quite smooth, f is sparse, f is additive,... Of interest here: X has low intrinsic dimension d D.

Some milder situations for X IR D f is quite smooth, f is sparse, f is additive,... Of interest here: X has low intrinsic dimension d D.

X IR D but has low intrinsic dimension d D Linear data

X IR D but has low intrinsic dimension d D Linear data Manifold data

X IR D but has low intrinsic dimension d D Linear data Manifold data Sparse data

X IR D but has low intrinsic dimension d D Linear data Manifold data Sparse data Common approach: Dimension reduction or estimation.

Dimension reduction/estimation increases tuning! Recent Alternative: f n operates in IR D but adapts to the unknown d of X. We want: f n f 2 2,P X n 1/Cd n 1/CD

Dimension reduction/estimation increases tuning! Recent Alternative: f n operates in IR D but adapts to the unknown d of X. We want: f n f 2 2,P X n 1/Cd n 1/CD

Dimension reduction/estimation increases tuning! Recent Alternative: f n operates in IR D but adapts to the unknown d of X. We want: f n f 2 2,P X n 1/Cd n 1/CD

Some work on adaptivity to intrinsic dimension: Kernel and local polynomial regression: Bickel and Li 2006, Lafferty and Wasserman 2007. Manifold dim. Dyadic tree classification: Scott and Nowak 2006. Box dim. RPtree, dyadic tree regression: Kpotufe and Dasgupta 2011. Doubling dim.

Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.

Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. Kernel reg.: Avg. mass of a ball of radius ɛ is approx. ɛ d For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.

Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. RPtree: Number of cells of diameter ɛ is approx. ɛ d. For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.

Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. RPtree: Number of cells of diameter ɛ is approx. ɛ d. For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.

Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. RPtree: Number of cells of diameter ɛ is approx. ɛ d. For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.

Insights help with tuning under time-constraints. Main Idea: compress data in a way that respects structure of X. Faster Kernel regression. [Kpo. 2009] Fast online tree-regression. [Kpo. and Orabona 2013] O(log n) time vs. O(n), + regression rates remain optimal!

Insights help with tuning under time-constraints. Main Idea: compress data in a way that respects structure of X. Faster Kernel regression. [Kpo. 2009] Fast online tree-regression. [Kpo. and Orabona 2013] O(log n) time vs. O(n), + regression rates remain optimal!

Insights help with tuning under time-constraints. Main Idea: compress data in a way that respects structure of X. Faster Kernel regression. [Kpo. 2009] Fast online tree-regression. [Kpo. and Orabona 2013] O(log n) time vs. O(n), + regression rates remain optimal!

Insights help with tuning under time-constraints. Main Idea: compress data in a way that respects structure of X. Faster Kernel regression. [Kpo. 2009] Fast online tree-regression. [Kpo. and Orabona 2013] O(log n) time vs. O(n), + regression rates remain optimal!

So far, we have viewed d as a global characteristic of X...

Problem complexity is likely to depend on location! Space X, d = d(x) Function f, α = α(x) Choose Neighbors(x) adaptively so that: f n (x) f(x) 2 λ 2dx/(2αx+dx) x n 2αx/(2αx+dx). Choose Neighbors(x): Cannot cross-validate locally at x!

Problem complexity is likely to depend on location! Space X, d = d(x) Function f, α = α(x) Choose Neighbors(x) adaptively so that: f n (x) f(x) 2 λ 2dx/(2αx+dx) x n 2αx/(2αx+dx). Choose Neighbors(x): Cannot cross-validate locally at x!

Problem complexity is likely to depend on location! Space X, d = d(x) Function f, α = α(x) Choose Neighbors(x) adaptively so that: f n (x) f(x) 2 λ 2dx/(2αx+dx) x n 2αx/(2αx+dx). Choose Neighbors(x): Cannot cross-validate locally at x!

NEXT: I. Local notions of smoothness and dimension. II. Local adaptivity to dimension: k-nn example. III. Full local adaptivity: kernel example.

Local smoothness Use local Hölder parameters λ = λ(x), α = α(x) on B(x, r): For all x B(x, r), f(x) f(x ) λρ(x, x ) α. f(x) = x α is flatter at x = 0 as α is increased.

Local dimension Figure : d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). If P X is U(B(x, r)), then P X (B(x, r)) ɛ d P X (B(x, ɛr)). Def.: P X is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, P X (B(x, r )) Cɛ d P X (B(x, ɛr )).

Local dimension Figure : d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). If P X is U(B(x, r)), then P X (B(x, r)) ɛ d P X (B(x, ɛr)). Def.: P X is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, P X (B(x, r )) Cɛ d P X (B(x, ɛr )).

Local dimension Figure : d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). If P X is U(B(x, r)), then P X (B(x, r)) ɛ d P X (B(x, ɛr)). Def.: P X is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, P X (B(x, r )) Cɛ d P X (B(x, ɛr )).

Local dimension Figure : d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). If P X is U(B(x, r)), then P X (B(x, r)) ɛ d P X (B(x, ɛr)). Def.: P X is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, P X (B(x, r )) Cɛ d P X (B(x, ɛr )).

The growth of P X can capture the intrinsic dimension in B(x). Location of query x matters! Size of neighborhood B matters!

The growth of P X can capture the intrinsic dimension in B(x). Location of query x matters! Size of neighborhood B matters!

The growth of P X can capture the intrinsic dimension in B(x). Location of query x matters! Size of neighborhood B matters!

Size of neighborhood B matters! For k-nn, or kernel reg, size of B depends on n and (k or h).

Size of neighborhood B matters! For k-nn, or kernel reg, size of B depends on n and (k or h).

The growth of P X (B) can capture the intrinsic dimension locally. Linear data Manifold data Sparse data

The growth of P X (B) can capture the intrinsic dimension locally. Linear data Manifold data Sparse data X can be a collection of subspaces of various dimensions.

Intrinsic d tightly captures the minimax rate: Theorem: Consider a metric measure space (X, ρ, µ), such that for all x X, r > 0, ɛ > 0, we have µ(b(x, r)) ɛ d µ(b(x, ɛr)). Then, for any regressor f n, there exists P X,Y, where P X = µ and f(x) = E Y x is λ-lipschitz, such that E f n f 2 PX,Y n 2,µ λ2d/(2+d) n 2/(2+d).

Intrinsic d tightly captures the minimax rate: Theorem: Consider a metric measure space (X, ρ, µ), such that for all x X, r > 0, ɛ > 0, we have µ(b(x, r)) ɛ d µ(b(x, ɛr)). Then, for any regressor f n, there exists P X,Y, where P X = µ and f(x) = E Y x is λ-lipschitz, such that E f n f 2 PX,Y n 2,µ λ2d/(2+d) n 2/(2+d).

NEXT: I. Local notions of smoothness and dimension. II. Local adaptivity to dimension: k-nn example. III. Full local adaptivity: kernel example.

Main Assumptions: X metric space (X, ρ). P X is locally homogeneous with unknown d(x). f is λ-lipschitz on X, i.e. α = 1. k-nn regression: f n (x) = weighted avg (Y i ) of k-nn(x). Suppose X IR D, the learner operates in IR D! No dimensionality reduction, no dimension estimation!

Main Assumptions: X metric space (X, ρ). P X is locally homogeneous with unknown d(x). f is λ-lipschitz on X, i.e. α = 1. k-nn regression: f n (x) = weighted avg (Y i ) of k-nn(x). Suppose X IR D, the learner operates in IR D! No dimensionality reduction, no dimension estimation!

Bias-Variance tradeoff E (X i,y i ) n 1 f n (x) f(x) 2 = E f n (x) E f n (x) 2 }{{} Variance + E f n (x) f(x) 2. }{{} Bias 2

General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).

General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).

General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).

General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).

General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).

Choosing k locally at x- Intuition Remember: Cross-valid. or dim. estimation at x are impractical. Instead: Main technical hurdle: intrinsic dimension might vary with k.

Choosing k locally at x- Intuition Remember: Cross-valid. or dim. estimation at x are impractical. Instead: Main technical hurdle: intrinsic dimension might vary with k.

Choosing k(x)- Result Theorem: Suppose k(x) is chosen as above. The following holds w.h.p. simultaneously for all x. Consider any B centered at x, s.t. P X (B) n 1/3. Suppose P X is (C, d)-homogeneous on B. We have ( ) f n (x) f(x) 2 C ln n 2/(2+d) λ 2. np X (B) As n the claim applies to any B centered at x, P X (B) 0.

Choosing k(x)- Result Theorem: Suppose k(x) is chosen as above. The following holds w.h.p. simultaneously for all x. Consider any B centered at x, s.t. P X (B) n 1/3. Suppose P X is (C, d)-homogeneous on B. We have ( ) f n (x) f(x) 2 C ln n 2/(2+d) λ 2. np X (B) As n the claim applies to any B centered at x, P X (B) 0.

NEXT: I. Local notions of smoothness and dimension. II. Local adaptivity to dimension: k-nn example. III. Full local adaptivity: kernel example. (Recent work with Vikas Garg)

Main Assumptions: X metric space (X, ρ) of diameter 1. P X is locally homogeneous with unknown d(x). f is locally Hölder with unknown λ(x), α(x). Kernel regression: f n (x) = weighted avg (Y i ) for X i in B ρ (x, h).

Main Assumptions: X metric space (X, ρ) of diameter 1. P X is locally homogeneous with unknown d(x). f is locally Hölder with unknown λ(x), α(x). Kernel regression: f n (x) = weighted avg (Y i ) for X i in B ρ (x, h).

General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).

General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).

General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).

General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).

General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).

From the previous intuition Suppose we know α(x) but not d(x). Monitor 1 n h (x) and h2α. Picking h d (x): f n (x) f(x) 2 err(h ) n 2α/(2α+d).

From the previous intuition Suppose we know α(x) but not d(x). Monitor 1 n h (x) and h2α. Picking h d (x): f n (x) f(x) 2 err(h ) n 2α/(2α+d).

From Lepski Suppose we know d(x) but not α(x). Intuition: For every h < h, 1 nh d > h 2α therefore for such h f n (h; x) f(x) 2 1 nh d + h2α 2 1 nh d.

From Lepski Suppose we know d(x) but not α(x). Intuition: For every h < h, 1 nh d > h 2α therefore for such h f n (h; x) f(x) 2 1 nh d + h2α 2 1 nh d.

From Lepski Suppose we know d(x) but not α(x). All intervals [ ] f n (h; x) ± 2 1, h < h must intersect! nh d Picking h α (x): f n (x) f(x) 2 err(h ) n 2α/(2α+d).

From Lepski Suppose we know d(x) but not α(x). All intervals [ ] f n (h; x) ± 2 1, h < h must intersect! nh d Picking h α (x): f n (x) f(x) 2 err(h ) n 2α/(2α+d).

Combine Lepski with previous intuition We know neither d nor α. All intervals [ ] f n (h; x) ± 2 1 n h (x), h < h d must intersect! Picking h α,d (x): f n (x) f(x) 2 err(h d ) err(h ) n 2α/(2α+d).

Combine Lepski with previous intuition We know neither d nor α. All intervals [ ] f n (h; x) ± 2 1 n h (x), h < h d must intersect! Picking h α,d (x): f n (x) f(x) 2 err(h d ) err(h ) n 2α/(2α+d).

Choosing h α,d (x)- Result Tightness assumption on d(x): r 0, x X, C, C, d such that r r 0, Cr d P X (B(x, r)) C r d. Theorem: Suppose h α,d (x) is chosen as described. Let n N(r 0 ). The following holds w.h.p. simultaneously for all x. Let d, α, λ be the local problem parameters on B(x, r 0 ). We have ( ) f n (x) f(x) 2 ln n 2α/(2α+d) λ 2d/(2α+d). n The rate is optimal.

Choosing h α,d (x)- Result Tightness assumption on d(x): r 0, x X, C, C, d such that r r 0, Cr d P X (B(x, r)) C r d. Theorem: Suppose h α,d (x) is chosen as described. Let n N(r 0 ). The following holds w.h.p. simultaneously for all x. Let d, α, λ be the local problem parameters on B(x, r 0 ). We have ( ) f n (x) f(x) 2 ln n 2α/(2α+d) λ 2d/(2α+d). n The rate is optimal.

Simulation on data with mixed spatial complexity... the approach works, but should be made more efficient!

Simulation on data with mixed spatial complexity... the approach works, but should be made more efficient!

Future direction: Extend self-tuning to tree-based kernel implementations.

Initial experiments with tree-based kernel: Without CValidation: automatically detect interval containing h.

Future direction: Extend self-tuning to data streaming setting.

Initial streaming experiments: 1.1 1 0.9 Synthetic dataset, d=5, D=100, first 1000 samples d=1 Incremental Tree based fixed d=1 fixed d=5 fixed d=10 Normalized RMSE on test set 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.5 1 1.5 2 2.5 3 3.5 4 # Training samples x 10 4 Robustness to increasing intrinsic dimension.

Future direction: Adaptive error bands. Would lead to more local sampling strategies.

TAKE HOME MESSAGE: We can adapt to intrinsic d(x ) without preprocessing. Local-learners can self-tune optimally to local d(x) and α(x). Results extend to plug-in classification! Many potential future directions!

TAKE HOME MESSAGE: We can adapt to intrinsic d(x ) without preprocessing. Local-learners can self-tune optimally to local d(x) and α(x). Results extend to plug-in classification! Many potential future directions!

TAKE HOME MESSAGE: We can adapt to intrinsic d(x ) without preprocessing. Local-learners can self-tune optimally to local d(x) and α(x). Results extend to plug-in classification! Many potential future directions!

Thank you!