Local regression, intrinsic dimension, and nonparametric sparsity

Size: px

Start display at page:

Download "Local regression, intrinsic dimension, and nonparametric sparsity"

Frank Weaver
5 years ago
Views:

1 Local regression, intrinsic dimension, and nonparametric sparsity Samory Kpotufe Toyota Technological Institute - Chicago and Max Planck Institute for Intelligent Systems

2 I. Local regression and (local) intrinsic dimension. II. Nonparametric sparsity: improving local regression.

3 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise We will assume f is Lipschitz-continous: f(x) f(x ) λρ(x, x ). Learn: f n (x) = avg (Y i ) of X i B(x). In particular: f n,k (x) = kernel avg (Y i ) of k-nn B(x). (k-nn or adaptive-bandwidth kernel reg.) Quite basic and works well! = common in practice! New explanation for its success: Adaptivity to local problem complexity (local density (e.g. I. Abramson 82), local dimension of X )

4 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise We will assume f is Lipschitz-continous: f(x) f(x ) λρ(x, x ). Learn: f n (x) = avg (Y i ) of X i B(x). In particular: f n,k (x) = kernel avg (Y i ) of k-nn B(x). (k-nn or adaptive-bandwidth kernel reg.) Quite basic and works well! = common in practice! New explanation for its success: Adaptivity to local problem complexity (local density (e.g. I. Abramson 82), local dimension of X )

5 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise We will assume f is Lipschitz-continous: f(x) f(x ) λρ(x, x ). Learn: f n (x) = avg (Y i ) of X i B(x). In particular: f n,k (x) = kernel avg (Y i ) of k-nn B(x). (k-nn or adaptive-bandwidth kernel reg.) Quite basic and works well! = common in practice! New explanation for its success: Adaptivity to local problem complexity (local density (e.g. I. Abramson 82), local dimension of X )

6 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise We will assume f is Lipschitz-continous: f(x) f(x ) λρ(x, x ). Learn: f n (x) = avg (Y i ) of X i B(x). In particular: f n,k (x) = kernel avg (Y i ) of k-nn B(x). (k-nn or adaptive-bandwidth kernel reg.) Quite basic and works well! = common in practice! New explanation for its success: Adaptivity to local problem complexity (local density (e.g. I. Abramson 82), local dimension of X )

7 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise We will assume f is Lipschitz-continous: f(x) f(x ) λρ(x, x ). Learn: f n (x) = avg (Y i ) of X i B(x). In particular: f n,k (x) = kernel avg (Y i ) of k-nn B(x). (k-nn or adaptive-bandwidth kernel reg.) Quite basic and works well! = common in practice! New explanation for its success: Adaptivity to local problem complexity (local density (e.g. I. Abramson 82), local dimension of X )

8 Dimension and regression: Curse of dimension Suppose X IR D. There exist distributions on (X, Y ) such that the excess risk f n,k f 2 2,µ is of the form n 2/(2+D).. = E X f n,k (X) f(x) 2 This is true for all nonparametric regressors! (Stone 82)

9 Dimension and regression: Curse of dimension Suppose X IR D. There exist distributions on (X, Y ) such that the excess risk f n,k f 2 2,µ is of the form n 2/(2+D).. = E X f n,k (X) f(x) 2 This is true for all nonparametric regressors! (Stone 82)

10 Fortunately, high dimensional data often has low intrinsic complexity Linear data

11 Fortunately, high dimensional data often has low intrinsic complexity Linear data Manifold data

12 Fortunately, high dimensional data often has low intrinsic complexity Linear data Manifold data Sparse data

13 Fortunately, high dimensional data often has low intrinsic complexity Linear data Manifold data Sparse data Common approach: Dimension reduction e.g. PCA, Manifold learning (e.g. LLE, Isomap, Laplacian eigenmaps, kernel PCA,...)

14 Main result: k-nn performs well without dimension reduction! f n,k converges at a rate adaptive to unknown intrinsic dimension. Furthermore, k-nn is locally adaptive: f n,k (x) adapts to intrinsic dimension locally at x. The result suggests that: More can be gained tuning k than tuning the parameters of my favorite dimension reduction procedure.

15 Main result: k-nn performs well without dimension reduction! f n,k converges at a rate adaptive to unknown intrinsic dimension. Furthermore, k-nn is locally adaptive: f n,k (x) adapts to intrinsic dimension locally at x. The result suggests that: More can be gained tuning k than tuning the parameters of my favorite dimension reduction procedure.

16 Main result: k-nn performs well without dimension reduction! f n,k converges at a rate adaptive to unknown intrinsic dimension. Furthermore, k-nn is locally adaptive: f n,k (x) adapts to intrinsic dimension locally at x. The result suggests that: More can be gained tuning k than tuning the parameters of my favorite dimension reduction procedure.

17 Other work on adaptivity to intrinsic dimension: Kernel and local polynomial regression: Bickel and Li 2006, Lafferty and Wasserman Manifold dim. Dyadic tree classification: Scott and Nowak Box dim. 1-NN regression: Kulkarni and Posner Doubling dim. RPtree, dyadic tree regression: Kpotufe and Dasgupta Doubling dim. Tree-kernel hybrids: Kpotufe Doubling dim. The above results assume more restrictive notions of dimension.

18 Other work on adaptivity to intrinsic dimension: Kernel and local polynomial regression: Bickel and Li 2006, Lafferty and Wasserman Manifold dim. Dyadic tree classification: Scott and Nowak Box dim. 1-NN regression: Kulkarni and Posner Doubling dim. RPtree, dyadic tree regression: Kpotufe and Dasgupta Doubling dim. Tree-kernel hybrids: Kpotufe Doubling dim. The above results assume more restrictive notions of dimension.

19 Outline: Intrinsic dimension Adaptivity for any log n k n Choosing a good k = k(x)

20 Intrinsic dimension Figure: d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). Suppose µ is U(B(x, r)), then µ(b(x, r)) ɛ d µ(b(x, ɛr)). Definition: µ is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, µ(b(x, r )) Cɛ d µ(b(x, ɛr )).

21 Intrinsic dimension Figure: d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). Suppose µ is U(B(x, r)), then µ(b(x, r)) ɛ d µ(b(x, ɛr)). Definition: µ is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, µ(b(x, r )) Cɛ d µ(b(x, ɛr )).

22 Intrinsic dimension Figure: d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). Suppose µ is U(B(x, r)), then µ(b(x, r)) ɛ d µ(b(x, ɛr)). Definition: µ is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, µ(b(x, r )) Cɛ d µ(b(x, ɛr )).

23 Intrinsic dimension Figure: d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). Suppose µ is U(B(x, r)), then µ(b(x, r)) ɛ d µ(b(x, ɛr)). Definition: µ is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, µ(b(x, r )) Cɛ d µ(b(x, ɛr )).

24 Given a query x, the behavior of µ in a neighborhood B of x can capture the intrinsic dimension in B. Location of query x matters! Size of neighborhood B matters! For k-nn, B is a region centered at x, of mass k/n.

25 Given a query x, the behavior of µ in a neighborhood B of x can capture the intrinsic dimension in B. Location of query x matters! Size of neighborhood B matters! For k-nn, B is a region centered at x, of mass k/n.

26 Given a query x, the behavior of µ in a neighborhood B of x can capture the intrinsic dimension in B. Location of query x matters! Size of neighborhood B matters! For k-nn, B is a region centered at x, of mass k/n.

27 Given a query x, the behavior of µ in a neighborhood B of x can capture the intrinsic dimension in B. Location of query x matters! Size of neighborhood B matters! For k-nn, B is a region centered at x, of mass k/n.

28 Size of neighborhood B matters!

29 The behavior of µ can capture the intrinsic dimension locally. For k-nn, locality will depend on n and k. Linear data Manifold data Sparse data Suppose µ = i ω iµ i, where µ i is (C i, d i ) homogeneous on B(x), then µ is (C, d)-homogenous on B(x) for some C, d max i d i. e.g. X collection of subspaces of various dimensions.

30 The behavior of µ can capture the intrinsic dimension locally. For k-nn, locality will depend on n and k. Linear data Manifold data Sparse data Suppose µ = i ω iµ i, where µ i is (C i, d i ) homogeneous on B(x), then µ is (C, d)-homogenous on B(x) for some C, d max i d i. e.g. X collection of subspaces of various dimensions.

31 The behavior of µ can capture the intrinsic dimension locally. For k-nn, locality will depend on n and k. Linear data Manifold data Sparse data Suppose µ = i ω iµ i, where µ i is (C i, d i ) homogeneous on B(x), then µ is (C, d)-homogenous on B(x) for some C, d max i d i. e.g. X collection of subspaces of various dimensions.

32 Outline: Intrinsic dimension Adaptivity for any log n k n Choosing a good k = k(x)

33 Adaptivity for k - General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n,k (x) depends on: (Variance of f n,k (x)) 1/k. (Bias of f n,k (x)) r k (x). It turns out: r k (x) (k/n) 1/d, where d = d(b). Also: r k (x) depends on µ(b) (smaller in dense regions B).

34 Adaptivity for k - General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n,k (x) depends on: (Variance of f n,k (x)) 1/k. (Bias of f n,k (x)) r k (x). It turns out: r k (x) (k/n) 1/d, where d = d(b). Also: r k (x) depends on µ(b) (smaller in dense regions B).

35 Adaptivity for k - General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n,k (x) depends on: (Variance of f n,k (x)) 1/k. (Bias of f n,k (x)) r k (x). It turns out: r k (x) (k/n) 1/d, where d = d(b). Also: r k (x) depends on µ(b) (smaller in dense regions B).

36 Adaptivity for k - General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n,k (x) depends on: (Variance of f n,k (x)) 1/k. (Bias of f n,k (x)) r k (x). It turns out: r k (x) (k/n) 1/d, where d = d(b). Also: r k (x) depends on µ(b) (smaller in dense regions B).

37 Adaptivity for k - General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n,k (x) depends on: (Variance of f n,k (x)) 1/k. (Bias of f n,k (x)) r k (x). It turns out: r k (x) (k/n) 1/d, where d = d(b). Also: r k (x) depends on µ(b) (smaller in dense regions B).

38 Adaptivity for k - Result: Theorem: The following holds w.h.p. simultaneously for all x X and log n k n. Consider any B(x, r) centered at x, s.t. µ(b(x, r)) k/n. Suppose µ is (C, d)-homogeneous on B. We have f n,k (x) f(x) 2 σ2 Y + t2 Y k ( ) Ck 2/d + λ 2 r 2. nµ(b) Rate is best if x is in a dense region B with low dimension d.

39 Adaptivity for k - Result: Theorem: The following holds w.h.p. simultaneously for all x X and log n k n. Consider any B(x, r) centered at x, s.t. µ(b(x, r)) k/n. Suppose µ is (C, d)-homogeneous on B. We have f n,k (x) f(x) 2 σ2 Y + t2 Y k ( ) Ck 2/d + λ 2 r 2. nµ(b) Rate is best if x is in a dense region B with low dimension d.

40 Proof idea f n,k (x) f(x) f n,k (x) E f n,k (x) + E f n,k (x) + f(x). Uniform variance bound over X : Fix X and x, f n,k (x) E f n,k (x) 2 Var Y X f n,k (x) 1 k. Fix X, there are at most n VC of balls possible values of f n,k (x). Uniform bias bound over X : In d-dimensions, µ(b(x, r)) r d = E r k,n (x) (k/n) 1/d. Here d changes with r, but similar idea holds. B, µ n (B) µ(b) so that x, r k,n (x) E r k,n (x).

41 Proof idea f n,k (x) f(x) f n,k (x) E f n,k (x) + E f n,k (x) + f(x). Uniform variance bound over X : Fix X and x, f n,k (x) E f n,k (x) 2 Var Y X f n,k (x) 1 k. Fix X, there are at most n VC of balls possible values of f n,k (x). Uniform bias bound over X : In d-dimensions, µ(b(x, r)) r d = E r k,n (x) (k/n) 1/d. Here d changes with r, but similar idea holds. B, µ n (B) µ(b) so that x, r k,n (x) E r k,n (x).

42 Proof idea f n,k (x) f(x) f n,k (x) E f n,k (x) + E f n,k (x) + f(x). Uniform variance bound over X : Fix X and x, f n,k (x) E f n,k (x) 2 Var Y X f n,k (x) 1 k. Fix X, there are at most n VC of balls possible values of f n,k (x). Uniform bias bound over X : In d-dimensions, µ(b(x, r)) r d = E r k,n (x) (k/n) 1/d. Here d changes with r, but similar idea holds. B, µ n (B) µ(b) so that x, r k,n (x) E r k,n (x).

43 Outline: Intrinsic dimension Adaptivity for any log n k n Choosing a good k = k(x)

44 Choosing k(x)- Best possible rate in terms of d Theorem: Consider a metric measure space (X, ρ, µ), such that for all x X, r > 0, ɛ > 0, we have µ(b(x, r)) ɛ d µ(b(x, ɛr)). Then, for any regressor f n, there exists D X,Y, where D X = µ and f(x) E Y x is λ-lipschitz, such that E f n f 2 DX,Y n 2,µ λ2d/(2+d) n 2/(2+d).

45 Choosing k(x)- Best possible rate in terms of d Theorem: Consider a metric measure space (X, ρ, µ), such that for all x X, r > 0, ɛ > 0, we have µ(b(x, r)) ɛ d µ(b(x, ɛr)). Then, for any regressor f n, there exists D X,Y, where D X = µ and f(x) E Y x is λ-lipschitz, such that E f n f 2 DX,Y n 2,µ λ2d/(2+d) n 2/(2+d).

46 Choosing k(x)- Best possible rate in terms of d Intuition: We want to create a family F of λ-lipschitz functions f which vary a lot! The amount of variation we can impose will depend on d.

47 Choosing k(x)- Best possible rate in terms of d Intuition: We want to create a family F of λ-lipschitz functions f which vary a lot! The amount of variation we can impose will depend on d.

48 Choosing k locally at x- Intuition Note: Cross-validation and dimension estimation require large samples sizes, which is unlikely in small neighborhoods of x. Instead: Main technical hurdle: intrinsic dimension might vary with k.

49 Choosing k locally at x- Intuition Note: Cross-validation and dimension estimation require large samples sizes, which is unlikely in small neighborhoods of x. Instead: Main technical hurdle: intrinsic dimension might vary with k.

50 Choosing k(x)- Result Theorem: Suppose k(x) is chosen as above. The following holds w.h.p. simultaneously for all x. Consider any B centered at x, s.t. µ(b) n 1/3. Suppose µ is (C, d)-homogeneous on B. We have ( ) f n,k (x) f(x) 2 C 2/(2+d) λ 2. nµ(b) As n the claim applies to any B centered at x, µ(b) 0.

51 Choosing k(x)- Result Theorem: Suppose k(x) is chosen as above. The following holds w.h.p. simultaneously for all x. Consider any B centered at x, s.t. µ(b) n 1/3. Suppose µ is (C, d)-homogeneous on B. We have ( ) f n,k (x) f(x) 2 C 2/(2+d) λ 2. nµ(b) As n the claim applies to any B centered at x, µ(b) 0.

52 Proof idea Suppose d is fixed over X. k n 2/(2+d) so show that rate(k(x)) rate(k ) n 2/(2+d). Since d varies with r, apply the argument over a sufficiently large ball to include all k neighbors for any d 1.

53 Proof idea Suppose d is fixed over X. k n 2/(2+d) so show that rate(k(x)) rate(k ) n 2/(2+d). Since d varies with r, apply the argument over a sufficiently large ball to include all k neighbors for any d 1.

54 Results likely extend to: Higher order polynomial regression/classification using k-nn. Full adaptivity: include function smoothness. Local choice of bandwidth in kernel regression.

55 Take home message k-nn regression performs well without dimension reduction! Furthermore, it is possible to choose k at x so as to achieve a good performance in terms of local dimension. Question: Is there a general principle for designing adaptive learners? We ve assumed so far that nature (or an expert) provided the right metric (X, ρ)!

56 Take home message k-nn regression performs well without dimension reduction! Furthermore, it is possible to choose k at x so as to achieve a good performance in terms of local dimension. Question: Is there a general principle for designing adaptive learners? We ve assumed so far that nature (or an expert) provided the right metric (X, ρ)!

57 Take home message k-nn regression performs well without dimension reduction! Furthermore, it is possible to choose k at x so as to achieve a good performance in terms of local dimension. Question: Is there a general principle for designing adaptive learners? We ve assumed so far that nature (or an expert) provided the right metric (X, ρ)!

58 II. Nonparametric sparsity: improving local regression. Based on joint work with Abdeslam Boularias

59 Sparse f: Parametric sparsity: f(x) = i d w if i (x) >, most w i = 0. Nonparametric sparsity: f(x) = i w if i (x) >, most w i = 0. f(x) = g(projection(x)) onto R relevant variables, R [d]. Here: For every f i (derivative of f along i) let f i 1,µ E X f { } i (X). The set f i 1,µ is approximately sparse. i d f could depend on all features of X, but not equally on all!

60 Sparse f: Parametric sparsity: f(x) = i d w if i (x) >, most w i = 0. Nonparametric sparsity: f(x) = i w if i (x) >, most w i = 0. f(x) = g(projection(x)) onto R relevant variables, R [d]. Here: For every f i (derivative of f along i) let f i 1,µ E X f { } i (X). The set f i 1,µ is approximately sparse. i d f could depend on all features of X, but not equally on all!

61 Sparse f: Parametric sparsity: f(x) = i d w if i (x) >, most w i = 0. Nonparametric sparsity: f(x) = i w if i (x) >, most w i = 0. f(x) = g(projection(x)) onto R relevant variables, R [d]. Here: For every f i (derivative of f along i) let f i 1,µ E X f { } i (X). The set f i 1,µ is approximately sparse. i d f could depend on all features of X, but not equally on all!

62 A simple idea: Gradient weighting: Reweight x i W i x i, where W i f i 1,µ. Equivalently, replace euclidean distance with ρ(x, x ) = (x x i ) W(x x i ). Similar to metric learning, but cheaper: we estimate a single ρ! Now run any local regressor f n on (X, ρ).

63 A simple idea: Gradient weighting: Reweight x i W i x i, where W i f i 1,µ. Equivalently, replace euclidean distance with ρ(x, x ) = (x x i ) W(x x i ). Similar to metric learning, but cheaper: we estimate a single ρ! Now run any local regressor f n on (X, ρ).

64 A simple idea: Gradient weighting: Reweight x i W i x i, where W i f i 1,µ. Equivalently, replace euclidean distance with ρ(x, x ) = (x x i ) W(x x i ). Similar to metric learning, but cheaper: we estimate a single ρ! Now run any local regressor f n on (X, ρ).

65 A simple idea: Gradient weighting: Reweight x i W i x i, where W i f i 1,µ. Equivalently, replace euclidean distance with ρ(x, x ) = (x x i ) W(x x i ). Similar to metric learning, but cheaper: we estimate a single ρ! Now run any local regressor f n on (X, ρ).

66 Intuition: suppose f 2 1,µ f 1 1,µ. Lower Var(f n (x)): suppose f i 1,µ is large only for i R [d]. Balls B ρ contain more points relative to euclidean balls. Formally, µ(b(x, ɛ ρ(x )) ɛ R, instead of ɛ d. Bias of f n (x) is relatively unaffected: No sample is too far from x in any relevant direction. Formally, f has the following Lipschitz property: f(x) f(x ) ( f i ) ρ(x, x ). Wi i R

67 Intuition: suppose f 2 1,µ f 1 1,µ. Lower Var(f n (x)): suppose f i 1,µ is large only for i R [d]. Balls B ρ contain more points relative to euclidean balls. Formally, µ(b(x, ɛ ρ(x )) ɛ R, instead of ɛ d. Bias of f n (x) is relatively unaffected: No sample is too far from x in any relevant direction. Formally, f has the following Lipschitz property: f(x) f(x ) ( f i ) ρ(x, x ). Wi i R

68 Intuition: suppose f 2 1,µ f 1 1,µ. Lower Var(f n (x)): suppose f i 1,µ is large only for i R [d]. Balls B ρ contain more points relative to euclidean balls. Formally, µ(b(x, ɛ ρ(x )) ɛ R, instead of ɛ d. Bias of f n (x) is relatively unaffected: No sample is too far from x in any relevant direction. Formally, f has the following Lipschitz property: f(x) f(x ) ( f i ) ρ(x, x ). Wi i R

69 Efficient estimation of f i 1,µ : don t estimate f i. W i E n f n,h (X + te i ) f n,h (X te i ) 2t 1 {A n,i (X)}, where A n,i (X) we are confident in both estimates f n,h (X ± te i ). Fast preprocessing, and online: just 2 estimates of f n,h at X. Metric learning optimizes over a space of possible metrics. Only param. search grid to adapt to d dimensions. KR with bandwidths {h i } d 1 needs d d param. search grid. General: preprocessing for any distance-based regressor. Other methods apply to particular regressors, e.g. Rodeo for local linear reg., metric learning for KR.

70 Efficient estimation of f i 1,µ : don t estimate f i. W i E n f n,h (X + te i ) f n,h (X te i ) 2t 1 {A n,i (X)}, where A n,i (X) we are confident in both estimates f n,h (X ± te i ). Fast preprocessing, and online: just 2 estimates of f n,h at X. Metric learning optimizes over a space of possible metrics. Only param. search grid to adapt to d dimensions. KR with bandwidths {h i } d 1 needs d d param. search grid. General: preprocessing for any distance-based regressor. Other methods apply to particular regressors, e.g. Rodeo for local linear reg., metric learning for KR.

71 Efficient estimation of f i 1,µ : don t estimate f i. W i E n f n,h (X + te i ) f n,h (X te i ) 2t 1 {A n,i (X)}, where A n,i (X) we are confident in both estimates f n,h (X ± te i ). Fast preprocessing, and online: just 2 estimates of f n,h at X. Metric learning optimizes over a space of possible metrics. Only param. search grid to adapt to d dimensions. KR with bandwidths {h i } d 1 needs d d param. search grid. General: preprocessing for any distance-based regressor. Other methods apply to particular regressors, e.g. Rodeo for local linear reg., metric learning for KR.

72 Efficient estimation of f i 1,µ : don t estimate f i. W i E n f n,h (X + te i ) f n,h (X te i ) 2t 1 {A n,i (X)}, where A n,i (X) we are confident in both estimates f n,h (X ± te i ). Fast preprocessing, and online: just 2 estimates of f n,h at X. Metric learning optimizes over a space of possible metrics. Only param. search grid to adapt to d dimensions. KR with bandwidths {h i } d 1 needs d d param. search grid. General: preprocessing for any distance-based regressor. Other methods apply to particular regressors, e.g. Rodeo for local linear reg., metric learning for KR.

73 Practicality of the approach: On many real-world data, f i 1,µ varies enough! SRCS robot joint7. Parkinson s. Telecom. Ailerons.

74 W i consistently estimates f i 1,µ Theorem Under general regularity conditions on µ, and smoothness conditions on f, we have with high probability: W i f 1,µ 1 i A(n) t nh d + h f sup i i [d] + 2 ( ) f i ln 2d/δ sup + µ ( t,i (X )) + ɛ t,i. n

75 Results on real world datasets, kernel regression. Barrett joint 1 Barrett joint 5 SARCOS joint 1 SARCOS joint 5 Housing KR error 0.50 ± ± ± ± ±0.08 KR-ρ error 0.38± ± ± ± ±0.06 Concrete Strength Wine Quality Telecom Ailerons Parkinson s KR error 0.42 ± ± ± ± ±0.03 KR-ρ error 0.37 ± ± ± ± ±0.03

76 Results on real world datasets, k-nn regression. Barrett joint 1 Barrett joint 5 SARCOS joint 1 SARCOS joint 5 Housing k-nn error 0.41 ± ± ± ± ±0.09 k-nn-ρ error 0.29 ± ± ± ± ±0.06 Concrete Strength Wine Quality Telecom Ailerons Parkinson s k-nn error 0.40 ± ± ± ± ±0.01 k-nn-ρ error 0.38 ± ± ± ± ±0.01

77 MSE, varying training size, kernel regression SRCS joint 7, KR Ailerons, KR Telecom with KR

78 MSE, varying training size, k-nn regression SRCS joint 7, k-nn Ailerons, k-nn Telecom, k-nn

79 Take home message Gradient weigths help local regressors! So simple = there should be much room for improvement! Thanks!

80 Take home message Gradient weigths help local regressors! So simple = there should be much room for improvement! Thanks!

81 Take home message Gradient weigths help local regressors! So simple = there should be much room for improvement! Thanks!

Local Self-Tuning in Nonparametric Regression. Samory Kpotufe ORFE, Princeton University

Local Self-Tuning in Nonparametric Regression Samory Kpotufe ORFE, Princeton University Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg