Local regression, intrinsic dimension, and nonparametric sparsity

Size: px
Start display at page:

Download "Local regression, intrinsic dimension, and nonparametric sparsity"

Transcription

1 Local regression, intrinsic dimension, and nonparametric sparsity Samory Kpotufe Toyota Technological Institute - Chicago and Max Planck Institute for Intelligent Systems

2 I. Local regression and (local) intrinsic dimension. II. Nonparametric sparsity: improving local regression.

3 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise We will assume f is Lipschitz-continous: f(x) f(x ) λρ(x, x ). Learn: f n (x) = avg (Y i ) of X i B(x). In particular: f n,k (x) = kernel avg (Y i ) of k-nn B(x). (k-nn or adaptive-bandwidth kernel reg.) Quite basic and works well! = common in practice! New explanation for its success: Adaptivity to local problem complexity (local density (e.g. I. Abramson 82), local dimension of X )

4 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise We will assume f is Lipschitz-continous: f(x) f(x ) λρ(x, x ). Learn: f n (x) = avg (Y i ) of X i B(x). In particular: f n,k (x) = kernel avg (Y i ) of k-nn B(x). (k-nn or adaptive-bandwidth kernel reg.) Quite basic and works well! = common in practice! New explanation for its success: Adaptivity to local problem complexity (local density (e.g. I. Abramson 82), local dimension of X )

5 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise We will assume f is Lipschitz-continous: f(x) f(x ) λρ(x, x ). Learn: f n (x) = avg (Y i ) of X i B(x). In particular: f n,k (x) = kernel avg (Y i ) of k-nn B(x). (k-nn or adaptive-bandwidth kernel reg.) Quite basic and works well! = common in practice! New explanation for its success: Adaptivity to local problem complexity (local density (e.g. I. Abramson 82), local dimension of X )

6 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise We will assume f is Lipschitz-continous: f(x) f(x ) λρ(x, x ). Learn: f n (x) = avg (Y i ) of X i B(x). In particular: f n,k (x) = kernel avg (Y i ) of k-nn B(x). (k-nn or adaptive-bandwidth kernel reg.) Quite basic and works well! = common in practice! New explanation for its success: Adaptivity to local problem complexity (local density (e.g. I. Abramson 82), local dimension of X )

7 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise We will assume f is Lipschitz-continous: f(x) f(x ) λρ(x, x ). Learn: f n (x) = avg (Y i ) of X i B(x). In particular: f n,k (x) = kernel avg (Y i ) of k-nn B(x). (k-nn or adaptive-bandwidth kernel reg.) Quite basic and works well! = common in practice! New explanation for its success: Adaptivity to local problem complexity (local density (e.g. I. Abramson 82), local dimension of X )

8 Dimension and regression: Curse of dimension Suppose X IR D. There exist distributions on (X, Y ) such that the excess risk f n,k f 2 2,µ is of the form n 2/(2+D).. = E X f n,k (X) f(x) 2 This is true for all nonparametric regressors! (Stone 82)

9 Dimension and regression: Curse of dimension Suppose X IR D. There exist distributions on (X, Y ) such that the excess risk f n,k f 2 2,µ is of the form n 2/(2+D).. = E X f n,k (X) f(x) 2 This is true for all nonparametric regressors! (Stone 82)

10 Fortunately, high dimensional data often has low intrinsic complexity Linear data

11 Fortunately, high dimensional data often has low intrinsic complexity Linear data Manifold data

12 Fortunately, high dimensional data often has low intrinsic complexity Linear data Manifold data Sparse data

13 Fortunately, high dimensional data often has low intrinsic complexity Linear data Manifold data Sparse data Common approach: Dimension reduction e.g. PCA, Manifold learning (e.g. LLE, Isomap, Laplacian eigenmaps, kernel PCA,...)

14 Main result: k-nn performs well without dimension reduction! f n,k converges at a rate adaptive to unknown intrinsic dimension. Furthermore, k-nn is locally adaptive: f n,k (x) adapts to intrinsic dimension locally at x. The result suggests that: More can be gained tuning k than tuning the parameters of my favorite dimension reduction procedure.

15 Main result: k-nn performs well without dimension reduction! f n,k converges at a rate adaptive to unknown intrinsic dimension. Furthermore, k-nn is locally adaptive: f n,k (x) adapts to intrinsic dimension locally at x. The result suggests that: More can be gained tuning k than tuning the parameters of my favorite dimension reduction procedure.

16 Main result: k-nn performs well without dimension reduction! f n,k converges at a rate adaptive to unknown intrinsic dimension. Furthermore, k-nn is locally adaptive: f n,k (x) adapts to intrinsic dimension locally at x. The result suggests that: More can be gained tuning k than tuning the parameters of my favorite dimension reduction procedure.

17 Other work on adaptivity to intrinsic dimension: Kernel and local polynomial regression: Bickel and Li 2006, Lafferty and Wasserman Manifold dim. Dyadic tree classification: Scott and Nowak Box dim. 1-NN regression: Kulkarni and Posner Doubling dim. RPtree, dyadic tree regression: Kpotufe and Dasgupta Doubling dim. Tree-kernel hybrids: Kpotufe Doubling dim. The above results assume more restrictive notions of dimension.

18 Other work on adaptivity to intrinsic dimension: Kernel and local polynomial regression: Bickel and Li 2006, Lafferty and Wasserman Manifold dim. Dyadic tree classification: Scott and Nowak Box dim. 1-NN regression: Kulkarni and Posner Doubling dim. RPtree, dyadic tree regression: Kpotufe and Dasgupta Doubling dim. Tree-kernel hybrids: Kpotufe Doubling dim. The above results assume more restrictive notions of dimension.

19 Outline: Intrinsic dimension Adaptivity for any log n k n Choosing a good k = k(x)

20 Intrinsic dimension Figure: d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). Suppose µ is U(B(x, r)), then µ(b(x, r)) ɛ d µ(b(x, ɛr)). Definition: µ is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, µ(b(x, r )) Cɛ d µ(b(x, ɛr )).

21 Intrinsic dimension Figure: d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). Suppose µ is U(B(x, r)), then µ(b(x, r)) ɛ d µ(b(x, ɛr)). Definition: µ is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, µ(b(x, r )) Cɛ d µ(b(x, ɛr )).

22 Intrinsic dimension Figure: d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). Suppose µ is U(B(x, r)), then µ(b(x, r)) ɛ d µ(b(x, ɛr)). Definition: µ is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, µ(b(x, r )) Cɛ d µ(b(x, ɛr )).

23 Intrinsic dimension Figure: d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). Suppose µ is U(B(x, r)), then µ(b(x, r)) ɛ d µ(b(x, ɛr)). Definition: µ is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, µ(b(x, r )) Cɛ d µ(b(x, ɛr )).

24 Given a query x, the behavior of µ in a neighborhood B of x can capture the intrinsic dimension in B. Location of query x matters! Size of neighborhood B matters! For k-nn, B is a region centered at x, of mass k/n.

25 Given a query x, the behavior of µ in a neighborhood B of x can capture the intrinsic dimension in B. Location of query x matters! Size of neighborhood B matters! For k-nn, B is a region centered at x, of mass k/n.

26 Given a query x, the behavior of µ in a neighborhood B of x can capture the intrinsic dimension in B. Location of query x matters! Size of neighborhood B matters! For k-nn, B is a region centered at x, of mass k/n.

27 Given a query x, the behavior of µ in a neighborhood B of x can capture the intrinsic dimension in B. Location of query x matters! Size of neighborhood B matters! For k-nn, B is a region centered at x, of mass k/n.

28 Size of neighborhood B matters!

29 The behavior of µ can capture the intrinsic dimension locally. For k-nn, locality will depend on n and k. Linear data Manifold data Sparse data Suppose µ = i ω iµ i, where µ i is (C i, d i ) homogeneous on B(x), then µ is (C, d)-homogenous on B(x) for some C, d max i d i. e.g. X collection of subspaces of various dimensions.

30 The behavior of µ can capture the intrinsic dimension locally. For k-nn, locality will depend on n and k. Linear data Manifold data Sparse data Suppose µ = i ω iµ i, where µ i is (C i, d i ) homogeneous on B(x), then µ is (C, d)-homogenous on B(x) for some C, d max i d i. e.g. X collection of subspaces of various dimensions.

31 The behavior of µ can capture the intrinsic dimension locally. For k-nn, locality will depend on n and k. Linear data Manifold data Sparse data Suppose µ = i ω iµ i, where µ i is (C i, d i ) homogeneous on B(x), then µ is (C, d)-homogenous on B(x) for some C, d max i d i. e.g. X collection of subspaces of various dimensions.

32 Outline: Intrinsic dimension Adaptivity for any log n k n Choosing a good k = k(x)

33 Adaptivity for k - General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n,k (x) depends on: (Variance of f n,k (x)) 1/k. (Bias of f n,k (x)) r k (x). It turns out: r k (x) (k/n) 1/d, where d = d(b). Also: r k (x) depends on µ(b) (smaller in dense regions B).

34 Adaptivity for k - General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n,k (x) depends on: (Variance of f n,k (x)) 1/k. (Bias of f n,k (x)) r k (x). It turns out: r k (x) (k/n) 1/d, where d = d(b). Also: r k (x) depends on µ(b) (smaller in dense regions B).

35 Adaptivity for k - General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n,k (x) depends on: (Variance of f n,k (x)) 1/k. (Bias of f n,k (x)) r k (x). It turns out: r k (x) (k/n) 1/d, where d = d(b). Also: r k (x) depends on µ(b) (smaller in dense regions B).

36 Adaptivity for k - General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n,k (x) depends on: (Variance of f n,k (x)) 1/k. (Bias of f n,k (x)) r k (x). It turns out: r k (x) (k/n) 1/d, where d = d(b). Also: r k (x) depends on µ(b) (smaller in dense regions B).

37 Adaptivity for k - General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n,k (x) depends on: (Variance of f n,k (x)) 1/k. (Bias of f n,k (x)) r k (x). It turns out: r k (x) (k/n) 1/d, where d = d(b). Also: r k (x) depends on µ(b) (smaller in dense regions B).

38 Adaptivity for k - Result: Theorem: The following holds w.h.p. simultaneously for all x X and log n k n. Consider any B(x, r) centered at x, s.t. µ(b(x, r)) k/n. Suppose µ is (C, d)-homogeneous on B. We have f n,k (x) f(x) 2 σ2 Y + t2 Y k ( ) Ck 2/d + λ 2 r 2. nµ(b) Rate is best if x is in a dense region B with low dimension d.

39 Adaptivity for k - Result: Theorem: The following holds w.h.p. simultaneously for all x X and log n k n. Consider any B(x, r) centered at x, s.t. µ(b(x, r)) k/n. Suppose µ is (C, d)-homogeneous on B. We have f n,k (x) f(x) 2 σ2 Y + t2 Y k ( ) Ck 2/d + λ 2 r 2. nµ(b) Rate is best if x is in a dense region B with low dimension d.

40 Proof idea f n,k (x) f(x) f n,k (x) E f n,k (x) + E f n,k (x) + f(x). Uniform variance bound over X : Fix X and x, f n,k (x) E f n,k (x) 2 Var Y X f n,k (x) 1 k. Fix X, there are at most n VC of balls possible values of f n,k (x). Uniform bias bound over X : In d-dimensions, µ(b(x, r)) r d = E r k,n (x) (k/n) 1/d. Here d changes with r, but similar idea holds. B, µ n (B) µ(b) so that x, r k,n (x) E r k,n (x).

41 Proof idea f n,k (x) f(x) f n,k (x) E f n,k (x) + E f n,k (x) + f(x). Uniform variance bound over X : Fix X and x, f n,k (x) E f n,k (x) 2 Var Y X f n,k (x) 1 k. Fix X, there are at most n VC of balls possible values of f n,k (x). Uniform bias bound over X : In d-dimensions, µ(b(x, r)) r d = E r k,n (x) (k/n) 1/d. Here d changes with r, but similar idea holds. B, µ n (B) µ(b) so that x, r k,n (x) E r k,n (x).

42 Proof idea f n,k (x) f(x) f n,k (x) E f n,k (x) + E f n,k (x) + f(x). Uniform variance bound over X : Fix X and x, f n,k (x) E f n,k (x) 2 Var Y X f n,k (x) 1 k. Fix X, there are at most n VC of balls possible values of f n,k (x). Uniform bias bound over X : In d-dimensions, µ(b(x, r)) r d = E r k,n (x) (k/n) 1/d. Here d changes with r, but similar idea holds. B, µ n (B) µ(b) so that x, r k,n (x) E r k,n (x).

43 Outline: Intrinsic dimension Adaptivity for any log n k n Choosing a good k = k(x)

44 Choosing k(x)- Best possible rate in terms of d Theorem: Consider a metric measure space (X, ρ, µ), such that for all x X, r > 0, ɛ > 0, we have µ(b(x, r)) ɛ d µ(b(x, ɛr)). Then, for any regressor f n, there exists D X,Y, where D X = µ and f(x) E Y x is λ-lipschitz, such that E f n f 2 DX,Y n 2,µ λ2d/(2+d) n 2/(2+d).

45 Choosing k(x)- Best possible rate in terms of d Theorem: Consider a metric measure space (X, ρ, µ), such that for all x X, r > 0, ɛ > 0, we have µ(b(x, r)) ɛ d µ(b(x, ɛr)). Then, for any regressor f n, there exists D X,Y, where D X = µ and f(x) E Y x is λ-lipschitz, such that E f n f 2 DX,Y n 2,µ λ2d/(2+d) n 2/(2+d).

46 Choosing k(x)- Best possible rate in terms of d Intuition: We want to create a family F of λ-lipschitz functions f which vary a lot! The amount of variation we can impose will depend on d.

47 Choosing k(x)- Best possible rate in terms of d Intuition: We want to create a family F of λ-lipschitz functions f which vary a lot! The amount of variation we can impose will depend on d.

48 Choosing k locally at x- Intuition Note: Cross-validation and dimension estimation require large samples sizes, which is unlikely in small neighborhoods of x. Instead: Main technical hurdle: intrinsic dimension might vary with k.

49 Choosing k locally at x- Intuition Note: Cross-validation and dimension estimation require large samples sizes, which is unlikely in small neighborhoods of x. Instead: Main technical hurdle: intrinsic dimension might vary with k.

50 Choosing k(x)- Result Theorem: Suppose k(x) is chosen as above. The following holds w.h.p. simultaneously for all x. Consider any B centered at x, s.t. µ(b) n 1/3. Suppose µ is (C, d)-homogeneous on B. We have ( ) f n,k (x) f(x) 2 C 2/(2+d) λ 2. nµ(b) As n the claim applies to any B centered at x, µ(b) 0.

51 Choosing k(x)- Result Theorem: Suppose k(x) is chosen as above. The following holds w.h.p. simultaneously for all x. Consider any B centered at x, s.t. µ(b) n 1/3. Suppose µ is (C, d)-homogeneous on B. We have ( ) f n,k (x) f(x) 2 C 2/(2+d) λ 2. nµ(b) As n the claim applies to any B centered at x, µ(b) 0.

52 Proof idea Suppose d is fixed over X. k n 2/(2+d) so show that rate(k(x)) rate(k ) n 2/(2+d). Since d varies with r, apply the argument over a sufficiently large ball to include all k neighbors for any d 1.

53 Proof idea Suppose d is fixed over X. k n 2/(2+d) so show that rate(k(x)) rate(k ) n 2/(2+d). Since d varies with r, apply the argument over a sufficiently large ball to include all k neighbors for any d 1.

54 Results likely extend to: Higher order polynomial regression/classification using k-nn. Full adaptivity: include function smoothness. Local choice of bandwidth in kernel regression.

55 Take home message k-nn regression performs well without dimension reduction! Furthermore, it is possible to choose k at x so as to achieve a good performance in terms of local dimension. Question: Is there a general principle for designing adaptive learners? We ve assumed so far that nature (or an expert) provided the right metric (X, ρ)!

56 Take home message k-nn regression performs well without dimension reduction! Furthermore, it is possible to choose k at x so as to achieve a good performance in terms of local dimension. Question: Is there a general principle for designing adaptive learners? We ve assumed so far that nature (or an expert) provided the right metric (X, ρ)!

57 Take home message k-nn regression performs well without dimension reduction! Furthermore, it is possible to choose k at x so as to achieve a good performance in terms of local dimension. Question: Is there a general principle for designing adaptive learners? We ve assumed so far that nature (or an expert) provided the right metric (X, ρ)!

58 II. Nonparametric sparsity: improving local regression. Based on joint work with Abdeslam Boularias

59 Sparse f: Parametric sparsity: f(x) = i d w if i (x) >, most w i = 0. Nonparametric sparsity: f(x) = i w if i (x) >, most w i = 0. f(x) = g(projection(x)) onto R relevant variables, R [d]. Here: For every f i (derivative of f along i) let f i 1,µ E X f { } i (X). The set f i 1,µ is approximately sparse. i d f could depend on all features of X, but not equally on all!

60 Sparse f: Parametric sparsity: f(x) = i d w if i (x) >, most w i = 0. Nonparametric sparsity: f(x) = i w if i (x) >, most w i = 0. f(x) = g(projection(x)) onto R relevant variables, R [d]. Here: For every f i (derivative of f along i) let f i 1,µ E X f { } i (X). The set f i 1,µ is approximately sparse. i d f could depend on all features of X, but not equally on all!

61 Sparse f: Parametric sparsity: f(x) = i d w if i (x) >, most w i = 0. Nonparametric sparsity: f(x) = i w if i (x) >, most w i = 0. f(x) = g(projection(x)) onto R relevant variables, R [d]. Here: For every f i (derivative of f along i) let f i 1,µ E X f { } i (X). The set f i 1,µ is approximately sparse. i d f could depend on all features of X, but not equally on all!

62 A simple idea: Gradient weighting: Reweight x i W i x i, where W i f i 1,µ. Equivalently, replace euclidean distance with ρ(x, x ) = (x x i ) W(x x i ). Similar to metric learning, but cheaper: we estimate a single ρ! Now run any local regressor f n on (X, ρ).

63 A simple idea: Gradient weighting: Reweight x i W i x i, where W i f i 1,µ. Equivalently, replace euclidean distance with ρ(x, x ) = (x x i ) W(x x i ). Similar to metric learning, but cheaper: we estimate a single ρ! Now run any local regressor f n on (X, ρ).

64 A simple idea: Gradient weighting: Reweight x i W i x i, where W i f i 1,µ. Equivalently, replace euclidean distance with ρ(x, x ) = (x x i ) W(x x i ). Similar to metric learning, but cheaper: we estimate a single ρ! Now run any local regressor f n on (X, ρ).

65 A simple idea: Gradient weighting: Reweight x i W i x i, where W i f i 1,µ. Equivalently, replace euclidean distance with ρ(x, x ) = (x x i ) W(x x i ). Similar to metric learning, but cheaper: we estimate a single ρ! Now run any local regressor f n on (X, ρ).

66 Intuition: suppose f 2 1,µ f 1 1,µ. Lower Var(f n (x)): suppose f i 1,µ is large only for i R [d]. Balls B ρ contain more points relative to euclidean balls. Formally, µ(b(x, ɛ ρ(x )) ɛ R, instead of ɛ d. Bias of f n (x) is relatively unaffected: No sample is too far from x in any relevant direction. Formally, f has the following Lipschitz property: f(x) f(x ) ( f i ) ρ(x, x ). Wi i R

67 Intuition: suppose f 2 1,µ f 1 1,µ. Lower Var(f n (x)): suppose f i 1,µ is large only for i R [d]. Balls B ρ contain more points relative to euclidean balls. Formally, µ(b(x, ɛ ρ(x )) ɛ R, instead of ɛ d. Bias of f n (x) is relatively unaffected: No sample is too far from x in any relevant direction. Formally, f has the following Lipschitz property: f(x) f(x ) ( f i ) ρ(x, x ). Wi i R

68 Intuition: suppose f 2 1,µ f 1 1,µ. Lower Var(f n (x)): suppose f i 1,µ is large only for i R [d]. Balls B ρ contain more points relative to euclidean balls. Formally, µ(b(x, ɛ ρ(x )) ɛ R, instead of ɛ d. Bias of f n (x) is relatively unaffected: No sample is too far from x in any relevant direction. Formally, f has the following Lipschitz property: f(x) f(x ) ( f i ) ρ(x, x ). Wi i R

69 Efficient estimation of f i 1,µ : don t estimate f i. W i E n f n,h (X + te i ) f n,h (X te i ) 2t 1 {A n,i (X)}, where A n,i (X) we are confident in both estimates f n,h (X ± te i ). Fast preprocessing, and online: just 2 estimates of f n,h at X. Metric learning optimizes over a space of possible metrics. Only param. search grid to adapt to d dimensions. KR with bandwidths {h i } d 1 needs d d param. search grid. General: preprocessing for any distance-based regressor. Other methods apply to particular regressors, e.g. Rodeo for local linear reg., metric learning for KR.

70 Efficient estimation of f i 1,µ : don t estimate f i. W i E n f n,h (X + te i ) f n,h (X te i ) 2t 1 {A n,i (X)}, where A n,i (X) we are confident in both estimates f n,h (X ± te i ). Fast preprocessing, and online: just 2 estimates of f n,h at X. Metric learning optimizes over a space of possible metrics. Only param. search grid to adapt to d dimensions. KR with bandwidths {h i } d 1 needs d d param. search grid. General: preprocessing for any distance-based regressor. Other methods apply to particular regressors, e.g. Rodeo for local linear reg., metric learning for KR.

71 Efficient estimation of f i 1,µ : don t estimate f i. W i E n f n,h (X + te i ) f n,h (X te i ) 2t 1 {A n,i (X)}, where A n,i (X) we are confident in both estimates f n,h (X ± te i ). Fast preprocessing, and online: just 2 estimates of f n,h at X. Metric learning optimizes over a space of possible metrics. Only param. search grid to adapt to d dimensions. KR with bandwidths {h i } d 1 needs d d param. search grid. General: preprocessing for any distance-based regressor. Other methods apply to particular regressors, e.g. Rodeo for local linear reg., metric learning for KR.

72 Efficient estimation of f i 1,µ : don t estimate f i. W i E n f n,h (X + te i ) f n,h (X te i ) 2t 1 {A n,i (X)}, where A n,i (X) we are confident in both estimates f n,h (X ± te i ). Fast preprocessing, and online: just 2 estimates of f n,h at X. Metric learning optimizes over a space of possible metrics. Only param. search grid to adapt to d dimensions. KR with bandwidths {h i } d 1 needs d d param. search grid. General: preprocessing for any distance-based regressor. Other methods apply to particular regressors, e.g. Rodeo for local linear reg., metric learning for KR.

73 Practicality of the approach: On many real-world data, f i 1,µ varies enough! SRCS robot joint7. Parkinson s. Telecom. Ailerons.

74 W i consistently estimates f i 1,µ Theorem Under general regularity conditions on µ, and smoothness conditions on f, we have with high probability: W i f 1,µ 1 i A(n) t nh d + h f sup i i [d] + 2 ( ) f i ln 2d/δ sup + µ ( t,i (X )) + ɛ t,i. n

75 Results on real world datasets, kernel regression. Barrett joint 1 Barrett joint 5 SARCOS joint 1 SARCOS joint 5 Housing KR error 0.50 ± ± ± ± ±0.08 KR-ρ error 0.38± ± ± ± ±0.06 Concrete Strength Wine Quality Telecom Ailerons Parkinson s KR error 0.42 ± ± ± ± ±0.03 KR-ρ error 0.37 ± ± ± ± ±0.03

76 Results on real world datasets, k-nn regression. Barrett joint 1 Barrett joint 5 SARCOS joint 1 SARCOS joint 5 Housing k-nn error 0.41 ± ± ± ± ±0.09 k-nn-ρ error 0.29 ± ± ± ± ±0.06 Concrete Strength Wine Quality Telecom Ailerons Parkinson s k-nn error 0.40 ± ± ± ± ±0.01 k-nn-ρ error 0.38 ± ± ± ± ±0.01

77 MSE, varying training size, kernel regression SRCS joint 7, KR Ailerons, KR Telecom with KR

78 MSE, varying training size, k-nn regression SRCS joint 7, k-nn Ailerons, k-nn Telecom, k-nn

79 Take home message Gradient weigths help local regressors! So simple = there should be much room for improvement! Thanks!

80 Take home message Gradient weigths help local regressors! So simple = there should be much room for improvement! Thanks!

81 Take home message Gradient weigths help local regressors! So simple = there should be much room for improvement! Thanks!

Local Self-Tuning in Nonparametric Regression. Samory Kpotufe ORFE, Princeton University

Local Self-Tuning in Nonparametric Regression. Samory Kpotufe ORFE, Princeton University Local Self-Tuning in Nonparametric Regression Samory Kpotufe ORFE, Princeton University Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg

More information

Escaping the curse of dimensionality with a tree-based regressor

Escaping the curse of dimensionality with a tree-based regressor Escaping the curse of dimensionality with a tree-based regressor Samory Kpotufe UCSD CSE Curse of dimensionality In general: Computational and/or prediction performance deteriorate as the dimension D increases.

More information

Adaptivity to Local Smoothness and Dimension in Kernel Regression

Adaptivity to Local Smoothness and Dimension in Kernel Regression Adaptivity to Local Smoothness and Dimension in Kernel Regression Samory Kpotufe Toyota Technological Institute-Chicago samory@tticedu Vikas K Garg Toyota Technological Institute-Chicago vkg@tticedu Abstract

More information

k-nn Regression Adapts to Local Intrinsic Dimension

k-nn Regression Adapts to Local Intrinsic Dimension k-nn Regression Adapts to Local Intrinsic Dimension Samory Kpotufe Max Planck Institute for Intelligent Systems samory@tuebingen.mpg.de Abstract Many nonparametric regressors were recently shown to converge

More information

A Consistent Estimator of the Expected Gradient Outerproduct

A Consistent Estimator of the Expected Gradient Outerproduct A Consistent Estimator of the Expected Gradient Outerproduct Shubhendu Trivedi TTI-Chicago Jialei Wang University of Chicago Samory Kpotufe TTI-Chicago Gregory Shakhnarovich TTI-Chicago Abstract In high-dimensional

More information

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence Nearest neighbor classification in metric spaces: universal consistency and rates of convergence Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to classification.

More information

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008 Advances in Manifold Learning Presented by: Nakul Verma June 10, 008 Outline Motivation Manifolds Manifold Learning Random projection of manifolds for dimension reduction Introduction to random projections

More information

Time-Accuracy Tradeoffs in Kernel Prediction: Controlling Prediction Quality

Time-Accuracy Tradeoffs in Kernel Prediction: Controlling Prediction Quality Journal of Machine Learning Research 18 (017) 1-9 Submitted 10/16; Revised 3/17; Published 4/17 Time-Accuracy Tradeoffs in Kernel Prediction: Controlling Prediction Quality Samory Kpotufe ORFE, Princeton

More information

Econometrics I. Lecture 10: Nonparametric Estimation with Kernels. Paul T. Scott NYU Stern. Fall 2018

Econometrics I. Lecture 10: Nonparametric Estimation with Kernels. Paul T. Scott NYU Stern. Fall 2018 Econometrics I Lecture 10: Nonparametric Estimation with Kernels Paul T. Scott NYU Stern Fall 2018 Paul T. Scott NYU Stern Econometrics I Fall 2018 1 / 12 Nonparametric Regression: Intuition Let s get

More information

The Curse of Dimensionality for Local Kernel Machines

The Curse of Dimensionality for Local Kernel Machines The Curse of Dimensionality for Local Kernel Machines Yoshua Bengio, Olivier Delalleau & Nicolas Le Roux April 7th 2005 Yoshua Bengio, Olivier Delalleau & Nicolas Le Roux Snowbird Learning Workshop Perspective

More information

Distribution-specific analysis of nearest neighbor search and classification

Distribution-specific analysis of nearest neighbor search and classification Distribution-specific analysis of nearest neighbor search and classification Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to information retrieval and classification.

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

Random projection trees and low dimensional manifolds. Sanjoy Dasgupta and Yoav Freund University of California, San Diego

Random projection trees and low dimensional manifolds. Sanjoy Dasgupta and Yoav Freund University of California, San Diego Random projection trees and low dimensional manifolds Sanjoy Dasgupta and Yoav Freund University of California, San Diego I. The new nonparametrics The new nonparametrics The traditional bane of nonparametric

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Online Learning in High Dimensions. LWPR and it s application

Online Learning in High Dimensions. LWPR and it s application Lecture 9 LWPR Online Learning in High Dimensions Contents: LWPR and it s application Sethu Vijayakumar, Aaron D'Souza and Stefan Schaal, Incremental Online Learning in High Dimensions, Neural Computation,

More information

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods) Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March

More information

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian with mean ( µ ) and standard deviation ( σ) Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Optimal rates for k-nn density and mode estimation

Optimal rates for k-nn density and mode estimation Optimal rates for k-nn density and mode estimation Samory Kpotufe ORFE, Princeton University Joint work with Sanjoy Dasgupta, UCSD, CSE. Goal: Practical and Optimal estimator of all modes of f from X 1:n

More information

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo Outline in High Dimensions Using the Rodeo Han Liu 1,2 John Lafferty 2,3 Larry Wasserman 1,2 1 Statistics Department, 2 Machine Learning Department, 3 Computer Science Department, Carnegie Mellon University

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Nonlinear Dimensionality Reduction Piyush Rai CS5350/6350: Machine Learning October 25, 2011 Recap: Linear Dimensionality Reduction Linear Dimensionality Reduction: Based on a linear projection of the

More information

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation Laplacian Eigenmaps for Dimensionality Reduction and Data Representation Neural Computation, June 2003; 15 (6):1373-1396 Presentation for CSE291 sp07 M. Belkin 1 P. Niyogi 2 1 University of Chicago, Department

More information

Exploiting k-nearest Neighbor Information with Many Data

Exploiting k-nearest Neighbor Information with Many Data Exploiting k-nearest Neighbor Information with Many Data 2017 TEST TECHNOLOGY WORKSHOP 2017. 10. 24 (Tue.) Yung-Kyun Noh Robotics Lab., Contents Nonparametric methods for estimating density functions Nearest

More information

Non-linear Dimensionality Reduction

Non-linear Dimensionality Reduction Non-linear Dimensionality Reduction CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Introduction Laplacian Eigenmaps Locally Linear Embedding (LLE)

More information

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Yoshua Bengio Pascal Vincent Jean-François Paiement University of Montreal April 2, Snowbird Learning 2003 Learning Modal Structures

More information

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold. Nonlinear Methods Data often lies on or near a nonlinear low-dimensional curve aka manifold. 27 Laplacian Eigenmaps Linear methods Lower-dimensional linear projection that preserves distances between all

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Efficient and Optimal Modal-set Estimation using knn graphs

Efficient and Optimal Modal-set Estimation using knn graphs Efficient and Optimal Modal-set Estimation using knn graphs Samory Kpotufe ORFE, Princeton University Based on various results with Sanjoy Dasgupta, Kamalika Chaudhuri, Ulrike von Luxburg, Heinrich Jiang

More information

Introduction to Nonparametric and Semiparametric Estimation. Good when there are lots of data and very little prior information on functional form.

Introduction to Nonparametric and Semiparametric Estimation. Good when there are lots of data and very little prior information on functional form. 1 Introduction to Nonparametric and Semiparametric Estimation Good when there are lots of data and very little prior information on functional form. Examples: y = f(x) + " (nonparametric) y = z 0 + f(x)

More information

CSE446: non-parametric methods Spring 2017

CSE446: non-parametric methods Spring 2017 CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

Unlabeled Data: Now It Helps, Now It Doesn t

Unlabeled Data: Now It Helps, Now It Doesn t institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster

More information

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting) Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting) Professor: Aude Billard Assistants: Nadia Figueroa, Ilaria Lauzana and Brice Platerrier E-mails: aude.billard@epfl.ch,

More information

Computational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center

Computational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center Outline Modern nonparametric inference for high dimensional data Nonparametric

More information

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality Weiqiang Dong 1 The goal of the work presented here is to illustrate that classification error responds to error in the target probability estimates

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Geometric Inference for Probability distributions

Geometric Inference for Probability distributions Geometric Inference for Probability distributions F. Chazal 1 D. Cohen-Steiner 2 Q. Mérigot 2 1 Geometrica, INRIA Saclay, 2 Geometrica, INRIA Sophia-Antipolis 2009 June 1 Motivation What is the (relevant)

More information

Math 320-2: Midterm 2 Practice Solutions Northwestern University, Winter 2015

Math 320-2: Midterm 2 Practice Solutions Northwestern University, Winter 2015 Math 30-: Midterm Practice Solutions Northwestern University, Winter 015 1. Give an example of each of the following. No justification is needed. (a) A metric on R with respect to which R is bounded. (b)

More information

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat Geometric View of Machine Learning Nearest Neighbor Classification Slides adapted from Prof. Carpuat What we know so far Decision Trees What is a decision tree, and how to induce it from data Fundamental

More information

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal

More information

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Kernel-Based Contrast Functions for Sufficient Dimension Reduction Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Economics 620, Lecture 19: Introduction to Nonparametric and Semiparametric Estimation

Economics 620, Lecture 19: Introduction to Nonparametric and Semiparametric Estimation Economics 620, Lecture 19: Introduction to Nonparametric and Semiparametric Estimation Nicholas M. Kiefer Cornell University Professor N. M. Kiefer (Cornell University) Lecture 19: Nonparametric Analysis

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Manifold Regularization

Manifold Regularization 9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Neural Networks, Convexity, Kernels and Curses

Neural Networks, Convexity, Kernels and Curses Neural Networks, Convexity, Kernels and Curses Yoshua Bengio Work done with Nicolas Le Roux, Olivier Delalleau and Hugo Larochelle August 26th 2005 Perspective Curse of Dimensionality Most common non-parametric

More information

Manifold Learning and it s application

Manifold Learning and it s application Manifold Learning and it s application Nandan Dubey SE367 Outline 1 Introduction Manifold Examples image as vector Importance Dimension Reduction Techniques 2 Linear Methods PCA Example MDS Perception

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

Basis Expansion and Nonlinear SVM. Kai Yu

Basis Expansion and Nonlinear SVM. Kai Yu Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

41903: Introduction to Nonparametrics

41903: Introduction to Nonparametrics 41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Nonparametric Methods

Nonparametric Methods Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis

More information

Consistency of Nearest Neighbor Classification under Selective Sampling

Consistency of Nearest Neighbor Classification under Selective Sampling JMLR: Workshop and Conference Proceedings vol 23 2012) 18.1 18.15 25th Annual Conference on Learning Theory Consistency of Nearest Neighbor Classification under Selective Sampling Sanjoy Dasgupta 9500

More information

Sample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI

Sample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI Sample Complexity of Learning Mahalanobis Distance Metrics Nakul Verma Janelia, HHMI feature 2 Mahalanobis Metric Learning Comparing observations in feature space: x 1 [sq. Euclidean dist] x 2 (all features

More information

Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran

Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran 1. Let A m n be a matrix of real numbers. The matrix AA T has an eigenvector x with eigenvalue b. Then the eigenvector y of A T A

More information

Lecture 10: Dimension Reduction Techniques

Lecture 10: Dimension Reduction Techniques Lecture 10: Dimension Reduction Techniques Radu Balan Department of Mathematics, AMSC, CSCAMM and NWC University of Maryland, College Park, MD April 17, 2018 Input Data It is assumed that there is a set

More information

Exploiting Sparse Non-Linear Structure in Astronomical Data

Exploiting Sparse Non-Linear Structure in Astronomical Data Exploiting Sparse Non-Linear Structure in Astronomical Data Ann B. Lee Department of Statistics and Department of Machine Learning, Carnegie Mellon University Joint work with P. Freeman, C. Schafer, and

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation Introduction and Data Representation Mikhail Belkin & Partha Niyogi Department of Electrical Engieering University of Minnesota Mar 21, 2017 1/22 Outline Introduction 1 Introduction 2 3 4 Connections to

More information

Nonparametric Inference in Cosmology and Astrophysics: Biases and Variants

Nonparametric Inference in Cosmology and Astrophysics: Biases and Variants Nonparametric Inference in Cosmology and Astrophysics: Biases and Variants Christopher R. Genovese Department of Statistics Carnegie Mellon University http://www.stat.cmu.edu/ ~ genovese/ Collaborators:

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Bayesian simultaneous regression and dimension reduction

Bayesian simultaneous regression and dimension reduction Bayesian simultaneous regression and dimension reduction MCMski II Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University January 10, 2008

More information

Sparse Kernel Machines - SVM

Sparse Kernel Machines - SVM Sparse Kernel Machines - SVM Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I. Christensen (RIM@GT) Support

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

Linear and Non-Linear Dimensionality Reduction

Linear and Non-Linear Dimensionality Reduction Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections

More information

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Semi-Supervised Learning by Multi-Manifold Separation

Semi-Supervised Learning by Multi-Manifold Separation Semi-Supervised Learning by Multi-Manifold Separation Xiaojin (Jerry) Zhu Department of Computer Sciences University of Wisconsin Madison Joint work with Andrew Goldberg, Zhiting Xu, Aarti Singh, and Rob

More information

Probabilistic Machine Learning. Industrial AI Lab.

Probabilistic Machine Learning. Industrial AI Lab. Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Multi-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics

Multi-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics 1 / 38 Multi-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics Chris Williams with Kian Ming A. Chai, Stefan Klanke, Sethu Vijayakumar December 2009 Motivation 2 / 38 Examples

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

Multiscale Wavelets on Trees, Graphs and High Dimensional Data

Multiscale Wavelets on Trees, Graphs and High Dimensional Data Multiscale Wavelets on Trees, Graphs and High Dimensional Data ICML 2010, Haifa Matan Gavish (Weizmann/Stanford) Boaz Nadler (Weizmann) Ronald Coifman (Yale) Boaz Nadler Ronald Coifman Motto... the relationships

More information

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi Overview Introduction Linear Methods for Dimensionality Reduction Nonlinear Methods and Manifold

More information

... SPARROW. SPARse approximation Weighted regression. Pardis Noorzad. Department of Computer Engineering and IT Amirkabir University of Technology

... SPARROW. SPARse approximation Weighted regression. Pardis Noorzad. Department of Computer Engineering and IT Amirkabir University of Technology ..... SPARROW SPARse approximation Weighted regression Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Université de Montréal March 12, 2012 SPARROW 1/47 .....

More information

CSC 411: Lecture 02: Linear Regression

CSC 411: Lecture 02: Linear Regression CSC 411: Lecture 02: Linear Regression Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto (Most plots in this lecture are from Bishop s book) Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

COMS 4771 Regression. Nakul Verma

COMS 4771 Regression. Nakul Verma COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /

CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang / CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors Furong Huang / furongh@cs.umd.edu What we know so far Decision Trees What is a decision tree, and how to induce it from

More information