Local regression, intrinsic dimension, and nonparametric sparsity
|
|
- Frank Weaver
- 5 years ago
- Views:
Transcription
1 Local regression, intrinsic dimension, and nonparametric sparsity Samory Kpotufe Toyota Technological Institute - Chicago and Max Planck Institute for Intelligent Systems
2 I. Local regression and (local) intrinsic dimension. II. Nonparametric sparsity: improving local regression.
3 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise We will assume f is Lipschitz-continous: f(x) f(x ) λρ(x, x ). Learn: f n (x) = avg (Y i ) of X i B(x). In particular: f n,k (x) = kernel avg (Y i ) of k-nn B(x). (k-nn or adaptive-bandwidth kernel reg.) Quite basic and works well! = common in practice! New explanation for its success: Adaptivity to local problem complexity (local density (e.g. I. Abramson 82), local dimension of X )
4 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise We will assume f is Lipschitz-continous: f(x) f(x ) λρ(x, x ). Learn: f n (x) = avg (Y i ) of X i B(x). In particular: f n,k (x) = kernel avg (Y i ) of k-nn B(x). (k-nn or adaptive-bandwidth kernel reg.) Quite basic and works well! = common in practice! New explanation for its success: Adaptivity to local problem complexity (local density (e.g. I. Abramson 82), local dimension of X )
5 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise We will assume f is Lipschitz-continous: f(x) f(x ) λρ(x, x ). Learn: f n (x) = avg (Y i ) of X i B(x). In particular: f n,k (x) = kernel avg (Y i ) of k-nn B(x). (k-nn or adaptive-bandwidth kernel reg.) Quite basic and works well! = common in practice! New explanation for its success: Adaptivity to local problem complexity (local density (e.g. I. Abramson 82), local dimension of X )
6 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise We will assume f is Lipschitz-continous: f(x) f(x ) λρ(x, x ). Learn: f n (x) = avg (Y i ) of X i B(x). In particular: f n,k (x) = kernel avg (Y i ) of k-nn B(x). (k-nn or adaptive-bandwidth kernel reg.) Quite basic and works well! = common in practice! New explanation for its success: Adaptivity to local problem complexity (local density (e.g. I. Abramson 82), local dimension of X )
7 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise We will assume f is Lipschitz-continous: f(x) f(x ) λρ(x, x ). Learn: f n (x) = avg (Y i ) of X i B(x). In particular: f n,k (x) = kernel avg (Y i ) of k-nn B(x). (k-nn or adaptive-bandwidth kernel reg.) Quite basic and works well! = common in practice! New explanation for its success: Adaptivity to local problem complexity (local density (e.g. I. Abramson 82), local dimension of X )
8 Dimension and regression: Curse of dimension Suppose X IR D. There exist distributions on (X, Y ) such that the excess risk f n,k f 2 2,µ is of the form n 2/(2+D).. = E X f n,k (X) f(x) 2 This is true for all nonparametric regressors! (Stone 82)
9 Dimension and regression: Curse of dimension Suppose X IR D. There exist distributions on (X, Y ) such that the excess risk f n,k f 2 2,µ is of the form n 2/(2+D).. = E X f n,k (X) f(x) 2 This is true for all nonparametric regressors! (Stone 82)
10 Fortunately, high dimensional data often has low intrinsic complexity Linear data
11 Fortunately, high dimensional data often has low intrinsic complexity Linear data Manifold data
12 Fortunately, high dimensional data often has low intrinsic complexity Linear data Manifold data Sparse data
13 Fortunately, high dimensional data often has low intrinsic complexity Linear data Manifold data Sparse data Common approach: Dimension reduction e.g. PCA, Manifold learning (e.g. LLE, Isomap, Laplacian eigenmaps, kernel PCA,...)
14 Main result: k-nn performs well without dimension reduction! f n,k converges at a rate adaptive to unknown intrinsic dimension. Furthermore, k-nn is locally adaptive: f n,k (x) adapts to intrinsic dimension locally at x. The result suggests that: More can be gained tuning k than tuning the parameters of my favorite dimension reduction procedure.
15 Main result: k-nn performs well without dimension reduction! f n,k converges at a rate adaptive to unknown intrinsic dimension. Furthermore, k-nn is locally adaptive: f n,k (x) adapts to intrinsic dimension locally at x. The result suggests that: More can be gained tuning k than tuning the parameters of my favorite dimension reduction procedure.
16 Main result: k-nn performs well without dimension reduction! f n,k converges at a rate adaptive to unknown intrinsic dimension. Furthermore, k-nn is locally adaptive: f n,k (x) adapts to intrinsic dimension locally at x. The result suggests that: More can be gained tuning k than tuning the parameters of my favorite dimension reduction procedure.
17 Other work on adaptivity to intrinsic dimension: Kernel and local polynomial regression: Bickel and Li 2006, Lafferty and Wasserman Manifold dim. Dyadic tree classification: Scott and Nowak Box dim. 1-NN regression: Kulkarni and Posner Doubling dim. RPtree, dyadic tree regression: Kpotufe and Dasgupta Doubling dim. Tree-kernel hybrids: Kpotufe Doubling dim. The above results assume more restrictive notions of dimension.
18 Other work on adaptivity to intrinsic dimension: Kernel and local polynomial regression: Bickel and Li 2006, Lafferty and Wasserman Manifold dim. Dyadic tree classification: Scott and Nowak Box dim. 1-NN regression: Kulkarni and Posner Doubling dim. RPtree, dyadic tree regression: Kpotufe and Dasgupta Doubling dim. Tree-kernel hybrids: Kpotufe Doubling dim. The above results assume more restrictive notions of dimension.
19 Outline: Intrinsic dimension Adaptivity for any log n k n Choosing a good k = k(x)
20 Intrinsic dimension Figure: d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). Suppose µ is U(B(x, r)), then µ(b(x, r)) ɛ d µ(b(x, ɛr)). Definition: µ is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, µ(b(x, r )) Cɛ d µ(b(x, ɛr )).
21 Intrinsic dimension Figure: d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). Suppose µ is U(B(x, r)), then µ(b(x, r)) ɛ d µ(b(x, ɛr)). Definition: µ is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, µ(b(x, r )) Cɛ d µ(b(x, ɛr )).
22 Intrinsic dimension Figure: d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). Suppose µ is U(B(x, r)), then µ(b(x, r)) ɛ d µ(b(x, ɛr)). Definition: µ is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, µ(b(x, r )) Cɛ d µ(b(x, ɛr )).
23 Intrinsic dimension Figure: d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). Suppose µ is U(B(x, r)), then µ(b(x, r)) ɛ d µ(b(x, ɛr)). Definition: µ is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, µ(b(x, r )) Cɛ d µ(b(x, ɛr )).
24 Given a query x, the behavior of µ in a neighborhood B of x can capture the intrinsic dimension in B. Location of query x matters! Size of neighborhood B matters! For k-nn, B is a region centered at x, of mass k/n.
25 Given a query x, the behavior of µ in a neighborhood B of x can capture the intrinsic dimension in B. Location of query x matters! Size of neighborhood B matters! For k-nn, B is a region centered at x, of mass k/n.
26 Given a query x, the behavior of µ in a neighborhood B of x can capture the intrinsic dimension in B. Location of query x matters! Size of neighborhood B matters! For k-nn, B is a region centered at x, of mass k/n.
27 Given a query x, the behavior of µ in a neighborhood B of x can capture the intrinsic dimension in B. Location of query x matters! Size of neighborhood B matters! For k-nn, B is a region centered at x, of mass k/n.
28 Size of neighborhood B matters!
29 The behavior of µ can capture the intrinsic dimension locally. For k-nn, locality will depend on n and k. Linear data Manifold data Sparse data Suppose µ = i ω iµ i, where µ i is (C i, d i ) homogeneous on B(x), then µ is (C, d)-homogenous on B(x) for some C, d max i d i. e.g. X collection of subspaces of various dimensions.
30 The behavior of µ can capture the intrinsic dimension locally. For k-nn, locality will depend on n and k. Linear data Manifold data Sparse data Suppose µ = i ω iµ i, where µ i is (C i, d i ) homogeneous on B(x), then µ is (C, d)-homogenous on B(x) for some C, d max i d i. e.g. X collection of subspaces of various dimensions.
31 The behavior of µ can capture the intrinsic dimension locally. For k-nn, locality will depend on n and k. Linear data Manifold data Sparse data Suppose µ = i ω iµ i, where µ i is (C i, d i ) homogeneous on B(x), then µ is (C, d)-homogenous on B(x) for some C, d max i d i. e.g. X collection of subspaces of various dimensions.
32 Outline: Intrinsic dimension Adaptivity for any log n k n Choosing a good k = k(x)
33 Adaptivity for k - General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n,k (x) depends on: (Variance of f n,k (x)) 1/k. (Bias of f n,k (x)) r k (x). It turns out: r k (x) (k/n) 1/d, where d = d(b). Also: r k (x) depends on µ(b) (smaller in dense regions B).
34 Adaptivity for k - General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n,k (x) depends on: (Variance of f n,k (x)) 1/k. (Bias of f n,k (x)) r k (x). It turns out: r k (x) (k/n) 1/d, where d = d(b). Also: r k (x) depends on µ(b) (smaller in dense regions B).
35 Adaptivity for k - General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n,k (x) depends on: (Variance of f n,k (x)) 1/k. (Bias of f n,k (x)) r k (x). It turns out: r k (x) (k/n) 1/d, where d = d(b). Also: r k (x) depends on µ(b) (smaller in dense regions B).
36 Adaptivity for k - General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n,k (x) depends on: (Variance of f n,k (x)) 1/k. (Bias of f n,k (x)) r k (x). It turns out: r k (x) (k/n) 1/d, where d = d(b). Also: r k (x) depends on µ(b) (smaller in dense regions B).
37 Adaptivity for k - General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n,k (x) depends on: (Variance of f n,k (x)) 1/k. (Bias of f n,k (x)) r k (x). It turns out: r k (x) (k/n) 1/d, where d = d(b). Also: r k (x) depends on µ(b) (smaller in dense regions B).
38 Adaptivity for k - Result: Theorem: The following holds w.h.p. simultaneously for all x X and log n k n. Consider any B(x, r) centered at x, s.t. µ(b(x, r)) k/n. Suppose µ is (C, d)-homogeneous on B. We have f n,k (x) f(x) 2 σ2 Y + t2 Y k ( ) Ck 2/d + λ 2 r 2. nµ(b) Rate is best if x is in a dense region B with low dimension d.
39 Adaptivity for k - Result: Theorem: The following holds w.h.p. simultaneously for all x X and log n k n. Consider any B(x, r) centered at x, s.t. µ(b(x, r)) k/n. Suppose µ is (C, d)-homogeneous on B. We have f n,k (x) f(x) 2 σ2 Y + t2 Y k ( ) Ck 2/d + λ 2 r 2. nµ(b) Rate is best if x is in a dense region B with low dimension d.
40 Proof idea f n,k (x) f(x) f n,k (x) E f n,k (x) + E f n,k (x) + f(x). Uniform variance bound over X : Fix X and x, f n,k (x) E f n,k (x) 2 Var Y X f n,k (x) 1 k. Fix X, there are at most n VC of balls possible values of f n,k (x). Uniform bias bound over X : In d-dimensions, µ(b(x, r)) r d = E r k,n (x) (k/n) 1/d. Here d changes with r, but similar idea holds. B, µ n (B) µ(b) so that x, r k,n (x) E r k,n (x).
41 Proof idea f n,k (x) f(x) f n,k (x) E f n,k (x) + E f n,k (x) + f(x). Uniform variance bound over X : Fix X and x, f n,k (x) E f n,k (x) 2 Var Y X f n,k (x) 1 k. Fix X, there are at most n VC of balls possible values of f n,k (x). Uniform bias bound over X : In d-dimensions, µ(b(x, r)) r d = E r k,n (x) (k/n) 1/d. Here d changes with r, but similar idea holds. B, µ n (B) µ(b) so that x, r k,n (x) E r k,n (x).
42 Proof idea f n,k (x) f(x) f n,k (x) E f n,k (x) + E f n,k (x) + f(x). Uniform variance bound over X : Fix X and x, f n,k (x) E f n,k (x) 2 Var Y X f n,k (x) 1 k. Fix X, there are at most n VC of balls possible values of f n,k (x). Uniform bias bound over X : In d-dimensions, µ(b(x, r)) r d = E r k,n (x) (k/n) 1/d. Here d changes with r, but similar idea holds. B, µ n (B) µ(b) so that x, r k,n (x) E r k,n (x).
43 Outline: Intrinsic dimension Adaptivity for any log n k n Choosing a good k = k(x)
44 Choosing k(x)- Best possible rate in terms of d Theorem: Consider a metric measure space (X, ρ, µ), such that for all x X, r > 0, ɛ > 0, we have µ(b(x, r)) ɛ d µ(b(x, ɛr)). Then, for any regressor f n, there exists D X,Y, where D X = µ and f(x) E Y x is λ-lipschitz, such that E f n f 2 DX,Y n 2,µ λ2d/(2+d) n 2/(2+d).
45 Choosing k(x)- Best possible rate in terms of d Theorem: Consider a metric measure space (X, ρ, µ), such that for all x X, r > 0, ɛ > 0, we have µ(b(x, r)) ɛ d µ(b(x, ɛr)). Then, for any regressor f n, there exists D X,Y, where D X = µ and f(x) E Y x is λ-lipschitz, such that E f n f 2 DX,Y n 2,µ λ2d/(2+d) n 2/(2+d).
46 Choosing k(x)- Best possible rate in terms of d Intuition: We want to create a family F of λ-lipschitz functions f which vary a lot! The amount of variation we can impose will depend on d.
47 Choosing k(x)- Best possible rate in terms of d Intuition: We want to create a family F of λ-lipschitz functions f which vary a lot! The amount of variation we can impose will depend on d.
48 Choosing k locally at x- Intuition Note: Cross-validation and dimension estimation require large samples sizes, which is unlikely in small neighborhoods of x. Instead: Main technical hurdle: intrinsic dimension might vary with k.
49 Choosing k locally at x- Intuition Note: Cross-validation and dimension estimation require large samples sizes, which is unlikely in small neighborhoods of x. Instead: Main technical hurdle: intrinsic dimension might vary with k.
50 Choosing k(x)- Result Theorem: Suppose k(x) is chosen as above. The following holds w.h.p. simultaneously for all x. Consider any B centered at x, s.t. µ(b) n 1/3. Suppose µ is (C, d)-homogeneous on B. We have ( ) f n,k (x) f(x) 2 C 2/(2+d) λ 2. nµ(b) As n the claim applies to any B centered at x, µ(b) 0.
51 Choosing k(x)- Result Theorem: Suppose k(x) is chosen as above. The following holds w.h.p. simultaneously for all x. Consider any B centered at x, s.t. µ(b) n 1/3. Suppose µ is (C, d)-homogeneous on B. We have ( ) f n,k (x) f(x) 2 C 2/(2+d) λ 2. nµ(b) As n the claim applies to any B centered at x, µ(b) 0.
52 Proof idea Suppose d is fixed over X. k n 2/(2+d) so show that rate(k(x)) rate(k ) n 2/(2+d). Since d varies with r, apply the argument over a sufficiently large ball to include all k neighbors for any d 1.
53 Proof idea Suppose d is fixed over X. k n 2/(2+d) so show that rate(k(x)) rate(k ) n 2/(2+d). Since d varies with r, apply the argument over a sufficiently large ball to include all k neighbors for any d 1.
54 Results likely extend to: Higher order polynomial regression/classification using k-nn. Full adaptivity: include function smoothness. Local choice of bandwidth in kernel regression.
55 Take home message k-nn regression performs well without dimension reduction! Furthermore, it is possible to choose k at x so as to achieve a good performance in terms of local dimension. Question: Is there a general principle for designing adaptive learners? We ve assumed so far that nature (or an expert) provided the right metric (X, ρ)!
56 Take home message k-nn regression performs well without dimension reduction! Furthermore, it is possible to choose k at x so as to achieve a good performance in terms of local dimension. Question: Is there a general principle for designing adaptive learners? We ve assumed so far that nature (or an expert) provided the right metric (X, ρ)!
57 Take home message k-nn regression performs well without dimension reduction! Furthermore, it is possible to choose k at x so as to achieve a good performance in terms of local dimension. Question: Is there a general principle for designing adaptive learners? We ve assumed so far that nature (or an expert) provided the right metric (X, ρ)!
58 II. Nonparametric sparsity: improving local regression. Based on joint work with Abdeslam Boularias
59 Sparse f: Parametric sparsity: f(x) = i d w if i (x) >, most w i = 0. Nonparametric sparsity: f(x) = i w if i (x) >, most w i = 0. f(x) = g(projection(x)) onto R relevant variables, R [d]. Here: For every f i (derivative of f along i) let f i 1,µ E X f { } i (X). The set f i 1,µ is approximately sparse. i d f could depend on all features of X, but not equally on all!
60 Sparse f: Parametric sparsity: f(x) = i d w if i (x) >, most w i = 0. Nonparametric sparsity: f(x) = i w if i (x) >, most w i = 0. f(x) = g(projection(x)) onto R relevant variables, R [d]. Here: For every f i (derivative of f along i) let f i 1,µ E X f { } i (X). The set f i 1,µ is approximately sparse. i d f could depend on all features of X, but not equally on all!
61 Sparse f: Parametric sparsity: f(x) = i d w if i (x) >, most w i = 0. Nonparametric sparsity: f(x) = i w if i (x) >, most w i = 0. f(x) = g(projection(x)) onto R relevant variables, R [d]. Here: For every f i (derivative of f along i) let f i 1,µ E X f { } i (X). The set f i 1,µ is approximately sparse. i d f could depend on all features of X, but not equally on all!
62 A simple idea: Gradient weighting: Reweight x i W i x i, where W i f i 1,µ. Equivalently, replace euclidean distance with ρ(x, x ) = (x x i ) W(x x i ). Similar to metric learning, but cheaper: we estimate a single ρ! Now run any local regressor f n on (X, ρ).
63 A simple idea: Gradient weighting: Reweight x i W i x i, where W i f i 1,µ. Equivalently, replace euclidean distance with ρ(x, x ) = (x x i ) W(x x i ). Similar to metric learning, but cheaper: we estimate a single ρ! Now run any local regressor f n on (X, ρ).
64 A simple idea: Gradient weighting: Reweight x i W i x i, where W i f i 1,µ. Equivalently, replace euclidean distance with ρ(x, x ) = (x x i ) W(x x i ). Similar to metric learning, but cheaper: we estimate a single ρ! Now run any local regressor f n on (X, ρ).
65 A simple idea: Gradient weighting: Reweight x i W i x i, where W i f i 1,µ. Equivalently, replace euclidean distance with ρ(x, x ) = (x x i ) W(x x i ). Similar to metric learning, but cheaper: we estimate a single ρ! Now run any local regressor f n on (X, ρ).
66 Intuition: suppose f 2 1,µ f 1 1,µ. Lower Var(f n (x)): suppose f i 1,µ is large only for i R [d]. Balls B ρ contain more points relative to euclidean balls. Formally, µ(b(x, ɛ ρ(x )) ɛ R, instead of ɛ d. Bias of f n (x) is relatively unaffected: No sample is too far from x in any relevant direction. Formally, f has the following Lipschitz property: f(x) f(x ) ( f i ) ρ(x, x ). Wi i R
67 Intuition: suppose f 2 1,µ f 1 1,µ. Lower Var(f n (x)): suppose f i 1,µ is large only for i R [d]. Balls B ρ contain more points relative to euclidean balls. Formally, µ(b(x, ɛ ρ(x )) ɛ R, instead of ɛ d. Bias of f n (x) is relatively unaffected: No sample is too far from x in any relevant direction. Formally, f has the following Lipschitz property: f(x) f(x ) ( f i ) ρ(x, x ). Wi i R
68 Intuition: suppose f 2 1,µ f 1 1,µ. Lower Var(f n (x)): suppose f i 1,µ is large only for i R [d]. Balls B ρ contain more points relative to euclidean balls. Formally, µ(b(x, ɛ ρ(x )) ɛ R, instead of ɛ d. Bias of f n (x) is relatively unaffected: No sample is too far from x in any relevant direction. Formally, f has the following Lipschitz property: f(x) f(x ) ( f i ) ρ(x, x ). Wi i R
69 Efficient estimation of f i 1,µ : don t estimate f i. W i E n f n,h (X + te i ) f n,h (X te i ) 2t 1 {A n,i (X)}, where A n,i (X) we are confident in both estimates f n,h (X ± te i ). Fast preprocessing, and online: just 2 estimates of f n,h at X. Metric learning optimizes over a space of possible metrics. Only param. search grid to adapt to d dimensions. KR with bandwidths {h i } d 1 needs d d param. search grid. General: preprocessing for any distance-based regressor. Other methods apply to particular regressors, e.g. Rodeo for local linear reg., metric learning for KR.
70 Efficient estimation of f i 1,µ : don t estimate f i. W i E n f n,h (X + te i ) f n,h (X te i ) 2t 1 {A n,i (X)}, where A n,i (X) we are confident in both estimates f n,h (X ± te i ). Fast preprocessing, and online: just 2 estimates of f n,h at X. Metric learning optimizes over a space of possible metrics. Only param. search grid to adapt to d dimensions. KR with bandwidths {h i } d 1 needs d d param. search grid. General: preprocessing for any distance-based regressor. Other methods apply to particular regressors, e.g. Rodeo for local linear reg., metric learning for KR.
71 Efficient estimation of f i 1,µ : don t estimate f i. W i E n f n,h (X + te i ) f n,h (X te i ) 2t 1 {A n,i (X)}, where A n,i (X) we are confident in both estimates f n,h (X ± te i ). Fast preprocessing, and online: just 2 estimates of f n,h at X. Metric learning optimizes over a space of possible metrics. Only param. search grid to adapt to d dimensions. KR with bandwidths {h i } d 1 needs d d param. search grid. General: preprocessing for any distance-based regressor. Other methods apply to particular regressors, e.g. Rodeo for local linear reg., metric learning for KR.
72 Efficient estimation of f i 1,µ : don t estimate f i. W i E n f n,h (X + te i ) f n,h (X te i ) 2t 1 {A n,i (X)}, where A n,i (X) we are confident in both estimates f n,h (X ± te i ). Fast preprocessing, and online: just 2 estimates of f n,h at X. Metric learning optimizes over a space of possible metrics. Only param. search grid to adapt to d dimensions. KR with bandwidths {h i } d 1 needs d d param. search grid. General: preprocessing for any distance-based regressor. Other methods apply to particular regressors, e.g. Rodeo for local linear reg., metric learning for KR.
73 Practicality of the approach: On many real-world data, f i 1,µ varies enough! SRCS robot joint7. Parkinson s. Telecom. Ailerons.
74 W i consistently estimates f i 1,µ Theorem Under general regularity conditions on µ, and smoothness conditions on f, we have with high probability: W i f 1,µ 1 i A(n) t nh d + h f sup i i [d] + 2 ( ) f i ln 2d/δ sup + µ ( t,i (X )) + ɛ t,i. n
75 Results on real world datasets, kernel regression. Barrett joint 1 Barrett joint 5 SARCOS joint 1 SARCOS joint 5 Housing KR error 0.50 ± ± ± ± ±0.08 KR-ρ error 0.38± ± ± ± ±0.06 Concrete Strength Wine Quality Telecom Ailerons Parkinson s KR error 0.42 ± ± ± ± ±0.03 KR-ρ error 0.37 ± ± ± ± ±0.03
76 Results on real world datasets, k-nn regression. Barrett joint 1 Barrett joint 5 SARCOS joint 1 SARCOS joint 5 Housing k-nn error 0.41 ± ± ± ± ±0.09 k-nn-ρ error 0.29 ± ± ± ± ±0.06 Concrete Strength Wine Quality Telecom Ailerons Parkinson s k-nn error 0.40 ± ± ± ± ±0.01 k-nn-ρ error 0.38 ± ± ± ± ±0.01
77 MSE, varying training size, kernel regression SRCS joint 7, KR Ailerons, KR Telecom with KR
78 MSE, varying training size, k-nn regression SRCS joint 7, k-nn Ailerons, k-nn Telecom, k-nn
79 Take home message Gradient weigths help local regressors! So simple = there should be much room for improvement! Thanks!
80 Take home message Gradient weigths help local regressors! So simple = there should be much room for improvement! Thanks!
81 Take home message Gradient weigths help local regressors! So simple = there should be much room for improvement! Thanks!
Local Self-Tuning in Nonparametric Regression. Samory Kpotufe ORFE, Princeton University
Local Self-Tuning in Nonparametric Regression Samory Kpotufe ORFE, Princeton University Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg
More informationEscaping the curse of dimensionality with a tree-based regressor
Escaping the curse of dimensionality with a tree-based regressor Samory Kpotufe UCSD CSE Curse of dimensionality In general: Computational and/or prediction performance deteriorate as the dimension D increases.
More informationAdaptivity to Local Smoothness and Dimension in Kernel Regression
Adaptivity to Local Smoothness and Dimension in Kernel Regression Samory Kpotufe Toyota Technological Institute-Chicago samory@tticedu Vikas K Garg Toyota Technological Institute-Chicago vkg@tticedu Abstract
More informationk-nn Regression Adapts to Local Intrinsic Dimension
k-nn Regression Adapts to Local Intrinsic Dimension Samory Kpotufe Max Planck Institute for Intelligent Systems samory@tuebingen.mpg.de Abstract Many nonparametric regressors were recently shown to converge
More informationA Consistent Estimator of the Expected Gradient Outerproduct
A Consistent Estimator of the Expected Gradient Outerproduct Shubhendu Trivedi TTI-Chicago Jialei Wang University of Chicago Samory Kpotufe TTI-Chicago Gregory Shakhnarovich TTI-Chicago Abstract In high-dimensional
More informationNearest neighbor classification in metric spaces: universal consistency and rates of convergence
Nearest neighbor classification in metric spaces: universal consistency and rates of convergence Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to classification.
More informationAdvances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008
Advances in Manifold Learning Presented by: Nakul Verma June 10, 008 Outline Motivation Manifolds Manifold Learning Random projection of manifolds for dimension reduction Introduction to random projections
More informationTime-Accuracy Tradeoffs in Kernel Prediction: Controlling Prediction Quality
Journal of Machine Learning Research 18 (017) 1-9 Submitted 10/16; Revised 3/17; Published 4/17 Time-Accuracy Tradeoffs in Kernel Prediction: Controlling Prediction Quality Samory Kpotufe ORFE, Princeton
More informationEconometrics I. Lecture 10: Nonparametric Estimation with Kernels. Paul T. Scott NYU Stern. Fall 2018
Econometrics I Lecture 10: Nonparametric Estimation with Kernels Paul T. Scott NYU Stern Fall 2018 Paul T. Scott NYU Stern Econometrics I Fall 2018 1 / 12 Nonparametric Regression: Intuition Let s get
More informationThe Curse of Dimensionality for Local Kernel Machines
The Curse of Dimensionality for Local Kernel Machines Yoshua Bengio, Olivier Delalleau & Nicolas Le Roux April 7th 2005 Yoshua Bengio, Olivier Delalleau & Nicolas Le Roux Snowbird Learning Workshop Perspective
More informationDistribution-specific analysis of nearest neighbor search and classification
Distribution-specific analysis of nearest neighbor search and classification Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to information retrieval and classification.
More informationInstance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationRandom projection trees and low dimensional manifolds. Sanjoy Dasgupta and Yoav Freund University of California, San Diego
Random projection trees and low dimensional manifolds Sanjoy Dasgupta and Yoav Freund University of California, San Diego I. The new nonparametrics The new nonparametrics The traditional bane of nonparametric
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationOnline Learning in High Dimensions. LWPR and it s application
Lecture 9 LWPR Online Learning in High Dimensions Contents: LWPR and it s application Sethu Vijayakumar, Aaron D'Souza and Stefan Schaal, Incremental Online Learning in High Dimensions, Neural Computation,
More informationMachine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)
Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March
More informationGaussian with mean ( µ ) and standard deviation ( σ)
Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationOptimal rates for k-nn density and mode estimation
Optimal rates for k-nn density and mode estimation Samory Kpotufe ORFE, Princeton University Joint work with Sanjoy Dasgupta, UCSD, CSE. Goal: Practical and Optimal estimator of all modes of f from X 1:n
More informationSparse Nonparametric Density Estimation in High Dimensions Using the Rodeo
Outline in High Dimensions Using the Rodeo Han Liu 1,2 John Lafferty 2,3 Larry Wasserman 1,2 1 Statistics Department, 2 Machine Learning Department, 3 Computer Science Department, Carnegie Mellon University
More informationNonlinear Dimensionality Reduction
Nonlinear Dimensionality Reduction Piyush Rai CS5350/6350: Machine Learning October 25, 2011 Recap: Linear Dimensionality Reduction Linear Dimensionality Reduction: Based on a linear projection of the
More informationLaplacian Eigenmaps for Dimensionality Reduction and Data Representation
Laplacian Eigenmaps for Dimensionality Reduction and Data Representation Neural Computation, June 2003; 15 (6):1373-1396 Presentation for CSE291 sp07 M. Belkin 1 P. Niyogi 2 1 University of Chicago, Department
More informationExploiting k-nearest Neighbor Information with Many Data
Exploiting k-nearest Neighbor Information with Many Data 2017 TEST TECHNOLOGY WORKSHOP 2017. 10. 24 (Tue.) Yung-Kyun Noh Robotics Lab., Contents Nonparametric methods for estimating density functions Nearest
More informationNon-linear Dimensionality Reduction
Non-linear Dimensionality Reduction CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Introduction Laplacian Eigenmaps Locally Linear Embedding (LLE)
More informationLearning Eigenfunctions: Links with Spectral Clustering and Kernel PCA
Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Yoshua Bengio Pascal Vincent Jean-François Paiement University of Montreal April 2, Snowbird Learning 2003 Learning Modal Structures
More informationNonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.
Nonlinear Methods Data often lies on or near a nonlinear low-dimensional curve aka manifold. 27 Laplacian Eigenmaps Linear methods Lower-dimensional linear projection that preserves distances between all
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationEfficient and Optimal Modal-set Estimation using knn graphs
Efficient and Optimal Modal-set Estimation using knn graphs Samory Kpotufe ORFE, Princeton University Based on various results with Sanjoy Dasgupta, Kamalika Chaudhuri, Ulrike von Luxburg, Heinrich Jiang
More informationIntroduction to Nonparametric and Semiparametric Estimation. Good when there are lots of data and very little prior information on functional form.
1 Introduction to Nonparametric and Semiparametric Estimation Good when there are lots of data and very little prior information on functional form. Examples: y = f(x) + " (nonparametric) y = z 0 + f(x)
More informationCSE446: non-parametric methods Spring 2017
CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationUnlabeled Data: Now It Helps, Now It Doesn t
institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster
More informationAdvanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)
Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting) Professor: Aude Billard Assistants: Nadia Figueroa, Ilaria Lauzana and Brice Platerrier E-mails: aude.billard@epfl.ch,
More informationComputational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center
Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center Outline Modern nonparametric inference for high dimensional data Nonparametric
More informationOn Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong
On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality Weiqiang Dong 1 The goal of the work presented here is to illustrate that classification error responds to error in the target probability estimates
More informationEnsemble Methods and Random Forests
Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationGeometric Inference for Probability distributions
Geometric Inference for Probability distributions F. Chazal 1 D. Cohen-Steiner 2 Q. Mérigot 2 1 Geometrica, INRIA Saclay, 2 Geometrica, INRIA Sophia-Antipolis 2009 June 1 Motivation What is the (relevant)
More informationMath 320-2: Midterm 2 Practice Solutions Northwestern University, Winter 2015
Math 30-: Midterm Practice Solutions Northwestern University, Winter 015 1. Give an example of each of the following. No justification is needed. (a) A metric on R with respect to which R is bounded. (b)
More informationGeometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat
Geometric View of Machine Learning Nearest Neighbor Classification Slides adapted from Prof. Carpuat What we know so far Decision Trees What is a decision tree, and how to induce it from data Fundamental
More informationConnection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis
Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal
More informationKernel-Based Contrast Functions for Sufficient Dimension Reduction
Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationEconomics 620, Lecture 19: Introduction to Nonparametric and Semiparametric Estimation
Economics 620, Lecture 19: Introduction to Nonparametric and Semiparametric Estimation Nicholas M. Kiefer Cornell University Professor N. M. Kiefer (Cornell University) Lecture 19: Nonparametric Analysis
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationChap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University
Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics
More informationManifold Regularization
9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationNeural Networks, Convexity, Kernels and Curses
Neural Networks, Convexity, Kernels and Curses Yoshua Bengio Work done with Nicolas Le Roux, Olivier Delalleau and Hugo Larochelle August 26th 2005 Perspective Curse of Dimensionality Most common non-parametric
More informationManifold Learning and it s application
Manifold Learning and it s application Nandan Dubey SE367 Outline 1 Introduction Manifold Examples image as vector Importance Dimension Reduction Techniques 2 Linear Methods PCA Example MDS Perception
More informationDivide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates
: A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.
More informationBasis Expansion and Nonlinear SVM. Kai Yu
Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion
More informationMachine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.
Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning
More information41903: Introduction to Nonparametrics
41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationNonparametric Methods
Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis
More informationConsistency of Nearest Neighbor Classification under Selective Sampling
JMLR: Workshop and Conference Proceedings vol 23 2012) 18.1 18.15 25th Annual Conference on Learning Theory Consistency of Nearest Neighbor Classification under Selective Sampling Sanjoy Dasgupta 9500
More informationSample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI
Sample Complexity of Learning Mahalanobis Distance Metrics Nakul Verma Janelia, HHMI feature 2 Mahalanobis Metric Learning Comparing observations in feature space: x 1 [sq. Euclidean dist] x 2 (all features
More informationAssignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran
Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran 1. Let A m n be a matrix of real numbers. The matrix AA T has an eigenvector x with eigenvalue b. Then the eigenvector y of A T A
More informationLecture 10: Dimension Reduction Techniques
Lecture 10: Dimension Reduction Techniques Radu Balan Department of Mathematics, AMSC, CSCAMM and NWC University of Maryland, College Park, MD April 17, 2018 Input Data It is assumed that there is a set
More informationExploiting Sparse Non-Linear Structure in Astronomical Data
Exploiting Sparse Non-Linear Structure in Astronomical Data Ann B. Lee Department of Statistics and Department of Machine Learning, Carnegie Mellon University Joint work with P. Freeman, C. Schafer, and
More informationGeneralization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh
Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds
More informationLaplacian Eigenmaps for Dimensionality Reduction and Data Representation
Introduction and Data Representation Mikhail Belkin & Partha Niyogi Department of Electrical Engieering University of Minnesota Mar 21, 2017 1/22 Outline Introduction 1 Introduction 2 3 4 Connections to
More informationNonparametric Inference in Cosmology and Astrophysics: Biases and Variants
Nonparametric Inference in Cosmology and Astrophysics: Biases and Variants Christopher R. Genovese Department of Statistics Carnegie Mellon University http://www.stat.cmu.edu/ ~ genovese/ Collaborators:
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationBayesian simultaneous regression and dimension reduction
Bayesian simultaneous regression and dimension reduction MCMski II Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University January 10, 2008
More informationSparse Kernel Machines - SVM
Sparse Kernel Machines - SVM Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I. Christensen (RIM@GT) Support
More informationNonlinear Dimensionality Reduction
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationMetric Spaces and Topology
Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies
More informationLinear and Non-Linear Dimensionality Reduction
Linear and Non-Linear Dimensionality Reduction Alexander Schulz aschulz(at)techfak.uni-bielefeld.de University of Pisa, Pisa 4.5.215 and 7.5.215 Overview Dimensionality Reduction Motivation Linear Projections
More informationDiffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)
Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions
More informationThese slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
More informationSemi-Supervised Learning by Multi-Manifold Separation
Semi-Supervised Learning by Multi-Manifold Separation Xiaojin (Jerry) Zhu Department of Computer Sciences University of Wisconsin Madison Joint work with Andrew Goldberg, Zhiting Xu, Aarti Singh, and Rob
More informationProbabilistic Machine Learning. Industrial AI Lab.
Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear
More informationVBM683 Machine Learning
VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data
More informationVector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.
Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar
More informationCS534 Machine Learning - Spring Final Exam
CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationMulti-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics
1 / 38 Multi-task Learning with Gaussian Processes, with Applications to Robot Inverse Dynamics Chris Williams with Kian Ming A. Chai, Stefan Klanke, Sethu Vijayakumar December 2009 Motivation 2 / 38 Examples
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationMIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design
MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation
More informationMultiscale Wavelets on Trees, Graphs and High Dimensional Data
Multiscale Wavelets on Trees, Graphs and High Dimensional Data ICML 2010, Haifa Matan Gavish (Weizmann/Stanford) Boaz Nadler (Weizmann) Ronald Coifman (Yale) Boaz Nadler Ronald Coifman Motto... the relationships
More informationFace Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi
Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi Overview Introduction Linear Methods for Dimensionality Reduction Nonlinear Methods and Manifold
More information... SPARROW. SPARse approximation Weighted regression. Pardis Noorzad. Department of Computer Engineering and IT Amirkabir University of Technology
..... SPARROW SPARse approximation Weighted regression Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Université de Montréal March 12, 2012 SPARROW 1/47 .....
More informationCSC 411: Lecture 02: Linear Regression
CSC 411: Lecture 02: Linear Regression Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto (Most plots in this lecture are from Bishop s book) Zemel, Urtasun, Fidler (UofT) CSC 411: 02-Regression
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationCOMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More informationSupport Vector Machines
Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector
More informationIntroduction to Gaussian Process
Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationCMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors. Furong Huang /
CMSC 422 Introduction to Machine Learning Lecture 4 Geometry and Nearest Neighbors Furong Huang / furongh@cs.umd.edu What we know so far Decision Trees What is a decision tree, and how to induce it from
More information