Local Self-Tuning in Nonparametric Regression. Samory Kpotufe ORFE, Princeton University

Size: px
Start display at page:

Download "Local Self-Tuning in Nonparametric Regression. Samory Kpotufe ORFE, Princeton University"

Transcription

1 Local Self-Tuning in Nonparametric Regression Samory Kpotufe ORFE, Princeton University

2 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg (Y i ) of Neighbors(x). (e.g. k-nn, kernel, or tree-based reg.) Quite basic = common in modern applications. Sensitive to choice of Neighbors(x): k, band. h, tree cell size. Goal: choose Neighbors(x) optimally!

3 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg (Y i ) of Neighbors(x). (e.g. k-nn, kernel, or tree-based reg.) Quite basic = common in modern applications. Sensitive to choice of Neighbors(x): k, band. h, tree cell size. Goal: choose Neighbors(x) optimally!

4 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg (Y i ) of Neighbors(x). (e.g. k-nn, kernel, or tree-based reg.) Quite basic = common in modern applications. Sensitive to choice of Neighbors(x): k, band. h, tree cell size. Goal: choose Neighbors(x) optimally!

5 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg (Y i ) of Neighbors(x). (e.g. k-nn, kernel, or tree-based reg.) Quite basic = common in modern applications. Sensitive to choice of Neighbors(x): k, band. h, tree cell size. Goal: choose Neighbors(x) optimally!

6 Performance depends on problem parameters Performance would depend on dim(x) and how fast f varies... Suppose X IR D, and x, x, f(x) f(x ) λ x x α. Performance measure: f n f 2 2,P X. = E X f n (X) f(x) 2. Minimax global performance (Stone 80-82) f n f 2 2,P X λ 2D/(2α+D) n 2α/(2α+D).

7 Performance depends on problem parameters Performance would depend on dim(x) and how fast f varies... Suppose X IR D, and x, x, f(x) f(x ) λ x x α. Performance measure: f n f 2 2,P X. = E X f n (X) f(x) 2. Minimax global performance (Stone 80-82) f n f 2 2,P X λ 2D/(2α+D) n 2α/(2α+D).

8 Performance depends on problem parameters Performance would depend on dim(x) and how fast f varies... Suppose X IR D, and x, x, f(x) f(x ) λ x x α. Performance measure: f n f 2 2,P X. = E X f n (X) f(x) 2. Minimax global performance (Stone 80-82) f n f 2 2,P X λ 2D/(2α+D) n 2α/(2α+D).

9 Performance depends on problem parameters Performance would depend on dim(x) and how fast f varies... Suppose X IR D, and x, x, f(x) f(x ) λ x x α. Performance measure: f n f 2 2,P X. = E X f n (X) f(x) 2. Minimax global performance (Stone 80-82) f n f 2 2,P X λ 2D/(2α+D) n 2α/(2α+D).

10 Some milder situations for X IR D f is quite smooth, f is sparse, f is additive,... Of interest here: X has low intrinsic dimension d D.

11 Some milder situations for X IR D f is quite smooth, f is sparse, f is additive,... Of interest here: X has low intrinsic dimension d D.

12 X IR D but has low intrinsic dimension d D Linear data

13 X IR D but has low intrinsic dimension d D Linear data Manifold data

14 X IR D but has low intrinsic dimension d D Linear data Manifold data Sparse data

15 X IR D but has low intrinsic dimension d D Linear data Manifold data Sparse data Common approach: Dimension reduction or estimation.

16 Dimension reduction/estimation increases tuning! Recent Alternative: f n operates in IR D but adapts to the unknown d of X. We want: f n f 2 2,P X n 1/Cd n 1/CD

17 Dimension reduction/estimation increases tuning! Recent Alternative: f n operates in IR D but adapts to the unknown d of X. We want: f n f 2 2,P X n 1/Cd n 1/CD

18 Dimension reduction/estimation increases tuning! Recent Alternative: f n operates in IR D but adapts to the unknown d of X. We want: f n f 2 2,P X n 1/Cd n 1/CD

19 Some work on adaptivity to intrinsic dimension: Kernel and local polynomial regression: Bickel and Li 2006, Lafferty and Wasserman Manifold dim. Dyadic tree classification: Scott and Nowak Box dim. RPtree, dyadic tree regression: Kpotufe and Dasgupta Doubling dim.

20 Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.

21 Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. Kernel reg.: Avg. mass of a ball of radius ɛ is approx. ɛ d For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.

22 Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. RPtree: Number of cells of diameter ɛ is approx. ɛ d. For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.

23 Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. RPtree: Number of cells of diameter ɛ is approx. ɛ d. For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.

24 Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. RPtree: Number of cells of diameter ɛ is approx. ɛ d. For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.

25 Insights help with tuning under time-constraints. Main Idea: compress data in a way that respects structure of X. Faster Kernel regression. [Kpo. 2009] Fast online tree-regression. [Kpo. and Orabona 2013] O(log n) time vs. O(n), + regression rates remain optimal!

26 Insights help with tuning under time-constraints. Main Idea: compress data in a way that respects structure of X. Faster Kernel regression. [Kpo. 2009] Fast online tree-regression. [Kpo. and Orabona 2013] O(log n) time vs. O(n), + regression rates remain optimal!

27 Insights help with tuning under time-constraints. Main Idea: compress data in a way that respects structure of X. Faster Kernel regression. [Kpo. 2009] Fast online tree-regression. [Kpo. and Orabona 2013] O(log n) time vs. O(n), + regression rates remain optimal!

28 Insights help with tuning under time-constraints. Main Idea: compress data in a way that respects structure of X. Faster Kernel regression. [Kpo. 2009] Fast online tree-regression. [Kpo. and Orabona 2013] O(log n) time vs. O(n), + regression rates remain optimal!

29 So far, we have viewed d as a global characteristic of X...

30 Problem complexity is likely to depend on location! Space X, d = d(x) Function f, α = α(x) Choose Neighbors(x) adaptively so that: f n (x) f(x) 2 λ 2dx/(2αx+dx) x n 2αx/(2αx+dx). Choose Neighbors(x): Cannot cross-validate locally at x!

31 Problem complexity is likely to depend on location! Space X, d = d(x) Function f, α = α(x) Choose Neighbors(x) adaptively so that: f n (x) f(x) 2 λ 2dx/(2αx+dx) x n 2αx/(2αx+dx). Choose Neighbors(x): Cannot cross-validate locally at x!

32 Problem complexity is likely to depend on location! Space X, d = d(x) Function f, α = α(x) Choose Neighbors(x) adaptively so that: f n (x) f(x) 2 λ 2dx/(2αx+dx) x n 2αx/(2αx+dx). Choose Neighbors(x): Cannot cross-validate locally at x!

33 NEXT: I. Local notions of smoothness and dimension. II. Local adaptivity to dimension: k-nn example. III. Full local adaptivity: kernel example.

34 Local smoothness Use local Hölder parameters λ = λ(x), α = α(x) on B(x, r): For all x B(x, r), f(x) f(x ) λρ(x, x ) α. f(x) = x α is flatter at x = 0 as α is increased.

35 Local dimension Figure : d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). If P X is U(B(x, r)), then P X (B(x, r)) ɛ d P X (B(x, ɛr)). Def.: P X is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, P X (B(x, r )) Cɛ d P X (B(x, ɛr )).

36 Local dimension Figure : d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). If P X is U(B(x, r)), then P X (B(x, r)) ɛ d P X (B(x, ɛr)). Def.: P X is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, P X (B(x, r )) Cɛ d P X (B(x, ɛr )).

37 Local dimension Figure : d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). If P X is U(B(x, r)), then P X (B(x, r)) ɛ d P X (B(x, ɛr)). Def.: P X is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, P X (B(x, r )) Cɛ d P X (B(x, ɛr )).

38 Local dimension Figure : d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). If P X is U(B(x, r)), then P X (B(x, r)) ɛ d P X (B(x, ɛr)). Def.: P X is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, P X (B(x, r )) Cɛ d P X (B(x, ɛr )).

39 The growth of P X can capture the intrinsic dimension in B(x). Location of query x matters! Size of neighborhood B matters!

40 The growth of P X can capture the intrinsic dimension in B(x). Location of query x matters! Size of neighborhood B matters!

41 The growth of P X can capture the intrinsic dimension in B(x). Location of query x matters! Size of neighborhood B matters!

42 Size of neighborhood B matters! For k-nn, or kernel reg, size of B depends on n and (k or h).

43 Size of neighborhood B matters! For k-nn, or kernel reg, size of B depends on n and (k or h).

44 The growth of P X (B) can capture the intrinsic dimension locally. Linear data Manifold data Sparse data

45 The growth of P X (B) can capture the intrinsic dimension locally. Linear data Manifold data Sparse data X can be a collection of subspaces of various dimensions.

46 Intrinsic d tightly captures the minimax rate: Theorem: Consider a metric measure space (X, ρ, µ), such that for all x X, r > 0, ɛ > 0, we have µ(b(x, r)) ɛ d µ(b(x, ɛr)). Then, for any regressor f n, there exists P X,Y, where P X = µ and f(x) = E Y x is λ-lipschitz, such that E f n f 2 PX,Y n 2,µ λ2d/(2+d) n 2/(2+d).

47 Intrinsic d tightly captures the minimax rate: Theorem: Consider a metric measure space (X, ρ, µ), such that for all x X, r > 0, ɛ > 0, we have µ(b(x, r)) ɛ d µ(b(x, ɛr)). Then, for any regressor f n, there exists P X,Y, where P X = µ and f(x) = E Y x is λ-lipschitz, such that E f n f 2 PX,Y n 2,µ λ2d/(2+d) n 2/(2+d).

48 NEXT: I. Local notions of smoothness and dimension. II. Local adaptivity to dimension: k-nn example. III. Full local adaptivity: kernel example.

49 Main Assumptions: X metric space (X, ρ). P X is locally homogeneous with unknown d(x). f is λ-lipschitz on X, i.e. α = 1. k-nn regression: f n (x) = weighted avg (Y i ) of k-nn(x). Suppose X IR D, the learner operates in IR D! No dimensionality reduction, no dimension estimation!

50 Main Assumptions: X metric space (X, ρ). P X is locally homogeneous with unknown d(x). f is λ-lipschitz on X, i.e. α = 1. k-nn regression: f n (x) = weighted avg (Y i ) of k-nn(x). Suppose X IR D, the learner operates in IR D! No dimensionality reduction, no dimension estimation!

51 Bias-Variance tradeoff E (X i,y i ) n 1 f n (x) f(x) 2 = E f n (x) E f n (x) 2 }{{} Variance + E f n (x) f(x) 2. }{{} Bias 2

52 General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).

53 General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).

54 General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).

55 General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).

56 General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).

57 Choosing k locally at x- Intuition Remember: Cross-valid. or dim. estimation at x are impractical. Instead: Main technical hurdle: intrinsic dimension might vary with k.

58 Choosing k locally at x- Intuition Remember: Cross-valid. or dim. estimation at x are impractical. Instead: Main technical hurdle: intrinsic dimension might vary with k.

59 Choosing k(x)- Result Theorem: Suppose k(x) is chosen as above. The following holds w.h.p. simultaneously for all x. Consider any B centered at x, s.t. P X (B) n 1/3. Suppose P X is (C, d)-homogeneous on B. We have ( ) f n (x) f(x) 2 C ln n 2/(2+d) λ 2. np X (B) As n the claim applies to any B centered at x, P X (B) 0.

60 Choosing k(x)- Result Theorem: Suppose k(x) is chosen as above. The following holds w.h.p. simultaneously for all x. Consider any B centered at x, s.t. P X (B) n 1/3. Suppose P X is (C, d)-homogeneous on B. We have ( ) f n (x) f(x) 2 C ln n 2/(2+d) λ 2. np X (B) As n the claim applies to any B centered at x, P X (B) 0.

61 NEXT: I. Local notions of smoothness and dimension. II. Local adaptivity to dimension: k-nn example. III. Full local adaptivity: kernel example. (Recent work with Vikas Garg)

62 Main Assumptions: X metric space (X, ρ) of diameter 1. P X is locally homogeneous with unknown d(x). f is locally Hölder with unknown λ(x), α(x). Kernel regression: f n (x) = weighted avg (Y i ) for X i in B ρ (x, h).

63 Main Assumptions: X metric space (X, ρ) of diameter 1. P X is locally homogeneous with unknown d(x). f is locally Hölder with unknown λ(x), α(x). Kernel regression: f n (x) = weighted avg (Y i ) for X i in B ρ (x, h).

64 General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).

65 General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).

66 General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).

67 General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).

68 General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).

69 From the previous intuition Suppose we know α(x) but not d(x). Monitor 1 n h (x) and h2α. Picking h d (x): f n (x) f(x) 2 err(h ) n 2α/(2α+d).

70 From the previous intuition Suppose we know α(x) but not d(x). Monitor 1 n h (x) and h2α. Picking h d (x): f n (x) f(x) 2 err(h ) n 2α/(2α+d).

71 From Lepski Suppose we know d(x) but not α(x). Intuition: For every h < h, 1 nh d > h 2α therefore for such h f n (h; x) f(x) 2 1 nh d + h2α 2 1 nh d.

72 From Lepski Suppose we know d(x) but not α(x). Intuition: For every h < h, 1 nh d > h 2α therefore for such h f n (h; x) f(x) 2 1 nh d + h2α 2 1 nh d.

73 From Lepski Suppose we know d(x) but not α(x). All intervals [ ] f n (h; x) ± 2 1, h < h must intersect! nh d Picking h α (x): f n (x) f(x) 2 err(h ) n 2α/(2α+d).

74 From Lepski Suppose we know d(x) but not α(x). All intervals [ ] f n (h; x) ± 2 1, h < h must intersect! nh d Picking h α (x): f n (x) f(x) 2 err(h ) n 2α/(2α+d).

75 Combine Lepski with previous intuition We know neither d nor α. All intervals [ ] f n (h; x) ± 2 1 n h (x), h < h d must intersect! Picking h α,d (x): f n (x) f(x) 2 err(h d ) err(h ) n 2α/(2α+d).

76 Combine Lepski with previous intuition We know neither d nor α. All intervals [ ] f n (h; x) ± 2 1 n h (x), h < h d must intersect! Picking h α,d (x): f n (x) f(x) 2 err(h d ) err(h ) n 2α/(2α+d).

77 Choosing h α,d (x)- Result Tightness assumption on d(x): r 0, x X, C, C, d such that r r 0, Cr d P X (B(x, r)) C r d. Theorem: Suppose h α,d (x) is chosen as described. Let n N(r 0 ). The following holds w.h.p. simultaneously for all x. Let d, α, λ be the local problem parameters on B(x, r 0 ). We have ( ) f n (x) f(x) 2 ln n 2α/(2α+d) λ 2d/(2α+d). n The rate is optimal.

78 Choosing h α,d (x)- Result Tightness assumption on d(x): r 0, x X, C, C, d such that r r 0, Cr d P X (B(x, r)) C r d. Theorem: Suppose h α,d (x) is chosen as described. Let n N(r 0 ). The following holds w.h.p. simultaneously for all x. Let d, α, λ be the local problem parameters on B(x, r 0 ). We have ( ) f n (x) f(x) 2 ln n 2α/(2α+d) λ 2d/(2α+d). n The rate is optimal.

79 Simulation on data with mixed spatial complexity... the approach works, but should be made more efficient!

80 Simulation on data with mixed spatial complexity... the approach works, but should be made more efficient!

81 Future direction: Extend self-tuning to tree-based kernel implementations.

82 Initial experiments with tree-based kernel: Without CValidation: automatically detect interval containing h.

83 Future direction: Extend self-tuning to data streaming setting.

84 Initial streaming experiments: Synthetic dataset, d=5, D=100, first 1000 samples d=1 Incremental Tree based fixed d=1 fixed d=5 fixed d=10 Normalized RMSE on test set # Training samples x 10 4 Robustness to increasing intrinsic dimension.

85 Future direction: Adaptive error bands. Would lead to more local sampling strategies.

86 TAKE HOME MESSAGE: We can adapt to intrinsic d(x ) without preprocessing. Local-learners can self-tune optimally to local d(x) and α(x). Results extend to plug-in classification! Many potential future directions!

87 TAKE HOME MESSAGE: We can adapt to intrinsic d(x ) without preprocessing. Local-learners can self-tune optimally to local d(x) and α(x). Results extend to plug-in classification! Many potential future directions!

88 TAKE HOME MESSAGE: We can adapt to intrinsic d(x ) without preprocessing. Local-learners can self-tune optimally to local d(x) and α(x). Results extend to plug-in classification! Many potential future directions!

89 Thank you!

Local regression, intrinsic dimension, and nonparametric sparsity

Local regression, intrinsic dimension, and nonparametric sparsity Local regression, intrinsic dimension, and nonparametric sparsity Samory Kpotufe Toyota Technological Institute - Chicago and Max Planck Institute for Intelligent Systems I. Local regression and (local)

More information

Adaptivity to Local Smoothness and Dimension in Kernel Regression

Adaptivity to Local Smoothness and Dimension in Kernel Regression Adaptivity to Local Smoothness and Dimension in Kernel Regression Samory Kpotufe Toyota Technological Institute-Chicago samory@tticedu Vikas K Garg Toyota Technological Institute-Chicago vkg@tticedu Abstract

More information

Escaping the curse of dimensionality with a tree-based regressor

Escaping the curse of dimensionality with a tree-based regressor Escaping the curse of dimensionality with a tree-based regressor Samory Kpotufe UCSD CSE Curse of dimensionality In general: Computational and/or prediction performance deteriorate as the dimension D increases.

More information

k-nn Regression Adapts to Local Intrinsic Dimension

k-nn Regression Adapts to Local Intrinsic Dimension k-nn Regression Adapts to Local Intrinsic Dimension Samory Kpotufe Max Planck Institute for Intelligent Systems samory@tuebingen.mpg.de Abstract Many nonparametric regressors were recently shown to converge

More information

Optimal rates for k-nn density and mode estimation

Optimal rates for k-nn density and mode estimation Optimal rates for k-nn density and mode estimation Samory Kpotufe ORFE, Princeton University Joint work with Sanjoy Dasgupta, UCSD, CSE. Goal: Practical and Optimal estimator of all modes of f from X 1:n

More information

Time-Accuracy Tradeoffs in Kernel Prediction: Controlling Prediction Quality

Time-Accuracy Tradeoffs in Kernel Prediction: Controlling Prediction Quality Journal of Machine Learning Research 18 (017) 1-9 Submitted 10/16; Revised 3/17; Published 4/17 Time-Accuracy Tradeoffs in Kernel Prediction: Controlling Prediction Quality Samory Kpotufe ORFE, Princeton

More information

Efficient and Optimal Modal-set Estimation using knn graphs

Efficient and Optimal Modal-set Estimation using knn graphs Efficient and Optimal Modal-set Estimation using knn graphs Samory Kpotufe ORFE, Princeton University Based on various results with Sanjoy Dasgupta, Kamalika Chaudhuri, Ulrike von Luxburg, Heinrich Jiang

More information

Unlabeled Data: Now It Helps, Now It Doesn t

Unlabeled Data: Now It Helps, Now It Doesn t institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Random projection trees and low dimensional manifolds. Sanjoy Dasgupta and Yoav Freund University of California, San Diego

Random projection trees and low dimensional manifolds. Sanjoy Dasgupta and Yoav Freund University of California, San Diego Random projection trees and low dimensional manifolds Sanjoy Dasgupta and Yoav Freund University of California, San Diego I. The new nonparametrics The new nonparametrics The traditional bane of nonparametric

More information

Distribution-specific analysis of nearest neighbor search and classification

Distribution-specific analysis of nearest neighbor search and classification Distribution-specific analysis of nearest neighbor search and classification Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to information retrieval and classification.

More information

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality Weiqiang Dong 1 The goal of the work presented here is to illustrate that classification error responds to error in the target probability estimates

More information

MATH 31BH Homework 1 Solutions

MATH 31BH Homework 1 Solutions MATH 3BH Homework Solutions January 0, 04 Problem.5. (a) (x, y)-plane in R 3 is closed and not open. To see that this plane is not open, notice that any ball around the origin (0, 0, 0) will contain points

More information

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence Nearest neighbor classification in metric spaces: universal consistency and rates of convergence Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to classification.

More information

A Consistent Estimator of the Expected Gradient Outerproduct

A Consistent Estimator of the Expected Gradient Outerproduct A Consistent Estimator of the Expected Gradient Outerproduct Shubhendu Trivedi TTI-Chicago Jialei Wang University of Chicago Samory Kpotufe TTI-Chicago Gregory Shakhnarovich TTI-Chicago Abstract In high-dimensional

More information

Bias-Variance in Machine Learning

Bias-Variance in Machine Learning Bias-Variance in Machine Learning Bias-Variance: Outline Underfitting/overfitting: Why are complex hypotheses bad? Simple example of bias/variance Error as bias+variance for regression brief comments on

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

Semi-Supervised Learning by Multi-Manifold Separation

Semi-Supervised Learning by Multi-Manifold Separation Semi-Supervised Learning by Multi-Manifold Separation Xiaojin (Jerry) Zhu Department of Computer Sciences University of Wisconsin Madison Joint work with Andrew Goldberg, Zhiting Xu, Aarti Singh, and Rob

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Introduction to Nonparametric and Semiparametric Estimation. Good when there are lots of data and very little prior information on functional form.

Introduction to Nonparametric and Semiparametric Estimation. Good when there are lots of data and very little prior information on functional form. 1 Introduction to Nonparametric and Semiparametric Estimation Good when there are lots of data and very little prior information on functional form. Examples: y = f(x) + " (nonparametric) y = z 0 + f(x)

More information

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

Math General Topology Fall 2012 Homework 1 Solutions

Math General Topology Fall 2012 Homework 1 Solutions Math 535 - General Topology Fall 2012 Homework 1 Solutions Definition. Let V be a (real or complex) vector space. A norm on V is a function : V R satisfying: 1. Positivity: x 0 for all x V and moreover

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Online Learning in High Dimensions. LWPR and it s application

Online Learning in High Dimensions. LWPR and it s application Lecture 9 LWPR Online Learning in High Dimensions Contents: LWPR and it s application Sethu Vijayakumar, Aaron D'Souza and Stefan Schaal, Incremental Online Learning in High Dimensions, Neural Computation,

More information

Your first day at work MATH 806 (Fall 2015)

Your first day at work MATH 806 (Fall 2015) Your first day at work MATH 806 (Fall 2015) 1. Let X be a set (with no particular algebraic structure). A function d : X X R is called a metric on X (and then X is called a metric space) when d satisfies

More information

Prediction of double gene knockout measurements

Prediction of double gene knockout measurements Prediction of double gene knockout measurements Sofia Kyriazopoulou-Panagiotopoulou sofiakp@stanford.edu December 12, 2008 Abstract One way to get an insight into the potential interaction between a pair

More information

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous: MATH 51H Section 4 October 16, 2015 1 Continuity Recall what it means for a function between metric spaces to be continuous: Definition. Let (X, d X ), (Y, d Y ) be metric spaces. A function f : X Y is

More information

Advanced computational methods X Selected Topics: SGD

Advanced computational methods X Selected Topics: SGD Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran

Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran 1. Let A m n be a matrix of real numbers. The matrix AA T has an eigenvector x with eigenvalue b. Then the eigenvector y of A T A

More information

Unlabeled data: Now it helps, now it doesn t

Unlabeled data: Now it helps, now it doesn t Unlabeled data: Now it helps, now it doesn t Aarti Singh, Robert D. Nowak Xiaojin Zhu Department of Electrical and Computer Engineering Department of Computer Sciences University of Wisconsin - Madison

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

Nonparametric Inference via Bootstrapping the Debiased Estimator

Nonparametric Inference via Bootstrapping the Debiased Estimator Nonparametric Inference via Bootstrapping the Debiased Estimator Yen-Chi Chen Department of Statistics, University of Washington ICSA-Canada Chapter Symposium 2017 1 / 21 Problem Setup Let X 1,, X n be

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Linear and Logistic Regression. Dr. Xiaowei Huang

Linear and Logistic Regression. Dr. Xiaowei Huang Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics

More information

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods. TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Sample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI

Sample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI Sample Complexity of Learning Mahalanobis Distance Metrics Nakul Verma Janelia, HHMI feature 2 Mahalanobis Metric Learning Comparing observations in feature space: x 1 [sq. Euclidean dist] x 2 (all features

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Nonparametric regression with martingale increment errors

Nonparametric regression with martingale increment errors S. Gaïffas (LSTA - Paris 6) joint work with S. Delattre (LPMA - Paris 7) work in progress Motivations Some facts: Theoretical study of statistical algorithms requires stationary and ergodicity. Concentration

More information

Efficient and Principled Online Classification Algorithms for Lifelon

Efficient and Principled Online Classification Algorithms for Lifelon Efficient and Principled Online Classification Algorithms for Lifelong Learning Toyota Technological Institute at Chicago Chicago, IL USA Talk @ Lifelong Learning for Mobile Robotics Applications Workshop,

More information

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008 Advances in Manifold Learning Presented by: Nakul Verma June 10, 008 Outline Motivation Manifolds Manifold Learning Random projection of manifolds for dimension reduction Introduction to random projections

More information

Analysis I (Math 121) First Midterm Correction

Analysis I (Math 121) First Midterm Correction Analysis I (Math 11) First Midterm Correction Fall 00 Ali Nesin November 9, 00 1. Which of the following are not vector spaces over R (with the componentwise addition and scalar multiplication) and why?

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University, 1 Introduction to Machine Learning Regression Computer Science, Tel-Aviv University, 2013-14 Classification Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Other structures (e.g.,

More information

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good

More information

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods) Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March

More information

Economics 620, Lecture 19: Introduction to Nonparametric and Semiparametric Estimation

Economics 620, Lecture 19: Introduction to Nonparametric and Semiparametric Estimation Economics 620, Lecture 19: Introduction to Nonparametric and Semiparametric Estimation Nicholas M. Kiefer Cornell University Professor N. M. Kiefer (Cornell University) Lecture 19: Nonparametric Analysis

More information

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric CS 189 Introduction to Machine Learning Fall 2017 Note 5 1 Overview Recall from our previous note that for a fixed input x, our measurement Y is a noisy measurement of the true underlying response f x):

More information

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

Basis Expansion and Nonlinear SVM. Kai Yu

Basis Expansion and Nonlinear SVM. Kai Yu Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion

More information

Math 5210, Definitions and Theorems on Metric Spaces

Math 5210, Definitions and Theorems on Metric Spaces Math 5210, Definitions and Theorems on Metric Spaces Let (X, d) be a metric space. We will use the following definitions (see Rudin, chap 2, particularly 2.18) 1. Let p X and r R, r > 0, The ball of radius

More information

The Decision List Machine

The Decision List Machine The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 nat@site.uottawa.ca

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Math General Topology Fall 2012 Homework 6 Solutions

Math General Topology Fall 2012 Homework 6 Solutions Math 535 - General Topology Fall 202 Homework 6 Solutions Problem. Let F be the field R or C of real or complex numbers. Let n and denote by F[x, x 2,..., x n ] the set of all polynomials in n variables

More information

Convex optimization COMS 4771

Convex optimization COMS 4771 Convex optimization COMS 4771 1. Recap: learning via optimization Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + 1 / 15 Soft-margin

More information

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk) 10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Locality Sensitive Hashing

Locality Sensitive Hashing Locality Sensitive Hashing February 1, 016 1 LSH in Hamming space The following discussion focuses on the notion of Locality Sensitive Hashing which was first introduced in [5]. We focus in the case of

More information

where m is the maximal ideal of O X,p. Note that m/m 2 is a vector space. Suppose that we are given a morphism

where m is the maximal ideal of O X,p. Note that m/m 2 is a vector space. Suppose that we are given a morphism 8. Smoothness and the Zariski tangent space We want to give an algebraic notion of the tangent space. In differential geometry, tangent vectors are equivalence classes of maps of intervals in R into the

More information

Active learning. Sanjoy Dasgupta. University of California, San Diego

Active learning. Sanjoy Dasgupta. University of California, San Diego Active learning Sanjoy Dasgupta University of California, San Diego Exploiting unlabeled data A lot of unlabeled data is plentiful and cheap, eg. documents off the web speech samples images and video But

More information

Introduction to Gaussian Process

Introduction to Gaussian Process Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Sparse Kernel Machines - SVM

Sparse Kernel Machines - SVM Sparse Kernel Machines - SVM Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I. Christensen (RIM@GT) Support

More information

CSE446: non-parametric methods Spring 2017

CSE446: non-parametric methods Spring 2017 CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want

More information

variance of independent variables: sum of variances So chebyshev predicts won t stray beyond stdev.

variance of independent variables: sum of variances So chebyshev predicts won t stray beyond stdev. Announcements No class monday. Metric embedding seminar. Review expectation notion of high probability. Markov. Today: Book 4.1, 3.3, 4.2 Chebyshev. Remind variance, standard deviation. σ 2 = E[(X µ X

More information

The Curse of Dimensionality for Local Kernel Machines

The Curse of Dimensionality for Local Kernel Machines The Curse of Dimensionality for Local Kernel Machines Yoshua Bengio, Olivier Delalleau & Nicolas Le Roux April 7th 2005 Yoshua Bengio, Olivier Delalleau & Nicolas Le Roux Snowbird Learning Workshop Perspective

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Optimality Conditions for Nonsmooth Convex Optimization

Optimality Conditions for Nonsmooth Convex Optimization Optimality Conditions for Nonsmooth Convex Optimization Sangkyun Lee Oct 22, 2014 Let us consider a convex function f : R n R, where R is the extended real field, R := R {, + }, which is proper (f never

More information

Nonparametric Methods

Nonparametric Methods Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

Classifier Complexity and Support Vector Classifiers

Classifier Complexity and Support Vector Classifiers Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Introduction to Real Analysis

Introduction to Real Analysis Introduction to Real Analysis Joshua Wilde, revised by Isabel Tecu, Takeshi Suzuki and María José Boccardi August 13, 2013 1 Sets Sets are the basic objects of mathematics. In fact, they are so basic that

More information

Lecture 18: March 15

Lecture 18: March 15 CS71 Randomness & Computation Spring 018 Instructor: Alistair Sinclair Lecture 18: March 15 Disclaimer: These notes have not been subjected to the usual scrutiny accorded to formal publications. They may

More information

Math 320-2: Midterm 2 Practice Solutions Northwestern University, Winter 2015

Math 320-2: Midterm 2 Practice Solutions Northwestern University, Winter 2015 Math 30-: Midterm Practice Solutions Northwestern University, Winter 015 1. Give an example of each of the following. No justification is needed. (a) A metric on R with respect to which R is bounded. (b)

More information

Tsybakov noise adap/ve margin- based ac/ve learning

Tsybakov noise adap/ve margin- based ac/ve learning Tsybakov noise adap/ve margin- based ac/ve learning Aar$ Singh A. Nico Habermann Associate Professor NIPS workshop on Learning Faster from Easy Data II Dec 11, 2015 Passive Learning Ac/ve Learning (X j,?)

More information

Low Bias Bagged Support Vector Machines

Low Bias Bagged Support Vector Machines Low Bias Bagged Support Vector Machines Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi di Milano, Italy valentini@dsi.unimi.it Thomas G. Dietterich Department of Computer

More information

Learning Theory Continued

Learning Theory Continued Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.

More information

Problem Set 1: Solutions Math 201A: Fall Problem 1. Let (X, d) be a metric space. (a) Prove the reverse triangle inequality: for every x, y, z X

Problem Set 1: Solutions Math 201A: Fall Problem 1. Let (X, d) be a metric space. (a) Prove the reverse triangle inequality: for every x, y, z X Problem Set 1: s Math 201A: Fall 2016 Problem 1. Let (X, d) be a metric space. (a) Prove the reverse triangle inequality: for every x, y, z X d(x, y) d(x, z) d(z, y). (b) Prove that if x n x and y n y

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

A New Look at Nearest Neighbours: Identifying Benign Input Geometries via Random Projections

A New Look at Nearest Neighbours: Identifying Benign Input Geometries via Random Projections Journal of Machine Learning Research 6 A New Look at Nearest Neighbours: Identifying Benign Input Geometries via Random Projections Ata Kabán School of Computer Science, The University of Birmingham, Edgbaston,

More information

Exploiting k-nearest Neighbor Information with Many Data

Exploiting k-nearest Neighbor Information with Many Data Exploiting k-nearest Neighbor Information with Many Data 2017 TEST TECHNOLOGY WORKSHOP 2017. 10. 24 (Tue.) Yung-Kyun Noh Robotics Lab., Contents Nonparametric methods for estimating density functions Nearest

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information