Local Self-Tuning in Nonparametric Regression. Samory Kpotufe ORFE, Princeton University
|
|
- Judith Francis
- 5 years ago
- Views:
Transcription
1 Local Self-Tuning in Nonparametric Regression Samory Kpotufe ORFE, Princeton University
2 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg (Y i ) of Neighbors(x). (e.g. k-nn, kernel, or tree-based reg.) Quite basic = common in modern applications. Sensitive to choice of Neighbors(x): k, band. h, tree cell size. Goal: choose Neighbors(x) optimally!
3 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg (Y i ) of Neighbors(x). (e.g. k-nn, kernel, or tree-based reg.) Quite basic = common in modern applications. Sensitive to choice of Neighbors(x): k, band. h, tree cell size. Goal: choose Neighbors(x) optimally!
4 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg (Y i ) of Neighbors(x). (e.g. k-nn, kernel, or tree-based reg.) Quite basic = common in modern applications. Sensitive to choice of Neighbors(x): k, band. h, tree cell size. Goal: choose Neighbors(x) optimally!
5 Local Regression Data: {(X i, Y i )} n i=1, Y = f(x) + noise f nonparametric F, i.e. dim(f) =. Learn: f n (x) = avg (Y i ) of Neighbors(x). (e.g. k-nn, kernel, or tree-based reg.) Quite basic = common in modern applications. Sensitive to choice of Neighbors(x): k, band. h, tree cell size. Goal: choose Neighbors(x) optimally!
6 Performance depends on problem parameters Performance would depend on dim(x) and how fast f varies... Suppose X IR D, and x, x, f(x) f(x ) λ x x α. Performance measure: f n f 2 2,P X. = E X f n (X) f(x) 2. Minimax global performance (Stone 80-82) f n f 2 2,P X λ 2D/(2α+D) n 2α/(2α+D).
7 Performance depends on problem parameters Performance would depend on dim(x) and how fast f varies... Suppose X IR D, and x, x, f(x) f(x ) λ x x α. Performance measure: f n f 2 2,P X. = E X f n (X) f(x) 2. Minimax global performance (Stone 80-82) f n f 2 2,P X λ 2D/(2α+D) n 2α/(2α+D).
8 Performance depends on problem parameters Performance would depend on dim(x) and how fast f varies... Suppose X IR D, and x, x, f(x) f(x ) λ x x α. Performance measure: f n f 2 2,P X. = E X f n (X) f(x) 2. Minimax global performance (Stone 80-82) f n f 2 2,P X λ 2D/(2α+D) n 2α/(2α+D).
9 Performance depends on problem parameters Performance would depend on dim(x) and how fast f varies... Suppose X IR D, and x, x, f(x) f(x ) λ x x α. Performance measure: f n f 2 2,P X. = E X f n (X) f(x) 2. Minimax global performance (Stone 80-82) f n f 2 2,P X λ 2D/(2α+D) n 2α/(2α+D).
10 Some milder situations for X IR D f is quite smooth, f is sparse, f is additive,... Of interest here: X has low intrinsic dimension d D.
11 Some milder situations for X IR D f is quite smooth, f is sparse, f is additive,... Of interest here: X has low intrinsic dimension d D.
12 X IR D but has low intrinsic dimension d D Linear data
13 X IR D but has low intrinsic dimension d D Linear data Manifold data
14 X IR D but has low intrinsic dimension d D Linear data Manifold data Sparse data
15 X IR D but has low intrinsic dimension d D Linear data Manifold data Sparse data Common approach: Dimension reduction or estimation.
16 Dimension reduction/estimation increases tuning! Recent Alternative: f n operates in IR D but adapts to the unknown d of X. We want: f n f 2 2,P X n 1/Cd n 1/CD
17 Dimension reduction/estimation increases tuning! Recent Alternative: f n operates in IR D but adapts to the unknown d of X. We want: f n f 2 2,P X n 1/Cd n 1/CD
18 Dimension reduction/estimation increases tuning! Recent Alternative: f n operates in IR D but adapts to the unknown d of X. We want: f n f 2 2,P X n 1/Cd n 1/CD
19 Some work on adaptivity to intrinsic dimension: Kernel and local polynomial regression: Bickel and Li 2006, Lafferty and Wasserman Manifold dim. Dyadic tree classification: Scott and Nowak Box dim. RPtree, dyadic tree regression: Kpotufe and Dasgupta Doubling dim.
20 Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.
21 Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. Kernel reg.: Avg. mass of a ball of radius ɛ is approx. ɛ d For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.
22 Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. RPtree: Number of cells of diameter ɛ is approx. ɛ d. For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.
23 Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. RPtree: Number of cells of diameter ɛ is approx. ɛ d. For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.
24 Adaptivity to intrinsic d Main insight: Key algorithmic quantities depend on d, not on D. RPtree: Number of cells of diameter ɛ is approx. ɛ d. For Lipschitz f, f n,ɛ f 2 2,P X ɛ d n + ɛ2. Cross-validate over ɛ for a good rate in terms of d.
25 Insights help with tuning under time-constraints. Main Idea: compress data in a way that respects structure of X. Faster Kernel regression. [Kpo. 2009] Fast online tree-regression. [Kpo. and Orabona 2013] O(log n) time vs. O(n), + regression rates remain optimal!
26 Insights help with tuning under time-constraints. Main Idea: compress data in a way that respects structure of X. Faster Kernel regression. [Kpo. 2009] Fast online tree-regression. [Kpo. and Orabona 2013] O(log n) time vs. O(n), + regression rates remain optimal!
27 Insights help with tuning under time-constraints. Main Idea: compress data in a way that respects structure of X. Faster Kernel regression. [Kpo. 2009] Fast online tree-regression. [Kpo. and Orabona 2013] O(log n) time vs. O(n), + regression rates remain optimal!
28 Insights help with tuning under time-constraints. Main Idea: compress data in a way that respects structure of X. Faster Kernel regression. [Kpo. 2009] Fast online tree-regression. [Kpo. and Orabona 2013] O(log n) time vs. O(n), + regression rates remain optimal!
29 So far, we have viewed d as a global characteristic of X...
30 Problem complexity is likely to depend on location! Space X, d = d(x) Function f, α = α(x) Choose Neighbors(x) adaptively so that: f n (x) f(x) 2 λ 2dx/(2αx+dx) x n 2αx/(2αx+dx). Choose Neighbors(x): Cannot cross-validate locally at x!
31 Problem complexity is likely to depend on location! Space X, d = d(x) Function f, α = α(x) Choose Neighbors(x) adaptively so that: f n (x) f(x) 2 λ 2dx/(2αx+dx) x n 2αx/(2αx+dx). Choose Neighbors(x): Cannot cross-validate locally at x!
32 Problem complexity is likely to depend on location! Space X, d = d(x) Function f, α = α(x) Choose Neighbors(x) adaptively so that: f n (x) f(x) 2 λ 2dx/(2αx+dx) x n 2αx/(2αx+dx). Choose Neighbors(x): Cannot cross-validate locally at x!
33 NEXT: I. Local notions of smoothness and dimension. II. Local adaptivity to dimension: k-nn example. III. Full local adaptivity: kernel example.
34 Local smoothness Use local Hölder parameters λ = λ(x), α = α(x) on B(x, r): For all x B(x, r), f(x) f(x ) λρ(x, x ) α. f(x) = x α is flatter at x = 0 as α is increased.
35 Local dimension Figure : d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). If P X is U(B(x, r)), then P X (B(x, r)) ɛ d P X (B(x, ɛr)). Def.: P X is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, P X (B(x, r )) Cɛ d P X (B(x, ɛr )).
36 Local dimension Figure : d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). If P X is U(B(x, r)), then P X (B(x, r)) ɛ d P X (B(x, ɛr)). Def.: P X is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, P X (B(x, r )) Cɛ d P X (B(x, ɛr )).
37 Local dimension Figure : d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). If P X is U(B(x, r)), then P X (B(x, r)) ɛ d P X (B(x, ɛr)). Def.: P X is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, P X (B(x, r )) Cɛ d P X (B(x, ɛr )).
38 Local dimension Figure : d-dimensional balls centered at x. Volume growth: vol(b(x, r)) = C r d = ɛ d vol(b(x, ɛr)). If P X is U(B(x, r)), then P X (B(x, r)) ɛ d P X (B(x, ɛr)). Def.: P X is (C, d)-homogeneous on B(x, r) if r r, ɛ > 0, P X (B(x, r )) Cɛ d P X (B(x, ɛr )).
39 The growth of P X can capture the intrinsic dimension in B(x). Location of query x matters! Size of neighborhood B matters!
40 The growth of P X can capture the intrinsic dimension in B(x). Location of query x matters! Size of neighborhood B matters!
41 The growth of P X can capture the intrinsic dimension in B(x). Location of query x matters! Size of neighborhood B matters!
42 Size of neighborhood B matters! For k-nn, or kernel reg, size of B depends on n and (k or h).
43 Size of neighborhood B matters! For k-nn, or kernel reg, size of B depends on n and (k or h).
44 The growth of P X (B) can capture the intrinsic dimension locally. Linear data Manifold data Sparse data
45 The growth of P X (B) can capture the intrinsic dimension locally. Linear data Manifold data Sparse data X can be a collection of subspaces of various dimensions.
46 Intrinsic d tightly captures the minimax rate: Theorem: Consider a metric measure space (X, ρ, µ), such that for all x X, r > 0, ɛ > 0, we have µ(b(x, r)) ɛ d µ(b(x, ɛr)). Then, for any regressor f n, there exists P X,Y, where P X = µ and f(x) = E Y x is λ-lipschitz, such that E f n f 2 PX,Y n 2,µ λ2d/(2+d) n 2/(2+d).
47 Intrinsic d tightly captures the minimax rate: Theorem: Consider a metric measure space (X, ρ, µ), such that for all x X, r > 0, ɛ > 0, we have µ(b(x, r)) ɛ d µ(b(x, ɛr)). Then, for any regressor f n, there exists P X,Y, where P X = µ and f(x) = E Y x is λ-lipschitz, such that E f n f 2 PX,Y n 2,µ λ2d/(2+d) n 2/(2+d).
48 NEXT: I. Local notions of smoothness and dimension. II. Local adaptivity to dimension: k-nn example. III. Full local adaptivity: kernel example.
49 Main Assumptions: X metric space (X, ρ). P X is locally homogeneous with unknown d(x). f is λ-lipschitz on X, i.e. α = 1. k-nn regression: f n (x) = weighted avg (Y i ) of k-nn(x). Suppose X IR D, the learner operates in IR D! No dimensionality reduction, no dimension estimation!
50 Main Assumptions: X metric space (X, ρ). P X is locally homogeneous with unknown d(x). f is λ-lipschitz on X, i.e. α = 1. k-nn regression: f n (x) = weighted avg (Y i ) of k-nn(x). Suppose X IR D, the learner operates in IR D! No dimensionality reduction, no dimension estimation!
51 Bias-Variance tradeoff E (X i,y i ) n 1 f n (x) f(x) 2 = E f n (x) E f n (x) 2 }{{} Variance + E f n (x) f(x) 2. }{{} Bias 2
52 General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).
53 General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).
54 General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).
55 General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).
56 General intuition: Fix, n k log n, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/k. (Bias of f n (x)) r k (x). We have: f n (x) f(x) 2 1 k + r k(x) 2. It turns out: r k (x) (k/n) 1/d, where d = d(b).
57 Choosing k locally at x- Intuition Remember: Cross-valid. or dim. estimation at x are impractical. Instead: Main technical hurdle: intrinsic dimension might vary with k.
58 Choosing k locally at x- Intuition Remember: Cross-valid. or dim. estimation at x are impractical. Instead: Main technical hurdle: intrinsic dimension might vary with k.
59 Choosing k(x)- Result Theorem: Suppose k(x) is chosen as above. The following holds w.h.p. simultaneously for all x. Consider any B centered at x, s.t. P X (B) n 1/3. Suppose P X is (C, d)-homogeneous on B. We have ( ) f n (x) f(x) 2 C ln n 2/(2+d) λ 2. np X (B) As n the claim applies to any B centered at x, P X (B) 0.
60 Choosing k(x)- Result Theorem: Suppose k(x) is chosen as above. The following holds w.h.p. simultaneously for all x. Consider any B centered at x, s.t. P X (B) n 1/3. Suppose P X is (C, d)-homogeneous on B. We have ( ) f n (x) f(x) 2 C ln n 2/(2+d) λ 2. np X (B) As n the claim applies to any B centered at x, P X (B) 0.
61 NEXT: I. Local notions of smoothness and dimension. II. Local adaptivity to dimension: k-nn example. III. Full local adaptivity: kernel example. (Recent work with Vikas Garg)
62 Main Assumptions: X metric space (X, ρ) of diameter 1. P X is locally homogeneous with unknown d(x). f is locally Hölder with unknown λ(x), α(x). Kernel regression: f n (x) = weighted avg (Y i ) for X i in B ρ (x, h).
63 Main Assumptions: X metric space (X, ρ) of diameter 1. P X is locally homogeneous with unknown d(x). f is locally Hölder with unknown λ(x), α(x). Kernel regression: f n (x) = weighted avg (Y i ) for X i in B ρ (x, h).
64 General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).
65 General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).
66 General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).
67 General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).
68 General intuition: Fix, 0 < h < 1, and let x region B of dimension d. Rate of convergence of f n (x) depends on: (Variance of f n (x)) 1/n h (x). (Bias of f n (x)) h 2α. We have: f n (x) f(x) 2 1 n h (x) + h2α. It turns out: n h (x) nh d, where d = d(b).
69 From the previous intuition Suppose we know α(x) but not d(x). Monitor 1 n h (x) and h2α. Picking h d (x): f n (x) f(x) 2 err(h ) n 2α/(2α+d).
70 From the previous intuition Suppose we know α(x) but not d(x). Monitor 1 n h (x) and h2α. Picking h d (x): f n (x) f(x) 2 err(h ) n 2α/(2α+d).
71 From Lepski Suppose we know d(x) but not α(x). Intuition: For every h < h, 1 nh d > h 2α therefore for such h f n (h; x) f(x) 2 1 nh d + h2α 2 1 nh d.
72 From Lepski Suppose we know d(x) but not α(x). Intuition: For every h < h, 1 nh d > h 2α therefore for such h f n (h; x) f(x) 2 1 nh d + h2α 2 1 nh d.
73 From Lepski Suppose we know d(x) but not α(x). All intervals [ ] f n (h; x) ± 2 1, h < h must intersect! nh d Picking h α (x): f n (x) f(x) 2 err(h ) n 2α/(2α+d).
74 From Lepski Suppose we know d(x) but not α(x). All intervals [ ] f n (h; x) ± 2 1, h < h must intersect! nh d Picking h α (x): f n (x) f(x) 2 err(h ) n 2α/(2α+d).
75 Combine Lepski with previous intuition We know neither d nor α. All intervals [ ] f n (h; x) ± 2 1 n h (x), h < h d must intersect! Picking h α,d (x): f n (x) f(x) 2 err(h d ) err(h ) n 2α/(2α+d).
76 Combine Lepski with previous intuition We know neither d nor α. All intervals [ ] f n (h; x) ± 2 1 n h (x), h < h d must intersect! Picking h α,d (x): f n (x) f(x) 2 err(h d ) err(h ) n 2α/(2α+d).
77 Choosing h α,d (x)- Result Tightness assumption on d(x): r 0, x X, C, C, d such that r r 0, Cr d P X (B(x, r)) C r d. Theorem: Suppose h α,d (x) is chosen as described. Let n N(r 0 ). The following holds w.h.p. simultaneously for all x. Let d, α, λ be the local problem parameters on B(x, r 0 ). We have ( ) f n (x) f(x) 2 ln n 2α/(2α+d) λ 2d/(2α+d). n The rate is optimal.
78 Choosing h α,d (x)- Result Tightness assumption on d(x): r 0, x X, C, C, d such that r r 0, Cr d P X (B(x, r)) C r d. Theorem: Suppose h α,d (x) is chosen as described. Let n N(r 0 ). The following holds w.h.p. simultaneously for all x. Let d, α, λ be the local problem parameters on B(x, r 0 ). We have ( ) f n (x) f(x) 2 ln n 2α/(2α+d) λ 2d/(2α+d). n The rate is optimal.
79 Simulation on data with mixed spatial complexity... the approach works, but should be made more efficient!
80 Simulation on data with mixed spatial complexity... the approach works, but should be made more efficient!
81 Future direction: Extend self-tuning to tree-based kernel implementations.
82 Initial experiments with tree-based kernel: Without CValidation: automatically detect interval containing h.
83 Future direction: Extend self-tuning to data streaming setting.
84 Initial streaming experiments: Synthetic dataset, d=5, D=100, first 1000 samples d=1 Incremental Tree based fixed d=1 fixed d=5 fixed d=10 Normalized RMSE on test set # Training samples x 10 4 Robustness to increasing intrinsic dimension.
85 Future direction: Adaptive error bands. Would lead to more local sampling strategies.
86 TAKE HOME MESSAGE: We can adapt to intrinsic d(x ) without preprocessing. Local-learners can self-tune optimally to local d(x) and α(x). Results extend to plug-in classification! Many potential future directions!
87 TAKE HOME MESSAGE: We can adapt to intrinsic d(x ) without preprocessing. Local-learners can self-tune optimally to local d(x) and α(x). Results extend to plug-in classification! Many potential future directions!
88 TAKE HOME MESSAGE: We can adapt to intrinsic d(x ) without preprocessing. Local-learners can self-tune optimally to local d(x) and α(x). Results extend to plug-in classification! Many potential future directions!
89 Thank you!
Local regression, intrinsic dimension, and nonparametric sparsity
Local regression, intrinsic dimension, and nonparametric sparsity Samory Kpotufe Toyota Technological Institute - Chicago and Max Planck Institute for Intelligent Systems I. Local regression and (local)
More informationAdaptivity to Local Smoothness and Dimension in Kernel Regression
Adaptivity to Local Smoothness and Dimension in Kernel Regression Samory Kpotufe Toyota Technological Institute-Chicago samory@tticedu Vikas K Garg Toyota Technological Institute-Chicago vkg@tticedu Abstract
More informationEscaping the curse of dimensionality with a tree-based regressor
Escaping the curse of dimensionality with a tree-based regressor Samory Kpotufe UCSD CSE Curse of dimensionality In general: Computational and/or prediction performance deteriorate as the dimension D increases.
More informationk-nn Regression Adapts to Local Intrinsic Dimension
k-nn Regression Adapts to Local Intrinsic Dimension Samory Kpotufe Max Planck Institute for Intelligent Systems samory@tuebingen.mpg.de Abstract Many nonparametric regressors were recently shown to converge
More informationOptimal rates for k-nn density and mode estimation
Optimal rates for k-nn density and mode estimation Samory Kpotufe ORFE, Princeton University Joint work with Sanjoy Dasgupta, UCSD, CSE. Goal: Practical and Optimal estimator of all modes of f from X 1:n
More informationTime-Accuracy Tradeoffs in Kernel Prediction: Controlling Prediction Quality
Journal of Machine Learning Research 18 (017) 1-9 Submitted 10/16; Revised 3/17; Published 4/17 Time-Accuracy Tradeoffs in Kernel Prediction: Controlling Prediction Quality Samory Kpotufe ORFE, Princeton
More informationEfficient and Optimal Modal-set Estimation using knn graphs
Efficient and Optimal Modal-set Estimation using knn graphs Samory Kpotufe ORFE, Princeton University Based on various results with Sanjoy Dasgupta, Kamalika Chaudhuri, Ulrike von Luxburg, Heinrich Jiang
More informationUnlabeled Data: Now It Helps, Now It Doesn t
institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster
More informationInstance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest
More informationRandom projection trees and low dimensional manifolds. Sanjoy Dasgupta and Yoav Freund University of California, San Diego
Random projection trees and low dimensional manifolds Sanjoy Dasgupta and Yoav Freund University of California, San Diego I. The new nonparametrics The new nonparametrics The traditional bane of nonparametric
More informationDistribution-specific analysis of nearest neighbor search and classification
Distribution-specific analysis of nearest neighbor search and classification Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to information retrieval and classification.
More informationOn Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong
On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality Weiqiang Dong 1 The goal of the work presented here is to illustrate that classification error responds to error in the target probability estimates
More informationMATH 31BH Homework 1 Solutions
MATH 3BH Homework Solutions January 0, 04 Problem.5. (a) (x, y)-plane in R 3 is closed and not open. To see that this plane is not open, notice that any ball around the origin (0, 0, 0) will contain points
More informationNearest neighbor classification in metric spaces: universal consistency and rates of convergence
Nearest neighbor classification in metric spaces: universal consistency and rates of convergence Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to classification.
More informationA Consistent Estimator of the Expected Gradient Outerproduct
A Consistent Estimator of the Expected Gradient Outerproduct Shubhendu Trivedi TTI-Chicago Jialei Wang University of Chicago Samory Kpotufe TTI-Chicago Gregory Shakhnarovich TTI-Chicago Abstract In high-dimensional
More informationBias-Variance in Machine Learning
Bias-Variance in Machine Learning Bias-Variance: Outline Underfitting/overfitting: Why are complex hypotheses bad? Simple example of bias/variance Error as bias+variance for regression brief comments on
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationSemi-Supervised Learning by Multi-Manifold Separation
Semi-Supervised Learning by Multi-Manifold Separation Xiaojin (Jerry) Zhu Department of Computer Sciences University of Wisconsin Madison Joint work with Andrew Goldberg, Zhiting Xu, Aarti Singh, and Rob
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationIntroduction to Nonparametric and Semiparametric Estimation. Good when there are lots of data and very little prior information on functional form.
1 Introduction to Nonparametric and Semiparametric Estimation Good when there are lots of data and very little prior information on functional form. Examples: y = f(x) + " (nonparametric) y = z 0 + f(x)
More informationPlug-in Approach to Active Learning
Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X
More informationThese slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
More informationMath General Topology Fall 2012 Homework 1 Solutions
Math 535 - General Topology Fall 2012 Homework 1 Solutions Definition. Let V be a (real or complex) vector space. A norm on V is a function : V R satisfying: 1. Positivity: x 0 for all x V and moreover
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationOnline Learning in High Dimensions. LWPR and it s application
Lecture 9 LWPR Online Learning in High Dimensions Contents: LWPR and it s application Sethu Vijayakumar, Aaron D'Souza and Stefan Schaal, Incremental Online Learning in High Dimensions, Neural Computation,
More informationYour first day at work MATH 806 (Fall 2015)
Your first day at work MATH 806 (Fall 2015) 1. Let X be a set (with no particular algebraic structure). A function d : X X R is called a metric on X (and then X is called a metric space) when d satisfies
More informationPrediction of double gene knockout measurements
Prediction of double gene knockout measurements Sofia Kyriazopoulou-Panagiotopoulou sofiakp@stanford.edu December 12, 2008 Abstract One way to get an insight into the potential interaction between a pair
More informationMATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:
MATH 51H Section 4 October 16, 2015 1 Continuity Recall what it means for a function between metric spaces to be continuous: Definition. Let (X, d X ), (Y, d Y ) be metric spaces. A function f : X Y is
More informationAdvanced computational methods X Selected Topics: SGD
Advanced computational methods X071521-Selected Topics: SGD. In this lecture, we look at the stochastic gradient descent (SGD) method 1 An illustrating example The MNIST is a simple dataset of variety
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationAssignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran
Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran 1. Let A m n be a matrix of real numbers. The matrix AA T has an eigenvector x with eigenvalue b. Then the eigenvector y of A T A
More informationUnlabeled data: Now it helps, now it doesn t
Unlabeled data: Now it helps, now it doesn t Aarti Singh, Robert D. Nowak Xiaojin Zhu Department of Electrical and Computer Engineering Department of Computer Sciences University of Wisconsin - Madison
More informationChap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University
Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics
More informationEnsemble Methods and Random Forests
Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization
More informationNonparametric Inference via Bootstrapping the Debiased Estimator
Nonparametric Inference via Bootstrapping the Debiased Estimator Yen-Chi Chen Department of Statistics, University of Washington ICSA-Canada Chapter Symposium 2017 1 / 21 Problem Setup Let X 1,, X n be
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationLinear and Logistic Regression. Dr. Xiaowei Huang
Linear and Logistic Regression Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Two Classical Machine Learning Algorithms Decision tree learning K-nearest neighbor Model Evaluation Metrics
More informationCSE 151 Machine Learning. Instructor: Kamalika Chaudhuri
CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Ensemble Learning How to combine multiple classifiers into a single one Works well if the classifiers are complementary This class: two types of
More informationRegression, Ridge Regression, Lasso
Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationData Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.
TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationSample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI
Sample Complexity of Learning Mahalanobis Distance Metrics Nakul Verma Janelia, HHMI feature 2 Mahalanobis Metric Learning Comparing observations in feature space: x 1 [sq. Euclidean dist] x 2 (all features
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationNonparametric regression with martingale increment errors
S. Gaïffas (LSTA - Paris 6) joint work with S. Delattre (LPMA - Paris 7) work in progress Motivations Some facts: Theoretical study of statistical algorithms requires stationary and ergodicity. Concentration
More informationEfficient and Principled Online Classification Algorithms for Lifelon
Efficient and Principled Online Classification Algorithms for Lifelong Learning Toyota Technological Institute at Chicago Chicago, IL USA Talk @ Lifelong Learning for Mobile Robotics Applications Workshop,
More informationAdvances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008
Advances in Manifold Learning Presented by: Nakul Verma June 10, 008 Outline Motivation Manifolds Manifold Learning Random projection of manifolds for dimension reduction Introduction to random projections
More informationAnalysis I (Math 121) First Midterm Correction
Analysis I (Math 11) First Midterm Correction Fall 00 Ali Nesin November 9, 00 1. Which of the following are not vector spaces over R (with the componentwise addition and scalar multiplication) and why?
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationReducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning
More informationGeneralization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh
Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds
More informationIntroduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,
1 Introduction to Machine Learning Regression Computer Science, Tel-Aviv University, 2013-14 Classification Input: X Real valued, vectors over real. Discrete values (0,1,2,...) Other structures (e.g.,
More informationLearning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin
Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good
More informationMachine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)
Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March
More informationEconomics 620, Lecture 19: Introduction to Nonparametric and Semiparametric Estimation
Economics 620, Lecture 19: Introduction to Nonparametric and Semiparametric Estimation Nicholas M. Kiefer Cornell University Professor N. M. Kiefer (Cornell University) Lecture 19: Nonparametric Analysis
More informationIntroduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric
CS 189 Introduction to Machine Learning Fall 2017 Note 5 1 Overview Recall from our previous note that for a fixed input x, our measurement Y is a noisy measurement of the true underlying response f x):
More informationBayesian Regularization
Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors
More informationMetric Spaces and Topology
Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies
More informationBasis Expansion and Nonlinear SVM. Kai Yu
Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion
More informationMath 5210, Definitions and Theorems on Metric Spaces
Math 5210, Definitions and Theorems on Metric Spaces Let (X, d) be a metric space. We will use the following definitions (see Rudin, chap 2, particularly 2.18) 1. Let p X and r R, r > 0, The ball of radius
More informationThe Decision List Machine
The Decision List Machine Marina Sokolova SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 sokolova@site.uottawa.ca Nathalie Japkowicz SITE, University of Ottawa Ottawa, Ont. Canada,K1N-6N5 nat@site.uottawa.ca
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationMath General Topology Fall 2012 Homework 6 Solutions
Math 535 - General Topology Fall 202 Homework 6 Solutions Problem. Let F be the field R or C of real or complex numbers. Let n and denote by F[x, x 2,..., x n ] the set of all polynomials in n variables
More informationConvex optimization COMS 4771
Convex optimization COMS 4771 1. Recap: learning via optimization Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + 1 / 15 Soft-margin
More information21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)
10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC
More informationIntroduction to Machine Learning (67577) Lecture 3
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz
More informationLocality Sensitive Hashing
Locality Sensitive Hashing February 1, 016 1 LSH in Hamming space The following discussion focuses on the notion of Locality Sensitive Hashing which was first introduced in [5]. We focus in the case of
More informationwhere m is the maximal ideal of O X,p. Note that m/m 2 is a vector space. Suppose that we are given a morphism
8. Smoothness and the Zariski tangent space We want to give an algebraic notion of the tangent space. In differential geometry, tangent vectors are equivalence classes of maps of intervals in R into the
More informationActive learning. Sanjoy Dasgupta. University of California, San Diego
Active learning Sanjoy Dasgupta University of California, San Diego Exploiting unlabeled data A lot of unlabeled data is plentiful and cheap, eg. documents off the web speech samples images and video But
More informationIntroduction to Gaussian Process
Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationSparse Kernel Machines - SVM
Sparse Kernel Machines - SVM Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I. Christensen (RIM@GT) Support
More informationCSE446: non-parametric methods Spring 2017
CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want
More informationvariance of independent variables: sum of variances So chebyshev predicts won t stray beyond stdev.
Announcements No class monday. Metric embedding seminar. Review expectation notion of high probability. Markov. Today: Book 4.1, 3.3, 4.2 Chebyshev. Remind variance, standard deviation. σ 2 = E[(X µ X
More informationThe Curse of Dimensionality for Local Kernel Machines
The Curse of Dimensionality for Local Kernel Machines Yoshua Bengio, Olivier Delalleau & Nicolas Le Roux April 7th 2005 Yoshua Bengio, Olivier Delalleau & Nicolas Le Roux Snowbird Learning Workshop Perspective
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationLearning with multiple models. Boosting.
CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models
More informationOptimality Conditions for Nonsmooth Convex Optimization
Optimality Conditions for Nonsmooth Convex Optimization Sangkyun Lee Oct 22, 2014 Let us consider a convex function f : R n R, where R is the extended real field, R := R {, + }, which is proper (f never
More informationNonparametric Methods
Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis
More informationVoting (Ensemble Methods)
1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers
More informationClassifier Complexity and Support Vector Classifiers
Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationIntroduction to Real Analysis
Introduction to Real Analysis Joshua Wilde, revised by Isabel Tecu, Takeshi Suzuki and María José Boccardi August 13, 2013 1 Sets Sets are the basic objects of mathematics. In fact, they are so basic that
More informationLecture 18: March 15
CS71 Randomness & Computation Spring 018 Instructor: Alistair Sinclair Lecture 18: March 15 Disclaimer: These notes have not been subjected to the usual scrutiny accorded to formal publications. They may
More informationMath 320-2: Midterm 2 Practice Solutions Northwestern University, Winter 2015
Math 30-: Midterm Practice Solutions Northwestern University, Winter 015 1. Give an example of each of the following. No justification is needed. (a) A metric on R with respect to which R is bounded. (b)
More informationTsybakov noise adap/ve margin- based ac/ve learning
Tsybakov noise adap/ve margin- based ac/ve learning Aar$ Singh A. Nico Habermann Associate Professor NIPS workshop on Learning Faster from Easy Data II Dec 11, 2015 Passive Learning Ac/ve Learning (X j,?)
More informationLow Bias Bagged Support Vector Machines
Low Bias Bagged Support Vector Machines Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi di Milano, Italy valentini@dsi.unimi.it Thomas G. Dietterich Department of Computer
More informationLearning Theory Continued
Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.
More informationProblem Set 1: Solutions Math 201A: Fall Problem 1. Let (X, d) be a metric space. (a) Prove the reverse triangle inequality: for every x, y, z X
Problem Set 1: s Math 201A: Fall 2016 Problem 1. Let (X, d) be a metric space. (a) Prove the reverse triangle inequality: for every x, y, z X d(x, y) d(x, z) d(z, y). (b) Prove that if x n x and y n y
More informationLecture Notes 1: Vector spaces
Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector
More informationVariance Reduction and Ensemble Methods
Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationA New Look at Nearest Neighbours: Identifying Benign Input Geometries via Random Projections
Journal of Machine Learning Research 6 A New Look at Nearest Neighbours: Identifying Benign Input Geometries via Random Projections Ata Kabán School of Computer Science, The University of Birmingham, Edgbaston,
More informationExploiting k-nearest Neighbor Information with Many Data
Exploiting k-nearest Neighbor Information with Many Data 2017 TEST TECHNOLOGY WORKSHOP 2017. 10. 24 (Tue.) Yung-Kyun Noh Robotics Lab., Contents Nonparametric methods for estimating density functions Nearest
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More information