References for online kernel methods

Size: px

Start display at page:

Download "References for online kernel methods"

Holly Hensley
5 years ago
Views:

1 References for online kernel methods W. Liu, J. Principe, S. Haykin Kernel Adaptive Filtering: A Comprehensive Introduction. Wiley, W. Liu, P. Pokharel, J. Principe. The kernel least mean square algorithm, IEEE Trans. on Signal Processing, vol 56, 2, , Feb K. Slavakis, S. Theodoridis, I. Yamada, Adaptive constrained learning in reproducing kernel Hilbert spaces, IEEE Trans. on Signal Processing, vol. 57,12, , Dec C. Richard, J.-C. Bermudez, P. Honeine. Online prediction of time series data with kernels. IEEE Trans. on Signal Proc., vol. 57, ,Mar Y. Engel, S. Mannor, and R. Meir. The kernel recursive least-squares algorithm, IEEE Trans. Signal Proc., Vol. 52, 8, , Aug., C. Williams. Prediction with Gaussian Processes from linear regression to linear prediction and beyond, in Learning on Graphical Models, ed. M. Jordan, , MIT Press, C. Rasmussen, C. Williams. Gaussian Processes for Machine Learning, MIT Press,

2 Results from Probability 2

3 Classification Error (general case) Training error: J emp (w) = 1/m I(d(i) f(x(i),w)) Test error: J(w) = I(d f(x,w)) p(x,d) dx dd Change notation to learning functions h instead of parameters w. x X (input or instance) y= h(x), h H (label or concept) (consider binary labels) S = ((x(1),d(1),.,x(m),d(m)) (sample drawn iid from some unknown distribution D) h =f(s,h) takes a sample S and chooses a hypothesis Training error: J emp (h) = 1/m I(d(i) h(x(i))) Test error: J(h) = I(d h(x)) p(x,d) dx dd 3

4 Empirical Risk Minimization (ERM) Assume that hypothesis class H is finite with H = k ERM chooses hypothesis h* such that h* = argmin h j H 1/m I(d(i) h j (x(i))) How well does ERM do with achieving small generalization error? Can come up with bounds for generalization error based on empirical training error using union bound and Hoeffding inquality 4

5 Learning Theory results for finite H 1) Given h H then P( J emp (h) J(h) > ε) 2 exp(-2mε 2 ) 2) Uniform convergence: P ( j { J emp (h j ) J(h j ) > ε}) 2k exp(-2mε 2 ) or P( j { J emp (h j ) J(h j ) ε}) 1-2k exp(-2mε 2 ) 3) Sample complexity: Let δ = 2k exp(-2mε 2 ) then with probability 1- δ, if m 1/(2ε 2 ) log (2k/δ) we have that J emp (h) J(h) ε h H 4) Error bound: Solve for ε then with probability 1- δ that h H, J emp (h) J(h) 1/(2m) log (2k/δ) 5

6 Proofs of learning theory results Note that (x(i), d(i)) are iid drawn from an unknown distribution, then for any hypothesis h we have that I(d(i) h(x(i)) are iid Bernoulli RVs. We also have that J(h) = P(d h(x)) Then we can apply Hoeffding inequality to get 1) P( J emp (h) J(h) > ε) 2 exp(-2mε 2 ) Then apply union bound to get 2) P( j { J emp (h j ) J(h j ) ε}) 1-2k exp(-2mε 2 ) Let δ = 2k exp(-2mε 2 ) be the confidence value, ε is the error, and m the sample size. Using 2) can then get 3) and 4). 6

7 Generalization Theorem Let h H with H = k (finite hypothesis class) h* = argmin h J emp (h) (hypothesis with best training error) h opt = argmin h J(h) (best hypothesis) Fix m and δ, then ε = 1/(2m) log (2k/δ) and J(h*) J (h opt ) + 2ε 7

8 Bias versus Variance Dilemma Generalization error bounds depend on two terms J(h*) J emp (h*) + 1/(2m) log (2k/δ) First term refers to bias. If H is not large enough then bias could be high. Second term refers to variance. If H is too large then variance could be high. Can change H depending if bias or variance is too high. Could also have more training examples, m. 8

9 Training Error and Generalization Error Plots Error J(h) J emp (h) Error J(h) Simple model m J emp (h) Complex model m 9

10 Structural Risk Minimization Consider a set of growing function classes of increasing complexity H 1 H 2 H k H k+1 Error Bound on test error Complexity term Training error Complexity 10

11 Comments on generalization bounds Bounds depend on training error and hypothesis class complexity (Bias vs. variance) Bounds do not depend on distribution from which examples are drawn, uniform convergence Bounds are not tight as it uses union bound Bounds grow slowly with k, depends on log(k) What if hypothesis class is infinite? Tighter bounds found by using VC dimension which is measure defining dimensionality of H 11

12 VC dimension Consider function classes where each function labels each input as 1 or 0. A set of m points is shattered by function class if the function class represents all 2 m possible labelings of the points. The VC dimension of a function class is the largest cardinality of points that is shattered by the function class. Example: linear threshold functions in Euclidean n space has VC dimension of n+1. The VC dimension measures the complexity of the function class. 12

13 Growth functions and numbers Growth function: Let X be a set of inputs. Let function h be set of points where output label is 1. Π H (X) = {h X: h H}, note that Π H (X) {0,1} X (power set). If equality, then H shatters X. Growth number: Π H (m) = max X =m Π H (X) VC dimension VC (H) = max m such that Π H (m) = 2 m If no number exists, then VC dimension is infinite. 13

14 Capabilities of Linear Threshold Functions Discussed three learning algorithms for linear threshold functions (LTF): PLA, SVM, LS SVM (FLDA) How can we describe capabilities of LTF? Given m points, how many dichotomies can homogenous LTF (HLTF) (zero threshold) realize? General position (GP): m points in R n in GP if any subset of k min(m,n) points are linearly independent. 14

15 Function Counting Theorem Given m points in in R n in GP there are C(m,n) dichotomies that can be realized where n-1 m-1 Π(m) =C(m,n) = 2 Σ k=0 k ( ) 15

16 FCT Proof C(m+1,n) = C(m,n) + C(m,n-1) Given m points, add a point x* in GP. Construct a hyperplane by projecting into null space of x*. For any dichotomy, x* will either be ambiguous or not. Number of ambiguous points is C (m,n-1) Induction proof: Base step: C(m,1) = C(1,n) = 2 Induction step 16

17 Graphical representation of FCT proof.... x*. *.. C(m+1,n) = C(m,n) + C(m,n-1). 17

18 LTF Capacity HLTF capacity is n, LTF capacity is n+1. If points are not in GP capacity is less. Random capacity of HLTF is 2n. Higher capacity achieved by nonlinear threshold functions with capacity dependent on number of inputs. LTF can only realize a limited number of Boolean functions. 18

19 VC dimension examples Homogenous Linear Threshold Functions: n Linear Threshold Functions: n+1 Quadratic Threshold Functions: (n+1)(n+2)/2 One closed interval: 2 Closed intervals: Axis aligned rectangles: 2n 19

PAC-learning, VC Dimension and Margin-based Bounds

More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based