Curve learning. p.1/35

Size: px

Start display at page:

Download "Curve learning. p.1/35"

Juliana Bates
5 years ago
Views:

1 Curve learning Gérard Biau UNIVERSITÉ MONTPELLIER II p.1/35

2 Summary The problem The mathematical model Functional classification 1. Fourier filtering 2. Wavelet filtering Applications p.2/35

3 The problem In many experiments, practitioners collect samples of curves and other functional observations. p.3/35

4 The problem In many experiments, practitioners collect samples of curves and other functional observations. We wish to investigate how classical learning rules can be extended to handle functions. p.3/35

5 A typical problem p.4/35

6 A speech recognition problem 0.5 YES 0.5 NO BOAT GOAT SH AO p.5/35

7 Classification Given a random pair (X, Y ) R d {0, 1}, the problem of classification is to predict the unknown nature Y of a new observation X. p.6/35

8 Classification Given a random pair (X, Y ) R d {0, 1}, the problem of classification is to predict the unknown nature Y of a new observation X. The statistician creates a classifier g : R d {0, 1} which represents her guess of the label of X. p.6/35

9 Classification Given a random pair (X, Y ) R d {0, 1}, the problem of classification is to predict the unknown nature Y of a new observation X. The statistician creates a classifier g : R d {0, 1} which represents her guess of the label of X. An error occurs if g(x) Y, and the probability of error for a particular classifier g is L(g) = P{g(X) Y }. p.6/35

10 Classification The Bayes rule g (x) = { 0 if P{Y = 0 X = x} P{Y = 1 X = x} 1 otherwise, is the optimal decision. That is, for any decision function g : R d {0, 1}, P{g (X) Y } P{g(X) Y }. p.7/35

11 Classification The Bayes rule g (x) = { 0 if P{Y = 0 X = x} P{Y = 1 X = x} 1 otherwise, is the optimal decision. That is, for any decision function g : R d {0, 1}, P{g (X) Y } P{g(X) Y }. The problem is thus to construct a reasonable classifier g n based on i.i.d. observations (X 1, Y 1 ),..., (X n, Y n ). p.7/35

12 Error criteria The goal is to make the error probability P{g n (X) Y } close to L. p.8/35

13 Error criteria The goal is to make the error probability P{g n (X) Y } close to L. Definition. consistent if A decision rule g n is called P{g n (X) Y } L as n. p.8/35

14 Error criteria The goal is to make the error probability P{g n (X) Y } close to L. Definition. consistent if A decision rule g n is called P{g n (X) Y } L as n. Definition. A decision rule is called universally consistent if it is consistent for any distribution of the pair (X, Y ). p.8/35

15 Two simple and popular rules The moving window rule: g n (x) = { 0 if n i=1 1 {Y i =0,X i B x,hn } n i=1 1 {Y i =1,X i B x,h n } 1 otherwise. p.9/35

16 Two simple and popular rules The moving window rule: g n (x) = { 0 if n i=1 1 {Y i =0,X i B x,hn } n i=1 1 {Y i =1,X i B x,h n } 1 otherwise. The nearest neighbor rule: { 0 if k n g n (x) = i=1 1 {Y (i) (x)=0} k n i=1 1 {Y (i) (x)=1} 1 otherwise. p.9/35

17 Two simple and popular rules The moving window rule: g n (x) = { 0 if n i=1 1 {Y i =0,X i B x,hn } n i=1 1 {Y i =1,X i B x,h n } 1 otherwise. The nearest neighbor rule: { 0 if k n g n (x) = i=1 1 {Y (i) (x)=0} k n i=1 1 {Y (i) (x)=1} 1 otherwise. Devroye, Györfi and Lugosi (1996). A Probabilistic Theory of Pattern Recognition. p.9/35

18 Notation The model. The random variable X takes values in a metric space (F, ρ). The distribution of the pair (X, Y ) is completely specified by µ and η: µ(a) = P{X A} and η(x) = P{Y = 1 X = x}. p.10/35

19 Notation The model. The random variable X takes values in a metric space (F, ρ). The distribution of the pair (X, Y ) is completely specified by µ and η: µ(a) = P{X A} and η(x) = P{Y = 1 X = x}. And all the definitions remain the same... p.10/35

20 Assumptions (H1) The space F is complete. p.11/35

21 Assumptions (H1) The space F is complete. (H2) The following differentiation result holds: 1 lim η dµ = η(x) in µ-probability. h 0 µ(b x,h ) B x,h p.11/35

22 Assumptions (H1) The space F is complete. (H2) The following differentiation result holds: 1 lim η dµ = η(x) in µ-probability. h 0 µ(b x,h ) B x,h Aïe aïe aïe... p.11/35

23 Results Theorem [consistency]. Assume that (H1) and (H2) hold. If h n 0 and, for every k 1, N k (h n /2)/n 0 as n, then lim n L(g n) = L. p.12/35

24 Results Theorem [consistency]. Assume that (H1) and (H2) hold. If h n 0 and, for every k 1, N k (h n /2)/n 0 as n, then lim n L(g n) = L. h n 1/ log log n: curse of dimensionality. p.12/35

25 Curse of dimensionality? Consider the hypercube [ a, a] d and an inscribed hypersphere with radius r = a. Then f d = volume sphere volume cube = π d/2 d2 d 1 Γ(d/2). p.13/35

26 Curse of dimensionality? Consider the hypercube [ a, a] d and an inscribed hypersphere with radius r = a. Then f d = volume sphere volume cube = π d/2 d2 d 1 Γ(d/2). d f d p.13/35

27 Curse of dimensionality? Consider the hypercube [ a, a] d and an inscribed hypersphere with radius r = a. Then f d = volume sphere volume cube = π d/2 d2 d 1 Γ(d/2). d f d As the dimension increases, most spherical neighborhoods will be empty! p.13/35

28 Classification in Hilbert spaces Here, the observations X i take values in the infinite dimensional space L 2 ( [0, 1] ). p.14/35

29 Classification in Hilbert spaces Here, the observations X i take values in the infinite dimensional space L 2 ( [0, 1] ). The trigonometric basis, formed by the functions ψ 1 (t) = 1, ψ 2j (t) = 2 cos(2πjt), and ψ 2j+1 (t) = 2 sin(2πjt), j = 1, 2,..., is a complete orthonormal system in L 2 ( [0, 1] ). p.14/35

30 Classification in Hilbert spaces Here, the observations X i take values in the infinite dimensional space L 2 ( [0, 1] ). The trigonometric basis, formed by the functions ψ 1 (t) = 1, ψ 2j (t) = 2 cos(2πjt), and ψ 2j+1 (t) = 2 sin(2πjt), j = 1, 2,..., is a complete orthonormal system in L 2 ( [0, 1] ). We let the Fourier representation of X i be X i (t) = j=1 X ij ψ j (t). p.14/35

31 Dimension reduction Select the effective dimension d to approximate each X i by X i (t) d j=1 X ij ψ j (t). p.15/35

32 Dimension reduction Select the effective dimension d to approximate each X i by X i (t) d j=1 X ij ψ j (t). Perform nearest neighbor classification based on the coefficient vector X (d) i = (X i1,..., X id ) for the selected dimension d. p.15/35

33 Data-splitting We split the data into p.16/35

34 Data-splitting We split the data into a training sequence (X 1, Y 1 ),..., (X l, Y l ) p.16/35

35 Data-splitting We split the data into a training sequence (X 1, Y 1 ),..., (X l, Y l ) a testing sequence (X l+1, Y l+1 ),..., (X m+l, Y m+l ). p.16/35

36 Data-splitting We split the data into a training sequence a testing sequence (X 1, Y 1 ),..., (X l, Y l ) (X l+1, Y l+1 ),..., (X m+l, Y m+l ). The training sequence defines a class of nearest neighbor classifiers with members g l,k,d. p.16/35

37 Empirical risk minimization The testing sequence is used to select a particular classifier by minimizing the penalized empirical risk 1 m 1 { } λ + d. g l,k,d (X (d) i ) Y i m i J m p.17/35

38 Empirical risk minimization The testing sequence is used to select a particular classifier by minimizing the penalized empirical risk 1 m 1 { } λ + d. g l,k,d (X (d) i ) Y i m i J m In practice, the regularization penalty may be formally replaced by λ d = 0 for d d n and λ d = + for d d n +1. p.17/35

39 Result Theorem [oracle inequality]. Assume that d=1 e 2λ2 d <. (1) Then there exists a constant C > 0 such that L(ĝ n ) L [ inf (L d L ) + inf d 1 log l + C m. 1 k l ( L(gl,k,d ) Ld) λ ] d + m p.18/35

40 Result Theorem [oracle inequality]. Assume that d=1 e 2λ2 d <. (1) Then there exists a constant C > 0 such that L(ĝ n ) L [ inf (L d L ) + inf d 1 log l + C m. 1 k l ( L(gl,k,d ) Ld) λ ] d + m p.18/35

41 Result Theorem [oracle inequality]. Assume that d=1 e 2λ2 d <. (1) Then there exists a constant C > 0 such that L(ĝ n ) L [ inf (L d L ) + inf d 1 log l + C m. 1 k l ( L(gl,k,d ) Ld) λ ] d + m p.18/35

42 Result Theorem [oracle inequality]. Assume that d=1 e 2λ2 d <. (1) Then there exists a constant C > 0 such that L(ĝ n ) L [ inf (L d L ) + inf d 1 log l + C m. 1 k l ( L(gl,k,d ) Ld) λ ] d + m p.18/35

43 Results Theorem [consistency]. (1) and Under the assumption lim n l =, lim n m =, lim n log l m = 0, the classifier ĝ n is universally consistent. p.19/35

44 Results Theorem [consistency]. (1) and Under the assumption lim n l =, lim n m =, lim n log l m = 0, the classifier ĝ n is universally consistent. Under stronger assumptions, the classifier ĝ n is strongly consistent, that is P { ĝ n (X) Y (X 1, Y 1 ),..., (X n, Y n ) } L with probability one. p.19/35

45 Illustration 0.5 YES 0.5 NO BOAT GOAT SH AO p.20/35

46 Results Method Error rates 0.10 FOURIER NN-DIRECT QDA p.21/35

47 Multiresolution analysis V 0 V 1 V 2... L 2 ( [0, 1] ). p.22/35

48 Multiresolution analysis V 0 V 1 V 2... L 2 ( [0, 1] ). φ: the father wavelet such that {φ j,k : k = 0,..., 2 j 1} is an orthornormal basis of V j. p.22/35

49 Multiresolution analysis V 0 V 1 V 2... L 2 ( [0, 1] ). φ: the father wavelet such that {φ j,k : k = 0,..., 2 j 1} is an orthornormal basis of V j. For each j, one constructs an orthornormal basis {ψ j,k : k = 0,..., 2 j 1} of W j such that V j+1 = V j Wj. ψ is the mother wavelet. p.22/35

50 Multiresolution analysis V 0 V 1 V 2... L 2 ( [0, 1] ). φ: the father wavelet such that {φ j,k : k = 0,..., 2 j 1} is an orthornormal basis of V j. For each j, one constructs an orthornormal basis {ψ j,k : k = 0,..., 2 j 1} of W j such that V j+1 = V j Wj. ψ is the mother wavelet. Any function X i of L 2 ( [0, 1] ) reads X i (t) = 2 j 1 ξ i j,kψ j,k (t) + η i φ 0,0 (t). j=0 k=0 p.22/35

51 ψ ψ 0,0 ψ 0, ψ 4,4 ψ 4,8 ψ 4,12 p.23/35

52 Fix a maximum resolution level J and expand each observation as X i (t) J 1 j=0 2 j 1 k=0 ξ i j,kψ j,k (t) + η i φ 0,0 (t). p.24/35

53 Fix a maximum resolution level J and expand each observation as X i (t) J 1 j=0 2 j 1 k=0 ξ i j,kψ j,k (t) + η i φ 0,0 (t). Rewrite each X i as a linear combination of 2 J basis functions ψ j : X i (t) 2 J j=1 X ij ψ j (t). p.24/35

54 Fix a maximum resolution level J and expand each observation as X i (t) J 1 j=0 2 j 1 k=0 ξ i j,kψ j,k (t) + η i φ 0,0 (t). Rewrite each X i as a linear combination of 2 J basis functions ψ j : X i (t) 2 J j=1 X ij ψ j (t). How to select the most pertinent basis functions? p.24/35

55 Thresholding Reorder the first 2 J basis functions {ψ 1,..., ψ 2 J} into {ψ j1,..., ψ j2 J } via the scheme l X ij1 2 l X ij l X ij2 J 2. i=1 i=1 i=1 p.25/35

56 Thresholding Reorder the first 2 J basis functions {ψ 1,..., ψ 2 J} into {ψ j1,..., ψ j2 J } via the scheme l X ij1 2 l X ij l X ij2 J 2. i=1 i=1 i=1 Wavelet coefficients are ranked using a global thresholding on the mean of the square coefficients. p.25/35

57 Classification For each d = 1,..., 2 J, let D (d) l rules g (d) : R d {0, 1}. be a collection of p.26/35

58 Classification For each d = 1,..., 2 J, let D (d) l rules g (d) : R d {0, 1}. be a collection of Let S (d) C (m) denote the corresponding shatter l coefficients, and let S (J) C l (m) be the shatter coefficient of all rules {g (d) : d = 1,..., 2 J }. p.26/35

59 Selection step We select both d and g (d) optimally by minimizing the empirical probability of error: ( ˆd)) [ ˆd, ĝ ( 1 = arg min 1 d 2 J, g D (d) m l i J m 1 [g (d) (X (d) i ) Y i ] ]. p.27/35

60 Selection step We select both d and g (d) optimally by minimizing the empirical probability of error: ( ˆd)) [ ˆd, ĝ ( 1 = arg min 1 d 2 J, g D (d) m l i J m 1 [g (d) (X (d) i ) Y i ] ]. Theorem. L(ĝ) L L 2 J L + E + CE { log ( S (J) inf 1 d 2 J, g (d) D (d) l C l (m) ) m. L l (g (d) ) } L 2 J p.27/35

61 Illustration sh, iy, dcl, ao, aa, l = 250, m = 250. p.28/35

62 Methods W-QDA: D (d) l rules. W-NN: D (d) l W-T: D (d) l contains quadratic discriminant contains nearest-neighbor rules. contains binary trees. p.29/35

63 Methods W-QDA: D (d) l rules. W-NN: D (d) l W-T: D (d) l contains quadratic discriminant contains nearest-neighbor rules. contains binary trees. F-NN: Fourier filtering approach. MPLSR: Multivariate Partial Least Square regression. p.29/35

64 Results Method Error rates ˆd W-QDA W-NN W-T F-NN MPLSR p.30/35

65 A simulated example p.31/35

66 A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i p.32/35

67 A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i f µ,σ = N (µ, σ 2 ) where µ U(0.1, 0.4) and σ U(0, 0.005) p.32/35

68 A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i f µ,σ = N (µ, σ 2 ) where µ U(0.1, 0.4) and σ U(0, 0.005) Fi 1 and Fi 2 U(50, 150) p.32/35

69 A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i f µ,σ = N (µ, σ 2 ) where µ U(0.1, 0.4) and σ U(0, 0.005) Fi 1 and Fi 2 U(50, 150) ε i N (0, 0.5). p.32/35

70 A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i f µ,σ = N (µ, σ 2 ) where µ U(0.1, 0.4) and σ U(0, 0.005) Fi 1 and Fi 2 U(50, 150) ε i N (0, 0.5). For i = 1,..., n Y i = { 0 if µi, otherwise. p.32/35

71 Results Method Error rates ˆd W-QDA W-NN W-T F-NN MPLSR p.33/35

72 Results Method Error rates ˆd W-QDA W-NN W-T F-NN MPLSR p.33/35

73 Sampling In practice, the curves are always observed at discrete time points, say t p, p = 1,..., P. p.34/35

74 Sampling In practice, the curves are always observed at discrete time points, say t p, p = 1,..., P. There is only an estimation of the wavelet coefficients available: X i (t) 2 J j=1 ˆX ij ψ j (t), where ˆX ij = 1 P P X i (t p )ψ j (t p ). p=1 p.34/35

75 Sampling consistency Theorem. Let D (d) l be the family of all k-nearest neighbor rules, and suppose that lim n l =, lim m =, n lim n log l m = 0. Then the selected rule ĝ is universally consistent, that is lim lim lim L(ĝ) = J n P L. p.35/35

On the Kernel Rule for Function Classification

On the Kernel Rule for Function Classification Christophe ABRAHAM a, Gérard BIAU b, and Benoît CADRE b a ENSAM-INRA, UMR Biométrie et Analyse des Systèmes, 2 place Pierre Viala, 34060 Montpellier Cedex,