Curve learning Gérard Biau UNIVERSITÉ MONTPELLIER II p.1/35
Summary The problem The mathematical model Functional classification 1. Fourier filtering 2. Wavelet filtering Applications p.2/35
The problem In many experiments, practitioners collect samples of curves and other functional observations. p.3/35
The problem In many experiments, practitioners collect samples of curves and other functional observations. We wish to investigate how classical learning rules can be extended to handle functions. p.3/35
A typical problem 20 10 0 0 100 200 300 p.4/35
A speech recognition problem 0.5 YES 0.5 NO 0.5 0 1 1 BOAT 0.5 0 1 1 GOAT 1 0 0.2 SH 1 1 0 1 1 AO 0.2 1 0 1 0 1 p.5/35
Classification Given a random pair (X, Y ) R d {0, 1}, the problem of classification is to predict the unknown nature Y of a new observation X. p.6/35
Classification Given a random pair (X, Y ) R d {0, 1}, the problem of classification is to predict the unknown nature Y of a new observation X. The statistician creates a classifier g : R d {0, 1} which represents her guess of the label of X. p.6/35
Classification Given a random pair (X, Y ) R d {0, 1}, the problem of classification is to predict the unknown nature Y of a new observation X. The statistician creates a classifier g : R d {0, 1} which represents her guess of the label of X. An error occurs if g(x) Y, and the probability of error for a particular classifier g is L(g) = P{g(X) Y }. p.6/35
Classification The Bayes rule g (x) = { 0 if P{Y = 0 X = x} P{Y = 1 X = x} 1 otherwise, is the optimal decision. That is, for any decision function g : R d {0, 1}, P{g (X) Y } P{g(X) Y }. p.7/35
Classification The Bayes rule g (x) = { 0 if P{Y = 0 X = x} P{Y = 1 X = x} 1 otherwise, is the optimal decision. That is, for any decision function g : R d {0, 1}, P{g (X) Y } P{g(X) Y }. The problem is thus to construct a reasonable classifier g n based on i.i.d. observations (X 1, Y 1 ),..., (X n, Y n ). p.7/35
Error criteria The goal is to make the error probability P{g n (X) Y } close to L. p.8/35
Error criteria The goal is to make the error probability P{g n (X) Y } close to L. Definition. consistent if A decision rule g n is called P{g n (X) Y } L as n. p.8/35
Error criteria The goal is to make the error probability P{g n (X) Y } close to L. Definition. consistent if A decision rule g n is called P{g n (X) Y } L as n. Definition. A decision rule is called universally consistent if it is consistent for any distribution of the pair (X, Y ). p.8/35
Two simple and popular rules The moving window rule: g n (x) = { 0 if n i=1 1 {Y i =0,X i B x,hn } n i=1 1 {Y i =1,X i B x,h n } 1 otherwise. p.9/35
Two simple and popular rules The moving window rule: g n (x) = { 0 if n i=1 1 {Y i =0,X i B x,hn } n i=1 1 {Y i =1,X i B x,h n } 1 otherwise. The nearest neighbor rule: { 0 if k n g n (x) = i=1 1 {Y (i) (x)=0} k n i=1 1 {Y (i) (x)=1} 1 otherwise. p.9/35
Two simple and popular rules The moving window rule: g n (x) = { 0 if n i=1 1 {Y i =0,X i B x,hn } n i=1 1 {Y i =1,X i B x,h n } 1 otherwise. The nearest neighbor rule: { 0 if k n g n (x) = i=1 1 {Y (i) (x)=0} k n i=1 1 {Y (i) (x)=1} 1 otherwise. Devroye, Györfi and Lugosi (1996). A Probabilistic Theory of Pattern Recognition. p.9/35
Notation The model. The random variable X takes values in a metric space (F, ρ). The distribution of the pair (X, Y ) is completely specified by µ and η: µ(a) = P{X A} and η(x) = P{Y = 1 X = x}. p.10/35
Notation The model. The random variable X takes values in a metric space (F, ρ). The distribution of the pair (X, Y ) is completely specified by µ and η: µ(a) = P{X A} and η(x) = P{Y = 1 X = x}. And all the definitions remain the same... p.10/35
Assumptions (H1) The space F is complete. p.11/35
Assumptions (H1) The space F is complete. (H2) The following differentiation result holds: 1 lim η dµ = η(x) in µ-probability. h 0 µ(b x,h ) B x,h p.11/35
Assumptions (H1) The space F is complete. (H2) The following differentiation result holds: 1 lim η dµ = η(x) in µ-probability. h 0 µ(b x,h ) B x,h Aïe aïe aïe... p.11/35
Results Theorem [consistency]. Assume that (H1) and (H2) hold. If h n 0 and, for every k 1, N k (h n /2)/n 0 as n, then lim n L(g n) = L. p.12/35
Results Theorem [consistency]. Assume that (H1) and (H2) hold. If h n 0 and, for every k 1, N k (h n /2)/n 0 as n, then lim n L(g n) = L. h n 1/ log log n: curse of dimensionality. p.12/35
Curse of dimensionality? Consider the hypercube [ a, a] d and an inscribed hypersphere with radius r = a. Then f d = volume sphere volume cube = π d/2 d2 d 1 Γ(d/2). p.13/35
Curse of dimensionality? Consider the hypercube [ a, a] d and an inscribed hypersphere with radius r = a. Then f d = volume sphere volume cube = π d/2 d2 d 1 Γ(d/2). d 1 2 3 4 5 6 7 f d 1 0.785 0.524 0.308 0.164 0.081 0.037 p.13/35
Curse of dimensionality? Consider the hypercube [ a, a] d and an inscribed hypersphere with radius r = a. Then f d = volume sphere volume cube = π d/2 d2 d 1 Γ(d/2). d 1 2 3 4 5 6 7 f d 1 0.785 0.524 0.308 0.164 0.081 0.037 As the dimension increases, most spherical neighborhoods will be empty! p.13/35
Classification in Hilbert spaces Here, the observations X i take values in the infinite dimensional space L 2 ( [0, 1] ). p.14/35
Classification in Hilbert spaces Here, the observations X i take values in the infinite dimensional space L 2 ( [0, 1] ). The trigonometric basis, formed by the functions ψ 1 (t) = 1, ψ 2j (t) = 2 cos(2πjt), and ψ 2j+1 (t) = 2 sin(2πjt), j = 1, 2,..., is a complete orthonormal system in L 2 ( [0, 1] ). p.14/35
Classification in Hilbert spaces Here, the observations X i take values in the infinite dimensional space L 2 ( [0, 1] ). The trigonometric basis, formed by the functions ψ 1 (t) = 1, ψ 2j (t) = 2 cos(2πjt), and ψ 2j+1 (t) = 2 sin(2πjt), j = 1, 2,..., is a complete orthonormal system in L 2 ( [0, 1] ). We let the Fourier representation of X i be X i (t) = j=1 X ij ψ j (t). p.14/35
Dimension reduction Select the effective dimension d to approximate each X i by X i (t) d j=1 X ij ψ j (t). p.15/35
Dimension reduction Select the effective dimension d to approximate each X i by X i (t) d j=1 X ij ψ j (t). Perform nearest neighbor classification based on the coefficient vector X (d) i = (X i1,..., X id ) for the selected dimension d. p.15/35
Data-splitting We split the data into p.16/35
Data-splitting We split the data into a training sequence (X 1, Y 1 ),..., (X l, Y l ) p.16/35
Data-splitting We split the data into a training sequence (X 1, Y 1 ),..., (X l, Y l ) a testing sequence (X l+1, Y l+1 ),..., (X m+l, Y m+l ). p.16/35
Data-splitting We split the data into a training sequence a testing sequence (X 1, Y 1 ),..., (X l, Y l ) (X l+1, Y l+1 ),..., (X m+l, Y m+l ). The training sequence defines a class of nearest neighbor classifiers with members g l,k,d. p.16/35
Empirical risk minimization The testing sequence is used to select a particular classifier by minimizing the penalized empirical risk 1 m 1 { } λ + d. g l,k,d (X (d) i ) Y i m i J m p.17/35
Empirical risk minimization The testing sequence is used to select a particular classifier by minimizing the penalized empirical risk 1 m 1 { } λ + d. g l,k,d (X (d) i ) Y i m i J m In practice, the regularization penalty may be formally replaced by λ d = 0 for d d n and λ d = + for d d n +1. p.17/35
Result Theorem [oracle inequality]. Assume that d=1 e 2λ2 d <. (1) Then there exists a constant C > 0 such that L(ĝ n ) L [ inf (L d L ) + inf d 1 log l + C m. 1 k l ( L(gl,k,d ) Ld) λ ] d + m p.18/35
Result Theorem [oracle inequality]. Assume that d=1 e 2λ2 d <. (1) Then there exists a constant C > 0 such that L(ĝ n ) L [ inf (L d L ) + inf d 1 log l + C m. 1 k l ( L(gl,k,d ) Ld) λ ] d + m p.18/35
Result Theorem [oracle inequality]. Assume that d=1 e 2λ2 d <. (1) Then there exists a constant C > 0 such that L(ĝ n ) L [ inf (L d L ) + inf d 1 log l + C m. 1 k l ( L(gl,k,d ) Ld) λ ] d + m p.18/35
Result Theorem [oracle inequality]. Assume that d=1 e 2λ2 d <. (1) Then there exists a constant C > 0 such that L(ĝ n ) L [ inf (L d L ) + inf d 1 log l + C m. 1 k l ( L(gl,k,d ) Ld) λ ] d + m p.18/35
Results Theorem [consistency]. (1) and Under the assumption lim n l =, lim n m =, lim n log l m = 0, the classifier ĝ n is universally consistent. p.19/35
Results Theorem [consistency]. (1) and Under the assumption lim n l =, lim n m =, lim n log l m = 0, the classifier ĝ n is universally consistent. Under stronger assumptions, the classifier ĝ n is strongly consistent, that is P { ĝ n (X) Y (X 1, Y 1 ),..., (X n, Y n ) } L with probability one. p.19/35
Illustration 0.5 YES 0.5 NO 0.5 0 1 1 BOAT 0.5 0 1 1 GOAT 1 0 0.2 SH 1 1 0 1 1 AO 0.2 1 0 1 0 1 p.20/35
Results Method Error rates 0.10 FOURIER 0.21 0.16 0.36 NN-DIRECT 0.42 0.42 0.07 QDA 0.35 0.19 p.21/35
Multiresolution analysis V 0 V 1 V 2... L 2 ( [0, 1] ). p.22/35
Multiresolution analysis V 0 V 1 V 2... L 2 ( [0, 1] ). φ: the father wavelet such that {φ j,k : k = 0,..., 2 j 1} is an orthornormal basis of V j. p.22/35
Multiresolution analysis V 0 V 1 V 2... L 2 ( [0, 1] ). φ: the father wavelet such that {φ j,k : k = 0,..., 2 j 1} is an orthornormal basis of V j. For each j, one constructs an orthornormal basis {ψ j,k : k = 0,..., 2 j 1} of W j such that V j+1 = V j Wj. ψ is the mother wavelet. p.22/35
Multiresolution analysis V 0 V 1 V 2... L 2 ( [0, 1] ). φ: the father wavelet such that {φ j,k : k = 0,..., 2 j 1} is an orthornormal basis of V j. For each j, one constructs an orthornormal basis {ψ j,k : k = 0,..., 2 j 1} of W j such that V j+1 = V j Wj. ψ is the mother wavelet. Any function X i of L 2 ( [0, 1] ) reads X i (t) = 2 j 1 ξ i j,kψ j,k (t) + η i φ 0,0 (t). j=0 k=0 p.22/35
1 0 1 2 2 0 1 0 1 ψ ψ 0,0 ψ 0,1 0 1 0 1 0 1 ψ 4,4 ψ 4,8 ψ 4,12 p.23/35
Fix a maximum resolution level J and expand each observation as X i (t) J 1 j=0 2 j 1 k=0 ξ i j,kψ j,k (t) + η i φ 0,0 (t). p.24/35
Fix a maximum resolution level J and expand each observation as X i (t) J 1 j=0 2 j 1 k=0 ξ i j,kψ j,k (t) + η i φ 0,0 (t). Rewrite each X i as a linear combination of 2 J basis functions ψ j : X i (t) 2 J j=1 X ij ψ j (t). p.24/35
Fix a maximum resolution level J and expand each observation as X i (t) J 1 j=0 2 j 1 k=0 ξ i j,kψ j,k (t) + η i φ 0,0 (t). Rewrite each X i as a linear combination of 2 J basis functions ψ j : X i (t) 2 J j=1 X ij ψ j (t). How to select the most pertinent basis functions? p.24/35
Thresholding Reorder the first 2 J basis functions {ψ 1,..., ψ 2 J} into {ψ j1,..., ψ j2 J } via the scheme l X ij1 2 l X ij2 2... l X ij2 J 2. i=1 i=1 i=1 p.25/35
Thresholding Reorder the first 2 J basis functions {ψ 1,..., ψ 2 J} into {ψ j1,..., ψ j2 J } via the scheme l X ij1 2 l X ij2 2... l X ij2 J 2. i=1 i=1 i=1 Wavelet coefficients are ranked using a global thresholding on the mean of the square coefficients. p.25/35
Classification For each d = 1,..., 2 J, let D (d) l rules g (d) : R d {0, 1}. be a collection of p.26/35
Classification For each d = 1,..., 2 J, let D (d) l rules g (d) : R d {0, 1}. be a collection of Let S (d) C (m) denote the corresponding shatter l coefficients, and let S (J) C l (m) be the shatter coefficient of all rules {g (d) : d = 1,..., 2 J }. p.26/35
Selection step We select both d and g (d) optimally by minimizing the empirical probability of error: ( ˆd)) [ ˆd, ĝ ( 1 = arg min 1 d 2 J, g D (d) m l i J m 1 [g (d) (X (d) i ) Y i ] ]. p.27/35
Selection step We select both d and g (d) optimally by minimizing the empirical probability of error: ( ˆd)) [ ˆd, ĝ ( 1 = arg min 1 d 2 J, g D (d) m l i J m 1 [g (d) (X (d) i ) Y i ] ]. Theorem. L(ĝ) L L 2 J L + E + CE { log ( S (J) inf 1 d 2 J, g (d) D (d) l C l (m) ) m. L l (g (d) ) } L 2 J p.27/35
Illustration 20 10 0 0 100 200 300 sh, iy, dcl, ao, aa, l = 250, m = 250. p.28/35
Methods W-QDA: D (d) l rules. W-NN: D (d) l W-T: D (d) l contains quadratic discriminant contains nearest-neighbor rules. contains binary trees. p.29/35
Methods W-QDA: D (d) l rules. W-NN: D (d) l W-T: D (d) l contains quadratic discriminant contains nearest-neighbor rules. contains binary trees. F-NN: Fourier filtering approach. MPLSR: Multivariate Partial Least Square regression. p.29/35
Results Method Error rates ˆd W-QDA 0.1042 7.30 W-NN 0.1096 19.52 W-T 0.1253 9.10 F-NN 0.1277 48.76 MPLSR 0.0904 5.96 p.30/35
A simulated example 4 4 0 0 4 0 1 40 4 0 1 2 0 0 40 0 1 4 2 0 1 10 0 0 4 0 1 10 0 1 p.31/35
A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i p.32/35
A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i f µ,σ = N (µ, σ 2 ) where µ U(0.1, 0.4) and σ U(0, 0.005) p.32/35
A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i f µ,σ = N (µ, σ 2 ) where µ U(0.1, 0.4) and σ U(0, 0.005) Fi 1 and Fi 2 U(50, 150) p.32/35
A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i f µ,σ = N (µ, σ 2 ) where µ U(0.1, 0.4) and σ U(0, 0.005) Fi 1 and Fi 2 U(50, 150) ε i N (0, 0.5). p.32/35
A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i f µ,σ = N (µ, σ 2 ) where µ U(0.1, 0.4) and σ U(0, 0.005) Fi 1 and Fi 2 U(50, 150) ε i N (0, 0.5). For i = 1,..., n Y i = { 0 if µi,1 0.25 1 otherwise. p.32/35
Results Method Error rates ˆd W-QDA 0.0843 7.36 W-NN 0.1255 23.92 W-T 0.1275 20.48 F-NN 0.3493 75.88 MPLSR 0.4413 3.68 p.33/35
Results Method Error rates ˆd W-QDA 0.0843 7.36 W-NN 0.1255 23.92 W-T 0.1275 20.48 F-NN 0.3493 75.88 MPLSR 0.4413 3.68 p.33/35
Sampling In practice, the curves are always observed at discrete time points, say t p, p = 1,..., P. p.34/35
Sampling In practice, the curves are always observed at discrete time points, say t p, p = 1,..., P. There is only an estimation of the wavelet coefficients available: X i (t) 2 J j=1 ˆX ij ψ j (t), where ˆX ij = 1 P P X i (t p )ψ j (t p ). p=1 p.34/35
Sampling consistency Theorem. Let D (d) l be the family of all k-nearest neighbor rules, and suppose that lim n l =, lim m =, n lim n log l m = 0. Then the selected rule ĝ is universally consistent, that is lim lim lim L(ĝ) = J n P L. p.35/35