Curve learning. p.1/35

Similar documents
On the Kernel Rule for Function Classification

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Consistency of Nearest Neighbor Methods

Random forests and averaging classifiers

Approximation Theoretical Questions for SVMs

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Random projection ensemble classification

Recap from previous lecture

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

A talk on Oracle inequalities and regularization. by Sara van de Geer

Introduction to Machine Learning

Statistical learning theory, Support vector machines, and Bioinformatics

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Online Learning and Sequential Decision Making

Adapted Feature Extraction and Its Applications

Solving Classification Problems By Knowledge Sets

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Introduction to Machine Learning

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

CMU-Q Lecture 24:

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Introduction to Machine Learning

Support Vector Machines for Classification: A Statistical Portrait

Probability Models for Bayesian Recognition

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

BINARY CLASSIFICATION

Linear & nonlinear classifiers

Is cross-validation valid for small-sample microarray classification?

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Naïve Bayes classification

Nearest Neighbor Pattern Classification

Machine Learning for OR & FE

Lecture 3: Introduction to Complexity Regularization

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

CMSC858P Supervised Learning Methods

Introduction to Machine Learning

A Study of Relative Efficiency and Robustness of Classification Methods

EM Algorithm & High Dimensional Data

Classification and Pattern Recognition

Logistic Regression. Machine Learning Fall 2018

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Support Vector Machines

Optimal decision and naive Bayes

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Weak Signals: machine-learning meets extreme value theory

Discriminative Models

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Does Modeling Lead to More Accurate Classification?

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Lecture 3: Statistical Decision Theory (Part II)

A Bahadur Representation of the Linear Support Vector Machine

Material presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.

STA 414/2104, Spring 2014, Practice Problem Set #1

Does Unlabeled Data Help?

Learning Linear Detectors

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

A Least Squares Formulation for Canonical Correlation Analysis

Decision trees COMS 4771

Support vector machine for functional data classification

Metric Embedding for Kernel Classification Rules

Distribution-specific analysis of nearest neighbor search and classification

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Generative v. Discriminative classifiers Intuition

BINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES

Introduction to Machine Learning (67577) Lecture 5

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Dyadic Classification Trees via Structural Risk Minimization

COMS 4771 Introduction to Machine Learning. Nakul Verma

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

Midterm Exam, Spring 2005

Mining Classification Knowledge

Lecture 2 Machine Learning Review

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Time Series Classification

Fast learning rates for plug-in classifiers under the margin condition

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Tufts COMP 135: Introduction to Machine Learning

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Classification with Reject Option

Bayes Decision Theory

Optimal global rates of convergence for interpolation problems with random design

Discriminative Models

Kernel Methods and Support Vector Machines

Wavelet-Based Estimation of a Discriminant Function

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Generalization, Overfitting, and Model Selection

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan

Bayesian Decision Theory

Empirical Risk Minimization, Model Selection, and Model Assessment

PAC-learning, VC Dimension and Margin-based Bounds

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Linear & nonlinear classifiers

LDA, QDA, Naive Bayes

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Transcription:

Curve learning Gérard Biau UNIVERSITÉ MONTPELLIER II p.1/35

Summary The problem The mathematical model Functional classification 1. Fourier filtering 2. Wavelet filtering Applications p.2/35

The problem In many experiments, practitioners collect samples of curves and other functional observations. p.3/35

The problem In many experiments, practitioners collect samples of curves and other functional observations. We wish to investigate how classical learning rules can be extended to handle functions. p.3/35

A typical problem 20 10 0 0 100 200 300 p.4/35

A speech recognition problem 0.5 YES 0.5 NO 0.5 0 1 1 BOAT 0.5 0 1 1 GOAT 1 0 0.2 SH 1 1 0 1 1 AO 0.2 1 0 1 0 1 p.5/35

Classification Given a random pair (X, Y ) R d {0, 1}, the problem of classification is to predict the unknown nature Y of a new observation X. p.6/35

Classification Given a random pair (X, Y ) R d {0, 1}, the problem of classification is to predict the unknown nature Y of a new observation X. The statistician creates a classifier g : R d {0, 1} which represents her guess of the label of X. p.6/35

Classification Given a random pair (X, Y ) R d {0, 1}, the problem of classification is to predict the unknown nature Y of a new observation X. The statistician creates a classifier g : R d {0, 1} which represents her guess of the label of X. An error occurs if g(x) Y, and the probability of error for a particular classifier g is L(g) = P{g(X) Y }. p.6/35

Classification The Bayes rule g (x) = { 0 if P{Y = 0 X = x} P{Y = 1 X = x} 1 otherwise, is the optimal decision. That is, for any decision function g : R d {0, 1}, P{g (X) Y } P{g(X) Y }. p.7/35

Classification The Bayes rule g (x) = { 0 if P{Y = 0 X = x} P{Y = 1 X = x} 1 otherwise, is the optimal decision. That is, for any decision function g : R d {0, 1}, P{g (X) Y } P{g(X) Y }. The problem is thus to construct a reasonable classifier g n based on i.i.d. observations (X 1, Y 1 ),..., (X n, Y n ). p.7/35

Error criteria The goal is to make the error probability P{g n (X) Y } close to L. p.8/35

Error criteria The goal is to make the error probability P{g n (X) Y } close to L. Definition. consistent if A decision rule g n is called P{g n (X) Y } L as n. p.8/35

Error criteria The goal is to make the error probability P{g n (X) Y } close to L. Definition. consistent if A decision rule g n is called P{g n (X) Y } L as n. Definition. A decision rule is called universally consistent if it is consistent for any distribution of the pair (X, Y ). p.8/35

Two simple and popular rules The moving window rule: g n (x) = { 0 if n i=1 1 {Y i =0,X i B x,hn } n i=1 1 {Y i =1,X i B x,h n } 1 otherwise. p.9/35

Two simple and popular rules The moving window rule: g n (x) = { 0 if n i=1 1 {Y i =0,X i B x,hn } n i=1 1 {Y i =1,X i B x,h n } 1 otherwise. The nearest neighbor rule: { 0 if k n g n (x) = i=1 1 {Y (i) (x)=0} k n i=1 1 {Y (i) (x)=1} 1 otherwise. p.9/35

Two simple and popular rules The moving window rule: g n (x) = { 0 if n i=1 1 {Y i =0,X i B x,hn } n i=1 1 {Y i =1,X i B x,h n } 1 otherwise. The nearest neighbor rule: { 0 if k n g n (x) = i=1 1 {Y (i) (x)=0} k n i=1 1 {Y (i) (x)=1} 1 otherwise. Devroye, Györfi and Lugosi (1996). A Probabilistic Theory of Pattern Recognition. p.9/35

Notation The model. The random variable X takes values in a metric space (F, ρ). The distribution of the pair (X, Y ) is completely specified by µ and η: µ(a) = P{X A} and η(x) = P{Y = 1 X = x}. p.10/35

Notation The model. The random variable X takes values in a metric space (F, ρ). The distribution of the pair (X, Y ) is completely specified by µ and η: µ(a) = P{X A} and η(x) = P{Y = 1 X = x}. And all the definitions remain the same... p.10/35

Assumptions (H1) The space F is complete. p.11/35

Assumptions (H1) The space F is complete. (H2) The following differentiation result holds: 1 lim η dµ = η(x) in µ-probability. h 0 µ(b x,h ) B x,h p.11/35

Assumptions (H1) The space F is complete. (H2) The following differentiation result holds: 1 lim η dµ = η(x) in µ-probability. h 0 µ(b x,h ) B x,h Aïe aïe aïe... p.11/35

Results Theorem [consistency]. Assume that (H1) and (H2) hold. If h n 0 and, for every k 1, N k (h n /2)/n 0 as n, then lim n L(g n) = L. p.12/35

Results Theorem [consistency]. Assume that (H1) and (H2) hold. If h n 0 and, for every k 1, N k (h n /2)/n 0 as n, then lim n L(g n) = L. h n 1/ log log n: curse of dimensionality. p.12/35

Curse of dimensionality? Consider the hypercube [ a, a] d and an inscribed hypersphere with radius r = a. Then f d = volume sphere volume cube = π d/2 d2 d 1 Γ(d/2). p.13/35

Curse of dimensionality? Consider the hypercube [ a, a] d and an inscribed hypersphere with radius r = a. Then f d = volume sphere volume cube = π d/2 d2 d 1 Γ(d/2). d 1 2 3 4 5 6 7 f d 1 0.785 0.524 0.308 0.164 0.081 0.037 p.13/35

Curse of dimensionality? Consider the hypercube [ a, a] d and an inscribed hypersphere with radius r = a. Then f d = volume sphere volume cube = π d/2 d2 d 1 Γ(d/2). d 1 2 3 4 5 6 7 f d 1 0.785 0.524 0.308 0.164 0.081 0.037 As the dimension increases, most spherical neighborhoods will be empty! p.13/35

Classification in Hilbert spaces Here, the observations X i take values in the infinite dimensional space L 2 ( [0, 1] ). p.14/35

Classification in Hilbert spaces Here, the observations X i take values in the infinite dimensional space L 2 ( [0, 1] ). The trigonometric basis, formed by the functions ψ 1 (t) = 1, ψ 2j (t) = 2 cos(2πjt), and ψ 2j+1 (t) = 2 sin(2πjt), j = 1, 2,..., is a complete orthonormal system in L 2 ( [0, 1] ). p.14/35

Classification in Hilbert spaces Here, the observations X i take values in the infinite dimensional space L 2 ( [0, 1] ). The trigonometric basis, formed by the functions ψ 1 (t) = 1, ψ 2j (t) = 2 cos(2πjt), and ψ 2j+1 (t) = 2 sin(2πjt), j = 1, 2,..., is a complete orthonormal system in L 2 ( [0, 1] ). We let the Fourier representation of X i be X i (t) = j=1 X ij ψ j (t). p.14/35

Dimension reduction Select the effective dimension d to approximate each X i by X i (t) d j=1 X ij ψ j (t). p.15/35

Dimension reduction Select the effective dimension d to approximate each X i by X i (t) d j=1 X ij ψ j (t). Perform nearest neighbor classification based on the coefficient vector X (d) i = (X i1,..., X id ) for the selected dimension d. p.15/35

Data-splitting We split the data into p.16/35

Data-splitting We split the data into a training sequence (X 1, Y 1 ),..., (X l, Y l ) p.16/35

Data-splitting We split the data into a training sequence (X 1, Y 1 ),..., (X l, Y l ) a testing sequence (X l+1, Y l+1 ),..., (X m+l, Y m+l ). p.16/35

Data-splitting We split the data into a training sequence a testing sequence (X 1, Y 1 ),..., (X l, Y l ) (X l+1, Y l+1 ),..., (X m+l, Y m+l ). The training sequence defines a class of nearest neighbor classifiers with members g l,k,d. p.16/35

Empirical risk minimization The testing sequence is used to select a particular classifier by minimizing the penalized empirical risk 1 m 1 { } λ + d. g l,k,d (X (d) i ) Y i m i J m p.17/35

Empirical risk minimization The testing sequence is used to select a particular classifier by minimizing the penalized empirical risk 1 m 1 { } λ + d. g l,k,d (X (d) i ) Y i m i J m In practice, the regularization penalty may be formally replaced by λ d = 0 for d d n and λ d = + for d d n +1. p.17/35

Result Theorem [oracle inequality]. Assume that d=1 e 2λ2 d <. (1) Then there exists a constant C > 0 such that L(ĝ n ) L [ inf (L d L ) + inf d 1 log l + C m. 1 k l ( L(gl,k,d ) Ld) λ ] d + m p.18/35

Result Theorem [oracle inequality]. Assume that d=1 e 2λ2 d <. (1) Then there exists a constant C > 0 such that L(ĝ n ) L [ inf (L d L ) + inf d 1 log l + C m. 1 k l ( L(gl,k,d ) Ld) λ ] d + m p.18/35

Result Theorem [oracle inequality]. Assume that d=1 e 2λ2 d <. (1) Then there exists a constant C > 0 such that L(ĝ n ) L [ inf (L d L ) + inf d 1 log l + C m. 1 k l ( L(gl,k,d ) Ld) λ ] d + m p.18/35

Result Theorem [oracle inequality]. Assume that d=1 e 2λ2 d <. (1) Then there exists a constant C > 0 such that L(ĝ n ) L [ inf (L d L ) + inf d 1 log l + C m. 1 k l ( L(gl,k,d ) Ld) λ ] d + m p.18/35

Results Theorem [consistency]. (1) and Under the assumption lim n l =, lim n m =, lim n log l m = 0, the classifier ĝ n is universally consistent. p.19/35

Results Theorem [consistency]. (1) and Under the assumption lim n l =, lim n m =, lim n log l m = 0, the classifier ĝ n is universally consistent. Under stronger assumptions, the classifier ĝ n is strongly consistent, that is P { ĝ n (X) Y (X 1, Y 1 ),..., (X n, Y n ) } L with probability one. p.19/35

Illustration 0.5 YES 0.5 NO 0.5 0 1 1 BOAT 0.5 0 1 1 GOAT 1 0 0.2 SH 1 1 0 1 1 AO 0.2 1 0 1 0 1 p.20/35

Results Method Error rates 0.10 FOURIER 0.21 0.16 0.36 NN-DIRECT 0.42 0.42 0.07 QDA 0.35 0.19 p.21/35

Multiresolution analysis V 0 V 1 V 2... L 2 ( [0, 1] ). p.22/35

Multiresolution analysis V 0 V 1 V 2... L 2 ( [0, 1] ). φ: the father wavelet such that {φ j,k : k = 0,..., 2 j 1} is an orthornormal basis of V j. p.22/35

Multiresolution analysis V 0 V 1 V 2... L 2 ( [0, 1] ). φ: the father wavelet such that {φ j,k : k = 0,..., 2 j 1} is an orthornormal basis of V j. For each j, one constructs an orthornormal basis {ψ j,k : k = 0,..., 2 j 1} of W j such that V j+1 = V j Wj. ψ is the mother wavelet. p.22/35

Multiresolution analysis V 0 V 1 V 2... L 2 ( [0, 1] ). φ: the father wavelet such that {φ j,k : k = 0,..., 2 j 1} is an orthornormal basis of V j. For each j, one constructs an orthornormal basis {ψ j,k : k = 0,..., 2 j 1} of W j such that V j+1 = V j Wj. ψ is the mother wavelet. Any function X i of L 2 ( [0, 1] ) reads X i (t) = 2 j 1 ξ i j,kψ j,k (t) + η i φ 0,0 (t). j=0 k=0 p.22/35

1 0 1 2 2 0 1 0 1 ψ ψ 0,0 ψ 0,1 0 1 0 1 0 1 ψ 4,4 ψ 4,8 ψ 4,12 p.23/35

Fix a maximum resolution level J and expand each observation as X i (t) J 1 j=0 2 j 1 k=0 ξ i j,kψ j,k (t) + η i φ 0,0 (t). p.24/35

Fix a maximum resolution level J and expand each observation as X i (t) J 1 j=0 2 j 1 k=0 ξ i j,kψ j,k (t) + η i φ 0,0 (t). Rewrite each X i as a linear combination of 2 J basis functions ψ j : X i (t) 2 J j=1 X ij ψ j (t). p.24/35

Fix a maximum resolution level J and expand each observation as X i (t) J 1 j=0 2 j 1 k=0 ξ i j,kψ j,k (t) + η i φ 0,0 (t). Rewrite each X i as a linear combination of 2 J basis functions ψ j : X i (t) 2 J j=1 X ij ψ j (t). How to select the most pertinent basis functions? p.24/35

Thresholding Reorder the first 2 J basis functions {ψ 1,..., ψ 2 J} into {ψ j1,..., ψ j2 J } via the scheme l X ij1 2 l X ij2 2... l X ij2 J 2. i=1 i=1 i=1 p.25/35

Thresholding Reorder the first 2 J basis functions {ψ 1,..., ψ 2 J} into {ψ j1,..., ψ j2 J } via the scheme l X ij1 2 l X ij2 2... l X ij2 J 2. i=1 i=1 i=1 Wavelet coefficients are ranked using a global thresholding on the mean of the square coefficients. p.25/35

Classification For each d = 1,..., 2 J, let D (d) l rules g (d) : R d {0, 1}. be a collection of p.26/35

Classification For each d = 1,..., 2 J, let D (d) l rules g (d) : R d {0, 1}. be a collection of Let S (d) C (m) denote the corresponding shatter l coefficients, and let S (J) C l (m) be the shatter coefficient of all rules {g (d) : d = 1,..., 2 J }. p.26/35

Selection step We select both d and g (d) optimally by minimizing the empirical probability of error: ( ˆd)) [ ˆd, ĝ ( 1 = arg min 1 d 2 J, g D (d) m l i J m 1 [g (d) (X (d) i ) Y i ] ]. p.27/35

Selection step We select both d and g (d) optimally by minimizing the empirical probability of error: ( ˆd)) [ ˆd, ĝ ( 1 = arg min 1 d 2 J, g D (d) m l i J m 1 [g (d) (X (d) i ) Y i ] ]. Theorem. L(ĝ) L L 2 J L + E + CE { log ( S (J) inf 1 d 2 J, g (d) D (d) l C l (m) ) m. L l (g (d) ) } L 2 J p.27/35

Illustration 20 10 0 0 100 200 300 sh, iy, dcl, ao, aa, l = 250, m = 250. p.28/35

Methods W-QDA: D (d) l rules. W-NN: D (d) l W-T: D (d) l contains quadratic discriminant contains nearest-neighbor rules. contains binary trees. p.29/35

Methods W-QDA: D (d) l rules. W-NN: D (d) l W-T: D (d) l contains quadratic discriminant contains nearest-neighbor rules. contains binary trees. F-NN: Fourier filtering approach. MPLSR: Multivariate Partial Least Square regression. p.29/35

Results Method Error rates ˆd W-QDA 0.1042 7.30 W-NN 0.1096 19.52 W-T 0.1253 9.10 F-NN 0.1277 48.76 MPLSR 0.0904 5.96 p.30/35

A simulated example 4 4 0 0 4 0 1 40 4 0 1 2 0 0 40 0 1 4 2 0 1 10 0 0 4 0 1 10 0 1 p.31/35

A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i p.32/35

A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i f µ,σ = N (µ, σ 2 ) where µ U(0.1, 0.4) and σ U(0, 0.005) p.32/35

A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i f µ,σ = N (µ, σ 2 ) where µ U(0.1, 0.4) and σ U(0, 0.005) Fi 1 and Fi 2 U(50, 150) p.32/35

A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i f µ,σ = N (µ, σ 2 ) where µ U(0.1, 0.4) and σ U(0, 0.005) Fi 1 and Fi 2 U(50, 150) ε i N (0, 0.5). p.32/35

A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i f µ,σ = N (µ, σ 2 ) where µ U(0.1, 0.4) and σ U(0, 0.005) Fi 1 and Fi 2 U(50, 150) ε i N (0, 0.5). For i = 1,..., n Y i = { 0 if µi,1 0.25 1 otherwise. p.32/35

Results Method Error rates ˆd W-QDA 0.0843 7.36 W-NN 0.1255 23.92 W-T 0.1275 20.48 F-NN 0.3493 75.88 MPLSR 0.4413 3.68 p.33/35

Results Method Error rates ˆd W-QDA 0.0843 7.36 W-NN 0.1255 23.92 W-T 0.1275 20.48 F-NN 0.3493 75.88 MPLSR 0.4413 3.68 p.33/35

Sampling In practice, the curves are always observed at discrete time points, say t p, p = 1,..., P. p.34/35

Sampling In practice, the curves are always observed at discrete time points, say t p, p = 1,..., P. There is only an estimation of the wavelet coefficients available: X i (t) 2 J j=1 ˆX ij ψ j (t), where ˆX ij = 1 P P X i (t p )ψ j (t p ). p=1 p.34/35

Sampling consistency Theorem. Let D (d) l be the family of all k-nearest neighbor rules, and suppose that lim n l =, lim m =, n lim n log l m = 0. Then the selected rule ĝ is universally consistent, that is lim lim lim L(ĝ) = J n P L. p.35/35