Curve learning. p.1/35
|
|
- Juliana Bates
- 5 years ago
- Views:
Transcription
1 Curve learning Gérard Biau UNIVERSITÉ MONTPELLIER II p.1/35
2 Summary The problem The mathematical model Functional classification 1. Fourier filtering 2. Wavelet filtering Applications p.2/35
3 The problem In many experiments, practitioners collect samples of curves and other functional observations. p.3/35
4 The problem In many experiments, practitioners collect samples of curves and other functional observations. We wish to investigate how classical learning rules can be extended to handle functions. p.3/35
5 A typical problem p.4/35
6 A speech recognition problem 0.5 YES 0.5 NO BOAT GOAT SH AO p.5/35
7 Classification Given a random pair (X, Y ) R d {0, 1}, the problem of classification is to predict the unknown nature Y of a new observation X. p.6/35
8 Classification Given a random pair (X, Y ) R d {0, 1}, the problem of classification is to predict the unknown nature Y of a new observation X. The statistician creates a classifier g : R d {0, 1} which represents her guess of the label of X. p.6/35
9 Classification Given a random pair (X, Y ) R d {0, 1}, the problem of classification is to predict the unknown nature Y of a new observation X. The statistician creates a classifier g : R d {0, 1} which represents her guess of the label of X. An error occurs if g(x) Y, and the probability of error for a particular classifier g is L(g) = P{g(X) Y }. p.6/35
10 Classification The Bayes rule g (x) = { 0 if P{Y = 0 X = x} P{Y = 1 X = x} 1 otherwise, is the optimal decision. That is, for any decision function g : R d {0, 1}, P{g (X) Y } P{g(X) Y }. p.7/35
11 Classification The Bayes rule g (x) = { 0 if P{Y = 0 X = x} P{Y = 1 X = x} 1 otherwise, is the optimal decision. That is, for any decision function g : R d {0, 1}, P{g (X) Y } P{g(X) Y }. The problem is thus to construct a reasonable classifier g n based on i.i.d. observations (X 1, Y 1 ),..., (X n, Y n ). p.7/35
12 Error criteria The goal is to make the error probability P{g n (X) Y } close to L. p.8/35
13 Error criteria The goal is to make the error probability P{g n (X) Y } close to L. Definition. consistent if A decision rule g n is called P{g n (X) Y } L as n. p.8/35
14 Error criteria The goal is to make the error probability P{g n (X) Y } close to L. Definition. consistent if A decision rule g n is called P{g n (X) Y } L as n. Definition. A decision rule is called universally consistent if it is consistent for any distribution of the pair (X, Y ). p.8/35
15 Two simple and popular rules The moving window rule: g n (x) = { 0 if n i=1 1 {Y i =0,X i B x,hn } n i=1 1 {Y i =1,X i B x,h n } 1 otherwise. p.9/35
16 Two simple and popular rules The moving window rule: g n (x) = { 0 if n i=1 1 {Y i =0,X i B x,hn } n i=1 1 {Y i =1,X i B x,h n } 1 otherwise. The nearest neighbor rule: { 0 if k n g n (x) = i=1 1 {Y (i) (x)=0} k n i=1 1 {Y (i) (x)=1} 1 otherwise. p.9/35
17 Two simple and popular rules The moving window rule: g n (x) = { 0 if n i=1 1 {Y i =0,X i B x,hn } n i=1 1 {Y i =1,X i B x,h n } 1 otherwise. The nearest neighbor rule: { 0 if k n g n (x) = i=1 1 {Y (i) (x)=0} k n i=1 1 {Y (i) (x)=1} 1 otherwise. Devroye, Györfi and Lugosi (1996). A Probabilistic Theory of Pattern Recognition. p.9/35
18 Notation The model. The random variable X takes values in a metric space (F, ρ). The distribution of the pair (X, Y ) is completely specified by µ and η: µ(a) = P{X A} and η(x) = P{Y = 1 X = x}. p.10/35
19 Notation The model. The random variable X takes values in a metric space (F, ρ). The distribution of the pair (X, Y ) is completely specified by µ and η: µ(a) = P{X A} and η(x) = P{Y = 1 X = x}. And all the definitions remain the same... p.10/35
20 Assumptions (H1) The space F is complete. p.11/35
21 Assumptions (H1) The space F is complete. (H2) The following differentiation result holds: 1 lim η dµ = η(x) in µ-probability. h 0 µ(b x,h ) B x,h p.11/35
22 Assumptions (H1) The space F is complete. (H2) The following differentiation result holds: 1 lim η dµ = η(x) in µ-probability. h 0 µ(b x,h ) B x,h Aïe aïe aïe... p.11/35
23 Results Theorem [consistency]. Assume that (H1) and (H2) hold. If h n 0 and, for every k 1, N k (h n /2)/n 0 as n, then lim n L(g n) = L. p.12/35
24 Results Theorem [consistency]. Assume that (H1) and (H2) hold. If h n 0 and, for every k 1, N k (h n /2)/n 0 as n, then lim n L(g n) = L. h n 1/ log log n: curse of dimensionality. p.12/35
25 Curse of dimensionality? Consider the hypercube [ a, a] d and an inscribed hypersphere with radius r = a. Then f d = volume sphere volume cube = π d/2 d2 d 1 Γ(d/2). p.13/35
26 Curse of dimensionality? Consider the hypercube [ a, a] d and an inscribed hypersphere with radius r = a. Then f d = volume sphere volume cube = π d/2 d2 d 1 Γ(d/2). d f d p.13/35
27 Curse of dimensionality? Consider the hypercube [ a, a] d and an inscribed hypersphere with radius r = a. Then f d = volume sphere volume cube = π d/2 d2 d 1 Γ(d/2). d f d As the dimension increases, most spherical neighborhoods will be empty! p.13/35
28 Classification in Hilbert spaces Here, the observations X i take values in the infinite dimensional space L 2 ( [0, 1] ). p.14/35
29 Classification in Hilbert spaces Here, the observations X i take values in the infinite dimensional space L 2 ( [0, 1] ). The trigonometric basis, formed by the functions ψ 1 (t) = 1, ψ 2j (t) = 2 cos(2πjt), and ψ 2j+1 (t) = 2 sin(2πjt), j = 1, 2,..., is a complete orthonormal system in L 2 ( [0, 1] ). p.14/35
30 Classification in Hilbert spaces Here, the observations X i take values in the infinite dimensional space L 2 ( [0, 1] ). The trigonometric basis, formed by the functions ψ 1 (t) = 1, ψ 2j (t) = 2 cos(2πjt), and ψ 2j+1 (t) = 2 sin(2πjt), j = 1, 2,..., is a complete orthonormal system in L 2 ( [0, 1] ). We let the Fourier representation of X i be X i (t) = j=1 X ij ψ j (t). p.14/35
31 Dimension reduction Select the effective dimension d to approximate each X i by X i (t) d j=1 X ij ψ j (t). p.15/35
32 Dimension reduction Select the effective dimension d to approximate each X i by X i (t) d j=1 X ij ψ j (t). Perform nearest neighbor classification based on the coefficient vector X (d) i = (X i1,..., X id ) for the selected dimension d. p.15/35
33 Data-splitting We split the data into p.16/35
34 Data-splitting We split the data into a training sequence (X 1, Y 1 ),..., (X l, Y l ) p.16/35
35 Data-splitting We split the data into a training sequence (X 1, Y 1 ),..., (X l, Y l ) a testing sequence (X l+1, Y l+1 ),..., (X m+l, Y m+l ). p.16/35
36 Data-splitting We split the data into a training sequence a testing sequence (X 1, Y 1 ),..., (X l, Y l ) (X l+1, Y l+1 ),..., (X m+l, Y m+l ). The training sequence defines a class of nearest neighbor classifiers with members g l,k,d. p.16/35
37 Empirical risk minimization The testing sequence is used to select a particular classifier by minimizing the penalized empirical risk 1 m 1 { } λ + d. g l,k,d (X (d) i ) Y i m i J m p.17/35
38 Empirical risk minimization The testing sequence is used to select a particular classifier by minimizing the penalized empirical risk 1 m 1 { } λ + d. g l,k,d (X (d) i ) Y i m i J m In practice, the regularization penalty may be formally replaced by λ d = 0 for d d n and λ d = + for d d n +1. p.17/35
39 Result Theorem [oracle inequality]. Assume that d=1 e 2λ2 d <. (1) Then there exists a constant C > 0 such that L(ĝ n ) L [ inf (L d L ) + inf d 1 log l + C m. 1 k l ( L(gl,k,d ) Ld) λ ] d + m p.18/35
40 Result Theorem [oracle inequality]. Assume that d=1 e 2λ2 d <. (1) Then there exists a constant C > 0 such that L(ĝ n ) L [ inf (L d L ) + inf d 1 log l + C m. 1 k l ( L(gl,k,d ) Ld) λ ] d + m p.18/35
41 Result Theorem [oracle inequality]. Assume that d=1 e 2λ2 d <. (1) Then there exists a constant C > 0 such that L(ĝ n ) L [ inf (L d L ) + inf d 1 log l + C m. 1 k l ( L(gl,k,d ) Ld) λ ] d + m p.18/35
42 Result Theorem [oracle inequality]. Assume that d=1 e 2λ2 d <. (1) Then there exists a constant C > 0 such that L(ĝ n ) L [ inf (L d L ) + inf d 1 log l + C m. 1 k l ( L(gl,k,d ) Ld) λ ] d + m p.18/35
43 Results Theorem [consistency]. (1) and Under the assumption lim n l =, lim n m =, lim n log l m = 0, the classifier ĝ n is universally consistent. p.19/35
44 Results Theorem [consistency]. (1) and Under the assumption lim n l =, lim n m =, lim n log l m = 0, the classifier ĝ n is universally consistent. Under stronger assumptions, the classifier ĝ n is strongly consistent, that is P { ĝ n (X) Y (X 1, Y 1 ),..., (X n, Y n ) } L with probability one. p.19/35
45 Illustration 0.5 YES 0.5 NO BOAT GOAT SH AO p.20/35
46 Results Method Error rates 0.10 FOURIER NN-DIRECT QDA p.21/35
47 Multiresolution analysis V 0 V 1 V 2... L 2 ( [0, 1] ). p.22/35
48 Multiresolution analysis V 0 V 1 V 2... L 2 ( [0, 1] ). φ: the father wavelet such that {φ j,k : k = 0,..., 2 j 1} is an orthornormal basis of V j. p.22/35
49 Multiresolution analysis V 0 V 1 V 2... L 2 ( [0, 1] ). φ: the father wavelet such that {φ j,k : k = 0,..., 2 j 1} is an orthornormal basis of V j. For each j, one constructs an orthornormal basis {ψ j,k : k = 0,..., 2 j 1} of W j such that V j+1 = V j Wj. ψ is the mother wavelet. p.22/35
50 Multiresolution analysis V 0 V 1 V 2... L 2 ( [0, 1] ). φ: the father wavelet such that {φ j,k : k = 0,..., 2 j 1} is an orthornormal basis of V j. For each j, one constructs an orthornormal basis {ψ j,k : k = 0,..., 2 j 1} of W j such that V j+1 = V j Wj. ψ is the mother wavelet. Any function X i of L 2 ( [0, 1] ) reads X i (t) = 2 j 1 ξ i j,kψ j,k (t) + η i φ 0,0 (t). j=0 k=0 p.22/35
51 ψ ψ 0,0 ψ 0, ψ 4,4 ψ 4,8 ψ 4,12 p.23/35
52 Fix a maximum resolution level J and expand each observation as X i (t) J 1 j=0 2 j 1 k=0 ξ i j,kψ j,k (t) + η i φ 0,0 (t). p.24/35
53 Fix a maximum resolution level J and expand each observation as X i (t) J 1 j=0 2 j 1 k=0 ξ i j,kψ j,k (t) + η i φ 0,0 (t). Rewrite each X i as a linear combination of 2 J basis functions ψ j : X i (t) 2 J j=1 X ij ψ j (t). p.24/35
54 Fix a maximum resolution level J and expand each observation as X i (t) J 1 j=0 2 j 1 k=0 ξ i j,kψ j,k (t) + η i φ 0,0 (t). Rewrite each X i as a linear combination of 2 J basis functions ψ j : X i (t) 2 J j=1 X ij ψ j (t). How to select the most pertinent basis functions? p.24/35
55 Thresholding Reorder the first 2 J basis functions {ψ 1,..., ψ 2 J} into {ψ j1,..., ψ j2 J } via the scheme l X ij1 2 l X ij l X ij2 J 2. i=1 i=1 i=1 p.25/35
56 Thresholding Reorder the first 2 J basis functions {ψ 1,..., ψ 2 J} into {ψ j1,..., ψ j2 J } via the scheme l X ij1 2 l X ij l X ij2 J 2. i=1 i=1 i=1 Wavelet coefficients are ranked using a global thresholding on the mean of the square coefficients. p.25/35
57 Classification For each d = 1,..., 2 J, let D (d) l rules g (d) : R d {0, 1}. be a collection of p.26/35
58 Classification For each d = 1,..., 2 J, let D (d) l rules g (d) : R d {0, 1}. be a collection of Let S (d) C (m) denote the corresponding shatter l coefficients, and let S (J) C l (m) be the shatter coefficient of all rules {g (d) : d = 1,..., 2 J }. p.26/35
59 Selection step We select both d and g (d) optimally by minimizing the empirical probability of error: ( ˆd)) [ ˆd, ĝ ( 1 = arg min 1 d 2 J, g D (d) m l i J m 1 [g (d) (X (d) i ) Y i ] ]. p.27/35
60 Selection step We select both d and g (d) optimally by minimizing the empirical probability of error: ( ˆd)) [ ˆd, ĝ ( 1 = arg min 1 d 2 J, g D (d) m l i J m 1 [g (d) (X (d) i ) Y i ] ]. Theorem. L(ĝ) L L 2 J L + E + CE { log ( S (J) inf 1 d 2 J, g (d) D (d) l C l (m) ) m. L l (g (d) ) } L 2 J p.27/35
61 Illustration sh, iy, dcl, ao, aa, l = 250, m = 250. p.28/35
62 Methods W-QDA: D (d) l rules. W-NN: D (d) l W-T: D (d) l contains quadratic discriminant contains nearest-neighbor rules. contains binary trees. p.29/35
63 Methods W-QDA: D (d) l rules. W-NN: D (d) l W-T: D (d) l contains quadratic discriminant contains nearest-neighbor rules. contains binary trees. F-NN: Fourier filtering approach. MPLSR: Multivariate Partial Least Square regression. p.29/35
64 Results Method Error rates ˆd W-QDA W-NN W-T F-NN MPLSR p.30/35
65 A simulated example p.31/35
66 A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i p.32/35
67 A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i f µ,σ = N (µ, σ 2 ) where µ U(0.1, 0.4) and σ U(0, 0.005) p.32/35
68 A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i f µ,σ = N (µ, σ 2 ) where µ U(0.1, 0.4) and σ U(0, 0.005) Fi 1 and Fi 2 U(50, 150) p.32/35
69 A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i f µ,σ = N (µ, σ 2 ) where µ U(0.1, 0.4) and σ U(0, 0.005) Fi 1 and Fi 2 U(50, 150) ε i N (0, 0.5). p.32/35
70 A simulated example Pairs (X i (t), Y i ) are generated by: X i (t) = 1 ( ) sin(fi 1 πt)f µi,σi (t)+sin(fi 2 πt)f 1 µi,σ 50 i (t) +ε i f µ,σ = N (µ, σ 2 ) where µ U(0.1, 0.4) and σ U(0, 0.005) Fi 1 and Fi 2 U(50, 150) ε i N (0, 0.5). For i = 1,..., n Y i = { 0 if µi, otherwise. p.32/35
71 Results Method Error rates ˆd W-QDA W-NN W-T F-NN MPLSR p.33/35
72 Results Method Error rates ˆd W-QDA W-NN W-T F-NN MPLSR p.33/35
73 Sampling In practice, the curves are always observed at discrete time points, say t p, p = 1,..., P. p.34/35
74 Sampling In practice, the curves are always observed at discrete time points, say t p, p = 1,..., P. There is only an estimation of the wavelet coefficients available: X i (t) 2 J j=1 ˆX ij ψ j (t), where ˆX ij = 1 P P X i (t p )ψ j (t p ). p=1 p.34/35
75 Sampling consistency Theorem. Let D (d) l be the family of all k-nearest neighbor rules, and suppose that lim n l =, lim m =, n lim n log l m = 0. Then the selected rule ĝ is universally consistent, that is lim lim lim L(ĝ) = J n P L. p.35/35
On the Kernel Rule for Function Classification
On the Kernel Rule for Function Classification Christophe ABRAHAM a, Gérard BIAU b, and Benoît CADRE b a ENSAM-INRA, UMR Biométrie et Analyse des Systèmes, 2 place Pierre Viala, 34060 Montpellier Cedex,
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationRandom forests and averaging classifiers
Random forests and averaging classifiers Gábor Lugosi ICREA and Pompeu Fabra University Barcelona joint work with Gérard Biau (Paris 6) Luc Devroye (McGill, Montreal) Leo Breiman Binary classification
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationContents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)
Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture
More informationLecture 4 Discriminant Analysis, k-nearest Neighbors
Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se
More informationRandom projection ensemble classification
Random projection ensemble classification Timothy I. Cannings Statistics for Big Data Workshop, Brunel Joint work with Richard Samworth Introduction to classification Observe data from two classes, pairs
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationInstance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest
More informationA talk on Oracle inequalities and regularization. by Sara van de Geer
A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other
More informationIntroduction to Machine Learning
Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationDirect Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina
Direct Learning: Linear Classification Logistic regression models for classification problem We consider two class problem: Y {0, 1}. The Bayes rule for the classification is I(P(Y = 1 X = x) > 1/2) so
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationAdapted Feature Extraction and Its Applications
saito@math.ucdavis.edu 1 Adapted Feature Extraction and Its Applications Naoki Saito Department of Mathematics University of California Davis, CA 95616 email: saito@math.ucdavis.edu URL: http://www.math.ucdavis.edu/
More informationSolving Classification Problems By Knowledge Sets
Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose
More informationSUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION
SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology
More informationIntroduction to Machine Learning
Introduction to Machine Learning Bayesian Classification Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationIntro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation
Lecture 15. Pattern Classification (I): Statistical Formulation Outline Statistical Pattern Recognition Maximum Posterior Probability (MAP) Classifier Maximum Likelihood (ML) Classifier K-Nearest Neighbor
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationProbability Models for Bayesian Recognition
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIAG / osig Second Semester 06/07 Lesson 9 0 arch 07 Probability odels for Bayesian Recognition Notation... Supervised Learning for Bayesian
More informationIntroduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak
Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationBINARY CLASSIFICATION
BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationIs cross-validation valid for small-sample microarray classification?
Is cross-validation valid for small-sample microarray classification? Braga-Neto et al. Bioinformatics 2004 Topics in Bioinformatics Presented by Lei Xu October 26, 2004 1 Review 1) Statistical framework
More informationEXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING
EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: August 30, 2018, 14.00 19.00 RESPONSIBLE TEACHER: Niklas Wahlström NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationNearest Neighbor Pattern Classification
Nearest Neighbor Pattern Classification T. M. Cover and P. E. Hart May 15, 2018 1 The Intro The nearest neighbor algorithm/rule (NN) is the simplest nonparametric decisions procedure, that assigns to unclassified
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Introduction to Classification Algorithms Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Some
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationCMSC858P Supervised Learning Methods
CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer
More informationA Study of Relative Efficiency and Robustness of Classification Methods
A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics
More informationEM Algorithm & High Dimensional Data
EM Algorithm & High Dimensional Data Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Gaussian EM Algorithm For the Gaussian mixture model, we have Expectation Step (E-Step): Maximization Step (M-Step): 2 EM
More informationClassification and Pattern Recognition
Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationOptimal decision and naive Bayes
Optimal decision and naive Bayes Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2017 1 Outline Standard supervised learning hypothesis Optimal model The Naive Bayes Classifier 2 Data model Data
More informationBayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI
Bayes Classifiers CAP5610 Machine Learning Instructor: Guo-Jun QI Recap: Joint distributions Joint distribution over Input vector X = (X 1, X 2 ) X 1 =B or B (drinking beer or not) X 2 = H or H (headache
More informationMachine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber
Machine Learning Regression-Based Classification & Gaussian Discriminant Analysis Manfred Huber 2015 1 Logistic Regression Linear regression provides a nice representation and an efficient solution to
More informationWeak Signals: machine-learning meets extreme value theory
Weak Signals: machine-learning meets extreme value theory Stephan Clémençon Télécom ParisTech, LTCI, Université Paris Saclay machinelearningforbigdata.telecom-paristech.fr 2017-10-12, Workshop Big Data
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationDoes Modeling Lead to More Accurate Classification?
Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationA Bahadur Representation of the Linear Support Vector Machine
A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support
More informationMaterial presented. Direct Models for Classification. Agenda. Classification. Classification (2) Classification by machines 6/16/2010.
Material presented Direct Models for Classification SCARF JHU Summer School June 18, 2010 Patrick Nguyen (panguyen@microsoft.com) What is classification? What is a linear classifier? What are Direct Models?
More informationSTA 414/2104, Spring 2014, Practice Problem Set #1
STA 44/4, Spring 4, Practice Problem Set # Note: these problems are not for credit, and not to be handed in Question : Consider a classification problem in which there are two real-valued inputs, and,
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationLearning Linear Detectors
Learning Linear Detectors Instructor - Simon Lucey 16-423 - Designing Computer Vision Apps Today Detection versus Classification Bayes Classifiers Linear Classifiers Examples of Detection 3 Learning: Detection
More informationMINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava
MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,
More informationBayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory
Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology
More informationA Least Squares Formulation for Canonical Correlation Analysis
A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation
More informationDecision trees COMS 4771
Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).
More informationSupport vector machine for functional data classification
Support vector machine for functional data classification Fabrice Rossi, Nathalie Villa To cite this version: Fabrice Rossi, Nathalie Villa. Support vector machine for functional data classification. Neurocomputing
More informationMetric Embedding for Kernel Classification Rules
Metric Embedding for Kernel Classification Rules Bharath K. Sriperumbudur University of California, San Diego (Joint work with Omer Lang & Gert Lanckriet) Bharath K. Sriperumbudur (UCSD) Metric Embedding
More informationDistribution-specific analysis of nearest neighbor search and classification
Distribution-specific analysis of nearest neighbor search and classification Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to information retrieval and classification.
More informationComputer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)
Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative
More informationBINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES
BINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES DAVID MCDIARMID Abstract Binary tree-structured partition and classification schemes are a class of nonparametric tree-based approaches to classification
More informationIntroduction to Machine Learning (67577) Lecture 5
Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai
More informationPart I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis
Week 5 Based in part on slides from textbook, slides of Susan Holmes Part I Linear Discriminant Analysis October 29, 2012 1 / 1 2 / 1 Nearest centroid rule Suppose we break down our data matrix as by the
More informationDyadic Classification Trees via Structural Risk Minimization
Dyadic Classification Trees via Structural Risk Minimization Clayton Scott and Robert Nowak Department of Electrical and Computer Engineering Rice University Houston, TX 77005 cscott,nowak @rice.edu Abstract
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative
More informationNearest neighbor classification in metric spaces: universal consistency and rates of convergence
Nearest neighbor classification in metric spaces: universal consistency and rates of convergence Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to classification.
More informationMidterm Exam, Spring 2005
10-701 Midterm Exam, Spring 2005 1. Write your name and your email address below. Name: Email address: 2. There should be 15 numbered pages in this exam (including this cover sheet). 3. Write your name
More informationMining Classification Knowledge
Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013 Outline 1. Bayesian classification
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationMachine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1
Machine Learning 1 Linear Classifiers Marius Kloft Humboldt University of Berlin Summer Term 2014 Machine Learning 1 Linear Classifiers 1 Recap Past lectures: Machine Learning 1 Linear Classifiers 2 Recap
More informationLinear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining
Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes November 9, 2012 1 / 1 Nearest centroid rule Suppose we break down our data matrix as by the labels yielding (X
More informationTime Series Classification
Distance Measures Classifiers DTW vs. ED Further Work Questions August 31, 2017 Distance Measures Classifiers DTW vs. ED Further Work Questions Outline 1 2 Distance Measures 3 Classifiers 4 DTW vs. ED
More informationFast learning rates for plug-in classifiers under the margin condition
Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,
More informationStatistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003
Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)
More informationTufts COMP 135: Introduction to Machine Learning
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationClass 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio
Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant
More informationClassification with Reject Option
Classification with Reject Option Bartlett and Wegkamp (2008) Wegkamp and Yuan (2010) February 17, 2012 Outline. Introduction.. Classification with reject option. Spirit of the papers BW2008.. Infinite
More informationBayes Decision Theory
Bayes Decision Theory Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 / 16
More informationOptimal global rates of convergence for interpolation problems with random design
Optimal global rates of convergence for interpolation problems with random design Michael Kohler 1 and Adam Krzyżak 2, 1 Fachbereich Mathematik, Technische Universität Darmstadt, Schlossgartenstr. 7, 64289
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationWavelet-Based Estimation of a Discriminant Function
Wavelet-Based Estimation of a Discriminant Function Woojin Chang 1 Seong-Hee Kim 2 and Brani Vidaovic 3 Abstract. In this paper we consider wavelet-based binary linear classifiers. Both consistency results
More informationPAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationDiscussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan
Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania Professors Antoniadis and Fan are
More informationBayesian Decision Theory
Introduction to Pattern Recognition [ Part 4 ] Mahdi Vasighi Remarks It is quite common to assume that the data in each class are adequately described by a Gaussian distribution. Bayesian classifier is
More informationEmpirical Risk Minimization, Model Selection, and Model Assessment
Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification
ISyE 6416: Computational Statistics Spring 2017 Lecture 5: Discriminant analysis and classification Prof. Yao Xie H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationLDA, QDA, Naive Bayes
LDA, QDA, Naive Bayes Generative Classification Models Marek Petrik 2/16/2017 Last Class Logistic Regression Maximum Likelihood Principle Logistic Regression Predict probability of a class: p(x) Example:
More informationLecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher
Lecture 3 STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher Previous lectures What is machine learning? Objectives of machine learning Supervised and
More information