Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Similar documents
Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Kernel Methods and Support Vector Machines

Estimating Parameters for a Gaussian pdf

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Linear Classifiers as Pattern Detectors

Pattern Recognition and Machine Learning. Artificial Neural networks

Ensemble Based on Data Envelopment Analysis

Linear Classifiers as Pattern Detectors

A Simple Regression Problem

Block designs and statistics

Machine Learning Basics: Estimators, Bias and Variance

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Feature Extraction Techniques

Bayes Decision Rule and Naïve Bayes Classifier

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Non-Parametric Non-Line-of-Sight Identification 1

E. Alpaydın AERFAISS

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Understanding Machine Learning Solution Manual

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Boosting with log-loss

Model Accuracy Measures

List Scheduling and LPT Oliver Braun (09/05/2017)

Computational and Statistical Learning Theory

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Performance evaluation of binary classifiers

Combining Classifiers

Homework 3 Solutions CSE 101 Summer 2017

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer.

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Evaluation & Credibility Issues

Bootstrapping Dependent Data

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

Machine Learning Concepts in Chemoinformatics

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

1 Proof of learning bounds

Symbolic Analysis as Universal Tool for Deriving Properties of Non-linear Algorithms Case study of EM Algorithm

a a a a a a a m a b a b

1 Bounding the Margin

Support Vector Machines MIT Course Notes Cynthia Rudin

Fairness via priority scheduling

Support Vector Machines. Maximizing the Margin

Introduction to Machine Learning. Recitation 11

Support Vector Machines. Goals for the lecture

What is Probability? (again)

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

Gene Selection for Colon Cancer Classification using Bayesian Model Averaging of Linear and Quadratic Discriminants

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

Polygonal Designs: Existence and Construction

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Chaotic Coupled Map Lattices

Tracking using CONDENSATION: Conditional Density Propagation

Qualitative Modelling of Time Series Using Self-Organizing Maps: Application to Animal Science

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Using a De-Convolution Window for Operating Modal Analysis

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

1 Generalization bounds based on Rademacher complexity

Multi-Scale/Multi-Resolution: Wavelet Transform

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

Bayes Theorem & Diagnostic Tests Screening Tests

SPECTRUM sensing is a core concept of cognitive radio

Stochastic Subgradient Methods

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

The Distribution of the Covariance Matrix for a Subset of Elliptical Distributions with Extension to Two Kurtosis Parameters

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

OBJECTIVES INTRODUCTION

Chapter 6 1-D Continuous Groups

3.8 Three Types of Convergence

Standard & Canonical Forms

Warning System of Dangerous Chemical Gas in Factory Based on Wireless Sensor Network

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Topic 5a Introduction to Curve Fitting & Linear Regression

Soft-margin SVM can address linearly separable problems with outliers

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

Performance Evaluation

Feedforward Networks

ZISC Neural Network Base Indicator for Classification Complexity Estimation

Physics 2107 Oscillations using Springs Experiment 2

AUTOMATIC DETECTION OF RWIS SENSOR MALFUNCTIONS (PHASE II) Northland Advanced Transportation Systems Research Laboratories Project B: Fiscal Year 2006

Learning Methods for Linear Detectors

NOTES AND CORRESPONDENCE. Two Extra Components in the Brier Score Decomposition

National 5 Summary Notes

Ch 12: Variations on Backpropagation

Analyzing Simulation Results

Feedforward Networks. Gradient Descent Learning and Backpropagation. Christian Jacob. CPSC 533 Winter 2004

Research Article Robust ε-support Vector Regression

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5,

Multiple Testing Issues & K-Means Clustering. Definitions related to the significance level (or type I error) of multiple tests

An Algorithm for Quantization of Discrete Probability Distributions

A Novel Breast Mass Diagnosis System based on Zernike Moments as Shape and Density Descriptors

Randomized Recovery for Boolean Compressed Sensing

Transcription:

Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition Proble...3 Discriinant and Decision Functions... 4 Machine Learning... 5 Training and Validation... 6 2. Two-Class Pattern Detectors...7 3. Perforance Metrics for 2 Class Detectors...9 ROC Curves... 9 True Positives and False Positives... 9 Precision and Recall... 11 F-Measure... 11 Accuracy... 12 Matthews Correlation Coefficient... 12

Notation x d A feature. An observed or easured value. X A vector of D features. D The nuber of diensions for the vector X K Nuber of classes C The th class X " C Stateent that an observation X is a eber of class C C ˆ The estiated class R( X ) A Recognition function C ˆ = R( X ) A recognition function that predicts C ˆ fro X For a detection function (K=2), C " { P, N} y The true class for an observation X { X } {y } Training saples of X for learning, along with ground truth y y An annotation (or ground truth) for saple M The nuber of training saples. 1-2

1. The Pattern Recognition Proble Pattern Recognition is the process of assigning observations to categories. An observation is soeties called an "entity" or "data" depending on the context and doain. Observations are produced by soe for of sensor. A sensor is a transducer that transfors physical phenoena into digital easureents. These easureents are classically called "Features". Features ay be Boolean, natural nubers, integers, real nubers or sybolic labels. In ost interesting probles, the sensor provides a feature vector, X, coposed of D " x 1 % $ ' x properties or features 2 X = $ ' $ " ' $ ' # x D & Our proble is to build a function, called a recognizer or classifier, R( X ), that aps the observation, X into a stateent that the observation belongs to a class C ˆ fro a set of K possible classes. R( X ) " C ˆ In ost classic techniques, the class C ˆ is fro a set of K nown classes { C }. For a Detection function, there are K=2 classes : =1 is a positive detection P, =2 is a negative detection, N. Thus C " { P, N} Alost all current classification techniques require the nuber of classes, K, to be fixed. ({ C } is a closed set). An interesting research proble is how to design classification algoriths that allow{ C } to be an open set that grows with experience. 1-3

Discriinant and Decision Functions The classification function R( X ) can typically be decoposed into two parts: ( ( )) C ˆ " R( X ) = d g X where g X ( ) is a discriinant function and d ( ( X )) is a decision function. g g ( X ): A discriinant function that transfors: (A vector of real nubers) X "R K ( ( )) : A decision function R K " ˆ d g X C " {C } The discriinant is typically a vector of functions, with one for each class. " g 1 ( X ) g ( X g ) = 2 ( X % $ ' $ )' $ " $ # g K ( X ' ' )& The decision function, d ( g ( X )), can be an arg-ax{}, a logistics sigoid function, or any other function that selects C fro X. A coon choice is arg-ax{}. C ˆ = d g X ( ( )) = arg" ax{ ( X )} C g In soe probles, there is a notion of cost for errors that the cost is different for different decisions. In this case we will see to iniize the cost of an error rather than the nuber of errors by biasing the classification with a notion of ris. 1-4

Machine Learning Machine learning explores the study and construction of algoriths that can learn fro and ae predictions about data. Learning for Pattern Recognition is only one of any fors of achine learning. Over the last 50 years, any fors of achine learning have been developed for any different applications. For pattern recognition, the coon approach is to a set of training data to estiate the discriinant function g ( X ). The danger with supervised learning is that the odel ay not generalize to data outside the training set. The quality of the recognizer depends on the degree to which the training data { X } represents the range of variations of real data. A variety of algoriths have been developed, each with its own advantages and disadvantages. Supervised Learning Most classical ethods for learning a recognition function, learn fro a set of labeled training data, coposed of M independent exaples, { X } for which we now the true class {y }. The set { }, {y } is called the training data. Having the true class {y } aes it uch easier to estiate the functions g ( X ) " g 1 ( X ) of g ( X g ) = 2 ( X % $ ' $ )' $ " $ # g K ( X ' ' )& Most of the technique that we will see will use supervised learning. Unsupervised Learning Unsupervised Learning techniques learn the recognition function without a labeled training set. Such ethods typically require a uch larger saple of data for learning. Sei-Supervised Learning. A nuber of hybrid algoriths exist that initiate learning fro a labeled training set and then extend the learning with unlabeled data. 1-5

Training and Validation In ost learning algoriths, we use the training data both to estiate the recognizer and to evaluate the results. However, there is a FUNDAMENTAL RULE in achine learning: NEVER TEST WITH THE SAME DATA A typical approach is to use cross validation (also nown as rotation estiation) in learning. Cross validation partitions the training data into N folds (or copleentary subsets). A subset of the folds are used to train the classifier, and the result is tested on the other folds. A taxonoy of coon techniques include: Exhaustive cross-validation o Leave p-out cross-validation o Leave one-out cross-validation Non-exhaustive cross-validation o -fold cross-validation o 2-fold cross-validation o Repeated sub-sapling validation 1-6

2. Two-Class Pattern Detectors A pattern detector is a classifier with K=2. Class =1: The target pattern, also nown as P or positive Class =2: Everything else, also nown as N or negative. Pattern detectors are used in coputer vision, for exaple to detect faces, road signs, publicity logos, or other patterns of interest. They are also used in signal counications, data ining and any other doains. The pattern detector is learned as a detection function g X ( ) followed by a decision rule, d(). For K=2 this can be reduced to a single function, as ( ) = g 1( ) g 2 ( X ) " 0.5 is equivalent to g 1( X ) " g 2 ( X ) g X X (assuing g 2 X ( ) " 0) Note that a threshold value other than 0.5 can be used. This is equivalent to biasing the detector. The detection function is learned fro a set of training data coposed of M saple observations { X } where each saple observation is labeled with an indicator variable {y } y = P or Positive for exaples of the target pattern (class =1) y = N or Negative for all other exaples (class =2) Observations for which g( X ) > 0.5 are estiated to be ebers of the target class. This will be called POSITIVE or P. Observations for which g( X ) " 0.5 are estiated to be ebers of the bacground. This will be called NEGATIVE or N. We can encode this as a decision function to define our detection function R( ) R( X ) = d(g( X )) = # $ % P if g( X ) " 0.5 N if g( X ) < 0.5 1-7

For training we need ground truth (annotation). For each training saple the annotation or ground truth tells us the real class y y = P # " Target - Class $ % N otherwise The Classification can be TRUE or FALSE. if R( ) = y then T else F This gives R( ) = y AND R( R( ) " y AND R( R( ) " y AND R( R( ) = y AND R( ) = P is a TRUE POSITIVE or TP ) = P is a FALSE POSITIVE or FP ) = N is a FALSE NEGATIVE or FN ) = N is a TRUE NEGATIVE or TN To better understand the detector we need a tool to explore the trade-off between aing false detections (false positives) and issed detections (false negatives). The Receiver Operating Characteristic (ROC) provides such a tool 1-8

3. Perforance Metrics for 2 Class Detectors ROC Curves Two-class classifiers have long been used for signal detection probles in counications and have been used to deonstrate optiality for signal detection ethods. The quality etric that is used is the Receiver Operating Characteristic (ROC) curve. This curve can be used to describe or copare any ethod for signal or pattern detection. The ROC curve is generated by adding a variable Bias ter to a discriinant function. R( X ) = d(g( X )+ B) and plotting the rate of true positive detection vs false positive detection where R( ) is the classifier as in lesson 1. As the bias ter, B, is swept through a range of values, it changes the ratio of true positive detection to false positives. For a ratio of histogras, g( X ) is a probability ranging fro 0 to 1. B can range fro less than 0.5 to ore than +0.5. When B 0.5 all detections will be Negative. When B > +0.5 all detections will be Positive. Between 0.5 and +0.5 R( X ) will give a ix of TP, TN, FP and FN. The bias ter, B, can act as an adjustable gain that sets the sensitivity of the detector. The bias ter allows us to trade False Positives for False Negatives. The resulting curve is called a Receiver Operating Characteristics (ROC) curve. The ROC plots True Positive Rate (TPR) against False Positive Rate (FNR) as a function of B for the training data { X }, {y }. True Positives and False Positives For each training saple, the detection as either Positive (P) or Negative (N) IF g( )+B > 0.5 THEN P else N The detection can be TRUE (T) or FALSE (F) depending on the indicator variable y IF y = R( ) THEN T else F 1-9

Cobining these two values, any detection can be a True Positive (TP), False Positive (FP), True Negative (TN) or False Negative (FN). For the M saples of the training data { }, {y } we can define: #P as the nuber of Positives, #N as the nuber of Negatives, #T as the nuber of True and #F as the nuber of False, Fro this we can define: #TP as the nuber of True Positives, #FP as the nuber of False Positives, #TN as the nuber of True Negative, #FN as the nuber of False Negatives. Note that #P = #TP + #FN And #N = #FP+ #TN The True Positive Rate (TPR) is TPR = #TP # P = #TP #TP+# FN The False Positive Rate (FPR) is FPR = # FP # N = # FP # FP+#TN The ROC plots the TPR against the FPR as a bias B is swept through a range of values. When B is less than 0.5, all the saples are detected as N, and both the TPR and FPR are 0. As B increases both the TPR and FPR increase. Norally TPR should rise onotonically with FPR. If TPR and FPR are equal, then the detector is no better than chance. The closer the curve approaches the upper left corner, the better the detector. y = R( X ) d(g( X )+B > 0.5) T F P True Positive (TP) False Positive (FP) 1-10

N True Negative (TN) False Negative (FN) Precision and Recall Precision, also called Positive Predictive Value (PPV), is the fraction of retrieved instances that are relevant to the proble. PP = TP TP + FP A perfect precision score (PPV=1.0) eans that every result retrieved by a search was relevant, but says nothing about whether all relevant docuents were retrieved. Recall, also nown as sensitivity (S), hit rate, and True Positive Rate (TPR) is the fraction of relevant instances that are retrieved. S = TPR = TP T = TP TP + FN A perfect recall score (TPR=1.0) eans that all relevant docuents were retrieved by the search, but says nothing about how any irrelevant docuents were also retrieved. Both precision and recall are therefore based on an understanding and easure of relevance. In our case, relevance corresponds to True. Precision answers the question How any of the Positive Eleents are True? Recall answers the question How any of the True eleents are Positive? In any doains, there is an inverse relationship between precision and recall. It is possible to increase one at the cost of reducing the other. F-Measure The F-easures cobine precision and recall into a single value. The F easures easure the effectiveness of retrieval with respect to a user who attaches 2 ties as uch iportance to recall as precision. The F 1 score weights recall higher than precision. F 1 Score: 1-11

F 1 = 2TP 2TP + FP + FN The F1 score is the haronic ean of precision and sensitivity. This is the geoetric ean divided by the arithetic ean. Accuracy Accuracy is the fraction of test cases that are correctly classified (T). ACC = T M = TP +TN M where M is the quantity of test data. Note that the ters Accuracy and Precision have a very different eaning in Measureent theory. In easureent theory, accuracy is the average distance fro a true value, while precision is a easure of the reproducibility for the easureent. Matthews Correlation Coefficient The Matthews correlation coefficient is a easure of the quality of binary (two-class) classifications. This easure was proposed by the biocheist Brian W. Matthews in 1975. MCC taes into account true and false positives and negatives and is generally regarded as a balanced easure that can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient between the observed and predicted binary classifications MCC results a value between +1 and -1, where +1 represents a perfect prediction, 0 no better than rando prediction and 1 indicates total disagreeent between prediction and observation. MCC = TP "TN # FP " FN (TP + FP)(TP + FN)(TN + FP)(TN + FN) The original forula given by atthews was: 1-12

M = Total quanitity of test data: M = TN +TP + FN + FP S = P = TP + FN M TP + FP M MCC = TP M " S # P PS(1" S)(1" P) 1-13