PATTERN RECOGNITION AND MACHINE LEARNING

Similar documents
Detection theory. H 0 : x[n] = w[n]

Detection theory 101 ELEC-E5410 Signal Processing for Communications

Introduction to Statistical Inference

PATTERN RECOGNITION AND MACHINE LEARNING

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)?

Detection Theory. Chapter 3. Statistical Decision Theory I. Isael Diaz Oct 26th 2010

Introduction to Signal Detection and Classification. Phani Chavali

Topic 3: Hypothesis Testing

DETECTION theory deals primarily with techniques for

Introduction to Detection Theory

ELEG 5633 Detection and Estimation Signal Detection: Deterministic Signals

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Lecture 8: Signal Detection and Noise Assumption

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

6.867 Machine Learning

F2E5216/TS1002 Adaptive Filtering and Change Detection. Course Organization. Lecture plan. The Books. Lecture 1

Bayesian Decision Theory

Intelligent Systems Statistical Machine Learning

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EIE6207: Estimation Theory

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

ECE521 Lecture7. Logistic Regression

Topic 17: Simple Hypotheses

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Lecture 7 Introduction to Statistical Decision Theory

Introduction to Supervised Learning. Performance Evaluation

Intelligent Systems Statistical Machine Learning

10-701/ Machine Learning - Midterm Exam, Fall 2010

Evaluation. Andrea Passerini Machine Learning. Evaluation

Machine Learning, Fall 2009: Midterm

Evaluation requires to define performance measures to be optimized

Rowan University Department of Electrical and Computer Engineering

CSE 546 Final Exam, Autumn 2013

A first model of learning

Estimation and Detection

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

PATTERN RECOGNITION AND MACHINE LEARNING

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

If there exists a threshold k 0 such that. then we can take k = k 0 γ =0 and achieve a test of size α. c 2004 by Mark R. Bell,

Statistical Methods for SVM

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Signal Detection Basics - CFAR

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Stephen Scott.

STOCHASTIC PROCESSES, DETECTION AND ESTIMATION Course Notes

LECTURE NOTE #3 PROF. ALAN YUILLE

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Introduction to Machine Learning Midterm Exam

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

APPENDIX 1 NEYMAN PEARSON CRITERIA

PDEEC Machine Learning 2016/17

Chapter 2 Signal Processing at Receivers: Detection Theory

Linear models. x = Hθ + w, where w N(0, σ 2 I) and H R n p. The matrix H is called the observation matrix or design matrix 1.

Performance Evaluation and Comparison

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

Modifying Voice Activity Detection in Low SNR by correction factors

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Decision-making, inference, and learning theory. ECE 830 & CS 761, Spring 2016

10. Composite Hypothesis Testing. ECE 830, Spring 2014

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Data Privacy in Biomedicine. Lecture 11b: Performance Measures for System Evaluation

Machine Learning Practice Page 2 of 2 10/28/13

Multivariate statistical methods and data mining in particle physics

Lecture 4 Discriminant Analysis, k-nearest Neighbors

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

ORF 245 Fundamentals of Statistics Chapter 9 Hypothesis Testing

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Lecture 8: Information Theory and Statistics

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Introduction to Bayesian Learning. Machine Learning Fall 2018

Lecture 22: Error exponents in hypothesis testing, GLRT

Machine Learning, Midterm Exam

ECE531 Lecture 6: Detection of Discrete-Time Signals with Random Parameters

Performance Evaluation

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

ECE521 week 3: 23/26 January 2017

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Noncoherent Integration Gain, and its Approximation Mark A. Richards. June 9, Signal-to-Noise Ratio and Integration in Radar Signal Processing

Introduction to Machine Learning Midterm Exam Solutions

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Lecture Notes 1: Vector spaces

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Novel spectrum sensing schemes for Cognitive Radio Networks

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Bayes Rule for Minimizing Risk

Classification Based on Probability

Introduction to Logistic Regression

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

BAYESIAN DECISION THEORY

Multivariate Methods in Statistical Data Analysis

Transcription:

PATTERN RECOGNITION AND MACHINE LEARNING Slide Set 3: Detection Theory January 2018 Heikki Huttunen heikki.huttunen@tut.fi Department of Signal Processing Tampere University of Technology

Detection theory ˆ In this section, we will briefly consider detection theory. ˆ Detection theory has many common topics with machine learning. ˆ The methods are based on estimation theory and attempt to answer questions such as ˆ Is a signal of specific model present in our time series? E.g., detection of noisy sinusoid; beep or no beep? ˆ Is the transmitted pulse present at radar signal at time t? ˆ Does the mean level of a signal change at time t? ˆ After calculating the mean change in pixel values of subsequent frames in video, is there something moving in the scene? ˆ Is there a person in this video frame?

Detection theory ˆ The area is closely related to hypothesis testing, which is widely used e.g., in medicine: Is the response in patients due to the new drug or due to random fluctuations? ˆ Consider the detection of a sinusoidal waveform 1.0 Noiseless Signal 0.5 0.0 0.5 1.0 0 3 100 200 300 400 500 Noisy Signal 600 700 800 900 0 12 3 21 0 100 200 300 400 500 600 700 800 900 40 45 Detection Result 30 35 20 25 10 15 0 5 0 100 200 300 400 500 600 700 800 900

Detection theory ˆ In our case, the hypotheses could be H 1 : x[n] = A cos(2πf 0 n + ϕ) + w[n] H 0 : x[n] = w[n] ˆ This example corresponds to detection of noisy sinusoid. ˆ The hypothesis H 1 corresponds to the case that the sinusoid is present and is called alternative hypothesis. ˆ The hypothesis H 0 corresponds to the case that the measurements consists of noise only and is called null hypothesis.

Introductory Example ˆ Neyman-Pearson approach is the classical way of solving detection problems in an optimal manner. ˆ It relies on so called Neyman-Pearson theorem. ˆ Before stating the theorem, consider a simplistic detection problem, where we observe one sample x[0] from one of two densities: N (0, 1) or N (1, 1). ˆ The task is to choose the correct density in an optimal manner. ˆ Our hypotheses are now H 1 : μ = 1, H 0 : μ = 0, and the corresponding likelihoods are plotted below.

Introductory Example Likelihood 0.40 0.35 0.30 0.25 0.20 0.15 Likelihood of observing different values of x[0] given H 0 or H 1 p(x[0] H 0 ) p(x[0] H 1 ) 0.10 0.05 0.00 4 3 2 1 0 1 2 3 4 x[0] ˆ An obvious approach for deciding the density would choose the one, which is higher for a particular x[0]. ˆ More specifically, study the likelihoods and choose the more likely one.

Introductory Example ˆ The likelihoods are H 1 : p(x[0] μ = 1) = 1 (x[n] 1)2 exp. 2π 2 H 0 : p(x[0] μ = 0) = 1 exp (x[n])2. 2π 2 ˆ One should select H 1 if "μ = 1" is more likely than "μ = 0". ˆ In other words, p(x[0] μ = 1) > p(x[0] μ = 0).

Introductory Example ˆ Let s state this in terms of x[0]: p(x[0] μ = 1) > p(x[0] μ = 0) p(x[0] μ = 1) p(x[0] μ = 0) > 1 exp 1 2π exp 1 exp 2π (x[n] 1)2 2 (x[n])2 2 > 1 (x[n] 1)2 x[n] 2 > 1 2

Introductory Example (x[n] 2 (x[n] 1) 2 ) > 0 2x[n] 1 > 0 x[n] > 1 2. ˆ In other words, choose H 1 if x[0] > 0.5 and H 0 if x[0] < 0.5. ˆ Studying the ratio of likelihoods on the second row of the derivation is the key. ˆ This ratio is called likelihood ratio, and comparison to a threshold γ (here γ = 1) is called likelihood ratio test (LRT). ˆ Of course the threshold γ may be chosen other than γ = 1.

Error Types ˆ It might be that the detection problem is not symmetric and some errors are more costly than others. ˆ For example, when detecting a disease, a missed detection is more costly than a false alarm. ˆ The tradeoff between misses and false alarms can be adjusted using the threshold of the LRT.

Error Types ˆ The below figure illustrates the probabilities of the two kinds of errors. ˆ The blue area on the left corresponds to the probability of choosing H 1 while H 0 would hold (false match). ˆ The red area is the probability of choosing H 0 while H 1 would hold (missed detection). Decide H 0 when H 1 holds Likelihood 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 4 3 2 1 0 1 2 3 4 x[0] Decide H 1 when H 0 holds

Error Types ˆ It can be seen that we can decrease either probability arbitrarily small by adjusting the detection threshold. Decide H 0 when H 1 holds Decide H 0 when H 1 holds Likelihood Likelihood 0.40 0.35 0.30 0.25 0.20 0.15 Detection threshold at 0. Small amount of missed detections (red) but many false matches (blue). 0.10 0.05 0.00 4 3 2 1 0 1 2 3 4 x[0] 0.40 0.35 0.30 0.25 0.20 0.15 Detection threshold at 1.5. Small amount of false matches (blue) but many missed detections (red). 0.10 0.05 0.00 4 3 2 1 0 1 2 3 4 x[0] Decide H 1 when H 0 holds Decide H 1 when H 0 holds

Error Types ˆ Both probabilities can be calculated. ˆ Probability of false alarm for the threshold γ = 1.5 is P FA = P(x[0] > γ μ = 0) = ˆ Probability of missed detection is 1.5 1 exp (x[n])2 dx[n] 0.0668. 2π 2 1.5 1 (x[n] 1)2 P M = P(x[0] < γ μ = 1) = exp dx[n] 0.6915. 2π 2

Error Types ˆ An equivalent, but more useful term is the complement of P M : probability of detection: 1 (x[n] 1)2 P D = 1 P M = exp dx[n] 0.3085. 1.5 2π 2

Neyman-Pearson Theorem ˆ Since P FA and P D depend on each other, we would like to maximize P D subject to given maximum allowed P FA. Luckily the following theorem makes this easy. ˆ Neyman-Pearson Theorem: For a fixed P FA, the likelihood ratio test maximizes P D with the decision rule L(x) = p(x; H 1) p(x; H 0 ) > γ, with threshold γ is the value for which p(x; H 0 ) dx = P FA. x:l(x)>γ

Neyman-Pearson Theorem ˆ As an example, suppose we want to find the best detector for our introductory example, and we can tolerate 10% false alarms (P FA = 0.1). ˆ According to the theorem, the detection rule is: Select H 1 if p(x μ = 1) p(x μ = 0) > γ The only thing to find out now is the threshold γ such that γ p(x μ = 0) dx = 0.1.

Neyman-Pearson Theorem ˆ This can be done with Python function isf, which solves the inverse cumulative distribution function. >>> import scipy.stats as stats >>> # Compute threshold such that P_FA = 0.1 >>> T = stats.norm.isf(0.1, loc = 0, scale = 1) >>> print T 1.28155156554 ˆ The parameters loc and scale are the mean and standard deviation of the Gaussian density, respectively.

Detector for a known waveform ˆ The NP approach applies to all cases where likelihoods are available. ˆ An important special case is that of a known waveform s[n] embedded in WGN sequence w[n]: H 1 : x[n] = s[n] + w[n] H 0 : x[n] = w[n]. ˆ An example of a case where the waveform is known could be detection of radar signals, where a pulse s[n] transmitted by us is reflected back after some propagation time.

Detector for a known waveform ˆ For this case the likelihoods are N 1 1 p(x H 1 ) = 2πσ exp (x[n] s[n])2 2 2σ 2 n=0 N 1 p(x H 0 ) = n=0 1 2πσ exp (x[n])2 2 2σ 2 ˆ The likelihood ratio test is easily obtained as p(x H 1 ) p(x H 0 ) = exp 1 N 1 2σ 2 (x[n] s[n]) 2 n=0 N 1 n=0. (x[n]) 2 > γ.,

Detector for a known waveform ˆ This simplifies by taking the logarithm from both sides: 1 N 1 2σ 2 (x[n] s[n]) 2 n=0 ˆ This further simplifies into 1 N 1 σ 2 n=0 x[n]s[n] 1 2σ 2 N 1 n=0 N 1 n=0 (x[n]) 2 > ln γ. (s[n]) 2 > ln γ.

Detector for a known waveform ˆ Since s[n] is a known waveform (= constant), we can simplify the procedure by moving it to the right hand side and combining it with the threshold: N 1 n=0 x[n]s[n] > σ 2 ln γ + 1 2 N 1 n=0 (s[n]) 2. We can equivalently call the right hand side as our threshold (say γ ) to get the final decision rule N 1 n=0 x[n]s[n] > γ.

Examples ˆ This leads into some rather obvious results. ˆ The detector for a known DC level in WGN is N 1 n=0 N 1 x[n]a > γ A n=0 x[n] > γ Equally well we can set a new threshold and call it γ = γ/(an). This way the detection rule becomes: x > γ. Note that a negative A would invert the inequality.

Examples ˆ The detector for a sinusoid in WGN is N 1 n=0 N 1 x[n]a cos(2πf 0 n + ϕ) > γ A ˆ Again we can divide by A to get N 1 n=0 n=0 x[n] cos(2πf 0 n + ϕ) > γ. x[n] cos(2πf 0 n + ϕ) > γ. ˆ In other words, we check the correlation with the sinusoid. Note that the amplitude A does not affect our statistic, only the threshold which is anyway selected according to the fixed P FA rate.

Examples ˆ As an example, the below picture shows the detection process with σ = 0.5. 1.0 Noiseless Signal 0.5 0.0 0.5 1.0 0 3 100 200 300 400 500 Noisy Signal 600 700 800 900 0 12 3 21 0 100 200 300 400 500 600 700 800 900 Detection Result 40 20 20 0 40 0 100 200 300 400 500 600 700 800 900

Detection of random signals ˆ The problem with the previous approach was that the model was too restrictive; the results depend on how well the phases match. ˆ The model can be relaxed by considering random signals, whose exact form is unknown, but the correlation structure is known. Since the correlation captures the frequency (but not the phase), this is exactly what we want. ˆ In general, the detection of a random signal can be formulated as follows. ˆ Suppose s N (0, C s ) and w N (0, σ 2 I). Then the detection problem is a hypothesis test H 0 : x N (0, σ 2 I) H 1 : x N (0, C s + σ 2 I)

Detection of random signals ˆ It can be shown, that the decision rule becomes Decide H 1, if x T ŝ > γ, where ŝ = C s (C s + σ 2 I) 1 x.

Example of Random Signal Detection ˆ Without going into the details, let s jump directly to the derived decision rule for the sinusoid: N 1 x[n] exp( 2πif 0 n) > γ. n=0 ˆ As an example, the below picture shows the detection process with σ = 0.5. ˆ Note the simplicity of Python implementation: import numpy as np h = np.exp(-2 * np.pi * 1j * f0 * n) y = np.abs(np.convolve(h, xn, same ))

Example of Random Signal Detection 1.0 Noiseless Signal 0.5 0.0 0.5 1.0 0 3 100 200 300 400 500 Noisy Signal 600 700 800 900 0 12 3 21 0 100 200 300 400 500 600 700 800 900 40 45 Detection Result 30 35 20 25 10 15 0 5 0 100 200 300 400 500 600 700 800 900

Receiver Operating Characteristics ˆ A usual way of illustrating the detector performance is the Receiver Operating Characteristics curve (ROC curve). ˆ This describes the relationship between P FA and P D for all possible values of the threshold γ. ˆ The functional relationship between P FA and P D depends on the problem and the selected detector.

Receiver Operating Characteristics ˆ For example, in the DC level example, ˆ It is easy to see the relationship: P D (γ) = 1 (x 1)2 P D (γ) = exp dx γ 2π 2 1 P FA (γ) = exp x2 dx γ 2π 2 γ 1 1 exp x2 dx = P FA (γ 1). 2π 2

Receiver Operating Characteristics ˆ Plotting the ROC curve for all γ results in the following curve. 1.0 ROC Curve Probability of Detection P D 0.8 0.6 0.4 0.2 Large γ (low sensitivity) Small γ (high sensitivity) 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Probability of False Alarm P FA

Receiver Operating Characteristics ˆ The higher the ROC curve, the better the performance. ˆ A random guess has diagonal ROC curve. ˆ This gives rise to a widely used measure for detector performance: the Area Under (ROC) Curve, or AUC criterion. ˆ The benefit of AUC is that it is threshold independent, and tests the accuracy for all thresholds. ˆ In the DC level case, the performance increases if the noise variance σ 2 decreases (since the problem becomes easier). ˆ Below are the ROC plots for various values of σ 2.

Receiver Operating Characteristics 1.0 0.8 Probability of Detection P D 0.6 0.4 0.2 0.0 σ = 0.2 (AUC = 1.00) σ = 0.4 (AUC = 0.96) σ = 0.6 (AUC = 0.88) σ = 0.8 (AUC = 0.81) σ = 1.0 (AUC = 0.76) Random Guess (AUC = 0.5) 0.0 0.2 0.4 0.6 0.8 1.0 Probability of False Alarm P FA

Empirical AUC ˆ Initially, AUC and ROC stem from radar and radio detection problems. ˆ More recently, AUC has become one of the standard measures of classification performance, as well. ˆ Usually a closed form expression for P D and P FA can not be derived. ˆ Thus, ROC and AUC are most often computed empirically; i.e., by evaluating the prediction results on a holdout test set.

Classification Example ROC and AUC 2 Training Data 0 2 4 ˆ For example, consider the 2-dimensional dataset on the right. ˆ The data is split to training and test sets, which are similar but not exactly the same. ˆ Let s train 4 classifiers on the upper data and compute the ROC for each on the bottom data. 6 8 10 2 1 0 1 2 3 4 5 6 Test Data 2 0 2 4 6 8 10 2 1 0 1 2 3 4 5 6

Classification Example ROC and AUC Classifier with minimum error boundary 2 5.0 % of crosses detected as circle ˆ A linear classifier trained with the training data produces the shown class boundary. ˆ The class boundary has the orientation and location that minimizes the overall classification error for the training data. ˆ The boundary is defined by: y = c 1 x + c 0 0 2 4 6 Class Boundary with parameters c 1 and c 0 learned from data. 8 13.5 % of circles detected as cross 10 2 1 0 1 2 3 4 5 6

Classification Example ROC and AUC Classifier with boundary lifted up 2 0.5 % of crosses detected as circle 0 2 Class Boundary ˆ We can adjust the sensitivity of classification by moving the decision boundary up or down. ˆ In other words, slide the parameter c 0 in y = c 1 x + c 0 ˆ This can be seen as a tuning parameter for plotting the ROC curve. 4 6 10 8 35.0 % of circles detected as cross 2 1 0 1 2 3 4 5 6 Classifier with boundary lifted down 2 26.0 % of crosses detected as circle 0 2 4 Class Boundary 6 10 8 2.0 % of circles detected as cross 2 1 0 1 2 3 4 5 6

Classification Example ROC and AUC ˆ When the boundary slides from bottom to top, we plot the empirical ROC curve. ˆ Plotting starts from upper right corner. ˆ Every time the boundary passes a blue cross, the curve moves left. ˆ Every time the boundary passes a red circle, the curve moves down. 1.0 Classifier with minimum error boundary 2 5.0 % of crosses detected as circle Probability of Detection P D 0.8 0.6 0.4 Figure on the Right 0 2 4 6 Class Boundary 0.2 AUC = 0.98 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Probability of False Alarm P FA 8 13.5 % of circles detected as cross 10 2 1 0 1 2 3 4 5 6

Classification Example ROC and AUC ˆ Real usage is for comparing classifiers. ˆ Below is a plot of ROC curves for 4 widely used classifiers. ˆ Each classifier produces a class membership score over which the tuning parameter slides. 1.0 0.8 Probability of Detection P D 0.6 0.4 Logistic Regression (AUC = 0.98) 0.2 Support Vector Machine (AUC = 0.96) Random Forest (AUC = 0.97) Nearest Neighbor (AUC = 0.96) 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Probability of False Alarm P FA

ROC and AUC code in Python classifiers = [(LogisticRegression(), "Logistic Regression"), (SVC(probability = True), "Support Vector Machine"), (RandomForestClassifier(n_estimators = 100), "Random Forest"), (KNeighborsClassifier(), "Nearest Neighbor")] for clf, name in classifiers: clf.fit(x, y) ROC = [] for gamma in np.linspace(0, 1, 1000): err1 = np.count_nonzero(clf.predict_proba(x_test[y_test == 0, :])[:,1] <= gamma) err2 = np.count_nonzero(clf.predict_proba(x_test[y_test == 1, :])[:,1] > gamma) err1 = float(err1) / np.count_nonzero(y_test == 0) err2 = float(err2) / np.count_nonzero(y_test == 1) ROC.append([err1, err2]) ROC = np.array(roc) ROC = ROC[::-1, :] auc = roc_auc_score(y_test, clf.predict_proba(x_test)[:,1]) plt.plot(1-roc[:, 0], ROC[:, 1], linewidth = 2, label="%s (AUC = %.2f)" % (name, auc))