Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Similar documents
Variations. ECE 6540, Lecture 02 Multivariate Random Variables & Linear Algebra

LECTURE NOTE #3 PROF. ALAN YUILLE

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 3. Linear Regression

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Chapter 6: Classification

Loss Functions, Decision Theory, and Linear Models

A Step Towards the Cognitive Radar: Target Detection under Nonstationary Clutter

CMU-Q Lecture 24:

Lecture 6. Notes on Linear Algebra. Perceptron

Bayesian Decision Theory

Introduction to Statistical Inference

Bayesian Learning (II)

Bayesian Decision Theory

BAYES DECISION THEORY

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Radial Basis Function (RBF) Networks

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

ECE521 week 3: 23/26 January 2017

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

Logistic Regression. Machine Learning Fall 2018

Lecture 11. Kernel Methods

Integrating Rational functions by the Method of Partial fraction Decomposition. Antony L. Foster

CS249: ADVANCED DATA MINING

Lecture 2. Bayes Decision Theory

Introduction to Signal Detection and Classification. Phani Chavali

Classifier performance evaluation

Lecture 2 Machine Learning Review

PATTERN RECOGNITION AND MACHINE LEARNING

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Performance Evaluation and Comparison

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

General Strong Polarization

Intelligent Systems Statistical Machine Learning

Detection theory 101 ELEC-E5410 Signal Processing for Communications

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)?

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Introduction to Density Estimation and Anomaly Detection. Tom Dietterich

Detection theory. H 0 : x[n] = w[n]

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Stephen Scott.

Bayesian Decision Theory

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Lecture 3: Autoregressive Moving Average (ARMA) Models and their Practical Applications

PATTERN RECOGNITION AND MACHINE LEARNING

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Lecture : Probabilistic Machine Learning

ECE 6540, Lecture 06 Sufficient Statistics & Complete Statistics Variations

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Intelligent Systems Statistical Machine Learning

L11: Pattern recognition principles

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Naïve Bayes classification

Machine Learning (CS 567) Lecture 2

Algorithmisches Lernen/Machine Learning

MACHINE LEARNING ADVANCED MACHINE LEARNING

Performance Evaluation

Machine Learning and Deep Learning! Vincent Lepetit!

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

ECE521 Lecture7. Logistic Regression

Advanced data analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Inferring the origin of an epidemic with a dynamic message-passing algorithm

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

PHY103A: Lecture # 4

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Bias-Variance Tradeoff

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

CS145: INTRODUCTION TO DATA MINING

Introduction to Bayesian Learning. Machine Learning Fall 2018

Overfitting, Bias / Variance Analysis

7.3 The Jacobi and Gauss-Seidel Iterative Methods

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Chapter 5: Spectral Domain From: The Handbook of Spatial Statistics. Dr. Montserrat Fuentes and Dr. Brian Reich Prepared by: Amanda Bell

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Introduction to time series econometrics and VARs. Tom Holden PhD Macroeconomics, Semester 2

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Recap from previous lecture

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture No. 1 Introduction to Method of Weighted Residuals. Solve the differential equation L (u) = p(x) in V where L is a differential operator

MIRA, SVM, k-nn. Lirong Xia

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

GWAS V: Gaussian processes

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

10.4 The Cross Product

Work, Energy, and Power. Chapter 6 of Essential University Physics, Richard Wolfson, 3 rd Edition

Learning with multiple models. Boosting.

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

PHY103A: Lecture # 9

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Transcription:

Lecture 3 STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Previous lectures What is machine learning? Objectives of machine learning Supervised and Unsupervised learning Examples and approaches Multivariate Linear regression Predicts any continuous valued target from vector of features Important: Simple to compute, parameters easy to interpret lllustrate basic procedure: Model formulation, loss function,... Many natural phenomena have a linear relationship Subsequent lectures build up theory behind such parametric estimation techniques

3 Outline Principles of Supervised Learning Model Selection and Generalization (Alpaydin 2.7 & 2.8, Bishop 1.3) Overfitting and Underfitting Decision Theory (1.5 Bishop and Ch3 Alpaydin) Binary Classification Maximum Likelihood and Log likelihood Bayes Methods: MAP and Bayes Risk Receiver operating characteristic Minimum probability of error Issues in applying Bayesian classification Curse of Dimensionality

4 Outline Principles of Supervised Learning Model Selection and Generalization Overfitting and Underfitting Decision Theory Binary Classification Maximum Likelihood and Log likelihood Bayes Methods: MAP and Bayes Risk Receiver operating characteristic Minimum probability of error Issues in applying Bayesian classification Curse of Dimensionality

5 Memorization vs. Generalization Two key concepts in ML: Memorization: Finding an algorithm that fits training data well Generalization: Gives good results on data not yet seen. Prediction. Example: Suppose we have only three samples of fish Can we learn a classification rule? Sure Fish weight Class 1 (sea bass) Class 2 (salmon) Fish length

Memorization vs. Generalization Many possible classifier fit training data Easy to memorize the data set, but need to generalize to new data All three classifiers below (Classifier 1, C2, and C3) fit data But, which one will predict new sample correctly? sea bass salmon Fish weight Fish length

Memorization vs. Generalization Which classifier predicts new sample correctly? Classifier 1 predicts salmon Classifier 2 predicts salmon Classifier 3 predicts sea bass sea bass salmon We do not know which one is right: Not enough training data Need more samples to generalize Fish weight? Fish length 7

Basic Tradeoff Generalization requires assumptions ML uses a model Basic tradeoff between three factors: Model complexity: Allows to fit complex relationships Amount of training data Generalization error: How model fits new samples This class: Provides a principled ways to: Formulate models that can capture complex behavior Analyze how well they perform under statistical assumptions

9 Generalization: Underfitting and Overfitting Example: Consider fitting a polynomial Assume a low-order polynomial Easy to train. Less parameters to estimate But model does not capture full relation. Underfitting Assume too high a polynomial Fits complex behavior But, sensitive to noise. Needs many samples. Overfitting This course: How to rigorously quantify model selection and algorithm performance

Generalization: Underfitting and Overfitting Example: Consider fitting a polynomial Assume a low-order polynomial Easy to train. Less parameters to estimate But model does not capture full relation. Underfitting Assume too high a polynomial Fits complex behavior But, sensitive to noise. Needs many samples. Overfitting This course: How to rigorously quantify model selection and algorithm performance

Ingredients in Supervised Learning Select a model: yy = gg(xx, θθ) Describes how we predict target yy from features xx Has parameters θθ Get training data: xx ii, yy ii, ii = 1,, nn Select a loss function LL(yy ii, yy ii ) How well prediction matches true value on the training data Design algorithm to try to minimize loss: θθ = arg min θθ nn LL( yy ii, yy ii ) ii=1 The art principled methods to develop models and algorithms for often intractable loss functions and complex large is what machine learning is really all about.

12 Outline Principles of Supervised Learning Model Selection and Generalization Overfitting and Underfitting Decision Theory Binary Classification Maximum Likelihood and Log likelihood Bayes Methods: MAP and Bayes Risk Receiver operating characteristic Minimum probability of error Issues in applying Bayesian classification Curse of Dimensionality

Decision Theory How to make decision in the presence of uncertainty? History: Prominent in WWII: radar for detecting aircraft, codebreaking, decryption Observed data x X, state y Y p(x y): conditional distribution Model of how the data is generated Example: y {0, 1} (salmon vs. sea bass) or (airplane vs. bird, etc.) x: length of fish

Maximum Likelihood (ML) Decision Which fish type is more likely to given the observed fish length x? If p( x y = 0 ) > p( x y = 1 ), guess salmon; otherwise classify the fish as sea bass If pp xx yy=1 ) pp xx yy=0 ) equivalently: if log > 1, guess sea bass [likelihood ratio or LRT] pp xx yy=1 ) pp xx yy=0 ) yy ML = αα(xx) = SSSS arg max yy > 0 [log-likelihood ratio] pp xx yy) Seems reasonable, but what if salmon may be much more likely than sea bass?

Maximum a Posteriori (MAP) Decision Introduce prior probabilities pp yy = 0 and pp(yy = 1) Salmon more likely than sea bass: pp yy = 0 > pp(yy = 1) Now, which type of fish is more likely given observed fish length? Bayes Rule: pp yy xx) = pp xx yy)pp(yy) pp(xx) Including prior probabilities: If pp yy = 0 xx) > pp yy = 1 xx), guess salmon; otherwise, pick sea bass yy MAP = αα(xx) = arg max yy pp yy xx) = arg max yy pp xx yy) pp(yy) As a ratio test: if Equivalent via Bayes: if pp yy=1 xx ) > 1, guess sea bass pp yy=0 xx ) pp xx yy=1 ) > pp(yy=0) pp xx yy=0 ) pp(yy=1), guess sea bass Ignores that different mistakes can have different importance

Making it more interesting, full on Bayes What does it cost for a mistake? Plane with a missile, not a big bird? Define loss or cost: LL αα xx, yy : cost of decision αα xx when state is yy also often denoted CC iiii Y = 0 Y = 1 αα xx = 0 Correct, cost L(0,0) Incorrect, cost L(0,1) αα xx = 1 incorrect, cost L(1,0) Correct, cost L(1,1) Classic: Pascal's wager

Risk Minimization So now we have: the likelihood functions p(x y) priors p(y) decision rule αα xx loss function LL αα xx, yy : Risk is expected loss: EE LL = LL 0,0) pp(αα xx = 0, yy = 0 + LL 0,1) pp(αα xx = 0, yy = 1 + LL 1,0) pp(αα xx = 1, yy = 0 + LL 1,1) pp(αα xx = 1, yy = 1 Without loss of generality, zero cost for correct decisions EE LL = LL 1,0) pp αα xx = 1 yy = 0 pp yy = 0 + LL 0,1) pp αα xx = 0 yy = 1 pp(yy = 1) Bayes Decision Theory says pick decision rule αα xx to minimize risk

Visualizing Errors Type I error (False alarm or False Positive): Decide H1 when H0 Type II error (Missed detection or False Negative): Decide H0 when H1 Trade off Can work out error probabilities from conditional probabilities

Often more formally written Hypothesis Testing Two possible hypotheses for data H0: Null hypothesis, H1: Alternate hypothesis Model statistically: pp xx HH ii, ii = 0,1 Assume some distribution for each hypothesis Given Likelihood pp xx HH ii, ii = 0,1, Prior probabilities pp ii = PP(HH ii ) Compute posterior PP(HH ii xx) How likely is HH ii given the data and prior knowledge? Bayes Rule: PP HH ii xx = pp xx HH ii pp ii pp(xx) = pp xx HH ii pp ii pp xx HH 0 pp 0 + pp xx HH 1 pp 1

MAP: Minimum Probability of Error Probability of error: PP eeeeee = PP HH HH = PP HH = 0 HH 1 pp 1 + PP HH = 1 HH 0 pp 0 Write with integral: PP HH HH = pp(xx)pp HH HH xx dddd Error is minimized with MAP estimator HH = 1 PP HH 1 xx PP(HH 0 xx) Use Bayes rule: HH = 1 PP xx HH 1 pp 1 PP xx HH 0 pp 0 Equivalent to an LRT with γγ = pp 0 pp 1 Probabilistic interpretation of threshold

As before, express risk as integration over xx: RR = iijj CC iiii PP HH jj xx 1 { HH xx =ii} pp xx dddd To minimize, select HH xx = 1 when CC 10 PP HH 0 xx + CC 11 PP HH 1 xx CC 00 PP HH 0 xx + CC 01 PP HH 1 xx PP(HH 1 xx) PP HH 0 xx (CC 10 CC 00 ) (CC 11 CC 01 ) By Bayes Theorem, equivalent to an LRT with PP(xx HH 1 ) PP(xx HH 0 ) CC 10 CC 00 pp 0 CC 11 CC 01 pp 1 Bayes Risk Minimization

Same example basically, but posed as additive noise Scalar Gaussian HH 0 : xx = ww, ww ~ NN 0, σσ 2 HH 1 : xx = AA + ww, ww ~ NN 0, σσ 2 AA = 1 Example: A medical test for some disease xx = measured value of the patient HH 0 = patient is fine, HH 1 = patient is ill Probability model: xx is elevated with the disease

Example : Scalar Gaussians Hypothesis: HH 0 : xx = ww, ww ~ NN 0, σσ 2 HH 1 : xx = AA + ww, ww ~ NN 0, σσ 2 Problem: Use the LRT test to define a classifier and compute PP DD, PP FFFF Step 1. Write the probability distributions: pp xx HH 0 = 1 xx2 ee 2σσ 2, pp xx HH 2ππσσ 1 = 1 ee (xx AA)2 2σσ 2 2ππσσ

Scalar Guassian continued Step 2. Write the log likelihood: Step 3. LL xx = ln pp xx HH 1 pp(xx HH 0 ) = 1 2σσ 2 xx2 xx AA 2 = 1 2σσ 2 2AAAA + AA2 LL xx γγ xx tt = (2σσ 2 γγ AA 2 ) 2AA Write all further answers in terms of tt instead of γγ Classifier: HH = 1 xx tt 0 xx < tt

2 5 Scalar Guassian (cont) Step 4. Compute error probabilities PP DD = PP HH = 1 HH 1 = PP xx tt HH 1 Under HH 1, xx~nn(aa, σσ 2 ) So, PP DD = PP xx tt HH 1 ) = QQ( tt AA ) σσ Similarly, PP DD = PP xx tt HH 0 ) = QQ( tt σσ ) Here, QQ zz = Marcum Q-function QQ zz = PP ZZ zz, ZZ~NN(0,1)

2 6 Review: Gaussian Q-Function Problem: Suppose XX~NN(μμ, σσ 2 ). Often must compute probabilities like PP(XX tt) No closed-form expression. Define Marcum Q-function: QQ zz = PP ZZ zz, ZZ~NN(0,1) Let ZZ = (XX μμ) σσ Then PP XX tt = PP ZZ tt μμ σσ = QQ tt μμ σσ

Example: Two Exponentials Hypothesis: HH ii : pp xx HH ii = λλ ii ee λλ iixx, ii = 0, 1 Assume λλ 0 > λλ 1 Find ML detector threshold and probability of false alarm Step 1. Write the conditional probability distributions Nothing to do. Already given. Step 2: Log likelihood: LL xx = ln pp xx HH 1 pp(xx HH 0 ) = λλ 0 λλ 1 xx + ln λλ 1 λλ 0 ML: LRT test pick H1 if x (1/λλ 0 λλ 1 ) ln λλ 0 λλ 1 LL xx γγ xx tt

Two Exponentials (continued) Compute error probabilities PP DD = PP HH = 1 HH 1 = PP xx tt HH 1 PP DD = tt pp xx HH1 dddd = λλ1 tt ee λλ1xx dddd = ee λλ 1tt Similarly, PP FFFF = ee λλ 0tt

MAP Example Hypotheses: HH ii : xx = NN μμ ii, σσ 2, pp ii = PP HH ii, ii = 0,1 Two Gaussian densities with different means But same variance Problem: Find the MAP estimate μμ 0 μμ 1 Solution: First, write densities pp xx HH ii = 1 2ππσσ exp xx μμ 2 ii 2σσ 2 MAP estimate: Select HH = 1 pp xx HH 1 pp 1 pp xx HH 0 pp 0 In log domain: xx μμ 1 2 2σσ 2 + ln pp 1 xx μμ 2 0 2σσ 2 + ln pp 0

MAP Example: (Cont) More simplifications : HH = 1 when xx μμ 1 2 2σσ 2 + ln pp 1 xx μμ 2 0 2σσ 2 + ln pp 0 xx μμ 2 0 xx μμ 2 1 2σσ 2 ln pp 0 pp 1 2 μμ 1 μμ 0 xx + μμ 1 2 μμ 0 2 2σσ 2 ln pp 0 pp 1 xx μμ 1 + μμ 0 2 + σσ2 μμ 1 μμ 0 ln pp 0 pp 1

MAP Example (cont) MAP estimator: HH = 1 when xx tt Threshold tt = μμ 1 + μμ 0 2 + σσ 2 μμ 1 μμ 0 ln pp 0 pp 1 Midpoint between Gaussians Shifts to the left when pp 0 pp 1 μμ 0 μμ 1 tt

32 Multiple Classes Often have multiple classes. yy = 1,, KK Most methods easily extend: ML: Take max of KK likelihoods: yy = arg max pp(xx yy = ii) ii=1,,kk MAP: Take max of KK posteriors: LRT: Take max of KK weighted likelihoods: yy = arg max ii=1,,kk pp(xx yy = ii) γγ ii

33 Outline Principles of Supervised Learning Model Selection and Generalization Overfitting and Underfitting Decision Theory Binary Classification Maximum Likelihood and Log likelihood Bayes Methods: MAP and Bayes Risk Receiver operating characteristic Minimum probability of error Issues in applying Bayesian classification Curse of Dimensionality

Any binary decision strategy has a trade-off in errors Reminder of Errors TP = true positive TN = true negative FP = false positive FN = false negative ROC curves : error tradeoffs Typical illustrate: Tradeoff between TP and FP Receiver Operating Characteristic

ROC Curve PP DD vs. PP FFFF Trace out: Shows tradeoff Random guessing: PP FFFF γγ, PP DD γγ Select HH 1 randomly αα per cent of time PP DD = αα, PP FFFF = αα PP DD = PP FFFF

36 Area Under The Curve (AUC) Simple measure of quality AAAAAA = average of PP DD (γγ) with xx = γγ under HH 0 Proof: AAAAAA = PP DD γγ PP FFFF γγ dddd = PP DD γγ pp γγ HH 0 dddd Detection rate False alarm

ROC Example: Two Exponentials Hypotheses: HH ii : pp xx HH ii = λλ ii ee λλ iixx, ii = 0, 1 From before, LRT test is HH = 1 xx tt 0 xx < tt Error probabilities: PP DD = ee λλ 1tt, PP FFFF = ee λλ 0tt ROC curve: Write PP DD in terms of PP FFFF tt = 1 λλ 0 ln PP FFFF PP DD = PP FFFF λλ 1 λλ 0

38 Outline Principles of Supervised Learning Model Selection and Generalization Overfitting and Underfitting Decision Theory Binary Classification Maximum Likelihood and Log likelihood Bayes Methods: MAP and Bayes Risk Receiver operating characteristic Minimum probability of error Issues in applying Bayesian classification Curse of Dimensionality

39 Problems in Using Hypothesis Testing Hypothesis testing formulation requires Knowledge of likelihood pp(xx HH ii ) Possibly knowledge of prior PP HH ii Where do we get these? Approach 1: Learn distributions from data Then apply hypothesis testing Approach 2: Use hypothesis testing to select a form for the classifier Learn parameters of the classifier directly from data

40 Outline Principles of Supervised Learning Model Selection and Generalization Overfitting and Underfitting Decision Theory Binary Classification Maximum Likelihood and Log likelihood Bayes Methods: MAP and Bayes Risk Receiver operating characteristic Minimum probability of error Issues in applying Bayesian classification Curse of Dimensionality

41 Intuition in High-Dimensions Examples of Bayes Decision theory can be misleading because they are given in low dimensional spaces, 1 or 2 dim Most ML problems today have high dimension Often our geometric intuition in high-dimensions is wrong Example: Consider volume of sphere of radius rr = 1 in D dimensions What is the fraction of volume in a thin shell of a sphere between 1 εε rr 1? εε

42 Example: Sphere Hardening Let VV DD rr = volume of sphere of radius rr, dimension DD VV DD rr = KK DD rr DD Let ρρ DD (εε) = fraction of volume in a shell of radius εε ρρ DD (εε) = VV DD 1 VV DD (1 εε) = 1 1 εε DD VV DD (1) ρρ DD (εε) 1 DD = 100 20 5 DD = 1 1 εε

4 3 Gaussian Sphere Hardening Consider a Gaussian i.i.d. vector xx = xx 1,, xx DD, xx ii ~NN(0,1) As DD, probability density concentrates on shell xx even though xx = 0 is most likely point 2 DD, Let rr = xx 1 2 + xx 2 2 + + xx DD 2 1/2 DD = 1: pp rr = cc ee rr2 /2 DD = 2: pp rr = cc rr ee rr2 /2 general DD: pp rr = cc rr DD 1 ee rr2 /2

4 4 Example: Sphere Hardening Conclusions: As dimension increases, All volume of a sphere concentrates at its surface! Similar example: Consider a Gaussian i.i.d. vector xx = xx 1,, xx dd, xx ii ~NN(0,1) As dd, probability density concentrates on shell xx 2 dd Even though xx = 0 is most likely point

Computational Issues In high dimensions, classifiers need large number of parameters Example: Suppose xx = xx 1,, xx dd, each xx ii takes on LL values Hence xx takes on LL dd values Consider general classifier ff(xx) Assigns each xx some value If there are no restrictions on ff(xx), needs LL dd paramters

Curse of Dimensionality Curse of dimensionality: As dimension increases Number parameters for functions grows exponentially Most operations become computationally intractable Fitting the function, optimizing, storage What ML is doing today Finding tractable approximate approaches for high-dimensions