INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Similar documents
INF Anne Solberg One of the most challenging topics in image analysis is recognizing a specific object in an image.

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

INF Introduction to classifiction Anne Solberg

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Pattern Classification

Statistical Pattern Recognition

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Regression and generalization

Pattern Classification

10-701/ Machine Learning Mid-term Exam Solution

Classification with linear models

Bayesian Methods: Introduction to Multi-parameter Models

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Vector Quantization: a Limiting Case of EM

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Exponential Families and Bayesian Inference

Properties and Hypothesis Testing

1 Review of Probability & Statistics

Machine Learning Assignment-1

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Mixtures of Gaussians and the EM Algorithm

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Pattern recognition systems Laboratory 10 Linear Classifiers and the Perceptron Algorithm

Topic 8: Expected Values

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Pattern recognition systems Lab 10 Linear Classifiers and the Perceptron Algorithm

Most text will write ordinary derivatives using either Leibniz notation 2 3. y + 5y= e and y y. xx tt t

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

Lecture 19: Convergence

Topic 9: Sampling Distributions of Estimators

Axis Aligned Ellipsoid

Pattern Classification, Ch4 (Part 1)

Elementary manipulations of probabilities

Lecture 2: Monte Carlo Simulation

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

6.867 Machine learning

Department of Mathematics

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Statistical Inference Based on Extremum Estimators

Topic 9: Sampling Distributions of Estimators

6 Sample Size Calculations

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Introduction to Machine Learning DIS10

Empirical Process Theory and Oracle Inequalities

Random Variables, Sampling and Estimation

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

4. Basic probability theory

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

U8L1: Sec Equations of Lines in R 2

An Introduction to Randomized Algorithms

Problem Set 4 Due Oct, 12

CS284A: Representations and Algorithms in Molecular Biology

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

STA 4032 Final Exam Formula Sheet

Elementary Statistics

Machine Learning. Logistic Regression -- generative verses discriminative classifier. Le Song /15-781, Spring 2008

Topic 9: Sampling Distributions of Estimators

This is an introductory course in Analysis of Variance and Design of Experiments.

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

Sieve Estimators: Consistency and Rates of Convergence

Probability and MLE.

Mathematical Statistics - MS

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

[ ] ( ) ( ) [ ] ( ) 1 [ ] [ ] Sums of Random Variables Y = a 1 X 1 + a 2 X 2 + +a n X n The expected value of Y is:

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Lecture 15: Learning Theory: Concentration Inequalities

Linear Regression Demystified

11 Correlation and Regression

Power and Type II Error

Class 23. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

BHW #13 1/ Cooper. ENGR 323 Probabilistic Analysis Beautiful Homework # 13

THE KALMAN FILTER RAUL ROJAS

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

Frequentist Inference

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Lecture 12: November 13, 2018

Topic 5 [434 marks] (i) Find the range of values of n for which. (ii) Write down the value of x dx in terms of n, when it does exist.

f(x i ; ) L(x; p) = i=1 To estimate the value of that maximizes L or equivalently ln L we will set =0, for i =1, 2,...,m p x i (1 p) 1 x i i=1

Statistical Fundamentals and Control Charts

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Parameter, Statistic and Random Samples

The Maximum-Likelihood Decoding Performance of Error-Correcting Codes

Topics Machine learning: lecture 3. Linear regression. Linear regression. Linear regression. Linear regression

Linear Regression Models

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

Machine Learning Brett Bernstein

Lecture 6 Simple alternatives and the Neyman-Pearson lemma

CHAPTER 5. Theory and Solution Using Matrix Techniques

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

Tests of Hypotheses Based on a Single Sample (Devore Chapter Eight)

Interval Estimation (Confidence Interval = C.I.): An interval estimate of some population parameter is an interval of the form (, ),

[412] A TEST FOR HOMOGENEITY OF THE MARGINAL DISTRIBUTIONS IN A TWO-WAY CLASSIFICATION

Clases 7-8: Métodos de reducción de varianza en Monte Carlo *

STATISTICAL INFERENCE

Transcription:

INF 4300 90 Itroductio to classifictio Ae Solberg ae@ifiuioo Based o Chapter -6 i Duda ad Hart: atter Classificatio 90 INF 4300 Madator proect Mai task: classificatio You must implemet a classificatio algorithm Tetative schedule: eercise available November Tetative deadlie: November 3 90 INF 4300

Itroductio to classificatio Supervised classificatio is related to thresholdig Divide the image ito two classes: foregroud ad backgroud Thresholdig is a two-class classificatio problem based o a D feature vector The feature vector cosist of ol the gre level f How ca we classif a feature vector of N shape features ito correct character tpe? We will ow stud multivariate classificatio theor where we use N features to determie if a obect belogs to a set of K obect classes Recommeded additioal readig: atter Classificatio R Duda Hart ad D Stork Chapter Itroductio Chapter Baesia Decisio Theor -6 90 INF 4300 3 From INF30: Thresholdig Basic thresholdig assigs all piels i the image to oe of classes: foregroud or backgroud 0 if g if f T f T This ca be see as a -class classificatio problem based o a sigle feature the gra level 90 INF 4300 4

Classificatio error for thresholdig - Backgroud - Foregroud Threshold h t I this regio foregroud piels are misclassified as backgroud I this regio backgroud piels are misclassified as foregroud 90 INF 4300 5 Classificatio error for thresholdig We assume that bz is the ormalized histogram for backgroud bz ad fz is the histogram for foregroud The histograms are estimates of the probabilit distributio of the gra levels i the image Let F ad B be the prior probabilities for backgroud ad foregroudb+f= The ormalized histogram for the image is the give b p z B b z F f z The probabilit for misclassificatio give a treshold t is: t B t t f z dz b z dz F t 90 INF 4300 6 3

Fid T that miimizes the error Compute the derivative of t with respect to t t t t F f z dz B b z dz Set the derivative equal to 0: d t 0 F f T B b T dt Miimum error is achieved b settig T equal to the poit where the probabilities for foregroud ad backgroud are equal 90 INF 4300 7 Distributios stadard deviatio ad variace A Gaussia distributio ormal distributio is specified give the mea value ad the variace : p z Variace stadard deviatio e 90 INF 4300 8 4

Two Gaussia distributios for a sigle feature Assume that bz ad fz are Gaussia distributios the p B F B F B F z e e B F B ad F are the mea values for backgroud ad foregroud B ad F are the variace for backgroud ad foregroud 90 INF 4300 9 The -class classificatio problem summarized Give two Gaussia distributios bz ad fz The classes have prior probabilities F ad B ver piel should be assiged to the class that miimizes the classificatio error The classificatio error is miimized at the poit where F fz = B bz What we will do ow is to geeralize to D features ad K classes 90 INF 4300 0 5

How do we fid the best border betee K classes with features? We will fid the theoretical aswer ad a geometrical iterpretatio of class meas variace ad the equivalet of a threshold 90 INF 4300 The goal of classificatio We estimate the decisio boudaries based o traiig data Classificatio performace is alwas estimated o a separate test data set We tr to measure the geeralizatio performace The classifier should perform well whe classifig ew samples Have lowest possible classificatio error We ofte face a tradeoff betwee classificatio error o the traiig set ad geeralizatio abilit whe determiig the compleit of the decisio boudar 90 INF 4300 6

robabilit theor - Appedi A4 Let be a discrete radom variable that ca assume a of a fiite umber of M differet values The M differet values will i our case be oe of M classes The probabilit that belogs to class i is p i = r=i i=m A probabilit distributio must sum to ad probabilities must be positive so p M i 0 ad p i i 90 INF 4300 3 pected values The epected value or mea of a radom variable is: M ip i The variace or secod order momet is: Var u i u 90 INF 4300 4 7

8 airs of radom variables Let ad be radom variables The oit probabilit of observig a pair of values The oit probabilit of observig a pair of values =i= is p i Alterativel we ca defie a oit probabilit distributio fuctio for which The margial distributios for ad if we wat to elimiate oe of them is: 0 elimiate oe of them is: 90 INF 4300 5 Statistical idepedece pected values of two variables Variables ad are statistical idepedet if ad ol if Two variables are ucorrelated if pected values of two variables: f f 0 90 INF 4300 6

pected values of M variables Usig vector otatio: μ Σ μ μ T 90 INF 4300 7 Coditioal probabilit If two variables are statisticall depedet kowig the value of oe of them lets us get a better estimate of the value of the other oe The coditioal probabilit of give is: r i r i r ad for distributios : ample: Threshold a page with dark tet of white backgroud is the gre level of a piel ad is its class F or B If we cosider which gre levels ca have - we epect small values if is tet =F ad large values if is backgroud =B 90 INF 4300 8 9

0 Baes rule i geeral The equatio: I words: evidece prior likelihood posterior To be eplaied for the classificatio problem later :- 90 INF 4300 9 Mea vectors ad covariace matrices i N dimesios If f is a -dimesioal feature vector we ca formulate its mea vector ad covariace matri as: f f f f Σ μ with features the mea vector will be of size ad or size 90 INF 4300 0 Σ

Baes rule for a classificatio problem Suppose we have J =J classes is the class label for a piel ad is the observed gra level or feature vector We ca use Baes rule to fid a epressio for the class with the highest probabilit: p p prior probabilit posterior probabilit likelihood ormalizig factor For thresholdig is the prior probabilit for backgroud or foregroud If we do't have special kowledge that oe of the classes occur more frequet tha other classes we set them equal for all classes =/J =J p is the probabilit desit fuctio that models the likelihood for 90 INF 4300 Baes rule eplaied p p p is the probabilit desit fuctio that models the likelihood for observig gra level if the piel belogs to class Tpicall we assume a tpe of distributio eg Gaussia ad the mea ad covariace of that distributio is fitted to some data that we kow belog to that class This fittig is called classifier traiig is the posterior probabilit that the piel actuall belogs to class We will soo se that the the classifier that achieves the miimum error is a classifier that assigs each piel to the class that has the highest posterior probabilit p is ust a scalig factor that assures that the probabilities sum to 90 INF 4300

robabilit of error If we have classes we make a error either if we decide if the true class is if we decide if the true class is If > we have more belief that belogs to ad we decide The probabilit of error is the: if we decide error if we decide 90 INF 4300 3 Back to classificatio error for thresholdig - Backgroud - Foregroud error error d error p d I this regio foregroud piels are misclassified as backgroud I this regio backgroud piels are misclassified as foregroud 90 INF 4300 4

Miimizig the error error error d error p d Whe we derived the optimal threshold we showed that the miimum error was achieved for placig the threhold or decisio border at the poit where = This is still valid 90 INF 4300 5 Baes decisio rule I the class case our goal of miimizig the error implies a decisiorule: Decide ω if ω >ω ; otherwise ω For J classes the rule aalogusl eteds to choose the class with maimum a posteriori probabilit The decisio boudar is the border betwee classes i ad simpl where ω i =ω actl where the threshold was set i miimum error thresholdig! F 4006 INF 3300 6 3

Baes classificatio with J classes ad D features How do we geeralize: To more the oe feature at a time To J classes To cosider loss fuctios that some errors are more costl tha others 90 INF 4300 7 Feature space If we measure d features will be a d-dimesioal feature vector Let { C } be a set of c classes The posterior probabilit for class c is ow computed as p p c p p Still we assig a piel with feature vector to the class that has the highest posterior probabilit: Decide if for all i 90 INF 4300 8 4

Discrimiat fuctios The decisio rule Decide if for all i ca be writte as assig to if g g i The classifier computes c discrimiat fuctio ad selects the class correspodig to the largest value of the discrimiat fuctio Sice classificatio cosists of choosig the class that has the largest value a scalig of the discrimiat fuctio g i b fg i will ot effect the decisio if f is a mootoicall icreasig fuctio This ca lead to simplificatios as we will soo see 90 INF 4300 9 quivalet discrimiat fuctios The followig choices of discrimiat fuctios give equivalet decisios: p i i gi i p g p i g l p l i i i i i The effect of the decisio rules is to divide the feature space ito c decisio regios R R c If g i >g for all i the is i regio R i The regios are separated b decisio boudaries surfaces i features space where the discrimiat fuctios for two classes are equal 90 INF 4300 30 5

Decisio fuctios - two classes If we have ol two classes assigig to if g >g is equivalet to usig a sigle discrimiat fuctio: g = g -g ad decide if g>0 The followig fuctios are equivalet: g g l p p 90 INF 4300 3 The Gaussia desit - uivariate case a sigle feature To use a classifier we eed to select a probabilit desit fuctio p i The most commol used probabilit desit is the ormal Gaussia distributio: p ep with epected value or mea ad variace p d p d 90 INF 4300 3 6

Traiig a uivariate Gaussia classifier To be able to compute the value of the discrimiat fuctio we eed to have a estimate of ad for each class Assume that we kow the true class labels for some piels ad that this is give i a mask image Traiig the classifier the cosists of computig ad for all piels with class label i the mask file 90 INF 4300 33 Classificatio with a uivariate Gaussia Decide o values for the prior probabilities If we have o prior iformatio assume that all classes are equall probable ad =/c stimate ad based o traiig data Compute the discrimiat fuctio p ep for all classes ad assig each patter to the class with the highest value A simple measure of classificatio accurac ca be to cout the percetage of correctl classified piels overall averaged for all classes or per class If a piel has true class label k it is correctl classified if =k 90 INF 4300 34 7

ample: image ad traiig masks 90 INF 4300 35 8