Computer Vision. Pa0ern Recogni4on Concepts Part I. Luis F. Teixeira MAP- i 2012/13

Similar documents
Computer Vision. Pa0ern Recogni4on Concepts. Luis F. Teixeira MAP- i 2014/15

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

CS 6140: Machine Learning Spring 2016

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

STAD68: Machine Learning

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Bayesian Decision Theory

Sta$s$cal sequence recogni$on

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Bias/variance tradeoff, Model assessment and selec+on

STA 4273H: Sta-s-cal Machine Learning

Latent Dirichlet Alloca/on

Outline. What is Machine Learning? Why Machine Learning? 9/29/08. Machine Learning Approaches to Biological Research: Bioimage Informa>cs and Beyond

An Introduc+on to Sta+s+cs and Machine Learning for Quan+ta+ve Biology. Anirvan Sengupta Dept. of Physics and Astronomy Rutgers University

Pattern Classification

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Bayesian networks Lecture 18. David Sontag New York University

L11: Pattern recognition principles

Generative Model (Naïve Bayes, LDA)

BAYESIAN DECISION THEORY

Classifica(on and predic(on omics style. Dr Nicola Armstrong Mathema(cs and Sta(s(cs Murdoch University

Graphical Models. Lecture 1: Mo4va4on and Founda4ons. Andrew McCallum

p(x ω i 0.4 ω 2 ω

Mixture Models. Michael Kuhn

Founda'ons of Large- Scale Mul'media Informa'on Management and Retrieval. Lecture #3 Machine Learning. Edward Chang

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

PATTERN CLASSIFICATION

Bayesian Decision Theory Lecture 2

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CS 6140: Machine Learning Spring 2017

p(x ω i 0.4 ω 2 ω

Learning Deep Genera,ve Models

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce

Issues and Techniques in Pattern Classification

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on

CSE 473: Ar+ficial Intelligence

Mathematical Formulation of Our Example

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

COMP 562: Introduction to Machine Learning

Multivariate statistical methods and data mining in particle physics

Image Processing 1 (IP1) Bildverarbeitung 1

Parametric Techniques

Error Rates. Error vs Threshold. ROC Curve. Biometrics: A Pattern Recognition System. Pattern classification. Biometrics CSE 190 Lecture 3

Introduc)on to Ar)ficial Intelligence

Parametric Techniques Lecture 3

Deep Learning for Computer Vision

Aggrega?on of Epistemic Uncertainty

Lecture 17: Face Recogni2on

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.

p(d θ ) l(θ ) 1.2 x x x

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

CSE 473: Ar+ficial Intelligence. Hidden Markov Models. Bayes Nets. Two random variable at each +me step Hidden state, X i Observa+on, E i

Minimum Error-Rate Discriminant

Natural Language Processing with Deep Learning. CS224N/Ling284

Lecture 17: Face Recogni2on

Lecture 7: Con3nuous Latent Variable Models

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Last Lecture Recap UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 3: Linear Regression

UVA CS / Introduc8on to Machine Learning and Data Mining

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Graphical Models. Lecture 3: Local Condi6onal Probability Distribu6ons. Andrew McCallum

Data Processing Techniques

CMU-Q Lecture 24:

Mul$variate clustering & classifica$on with R

Machine Learning Lecture 2

Ensemble of Climate Models

Machine Learning Lecture 2

Probabilistic Methods in Bioinformatics. Pabitra Mitra

Machine Learning 2017

Differen'al Privacy with Bounded Priors: Reconciling U+lity and Privacy in Genome- Wide Associa+on Studies

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

DART Tutorial Sec'on 1: Filtering For a One Variable System

Least Squares Parameter Es.ma.on

Linear Regression and Correla/on. Correla/on and Regression Analysis. Three Ques/ons 9/14/14. Chapter 13. Dr. Richard Jerz

Linear Regression and Correla/on

Bayes Decision Theory

Machine Learning Lecture 1

SKA machine learning perspec1ves for imaging, processing and analysis

A6523 Modeling, Inference, and Mining Jim Cordes, Cornell University. Motivations: Detection & Characterization. Lecture 2.

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Switch Mechanism Diagnosis using a Pattern Recognition Approach

Quan&fying Uncertainty. Sai Ravela Massachuse7s Ins&tute of Technology

Par$cle Filters Part I: Theory. Peter Jan van Leeuwen Data- Assimila$on Research Centre DARC University of Reading

Experimental Designs for Planning Efficient Accelerated Life Tests

Where are we? è Five major sec3ons of this course

Introduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas

Introduc)on to Ar)ficial Intelligence

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Lecture 3: Pattern Classification. Pattern classification

Pattern Classification

REGRESSION AND CORRELATION ANALYSIS

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Pan-STARRS1 Photometric Classifica7on of Supernovae using Ensemble Decision Tree Methods

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Machine learning for Dynamic Social Network Analysis

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 8: Hidden Markov Models

Gentle Introduction to Infinite Gaussian Mixture Modeling

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Transcription:

Computer Vision Pa0ern Recogni4on Concepts Part I Luis F. Teixeira MAP- i 2012/13

What is it? Pa0ern Recogni4on Many defini4ons in the literature The assignment of a physical object or event to one of several prespecified categories Duda and Hart A problem of es4ma4ng density func4ons in a high- dimensional space and dividing the space into the regions of categories or classes Fukunaga Given some examples of complex signals and the correct decisions for them, make decisions automa7cally for a stream of future examples Ripley The science that concerns the descrip7on or classifica7on (recogni4on) of measurements Schalkoff The process of giving names ω to observa7ons x, Schürmann Pa0ern Recogni4on is concerned with answering the ques4on What is this? Morse

Pa0ern Recogni4on Related fields Adap4ve signal processing Machine learning Robo4cs and vision Cogni4ve sciences Mathema4cal sta4s4cs Nonlinear op4miza4on Exploratory data analysis Fuzzy and gene4c systems Detec4on and es4ma4on theory Formal languages Structural modeling Biological cyberne4cs Computa4onal neuroscience Applica7ons Image processing Computer vision Speech recogni4on Scene understanding Search, retrieval and visualiza4on Computa4onal photography Human- computer interac4on Biometrics Document analysis Industrial inspec4on Financial forecast Medical diagnosis Surveillance and security Art, cultural heritage and entertainment

Pa0ern Recogni4on System A typical pa=ern recogni7on system contains A sensor A preprocessing mechanism A feature extrac4on mechanism (manual or automated) A classifica4on or descrip4on algorithm A set of examples (training set) already classified or described

Algorithms Classifica4on Supervised, categorical labels Bayesian classifier, KNN, SVM, Decision Tree, Neural Network, etc. Clustering Unsupervised, categorical labels Mixture models, K- means clustering, Hierarchical clustering, etc. Regression Supervised or Unsupervised, real- valued labels

Algorithms Classifica4on Supervised, categorical labels Bayesian classifier, KNN, SVM, Decision Tree, Neural Network, etc. Clustering Unsupervised, categorical labels Mixture models, K- means clustering, Hierarchical clustering, etc. Regression - Supervised or Unsupervised, real- valued labels

Feature Concepts A feature is any dis4nc4ve aspect, quality or characteris4c. Features may be symbolic (i.e., color) or numeric (i.e., height) The combina4on of d features is represented as a d- dimensional column vector called a feature vector The d- dimensional space defined by the feature vector is called feature space Objects are represented as points in a feature space. This representa4on is called a sca=er plot

Concepts Pa=ern Pa0ern is a composite of traits or features characteris4c of an individual In classifica7on, a pa0ern is a pair of variables {x,ω} where x is a collec4on of observa4ons or features (feature vector) ω is the concept behind the observa4on (label) What makes a good feature vector? The quality of a feature vector is related to its ability to discriminate examples from different classes Examples from the same class should have similar feature values Examples from different classes have different feature values

Good features? Concepts Feature proper4es

Classifiers Concepts The goal of a classifier is to par44on the feature space into class- labeled decision regions Borders between decision regions are called decision boundaries

Classification y = f(x) output prediction function feature vector Training: given a training set of labeled examples {(x 1,y 1 ),, (x N,y N )}, estimate the prediction function f by minimizing the prediction error on the training set Testing: apply f to a never before seen test example x and output the predicted value y = f(x)

Given a collec4on of labeled examples, come up with a func4on that will predict the labels of new examples. four How good is some func4on we come up with to do the classifica4on? Depends on Classification nine? Training examples Mistakes made Cost associated with the mistakes Novel input

An example* Problem: sor4ng incoming fish on a conveyor belt according to species Assume that we have only two kinds of fish: Salmon Sea bass *Adapted from Duda, Hart and Stork, Pattern Classification, 2nd Ed.

An example: the problem What humans see What computers see

An example: decision process What kind of informa4on can dis4nguish one species from the other? Length, width, weight, number and shape of fins, tail shape, etc. What can cause problems during sensing? Ligh4ng condi4ons, posi4on of fish on the conveyor belt, camera noise, etc. What are the steps in the process? Capture image - > isolate fish - > take measurements - > make decision

An example: our system Sensor The camera captures an image as a new fish enters the sor4ng area Preprocessing Adjustments for average intensity levels Segmenta4on to separate fish from background Feature Extrac7on Assume a fisherman told us that a sea bass is generally longer than a salmon. We can use length as a feature and decide between sea bass and salmon according to a threshold on length. Classifica7on Collect a set of examples from both species Plot a distribu4on of lengths for both classes Determine a decision boundary (threshold) that minimizes the classifica4on error

An example: features We estimate the system s probability of error and obtain a discouraging result of 40%. Can we improve this result?

An example: features Even though sea bass is longer than salmon on the average, there are many examples of fish where this observa4on does not hold Commi0ed to achieve a higher recogni4on rate, we try a number of features Width, Area, Posi4on of the eyes w.r.t. mouth... only to find out that these features contain no discriminatory informa4on Finally we find a good feature: average intensity of the scales

An example: features Histogram of the lightness feature for two types of fish in training samples. It looks easier to choose the threshold but we still can not make a perfect decision.

An example: mul4ple features We can use two features in our decision: lightness: x 1 length: x 2 Each fish image is now represented as a point (feature vector)! x = x $ 1 # & " % x 2 in a two- dimensional feature space.

An example: mul4ple features Scatter plot of lightness and length features for training samples. We can compute a decision boundary to divide the feature space into two regions with a classification rate of 95.7%.

An example: cost of error We should also consider costs of different errors we make in our decisions. For example, if the fish packing company knows that: Customers who buy salmon will object vigorously if they see sea bass in their cans. Customers who buy sea bass will not be unhappy if they occasionally see some expensive salmon in their cans. How does this knowledge affect our decision?

An example: cost of error We could intuitively shift the decision boundary to minimize an alternative cost function

An example: generaliza4on The issue of generaliza7on The recogni4on rate of our linear classifier (95.7%) met the design specifica4ons, but we s4ll think we can improve the performance of the system We then design a über- classifier that obtains an impressive classifica4on rate of 99.9975% with the following decision boundary

An example: generaliza4on The issue of generaliza7on Sa4sfied with our classifier, we integrate the system and deploy it to the fish processing plant A few days later the plant manager calls to complain that the system is misclassifying an average of 25% of the fish What went wrong?

Design cycle Data collec7on Probably the most 4me- intensive component of a pa0ern recogni4on problem Data acquisi4on and sensing Measurements of physical variables Important issues: bandwidth, resolu4on, sensi4vity, distor4on, SNR, latency, etc. Collec4ng training and tes4ng data How can we know when we have adequately large and representa4ve set of samples?

Design cycle Feature choice Cri4cal to the success of pa0ern recogni4on Finding a new representa7on in terms of features Discrimina4ve features Similar values for similar pa0erns Different values for different pa0erns Model learning and es7ma7on Learning a mapping between features and pa0ern groups and categories

Design cycle Model selec7on Defini4on of design criteria Domain dependence and prior informa4on Computa4onal cost and feasibility Parametric vs. non- parametric models Types of models: templates, decision- theore4c or sta4s4cal, syntac4c or structural, neural, and hybrid Model Training How can we learn the rule from data? Given a feature set and a blank model, adapt the model to explain the data Supervised, unsupervised and reinforcement learning

Predic7ng Design cycle Using features and learned models to assign a pa0ern to a category Evalua7on How can we es4mate the performance with training samples? How can we predict the performance with future data? Problems of overfiong and generaliza4on

Review of probability theory Basic probability X is a random variable P(X) is the probability that X achieves a certain value called a probability distribution/density function (PDF) continuous X discrete X

Condi4onal probability If A and B are two events, the probability of event A when we already know that event B has occurred P[A B] is defined by the rela4on P[A B] P[A B] = for P[B] > 0 P[B] P[A B] is read as the condi4onal probability of A condi4oned on B, or simply the probability of A given B Graphical interpreta4on

Condi4onal probability Theorem of Total Probability Let B 1, B 2,, B N be mutually exclusive events, then P[A] = P[A B 1 ]P[B 1 ]+... + P[A B N ]P[B N ] = P[A B k ]P[B k ] N k=1 Bayes Theorem Given B 1, B 2,, B N, a par44on of the sample space S. Suppose that event A occurs; what is the probability of event B j? Using the defini4on of condi4onal probability and the Theorem of total probability we obtain P[B j A] = P[A B j ] P[A] = P[A B j ] P[B j ] N k=1 P[A B k ] P[B k ]

Bayes theorem For pa0ern recogni4on, Bayes Theorem can be expressed as P(ω j x) = P(x ω j ) P(ω j ) where ω j is the j th class and x is the feature vector Each term in the Bayes Theorem P has a special name P(ω j ) Prior probability (of class ω j ) N k=1 P(x ω k ) P(ω k ) = P(x ω j ) P(ω j ) P(x) P(ω j x) Posterior probability (of class ω j given the observa4on x) P(x ω j ) Likelihood (condi4onal prob. of x given class ω j ) P(x) Evidence (normaliza4on constant that does not affect the decision) Two commonly used decision rules are Maximum A Posteriori (MAP): choose the class ω j with highest P(ω j x) Maximum Likelihood (ML): choose the class ω j with highest P(x ω j ) ML and MAP are equivalent for non- informa4ve priors (P(ω j ) constant)

Bayesian decision theory Bayesian Decision Theory is a sta4s4cal approach that quan4fies the tradeoffs between various decisions using probabili7es and costs that accompany such decisions. Fish sor4ng example: define C, the type of fish we observe (state of nature), as a random variable where C = C 1 for sea bass C = C 2 for salmon P(C 1 ) is the a priori probability that the next fish is a sea bass P(C 2 ) is the a priori probability that the next fish is a salmon

Prior probabili4es Prior probabili4es reflect our knowledge of how likely each type of fish will appear before we actually see it. How can we choose P(C 1 ) and P(C 2 )? Set P(C 1 ) = P(C 2 ) if they are equiprobable (uniform priors). May use different values depending on the fishing area, 4me of the year, etc. Assume there are no other types of fish P(C 1 ) + P(C 2 ) = 1 In a general classifica4on problem with K classes, prior probabili7es reflect prior expecta7ons of observing each class and K i=1 P(C i ) =1

Class- condi4onal probabili4es Let x be a con4nuous random variable, represen4ng the lightness measurement Define p(x C j ) as the class- condi4onal probability density (probability of x given that the state of nature is C j for j = 1, 2). p(x C 1 ) and p(x C 2 ) describe the difference in lightness between popula4ons of sea bass and salmon.

Posterior probabili4es Suppose we know P(C j ) and P(x C j ) for j = 1, 2, and measure the lightness of a fish as the value x. Define P(C j x) as the a posteriori probability (probability of the type being C j, given the measurement of feature value x). We can use the Bayes formula to convert the prior probability to the posterior probability P(C j x) = P(x C )P(C ) j j P(x) 2 where P(x) = P(x C j )P(C j ) j=1

Making a decision How can we make a decision axer observing the value of x?! Decide C 1 if P(C 1 x) > P(C 2 x) " # C 2 otherwise Rewri4ng the rule gives! Decide C if P(x C 1 ) 1 P(x C 2 ) > P(C ) 2 # " P(C 1 ) # $ C 2 otherwise Bayes decision rule minimizes the error of this decision

Confusion matrix For C 1 we have: Making a decision Assigned True C 1 C 2 C 1 C 2 correct detec4on false alarm mis- detec4on correct rejec4on The two types of errors (false alarm and mis- detec4on) can have dis7nct costs

Minimum- error- rate classifica4on Let {C 1,, C K } be the finite set of K states of nature (classes, categories). Let x be the D- component vector- valued random variable (feature vector). If all errors are equally costly, the minimum- error decision rule is defined as Decide C i if P(C i x) > P(C j x) j i The resul4ng error is called the Bayes error and is the best performance that can be achieved.

Bayesian decision theory Bayesian decision theory gives the op7mal decision rule under the assump4on that the true values of the probabili4es are known. But, how can we es4mate (learn) the unknown p(x C j ), j = 1,, K? Parametric models: assume that the form of the density func4ons is known Non- parametric models: no assump4on about the form

Bayesian decision theory Parametric models Density models (e.g., Gaussian) Mixture models (e.g., mixture of Gaussians) Hidden Markov Models Bayesian Belief Networks Non- parametric models Histogram- based es4ma4on Parzen window es4ma4on Nearest neighbour es4ma4on

Gaussian density Gaussian can be considered as a model where the feature vectors for a given class are con4nuous- valued, randomly corrupted versions of a single typical or prototype vector. For x R D 1 p(x) = N(µ, Σ) = (2π ) n/2 Σ For x R p(x) = N(µ,σ 2 ) = (x µ ) 1 2 2πσ e 2σ 2 1 2 (x µ )T Σ 1 (x µ ) e 1/2 Some proper7es of the Gaussian: Analy4cally tractable Completely specified by the 1st and 2nd moments Has the maximum entropy of all distribu4ons with a given mean and variance Many processes are asympto4cally Gaussian (Central Limit Theorem) Uncorrelatedness implies independence

Bayes linear classifier Let us assume that the class- condi7onal densi7es are Gaussian and then explore the resul4ng form for the posterior probabili4es. Assume that all classes share the same covariance matrix, thus the density for class C k is given by p(x C k ) = 1 (2π ) D/2 Σ 1 2 (x µ k ) T Σ 1 (x µ k ) T e 1/2 We then model the class- condi4onal densi4es p(x C k ) and class priors p(c k ) and use these to compute posterior probabili7es p(c k x) through Bayes' theorem: p(c k x) = p(x C k )p(c k ) p(x C j )p(c j ) Assuming only 2 classes the decision boundary is linear K j=1

Bayesian decision theory Bayesian Decision Theory shows us how to design an op4mal classifier if we know the prior probabili4es P(C k ) and the class- condi4onal densi4es P(x C k ). Unfortunately, we rarely have complete knowledge of the probabilis4c structure. However, we can oxen find design samples or training data that include par4cular representa4ves of the pa0erns we want to classify. The maximum likelihood es4mates of a Gaussian are ˆµ = 1 n x i and ˆΣ= 1 n (x n i=1 i n i=1 ˆµ)(x i ˆµ) T

Supervised classifica4on In prac4ce there are two general strategies Use the training data to build representa4ve probability model; separately model class- condi4onal densi4es and priors (genera2ve) Directly construct a good decision boundary, and model the posterior (discrimina2ve)

Discrimina4ve classifiers In general, discriminant func4ons can be defined independently of the Bayesian rule. They lead to subop7mal solu4ons, yet, if chosen appropriately, they can be computa4onally more tractable. Moreover, in prac4ce, they may also lead to be0er solu4ons. This, for example, may be case if the nature of the underlying pdf's are unknown.

Discrimina4ve classifiers Non- Bayesian Classifiers Distance- based classifiers: Minimum (mean) distance classifier Nearest neighbour classifier Decision boundary- based classifiers: Linear discriminant func4ons Support vector machines Neural networks Decision trees

References Selim Aksoy, Introduc4on to Pa0ern Recogni4on, Part I, h0p://re4na.cs.bilkent.edu.tr/papers/ patrec_tutorial1.pdf Ricardo Gu4errez- Osuna, Introduc4on to Pa0ern Recogni4on, h0p://research.cs.tamu.edu/prism/ lectures/pr/pr_l1.pdf Jaime S. Cardoso, Classifica4on Concepts, h0p:// www.dcc.fc.up.pt/~mcoimbra/lectures/mapi_1112/ CV_1112_5b_Classifica4onConcepts.pdf Christopher M. Bishop, Pa0ern Recogni4on and Machine Learning, Springer, 2006. Richard O. Duda, Peter E. Hart, David G. Stork, Pa0ern Classifica4on, John Wiley & Sons, 2001