Statistical Pattern Recognition

Similar documents
Pattern Classification, Ch4 (Part 1)

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Pattern Classification

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Mixtures of Gaussians and the EM Algorithm

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Vector Quantization: a Limiting Case of EM

10-701/ Machine Learning Mid-term Exam Solution

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Pattern Classification

Expectation-Maximization Algorithm.

Chapter 6 Principles of Data Reduction

Lecture 2: Monte Carlo Simulation

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Analytic Continuation

CHAPTER 10 INFINITE SEQUENCES AND SERIES

1.010 Uncertainty in Engineering Fall 2008

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Exponential Families and Bayesian Inference

Sequences and Series of Functions

CSE 527, Additional notes on MLE & EM

CS284A: Representations and Algorithms in Molecular Biology

Probability and MLE.

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Statistics 511 Additional Materials

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Introduction to Artificial Intelligence CAP 4601 Summer 2013 Midterm Exam

Information-based Feature Selection

Axis Aligned Ellipsoid

6.867 Machine learning

Chapter 11 Output Analysis for a Single Model. Banks, Carson, Nelson & Nicol Discrete-Event System Simulation

Lecture 10 October Minimaxity and least favorable prior sequences

The Expectation-Maximization (EM) Algorithm

Lecture 19: Convergence

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Infinite Sequences and Series

WHAT IS THE PROBABILITY FUNCTION FOR LARGE TSUNAMI WAVES? ABSTRACT

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

ECE 901 Lecture 13: Maximum Likelihood Estimation

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

Empirical Process Theory and Oracle Inequalities

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Lesson 10: Limits and Continuity

Lecture 9: September 19

6.867 Machine learning, lecture 7 (Jaakkola) 1

THE SYSTEMATIC AND THE RANDOM. ERRORS - DUE TO ELEMENT TOLERANCES OF ELECTRICAL NETWORKS

LECTURE NOTES 9. 1 Point Estimation. 1.1 The Method of Moments

Advanced Stochastic Processes.

Random Variables, Sampling and Estimation

11 Hidden Markov Models

Sieve Estimators: Consistency and Rates of Convergence

Distribution of Random Samples & Limit theorems

Castiel, Supernatural, Season 6, Episode 18

Analysis of Experimental Data

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

NUMERICAL METHODS FOR SOLVING EQUATIONS

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Problem Set 4 Due Oct, 12

Approximations and more PMFs and PDFs

STAT Homework 1 - Solutions

Lecture 11: Decision Trees

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Statisticians use the word population to refer the total number of (potential) observations under consideration

Reliability and Queueing

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

MATH 112: HOMEWORK 6 SOLUTIONS. Problem 1: Rudin, Chapter 3, Problem s k < s k < 2 + s k+1

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

HOMEWORK #10 SOLUTIONS

6. Sufficient, Complete, and Ancillary Statistics

Recursive Algorithms. Recurrences. Recursive Algorithms Analysis

1 Approximating Integrals using Taylor Polynomials

TMA4205 Numerical Linear Algebra. The Poisson problem in R 2 : diagonalization methods

EE 6885 Statistical Pattern Recognition

Properties and Hypothesis Testing

EE / EEE SAMPLE STUDY MATERIAL. GATE, IES & PSUs Signal System. Electrical Engineering. Postal Correspondence Course

Understanding Samples

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Machine Learning Brett Bernstein

1 Review of Probability & Statistics

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

x a x a Lecture 2 Series (See Chapter 1 in Boas)

5 : Exponential Family and Generalized Linear Models

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

5. Likelihood Ratio Tests

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Mathematical Statistics - MS

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Chapter 6 Sampling Distributions

Basis for simulation techniques

Title: Damage Identification of Structures Based on Pattern Classification Using Limited Number of Sensors

A statistical method to determine sample size to estimate characteristic value of soil parameters

Transcription:

Statistical Patter Recogitio Classificatio: No-Parametric Modelig Hamid R. Rabiee Jafar Muhammadi Sprig 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/

Ageda Parametric Modelig No-Parametric Modelig Desity Estimatio Parze Widow Parze Widow - Illustratio Parze Widow ad Classificatio K -Nearest Neighbor (K-NN) K-NN - Illustratio K-NN ad a-posteriori probabilities K-NN ad Classificatio Pros ad cos 2

Parametric Modelig Data availability i a Bayesia framework We could desig a optimal classifier if we kew P(w i ) ad P(x w i ) Ufortuately, we rarely have that much iformatio available! Assumptios A priori iformatio about the problem The form of uderlyig desity Example: Normality of P(x w i ): Characterized by 2 parameters Estimatio techiques (studied i stochastic processes course) Maximum-Likelihood (ML) ad the Bayesia estimatios (MAP: Maximum A Posteriori) Results are early idetical, but the approaches are differet! Other techiques (will be discussed later) Gaussia Mixture Model (GMM) ad Hidde Markov Model (HMM) 3

No-Parametric Modelig No-parametric modelig tries to model arbitrary distributios without assumig a certai parametric form. No-parametric models ca be used with arbitrary distributios ad without the assumptio that the forms of the uderlyig desities are kow. Moreover, They ca be used with multimodal distributios which are much more commo i practice tha uimodal distributios. There are two types of o-parametric methods: Estimatig P(x w j ) Parze widow Estimatig P(w j x) (Bypass probability ad go directly to a-posteriori probability estimatio ) K -Nearest Neighbor 4

Desity Estimatio Basic idea: Probability that a vector x will fall i regio R is: P P( x') dx ' P is a smoothed (or averaged) versio of the desity fuctio P(x). R If we have a sample of size ; therefore, the probability that k poits fall i R is the: k Pk P (1 P) k The expected value for k is E(k) = P ML estimatio of P is reached for ˆ ˆ k PML Therefore, the ratio k/ is a good estimate for the desity fuctio p. Assumig P(x) is cotiuous ad that the regio R is so small that P does ot vary sigificatly withi it, we ca write (V is the volume of R): Combiig above equatios, the desity estimate becomes: k/ Px ( ) V k P P( x') dx ' P( x) V R 5

Desity Estimatio The volume V eeds to approach zero if we wat to use this estimatio Practically, V caot be allowed to become small (sice the umber of samples is always limited). Theoretically, if a ulimited umber of samples is available, we ca circumvet this difficulty To estimate the desity of x regardig above limitatios, we do followig steps: I th step, cosider a total of data samples with the cetrality of x Form a regio R cotaiig x Let V be the volume of R, k the umber of samples fallig i R ad P (x) be the th estimate for P(x), the: P (x) = (k /)/V Three ecessary coditios for covergig P (x) to P(x) are: limv 0 lim1 k 0 lim k / 0 There are two differet ways of obtaiig sequeces of regios that satisfy these coditios: Parze-widow estimatio method: Shrik a iitial regio where V = 1/ ad show that P ( ) ( ) x P x k -earest eighbor estimatio method: Specify k as some fuctio of, such as k = ; the volume V is grow util it ecloses k eighbors of x. 6

Desity Estimatio Parze widow vs. k-earest eighbor 7

Parze Widow Parze-widow approach to estimate desities Assume the regio R is a d-dimesioal hypercube V (h : legth of the edge of R ) j (u) 2 ((x-x i )/h ) is equal to uity if x i falls withi the hypercube of volume V cetered at x ad equal to zero otherwise. The umber of samples i this hypercube is: h d Let (u) be the followig widow fuctio: 1 1 u j 1,..., d 0 otherwise k x x i i i1 h i 1 1 x x h i The, we obtai the followig estimate: P ( x) i1 V h h P (x) estimates p(x) as a average of fuctios of x ad the samples (x i ) (i = 1,,). These fuctios ca be geeral desity fuctio! 1 1 1 h 8

Parze Widow Example: The behavior of the Parze-widow method for the case where both P(x) & (u)~n(0,1) Let 2 1 h u e h h kow parameter 2 u 2 1, ; ( 1, 1 : ) Thus: i 1 1 x x i P ( x) i1 h h P is a average of ormal desities cetered at the samples x. i Numerical results for =1 ad h 1 =1 1/2 2 P 1( x) ( x x1) 1 e ( x x1) N( x1,1) 2 For =10 ad h=0.1, the cotributios of the idividual samples are clearly observable! 9

Parze Widow - Illustratio Example illustratio Note that the = estimates are the same ad match the true desity fuctio regardless of widow width. 10

Parze Widow - Illustratio Example 2 Case where P(x) = 1 U(a,b) + 2 T(c,d) (ukow desity) mixture of a uiform ad a triagle desity The P as the same as previous example 11

Parze Widow ad Classificatio I classifiers based o Parze-widow estimatio: We estimate the desities for each category ad classify a test poit by the label correspodig to the maximum posterior Usig the poits of oly category w i, P(x w i ) ca be estimated Kowig P(w i ), posterior probabilities ca be foud The decisio regio for a Parze-widow classifier depeds upo the choice of widow fuctio as illustrated i the followig figure. (See ext slide) 12

Parze Widow ad Classificatio The left oe: a small h (complicated boudaries) - The right oe: a larger h (simple boudaries) compare the upper ad lower regios of two cases small h is appropriate for the upper regio, large h for the lower regio No sigle widow width is ideal overall 13

Parze Widow 1D Example Suppose we have 7 samples D={2,3,4,8,10,11,12} Let widow width h=3, estimate desity at x=1 14

Parze Widow 1D Example Suppose we have 7 samples D={2,3,4,8,10,11,12}, h = 3 Plot probability desity fuctio usig Parze Widow, otice the resultig PDF is ot smooth. 15

K -Nearest Neighbor Goal: a solutio for the problem of the ukow best widow fuctio Let the cell volume be a fuctio of the traiig data Ceter a cell about x ad let it grows util it captures k samples (k = f()) k samples are called the k earest-eighbors of x Two possibilities ca occur: Desity is high ear x; therefore the cell will be small which provides a good resolutio Desity is low; therefore the cell will grow large ad stop util higher desity regios are reached We ca obtai a family of estimates by settig k =k 1 / ad choosig differet values for k 1 16

K-NN - Illustratio How do we classify a poit usig k-nn? K=1: belogs to square class K=3: belogs to triagle class K=7: belogs to square class 17

K-NN - Illustratio For k = ad for =1 the estimate becomes P (x)=k /, V = 1/V 1 =1/2 x-x 1 18

K-NN ad a-posteriori probabilities Goal: estimate P(w i x) from a set of labeled samples Let s place a cell of volume V aroud x ad capture k samples k i samples amogst k tured out to be labeled w i the A estimate for P (w i x) is: ki i ki P ( X, wi ) P ( X wi ) * P ( wi ) i P ( w x) i j c P ( x, wi ) ki k P ( x, w ) j1 j k i /k is the fractio of the samples withi the cell that are labeled w i For miimum error rate, the most frequetly represeted category withi the cell is selected If k is large ad the cell sufficietly small, the performace will approach the best possible 19

K-NN ad Classificatio The earest eighbor Rule (K=1) Let D = {x 1, x 2,, x } be a set of labeled prototypes Let x D be the closest prototype to a test poit x; the the earest-eighbor rule for classifyig x is to assig it the label associated with x The earest-eighbor rule leads to a error rate greater tha the miimum possible: the Bayes rate If the umber of prototype is large (ulimited), the error rate of the earest-eighbor classifier is ever worse tha twice the Bayes rate (it ca be demostrated!) Thik more about it. It meas that 50% of the iformatio eeded to optimally classify poit x is aggregated withi its earest labeled eighbor. If, it is always possible to fid x sufficietly close so that P(w i x ) P(w i x) If P(w m x) 1, the the earest eighbor selectio is almost always the same as the Bayes selectio 20

K-NN ad Classificatio The earest eighbor rule I 2D the earest eighbor leads to a partitioig of the iput space ito Vorooi cells I 3D the cells are 3D ad the decisio boudary resembles the surface of a crystal 21

Pros ad Cos No assumptios are eeded about the distributios ahead of time (geerality). With eough samples, covergece to a arbitrarily complicated target desity ca be obtaied. The umber of samples eeded may be very large (umber grows expoetially with the dimesioality of the feature space). These methods are very sesitive to the choice of widow size (if too small, most of the volume will be empty, if too large, importat variatios may be lost). There may be severe requiremets for computatio time ad storage. 22

Ay Questio Ed of Lecture 8 Thak you! Sprig 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ 23