Pattern Classification

Similar documents
Pattern Recognition. Parameter Estimation of Probability Density Functions

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

L11: Pattern recognition principles

Bayesian Decision and Bayesian Learning

Pattern Recognition 2

Hidden Markov Modelling

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Unsupervised Learning

CS 195-5: Machine Learning Problem Set 1

Machine learning for pervasive systems Classification in high-dimensional spaces

p(d θ ) l(θ ) 1.2 x x x

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Machine Learning 2017

Clustering VS Classification

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Probabilistic Latent Semantic Analysis

Algorithmisches Lernen/Machine Learning

Lecture 3: Pattern Classification

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Lecture 7: Con3nuous Latent Variable Models

STA 414/2104: Machine Learning

EECS490: Digital Image Processing. Lecture #26

Unsupervised Learning with Permuted Data

p(x ω i 0.4 ω 2 ω

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Hidden Markov Models and Gaussian Mixture Models

Covariance and Correlation Matrix

Expect Values and Probability Density Functions

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

COM336: Neural Computing

Probabilistic & Unsupervised Learning

Approximating the Covariance Matrix with Low-rank Perturbations

Brief Introduction of Machine Learning Techniques for Content Analysis

Bayes Decision Theory

Lecture Notes on the Gaussian Distribution

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Maximum Likelihood Estimation. only training data is available to design a classifier

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Parametric Techniques Lecture 3

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

THESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION. Submitted by. Daniel L Elliott

Lecture 2: GMM and EM

High Dimensional Discriminant Analysis

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Expectation Maximization

CS534 Machine Learning - Spring Final Exam

Feature selection and extraction Spectral domain quality estimation Alternatives

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Parametric Techniques

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Hidden Markov Models and Gaussian Mixture Models

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Probability Models for Bayesian Recognition

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Ch 4. Linear Models for Classification

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

ECE 5984: Introduction to Machine Learning

Bayesian Decision Theory

Mixtures of Gaussians with Sparse Structure

Lecture 3: Pattern Classification. Pattern classification

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

PATTERN CLASSIFICATION

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Independent Component Analysis and Unsupervised Learning

A Small Footprint i-vector Extractor

p(x ω i 0.4 ω 2 ω

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Linear Models for Classification

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Graphical Models

Statistical Pattern Recognition

Pattern Classification

Machine Learning Lecture 2

Unsupervised Learning: K- Means & PCA

Pattern Recognition Problem. Pattern Recognition Problems. Pattern Recognition Problems. Pattern Recognition: OCR. Pattern Recognition Books

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

Machine Learning 2nd Edition

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Inf2b Learning and Data

Conditional Random Field

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Maximum variance formulation

Necessary Corrections in Intransitive Likelihood-Ratio Classifiers

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Machine Learning - MT & 14. PCA and MDS

Session 1: Pattern Recognition

Notes on Latent Semantic Analysis

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

ABSTRACT INTRODUCTION

Expectation Maximization (EM)

Transcription:

Pattern Classification Introduction Parametric classifiers Semi-parametric classifiers Dimensionality reduction Significance testing 6345 Automatic Speech Recognition Semi-Parametric Classifiers 1

Semi-Parametric Classifiers Mixture densities ML parameter estimation Mixture implementations Expectation maximization (EM) 6345 Automatic Speech Recognition Semi-Parametric Classifiers 2

Mixture Densities PDF is composed of a mixture of m component densities {ω 1,,ω m }: m p(x)= p(x ω j )P(ω j ) j=1 Component PDF parameters and mixture weights P(ω j ) are typically unknown, making parameter estimation a form of unsupervised learning Gaussian mixtures assume Normal components: p(x ω k ) N(µ k, Σ k ) 6345 Automatic Speech Recognition Semi-Parametric Classifiers 3

Gaussian Mixture Example: One Dimension 00 005 010 015 020 025-4 -2 0 2 4 x σ p(x) =06p 1 (x)+04p 2 (x) p 1 (x) N( σ,σ 2 ) p 2 (x) N(15σ,σ 2 ) 6345 Automatic Speech Recognition Semi-Parametric Classifiers 4

Gaussian Example First 9 MFCC s from[s]: Gaussian PDF 1000 1500 2000 2500 3000 s ( dimension 1 ) -600-400 -200 0 200 s ( dimension 2 ) -200-100 0 100 200 s ( dimension 3 ) -300-200 -100 0 100 200 s ( dimension 4 ) -200-100 0 100 s ( dimension 5 ) -150-50 0 50 100 150 s ( dimension 6 ) -150-50 0 50 100 s ( dimension 7 ) -100-50 0 50 100 s ( dimension 8 ) -100-50 0 50 100 s ( dimension 9 ) 6345 Automatic Speech Recognition Semi-Parametric Classifiers 5 0000020004 00060008 00100012 00 0005 0010 0015 00 0005 0010 0015 00 0002000400060008 0000020004000600080010 0000020004000600080010 00 000050001000015 0000010002000300040005 00 0002000400060008

Independent Mixtures [s]: 2 Gaussian Mixture Components/Dimension 1000 1500 2000 2500 3000 s ( dimension 1 ) -600-400 -200 0 200 s ( dimension 2 ) -200-100 0 100 200 s ( dimension 3 ) -300-200 -100 0 100 200 s ( dimension 4 ) -200-100 0 100 s ( dimension 5 ) -150-50 0 50 100 150 s ( dimension 6 ) -150-50 0 50 100 s ( dimension 7 ) -100-50 0 50 100 s ( dimension 8 ) -100-50 0 50 100 s ( dimension 9 ) 6345 Automatic Speech Recognition Semi-Parametric Classifiers 6 0000020004000600080010 00 0005 0010 0015 00 0005 0010 0015 0000020004000600080010 0000020004000600080010 00 0002000400060008 00 000050001000015 0000010002000300040005 00 0002000400060008

Mixture Components [s]: 2 Gaussian Mixture Components/Dimension 1000 1500 2000 2500 3000 s ( dimension 1 ) -600-400 -200 0 200 s ( dimension 2 ) -200-100 0 100 200 s ( dimension 3 ) -300-200 -100 0 100 200 s ( dimension 4 ) -200-100 0 100 s ( dimension 5 ) -150-50 0 50 100 150 s ( dimension 6 ) -150-50 0 50 100 s ( dimension 7 ) -100-50 0 50 100 s ( dimension 8 ) -100-50 0 50 100 s ( dimension 9 ) 6345 Automatic Speech Recognition Semi-Parametric Classifiers 7 0000020004000600080010 00 0005 0010 0015 00 0005 0010 0015 00 00020004000600080010 0000020004000600080010 00 0002000400060008 00 000050001000015 0000010002000300040005 00 0002000400060008

ML Parameter Estimation: 1D Gaussian Mixture Means log L(µ k ) = n log p(x i ) = n log m i=1 i=1 j=1 p(x i ω j )P(ω j ) log L(µ k ) µ k = i log p(x i ) = µ k i 1 p(x i ) µ k p(x i ω k )P(ω k ) p(x i ω k ) = (x i µ k ) 2 1 e 2σ 2 k = p(x i ω k ) (x i µ k ) µ k µ k 2πσk σ 2 k log L(µ k ) µ k = i since p(x i ω k )P(ω k ) p(x i ) P(ω k ) p(x i ) p(x i ω k ) (x i µ k ) σ 2 k = P(ω k x i ) ˆµ k = =0 P(ω k x i )x i i P(ω k x i ) i 6345 Automatic Speech Recognition Semi-Parametric Classifiers 8

Gaussian Mixtures: ML Parameter Estimation The maximum likelihood solutions are of the form: ˆµ k = 1 n 1 n i i ˆP(ω k x i )x i ˆP(ω k x i ) ˆΣ k = 1 n i ˆP(ω k x i )(x i ˆµ k )(x i ˆµ k ) t 1 n ˆP(ω k )= 1 n i ˆP(ω k x i ) i ˆP(ω k x i ) 6345 Automatic Speech Recognition Semi-Parametric Classifiers 9

Gaussian Mixtures: ML Parameter Estimation The ML solutions are typically solved iteratively: Select a set of initial estimates for ˆP(ω k ), ˆµ k, ˆΣ k Use a set of n samples to reestimate the mixture parameters until some kind of convergence is found Clustering procedures are often used to provide the initial parameter estimates Similar to K-means clustering procedure 6345 Automatic Speech Recognition Semi-Parametric Classifiers 10

Example: 4 Samples, 2 Densities 1 Data: X = {x 1,x 2,x 3,x 4 }={2,1, 1, 2} 2 Init: p(x ω 1 ) N(1, 1) p(x ω 2 ) N( 1, 1) P(ω i )=05 3 Estimate: P(ω 1 x) P(ω 2 x) x 1 x 2 x 3 x 4 088 012 012 088 098 002 002 098 p(x) (e 05 + e 45 )(e 0 + e 2 )(e 0 + e 2 )(e 05 + e 45 )05 4 4 Recompute mixture parameters (only shown for ω 1 ): ˆP(ω 1 )= 98+88+12+02 4 = 05 ˆµ 1 = 98(2)+88(1)+12( 1)+02( 2) 98+88+12+02 = 134 ˆσ 2 1 = 98(2 134)2 +88(1 134) 2 +12( 1 134) 2 +02( 2 134) 2 98+88+12+02 = 070 5 Repeat steps 3,4 until convergence 6345 Automatic Speech Recognition Semi-Parametric Classifiers 11

[s] Duration: 2 Densities Iter µ 1 µ 2 σ 1 σ 2 0 152 95 35 23 1 150 97 37 24 2 148 98 39 25 3 147 100 41 25 4 146 100 42 26 5 146 102 43 26 6 146 102 44 26 7 145 102 44 26 Iter P(ω 1 ) P(ω 2 ) logp(x) 0 384 616 2727 1 376 624 2762 2 369 631 2773 3 362 638 2778 4 356 644 2781 5 349 651 2783 6 344 656 2784 7 338 662 2785 0 2 4 6 8 10 12 0 2 4 6 8 10 12 0 2 4 6 8 10 12 005 010 015 020 025 030 Duration (sec) 005 010 015 020 025 030 Duration (sec) 005 010 015 020 025 030 Duration (sec) 6345 Automatic Speech Recognition Semi-Parametric Classifiers 12

Gaussian Mixture Example: Two Dimensions 3-Dimensional PDF 4 PDF Contour 3 2 4 1-2 2 0 0 0-1 2 4-2 -2-2 -1 0 1 2 3 4 6345 Automatic Speech Recognition Semi-Parametric Classifiers 13

Two Dimensional Mixtures Diagonal Covariance Full Covariance s ( dimension 5 ) s ( dimension 6 ) -200-150 -100-50 0 50 100-150 -100-50 0 50 100 150 s ( dimension 5 ) s ( dimension 6 ) -200-150 -100-50 0 50 100-150 -100-50 0 50 100 150 Two Mixtures Three Mixtures s ( dimension 5 ) s ( dimension 6 ) -200-150 -100-50 0 50 100-150 -100-50 0 50 100 150 s ( dimension 5 ) s ( dimension 6 ) -200-150 -100-50 0 50 100-150 -100-50 0 50 100 150 6345 Automatic Speech Recognition Semi-Parametric Classifiers 14

Two Dimensional Components Mixture Components -200-150 -100-50 0 50 100 s ( dimension 5 ) -200-150 -100-50 0 50 100 s ( dimension 5 ) -200-150 -100-50 0 50 100 s ( dimension 5 ) -200-150 -100-50 0 50 100 s ( dimension 5 ) 6345 Automatic Speech Recognition Semi-Parametric Classifiers 15 s ( dimension 6 ) -150-100 -50 0 50 100 150 s ( dimension 6 ) -150-100 -50 0 50 100 150 s ( dimension 6 ) -150-100 -50 0 50 100 150 s ( dimension 6 ) -150-100 -50 0 50 100 150

Mixture of Gaussians: Implementation Variations Diagonal Gaussians are often used instead of full-covariance Gaussians Can reduce the number of parameters Can potentially model the underlying PDF just as well if enough components are used Mixture parameters are often constrained to be the same in order to reduce the number of parameters which need to be estimated Richter Gaussians share the same mean in order to better model the PDF tails Tied-Mixtures share the same Gaussian parameters across all classes Only the mixture weights ˆP(ω i ) are class specific (Also known as semi-continuous) 6345 Automatic Speech Recognition Semi-Parametric Classifiers 16

Richter Gaussian Mixtures [s] Log Duration: 2 Richter Gaussians 00 02 04 06 08 10 12 14-35 -30-25 -20-15 log Duration (sec) 00 02 04 06 08 10 12 14-35 -30-25 -20-15 log Duration (sec) Richter Density Richter Components 6345 Automatic Speech Recognition Semi-Parametric Classifiers 17

Expectation-Maximization (EM) Used for determining parameters, θ, forincomplete data, X = {x i } (ie, unsupervised learning problems) Introduces variable, Z = {z j }, to make data complete so θ can be solved using conventional ML techniques log L(θ) =logp(x,z θ) = i,j log p(x i,z j θ) In reality, z j can only be estimated by P(z j x i,θ), so we can only compute the expectation of log L(θ) E = E(log L(θ)) = P(z j x i,θ)logp(x i,z j θ) i j EM solutions are computed iteratively until convergence 1 Compute the expectation of log L(θ) 2 Compute the values θ, which maximize E 6345 Automatic Speech Recognition Semi-Parametric Classifiers 18

EM Parameter Estimation: 1D Gaussian Mixture Means Let z i be the component id, {ω j }, which x i belongs to E = E(log L(θ)) = P(z j x i,θ)logp(x i,z j θ) i j Convert to mixture component notation: E = E(log L(µ k )) = P(ω j x i )logp(x i,ω j ) i j Differentiate with respect to µ k : E µ k = i P(ω k x i ) log p(x i,ω k ) = µ k i P(ω k x i )x i i ˆµ k = P(ω k x i ) i P(ω k x i )( x i µ k σ 2 k ) = 0 6345 Automatic Speech Recognition Semi-Parametric Classifiers 19

EM Properties Each iteration of EM will increase the likelihood of X log p(x θ ) = log p(x i θ ) = p(x θ) p(x i i θ) i j = P(z j x i,θ)(log p(x i θ ) p(x i,z j θ) p(x i j i,z j θ ) p(x i θ) P(z j x i,θ)log p(x i θ ) p(x i θ) +log p(x i,z j θ ) p(x i,z j θ) ) Using Bayes rule and the Kullback-Liebler distance metric: p(x i,z j θ) p(x i θ) =P(z j x i,θ) j P(z j x i,θ)log P(z j x i,θ) P(z j x i,θ ) 0 Since θ was determined to maximize E(log L(θ)): i j P(z j x i,θ)log p(x i,z j θ ) p(x i,z j θ) 0 Combining these two properties: p(x θ ) p(x θ) 6345 Automatic Speech Recognition Semi-Parametric Classifiers 20

Dimensionality Reduction Given a training set, PDF parameter estimation becomes less robust as dimensionality increases Increasing dimensions can make it more difficult to obtain insights into any underlying structure Analytical techniques exist which can transform a sample space to a different set of dimensions If original dimensions are correlated, the same information may require fewer dimensions The transformed space will often have more Normal distribution than the original space If the new dimensions are orthogonal, it could be easier to model the transformed space 6345 Automatic Speech Recognition Dimensionality Reduction 1

Principal Components Analysis Linearly transforms d-dimensional vector, x, tod dimensional vector, y, via orthonormal vectors, W y = W t x W = {w 1,,w d } W t W =I If d <d,xcan be only partially reconstructed from y ˆx = Wy Principal components, W, minimize the distortion, D, between x, and ˆx, on training data, X = {x 1,,x n } D= n x i ˆx i 2 i=1 Also known as Karhunen-Loève (K-L) expansion (w i s are sinusoids for some stochastic processes) 6345 Automatic Speech Recognition Dimensionality Reduction 2

PCA Computation x 2 y 2 y 1 x 1 W corresponds to the first d eigenvectors, P, ofσ P={e 1,,e d } Σ=PΛP t w i =e i Full covariance structure of original space, Σ, is transformed to a diagonal covariance structure, Λ Eigenvalues, {λ 1,,λ d }, represent the variances in Λ Axes in d -space contain maximum amount of variance d D = i=d +1 6345 Automatic Speech Recognition Dimensionality Reduction 3 λ i

PCA Example Original feature vector mean rate response (d =40) Data obtained from 100 speakers from TIMIT corpus First 10 components explains 98% of total variance 100% Percent of Total Variance 90% 80% 70% 60% 50% Cosine PCA 40% 1 2 3 4 5 6 7 8 9 10 Number of Components 6345 Automatic Speech Recognition Dimensionality Reduction 4

PCA Example 0 5 Value Value Value Value Value 1 2 3 4 6 7 8 9 0 10 20 Frequency (Bark) 0 10 20 Frequency (Bark) 6345 Automatic Speech Recognition Dimensionality Reduction 5

Left 8 Left 4 Left 2 Left 1 Right 1 Right 2 Right 4 Right 8 C13 C12 C11 C10 C9 C8 C7 C6 C5 C4 C3 C2 C1 C0 Left 8 Left 4 Left 2 Left 1 Right 1 Right 2 Right 4 Right 8 06 05 04 03 02 01 0 C13 C12 C11 C10 C9 C8 C7 C6 C5 C4 C3 C2 C1 C0 06 05 04 03 02 01 0 03 025 02 015 01 005 0 03 025 02 015 01 005 0 PCA for Boundary Classification Eight non-uniform averages from 14 MFCCs First 50 dimensions used for classification Second Component Seventh Component 6345 Automatic Speech Recognition Dimensionality Reduction 6

PCA Issues PCA can be performed using Covariances Σ Correlation coefficients matrix P ρ ij = σ ij σii σ jj ρ ij 1 Pis usually preferred when the input dimensions have significantly different ranges PCA can be used to normalize or whiten original d-dimensional space to simplify subsequent processing Σ= P= Λ= I Whitening operation can be done in one step: z = V t x 6345 Automatic Speech Recognition Dimensionality Reduction 7

Significance Testing To properly compare results from different classifier algorithms, A 1,andA 2, it is necessary to perform significance tests Large differences can be insignificant for small test sets Small differences can be significant for large test sets General significance tests evaluate the hypothesis that the probability of being correct, p i, of both algorithms is the same The most powerful comparisons can be made using common train and test corpora, and common evaluation criterion Results reflect differences in algorithms rather than accidental differences in test sets Significance tests can be more precise when identical data are used since they can focus on tokens misclassified by only one algorithm, rather than on all tokens 6345 Automatic Speech Recognition Significance Testing 1

McNemar s Significance Test When algorithms A 1 and A 2 are tested on identical data we can collapse the results into a 2x2 matrix of counts A 1 /A 2 Correct Incorrect Correct n 00 n 01 Incorrect n 10 n 11 To compare algorithms, we test the null hypothesis H 0 that n 01 p 1 = p 2,orn 01 = n 10,orq= = 1 n 01 + n 10 2 Given H 0, the probability of observing k tokens asymmetrically classified out of n = n 01 + n 10 has a Binomial PMF ( ) (1 ) n n P(k) = k 2 McNemar s Test measures the probability, P, of all cases that meet or exceed the observed asymmetric distribution, and tests P<α 6345 Automatic Speech Recognition Significance Testing 2

McNemar s Significance Test (cont t) The probability, P, is computed by summing up the PMF tails l n P = P(k)+ P(k) l= min(n 01,n 10 ) m = max(n 01,n 10 ) k=0 k=m Probability Distribution 0 min max n For large n, a Normal distribution is often assumed 6345 Automatic Speech Recognition Significance Testing 3

Significance Test Example (Gillick and Cox, 1989) Common test set of 1400 tokens Algorithms A 1 and A 2 make 72 and 62 errors Are the differences significant? A 2 A 1 1266 62 72 0 A 2 A 1 1325 3 13 59 A 2 A 1 1328 0 10 62 n = 134 m =72 P=0437 n =16 m=13 P=00213 n =10 m=10 P=00020 6345 Automatic Speech Recognition Significance Testing 4

References Huang, Acero, and Hon, Spoken Language Processing, Prentice-Hall, 2001 Duda, Hart and Stork, Pattern Classification, John Wiley & Sons, 2001 Jelinek, Statistical Methods for Speech Recognition MIT Press, 1997 Bishop, Neural Networks for Pattern Recognition, Clarendon Press, 1995 Gillick and Cox, Some Statistical Issues in the Comparison of Speech Recognition Algorithms, Proc ICASSP, 1989 6345 Automatic Speech Recognition Pattern Classification