Lecture 3: Pattern Classification. Pattern classification

EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and EM Methodological remarks Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e68/ Columbia University Dept. of Electrical Engineering Spring 6 E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - Pattern classification Classification means: finding categorical (discrete labels for real-world (continuous observations F/Hz 4 f/hz 3..3.4.5.6.7.8.9. time/s F/Hz ay ao Why? - information etraction - conditional processing - detect eceptions E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- -

Building a classifier Define classes/attributes - could state eplicit rules - better to define through training eamples Define feature space Define decision algorithm - set parameters from eamples Measure performance - calculate (weighted error rate Pols vowel formants: "u" ( vs. "o" (o F / Hz 9 8 7 6 5 3 35 4 45 5 55 6 F / Hz E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 3 Classification system parts Sensor signal segment feature vector Pre-processing/ segmentation Feature etraction STFT Locate vowels Formant etraction Classification class Post-processing Contet constraints Costs/risk E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 4

Feature etraction Right features are critical - waveform vs. formants vs. cepstra - invariance under irrelevant modifications Theoretically equivalent features may act very differently in practice - representations make important aspects eplicit - remove irrelevant information Feature design incorporates domain knowledge - more data less need for cleverness? Smaller feature space (fewer dimensions simpler models (fewer parameters less training data needed faster training E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 5 Minimum Distance Classification 8 Pols vowel formants: "u" (, "o" (o, "a" (+ 6 4 + F / Hz 8 6 * o new observation 4 6 8 4 6 8 F / Hz Find closest match (Nearest Neighbor min[ D(, y ω i ], - choice of distance metric D(,y is important Find representative match (class prototypes min[ D(, z ω ] - class data { y ω,i } class model z ω E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 6

Linear and nonlinear classifiers Minimum Euclidean distance is equivalent to a linear discriminant function: - eamples { y ω,i } template z ω mean{ y ω,i } - observed feature vector [...] T - Euclidean distance metric Dy [, ] ( y T ( y - minimum distance rule: class i argmin i D[, z i ] argmin i D [, z i ] - i.e. choose class O over U if D [, z O ] < D [, z U ]... E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 7 Linear classifier (cont d Decision rule: Choose class O over U if: D [, z O ] < D [, z U ] ( z O T ( z O < ( z U T ( z U Projection T T + z OzO T z O T T < + z UzU T z U T T T ( z O z U > z OzO z UzU ( z U T z O ( z O z U >< -- ( z O z U T ( z O z U z U Threshold i.e. distance from z U to is less than half the distance to z O... E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 8

Linear classification boundaries Min-distance divides normal to class centers: decision boundary (z O -z U T (z O -z U z O -z U z U o z O (z O -z U (-z U T (z O -z U z O -z U * Scaling aes changes boundary: new boundary o Squeezed horizontally Squeezed vertically new boundary o E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 9 Decision boundaries Linear functions give linear boundaries - planes and hyperplanes as dimensions increase Linear boundaries can become curves under feature transformations - e.g. a linear discriminant in ' [...] T - support vector machines... What about more comple boundaries? - i.e. if a linear discriminant is w i i how about a nonlinear function? - changes only threshold space, but F[ w i i ] F[ w ij i ] j i j i F[ w jk F[ w ij i ]] offers more fleibility, and has even more... E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- -

Neural networks Sums over nonlinear functions of sums large range of decision surfaces e.g. Multi-layer perceptron (MLP with hidden layer: y k F[ w jk F[ w ij j i ]] j w ij + + h + w h jk + y F[ ] 3 Input layer Hidden layer Problem is finding the weights w ij... (training + y Output layer E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - Back-propagation training Find w ij to minimize e.g. E n for training set of patterns { n } and desired (target outputs {t n } Differentiate error with respect to weights - contribution from each pattern n, t n : y k F[ w jk h j j ] E w jk E y y Σ w jk ( y t F' Σ E w jk w jk α w jk ( w jk h j j [ ] h j i.e. gradient descent with learning rate α y( n t n E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- -

Neural net eample input units (normalized F, F 5 hidden units, 3 output units ( U, O, A F F A O U Sigmoid nonlinearity: F[ ] ---------------- + e df F( F d.8.6.4. sigm( d sigm( d 5 4 3 3 4 5 E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 3 Neural net training :5:3 net: MS error by training epoch Mean squared error.8.6.4. 3 4 5 6 7 8 9 Training epoch 6 Contours @ iterations 6 Contours @ iterations 4 8 6 4 6 8 eample... 4 8 6 4 6 8 E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 4

Support Vector Machines (SVMs Principled nonlinear decision boundaries... Key idea: Boundary can be linear in a higher-dimensional space e.g. Φ - depends only on training points at edges (support vectors - unique optimal solution for a given projection Φ - solution involves only inner products in projected space e.g. via kernel (w. + b yi k( y, Φ( Φ( y w (w. + b + y i + (w. + b ( y E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 5 3 Statistical Interpretation Observations are random variables whose distribution depends on the class: Observation Class ω i (hidden discrete p( ω i continuous Source distributions p( ω i - reflect variability in feature - reflect noise in observation - generally have to be estimated from data (rather than known in advance p( ω i Pr(ω i ω ω ω 3 ω 4 E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 6

Random Variables review Random vars have joint distributions (pdf s: p(,y y µy σ y σ y p(y p( µ σ - marginal p ( py (, dy - covariance Σ E[ ( o µ ( o µ T ] σ σy σ y σ y E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 7 Conditional probability Knowing one value in a joint distribution constrains the remainder: y Y p(,y p( y Y Bayes rule: py (, py ( py ( py ( py ( py ( ------------------------------- p ( can reverse conditioning given priors/marginal - either term can be discrete or continuous E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 8

Priors and posteriors Bayesian inference can be interpreted as updating prior beliefs with new information, : Bayes Rule: Pr( ω i Prior probability pω ( i -------------------------------------------------- pω ( j Pr( ω j j Likelihood Evidence p( Pr( ω i Posterior probability Posterior is prior scaled by likelihood & normalized by evidence (so Σ(posteriors Objection: priors are often unknown -...but omitting them amounts to assuming they are all equal E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 9 Bayesian (MAP classifier Minimize the probability of error by choosing maimum a posteriori (MAP class: ωˆ argma ω i Pr( ω i Intuitively right - choose most probable class in light of the observation Formally, epected probability of error, EPrerr [ ( ] p ( Pr( err d i X ωˆ i p ( ( Pr( ω i d where X ωˆ is the region of X where ω i is chosen i.. but Pr(ω i is largest in that region, so Pr(err is minimized. E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- -

Practical implementation Optimal classifier is but we don t know ωˆ argma ω i Pr( ω i Can model directly e.g. train a neural net / SVM to map from inputs to a set of outputs Pr(ω i - a discriminative model Often easier to model conditional distributions then use Bayes rule to find MAP class pω ( i Pr( ω i Labeled training eamples { n,ω n } Sort according to class Estimate conditional pdf for class ω p( ω E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - Likelihood models Given models for each distribution, the search for hence or even Choose parameters to maimize data likelihood pω ( i ωˆ argma Pr( ω i ω i pω ( i Pr( ω i becomes argma -------------------------------------------------- ω i pω ( j Pr( ω j j but denominator p( is the same over all ω i ωˆ argma ω i log argma ω i pω ( i Pr( ω i [ pω ( i + logpr( ω i ] p ( train ω i E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- -

Sources of error p( ω Pr(ω X ω ^ X ω ^ p( ω Pr(ω Pr(ω X ω ^ Pr(err X ω ^ Pr(ω X ω ^ Pr(err X ω ^ Suboptimal threshold / regions (bias error - use a Bayes classifier! Incorrect distributions (model error - better distribution models/more training data Misleading features ( Bayes error - irreducible for a given feature set regardless of classification scheme E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 3 4 Gaussian models Easiest way to model distributions is via parametric model - assume known form, estimate a few parameters Gaussian model is simple & useful: in -D: pω ( i --------------- ep πσ i normalization to make it sum to -- µ i ------------- σ i Parameters mean µ i and variance σ i fit P M P M e / p( ω i µ i σ i E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 4

Gaussians in d dimensions: p( ω i ----------------------------------- ( π d Ú Σ i ep -- ( µ i T Σ i ( µ i Described by d dimensional mean vector µ i and d d covariance matri Σ i 5.5 4 3 4 4 3 4 5 argma ω i Classify by maimizing log likelihood i.e. -- ( µ i T Σ i ( µ i -- log Σ i + logpr( ω i E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 5 Multivariate Gaussian covariance Eigen analysis of inverse of covariance gives Σ i ( SV T ( SV with diagonal S Λ Ú Gaus n ~ ep -- ( µ i T ( SV T ( SV ( µ i i.e. (SV transforms into uncorrelated, normalized-variance space space SV space v SV v'svv ( µ Mahalanobis distance i T Σ i ( µ i - Euclidean distance in normalized space eample... E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 6

Gaussian Miture models (GMMs Single Gaussians cannot model - distributions with multiple modes - distributions with nonlinear correlation What about a weighted sum? i.e. p ( c k pm ( k k where {c k } is a set of weights and { } is a set of Gaussian components pm ( k - can fit anything given enough components Interpretation: Each observation is generated by one of the Gaussians, chosen at random, with priors: c k Pr( m k E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 7 Gaussian mitures ( e.g. nonlinear correlation: resulting surface 3 -.8..4-5 original data 5 5 3 35..4.6 Gaussian components Problem: finding c k and m k parameters - easy if we knew which m k generated each E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 8

Epectation-maimization (EM General procedure for estimating a model when some parameters are unknown - e.g. which component generated a value in a Gaussian miture model Procedure: Iteratively update fied model parameters Θ to maimize Qepected value of log-probability of known training data trn and unknown parameters u: Q( Θ, Θ old Pr u trn, Θ old ( log p( u, trn Θ u - iterative because Pr( u trn, Θ old changes each time we change Θ - can prove this increases data likelihood, hence maimum-likelihood (ML model - local optimum - depends on initialization E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 9 Fitting GMMs with EM Want to find component Gaussians m k plus weights c k Pr( m k { µ k, Σ k } to maimize likelihood of training data trn If we could assign each to a particular m k, estimation would be direct Hence, treat ks (miture indices as unknown, form Q E[ log p(, k Θ ], differentiate with respect to model parameters equations for µ k, Σ k and c k to maimize Q... E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 3

GMM EM update equations: Solve for maimizing Q: pk n, Θ n µ k pk ( n, Θ n pk n, Θ n Σ k n ( n --------------------------------------------- ( ( n µ k ( n µ k T ------------------------------------------------------------------------------------ pk ( n, Θ c k --- pk N ( n, Θ n pk ( n, Θ Each involves, fuzzy membership of n to Gaussian k Parameter is just sample average, weighted by fuzzy membership E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 3 GMM eamples Vowel data fit with different miture counts: Gauss logp(-9 6 Gauss logp(-864 6 4 8 6 4 6 8 6 4 8 3 Gauss logp(-849 6 4 6 8 4 8 6 4 6 8 6 4 8 4 Gauss logp(-84 6 4 6 8 eample... E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 3

5 Methodological Remarks: : Training and test data A rich model can learn every training eample (overtraining error rate Test data Training data Overfitting training or parameters But, goal is to classify new, unseen data i.e. generalization - sometimes use cross validation set to decide when to stop training For evaluation results to be meaningful: - don t test with training data! - don t train on test data (even indirectly... E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 33 Model compleity More model parameters better fit - more Gaussian miture components - more MLP hidden units More training data can use larger models For fied training set size, there will be some optimal model size (to avoid overfitting: 44 4 More parameters Word Error Rate More training data WER% 4 38 36 34 Optimal parameter/data ratio Constant training time 3 9.5 8.5 Training set / hours 37 74 4 Hidden layer / units 5 WER for PLPN-8k nets vs. net size & training data E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 34

Model combination No single model is always best But different models have different strengths Benefits from combining several models e.g. Pr( ω w k Pr( ω m k k where each Pr( ω m k comes from a different model / feature set and w k Pr( m k estimates probability that model k is correct Should be able to incorporate into one model, but may be easier to combine in practice Trick is to have good estimates for w k : - static, from training data - dynamic, from test data... E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 35 Summary Classification is making discrete (hard decisions Basis is comparison with known eamples - eplicitly, or via a model Probabilistic interpretation for principled decisions Ways to build models: - neural nets can learn posteriors - Gaussians can model pdfs - other approaches... EM allows estimation of comple models Parting thought: pω ( How do you know what s going on? Pr( ω E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 36