Lecture 3: Pattern Classification. Pattern classification

Similar documents
Lecture 3: Pattern Classification

Session 1: Pattern Recognition

Lecture 3: Machine learning, classification, and generative models

Pattern Recognition Applied to Music Signals

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

MACHINE LEARNING ADVANCED MACHINE LEARNING

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

L20: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

MACHINE LEARNING ADVANCED MACHINE LEARNING

Advanced statistical methods for data analysis Lecture 2

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Mobile Robot Localization

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

ECE521 week 3: 23/26 January 2017

Machine Learning Lecture 3

CMU-Q Lecture 24:

Machine Learning for Signal Processing Bayes Classification and Regression

Notes on Discriminant Functions and Optimal Classification

Artificial Neural Networks 2

Machine Learning Lecture 2

Mining Classification Knowledge

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Machine Learning Lecture 2

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

6.867 Machine learning

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

L11: Pattern recognition principles

Machine Learning Basics III

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Where now? Machine Learning and Bayesian Inference

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Algorithmisches Lernen/Machine Learning

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

p(x ω i 0.4 ω 2 ω

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Classification: The rest of the story

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning Practice Page 2 of 2 10/28/13

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Machine Learning 4771

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Pattern Recognition and Machine Learning

Support Vector Machines

Lecture 4: Probabilistic Learning

Linear Discriminant Functions

Machine Learning (CS 567) Lecture 5

COM336: Neural Computing

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Bayesian Inference of Noise Levels in Regression

Linear & nonlinear classifiers

CSCI-567: Machine Learning (Spring 2019)

Machine Learning Lecture 5

Expectation Maximization Deconvolution Algorithm

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Machine Learning 2017

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

STA 414/2104, Spring 2014, Practice Problem Set #1

Logistic Regression. Machine Learning Fall 2018

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

MIRA, SVM, k-nn. Lirong Xia

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Statistical learning. Chapter 20, Sections 1 4 1

STA 4273H: Statistical Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

CSC 411: Lecture 09: Naive Bayes

Bayes Decision Theory

2 Statistical Estimation: Basic Concepts

Naïve Bayes classification

Introduction to Support Vector Machines

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Brief Introduction of Machine Learning Techniques for Content Analysis

Metric-based classifiers. Nuno Vasconcelos UCSD

Bayes Decision Theory - I

Mining Classification Knowledge

6. Vector Random Variables

Statistical Methods in Particle Physics

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

The Bayes classifier

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Multi-layer Neural Networks

Lecture : Probabilistic Machine Learning

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

CSE446: Clustering and EM Spring 2017

PATTERN CLASSIFICATION

Regression with Input-Dependent Noise: A Bayesian Treatment

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

MRC: The Maximum Rejection Classifier for Pattern Detection. With Michael Elad, Renato Keshet

Transcription:

EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and EM Methodological remarks Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e68/ Columbia University Dept. of Electrical Engineering Spring 6 E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - Pattern classification Classification means: finding categorical (discrete labels for real-world (continuous observations F/Hz 4 f/hz 3..3.4.5.6.7.8.9. time/s F/Hz ay ao Why? - information etraction - conditional processing - detect eceptions E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- -

Building a classifier Define classes/attributes - could state eplicit rules - better to define through training eamples Define feature space Define decision algorithm - set parameters from eamples Measure performance - calculate (weighted error rate Pols vowel formants: "u" ( vs. "o" (o F / Hz 9 8 7 6 5 3 35 4 45 5 55 6 F / Hz E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 3 Classification system parts Sensor signal segment feature vector Pre-processing/ segmentation Feature etraction STFT Locate vowels Formant etraction Classification class Post-processing Contet constraints Costs/risk E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 4

Feature etraction Right features are critical - waveform vs. formants vs. cepstra - invariance under irrelevant modifications Theoretically equivalent features may act very differently in practice - representations make important aspects eplicit - remove irrelevant information Feature design incorporates domain knowledge - more data less need for cleverness? Smaller feature space (fewer dimensions simpler models (fewer parameters less training data needed faster training E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 5 Minimum Distance Classification 8 Pols vowel formants: "u" (, "o" (o, "a" (+ 6 4 + F / Hz 8 6 * o new observation 4 6 8 4 6 8 F / Hz Find closest match (Nearest Neighbor min[ D(, y ω i ], - choice of distance metric D(,y is important Find representative match (class prototypes min[ D(, z ω ] - class data { y ω,i } class model z ω E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 6

Linear and nonlinear classifiers Minimum Euclidean distance is equivalent to a linear discriminant function: - eamples { y ω,i } template z ω mean{ y ω,i } - observed feature vector [...] T - Euclidean distance metric Dy [, ] ( y T ( y - minimum distance rule: class i argmin i D[, z i ] argmin i D [, z i ] - i.e. choose class O over U if D [, z O ] < D [, z U ]... E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 7 Linear classifier (cont d Decision rule: Choose class O over U if: D [, z O ] < D [, z U ] ( z O T ( z O < ( z U T ( z U Projection T T + z OzO T z O T T < + z UzU T z U T T T ( z O z U > z OzO z UzU ( z U T z O ( z O z U >< -- ( z O z U T ( z O z U z U Threshold i.e. distance from z U to is less than half the distance to z O... E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 8

Linear classification boundaries Min-distance divides normal to class centers: decision boundary (z O -z U T (z O -z U z O -z U z U o z O (z O -z U (-z U T (z O -z U z O -z U * Scaling aes changes boundary: new boundary o Squeezed horizontally Squeezed vertically new boundary o E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 9 Decision boundaries Linear functions give linear boundaries - planes and hyperplanes as dimensions increase Linear boundaries can become curves under feature transformations - e.g. a linear discriminant in ' [...] T - support vector machines... What about more comple boundaries? - i.e. if a linear discriminant is w i i how about a nonlinear function? - changes only threshold space, but F[ w i i ] F[ w ij i ] j i j i F[ w jk F[ w ij i ]] offers more fleibility, and has even more... E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- -

Neural networks Sums over nonlinear functions of sums large range of decision surfaces e.g. Multi-layer perceptron (MLP with hidden layer: y k F[ w jk F[ w ij j i ]] j w ij + + h + w h jk + y F[ ] 3 Input layer Hidden layer Problem is finding the weights w ij... (training + y Output layer E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - Back-propagation training Find w ij to minimize e.g. E n for training set of patterns { n } and desired (target outputs {t n } Differentiate error with respect to weights - contribution from each pattern n, t n : y k F[ w jk h j j ] E w jk E y y Σ w jk ( y t F' Σ E w jk w jk α w jk ( w jk h j j [ ] h j i.e. gradient descent with learning rate α y( n t n E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- -

Neural net eample input units (normalized F, F 5 hidden units, 3 output units ( U, O, A F F A O U Sigmoid nonlinearity: F[ ] ---------------- + e df F( F d.8.6.4. sigm( d sigm( d 5 4 3 3 4 5 E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 3 Neural net training :5:3 net: MS error by training epoch Mean squared error.8.6.4. 3 4 5 6 7 8 9 Training epoch 6 Contours @ iterations 6 Contours @ iterations 4 8 6 4 6 8 eample... 4 8 6 4 6 8 E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 4

Support Vector Machines (SVMs Principled nonlinear decision boundaries... Key idea: Boundary can be linear in a higher-dimensional space e.g. Φ - depends only on training points at edges (support vectors - unique optimal solution for a given projection Φ - solution involves only inner products in projected space e.g. via kernel (w. + b yi k( y, Φ( Φ( y w (w. + b + y i + (w. + b ( y E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 5 3 Statistical Interpretation Observations are random variables whose distribution depends on the class: Observation Class ω i (hidden discrete p( ω i continuous Source distributions p( ω i - reflect variability in feature - reflect noise in observation - generally have to be estimated from data (rather than known in advance p( ω i Pr(ω i ω ω ω 3 ω 4 E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 6

Random Variables review Random vars have joint distributions (pdf s: p(,y y µy σ y σ y p(y p( µ σ - marginal p ( py (, dy - covariance Σ E[ ( o µ ( o µ T ] σ σy σ y σ y E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 7 Conditional probability Knowing one value in a joint distribution constrains the remainder: y Y p(,y p( y Y Bayes rule: py (, py ( py ( py ( py ( py ( ------------------------------- p ( can reverse conditioning given priors/marginal - either term can be discrete or continuous E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 8

Priors and posteriors Bayesian inference can be interpreted as updating prior beliefs with new information, : Bayes Rule: Pr( ω i Prior probability pω ( i -------------------------------------------------- pω ( j Pr( ω j j Likelihood Evidence p( Pr( ω i Posterior probability Posterior is prior scaled by likelihood & normalized by evidence (so Σ(posteriors Objection: priors are often unknown -...but omitting them amounts to assuming they are all equal E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 9 Bayesian (MAP classifier Minimize the probability of error by choosing maimum a posteriori (MAP class: ωˆ argma ω i Pr( ω i Intuitively right - choose most probable class in light of the observation Formally, epected probability of error, EPrerr [ ( ] p ( Pr( err d i X ωˆ i p ( ( Pr( ω i d where X ωˆ is the region of X where ω i is chosen i.. but Pr(ω i is largest in that region, so Pr(err is minimized. E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- -

Practical implementation Optimal classifier is but we don t know ωˆ argma ω i Pr( ω i Can model directly e.g. train a neural net / SVM to map from inputs to a set of outputs Pr(ω i - a discriminative model Often easier to model conditional distributions then use Bayes rule to find MAP class pω ( i Pr( ω i Labeled training eamples { n,ω n } Sort according to class Estimate conditional pdf for class ω p( ω E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - Likelihood models Given models for each distribution, the search for hence or even Choose parameters to maimize data likelihood pω ( i ωˆ argma Pr( ω i ω i pω ( i Pr( ω i becomes argma -------------------------------------------------- ω i pω ( j Pr( ω j j but denominator p( is the same over all ω i ωˆ argma ω i log argma ω i pω ( i Pr( ω i [ pω ( i + logpr( ω i ] p ( train ω i E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- -

Sources of error p( ω Pr(ω X ω ^ X ω ^ p( ω Pr(ω Pr(ω X ω ^ Pr(err X ω ^ Pr(ω X ω ^ Pr(err X ω ^ Suboptimal threshold / regions (bias error - use a Bayes classifier! Incorrect distributions (model error - better distribution models/more training data Misleading features ( Bayes error - irreducible for a given feature set regardless of classification scheme E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 3 4 Gaussian models Easiest way to model distributions is via parametric model - assume known form, estimate a few parameters Gaussian model is simple & useful: in -D: pω ( i --------------- ep πσ i normalization to make it sum to -- µ i ------------- σ i Parameters mean µ i and variance σ i fit P M P M e / p( ω i µ i σ i E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 4

Gaussians in d dimensions: p( ω i ----------------------------------- ( π d Ú Σ i ep -- ( µ i T Σ i ( µ i Described by d dimensional mean vector µ i and d d covariance matri Σ i 5.5 4 3 4 4 3 4 5 argma ω i Classify by maimizing log likelihood i.e. -- ( µ i T Σ i ( µ i -- log Σ i + logpr( ω i E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 5 Multivariate Gaussian covariance Eigen analysis of inverse of covariance gives Σ i ( SV T ( SV with diagonal S Λ Ú Gaus n ~ ep -- ( µ i T ( SV T ( SV ( µ i i.e. (SV transforms into uncorrelated, normalized-variance space space SV space v SV v'svv ( µ Mahalanobis distance i T Σ i ( µ i - Euclidean distance in normalized space eample... E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 6

Gaussian Miture models (GMMs Single Gaussians cannot model - distributions with multiple modes - distributions with nonlinear correlation What about a weighted sum? i.e. p ( c k pm ( k k where {c k } is a set of weights and { } is a set of Gaussian components pm ( k - can fit anything given enough components Interpretation: Each observation is generated by one of the Gaussians, chosen at random, with priors: c k Pr( m k E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 7 Gaussian mitures ( e.g. nonlinear correlation: resulting surface 3 -.8..4-5 original data 5 5 3 35..4.6 Gaussian components Problem: finding c k and m k parameters - easy if we knew which m k generated each E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 8

Epectation-maimization (EM General procedure for estimating a model when some parameters are unknown - e.g. which component generated a value in a Gaussian miture model Procedure: Iteratively update fied model parameters Θ to maimize Qepected value of log-probability of known training data trn and unknown parameters u: Q( Θ, Θ old Pr u trn, Θ old ( log p( u, trn Θ u - iterative because Pr( u trn, Θ old changes each time we change Θ - can prove this increases data likelihood, hence maimum-likelihood (ML model - local optimum - depends on initialization E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 9 Fitting GMMs with EM Want to find component Gaussians m k plus weights c k Pr( m k { µ k, Σ k } to maimize likelihood of training data trn If we could assign each to a particular m k, estimation would be direct Hence, treat ks (miture indices as unknown, form Q E[ log p(, k Θ ], differentiate with respect to model parameters equations for µ k, Σ k and c k to maimize Q... E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 3

GMM EM update equations: Solve for maimizing Q: pk n, Θ n µ k pk ( n, Θ n pk n, Θ n Σ k n ( n --------------------------------------------- ( ( n µ k ( n µ k T ------------------------------------------------------------------------------------ pk ( n, Θ c k --- pk N ( n, Θ n pk ( n, Θ Each involves, fuzzy membership of n to Gaussian k Parameter is just sample average, weighted by fuzzy membership E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 3 GMM eamples Vowel data fit with different miture counts: Gauss logp(-9 6 Gauss logp(-864 6 4 8 6 4 6 8 6 4 8 3 Gauss logp(-849 6 4 6 8 4 8 6 4 6 8 6 4 8 4 Gauss logp(-84 6 4 6 8 eample... E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 3

5 Methodological Remarks: : Training and test data A rich model can learn every training eample (overtraining error rate Test data Training data Overfitting training or parameters But, goal is to classify new, unseen data i.e. generalization - sometimes use cross validation set to decide when to stop training For evaluation results to be meaningful: - don t test with training data! - don t train on test data (even indirectly... E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 33 Model compleity More model parameters better fit - more Gaussian miture components - more MLP hidden units More training data can use larger models For fied training set size, there will be some optimal model size (to avoid overfitting: 44 4 More parameters Word Error Rate More training data WER% 4 38 36 34 Optimal parameter/data ratio Constant training time 3 9.5 8.5 Training set / hours 37 74 4 Hidden layer / units 5 WER for PLPN-8k nets vs. net size & training data E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 34

Model combination No single model is always best But different models have different strengths Benefits from combining several models e.g. Pr( ω w k Pr( ω m k k where each Pr( ω m k comes from a different model / feature set and w k Pr( m k estimates probability that model k is correct Should be able to incorporate into one model, but may be easier to combine in practice Trick is to have good estimates for w k : - static, from training data - dynamic, from test data... E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 35 Summary Classification is making discrete (hard decisions Basis is comparison with known eamples - eplicitly, or via a model Probabilistic interpretation for principled decisions Ways to build models: - neural nets can learn posteriors - Gaussians can model pdfs - other approaches... EM allows estimation of comple models Parting thought: pω ( How do you know what s going on? Pr( ω E68 SAPR - Dan Ellis L3 - Pattern Classification 6-- - 36