Session 1: Pattern Recognition

Proc. Digital del Continguts Musicals Session 1: Pattern Recognition 1 2 3 4 5 Music Content Analysis Pattern Classification The Statistical Approach Distribution Models Singing Detection Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/muscontent/ Laboratory for Recognition and Organization of Speech and Audio Columbia University, New York Spring 2003 Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-1

1 Music Content Analysis Music contains information at many levels - what is it? We d like to get this information out automatically - fine-level transcription of events - broad-level classification of pieces Information extraction can be framed as: pattern classification / recognition or machine learning - build systems based on (labeled) training data Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-2

Music analysis What might we want to get out of music? Instrument identification - different levels of specificity - registers within instruments Score recovery - transcribe the note sequence - extract the performance Ensemble performance - gestalts : chords, tone colors Broader timescales - phrasing & musical structure - artist / genre clustering and classification Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-3

Course Outline Session 1: Pattern Classification - MLP Neural Networks - GMM distribution models - Singing detection Session 2: Sequence Recognition - The Hidden Markov Model - Chord sequence recognition Session 3: Musical Applications - Note transcription - Music summarization - Music Information Retrieval - Similarity browsing Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-4

Outline 1 2 3 4 5 Music Content Analysis Pattern Classification - classification - decision boundaries - neural networks The Statistical Approach Distribution Models Singing Detection Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-5

2 Pattern Classification Classification means: finding categorical (discrete) labels for real-world (continuous) observations F2/Hz 4000 f/hz 3000 2000 1000 0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 time/s 2000 1000 0 0 x x 1000 2000 F1/Hz ay ao Why? - information extraction - conditional processing - detect exceptions Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-6

Building a classifier Define classes/attributes - could state explicit rules - better to define through training examples Define feature space Define decision algorithm - set parameters from examples Measure performance - calculate (weighted) error rate 1100 Pols vowel formants: "u" (x) vs. "o" (o) 1000 F2 / Hz 900 800 700 600 250 300 350 400 450 500 550 600 F1 / Hz Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-7

Classification system parts Sensor signal segment feature vector Pre-processing/ segmentation Feature extraction STFT Locate vowels Formant extraction class Classification Post-processing Context constraints Costs/risk Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-8

Feature extraction Right features are critical - waveform vs. formants vs. cepstra - invariance under irrelevant modifications Theoretically equivalent features may act very differently in practice - representations make important aspects explicit - remove irrelevant information Feature design incorporates domain knowledge - more data less need for cleverness? Smaller feature space (fewer dimensions) simpler models (fewer parameters) less training data needed faster training Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-9

Minimum distance classification 1800 1600 Pols vowel formants: "u" (x), "o" (o), "a" (+) 1400 1200 + F2 / Hz 1000 800 600 x * o new observation x 0 200 400 600 800 1000 1200 1400 1600 1800 F1 / Hz Find closest match (nearest neighbor) min[ D( x, y ω i )], - choice of distance metric D(x,y) is important Find representative match (class prototypes) min[ D( x, z ω )] - class data { y ω,i } class model z ω Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-10

Linear classifiers Minimum Euclidean distance is equivalent to a linear discriminant function: - examples { y ω,i } template z ω = mean{ y ω,i } - observed feature vector x = [x 1 x 2...] T - Euclidean distance metric Dxy [, ] = ( x y) T ( x y) - minimum distance rule: class i = argmin i D[ x, z i ] = argmin i D 2 [ x, z i ] - i.e. choose class O over U if D 2 [ x, z O ] < D 2 [ x, z U ]... Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-11

Linear classification boundaries Min-distance divides normal to class centers: decision boundary 1 (z O -z U ) T (z O -z U ) 2 z O -z U z U x o z O (z O -z U ) (x-z U ) T (z O -z U ) z O -z U * x Scaling axes changes boundary: new boundary x o Squeezed horizontally x Squeezed vertically new boundary o Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-12

Decision boundaries Linear functions give linear boundaries - planes and hyperplanes as dimensions increase Linear boundaries can become curves under feature transformations - e.g. a linear discriminant in x' = [x 2 1 x 2 2...] T What about more complex boundaries? - i.e. if a linear discriminant is how about a nonlinear function? F[ w i x i ] F[ w ij x i ] w i x i - changes only threshold space, but j i j i F[ w jk F[ w ij x i ]] offers more flexibility, and has even more... Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-13

Neural networks Sums over nonlinear functions of sums large range of decision surfaces e.g. Multi-layer perceptron (MLP) with 1 hidden layer: y k = F[ w jk F[ w ij x j i ]] j x 1 + x w 2 h jk + y 1 x 2 3 + F[ ] Input layer w ij + h 1 Hidden layer Problem is finding the weights w ij... (training) + y2 Output layer Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-14

Back-propagation training Find w ij to minimize e.g. E = n for training set of patterns {x n } and desired (target) outputs {t n } y( x n ) t n Differentiate error with respect to weights - contribution from each pattern x n, t n : y k = F[ w jk h j j ] E w jk = = E y y Σ w jk 2( y t) F' Σ w jk = w jk α j ( w jk h j ) [ ] h j E w jk i.e. gradient descent with learning rate α 2 Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-15

Neural net example 2 input units (normalized F1, F2) 5 hidden units, 3 output units ( U, O, A ) F1 F2 A O U Sigmoid nonlinearity: F[ x] 1 = 1 ---------------- 1 + e x df = F( 1 F) dx 0.8 0.6 0.4 0.2 sigm(x) d sigm(x) dx 0 5 4 3 2 1 0 1 2 3 4 5 Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-16

1 Neural net training 2:5:3 net: MS error by training epoch Mean squared error 0.8 0.6 0.4 0.2 1600 0 0 10 20 30 40 50 60 70 80 90 100 Training epoch Contours @ 10 iterations 1600 Contours @ 100 iterations 1400 1400 1200 1200 1000 1000 800 800 600 200 400 600 800 1000 1200 example... 600 200 400 600 800 1000 1200 Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-17

Outline 1 2 3 4 5 Music Content Analysis Pattern Classification The Statistical Approach - Conditional distributions - Bayes rule Distribution Models Singing Detection Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-18

3 The Statistical Approach Observations are random variables whose distribution depends on the class: Observation x Class ω i (hidden) discrete continuous Source distributions p(x ω i ) - reflect variability in feature - reflect noise in observation - generally have to be estimated from data (rather than known in advance) p(x ω i ) p(x ω i ) Pr(ω i x) ω 1 ω 2 ω 3 ω 4 x Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-19

Random Variables review Random vars have joint distributions (pdf s): p(x,y) y µy σ y σ xy p(y) p(x) µ x σ x x - marginal px ( ) = pxy (, ) dy - covariance Σ = E[ ( o µ )( o µ ) T ] = σ x 2 σxy σ xy σ y 2 Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-20

Conditional probability Knowing one value in a joint distribution constrains the remainder: y Y p(x,y) x p(x y = Y ) Bayes rule: pxy (, ) = pxy ( ) py ( ) pyx ( ) = pxy ( ) py ( ) ------------------------------- px ( ) can reverse conditioning given priors/marginal - either term can be discrete or continuous Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-21

Priors and posteriors Bayesian inference can be interpreted as updating prior beliefs with new information, x: Bayes Rule: Pr( ω i ) Prior probability pxω ( i ) -------------------------------------------------- pxω ( j ) Pr( ω j ) j Likelihood Evidence = p(x) = Pr( ω i x) Posterior probability Posterior is prior scaled by likelihood & normalized by evidence (so Σ(posteriors) = 1) Objection: priors are often unknown - but omitting them amounts to assuming they are all equal Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-22

Bayesian (MAP) classifier Minimize the probability of error by choosing maximum a posteriori (MAP) class: Intuitively right - choose most probable class in light of the observation Given models for each distribution, hence ωˆ = argmax ω i Pr( ω i x) pxω ( i ) the search for ωˆ = argmax Pr( ω i x) ω i pxω ( i ) Pr( ω i ) becomes argmax -------------------------------------------------- ω i pxω ( j ) Pr( ω j ) j but denominator = p(x) is the same over all ω i ωˆ = argmax ω i pxω ( i ) Pr( ω i ) Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-23

Practical implementation Optimal classifier is but we don t know ωˆ = Pr( ω i x) argmax ω i Pr( ω i x) Can model directly e.g. train a neural net to map from inputs x to a set of outputs Pr(ω i ) - a discriminative model Often easier to model conditional distributions then use Bayes rule to find MAP class pxω ( i ) Labeled training examples {x n,ω xn } Sort according to class Estimate conditional pdf for class ω 1 p(x ω 1 ) Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-24

Outline 1 2 3 4 5 Music Content Analysis Pattern Classification The Statistical Approach Distribution Models - Gaussian models - Gaussian mixtures Singing Detection Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-25

4 Distribution Models Easiest way to model distributions is via parametric model - assume known form, estimate a few parameters Gaussian model is simple & useful: in 1 dim: pxω ( i ) = 1 --------------- 2πσ i exp normalization to make it sum to 1 1 -- x µ i 2 ------------- 2 σ i Parameters mean µ i and variance σ i 2 fit P M P M e 1/2 p(x ω i ) µ i σ i x Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-26

Gaussians in d dimensions: p( x ω i ) = 1 ----------------------------------- ( 2π) d 1 2 Σ i exp 1 -- ( x µ 2 i ) T 1 Σ i ( x µ i ) Described by d dimensional mean vector µ i and d x d covariance matrix Σ i 5 0.5 1 0 4 3 4 2 2 0 0 2 4 1 0 0 1 2 3 4 5 argmax ω i Classify by maximizing log likelihood i.e. 1 -- ( x µ 2 i ) T 1 Σ i ( x µ i ) 1 -- log Σ 2 i + logpr( ω i ) Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-27

Gaussian Mixture models (GMMs) Single Gaussians cannot model - distributions with multiple modes - distributions with nonlinear correlation What about a weighted sum? px ( ) c k pxm ( k ) i.e. k where {c k } is a set of weights and pxm ( k ) { } is a set of Gaussian components - can fit anything given enough components Interpretation: each observation is generated by one of the Gaussians, chosen at random with priors c k = Pr( m k ) Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-28

Gaussian mixtures (2) e.g. nonlinear correlation: resulting surface 3 2 1 0-1 0.8 1 1.2 1.4-2 0 5 10 original data 15 20 25 30 35 0 0.2 0.4 0.6 Gaussian components Problem: finding c k and m k parameters - easy if we knew which m k generated each x Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-29

Expectation-maximization (EM) General procedure for estimating a model when some parameters are unknown - e.g. which component generated a value in a Gaussian mixture model Procedure: Iteratively update fixed model parameters Θ to maximize Q=expected value of log-probability of known training data x trn and unknown parameters u: Q( Θ, Θ old ) = Pr u x trn, Θ old ( ) log p( u, x trn Θ) u - iterative because Pr( u x trn, Θ old ) changes each time we change Θ - can prove this increases data likelihood, hence maximum-likelihood (ML) model - local optimum - depends on initialization Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-30

Fitting GMMs with EM Want to find component Gaussians m k { µ k, Σ k } plus weights = = c k Pr( m k ) to maximize likelihood of training data x trn If we could assign each x to a particular m k, estimation would be direct Hence, treat ks (mixture indices) as unknown, form Q = E[ log p( x, k Θ) ], differentiate with respect to model parameters equations for µ k, Σ k and c k to maximize Q... Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-31

GMM EM update equations: Solve for maximizing Q: pkx n, Θ n µ k = pkx ( n, Θ) n pkx n, Θ n Σ k = n ( ) x n --------------------------------------------- ( )( x n µ k )( x n µ k ) T ------------------------------------------------------------------------------------ pkx ( n, Θ) 1 c k = --- pkx N ( n, Θ) n pkx ( n, Θ) Each involves, fuzzy membership of x n to Gaussian k Parameter is just sample average, weighted by fuzzy membership Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-32

GMM examples Vowel data fit with different mixture counts: 1 Gauss logp(x)=-1911 1600 2 Gauss logp(x)=-1864 1600 1400 1400 1200 1200 1000 1000 800 800 600 200 400 600 800 1000 1200 1600 3 Gauss logp(x)=-1849 600 200 400 600 800 1000 1200 1600 4 Gauss logp(x)=-1840 1400 1400 1200 1200 1000 1000 800 800 600 200 400 600 800 1000 1200 600 200 400 600 800 1000 1200 example... Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-33

Methodological Remarks: Training and test data A rich model can learn every training example (overtraining) error rate Test data Training data Overfitting training or parameters But, goal is to classify new, unseen data i.e. generalization - sometimes use cross validation set to decide when to stop training For evaluation results to be meaningful: - don t test with training data! - don t train on test data (even indirectly...) Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-34

Model complexity More model parameters better fit - more Gaussian mixture components - more MLP hidden units More training data can use larger models For best generalization (no overfitting), there will be some optimal model size: 44 42 More parameters Word Error Rate More training data WER% 40 38 36 34 Optimal parameter/data ratio Constant training time 32 9.25 18.5 Training set / hours 37 74 4000 1000 Hidden layer / units 2000 500 WER for PLP12N-8k nets vs. net size & training data Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-35

Outline 1 2 3 4 5 Music Content Analysis Pattern Classification The Statistical Approach Distribution Models Singing Detection - Motivation - Features - Classifiers Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-36

5 Singing Detection (Berenzweig et al. 01) Hz File: /Users/dpwe/projects/aclass/aimee.wav 7000 6500 6000 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 t Can we automatically detect when singing is present? f 9 Printed: Tue Mar 11 13:04:28 0:02 0:04 0:06 0:08 0:10 0:12 0:14 0:16 0:18 0:20 0:22 0:24 0:26 0:28 mus vox mus vox mus - for further processing (lyrics recognition?) - as a song signature? - as a basis for classification? Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-37

Singing Detection: Requirements Labelled training examples - 60 x 15 sec. radio excerpts - hand-mark sung phrases freq / khz freq / khz trn/mus/3 10 5 hand-label vox 0 trn/mus/8 10 5 Labelled test data - several complete tracks from CDs, hand-labelled Feature choice - Mel-frequency Cepstral Coefficients (MFCCs) popular for speech; maybe sung voice too? - separation of voices? - temporal dimension freq / khz freq / khz 0 10 5 0 10 5 trn/mus/19 trn/mus/28 0 0 2 4 6 8 10 12 time / sec Classifier choice - MLP Neural Net - GMMs for singing / music - SVM? Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-38

MLP Neural Net Directly estimate p(singing x) music MFCC calculation C 0 C 1... singing not singing C 12 - net has 26 inputs (+ ), 15 HUs, 2 o/ps (26:15:2) How many hidden units? - depends on data amount, boundary complexity Feature context window? - useful in speech Delta features? - useful in speech Training parameters... Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-39

MLP Results Raw net outputs on a CD track (FA 74.1%): Aimee Mann : Build That Wall + handvox freq / khz 10 8 6 4 2 0 1 26:15:1 netlab on 16ms frames (FA=74.1% @ thr=0.5) p(singing) 0.8 0.6 0.4 0.2 0 0 5 10 15 20 25 time / sec 30 p(singing) 1 0.8 0.6 Smoothed for continuity: best FA = 90.5% y p ( ) ( ) 0.4 0.2 0 0 5 10 15 20 25 time / sec 30 Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-40

GMM System Separate models for p(x sing), p(x no sing) - combined via likelihood ratio test GMM1 p(x singing ) music MFCC calculation C 0 C 1... Log l'hood ratio test p(x singing ) log p(x not ) singing? C 12 GMM2 p(x no singing ) How many Gaussians for each? - say 20; depends on data & complexity What kind of covariance? - diagonal (spherical?) Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-41

GMM Results Raw and smoothed results (Best FA=84.9%): freq / khz log(lhood) 10 8 6 4 2 0 10 5 0-5 Aimee Mann : Build That Wall + handvox 26d 20mix GMM on 16ms frames (FA=65.8% @ thr=0) log(lhood) -10 5 0 26d 20mix GMM smoothed by 61 pt (1 sec) hann (FA=84.9% @ thr=-0.8) -5 0 5 10 15 20 25 time / sec 30 MLP has advantage of discriminant training Each GMM trains only on data subset faster to train? (2 x 10 min vs. 20 min) Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-42

Summary Music content analysis: Pattern classification Basic machine learning methods: Neural Nets, GMMs Singing detection: classic application but... the time dimension? Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-43

References A.L. Berenzweig and D.P.W. Ellis (2001) Locating Singing Voice Segments within Music Signals, Proc. IEEE Workshop on Apps. of Sig. Proc. to Acous. and Audio, Mohonk NY, October 2001. http://www.ee.columbia.edu/~dpwe/pubs/waspaa01-singing.pdf R.O. Duda, P. Hart, R. Stork (2001) Pattern Classification, 2nd Ed. Wiley, 2001. E. Scheirer and M. Slaney (1997) Construction and evaluation of a robust multifeature speech/music discriminator, Proc. IEEE ICASSP, Munich, April 1997. http://www.ee.columbia.edu/~dpwe/e6820/papers/scheis97-mussp.pdf Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18-44