JHU CLSP Summer School Pattern Recognition Applied to Music Signals 2 3 4 5 Music Content Analysis Classification and Features Statistical Pattern Recognition Gaussian Mixtures and Neural Nets Singing Detection Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/muscontent/ Laboratory for Recognition and Organization of Speech and Audio Columbia University, New York July st, 23 Dan Ellis Pattern Recognition 23-7- -
Music Content Analysis Music contains information at many levels - what is it? We d like to get this information out automatically - fine-level transcription of events - broad-level classification of pieces Information extraction can be framed as: pattern classification / recognition or machine learning - build systems based on (labeled) training data Dan Ellis Pattern Recognition 23-7- - 2
Music analysis What information can we get from music? 4 Frequency 3 2.5.5 2 2.5 3 3.5 4 4.5 5 Time Score recovery - extract the performance Instrument identification Ensemble performance - gestalts : chords, tone colors Broader timescales - phrasing & musical structure - artist / genre clustering and classification Dan Ellis Pattern Recognition 23-7- - 3
Outline 2 3 4 5 Music Content Analysis Classification and Features - classification - spectrograms - cepstra Statistical Pattern Recognition Gaussian Mixtures and Neural Nets Singing Detection Dan Ellis Pattern Recognition 23-7- - 4
2 Classification and Features Classification means: finding categorical (discrete) labels for real-world (continuous) observations F2/Hz 4 f/hz 3 2.2.3.4.5.6.7.8.9. time/s 2 x x 2 F/Hz ay ao Problems - parameter tuning - feature overlap Dan Ellis Pattern Recognition 23-7- - 5
Classification system parts Sensor signal segment feature vector Pre-processing/ segmentation Feature extraction STFT Locate vowels Formant extraction class Classification Post-processing Context constraints Costs/risk Right features are critical - place upper bound on classifier - should make important aspects visible - invariance under irrelevant modifications Dan Ellis Pattern Recognition 23-7- - 6
X k, m The Spectrogram Short-time Fourier transform: N = [ ] wn ml n = [ ] xn X[ k, m] [ ] exp j( 2πk ------------------------------- ( n ml N ) ) Plot STFT as a grayscale image:. freq / Hz 4 3 2 - -2-3 -4 intensity / db freq / Hz 4 3 2 2.35 2.4 2.45 2.5 2.55 2.6 time / s -5.5.5 2 2.5 time / s Dan Ellis Pattern Recognition 23-7- - 7
Auditory spectrum Cepstra Spectrograms are good for visualization; Cepstrum is preferred for classification - dct of STFT: c k = idft( log X[ k, m] ) Cepstra capture coarse information in fewer dimensions with less correlation: 25 2 5 5 Features Covariance matrix Example joint distrib (,5) 2-2 6-3 2-4 8 4-5 Cepstral coefficients 8 6 4 2 8 6 4 2 2 6 2 8 4 5 5 frames 5 5 2 3 2 - -2-3 -4-5 5 Dan Ellis Pattern Recognition 23-7- - 8
Outline 2 3 4 5 Music Content Analysis Classification and Features Statistical Pattern Recognition - Priors and posteriors - Bayesian classifier Gaussian Mixtures and Neural Nets Singing Detection Dan Ellis Pattern Recognition 23-7- - 9
3 Statistical Pattern Recognition Observations are random variables whose distribution depends on the class: Observation x Class ω i (hidden) discrete continuous Source distributions p(x ω i ) - reflect variability in feature - reflect noise in observation - generally have to be estimated from data (rather than known in advance) p(x ω i ) p(x ω i ) Pr(ω i x) ω ω 2 ω 3 ω 4 x Dan Ellis Pattern Recognition 23-7- -
Priors and posteriors Bayesian inference can be interpreted as updating prior beliefs with new information, x: Bayes Rule: Pr( ω i ) Prior probability pxω ( i ) -------------------------------------------------- pxω ( j ) Pr( ω j ) j Likelihood Evidence = p(x) = Pr( ω i x) Posterior probability Posterior is prior scaled by likelihood & normalized by evidence (so Σ(posteriors) = ) Objection: priors are often unknown - but omitting them amounts to assuming they are all equal Dan Ellis Pattern Recognition 23-7- -
Bayesian (MAP) classifier Optimal classifier is but we don t know ωˆ = Pr( ω i x) argmax ω i Pr( ω i x) Can model conditional distributions pxω ( i ) then use Bayes rule to find MAP class Labeled training examples {x n,ω xn } Sort according to class Estimate conditional pdf for class ω p(x ω ) Or, can model directly e.g. train a neural net to map from inputs x to a set of outputs Pr(ω i ) - discriminative model Dan Ellis Pattern Recognition 23-7- - 2
Outline 2 3 4 5 Music Content Analysis Classification and Features Statistical Pattern Recognition Gaussian Mixtures and Neural Nets - Gaussians - Gaussian mixtures - Multi-layer perceptrons (MLPs) - Training and test data Singing Detection Dan Ellis Pattern Recognition 23-7- - 3
4 Gaussian Mixtures and Neural Nets p( x ω i ) Gaussians as parametric distribution models: = ----------------------------------- ( 2π) d 2 Σ i exp Described by d dimensional mean vector µ i and d x d covariance matrix Σ i 5 -- ( x µ 2 i ) T Σ i ( x µ i ).5 4 3 4 2 2 2 4 2 3 4 5 argmax ω i Classify by maximizing log likelihood i.e. -- ( x µ 2 i ) T Σ i ( x µ i ) -- log Σ 2 i + logpr( ω i ) Dan Ellis Pattern Recognition 23-7- - 4
Gaussian Mixture models (GMMs) Weighted sum of Gaussians can fit any PDF: i.e. px ( ) c k pxm ( k ) weights c k k Gaussians p(x m k ) - each observation from random single Gaussian? resulting surface 3 2 -.8.2.4-2 5 original data 5 2 25 3 35.2.4.6 Gaussian components Find c k and m k parameters via EM - easy if we knew which m k generated each x Dan Ellis Pattern Recognition 23-7- - 5
GMM examples Vowel data fit with different mixture counts: Gauss logp(x)=-9 6 2 Gauss logp(x)=-864 6 4 4 2 2 8 8 6 2 4 6 8 2 6 3 Gauss logp(x)=-849 6 2 4 6 8 2 6 4 Gauss logp(x)=-84 4 4 2 2 8 8 6 2 4 6 8 2 6 2 4 6 8 2 Dan Ellis Pattern Recognition 23-7- - 6
Neural networks Don t model distributions, instead, model posteriors pxω ( i ) Pr( ω i x) Sums over nonlinear functions of sums large range of decision surfaces e.g. Multi-layer perceptron (MLP) with hidden layer: y k = F[ w jk F[ w ij x j i ]] j x + x w 2 h jk + y x 2 3 + F[ ] Input layer w ij + h Hidden layer Train the weights w ij with back-propagation Dan Ellis Pattern Recognition 23-7- - 7 + y2 Output layer
Neural net example 2 input units (normalized F, F2) 5 hidden units, 3 output units ( U, O, A ) F F2 A O U Sigmoid nonlinearity: F[ x] = ---------------- + e x df = F( F) dx.8.6.4.2 sigm(x) d sigm(x) dx 5 4 3 2 2 3 4 5 Dan Ellis Pattern Recognition 23-7- - 8
Neural net training 2:5:3 net: MS error by training epoch Mean squared error.8.6.4.2 6 2 3 4 5 6 7 8 9 Training epoch Contours @ iterations 6 Contours @ iterations 4 4 2 2 8 8 6 2 4 6 8 2 example... 6 2 4 6 8 2 Dan Ellis Pattern Recognition 23-7- - 9
Aside: Training and test data A rich model can learn every training example (overtraining) error rate Test data Training data Overfitting training or parameters But, goal is to classify new, unseen data i.e. generalization - sometimes use cross validation set to decide when to stop training For evaluation results to be meaningful: - don t test with training data! - don t train on test data (even indirectly...) Dan Ellis Pattern Recognition 23-7- - 2
Outline 2 3 4 5 Music Content Analysis Classification and Features Statistical Pattern Recognition Gaussian Mixtures and Neural Nets Singing Detection - Motivation - Features - Classifiers Dan Ellis Pattern Recognition 23-7- - 2
5 Singing Detection (Berenzweig et al. ) Hz File: /Users/dpwe/projects/aclass/aimee.wav 7 65 6 55 5 45 4 35 3 25 2 5 5 t Can we automatically detect when singing is present? f 9 Printed: Tue Mar 3:4:28 :2 :4 :6 :8 : :2 :4 :6 :8 :2 :22 :24 :26 :28 mus vox mus vox mus - for further processing (lyrics recognition?) - as a song signature? - as a basis for classification? Dan Ellis Pattern Recognition 23-7- - 22
Singing Detection: Requirements freq / khz freq / khz trn/mus/3 5 hand-label vox trn/mus/8 5 Labeled training examples - 6 x 5 sec. radio excerpts - hand-mark sung phrases freq / khz freq / khz 5 5 trn/mus/9 trn/mus/28 2 4 6 8 2 Labeled test data - several complete tracks from CDs, hand-labelled Feature choice - Mel-frequency Cepstral Coefficients (MFCCs) popular for speech; maybe sung voice too? - separation of voices? temporal dimension? Classifier choice - MLP Neural Net - GMMs for singing / music - SVM? tim Dan Ellis Pattern Recognition 23-7- - 23
GMM System Separate models for p(x sing), p(x no sing) - combined via likelihood ratio test GMM p(x singing ) music MFCC calculation C C... Log l'hood ratio test p(x singing ) log p(x not ) singing? C 2 GMM2 p(x no singing ) How many Gaussians for each? - say 2; depends on data & complexity What kind of covariance? - diagonal (spherical?) Dan Ellis Pattern Recognition 23-7- - 24
GMM Results Raw and smoothed results (Best FA=84.9%): freq / khz log(lhood) 8 6 4 2 5-5 Aimee Mann : Build That Wall + handvox 26d 2mix GMM on 6ms frames (FA=65.8% @ thr=) log(lhood) - 5 26d 2mix GMM smoothed by 6 pt ( sec) hann (FA=84.9% @ thr=-.8) -5 5 5 2 25 time / sec 3 MLP has advantage of discriminant training Each GMM trains only on data subset faster to train? (2 x min vs. 2 min) Dan Ellis Pattern Recognition 23-7- - 25
MLP Neural Net Directly estimate p(singing x) music MFCC calculation C C... singing not singing C 2 - net has 26 inputs (+ ), 5 HUs, 2 o/ps (26:5:2) How many hidden units? - depends on data amount, boundary complexity Feature context window? - useful in speech Delta features? - useful in speech Training parameters... Dan Ellis Pattern Recognition 23-7- - 26
MLP Results Raw net outputs on a CD track (FA 74.%): Aimee Mann : Build That Wall + handvox freq / khz 8 6 4 2 26:5: netlab on 6ms frames (FA=74.% @ thr=.5) p(singing).8.6.4.2 5 5 2 25 time / sec 3 p(singing).8.6 Smoothed for continuity: best FA = 9.5% y p ( ) ( ).4.2 5 5 2 25 time / sec 3 Dan Ellis Pattern Recognition 23-7- - 27
Artist Classification (Berenzweig et al. 22) Artist label as available stand-in for genre Train MLP to classify frames among 2 artists Using only voice segments: Song-level accuracy improves 56.7% 64.9% Track 7 - Aimee Mann (dynvox=aimee, unseg=aimee) true voice Michael Penn The Roots The Moles Eric Matthews Arto Lindsay Oval Jason Falkner Built to Spill Beck XTC Wilco Aimee Mann The Flaming Lips Mouse on Mars Dj Shadow Richard Davies Cornelius Mercury Rev Belle & Sebastian Sugarplastic Boards of Canada 5 5 2 Track 4 - Arto Lindsay (dynvox=arto, unseg=oval) true voice Michael Penn The Roots The Moles Eric Matthews Arto Lindsay Oval Jason Falkner Built to Spill Beck XTC Wilco Aimee Mann The Flaming Lips Mouse on Mars Dj Shadow Richard Davies Cornelius Mercury Rev Belle & Sebastian Sugarplastic Boards of Canada 2 3 4 5 6 7 8 Dan Ellis Pattern Recognition 23-7- - 28 time / sec time / sec
Summary Music content analysis: Pattern classification Basic machine learning methods: Neural Nets, GMMs Singing detection: classic application but... the time dimension? Dan Ellis Pattern Recognition 23-7- - 29
References A.L. Berenzweig and D.P.W. Ellis (2) Locating Singing Voice Segments within Music Signals, Proc. IEEE Workshop on Apps. of Sig. Proc. to Acous. and Audio, Mohonk NY, October 2. http://www.ee.columbia.edu/~dpwe/pubs/waspaa-singing.pdf R.O. Duda, P. Hart, R. Stork (2) Pattern Classification, 2nd Ed. Wiley, 2. E. Scheirer and M. Slaney (997) Construction and evaluation of a robust multifeature speech/music discriminator, Proc. IEEE ICASSP, Munich, April 997. http://www.ee.columbia.edu/~dpwe/e682/papers/scheis97-mussp.pdf Dan Ellis Pattern Recognition 23-7- - 3