Pattern Recognition Applied to Music Signals

Similar documents
Session 1: Pattern Recognition

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification. Pattern classification

Lecture 3: Machine learning, classification, and generative models

Lecture 9: Speech Recognition. Recognizing Speech

Lecture 9: Speech Recognition

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Brief Introduction of Machine Learning Techniques for Content Analysis

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

p(d θ ) l(θ ) 1.2 x x x

PreFEst: A Predominant-F0 Estimation Method for Polyphonic Musical Audio Signals

Robust Speaker Identification

Pattern Classification

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Introduction to Support Vector Machines

PATTERN RECOGNITION AND MACHINE LEARNING

Automatic Speech Recognition (CS753)

Advanced statistical methods for data analysis Lecture 2

CSCI-567: Machine Learning (Spring 2019)

Model-based unsupervised segmentation of birdcalls from field recordings

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning 2017

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

L11: Pattern recognition principles

Machine Learning Lecture 5

Machine Recognition of Sounds in Mixtures

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Non-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology

Machine Learning Lecture 2

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Fast speaker diarization based on binary keys. Xavier Anguera and Jean François Bonastre

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

STA 414/2104: Machine Learning

Machine Learning Lecture 2

Machine Learning for Signal Processing Bayes Classification and Regression

1. Personal Audio 2. Features 3. Segmentation 4. Clustering 5. Future Work

Pattern Recognition and Machine Learning

Hidden Markov Models and Gaussian Mixture Models

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

COM336: Neural Computing

Voice Activity Detection Using Pitch Feature

Ch 4. Linear Models for Classification

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

A Novel Activity Detection Method

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Gaussian Process

PATTERN CLASSIFICATION

Multivariate statistical methods and data mining in particle physics

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

STA 4273H: Statistical Machine Learning

Multi-layer Neural Networks

Artificial Neural Networks

Necessary Corrections in Intransitive Likelihood-Ratio Classifiers

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

TinySR. Peter Schmidt-Nielsen. August 27, 2014

Temporal Modeling and Basic Speech Recognition

Linear Models for Classification

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Correspondence. Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure

Support Vector Machines using GMM Supervectors for Speaker Verification

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Around the Speaker De-Identification (Speaker diarization for de-identification ++) Itshak Lapidot Moez Ajili Jean-Francois Bonastre

STA 4273H: Statistical Machine Learning

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Expectation Propagation for Approximate Bayesian Inference

Nonlinear Classification

Clustering VS Classification

p(x ω i 0.4 ω 2 ω

Automatic Speech Recognition (CS753)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Bayes Decision Theory

Independent Component Analysis and Unsupervised Learning

Unsupervised Learning with Permuted Data

An Evolutionary Programming Based Algorithm for HMM training

Multiclass Discriminative Training of i-vector Language Recognition

CMU-Q Lecture 24:

ON THE USE OF MLP-DISTANCE TO ESTIMATE POSTERIOR PROBABILITIES BY KNN FOR SPEECH RECOGNITION

Gaussian Mixture Models

Mixtures of Gaussians with Sparse Structure

Hidden Markov Models in Language Processing

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

CSC 411: Lecture 09: Naive Bayes

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Feature extraction 1

Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates

Transcription:

JHU CLSP Summer School Pattern Recognition Applied to Music Signals 2 3 4 5 Music Content Analysis Classification and Features Statistical Pattern Recognition Gaussian Mixtures and Neural Nets Singing Detection Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/muscontent/ Laboratory for Recognition and Organization of Speech and Audio Columbia University, New York July st, 23 Dan Ellis Pattern Recognition 23-7- -

Music Content Analysis Music contains information at many levels - what is it? We d like to get this information out automatically - fine-level transcription of events - broad-level classification of pieces Information extraction can be framed as: pattern classification / recognition or machine learning - build systems based on (labeled) training data Dan Ellis Pattern Recognition 23-7- - 2

Music analysis What information can we get from music? 4 Frequency 3 2.5.5 2 2.5 3 3.5 4 4.5 5 Time Score recovery - extract the performance Instrument identification Ensemble performance - gestalts : chords, tone colors Broader timescales - phrasing & musical structure - artist / genre clustering and classification Dan Ellis Pattern Recognition 23-7- - 3

Outline 2 3 4 5 Music Content Analysis Classification and Features - classification - spectrograms - cepstra Statistical Pattern Recognition Gaussian Mixtures and Neural Nets Singing Detection Dan Ellis Pattern Recognition 23-7- - 4

2 Classification and Features Classification means: finding categorical (discrete) labels for real-world (continuous) observations F2/Hz 4 f/hz 3 2.2.3.4.5.6.7.8.9. time/s 2 x x 2 F/Hz ay ao Problems - parameter tuning - feature overlap Dan Ellis Pattern Recognition 23-7- - 5

Classification system parts Sensor signal segment feature vector Pre-processing/ segmentation Feature extraction STFT Locate vowels Formant extraction class Classification Post-processing Context constraints Costs/risk Right features are critical - place upper bound on classifier - should make important aspects visible - invariance under irrelevant modifications Dan Ellis Pattern Recognition 23-7- - 6

X k, m The Spectrogram Short-time Fourier transform: N = [ ] wn ml n = [ ] xn X[ k, m] [ ] exp j( 2πk ------------------------------- ( n ml N ) ) Plot STFT as a grayscale image:. freq / Hz 4 3 2 - -2-3 -4 intensity / db freq / Hz 4 3 2 2.35 2.4 2.45 2.5 2.55 2.6 time / s -5.5.5 2 2.5 time / s Dan Ellis Pattern Recognition 23-7- - 7

Auditory spectrum Cepstra Spectrograms are good for visualization; Cepstrum is preferred for classification - dct of STFT: c k = idft( log X[ k, m] ) Cepstra capture coarse information in fewer dimensions with less correlation: 25 2 5 5 Features Covariance matrix Example joint distrib (,5) 2-2 6-3 2-4 8 4-5 Cepstral coefficients 8 6 4 2 8 6 4 2 2 6 2 8 4 5 5 frames 5 5 2 3 2 - -2-3 -4-5 5 Dan Ellis Pattern Recognition 23-7- - 8

Outline 2 3 4 5 Music Content Analysis Classification and Features Statistical Pattern Recognition - Priors and posteriors - Bayesian classifier Gaussian Mixtures and Neural Nets Singing Detection Dan Ellis Pattern Recognition 23-7- - 9

3 Statistical Pattern Recognition Observations are random variables whose distribution depends on the class: Observation x Class ω i (hidden) discrete continuous Source distributions p(x ω i ) - reflect variability in feature - reflect noise in observation - generally have to be estimated from data (rather than known in advance) p(x ω i ) p(x ω i ) Pr(ω i x) ω ω 2 ω 3 ω 4 x Dan Ellis Pattern Recognition 23-7- -

Priors and posteriors Bayesian inference can be interpreted as updating prior beliefs with new information, x: Bayes Rule: Pr( ω i ) Prior probability pxω ( i ) -------------------------------------------------- pxω ( j ) Pr( ω j ) j Likelihood Evidence = p(x) = Pr( ω i x) Posterior probability Posterior is prior scaled by likelihood & normalized by evidence (so Σ(posteriors) = ) Objection: priors are often unknown - but omitting them amounts to assuming they are all equal Dan Ellis Pattern Recognition 23-7- -

Bayesian (MAP) classifier Optimal classifier is but we don t know ωˆ = Pr( ω i x) argmax ω i Pr( ω i x) Can model conditional distributions pxω ( i ) then use Bayes rule to find MAP class Labeled training examples {x n,ω xn } Sort according to class Estimate conditional pdf for class ω p(x ω ) Or, can model directly e.g. train a neural net to map from inputs x to a set of outputs Pr(ω i ) - discriminative model Dan Ellis Pattern Recognition 23-7- - 2

Outline 2 3 4 5 Music Content Analysis Classification and Features Statistical Pattern Recognition Gaussian Mixtures and Neural Nets - Gaussians - Gaussian mixtures - Multi-layer perceptrons (MLPs) - Training and test data Singing Detection Dan Ellis Pattern Recognition 23-7- - 3

4 Gaussian Mixtures and Neural Nets p( x ω i ) Gaussians as parametric distribution models: = ----------------------------------- ( 2π) d 2 Σ i exp Described by d dimensional mean vector µ i and d x d covariance matrix Σ i 5 -- ( x µ 2 i ) T Σ i ( x µ i ).5 4 3 4 2 2 2 4 2 3 4 5 argmax ω i Classify by maximizing log likelihood i.e. -- ( x µ 2 i ) T Σ i ( x µ i ) -- log Σ 2 i + logpr( ω i ) Dan Ellis Pattern Recognition 23-7- - 4

Gaussian Mixture models (GMMs) Weighted sum of Gaussians can fit any PDF: i.e. px ( ) c k pxm ( k ) weights c k k Gaussians p(x m k ) - each observation from random single Gaussian? resulting surface 3 2 -.8.2.4-2 5 original data 5 2 25 3 35.2.4.6 Gaussian components Find c k and m k parameters via EM - easy if we knew which m k generated each x Dan Ellis Pattern Recognition 23-7- - 5

GMM examples Vowel data fit with different mixture counts: Gauss logp(x)=-9 6 2 Gauss logp(x)=-864 6 4 4 2 2 8 8 6 2 4 6 8 2 6 3 Gauss logp(x)=-849 6 2 4 6 8 2 6 4 Gauss logp(x)=-84 4 4 2 2 8 8 6 2 4 6 8 2 6 2 4 6 8 2 Dan Ellis Pattern Recognition 23-7- - 6

Neural networks Don t model distributions, instead, model posteriors pxω ( i ) Pr( ω i x) Sums over nonlinear functions of sums large range of decision surfaces e.g. Multi-layer perceptron (MLP) with hidden layer: y k = F[ w jk F[ w ij x j i ]] j x + x w 2 h jk + y x 2 3 + F[ ] Input layer w ij + h Hidden layer Train the weights w ij with back-propagation Dan Ellis Pattern Recognition 23-7- - 7 + y2 Output layer

Neural net example 2 input units (normalized F, F2) 5 hidden units, 3 output units ( U, O, A ) F F2 A O U Sigmoid nonlinearity: F[ x] = ---------------- + e x df = F( F) dx.8.6.4.2 sigm(x) d sigm(x) dx 5 4 3 2 2 3 4 5 Dan Ellis Pattern Recognition 23-7- - 8

Neural net training 2:5:3 net: MS error by training epoch Mean squared error.8.6.4.2 6 2 3 4 5 6 7 8 9 Training epoch Contours @ iterations 6 Contours @ iterations 4 4 2 2 8 8 6 2 4 6 8 2 example... 6 2 4 6 8 2 Dan Ellis Pattern Recognition 23-7- - 9

Aside: Training and test data A rich model can learn every training example (overtraining) error rate Test data Training data Overfitting training or parameters But, goal is to classify new, unseen data i.e. generalization - sometimes use cross validation set to decide when to stop training For evaluation results to be meaningful: - don t test with training data! - don t train on test data (even indirectly...) Dan Ellis Pattern Recognition 23-7- - 2

Outline 2 3 4 5 Music Content Analysis Classification and Features Statistical Pattern Recognition Gaussian Mixtures and Neural Nets Singing Detection - Motivation - Features - Classifiers Dan Ellis Pattern Recognition 23-7- - 2

5 Singing Detection (Berenzweig et al. ) Hz File: /Users/dpwe/projects/aclass/aimee.wav 7 65 6 55 5 45 4 35 3 25 2 5 5 t Can we automatically detect when singing is present? f 9 Printed: Tue Mar 3:4:28 :2 :4 :6 :8 : :2 :4 :6 :8 :2 :22 :24 :26 :28 mus vox mus vox mus - for further processing (lyrics recognition?) - as a song signature? - as a basis for classification? Dan Ellis Pattern Recognition 23-7- - 22

Singing Detection: Requirements freq / khz freq / khz trn/mus/3 5 hand-label vox trn/mus/8 5 Labeled training examples - 6 x 5 sec. radio excerpts - hand-mark sung phrases freq / khz freq / khz 5 5 trn/mus/9 trn/mus/28 2 4 6 8 2 Labeled test data - several complete tracks from CDs, hand-labelled Feature choice - Mel-frequency Cepstral Coefficients (MFCCs) popular for speech; maybe sung voice too? - separation of voices? temporal dimension? Classifier choice - MLP Neural Net - GMMs for singing / music - SVM? tim Dan Ellis Pattern Recognition 23-7- - 23

GMM System Separate models for p(x sing), p(x no sing) - combined via likelihood ratio test GMM p(x singing ) music MFCC calculation C C... Log l'hood ratio test p(x singing ) log p(x not ) singing? C 2 GMM2 p(x no singing ) How many Gaussians for each? - say 2; depends on data & complexity What kind of covariance? - diagonal (spherical?) Dan Ellis Pattern Recognition 23-7- - 24

GMM Results Raw and smoothed results (Best FA=84.9%): freq / khz log(lhood) 8 6 4 2 5-5 Aimee Mann : Build That Wall + handvox 26d 2mix GMM on 6ms frames (FA=65.8% @ thr=) log(lhood) - 5 26d 2mix GMM smoothed by 6 pt ( sec) hann (FA=84.9% @ thr=-.8) -5 5 5 2 25 time / sec 3 MLP has advantage of discriminant training Each GMM trains only on data subset faster to train? (2 x min vs. 2 min) Dan Ellis Pattern Recognition 23-7- - 25

MLP Neural Net Directly estimate p(singing x) music MFCC calculation C C... singing not singing C 2 - net has 26 inputs (+ ), 5 HUs, 2 o/ps (26:5:2) How many hidden units? - depends on data amount, boundary complexity Feature context window? - useful in speech Delta features? - useful in speech Training parameters... Dan Ellis Pattern Recognition 23-7- - 26

MLP Results Raw net outputs on a CD track (FA 74.%): Aimee Mann : Build That Wall + handvox freq / khz 8 6 4 2 26:5: netlab on 6ms frames (FA=74.% @ thr=.5) p(singing).8.6.4.2 5 5 2 25 time / sec 3 p(singing).8.6 Smoothed for continuity: best FA = 9.5% y p ( ) ( ).4.2 5 5 2 25 time / sec 3 Dan Ellis Pattern Recognition 23-7- - 27

Artist Classification (Berenzweig et al. 22) Artist label as available stand-in for genre Train MLP to classify frames among 2 artists Using only voice segments: Song-level accuracy improves 56.7% 64.9% Track 7 - Aimee Mann (dynvox=aimee, unseg=aimee) true voice Michael Penn The Roots The Moles Eric Matthews Arto Lindsay Oval Jason Falkner Built to Spill Beck XTC Wilco Aimee Mann The Flaming Lips Mouse on Mars Dj Shadow Richard Davies Cornelius Mercury Rev Belle & Sebastian Sugarplastic Boards of Canada 5 5 2 Track 4 - Arto Lindsay (dynvox=arto, unseg=oval) true voice Michael Penn The Roots The Moles Eric Matthews Arto Lindsay Oval Jason Falkner Built to Spill Beck XTC Wilco Aimee Mann The Flaming Lips Mouse on Mars Dj Shadow Richard Davies Cornelius Mercury Rev Belle & Sebastian Sugarplastic Boards of Canada 2 3 4 5 6 7 8 Dan Ellis Pattern Recognition 23-7- - 28 time / sec time / sec

Summary Music content analysis: Pattern classification Basic machine learning methods: Neural Nets, GMMs Singing detection: classic application but... the time dimension? Dan Ellis Pattern Recognition 23-7- - 29

References A.L. Berenzweig and D.P.W. Ellis (2) Locating Singing Voice Segments within Music Signals, Proc. IEEE Workshop on Apps. of Sig. Proc. to Acous. and Audio, Mohonk NY, October 2. http://www.ee.columbia.edu/~dpwe/pubs/waspaa-singing.pdf R.O. Duda, P. Hart, R. Stork (2) Pattern Classification, 2nd Ed. Wiley, 2. E. Scheirer and M. Slaney (997) Construction and evaluation of a robust multifeature speech/music discriminator, Proc. IEEE ICASSP, Munich, April 997. http://www.ee.columbia.edu/~dpwe/e682/papers/scheis97-mussp.pdf Dan Ellis Pattern Recognition 23-7- - 3