CS229 Project: Musical Alignment Discovery

Similar documents
CS 188: Artificial Intelligence Fall 2011

CS 343: Artificial Intelligence

Sound Recognition in Mixtures

CS 343: Artificial Intelligence

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Nonnegative Matrix Factor 2-D Deconvolution for Blind Single Channel Source Separation

Independent Component Analysis and Unsupervised Learning

MULTIPITCH ESTIMATION AND INSTRUMENT RECOGNITION BY EXEMPLAR-BASED SPARSE REPRESENTATION. Ikuo Degawa, Kei Sato, Masaaki Ikehara

CS 5522: Artificial Intelligence II

CS 5522: Artificial Intelligence II

Brief Introduction of Machine Learning Techniques for Content Analysis

Non-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology

Music transcription with ISA and HMM

STA 414/2104: Machine Learning

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

Introduction Basic Audio Feature Extraction

Observing Dark Worlds (Final Report)

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Short-Time Fourier Transform and Chroma Features

Lecture 3: ASR: HMMs, Forward, Viterbi

Pattern Recognition and Machine Learning

Data Mining in Bioinformatics HMM

Hidden Markov Models. Dr. Naomi Harte

Introduction to Machine Learning

Dreem Challenge report (team Bussanati)

A New OCR System Similar to ASR System

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Introduction to Machine Learning

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

University of Colorado at Boulder ECEN 4/5532. Lab 2 Lab report due on February 16, 2015

Probabilistic Machine Learning. Industrial AI Lab.

Hidden Markov Models. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 19 Apr 2012

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Sean Escola. Center for Theoretical Neuroscience

Real-time Musical Score Following from a Live Audio Source

On Spectral Basis Selection for Single Channel Polyphonic Music Separation

STA 4273H: Statistical Machine Learning

COMPARISON OF FEATURES FOR DP-MATCHING BASED QUERY-BY-HUMMING SYSTEM

Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs

p(d θ ) l(θ ) 1.2 x x x

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Hidden Markov Models

ECE 5984: Introduction to Machine Learning

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Unsupervised Learning Methods

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

CSE 473: Artificial Intelligence

Cheng Soon Ong & Christian Walder. Canberra February June 2018

MULTI-RESOLUTION SIGNAL DECOMPOSITION WITH TIME-DOMAIN SPECTROGRAM FACTORIZATION. Hirokazu Kameoka

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II)

Estimating Covariance Using Factorial Hidden Markov Models

Statistics Toolbox 6. Apply statistical algorithms and probability models

Statistical learning. Chapter 20, Sections 1 4 1

Data Preprocessing. Cluster Similarity

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Extraction of Individual Tracks from Polyphonic Music

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Introduction to Gaussian Process

Probabilistic Graphical Models

STRUCTURAL BIOINFORMATICS I. Fall 2015

Machine Learning for Structured Prediction

Outline of Today s Lecture

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

CS 188: Artificial Intelligence Fall Recap: Inference Example

BAYESIAN STRUCTURAL INFERENCE AND THE BAUM-WELCH ALGORITHM

STA 414/2104: Lecture 8

Nonnegative Matrix Factorization with Markov-Chained Bases for Modeling Time-Varying Patterns in Music Spectrograms

Probabilistic Graphical Models

IMISOUND: An Unsupervised System for Sound Query by Vocal Imitation

Information Extraction from Text

Analysis of polyphonic audio using source-filter model and non-negative matrix factorization

Constrained Nonnegative Matrix Factorization with Applications to Music Transcription

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Elec4621 Advanced Digital Signal Processing Chapter 11: Time-Frequency Analysis

Signal Processing COS 323

What is semi-supervised learning?

Scalable audio separation with light Kernel Additive Modelling

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Comparing linear and non-linear transformation of speech

Hidden Markov Model and Speech Recognition

Probabilistic Time Series Classification

BILEVEL SPARSE MODELS FOR POLYPHONIC MUSIC TRANSCRIPTION

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Computational statistics

Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors

Bayesian Networks BY: MOHAMAD ALSABBAGH

Chemometrics: Classification of spectra

CS Homework 3. October 15, 2009

Augmented Statistical Models for Speech Recognition

Neutron inverse kinetics via Gaussian Processes

Forward algorithm vs. particle filtering

Transcription:

S A A V S N N R R S CS229 Project: Musical Alignment iscovery Woodley Packard ecember 16, 2005 Introduction Logical representations of musical data are widely available in varying forms (for instance, MII files are ubiquitous on the internet). Matching acoustic recordings of musical performances are also available from C stores, online music stores, or other repositories. However, due to variation in tempo and dynamic in performances, the time series alignment between logical and acoustic data is usually nonobvious. My project aims to automatically discover that alignment. I chose to restrict my explorations to the music of Mozart, because I have access to a large homogeneous library of both logical and acoustic data for that particular composer. One advantage of this restriction is that the set of musical instruments is limited to a few dozen. I take advantage of this by extracting models for the sound of each relevant instrument (overtone coefficients). This chart shows a high-level outline of the system s architecture: u d i o C s a w o t e i g i t a l N M A c o r e s p e c t r a l u d i o p e c t r a l o t e I n s t r u m e n t P a r a m e t e r s i t e r b i A l i g n m e n t S m o o t h i n g & e s u l t s ata Sets The logical note data input to my project is an accurate textual representation of a physical printed page of musical score, extracted (by a different project) from the comprehensive Neue Mozart Ausgabe (a complete edition of scores to Mozart s music). These data are handled by an external interpreter: 1

Input Note ata By modifying the playback facilities of the interpreter, I preprocessed the music into a table of note data containing, for each note in the piece, the logical start time, the pitch, the volume, the instrument, the duration, and the measure number. The acoustic data came from a commercial compact-disc recording of the symphony. I extracted a spectrogram (see image) from the digital samples using a Gaussian window size of 2048 samples and a step of 512 samples. The data were at 44100 Hz. Generative Model Acoustic Spectrogram Using the note data table, I constructed an estimation of a logical spectrogram by computing the frequency f m for each note and contributing an exponentially decaying rectangle at each overtone of the fundamental. I used an equal tempered scale, using the following equation for the fundamental frequency [1]: f m = 440 2 m M 440 12 where M 440 is the MII encoding of middle A. 2

Left: Uniform Predicted Spectrogram; Right: With Learned Instrument Spectra For the initial alignment, I assumed the spectral properties of all instruments were identical, namely five exponentially decaying evenly spaced overtones. Then, assuming the alignment was correct, I applied a linear regression to find the maximum likelihood spectral characteristics of each instrument given the observed spectrogram. The parameters my system learned were the coefficients of the first five overtones for each instrument. These parameters also implicitly included an overall scaling factor accounting for the difference between the suggested dynamics (loudness) in the score and the average recorded dynamics for each instrument. After estimating the instruments spectral properties, I used them to build a superior predicted spectrogram from the logical note data. Then I realigned, producing new training data for the instrument property estimation, etc. By repeating this process (similar to EM), my system was able to learn the properties of the instruments with higher accuracy. The parameters converged after three such iterations. Spectral overtone coefficients learned for several instruments I assumed the existence of a similarity metric f(x,y) comparing a logical spectrogram slice to an acoustic spectrogram slice. I tried several metrics for this purpose, including L 2 distance, a Gaussian kernel, and a simple dot-product (representing the cosine of the angle between the two spectra). The simple dot-product performed noticeably better than the other candidate metrics. 3

Alignment The literature contains many records of experiments of this type using hidden Markov models (HMMs), e.g. [2], [3], [4]. Alternative approaches have been studied with varying degrees of success (e.g. [5] investigates a discriminative learning approach), but most other models require supervised learning. By contrast HMMs can learn to align time series in an unsupervised fashion. Treating the predicted logical spectrogram as a linear HMM with output likelihoods given by the similarity metric, I used the Viterbi algorithm to compute the highest likelihood alignment between the logical spectrogram and the acoustic spectrogram: Top: Typical Alignment; Bottom: A Bad Spot In the example pictured, there were roughly 7 times more spectral slices than logical slices. The aligned image shows the original acoustic spectrogram in blue, with the hypothesized alignment of the logical spectrogram overlayed in green. The red lines on the aligned image indicate transitions between logical spectrogram slices. As can be seen, the alignment found in the case on top is reasonable. One problem with HMM alignment is that there is no good way to specifying a preferred number of times to stay in a given state. As such, the alignment tends to be locally uneven, even if globally it is a good fit. This can be seen in lower segment above. On average, only 1 7 of the slices are transitions (red), but in this particularly bad segment, the variance is very high. To solve this problem, I built a confidence metric to discern which segments of the alignment were smooth and which segments were not. The confidence metric m(t) observes the deviation d(t) between the instantaneous tempo of the alignment and the local average tempo: ( ) d(t) = f(t + n) f(t n) f (t) 2n, m(t) = exp d(t)2 2σ 2 where f is the Viterbi alignment, n is a parameter controlling the breadth of the local average, and σ 2 is a normalizing constant to put the metric in a useful range. In situations with high variance, the deviation is large and so the confidence metric is near 0, and in situations with low variance the confidence metric is near 1. 4

The term f (t) in the above equation for d(t) requires slight explanation, since the available representation of f(t) is not a continuous function. Instead it is a discrete mapping from audio C time slices to note data time slices. I used a numerical differentiation technique to estimate f (t) for the underlying latent continuous alignment function. Smoothing is applied to areas with low confidence, by mixing the aligned values with linear predictions from nearby high-confidence areas. Results and Applications The result of the entire process is an alignment function mapping from audio C time codes to note data time codes. Not having a hand-aligned test set, I am unable to provide a numerical accuracy measurement; [3] and [4] also found it difficult to measure accuracy numerically, although [2] invented a (somewhat discouraging) metric and [5] has a metric based on hand-aligned test data. However, I built a tool to visually inspect the learned alignment between the spectrograms. The visualization made it clear that the alignment is generally quite accurate. More specifically, it is quite good at lining up passages with quickly changing tonalities and moving notes. Slow-changing segments with little tonal (and hence spectral) variation sometimes cause reduced accuracy, but even in such cases the proposed alignment tends to be within a small fraction of a second of the intuitively best alignment. One possible application of this technique is, as I already explored, automatically learning characteristics of musical instruments. Abstract spectral characteristics of musical instruments could be used in musical synthesis or completely automatic music recognition. Other potential applications include enhancing playback in score studying and browsing, and augmenting a musical listening experience with visual information about the music. References [1] Campbell, Jim: The Equal Tempered Scale and Peculiarities of Piano Tuning. http://www.precisionstrobe.com/apps/pianotemp/temper.html [2] Raphael, Christopher: Automatic Transcription of Piano Music. Proceedings of ISMIR, 2002. [3] Orio N., echelle F. (2001) Score Following Using Spectral Analysis and Hidden Markov Models, Proceedings of the ICMC, 151-154. [4] Cano P., Loscos A., Bonada J. (1999) Score-Performance Matching Using HMMs, Proceedings of the ICMC, 441-444, 1999. [5] Shalev-Shwartz S., Keshet J., Singer Y. (2004) Learning to Align Polyphonic Music, Proceedings of ISMIR, 2004. 5