University of Colorado at Boulder ECEN 4/5532. Lab 2 Lab report due on February 16, 2015

Similar documents
Short-Time Fourier Transform and Chroma Features

Introduction Basic Audio Feature Extraction

LECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES

Audio Features. Fourier Transform. Fourier Transform. Fourier Transform. Short Time Fourier Transform. Fourier Transform.

Audio Features. Fourier Transform. Short Time Fourier Transform. Short Time Fourier Transform. Short Time Fourier Transform

Feature extraction 2

Timbre Similarity. Perception and Computation. Prof. Michael Casey. Dartmouth College. Thursday 7th February, 2008

Short-Time Fourier Transform and Chroma Features

200Pa 10million. Overview. Acoustics of Speech and Hearing. Loudness. Terms to describe sound. Matching Pressure to Loudness. Loudness vs.

Nonnegative Matrix Factor 2-D Deconvolution for Blind Single Channel Source Separation

Lecture 14 1/38 Phys 220. Final Exam. Wednesday, August 6 th 10:30 am 12:30 pm Phys multiple choice problems (15 points each 300 total)

Feature extraction 1

Chapters 11 and 12. Sound and Standing Waves

Response-Field Dynamics in the Auditory Pathway

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

PHYSICS 220. Lecture 21. Textbook Sections Lecture 21 Purdue University, Physics 220 1

Physical Acoustics. Hearing is the result of a complex interaction of physics, physiology, perception and cognition.

y = log b Exponential and Logarithmic Functions LESSON THREE - Logarithmic Functions Lesson Notes Example 1 Graphing Logarithms

Topic 6. Timbre Representations

Automatic Speech Recognition (CS753)

1 Wind Turbine Acoustics. Wind turbines generate sound by both mechanical and aerodynamic

Time-Frequency Analysis

The Spectral Differences in Sounds and Harmonics. Through chemical and biological processes, we, as humans, can detect the movement and

Robust Speaker Identification

Lecture 7: Pitch and Chord (2) HMM, pitch detection functions. Li Su 2016/03/31

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

TIME-DEPENDENT PARAMETRIC AND HARMONIC TEMPLATES IN NON-NEGATIVE MATRIX FACTORIZATION

Acoustic Quantities. LMS Test.Lab. Rev 12A

Detecting and Manipulating Musical Rhythms

AN INVERTIBLE DISCRETE AUDITORY TRANSFORM

David Weenink. First semester 2007

Physics 101: Lecture 22 Sound

Topic 7. Convolution, Filters, Correlation, Representation. Bryan Pardo, 2008, Northwestern University EECS 352: Machine Perception of Music and Audio

Research Article A Combined Mathematical Treatment for a Special Automatic Music Transcription System

Superposition and Standing Waves

Time-Series Analysis for Ear-Related and Psychoacoustic Metrics

Nicholas J. Giordano. Chapter 13 Sound

Speech Signal Representations

MAT120 Supplementary Lecture Notes

PreFEst: A Predominant-F0 Estimation Method for Polyphonic Musical Audio Signals

An Efficient Pitch-Tracking Algorithm Using a Combination of Fourier Transforms

1. Personal Audio 2. Features 3. Segmentation 4. Clustering 5. Future Work

General Physics I. Lecture 14: Sinusoidal Waves. Prof. WAN, Xin ( 万歆 )

CS229 Project: Musical Alignment Discovery

Review &The hardest MC questions in media, waves, A guide for the perplexed

Timbral, Scale, Pitch modifications

Sound Waves. Sound waves are longitudinal waves traveling through a medium Sound waves are produced from vibrating objects.

Novelty detection. Juan Pablo Bello MPATE-GE 2623 Music Information Retrieval New York University

Time-domain representations

IMISOUND: An Unsupervised System for Sound Query by Vocal Imitation

Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates

ROBUST REALTIME POLYPHONIC PITCH DETECTION

Extraction of Individual Tracks from Polyphonic Music

PHY205H1F Summer Physics of Everyday Life Class 7: Physics of Music, Electricity

Linear Prediction 1 / 41

MULTISCALE SCATTERING FOR AUDIO CLASSIFICATION

Instrument-specific harmonic atoms for mid-level music representation

Measurement of Temporal and Spatial Factors of a Flushing Toilet Noise in a Downstairs Bedroom

Musical Genre Classication

Frequency Domain Speech Analysis

Lab 4: Quantization, Oversampling, and Noise Shaping

Centre for Mathematical Sciences HT 2017 Mathematical Statistics

A COMPUTATIONAL SOFTWARE FOR NOISE MEASUREMENT AND TOWARD ITS IDENTIFICATION

Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation

Effects of Hurricanes on Ambient Noise in the Gulf of Mexico

Physics 101: Lecture 22 Sound

Designing Information Devices and Systems I Spring 2018 Homework 11

Geometric and algebraic structures in pattern recognition

arxiv:math/ v1 [math.na] 7 Mar 2006

Sound 2: frequency analysis

PUBLISHED IN IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL. 19, NO. 6, AUGUST

STATISTICAL APPROACH FOR SOUND MODELING

Fourier Analysis of Signals

FOURIER ANALYSIS. (a) Fourier Series

Sound Recognition in Mixtures

Lecture Notes 5: Multiresolution Analysis

APPENDIX B. Noise Primer

Centre for Mathematical Sciences HT 2018 Mathematical Statistics

Origin of Sound. Those vibrations compress and decompress the air (or other medium) around the vibrating object

Analysis of polyphonic audio using source-filter model and non-negative matrix factorization

Lecture 7: Feature Extraction

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

Harmonic Structure Transform for Speaker Recognition

QUERY-BY-EXAMPLE MUSIC RETRIEVAL APPROACH BASED ON MUSICAL GENRE SHIFT BY CHANGING INSTRUMENT VOLUME

COMBINING TIMBRIC AND RHYTHMIC FEATURES FOR SEMANTIC MUSIC TAGGING

Chirp Transform for FFT

Novelty detection. Juan Pablo Bello EL9173 Selected Topics in Signal Processing: Audio Content Analysis NYU Poly

Producing a Sound Wave. Chapter 14. Using a Tuning Fork to Produce a Sound Wave. Using a Tuning Fork, cont.

A Nonlinear Psychoacoustic Model Applied to the ISO MPEG Layer 3 Coder

arxiv: v2 [physics.class-ph] 12 Jul 2016

Sound. Speed of Sound

Sound Engineering Test and Analysis

Signal types. Signal characteristics: RMS, power, db Probability Density Function (PDF). Analogue-to-Digital Conversion (ADC).

Probability Distributions

Fourier Series and the DFT

Integer Division. Student Probe

Chapter 20: Mechanical Waves

Physics Worksheet Sound and Light Section: Name:

Audio. Wavelets. Neural Nets

T Y H G E D I. Music Informatics. Alan Smaill. Jan Alan Smaill Music Informatics Jan /1

Transcription:

University of Colorado at Boulder ECEN 4/5532 Lab 2 Lab report due on February 16, 2015 This is a MATLAB only lab, and therefore each student needs to turn in her/his own lab report and own programs. 1 Introduction This is the second lab on digital signal processing methods for content-based music information retrieval. In this lab we will explore higher level musical features such as rhythm, and tonality. In the next lab you will be asked to classify tracks according to these features. 2 Rhythm Musical rhythm can be loosely defined as the presence of repetitive patterns in the temporal structure of music. We will implement several techniques to detect the presence of rhythmic structures in an audio track. Given an audio segment of duration T =24 seconds, the goal is to compute a score that measures how often spectral patterns are repeated. Formally, we will compute a vector B, such that B(lag) quantifies the presence of similar spectral patterns for frame that are lag frames apart. The lag associated with the largest entry in the array B is a good candidate for the period in the rhythmic structure. We explain below two approaches to compute the array B. 2.1 Spectral decomposition We first compute the mfcc coefficients, as explained in lab 1. We modify the mfcc function to add a nonlinearity. Indeed, the perception of loudness by the human auditory system is nonlinear: loudness does not increase linearly with sound intensity (measured, for instance in Watts/m 2 ). A good approximation is obtained by taking the logarithm of the energy. The mfcc measured in decibels (db) is thus defined by, Your function should return an array mfcc(k,n) of size 10 log 10 (mfcc). (1) nbanks N f, (2) where N f = f s T/K is the number of frames, K=256 is the number of frequencies, and nbanks = 40 is the number of mels. The entire analysis is performed at the level of the frames. 1

2.2 Spectrum histogram We propose to construct a visual representation of a musical track by counting how often a given note (corresponding to one index of the mfcc array) was played at a given amplitude, measured in db. 1. Modify your function that computes the mfcc to return the mfcc in decibels. 2. Construct a two-dimensional spectrum histogram (a matrix of counts), with 40 columns one for each mfcc index, and 50 rows one for each amplitude level, measured in db, from -20 to 60 db. You will normalize the histogram, such that the sum along each column (for each note ) is one. 3. Display, using the MATLAB function imagesc, the spectrum histogram for the 12 audio tracks supplied for the first lab: http://ecee.colorado.edu/~fmeyer/class/ecen4532/audio-files.zip 2.3 Similarity matrix The first stage involves computing a similarity matrix defined as follows S(i, j) = nbanks mfcc(:, i), mfcc(:, j) mfcc(:, i) mfcc(:, j) = mfcc(k, i) mfcc(k, j) mfcc(:, i) mfcc(:, j). (3) S(i, j) measures the cosine of the angle between the spectral signatures of frames i and j. We note that mfcc is always positive, and therefore 0 S(i, j) 1. The similarity S(i, j) is large, S(i, j) 1, if the notes being played in frame i are similar to the notes being played in frame j. We note that this similarity is independent of the loudness of the music. 4. Implement the computation of the similarity matrix, and display the similarity matrices (color-coded) for the 12 audio tracks supplied for the first lab. See Fig. 1 for an example of a similarity matrix. k=1 2

1 0.9 100 0.8 200 0.7 0.6 300 0.5 400 0.4 0.3 500 0.2 0.1 600 100 200 300 400 500 600 Figure 1: Similarity matrix associated with the sample sample1.wav. The matrix is symmetric: S(i, j) = S(j, i) 2.4 A first estimate of the rhythm We note that the entries S(n, n + l) in the similarity matrix (3) are always at a distance (time lag) of l frames. In order to detect the presence of patterns that are repeated every l frames we compute, B(l) = N 1 f l S(n, n + l), l = 0,..., N f 1 (4) N f l n=1 B(l) is the sum of all the entries on the l th upper diagonal. Indeed, B(0) is the sum of all the entries on the main diagonal. Similarly, B(1) is the sum of the entries on the first upper diagonal, and so on and so forth. We note that 0 B(l) 1. (5) The index l associated with the largest value of B(l) corresponds to the presence of a rhythmic period of l K/f s seconds. 5. Implement the computation of the rhythm index B(l) defined in (4). Plot the vector B as a function of the lag l = 0, N f 1 for the 12 tracks. Comment on the presence, or absence, of a strong rhythmic pattern. Hint: to debug your function, you can use track-396 which should display strong rhythmic structure. 3

2.5 Periodicity Histograms Many entries in the similarity matrix are small, and are not useful to quantify the presence of rhythm. Furthermore, it would be interesting to be able to quantify the rhythm as a function of tempo. We can define a notion of tempo, measured in beat per minute, by taking the inverse of the rhythmic period, l K/f s, BPM (l) = 60 f s lk. (6) We describe an algorithm to compute, for each tempo B(l) the probability of finding a large similarity at that tempo. 6. Compute the median S of the matrix S. 7. Using a range of 40 to 240 BPM with bins of size 5 BPM, compute the number of times S(i, i + l) is greater than S, for each value of the tempo, 40, 45,..., 240 BPM. Your function should return an array of counts of size 80. 8. Construct and display a normalized histogram PH that counts, for each value of the tempo,, 40, 45,..., 240 BPM, the relative number of times that the similarity at that tempo was greater than the median similarity S. This is basically the array that you computed in the previous question normalized by the total number of counts. Display the histograms for the 12 tracks that epitomize the six different musical genres. 2.6 A better estimate of the rhythm Rather than compounding the similarity between frames that are at a fixed lag apart in time, we can try to detect more interesting patterns in the similarity matrix. In general, if two frames i and j are similar, we would like to know if this similarity is repeated later in the segment, at time j + l. A lag l will be a good candidate for a rhythmic period, if there are many i and j such that if S(i, j) is large then S(i, j + l) is also large. This leads to the computation of the autocorrelation, AR(l) = 1 N f (N f l) N f i=1 N f l j=1 S(i, j)s(i, j + l), l = 0,..., N f 1. (7) 4

Figure 2: Rhythm index AR(l)as a function of the lag l 9. Implement the computation of the rhythm index AR(l) defined in (7). Plot the vector AR as a function of the lag l = 0, N f 1 for the 12 tracks. Comment on the presence, or absence, of a strong rhythmic pattern. See Fig. 2 for an example of a plot of the rhythm index. 2.7 Rhythmic variations over time Here we are interested in studying the dynamic changes in the rhythm. For this purpose, we consider short time windows formed by 20 frames (approximately 1 second) over which we compute a vector AR of size 20, AR(l, m) = 1 20(20 l) 20 i=1 20 l j=1 S(i + m 20, j + m 20)S(i + m 20, j + m 20 + l), (8) for l = 0,..., 4, and m = 0,..., N f /20 1. (9) 5

10. Implement the computation of the rhythm index AR(l, m) defined in (9). Plot the image AR in false color as a function of the lag l = 0,..., 19 (y-axis) and the time window index m (x-axis) for the 12 tracks. Comment on the variation of rhythm patterns for the different tracks. Auditory model 32 Figure 3: Rhythm index AR(l, m)as a function of the lag l (y-axis) and frame index m (x-axis). 3 Tonality and Chroma 3.1 The equal-tempered scale We now investigate the time-frequency structure related to the concept of tonality and chroma. The tonality is concerned with the distribution of notes at a given time. In contrast, the melody characterizes the variations of the spectrum as a function of time. We first observe that the auditory sensation that is closely related to frequency, also known as pitch, corresponds to a logarithmic scale of the physical frequency. We have seen examples of such scales in the previous lab: the bark and the mel scales. In this lab, we define another logarithmic scale, known as the tempered scale. First, we define a reference frequency f 0 associated with a reference pitch. While it is customary to use the frequency 440 Hz associated with the pitch A4, we will also use f 0 = 27.5Hz (this known as A0, the lowest note on a piano). The tempered scale introduces 12 frequency intervals known as semi-tones between this reference frequency f 0 and the next frequency also perceived as an A, 2f 0. As a result, a frequency f sm that is sm semitones away from f 0 is given by f sm = f 0 2 sm/12. (10) An interval of 12 semitones, [f sm, f sm+12 ), corresponds to an octave. The same notes (e.g. A, or C#) are always separated by an octave. For instance, the two previous As, A0 and A4 are separated by 4 octaves, since 440 = 2 4 27.5. The notion of note can be formally modeled by the concept of chroma. 6

3.2 Chroma The chroma is associated with the relative position of a note inside an octave. It is a relative measurement that is independent of the absolute pitch. We describe an algorithm to compute a chroma feature vector for a frame n. To lighten the notation, we drop the dependency on the frame index n in this discussion. We first present the idea informally, and then we make our statement precise. We are interested to map a given frequency f to a note. Given a reference frequency f 0, then we can use (10) to compute the distance sm between f and f 0 measured in semitones, sm = round(12 log 2 (f/f 0 )). (11) We note that f is usually not exactly equal to f 0 2 sm/12, and this is why we round 12 log 2 (f/f 0 ) to the closest integer. We can then map this index sm to a note, or chroma, c within an octave, using or c = sm mod (12), (12) sm = 12q + c, with 0 c 11, and q N. (13) In other words, f sm and f sm+12 given by (10) correspond to the same note c, which is exactly what you see if you look at the keys of a piano. The problem with this approach is that we do not have an algorithm to extract the frequencies f of the notes that are being played at a given time. Rather, we obtain a distribution of the spectral energy provided by the Fourier transform. We need to identify the main peaks in the Fourier transform, and compute the chroma associated with these frequencies, using the equations above. In the following we describe the different steps of the algorithm in details. 3.2.1 Step 1: spectral analysis and peak detection The goal is to identify the main notes being played, and remove the noise and the harmonics coming from the timbre of each instrument. For each frame n, we compute the windowed FFT as explained in the first lab. We obtain K = N/2 + 1 (where N is even) Fourier coefficients, given by Y = FFT(w. x n ); K = N/2 + 1; X n = Y(1 : K); (14) We detect the local maxima of the magnitude of the Fourier transform X n, and count the number of maxima, or peaks, n peaks. We denote by f k, k = 1,..., n peaks these peak frequencies. 3.2.2 Step 2: of the peak frequencies to semitones For each peak frequency f k, we find the semitone sm and the corresponding frequency f sm = f 0 2 sm/12 closest to f k, so that sm = round(12 log 2 (f k /f 0 )). (15) 7

Finally, we map the index sm to the note c defined by For computational efficiency we use f 0 = 27.5 Hz instead of 440 Hz. 3.2.3 Step 3: the Pitch Class Profile: a weighted sum of the semitones c = sm mod (12). (16) Instead of assigning each peak frequency f k to a single semitone sm, we spread the effect of f k to the two neighboring semitones sm and sm + 1 using a raised cosine function, defined in a weight function w(k, c). We define the Pitch Class Profile associated with the note, or chroma c = 0,..., 11 as the weighted sum of all the peak frequencies that are mapped to the note c, irrespective of the octave they fall in, PCP(c) = k w(k, c) X n (k) 2, where k is such that c = round(12 log 2 (f k /f 0 ))mod 12, (17) and the weight w(k, c) is given by cos 2 (π r/2) if 1 < r = 12 log 2 (f k /f 0 ) sm < 1, where c = sm mod (12), and sm = round(12 log w(k, c) = 2 (f k /f 0 )), 0 otherwise. (18) In plain English, PCP(c) tells us how much energy is associated with frequencies that are very close (as defined by the weight w(k, c)) to the note c across all the octaves. 3.2.4 Step 4: Normalize the Pitch Class Profile We want our definition of chroma to be independent of the loudness of the music. We therefore need to normalize each PCP by the total PCP computed over all the semitones. The normalized pitch class profile (NPCP) is defined by PCP(c) NPCP(c) = 12 q=1 PCP(q) (19) 8

at synchronous (Bartsch and Wakefield, 2001) 11. Implement the computation of the Normalized Pitch Class Profile, defined by (19). You will compute a vector of 12 entries for each frame n. 12. Evaluate and plot the NPCP for the 12 audio tracks supplied for the first lab: http://ecee.colorado.edu/~fmeyer/class/ecen4532/audio-files.zip See Fig. 4 for an example. Figure 4: Chroma as a function of time (seconds). 9