Real-Time Pitch Determination of One or More Voices by Nonnegative Matrix Factorization

Size: px
Start display at page:

Download "Real-Time Pitch Determination of One or More Voices by Nonnegative Matrix Factorization"

Transcription

1 Real-Time Pitch Determination of One or More Voices by Nonnegative Matrix Factorization Fei Sha and Lawrence K. Saul Dept. of Computer and Information Science University of Pennsylvania, Philadelphia, PA 1914 Abstract An auditory scene, composed of overlapping acoustic sources, can be viewed as a complex object whose constituent parts are the individual sources. Pitch is known to be an important cue for auditory scene analysis. In this paper, with the goal of building agents that operate in human environments, we describe a real-time system to identify the presence of one or more voices and compute their pitch. The signal processing in the front end is based on instantaneous frequency estimation, a method for tracking the partials of voiced speech, while the pattern-matching in the back end is based on nonnegative matrix factorization, an unsupervised algorithm for learning the parts of complex objects. While supporting a framework to analyze complicated auditory scenes, our system maintains real-time operability and state-of-the-art performance in clean speech. 1 Introduction Nonnegative matrix factorization (NMF) is an unsupervised algorithm for learning the parts of complex objects [11]. The algorithm represents high dimensional inputs ( objects ) by a linear superposition of basis functions ( parts ) in which both the linear coefficients and basis functions are constrained to be nonnegative. Applied to images of faces, NMF learns basis functions that correspond to eyes, noses, and mouths; applied to handwritten digits, it learns basis functions that correspond to cursive strokes. The algorithm has also been implemented in real-time embedded systems as part of a visual front end [1]. Recently, it has been suggested that NMF can play a similarly useful role in speech and audio processing [16, 17]. An auditory scene, composed of overlapping acoustic sources, can be viewed as a complex object whose constituent parts are the individual sources. Pitch is known to be an extremely important cue for source separation and auditory scene analysis [4]. It is also an acoustic cue that seems amenable to modeling by NMF. In particular, we can imagine the basis functions in NMF as harmonic stacks of individual periodic sources (e.g., voices, instruments), which are superposed to give the magnitude spectrum of a mixed signal. The pattern-matching computations of NMF are reminiscent of longstanding template-based models of pitch perception [6]. Our interest in NMF lies mainly in its use for speech processing. In this paper, we describe a real-time system to detect the presence of one or more voices and determine their pitch.

2 Learning plays a crucial role in our system: the basis functions of NMF are trained offline from data to model the particular timbres of voiced speech, which vary across different phonetic contexts and speakers. In related work, Smaragdis and Brown used NMF to model polyphonic piano music [17]. Our work differs in its focus on speech, real-time processing, and statistical learning of basis functions. A long-term goal is to develop interactive voice-driven agents that respond to the pitch contours of human speech [15]. To be truly interactive, these agents must be able to process input from distant sources and to operate in noisy environments with overlapping speakers. In this paper, we have taken an important step toward this goal by maintaining real-time operability and state-of-the-art performance in clean speech while developing a framework that can analyze more complicated auditory scenes. These are inherently competing goals in engineering. Our focus on actual system-building also distinguishes our work from many other studies of overlapping periodic sources [5, 9, 19, 2, 21]. The organization of this paper is as follows. In section 2, we describe the signal processing in our front end that converts speech signals into a form that can be analyzed by NMF. In section 3, we describe the use of NMF for pitch tracking namely, the learning of basis functions for voiced speech, and the nonnegative deconvolution for real-time analysis. In section 4, we present experimental results on signals with one or more voices. Finally, in section 5, we conclude with plans for future work. 2 Signal processing A periodic signal is characterized by its fundamental frequency, f. It can be decomposed by Fourier analysis as the sum of sinusoids or partials whose frequencies occur at integer multiples of f. For periodic signals with unknown f, the frequencies of the partials can be inferred from peaks in the magnitude spectrum, as computed by an FFT. Voiced speech is perceived as having a pitch at the fundamental frequency of vocal cord vibration. Perfect periodicity is an idealization, however; the waveforms of voiced speech are non-stationary, quasiperiodic signals. In practice, one cannot reliably extract the partials of voiced speech by simply computing windowed FFTs and locating peaks in the magnitude spectrum. In this section, we review a more robust method, known as instantaneous frequency (IF) estimation [1], for extracting the stable sinusoidal components of voiced speech. This method is the basis for the signal processing in our front-end. The starting point of IF estimation is to model the voiced speech signal, s(t), by a sum of amplitude and frequency-modulated sinusoids: s(t) = ( t ) α i (t)cos dt ω i (t) + θ i. (1) i The arguments of the cosines in eq. (1) are called the instantaneous phases; their derivatives with respect to time yield the so-called instantaneous frequencies ω i (t). If the amplitudes α i (t) and frequencies ω i (t) are stationary, then eq. (1) reduces to a weighted sum of pure sinusoids. For nonstationary signals, ω i (t) intuitively represents the instantaneous frequency of the ith partial at time t. The short-time Fourier transform (STFT) provides an efficient tool for IF estimation [2]. The STFT of s(t) with windowing function w(t) is given by: F(ω, t) = dτ s(τ)w(τ t)e jωτ. (2) Let z(ω, t) = e jωt F(ω, t) denote the analytic signal of the Fourier component of s(t) with frequency ω, and let a = Re[z] and b = Im[z] denote its real and imaginary parts. We

3 Instantaneous Pitch (Hz) Time (second) Figure 1: Top: instantaneous frequencies of estimated partials for the utterance The north wind and the sun were disputing. Bottom: f contour derived from a laryngograph recording. can define a mapping from the time-frequency plane of the STFT to another frequency axis λ(ω, t) by: λ(ω, t) = b a t arg[z(ω, t)] = b a t t a 2 + b 2 (3) The derivatives on the right hand side can be computed efficiently via SFFTs [2]. Note that the right hand side of eq. (3) differentiates the instantaneous phase associated with a particular Fourier component of s(t). IF estimation identifies the stable fixed points [7, 8] of this mapping, given by λ(ω, t) = ω and ( λ/ ω) ω=ω < 1, (4) as the instantaneous frequencies of the partials that appear in eq. (1). Intuitively, these fixed points occur where the notions of energy at frequency ω in eqs. (1) and (2) coincide that is, where the IF and STFT representations appear most consistent. The top panel of Fig. 1 shows the IFs of partials extracted by this method for a speech signal with sliding and overlapping analysis windows. The bottom panels shows the pitch contour. Note that in regions of voiced speech, indicated by nonzero f values, the IFs exhibit a clear harmonic structure, while in regions of unvoiced speech, they do not. In summary, the signal processing in our front-end extracts partials with frequencies ωi (t) and nonnegative amplitudes F(ωi (t), t), where t indexes the time of the analysis window and i indexes the number of extracted partials. Further analysis of the signal is performed by the NMF algorithm described in the next section, which is used to detect the presence of one or more voices and to estimate their f values. Similar front ends have been used in other studies of pitch tracking and source separation [1, 2, 7, 13]. 3 Nonnegative matrix factorization For mixed signals of overlapping speakers, our front-end outputs the mixture of partials extracted from several voices. How can we analyze this output by NMF? In this section, we show: (i) how to learn nonnegative basis functions that model the characteristic timbres of voiced speech, and (ii) how to decompose mixed signals in terms of these basis functions. We briefly review NMF [11]. Given observations y t, the goal of NMF is to compute basis functions W and linear coefficients x t such that the reconstructed vectors ỹ t = Wx t

4 best match the original observations. The observations, basis functions, and coefficients are constrained to be nonnegative. Reconstruction errors are measured by the generalized Kullback-Leibler divergence: G(y,ỹ) = [y α log(y α /ỹ α ) y α + ỹ α ], (5) α which is lower bounded by zero and vanishes if and only if y = ỹ. NMF works by optimizing the total reconstruction error t G(y t,ỹ t ) in terms of the basis functions W and coefficients x t. We form three matrices by concatenating the column vectors y t, ỹ t and x t separately and denote them by Y, Ỹ and X respectively. Multiplicative updates for the optimization problem are given in terms of the elements of these matrices: [ ( ) ] ( ) Yαt α W αβ W αβ X βt, X βt X βt W αβ Y αt /Ỹαt t Ỹ αt γ W. (6) γβ These alternating updates are guaranteed to converge to a local minimum of the total reconstruction error; see [11] for further details. In our application of NMF to pitch estimation, the vectors y t store vertical time slices of the IF representation in Fig. 1. Specifically, the elements of y t store the magnitude spectra F(ωi (t), t) of extracted partials at time t; the instantaneous frequency axis is discretized on a log scale so that each element of y t covers 1/36 octave of the frequency spectrum. The columns of W store basis functions, or harmonic templates, for the magnitude spectra of voiced speech with different fundamental frequencies. (An additional column in W stores a non-harmonic template for unvoiced speech.) In this study, only one harmonic template was used per fundamental frequency. The fundamental frequencies range from 5Hz to 4Hz, spaced and discretized on a log scale. We constrained the harmonic templates for different fundamental frequencies to be related by a simple translation on the log-frequency axis. Tying the columns of W in this way greatly reduces the number of parameters that must be estimated by a learning algorithm. Finally, the elements of x t store the coefficients that best reconstruct y t by linearly superposing harmonic templates of W. Note that only partials from the same source form harmonic relations. Thus, the number of nonzero elements in x t indicates the number of periodic sources at time t, while the indices of nonzero elements indicate their fundamental frequencies. It is in this sense that the reconstruction y t Wx t provides an analysis of the auditory scene. 3.1 Learning the basis functions of voiced speech The harmonic templates in W were estimated from the voiced speech of (non-overlapping) speakers in the Keele database [14]. The Keele database provides aligned pitch contours derived from laryngograph recordings. The first halves of all utterances were used for training, while the second halves were reserved for testing. Given the vectors y t computed by IF estimation in the front end, the problem of NMF is to estimate the columns of W and the reconstruction coefficients x t. Each x t has only two nonzero elements (one indicating the reference value for f, the other corresponding to the non-harmonic template of the basis matrix W); their magnitudes must still be estimated by NMF. The estimation was performed by iterating the updates in eq. (6). Fig. 2 (left) compares the harmonic template at 1 Hz before and after learning. While the template is initialized with broad spectral peaks, it is considerably sharpened by the NMF learning algorithm. Fig. 2 (right) shows four examples from the Keele database (from snippets of voiced speech with f = 1 Hz) that were used to train this template. Note that even among these four partial profiles there is considerable variance. The learned template is derived to minimize the total reconstruction error over all segments of voiced speech in the training data.

5 1 female: cloak male: stronger male: travel male: the Figure 2: Left: harmonic template before and after learning for voiced speech at f = 1 Hz. The learned template (bottom) has a much sharper spectral profile. Right: observed partials from four speakers with f = 1 Hz. 3.2 Nonnegative deconvolution for estimating f of one or more voices Once the basis functions in W have been estimated, computingx such that y Wx under the measure of eq. (5) simplifies to the problem of nonnegative deconvolution. Nonnegative deconvolution has been applied to problems in fundamental frequency estimation [16], music analysis [17] and sound localization [12]. In our model, nonnegative deconvolution of y Wx yields an estimate of the number of periodic sources in y as well as their f values. Ideally, the number of nonzero reconstruction weights in x reveal the number of sources, and the corresponding columns in the basis matrix W reveal their f values. In practice, the index of the largest component of x is found, and its corresponding f value is deemed to be the dominant fundamental frequency. The second largest component of x is then used to extract a secondary fundamental frequency, and so on. A thresholding heuristic can be used to terminate the search for additional sources. Unvoiced speech is detected by a simple frame-based classifier trained to make voiced/unvoiced distinctions from the observation y and its nonnegative deconvolution x. The pattern-matching computations in NMF are reminiscent of well-known models of harmonic template matching [6]. Two main differences are worth noting. First, the templates in NMF are learned from labeled speech data. We have found this to be essential in their generalization to unseen cases. It is not obvious how to craft a harmonic template by hand that manages the variability of partial profiles in Fig. 2 (right). Second, the template matching in NMF is framed by nonnegativity constraints. Specifically, the algorithm models observed partials by a nonnegative superposition of harmonic stacks. The cost function in eq. (5) also diverges if ỹ α = when y α is nonzero; this useful property ensures that minima of eq. (5) must explain each observed partial by its attribution to one or more sources. This property does not hold for traditional least-squares linear reconstructions. 4 Implementation and results We have implemented both the IF estimation in section 2 and the nonnegative deconvolution in section 3.2 in a real-time system for pitch tracking. The software runs on a laptop computer with a visual display that shows the contour of estimated f values scrolling in real-time. After the signal is downsampled to 49 Hz, IF estimation is performed in 1 ms shifts with an analysis window of 5 ms. Partials extracted from the fixed points of eq. (4) are discretized on a log-frequency axis. The columns of the basis matrix W provide har-

6 Keele database VE (%) UE (%) GPE (%) RMS (Hz) NMF RAPT Edinburgh database VE (%) UE (%) GPE (%) RMS (Hz) NMF RAPT Table 1: Comparison between our algorithm and RAPT [18] on the test portion of the Keele database (see text) and the full Edinburgh database, in terms of the percentages of voiced errors (VE), unvoiced errors (UE), and gross pitch errors (GPE), as well as the root mean square (RMS) deviation in Hz. monic templates for f = 5 Hz to f = 4 Hz with a step size of 1/36 octave. To achieve real-time performance and reduce system latency, the system does not postprocess the f values obtained in each frame from nonnegative deconvolution: in particular, there is no dynamic programming to smooth the pitch contour, as commonly done in many pitch tracking algorithms [18]. We have found that our algorithm performs well and yields smooth pitch contours (for non-overlapping voices) even without this postprocessing. 4.1 Pitch determination of clean speech signals Table 1 compares the performance of our algorithm on clean speech from a single speaker to RAPT [18], a state-of-the-art pitch tracker based on autocorrelation and dynamic programming. Four error types are reported: the percentage of voiced frames misclassified as unvoiced (VE), the percentage of unvoiced frames misclassified as voiced (UE), the percentage of voiced frames with gross pitch errors (GPE) where predicted and reference f values differ by more than 2%, and the root-mean-squared (RMS) difference between predicted and reference f values when there are no gross pitch errors. The results were obtained on the second halves of utterances reserved for testing in the Keele database, as well as the full set of utterances in the Edinburgh database [3]. As shown in the table, the performance of our algorithm is comparable to that of RAPT. 4.2 Pitch determination of overlapping voices and noisy speech We have also examined the robustness of our system to noise and overlapping speakers. Fig. 3 shows the f values estimated by our algorithm from a mixture of two voices one with ascending pitch, the other with descending pitch. Each voice spans one octave. The dominant and secondary f values extracted in each frame by nonnegative deconvolution are shown. The algorithm recovers the f values of the individual voices almost perfectly, though it does not currently make any effort to track the voices through time. (This is a subject for future work.) Fig. 4 shows in more detail how IF estimation and nonnegative deconvolution are affected by interfering speakers and noise. A clean signal from a single speaker is shown in the top row of the plot, along with its log power spectra, partials extracted by IF estimation, estimated f, and reconstructed harmonic stack. The second and third rows show the effects of adding white noise and an overlapping speaker, respectively. Both types of interference degrade the harmonic structure in the log power spectra and extracted partials. However, nonnegative deconvolution is still able to recover the pitch of the original speaker, as well as the pitch of the second speaker. On larger evaluations of the algorithm s robustness, we have obtained results comparable to RAPT over a wide range of SNRs (as low as db).

7 1 8 2 dominant pitch secondary pitch Frequency(Hz) 6 4 F (Hz) Time (s) Time (s) Figure 3: Left: Spectrogram of a mixture of two voices with ascending and descending f contours. Right: f values estimated by NMF. Waveform Log Power Spectra Y Deconvoluted X Reconstructed Y White noise added Mix of two signals Clean Time Figure 4: Effect of white noise (middle row) and overlapping speaker (bottom row) on clean speech (top row). Both types of interference degrade the harmonic structure in the log power spectra (second column) and the partials extracted by IF estimation (third column). The results of nonnegative deconvolution (fourth column), however, are fairly robust. Both the pitch of the original speaker at f = 2 Hz and the overlapping speaker at f = 3 Hz are recovered. The fifth column displays the reconstructed profile of extracted partials from activated harmonic templates. 5 Discussion There exists a large body of related work on fundamental frequency estimation of overlapping sources [5, 7, 9, 19, 2, 21]. Our contributions in this paper are to develop a new framework based on recent advances in unsupervised learning and to study the problem with the constraints imposed by real-time system building. Nonnegative deconvolution is similar to EM algorithms [7] for harmonic template matching, but it does not impose normalization constraints on spectral peaks as if they represented a probability distribution. Important directions for future work are to train a richer set of harmonic templates by NMF, to incorporate the frame-based computations of nonnegative deconvolution into a dynamical model, and to embed our real-time system in interactive agents that respond to the pitch contours of human speech. All these directions are being actively pursued.

8 References [1] T. Abe, T. Kobayashi, and S. Imai. Harmonics tracking and pitch extraction based on instantaneous frequency. In Proc. of ICASSP, pages , [2] T. Abe, T. Kobayashi, and S. Imai. Robust pitch estimation with harmonics enhancement in noisy environments based on instantaneous frequency. In Proc. of ICSLP, pages , [3] P. Bagshaw, S. M. Hiller, and M. A. Jack. Enhanced pitch tracking and the processing of f contours for computer aided intonation teaching. In Proc. of 3rd European Conference on Speech Communication and Technology, pages 13 16, [4] A. S. Bregman. Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press, 2nd edition, [5] A. de Cheveigne and H. Kawahara. Multiple period estimation and pitch perception model. Speech Communication, 27: , [6] J. Goldstein. An optimum processor theory for the central formation of the pitch of complex tones. J. Acoust. Soc. Am., 54: , [7] M. Goto. A robust predominant-f estimation method for real-time detection of melody and bass lines in CD recordings. In Proc. of ICASSP, pages , June 2. [8] H. Kawahara, H. Katayose, A. de Cheveigné, and R. D. Patterson. Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of f and periodicity. In Proc. of EuroSpeech, pages , [9] A. Klapuri, T. Virtanen, and J.-M. Holm. Robust multipitch estimation for the analysis and manipulation of polyphonic musical signals. In Proc. of COST-G6 Conference on Digital Audio Effects, Verona, Italy, 2. [1] D. D. Lee and H. S. Seung. Learning in intelligent embedded systems. In Proc. of USENIX Workshop on Embedded Systems, [11] D. D. Lee and H. S. Seung. Learning the parts of objects with nonnegative matrix factorization. Nature, 41: , [12] Y. Lin, D. D. Lee, and L. K. Saul. Nonnegative deconvolution for time of arrival estimation. In Proc. of ICASSP, 24. [13] T. Nakatani and T. Irino. Robust fundamental frequency estimation against background noise and spectral distortion. In Proc. of ICSLP, pages , 22. [14] F. Plante, G. F. Meyer, and W. A. Ainsworth. A pitch extraction reference database. In Proc. of EuroSpeech, pages , [15] L. K. Saul, D. D. Lee, C. L. Isbell, and Y. LeCun. Real time voice processing with audiovisual feedback: toward autonomous agents with perfect pitch. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15. MIT Press, 23. [16] L. K. Saul, F. Sha, and D. D. Lee. Statistical signal processing with nonnegativity constraints. In Proc. of EuroSpeech, pages 11 14, 23. [17] P. Smaragdis and J. C. Brown. Non-negative matrix factorization for polyphonic music transcription. In Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages , 23. [18] D. Talkin. A robust algorithm for pitch tracking(rapt). In W. B. Kleijn and K. K. Paliwal, editors, Speech Coding and Synthesis, chapter 14. Elsevier Science B.V., [19] T. Tolonen and M. Karjalainen. A computationally efficient multipitch analysis model. IEEE Trans. on Speech and Audio Processing, 8(6):78 716, 2. [2] T. Virtanen and A. Klapuri. Separation of harmonic sounds using multipitch analysis and iterative parameter estimation. In Proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 83 86, New Paltz, NY, USA, Oct 21. [21] M. Wu, D. Wang, and G. J. Brown. A multipitch tracking algorithm for noisy speech. IEEE Trans. on Speech and Audio Processing, 11: , 23.

Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs

Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs Paris Smaragdis TR2004-104 September

More information

Nonnegative Matrix Factor 2-D Deconvolution for Blind Single Channel Source Separation

Nonnegative Matrix Factor 2-D Deconvolution for Blind Single Channel Source Separation Nonnegative Matrix Factor 2-D Deconvolution for Blind Single Channel Source Separation Mikkel N. Schmidt and Morten Mørup Technical University of Denmark Informatics and Mathematical Modelling Richard

More information

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS Emad M. Grais and Hakan Erdogan Faculty of Engineering and Natural Sciences, Sabanci University, Orhanli

More information

Non-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology

Non-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology Non-Negative Matrix Factorization And Its Application to Audio Tuomas Virtanen Tampere University of Technology tuomas.virtanen@tut.fi 2 Contents Introduction to audio signals Spectrogram representation

More information

Analysis of polyphonic audio using source-filter model and non-negative matrix factorization

Analysis of polyphonic audio using source-filter model and non-negative matrix factorization Analysis of polyphonic audio using source-filter model and non-negative matrix factorization Tuomas Virtanen and Anssi Klapuri Tampere University of Technology, Institute of Signal Processing Korkeakoulunkatu

More information

Nonnegative Matrix Factorization with Markov-Chained Bases for Modeling Time-Varying Patterns in Music Spectrograms

Nonnegative Matrix Factorization with Markov-Chained Bases for Modeling Time-Varying Patterns in Music Spectrograms Nonnegative Matrix Factorization with Markov-Chained Bases for Modeling Time-Varying Patterns in Music Spectrograms Masahiro Nakano 1, Jonathan Le Roux 2, Hirokazu Kameoka 2,YuKitano 1, Nobutaka Ono 1,

More information

MULTI-RESOLUTION SIGNAL DECOMPOSITION WITH TIME-DOMAIN SPECTROGRAM FACTORIZATION. Hirokazu Kameoka

MULTI-RESOLUTION SIGNAL DECOMPOSITION WITH TIME-DOMAIN SPECTROGRAM FACTORIZATION. Hirokazu Kameoka MULTI-RESOLUTION SIGNAL DECOMPOSITION WITH TIME-DOMAIN SPECTROGRAM FACTORIZATION Hiroazu Kameoa The University of Toyo / Nippon Telegraph and Telephone Corporation ABSTRACT This paper proposes a novel

More information

Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification

Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification Hafiz Mustafa and Wenwu Wang Centre for Vision, Speech and Signal Processing (CVSSP) University of Surrey,

More information

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT CONOLUTIE NON-NEGATIE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT Paul D. O Grady Barak A. Pearlmutter Hamilton Institute National University of Ireland, Maynooth Co. Kildare, Ireland. ABSTRACT Discovering

More information

On Spectral Basis Selection for Single Channel Polyphonic Music Separation

On Spectral Basis Selection for Single Channel Polyphonic Music Separation On Spectral Basis Selection for Single Channel Polyphonic Music Separation Minje Kim and Seungjin Choi Department of Computer Science Pohang University of Science and Technology San 31 Hyoja-dong, Nam-gu

More information

Discovering Convolutive Speech Phones using Sparseness and Non-Negativity Constraints

Discovering Convolutive Speech Phones using Sparseness and Non-Negativity Constraints Discovering Convolutive Speech Phones using Sparseness and Non-Negativity Constraints Paul D. O Grady and Barak A. Pearlmutter Hamilton Institute, National University of Ireland Maynooth, Co. Kildare,

More information

Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation

Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation CENTER FOR COMPUTER RESEARCH IN MUSIC AND ACOUSTICS DEPARTMENT OF MUSIC, STANFORD UNIVERSITY REPORT NO. STAN-M-4 Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation

More information

PreFEst: A Predominant-F0 Estimation Method for Polyphonic Musical Audio Signals

PreFEst: A Predominant-F0 Estimation Method for Polyphonic Musical Audio Signals PreFEst: A Predominant-F0 Estimation Method for Polyphonic Musical Audio Signals Masataka Goto National Institute of Advanced Industrial Science and Technology (AIST). IT, AIST, 1-1-1 Umezono, Tsukuba,

More information

LECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES

LECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES LECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES Abstract March, 3 Mads Græsbøll Christensen Audio Analysis Lab, AD:MT Aalborg University This document contains a brief introduction to pitch

More information

Learning Dictionaries of Stable Autoregressive Models for Audio Scene Analysis

Learning Dictionaries of Stable Autoregressive Models for Audio Scene Analysis Learning Dictionaries of Stable Autoregressive Models for Audio Scene Analysis Youngmin Cho yoc@cs.ucsd.edu Lawrence K. Saul saul@cs.ucsd.edu CSE Department, University of California, San Diego, 95 Gilman

More information

ESTIMATING TRAFFIC NOISE LEVELS USING ACOUSTIC MONITORING: A PRELIMINARY STUDY

ESTIMATING TRAFFIC NOISE LEVELS USING ACOUSTIC MONITORING: A PRELIMINARY STUDY ESTIMATING TRAFFIC NOISE LEVELS USING ACOUSTIC MONITORING: A PRELIMINARY STUDY Jean-Rémy Gloaguen, Arnaud Can Ifsttar - LAE Route de Bouaye - CS4 44344, Bouguenais, FR jean-remy.gloaguen@ifsttar.fr Mathieu

More information

TIME-DEPENDENT PARAMETRIC AND HARMONIC TEMPLATES IN NON-NEGATIVE MATRIX FACTORIZATION

TIME-DEPENDENT PARAMETRIC AND HARMONIC TEMPLATES IN NON-NEGATIVE MATRIX FACTORIZATION TIME-DEPENDENT PARAMETRIC AND HARMONIC TEMPLATES IN NON-NEGATIVE MATRIX FACTORIZATION 13 th International Conference on Digital Audio Effects Romain Hennequin, Roland Badeau and Bertrand David Telecom

More information

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP,

More information

Timbral, Scale, Pitch modifications

Timbral, Scale, Pitch modifications Introduction Timbral, Scale, Pitch modifications M2 Mathématiques / Vision / Apprentissage Audio signal analysis, indexing and transformation Page 1 / 40 Page 2 / 40 Modification of playback speed Modifications

More information

Sound Recognition in Mixtures

Sound Recognition in Mixtures Sound Recognition in Mixtures Juhan Nam, Gautham J. Mysore 2, and Paris Smaragdis 2,3 Center for Computer Research in Music and Acoustics, Stanford University, 2 Advanced Technology Labs, Adobe Systems

More information

A NEW DISSIMILARITY METRIC FOR THE CLUSTERING OF PARTIALS USING THE COMMON VARIATION CUE

A NEW DISSIMILARITY METRIC FOR THE CLUSTERING OF PARTIALS USING THE COMMON VARIATION CUE A NEW DISSIMILARITY METRIC FOR THE CLUSTERING OF PARTIALS USING THE COMMON VARIATION CUE Mathieu Lagrange SCRIME LaBRI, Université Bordeaux 1 351, cours de la Libération, F-33405 Talence cedex, France

More information

Introduction to Biomedical Engineering

Introduction to Biomedical Engineering Introduction to Biomedical Engineering Biosignal processing Kung-Bin Sung 6/11/2007 1 Outline Chapter 10: Biosignal processing Characteristics of biosignals Frequency domain representation and analysis

More information

Convolutive Non-Negative Matrix Factorization for CQT Transform using Itakura-Saito Divergence

Convolutive Non-Negative Matrix Factorization for CQT Transform using Itakura-Saito Divergence Convolutive Non-Negative Matrix Factorization for CQT Transform using Itakura-Saito Divergence Fabio Louvatti do Carmo; Evandro Ottoni Teatini Salles Abstract This paper proposes a modification of the

More information

NMF WITH SPECTRAL AND TEMPORAL CONTINUITY CRITERIA FOR MONAURAL SOUND SOURCE SEPARATION. Julian M. Becker, Christian Sohn and Christian Rohlfing

NMF WITH SPECTRAL AND TEMPORAL CONTINUITY CRITERIA FOR MONAURAL SOUND SOURCE SEPARATION. Julian M. Becker, Christian Sohn and Christian Rohlfing NMF WITH SPECTRAL AND TEMPORAL CONTINUITY CRITERIA FOR MONAURAL SOUND SOURCE SEPARATION Julian M. ecker, Christian Sohn Christian Rohlfing Institut für Nachrichtentechnik RWTH Aachen University D-52056

More information

MULTIPITCH ESTIMATION AND INSTRUMENT RECOGNITION BY EXEMPLAR-BASED SPARSE REPRESENTATION. Ikuo Degawa, Kei Sato, Masaaki Ikehara

MULTIPITCH ESTIMATION AND INSTRUMENT RECOGNITION BY EXEMPLAR-BASED SPARSE REPRESENTATION. Ikuo Degawa, Kei Sato, Masaaki Ikehara MULTIPITCH ESTIMATION AND INSTRUMENT RECOGNITION BY EXEMPLAR-BASED SPARSE REPRESENTATION Ikuo Degawa, Kei Sato, Masaaki Ikehara EEE Dept. Keio University Yokohama, Kanagawa 223-8522 Japan E-mail:{degawa,

More information

Probabilistic Inference of Speech Signals from Phaseless Spectrograms

Probabilistic Inference of Speech Signals from Phaseless Spectrograms Probabilistic Inference of Speech Signals from Phaseless Spectrograms Kannan Achan, Sam T. Roweis, Brendan J. Frey Machine Learning Group University of Toronto Abstract Many techniques for complex speech

More information

AN INVERTIBLE DISCRETE AUDITORY TRANSFORM

AN INVERTIBLE DISCRETE AUDITORY TRANSFORM COMM. MATH. SCI. Vol. 3, No. 1, pp. 47 56 c 25 International Press AN INVERTIBLE DISCRETE AUDITORY TRANSFORM JACK XIN AND YINGYONG QI Abstract. A discrete auditory transform (DAT) from sound signal to

More information

Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator

Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator 1 Optimal Speech Enhancement Under Signal Presence Uncertainty Using Log-Spectral Amplitude Estimator Israel Cohen Lamar Signal Processing Ltd. P.O.Box 573, Yokneam Ilit 20692, Israel E-mail: icohen@lamar.co.il

More information

Lecture 7: Pitch and Chord (2) HMM, pitch detection functions. Li Su 2016/03/31

Lecture 7: Pitch and Chord (2) HMM, pitch detection functions. Li Su 2016/03/31 Lecture 7: Pitch and Chord (2) HMM, pitch detection functions Li Su 2016/03/31 Chord progressions Chord progressions are not arbitrary Example 1: I-IV-I-V-I (C-F-C-G-C) Example 2: I-V-VI-III-IV-I-II-V

More information

1 Introduction. 2 Data Set and Linear Spectral Analysis

1 Introduction. 2 Data Set and Linear Spectral Analysis Analogical model for self-sustained sounds generated by organ pipe E. DE LAURO, S. DE MARTINO, M. FALANGA Department of Physics Salerno University Via S. Allende, I-848, Baronissi (SA) ITALY Abstract:

More information

Dominant Feature Vectors Based Audio Similarity Measure

Dominant Feature Vectors Based Audio Similarity Measure Dominant Feature Vectors Based Audio Similarity Measure Jing Gu 1, Lie Lu 2, Rui Cai 3, Hong-Jiang Zhang 2, and Jian Yang 1 1 Dept. of Electronic Engineering, Tsinghua Univ., Beijing, 100084, China 2 Microsoft

More information

Single Channel Signal Separation Using MAP-based Subspace Decomposition

Single Channel Signal Separation Using MAP-based Subspace Decomposition Single Channel Signal Separation Using MAP-based Subspace Decomposition Gil-Jin Jang, Te-Won Lee, and Yung-Hwan Oh 1 Spoken Language Laboratory, Department of Computer Science, KAIST 373-1 Gusong-dong,

More information

t Tao Group Limited, Reading, RG6 IAZ, U.K. wenwu.wang(ieee.org t Samsung Electronics Research Institute, Staines, TW1 8 4QE, U.K.

t Tao Group Limited, Reading, RG6 IAZ, U.K.   wenwu.wang(ieee.org t Samsung Electronics Research Institute, Staines, TW1 8 4QE, U.K. NON-NEGATIVE MATRIX FACTORIZATION FOR NOTE ONSET DETECTION OF AUDIO SIGNALS Wenwu Wangt, Yuhui Luol, Jonathon A. Chambers', and Saeid Sanei t Tao Group Limited, Reading, RG6 IAZ, U.K. Email: wenwu.wang(ieee.org

More information

University of Colorado at Boulder ECEN 4/5532. Lab 2 Lab report due on February 16, 2015

University of Colorado at Boulder ECEN 4/5532. Lab 2 Lab report due on February 16, 2015 University of Colorado at Boulder ECEN 4/5532 Lab 2 Lab report due on February 16, 2015 This is a MATLAB only lab, and therefore each student needs to turn in her/his own lab report and own programs. 1

More information

Environmental Sound Classification in Realistic Situations

Environmental Sound Classification in Realistic Situations Environmental Sound Classification in Realistic Situations K. Haddad, W. Song Brüel & Kjær Sound and Vibration Measurement A/S, Skodsborgvej 307, 2850 Nærum, Denmark. X. Valero La Salle, Universistat Ramon

More information

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis Thomas Ewender Outline Motivation Detection algorithm of continuous F 0 contour Frame classification algorithm

More information

Introduction Basic Audio Feature Extraction

Introduction Basic Audio Feature Extraction Introduction Basic Audio Feature Extraction Vincent Koops (with slides by Meinhard Müller) Sound and Music Technology, December 6th, 2016 1 28 November 2017 Today g Main modules A. Sound and music for

More information

Improved Method for Epoch Extraction in High Pass Filtered Speech

Improved Method for Epoch Extraction in High Pass Filtered Speech Improved Method for Epoch Extraction in High Pass Filtered Speech D. Govind Center for Computational Engineering & Networking Amrita Vishwa Vidyapeetham (University) Coimbatore, Tamilnadu 642 Email: d

More information

Constrained Nonnegative Matrix Factorization with Applications to Music Transcription

Constrained Nonnegative Matrix Factorization with Applications to Music Transcription Constrained Nonnegative Matrix Factorization with Applications to Music Transcription by Daniel Recoskie A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the

More information

ORTHOGONALITY-REGULARIZED MASKED NMF FOR LEARNING ON WEAKLY LABELED AUDIO DATA. Iwona Sobieraj, Lucas Rencker, Mark D. Plumbley

ORTHOGONALITY-REGULARIZED MASKED NMF FOR LEARNING ON WEAKLY LABELED AUDIO DATA. Iwona Sobieraj, Lucas Rencker, Mark D. Plumbley ORTHOGONALITY-REGULARIZED MASKED NMF FOR LEARNING ON WEAKLY LABELED AUDIO DATA Iwona Sobieraj, Lucas Rencker, Mark D. Plumbley University of Surrey Centre for Vision Speech and Signal Processing Guildford,

More information

Gaussian Processes for Audio Feature Extraction

Gaussian Processes for Audio Feature Extraction Gaussian Processes for Audio Feature Extraction Dr. Richard E. Turner (ret26@cam.ac.uk) Computational and Biological Learning Lab Department of Engineering University of Cambridge Machine hearing pipeline

More information

Single-channel source separation using non-negative matrix factorization

Single-channel source separation using non-negative matrix factorization Single-channel source separation using non-negative matrix factorization Mikkel N. Schmidt Technical University of Denmark mns@imm.dtu.dk www.mikkelschmidt.dk DTU Informatics Department of Informatics

More information

Application of the Tuned Kalman Filter in Speech Enhancement

Application of the Tuned Kalman Filter in Speech Enhancement Application of the Tuned Kalman Filter in Speech Enhancement Orchisama Das, Bhaswati Goswami and Ratna Ghosh Department of Instrumentation and Electronics Engineering Jadavpur University Kolkata, India

More information

NON-NEGATIVE SPARSE CODING

NON-NEGATIVE SPARSE CODING NON-NEGATIVE SPARSE CODING Patrik O. Hoyer Neural Networks Research Centre Helsinki University of Technology P.O. Box 9800, FIN-02015 HUT, Finland patrik.hoyer@hut.fi To appear in: Neural Networks for

More information

Computational Auditory Scene Analysis Based on Loudness/Pitch/Timbre Decomposition

Computational Auditory Scene Analysis Based on Loudness/Pitch/Timbre Decomposition Computational Auditory Scene Analysis Based on Loudness/Pitch/Timbre Decomposition Mototsugu Abe and Shigeru Ando Department of Mathematical Engineering and Information Physics Faculty of Engineering,

More information

REVIEW OF SINGLE CHANNEL SOURCE SEPARATION TECHNIQUES

REVIEW OF SINGLE CHANNEL SOURCE SEPARATION TECHNIQUES REVIEW OF SINGLE CHANNEL SOURCE SEPARATION TECHNIQUES Kedar Patki University of Rochester Dept. of Electrical and Computer Engineering kedar.patki@rochester.edu ABSTRACT The paper reviews the problem of

More information

Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement

Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement Patrick J. Wolfe Department of Engineering University of Cambridge Cambridge CB2 1PZ, UK pjw47@eng.cam.ac.uk Simon J. Godsill

More information

CONSTRAINS ON THE MODEL FOR SELF-SUSTAINED SOUNDS GENERATED BY ORGAN PIPE INFERRED BY INDEPENDENT COMPONENT ANALYSIS

CONSTRAINS ON THE MODEL FOR SELF-SUSTAINED SOUNDS GENERATED BY ORGAN PIPE INFERRED BY INDEPENDENT COMPONENT ANALYSIS CONSTRAINS ON THE MODEL FOR SELF-SUSTAINED SOUNDS GENERATED BY ORGAN PIPE INFERRED BY INDEPENDENT COMPONENT ANALYSIS E. De Lauro S. De Martino M. Falanga G. Sarno Department of Physics, Salerno University,

More information

Sound 2: frequency analysis

Sound 2: frequency analysis COMP 546 Lecture 19 Sound 2: frequency analysis Tues. March 27, 2018 1 Speed of Sound Sound travels at about 340 m/s, or 34 cm/ ms. (This depends on temperature and other factors) 2 Wave equation Pressure

More information

Improved Speech Presence Probabilities Using HMM-Based Inference, with Applications to Speech Enhancement and ASR

Improved Speech Presence Probabilities Using HMM-Based Inference, with Applications to Speech Enhancement and ASR Improved Speech Presence Probabilities Using HMM-Based Inference, with Applications to Speech Enhancement and ASR Bengt J. Borgström, Student Member, IEEE, and Abeer Alwan, IEEE Fellow Abstract This paper

More information

Proc. of NCC 2010, Chennai, India

Proc. of NCC 2010, Chennai, India Proc. of NCC 2010, Chennai, India Trajectory and surface modeling of LSF for low rate speech coding M. Deepak and Preeti Rao Department of Electrical Engineering Indian Institute of Technology, Bombay

More information

2D Spectrogram Filter for Single Channel Speech Enhancement

2D Spectrogram Filter for Single Channel Speech Enhancement Proceedings of the 7th WSEAS International Conference on Signal, Speech and Image Processing, Beijing, China, September 15-17, 007 89 D Spectrogram Filter for Single Channel Speech Enhancement HUIJUN DING,

More information

LONG-TERM REVERBERATION MODELING FOR UNDER-DETERMINED AUDIO SOURCE SEPARATION WITH APPLICATION TO VOCAL MELODY EXTRACTION.

LONG-TERM REVERBERATION MODELING FOR UNDER-DETERMINED AUDIO SOURCE SEPARATION WITH APPLICATION TO VOCAL MELODY EXTRACTION. LONG-TERM REVERBERATION MODELING FOR UNDER-DETERMINED AUDIO SOURCE SEPARATION WITH APPLICATION TO VOCAL MELODY EXTRACTION. Romain Hennequin Deezer R&D 10 rue d Athènes, 75009 Paris, France rhennequin@deezer.com

More information

NOISE ROBUST RELATIVE TRANSFER FUNCTION ESTIMATION. M. Schwab, P. Noll, and T. Sikora. Technical University Berlin, Germany Communication System Group

NOISE ROBUST RELATIVE TRANSFER FUNCTION ESTIMATION. M. Schwab, P. Noll, and T. Sikora. Technical University Berlin, Germany Communication System Group NOISE ROBUST RELATIVE TRANSFER FUNCTION ESTIMATION M. Schwab, P. Noll, and T. Sikora Technical University Berlin, Germany Communication System Group Einsteinufer 17, 1557 Berlin (Germany) {schwab noll

More information

Analysis of audio intercepts: Can we identify and locate the speaker?

Analysis of audio intercepts: Can we identify and locate the speaker? Motivation Analysis of audio intercepts: Can we identify and locate the speaker? K V Vijay Girish, PhD Student Research Advisor: Prof A G Ramakrishnan Research Collaborator: Dr T V Ananthapadmanabha Medical

More information

Sparse and Shift-Invariant Feature Extraction From Non-Negative Data

Sparse and Shift-Invariant Feature Extraction From Non-Negative Data MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Sparse and Shift-Invariant Feature Extraction From Non-Negative Data Paris Smaragdis, Bhiksha Raj, Madhusudana Shashanka TR8-3 August 8 Abstract

More information

SPEECH ANALYSIS AND SYNTHESIS

SPEECH ANALYSIS AND SYNTHESIS 16 Chapter 2 SPEECH ANALYSIS AND SYNTHESIS 2.1 INTRODUCTION: Speech signal analysis is used to characterize the spectral information of an input speech signal. Speech signal analysis [52-53] techniques

More information

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT.

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT. CEPSTRAL ANALYSIS SYNTHESIS ON THE EL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITH FOR IT. Summarized overview of the IEEE-publicated papers Cepstral analysis synthesis on the mel frequency scale by Satochi

More information

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood

More information

A METHOD OF ICA IN TIME-FREQUENCY DOMAIN

A METHOD OF ICA IN TIME-FREQUENCY DOMAIN A METHOD OF ICA IN TIME-FREQUENCY DOMAIN Shiro Ikeda PRESTO, JST Hirosawa 2-, Wako, 35-98, Japan Shiro.Ikeda@brain.riken.go.jp Noboru Murata RIKEN BSI Hirosawa 2-, Wako, 35-98, Japan Noboru.Murata@brain.riken.go.jp

More information

Robust Speaker Identification

Robust Speaker Identification Robust Speaker Identification by Smarajit Bose Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata Joint work with Amita Pal and Ayanendranath Basu Overview } } } } } } }

More information

COMP 546, Winter 2018 lecture 19 - sound 2

COMP 546, Winter 2018 lecture 19 - sound 2 Sound waves Last lecture we considered sound to be a pressure function I(X, Y, Z, t). However, sound is not just any function of those four variables. Rather, sound obeys the wave equation: 2 I(X, Y, Z,

More information

Joint Filtering and Factorization for Recovering Latent Structure from Noisy Speech Data

Joint Filtering and Factorization for Recovering Latent Structure from Noisy Speech Data Joint Filtering and Factorization for Recovering Latent Structure from Noisy Speech Data Colin Vaz, Vikram Ramanarayanan, and Shrikanth Narayanan Ming Hsieh Department of Electrical Engineering University

More information

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Exemplar-based voice conversion using non-negative spectrogram deconvolution Exemplar-based voice conversion using non-negative spectrogram deconvolution Zhizheng Wu 1, Tuomas Virtanen 2, Tomi Kinnunen 3, Eng Siong Chng 1, Haizhou Li 1,4 1 Nanyang Technological University, Singapore

More information

Feature Extraction for ASR: Pitch

Feature Extraction for ASR: Pitch Feature Extraction for ASR: Pitch Wantee Wang 2015-03-14 16:55:51 +0800 Contents 1 Cross-correlation and Autocorrelation 1 2 Normalized Cross-Correlation Function 3 3 RAPT 4 4 Kaldi Pitch Tracker 5 Pitch

More information

NONNEGATIVE MATRIX FACTORIZATION WITH TRANSFORM LEARNING. Dylan Fagot, Herwig Wendt and Cédric Févotte

NONNEGATIVE MATRIX FACTORIZATION WITH TRANSFORM LEARNING. Dylan Fagot, Herwig Wendt and Cédric Févotte NONNEGATIVE MATRIX FACTORIZATION WITH TRANSFORM LEARNING Dylan Fagot, Herwig Wendt and Cédric Févotte IRIT, Université de Toulouse, CNRS, Toulouse, France firstname.lastname@irit.fr ABSTRACT Traditional

More information

Independent Component Analysis and Unsupervised Learning

Independent Component Analysis and Unsupervised Learning Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent

More information

Machine Recognition of Sounds in Mixtures

Machine Recognition of Sounds in Mixtures Machine Recognition of Sounds in Mixtures Outline 1 2 3 4 Computational Auditory Scene Analysis Speech Recognition as Source Formation Sound Fragment Decoding Results & Conclusions Dan Ellis

More information

CS229 Project: Musical Alignment Discovery

CS229 Project: Musical Alignment Discovery S A A V S N N R R S CS229 Project: Musical Alignment iscovery Woodley Packard ecember 16, 2005 Introduction Logical representations of musical data are widely available in varying forms (for instance,

More information

GAUSSIANIZATION METHOD FOR IDENTIFICATION OF MEMORYLESS NONLINEAR AUDIO SYSTEMS

GAUSSIANIZATION METHOD FOR IDENTIFICATION OF MEMORYLESS NONLINEAR AUDIO SYSTEMS GAUSSIANIATION METHOD FOR IDENTIFICATION OF MEMORYLESS NONLINEAR AUDIO SYSTEMS I. Marrakchi-Mezghani (1),G. Mahé (2), M. Jaïdane-Saïdane (1), S. Djaziri-Larbi (1), M. Turki-Hadj Alouane (1) (1) Unité Signaux

More information

Machine Learning for Signal Processing Non-negative Matrix Factorization

Machine Learning for Signal Processing Non-negative Matrix Factorization Machine Learning for Signal Processing Non-negative Matrix Factorization Class 10. 3 Oct 2017 Instructor: Bhiksha Raj With examples and slides from Paris Smaragdis 11755/18797 1 A Quick Recap X B W D D

More information

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析 Chapter 9 Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析 1 LPC Methods LPC methods are the most widely used in speech coding, speech synthesis, speech recognition, speaker recognition and verification

More information

ACCOUNTING FOR PHASE CANCELLATIONS IN NON-NEGATIVE MATRIX FACTORIZATION USING WEIGHTED DISTANCES. Sebastian Ewert Mark D. Plumbley Mark Sandler

ACCOUNTING FOR PHASE CANCELLATIONS IN NON-NEGATIVE MATRIX FACTORIZATION USING WEIGHTED DISTANCES. Sebastian Ewert Mark D. Plumbley Mark Sandler ACCOUNTING FOR PHASE CANCELLATIONS IN NON-NEGATIVE MATRIX FACTORIZATION USING WEIGHTED DISTANCES Sebastian Ewert Mark D. Plumbley Mark Sandler Queen Mary University of London, London, United Kingdom ABSTRACT

More information

Lecture Notes 5: Multiresolution Analysis

Lecture Notes 5: Multiresolution Analysis Optimization-based data analysis Fall 2017 Lecture Notes 5: Multiresolution Analysis 1 Frames A frame is a generalization of an orthonormal basis. The inner products between the vectors in a frame and

More information

Elec4621 Advanced Digital Signal Processing Chapter 11: Time-Frequency Analysis

Elec4621 Advanced Digital Signal Processing Chapter 11: Time-Frequency Analysis Elec461 Advanced Digital Signal Processing Chapter 11: Time-Frequency Analysis Dr. D. S. Taubman May 3, 011 In this last chapter of your notes, we are interested in the problem of nding the instantaneous

More information

Frequency Domain Speech Analysis

Frequency Domain Speech Analysis Frequency Domain Speech Analysis Short Time Fourier Analysis Cepstral Analysis Windowed (short time) Fourier Transform Spectrogram of speech signals Filter bank implementation* (Real) cepstrum and complex

More information

10ème Congrès Français d Acoustique

10ème Congrès Français d Acoustique 1ème Congrès Français d Acoustique Lyon, 1-16 Avril 1 Spectral similarity measure invariant to pitch shifting and amplitude scaling Romain Hennequin 1, Roland Badeau 1, Bertrand David 1 1 Institut TELECOM,

More information

Deep NMF for Speech Separation

Deep NMF for Speech Separation MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Deep NMF for Speech Separation Le Roux, J.; Hershey, J.R.; Weninger, F.J. TR2015-029 April 2015 Abstract Non-negative matrix factorization

More information

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017 Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017 I am v Master Student v 6 months @ Music Technology Group, Universitat Pompeu Fabra v Deep learning for acoustic source separation v

More information

Machine Learning for Signal Processing Non-negative Matrix Factorization

Machine Learning for Signal Processing Non-negative Matrix Factorization Machine Learning for Signal Processing Non-negative Matrix Factorization Class 10. Instructor: Bhiksha Raj With examples from Paris Smaragdis 11755/18797 1 The Engineer and the Musician Once upon a time

More information

REAL-TIME TIME-FREQUENCY BASED BLIND SOURCE SEPARATION. Scott Rickard, Radu Balan, Justinian Rosca. Siemens Corporate Research Princeton, NJ 08540

REAL-TIME TIME-FREQUENCY BASED BLIND SOURCE SEPARATION. Scott Rickard, Radu Balan, Justinian Rosca. Siemens Corporate Research Princeton, NJ 08540 REAL-TIME TIME-FREQUENCY BASED BLIND SOURCE SEPARATION Scott Rickard, Radu Balan, Justinian Rosca Siemens Corporate Research Princeton, NJ 84 fscott.rickard,radu.balan,justinian.roscag@scr.siemens.com

More information

FACTORS IN FACTORIZATION: DOES BETTER AUDIO SOURCE SEPARATION IMPLY BETTER POLYPHONIC MUSIC TRANSCRIPTION?

FACTORS IN FACTORIZATION: DOES BETTER AUDIO SOURCE SEPARATION IMPLY BETTER POLYPHONIC MUSIC TRANSCRIPTION? FACTORS IN FACTORIZATION: DOES BETTER AUDIO SOURCE SEPARATION IMPLY BETTER POLYPHONIC MUSIC TRANSCRIPTION? Tiago Fernandes Tavares, George Tzanetakis, Peter Driessen University of Victoria Department of

More information

c Springer, Reprinted with permission.

c Springer, Reprinted with permission. Zhijian Yuan and Erkki Oja. A FastICA Algorithm for Non-negative Independent Component Analysis. In Puntonet, Carlos G.; Prieto, Alberto (Eds.), Proceedings of the Fifth International Symposium on Independent

More information

Non-Negative Tensor Factorisation for Sound Source Separation

Non-Negative Tensor Factorisation for Sound Source Separation ISSC 2005, Dublin, Sept. -2 Non-Negative Tensor Factorisation for Sound Source Separation Derry FitzGerald, Matt Cranitch φ and Eugene Coyle* φ Dept. of Electronic Engineering, Cor Institute of Technology

More information

MANY digital speech communication applications, e.g.,

MANY digital speech communication applications, e.g., 406 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 2, FEBRUARY 2007 An MMSE Estimator for Speech Enhancement Under a Combined Stochastic Deterministic Speech Model Richard C.

More information

Robust Sound Event Detection in Continuous Audio Environments

Robust Sound Event Detection in Continuous Audio Environments Robust Sound Event Detection in Continuous Audio Environments Haomin Zhang 1, Ian McLoughlin 2,1, Yan Song 1 1 National Engineering Laboratory of Speech and Language Information Processing The University

More information

SINGLE-CHANNEL SPEECH PRESENCE PROBABILITY ESTIMATION USING INTER-FRAME AND INTER-BAND CORRELATIONS

SINGLE-CHANNEL SPEECH PRESENCE PROBABILITY ESTIMATION USING INTER-FRAME AND INTER-BAND CORRELATIONS 204 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) SINGLE-CHANNEL SPEECH PRESENCE PROBABILITY ESTIMATION USING INTER-FRAME AND INTER-BAND CORRELATIONS Hajar Momeni,2,,

More information

Cochlear modeling and its role in human speech recognition

Cochlear modeling and its role in human speech recognition Allen/IPAM February 1, 2005 p. 1/3 Cochlear modeling and its role in human speech recognition Miller Nicely confusions and the articulation index Jont Allen Univ. of IL, Beckman Inst., Urbana IL Allen/IPAM

More information

L6: Short-time Fourier analysis and synthesis

L6: Short-time Fourier analysis and synthesis L6: Short-time Fourier analysis and synthesis Overview Analysis: Fourier-transform view Analysis: filtering view Synthesis: filter bank summation (FBS) method Synthesis: overlap-add (OLA) method STFT magnitude

More information

Zeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals

Zeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals Zeros of z-transformzzt representation and chirp group delay processing for analysis of source and filter characteristics of speech signals Baris Bozkurt 1 Collaboration with LIMSI-CNRS, France 07/03/2017

More information

Sparseness Constraints on Nonnegative Tensor Decomposition

Sparseness Constraints on Nonnegative Tensor Decomposition Sparseness Constraints on Nonnegative Tensor Decomposition Na Li nali@clarksonedu Carmeliza Navasca cnavasca@clarksonedu Department of Mathematics Clarkson University Potsdam, New York 3699, USA Department

More information

Matrix Factorization for Speech Enhancement

Matrix Factorization for Speech Enhancement Matrix Factorization for Speech Enhancement Peter Li Peter.Li@nyu.edu Yijun Xiao ryjxiao@nyu.edu 1 Introduction In this report, we explore techniques for speech enhancement using matrix factorization.

More information

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

The effect of speaking rate and vowel context on the perception of consonants. in babble noise The effect of speaking rate and vowel context on the perception of consonants in babble noise Anirudh Raju Department of Electrical Engineering, University of California, Los Angeles, California, USA anirudh90@ucla.edu

More information

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

Non-negative Matrix Factorization: Algorithms, Extensions and Applications Non-negative Matrix Factorization: Algorithms, Extensions and Applications Emmanouil Benetos www.soi.city.ac.uk/ sbbj660/ March 2013 Emmanouil Benetos Non-negative Matrix Factorization March 2013 1 / 25

More information

David Weenink. First semester 2007

David Weenink. First semester 2007 Institute of Phonetic Sciences University of Amsterdam First semester 2007 Definition (ANSI: In Psycho-acoustics) is that auditory attribute of sound according to which sounds can be ordered on a scale

More information

Recovery of Compactly Supported Functions from Spectrogram Measurements via Lifting

Recovery of Compactly Supported Functions from Spectrogram Measurements via Lifting Recovery of Compactly Supported Functions from Spectrogram Measurements via Lifting Mark Iwen markiwen@math.msu.edu 2017 Friday, July 7 th, 2017 Joint work with... Sami Merhi (Michigan State University)

More information

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi Signal Modeling Techniques in Speech Recognition Hassan A. Kingravi Outline Introduction Spectral Shaping Spectral Analysis Parameter Transforms Statistical Modeling Discussion Conclusions 1: Introduction

More information

Relative-Pitch Tracking of Multiple Arbitrary Sounds

Relative-Pitch Tracking of Multiple Arbitrary Sounds Relative-Pitch Tracking of Multiple Arbitrary Sounds Paris Smaragdis a) Adobe Systems Inc., 275 Grove St. Newton, MA 02466, USA Perceived-pitch tracking of potentially aperiodic sounds, as well as pitch

More information

MULTISENSORY SPEECH ENHANCEMENT IN NOISY ENVIRONMENTS USING BONE-CONDUCTED AND AIR-CONDUCTED MICROPHONES. Mingzi Li,Israel Cohen and Saman Mousazadeh

MULTISENSORY SPEECH ENHANCEMENT IN NOISY ENVIRONMENTS USING BONE-CONDUCTED AND AIR-CONDUCTED MICROPHONES. Mingzi Li,Israel Cohen and Saman Mousazadeh MULTISENSORY SPEECH ENHANCEMENT IN NOISY ENVIRONMENTS USING BONE-CONDUCTED AND AIR-CONDUCTED MICROPHONES Mingzi Li,Israel Cohen and Saman Mousazadeh Department of Electrical Engineering, Technion - Israel

More information

Bayesian Regularization and Nonnegative Deconvolution for Room Impulse Response Estimation

Bayesian Regularization and Nonnegative Deconvolution for Room Impulse Response Estimation IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 3, MARCH 2006 839 Bayesian Regularization and Nonnegative Deconvolution for Room Impulse Response Estimation Yuanqing Lin and Daniel D. Lee Abstract

More information