Harmonic Structure Transform for Speaker Recognition
|
|
- Howard Hamilton
- 5 years ago
- Views:
Transcription
1 Harmonic Structure Transform for Speaker Recognition Kornel Laskowski & Qin Jin Carnegie Mellon University, Pittsburgh PA, USA KTH Speech Music & Hearing, Stockholm, Sweden 29 August, 2011 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 1/21
2 Spectral Transforms in General Given x the energy spectrum of a speech frame, ( ( )) y = F 1 log M T x normalization term The matrix M is a filterbank, whose columns look like: M defines the number of filters, and their central frequencies, widths, and general shapes. Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 2/21
3 Spectral Transforms in General Given x the energy spectrum of a speech frame, ( ( )) y = F 1 log M T x normalization term The matrix M is a filterbank, whose columns look like: M defines the number of filters, and their central frequencies, widths, and general shapes. Importantly here, the filters of all such filterbanks integrate energy across frequencies related by adjacency. Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 2/21
4 The Harmonic Structure Transform (HST) In contrast, the HST is implemented by a matrix H whose columns look like: Each filter integrates energy across frequencies related by harmonicity (not adjacency). this is novel (Laskowski & Jin, 2010) for speaker recognition related to (Liénard, Barras & Signol, 8) for pitch detection unknown: number of filters, and their fundamental frequencies, tooth widths, and individual tooth shapes Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 3/21
5 Outline of this Talk 1 Baseline Performance What is known? 2 Experiments in HSCC Filterbank Design linear spacing in fundamental frequency piecewise linear spacing in fundamental frequency logarithmic spacing in fundamental frequency fundamental frequency range and density 3 Score-level Fusion with Standard MFCCs 4 Generalization 5 Conclusions Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 4/21
6 HST Processing frame FFT x t idealized FFT (comb filter h) as a function of i f h [i 1] f h [i] f h [i + 1] analysis every 8 ms frames 32 ms wide comb filter teeth triangular (global width parameter) filters, linearly spanning from 50 Hz to 450 Hz logarithm at each filter output, then normalization decorrelation using LDA yields harmonic structure cepstral coefficients (HSCCs) Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 5/21
7 HSCC Modeling for Classification As simple as possible. one GMM per speaker 1 assume one Gaussian element 2 determine optimal number N D of LDA dimensions 3 hold N D fixed 4 determine optimal number of N G Gaussians maximum likelihood closed-set classification (MAP under uniform prior) Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 6/21
8 Available Results (Laskowski & Jin, ODYSSEY 2010) Wall Street Journal data, mostly read speech 100-way closed-set classification, per gender second trials, per gender and dataset matched channel and matched multi-session conditions System Female, Male, Dev Test Dev Test F HST/LDA MEL/DCT MEL/LDA Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 7/21
9 Session Mismatch MIXER5 data, various speaking styles 66-way closed-set classification second trials, per dataset matched channel and matched session: accuracies of 100% matched channel but mismatched session: System Dev Test F HST/LDA MEL/DCT MEL/LDA Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 8/21
10 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21
11 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21
12 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21
13 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21
14 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21
15 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21
16 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21
17 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21
18 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21
19 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21
20 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21
21 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21
22 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz? Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21
23 Piecewise Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 10/21
24 Piecewise Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 10/21
25 Piecewise Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 10/21
26 Piecewise Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 10/21
27 Piecewise Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 10/21
28 Logarithmic Spacing of Fundamental Frequencies In the linear case: N h fundamental frequencies f h between f min and f max : f h [i] = fh min + i 1 ( f max h f min ) h 1 i N h N h 1 Baseline Linear 34.4 PW Linear frequency f h (Hz) Baseline Logarithmic filter index i Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 11/21
29 Logarithmic Spacing of Fundamental Frequencies In the linear case: N h fundamental frequencies f h between f min and f max : f h [i] = fh min + i 1 ( f max h f min ) h 1 i N h N h 1 Linear Baseline Linear 34.4 PW Linear frequency f h (Hz) Baseline Logarithmic filter index i Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 11/21
30 Logarithmic Spacing of Fundamental Frequencies In the linear case: N h fundamental frequencies f h between f min and f max : f h [i] = fh min + i 1 ( f max h f min ) h 1 i N h N h 1 Linear Baseline Linear 34.4 PW Linear frequency f h (Hz) Baseline Logarithmic PW Linear filter index i Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 11/21
31 Logarithmic Spacing of Fundamental Frequencies In the linear case: N h fundamental frequencies f h between f min and f max : f h [i] = fh min + i 1 ( f max h f min ) h 1 i N h N h 1 frequency f h (Hz) Linear Logarithmic Baseline PW Linear filter index i Baseline Linear PW Linear Logarithmic f h [i] = f min h ( f max ) i 1 N h h 1 fh min Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 11/21
32 Range and Density of Fundamental Frequencies Three manipulations: f min : raise to 62.5 Hz f max : raise to maximize accuracy (to 0 Hz) N h : lower to maximize accuracy (to 1129) frequency f h (Hz) Baseline Linear PW Linear Logarithmic Raise LB Raise UB Lower N Logarithmic filter index i Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 12/21
33 Range and Density of Fundamental Frequencies Three manipulations: f min : raise to 62.5 Hz f max : raise to maximize accuracy (to 0 Hz) N h : lower to maximize accuracy (to 1129) frequency f h (Hz) Baseline Linear PW Linear Logarithmic Raise LB Raise UB Lower N Raise LB Logarithmic filter index i Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 12/21
34 Range and Density of Fundamental Frequencies Three manipulations: f min : raise to 62.5 Hz f max : raise to maximize accuracy (to 0 Hz) N h : lower to maximize accuracy (to 1129) Baseline Linear 34.4 PW Linear frequency f h (Hz) Logarithmic Raise LB Raise UB Raise UB Lower N Raise LB Logarithmic filter index i Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 12/21
35 Range and Density of Fundamental Frequencies Three manipulations: f min : raise to 62.5 Hz f max : raise to maximize accuracy (to 0 Hz) N h : lower to maximize accuracy (to 1129) Baseline Linear 34.4 PW Linear frequency f h (Hz) Lower N Logarithmic Raise LB Raise UB Raise UB Lower N Raise LB Logarithmic filter index i Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 12/21
36 Interpolation with Standard MFCCs linear interpolation with DCT-decorrelated log-mel energies 20 coefficients 25 coefficients (always worse) MEL/DCT %abs accuracy MEL/DCT 25 HSCC MEL/DCT HSCC %abs interpolation parameter Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 13/21
37 Interpolation with LDA-Rotated MFCCs linear interpolation with LDA-decorrelated log-mel energies 20 coefficients (better in combination) 25 coefficients (better alone) 10.6%abs MEL/LDA 20 accuracy MEL/LDA 25 HSCC MEL/LDA + HSCC %abs interpolation parameter Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 14/21
38 Summary of Accuracy on DevSet DEVSET EVALSET 10.7%abs 33.5%rel Baseline Linear PW Linear Logarithmic Raise LB Raise UB Lower N ??.? %abs 27.6%rel MEL/DCT + HSCC %abs 31.8%rel MEL/LDA + HSCC Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 15/21
39 Accuracy on EvalSet DEVSET EVALSET 10.7%abs 33.5%rel Baseline Linear PW Linear Logarithmic Raise LB Raise UB Lower N ??.? %abs 27.6%rel MEL/DCT + HSCC %abs 31.8%rel MEL/LDA + HSCC Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 16/21
40 Accuracy on EvalSet DEVSET EVALSET 10.7%abs 33.5%rel Baseline Linear PW Linear Logarithmic Raise LB Raise UB Lower N ??.? %abs 27.6%rel MEL/DCT + HSCC %abs 31.8%rel MEL/LDA + HSCC Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 16/21
41 Accuracy on EvalSet DEVSET EVALSET 10.7%abs 33.5%rel Baseline Linear PW Linear Logarithmic Raise LB Raise UB Lower N ??.? %abs 27.6%rel MEL/DCT + HSCC %abs 31.8%rel MEL/LDA + HSCC Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 16/21
42 Conclusions 1 evaluated the baseline transform in session mismatch viable: twice as many errors an equivalent MFCC system 2 errors can be reduced by a third (33.5%rel) by optimizing: the number of filters in the filterbank the fundamental frequency corresponding to each filter 3 logarithmic spacing of fundamental frequencies is better than linear spacing more filters for low fundamental frequencies fewer filters for high fundamental frequencies 4 in an equivalent MFCC system, errors can be reduced by almost a third ( %rel) via score-level fusion with the improved HST system Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 17/21
43 Future Directions 1 change framing policy from 8 ms/32 ms to something longer intonation and voice quality are supra-segmental larger temporal support greater spectral resolution 2 optimize the tooth shape of comb filters 3 find a data-independent decorrelation transform leading to a compact (< 25 coefficients) representation 4 explore adaptation from a universal background model 5 generalize to binary speaker verification (and NIST SREs) Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 18/21
44 Potential Impact 1 a new general representation of the spectrum 2 deliberately orthogonal to spectral envelope features (MFCCs, LPCCs, etc.) but computed in an identical manner 3 likely beneficial not only for speaker recognition, but also: online speaker diarization classification of emotional speech clinical voice quality assessment Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 19/21
45 Some Interesting Insights... 1 currently in HST, spectral energy < 300 Hz is zeroed out this improves closed-set speaker classification but the fundamental (zeroth harmonic) is ignored the fundamental is thought to play a role in emotional expression 2 that optimal filter spacing is logarithmic is curious independent of the logarithmic tonotopicity of the basilar membrane greater acuity in discriminating among harmonic sounds (not just pure tones) Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 20/21
46 THANK YOU Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 21/21
Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring
Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring Kornel Laskowski & Qin Jin Carnegie Mellon University Pittsburgh PA, USA 28 June, 2010 Laskowski & Jin ODYSSEY 2010,
More informationAutomatic Speech Recognition (CS753)
Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017 Speech Signal Analysis Generate discrete samples A frame Need to focus on short
More informationRobust Speaker Identification
Robust Speaker Identification by Smarajit Bose Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata Joint work with Amita Pal and Ayanendranath Basu Overview } } } } } } }
More informationExemplar-based voice conversion using non-negative spectrogram deconvolution
Exemplar-based voice conversion using non-negative spectrogram deconvolution Zhizheng Wu 1, Tuomas Virtanen 2, Tomi Kinnunen 3, Eng Siong Chng 1, Haizhou Li 1,4 1 Nanyang Technological University, Singapore
More informationSupport Vector Machines using GMM Supervectors for Speaker Verification
1 Support Vector Machines using GMM Supervectors for Speaker Verification W. M. Campbell, D. E. Sturim, D. A. Reynolds MIT Lincoln Laboratory 244 Wood Street Lexington, MA 02420 Corresponding author e-mail:
More informationSpeaker recognition by means of Deep Belief Networks
Speaker recognition by means of Deep Belief Networks Vasileios Vasilakakis, Sandro Cumani, Pietro Laface, Politecnico di Torino, Italy {first.lastname}@polito.it 1. Abstract Most state of the art speaker
More informationSpeech Signal Representations
Speech Signal Representations Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapters 5, 6 2. J. R. Deller et. al., Discrete-Time Processing of Speech Signals, Chapters 4-6
More informationEstimation of Relative Operating Characteristics of Text Independent Speaker Verification
International Journal of Engineering Science Invention Volume 1 Issue 1 December. 2012 PP.18-23 Estimation of Relative Operating Characteristics of Text Independent Speaker Verification Palivela Hema 1,
More informationComputing the fundamental frequency variation spectrum in conversational spoken dialogue systems
Computing the fundamental frequency variation spectrum in conversational spoken dialogue systems K. Laskowski a, M. Wölfel b, M. Heldner c and J. Edlund c a Carnegie Mellon University, 407 South Craig
More informationSpeaker Verification Using Accumulative Vectors with Support Vector Machines
Speaker Verification Using Accumulative Vectors with Support Vector Machines Manuel Aguado Martínez, Gabriel Hernández-Sierra, and José Ramón Calvo de Lara Advanced Technologies Application Center, Havana,
More informationStress detection through emotional speech analysis
Stress detection through emotional speech analysis INMA MOHINO inmaculada.mohino@uah.edu.es ROBERTO GIL-PITA roberto.gil@uah.es LORENA ÁLVAREZ PÉREZ loreduna88@hotmail Abstract: Stress is a reaction or
More informationLecture 7: Feature Extraction
Lecture 7: Feature Extraction Kai Yu SpeechLab Department of Computer Science & Engineering Shanghai Jiao Tong University Autumn 2014 Kai Yu Lecture 7: Feature Extraction SJTU Speech Lab 1 / 28 Table of
More informationFront-End Factor Analysis For Speaker Verification
IEEE TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING Front-End Factor Analysis For Speaker Verification Najim Dehak, Patrick Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet, Abstract This
More informationSpectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates
Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates Dima Ruinskiy Niv Dadush Yizhar Lavner Department of Computer Science, Tel-Hai College, Israel Outline Phoneme
More informationSignal Modeling Techniques in Speech Recognition. Hassan A. Kingravi
Signal Modeling Techniques in Speech Recognition Hassan A. Kingravi Outline Introduction Spectral Shaping Spectral Analysis Parameter Transforms Statistical Modeling Discussion Conclusions 1: Introduction
More informationA Small Footprint i-vector Extractor
A Small Footprint i-vector Extractor Patrick Kenny Odyssey Speaker and Language Recognition Workshop June 25, 2012 1 / 25 Patrick Kenny A Small Footprint i-vector Extractor Outline Introduction Review
More informationFeature extraction 2
Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Feature extraction 2 Dr Philip Jackson Linear prediction Perceptual linear prediction Comparison of feature methods
More informationarxiv: v1 [cs.sd] 25 Oct 2014
Choice of Mel Filter Bank in Computing MFCC of a Resampled Speech arxiv:1410.6903v1 [cs.sd] 25 Oct 2014 Laxmi Narayana M, Sunil Kumar Kopparapu TCS Innovation Lab - Mumbai, Tata Consultancy Services, Yantra
More informationSession Variability Compensation in Automatic Speaker Recognition
Session Variability Compensation in Automatic Speaker Recognition Javier González Domínguez VII Jornadas MAVIR Universidad Autónoma de Madrid November 2012 Outline 1. The Inter-session Variability Problem
More informationFEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION
FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION Sarika Hegde 1, K. K. Achary 2 and Surendra Shetty 3 1 Department of Computer Applications, NMAM.I.T., Nitte, Karkala Taluk,
More informationSignal representations: Cepstrum
Signal representations: Cepstrum Source-filter separation for sound production For speech, source corresponds to excitation by a pulse train for voiced phonemes and to turbulence (noise) for unvoiced phonemes,
More informationApplication of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data
Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data Juan Torres 1, Ashraf Saad 2, Elliot Moore 1 1 School of Electrical and Computer
More informationTopic 6. Timbre Representations
Topic 6 Timbre Representations We often say that singer s voice is magnetic the violin sounds bright this French horn sounds solid that drum sounds dull What aspect(s) of sound are these words describing?
More informationSinger Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers
Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Kumari Rambha Ranjan, Kartik Mahto, Dipti Kumari,S.S.Solanki Dept. of Electronics and Communication Birla
More informationModel-based unsupervised segmentation of birdcalls from field recordings
Model-based unsupervised segmentation of birdcalls from field recordings Anshul Thakur School of Computing and Electrical Engineering Indian Institute of Technology Mandi Himachal Pradesh, India Email:
More informationLecture 9: Speech Recognition. Recognizing Speech
EE E68: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 3 4 Recognizing Speech Feature Calculation Sequence Recognition Hidden Markov Models Dan Ellis http://www.ee.columbia.edu/~dpwe/e68/
More informationFeature extraction 1
Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Feature extraction 1 Dr Philip Jackson Cepstral analysis - Real & complex cepstra - Homomorphic decomposition Filter
More informationSCORE CALIBRATING FOR SPEAKER RECOGNITION BASED ON SUPPORT VECTOR MACHINES AND GAUSSIAN MIXTURE MODELS
SCORE CALIBRATING FOR SPEAKER RECOGNITION BASED ON SUPPORT VECTOR MACHINES AND GAUSSIAN MIXTURE MODELS Marcel Katz, Martin Schafföner, Sven E. Krüger, Andreas Wendemuth IESK-Cognitive Systems University
More informationLecture 9: Speech Recognition
EE E682: Speech & Audio Processing & Recognition Lecture 9: Speech Recognition 1 2 3 4 Recognizing Speech Feature Calculation Sequence Recognition Hidden Markov Models Dan Ellis
More informationProc. of NCC 2010, Chennai, India
Proc. of NCC 2010, Chennai, India Trajectory and surface modeling of LSF for low rate speech coding M. Deepak and Preeti Rao Department of Electrical Engineering Indian Institute of Technology, Bombay
More informationExperiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition
Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition ABSTRACT It is well known that the expectation-maximization (EM) algorithm, commonly used to estimate hidden
More informationNearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender
Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis Thomas Ewender Outline Motivation Detection algorithm of continuous F 0 contour Frame classification algorithm
More informationEFFECTIVE ACOUSTIC MODELING FOR ROBUST SPEAKER RECOGNITION. Taufiq Hasan Al Banna
EFFECTIVE ACOUSTIC MODELING FOR ROBUST SPEAKER RECOGNITION by Taufiq Hasan Al Banna APPROVED BY SUPERVISORY COMMITTEE: Dr. John H. L. Hansen, Chair Dr. Carlos Busso Dr. Hlaing Minn Dr. P. K. Rajasekaran
More informationNoise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm
EngOpt 2008 - International Conference on Engineering Optimization Rio de Janeiro, Brazil, 0-05 June 2008. Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic
More informationNovel Quality Metric for Duration Variability Compensation in Speaker Verification using i-vectors
Published in Ninth International Conference on Advances in Pattern Recognition (ICAPR-2017), Bangalore, India Novel Quality Metric for Duration Variability Compensation in Speaker Verification using i-vectors
More informationText-Independent Speaker Identification using Statistical Learning
University of Arkansas, Fayetteville ScholarWorks@UARK Theses and Dissertations 7-2015 Text-Independent Speaker Identification using Statistical Learning Alli Ayoola Ojutiku University of Arkansas, Fayetteville
More informationFast speaker diarization based on binary keys. Xavier Anguera and Jean François Bonastre
Fast speaker diarization based on binary keys Xavier Anguera and Jean François Bonastre Outline Introduction Speaker diarization Binary speaker modeling Binary speaker diarization system Experiments Conclusions
More informationTinySR. Peter Schmidt-Nielsen. August 27, 2014
TinySR Peter Schmidt-Nielsen August 27, 2014 Abstract TinySR is a light weight real-time small vocabulary speech recognizer written entirely in portable C. The library fits in a single file (plus header),
More informationSingle Channel Signal Separation Using MAP-based Subspace Decomposition
Single Channel Signal Separation Using MAP-based Subspace Decomposition Gil-Jin Jang, Te-Won Lee, and Yung-Hwan Oh 1 Spoken Language Laboratory, Department of Computer Science, KAIST 373-1 Gusong-dong,
More informationMulticlass Discriminative Training of i-vector Language Recognition
Odyssey 214: The Speaker and Language Recognition Workshop 16-19 June 214, Joensuu, Finland Multiclass Discriminative Training of i-vector Language Recognition Alan McCree Human Language Technology Center
More informationEstimation of Cepstral Coefficients for Robust Speech Recognition
Estimation of Cepstral Coefficients for Robust Speech Recognition by Kevin M. Indrebo, B.S., M.S. A Dissertation submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment
More informationAN INVERTIBLE DISCRETE AUDITORY TRANSFORM
COMM. MATH. SCI. Vol. 3, No. 1, pp. 47 56 c 25 International Press AN INVERTIBLE DISCRETE AUDITORY TRANSFORM JACK XIN AND YINGYONG QI Abstract. A discrete auditory transform (DAT) from sound signal to
More informationJoint Factor Analysis for Speaker Verification
Joint Factor Analysis for Speaker Verification Mengke HU ASPITRG Group, ECE Department Drexel University mengke.hu@gmail.com October 12, 2012 1/37 Outline 1 Speaker Verification Baseline System Session
More informationAn Integration of Random Subspace Sampling and Fishervoice for Speaker Verification
Odyssey 2014: The Speaker and Language Recognition Workshop 16-19 June 2014, Joensuu, Finland An Integration of Random Subspace Sampling and Fishervoice for Speaker Verification Jinghua Zhong 1, Weiwu
More informationAn Investigation of Spectral Subband Centroids for Speaker Authentication
R E S E A R C H R E P O R T I D I A P An Investigation of Spectral Subband Centroids for Speaker Authentication Norman Poh Hoon Thian a Conrad Sanderson a Samy Bengio a IDIAP RR 3-62 December 3 published
More informationModified-prior PLDA and Score Calibration for Duration Mismatch Compensation in Speaker Recognition System
INERSPEECH 2015 Modified-prior PLDA and Score Calibration for Duration Mismatch Compensation in Speaker Recognition System QingYang Hong 1, Lin Li 1, Ming Li 2, Ling Huang 1, Lihong Wan 1, Jun Zhang 1
More informationMel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda
Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation Keiichi Tokuda Nagoya Institute of Technology Carnegie Mellon University Tamkang University March 13,
More informationDetection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors
Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors Kazumasa Yamamoto Department of Computer Science Chubu University Kasugai, Aichi, Japan Email: yamamoto@cs.chubu.ac.jp Chikara
More informationSegmental Recurrent Neural Networks for End-to-end Speech Recognition
Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave
More informationMULTISCALE SCATTERING FOR AUDIO CLASSIFICATION
MULTISCALE SCATTERING FOR AUDIO CLASSIFICATION Joakim Andén CMAP, Ecole Polytechnique, 91128 Palaiseau anden@cmappolytechniquefr Stéphane Mallat CMAP, Ecole Polytechnique, 91128 Palaiseau ABSTRACT Mel-frequency
More informationNoise Compensation for Subspace Gaussian Mixture Models
Noise ompensation for ubspace Gaussian Mixture Models Liang Lu University of Edinburgh Joint work with KK hin, A. Ghoshal and. enals Liang Lu, Interspeech, eptember, 2012 Outline Motivation ubspace GMM
More informationPresented By: Omer Shmueli and Sivan Niv
Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan
More informationCEPSTRAL analysis has been widely used in signal processing
162 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 2, MARCH 1999 On Second-Order Statistics and Linear Estimation of Cepstral Coefficients Yariv Ephraim, Fellow, IEEE, and Mazin Rahim, Senior
More informationHow to Deal with Multiple-Targets in Speaker Identification Systems?
How to Deal with Multiple-Targets in Speaker Identification Systems? Yaniv Zigel and Moshe Wasserblat ICE Systems Ltd., Audio Analysis Group, P.O.B. 690 Ra anana 4307, Israel yanivz@nice.com Abstract In
More informationChapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析
Chapter 9 Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析 1 LPC Methods LPC methods are the most widely used in speech coding, speech synthesis, speech recognition, speaker recognition and verification
More informationUsually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,
Odyssey 2012 The Speaker and Language Recognition Workshop 25-28 June 2012, Singapore First attempt of Boltzmann Machines for Speaker Verification Mohammed Senoussaoui 1,2, Najim Dehak 3, Patrick Kenny
More informationA SUPERVISED FACTORIAL ACOUSTIC MODEL FOR SIMULTANEOUS MULTIPARTICIPANT VOCAL ACTIVITY DETECTION IN CLOSE-TALK MICROPHONE RECORDINGS OF MEETINGS
A SUPERVISED FACTORIAL ACOUSTIC MODEL FOR SIMULTANEOUS MULTIPARTICIPANT VOCAL ACTIVITY DETECTION IN CLOSE-TALK MICROPHONE RECORDINGS OF MEETINGS Kornel Laskowski and Tanja Schultz interact, Carnegie Mellon
More informationUniversity of Colorado at Boulder ECEN 4/5532. Lab 2 Lab report due on February 16, 2015
University of Colorado at Boulder ECEN 4/5532 Lab 2 Lab report due on February 16, 2015 This is a MATLAB only lab, and therefore each student needs to turn in her/his own lab report and own programs. 1
More informationTNO SRE-2008: Calibration over all trials and side-information
Image from Dr Seuss TNO SRE-2008: Calibration over all trials and side-information David van Leeuwen (TNO, ICSI) Howard Lei (ICSI), Nir Krause (PRS), Albert Strasheim (SUN) Niko Brümmer (SDV) Knowledge
More informationCEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT.
CEPSTRAL ANALYSIS SYNTHESIS ON THE EL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITH FOR IT. Summarized overview of the IEEE-publicated papers Cepstral analysis synthesis on the mel frequency scale by Satochi
More informationFusion of Spectral Feature Sets for Accurate Speaker Identification
Fusion of Spectral Feature Sets for Accurate Speaker Identification Tomi Kinnunen, Ville Hautamäki, and Pasi Fränti Department of Computer Science University of Joensuu, Finland {tkinnu,villeh,franti}@cs.joensuu.fi
More informationORTHOGONALITY-REGULARIZED MASKED NMF FOR LEARNING ON WEAKLY LABELED AUDIO DATA. Iwona Sobieraj, Lucas Rencker, Mark D. Plumbley
ORTHOGONALITY-REGULARIZED MASKED NMF FOR LEARNING ON WEAKLY LABELED AUDIO DATA Iwona Sobieraj, Lucas Rencker, Mark D. Plumbley University of Surrey Centre for Vision Speech and Signal Processing Guildford,
More informationMinimax i-vector extractor for short duration speaker verification
Minimax i-vector extractor for short duration speaker verification Ville Hautamäki 1,2, You-Chi Cheng 2, Padmanabhan Rajan 1, Chin-Hui Lee 2 1 School of Computing, University of Eastern Finl, Finl 2 ECE,
More informationLow-dimensional speech representation based on Factor Analysis and its applications!
Low-dimensional speech representation based on Factor Analysis and its applications! Najim Dehak and Stephen Shum! Spoken Language System Group! MIT Computer Science and Artificial Intelligence Laboratory!
More informationAnalysis of polyphonic audio using source-filter model and non-negative matrix factorization
Analysis of polyphonic audio using source-filter model and non-negative matrix factorization Tuomas Virtanen and Anssi Klapuri Tampere University of Technology, Institute of Signal Processing Korkeakoulunkatu
More informationReformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features
Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Heiga ZEN (Byung Ha CHUN) Nagoya Inst. of Tech., Japan Overview. Research backgrounds 2.
More informationModeling Norms of Turn-Taking in Multi-Party Conversation
Modeling Norms of Turn-Taking in Multi-Party Conversation Kornel Laskowski Carnegie Mellon University Pittsburgh PA, USA 13 July, 2010 Laskowski ACL 1010, Uppsala, Sweden 1/29 Comparing Written Documents
More informationA Low-Cost Robust Front-end for Embedded ASR System
A Low-Cost Robust Front-end for Embedded ASR System Lihui Guo 1, Xin He 2, Yue Lu 1, and Yaxin Zhang 2 1 Department of Computer Science and Technology, East China Normal University, Shanghai 200062 2 Motorola
More informationAllpass Modeling of LP Residual for Speaker Recognition
Allpass Modeling of LP Residual for Speaker Recognition K. Sri Rama Murty, Vivek Boominathan and Karthika Vijayan Department of Electrical Engineering, Indian Institute of Technology Hyderabad, India email:
More informationFrog Sound Identification System for Frog Species Recognition
Frog Sound Identification System for Frog Species Recognition Clifford Loh Ting Yuan and Dzati Athiar Ramli Intelligent Biometric Research Group (IBG), School of Electrical and Electronic Engineering,
More informationTime-Varying Autoregressions for Speaker Verification in Reverberant Conditions
INTERSPEECH 017 August 0 4, 017, Stockholm, Sweden Time-Varying Autoregressions for Speaker Verification in Reverberant Conditions Ville Vestman 1, Dhananjaya Gowda, Md Sahidullah 1, Paavo Alku 3, Tomi
More informationIntelligent Rapid Voice Recognition using Neural Tensor Network, SVM and Reinforcement Learning
CS229/CS221 PROJECT REPORT, DECEMBER 2015, STANFORD UNIVERSITY 1 Intelligent Rapid Voice Recognition using Neural Tensor Network, SVM and Reinforcement Learning Davis Wertheimer, Aashna Garg, James Cranston
More informationA comparative study of explicit and implicit modelling of subsegmental speaker-specific excitation source information
Sādhanā Vol. 38, Part 4, August 23, pp. 59 62. c Indian Academy of Sciences A comparative study of explicit and implicit modelling of subsegmental speaker-specific excitation source information DEBADATTA
More informationSession 1: Pattern Recognition
Proc. Digital del Continguts Musicals Session 1: Pattern Recognition 1 2 3 4 5 Music Content Analysis Pattern Classification The Statistical Approach Distribution Models Singing Detection Dan Ellis
More informationImproving the Effectiveness of Speaker Verification Domain Adaptation With Inadequate In-Domain Data
Distribution A: Public Release Improving the Effectiveness of Speaker Verification Domain Adaptation With Inadequate In-Domain Data Bengt J. Borgström Elliot Singer Douglas Reynolds and Omid Sadjadi 2
More informationAround the Speaker De-Identification (Speaker diarization for de-identification ++) Itshak Lapidot Moez Ajili Jean-Francois Bonastre
Around the Speaker De-Identification (Speaker diarization for de-identification ++) Itshak Lapidot Moez Ajili Jean-Francois Bonastre The 2 Parts HDM based diarization System The homogeneity measure 2 Outline
More informationL8: Source estimation
L8: Source estimation Glottal and lip radiation models Closed-phase residual analysis Voicing/unvoicing detection Pitch detection Epoch detection This lecture is based on [Taylor, 2009, ch. 11-12] Introduction
More informationMaximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems
Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems Chin-Hung Sit 1, Man-Wai Mak 1, and Sun-Yuan Kung 2 1 Center for Multimedia Signal Processing Dept. of
More informationLecture 5: GMM Acoustic Modeling and Feature Extraction
CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 5: GMM Acoustic Modeling and Feature Extraction Original slides by Dan Jurafsky Outline for Today Acoustic
More informationGain Compensation for Fast I-Vector Extraction over Short Duration
INTERSPEECH 27 August 2 24, 27, Stockholm, Sweden Gain Compensation for Fast I-Vector Extraction over Short Duration Kong Aik Lee and Haizhou Li 2 Institute for Infocomm Research I 2 R), A STAR, Singapore
More informationNon-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology
Non-Negative Matrix Factorization And Its Application to Audio Tuomas Virtanen Tampere University of Technology tuomas.virtanen@tut.fi 2 Contents Introduction to audio signals Spectrogram representation
More informationEnvironmental Sound Classification in Realistic Situations
Environmental Sound Classification in Realistic Situations K. Haddad, W. Song Brüel & Kjær Sound and Vibration Measurement A/S, Skodsborgvej 307, 2850 Nærum, Denmark. X. Valero La Salle, Universistat Ramon
More informationUncertainty Modeling without Subspace Methods for Text-Dependent Speaker Recognition
Uncertainty Modeling without Subspace Methods for Text-Dependent Speaker Recognition Patrick Kenny, Themos Stafylakis, Md. Jahangir Alam and Marcel Kockmann Odyssey Speaker and Language Recognition Workshop
More informationSPEECH RECOGNITION USING TIME DOMAIN FEATURES FROM PHASE SPACE RECONSTRUCTIONS
SPEECH RECOGNITION USING TIME DOMAIN FEATURES FROM PHASE SPACE RECONSTRUCTIONS by Jinjin Ye, B.S. A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS for
More informationPHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS
PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS Jinjin Ye jinjin.ye@mu.edu Michael T. Johnson mike.johnson@mu.edu Richard J. Povinelli richard.povinelli@mu.edu
More informationRobust Sound Event Detection in Continuous Audio Environments
Robust Sound Event Detection in Continuous Audio Environments Haomin Zhang 1, Ian McLoughlin 2,1, Yan Song 1 1 National Engineering Laboratory of Speech and Language Information Processing The University
More informationAUDIO-VISUAL RELIABILITY ESTIMATES USING STREAM
SCHOOL OF ENGINEERING - STI ELECTRICAL ENGINEERING INSTITUTE SIGNAL PROCESSING LABORATORY Mihai Gurban ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE EPFL - FSTI - IEL - LTS Station Switzerland-5 LAUSANNE Phone:
More informationVoice Activity Detection Using Pitch Feature
Voice Activity Detection Using Pitch Feature Presented by: Shay Perera 1 CONTENTS Introduction Related work Proposed Improvement References Questions 2 PROBLEM speech Non speech Speech Region Non Speech
More informationPattern Recognition Applied to Music Signals
JHU CLSP Summer School Pattern Recognition Applied to Music Signals 2 3 4 5 Music Content Analysis Classification and Features Statistical Pattern Recognition Gaussian Mixtures and Neural Nets Singing
More informationSpeaker Representation and Verification Part II. by Vasileios Vasilakakis
Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation
More informationUnifying Probabilistic Linear Discriminant Analysis Variants in Biometric Authentication
Unifying Probabilistic Linear Discriminant Analysis Variants in Biometric Authentication Aleksandr Sizov 1, Kong Aik Lee, Tomi Kinnunen 1 1 School of Computing, University of Eastern Finland, Finland Institute
More informationFuzzy Support Vector Machines for Automatic Infant Cry Recognition
Fuzzy Support Vector Machines for Automatic Infant Cry Recognition Sandra E. Barajas-Montiel and Carlos A. Reyes-García Instituto Nacional de Astrofisica Optica y Electronica, Luis Enrique Erro #1, Tonantzintla,
More informationNonnegative Matrix Factor 2-D Deconvolution for Blind Single Channel Source Separation
Nonnegative Matrix Factor 2-D Deconvolution for Blind Single Channel Source Separation Mikkel N. Schmidt and Morten Mørup Technical University of Denmark Informatics and Mathematical Modelling Richard
More informationDeep Learning for Speech Recognition. Hung-yi Lee
Deep Learning for Speech Recognition Hung-yi Lee Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New
More informationLECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES
LECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES Abstract March, 3 Mads Græsbøll Christensen Audio Analysis Lab, AD:MT Aalborg University This document contains a brief introduction to pitch
More informationCorrespondence. Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure
Correspondence Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure It is possible to detect and classify moving and stationary targets using ground surveillance pulse-doppler radars
More informationMonaural speech separation using source-adapted models
Monaural speech separation using source-adapted models Ron Weiss, Dan Ellis {ronw,dpwe}@ee.columbia.edu LabROSA Department of Electrical Enginering Columbia University 007 IEEE Workshop on Applications
More informationMultimedia Networking ECE 599
Multimedia Networking ECE 599 Prof. Thinh Nguyen School of Electrical Engineering and Computer Science Based on lectures from B. Lee, B. Girod, and A. Mukherjee 1 Outline Digital Signal Representation
More informationSpeaker Identification Based On Discriminative Vector Quantization And Data Fusion
University of Central Florida Electronic Theses and Dissertations Doctoral Dissertation (Open Access) Speaker Identification Based On Discriminative Vector Quantization And Data Fusion 2005 Guangyu Zhou
More informationThe Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models
Statistical NLP Spring 2010 The Noisy Channel Model Lecture 9: Acoustic Models Dan Klein UC Berkeley Acoustic model: HMMs over word positions with mixtures of Gaussians as emissions Language model: Distributions
More information