Harmonic Structure Transform for Speaker Recognition

Size: px

Start display at page:

Download "Harmonic Structure Transform for Speaker Recognition"

Howard Hamilton
5 years ago
Views:

1 Harmonic Structure Transform for Speaker Recognition Kornel Laskowski & Qin Jin Carnegie Mellon University, Pittsburgh PA, USA KTH Speech Music & Hearing, Stockholm, Sweden 29 August, 2011 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 1/21

2 Spectral Transforms in General Given x the energy spectrum of a speech frame, ( ( )) y = F 1 log M T x normalization term The matrix M is a filterbank, whose columns look like: M defines the number of filters, and their central frequencies, widths, and general shapes. Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 2/21

3 Spectral Transforms in General Given x the energy spectrum of a speech frame, ( ( )) y = F 1 log M T x normalization term The matrix M is a filterbank, whose columns look like: M defines the number of filters, and their central frequencies, widths, and general shapes. Importantly here, the filters of all such filterbanks integrate energy across frequencies related by adjacency. Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 2/21

4 The Harmonic Structure Transform (HST) In contrast, the HST is implemented by a matrix H whose columns look like: Each filter integrates energy across frequencies related by harmonicity (not adjacency). this is novel (Laskowski & Jin, 2010) for speaker recognition related to (Liénard, Barras & Signol, 8) for pitch detection unknown: number of filters, and their fundamental frequencies, tooth widths, and individual tooth shapes Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 3/21

5 Outline of this Talk 1 Baseline Performance What is known? 2 Experiments in HSCC Filterbank Design linear spacing in fundamental frequency piecewise linear spacing in fundamental frequency logarithmic spacing in fundamental frequency fundamental frequency range and density 3 Score-level Fusion with Standard MFCCs 4 Generalization 5 Conclusions Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 4/21

6 HST Processing frame FFT x t idealized FFT (comb filter h) as a function of i f h [i 1] f h [i] f h [i + 1] analysis every 8 ms frames 32 ms wide comb filter teeth triangular (global width parameter) filters, linearly spanning from 50 Hz to 450 Hz logarithm at each filter output, then normalization decorrelation using LDA yields harmonic structure cepstral coefficients (HSCCs) Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 5/21

7 HSCC Modeling for Classification As simple as possible. one GMM per speaker 1 assume one Gaussian element 2 determine optimal number N D of LDA dimensions 3 hold N D fixed 4 determine optimal number of N G Gaussians maximum likelihood closed-set classification (MAP under uniform prior) Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 6/21

8 Available Results (Laskowski & Jin, ODYSSEY 2010) Wall Street Journal data, mostly read speech 100-way closed-set classification, per gender second trials, per gender and dataset matched channel and matched multi-session conditions System Female, Male, Dev Test Dev Test F HST/LDA MEL/DCT MEL/LDA Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 7/21

9 Session Mismatch MIXER5 data, various speaking styles 66-way closed-set classification second trials, per dataset matched channel and matched session: accuracies of 100% matched channel but mismatched session: System Dev Test F HST/LDA MEL/DCT MEL/LDA Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 8/21

10 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

11 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

12 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

13 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

14 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

15 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

16 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

17 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

18 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

19 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

20 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

21 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

22 Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz? Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

23 Piecewise Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 10/21

24 Piecewise Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 10/21

25 Piecewise Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 10/21

26 Piecewise Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 10/21

27 Piecewise Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 10/21

28 Logarithmic Spacing of Fundamental Frequencies In the linear case: N h fundamental frequencies f h between f min and f max : f h [i] = fh min + i 1 ( f max h f min ) h 1 i N h N h 1 Baseline Linear 34.4 PW Linear frequency f h (Hz) Baseline Logarithmic filter index i Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 11/21

29 Logarithmic Spacing of Fundamental Frequencies In the linear case: N h fundamental frequencies f h between f min and f max : f h [i] = fh min + i 1 ( f max h f min ) h 1 i N h N h 1 Linear Baseline Linear 34.4 PW Linear frequency f h (Hz) Baseline Logarithmic filter index i Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 11/21

30 Logarithmic Spacing of Fundamental Frequencies In the linear case: N h fundamental frequencies f h between f min and f max : f h [i] = fh min + i 1 ( f max h f min ) h 1 i N h N h 1 Linear Baseline Linear 34.4 PW Linear frequency f h (Hz) Baseline Logarithmic PW Linear filter index i Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 11/21

31 Logarithmic Spacing of Fundamental Frequencies In the linear case: N h fundamental frequencies f h between f min and f max : f h [i] = fh min + i 1 ( f max h f min ) h 1 i N h N h 1 frequency f h (Hz) Linear Logarithmic Baseline PW Linear filter index i Baseline Linear PW Linear Logarithmic f h [i] = f min h ( f max ) i 1 N h h 1 fh min Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 11/21

32 Range and Density of Fundamental Frequencies Three manipulations: f min : raise to 62.5 Hz f max : raise to maximize accuracy (to 0 Hz) N h : lower to maximize accuracy (to 1129) frequency f h (Hz) Baseline Linear PW Linear Logarithmic Raise LB Raise UB Lower N Logarithmic filter index i Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 12/21

33 Range and Density of Fundamental Frequencies Three manipulations: f min : raise to 62.5 Hz f max : raise to maximize accuracy (to 0 Hz) N h : lower to maximize accuracy (to 1129) frequency f h (Hz) Baseline Linear PW Linear Logarithmic Raise LB Raise UB Lower N Raise LB Logarithmic filter index i Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 12/21

34 Range and Density of Fundamental Frequencies Three manipulations: f min : raise to 62.5 Hz f max : raise to maximize accuracy (to 0 Hz) N h : lower to maximize accuracy (to 1129) Baseline Linear 34.4 PW Linear frequency f h (Hz) Logarithmic Raise LB Raise UB Raise UB Lower N Raise LB Logarithmic filter index i Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 12/21

35 Range and Density of Fundamental Frequencies Three manipulations: f min : raise to 62.5 Hz f max : raise to maximize accuracy (to 0 Hz) N h : lower to maximize accuracy (to 1129) Baseline Linear 34.4 PW Linear frequency f h (Hz) Lower N Logarithmic Raise LB Raise UB Raise UB Lower N Raise LB Logarithmic filter index i Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 12/21

36 Interpolation with Standard MFCCs linear interpolation with DCT-decorrelated log-mel energies 20 coefficients 25 coefficients (always worse) MEL/DCT %abs accuracy MEL/DCT 25 HSCC MEL/DCT HSCC %abs interpolation parameter Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 13/21

37 Interpolation with LDA-Rotated MFCCs linear interpolation with LDA-decorrelated log-mel energies 20 coefficients (better in combination) 25 coefficients (better alone) 10.6%abs MEL/LDA 20 accuracy MEL/LDA 25 HSCC MEL/LDA + HSCC %abs interpolation parameter Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 14/21

38 Summary of Accuracy on DevSet DEVSET EVALSET 10.7%abs 33.5%rel Baseline Linear PW Linear Logarithmic Raise LB Raise UB Lower N ??.? %abs 27.6%rel MEL/DCT + HSCC %abs 31.8%rel MEL/LDA + HSCC Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 15/21

39 Accuracy on EvalSet DEVSET EVALSET 10.7%abs 33.5%rel Baseline Linear PW Linear Logarithmic Raise LB Raise UB Lower N ??.? %abs 27.6%rel MEL/DCT + HSCC %abs 31.8%rel MEL/LDA + HSCC Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 16/21

40 Accuracy on EvalSet DEVSET EVALSET 10.7%abs 33.5%rel Baseline Linear PW Linear Logarithmic Raise LB Raise UB Lower N ??.? %abs 27.6%rel MEL/DCT + HSCC %abs 31.8%rel MEL/LDA + HSCC Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 16/21

41 Accuracy on EvalSet DEVSET EVALSET 10.7%abs 33.5%rel Baseline Linear PW Linear Logarithmic Raise LB Raise UB Lower N ??.? %abs 27.6%rel MEL/DCT + HSCC %abs 31.8%rel MEL/LDA + HSCC Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 16/21

42 Conclusions 1 evaluated the baseline transform in session mismatch viable: twice as many errors an equivalent MFCC system 2 errors can be reduced by a third (33.5%rel) by optimizing: the number of filters in the filterbank the fundamental frequency corresponding to each filter 3 logarithmic spacing of fundamental frequencies is better than linear spacing more filters for low fundamental frequencies fewer filters for high fundamental frequencies 4 in an equivalent MFCC system, errors can be reduced by almost a third ( %rel) via score-level fusion with the improved HST system Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 17/21

43 Future Directions 1 change framing policy from 8 ms/32 ms to something longer intonation and voice quality are supra-segmental larger temporal support greater spectral resolution 2 optimize the tooth shape of comb filters 3 find a data-independent decorrelation transform leading to a compact (< 25 coefficients) representation 4 explore adaptation from a universal background model 5 generalize to binary speaker verification (and NIST SREs) Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 18/21

44 Potential Impact 1 a new general representation of the spectrum 2 deliberately orthogonal to spectral envelope features (MFCCs, LPCCs, etc.) but computed in an identical manner 3 likely beneficial not only for speaker recognition, but also: online speaker diarization classification of emotional speech clinical voice quality assessment Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 19/21

45 Some Interesting Insights... 1 currently in HST, spectral energy < 300 Hz is zeroed out this improves closed-set speaker classification but the fundamental (zeroth harmonic) is ignored the fundamental is thought to play a role in emotional expression 2 that optimal filter spacing is logarithmic is curious independent of the logarithmic tonotopicity of the basilar membrane greater acuity in discriminating among harmonic sounds (not just pure tones) Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 20/21

46 THANK YOU Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 21/21

Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring

Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring Kornel Laskowski & Qin Jin Carnegie Mellon University Pittsburgh PA, USA 28 June, 2010 Laskowski & Jin ODYSSEY 2010,