Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates

Size: px

Start display at page:

Download "Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates"

Kristopher Daniel
5 years ago
Views:

1 Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates Dima Ruinskiy Niv Dadush Yizhar Lavner Department of Computer Science, Tel-Hai College, Israel

2 Outline Phoneme Spotting Applications and Approaches Fricatives and Affricates Discriminating Features Our Algorithms Cepstrogram Matching Linear Discriminant Analysis Results and Discussion

3 Phoneme Spotting Definition: Locating all appearances of a phoneme (set of phonemes) in continuous speech. Not to be confused with phoneme classification, where individual phonemes are already marked

4 Phoneme Spotting Applications Speech recognition Spoken term detection Smart audio filtering Multimedia synchronization Audio / Video Lyrics / Music Aesthetic purposes Professional audio recording

5 Traditional Approaches Pattern-matching Classifiers (GMM, SVM ) Hierarchical approach Time-domain and frequency-domain features Choice of features is key Dependent on the phonemes in question

6 Fricatives and Affricates The largest group of phonemes in English /s/ (sound), /sh/ (wash), /f/ (food), /th/ (math) /z/ (zebra), /zh/ (mirage), /v/ (victory), /dh/ (this) Affricates: /ch/ (chess), /ts/ (pizza) Concatenation of a stop and a fricative

7 Fricatives and Affricates Characteristics Noise-like consonants Concentration of energy in high frequencies Weak formant structure Problems Excessive accentuation Hearing-impaired

8 Discriminating Features Short-time Energy N + N E db = 10 log10 x n N n=n 0 Filters out silence and vowel-like phonemes Not useful in filtering out most consonants

9 Discriminating Features Zero-Crossing Rate N +N ZCR = 0.5 sgn x n - sgn x n - 1 N n=n +1 Indicative of high frequencies (noise-like signal) Significantly higher for certain fricatives (/s/, /f/, /sh/) than for other phonemes 0

10 Discriminating Features Band Energy Ratio Measure total spectral energy in two different bands: = 500Hz,3KHz Spectral energy from DFT: The ratio B 1 = 5KHz,10KHz B 2 k B is large when the phoneme contains mostly high frequencies B BER = 10log10 E B /E 1 B2 E = X k 2

11 Discriminating Features Top: Speech signal with instances of /s/ and /sh/ Middle: Band energy ratio in decibels Bottom: Zero-crossing rate

12 Discriminating Features Spectral peak locations The two dominant peaks in the LPC envelope Mel frequency cepstral coefficients Triangular ideal band-pass filters V k (logarithmically spaced and sized) Total spectral energy E(i) in each filter M-1 1 2π 1 MFCC L = log E i cos i + L M M 2 i=0 i

13 Discriminating Features Lacunarity Textural measure of translational invariance For sliding windows of length r compute the mass S(r) across the signal Define the lacunarity Λ(r): Var S r Λ r = + 1 μ 2 S r Apply least-square approximation to get a bestfitting function of the form α/x + β γ

14 Discriminating Features Lacunarity BLUE = fricatives/affricates RED = other phonemes

15 Cepstrogram-Matching Algorithm MFCC vectors for consecutive short windows Arranged in a matrix to form a cepstrogram Sub-frame FRAME S 1 S 3 S 5 S 2 S 4 S 6 S 7 V 1 V 2 V 3 V 4 V 5 V 6 V 7

16 Cepstrogram-Matching Algorithm Training Cepstrograms of several known fricative/affricate phonemes Compute Mean (template) matrix V and Variance matrix. T = 1 N N i=1 M i

17 Cepstrogram-Matching Algorithm Testing Compute cepstrogram of analysis frame: M(X K ) Difference matrix D = M XK - T /V Distance measure: d = j Frame is a candidate if d is below a threshold Candidates are further checked using short-time energy, zero-crossing rate, band energy ratio. i D ij 2

18 Divide into INPUT SIGNAL consecutive X k-2 X k-1 X k X k+1 X k+2 analysis frames TEMPLATE- MATCHING ALGORITHM Compute MFCC matrix Compute Supporting Feature Set Breath Template Matrix Calculate distance measure NO Distance below threshold? YES Preliminary classification as non-fricative Preliminary classification as fricative Detection refinement on all frames in vicinity NO Still classified as fricative? YES Discard Classify as fricative and demarcate boundaries

19 Cepstrogram-Matching Algorithm Achieved good results in breath detection (Ruinskiy-Lavner, 2006) Results for fricatives/affricates are good but not good enough Biggest problem: false positives

20 Linear Discriminant Analysis (LDA) Transforming multi-dimensional feature vectors (of two or more classes) into a onedimensional representation Aimed at maximizing inter-class difference while minimizing intra-class variance

21 Linear Discriminant Analysis (LDA) Classes C 0,C 1 ; Class means m 0,m 1 S B, S w - inter/intra-class variances Maximize T B J w = w T S W w S w w Differentiating J(w) we obtain the extremum: w = S -1 W m1 - m0

22 LDA Classifier Training Several hundred phonemes (from TIMIT) 28% fricatives/affricates, 72% other phonemes Short overlapping frames (8-15ms) Feature vector consisting of: Short-time energy Zero-crossing rate Band energy ratio Lacunarity Spectral peaks

23 LDA Classifier Training Classes C 1 (fricatives/affricates), C 0 (others), represented by matrices of data Subtract global mean vector m from columns: c i = ci - m, ci C i, i {0,1} Calculate covariance matrices T i i Ci COV = C and joint covariance matrix C= COV 1+ COV2 N

24 LDA Classification Discrimination function -1 T 1-1 T m0c x - m0c m 0 + log p 0 < T 1-1 T m1c x - m1c m 1 + log p1 f(x) = 2 0 otherwise Silence threshold, Median filtering

25 Divide into INPUT SIGNAL consecutive X k-2 X k-1 X k X k+1 X k+2 analysis frames LINEAR DISCRIMINANT ALGORITHM Compute Feature Vector LDA Classification Classified as fricative? YES NO Post-processing (median/energy filtering) Discard Classify as fricative

26 Results Algorithm Test data Detections False alarms Excluding breath Ceps. 6 speakers (2M+4F) 10 minutes 95.1% 26.2% 14.6% Ceps. (/s/ only) (same as above) 93.2% 2.7% 2.7% LDA TIMIT, 25 speakers 96.4% 15.7% 6.1% Breath detector can eliminate most breath-related false positives Most common false positives after breath: stop consonants (/t/, /k/)

27 Thank you! Q&A

28 Backup

29 Backup

30 Backup

31 Backup

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017 Speech Signal Analysis Generate discrete samples A frame Need to focus on short