Stress detection through emotional speech analysis

Similar documents
Automatic Speech Recognition (CS753)

Robust Speaker Identification

Speech Signal Representations

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Feature extraction 2

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

Harmonic Structure Transform for Speaker Recognition

Feature extraction 1

Proc. of NCC 2010, Chennai, India

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring

Gender-Driven Emotion Recognition Through Speech Signals for Ambient Intelligence Applications

Model-based unsupervised segmentation of birdcalls from field recordings

arxiv: v1 [cs.sd] 25 Oct 2014

A comparative study of time-delay estimation techniques for convolutive speech mixtures

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT.

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

Evaluation of the modified group delay feature for isolated word recognition

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

Lecture 9: Speech Recognition. Recognizing Speech

Topic 6. Timbre Representations

Lecture 9: Speech Recognition

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda

SPEECH ANALYSIS AND SYNTHESIS

Fuzzy Support Vector Machines for Automatic Infant Cry Recognition

FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION

Frog Sound Identification System for Frog Species Recognition

Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Cochlear modeling and its role in human speech recognition

LECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES

Sinusoidal Modeling. Yannis Stylianou SPCC University of Crete, Computer Science Dept., Greece,

TinySR. Peter Schmidt-Nielsen. August 27, 2014

Allpass Modeling of LP Residual for Speaker Recognition

Signal representations: Cepstrum

Lecture 7: Feature Extraction

On The Role Of Head Motion In Affective Expression

2D Spectrogram Filter for Single Channel Speech Enhancement

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

L7: Linear prediction of speech

Correspondence. Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure

Recognition of Human Emotion in Speech Using Modulation Spectral Features and Support Vector Machines

NEW ACOUSTICAL PATTERN RECOGNITION APPROACH TO IDENTIFY DIFFERENT STAGES OF A COOKING PROCESS. THE BOILING WATER CASE.

Sound Recognition in Mixtures

Linear Prediction 1 / 41

Improved Method for Epoch Extraction in High Pass Filtered Speech

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

Voiced Speech. Unvoiced Speech

Voice Activity Detection Using Pitch Feature

A Low-Cost Robust Front-end for Embedded ASR System

Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data

OVERLAPPING ANIMAL SOUND CLASSIFICATION USING SPARSE REPRESENTATION

Session 1: Pattern Recognition

Musical Genre Classication

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Environmental Sound Classification in Realistic Situations

Dominant Feature Vectors Based Audio Similarity Measure

Fast speaker diarization based on binary keys. Xavier Anguera and Jean François Bonastre

Multiclass Discriminative Training of i-vector Language Recognition

Analysis of polyphonic audio using source-filter model and non-negative matrix factorization

Text Independent Speaker Identification Using Imfcc Integrated With Ica

Fast and accurate sequential floating forward feature selection with the Bayes classifier applied to speech emotion recognition

CEPSTRAL analysis has been widely used in signal processing

L8: Source estimation

MULTISCALE SCATTERING FOR AUDIO CLASSIFICATION

A comparative study of explicit and implicit modelling of subsegmental speaker-specific excitation source information

Modifying Voice Activity Detection in Low SNR by correction factors

Estimation of Cepstral Coefficients for Robust Speech Recognition

ON THE USE OF MLP-DISTANCE TO ESTIMATE POSTERIOR PROBABILITIES BY KNN FOR SPEECH RECOGNITION

Time-Series Analysis for Ear-Related and Psychoacoustic Metrics

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

Analysis of audio intercepts: Can we identify and locate the speaker?

Speaker Verification Using Accumulative Vectors with Support Vector Machines

Pattern Recognition Applied to Music Signals

SPEECH COMMUNICATION 6.541J J-HST710J Spring 2004

Presented By: Omer Shmueli and Sivan Niv

Noisy Speech Recognition using Wavelet Transform and Weighting Coefficients for a Specific Level

Robust Speaker Identification System Based on Wavelet Transform and Gaussian Mixture Model

Non-parametric Classification of Facial Features

Modeling the creaky excitation for parametric speech synthesis.

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

Sensors & Transducers 2016 by IFSA Publishing, S. L.

Introduction Basic Audio Feature Extraction

Frequency Domain Speech Analysis

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Zeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals

/16/$ IEEE 2817

A Survey on Voice Activity Detection Methods

GMM Vector Quantization on the Modeling of DHMM for Arabic Isolated Word Recognition System

Tone Analysis in Harmonic-Frequency Domain and Feature Reduction using KLT+LVQ for Thai Isolated Word Recognition

Support Vector Machines using GMM Supervectors for Speaker Verification

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

New Statistical Model for the Enhancement of Noisy Speech

Text-Independent Speaker Identification using Statistical Learning

Unsupervised Learning Methods

Comparing linear and non-linear transformation of speech

Nonlinear Modeling of a Guitar Loudspeaker Cabinet

Gaussian Processes for Audio Feature Extraction

Transcription:

Stress detection through emotional speech analysis INMA MOHINO inmaculada.mohino@uah.edu.es ROBERTO GIL-PITA roberto.gil@uah.es LORENA ÁLVAREZ PÉREZ loreduna88@hotmail Abstract: Stress is a reaction or response of the subject to face up the daily mental, emotional or physical challenges. Continuous scanning of stress levels of a subject is a key point to understand and control personal stress. Stress is expressed by physiological changes, emotional reactions, and conduct changes. Some of the physiological changes are the increase of adrenaline produced to intensify the concentration, or the rise in heart rate and the acceleration of the reflexes. Concerning emotional reactions, they can be expressed by changes in the prosody of speech. In this paper we study the design of a classification system of stress levels using emotional speech analysis. For this purpose, linear discriminant combined with bootstrapping techniques are useful tools in order to implement classifiers. Results demonstrate the feasibility of the proposed system, obtaining error rates lower than 33% Key Words: emotional speech, speech-processing, stress detection. 1 Introduction In recent years there has been a tremendous job in studying the parameters of emotions in the human voice, fundamentally divided in two different research lines: the artificial production of emotional sounds [1, 2], and the classification of emotional states [3, 4, 5, 6, 7]. In the first line, researches focus on the study of the characteristics of speech signals produced under different emotional states of the subject, and its relationship with the language. Several and algorithms are proposed in the literature [1], and a review of the state of the art can be found in [2]. Concerning the classification of emotional states, the objective is to determine the emotional state of the subject given a speech signal, from a limited set of available states. From the results presented in the literature, the with greater classification capability are related to the pitch, and they are widely studied in [3] and [4]. Furthermore, a comprehensive study of the most used in the recognition of emotions can be found in [5], where those based on the pitch are again shown as the most discriminatory. Other papers focus on selecting a suitable reduced set of, in order to improve generalization capability of the classifiers, like [6], in which automatic classification is used to select a minimum set of, or [7], in which a detailed study of a huge number of parameters is included, with the purpose of selecting with linear independence. This last paper concludes that with only 6 a high rate of classification success can be achieved. In this paper, the objective is not to simply classify the different emotional states, but to distinguish the level of excitation from the emotions, with the aim of predicting stress levels. For this study, we use the public database The Berlin Database of Emotional Speech, described in [1], and we carry out a set of experiments aiming at studying the combination of extracted from the literature and their effects over the classification performance. 2 Materials and Methods This section includes a brief description of the classification method (the least-square linear classifier) and a description of the database. 2.1 Standard MSE minimization of a diagonal linear discriminant Linear classifiers are characterized by the use of linear decision boundaries, which implies that they cannot discriminate classes associated in very complex shapes. Let us consider a set of training patterns x = [x 1, x 2,..., x L ] T, where each of these patters is assigned to one of the possible classes denoted as C i, i = 1,, K. In a linear classifier, the decision rule is obtained using a set of K linear combinations of the training patterns, as it can be observed in equation (1). ISBN: 978-1-61804-126-5 233

y k = w k0 + L w kn x n (1) n=1 Where w kn are the weighting values and w k0 the threshold. Furthermore, equation (1) can be expressed in matrix notation as equation (2). y = w 0 + W T x (2) Where W is the weight matrix that contain the values of w kn. The design of the classifier consists of finding the best values of W and w 0 to minimize the classification error. The output of the linear combinations y is used to determine the decision rule. For instance, if the component y k gives the maximum value of the vector, then the k-th class is assigned to the pattern. In order to determine the values of the weights, it is necessary to minimize the mean squared error value. Let us define the matrix V = [w 0, W] T containing the weight matrix W and the threshold vector w 0, then the pattern matrix Q, which contains the input for classification, is the expressed in (3). 1 1 1... 1 x 11 x 12 x 13... x 1N Q =....... x L1 x L2 x L3... x LN (3) So, the output of the linear classifier is obtained as a linear combination of the inputs according to (4). Y = V Q (4) Let us now define the target matrix containing the labels of each pattern as: t 11 t 12 t 13... t 1N T =....... (5) t K1 t K2 t K3... t KN where N is the number of data samples, and t kn = 1 if the n-th pattern belongs to class C k, and 0 in other case. Then, the error is the difference between the outputs of the classifier and the true values, which are contained in the target vector: E = Y T = V Q V (6) Consequently, the Mean Square Error is computed according to equation (7). MSE = 1 N Y T) 2 = 1 N V Q T 2 (7) In the least squares approach, the weights are adjusted in order to minimize the mean squared value of this error (MSE). The minimization of the MSE is obtaining deriving expression (7) with respect V and, using the equations of Wiener-Hopf [9], the next expression for the weight values is obtained: V = T Q T (Q Q T ) 1 (8) This expression allows to determine the values of the coefficients that minimize the mean squared error for a given set of. 2.2 Database description For this study, we use the public database The Berlin Database of Emotional Speech, described in [1]. This database consists of 535 sound files (patterns), produced by 10 persons: 5 males and 5 females. A key point to note is that, as the sound database is not excessively large, and with the aim of investigating the robustness, the classification generalization, and the significance of the results, we have made use of several different subdivisions of the database into design and test subsets by means of bootstrapping [10]. Bootstrapping is a method for estimating error probabilities in those cases in which few data are available. It consists in iteratively select the design data and test data from the available data, in order to implement so many classification systems as iterations of the bootstrapping [11]. In our case, with the aim of maximizing the generalization capability of the results, we have selected all the possible configurations of test sets, selecting one male and one female speaker each time. 3 Features containing emotional information The measurements selected for the study of the emotion detection problem have been the Mel-Frequency Cepstral Coefficients (MFCCs), the Short Term Energy (STE), the Pitch, the Jitter, the Harmonic to Noise Ratio(HNR), the Aperture Perturbation Quotient (APQ) and the Pitch Perturbation Quotient (PPQ). We have also evaluated a novel measurement, that has demonstrated to be very useful in order to determine the emotion. Once these measurements are determined, different statistics (mean, variance, kurtosis, etc.) are then evaluated in order to obtain the. Table 1 includes a description of the and the statistics determined from each measurement. ISBN: 978-1-61804-126-5 234

Table 1: Description of the set of used in the paper MFCCs (5 coef.) Features Index mean 1-5 Total number of std 6-10 15 delta MFCC 11-15 mean(e) 16 std(e) 17 Energy (e) kurtosis(e) 18 5 skeness(e) 19 median(e) 20 mean(p) 21 std(p) 22 Pitch (p) kurtosis(p) 23 5 skewness(p) 24 median(p) 25 Standard Energy and mean(p e) 26 Pitch std(p e) 27 Proposed mean(j) 28 Jitter (j) mean(log(j)) 29 3 median(j) 30 mean 31 std 32 geomean 33 HNR var 34 7 kutosis 35 skewness 36 median 37 APQ 38 1 PPQ 39 1 mean(x) 40 std(x) 41 var(x) 42 Proposed geomean(x) 43 feature family (x) mean(log(x)) 44 kurtosis(x) 45 skewness(x) 46 median(x) 47 2 8 3.1 MFCCs The MFCCs are a set of perceptual parameters calculated from the STFT [8] that have been widely used in speech recognition. They provide a compact representation of the spectral envelope, such that most of the signal energy is concentrated in the first coefficients. Perceptual analysis emulates human ear nonlinear frequency response by creating a set of filters on non-linearly spaced frequency bands. Mel cepstral analysis uses the Mel scale and a cepstral smoothing in order to get the final smoothed spectrum. The process used for obtaining the MFCCs is described as follows: First, the short-term spectrum of the vocal segment is evaluated. This spectrum is integrated over gradually widening frequency intervals on the Mel scale. The resulting Mel-warped spectrum is projected on a cosine basis and the Mel frequency cepstral coefficients (MFCC) are obtained. The bandwidth of each band in the frequency domain depends on the filter central frequency. The higher the frequency is, the wider the bandwidth is. Next, a vector with log energies is evaluated for each filter. Finally, the cosine transform converts the log energies to a set of uncorrelated cepstral coefficients (MFCCs) The first cepstral coefficient describe the shape of the log spectrum independently of its overall level, the second coefficient measures the balance between the upper and lower halves of the spectrum, and higher order coefficients are concerned with increasingly finer in the spectrum. In this paper, 5 MFCCs have been evaluated for each file-pattern, and several statistics of the Mel- Frequency Cepstral Coefficients (MFCC) were considered as. 3.2 Short term energy The Short Term Energy is described by the evaluation of the energy in 20ms time frames. 3.3 Pitch Pitch or Fundamental frequency, gives information about the vibration velocity of the vocal chords when a sound is produced, which is generated by the quick aperture and close of vocal chords. ISBN: 978-1-61804-126-5 235

3.4 Jitter Jitter, or frequency perturbation, is defined as small cycle to cycle changes of period that occur during phonation which is not accounted by voluntary changes in frequency. The more jitter deviates from zero, the more it correlates with erratic vibratory patterns of the vocal folds. Depends on voice, sex and voluntary intonation-. 3.5 HNR HNR (Harmonics to Noise Ratio: is a measurement of voice pureness. It is based on calculating the ratio of the energy of the harmonics related to the noise energy present in the voice (both measured in db). Such measurement is carried out from the speech spectrum removing by filtering the energy present at the harmonics. Fourier transformed the resulting filtered spectrum to provide a noise spectrum which is subtracted from the original log spectrum. This results in, what is termed here, a source related spectrum. After performing a baseline correction procedure on this spectrum, the modified noise spectrum is subtracted from the original log spectrum in order to provide the HNR ratio estimate. 3.6 PPQ PPQ (Pitch Perturbation Quotient). Compute the relative variability from period to period of the fundamental frequency, with a smoothing factor of M periods. Specifically, it averages the differences between each period and the Z previous periods and the Z next periods. Hence, the whole signal is analyzed with a moving window containing the M periods. 3.7 APQ APQ, Aperture Perturbation Quotient. When there are voluntary amplitude changes in the voice, it is useful the use of the APQ parameter, due to the fact that it measures the period to period amplitude variability averaging among M periods. 3.8 Proposed measurement: normalized harmonic energy variation The normalized harmonic energy variation can be obtained by dividing the high frequency energy by the harmonic energy between 0.1Hz and 4Hz. First, the harmonic energy is evaluated in time frames of 20ms. Then this sequence of values is filtered in order to select the values between 0.1Hz and 4Hz. The proposed measurement normalizes this value using the high frequency energy, measured between 2.5KHz and 5.8KHz. Table 2: Confusion matrix (%) for the proposed set of. The emotions listed are Neutral (N), Boredom (B), Sadness (S), Disgust (D), Fear (F), Happiness (H) and Anger (A). Real emotion Low Medium High stress stress stress N B S D F H A N 53 % 15 % 0 % 0 % 6 % 8 % 0 % B 5 % 69 % 17 % 0 % 0 % 0 % 0 % S 3 % 5 % 70 % 0 % 0 % 0 % 0 % D 2 % 5 % 7 % 63 % 0 % 0 % 1 % F 24 % 0 % 7 % 25 % 78 % 0 % 6 % H 9 % 0 % 0 % 13 % 17 % 50 % 18 % A 3 % 5 % 0 % 0 % 0 % 42 % 75 % 4 Results This section presents the results of the experiments carried out in this paper. Table 2 shows the confusion matrix for the proposed set of. The emotions listed are Neutral (N), Boredom (B), Sadness (S), Disgust (D), Fear (F), Happiness (H) and Anger (A). Each element represents the probability of assigning an emotion. Each column represents the distribution of the values belonging to a different emotion. As we can see, there exists three groups of emotions that can be confused. Due to this result, we propose to group those original emotions into three groups: Low stressed emotions. This set includes neutral, boredom and sadness patterns. Medium stressed emotions. This set includes disgust and fear. High stressed emotions. This set includes happiness and anger. Table 3 shows the classification errors (%) for the original and proposed set of, in function of the number of classes. As we can see, the use of the proposed set of increases the performance of the classifier, both with the 7-class problem and the 3-class problem. 5 Conclusion Stress is a reaction or response of the subject to face up the daily mental, emotional or physical challenges. Continuous scanning of stress levels of a subject is a key point to understand and control personal stress. Stress is expressed by physiological changes, emotional reactions, and conduct changes. Some of the ISBN: 978-1-61804-126-5 236

Table 3: Classification errors (%) for the standard and proposed set of, in function of the number of classes Number of Standard set Standard and Classes of Proposed 7 classes 37.54% 32.61% 3 classes 19.05% 15.93% physiological changes are the increase of adrenaline produced to intensify the concentration, or the rise in heart rate and the acceleration of the reflexes. Concerning emotional reactions, they can be expressed by changes in the prosody of speech. Therefore, it is important for the society to find solutions in order to determine the instantaneous stress level, since stress can undermine both mental and physical health. In this work some are proposed for minimizing the error in emotion detectors that improves the performance of actual systems. Results demonstrate the feasibility of the proposed system, obtaining error rates lower than 33% with seven types of emotions. It is important to highlight that subjective tests with native listeners have been carried out, obtaining error rates of around 40%. This fact makes the proposed family of measurements useful in order to implement emotion classification systems, even in those cases in which there is not prior knowledge about the speaker. [1] P.Y. Oudeyer, The production and recognition of emotions in speech: and algorithms, International Journal of Human-Computer Studies 59 (2003), pp. 157 183. [2] R. Barra, J.Macias-Guarasa, J.M.Montero, C. Rincon, F.Fernandez, and R.Cordoba, In Search of Primary Rubrics for Language Independent Emotional Speech Identification, International Symposium on Intelligent Signal Processing (2007), pp. 1 6. [3] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier and B. Weiss, A database of german emotional speech, Proceedings of the Interspeech (2005), pp. 1517 1520. [4] A. Paeschke Global trend of fundamental frequency in emotional speech, Proceedings of the Speech Prosody (2004), pp. 671 674. [5] D. Ververidis and C. Kotropoulos, Emotional speech recognition: Resources,, and methods, Speech Communication 48 (2006), no. 9, pp. 1162 1181. [6] D. Ververidis, C. Kotropoulos and L.Pitas, Automatic emotional speech classification, Proceedings of the International Conference on Acoustics, Speech and Signal Processing (2004), vol. 1, pp. 593 596. [7] K. Hammerschmidt and U. Jrgens, Acoustical Correlates of Affective Prosody, Journal of Voice 21 (2007), pp. 531 540. [8] S. Davis and P. Mermelstein, Experiments in syllable-based recognition of continuous speech, IEEE Transactions on Acoustics, Speech and Signal Processing 28 (1980), pp. 357 366. [9] H.L. Van Trees, Detection, estimation, and modulation theory, vol. 1. Wiley, 1968. [10] R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification, Wiley-Interscience, 2001. [11] A.C. Davison and D.V. Hinkley, Bootstrap methods and their application, vol. 1, Cambridge Univ Pr, 1997. 6 Acknowledgments This work has been funded by the Spanish Ministry of Education and Science (TEC2009-14414- C03-03) and by the under project UAH2011/EXP-028. References: ISBN: 978-1-61804-126-5 237