Stress detection through emotional speech analysis

Size: px

Start display at page:

Download "Stress detection through emotional speech analysis"

Darleen Singleton
5 years ago
Views:

1 Stress detection through emotional speech analysis INMA MOHINO ROBERTO GIL-PITA LORENA ÁLVAREZ PÉREZ Abstract: Stress is a reaction or response of the subject to face up the daily mental, emotional or physical challenges. Continuous scanning of stress levels of a subject is a key point to understand and control personal stress. Stress is expressed by physiological changes, emotional reactions, and conduct changes. Some of the physiological changes are the increase of adrenaline produced to intensify the concentration, or the rise in heart rate and the acceleration of the reflexes. Concerning emotional reactions, they can be expressed by changes in the prosody of speech. In this paper we study the design of a classification system of stress levels using emotional speech analysis. For this purpose, linear discriminant combined with bootstrapping techniques are useful tools in order to implement classifiers. Results demonstrate the feasibility of the proposed system, obtaining error rates lower than 33% Key Words: emotional speech, speech-processing, stress detection. 1 Introduction In recent years there has been a tremendous job in studying the parameters of emotions in the human voice, fundamentally divided in two different research lines: the artificial production of emotional sounds [1, 2], and the classification of emotional states [3, 4, 5, 6, 7]. In the first line, researches focus on the study of the characteristics of speech signals produced under different emotional states of the subject, and its relationship with the language. Several and algorithms are proposed in the literature [1], and a review of the state of the art can be found in [2]. Concerning the classification of emotional states, the objective is to determine the emotional state of the subject given a speech signal, from a limited set of available states. From the results presented in the literature, the with greater classification capability are related to the pitch, and they are widely studied in [3] and [4]. Furthermore, a comprehensive study of the most used in the recognition of emotions can be found in [5], where those based on the pitch are again shown as the most discriminatory. Other papers focus on selecting a suitable reduced set of, in order to improve generalization capability of the classifiers, like [6], in which automatic classification is used to select a minimum set of, or [7], in which a detailed study of a huge number of parameters is included, with the purpose of selecting with linear independence. This last paper concludes that with only 6 a high rate of classification success can be achieved. In this paper, the objective is not to simply classify the different emotional states, but to distinguish the level of excitation from the emotions, with the aim of predicting stress levels. For this study, we use the public database The Berlin Database of Emotional Speech, described in [1], and we carry out a set of experiments aiming at studying the combination of extracted from the literature and their effects over the classification performance. 2 Materials and Methods This section includes a brief description of the classification method (the least-square linear classifier) and a description of the database. 2.1 Standard MSE minimization of a diagonal linear discriminant Linear classifiers are characterized by the use of linear decision boundaries, which implies that they cannot discriminate classes associated in very complex shapes. Let us consider a set of training patterns x = [x 1, x 2,..., x L ] T, where each of these patters is assigned to one of the possible classes denoted as C i, i = 1,, K. In a linear classifier, the decision rule is obtained using a set of K linear combinations of the training patterns, as it can be observed in equation (1). ISBN:

2 y k = w k0 + L w kn x n (1) n=1 Where w kn are the weighting values and w k0 the threshold. Furthermore, equation (1) can be expressed in matrix notation as equation (2). y = w 0 + W T x (2) Where W is the weight matrix that contain the values of w kn. The design of the classifier consists of finding the best values of W and w 0 to minimize the classification error. The output of the linear combinations y is used to determine the decision rule. For instance, if the component y k gives the maximum value of the vector, then the k-th class is assigned to the pattern. In order to determine the values of the weights, it is necessary to minimize the mean squared error value. Let us define the matrix V = [w 0, W] T containing the weight matrix W and the threshold vector w 0, then the pattern matrix Q, which contains the input for classification, is the expressed in (3) x 11 x 12 x x 1N Q = x L1 x L2 x L3... x LN (3) So, the output of the linear classifier is obtained as a linear combination of the inputs according to (4). Y = V Q (4) Let us now define the target matrix containing the labels of each pattern as: t 11 t 12 t t 1N T = (5) t K1 t K2 t K3... t KN where N is the number of data samples, and t kn = 1 if the n-th pattern belongs to class C k, and 0 in other case. Then, the error is the difference between the outputs of the classifier and the true values, which are contained in the target vector: E = Y T = V Q V (6) Consequently, the Mean Square Error is computed according to equation (7). MSE = 1 N Y T) 2 = 1 N V Q T 2 (7) In the least squares approach, the weights are adjusted in order to minimize the mean squared value of this error (MSE). The minimization of the MSE is obtaining deriving expression (7) with respect V and, using the equations of Wiener-Hopf [9], the next expression for the weight values is obtained: V = T Q T (Q Q T ) 1 (8) This expression allows to determine the values of the coefficients that minimize the mean squared error for a given set of. 2.2 Database description For this study, we use the public database The Berlin Database of Emotional Speech, described in [1]. This database consists of 535 sound files (patterns), produced by 10 persons: 5 males and 5 females. A key point to note is that, as the sound database is not excessively large, and with the aim of investigating the robustness, the classification generalization, and the significance of the results, we have made use of several different subdivisions of the database into design and test subsets by means of bootstrapping [10]. Bootstrapping is a method for estimating error probabilities in those cases in which few data are available. It consists in iteratively select the design data and test data from the available data, in order to implement so many classification systems as iterations of the bootstrapping [11]. In our case, with the aim of maximizing the generalization capability of the results, we have selected all the possible configurations of test sets, selecting one male and one female speaker each time. 3 Features containing emotional information The measurements selected for the study of the emotion detection problem have been the Mel-Frequency Cepstral Coefficients (MFCCs), the Short Term Energy (STE), the Pitch, the Jitter, the Harmonic to Noise Ratio(HNR), the Aperture Perturbation Quotient (APQ) and the Pitch Perturbation Quotient (PPQ). We have also evaluated a novel measurement, that has demonstrated to be very useful in order to determine the emotion. Once these measurements are determined, different statistics (mean, variance, kurtosis, etc.) are then evaluated in order to obtain the. Table 1 includes a description of the and the statistics determined from each measurement. ISBN:

3 Table 1: Description of the set of used in the paper MFCCs (5 coef.) Features Index mean 1-5 Total number of std delta MFCC mean(e) 16 std(e) 17 Energy (e) kurtosis(e) 18 5 skeness(e) 19 median(e) 20 mean(p) 21 std(p) 22 Pitch (p) kurtosis(p) 23 5 skewness(p) 24 median(p) 25 Standard Energy and mean(p e) 26 Pitch std(p e) 27 Proposed mean(j) 28 Jitter (j) mean(log(j)) 29 3 median(j) 30 mean 31 std 32 geomean 33 HNR var 34 7 kutosis 35 skewness 36 median 37 APQ 38 1 PPQ 39 1 mean(x) 40 std(x) 41 var(x) 42 Proposed geomean(x) 43 feature family (x) mean(log(x)) 44 kurtosis(x) 45 skewness(x) 46 median(x) MFCCs The MFCCs are a set of perceptual parameters calculated from the STFT [8] that have been widely used in speech recognition. They provide a compact representation of the spectral envelope, such that most of the signal energy is concentrated in the first coefficients. Perceptual analysis emulates human ear nonlinear frequency response by creating a set of filters on non-linearly spaced frequency bands. Mel cepstral analysis uses the Mel scale and a cepstral smoothing in order to get the final smoothed spectrum. The process used for obtaining the MFCCs is described as follows: First, the short-term spectrum of the vocal segment is evaluated. This spectrum is integrated over gradually widening frequency intervals on the Mel scale. The resulting Mel-warped spectrum is projected on a cosine basis and the Mel frequency cepstral coefficients (MFCC) are obtained. The bandwidth of each band in the frequency domain depends on the filter central frequency. The higher the frequency is, the wider the bandwidth is. Next, a vector with log energies is evaluated for each filter. Finally, the cosine transform converts the log energies to a set of uncorrelated cepstral coefficients (MFCCs) The first cepstral coefficient describe the shape of the log spectrum independently of its overall level, the second coefficient measures the balance between the upper and lower halves of the spectrum, and higher order coefficients are concerned with increasingly finer in the spectrum. In this paper, 5 MFCCs have been evaluated for each file-pattern, and several statistics of the Mel- Frequency Cepstral Coefficients (MFCC) were considered as. 3.2 Short term energy The Short Term Energy is described by the evaluation of the energy in 20ms time frames. 3.3 Pitch Pitch or Fundamental frequency, gives information about the vibration velocity of the vocal chords when a sound is produced, which is generated by the quick aperture and close of vocal chords. ISBN:

4 3.4 Jitter Jitter, or frequency perturbation, is defined as small cycle to cycle changes of period that occur during phonation which is not accounted by voluntary changes in frequency. The more jitter deviates from zero, the more it correlates with erratic vibratory patterns of the vocal folds. Depends on voice, sex and voluntary intonation HNR HNR (Harmonics to Noise Ratio: is a measurement of voice pureness. It is based on calculating the ratio of the energy of the harmonics related to the noise energy present in the voice (both measured in db). Such measurement is carried out from the speech spectrum removing by filtering the energy present at the harmonics. Fourier transformed the resulting filtered spectrum to provide a noise spectrum which is subtracted from the original log spectrum. This results in, what is termed here, a source related spectrum. After performing a baseline correction procedure on this spectrum, the modified noise spectrum is subtracted from the original log spectrum in order to provide the HNR ratio estimate. 3.6 PPQ PPQ (Pitch Perturbation Quotient). Compute the relative variability from period to period of the fundamental frequency, with a smoothing factor of M periods. Specifically, it averages the differences between each period and the Z previous periods and the Z next periods. Hence, the whole signal is analyzed with a moving window containing the M periods. 3.7 APQ APQ, Aperture Perturbation Quotient. When there are voluntary amplitude changes in the voice, it is useful the use of the APQ parameter, due to the fact that it measures the period to period amplitude variability averaging among M periods. 3.8 Proposed measurement: normalized harmonic energy variation The normalized harmonic energy variation can be obtained by dividing the high frequency energy by the harmonic energy between 0.1Hz and 4Hz. First, the harmonic energy is evaluated in time frames of 20ms. Then this sequence of values is filtered in order to select the values between 0.1Hz and 4Hz. The proposed measurement normalizes this value using the high frequency energy, measured between 2.5KHz and 5.8KHz. Table 2: Confusion matrix (%) for the proposed set of. The emotions listed are Neutral (N), Boredom (B), Sadness (S), Disgust (D), Fear (F), Happiness (H) and Anger (A). Real emotion Low Medium High stress stress stress N B S D F H A N 53 % 15 % 0 % 0 % 6 % 8 % 0 % B 5 % 69 % 17 % 0 % 0 % 0 % 0 % S 3 % 5 % 70 % 0 % 0 % 0 % 0 % D 2 % 5 % 7 % 63 % 0 % 0 % 1 % F 24 % 0 % 7 % 25 % 78 % 0 % 6 % H 9 % 0 % 0 % 13 % 17 % 50 % 18 % A 3 % 5 % 0 % 0 % 0 % 42 % 75 % 4 Results This section presents the results of the experiments carried out in this paper. Table 2 shows the confusion matrix for the proposed set of. The emotions listed are Neutral (N), Boredom (B), Sadness (S), Disgust (D), Fear (F), Happiness (H) and Anger (A). Each element represents the probability of assigning an emotion. Each column represents the distribution of the values belonging to a different emotion. As we can see, there exists three groups of emotions that can be confused. Due to this result, we propose to group those original emotions into three groups: Low stressed emotions. This set includes neutral, boredom and sadness patterns. Medium stressed emotions. This set includes disgust and fear. High stressed emotions. This set includes happiness and anger. Table 3 shows the classification errors (%) for the original and proposed set of, in function of the number of classes. As we can see, the use of the proposed set of increases the performance of the classifier, both with the 7-class problem and the 3-class problem. 5 Conclusion Stress is a reaction or response of the subject to face up the daily mental, emotional or physical challenges. Continuous scanning of stress levels of a subject is a key point to understand and control personal stress. Stress is expressed by physiological changes, emotional reactions, and conduct changes. Some of the ISBN:

5 Table 3: Classification errors (%) for the standard and proposed set of, in function of the number of classes Number of Standard set Standard and Classes of Proposed 7 classes 37.54% 32.61% 3 classes 19.05% 15.93% physiological changes are the increase of adrenaline produced to intensify the concentration, or the rise in heart rate and the acceleration of the reflexes. Concerning emotional reactions, they can be expressed by changes in the prosody of speech. Therefore, it is important for the society to find solutions in order to determine the instantaneous stress level, since stress can undermine both mental and physical health. In this work some are proposed for minimizing the error in emotion detectors that improves the performance of actual systems. Results demonstrate the feasibility of the proposed system, obtaining error rates lower than 33% with seven types of emotions. It is important to highlight that subjective tests with native listeners have been carried out, obtaining error rates of around 40%. This fact makes the proposed family of measurements useful in order to implement emotion classification systems, even in those cases in which there is not prior knowledge about the speaker. [1] P.Y. Oudeyer, The production and recognition of emotions in speech: and algorithms, International Journal of Human-Computer Studies 59 (2003), pp [2] R. Barra, J.Macias-Guarasa, J.M.Montero, C. Rincon, F.Fernandez, and R.Cordoba, In Search of Primary Rubrics for Language Independent Emotional Speech Identification, International Symposium on Intelligent Signal Processing (2007), pp [3] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier and B. Weiss, A database of german emotional speech, Proceedings of the Interspeech (2005), pp [4] A. Paeschke Global trend of fundamental frequency in emotional speech, Proceedings of the Speech Prosody (2004), pp [5] D. Ververidis and C. Kotropoulos, Emotional speech recognition: Resources,, and methods, Speech Communication 48 (2006), no. 9, pp [6] D. Ververidis, C. Kotropoulos and L.Pitas, Automatic emotional speech classification, Proceedings of the International Conference on Acoustics, Speech and Signal Processing (2004), vol. 1, pp [7] K. Hammerschmidt and U. Jrgens, Acoustical Correlates of Affective Prosody, Journal of Voice 21 (2007), pp [8] S. Davis and P. Mermelstein, Experiments in syllable-based recognition of continuous speech, IEEE Transactions on Acoustics, Speech and Signal Processing 28 (1980), pp [9] H.L. Van Trees, Detection, estimation, and modulation theory, vol. 1. Wiley, [10] R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification, Wiley-Interscience, [11] A.C. Davison and D.V. Hinkley, Bootstrap methods and their application, vol. 1, Cambridge Univ Pr, Acknowledgments This work has been funded by the Spanish Ministry of Education and Science (TEC C03-03) and by the under project UAH2011/EXP-028. References: ISBN:

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017 Speech Signal Analysis Generate discrete samples A frame Need to focus on short