A Survey on Voice Activity Detection Methods

e-issn 2455 1392 Volume 2 Issue 4, April 2016 pp. 668-675 Scientific Journal Impact Factor : 3.468 http://www.ijcter.com A Survey on Voice Activity Detection Methods Shabeeba T. K. 1, Anand Pavithran 2 1,2 Department of Computer Science and Engineering MES College of Engineering, Kuttippuram Kerala, 679573, India Abstract Voice Activity Detection(VAD) is a technique used in speech processing in which the presence or absence of human speech is detected. It can facilitate speech processing, and can also be used to deactivate some processes during non- speech section of an audio session. Various VAD algorithms have been developed that provide varying features and compromises between latency, sensitivity, accuracy and computational cost. The VAD methods formulates the decision rule on a frame by frame basis using instantaneous measures of the divergence distance between speech and noise. The different measures which are used in VAD methods include spectral slope, correlation coefficients, log likelihood ratio, cepstral, weighted cepstral, and modified distance measures etc. Statistical and Machine learning methods have been used for VAD recently. The study involves various VAD methods. Keywords Voice Activity Detection, Deep Belief Network I. INTRODUCTION Determining the beginning and the termination of speech in the presence of background noise is a complicated problem. Voice activity detector (VAD) tries to separate speech signals from background noises. The result of a VAD decision is a binary value, which indicates the presence of speech in the input signal (for example the output value is 1) or the presence of noise only (for example the output value is 0). For automatic speech recognition, endpoint detection is required to isolate the speech of interest so as to be able to create a speech pattern or template. A VAD algorithm is an integral part from amongst a variety of speech communication systems, such as speech recognition and speech coding in mobile phones, and IP telephony. In telecommunication systems an effective VAD algorithm plays an important role, especially in automatic speech recognition (ASR) systems. It can be used to deactivate some processes during non-speech section of an audio session. And also it is used to reduce the computation by eliminating unnecessary transmission and processing of non- speech segments and to reduce potential mis-recognition errors in non-speech segments. In several speech communications scenarios, it is useful to use discontinuous transmission (DTX). In a wireless (cell phone) case, avoiding transmission during speech pauses will prolong battery life in portable units and reduce interference to other wireless users (users in nearby cells using the same frequencies)[7]. For conversational speech, each side normally talks less than 50% of the time. The typical design of a VAD algorithm is as follows: 1) There may be a noise reduction stage. 2) Some features or quantities are calculated from a section of the input signal. 3) A classification rule is applied to classify the section as speech or nonspeech. The classification of VAD are[1]: VADs in standard speech processing systems. @IJCTER-2016, All rights Reserved 668

Statistical signal processing based VADs. Supervised machine learning based VADs. Unsupervised machine learning based VADs. This paper pay particular attention to the study of different VAD methods and the performance evaluation of them. Here five VAD methods are considered and their advantages and disadvantages are listed. II. VAD METHODS Researches are always been conducted to improve the efficiency of voice activity detection with maximum accuracy. This chapter briefly presents some of such effective approaches to voice activity detection. A. Discriminative Training for Multiple Observation Likeli- hood Ratio Based Voice Activity Detection VAD decisions made from multiple observations will reduce the miss-hit rate in the speech offset region or false-alarm rate in the noise nonstationary region than that made from a single instantaneous observation, taking advantage of the strong correlation in the consecutive time-frames of speech. This paper[2] propose a supervised machine learning based VAD in which two discriminative training methods are further studied for effective combination of multiple observation LRs, in terms of misclassification errors and receiver operating characteristics(roc) curves. 1) Signal Model and Single Observation LLR: Assume that the speech is degraded by an uncorrelated additive noise. Under two hypotheses H0 (speech-pause) and H1 (speech- active), the observation in the short-time Fourier transform (STFT) domain can be written as, H0 (speech-pause) : x k,t = n k,t H1 (speech-active) : x k,t = s k,t + n k,t (2.1) where k and t are the frequency-bin and time-frame index, respectively. 2) Multiple Observation LLRs: Incorporate contextual information into the decision rule will increase the robustness of VAD in noisy conditions. Suppose that a collection of M sequential LLRs from the current time-frame t, denoted as l t = { l t, l t 1,..., l t M+1 } T, is used to make VAD decision for the current time-frame t, a new statistic that reflects the dependence on the current time-frame as well as its previous M-1 time-frames, can be expressed as, where w = { w1, w2,..., wm } T is a vector of the combination weights for different time-frames. The decision rule can then be established as, 3) Discriminative Training: In discriminative training, VAD performance is directly associated with a designed objective function, which can be optimized within the training data. @IJCTER-2016, All rights Reserved 669

1) Minimal Classification Error Training Suppose there is a set of labeled LLRs for training, denoted as L = {L s,l p } where L s = {l i s, i=1,2,...,n s } and L p = {l j p, j=1,2,...,n p } represent the portion of the training set containing all the LLRs labeled as speech-active or speech- pause, respectively. Minimum classification error (MCE) training is a well known discriminative training approach, which aims at minimizing the misclassification errors over the entire training set. The MCE loss function and can be defined as, Basically, the minimization of MCE can improve the VAD performance in terms of reduced amount of two types of errors(e.g., the miss-hit errors and false-alarm errors). 2) Maximal Area Under the ROC Curve Training The ROC curves are frequently used to completely describe the VAD performance. A ROC curve is drawn by varying the decision threshold to reflect the relationship between speech-hit rate (HR1), defined as the fraction of all actual speech frames that are correctly classified as speech-active frames against the false-alarm rate (FAR0), defined as the the fraction of all the actual speech-pause(e.g., noise only) frames that are incorrectly classified as speech frames. As illustrated in Figure 1, the closer the ROC curve is toward the upper left corner, the better the classifier s ability to discriminate between the two classes. Thus, the area under the ROC curve (AUC) is a general, robust measure of classifier discrimination performance, Fig. 1. Illustration of ROC curve and AUC regardless of the decision threshold, which may be unknown, changeable over time, or might vary depending on how the classifier will be used in practical applications. B. Support Vector Machine Based VAD SVM based VAD [3] employs effective feature vectors: a posteriori SNR, a priori SNR and a predicted SNR as principal parameters. 1) Feature Vector Extraction: A noise signal d is added to a speech signal s, with their sum being denoted by x. By taking the Discrete Fourier Transform(DFT), the noise spectra D, the clean speech spectra S the noisy speech spectra X such that @IJCTER-2016, All rights Reserved 670

where k is the frequency-bin index (k = 0,1,...M-1 ) and n is the frame index (n = 0,1,...). Assuming that speech is degraded by uncorrelated additive noise, for each frame, two hypotheses are there, H0 : speech absent : X(n) = D(n) H1 : speech present : X(n) = S(n)+D(n) (2.6) in which X(n), D(n) and S(n) denote the DFT coefficients at the n th frame of the noisy speech, noise and clean speech respectively. Consider the a posteriori SNR γ k (n) as the first feature vector, which is derived by the ratio of the input signal X k (n) and the variance λ d,k (n) of the noise signal D k (n) updated in the periods of speech absence. The second feature vector is the a priori SNR is calculated using a decision directed approach and the third feature is the predicted SNR, which is estimated by the long-term smoothed power spectra of the background noise and speech. The estimated noise power spectra for the predicted SNR estimation is given by, and speech power spectra is, where are the estimates for λ d,k (n) and λ s,k (n). Also, δ d(=0.98) and δ s (=0.98) are the experimental chosen parameter values for D k (n) and S k (n). 2) VAD based on SVM: The SVM makes it possible to build an optimal hyper plane that is separated without error where the distance between closest vectors and the hyper plane becomes maximal. Given training data consisting of N dimensional patterns x i and the corresponding class labels z i, (x 1,z 1 ),...,(x l,z l ), x R N, z { +1,-1 }, the equation f or the hyper plane is given by, where w is the weight vector, b is the bias and <u,v> represents the inner product between u and v. The SVM inherently offers support vectors x i * (i = 1,...,k) and optimized bias b* from the training data, and then output function of the linear SVM for an input vector x is, @IJCTER-2016, All rights Reserved 671

Kernel functions are introduced rather than the linear kernel in order to consider nonlinear input space. Sometimes processing the kernel is cheaper than processing the entire feature. C. Maximum Margin Clustering Based Statistical VAD With Multiple Observation Compound Feature Maximum Margin Clustering Based VAD(MMC based VAD) [4] extends the idea of SVM which aims at finding a maximum margin hyper plane. One maximum margin hyper plane could be found in the feature space which will lead to the minimal classification error. 1) Feature Extraction: A new feature called multiple observation compound feature (MO-CF) is proposed. It takes the advantages of the statistical model and the multiple observation techniques. Specifically, it consists of two sub features. The first sub feature of MO-CF is the multiple observation signal-to-noise ratio (MO-SNR) feature ρ MO which is derived from single-observation SNR (SO-SNR). It has a better control over the randomness of the SNR estimation and leads to better performance on speech detection rate (SD) than SO-SNR. However, MO-SNR increases the false alarm rate(fa)simultaneously. To overcome this drawback, multiple observation maximum probability (MO-MP) φ is included as the second subfeature. The φ vector is derived from revised MO-LRT (RMO-LRT) and inherits the good ability of RMOLRT on FA. The major difference between MO-MP and RMOLRT is that MO-MP consists of LRT scores of all DFT bins under the maximum probabilistic global hypotheses while RMO-LRT is a sum of the LRT scores. Obviously, the former is more informative than the latter. Although MO-MP could yield higher SD than RMO- LRT, it is still inferior to MO-SNR on SD. In order to combine the merits of the two proposed sub features, the MO-CF is defined as, where β is to balance the contributions of the two sub-features for the best overall performance. D. VAD Based on Unsupervised Learning Framework VADs are generally characterized by acoustic features and classifiers. In this paper [5], select the smoothed subband logarithmic energy as the acoustic feature. The input signal is grouped into several Mel subbands in the frequency domain. Then, the logarithmic energy is calculated by using the logarithmic value of the absolute magnitude sum of each subband. Eventually, it is smoothed to form an envelope for classification. Two Gaussian models are employed as the classifier to describe the logarithmic energy distributions of speech and nonspeech, respectively. These two models are incorporated into a two-component Gaussian Mixture Model(GMM). Its parameters are estimated in an unsupervised way. Speech/nonspeech classification is firstly conducted at each subband. Then, all subband s decisions are summarized by a voting procedure. 1) Modeling Logarithmic Energy Distribution With GMM: Assuming that both speech and nonspeech log energies obey the Gaussian distribution, the bimodal distribution can be fitted by a two-component GMM, where one component with the smaller mean is identified as the nonspeech mode and the other component for the speech mode. This model is described by the following equations. Let x k denote the logarithmic energy of a subband at the time k. z is the speech/nonspeech label, z {0,1}, where 0 denotes nonspeech and 1 for speech. According to the Baye s rule, we have the equation @IJCTER-2016, All rights Reserved 672

where p(z) is the prior probability of speech/nonspeech, and is actually equal to the weight coefficient w z (w 0 +w 1 =1). p(x k z,λ ) represents the likelihood of given the speech/nonspeech model: where µ z and K z, respectively, denote the mean and variance. λ µ z, K z, w z z=0,1 is the parameter set of the GMM. An interesting point is that, the mean difference µ 1 -µ 0 represents the a posteriori SNR because µ 1 and µ 0 are, respectively, the averaged logarithmic energy of speech and nonspeech. E. Deep Belief Networks based VAD The DBN-based VAD first connects multiple acoustic features of an observation in serial to a long feature vector which is used as the visible layer [i.e., input] of DBN[1]. Then, a new feature is extracted by transferring the long feature vector through multiple nonlinear hidden layers. Finally, the class of the observation is predicted by a linear classifier [i.e., softmax output layer] of DBN with the new feature as its input. Because VAD only contains two classes [i.e., K=2 ], we can further get the prediction function of the DBN-based VAD as follows: where H1/H0 denotes the speech/noise hypothesis, ε is a tunable decision threshold, usually setting to 0 and s k is defined as, where d k is defined as, and g (L) (.) is the activation function of the L th hidden layer, is the weights between the adjacent two layers with i as the i th unit of the L th layer and j as the j th unit of the (L-1) th layer and { x r } r is the input feature vector. 1) Deep Belief Networks: DBN is a type of the deep neural networks, if trained successfully, they can achieve a strong generalization ability with few training data. It is a probabilistic generative model that consists of multiple hidden layers of stochastic latent variables. The top two layers of DBN have @IJCTER-2016, All rights Reserved 673

undirected, symmetric connections and form an associative memory. Other hidden layers form a topdown directed acyclic graph [6]. The units in the lowest layer are called visible units, which represent an input feature vector. Successively connected two layers formulate a constituent module of DBN, called restricted Boltzmann machine (RBM), therefore, DBN is a stack of RBMs. The training process of DBN consists of two phases[1]. First, it takes a greedy layer-wise unsupervised pre-training phase of the stacked RBMs to find initial parameters that are close to a good solution of the deep neural network. Then, it takes a supervised back-propagation training phase to fine-tune the initial parameters. The key point that contributes to the success of DBN is the greedy layer-wise unsupervised pre-training of the RBM models. It performs like a regularizer of the supervised training phase that prevents DBN from over-fitting to the training set. Fig. 2. An RBM with l visible units and J hidden units. Because the layer-wise unsupervised pre-training of the RBM models contributes to the success of DBN, this special training process is introduced below. RBM is an energy-model based two layer, bipartite, undirected stochastic graphical model as shown in figure 2. Specifically, one layer of RBM is composed of visible units v, and the other layer is composed of hidden units h. There are symmetric connections between the two layers and no connection within each layer. The connection weights can be represented by a weight matrix W. This paper consider only the Bernoulli (visible)- Bernoulli (hidden) RBM, which means v i {0,1} and h j {0,1}. RBM tries to find a model that maximize the likelihood of v, which is equivalent to the following optimization problem, where the marginal distribution P(v ; W) is defined as, With denoted as the partition function or the normalization factor, and the energy model is given by, Energy (v,h;w)=-b T v- c T h-h T Wv (2.22) where b and c are bias terms of visible layer and hidden layer. @IJCTER-2016, All rights Reserved 674

III. PERFORMANCE ANALYSIS The former section deals with several voice activity detec- tion methods including statistical, supervised machine learn- ing, unsupervised machine learning and DBN based methods. A comparative study of those methods are conducted. The advantages and disadvantages of these methods are listed. The Multiple Observation Likelihood Ratio Based VAD is robust in noisy conditions, but its disadvantage is the high computational complexity. The SVM based VAD makes use of a time-varying signal-to-noise ratio and kernal trick is there. In SVM based VAD, the choice of kernal is a complex task. In MMC based VAD there is no labeling of training data, as a result the computational complexity is high. Even though it has good performance at low level SNR. The VAD Based on Unsupervised Learning Framework doesn t rely on nonspeech beginning and it uses only a simple acoustic feature for classification, so the output will not be much accurate. The advantages of multiple acoustic features are combined in DBN based VAD such that the variations of the features can be descried. As the number of features increases, the complexity of the network also increases and it will take more time for voice activty detection. IV. CONCLUSION Voice activity detector (VAD) tries to separate speech signals from background noises. There are various methods for VAD such as statistical, supervised machine learning, unsupervised machine learning based etc. In this work several VAD methods are studied and their performance is evaluated. In which the DBN based VAD is outperforming others. The DBN-based VAD aims to extract a new feature that can fully express the advantages of all acoustic features. The complexity of DBN is more as the number of features are more. It will be more advantageous that if we can achieve the same accuracy with a less complex network. REFERENCES [1] Xiao-Lei Zhang and Ji Wu, Deep Belief Networks Based Voice Ac- tivity Detection, IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 4, April 2013. [2] Tao Yu and John H. L. Hansen, Discriminative Training for Multiple Observation Likelihood Ratio Based Voice Activity Detection, IEEE Signal Processing Letters, Vol. 17, No. 11, November 2010. [3] Ji Wu and Xiao-Lei Zhang, VAD based on statistical models and machine learning approaches, ELSEVIER, Computer Speech and Lang., 2010. [4] Ji Wu and Xiao-Lei Zhang, Maximum Margin Clustering Based Statis- tical VAD With Multiple Observation Compound Feature, IEEE Signal Processing Letters, Vol. 18, No. 5, May 2011. [5] Dongwen Ying, Yonghong Yan, Jianwu Dang, and Frank K. Soong, Voice Activity Detection Based on an Unsupervised Learning Frame- work, IEEE Transactions on Audio, Speech, and Language Processing, Vol. 19, No. 8, November 2011. [6] D. Yu and L. Deng, Deep learning and its applications to signal and information processing, IEEE Signal Processing Magazine, vol. 28, no. 1, pp. 145-154, Jan. 2011. [7] Lawrence R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Pearson Education, Jan. 2003. @IJCTER-2016, All rights Reserved 675