Using Noise to Speed Up Video Classification with Recurrent Backpropagation

Size: px

Start display at page:

Download "Using Noise to Speed Up Video Classification with Recurrent Backpropagation"

Madeleine Dean
6 years ago
Views:

1 Using Noise to Speed Up Video Classification with Recurrent Bacpropagation Olaoluwa Adigun Department of Electrical Engineering University of Southern California Los Angeles, CA Bart oso Department of Electrical Engineering University of Southern California Los Angeles, CA Abstract Carefully inected noise can speed the convergence and accuracy of video classification with recurrent bacpropagation (RBP. This noise-boost uses the recent results that bacpropagation is a special case of the generalized expectation maximization (EM algorithm and that careful noise inection can always speed the average convergence of the EM algorithm to a local maximum of the log-lielihood surface. We ext this result to the time-varying case of recurrent bacpropagation and prove sufficient noise-benefit conditions for both classification and regression. Inecting noise that satisfies the noisy-em positivity condition (NEM noise speeds up RBP training. The classification simulations used eleven categories of sports videos based on standard UCF YouTube sports-action video clips. Training RBP with NEM noise in ust the output neurons led to 60% fewer iterations in training as compared with noiseless training. This corresponded to a 20.6% maximum decrease in training cross entropy. NEM noise inection also outperformed simple blind noise inection: RBP training with NEM noise gave a 5.6% maximum decrease in training cross entropy compared with RBP training with blind noise. Inecting NEM noise also improved the relative classification accuracy by 5% over noiseless RBP training. NEM noise improved the classification accuracy from 8% to 83% on the test set. I. NOISE BOOSTING RECURRENT BACPROPAGATION We show how carefully inected noise can speed up the convergence of recurrent bacpropagation (RBP for classifying time-varying patterns such as videos. Simulations used the categories of sampled UCF YouTube sports videos [], [2]. We prove a related result for speeding up RBP for regression. These new noise-benefit results turn on the recent result that bacpropagation (BP is a special case of the generalized expectation-maximization (EM algorithm [3]. BP is the benchmar algorithm for training deep neural networs [4] [6]. EM is the standard algorithm for finding maximumlielihood parameters in the case of missing data or hidden parameters [7]. BP finds the networ weights and parameters that minimize some error or other performance measure. It equivalently maximizes the networ lielihood and this permits the EM connection. The error function is usually crossentropy for classification and squared error for regression. The Noisy EM (NEM Theorem [8] [0] gives a sufficient condition for noise to speed up the average convergence of the EM algorithm. It holds for additive or multiplicative or any other type of noise inection [0]. The NEM Theorem ensures that NEM noise improves training with the EM algorithm by ensuring that at each iteration NEM taes a larger step on average up the nearest hill of probability or log-lielihood. The BP algorithm is a form of maximum lielihood estimation [] and a special case of generalized EM algorithm. So NEM noise also improves the convergence of BP [3]. It also ts to improve classification accuracy because the lielihood lowerbounds the accuracy [2]. The NEM noise benefit is a type of of stochastic resonance where a small amount of noise improves the performance of a nonlinear system while too much noise harms the performance [3] [2]. Figure 2 shows that noise-boosted RBP too only 8 iterations to reach the same cross-entropy value that noiseless RBP reached after 34 iterations. The noise benefit became more pronounced when we added more internal memory neurons. Figure 3 shows that the benefit reached diminishing returns with about 200 memory neurons. This 200-memory-neuron case produced a 5% improvement in the relative classification accuracy. A recurrent neural networ (RNN is a neural networ whose directed graph contains cycles or feedbac. This feedbac allows an RNN to sustain an internal memory state that can hold and reveal information about recent inputs. So the RNN output deps not only on the input but on the recent output and on the internal memory state. This internal dynamical structure maes RNNs suitable for time-varying tass such as speech processing [5], [22] and video recognition [23], [24]. RBP is the most common method for training RNNs [25], [26]. We apply the NEM-based RBP algorithm below to video recognition. The NEM noise-boost applies in principle to any time-varying tas of classification or regression. The next section reviews the recent result that BP is a special case of generalized EM. Then section III presents the new noisy RBP training theorems for classification and for regression with RNNs. We also present the noisy RBP algorithm for training long short-term memory (LSTM [26] or recurrent learning from time-series data [27]. Section IV presents the results of NEM-noise training of a LSTM RNN for video classification. II. BACPROPAGATION AS GENERALIZED EXPECTATION MAXIMIZATION This section states the recent result that bacpropagation is a special case of the generalized Expectation-Maximization /7/$ IEEE 08

2 (EM algorithm and that careful noise inection can always speed up EM convergence. The next section states the companion result that carefully chosen noise can speed convergence of the EM on average and thereby noise-boost BP. The BP-as-EM theorem casts BP in terms of maximum lielihood. Both classification and regression have the same BP (EM learning laws but they differ in the type of neurons in their output layer. The output neurons of a classifier networ are softmax or Gibbs neurons. The output neurons of a regression networ are identity neurons. But the regression case assumes that the output training vectors are the vector population means of a multidimensional normal probability density with an identity population covariance matrix. We now restate the recent theorem that BP is a special case of the generalized EM (GEM algorithm [3]. Theorem : Bacpropagation as Generalized Expectation Maximization The bacpropagation update equation for a differentiable lielihood function p(y x, Θ at epoch n Θ n+ Θ n +η Θ ln p(y x,θ ( equals the GEM update equation at epoch n Θ n+ Θ n +η Θ Q(Θ Θ n ΘΘ n ΘΘ n where GEM uses the differentiable Q-function { } Q(Θ Θ n E p(h y,x,θn ln p(h,y X,Θ. The next section shows how selective noise inection can speed up the average convergence of the EM algorithm. A. Noise-boosting Bacpropagation via EM The Noisy EM Theorem shows that noise inection can only speed up the EM algorithm on average if the noise obeys the NEM positivity condition [8], [0]. The Noisy EM Theorem for additive noise states that a noise benefit holds at each iteration n if the following positivity condition holds: [ ( p(x+n,h Θ n ] E x,h,n Θ ln p(x,h Θ n 0. (3 Then the EM noise benefit (2 Q(Θ n Θ Q N (Θ n Θ (4 holds on average at iteration n: [ ] E x,n Θ n Q(Θ n Θ Q N (Θ n Θ [ ] E x Θ n Q(Θ Θ Q(Θ n Θ where Θ denotes the maximum-lielihood vector of parameters. The NEM positivity condition (3 has a simple form for Gaussian mixture models [28] and for classification and regression networs [3]. The intuition behind the NEM sufficient condition (3 is that some noise realizations n mae a signal x more probable: f(x+n Θ f(x Θ. Taing logarithms givesln( f(x+n θ f(x θ 0. Taing expectations gives a NEM-lie positivity condition. The actual proof of the NEM Theorem uses ullbac-liebler divergence to show that the noise-boosted lielihood is closer on average at each iteration to the optimal lielihood function than is the noiseless lielihood [0]. The NEM positivity inequality (3 is not vacuous because the expectation conditions on the converged parameter vector Θ. Vacuity would result in the usual case of averaging a log-lielihood ratio. Tae the expectation of the log-lielihood ratio ln f(x Θ g(x Θ with respect to the probability density function g(x Θ to give E g [ln f(x Θ g(x Θ ]. Then Jensen s inequality and the concavity of the logarithm give E g [ln f(x Θ g(x Θ ] lne g [ f(x Θ g(x Θ ] ln f(x Θ X g(x Θ g(x Θ dx ln g(x Θ dx X ln 0. So E g [ln f(x θ g(x θ ] 0 and thus in this case strict positivity is impossible [0]. But the expectation in (3 does not in general lead to this cancellation of probability densities because the integrating density in (3 deps on the optimal maximum-lielihood parameter Θ rather than on ust Θ n. So density cancellation occurs only when the NEM algorithm has converged to a local lielihood maximum because then Θ n Θ. The NEM Theorem simplifies for a classifier networ with softmax output neurons. Then the additive noise must lie above the defining NEM hyperplane [3]. A similar NEM result holds for regression except that the noise-benefit region is a hypersphere. NEM noise can also inect into the hidden-neuron layers in accord with their lielihood functions [2]. The next section exts these results to recurrent BP for classification and regression. III. NOISE-BOOSTED RECURRENT BACPROPAGATION We next develop the recurrent bacpropagation (RBP algorithm and then show how to noise-boost it using the NEM Theorem. NEM noise can inect in all neurons [2]. This paper focuses only on noise inection into the output neurons. A. Recurrent Bacpropagation This section presents the learning algorithm for training Long Short-Term Memory (LSTM with RBP. The LSTM is a deep RNN architecture suitable for problems with long timelag depencies [26]. The LSTM networ consists of I input neurons, J hidden neurons, and output neurons. We also have control gates for the input, hidden, and output layers. The input gate controls the input activation. The forget gate controls the hidden activation. And the output gate controls the output activation. These gates each have J neurons. The LSTM networ captures the depency of the input data with respect to time. The time t can tae on any value in {,2,...,T}. The J I matrix W x connects the hidden neurons to the input neurons. The J matrix U y connects the output neurons to the hidden neurons. Superscripts x,h, and y denote the respective input, hidden, and output layers. The J I matrix W γ connects the input gates to the input neurons and J matrix V γ connects the input gates to the output 09

3 neurons. The J I matrix W φ connects the forget gates to the input neurons and J matrix V φ connects the forget gate to the output neurons. The J I matrix W ω connects the output gates to the input neurons and J matrix V ω connects the output gates to the output neurons. Superscripts γ,φ, and ω denote the respective input, forget, and output gates. The training involves the forward pass of the input vector and the bacward pass of the error. The forward pass propagates the input vector from the input layer to the output layer that has output activation a y. The bacward pass propagates the error from the output and hidden layers bac to the input layer and updates the networ s weights. We alternate these two passes starting with the forward pass until the networ converges. RBP sees those weights that minimize the error function E(Θ or maximize the lielihood L(Θ. The forward pass initializes the internal memory h (0 to 0 for all values of. The output activation a y(0 is 0 for all values of. We then compute a y(t for t,2,..., and T. The input vector x (t has I input neuron values. The input neurons have identity activations: a x(t i x (t i where a x(t i denotes the activation for the th input neuron due to the input vector x (t at time t. The term o x(t denotes the th value for the transformation from the input space to the hidden state: o x(t I i w x ia x(t i +b x (5 where b x is the bias for the th hidden neuron due to the input activation. This transformation is the logistic sigmoidal activation function. The term a x(t denotes the activation of the th hidden neuron with respect to the input-space transformation of x t : a x(t +e ox(t The term o γ(t is the input to the th input gate at time t: o γ(t I i w γ i ax(t i + (6 v γ ay(t +b γ (7 where b γ is the bias for the th input gate and a y(t is the output activation for time t. The activation a γ(t for the th input gate at time t also uses the logistic sigmoid: The term o h(t time t: a γ(t +e oγ(t. (8 denotes the input to the th hidden neuron at o h(t a x(t a γ(t (9 where denotes Hadamard product. The term o φ(t the input to the th forget gate. It has the form o φ(t I i w φ i ax(t i + denotes v φ ay(t +b φ (0 where b φ is the bias for the th forget gate. The term s (t denotes the state of the th internal memory at time time t: ( s (t o h(t + s (t a γ(t ( where s (t is the state of th internal memory at time t. The activation a s(t of the th internal memory is also logistic: a s(t +e s(t (2 where superscript s denotes the internal memory. The term denotes the input to the th output gate. It has the form o ω(t o ω(t I i w ω ia x(t i + v ω a y(t +b ω (3 where b ω is the bias for the th output gate. The term a ω(t is the activation for the th output gate at time t. The output gate also uses the logistic sigmoid The activation a h(t a ω(t +e oω(t. (4 for the th output neuron at time t obeys a h(t a s(t a ω(t (5 where a s(t and a ω(t are from (2 and (4. The term o y(t is the input to the th output neuron at time t: o y(t J u x a h(t +b y (6 where b y is the bias for th output neuron. The output-layer neurons use softmax activations because this is a classification networ [3], []. The th output neuron a y(t for th has the softmax form a y(t exp(o y(t l exp(oy(t l. (7 The bacward pass computes the error and its derivative with respect to the weights. The error function E (t is the cross-entropy because this is a classifier networ []: E(Θ E (t ( y (t ln(ay(t (8. The derivative of the error term E (t with respect to u y for t T,T,..., and is u y a y(t (a y(t a y(t o y(t o y(t u y y (t ah(t. (9 0

4 The partial derivative of the E (t with respect to the b y is b y a y(t (a y(t a y(t o y(t o y(t b y y (t. (20 Let E (t denote the derivative of E (t with respect to a h(t. Then E (t a h(t ( (a y(t y (t uy. (2 We now list the many other partial derivatives in the RPP algorithm. The derivative of E (t with respect to w ω i is w ω i a h(t E (t a s(t a h(t a ω(t a ω(t o ω(t o ω(t w ω i a ω(t ( a ω(t a x(t The derivative of E (t with respect to v ω is v ω a h(t E (t a s(t a h(t a ω(t a ω(t o ω(t o ω(t v ω a ω(t ( a ω(t a y(t The derivative of E (t with respect to b ω is b ω a h(t E (t a s(t a h(t a ω(t a ω(t o ω(t a ω(t ( a ω(t The derivative of E (t with respect to s (t Let a φ(t o φ(t s (t a h(t E (t a ω(t a h(t s (t i. (22 i. (23 o ω(t b ω. (24 is a s(t ( a s(t. (25 denote the derivative of a φ(t with respect to with respect to o φ(t is. The derivative of a φ(t a φ(t aφ(t o φ(t a φ(t ( a φ(t. (26 The derivative of E (t with respect to w φ i is w φ i ( T n mn+ a φ(m+ s (n a φ(n a x(n i. (27 The derivative of E (t with respect to v φ is v φ ( T n mn+ a φ(m+ s (n a φ(n The derivative of E (t with respect to b φ is b φ ( T n mn+ a φ(m+ s (n a y(n. (28 a φ(n So the derivative of E (t with respect to a γ(t a γ(t s (t s (t o h(t o h(t a γ(t ( E (t a ω(t a s(t ( a s(t The derivative of E (t with respect to w γ i is w γ i a γ(t a γ(t a γ(t o γ(t a γ(t o γ(t w γ i. (29 is a x(t ( a γ(t a x(t The derivative of E (t with respect to v γ is v γ a γ(t a γ(t a γ(t o γ(t a γ(t o γ(t v γ ( a γ(t a y(t The derivative of E (t with respect to b γ is b γ a γ(t a γ(t a γ(t o γ(t o γ(t b γ The derivative of E (t with respect to a x(t a x(t s (t s (t o h(t. (30 i. (3. (32 a γ(t ( a γ(t. (33 o h(t a x(t ( E (t a ω(t a s(t ( a s(t The derivative of E (t with respect to w x i is w x i a x(t a x(t a x(t o x(t a x(t i o x(t w x i is a γ(t ( a γ(t a x(t. (34 i. (35

5 The derivative of E (t with respect to the b x is b x a x(t a x(t a x(t o x(t o x(t b x a x(t i ( a γ(t. (36 These partial derivatives form the update rules for training the RNN with the RBP algorithm. B. Noisy EM Theorems for RBP We can now state and prove the new NEM RBP theorems for a classifier RNN and for a regression RNN. We start with the classifier result. Theorem 2: RBP Noise Benefit for RNN Classification The NEM positivity condition (3 holds for the maximum lielihood training of a classifier recurrent neural networ with noise inection in the output softmax neurons if the following hyperplane condition holds : E y,h,n x,θ T n (tt ln a y(t} 0. (37 Proof : We first show that the cross entropy E(Θ equals the sum of the negative log-lielihood functions over time t,2,..., and T. The error E(Θ is the cross entropy for a classifier networ: E(Θ E (t (38 ln [ ln [ ln y (t ln ay(t (39 (a y(t y (t ] p (y (t y (t x,θ ] (40 (4 ln p(y (t x,θ (42 T p(y (t x,θ (43 L(Θ (44 So minimizing the cross entropy E(Θ maximizes the logliehood function L(Θ: p(y x,θ e E(Θ. The NEM sufficient condition (3 simplifies to the product of exponentiated output activations over time t and each with output neurons. Then additive noise inection gives p(n+y,h x,θ p(y,h x,θ T T p(n (t +y (t,h x,θ p(h x,θ p(h x,θ p(y (t,h x,θ p(n (t +y (t h,x,θ p(y (t h,x,θ T [ T [ (a y(t n(t +y(t (45 (a y(t y(t ] (a y(t n(t. (47 Then the NEM positivity condition in (3 becomes { ln T [ This simplifies to the inequality ] (46 ]} (a y(t n(t 0. (48 T [ ]} n (t ln(a y(t 0. (49 The inner summation represents the inner-product of n (t with ln a y(t. Then we can rewrite (49 as the hyperplane inequality T n (tt ln a y(t} 0. (50 We next state and prove the NEM RBP theorem for regression networs. Theorem 3: RBP Noise Benefit for RNN Regression The NEM positivity condition (3 holds for the maximum lielihood training of a regression RNN with Gaussian target vector y N(y a y,i if the following inequality condition holds: T (y (t +n (t a y(t 2} T y (t a y(t 2} 0. (5 Proof: The error function E(Θ is the squared error for a regression networ [3]. We first show that minimizing the squared error E(Θ is the same as maximizing the lielihood function L(Θ: 2

Fig. : The first 5 frames from a UCF YouTube video sample of uggling a soccer ball. Such soccer videos were from the 6th of categories in the classification experiment.

6 Fig. : The first 5 frames from a UCF YouTube video sample of uggling a soccer ball. Such soccer videos were from the 6th of categories in the classification experiment. So the target output for this sampled video clip was the -in--coded bit vector[ ]. E(Θ E (t (52 y(t a y(t 2 2 [ ln exp { y(t a y(t 2 }] 2 (53 (54 T ln(2π 2 +T ln(2π 2 (55 [ T ] ln p(y y (t x,θ +T ln(2π 2 L(Θ+T ln(2π 2 (56 where the networ s output probability density function p(y y (t x,θ is the vector normal density N(y (t a (t,θ. So minimizing the squared error E(Θ maximizes the lielihood L(Θ because the last term in (56 does not dep on Θ. Inecting additive noise into the output neurons gives p(n+y,h x,θ p(y,h x,θ T p(n (t +y (t,h x,θ p(h x,θ p(h x,θ p(y (t,h x,θ T p(n (t +y (t h,x,θ (57 p(y (t h,x,θ ( T exp 2 (y(t a y(t 2 ( exp 2 (y(t +n (t a y(t 2. (58 Then the NEM positivity condition in (3 simplifies to { ln T exp( 2 (y(t a y(t 2 } exp( 2 (y(t +n (t a y(t 2 0. (59 This positivity condition simplifies to the hyperspherical inequality T (y (t +n (t a y(t 2} T y (t a y(t 2} 0. (60 These NEM theorems ext directly to other RNN architectures such as those with gated recurrent units and Elman RNNs [29]. Equation (6 below gives the update rule for training a RNN with BP through time. It applies to both classification and regression networs. The update rule for the n th training iteration is Θ (n+ Θ (n +η Θ L(Θ (6 ΘΘ (n Θ (n η Θ E(Θ (62 ΘΘ (n forθ {u y,b y,w ω i,vω,bω,wφ i,vφ,bφ,wγ i,vγ,bγ,wx i,bx } and η is the learning rate. The partial derivatives in the gradient of (62 come from (9, (20, (22 (24, (27 (29, (3 (33, (35, and (36. Algorithm states the details of the Noisy RBP algorithm with a LSTM architecture and for softmax output activation. A related algorithm inects noise into the hidden units as well [3]. This produces further speed-ups in general. IV. VIDEO CLASSIFICATION SIMULATION RESULTS We inected additive NEM noise during the RBP training of a LSTM-RNN video classifier. We used 80:20 splits for training and test sets. We extracted 26,000 training instances from the standard UCF sports action YouTube videos [], [2]. We used 6,700 video-clip samples for testing the networ after training. Each training instance had 2 image frames sampled uniformly from a sports-action video at the rate of 6 frames per second. Each image frame in the dataset had pixels. Each pixel value was between 0 and. We fed the pixel values into the input neurons of the LSTM- RNN and tried different numbers of hidden neurons. The 3

7 Cross Entropy Noise Benefit with RBP Training No Noise Blind Noise NEM Noise Training Iterations Fig. 2: Noise benefit comparisons. NEM Noise benefit from noise inection into the output layer of a LSTM RNN. Adding NEM noise to the output neurons of the LSTM RNN improved the performance of RBP training. Training RBP with NEM noise in ust the output softmax neurons led to 60% fewer iterations to reach a common low value of cross entropy. Noise-boosted RBP too 8 iterations to reach the cross-entropy value 2.7 while ordinary noiseless RBP too 34 iterations to reach the same cross entropy value. There was a 20.6 % maximum decrease in cross-entropy. NEM noise can also inect into the hidden neurons. Blind noise gave little improvement. input, forget, and output gates all had the same number of neurons. The output layer had neurons that represented the categories of sports actions in the dataset. These sport actions were basetball shooting, biing, diving, golf swinging, horsebac riding, soccer uggling, swinging, tennis swinging, trampoline umping, volleyball spiing, and waling. Figure shows 5 sampled consecutive frames from a video in the soccer-uggling category. We fed the frames into the LTSM networ as a sequence of data over time. The RNN used softmax neurons at the output layer because the LSTM-RNN was a classification networ. The other layers and the gates used logistic neurons as discussed above. Then the noise benefit tapered off. We used uniform noise fromu(0,0. with annealing factor. This scaled down the noise by n as the iteration number n grew. The Adaptive Moment Estimation optimizer piced the learning rate as n grew. The initial learning rate was η We compared the training performance over 200 iterations and found that the best noise benefit occurred with 200 internal memory neurons. Figure 2 shows the cross entropy for different RBP training regimens and the resulting NEM- RBP speed-up of convergence. Simply inecting faint blind (uniform noise into the output neurons performed slightly better than no noise inection but not nearly as well as NEM noise inection. The figure also shows that noiseless RBP too 26 more iterations to reach the same cross-entropy value of 2.7. So NEM-boosted RBP needed 60% fewer training iterations to converge to the same low cross-entropy value as did noiseless RBP. The inection of NEM noise also improved the classification accuracy on the video-clip test set. This apparently occurred because the networ lielihood lower-bounds the classification accuracy as we have shown elsewhere [2]. We compared the classification accuracy of NEM-noise RBP training with the Data: 0 input sets <: 5 áäääá: Ç where each input set : á L < á :5; á äääá á :Í;. 0 target sets <; 5 áäääá; Ç where target set ; á L<! á :5; á äääá! á :Í;. Number of epochs for running bacpropagation through-time 4. Result: Trained RNN weights. Initialize the networ weights #. while: epoch N s\4 while training input set J s\0 FEEDFORWARD while P s\6 x Compute the activation of input, forget, and output gates from :ç; and! :ç?5; á. á x Compute the memory cell activation á :ç; and output activation ì:ç; á. ADD NOISE while P s\6š x Generate noise vector : á ì:ç; ARrA x Add the noise:! á :ç; Z! á :ç; E else x Do nothing BACPROPAGATE ERROR THOUGH TIME Initialize: Ï ':#;Lr while P 6\sŠ x Compute the cross-entropy ' ç :#;. x Compute cross-entropy gradient Ï ' ç :#; x Update the gradient Ï ':#;LÏ ':#;E Ï ' ç :#; x Update the parameters # using gradient descent: #L#FßÏ ':#; Algorithm : Noisy RBP algorithm for training a classifier RNN with an LSTM architecture. accuracy for noiseless RBP training. The accuracy measure counted all true positives. NEM noise produced a 2% relative gain in classification accuracy. The noiseless RBP training gave 8% classification accuracy on the test set. The inection of NEM noise gave 83% classification accuracy on the test set. This represents 34 more correct classifications than we found for noiseless RBP training. We ran the simulation on a 4

8 Cross Entropy Cross Entropy Cross Entropy Noisy RBP (50 Hidden Neurons No Noise Blind Noise NEM Noise Training Iterations Noisy RBP (50 Hidden Neurons No Noise Blind Noise NEM Noise Training Iterations Noisy RBP (200 Hidden Neurons Training Iterations No Noise Blind Noise NEM Noise Fig. 3: Hidden neuron memory-size effects on inecting noise into the output neurons of the RNN. The NEM noise benefit became more pronounced as the number of hidden neurons increased. The best noise effect occurred with 200 memory neurons and tapered off thereafter. Google cloud instance with 8 vcpus, 30GB RAM, and 50GB system dis. V. CONCLUSION Careful noise inection speeded up convergence of the recurrent bacpropagation (RBP algorithm for video samples. The ey insight in the noise boost is that ordinary BP is a special case of the generalized EM algorithm. Then noiseboosting EM leads to noise-boosting BP and its recurrent versions for both classification and regression. Simulations showed that noise-boosted RBP training substantially outperformed noiseless RBP for classifying the UCF sports action Youtube video clips. Further noise benefits should result for noise inection into the hidden neurons and the use of convolutional mass [3]. REFERENCES [] J. A. Miel D. Rodriguez and M. Shah, Action mach: A spatio-temporal maximum average correlation height filter for action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, [2]. Soomro and A. R. Zamir, Action recognition in realistic sports videos, in Computer Vision in Sports. Springer, 204, pp [3]. Audhhasi, O. Osoba, and B. oso, Noise-enhanced convolutional neural networs, Neural Networs, vol. 78, pp. 5 23, 206. [4] D. Rumelhart, G. Hinton, and R. Williams, Learning representations by bac-propagating errors, Nature, pp , 986. [5] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol. 52, pp , 205. [6] M. Jordan and T. Mitchell, Machine learning: trs, perspectives, and prospects, Science, vol. 349, pp , 205. [7] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum lielihood from incomplete data via the em algorithm, Journal of the royal statistical society. Series B (methodological, pp. 38, 977. [8] O. Osoba, S. Mitaim, and B. oso, The noisy expectation maximization algorithm, Fluctuation and Noise Letters, vol. 2, no. 03, p , 203. [9] O. Osoba and B. oso, Noise-Enhanced Clustering and Competitive Learning Algorithms, Neural Networs, Jan [0], The noisy expectation-maximization algorithm for multiplicative noise inection, Fluctuation and Noise Letters, p , 206. [] C. M. Bishop, Pattern recognition and machine learning. springer, [2] B. oso,. Audhhasi, and O. Osoba, Noise can speed bacpropagation learning and deep bidirectional pretraining, in review. [3] B. oso, Neural Networs and Fuzzy Systems: A Dynamical Systems Approach to Machine Intelligence. Prentice Hall, 99. [4] P. Hänggi, Stochastic resonance in biology, ChemPhysChem, vol. 3, no. 3, pp , [5] B. oso, Noise. Penguin, [6] A. Patel and B. oso, Stochastic resonance in continuous and spiing neuron models with Levy noise, IEEE Transactions on Neural Networs, vol. 9, no. 2, pp , [7] M. McDonnell, N. Stocs, C. Pearce, and D. Abbott, Stochastic resonance: from suprathreshold stochastic resonance to stochastic signal quantization. Cambridge University Press, [8] M. Wilde and B. oso, Quantum forbidden-interval theorems for stochastic resonance, Journal of Physical A: Mathematical Theory, vol. 42, no. 46, [9] A. Patel and B. oso, Error-probability noise benefits in threshold neural signal detection, Neural Networs, vol. 22, no. 5, pp , [20] B. Franze and B. oso, Using noise to speed up Marov chain Monte Carlo estimation, Procedia Computer Science, vol. 53, pp. 3 20, 205. [2] L. Gammaitoni, P. Hänggi, P. Jung, and F. Marchesoni, Stochastic Resonance, Reviews of Modern Physics, vol. 70, no., pp , January 998. [22] A. Graves, A.-r. Mohamed, and G. Hinton, Speech recognition with deep recurrent neural networs, in 203 IEEE international conference on acoustics, speech and signal processing. IEEE, 203, pp [23] J. Donahue, L. Anne Hrics, S. Guadarrama, M. Rohrbach, S. Venugopalan,. Saeno, and T. Darrell, Long-term recurrent convolutional networs for visual recognition and description, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 205, pp [24] J. Yue-Hei Ng, M. Hausnecht, S. Viayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, Beyond short snippets: Deep networs for video classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 205, pp [25] P. J. Werbos, Bacpropagation through time: what it does and how to do it, Proceedings of the IEEE, vol. 78, no. 0, pp , 990. [26] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation, vol. 9, no. 8, pp , 997. [27] A.-r. M. Alex Graves and G. Hinton, Speech recognition with deep recurrent neural networs, Proceedings of International Conference on Acoustic, Speech and Signal Processing, pp , 203. [28]. Audhhasi, O. Osoba, and B. oso, Noisy hidden Marov models for speech recognition, in Proceedings of the 203 International Joint Conference on Neural Networs. IEEE, 203, pp. 6. [29] P. Rodriguez, J. Wiles, and J. L. Elman, A recurrent neural networ that learns to count, Connection Science, vol., no., pp. 5 40,

Bidirectional Representation and Backpropagation Learning

Int'l Conf on Advances in Big Data Analytics ABDA'6 3 Bidirectional Representation and Bacpropagation Learning Olaoluwa Adigun and Bart Koso Department of Electrical Engineering Signal and Image Processing