Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity- Representativeness Reward

Size: px

Start display at page:

Download "Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity- Representativeness Reward"

Arabella Heath
5 years ago
Views:

1 Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity- Representativeness Reward Kaiyang Zhou, Yu Qiao, Tao Xiang AAAI 2018

2 What is video summarization? Goal: to automatically summarize videos into keyframes or key-clips. Video frames Model Keyframes Key-clips We want Diverse Representative

3 Application of video summarization e.g. YouTube video preview Short preview will display when mouse is on it

4 Unsupervised video summarization Idea: to analyze correlations between frames in feature space 1. Feature extraction 2. Clustering 3. Keyframes extraction Keyframes feature space De Avila et al. Pattern Recognition Letters 2011.

5 Supervised video summarization Idea: to exploit human labels scores: y = {0.1, 0.8, 1.0, 0.2,...} keyframes: y = {0, 1, 1, 0,...} Training loss = (y w T X) 2 Inference p = w T X 0 feature vectors Temporal relations are hard to capture by linear models. Gygli et al. ECCV 2014.

6 Zhang et al. ECCV Recurrent neural network with supervised learning Idea: use RNN to capture temporal relations RNN RNN RNN RNN p 1 p 2 p 3 p 4 X (y i p i ) 2 i {z } regression loss Collecting labels here is much more expensive than that of other tasks. Labels may not provide good supervision signals. (b/c labels are subjective)

7 Main idea To mimic how humans summarize videos Agent Reward video summary Is summary diverse and representative?

8 Model RNN p 1 a 1 RNN RNN p 2 p 3 a 2 a 3 video RNN p 4 a 4 summary a i 2 {0, 1} Diversity-representativeness reward

9 Diversity reward R div = 1 Y ( Y 1) Set of selected frames P t2y P t 0 2Y t 0 6=t d(x t,x t 0) Representativeness reward R rep =exp( 1 T TP t=1 min x t 0 t x t 0 2 ) 2Y

10 Optimization Reward: R = R div + R rep Objective function: J( ) =E[R] Approximate gradients via REINFORCE: O J( ) 1 N NP TP (R n b)o log (a t h t ) n=1 t=1 Williams. Machine Learning 1992.

11 Inference S 1 S 2 S 3 S 4 arg max µ Score prediction: {p i } T i=1 = RNN({x i} T i=1 ) Compute clip-level scores: P k I(S k )= 1 S k Select clips (0/1 Knapsack): µ k I(S k ), P k P i2s k p i µ k S k apple, µ k 2 {0, 1} Song et al. CVPR 2015.

12 Human summary Evaluation False negative Machine summary True positive False positive Metric: F-score = (2 x precision x recall) / (precision + recall) Dataset # videos Length (mins) Description # annotators per video SumMe User videos TVSum YouTube videos 20 Gygli et al. ECCV 2014, Song et al. CVPR 2015.

13 Quantitative Results Table: Comparison with other unsupervised approaches. Method SumMe (%) TVSum (%) Video-MMR Uniform sampling K-medoids Vsumm Web image Dictionary selection Online sparse coding Co-archetypal GAN dpp " 6% " 11% Ours

14 Quantitative Results Table: Comparison with other supervised approaches. Method SumMe (%) TVSum (%) Interestingness Submodularity Summary transfer Bi-LSTM DPP-LSTM GAN sup Ours " 2%

15 Quantitative Results Table: Comparison with other supervised approaches. Method SumMe (%) TVSum (%) Interestingness Submodularity Summary transfer Bi-LSTM DPP-LSTM GAN sup Ours Ours (supervised) For more experiments and details, please see our paper.

16 Qualitative Results Manual scores Prediction by RNN with RL Prediction by RNN with supervised learning Video #10 in TVSUM

17 Summary 1. Proposed a label-free reward, i.e. diversity-representativeness reward. 2. Outperformed/competitive to other unsupervised/supervised ones. 3. Extended theunsupervised method to thesupervised version. Code and data available at:

18 Thanks! Any questions? please feel free to contact me at:

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech