Modeling Complex Temporal Composition of Actionlets for Activity Prediction

Size: px

Start display at page:

Download "Modeling Complex Temporal Composition of Actionlets for Activity Prediction"

Prosper Anthony
5 years ago
Views:

1 Modeling Complex Temporal Composition of Actionlets for Activity Prediction ECCV 2012 Activity Recognition Reading Group

2 Framework of activity prediction

3 What is an Actionlet To segment a long sequence of human activity video into multiple atomic actions that have specific semantic meanings. These atomic actions have causal relationships that can be used for prediction. We call these atomic actions actionlets.

4 Datasets Requirement: the activity should have multiple steps where each step constitutes a meaningful action unit (actionlet). Two datasets: Tennis games from YouTube. Maryland Human-Object Interaction (MHOI) dataset.

5 Dataset: Tennis Game 160 video clips.(80 winning, 80 losing). Each video clip is one point with an exchange of several strokes. Prediction problem: who will win?

6 Dataset: Maryland Human-Object Interaction Five activity categories: Answering phone call Making phone call Drinking water Lighting a flashlight Pouring water

7 Dataset: Maryland Human-Object Interaction Five activity categories. These activities have about 3 to 5 actionlets each. Constituent actionlets share similar human movements: 1) reaching for an object 2) grasping the object 3) manipulating the object, and 4) put back the object. 44 video clips. Prediction problem: what action is it?

8 Algorithm outline Given a set of training videos: Step 1: Actionlets Detection Segment each video into actionlets. Step 2: Activity Encoding Cluster actionlets into meaningful groups. Step 3: Activity Prediction Model Generate a next symbol probability function for each activity class. Given an ongoing test sequence: Compute the probability of the sequence under each prediction models, select the largest one as the predicted class.

9 Step 1: Actionlet Detection Input: a set of videos. Goal: find the frame indices that can segment the video into actionlets. Method: velocity based. For each atomic action, the start frame and the end frame always have the lowest movement velocity. And the velocity reaches the peak at the intermediate stage of each action.

10 Step 1: Actionlet Detection Input: a set of videos. Goal: find the frame indices that can segment the video into actionlets. Method: velocity based. For each atomic action, the start frame and the end frame always have the lowest movement velocity. And the velocity reaches the peak at the intermediate stage of each action.

11 Step 1: Actionlet Detection Input: a set of videos. Goal: find the frame indices that can segment the video into actionlets. Method: velocity based. S1. Key points: Harris Corner S2. Trajectories: Lucas-Kanade optical flow S3: Generate overall motion velocity V t for each frame F t :

12 Step 1: Actionlet Detection Result Evaluation Evaluation: Manually label the segmentation point in the video. A target window with size of 15 frames around the human labeled segmentation point would be the ground truth. MHOI: Manually labeled 113 actionlets for all 44 videos. Accuracy is Tennis: Manually labeled 253 actionlets for 40(out of 160) videos. Accuracy is 0.82.

13 Step 2: Activity Encoding Input: a set of segmented videos. Goal: Cluster actionlet into meaningful groups, then represent each video by a sequence of actionlets. Method: S1. Compute video descriptors for each actionlet; S2. Quantize the descriptors by k-means; S3. Learn the actionlet categories.

14 Step 2: Activity Encoding S1: Video descriptors Input: a set of segmented videos. MHOI: large scale person + static background Interest points: 3-D Harris corner. Descriptor: HOG and HOF. Tennis: small scale person Dense trajectories [18]. Feature points in dense grid -> Trajectories by optical flow + median filtering -> HOG, HOF and MBH features of each trajectories. [18]H. Wang, A. Klaser, C. Schmid, C.L. Liu, Action recognitino by dense trajectories.

15 Step 2: Activity Encoding S2: Descriptor quantization Input: actionlets represented as descriptors. 1. Cluster all descriptors by k-means; 2. Generate the descriptor codebook; 3. Compute memberships with respect to the codebook. Codebook size: MHOI: 50. Tennis: 1000.

16 Step 2: Activity Encoding S3: Actionlet categories Input: Actionlets represented as a collection of words from the codebook (word histogram). Learn the actionlets categories from the histograms in an unsupervised way [17]. [17] J. Niebles, H.Wang, L. Fei-Fei, Unsupervised Learning of Hunan Action Categories Using Spatial-Temporal Words.

17 Step 2: Activity Encoding Input: a set of segmented videos. Goal: Cluster actionlet into meaningful groups, then represent each video by a sequence of actionlets. Method: S1. Compute video descriptors for each actionlet; S2. Quantize the descriptors by k-means; S3. Learn the actionlet categories. Evaluation

18 Step 2: Activity Encoding Result Evaluation Evaluate the actionlets clustering result. Measurement: Rand Index. A set of elements S, two partitions X and Y. Define: a: the number of pairs of elements in S that are in the same set in X and in the same set in Y; b: different sets in X and in different sets in Y; c: same set in X and in different sets in Y; d: different sets in X and in the same set in Y. R = (a + b) / (a + b + c + d). The value is between 0 and 1, with 1 denoting two clusters are exactly the same.

19 Step 2: Activity Encoding Result Evaluation Evaluate the actionlets clustering result. Measurement: Rand Index. MHOI: Manually group actionlets into 5 categories where each category represents a meaningful action unit, e.g. reach the object, grab the object, release the object. Rand index is 0.7. Tennis: Group 253 actionlets from 40 videos into 10 categories. Rand index is 0.73.

20 Step 3: Activity Encoding Input: videos represented by actionlet sequences. Σ : actionlet alphabet. training 1 2 m {,,..., }:, D = r r r training sample set where r = r r... r, r Σ. i i i i i 1 2 l j i Goal: to learn a model P (for each activity) that provides a probability assignment p(t) for the ongoing actionlet sequence t = t1, t2,..., t t. Tool: Probabilistic Suffix Tree (PST)

21 Step 3: Activity Encoding Probabilistic Suffix Tree (PST) Given training data D training of an activity, Capture the temporal dependency between actionlets (next symbol probability function). The dependency is modeled as a Variable order Markov Model (VMM). Represented by Probabilistic Suffix Tree.

22 Step 3: Activity Encoding Probabilistic Suffix Tree (PST) next symbol probability function : γ ( σ), where σ Σ s Σ context *, ( ). s

23 Step 3: Activity Encoding Probabilistic Suffix Tree (PST)

24 Step 3: Activity Encoding Input: videos represented by actionlet sequences. Σ : actionlet alphabet. D = r r r training sample set training 1 2 m {,,..., }:, where r = r r... r, r Σ. i i i i i 1 2 l j i Goal: to learn a model P for each activity that provides a probability assignment p(t) for the ongoing actionlet sequence t = t1, t2,..., t t. Tool: Probabilistic Suffix Tree (PST) t T p ( t) = log γ j 1 ( tj ) j= 1 s

25 Step 3: Activity Encoding Predictive Accumulative Function(PAF) Predictability of different activities. Observations: Tennis game: late predictable. Only the last several strokes impact the winning or losing results. Drinking water : early predictable. As long as the first actionlet grab a cup is observed, we can guess the intention. How to use: weighted log likelihood. p tt... t t T 1 2 j ( t) = f p log γ j 1 ( t j ) s j= 1 t

26 Step 3: Activity Encoding Predictive Accumulative Function(PAF) Predictability of different activities. Observations: Tennis game: late predictable. Only the last several strokes impact the winning or losing results. Drinking water : early predictable. As long as the first actionlet grab a cup is observed, we can guess the intention. How to use: weighted log likelihood. p tt... t t T 1 2 j ( t) = f p log γ j 1 ( t j ) s j= 1 t

27 Step 3: Activity Encoding Predictive Accumulative Function(PAF) Notations: for k [0,1]. 1 2 m i i i i D = { r, r,..., r }, where r = r r... r. With the learnt PST we can compute: Given the first k percentage of the sequence observed, the information we gain is: T T H( D) H( D D ) H( D) = p ()log r p () r Fit a function: r rpre( k ) T γ j 1 ( ) j 1 s j pre k γ s pre( k ) j= 1 j= D = r r where r = r r r 1 m i i i i k { pre( k ),..., pre( k )}, pre( k ) k l. T p ( r) = ( r ), p ( r ) = ( r ) y k = H( D) k, y = f ( ) p k l i i r D T T H( D Dk ) = p ( r, rpre( k )) log p ( r rpre( k )) rpre( k ) Dk r D

28 Step 3: Activity Encoding Input: videos represented by actionlet sequences. Σ : actionlet alphabet. D = r r r training sample set training 1 2 m {,,..., }:, where r = r r... r, r Σ. i i i i i 1 2 l j i Goal: to learn a model P for each activity that provides a probability assignment p(t) for the ongoing actionlet sequence t = t1, t2,..., t t. Weighted log likelihood p tt... t t T 1 2 j ( t) = f p log γ j 1 ( t j ) s j= 1 t

29 Algorithm outline Given a set of training videos: Step 1: Actionlets Detection Segment each video into actionlets. Step 2: Activity Encoding Cluster actionlets into meaningful groups. Step 3: Activity Prediction Model Generate a next symbol probability function for each activity class. T Given an ongoing test sequence: () Compute the probability of the sequence under each prediction models, select the largest one as the predicted class. p t for each activity

30 Experiment Results Middle-Level Complex Activity Prediction: On MHOI High-Level Complex Activity Prediction: On Tennis

31 Experiment Results Middle-Level Complexity (MHOI) Dataset: 5 activities, each has 9 or 10 samples. Supervised classification problem: For each activity: Positive set: all the samples of this activity. Negative set: equal number of samples randomly selected from other activities. Evaluate accuracy by leave-one-out. Repeat above for 10 times, and average the performance. Other methods for comparison: Dynamic Bag-of-Words Integral Bag-of-Words A basic SVM-based approach

32 Experiment Results Middle-Level Complexity (MHOI)

33 Experiment Results Middle-Level Complexity (MHOI)

34 Experiment Results Middle-Level Complexity (MHOI)

35 Experiment Results High-Level Complexity (Tennis) Dataset: 2 categories (winning/losing), each has 80 samples. Length of actionlet sequences vary from 1 to more than 20. Supervised classification problem: Evaluate accuracy by leave-one-out. Other methods for comparison: Dynamic Bag-of-Words Integral Bag-of-Words A basic SVM-based approach

36 Experiment Results High-Level Complexity (Tennis)

37 Experiment Results High-Level Complexity (Tennis)

Global Scene Representations. Tilke Judd

Global Scene Representations. Tilke Judd Global Scene Representations Tilke Judd Papers Oliva and Torralba [2001] Fei Fei and Perona [2005] Labzebnik, Schmid and Ponce [2006] Commonalities Goal: Recognize natural scene categories Extract features