Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities

Size: px

Start display at page:

Download "Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities"

Harvey Sharp
5 years ago
Views:

1 Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford University 1

2 Human-Object Interaction Robots interact with objects Automatic sports commentary Kobe is dunking the ball. Medical care 2

Human-Object Interaction Holistic image based

saxophone Grouplet is a generic feature for

3 Human-Object Interaction Holistic image based classification (Previous talk: Grouplet) Detailed understanding and reasoning Vs. Playing bassoon Playing saxophone Playing saxophone Grouplet is a generic feature for structured objects, or interactions of groups of objects. Caltech101 HOI activity: Tennis Forehand Berg & Malik, 2005 Grauman & Darrell, 2005 Gehler & Nowozin, 2009 OURS 48% 59% 77% 62% 3

4 Human-Object Interaction Holistic image based classification Detailed understanding and reasoning Human pose estimation Head Torso 4

5 Human-Object Interaction Holistic image based classification Detailed understanding and reasoning Human pose estimation Object detection Tennis racket 5

6 Human-Object Interaction Holistic image based classification Detailed understanding and reasoning Human pose estimation Object detection Head Tennis racket Torso HOI activity: Tennis Forehand 6

7 Outline Background and Intuition Mutual Context of Object and Human Pose Model Representation Model Learning Model Inference Experiments Conclusion 7

8 Outline Background and Intuition Mutual Context of Object and Human Pose Model Representation Model Learning Model Inference Experiments Conclusion 8

9 Human pose estimation & Object detection Human pose estimation is challenging. Difficult part appearance Self-occlusion Image region looks like a body part Felzenszwalb & Huttenlocher, 2005 Ren et al, 2005 Ramanan, 2006 Ferrari et al, 2008 Yang & Mori, 2008 Andriluka et al, 2009 Eichner & Ferrari,

10 Human pose estimation & Object detection Human pose estimation is challenging. Felzenszwalb & Huttenlocher, 2005 Ren et al, 2005 Ramanan, 2006 Ferrari et al, 2008 Yang & Mori, 2008 Andriluka et al, 2009 Eichner & Ferrari,

11 Human pose estimation & Object detection Facilitate Given the object is detected. 11

12 Human pose estimation & Object detection Small, low-resolution, partially occluded Object detection is challenging Image region similar to detection target Viola & Jones, 2001 Lampert et al, 2008 Divvala et al, 2009 Vedaldi et al,

13 Human pose estimation & Object detection Object detection is challenging Viola & Jones, 2001 Lampert et al, 2008 Divvala et al, 2009 Vedaldi et al,

14 Human pose estimation & Object detection Facilitate Given the pose is estimated. 14

15 Human pose estimation & Object detection Mutual Context 15

Context in Computer Vision Previous work Use context cues

moderately outperform better ~3-4% with context Hoiem et

Heitz & Koller, 2008 Desai et al, 2009 Divvala et al, 2009

Harzallah et al, 2009 Li, Socher & Fei-Fei, 2009 Marszalek

16 Context in Computer Vision Previous work Use context cues to facilitate object detection: Helpful, but only moderately outperform better ~3-4% with context Hoiem et al, 2006 Rabinovich et al, 2007 Oliva & Torralba, 2007 Heitz & Koller, 2008 Desai et al, 2009 Divvala et al, 2009 without context Murphy et al, 2003 Shotton et al, 2006 Harzallah et al, 2009 Li, Socher & Fei-Fei, 2009 Marszalek et al, 2009 Bao & Savarese, 2010 Viola & Jones, 2001 Lampert et al,

17 Context in Computer Vision Previous work Use context cues to facilitate object detection: Our approach Two challenging tasks serve as mutual context of each other: With mutual context: Helpful, but only moderately outperform better ~3-4% with context Hoiem et al, 2006 Rabinovich et al, 2007 Oliva & Torralba, 2007 Heitz & Koller, 2008 Desai et al, 2009 Divvala et al, 2009 without context Murphy et al, 2003 Shotton et al, 2006 Harzallah et al, 2009 Li, Socher & Fei-Fei, 2009 Marszalek et al, 2009 Bao & Savarese, 2010 Without context: 17

18 Outline Background and Intuition Mutual Context of Object and Human Pose Model Representation Model Learning Model Inference Experiments Conclusion 18

19 Mutual Context Model Representation A: Activity A O: H: Tennis forehand Tennis racket Croquet shot Croquet mallet Volleyball smash Volleyball Object O P 1 Human pose H P 2 Body parts P N Intra-class variations More than one H for each A; Unobserved during training. f O f1 f2 fn Image evidence P: l P : location; θ P : orientation; s P : scale. f: Shape context. [Belongie et al, 2002] 19

20 Mutual Context Model Representation ψ e( AO, ), ψ e( AH, ), ψ e( OH, ): Frequency of co-occurrence between A, O, and H. ψ e ( AH, ) ψ e ( AO, ) H ψ e( OH, ) O A Markov Random Field Ψ= wψ e e e E Clique weight Clique potential P 1 P 2 P N f O f 1 f 2 f N 20

21 Mutual Context Model Representation ψ e( AO, ), ψ e( AH, ), ψ e( OH, ): Frequency of co-occurrence between A, O, and H. A Markov Random Field Ψ= wψ e e e E ψ e( OP, n), ψ e( H, Pn), ψ e( Pm, Pn) : Spatial relationship among object and body parts. bin ( lo lp ) bin( θo θp ) Ν( so sp ) n n n location orientation size O ψ ( OP, ) e n P 1 P 2 H ψ ( H, P ) e n ψ ( P, P ) e m n Clique weight P N Clique potential f O f 1 f 2 f N 21

22 Mutual Context Model Representation ψ e( AO, ), ψ e( AH, ), ψ e( OH, ): Frequency of co-occurrence between A, O, and H. A Markov Random Field Ψ= wψ e e e E ψ e( OP, n), ψ e( H, Pn), ψ e( Pm, Pn) : Spatial relationship among object and body parts. bin ( lo lp ) bin( θo θp ) Ν( so sp ) n n n O location orientation size ψ ( OP, ) Learn structural connectivity among the body parts and the object. e n P 1 P 2 H ψ ( H, P ) e n Obtained by structure learning ψ ( P, P ) e m n Clique weight P N Clique potential f O f 1 f 2 f N 22

23 Mutual Context Model Representation ψ e( AO, ), ψ e( AH, ), ψ e( OH, ): Frequency of co-occurrence between A, O, and H. A Markov Random Field Ψ= wψ e e e E ψ e( OP, n), ψ e( H, Pn), ψ e( Pm, Pn) : Spatial relationship among object and body parts. bin ( lo lp ) bin( θo θp ) Ν( so sp ) n n n location orientation size Learn structural connectivity among the body parts and the object. ψ e( O, f ) and ψ O e( Pn, fp n ): Discriminative part detection scores. Shape context + AdaBoost [Andriluka et al, 2009] [Belongie et al, 2002] [Viola & Jones, 2001] O ψ ( O, f ) e O H P 1 P 2 P N ψ ( P, f ) e n P n f O f1 f2 fn Clique weight Clique potential 23

24 Outline Background and Intuition Mutual Context of Object and Human Pose Model Representation Model Learning Model Inference Experiments Conclusion 24

25 Model Learning Ψ= wψ e e e E A H Input: O P 1 P 2 P N cricket shot cricket bowling Goals: f O f1 f2 fn Hidden human poses 25

26 Model Learning Ψ= wψ e e e E A H Input: O P 1 P 2 P N cricket shot cricket bowling Goals: f O f1 f2 fn Hidden human poses Structural connectivity 26

27 Model Learning Ψ= wψ e e e E A H Input: O P 1 P 2 P N cricket shot cricket bowling Goals: f O f1 f2 fn Hidden human poses Structural connectivity Potential parameters Potential weights 27

28 Model Learning Ψ= wψ e e e E A H Input: O P 1 P 2 P N cricket shot cricket bowling Goals: f O f1 f2 fn Hidden human poses Structural connectivity Potential parameters Potential weights Hidden variables Structure learning Parameter estimation 28

29 Model Learning Ψ= wψ e e e E O A H Approach: croquet shot P 1 P 2 P N Goals: f O f1 f2 fn Hidden human poses Structural connectivity Potential parameters Potential weights 29

30 Model Learning Ψ= wψ e e e E O A H P 1 P 2 P N Approach: Hill-climbing max E= { e} e wψ e e Joint density of the model ( E µ ) 2 2 2σ Gaussian priori of the edge number f O f1 f2 fn Goals: Hidden human poses Structural connectivity Potential parameters Potential weights 30

31 Model Learning Ψ= wψ e e e E O A H P 1 P 2 P N Approach: Maximum likelihood ψ ( AO, ) ψ e ( AH, ) e ψ ( H, P ) e n ψ ( OH e, ) ψ ( OP, ) ψ ( P, P ) e n e m n Goals: f O f1 f2 fn Hidden human poses Structural connectivity Potential parameters Potential weights Standard AdaBoost ψ ( O, f ) ψ ( P, f ) e O e n P n 31

32 Model Learning Ψ= wψ e e e E Goals: O A f O f1 f2 fn H P 1 P 2 P N Hidden human poses Structural connectivity Potential parameters Potential weights Approach: Max-margin learning 1 min wr + β ξi 2 w, ξ r 2 2 ( ) ( ) s.t. ir, where y r y c, w x w x 1 ξ i, ξ 0 i c i r i i i Notations x i : Potential values of the i-th image. w r : Potential weights of the r-th pose. y(r): Activity of the r-th pose. ξ i : A slack variable for the i-th image. i i 32

33 Learning Results Cricket defensive shot Cricket bowling Croquet shot 33

34 Learning Results Tennis forehand Tennis serve Volleyball smash 34

35 Outline Background and Intuition Mutual Context of Object and Human Pose Model Representation Model Learning Model Inference Experiments Conclusion 35

36 Model Inference I The learned models 36

37 Model Inference I The learned models Head detection Compositional Inference Torso detection Ψ { P } ( * * A ) 1 H1 O1 1,,,, n n Layout of the object and body parts. [Chen et al, 2007] Tennis racket detection 37

38 Model Inference I The learned models Output * * Ψ A1 H1 O1 { P1, } Ψ A, H, O *,{ P *, } ( ),,, n n ( ) K K K Kn n 38

39 Outline Background and Intuition Mutual Context of Object and Human Pose Model Representation Model Learning Model Inference Experiments Conclusion 39

40 Dataset and Experiment Setup Sport data set: 6 classes 180 training (supervised with object and part locations) & 120 testing images Tasks: Object detection; Pose estimation; Cricket defensive shot Cricket bowling Croquet shot Activity classification. Tennis forehand Tennis serve Volleyball smash [Gupta et al, 2009] 40

41 Dataset and Experiment Setup Sport data set: 6 classes 180 training (supervised with object and part locations) & 120 testing images Tasks: Object detection; Pose estimation; Cricket defensive shot Cricket bowling Croquet shot Activity classification. Tennis forehand Tennis serve Volleyball smash [Gupta et al, 2009] 41

Object Detection Results Valid region Cricket bat 1

2009] Pedestrian context [Dalal & Triggs, 2006] Our

42 Object Detection Results Valid region Cricket bat Cricket ball Sliding window [Andriluka et al, 2009] Pedestrian context [Dalal & Triggs, 2006] Our Method Precision Recall Croquet mallet Tennis racket Volleyball Precision Recall 42

43 Object Detection Results Sliding window Pedestrian context Our method 1 Cricket ball Small object Precision Recall Background clutter Precision Volleyball Recall 43 43

44 Dataset and Experiment Setup Sport data set: 6 classes 180 training & 120 testing images Tasks: Object detection; Pose estimation; Cricket defensive shot Cricket bowling Croquet shot Activity classification. Tennis forehand Tennis serve Volleyball smash [Gupta et al, 2009] 44

45 Human Pose Estimation Results Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head Ramanan, 2006 Andriluka et al, 2009 Our full model

2009 Our full model.52.22.22.21.28.24.28.17.14.42.50.31.30.

58 Tennis serve model Our estimation result Andriluka et al,

46 Human Pose Estimation Results Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head Ramanan, 2006 Andriluka et al, 2009 Our full model Tennis serve model Our estimation result Andriluka et al, 2009 Volleyball smash model Our estimation Andriluka result et al,

al, 2009 Our full model One pose per class.52.22.22.21.28.

34.44.40.27.29.58.63.40.36.41.31.38.35.21.23.

47 Human Pose Estimation Results Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head Ramanan, 2006 Andriluka et al, 2009 Our full model One pose per class Estimation result Estimation result Estimation result Estimation result 47

48 Dataset and Experiment Setup Sport data set: 6 classes 180 training & 120 testing images Tasks: Object detection; Pose estimation; Cricket defensive shot Cricket bowling Croquet shot Activity classification. Tennis forehand Tennis serve Volleyball smash [Gupta et al, 2009] 48

49 Activity Classification Results Classification accuracy No scene information 83.3% 78.9% Scene is critical!! 52.5% Cricket shot Tennis forehand 0.5 model Our model Gupta et al, 2009 Gupta et al, 2009 Bag-of- Words Bag-of-words SIFT+SVM 49

50 Conclusion Human-Object Interaction Grouplet representation Vs. Mutual context model Next Steps Pose estimation & Object detection on PPMI images. Modeling multiple objects and humans. 50

51 Acknowledgment Stanford Vision Lab reviewers: Barry Chai ( ) Juan Carlos Niebles Hao Su Silvio Savarese, U. Michigan Anonymous reviewers 51

52 52

53 Human-Object Interaction Holistic image based classification Detailed understanding and reasoning Human pose estimation Object detection How to beat this??? Head Tennis racket Torso 53

54 Mutual Context Model Representation human-object interaction activity A Hierarchical representation of images human pose object body parts image patches O H P 1 P 2 P N f O f1 f2 fn 54

Pictorial Structures Revisited: People Detection and Articulated Pose Estimation. Department of Computer Science TU Darmstadt

Pictorial Structures Revisited: People Detection and Articulated Pose Estimation Mykhaylo Andriluka Stefan Roth Bernt Schiele Department of Computer Science TU Darmstadt Generic model for human detection