End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation

Size: px

Start display at page:

Download "End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation"

Emery Potter
5 years ago
Views:

1 End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang Department of Electronic Engineering, The Chinese University of Hong Kong

2 Outline Introduction Methodology Experiments 2

3 What is Human Pose Estimation? 3 Results are generated by the proposed methods (without temporal constraints). Video Credit: Peter Jasko solo - M-idzomer 2013

4 Human Pose Estimation is to recover the joint positions of articulated limbs of human body given an image or a video. 4 Results are generated by the proposed methods (without temporal constraints). Video Credit: Peter Jasko solo - M-idzomer 2013

5 Applications Activity Recognition Game and Animation Clothing Parsing [Yamaguchi et al. CVPR 14] 5

6 Challenges Articulation Forshortening Clothing Dantone et al. CVPR

7 Structure Modeling Unary template Springs Figure Drawing Cylinders for each body parts Join up the cylinders Pictorial Structures Unary templates Pairwise springs Unable to handle large variations Image credit: artintegrity.wordpress.com Fischler & Elschlager 1973 Felzenszwalb & Huttenlocher

8 Structure Modeling Figure Drawing Cylinders for each body parts Join up the cylinders Pictorial Structures Unary templates Pairwise springs Unable to handle large variations Mixtures of each part Unary template for each mixture type Clustering on ( x, y) Graph Structures Trees Latent trees Loopy graphs Fully connected graphs Image credit: artintegrity.wordpress.com Fischler & Elschlager 1973 Felzenszwalb & Huttenlocher 2005 Yang & Ramanan

9 Deep Learning Based Methods Heat map prediction Benefits from better neural network architectures VGG GoogLeNet, ResNet Fully Convolutional Networks Structures are only used for post processing 9

10 Gap Between Deep Models and Structure Modeling Gap 10

11 Gap Between Deep Models and Structure Modeling End-to-End Learning 11

12 Motivation: Geometric Constraints Among Body Parts Helps in Learning Better Representation Forward Forward CNN Backward Backward Local appearance is ambiguous Global consistency helps training 12

13 Outline Introduction Methodology Experiments 13

14 Graph Models G = (V, E) Vertices Locations and mixture types of body parts Modeled by a front-end CNN Edges Pairwise spatial relationships between body parts Modeled by message passing layers 14

15 Soft Logarithm Message Passing Layers Framework head neck l.shou CNN r.shou r.knee r.ankle 15

16 Soft Logarithm Message Passing Layers Framework head neck l.shou CNN r.shou F l, t I; θ, ω = φ(l i, t i I; θ) r.knee i V t i,t j ψ l i, l j, t i, t j I; ω i,j r.ankle i,j E 16

17 Soft Logarithm Message Passing Layers Framework l = l i = {(x i, y i )}: location of part i head neck l.shou CNN r.shou F l, t I; θ, ω = φ(l i, t i I; θ) r.knee i V t i,t j ψ l i, l j, t i, t j I; ω i,j r.ankle i,j E 17

18 Soft Logarithm Message Passing Layers Framework l = l i = {(x i, y i )}: location of part i head neck l.shou CNN r.shou F l, t I; θ, ω = φ(l i, t i I; θ) r.knee i V t i,t j ψ l i, l j, t i, t j I; ω i,j r.ankle i,j E t = t i : mixture type of part i 18

19 Soft Logarithm Front-End CNN Part Appearance Terms head neck l.shou CNN r.shou F l, t I; θ, ω = φ(l i, t i I; θ) r.knee i V t i,t j ψ l i, l j, t i, t j I; ω i,j r.ankle i,j E 19

20 Soft Logarithm Message Passing Layers Spatial Relationship Terms head neck l.shou CNN r.shou F l, t I; θ, ω = φ(l i, t i I; θ) r.knee i V t i,t j ψ l i, l j, t i, t j I; ω i,j r.ankle i,j E 20

21 PxK 1x1 Dropout 1x1 Dropout Front end CNN CNN norm pool norm pool 1x1 drop out Local confidence of the appearance of part i with mixture type t i : φ(l i, t i I; θ) = log p l i, t i I; θ = log σ(f l i, t i I; θ ) 21

22 PxK 1x1 Dropout 1x1 Dropout Front end CNN CNN norm pool norm pool 1x1 drop out Local confidence of the appearance of part i with mixture type t i : φ(l i, t i I; θ) = log p l i, t i I; θ = log σ(f l i, t i I; θ ) output of the front-end CNN 22

4096 4096 PxK 1x1 Dropout 1x1 Dropout Front end CNN CNN norm pool norm pool 1x1 drop out 3 3 3 3 3 3 Local confidence

23 PxK 1x1 Dropout 1x1 Dropout Front end CNN CNN norm pool norm pool 1x1 drop out Local confidence of the appearance of part i with mixture type t i : φ(l i, t i I; θ) = log p l i, t i I; θ = log σ(f l i, t i I; θ ) 23

24 PxK 1x1 Dropout 1x1 Dropout Front end CNN CNN norm pool norm pool 1x1 drop out Local confidence of the appearance of part i with mixture type t i : φ(l i, t i I; θ) = log p l i, t i I; θ = log σ(f l i, t i I; θ ) Soft function 24

25 PxK 1x1 Dropout 1x1 Dropout Front end CNN CNN norm pool norm pool 1x1 drop out Local confidence of the appearance of part i with mixture type t i : φ(l i, t i I; θ) = log p l i, t i I; θ = log σ(f l i, t i I; θ ) Logarithm 25

26 Soft Logarithm Message Passing Layers Spatial Relationship Terms head neck l.shou CNN r.shou F l, t I; θ, ω = φ(l i, t i I; θ) r.knee i V t i,t j ψ l i, l j, t i, t j I; ω i,j r.ankle i,j E 26

27 Spatial Relationship Terms Quadratic deformation constraints t ψ l i, l j, t i, t j I; ω i,t j i,j t = ω i,t j i,j, d(l i, l j ) d l i, l j = x, x 2, y, y 2 x = x i x j, y = y i y j Yang Y, Ramanan D. Articulated pose estimation with flexible mixtures-of-parts. CVPR,

28 Message Passing: Max-Sum Algorithm 1 2 m 21 m ij l j, t j The α m message u i lpassed i, t i from ψ l i, lpart j, t i, it j to j l i,t i 28

29 Message Passing: Max-Sum Algorithm 1 2 m 21 m ij l j, t j The α m message u i lpassed i, t i from ψ l i, lpart j, t i, it j to j l i,t i u i l i, t i α u φ lthe i, t i belief of part m ki i(l i, t i ) k N(i) 29

30 Message Passing: Max-Sum Algorithm m ij l j, t j α m l i,t i u i l i, t i ψ l i, l j, t i, t j u i l i, t i α u φ l i, t i k N(i) m ki (l i, t i ) l i, t i = arg u i l i, t i l i,t i 30

31 Diagnosis on Message Passing Layers 1 st 31

32 Diagnosis on Message Passing Layers 2 nd 32

33 Diagnosis on Message Passing Layers 3 rd 33

34 Outline Introduction Methodology Experiments 34

35 Datasets LSP FLIC Sports 1000 training 1000 testing Films 3987 training 1016 testing 35

36 Qualitative Results on the LSP Dataset 36

37 37

38 PCP on the LSP dataset [Percentage of Correct Parts] 38

39 STRICT PCP ON THE LSP DATASET Yang&Ramanan, CVPR'11 Eichner&Ferrari, ACCV'13 Pose Machines, ECCV'14 Pishchulin et al., ICCV'13 Chen&Yuille, NIPS'14 Pishchulin et al., CVPR'13 Kiefel&Gehler, ECCV'14 Ouyang et al., CVPR'14 DeepPose, CVPR'14 Ours TORSO HEAD UPPER ARMS LOWER ARMS UPPER LEGS LOWER LEGS MEAN 39

40 Percentage of Detected Joints (PDJ) Our method LSP FLIC 40

41 Datasets LSP FLIC Image Parse Sports 1000 training 1000 testing Films 3987 training 1016 testing Activities 100 training 205 testing Cross-dataset validation 41

42 Generalization Image Parse Dataset Ours Ouyang et al., CVPR'14 Yang&Ramanan, TPAMI'13 Pishchulin et al., ICCV'13 Pishchulin et al., CVPR'13 Pishchulin et al., CVPR'12 Johnson&Everingham, CVPR'11 Yang&Ramanan, CVPR'11 MEAN L OWER LEGS UPPER LEGS L OWER ARMS UPPER ARMS H EAD TORSO

43 Component Analysis 43

44 Unary Term vs. Full Model STRICT PCP ON THE LSP DATASET (VGG-LG) Unary term only Full model (seperately trained) Full model (jointly trained) TORSO HEAD UPPER ARMS LOWER ARMS UPPER LEGS LOWER LEGS MEAN 44

45 Tree-Structured Model vs. Loopy Model Loopy Model Tree Model Mean Lower Legs Upper Legs Lower Arms Upper Arms Head Torso

46 Human Pose Estimation in This CVPR Chu et al. Structured Feature Learning for Pose Estimation. Carreira et al. Human Pose Estimation With Iterative Error Feedback. Wei et al. Convolutional Pose Machines. 46

47 Summary CNN Structure End-to-End Learning Bridge the gap between structure modeling and deep learning A new message passing layer, which is flexible to treestructured/loopy relational graphs.

48 CNN Structure Thank you. Welcome to our poster session #5 Scan for more details

Pictorial Structures Revisited: People Detection and Articulated Pose Estimation. Department of Computer Science TU Darmstadt

Pictorial Structures Revisited: People Detection and Articulated Pose Estimation Mykhaylo Andriluka Stefan Roth Bernt Schiele Department of Computer Science TU Darmstadt Generic model for human detection