Proposta di Tennis player segmentation for semantic behavior analysis Architettura Software per Robot Mobili Vito Renò, Nicola Mosca, Massimiliano Nitti, Tiziana D Orazio, Donato Campagnoli, Andrea Prati, Ettore Stella {reno, stella}@ba.issia.cnr.it Institute of Intelligent Systems for Automation Via G. Amendola 122 D/O Bari www.issia.cnr.it National Research Council of Italy (CNR)
Outline Introduction Methodology BG Initialization BG Update Energy processing Variance Processing One step frame differencing Fine tuning Experiments and results Conclusions and future works
Introduction Computer vision & sports Sports is said to be the social glue of society. [ ] Technology is therefore becoming more and more crucial [ ] Since the use of sensors or other devices fixed to players or equipment is generally not possible, a rich set of opportunities exist for the application of computer vision techniques to help the competitors, trainers and audience. Computer Vision in Sports Preface ISBN 978-3-319-09396-3
Introduction Motivation why another background model? Results obtained with the Mixture of Gaussian Tendency of break players silhouettes Investigate how to produce better low level outputs avoiding post processing
Introduction Background (BG) modeling in artificial vision systems Object segmentation BG process Object tracking cameras raw data artificial vision system Scene understanding users One of the first low level computational tasks executed Input for other high level software modules (e.g. object tracking/scene understanding) Very high throughputs achieved by state of the art cameras
Introduction System overview Preliminary step of a system aimed to address coaching needs Four cameras cover all game areas with at least two views Trade-off between computational complexity and reliable results in real time
Introduction Globally Intrinsic VariancE for BACKground (GIVEBACK) Segment active entities in tennis context (balls, players) Intrinsic sensor variance of each returned value in the range 0 255 Scalability (single-multi camera), robustness and reliable results in real time
Methodology Algorithm description Background Initialization for each frame Variance process One step frame differencing if (Background is learned) Foreground extraction Fine tuning process Background Update Energy process
Methodology BG initialization BG image set to half intensity (all gray logic) No a priori knowledge of the scene
Methodology BG Update BG t ( u, v) = " $ # $ $ % BG t 1 BG t 1 BG t 1 ( u, v) κ ( u, v) ( u, v) +κ if BG t 1 u, v if BG t 1 u, v if BG t 1 u, v ( ) > I t (u, v) ( ) = I t (u, v) ( ) < I t (u, v) Each BG pixel value is increased or decreased by κ
Methodology Energy process ε = BG t 1 I t L 1 norm (for speed reasons) Used to stop the learning phase when it reaches its minimum The BG is substituted with the last captured frame during the learning phase
Methodology Variance process Variance is not related to the observations of a single pixel over time, but is a function of the gray level returned by the sensor Obs γ It models different responses to different light intensities { } ( ) = k = ( u, v) BG( u, v) = γ V t ( γ) = V t 1 γ k N t ( ) N t 1 + I t k ( ) BG( k) 2 ( γ) ( ) ( ) frequency of the γ-th gray level over time k Obs γ N γ
Methodology One step frame differencing AD = I t I t 1 M os = ( ) = 3.5σ ( γ) τ γ σ ( γ) = V ( γ) 0 AD u, v 255 AD u, v ( ) ( ( )) ( ) τ I t 1 ( u, v) ( ) > τ I t 1 u, v Binary mask calculated at each iteration Also the threshold is function of a specific gray value Robust approach (e.g. no BG update on moving players)
Methodology Foreground extraction AD = I t BG t 1 M fg = ( ) = 3.5σ ( γ) τ γ σ ( γ) = V ( γ) 0 AD u, v 255 AD u, v ( ) ( ( )) ( ) τ BG t 1 ( u, v) ( ) > τ BG t 1 u, v Similar to the one step frame differencing The current frame is compared to the BG model
Methodology Fine tuning process Blob analysis done both on M os and M fg to obtain two sets of connected regions and build the update mask B os = { b 1, b 2,, b n } B fg = { b 1, b 2,, b m } M upd = ( b i, b ) j b i B os, b j B fg, b i b j { } minimum circumscribed rectangle overlapping blobs
Experiments and results Dataset description Video sequences that represent a tennis training session Raw videos acquired by AVT Prosilica GT1920C (frame size 1920x1024@50fps) and equipped with auto iris lens Typical situations of a tennis training session, e.g. players similar to the ground, fast balls, stripes on red ground
Experiments and results Test case GIVEBACK, MoGv2, GMG and Kalman filter based BG are evaluated Starting from frame f 0, 10 images are taken every 500 frames GMG and MoGv2 implementations available online https://github.com/andrewssobral/bgslibrary, Kalman filter based BG available in MVTec Halcon 12 Precision, Recall and F-Measure are used as metrics for the quantitative results Qualitative results are presented in terms of ground truth and foreground masks
Experiments and results Qualitative results
Experiments and results Qualitative results Parts of the player considered as BG
Experiments and results Qualitative results Ghosting issues
Experiments and results Qualitative results Well cut player silhouette (ghost reduction while preserving the shape)
Experiments and results Quantitative results P = R = TP TP + FP TP TP + FN 0.9 0.8 0.7 0.6 Precision vs. Recall GMG MOG2 MOG2 FILT KALMAN KALMAN FILT GIVEBACK FINE TUNE GIVEBACK Each point refers to a run Recall 0.5 0.4 Ground truth in the upper right position (1,1) 0.3 0.2 0.1 0 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Precision
Experiments and results Quantitative results 0.9 F Measure F = 2 P R P + R 0.8 0.7 0.6 Median F 25 th to 75 th percentile F Measure 0.5 0.4 0.3 + Outliers 0.2 0.1 GMG MOG2 MOG2 FILT KALMAN KALMAN FILT GIVEBACK FINE TUNE GIVEBACK Algorithm
Experiments and results Quantitative results 0.9 F Measure F = 2 P R P + R 0.8 0.7 0.6 Median F F Measure 0.5 0.4 Reproducible results over time 25 th to 75 th percentile 0.3 + Outliers 0.2 0.1 GMG MOG2 MOG2 FILT KALMAN KALMAN FILT GIVEBACK FINE TUNE GIVEBACK Algorithm
Experiments and results Performances C++ implementation on PC Intel Xeon E5-2603 @ 1.60 GHz, 32GB RAM, Windows 7 64bit OS The algorithm can run at 50 fps
Conclusion and future works Efficient method to segment active entities in tennis context, preserving players silhouettes BG modeled as mean image, while the variance is related to the specific gray level captured by the sensor (enriched by a selective update mask) Future implementations/optimizations directly on smart cameras (e.g. FPGA or ARM architectures) High level analyses like posture recognition and semantic analysis