Statistical Filters for Crowd Image Analysis Ákos Utasi, Ákos Kiss and Tamás Szirányi Distributed Events Analysis Research Group, Computer and Automation Research Institute H-1111 Budapest, Kende utca 13-17, Hungary {utasi, akos.kiss, sziranyi}@sztaki.hu Abstract A mass of people can behave like as a random moving swarm. Its complexity can be described by statistical features. The paper gives solutions for recognising unusual motion patterns from overall motion statistics. The resulting system is tested on the PETS2009 dataset scenarios S3: Event Recognition and Dataset S1: Person Count and Density Estimation, with convincing results. 1. Introduction While there is a wide range of approaches many of them can not be applied in outdoor surveillance due to unreliable observation data. Surveillance applications face a lot of problems and as discussed in several papers (e.g. [1]) there is a significant gap between laboratory testing and real life applications where there are several sources of noise. This paper describes a statistical based evaluation of special motion events of mass of people tested on the PETS outdoor video dataset. Special events, as walking, running, rapid dispersion, local dispersion, crowd formation and splitting are estimated from statistics over global and local probabilistic models. We give a meaningful solution for unifying models of global motion statistics and local spatiotemporal flow estimation. A 4 dimensional Mixture of Gaussian model (MOG) is used to characterise the usual motion patterns depending on the location in the pedestrian s area. A likelihood function gives the probability of a flow-map, together with the global motion probability. The paper avoids the ambiguous definition of object shape or connectivity here, as they are abruptly change by time in any position. We show that the relatively simple statistical models may support to give adequate answers for the given questions. 2. Our approach Our event recognition method is based on low-level motion statistics. We created several low-level detectors to model different properties of the dense optical flow vector field. Our method performs the following steps: Preprocessing: background-foreground separation, optical flow calculation and filtering; Low-level detectors: produce properties for a given observation (property of the dense optical flow vector field); Event recognition: uses the output of the low-level detectors to categorise the event and provides membership probability; State (event) determination: event category with the highest membership probability is selected. 3. Preprocessing 3.1. Background-foreground separation For background modelling we used the CIE L*U*V* uniform colour space. We trained MOGs for each pixel using EM [2]. For background-foreground separation we used the method proposed by [3], but we omitted the update procedure. Moreover the variances of the L* were increased to handle lightning condition changes more effectively. Match in pixel (i, j) is defined as yi,j c µc i,j < T (1) c Σ c i,j where yi,j c is the value of the pixel in (i, j) in channel c {L, U, V }, µ and Σ denotes the expected value and covariance respectively and I = 6.0 is a constant [3]. The first B backgrounds are chosen as the background model: B = arg min b ( b ) w l > T l=1 where in our case the T parameter was set to 1.0 to select all Gaussians in the model. Finally morphological operators were use to clean the noisy foreground mask. Further improvements can be achieved by removing the shadows from the foreground mask (e.g. using the technique of [4]). (2)
3.2. Optical flow calculation, filtering We used the method of [5] to calculate the optical flow, which was smoothed by a spatial median filter of radius r = 1. The optical flow vectors were transformed to polar coordinates followed by several simple filters: unusually small and large magnitudes were dropped; vectors outside the foreground mask were dropped. 4. Low-level detectors In this section we present our detectors we used to model different features of the optical flow field. The models of the detectors are trained on the PETS Regular flow training data. 4.1. Detecting unusual optical flow Mixture of Gaussians were successfully used in several previous work for motion segmentation [7, 9, 8]. Having the training set of regular flows we extracted the optical flow vectors and trained [2] a 4 dimensional (x, y, vx, vy) MOG model (location + velocities) with 64 components in the mixture. The model learns the location, the speed and the direction of the regular activity from the training set. Fig. 1 represents the means of the Gaussians of the mixture: solid red line represents the mean direction and magnitude in the mean location, while the radius of the ellipses are proportional to the location variances. be expressed as P (O) = K K P MOG (o k ) (3) k=1 where P MOG (o) = M l=1 w ln (o µ l, Σ l ). During detection the optical flow vectors are collected from the video frames in a time window of size W = 5. 4.2. Detecting unusual magnitudes In order to describe the typical magnitudes of the usual activity a 1 dimensional Gaussian model was estimated from the training dataset. Before training the model the optical flow vectors were transformed into 3D space in order to normalise the magnitudes. For calculating the probability of a set of magnitudes we used the formula of Eq. 3 with the Gaussian probability in the product. 5. Event categorisation Our method currently recognises three types of events: regular activity, running and splitting. The recognition can be easily extended by using other low-level feature detector plugins. 5.1. Event recognition Using the low-level feature detectors presented in Sec. 4 we calculated the mean probabilities (or log P r) of the training data. Let denote the P i the mean probability of the ith low-level detector, and D i the standard deviation. Then to express that a given low-level feature f i with p i is similar to the training dataset we can define ( M sim (p i ) = N R pi P ) i, D i (4) membership similarity measure, where N R is a righttruncated Gaussian. Similarly we can define the membership dissimilarity measure as ( M dissim (p i ) = N L pi P ) i K D i, D i (5) Figure 1: Mixture of Gaussian (MOG) ellipses in 4 dimensions: x, y, vx, vy, represented by the cut at 2.5σ, while the velocities are represented by small red vectors in the centre points. Then for an incoming optical flow field O = {o 1,..., o K } (o k = (x k, y k, vx k, vy k )) the probability can where N L denotes the left-truncated Gaussian, the values of K is typically 2.5 4.0 (in our case it was set to 3.5). Fig. 2 demonstrates the two functions. Our event categorisation algorithm is based on the above membership functions. Regular activity for example is constructed from two similarity functions on the two low-level feature detectors (unusual event and unusual magnitude) presented in Sec. 4. We defined the following three recognisers: Regular activity (R reg ): usual flow (M sim ) and usual magnitudes (M sim ), see Fig. 3 top;
Figure 2: Membership similarity (black) and dissimilarity (red) functions. Running (R run ): unusual flow (M dissim ) and unusual magnitudes (M dissim ), see Fig. 3 middle; Split (R split ): unusual flow (M dissim ) and usual magnitudes (M sim ), see Fig. 3 bottom. Each recogniser calculates the product of its membership similarity and dissimilarity values. Finally the most probable (highest value) case is selected to define the output state. 6. Experiments We tested our event recognition system on the PETS Event recognition dataset and we selected the videos containing the running and splitting events. The false alarm ratio was extremely low, the confusion matrix is shown below. Event Regular Run Split Regular 160 1 0 Run 1 115 0 Split 24 0 45 Please note that the end of the split event cannot be clearly defined hence the performance might increase. The detected state sequences are demonstrated on Fig. 4. 7. Person Count and Density Estimation For person count and density estimation we manually trained an estimator from the training sequences (S0 Regular flow) similar to [10]. Each image frame in the training set of M frames was segmented into N = 40 regions (40 equal columns) and for each region we collected the number of foreground pixels (Sec. 3.1) resulting in an N M matrix denoted as F. Moreover let denote p = [p 1,..., p M ] T the number of pedestrians for the training set. Using the ground truth data of F and p we can estimate the probable number Figure 4: Most probable state sequences. Top: S3.L3 Sequence 1 (running) with timestamp 14-16; Bottom: S3.L3 Sequence 3 (split) with timestamp 14-31; States: 0 - regular, 1 - running, 2 - split of pedestrians per foreground pixels ratio for each regions denoted as r and is computed as the solution of F r = p. (6) For an unknown image i frame we collect the foreground pixels in each region as the feature vector x i then the number of pedestrians is estimated as p i = x i r. (7) Please note that only a subset of regions hold usable information, the others might be skipped or components might be computed from the nearby important regions. We used 800 frames from the S0 Regular flow training sequences to train our estimator and the remaining 421 frames were used for testing. The result is demonstrated on Fig. 5. The algorithm is fast, but occlusion highly reduces its reliability. This is the reason of the deviation of cc. 5-6 detected persons in Fig. 5, which leads to the error diagrams in Fig. 6. The explanation of high relative error values is that the band has approximately the same width, indicating that occlusion means rather additive than multiplicative noise using this model. 8. Summary and Conclusions In this paper we presented a probabilistic event classification system. The design of the proposed system allows us to easily integrate new low-level detector plugins to recognise other complex event classes. In the future we plan to use probabilistic models which take into account the duration of the events (e.g. hidden semi-markov model [6]). Moreover our person count estimator can be improved by including temporal information in the model (e.g. using temporal filters).
guessed number of people 45 40 35 30 25 20 15 10 5 Guesses grouped by ground truth ground truth train data test data 0 0 5 10 15 20 25 30 35 40 number of people (ground truth) [8] Wei Zhang, Xiangzhong Fang, and Xiaokang Yang Spatiotemporal Gaussian mixture model to detect moving objects in dynamic scenes, J. Electron. Imaging, Vol. 16, 2007. [9] Roland Wilson, Andrew Calway, Multiresolution Gaussian Mixture Models for Visual Motion Estimation, in Proc. of the IEEE International Conference on Image Processing, pp. 921-924. Oct. 2001. [10] Yin, J., Velastin, S., Davies, A., Image Processing Techniques for Crowd Density Estimation Using a Reference Image, in Proc. 2nd Asia-Pacific Conference on Computer. Vision, pp. 610, 1995 Figure 5: S1: Guessed number of people grouped by the S0 ground truth. Acknowledgements This work has been supported by the Hungarian Research Fund OTKA 76159 and the European Defense Agency in the MEDUSA project. References [1] Anthony R. Dick and Michael J. Brooks, Issues in Automated Visual Surveillance, Proc. 7th International Conference on Digital Image Computing: Techniques and Applications, pp. 195 204. Sydney, 2003. [2] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. Royal Stat. Soc., vol. 39, pp. 1 38, 1977. [3] C. Stauffer and W. E. L. Grimson, Adaptive Background Mixture Models for Real-time Tracking, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 246 252, Fort Collins, CO, USA, 23-25 June 1999. [4] Cs. Benedek and T. Szirányi Bayesian Foreground and Shadow Detection in Uncertain Frame Rate Surveillance Videos, IEEE Transactions on Image Processing, 17:(4) pp. 608-621, 2008 [5] J. R. Bergen, R. Hingorani, Hierarchical Motion-Based Frame Rate Conversion, Technical report, David Sarnoff Research Center Princeton NJ 08540, 1990. [6] J. Ferguson, Variable duration models for speech, In Proceedings of the Symposium on the Application of HMMs to Text and Speech, pages 143-179, 1980. [7] Weiss, Y. Adelson, E.H. A unified mixture framework for motion segmentation: incorporating spatial coherence and estimating the number of models, in Proc. Computer Vision and Pattern Recognition, 1996. pp.321-326
Figure 3: Probability values of the low-level detectors for the regular activity (top), running event (middle) and split event (bottom). Left column: output of unusual flow detector, right column: output of unusual magnitude detector. Logarithmic scale is used.
Figure 6: Histograms of error (left) and relative error (right) on the training (top) and test (bottom) data.