Content-based Decomposition of Gesture Videos

Size: px

Start display at page:

Download "Content-based Decomposition of Gesture Videos"

Osborne Horn
5 years ago
Views:

1 Content-based Decomposition of Gesture Videos Nikolaos D. Doulamis National Technical University of Athens, Dept. of Electrical and Computer Engineering Anastasios D. Doulamis National Technical University of Athens, Dept. of Electrical and Computer Engineering Dimitrios I. Kosmopoulos National Centre for Scientific Research Demokritos, Institute of Informatics & Telecommunications Abstract - In this paper e present a novel method for gesture video decomposition based on the depicted content. From the initial content the key-frames are extracted and the neighboring frames are assigned to key frames of similar content. The resulting frame groups are decomposed to binary trees, based on the energy of the depicted gestures. In case of reduced bandidth e keep the original timeline but e send only a dynamically adapted video summary. The respective frames are obtained by moving appropriately across the hierarchy layers of the constructed tree. The hierarchically structured video can be used for purposes such as efficient video brosing and transmission of dynamically generated summaries over lo bandidth netorks, for communication or human-computer interface applications. I INTRODUCTION Many interesting approaches have been proposed in the past to improve the quality of life of disabled people and aid their inclusion in the society. This applies also to people ith hearing and speech impairement, ho use sign language to communicate. The video depicting communicative signs requires special analysis (compared ith conventional video content) to derive the semantic meaning of the content. On the other hand, there is a tremendous demand for multimedia data transmitted over netorks and this demand is expected to rapidly increase in the forthcoming years due to the development of lo-cost devices for capturing and generating multimedia information. Despite the recent process in coding and compression algorithms, such as the MPEG-4 standards [], transmission of multimedia information, and especially of digital video in a cost effective and quality guaranteeing manner over lo bandidth netorks, such as the Internet, still remains one of the most challenging problems. This is due to the fact that digital video, even in compressed domain, imposes very large bandidth requirements. There is difficulty in navigating and accessing huge video databases. Despite the recent advances in video coding and the proposal of hierarchical coding schemes, such as the Fine Granularity Scalability (FGS) [, 3] scheme. This concept can be extended to video sequences by initially transmitting a coarse (lo) resolution and then, decomposing particular video segments into a higher (finer) resolution. In this ay and taking into account bandidth constrains, the respective video content is transmitted. Conventional video adaptation algorithms can be considered the orks of [4-8]. The goal of these algorithms is to extract a small "video abstract" by discarding visual information, similarly to text summaries used in document files. On the contrary, in a video delivery system the user is alloed to see visual content at various resolutions. Toards this direction, algorithms, dealing ith progressive retrieval of images or video sequences, such as the ones presented in [9, 0] can be considered. These techniques initially transmit a coarse (lo) "resolution vie" of an image, folloed by additional residual information. Hoever, these approaches, decompose linearly (donsample) visual information on spatial and temporal direction at fixed units of pixels or frames. Hierarchical video summarization has been also adopted in the frameork of the MPEG-7 standard [], through the HierarchicalSummary Description scheme. The technique uses only key-frame organization and clusters video segments according to the visual content and temporal coherency. In [], a content-based video decomposition scheme has been presented to organize visual information at different levels of content hierarchy. The method represents video information in a tree structure, the level of hich corresponds to a content resolution, hile the tree nodes correspond to the video segments that are partitioned at this level. Shots and frames are considered as the basic elements for the hierarchical video decomposition. The main draback, hoever, of all the above mentioned methods, is that a) generic lo level features are used for modeling video content and b) content decomposition is not constrained ith respect to the current bandidth requirements. The second issue means that the amount of the transmitted video content is not restricted by the netork channel characteristics, hich dynamically

2 changes through time. For this reason, a video content decomposition scheme is required to update the transmitted video content to netork characteristics (adaptive video transmission). In this paper, e propose a novel video decomposition scheme for applications such as efficient brosing and adaptive delivery and transmitting of gesture videos such as sign language streams. Gesture videos are also useful for human-computer interaction, or even robot guidance over netorks. In these types of videos, the semantic content is mainly defined as the gesture variation, i.e., the movement of hands and head. Adaptation of gesture videos allos for efficient streaming of such types of videos over lo bandidth netorks, for navigation of sign language streams in huge media databases, for delivery of the salient content in robotic guidance scenarios, and for searching and indexing of particular content of interest of gesture videos. The proposed architecture consists of the gesture segmentation-representation module and the adaptive hierarchical decomposition module. Gesture segmentation is performed in our case taking into account skin color information. Then, the segmented regions are represented using the Zernike moments [3]. The adoption of Zernike moments, instead of other types of moments, such as Cartesian or Hu moments, is due to the fact that they are orthogonal and thus they have stronger representation capabilities. Due to orthogonality, the sum of the squared coefficients of the Zernike moments expresses the energy of the gesture shape [3]. The next step of the proposed architecture is the adaptive hierarchical content decomposition algorithm. More specifically, the shape energy is used for estimating the key-frames based on the curvature of the shape energy trajectory. In the folloing, the remaining video frames are temporally classified ith respect to the estimated key-frames. Then, an adaptive decomposition algorithm is proposed, hich dynamically calculates the number of frames required to be transmitted ith respect to content variations and the bandidth constraints. The algorithm cluster video frames using a binary tree classification scheme. II GESTURE SEGMENTATION AND SUMMARIZATION The goal of gesture segmentation and summarization procedure is to segment the head and hand regions in the image and create a content based representation for them; then based on this representation to select only those frames that are able to describe adequately the performed gestures. As described in the folloing, the segmentation is performed using skin color, the head and hand areas are described by Zernike moments and the key frames are extracted by identifying the maximum temporal variations of the gesture energy. We assume that the depicted person (target) faces the camera ith his/her upper body part captured in the image. The hands may disappear from the image, they may be occluded by each other and they may occlude the head, but e assume that the target is dressed. To perform color modeling, e firstly locate the face in the image using a face detector from a readily available vision library. We train a color classifier based on the face color and then based on the modeled color e find the hand regions in the image. For solving this target-specific and thus reduced skin color modeling problem, e assume a single multivariate Gaussian model. The color probability density function in the Hue, Saturation and Intensity color space (selected due to similarity to human perception) is given by: p( c skin) = e 3 (π ) Σ Τ ( c µ ) Σ ( c µ ) here c refers to the color vector and µ, Σ are the mean vector and the covariance matrix respectively. The skin regions are obtained after calculating the skin probability for all pixels. Depending on the application context the Intensity (and sometimes also the Saturation) channel may be excluded from the color probability calculation. Using the color model, e are able to segment the image into skin and non-skin regions resulting in a binary mask (filtered to remove noise and after extraction of the up to three biggest connected components). The mask is applied to the Intensity channel of the image to obtain a masked gray-level image including only the head and the to hands. The gray-level image provides a richer representation of the current gesture than the binary masks; the employment of the latter can lead to loss of information especially in the case of different gestures ith similar silhouettes, e.g., front and back image of the hand. We use the Zernike moments to represent the activity state as it is expressed by the relative position and shape of the head and the to hands due to their noise resiliency, their reduced information redundancy and their reconstruction capability. The complex Zernike moments of order p are defined as [3]: () p + π jqθ Α pq = Rpq ( r) e f ( r, θ ) r drdθ (a) π 0 π = x y, θ=tan - (y/x), -<x,y< (b) r + p q s m s R pq ( r) ( ) r (c) = s= 0 p-q = even and 0 q m. ( m s)! m + n m n s!( s)!( s)!

3 For the purposes of this ork e chose to normalize the moments around the center of gravity because it is actually the relative pose of the head to the hands that signifies change of activity. On the contrary, normalization to scale is undesired due to the fact that scale variations ithin the same shot signify interesting gestural events. For representing the gesture state e use as signature the sum of the squared coefficients of the Zernike moments hich correspond to the gesture energy [3]: J = Q p= 0 q p p, q even Apq here Q is the selected order of the moments and the L norm. The energy is plotted for each frame of the shot forming a trajectory, hich expresses the temporal variation of the energy shape through time. The second derivative of J for all frames ithin a shot ith respect to time is used as a curvature measure. Local maxima correspond to time instances of peak shape variation, hile local minima indicate lo shape variation. Let us denote as J( the energy of shape coefficients to the kth frame of the examined shot. Initially, the first derivative of signal J(, say J (, is evaluated, approximated as the difference of feature vectors beteen to successive frames J(=J(k+)-J(. To minimize noise influence, a eighted average of the first derivative, say J, over a indo, is used, given by: (4) J ( = β( α( k ) l k ( J ( l + ) J ( l) ), k=0,,m- here α ( = max(0, k N ), and β ( k ) = min( M, k + N ) and *N + is the length of the indo, centered at frame k. Variable M indicates the number of frames of the shot. The eights l are defined for l { N, N } ; in the simple case, all eights l are considered equal to each other, meaning that the derivatives of all frame feature vectors ithin the indo interval present the same importance, (3) l =, -N,,N (5) (N + ) Similarly, the second eighted derivative, J (, for the k-th frame is defined as: J ( = β( k ) α( k ) l k J ( (6) here J ( = J ( k + ) J (, k=0,,m-3 (7) and α ( = min(0, k N ), β ( k ) = min( M 3, k + N ) As explained previously, the local maxima and minima of J are considered as appropriate curve points, i.e., as time instances for the selected key-frames. Note that J is a discrete time sequence. Hence, the local maxima and minima can be estimated as the union of to sets X = X M X m ; the X M contains the time instances of frames corresponding to the local maxima of J, hile the X m the time instances of local minima of J. The sets X M and X m are estimated as follos X M = { k : J ( k ) < J ( & J ( > J ( k + )} X m = { k : J ( k ) > J ( & J ( > D( k + )} (8) III VIDEO DECOMPOSITION AND ADAPTIVE TRANSMISSION The key-frames extracted as described in section, preserve the semantics of the gesture. Thus, video content decomposition is performed in our case by exploiting the key-frame information. In particular, the remaining frames are temporally grouped around the extracted key-frames. These groups are composed in a spatiotemporal fashion considering both gesture energy (spatial) and time proximity (temporal). It is actually a problem of dividing the frames that belong to the timeline beteen to adjacent key-frames into to groups. One group is represented by the first key-frame, hile the other by the second one. This is equivalent to finding the optimal value frame index, say mˆ, that minimizes the folloing objective function. m N m ˆ = arg min{ Jk g + Jk g } (9) m k = k= m+ here g, g the shape energy of the to key-frames and J k the shape energy of the frames belonging in the timeline beteen the to key-frames. Then the frames ith time index k m are assigned to the group of the first key-frame hile the rest ones are assigned to the second group. Especially, for the first (last) key-frame of the video sequence, the frames that are located in time beteen the beginning (end) of the timeline and the key-frame, are directly assigned to the group of that key-frame. After defining the groups, say G i i=,, K, the frames in each group G i are processed to compose a binary tree based on the difference of their shape energies. Each node of the tree corresponds to a video frame. In addition each frame is decomposed into to frames, say f r, f s hose respective shape energy, J r, J s, is selected so that:

4 J r J s > J u J v for all fu f v Gi, (0) All the remaining nodes of G i are then grouped around the to frames f u, f v by minimizing (9). The frames f u, f v, summarize (represent) at a more detailed level all the frames belonging into their groups. The video decomposition procedure is repeated for all tree branches until the tree can not be further decomposed, i.e. the tree nodes represent single frames. In our representation, each node n in the tree is characterized by a vector of three integers (a,b,c) here: a is the minimum frame index in the group summarized by node n. b is the maximum frame index in the group summarized by node n. c is the index of the frame represented by node n In the sequel, video decomposition is used for transmitting visual information over lo bandidth netorks. Let us denote as C(n) the set of all children of node n. Let us also denote as K(t) the root node of the tree that corresponds to the key-frame at time instance t. If N is the total number of frames, t the current timeline counted in frame-period, and B the current bandidth, then an adaptive algorithm for video transmission is proposed defined as follos: for t = :N { GetBandidth(B); n = K (t); hile (B - <n(b)-n(a) ) /*a single node*/ n = { n C(n) : t [a,b]}; if n (c) = = t transmit frame(n) else transmit control signal } In the previous algorithm e recall that a, b are the minimum/maximum frame indices of the group represented by node n. In addition, B - refers to the inverse of the bandidth and n(a), n(b), n(c) the a, b, c fields of the node vector. In each time slot the bandidth is read to find the ne netork/client media device state. In case that the bandidth is unable to transmit even the key-frames then some higher level clustering has to be performed but in that case significant visual info ill be lost. IV EXPERIMENTAL RESULTS The applicability of the method has been verified through content-based video decomposition of sign language videos. The presented scheme for gesture-based video decomposition is evaluated using video files of sign language. Figure illustrates a shot of the assume ord. In Figure a, one every 5 frames is presented for clarity. Figure b shos the second derivative of the shape energy and the respective key-frames. As is observed, the salient parts of the shots are detected as key-frames, i.e., the 8, 3, 39, 60, 89,, 7 frame indices. The remaining frames are grouped together ith respect to these key-frames. For example, the first group includes the frames from -0, the second -3, etc. For each group, the to most dissimilar frames are selected for further decomposition, e.g., the second group represented by the key-frame 3 is further decomposed to groups represented by frames 6 and 3, hich in the sequel are decomposed to groups represented by frames ith indices of and 6 and 7 and 3 respectively. This process is repeated until all the tree nodes represent single frames. This is illustrated in Figure. The frames that are transmitted for time-variant bandidth are displayed in the same figure. It becomes clear that the frame transmission is adaptable to netork channel capacity and variations in its characteristics. It is also evident that time the frame sequence, hich is important for content such as gestures, is preserved. V CONCLUSIONS In this paper e present an adaptive algorithm for gesture video decomposition over lo-bandidth netorks. Initially, key-frames are extracted using the time variation of the gesture shape energy. This energy is calculated by the sum of he squared coefficients o the Zernike moments. In case of reduced bandidth e keep the original timeline but e send only a dynamically adapted video summary. The respective frames are obtained by moving appropriately across the hierarchy layers of the constructed tree. Experimental results on sign language videos indicate that the proposed scheme a) decomposes video based on content characteristics and b) provides adaptive content delivery in time varied netorks. VI REFERENCES [] ISO/IEC JTC/SC9/WG N356, "MPEG-4 Overvie," Doc. N356, Maui, Haaii, December 999. [] W. Li, Overvie of Fine Granularity Scalability in MPEG-4 Video Standard, IEEE Trans. CSVT, vol., no. 3, pp , Mar. 00 [3] Coding of Audio-Visual Objects, Part- Visual, Amendment 4: Streaming Video Profile, ISO/IEC /FPDAM4, July 000. [4] N. Doulamis, A. Doulamis, Y. Avrithis, K. Ntalianis and S. Kollias, " Efficient summarization of stereoscopic video sequences," IEEE Trans. on CSVT, Vol. 0, No. 4, pp , June 000. [5] A. Hanjalic and H. Zhang, "An integrated scheme for automated abstraction based on unsupervised cluster-

validity analysis," IEEE Trans. on CSV T, Vol. 9, No. 8, pp. 80-89, December 999. [6] A. Ekin, A.M. Tekalp, and R. Mehrotra, Automatic soccer video analysis and summarization, IEEE Trans.

on CSVT, Vol., 3,pp. 006-03, Oct. 003. [8] Jae-Ho Lee, Gang-Gook Lee, and Whoi-Yul Kim, Automatic video summarizing tool using MPEG-7 descriptors for personal video recorder, IEEE Trans. on Cons.

Li, "Adaptive storage and retrieval of large compressed images," in Storage & Retrieval for Image and Video Databases, VII, M.M Yeung, B.L. Yeo and C. A. Bouman Eds. Proc. SPIE, vol. 3656, pp.

Doulamis, Optimal Contentbased Video Decomposition for Interactive Video Navigation over IP-based Netorks, IEEE Trans. on Circuits and Systems for Video Technology, Vοl. 4, No. 6, pp.

5 validity analysis," IEEE Trans. on CSV T, Vol. 9, No. 8, pp , December 999. [6] A. Ekin, A.M. Tekalp, and R. Mehrotra, Automatic soccer video analysis and summarization, IEEE Trans. on Image Processing, Vol., pp , July 003. [7] Tianming Liu, Hong-Jiang Zhang and Feihu Qi, A novel video key-frame-extraction algorithm based on perceived motion energy model, IEEE Trans. on CSVT, Vol., 3,pp , Oct [8] Jae-Ho Lee, Gang-Gook Lee, and Whoi-Yul Kim, Automatic video summarizing tool using MPEG-7 descriptors for personal video recorder, IEEE Trans. on Cons. Elect., Vol. 49,pp , Aug [9] J. R. Smith, "VideoZoom: Spatio-temporal video broser," IEEE Trans. on Multimedia, vol., No., pp. 57-7, June 999. [0] J. R. Smith, V. Castelli and C.-S. Li, "Adaptive storage and retrieval of large compressed images," in Storage & Retrieval for Image and Video Databases, VII, M.M Yeung, B.L. Yeo and C. A. Bouman Eds. Proc. SPIE, vol. 3656, pp , Jan [] ISO/IEC JTC /SC 9/WG /N3964,N3966, Multimedia Description Schemes (MDS) Group, March 00, Singapore. [] A. Doulamis and N. Doulamis, Optimal Contentbased Video Decomposition for Interactive Video Navigation over IP-based Netorks, IEEE Trans. on Circuits and Systems for Video Technology, Vοl. 4, No. 6, pp , June 004. [3] R.Mukundan,K.R.Ramakrishnan, Moment Functions in Image Analysis: Theory and Applications, World Scientific, Singapore, x (a) Magnitude of the Second Derivative (,0,8) (,3, 3) (3,49,39) (50, 68, 60) Frame Index (69,0,89) (03,,) (3,37,7) Second Derivative/ local Min-Max Key-frames of the assume sign (b) Figure. Some summarization results of sign language images. (a) One every 5 frames for the assume sign. (b) The second derivative and the local maxima/minima for the assume sign and the key-frames

6 Figure. The binary tree decomposition for the group of frames -3 and the frames on the media device for time-variable bandidth.

A Novel Activity Detection Method

A Novel Activity Detection Method Gismy George P.G. Student, Department of ECE, Ilahia College of,muvattupuzha, Kerala, India ABSTRACT: This paper presents an approach for activity state recognition of