Capturing Interactions in Meetings with Omnidirectional Cameras

Similar documents
Video-Based Face Recognition Using Adaptive Hidden Markov Models

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

Variants of Pegasos. December 11, 2009

Clustering (Bishop ch 9)

A Novel Object Detection Method Using Gaussian Mixture Codebook Model of RGB-D Information

Detection of Waving Hands from Images Using Time Series of Intensity Values

Bayes rule for a classification problem INF Discriminant functions for the normal density. Euclidean distance. Mahalanobis distance

Outline. Probabilistic Model Learning. Probabilistic Model Learning. Probabilistic Model for Time-series Data: Hidden Markov Model

CHAPTER 10: LINEAR DISCRIMINATION

Robustness Experiments with Two Variance Components

Solution in semi infinite diffusion couples (error function analysis)

In the complete model, these slopes are ANALYSIS OF VARIANCE FOR THE COMPLETE TWO-WAY MODEL. (! i+1 -! i ) + [(!") i+1,q - [(!

Robust and Accurate Cancer Classification with Gene Expression Profiling

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 4

A Bayesian algorithm for tracking multiple moving objects in outdoor surveillance video

Dynamic Team Decision Theory. EECS 558 Project Shrutivandana Sharma and David Shuman December 10, 2005

Object Tracking Based on Visual Attention Model and Particle Filter

doi: info:doi/ /

/99 $10.00 (c) 1999 IEEE

V.Abramov - FURTHER ANALYSIS OF CONFIDENCE INTERVALS FOR LARGE CLIENT/SERVER COMPUTER NETWORKS

Tools for Analysis of Accelerated Life and Degradation Test Data

TSS = SST + SSE An orthogonal partition of the total SS

( ) () we define the interaction representation by the unitary transformation () = ()

Algorithm Research on Moving Object Detection of Surveillance Video Sequence *

Linear Response Theory: The connection between QFT and experiments

Introduction to Boosting

Lecture 11 SVM cont

FTCS Solution to the Heat Equation

Sampling Procedure of the Sum of two Binary Markov Process Realizations

Single and Multiple Object Tracking Using a Multi-Feature Joint Sparse Representation

Time-interval analysis of β decay. V. Horvat and J. C. Hardy

Fall 2010 Graduate Course on Dynamic Learning

Towards Monitoring Human Activities Using an Omnidirectional Camera

Anomaly Detection. Lecture Notes for Chapter 9. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

On One Analytic Method of. Constructing Program Controls

GENERATING CERTAIN QUINTIC IRREDUCIBLE POLYNOMIALS OVER FINITE FIELDS. Youngwoo Ahn and Kitae Kim

Abstract This paper considers the problem of tracking objects with sparsely located binary sensors. Tracking with a sensor network is a

Reactive Methods to Solve the Berth AllocationProblem with Stochastic Arrival and Handling Times

Advanced Machine Learning & Perception

An introduction to Support Vector Machine

[ ] 2. [ ]3 + (Δx i + Δx i 1 ) / 2. Δx i-1 Δx i Δx i+1. TPG4160 Reservoir Simulation 2018 Lecture note 3. page 1 of 5

THE PREDICTION OF COMPETITIVE ENVIRONMENT IN BUSINESS

Math 128b Project. Jude Yuen

Ordinary Differential Equations in Neuroscience with Matlab examples. Aim 1- Gain understanding of how to set up and solve ODE s

Machine Learning Linear Regression

Motion in Two Dimensions

Machine Learning 2nd Edition

Discrete Markov Process. Introduction. Example: Balls and Urns. Stochastic Automaton. INTRODUCTION TO Machine Learning 3rd Edition

A MACHINE LEARNING APPROACH FOR HUMAN POSTURE DETECTION IN DOMOTICS APPLICATIONS

Computing Relevance, Similarity: The Vector Space Model

Detection and Classification of Interacting Persons

Including the ordinary differential of distance with time as velocity makes a system of ordinary differential equations.

Cubic Bezier Homotopy Function for Solving Exponential Equations

Introduction ( Week 1-2) Course introduction A brief introduction to molecular biology A brief introduction to sequence comparison Part I: Algorithms

Building Temporal Models for Gesture Recognition

II. Light is a Ray (Geometrical Optics)

DEEP UNFOLDING FOR MULTICHANNEL SOURCE SEPARATION SUPPLEMENTARY MATERIAL

CamShift Guided Particle Filter for Visual Tracking

CS 268: Packet Scheduling

( ) [ ] MAP Decision Rule

PHYS 1443 Section 001 Lecture #4

Real-time Vision-based Multiple Vehicle Detection and Tracking for Nighttime Traffic Surveillance

Effect of Resampling Steepness on Particle Filtering Performance in Visual Tracking

Hidden Markov Models

Chapters 2 Kinematics. Position, Distance, Displacement

WiH Wei He

John Geweke a and Gianni Amisano b a Departments of Economics and Statistics, University of Iowa, USA b European Central Bank, Frankfurt, Germany

Introduction to. Computer Animation

Performance Analysis for a Network having Standby Redundant Unit with Waiting in Repair

Hidden Markov Models Following a lecture by Andrew W. Moore Carnegie Mellon University

Particle Filter Based Robot Self-localization Using RGBD Cues and Wheel Odometry Measurements Enyang Gao1, a*, Zhaohua Chen1 and Qizhuhui Gao1

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Chapter 6: AC Circuits

An Effective TCM-KNN Scheme for High-Speed Network Anomaly Detection

P R = P 0. The system is shown on the next figure:

Graduate Macroeconomics 2 Problem set 5. - Solutions

Kernel-Based Bayesian Filtering for Object Tracking

Computer Robot Vision Conference 2010

Existence and Uniqueness Results for Random Impulsive Integro-Differential Equation

ON THE WEAK LIMITS OF SMOOTH MAPS FOR THE DIRICHLET ENERGY BETWEEN MANIFOLDS

Department of Economics University of Toronto

J i-1 i. J i i+1. Numerical integration of the diffusion equation (I) Finite difference method. Spatial Discretization. Internal nodes.

Lecture VI Regression

Lecture 6: Learning for Control (Generalised Linear Regression)

PARTICLE FILTER BASED VEHICLE TRACKING APPROACH WITH IMPROVED RESAMPLING STAGE

Mechanics Physics 151

EEL 6266 Power System Operation and Control. Chapter 5 Unit Commitment

Lecture 18: The Laplace Transform (See Sections and 14.7 in Boas)

. The geometric multiplicity is dim[ker( λi. number of linearly independent eigenvectors associated with this eigenvalue.

CS 536: Machine Learning. Nonparametric Density Estimation Unsupervised Learning - Clustering

Modeling Conversational Dynamics as a Mixed-Memory Markov Process

Machine Vision based Micro-crack Inspection in Thin-film Solar Cell Panel

UNIVERSITAT AUTÒNOMA DE BARCELONA MARCH 2017 EXAMINATION

In this paper we are interested in the computational analysis

Using Fuzzy Pattern Recognition to Detect Unknown Malicious Executables Code

Survival Analysis and Reliability. A Note on the Mean Residual Life Function of a Parallel System

Political Economy of Institutions and Development: Problem Set 2 Due Date: Thursday, March 15, 2019.

HEAT CONDUCTION PROBLEM IN A TWO-LAYERED HOLLOW CYLINDER BY USING THE GREEN S FUNCTION METHOD

ABSTRACT KEYWORDS. Bonus-malus systems, frequency component, severity component. 1. INTRODUCTION

An Optimal Control Approach to the Multi-agent Persistent Monitoring Problem

Transcription:

3 Journal of Dsance Educaon Technologes, 3(3), 3-45, July-Sepember 005 Capurng Ineracons n Meengs wh Omndreconal Cameras Raner Sefelhagen, Unversä Karlsruhe (TH), Germany Xln Chen, Carnege Mellon Unversy, USA Je Yang, Carnege Mellon Unversy, USA ABSTRACT Human neracon s one of he mos mporan characerscs of meengs. To explore complex human neracons n meengs, we mus undersand hem and her componens n deal. In hs paper, we presen our effors n capurng human neracons n meengs usng omndreconal cameras. We presen algorhms for person rackng, head pose esmaon, and face recognon from omndreconal mages. We also dscuss an approach for he esmaon of who was alkng o whom, based on racked head poses of he parcpans. Fnally, we address he problem of acvy modelng, based on movng rajecores of people n a meeng room. We repor expermenal resuls o demonsrae he feasbly of he presened echnologes and dscuss fuure work. Keywords: classroom echnology; compuer-medaed communcaon; human machne sysems; ndexng; mulmeda INTRODUCTION Meengs are an mporan par of daly lfe n governmens, companes, unverses, and oher organzaons. Mos people fnd mpossble o aend all relevan meengs or o rean all he salen pons rased n meengs. In he pas few years, many researchers have been aempng o fnd varous ways o faclae he recordng, annoaon, and analyss of meengs. Xerox PARC has developed a meda-enabled conference room, equpped wh cameras and mcrophones o capure audo-vsual conen (Chu e al., 1999). In he NIST Smar Space Lab projec, anoher smar meeng room has been se up (Rosenhal, 000). A Mcrosof research, some work has been conduced on capurng small group meengs usng a rng camera (Ru e al., 001). The Unversy of Calforna, San Dego also has developed a meeng sysem equpped wh several fxed calbraed cameras, some acve cameras, and several omndreconal cameras; he sysem s able o rack people n he room, recognze her faces, and denfy he curren speaker (Mkc e al., 000). A he Ineracve Sysems Laboraory of Carnege Mellon Unversy and a he Unversä Karlsruhe, we have been developng echnologes for an nellgen meeng room snce 1997 (Sefelhagen e al.,

Journal of Dsance Educaon Technologes, 3(3), 3-45, July-Sepember 005 33 1999; Wabel e al., 1998; Wabel e al., 003; Yang e al., 1999). Meengs among people are evens ha encode a large amoun of socal and communcave nformaon. To decode such nformaon requres undersandng mulmeda nformaon from mulple cues. In hs research, we focus on how o use vsual nformaon o undersand human neracons n meengs. Humans oban a wealh of nformaon by observng her envronmen n a meeng. For example, we smulaneously denfy people, we undersand wha hey are dong, why hey are dong, o whom and wh whom hey are neracng, wha her muual relaonshps are, wha her socal relaonshps, roles, and syles are, wha her feelngs, concerns, and neress are, and how hey are carryng ou asks over he perod of me. For machne percepon, human neracons have o be undersood and descrbed a mulple levels and n erms of mulple funconales and perspecves. Loosely speakng, o fully undersand human neracons, a machne mus provde answers o all relaed quesons of human neracon: Who was here? Where were he people? Why was somehng happenng? When dd happen? To whom dd someone speak and how dd somehng ake place? Such nformaon hen can be used, for example, o annoae mulmeda proocols of meengs or o oban beer auomac analyses of such meengs. In hs paper, we presen our effors o capure human neracon n a meeng usng omndreconal cameras. We can pu an omndreconal camera on he meeng able and/or moun on he celng. Fgure 1 shows an example of omndreconal cameras n a meeng room. An omndreconal camera has abou a 180-degree vewng angle. From he camera on a meeng able, he sysem s able o capure faces Fgure 1. A scene capured from omndreconal cameras n a meeng room (a) On he able (b) On he celng of all parcpans. By furher processng capured faces, he sysem can oban nformaon on who s n he meeng, where people are locaed, and who s lookng a whom. From he camera on he celng, he sysem can observe acves n a larger area around meeng able. The sysem, by vrue of beng above he acon, s less prone o problems of occluson. Alhough an omndreconal camera has a lmed resoluon, we, neverheless, can use o capure much useful nformaon for modelng human neracons n meengs. The remander of hs paper s organzed as follows. We dscuss robus rackng algorhms for rackng meeng parcpans and her head poses usng omndreconal cameras. Nex, we address problems on who, wha (acvy analyss), and who s alkng o whom n a meeng. The paper hen repors expermenal resuls on

34 Journal of Dsance Educaon Technologes, 3(3), 3-45, July-Sepember 005 he descrbed echnologes. We conclude he paper wh a dscusson on fuure work. PERSON TRACKING AND IDENTIFICATION Robus rackng of meeng parcpans, her head poses, and her denes s essenal for a sysem o provde nformaon on who s n he meeng, where hey are locaed n he room, wha hey are dong, who s alkng o whom, and when neracons happen. In hs secon, we wll dscuss echnologes for rackng people, her head poses, and recognzng her faces, usng omndreconal cameras on he meeng able and on he celng of he meeng room. Trackng People from an Omndreconal Camera Mouned on he Celng An omndreconal camera mouned on he celng s less prone o problems of occluson han a able- or wall-mouned camera. However, here are sll many challenges for robusly rackng people usng such a camera. These nclude he problems of changng background and of a nonunform vew. One of he fundamenal problems n people rackng sysems s o exrac all objecs n a scene from he background. Background subracon has been used wdely for hs ask. Many dfferen models have been proposed o characerze he background, such as pxel nerval esmaon (Haraoglu e al., 1989), sngle Gaussan Model (Karmann e al., 1990; Toyama e al., 1999; Wren e al., 1997), and Gaussan Mxure Model (GMM) (Fredman & Russel, 1997; Grmson e al., 1998). All hese models, however, canno evolve over me. An omndreconal camera mouned on he celng covers a much larger area han ha of a normal perspecve camera. Ths makes he scene more complex. In addon, many facors may cause he background o change durng a meeng, such as: 1. sudden scene changes, such as lghs beng urned on or off durng a meeng and a he sysem sarng pon;. slow envronmen changes caused, for example, by a meeng room wh wndows a dfferen mes of he day; 3. fade n/fade ou lghng changes caused, for example, by he shadow of a movng objec; and 4. problems of paral background updaes, because a char, for example, has been moved durng a meeng. To handle such a dynamc envronmen, a sophscaed model s needed. A good background model should have he capably of handlng all or mos of he above suaons. A soluon s o use an adapve model. Insead of assumng he known background before rackng, we can buld he background gradually by adapaon. From a mahemacal pon of vew, a background can be consdered as a feld change over he me. The Markov Random Feld (MRF) s a naural way o descrbe he evolvemen of he background. We have proposed o use MRF models o represen boh foreground and background (Chen & Yang, 00). The basc assumpon ha suppors hs mehod s ha he background s sascally sable. Unlke some of he prevous mehods, we do no assume ha he background s known before he rackng process. The background model can be generaed gradually durng he rackng process. We descrbe he model n deal as follows. A background frs can be regarded as a D feld wh a lmed suppor se and can evolve over me as llusraed n Fg-

Journal of Dsance Educaon Technologes, 3(3), 3-45, July-Sepember 005 35 Fgure. D background grd evolvng along wh me { ( { } I, B,{ O })} j arg max P B, O B,{ O } 1 1. () Noe ha he objec se a me can be dfferen from he objec se a me -1. Under he assumpon ha he objecs are ndependen o he background, we have: Λ ure. An mage whn a sequence of mages can be regarded as he background mage covered wh some objecs and noses added. Suppose a suppor se of mages s Λ = {(1,1), ( 1,), L, (1, n), (,1), L, ( m, n)} and m, n are he hegh and wdh of he mage, respecvely. The suppor se of objecs a me s Λ. The background sup- por se a me s Λ = Λ \{ Λ}. Assume ha B s he deal background mage a me, he objec n a - D mage s O a me, I s he observed mage a me. Therefore, he relaonshp among hem s as equaon (1). B + n X Λ I = O O + Λ n X, (1) where n (X) s he nose a poson X and me. Therefore, he vsual survellance problem can be defned as gven an observed mage I (X), he background B 1 (X), and he objec se {O } a me -1, how 1 can we oban he bes esmaon of he background B (X) and he objec se {O } 1 a me. Ths goal can be acheved by: { ( { } { })} 1 { ( { })} j j arg max P B, O, Λ I, B 1, O 1, Λ B,{ O, Λ } j j = arg max P B, Λ I, B, O, Λ B, Λ arg max P { O, Λ } j { ({ O, Λ } I, B,{ O, Λ })} j 1 1 1 1 1 1 (3) The frs em s he esmaed background, and he second one s he esmaon of he objecs. If we apply a frs order MRF model, he rackng problem n Equaon (3) can be formulaed as mnmzng he followng Equaon (4). B EN + EP / T X Λ arg mn { O Λ } + EN( X, X& ) + EP ( X, X&,, )/ T, X Λ (4) where T and T are updae speed facors, and E = N [ B I ] B E = P ( X, X& EN ) 1 σ = max σ σ [ B B ( X X& )] ( B, B ( X X& )) 1, B + x B +, y ff. X Λ andx X& Λ oherwse 1

36 Journal of Dsance Educaon Technologes, 3(3), 3-45, July-Sepember 005 E ( X, X& ) = P B B + x [ B B ( X X& )] 1 ( X X& ) x 1 B B + y 1 ( X X& ). y One of he basc assumpons for hs model s ha he pxel mesh s unfed. Therefore, an adapaon beween he unfed mesh and omnvew s needed, f we apply hs model for rackng people n an omndreconal camera. There are wo dfferen ways of adapaon: he frs s o conver he mage no a unformed vew; for example, by ransformng no a panoramc vew. Bu, n hs case, he sysem would lose some useful nformaon near he cener area of he camera; he second approach s o adap he model o he mage. Based on he characersc of he camera we used, we can compensae he non-lneary based on he mappng shown n Fgure 3. Fgure 3(a) s he facor n drecon X and Y and Fgure 3(b) s he facor n drecon Z, whch s he opcal axs drecon. We assume ha he objec movemen s only n drecon X and Y n mos cases. The capured scene n he mage s locaed whn a crcle, whose radus s f, and f s he focus lengh of he parabolod. The objec s dmenson n drecon X and Y wll be maxmum when he objec s locaed on he opcal axs and wll dsappear a he crcle. The objec s dmenson n drecon Z wll be nvsble a he cener and he crcle, and wll reach maxmum a he radus 0.869f. We can esmae changes n sze for each objec usng such a mach facor map. Fgure 3. An llusraon of horzonal and vercal mach facor map (a) facor n X-Y drecon (b) facor n Z drecon Smulaneous Head Pose Trackng In order o smulaneously rack head poses of meeng parcpans, we use an omndreconal camera o capure he scene around a meeng able. In he panoramc vew of he meeng scene (see Fgure 4), we hen deec he parcpans faces by searchng for skn-colored regons and use some heurscs o dsngush skncolored hands from faces (Sefelhagen e al., 000). For each deeced parcpan, a recfed (perspecve) vew s compued (see Fgure 5). Faces exraced from hese vews are hen used o esmae each parcpan s head pose. We use neural neworks o esmae head pan and l from such facal mages (Sefelhagen e al., 000). In our approach, preprocessed facal mages are used as npu o he neural neworks, and he neworks are raned so as o esmae he horzonal pan or vercal (l) head orenaon of he npu mages. Separae neworks were raned o esmae head pan and l. These neworks conaned one hdden layer and one oupu un ha encodes he head orenaon n degrees. By ranng mul-user neworks on mages from 1 users, we acheved average esmaon errors as low as hree degrees for pan and l. On mages from new users, head or-

Journal of Dsance Educaon Technologes, 3(3), 3-45, July-Sepember 005 37 Fgure 4. Panoramc vew of a meeng scene Fgure 5. Perspecve vews of hree parcpans enaon could be esmaed wh an average error of 10 degrees for pan and l. More deals can be found n Sefelhagen e al. (000). Face Recognon from a Panoramc Vew Face recognon has been an acve research area n he las wo decades. A good overvew of he progress n hs area can be found n revew papers (Chellappa e al., 1995; Samal & Iyengar, 199) and n he proceedngs of he las sx nernaonal conferences on Auomac Face and Gesure Recognon. A major challenge for face recognon from a panoramc vew s dffcules n face algnmen due o low resoluon of mages. For a holsc emplae machng approach (e.g., Prncpal Componens Analyss [PCA] and Lnear Dscrmnan Analyss [LDA]), facal feaures, such as eyes, are commonly used for algnng faces. In a panoramc vew obaned from he omndreconal camera ha we used for meeng capurng (640x480 pxels), we canno robusly deec facal feaures for algnmen of a face. Therefore, we employ a deecon-based mehod for face algnmen. We deec faces usng a PCA-based mehod wh dfferen scales n he panoramc mage. To ran he face subspace, we use 400 face mages ha were cropped from some ranng sequences. Fgure 6 depcs he frs 4 egenfaces of he subspaces. In he deecon process, we projec he canddae area o he face space and use hese projecon values o reconsruc a new mage. We hen measure he dsance beween he reconsruced mage and he orgnal one. For a non-face mage, he dsance s larger. Fgure 7 s an example of a face and a non-face and her reconsrucons. To oban an accurae poson of a face, we use a wo-sep mehod. In he frs sep, we deermne a rough poson of he face, and hen we search a a sub-pxel level o oban he opmal poson and orenaon n he second sep. Connuously denfyng people n a meeng room s a challengng ask (Yang e al., 1999). We prevously have developed echnologes for face recognon n a meeng room (Gross e al., 000a, 000b; Yang e al., 1999, 000). Under a consraned scenaro wh only a few people around a meeng able, PCA-based (also

38 Journal of Dsance Educaon Technologes, 3(3), 3-45, July-Sepember 005 Fgure 6. The frs 4 egenfaces for face deecon Fgure 7. Face, non-face, and her reconsrucons (he frs row s he orgnal mages, and he second row s he reconsruced ones) called Egenface approach) (Turk, 1991) can perform well, even wh low-resoluon mages capured wh an omndreconal camera. In our approach, we have used PCA boh for face algnmen and for face recognon. We wll presen he evaluaon resuls n nex secon. TRACKING AND MODELING INTERACTION In he prevous secons, we have dscussed echnologes ha can help answer quesons, such as Who s n he meeng? (person denfcaon), Where are hey n he room? (person locang and rackng), and Where dd someone look? (focus of aenon). Gven answers o (or beer, hypoheses abou) hese basc quesons, s possble o speculae abou meeng acons and neracons, and he ways n whch hey are performed. From Head Pose o Focus of Aenon Knowng who s alkng o whom s an mporan cue, boh for he undersandng and ndexng of meengs as well as for vdeoconferencng applcaons. In our research, we have addressed he problem of rackng who s lookng a whom durng meengs. There are wo conrbuon facors of where a person looks: head orenaon and eye orenaon. In hs work, head orenaon s consdered as a suffcen cue o deec a person s drecon of aenon. Relevan psychologcal leraure offers a number of convncng argumens for hs approach (Argyle & Cook, 1976; Cranach, 1971; Emery, 000), and he feasbly of hs approach prevously has been demonsraed expermenally (Sefelhagen 00; Sefelhagen & Zhu, 00). Our approach o rackng a whom parcpans look (.e., her focus of aenon) s he followng: (1) deec all parcpans n he scene, () esmae each parcpan s head orenaon, and (3) map each esmaed head orenaon o s lkely arges, usng a probablsc framework. We have developed a Bayesan approach o esmae a whch arge a person s lookng, based on he person s observed head orenaon (Sefelhagen e al., 001, 00). More precsely, we wsh o fnd P ( Focuss = T xs ), he probably ha a subjec s s lookng oward a ceran arge person T, gven he subjec s observed horzonal head orenaon x S, whch s he oupu of he neural nework for head pan esmaon. Usng Bayes formula, hs can be decomposed no

Journal of Dsance Educaon Technologes, 3(3), 3-45, July-Sepember 005 39 P( Focus S p( xs FocusS = T ) P( FocusS = T ) = T xs ) = p( x ), S (5) where x s denoes he head pan of person s n degrees and T s one of he oher persons around he able. In order o compue P ( Focus s xs ), s necessary o learn he class-condonal probably densy funcon p( xs Focuss = T ), he class pror P( Focuss = T ) and p(x ) for each person. s Fndng s rval and can be done by buldng a hsogram of he observed head orenaons of a person over me. We have developed an unsupervsed learnng approach o fnd he class-condonal head pan dsrbuons of each parcpan. In our approach, we assume ha he class-condonal head pan dsrbuons can be modeled as Gaussan dsrbuons. Then, he dsrbuon p(x) of all head pan observaons from a person wll resul n a mxure of Gaussans, M px ( ) px ( jp ) ( j), j= 1 (6) where he ndvdual componen denses are gven by Gaussan dsrbuons. The number of Gaussans M s se o he number of oher parcpans ha are deeced around he able. The parameers of he mxure model can be adaped so as o maxmze he lkelhood of he pan observaons gven he mxure model. Ths can be done usng he EM algorhm (for furher deals see Sefelhagen e al. [001]). Afer adapaon of he mxure model (6), we use he resulng ndvdual Gaussan componens as an approxmaon of he class-condonals of our focus of aenon model descrbed n equaon (5). We furhermore use he prors of he mxure model as he focus prors. To assgn he ndvdual Gaussan componens and he prors o her correspondng arge persons, he relave poson of he parcpans around he able are used. Fgure 8 depcs he adapaon resul for one user. On he lef sde, he rue class-condonal head pan dsrbuons are depced, ogeher wh he learned class-condonals. On he rgh sde, he resulng learned poseror dsrbuons are shown. Acvy and Scene Modelng from Movng Trajecores Alhough an omndreconal camera has a lmed resoluon, we can use o analyze some smple acves and neracons. The basc dea s o analyze human acves and neracons, usng movng rajecores. We defne a herarchcal behavor model. A he lowes level of he model, conans essenal nformaon, such as movng, soppng, sng, or sandng, whch can be observed from he rackng se- Fgure 8. Learned class-condonal head pan dsrbuons (lef) and resulng poserors (rgh)

40 Journal of Dsance Educaon Technologes, 3(3), 3-45, July-Sepember 005 quence. A a hgher level, we can dsngush some dfferen acves, such as workng alone, havng a meeng, and so forh, whch canno be observed drecly. These acves can be observed va rackng movng rajecores of people n a scene. For example, we can defne a meeng as wo or more rajecores comng from he same or dfferen drecons and sayng n he scene for a perod of me. We have esed he dea on he lmed daase. We colleced one hour of vdeo daa from an omndreconal camera mouned on he celng of our meeng room. We npu he vdeo no he people rackng sysem and obaned movng rajecores of people. We hen used a mespaal wndow o analyze ndvdual rajecores. A duraon of fve seconds was used as he me wndow. The rajecory whn hs me wndow formed a spaal wndow. The me overlap wndow was used for each clp, and he overlap me was.5s. If he objec sayed a a spo for a perod of me, he movng rajecores would be accumulaed no hsograms. These hsograms could have dfferen paerns correspondng o dfferen acves. We hen could nfer human acves from he hsograms. Fgure 9 shows some paerns for dfferen acves. The op-lef n each group s he spaal hsogram. The op-rgh and boom-lef are horzonal and vercal vews of he hsogram, and he boomrgh s he op vew of he rajecory. EXPERIMENTAL RESULTS We have performed varous expermens o evaluae he echnologes. We presen some expermenal resuls n hs secon. Expermens for People Trackng We have esed he sysem s ably o nalze he background model wh objecs n he scene, rack and monor an objec wh changes n sze and lghng, and rack mulple objecs. Fgure 10 s an example of background buldng wh an objec n he scene. In Fgure 10(c)-(e), some black areas whn he ellpses are he background paches under consrucon, and we dynamcally updae he background whle rackng he objec and oban a complee background as Fgure 10(f). Fgures 10(a) Fgure 9. Examples of acvy paerns (a) Walkng hrough clp (b) Lookng for somehng clp

Journal of Dsance Educaon Technologes, 3(3), 3-45, July-Sepember 005 41 Fgure 10. Background evoluons wh me changng Fgure 11. Example of mul-objec rackng (a) (b) Frame 135 Frame 151 Frame 155 Frame 159 (c) (e) (d) (f) and (b) are orgnal sequence correspondng o Fgures 10(c) and (f), respecvely. Fgure 11 s an example of mul-objec rackng. Only he op-lef mage s shown ogeher wh he background; he ohers only show he movng objecs. Alhough one of he objecs passes hrough he blnd zone as depced n frame 155, he objec s racked connuously by he sysem. In hs example, we rack boh sable and movng objecs a he same me. I can be seen ha he sze and lghng of he movng objec n he rackng sequence change wh s poson. Expermens on Face Recognon In our expermens, we es he average response me for deecon and recognon accuracy. We defne he response me as he me from when a parcpan ss on hs or her char o he me when hs or her face s deeced. The average response me s 1.5s, whch s based on 3 parcpans sequences. In consderaon ha people n a meeng room can be racked and he recognzed people can be aached o an ID on her connuous rajecores, we defne he recognon rae as he followng: n γ =, N where n s he number of correcly recognzed people wh a duraon T, from he me hey s on her char, and N s he oal number of parcpans. We have 10 classes n our expermen, and he recognon resuls wh dfferen me duraons are shown n Fgure 1. Expermenal Resuls for Focus of Aenon Trackng We evaluaed our approach on several meengs ha we recorded. In each of he meengs, four parcpans were sng around a able, dscussng a freely chosen

4 Journal of Dsance Educaon Technologes, 3(3), 3-45, July-Sepember 005 Fgure 1. Recognon accuracy over he me Rae 1 0.8 0.6 0.4 0. 0 1s 5s 10s 15s 0s Duraon opc. Vdeo was capured wh a panoramc camera, and audo was recorded usng several mcrophones. In each frame, we manually labeled a whom each parcpan was lookng. These labels could be one of Lef, Rgh or Sragh, meanng a person was lookng o he person o hs lef, o hs rgh, or o he person a he oppose. If he person wasn lookng a one of hese arges (.e., he person was lookng down a he able or was sarng up a he celng), he label Oher was assgned. In addon, labels ndcang wheher a person was speakng or no were manually assgned for each parcpan and each vdeo frame. Table 1 shows he evaluaon resuls on he four recorded meengs. In he able, he average accuracy on he four parcpans n each meeng s ndcaed. Durng he evaluaon, faces of he parcpans were racked auomacally. Head pan hen was compued, usng he neural nework-based sysem o esmae head orenaon. For each of he meeng parcpans, he class-condonal head pan dsrbuon p(x Focus), he class-prors Table 1. Percenage of correcly assgned focus arges Meeng A B C D Avg. Accuracy 68.8% 73.4% 79.5% 69.8% 7.9% P(Focus), and he observaon dsrbuons p(x) were adaped, as descrbed n he prevous secon, and he poseror probables for each person were compued. Durng evaluaon, he arge wh he hghes poseror probably hen was chosen as he focus of aenon arge of he person n each frame. For he evaluaon, we manually marked frames where a subjec s face was occluded or where he face was no correcly racked. These frames were no used for evaluaon. Face occluson occurred n 1.6% of he capured mages. Occluson somemes happened, when a user covered hs or her face wh hs or her arms or wh a coffee mug, for example; somemes, a face was occluded by one of he poss of he camera. In anoher 4.% of he frames, he face was no correcly racked. We also dd no use frames where a subjec dd no look a one of he oher persons a he able. Ths happened n 3.8% of he frames. Overall, 8.% of he frames were no used for evaluaon, snce a leas one of he above ndcaons was gven. CONCLUSION We have presened our effors n capurng neracons n meengs usng omndreconal cameras. We have dscussed approaches ha provde answers o he quesons who (face recognon), wha (acvy classfcaon), where (person rackng), and o/wh whom (focus of aenon) n meeng suaons, and we presened some expermenal resuls. In our fuure work, we am a mprovng he robusness of he ndvdual echnologes. In addon, we plan o combne he use of omndreconal cameras, perspecve cameras, and acvely conrolled cameras for he analyss of meengs.

Journal of Dsance Educaon Technologes, 3(3), 3-45, July-Sepember 005 43 ACKNOWLEDGMENTS We would lke o hank our colleagues n he Ineracve Sysems Laboraores for her helpful dscussons and echncal suppor. Ths research was parally suppored by he European Unon whn he IST projec FAME under Gran No. IST-000-833, he Naonal Scence Foundaon (USA) under Gran No. IIS-011560, and he Deparmen of Defense (USA) hrough award number N41756-03-C404. The second auhor also would lke o graefully acknowledge suppor from he Naural Scence Foundaon of Chna under Conrac No. 6033010. REFERENCES Argyle, M., & Cook, M. (1976). Gaze and muual gaze. Cambrdge, MA: Cambrdge Unversy Press. Chellappa, R., Wlson, & Srohey, S. (1995). Human and machne recognon of faces: A survey. Proceedngs of he IEEE. Chen, X., & Yang, J. (00). Towards monorng human acves usng an omndreconal camera. Proceedngs of he ICMI 00. Chu, P., Kapuskar, A., Remeer, S., & Wlcox, L. (1999). Meeng capure n a meda enrched conference room. Proceedngs of he Second Inernaonal Workshop on Cooperave Buldngs (CoBuld 99). Emery, N. (000). The eyes have : The neuroehology, funcon and evoluon of socal gaze. Neuroscence and Bobehavoral Revews, 4, 581-604. Fredman, N., & Russell, S. (1997). Image segmenaon n vdeo sequences: A probablsc approach. Proceedngs of he 13h Conference on Uncerany n Arfcal Inellgence. Grmson, W.E.L., Sauffer, C., & Romano, R. (1998). Usng adapve rackng o classfy and monor acves n a se. Proceedngs of he CVPR 1998. Gross, R., Yang, J., & Wabel, A. (000a). Face recognon n a meeng room. Proceedngs of he Fourh IEEE Inernaonal Conference on Auomac Face and Gesure Recognon. Gross, R., Yang, J., & Wabel, A. (000b).Growng Gaussan mxure model for pose nvaran face recognon. Proceedngs of he Inernaonal Conference on Paern Recognon, Barcelona, Span. Haraoglu, I., Harwood, D., & Davs, L.S. (1998). W 4 A real me sysem for deecon and rackng people and her pars. Proceedngs of he Inernaonal Conference on Auomac Face and Gesure Recognon, Nara, Japan. Karmann, K.-P., Brand, A.V., & Gerl, R. (1990). Movng objec segmenaon based on adapve reference mages. In (Ed.), Sgnal processng V: Theores and applcaon. Amserdam, The Neherlands: Elsever. Mkc, I., Huang, K., & Trved, M. (000). Acvy monorng and summarzaon for nellgen envronmens. Proceedngs of he Workshop on Human Moon. Rosenhal, L., & Sanford, V. (000). NIST nformaon echnology laboraory pervasve compung nave. Proceedngs of he IEEE Nnh Inernaonal Workshops on Enablng Technologes: Infrasrucure for Collaborave Enerprses. Ru, Y., Gupa, A., & Cadz, J.J. (001). Vewng meeng capured by an omndreconal camera. Proceedngs of he Human Facors n Compung Sysems, Seale Washngon.

44 Journal of Dsance Educaon Technologes, 3(3), 3-45, July-Sepember 005 Samal, A., & Iyengar, P. (199). Auomac recognon and analyss of human faces and facal expressons: A survey. Paern Recognon, 5(1), 65-77. Sefelhagen, R. (00). Trackng focus of aenon n meengs. Proceedngs of he Inernaonal Conference on Mulmodal Inerfaces, Psburgh, Pennsylvana. Sefelhagen, R., Yang, J., & Wabel, A. (1999). Modelng focus of aenon for meeng ndexng. Proceedngs of he ACM Mulmeda 1999. Sefelhagen, R., Yang, J., & Wabel, A. (000). Smulaneous rackng of head poses n a panoramc vew. Proceedngs of he Inernaonal Conference on Paern Recognon. Sefelhagen, R., Yang, J., & Wabel, A. (001). Esmang focus of aenon based on gaze and sound. Proceedngs of he Workshop on Percepve User Inerfaces (PUI 01), Orlando, Florda. Sefelhagen, R., Yang, J., & Wabel, A. (00). Modelng focus of aenon for meeng ndexng based on mulple cues. IEEE Transacons on Neural Neworks, 13(4), 98-938. Sefelhagen, R., & Zhu, J. (00). Head orenaon and gaze drecon n meengs. Proceedngs of he Conference on Human Facors n Compung Sysems (CHI00), Mnneapols, Mnnesoa. Toyama, K., Krumm, J., Brum, B., & Meyers, B. (1999). Wallflower: Prncples and pracce of background manenance. Proceedngs of he 7h Inernaonal Conference on Compuer Vson. Turk, M., & Penland, A. (1991). Egenfaces for recognon. Journal of Cognve Neuroscence, 3(1), 7-86. von Cranach, M. (1971). The role of orenng behavour n human neracon. In A.H. Esser (Ed.), Envronmenal space and behavour. New York: Plenum Press. Wabel, A., Be, M., Fnke, M., & Sefelhagen, R. (1998). Meeng browser: Trackng and summarzng meengs. Proceedngs of he DARPA Broadcas News Transcrpon and Undersandng Workshop, Lansdowne, Vrgna. Wabel, A., e al. (003). SMaRT: The smar meeng room ask a ISL. Proceedngs of he ICASSP 003. Wren, C.R., Azarbayejan, A., Darrell, T., & Penland, A.P. (1997). Pfnder: Realme rackng of human body. IEEE Trans. Paern Anal. Machne Inell., 19, 780-785. Yang, J., e al. (1999). Mulmodal people ID for a mulmeda meeng browser. Proceedngs of he ACM Mulmeda 99. Yang, J., Yu, H., & Kunz, W. (000). An effcen LDA algorhm for face recognon. Proceedngs of he Inernaonal Conference on Auomaon, Robocs, and Compuer Vson (ICARCV 000).

Journal of Dsance Educaon Technologes, 3(3), 3-45, July-Sepember 005 45 Raner Sefelhagen receved hs dploma degree and PhD (Dr-Ing) n compuer scence from Unversä Karlsruhe (TH), Germany (1996 and 00, respecvely). From 1995 o 1996, he was a vsng researcher a he School of Compuer Scence n Carnege Mellon Unversy, Psburgh, PA. He s currenly an asssan professor (wssenschaflcher Asssen) a he Faculy of Compuer Scence a Unversä Karlsruhe (TH), where he s leadng research on audo-vsual percepon of people and mulmodal human-compuer neracon. Hs research neress nclude paern recognon, vson-based percepon of people and her acves, mulmodal nerfaces and smar rooms. Xln Chen receved he BS, MS and PhD degrees n compuer scence from Harbn Insue of Technology, Chna (1988, 1991 and 1994, respecvely). He s a professor a Harbn Insue of Technology from 1999. He was a vsng scholar a Carnege Mellon Unversy from 001 o 004. Dr. Chen has served as a program commee member for several nernaonal and naonal conferences. He receved several awards, ncludng wce Naonal Scenfc and Technologcal Progress Awards n 000 and 003, respecvely, for hs research work. Hs research neress are mage processng, paern recognon, compuer vson and mulmodal nerface. Je Yang s currenly a senor sysems scens a he School of Compuer Scence n Carnege Mellon Unversy. He receved a PhD from he Unversy of Akron n 1994 and joned he Ineracve Sysems Laboraores, where he has been leadng research effors o develop vsual rackng and recognon sysems for mulmodal human compuer neracon. He developed adapve skn color modelng echnques and demonsraed sofware-based real-me face rackng sysem n 1995. He has nvolved developmens of many mulmodal sysems n boh nellgen workng spaces and moble plaforms, such as gaze-based nerface, lpreadng sysem, mage-based mulmodal ranslaon agen, mulmodal people ID, and auomac sgn ranslaon sysems. Hs curren research neress nclude mulmodal nerfaces, compuer vson, and paern recognon.