ACM Transactions on Multimedia Computing, Communications and Applications (May) [Volume 1, Number 2, 3 ed.] 1-57820-054-7

This special issue comprises some of the outstanding work originally presented at the ACM Multimedia Conference 2003 (AC

246 19 2MB

English Pages 84 Year 2000

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

ACM transactions on information systems (January) [Volume 23, Number 1]

403 72 2MB Read more

ACM Transactions on Internet Technology (February) [Volume 5, Number 1]

374 19 1MB Read more

ACM Transactions on Database Systems (March) [Volume 30, Number 1]

364 30 4MB Read more

ACM transactions on software engineering and methodology (April) [Volume 14, Number 2]

347 27 1MB Read more

ACM transactions on information and systems security (February) [Volume 8, Number 1]

402 24 2MB Read more

ACM transactions on programming languages and systems (January) [Volume 27, Number 1]

389 134 2MB Read more

ACM transactions on design automation of electronic systems (April) [Volume 10, Number 2]

382 88 3MB Read more

ACM Transactions on Design Automation of Electronic Systems (January) [Volume 10, Number 1]

338 57 3MB Read more

Communications of the ACM (April) [Volume 47, Number 4]

361 80 3MB Read more

IEEE Transactions on Antennas and Propagation [volume 59 number 2]

624 112 47MB Read more

ACM Transactions on Multimedia Computing, Communications and Applications (May) [Volume 1, Number 2, 3 ed.]
1-57820-054-7

Author / Uploaded
James R. Wilcox

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Selected Papers from the ACM Multimedia Conference 2003 This special issue comprises some of the outstanding work originally presented at the ACM Multimedia Conference 2003 (ACM MM 2003). The conference received 255 submissions, of which 43 high-quality papers were accepted for presentation. Of these papers, the Technical Program Chairs invited a dozen authors to submit enhanced versions of their papers to this special issue. These papers went through a rigorous review process, and we are happy to present four truly outstanding papers in this special issue. Due to the highly competitive evaluation process and limited space, many excellent papers could not be accepted for this special issue. However, some of them are being forwarded for consideration as future regular papers in this journal. The four featured papers in this special issue span research related to (1) multimedia analysis, processing, and retrieval, (2) multimedia networking and systems support, and (3) multimedia tools, end-systems, and applications. The papers are given as follows: • “Real-Time Multi Depth Stream Compression” authored by Sang-Uok Kum and Ketan Mayer-Patel, • “Panoptes: Scalable Low-Power Video Sensor Networking Technologies” authored by Wu-chi Feng, Brian Code, Ed Kaiser, Wu-chang Feng, and Mickael Le Baillif, • “Semantics and Feature Discovery via Confidence-based Ensemble” authored by Kingshy Goh, Beitao Li, and Edward Y. Chang, and • “Understanding Performance in Coliseum, an Immersive Videoconferencing System”, authored by H. Harlyn Baker, Nina Bhatti, Donald Tanguay, Irwin Sobel, Dan Gelb, Michael E. Goss, W. Bruce Culbertson, and Thomas Malzbender. We hope the readers of this special issue find these papers truly interesting and representative of some of the best work in the field of multimedia in 2003! The Guest Editors would like to thank the many authors for their hard work in submitting and preparing the papers for this special issue. We would also like to thank the many reviewers for their important feedback and help in selecting outstanding papers in the field of multimedia from 2003. Lastly, we would like to thank Larry Rowe, Chair ACM Multimedia 2003 and Ramesh Jain, Chair SIGMM for their support and guidance in preparing this special issue. THOMAS PLAGEMANN PRASHANT SHENOY JOHN R. SMITH Guest Editors to the Special Issue and ACM Multimedia 2003 Program Chairs

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005, Page 127.

Real-Time Multidepth Stream Compression SANG-UOK KUM and KETAN MAYER-PATEL University of North Carolina The goal of tele-immersion has long been to enable people at remote locations to share a sense of presence. A tele-immersion system acquires the 3D representation of a collaborator’s environment remotely and sends it over the network where it is rendered in the user’s environment. Acquisition, reconstruction, transmission, and rendering all have to be done in real-time to create a sense of presence. With added commodity hardware resources, parallelism can increase the acquisition volume and reconstruction data quality while maintaining real-time performance. However, this is not as easy for rendering since all of the data need to be combined into a single display. In this article, we present an algorithm to compress data from such 3D environments in real-time to solve this imbalance. We present a compression algorithm which scales comparably to the acquisition and reconstruction, reduces network transmission bandwidth, and reduces the rendering requirement for real-time performance. This is achieved by exploiting the coherence in the 3D environment data and removing them in real-time. We have tested the algorithm using a static office data set as well as a dynamic scene, the results of which are presented in the article. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval— Clustering; I.3.2 [Computer Graphics]: Graphics Systems—Distributed/network graphics; I.3.7 [Computer Graphics]: ThreeDimensional Graphics and Realism—Virtual reality; I.3.7 [Computer Graphics]: Applications General Terms: Algorithms Additional Key Words and Phrases: Real-time compression, tele-immersion, tele-presence, augmented reality, virtual reality, k-means algorithm, k-means initialization

1.

INTRODUCTION

Recently, there has been increasing interest in tele-immersion systems that create a sense of presence with distant individuals and situations by providing an interactive 3D rendering of remote environments [Kauff and Schreer 2002; Towles et al. 2002; Baker et al. 2002; Gross et al. 2003]. The 3D teleimmersion research group at the University of North Carolina, Chapel Hill [Office of the Future Project] together with collaborators at the University of Pennsylvania [University of Pennsylvania GRASP Lab], the Pittsburgh Supercomputing Center [Pittsburgh Supercomputing Center], and Advanced Network and Services, Inc. [Advanced Network and Services, Inc.] have been actively developing tele-immersion systems for several years. The four major components of a tele-immersion system are scene acquisition, 3D reconstruction, transmission, and rendering. Figure 1 shows a block diagram relating these components to each other and the overall system. For effective, interactive operation, these four components must accomplish their tasks in real-time. This work was supported in part by the Link Fellowship, and the National Science Foundation (ANI-0219780, IIS-0121293). Author’s address: University of North Carolina at Chapel Hill, CB# 3175, Sitterson Hall, Chapel Hill, NC 27599-3175; email: {kumsu,kmp}@cs.unc.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]. c 2005 ACM 1551-6857/05/0500-0128 $5.00 ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005, Pages 128–150.

Real-Time Multidepth Stream Compression

•

129

Fig. 1. Tele-immersion system.

Scene acquisition is done using multiple digital cameras and computers. Multiple digital cameras are placed around the scene to be reconstructed. The cameras are calibrated and registered to a single coordinate system called the world coordinate system. The computers are used to control the cameras for synchronized capture and to control 2D image stream transfer to the 3D reconstruction system. Using current commodity hardware, we are able to capture images with a resolution of 640×480 at 15 frames/sec. The 15 frames/sec limit is a result of the gen-lock synchronization mechanism employed by the particular cameras we have and faster capture performance may be achievable using other products. The 3D reconstruction system receives the captured 2D image streams from the acquisition system and creates a 3D representation of the scene in real-time. The reconstructed 3D scene is represented by depth streams. A depth stream is a video stream augmented with per-pixel depth information from the world coordinate system. Multiple input images are used to create a depth stream. The images are rectified and correspondences between the images are found. Using this correspondence information, disparities at each pixel are computed. The computed disparities and the calibration matrices of the cameras are used to compute the world coordinates of each 3D point. The major bottleneck of the reconstruction is the correspondence search between images, which is computationally expensive. Fortunately, this process can be parallelized to achieve real-time performance, since each depth stream computation is independent of the others. The acquired remote scene must be transmitted to the rendering system. At 640×480 resolution, each uncompressed depth stream running at 15 frames/sec needs—assuming 3 bytes for color and 2 bytes for depth—about 184 Mbits/sec of network bandwidth. For 10 depth streams, without data compression, the total bandwidth required would be 1.84 Gbits/sec. Finally, the transmitted depth streams are rendered and displayed in head-tracked passive stereo by the rendering system [Chen et al. 2000]. Since the depth streams are in world coordinates, thus viewindependent, they can be rendered from any new viewpoint. The user’s head is tracked to render the depth streams from precisely the user’s current viewpoint to provide a sense of presence. At a resolution of 640 × 480, each frame of each depth stream is comprised of approximately 300K 3D points. A system with 10 depth streams would require 90 Mpts/sec rendering performance to achieve 30 frames/sec viewdependent rendering, which is difficult with currently available commodity hardware. Also rendering is not as easily parallelized as 3D reconstruction since all of the depth streams must be rendered into a single view. While the scene acquisition and 3D reconstruction processes can be parallelized by adding additional hardware resources, experience with our initial prototypes indicate that rendering performance and ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

130

•

S.-U. Kum and K. Mayer-Patel

transmission bandwidth are likely to remain a bottleneck. Our work concentrates on this possible bottleneck between the reconstruction process and the rendering process. As such, we are not concerned with the 2D image streams captured during acquisition. Instead, we are concerned with the reconstructed 3D depth streams. Each of these depth streams is constructed from a particular viewpoint as if captured by a 3D camera although no actual 3D camera exists. In the rest of this article, the terms image, stream, camera, and viewpoint all refer to the 3D information produced by the reconstruction process which includes both color and depth on a per pixel basis. One way to alleviate the network and rendering bottleneck is to exploit coherence between the reconstructed depth streams and remove redundant points. Since multiple cameras acquire a common scene, redundant points exist between the reconstructed depth streams. By identifying and removing these redundant points, the total number of points transmitted to the rendering system is reduced which reduces network bandwidth and rendering demand while maintaining the quality of the reconstruction. Since the reconstruction process needs to be distributed over many computers in order to achieve realtime performance, each depth stream is created at a different computer. In order to remove redundant points between two depth streams, at least one of the streams must be transmitted to the computer where the other stream resides. Because of this, we must be careful to distinguish between two different network resources that must be managed. The first is internal network bandwidth. This refers to the bandwidth between computers involved in the reconstruction process. We expect these computers to be locally connected and thus this bandwidth resource is expected to be fairly plentiful (i.e., on the order of 100 Mb/s to 1 Gb/s) but still finite and limited. In managing this resource, we must be careful about how many of the depth streams need to be locally transmitted in order to remove redundant points. The second network resource is external network bandwidth, which refers to the bandwidth available between the reconstruction process and the rendering process. These two processes will not generally be locally connected and will probably traverse the Internet or Internet-2. In this case, bandwidth is expected to be more limited and the concern is removing as many redundant points as possible in order to reduce the amount of data transmitted to the renderer. This article presents a modified technique based on our earlier work [Kum et al. 2003] for exploiting coherence between depth streams in order to find and eliminate redundant points. Our contributions include: —A real-time depth stream compression technique. The Group-Based Real-Time Compression algorithm presented in this article finds and eliminates redundant points between two or more depth streams. —A depth stream coherence metric. In order to efficiently employ Group-Based Real-Time Compression, we must be able to compute which depth streams are most likely to exhibit strong coherence. We present an efficient algorithm for partitioning depth streams into coherent groups. —An evaluation of our methods, which shows that we can remove a large majority of redundant points and thereby reduce external bandwidth and rendering requirements while at the same time limiting the amount of internal bandwidth required to match what is locally available. Furthermore, since each depth stream is compared against at most only one other depth stream, real-time performance is achievable. This article is organized as follows: Section 2 describes background and related work. Section 3 provides an overview of our approach and a comparison with other possible approaches. In Section 4, we present the compression algorithm in detail. Section 5 explains how streams are partitioned into coherent groups. The results are presented in Section 6, and conclusions and future work are in Section 7. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Real-Time Multidepth Stream Compression

2.

•

131

BACKGROUND AND RELATED WORK

There have been multiple tele-immersion systems built recently. VIRTUE system [Kauff and Schreer 2002] uses stereo-based reconstruction for modeling and the user is tracked for view-dependent rendering. However, the display is not in stereo, which reduces the effect of immersion. The Coliseum [Baker et al. 2002] uses an Image-Based Visual Hulls [Matusik et al. 2000] method for reconstruction and is designed to support a large number of users. However, it uses one server for each participant to handle the rendering for all clients, which increases latency as the number of users increase. As with the VIRTUE system, it is also not displayed in stereo. The blue-c system [Gross et al. 2003] uses a CAVE [Cruz-Neira et al. 1993] environment for rendering and display to create an impression of total immersion. The reconstruction is done using a shape-from-silhouette technique that creates a point based-model. McMillan and Bishop [1995] proposed using a depth image (i.e., an image with color and depth information) to render a scene from new viewpoints by warping the depth image. One major problem with this method is disocclusion artifacts caused when a portion of the scene not visible in the depth image is visible from the new viewpoint. Using multiple depth images from multiple viewpoints can reduce these disocclusion artifacts. Layered Depth Images (LDI) merge multiple depth images into a single depth image by keeping multiple depth values per pixel [Shade et al. 1998]. However, the fixed resolution of an LDI imposes limits on sampling multiple depth images. An LDI tree, an octreee with a single LDI in each node, can be used to overcome this limitation [Chang et al. 1999]. Grossman and Dally [1998] create multiple depth images to model an arbitrary synthetic object. The depth images are divided into 8 × 8 blocks and redundant blocks are removed. QSplat [Rusinkiewicz and Levoy 2000] uses a bounding sphere hierarchy to group 3D scanned points for real-time progressive rendering of large models. Surfels [Rusinkiewicz and Levoy 2000] represent objects using a tree of three orthogonal LDIs called a Layered Depth Cube (LDC) tree. All of these approaches only handle static data in which compression was done only once as a preprocessing step. Therefore, these techniques are not suitable for real-time dynamic environments in which the compression has to be done for every frame. ¨ The video fragments used in the blue-c system [Wurmlin et al. 2004] are a point based representations for dynamic scenes. It exploits spatio-temporal coherence by identifying differential fragments in 2D image space and updating the 3D point representation of the scene. There have also been efforts to develop special scalable hardware for composite images with depth information [Molnar et al. 1992; Stoll et al. 2001]. The rendering system can be parallelized using these special hardware by connecting each 3D camera to a rendering PC and then compositing all of the rendered images. Unfortunately these systems are not commonly available and expensive to build. 3.

OVERVIEW AND DESIGN GOALS

This section outlines our design goals for the compression algorithm, examines several possible approaches to the problem and gives an overview of the modified Group-Based Real-Time Compression Algorithm from Kum et al. [2003]. 3.1 Design Goals To ensure a high quality rendering, we will require that the depth stream that most closely matches the user’s viewpoint at any given time is not compressed. We will call this depth stream the main stream. All points of the main stream are transmitted to the rendering process. Furthermore, a subset of the depth streams are identified as the set of reference streams. The reference streams form a predictive base for detecting and eliminating redundant points and are distributed among the depth streams. Every stream except for the main stream is compared to one or more of the reference streams and ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

132

•

S.-U. Kum and K. Mayer-Patel

Fig. 2. Examples of different compression algorithms and its reference stream transfer. The main stream is in bold and the arrows show the direction of reference stream movement.

redundant points are eliminated. The result is called a differential stream. These differential streams and the main stream are sent to the rendering system. Our design goals for the compression algorithm include: —Real-Time Performance. The compression algorithm needs to be at least as fast as the 3D reconstruction so there is no delay in processing the streams. —Scalability. The algorithm needs to scale with the number of depth streams, so that as the number of depth streams increases the number of data points does not overwhelm the rendering system. —Data Reduction. In order to alleviate the rendering bottleneck, the algorithm needs to reduce the number of data points by eliminating as many redundant points as possible. —Tunable Network Bandwidth. Distributing reference streams to the reconstruction processes will require additional network bandwidth. The algorithm should be tunable to limit the network bandwidth used even as the total number of depth streams increases. 3.2 General Approaches Given the restrictions and design goals outlined above, there are a number of general approaches that may be incorporated into our solution. 3.2.1 Stream Independent Temporal Compression. One possible approach is to compress each stream independently using temporal coherence. With such an approach, each stream acts as its own reference stream. Exploiting temporal coherence for traditional video types is known to result in good compression for real-time applications. This compression scheme scales well, and requires no additional network bandwidth since there is no need to communicate reference streams among the reconstruction processes. However, this compression scheme does not reduce the number of data points that the renderer must render each frame. The renderer must render all redundant points from the previous frame with the nonredundant points of the current frame. 3.2.2 Best-Interstream Compression. The best possible interstream compression would be to remove all redundant points from all streams by using every stream as a possible reference stream. This could be accomplished in the following way. The first stream sends all of its data points to the rendering system and to all other reconstruction processes as a reference stream. The second stream uses the first stream as a reference stream, creating a differential stream which it also distributes to the other reconstruction processes as a reference stream. The third stream receives the first two streams as reference streams in order to create its differential stream, and so on, continuing until the last stream uses all other streams as reference streams (Figure 2(a)). This is the best possible interstream compression since it has no redundant points. The drawbacks to this approach, however, are severe. Most streams in this approach ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Real-Time Multidepth Stream Compression

•

133

require multiple reference streams with at least one stream using all other streams as references. This dramatically increases computation requirements and makes realizing a real-time implementation very difficult. Also the number of reference streams broadcast is dependent on the number of streams. Thus, the network bandwidth required will increase as the number of streams increases, limiting scalability of the 3D cameras. 3.2.3 Single Reference Stream Compression. Another approach is to use the main stream as the reference stream for all other streams (Figure 2(b)). This does not require additional network bandwidth as more streams are added since there is always only one reference stream. Real-time operation is feasible since all other streams are compared against only one reference stream. A main disadvantage of this approach is possibly poor data compression. The coherence between the main stream and the depth streams that use it as a reference stream will diminish as the viewpoints of the streams diverge. Furthermore, the depth streams from two nearby viewpoints may contain redundant points which are not removed by using the main stream as the only reference. 3.2.4 Nearest Neighbors as Reference Stream Compression. Another approach is for each depth stream to select the closest neighboring depth stream as the reference stream to achieve better compression. The streams can be linearly sorted such that neighboring streams in the list have viewpoints that are close to each other. From this sorted list of streams, the streams left of the main stream use the right neighboring stream as its reference stream, and the streams right of the main stream uses the left neighboring stream as its reference stream (Figure 2(c)). With this scheme, every stream has one reference stream regardless of the total number of streams. The compression rate depends on the number of points that appear in nonneighboring streams but not in neighboring streams since these points will be redundant in the final result. Since the streams are sorted by viewpoint, the number of redundant points in nonneighboring streams but not in neighboring streams should be small which makes the compression comparable to the previously mentioned Best-Interstream Compression method. However, the network bandwidth demand for this compression scheme is high. For n streams there are generally n − 2 reference streams to distribute, again limiting scalability of 3D cameras. 3.3 Overview of Group-Based Real-Time Compression The Group-Based Real-Time Compression tries to balance compression efficiency and network bandwidth requirements by limiting the number of reference streams to a configurable limit and grouping streams together based on which of these streams serves as the best reference stream to use. All streams are divided into groups such that each stream is part of only one group. Each group has a center stream that is a representative of the group and sub streams (i.e., all other streams in the group). Stream partitioning and center stream selection is done as a preprocessing step since the acquisition cameras do not move. The main stream and the center streams comprise the set of possible reference streams. Thus the number of reference streams distributed equals the number of groups created plus one—the main stream. A differential stream is created for each stream using the reference stream that will most likely yield the best compression. Since the number of reference streams is limited to the number of groups, new streams can be added without increasing the reference stream network traffic as long as the number of groups remains the same. Because the number of groups is a configurable system parameter, the amount of network traffic generated by distributing reference streams can be engineered to match the available network bandwidth. Also each stream only uses one reference stream to create its differential frame, which makes real-time operation feasible. The difference from the algorithm presented in this paper and the one from our earlier work [Kum et al. 2003] is that the center streams use the closest neighboring center stream as the reference stream, not the center stream of the main stream’s group. This results in better compression for the center streams since, most of the time, closer streams have ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

134

•

S.-U. Kum and K. Mayer-Patel

more redundant points. The compression algorithm is described in more detail in Section 4. Section 5 details how streams are partitioned into groups and the center stream for each group selected. 4.

STREAM COMPRESSION

This section details how depth streams are compressed in real-time. First, we detail how reference streams are selected for each stream, and then discuss how these streams are compressed using the selected reference stream. 4.1 Reference Stream Selection In the Group-Based Real-Time Compression, all depth streams are partitioned into disjoint groups. The number of groups created is determined by the network bandwidth. Each group has a center stream, which best represents the group, and sub streams—depth streams in a group that are not the center stream. Furthermore, one stream is selected as the main stream for which no compression is done. The depth stream viewpoint with the shortest Euclidian distance to the user is chosen as the main stream since it best represents the user’s viewpoint. The group containing the main stream is called the main group and all other groups are referred to as a sub group. Once the main stream has been selected, the reference stream for all streams are selected as follows: —For the main stream, no reference stream is needed. —For the center stream of the main group, the main stream is used as the reference stream. —For the center streams of the sub groups, the nearest center stream is used as the reference stream. The center streams can be linearly sorted such that neighboring center streams in the list have viewpoints that are close to each other. From this sorted list of center streams, the center streams left of the main stream use the right neighboring center stream as its reference stream, and the center streams right of the main stream uses the left neighboring center stream as its reference stream. This differs from the algorithm given in Kum et al. [2003] since it does not use the center stream of the main group as the reference streams. It also compresses better than Kum et al. [2003] because two neighboring streams usually have more redundant points then two non-neighboring streams. —For any other substream, the center stream of its group is used as the reference stream. Figure 3 shows an example with 12 streams and 4 groups. Stream 5 is the main stream, which makes Group 2 the main group. Streams 1, 4, 7 and 10 are the center streams for its group, and are numbered in sequential order. Since Stream 5 is the main stream it does not have any reference streams. Stream 4 is the center stream of the main group and uses the main stream (Stream 5) as its reference stream. The center stream of subgroups—Stream 1, 7, and 10—use the nearest center stream as the reference stream, Stream 1 and 7 use Stream 4, and Stream 10 uses Stream 7. All substreams use their group’s center streams as its reference stream—Stream 2 and 3 of Group 1 use Stream 1, Stream 6 of Group 2 uses Stream 4, Stream 8 and 9 of Group 3 use Stream 7, and Stream 11 and 12 of Group 4 use Stream 10. The arrows show the direction of the reference stream distribution. 4.2 Differential Stream Construction To construct a differential stream, the data points of a depth stream are compared to the data points within the reference stream. Points that are within some given distance threshold are removed from the depth stream. The format of the differential stream is different from the original stream format. The original stream has five bytes, three bytes for color and two bytes for depth, for each data point. The differential stream ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Real-Time Multidepth Stream Compression

•

135

Fig. 3. An example of reference stream distribution for Group-Based Real-Time Compression.

has five bytes for only the non-redundant points (i.e., points not removed) and a bitmask to indicate which points have been retained and which points have been eliminated. If the bit value is ‘0’ then the data point represented by the bit is a redundant point and is removed. If the bit value is ‘1’, the corresponding point is included. The order of data for nonredundant points is the same as the order it appears in the bitmask. This format reduces the size of a frame in the differential stream by 39 bits, five bytes minus one bit, for redundant points and adds one bit for non-redundant points. So for a depth stream of 640 × 480 resolution with a 5 to 1 redundancy ratio (i.e., 80% of data points are deemed redundant), the size of a frame for the stream is reduced from 1.536MB to 346KB—approximately 5 to 1. 5.

STREAM PARTITION

In this section, we present an algorithm for stream partitioning and center stream selection. As discussed in Section 4, the streams need to be partitioned into groups and the center stream of each group selected before runtime. Since reference stream selection is dependent on this stream partitioning process, it also effects stream compression efficiency. Therefore, streams should be partitioned into groups such that the most redundant points are removed. In Section 5.1, we present effective criteria to partition n streams into k groups and to find the appropriate center stream in each group. We show how these metrics can be used to partition the streams and to select center streams in Section 5.2. In Section 5.3, the metrics are used to develop an efficient approximate algorithm for stream partitioning and center stream selection when n is too large for an exhaustive approach. 5.1 Coherence Metrics Stream partitioning and selection of center streams has a direct impact on compression since all substreams of a group use the center stream of the group as the reference stream. Therefore, the partitioning should ensure that each stream belongs to a group where the volume overlap between the stream and the group center stream is maximized. However, exact calculation of the volume overlap between two streams is expensive. Thus, in this article, we use the angle between the view directions of two depth streams as an approximation of ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

136

•

S.-U. Kum and K. Mayer-Patel

Fig. 4. Percentage of redundant points of a stream in the reference stream vs. the angle between the two streams. The streams are from the 3D camera configurations of Figure 9.

the overlapped volume. Empirically, the view directions of two streams are a good estimate for how much the two stream volumes overlap. The smaller the angle, the bigger the overlap. This is shown in Figure 4. The local squared angle sum (LSAS) is defined for stream Si as the sum of the squared angle between stream Si and all other streams in its group (Eq. (1)). This is used as the center stream selection criterion. The stream with the lowest LSAS of the group is chosen to be the center stream. LSASi =

nk

[angle of (Si , S j )]2

(1)

j =1

where stream Si and S j is in group k, and nk is the number of streams in group k. The group squared angle sum (GSAS), defined for a given group, is the sum of the squared angle between the group’s center stream and every substream in the group (Eq. (2)). This is used as the partitioning criterion for partitioning n streams into k groups. The sum of all GSAS’s for a particular partition (Eq. (3)) is defined as the total squared angle sum (TSAS). We are seeking the partition that minimizes TSAS. GSAS j =

nj [angle of (C j , S j i )]2

(2)

i=1

where C j is the center stream in group j , S j i is a sub stream in group j , and n j is the number of substreams in group j . TSAS =

k

GSASi

(3)

i=1

where k is the number of groups. Finally, the central squared angle sum (CSAS) is defined as the sum of the squared angle between all center streams (Eq. (4)). The streams should be partitioned such that CSAS is also minimal, since all center streams use each other as references. However, it should be noted that minimizing TSAS is ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Real-Time Multidepth Stream Compression

•

137

much more important than minimizing CSAS since TSAS effects compression much more than CSAS. CSAS =

k−1 k

[angle of (Ci , C j )]2

(4)

i=1 j =i+1

where Ci and C j is and center stream for group i and j , and k is the number of groups. The following sections presents how these criteria are used to partition the streams into groups and select center streams for each group. 5.2 Exhaustive Partition One way to partition n streams into k groups is an exhaustive method where all possible grouping combinations are tested. First k streams are selected from n streams. The selected streams are chosen as center streams and all other streams are assigned to the group with which the absolute angle of the stream and the group’s center stream is the smallest. This is done for all possible combinations of selecting k streams from n streams—a total of n Ck . For each stream partitioning the TSAS is calculated and the stream partitioning with the lowest TSAS is the partitioning solution. If there are multiple stream partitions with the same TSAS, the stream partition with the lowest CSAS is chosen as the solution. Unless n is small, this method is not practical. 5.3 Approximate Partition The k-means framework [Jain et al. 1999], original developed as a clustering algorithm, can be used to partition the streams for an approximate solution. The k-means framework is used to partition n data points into k disjoint subsets such that a criterion is optimized. The k-means framework is a robust and fast iterative method that finds the locally optimal solutions for the given criteria. It is done in the following three steps. —Initialization: Initial centers for the k partitions are chosen. —Assignment: All data points are placed in the partition with the center that best satisfies the given criteria. Usually the criteria are given as a relationship between a data point and the center. —Centering: For each partition the cluster centers are reassigned to optimize the criteria. The assignment and centering steps are repeated until the cluster centers do not change or an error threshold is reached. 5.3.1 Iterative Solution for Approximate Partitioning. Given initial center streams, the criteria used to assign all substreams to a group is the absolute angle of the stream and the group’s center stream. The substreams are assigned such that this absolute angle is the smallest. This would give groupings where the TSAS is minimized. After every substream has been assigned to a group, each group recalculates its center stream where the stream with the lowest LSAS is the new center stream of the group. The process of grouping and finding the center stream is repeated until the center streams converge and do not change between iterations. 5.3.2 Center Stream Initialization. The performance of this approximate approach heavily depends ˜ et al. 1999]. Therefore, on the initial starting conditions (initial center streams and stream order) [Pena in practice, to obtain a near optimal solution, multiple trials are attempted with several different instances of initial center streams. The full search space for the iterative method could be investigated when all possible starting conditions—total of n Ck —are explored. A starting condition is given as a set of k initial center streams. As seen in Table I(a), such an exhaustive method will find all possible ending conditions. For example, ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

138

•

S.-U. Kum and K. Mayer-Patel

Table I. Stream Partition (Stream viewpoints for (a) was generated randomly ((b) is the stream configurations used in the static office scene (Figure 9), and (c) is the camera configuration for the dynamic scene (Figure 10(a).)) Total # of streams & groups # of possible initial center streams # of created initial center streams Total # of possible center streams Optimal center streams Optimal solution’s TSAS Found center streams solution Found solution’s TSAS # of solutions with less TSAS

Total # of streams & groups # of possible initial center streams # of created initial center streams Total # of possible center streams Optimal center streams Optimal solution’s TSAS Found center streams solution Found solution’s TSAS # of solutions with less TSAS

n = 10, k = 5 252 144 46 {2, 3, 6, 8, 9} 246 {2, 3, 6, 8, 9} 246 0

n = 13, k = 5 286 20 5 {3, 7, 11} 550 {3, 7, 11} 550 0

n = 22, k = 5 26334 240 10443 {4, 15, 16, 17, 20} 574.25 {9, 15, 16, 17, 20} 734 6

n = 22, k = 5 26334 240 612 {2, 6, 10, 15, 19} 432.032 {2, 6, 10, 15, 19} 432.032 0 (b)

n = 25, k = 5 53130 529 25907 {3, 11, 12, 13, 17} 686.5 {3, 11, 12, 13, 21} 730.5 1 (a)

n = 60, k = 5 5461512 352 5170194 {2, 8, 22, 34, 58} 2224 {22, 34, 36, 47, 54} 3984.5 20517

n = 8, k = 2 28 14 3 {2, 6 } 820.601 {2, 6 } 820.601 0 (c)

if n = 10 and k = 5, there would be a total of 10 C5 = 252 possible starting conditions, which in the example in Table I(a) will lead to one of 46 distinct ending conditions. An ending condition is one in which the center streams do not change from one iteration to the next. The optimal answer is obtained by examining TSAS of all ending conditions. If there are multiple ending conditions with the same TSAS, the CSAS is used. However, for numbers of n and k where this becomes impractical, we can examine only a small sample of all the possible starting conditions. The chances of finding the same optimal solution as the exhaustive method will increase by intelligently selecting the starting conditions to examine. Theoretically, the best way to initialize the starting center streams is to have the initial center streams as close to the optimal answer as possible. This means that, in general, the initial centers should be dispersed throughout the data space. In this section, we discuss a method for finding starting center stream assignments that are well dispersed in the data space. Once good starting conditions have been identified, it is straightforward to examine all corresponding ending conditions to find a near-optimal solution. The basic approach to identifying all possible good starting conditions is as follows: (1) Sort the given streams using global squared angle sum (Section 5.3.2.1). (2) Group the streams in all possible “reasonable” ways (Section 5.3.2.2). (3) Find all possible “reasonable” candidate center streams (Section 5.3.2.3). (4) Generate all possible combination of the candidate center streams for each possible grouping as good starting conditions with all duplicates removed (Section 5.3.2.4). 5.3.2.1 Stream Sorting. Although the streams cannot be strictly ordered due to the three-dimensional nature of the viewpoint locations, an approximate sorting using the global squared angle sum is sufficient for our purpose. Global squared angle sum (GlSAS) for stream Si is the sum of the squared angle ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Real-Time Multidepth Stream Compression

•

139

Fig. 5. Dominant stream.

between stream Si and all other given streams (Eq. (5)). GlSASi =

n

[angle of (Si , S j )]2

(5)

j =1

where Si and S j are streams, and n is the total number of streams. Given total of n streams, the pivot stream is chosen as the stream with the lowest GlSAS. Next all other streams are divided into three groups—streams with a negative angle with respect to the pivot stream, streams with a positive angle, and streams with zero angle. Any stream that has zero angle with the pivot stream, either covers the pivot stream or is covered by the pivot stream. All such streams are represented by one stream, the dominant stream—the stream that covers all other streams. Points present in the nondominant streams should be present in the dominant stream, so for stream partitioning the dominant streams can be used to represent the nondominant streams. The nondominant streams are removed from the stream list and added back as substreams to the dominant stream’s group after the solution has been found. Figure 5 illustrates the notion of a dominant stream. Streams S1 , S2 , and S3 all have zero angle with each other. Stream S1 covers stream S2 and S3 since any data point in S2 and S3 is also in S1 . Therefore, S1 is the dominant stream. The arrow indicates the stream view direction. The positive angle and negative angle groups are each sorted using the GlSAS. The negative angle streams are sorted in descending order and the positive angle streams are sorted in ascending order. Placing the sorted negative angle streams left and the sorted positive angle streams on the right of the pivot stream creates the final sorted list. 5.3.2.2 Initial Group Partitions. After the n streams are sorted, they are partitioned into k initial groups. To ensure that we include all possible good starting conditions, we consider all reasonable groupings. Next, we describe our heuristics for creating reasonable groupings. The k initial groups are created such that, —If stream Si is in group G k , then all streams right of stream Si in the sorted stream list is in group Gk . —If stream Si is in group G j , stream Si+1 is either in group G j or G j +1 where 1 ≤ i < n, 1 ≤ j < k, and stream Si is left of stream Si+1 in the sorted stream list. —If n is not an exact multiple of k, every group is assigned either ⌊ kn ⌋ or ⌈ kn ⌉ streams and every stream is assigned to a group. If n is an exact multiple of k one group is assigned kn − 1 streams, another is assigned kn + 1 streams, and every other group is assigned kn streams. Streams are grouped into every possible combination that meets the above reasonableness criteria. Figure 6(a) shows the three possible groupings of ten sorted streams into three groups. The special case when n is an exact multiple of k is treated differently because the normal conditions will only allow one possible partition—Figure 6(b) shows the one possible groupings of nine sorted streams into three groups. Initial starting conditions are critical for a good solution to the approximate stream partition, so generating multiple starting conditions are desirable. To generate multiple partitions for the special case, one group is assigned kn − 1 streams and another is assigned kn + 1 streams, and every other group ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

140

•

S.-U. Kum and K. Mayer-Patel 123 123 1234

456 4567 567 (a)

7 8 9 10 8 9 10 8 9 10

123

456 (b)

789

12 12 123 123 1234 1234

345 3456 45 4567 567 56 (c)

6789 789 6789 89 89 789

Fig. 6. Initial Group Partitions: (a) All possible initial group partitions for 10 streams into 3 groups. (b) Only one grouping is possible for 9 streams and 3 groups with normal conditions. (c) All possible initial group partitions for 9 streams into 3 groups. (a)

1 2 3 4 5 6 7 8 9 10

(b)

{1, 4, 8}, {1, 4, 9}, {1, 5, 8}, {1, 5, 9}, {1, 6, 8}, {1, 6, 9}, {2, 4, 8}, {2, 4, 9}, {2, 5, 8}, {2, 5, 9}, {2, 6, 8}, {2, 6, 9}, {3, 4, 8}, {3, 4, 9}, {3, 5, 8}, {3, 5, 9}, {3, 6, 8}, {3, 6, 9}

Fig. 7. Initial Center Streams: (a) 10 streams partitioned into 3 groups. The candidate centers are in bold. (b) Initial center streams generated from group partition (a).

is assigned kn streams. Example of grouping nine streams into three groups is shown in Figure 6(c). One group has two streams, another has four streams, and all other groups have three streams. 5.3.2.3 Candidate Centers. Again to ensure all possible good starting conditions are included, multiple center stream candidates are chosen for each group. If the group has even number of streams, the two streams in the middle are chosen as candidates. If it has odd number of streams, the middle stream and its two neighboring streams are chosen as candidates. In Figure 7(a), Stream 1, 2, and 3 are candidate centers for Group 1, Stream 4, 5, and 6 are candidate centers for Group 2, and Stream 8 and 9 are candidate centers for Group 3. 5.3.2.4 Generating Initial Starting Points. Finally, we select a set of k center streams, one from each group, to construct a starting condition. All possible combinations for the candidate centers are generated as good beginning conditions (Figure 7(b)). Note that there can be duplicate starting conditions created, and these are removed. The distinct starting conditions generated will be the good starting conditions explored. If the number of starting conditions is still too large, the desired number of initial sets can be stochastically sampled from all identified good starting conditions. In this case, the duplicates are not removed when sampling in order to provide those conditions with a better chance of being sampled. 6.

RESULTS

In this section we present the results of the compression algorithm applied to two different data sets. The first is a static office scene captured using a 3D scanning laser rangefinder and a color camera. The second is a dynamic scene captured via multiple digital cameras. The results of stream partition, compression rate, network bandwidth, rendering speed, and rendered image quality are compared for different algorithms. 6.1 Data The Best-Interstream Compression, Nearest Neighbors as Reference Stream Compression, and GroupBased Real-Time Compression from Kum et al. [2003]—this will be referred to as the Group-Based Real-Time Compression—are compared with Group-Based Real-Time Compression for the static office data. For the dynamic scene, only Best-Interstream Compression and Nearest Neighbors as Reference Stream Compression is compared with Group-Based Real-Time Compression since the results of MM03 ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Real-Time Multidepth Stream Compression

•

141

Fig. 8. Static office.

Fig. 9. Virtual depth camera configuration for the static office. (a) Top view of 13 depth camera configuration partitioned into 3 groups. (b) Frontal view of 22 virtual depth cameras configuration partitioned into 5 groups.

Group-Based Real-Time Compression and Group-Based Real-Time Compression are the same. The dynamic scene only has two groups, so the different algorithms for reference stream selection of subgroup’s center stream between MM03 Group-Based Real-Time Compression and Group-Based Real-Time Compression does not affect the result. The Best-Interstream Compression was chosen since it is the best compression achievable, and Nearest Neighbors as Reference Stream Compression was chosen because it is the best compression achievable in real-time. 6.1.1 Static Office. Stream compression algorithm was tested on a static office scene presented in Chen et al. [2000]. The scene was acquired using a 3D scanning laser rangefinder and a high-resolution color camera, with which a polygonal model was created. Figure 8(a) shows the layout of the static office and Figure 8(b) is a rendered image of the static office scene. Multiple depth-streams were generated by placing virtual depth cameras, cameras that create color and depth per pixel, in the scene. The virtual depth cameras were placed in two different camera configurations similar to the ones used in previous systems [Towles et al. 2002; Kelshikar et al. 2003]. They were partitioned using the algorithm presented in Section 5.3. Figure 9 shows the camera groupings with the center camera circled. The first configuration placed 13 depth cameras around the scene in a semi-circle. The depth cameras were placed about 2 m from the mannequin at its eye height and was pointed at the mannequin’s head. The depth cameras were placed about 20 cm apart [Towles et al. 2002]. Figure 9(a) is the top view of 13 depth camera configuration partitioned into 3 groups. The second ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

142

•

S.-U. Kum and K. Mayer-Patel

Fig. 10. Dynamic Scene.

configuration placed 2 rows of 11 depth cameras on a wall 2.25 m from the mannequin. The two rows were parallel and 20 cm apart with the bottom row at the height of the mannequin’s eye. Cameras in each row were placed at 20 cm intervals and pointed at the mannequin’s head [Kelshikar et al. 2003]. The frontal view of 22 virtual depth cameras configuration partitioned into 5 groups is shown in Figure 9. The depth stream created from each depth camera was at a resolution of 640 × 480 with background. The horizontal field of view for all of the cameras was 42 degrees. A sequence of frames were generated by moving around the scene. The path starts by facing the mannequin directly and the view moves toward the left looking at the mannequin. After passing the last virtual depth camera, the view moves back to the right until first depth camera is reached. Then, the view is moved back to the center where it is moved in the vicinity to simulate a real situation. 6.1.2 Dynamic Sequence. Sequence of images were captured of a person sitting behind a desk talking with 8 Point Grey Dragonfly color cameras. The cameras captured synchronized images at 15 frames/sec using the Point Grey Sync Units. The cameras were placed on a wall about 1.5 m from the person, 20 cm apart from each other, and at 1.1 m from the floor. The cameras were pointed toward the face of the person. Figure 10(a) shows the 8 camera configuration partitioned into 2 groups using the method in Section 5.3. The center camera for each group is circled. The image resolution was 640 × 480. The depth streams were created by doing stereo based 3D reconstruction on the sequence of images [Pollefeys et al. 2004]. A rendered image of the dynamic scene is shown in Figure 10(b). 6.2 Stream Partition Table I shows the result of running the stream partitioning algorithm on several different examples. n is the total number of streams and k is the number of groups. Total number of possible solutions and the optimal solution is from running all possible initial center streams. Total squared angle sum (TSAS) is given with each center streams solution. The last row shows the number of solutions that has a smaller TSAS than the approximate partitioning solution. The streams in Table I(a) were placed randomly with the only constraint that the largest possible angle between two streams is 120 degrees. The streams for Table I(b) are the same configuration used for the static office scene (Figure 9). Table I(c) is the camera configuration for the dynamic scene (Figure 10(a)). Table I(a) shows the algorithm works well for different number of streams. The first example (n = 10, k = 5) shows that for a relatively small n and k an exhaustive search is plausible. The third (n = 25, k = 5) is an example of where n is an exact multiple of k. The final example (n = 60, k = 5) is a case where doing an exhaustive search is not practical. Table I(b) and I(c) demonstrates that the algorithm works well for streams placed fairly uniformly, which better represents the application. One thing to note is that there were two instances of center ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Real-Time Multidepth Stream Compression

•

143

Fig. 11. Compression Rates of Different Algorithms.

Table II. Average Compression Rate Static Scene Dynamic Scene 13 Streams 22 Streams (8 Streams) Best-Interstream Compression 7.79 13.29 5.21 Nearest Neighbors as Reference Stream Compression 7.70 10.00 4.74 Group-Based Real-Time Compression 5.78 7.64 4.44 MM03 Group-Based Real-Time Compression 5.58 7.56 —

streams—3, 7, 11 and 3, 8, 12—with the same GSAS of 550 for the first configuration (n = 13, k = 5). However 3, 7, 11 was selected on the basis of the CSAS. The last row of the table shows that for all cases tested, the solution generated by the approximate partitioning method is either the same or very close to the optimal solution. All solutions were in the top 0.4% of the total possible solutions. Also for camera configurations better suited to the application (i.e., roughly uniform placement), approximate partitioning method found the optimal solution. This was achieved with exploring less than 1% of the total possible initial center streams when the number of streams was larger than 22. 6.3 Compression Rate Figure 11 shows a comparison of compression rates for the different algorithms, and the average compression rate for each data set is shown in Table II. For the 13 stream configuration where the cameras were placed in a linear semi-circle, Best-Interstream Compression and Nearest Neighbors as Reference Stream Compression show about the same compression ratio. However for the 22 stream configuration, ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

144

•

S.-U. Kum and K. Mayer-Patel

where the cameras were placed in a 2D array, the Best-Interstream Compression has a better compression ratio than the Nearest Neighbors as Reference Stream Compression. This is due to the fact that in the 13 stream linear configuration, the definition of Neighbor is clear. However for the 22 stream 2D array configuration there are three possible Neighbors to select from. Therefore, when linearly sorting the cameras, there will be cameras that are not neighbors which have many redundant points. The results also show that the Group-Based Real-Time Compression does better than the MM03 Group-Based Real-Time Compression, especially when the user’s viewpoint (main stream) is near the end cameras of the camera configuration—the main group is one of the groups at the ends. The MM03 Group-Based Real-Time Compression uses the center stream of the main group as its reference stream for the sub group’s center stream, while the Group-Based Real-Time Compression uses the nearest neighboring center stream as the reference stream. So when the user’s viewpoint is near the end cameras of the camera configuration, the main group is one of the group at the ends, the center stream of the sub group on the opposite end should have more redundant points using the nearest neighboring center stream (Group-Based Real-Time Compression) than the main group’s center stream (MM03 Group-Based Real-Time Compression). For example, in Figure 9(b) if the main stream is Stream 3, Stream 2 is the center stream of the main group. Stream 10—a center stream of a sub group—uses Stream 19 as its reference stream for the Group-Based Real-Time Compression while MM03 Group-Based Real-Time Compression uses Stream 2 as its reference stream for Stream 10. Stream 10 should have more redundant points with Stream 19 than Stream 2 since the angle between Stream 10 and Stream 19 are smaller than Stream 10 and Stream 2 (Figure 4), resulting in better compression for Group-Based Real-Time Compression. The results from the dynamic scene (Figure 11(c)) indicates that the difference between the different algorithms is not big as the static office data set. We believe this is mostly due to the inaccuracy of depth values from reconstruction and the lack of background reconstruction. Since a stereo based 3D reconstruction algorithm was used to create the dynamic scene, no depth values (i.e., points) were generated for parts of the scene where good correspondence could not be found—such as the wall of the office. This is shown as black in Figure 10(b). However the static office data set has depth values for most of the scene including the background, where a significant number of redundant points exist. Therefore, the total number of redundant points were significantly reduced in the dynamic scene compared to the static scene. Also the inaccuracy of depth values, due to noise and other real-world effects in the dynamic scene, makes some redundant points to be identified incorrectly as nonredundant. 6.4 Network Bandwidth Network bandwidth usage was compared using target frame rate of 15 frames/sec. All differential streams for the static office scene were encoded using the method described in Section 4.2. All streams for the dynamic scene does not have much background, so they were also encoded with the same method since it saves network bandwidth. Internal network bandwidth for reference stream transfer, and external network bandwidth for transmission to the rendering system were compared between compressed stream created using different algorithms and the original noncompressed streams. The results show that Group-Based Real-Time Compression and MM03 Group-Based Real-Time Compression use much less internal network bandwidth than Best-Interstream Compression and Nearest Neighbors as Reference Stream Compression but use about the same external network bandwidth. 6.4.1 Internal Network Bandwidth. The internal network bandwidth is the bandwidth locally needed to transmit reference streams for creating differential streams. The number of reference streams needed for the Best-Interstream Compression is n − 1 if there are total of n streams. For ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Real-Time Multidepth Stream Compression

•

145

Fig. 12. Local network bandwidth usage.

Nearest Neighbors as Reference Stream Compression, total of n − 2 reference streams are needed since the 2 streams at the end are not used as reference streams. The Group-Based Real-Time Compression and MM03 Group-Based Real-Time Compression differ in which reference stream is used for creating differential streams, but have the same number of reference streams. There is no internal network bandwidth needed for original noncompressed streams since there are no reference streams used. Figure 12 shows that Group-Based Real-Time Compression and MM03 Group-Based Real-Time Compression use much less internal network bandwidth than Best-Interstream Compression and Nearest Neighbors as Reference Stream Compression. 6.4.2 External Network Bandwidth. The external network bandwidth refers to the network bandwidth needed to transfer the data streams from the reconstruction to the rendering system. The original noncompressed streams for the static scene have constant bandwidth because the number of streams does not change and the number of transmitted points per stream does not change—that is, the number of bytes needed for transmission of each frame stays the same. For the dynamic scene, the bandwidth for original noncompressed streams is not constant but nearly constant because the total number of points transmitted per frame changes slightly from frame to frame even though the number of streams remains the same. However, differential streams, streams with redundant points removed, are encoded using a bitmask which should make the network bandwidth demand change with the number of points sent. The results are shown in Figure 13, and exhibit similar behavior as the compression rate since the size of the encoded differential streams is directly related to the number of points sent. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

146

•

S.-U. Kum and K. Mayer-Patel

Fig. 13. External network bandwidth usage.

6.5 Frame Rate The streams were rendered using an NVIDIA Quadro FX3000 graphics card and the results are shown in Figure 14. For the static office scene, the rendering system is able to render the compressed streams at interactive frame rate while not for the original noncompressed streams which is essential for a tele-immersion system. The original noncompressed streams of the dynamic scene can be rendered at an interactive frame rate but is not fast enough for a smooth rendering of the scene which requires at least 20–30 frames/sec. However, the compressed streams can be rendered at a higher frame rate for smooth rendering. 6.6 Rendered Image Quality 6.6.1 Static Scene. The quality of rendered images was measured by the peak signal-to-noise ratio (PSNR). The golden standard reference image to compare against was rendered using the original polygonal model of the office. The PSNR of original noncompressed streams and the differential streams from different algorithms for 13 streams (Figure 15(a)) and 22 streams (Figure 15(b)) show no significant differences, which indicate redundant points can be removed without much loss in the rendered image quality. 6.6.2 Dynamic Scene. For the dynamic scene, since there is no original model to create the golden standard reference image, the reference image was rendered using the original noncompressed streams. Since the reconstructed depth streams do not have the whole background, rendered reference images have pixels with no color values. Therefore, the PSNR was calculated using only the valid pixels—pixels with color values—of the reference image. For pixels that had a valid color in the reference image but not in the rendered image, the color difference was set to the maximum—255. The opposite, valid pixels ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Real-Time Multidepth Stream Compression

•

147

Fig. 14. Frame rate for different compression schemes on a NVIDIA Quadro FX3000.

in the rendered image but not in the reference image, cannot happen since the rendered image used a subset of points used to render the reference image. The results are shown in Figure 15(c) indicates that the PSNR values for various algorithms are similar. However the absolute PSNR values are much worse compared to the PSNR values of the static office scene. Examining the color differences for each pixel explains the reason for worse absolute PSNR values and demonstrates the short comings of using PSNR for comparing image quality. Figure 16 shows the actual ratio of valid pixels with a given color difference. Figure 16(a) shows the minimum, maximum, average, cumulative minimum, cumulative maximum, and cumulative average of the 100 frame sequence for the Group-Based Real-Time Compression. Figure 16(b) is for the Best-Interstream Compression and Figure 16(c) for the Nearest Neighbors as Reference Stream Compression. The ratio of valid pixels with a given color difference for frame 25 of Group-Based Real-Time Compression, BestInterstream Compression, and Nearest Neighbors as Reference Stream Compression with cumulative values for each is shown in Figure 16(d). About 90% of valid pixels have a color difference less than 50. Most of other pixels, about 10%, are pixels with color difference of 255. These pixels are pixels that appear in the reference image but not in the rendered image. These are usually pixels at silhouettes of an object which are perceptually indistinguishable between the reference image and the rendered image but have a big effect on PSNR. 7.

CONCLUSIONS AND FUTURE WORK

In this article, we have presented a scalable real-time compression algorithm for 3D environments used for a tele-immersion system. We have modified the algorithm of our earlier work [Kum et al. 2003] for better performance, especially for large number of depth streams and groups. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

148

•

S.-U. Kum and K. Mayer-Patel

Fig. 15. The peak signal-to-noise ratio (PSNR) of valid pixels for different compression schemes.

The Group-Based Real-Time Compression algorithm balances the compression efficiency and resource limitations by partitioning the streams into coherent groups. By limiting the number of groups, which affects the total number of reference streams, the internal network bandwidth is controlled. It also enables the rendering system to deal with the scalability of the acquisition and 3D reconstruction system since more streams can be added without increasing the internal network bandwidth. The algorithm also achieves efficient compression by selecting the best reference stream for each stream when creating the differential stream. The removal of redundant points reduces the external network bandwidth demand and enables the rendering system to run in real-time while maintaining rendered image quality. The results show that the compression algorithm performs in real-time, is scalable, and tolerates the network bandwidth limits for multiple configurations. Furthermore, an approximate stream partitioning algorithm, which efficiently groups streams with high coherence for an effective Group-Based Real-Time Compression algorithm, is presented. The algorithm partitions the streams into coherent groups near optimal when it is not practical to obtain the best stream partition solution—that is, high number of streams. As part of future work, we would like to make the following improvements to the algorithm: —Use temporal coherence to increase compression and reduce comparisons made with the reference stream for faster compression. —Develop a better metric for main stream selection. Instead of just using Euclidean distance, the angles between the user’s view and the view directions of the 3D cameras can be used in conjunction. —As the main stream changes, major portions of the point data set changes abruptly causing a popping effect. This is worse if the main stream changes from one group to another. A gradual change in the point data set as the main stream changes would be desirable. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Real-Time Multidepth Stream Compression

•

149

Fig. 16. Ratio of valid pixels with a given color difference. The rendered image of original noncompressed streams is used as the reference image.

—Develop an efficient representation of the depth streams for better compression. —Improve depth stream compression by incorporating temporal coherence. —Examine the relationship between internal bandwidth and external bandwidth. If internal bandwidth is increased more reference streams can be used which should reduce external bandwidth. As mentioned before, internal bandwidth has bigger capacity so increasing internal bandwidth to reduce external bandwidth is usually advantageous. However, there should be a point where increasing internal bandwidth is no longer helpful. Finally, we would like to explore the proposed stochastic sampling for generating initial starting points for approximate stream partitioning. ACKNOWLEDGMENTS

We would like to thank Hye-Chung (Monica) Kum for all her helpful discussions and feedback, Herman Towles and Henry Fuchs for helpful suggestions and comments, Travis Sparks, Sudipta Sinha, Jason Repko, and Marc Pollefeys for their help in creating the dynamic scene data. We would also like to thank the Katholieke Universiteit Leuven in Belgium for providing the software to create the dynamic scene data. REFERENCES ADVANCED NETWORK AND SERVICES, INC. http://www.advanced.org. BAKER, H. H., TANGUAYA, D., SOBEL, I., GELB, D., GOSS, M. E., CULBERTSON, W. B., AND MALZBENDER, T. 2002. The Coliseum immersive teleconferencing system. In Proceedings of the International Workshop on Immersive Telepresence 2002 (Juan Les Pins, France). ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

150

•

S.-U. Kum and K. Mayer-Patel

CHANG, C.-F., BISHOP, G., AND LASTRA, A. 1999. LDI tree: A hierarchical representation for image-based rendering. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. ACM Press/Addison-Wesley Publishing Co., Los Angeles, CA. 291–298. CHEN, W.-C., TOWLES, H., NYLAND, L., WELCH, G., AND FUCHS, H. 2000. Toward a compelling sensation of telepresence: Demonstrating a portal to a distant (static) office. In Proceedings of the 11th IEEE Visualization Conference (Salt Lake City, Utah). 327–333. CRUZ-NEIRA, C., SANDIN, D. J., AND DEFANTI, T. A. 1993. Surround-screen projection-based virtual reality: The design and implementation of the cave. In Proceedings of the 20th Annual Conference on Computer Graphics and Interactive Techniques (Anaheim, Calif.). ACM, New York, 135–142. ¨ , S., NAEF, M., LAMBORAY, E., SPAGNO, C., KUNZ, A., KOLLER-MEIER, E., SVOBODA, T., VAN GOOL, L., LANG, S., GROSS, M., WURMLIN STREHLKE, K., MOERE, A. V., AND STAADT, O. 2003. Blue-c: A spatially immersive display and 3D video portal for telepresence. ACM Transactions on Graphics, Special issue (Proceedings of ACM SIGGRAPH2003) 22, 3 (July), 819–827. GROSSMAN, J. AND DALLY, W. J. 1998. Point sample rendering. In Proceedings of the 9th Eurographics Workshop on Rendering Techniques (Vienna, Austria). Springer-Verlag, New York, 181–192. JAIN, A. K., MURTY, M. N., AND FLYNN, P. J. 1999. Data clustering: a review. ACM Comput. Surv. 31, 3 (Sept.), 264–323. KAUFF, P. AND SCHREER, O. 2002. An immersive 3D video-conferencing system using shared virtual team user environments. In Proceedings of the 4th ACM International Conference on Collaborative Virtual Environments (Bonn, Germany). ACM, New York, 105–112. KELSHIKAR, N., ZABULIS, X., MULLIGAN, J., DANIILIDIS, K., SAWANT, V., SINHA, S., SPARKS, T., LARSEN, S., TOWLES, H., MAYER-PATEL, K., FUCHS, H., URBANIC, J., BENNINGER, K., REDDY, R., AND HUNTOON, G. 2003. Real-time terascale implementation of teleimmersion. In Proceedings of the the Terascale Performance Analysis Workshop at International Conference on Computational Science (Melbourne, Australia). KUM, S.-U., MAYER-PATEL, K., AND FUCHS, H. 2003. Real-time compression for dynamic 3D environments. In Proceedings of the 11th ACM International Conference on Multimedia (Berkeley, Calif.). ACM, New York, 185–194. MATUSIK, W., BUEHLER, C., RASKAR, R., GORTLER, S. J., AND MCMILLAN, L. 2000. Image-based visual hulls. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques (New Orleans, La.). ACM Press/Addison-Wesley Publishing Co., 369–374. MCMILLAN, L. AND BISHOP, G. 1995. Plenoptic modeling: An image-based rendering system. In Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques (Los Angeles, Calif.). ACM, New York, 39–46. MOLNAR, S., EYLES, J., AND POULTON, J. 1992. PixelFlow: High-speed rendering using image composition. In Proceedings of the 19th Annual Conference on Computer Graphics and Interactive Techniques (Chicago, Ill.). ACM, New York, 231–240. OFFICE OF THE FUTURE PROJECT. http://www.cs.unc.edu/~ootf. ˜ , J. M., LOZANO, J. A., AND LARRANAGA ˜ PENA , P. 1999. An empirical comparison of four initialization methods for the k-means algorithm. Patt. Reco. Lett. 20, 10 (Oct.), 1027–1040. PITTSBURGH SUPERCOMPUTING CENTER. http://www.psc.edu. POLLEFEYS, M., VAN GOOL, L., VERGAUWEN, M., VERBIEST, F., CORNELIS, K., TOPS, J., AND KOCH, R. 2004. Visual modeling with a hand-held camera. Int. J. Comput. Vis. 59, 3, 207–232. RUSINKIEWICZ, S. AND LEVOY, M. 2000. QSplat: A multiresolution point rendering system for large meshes. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques (New Orleans, La.) ACM Press/Addison-Wesley Publishing Co., 343–352. SHADE, J., GORTLER, S., HE, L., AND SZELISKI, R. 1998. Layered depth images. In Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques (Orlando, Fla.). ACM, New York, 231–242. STOLL, G., ELDRIDGE, M., PATTERSON, D., WEBB, A., BERMAN, S., LEVY, R., CAYWOOD, C., TAVEIRA, M., HUNT, S., AND HANRAHAN, P. 2001. Lightning-2: A high performance display subsystem for PC clusters. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques (Los Angeles, Calif.). ACM, New York, 141–148. TOWLES, H., CHEN, W.-C., YANG, R., KUM, S.-U., FUCHS, H., KELSHIKAR, N., MULLIGAN, J., DANIILIDIS, K., HOLDEN, L., ZELEZNIK, B., SADAGIC, A., AND LANIER, J. 2002. 3D tele-immersion over internet2. In Proceedings of the International Workshop on Immersive Telepresence 2002 (Juan Les Pins, France). UNIVERSITY OF PENNSYLVANIA GRASP LAB. http://www.grasp.upenn.edu. ¨ WURMLIN , S., LAMBORAY, E., AND GROSS, M. 2004. 3D video fragments: Dynamic point samples for real-time free-viewpoint video. Computers & Graphics, Special Issue on Coding, Compression and Streaming Techniques for 3D and Multimedia Data 28, 1 (Feb.), 3–14. Received January 2005; accepted January 2005 ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Panoptes: Scalable Low-Power Video Sensor Networking Technologies WU-CHI FENG, ED KAISER, WU CHANG FENG, and MIKAEL LE BAILLIF Portland State University Video-based sensor networks can provide important visual information in a number of applications including: environmental monitoring, health care, emergency response, and video security. This article describes the Panoptes video-based sensor networking architecture, including its design, implementation, and performance. We describe two video sensor platforms that can deliver high-quality video over 802.11 networks with a power requirement less than 5 watts. In addition, we describe the streaming and prioritization mechanisms that we have designed to allow it to survive long-periods of disconnected operation. Finally, we describe a sample application and bitmapping algorithm that we have implemented to show the usefulness of our platform. Our experiments include an in-depth analysis of the bottlenecks within the system as well as power measurements for the various components of the system. Categories and Subject Descriptors: C.5.3 [Computer System Implementation]: Microcomputers—Portable devices General Terms: Design, Measurement, Performance Additional Key Words and Phrases: Video sensor networking, video collection, adaptive video

1.

INTRODUCTION

There are many sensor networking applications that can significantly benefit from the presence of video information. These applications can include both video-only sensor networks or sensor networking applications in which video-based sensors augment their traditional scalar sensor counterparts. Examples of such applications include environmental monitoring, health-care monitoring, emergency response, robotics, and security/surveillance applications. Video sensor networks, however, provide a formidable challenge to the underlying infrastructure due to the relatively large computational and bandwidth requirements of the resulting video data. The amount of video generated can consume orders of magnitude more resources than their scalar sensor counterparts. As a result, video sensor networks must be carefully designed to be both low power as well as flexible enough to support a broad range of applications and environments. To understand the flexibility required in the way the video sensor are configured, we briefly outline three example applications: This work is based upon work that was supported by the National Science Foundation (NSF) under grants ANI-0087761 and EIA-0130344. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. Authors’ addresses: Department of Computer Science, Portland State University, P.O. Box 751, Portland OR 97207-0751; email: {wuchi,edkaiser,wuchang}@cs.pdx.edu; mikael.le [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]. c 2005 ACM 1551-6857/05/0500-0151 $5.00 ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005, Pages 151–167.

152

•

W.-C. Feng et al.

—Environmental Observation Systems: For oceanographers that want to understand the development of sandbars beneath the water’s surface, video sensors can provide an array of sensors. Using image processing techniques on the video, the oceanographers can determine the evolution of such sandbars over time. The tetherless nature of the application requires video sensors that are entirely self-sufficient. In particular, the sensors must be equipped with power that is generated dynamically via solar panels or wind-powered generators and managed appropriately. In addition, networking connectivity may be at a premium, including possibly intermittent or “programmed” network disconnection. For this application, keeping the sensor running indefinitely while collecting, storing, and transmitting only the most important video is the primary goal. —For Video Security and Surveillance Applications: In these applications, the video sensors should filter as much of the data at the sensor as possible in order to maximize scalability, minimize the amount of network traffic, and minimize the storage space required at the archive to hold the sensor data. The sensors themselves may have heterogeneous power and networking requirements. In outdoor security applications, the sensor power may be generated by solar panels and may use wireless networking to connect to the archive. For indoor security applications, the sensors most likely will have power access and will be connected via wireless or wireline networks. —Emergency Response Systems: A video-based sensor network may be deployed in order to help emergency response personnel assess the situation and take appropriate action. The video sensors may be required to capture and transmit high-quality video for a specified period of time (i.e., the duration of the emergency). The goal in these situations might be to meet a target operating time with minimal power adaptation, in order to provide emergency response personnel with the critical information they need throughout the incident. In this article, we describe the Panoptes video sensor networking project at Portland State University. In particular, we will describe the design, implementation, and performance of the Panoptes sensor node, a low-power video-based sensor. The sensor software consists of a component-based infrastructure that can have its functionality altered on-the-fly through Python-connected components. We also describe an adaptive video delivery mechanism that can manage a buffer of data so that it supports intermittent and disconnected operation. This buffering mechanism allows the user to specify how to gracefully degrade the quality of the video in the event that it is unable to transmit all the video data. Finally, we will describe a video sensor application that we have developed. In this application, we have designed an efficient algorithm to allow video data to be queried without analyzing pixel data directly. In the following section, we provide a description of the embedded sensor platform, including the systems software and architecture of the sensor. Following the description of the Panoptes video sensor, we describe a scalable video sensor application that has been designed to show some of the features of the video sensors. The experimentation section will provide an in-depth analysis of the performance of the video sensor and its subcomponents. In Section 5, we describe some of the work related to ours and how it differs. Finally, we conclude with some of our future work and a summary. 2.

VIDEO SENSOR PLATFORM

In designing a video-sensor platform, we had a number of design goals that we were trying to accomplish including: —Low Power. Whether power is scarce or available, minimizing the amount of power required to capture the video is important. For environments where power is scarce, minimizing power usage can significantly increase the time that the sensors can operate. For environments where power is plentiful, minimizing power usage can significantly increase the number of sensors that can be economically deployed. For example, homeowners may be willing to deploy a large number of 5-watt ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Panoptes: Scalable Low-Power Video Sensor Networking Technologies

•

153

video sensors (equivalent to a night light) while on vacation. However, they may be unwilling to use their laptop or desktop counterparts that can easily consume two orders of magnitude more power. —Flexible Adaptive Buffering Techniques. We expect that the video sensors will need to support a variety of latency and networking configurations, with a buffer on the sensor acting as the intermediate store for the data. Of course, the buffer can hold only a finite amount of data and may need to balance storing old data and new data. For some applications, data older than some prespecified time may be useless, while in other applications the goal will be to transmit as much captured data as possible (no matter how old it is) over the network. Two such applications might include commuter traffic monitoring for the former case and coastal monitoring for the latter case. Thus, we require a flexible mechanism by which applications can specify both latency and a mapping of priorities for the data that is being captured. —Power Management. A low-power video platform is just one component of the video sensor. The video sensor also needs to be able to adapt the amount of video that is being captured to the amount of power that is available. Just as in the flexible adaptive buffering techniques, power management also needs to be flexible. For example, in one scenario, the application requirement might be to have the sensor turn on and capture as much video as it can before the battery dies. In another scenario, it might be necessary for the sensor to keep itself alive using only self-generated power (such as from a solar panel or a wind-powered generator). —Adaptive Functionality. The functionality of the sensor may need to change from time to time. Changing the functionality should not require the sensor to be stopped and reconfigured. Rather, the sensor should be able to add new functionality while running and should also minimize the amount of code transferred through the network. In the following section, we will describe the hardware platform that serves as the basis of our video sensor technology. Following that, we will describe the software that we have developed to help address some of the design requirements above. 2.1 Panoptes Sensor Hardware In designing the video sensor, we had a number of options available to us. The most prevalent platform in the beginning was the StrongARM based Compaq IPAQ PDA. This platform has been used for a number of research projects, including some at MIT and ISI. As we will describe in the experimentation section, we found that the popular Winnov PC-Card video capture device was slow in capturing video and also required a large amount of power. The alternative to this was to find an embedded device with different input capabilities. Our initial investigation into embedded devices uncovered a number of limitations unique to embedded processors that are not generally found in their laptop or desktop counterparts. —Limited I/O Bandwidth. Many of the low-power devices today have either PCMCIA or USB for their primary I/O interconnects. Using PCMCIA-based devices typically requires significant power. For USB, low-power embedded devices do not support USB 2.0, which supports 455 Mb/sec. The main reason for this is that a fairly large processor would be required to consume the incoming data. For USB 1.0, the aggregate bandwidth is 12 Mb/sec, which cannot support the uncompressed movement of 320 × 240 pixel video at full frame rate. —Floating Point Processing. The Intel-based embedded processors such as the StrongArm processors and the Xscale processors do not support floating point operations. For video compression algorithms, this means that they either need to have floating point operations converted to integer equivalents or the operations need to be emulated. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

154

•

W.-C. Feng et al.

Fig. 1. The Panoptes Video Sensors. (a) The Applied Data Bitsy platform. The sensor board is approximately 5 inches long and 3 inches wide. (b) The Crossbow Stargate platform. The Stargate platform is approximately 3.5 inches by 2.5 inches in size.

—Memory Bandwidth. The ability to move data in and out of memory is relatively small compared to their desktop counterparts. Image capture, processing, and compression, however, can involve processing a large amount of image in real-time. The initial video sensor that we developed is an Applied Data Bitsy board utilizing the Intel StrongARM 206-MHz embedded processor. The device is approximately 5 inches long and approximately 3 inches wide. The sensor has a Logitech 3000 USB-based video camera, 64 Mbytes of memory, the Linux 2.4.19 operating system kernel, and an 802.11-based networking card. Note that while 802.11 is currently being used, it is possible to replace it with a lower-powered, lower frequency RF radio device. By switching to a USB-based camera platform, we were able to remove the power required to drive the PC-Card. After reporting our initial findings [Feng et al. 2003], we have ported the code base to work with the Crossbow Stargate platform. There are a number of advantages to this platform. First, it is made by the company that provides many of the motes to the sensor community. The Stargate was originally meant for use as a data aggregator for the motes. Second, while it has twice the processing power as the Bitsy board, it also consumes less energy. The video sensors are shown in Figure 1. As far as we know, these are the first viable video sensors that can capture video at a reasonable frame rate (i.e., greater than 15 frames per second), while using a small amount of power. The other platforms that we are aware of will be described in the related work section. 2.2 Panoptes Sensor Software Architecture There are a number of options in architecting the software on the video sensor. The Panoptes video sensor that we have developed uses the Linux operating system. We chose Linux because it provides the flexibility necessary to modify parts of the system to specific applications. Furthermore, accessing the device is simpler than in other operating systems. The functionality of the video sensing itself is split into a number of components including capture, compression, filtering, buffering, adaptation, and streaming. The major components of the system are shown in Figure 2. In the rest of this section, we will briefly describe the individual components. 2.2.1 Video Capture . As previously mentioned we chose a USB-based (USB 1.0) video camera. We are using the Phillips Web Camera interface with video for Linux. Decompression of the data from the USB device occurs in the kernel before passing the data to user space and allows for memory mapped access to decompressed frames. Polling indicates when a frame is ready to be read and further processed through a filtering algorithm, a compressor, or both. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Panoptes: Scalable Low-Power Video Sensor Networking Technologies

•

155

Fig. 2. Panoptes sensor software components.

2.2.2 Compression. The compression of video frames, both spatially and temporally, allows for a reduction in the cost of network transmission. We have currently set up JPEG, differential JPEG, and conditional replenishment as the compression formats on the Panoptes platform. Although JPEG itself does not allow for temporal compression of data, it saves on computational cost (relative to formats such as MPEG), and thus power. Compression on the Panoptes sensors is CPU bound. As will be shown in the experimentation section, we have taken advantage of Intel’s performance primitives for multimedia data that are available for the StrongARM and Xscale processors to make higher frame rates possible. While low-power video coding techniques are not the focus of this article, we expect that other compression technologies can be incorporated into the video sensor easily. 2.2.3 Filtering. The main benefit of a general-purpose video sensor is that it allows for application specific video handling and transformation to be accomplished at the edge of the sensor network, allowing for more sensors to be included in the system. For example, in a video security application, having the video sensor filter uninteresting data without compressing or transmitting it upstream allows the sensor network to be more scalable than if it just transmitted all data upstream. For environmental observation, the filter may create a time-elapsed image, allowing the data to be compressed as it is needed by the application as well as minimizing the amount that needs to be transmitted [Stockdon and Holman 2000]. Finally, in applications that require meta information about the video (e.g., image tracking), the filtering component can be set up to run the vision algorithms on the data. The filtering subcomponent in our system allows a user to specify how and what data should be filtered. Because of the relatively high cost of DCT-based video compression, we believe that fairly complex filtering algorithms can be run if they reduce the number of frames that need to be compressed. For this article, we have implemented a brute-force, pixel-by-pixel algorithm that detects whether or not the video has changed over time. Frames that are similar enough (not exceeding a certain threshold) can be dropped at this stage if desired. 2.2.4 Buffering and Adaptation. Buffering and dynamic adaptation are important for a number of reasons. First, we need to be able to manage transmitting video in the presence of network congestion. Second, for long-lived, low-power scenarios, the network may be turned off in order to save precious battery life. Specifically, 802.11 networking consumes approximately one-third of the power. Finally, in the event that the buffer within the video sensor fills up, efficient mechanisms need to be in place that allow the user to specify which data should be discarded first. For our sensor, we employ a priority-based streaming mechanism to support the video sensor. The algorithm presented here is different from traditional video streaming algorithms that have appeared in the literature (e.g., Ekici et al. [1999] and Feng et al. [1999]). The main difference is that in traditional video-streaming algorithms, the video data is known in advance but needs to be delivered in time for display. For most non-real-time video-sensor applications, the video data is being generated at potentially varying frame rates to save power and the data being captured is being generated on the fly. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

156

•

W.-C. Feng et al.

Fig. 3. A dynamic priority example.

While traditional video streaming algorithms can be employed for live streaming, we focus on adaptive video collection in this article. Priority-Based Adaptation. We have defined a flexible priority-based streaming mechanism for the buffer management system. Incoming video data is mapped to a number of priorities defined by the applications. The priorities can be used to manage both frame rate and frame quality. The mapping of the video into priorities is similar to that in Feng et al. [1999] or Krasic et al. [2003]. The buffer is managed through two main parameters: a high-water mark and low-water mark. If the buffered data goes beyond the high-water mark (i.e., the buffer is getting full), the algorithm starts discarding data from the lowest priority layer to the highest priority layer until the amount of buffered data is less than the low-water mark. Within a priority level, data is dropped in order from the oldest data to the newest. This allows the video data to be smoothed as much as possible. It is important to note that the priority mapping can be dynamic over time. For example, in the environmental monitoring application, the scientist may be interested in higher quality video data during low and high tides but may still require video at other times. The scientist can then incrementally increase the quality of the video during the important periods by increasing the priority levels. Figure 3 shows one such dynamic mapping. Data is sent across the network in priority order (highest priority, oldest frame first). This allows the sensor to transfer its highest priority information first. We believe that this is particularly important for low-power scenarios where the sensor will disconnect from the network to save power and scenarios where the network is intermittent. As shown in the example, the areas labeled (a) and (c) have been given higher priority than the frames in (b) and (d). Thus, the frames from the regions labeled (a) and (c) are delivered first. Once the highest priority data are transmitted, the streaming algorithm then transmits the frames from regions (a), (c), and (d). Note that the buffering and streaming algorithm can accept any number of priority layers and arbitrary application-specific mappings from video data to priority levels. 2.2.5 Providing Adaptive Sensor Functionality. In our initial implementation of the sensor, we simply modularized the code and connected the code via function calls. In order to change one of the parts such as the compression algorithm or the way filtering is accomplished, requires all of the code to be recompiled and sent to the sensor. In addition, the video sensor needs to be stopped and started with the new code. Thus, our goals for providing adaptive sensor functionality are to minimize the amount of code required to be compiled and transmitted to the sensor and to allow the sensor to dynamically alter its functionality without having to be manually stopped and restarted. As it turns out, the language Python allows for these goals to be met. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Panoptes: Scalable Low-Power Video Sensor Networking Technologies

•

157

Fig. 4. The Little Sister Sensor software components.

Python is an interpreted programming language similar in vein to TCL or Perl that allows natively compiled functions in high-level languages such as C to be assembled together through a scripting language. As a result, it allows one to take advantage of the speed of optimized compiled code (e.g., JPEG compression routines, networking, etc.), while having the flexibility of a scripting language in construct. For our video sensor software, each component is compiled into its own object code with an appropriate Python interconnect. The Python script can then be constructed to stitch the components together. In order to change the functionality of the video sensor such as its compression algorithm, one need only compile the object for the new compression algorithm, load the object onto the sensor, and update the script that the sensor is using. We have set up the video sensor code to automatically re-read the script every 10 seconds so a change in the script will change the functionality of the sensor on the fly. Our performance measurements of the Python-based interconnects shows an overhead of approximately 0.5 frames per second for the system, or approximately 5%. We believe that the additional flexibility gained by such a system is worth the overhead. 3.

THE LITTLE SISTER SENSOR NETWORKING APPLICATION

Video-sensor networking technologies must be able to provide useful information to the applications. Otherwise, they are just capturing data in futility. In order to demonstrate the usefulness of video-based sensor-networking applications, we have implemented a scalable video surveillance system using the Panoptes video sensor. The system allows video sensors to connect to it automatically and allows the sensors to be controlled through the user interface. The video surveillance system consists of a number of components, including the video sensor, a video aggregating node, and a client interface. The components of the system are shown in Figure 4 and are described in the rest of this section. 3.1 The User Interface The user interface for the Little Sister Sensor Networking application that we have deployed in our lab is shown in Figure 5. In the bottom center of the application window is a list of the video sensors that are available for the user to see. The list on the right is a list of events that the video sensor has captured. The cameras are controlled by a number of parameters which are described in the next section. The video window on the left allows events to be played back. In addition, it allows basic queries to be run on the video database. We will describe the queries that our system can run in Section 3.3. 3.2 Video Sensor Software In this application, the video sensors are fully powered and employ 802.11 wireless networking to network the sensor to the aggregating node. To maximize the scalability of the system, we have implemented a simple change detection filtering algorithm. The basic goal of the motion filtering is ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

158

•

W.-C. Feng et al.

Fig. 5. The Little Sister Sensor Networking client interface.

to identify events of interest and have it capture video for the event. This algorithm does a pixel by pixel comparison in the luminance channel. If sufficient pixels within a macroblock are greater than some threshold away from their reference frame, then the image is marked as different and recording of the video data begins. The video is then recorded until motion stops for a user-defined time, event end time. The event end time allows us to continue recording in the event that the object being recorded stops for a while and then continues movement. For example, a person walking into the room, sitting down to read a few Web pages, and then leaving may have 5-second periods where no motion is perceived (i.e., the person is just reading without moving). In addition to event recognition component, we propose a simple bitmapping algorithm for efficient querying and access to the stored video data. We create a map of the video data as an event is recording. For each image in the event, an image bitmap is created where each bit represents whether or not the luminance block has changed from the first image in of the event. This image bitmap indicates where the interesting areas of the video are. Furthermore, as will be described in the next section, the video aggregation node can use this to expedite queries for the users. Upon activation, the sensors read their configuration file to set up the basic parameters by which they should operate, including frame rate, video quality, video size, IP address of the video aggregator, etc. While we statically define the parameters by which they operate, one can easily imagine incorporating other techniques for managing the sensors automatically. 3.3 Video Aggregation Software The video aggregation node is responsible for the storage and retrieval of the video data for the video sensors and the clients. It can be at any IP connected facility. There are a number of components within the video aggregation node. The three principle parts are the camera manager, the query manager, and the stream manager. The camera manager is responsible for dealing with the video sensors. Upon activation, the video sensors register themselves with the camera manager. This includes information such as the name of the video sensor. The camera manager also handles all the incoming video from the video sensors. In order to maximize the scalability of the sensor system, multiple camera managers can be used. One important part of the camera manager is that it creates an event overview map using the bit-mapped information that is passed from the video sensor. The purpose of the event overview map is to create an overview of the entire event to aid in the efficient querying of the video data. The event overview map can be constructed in a number of ways. In this paper, we describe one relatively simple technique. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Panoptes: Scalable Low-Power Video Sensor Networking Technologies

•

159

Fig. 6. Event bitmapping example.

Fig. 7. Example of bit-map query. The right window shows the result of a query for events that relate to the area of the computer in the foreground.

Other techniques that track motion over time and create vectors of motion could also be integrated into the system. Union maps take all the image bitmaps for a single event and combine them together into the event overview map using a bitwise OR. This allows the system to quickly find events of interest (e.g., Who took the computer that was sitting here?). An example of the union map for someone walking through our lab (Figure 6(a)) is shown in Figure 6(b). The query manager is responsible for handling requests from the clients. Queries are entered into the video window. The user can left click to highlight 8 × 8 pixel regions within the video screen. The user can select any arbitrary shape of regions of interest. Upon receiving the query, the query manager finds all events within the system that have one of the regions in its event overview map. The list of matching events is then returned to the user. As an example, we have shown a sample query, in which the user highlighted part of the computer at the bottom of the image (see Figure 7). The query manager responded with only three events. Compared with the large list of events from the same camera in Figure 5, the simple bitmapping algorithm has reduced the number of events considerably. Note that the last event on the list is a video clip capturing a student moving the computer to his cube. The stream manager is responsible for streaming events of interest to the clients. We have implemented the camera, query, and stream managers as separate components in order to maximize ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

160

•

W.-C. Feng et al.

the scalability of the system. Although we have all three components running on a single host, it is possible to have them on geographically separated hosts. 4.

EXPERIMENTATION

In the first part of this section, we will describe the experimental results that we obtained from the various components of the video sensor including metrics such as power consumption, frame rate, and adaptability to networking resources. 4.1 USB Performance One of the interesting limitations of using standard USB to receive the video data from the camera is that its internal bandwidth is limited to 12 megabits per second. This 12 megabits includes USB packet header overhead so that the actual usable bandwidth is less. For a typical web camera capturing 4:2:0 YUV data at 320 × 240 pixel resolution, the theoretical maximum frame rate sustainable is only 13 frames per second over USB. Fortunately, or unfortunately, most USB cameras provide primitive forms of compression over the USB bus using mostly proprietary algorithms. There are a number of implications for compression, however. First, the quality may be degraded for the application. Second, it may require additional computation cycles in the host to which it is connected. For the Logitech cameras that we are using, the compression ratio from the USB camera is very small so that it is not suitable for wireless network transmission, requiring the data to be decompressed and recompressed into a more network friendly format. The alternatives to standard USB are firewire and USB 2.0. Most of the low-power embedded processors do not support either technology because the manufacturers feel that the processors are unable to fully utilize the bandwidth or would spend significant amount of its power and processing dealing with such devices. To test the video capture capabilities of the sensor, we set it up to grab video frames from the camera as quickly as possible, and then simply discard the data. For each resolution and USB compression setting, we recorded the frame rate as well as the amount of load that doing so puts on the sensor. We measured two metrics for a variety of parameters over 3,000 captured frames: (i) the average frame rate captured and (ii) the amount of load placed on the system. To measure frame rate, we took the total frames captured and divided it by the time required to capture all of the frames. The latter measurement shows us the load that the driver places on the system. To measure this, we ran the experiment to capture 3000 frames and then used the rusage() system call to find out the user, system, and total time of the experiment. We then calculated system load by summing the user and system times and dividing this by the total time. Table I lists the performance of the video sensor using the various compression settings and frame sizes. The Philip’s based video camera can only be set to three different resolutions: 160×120, 320×240, and 640 × 480. As shown in the table, the sensors are easily able to capture 160 × 120 video. This is not unexpected as the total bandwidth required to transmit 160 × 120 video at 30 frames per second is only 6.9 megabits, well beneath the USB bus bandwidth limit. For the various compression levels (1 being a higher quality stream with less compression and 3 being the lowest quality stream with high compression), we found that the system load introduced can be quite significant for the lightweight sensor. At the lowest compression setting, 22% of the CPU capacity is needed to decompress the video data from the USB camera for the Bitsy and 12% of the CPU for the Stargate. We believe that much of this time is spent touching memory and moving it around, rather than running a complex algorithm such as an IDCT. Using higher compression for the video data from the USB camera reduces the amount of system load introduced. We suspect that this is due to the smaller memory footprint of the compressed frame. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Panoptes: Scalable Low-Power Video Sensor Networking Technologies

•

161

Table I. Effect of USB Compression on Frame Rate and System Usage Bitsy Image Size 160 × 120

320×240

640×480

Compression 0 1 3 0 1 3 0 1 3

Frame Rate 29.64 29.77 29.88 4.88 28.72 29.98 — 14.14 14.73

%System CPU 4.48 22.29 15.71 2.85 67.17 44.50 — 83.66 77.65

Stargate Frame Rate 30.17 30.14 30.38 4.97 30.76 30.00 — 13.29 14.93

%System CPU 7.48 12.56 9.96 3.51 63.85 42.01 — 99.43 100.00

This table shows the ability of the sensors to capture video from the Logitech web camera and the amount of CPU required for each.

At 320 × 240, we encounter the Achilles’ Heel of the USB-based approach. Using uncompressed data from the camera, we are only able to achieve a frame rate of 5 frames per second (similar to the PC-card based approaches). With higher overhead (i.e., more time for decompression), we can achieve full frame rate video capture. In addition, we see that the amount of system load introduced is less than that required for the 160 × 120 stream. We suspect that this is again due to I/O being relatively slow on the video sensor. At 640 × 480, the video camera driver will not let the uncompressed mode be selected at all. Theoretically, one could achieve about 3 frames per second across the USB bus, but we suspect that if this mode were available, only 1 frame per second would be achievable. Using compression, we are able to achieve 14 frames per second, but we pay a significant penalty in having the video decompressed in the driver. As an aside, we are currently working on obtaining an NDA with Philips so that the decompression within the driver can be optimized as well as possibly allowing us to stay in the compressed domain. 4.2 Compression Performance We now focus on the ability of the video sensor to compress data for transmission across the network. Recall, we are interested in using general purpose software, so that algorithms such as filtering or region-of-interest coding can be accomplished on an application-specific basis. Software compression also allows us to have control over the algorithms that are used for compression (e.g., nv, JPEG, H.261, or MPEG). To measure the performance of compression on the 206-MHz Intel StrongARM processor, we measure the performance of an off-the-shelf JPEG compression algorithm (ChenDCT) and a JPEG compression algorithm that we implemented to take advantage of Intel’s Performance Primitives for the StrongArm and Xscale processors. In particular, there are some hand-coded assembly routines that use the architecture to speed up multimedia algorithms. Among these are algorithms to perform the DCT algorithm, quantization and Huffman encoding. For test data, we use a sample image in 4:2:0 YUV planar form taken from our lab and use it to test just the compression component. For each test, we compressed the image 300 times in a loop and averaged the measured compression times. As shown in Table II, we are able to achieve real-time compression of 320 × 240 pixel video using the Intel Performance Primitives. More importantly, it takes approximately 1/3 or 1/6 the time using the primitives compared with using a non-Intel-specific software optimized algorithm on the Bitsy and Stargate, respectively. As shown in the second row of the table, compressing a larger image scales linearly in the number of pixels. That is, the 640 × 480 pixel image takes approximately four times the amount of time to compress as the 320 × 240 pixel image. Furthermore, we are able to achieve ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

162

•

W.-C. Feng et al. Table II. Standalone Optimized vs. Unoptimized Software Compression Routines Image Size 320 × 240 640 × 480

IPP (ms) 26.65 105.84

Bitsy ChenDCT (ms) 73.69 291.28

IPP (ms) 20.18 85.82

Stargate ChenDCT (ms) 124.55 518.85

This table shows the performance of the sensors compressing a single image repeatedly with no video capture.

Table III. Standalone Optimized vs. Unoptimized Software Capture and Software Compression Routines Image Size 320 × 240 640 × 480

IPP (ms) 29.20 115.42

Bitsy ChenDCT (ms) 80.63 319.71

IPP (ms) 41.05 164.30

Stargate ChenDCT (ms) 171.46 725.72

This table shows the additional overhead incurred by the sensor in both capturing and compressing video.

approximately 10 frames per second using a high-quality image. It should be noted that the compression times using the IPP are dependent on the actual video content. In comparing the two platforms, it appears that the Stargate platform is able to outperform the Bitsy platform using the Intel Performance Primitives but cannot outperform it using the software compression algorithm. We believe that this is due to the fact that (i) the Xscale device has a faster processor and can take advantage of it when the working set is relatively small and (ii) the memory accesses in the Stargate seem to be a little slower than on the Bitsy. Finally, we note that using grayscale images reduces all the figures by approximately 1/3. This is not entirely surprising as the U and V components are one-third of the data in a color image. 4.3 Component Interaction Having described the performance of individual video sensor components, we now focus on how the various components come together. Because the capture and compression routines make up a large portion of the overall computing requirement for the video sensor, we are interested in understanding the interaction between them. Table III shows the performance of the sensor in capturing and compressing video data. Interestingly, the capture and compression with the Intel Performance Primitives results in approximately 4 milliseconds of overhead per frame captured for the Bitsy sensor. This scales linearly as we move to 640 × 480, requiring an additional 16 milliseconds per frame. For the ChenDCT algorithm, using either 320 × 240 or 640 × 480 video, the overhead of capturing data introduces a 24-millisecond overhead per frame. This seems to indicate that because the ChenDCT algorithm is unable to keep up the ability to capture video data that the I/O is being amortized during compression. To fully understand what is going on, we have instrumented a version of the code to measure the major components of the system. To do this, we inserted gettimeofday() calls within the source code and recorded the amount of time spent within each major code segment over 500 frames. The time spent in each of these components is shown in Table IV. For the 320 × 240 pixel images, nearly all the time is spent in the USB decompression module and compressing the video data. Our expectation is that, with an appropriately optimized USB decompression module, we will be able to achieve near real-time performance. For applications where video quality and not video rate is important, ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Panoptes: Scalable Low-Power Video Sensor Networking Technologies

•

163

Table IV. Average Time per Software Component Software Component PWC Decode JPEG Encode Bitmap Compare Image Copy Create Message Other

320 × 240 (ms) 16.96 21.08 4.05 1.09 0.43 10.35

640 × 480 (ms) 55.99 85.85 16.45 6.29 1.27 30.49

This figure shows the time spent in each of the various components within the Bitsy sensor.

we see that at 640 × 480 pixel video, we are able to achieve on the order of 5 frames a second. Finally, we note that this frame rate is better than IPAQ-based device results for 320 × 240 video data. For the Stargate sensor, we see that capturing and compressing video adds more overhead to the system when moving doing multiple tasks. From the capture results in Table II to the capture and compression results in Table III, we see that the Bitsy board incurs only additional milliseconds per frame, while the Stargate device nearly doubles its time, adding 20 milliseconds of overhead over the compression only results. We believe that both of these can be explained by what seems to be a slower memory sub-system on the Stargate sensor. The Bitsy-embedded device has an extra I/O processor for all of the I/O, resulting in lower overhead in capturing data. 4.4 Power Measurements To determine how much power is being drawn from the video sensor, we instrumented the sensor with an HP-3458A digital multimeter connected to a PC. This setup allows us to log the amount of current (and thus power) being consumed by the video sensor. To measure the amount of power required for the various components, we have run the various components in isolation or layered on top of another subsystem that we have already measured. The results of these measurements can be applied to power management algorithms (e.g., Kravets and Krishnan [2000]). Due to the recent move of the Systems Software faculty from the Oregon Graduate Institute to Portland State University, we were unable to set up our multimeter for testing of the Stargate sensor. The results of the experiments for the Bitsy board are shown in Figure 8. From the beginning of the trace until about 6 seconds into the trace, the video sensor is turning on and configuring itself. During this time, the power being drawn by the sensor is highly variable as it configures and tests the various hardware components on the board. Seconds 6–10 show the power being drawn by the system when it is completely idle (approximately 1.5 watts). Seconds 10–13 show the video camera turned on without capturing. As shown by the differential from the previous step, the camera requires approximately 1.5 watts of power to operate. Seconds 13–16 show the camera sleeping. Thus, over a watt of power can be saved if the sensor is incorporated with other low-power video sensor technologies that notify it when to turn on. In seconds 19–22, we show the power required to have just the network card on in the system but not transmitting any data (approximately 2.6 watts). In seconds 22–27, we added the camera back into the system. Here we see that the power for the various components is pretty much additive, making it easier to manage power. That is, the jump in power required to add the camera with and without the network card in is approximately the same. In seconds 27–38, we show the entire system running. As one would expect with a wireless network, the amount of power being drawn is fairly variable over time. Between seconds 38–40, we removed the camera and the network card, returning the system to idle. We then ran the CPU in a tight computational loop to show the power requirements while being fully burdened. Here, we see that the system by itself draws no more than 2.5 watts of power. Finally, ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

164

•

W.-C. Feng et al.

Fig. 8. Power consumption profile.

Table V. Average Power Requirement in Watts System State Sleep Idle CPU Loop Camera with CPU Camera in sleep with CPU Networking on with CPU Camera, Networking, CPU Capture Running

Bitsy Power (Watts) 0.058 1.473 2.287 3.049 1.617 2.557 4.280 5.268

we put the sensor in sleep mode (seconds 50–55). In the sleep state, the sensor requires very little power (approximately 0.05 watts of power). We have summarized the results in Table V. The most important thing to draw from the experiments is that in a given state, the power consumed by the sensor is relatively constant over time. The only exception comes when performing network transmission. As a result, we expect that the algorithms for power management that are being worked on by others might fit into this framework without much modification. We suspect that the Stargate sensor will have approximately 1–2 watts less power dissipation than the Bitsy board. This will be entirely attributed to the lower power requirement of the board and the CPU. The networking and the video camera are expected to require the same amount of power. 4.5 Buffering and Adaptation To test the ability of the sensor to deal with disconnected operation, we have run experiments to show how the video rate is adapted over time. In these experiments, we have used a sensor buffer of 4 megabytes with high-and low-water marks of 3.8 and 4 megabytes, respectively. For these experiments, we first turned on the sensor and had it capture, compress, and stream data. The experiment then turned the network card on and off for the times shown in Figure 9(a). The “on” times are indicated by a value of 1 in the graph, while the “off ” state is shown as a value of 0. As ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Panoptes: Scalable Low-Power Video Sensor Networking Technologies

•

165

Fig. 9. Dynamic adaptation example.

shown by Figures 9(b) and 9(c), the video sensor is able to cope with large amounts of disconnected time, while managing the video buffer properly. During down times, we see that the buffer reaches its high-water mark and then runs the algorithm to remove data, resulting in the sawtooth graph shown in Figure 9(c). Once reconnected, we see that the buffer begins to drain with the networking bandwidth becoming more plentiful relative to the rate at which the video is being captured. Had the network been constrained, instead of off, the algorithm would converge to the appropriate level of video. Larger video sensor buffers behave similar to the example in Figure 9. The only difference is that a larger buffer allows the system to be disconnected for longer periods of time. 5.

RELATED WORK

There are a number of related technologies to the proposed system detailed in this article. 5.1 Video Streaming and Capture Cameras There are a number of technologies that are available that capture video data and either store the data to the local hard disk or stream the data across the network. For example, web cameras such as the Logitech 3000 camera comes with software to allow motion-activated capture of video data. The camera, however, is not programmable and cannot be networked for storage or retrieval. Other cameras such as the D-Link DCS-1000W are IP streaming video cameras. These cameras capture data and stream it to the network. They were designed specifically for video streaming and capture. Thus, they are not programmable and would not work for situations such as environmental monitoring where power is extremely important. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

166

•

W.-C. Feng et al.

5.2 Sensor Networking Research There are a tremendous number of sensor networking technologies being developed for sensor networking applications [Estrin et al. 1992]. From the hardware perspective, there are two important sensors: the Berkeley Mote [Hill et al. 2000] and the PC-104-based sensor developed at UCLA [Bulusu et al. 2003]. The Berkeley Mote is perhaps the smallest sensor within the sensor networking world at the moment. These sensors are extremely low powered and have a very small networking range. As a result, these sensors are really useful for collecting small amounts of simple information. The PC-104-based sensor from UCLA is the next logical progression in sensor technologies that provides slightly more computing power. We believe the Panoptes platform is the next logical platform within the hierarchy of sensor network platforms. We expect that hybrid technologies, where Motes and the PC-104-based sensors can be used to trigger higher-powered sensors such as ours. This would allow the sensor network’s power consumption to be minimized. In addition to hardware sensors, there are a large number of sensor networking technologies that sit on top of the sensors themselves. These include technologies for ad hoc routing, location discovery, resource discovery, and naming. Clearly, advances in these areas can be incorporated into our video sensor technology. 5.3 Mobile Power Management Mobile power management is another key problem for long-lived video sensors. There have been many techniques focused on overall system power management. Examples include the work being done by Kravets at UIUC [Kravets and Krishnan 2000], Noble at the University of Michigan [Corner et al. 2001], and Satyanarayanan at CMU [Flinn and Satyanarayanan 1999]. We have not yet implemented power management routines within the video platform. We expect that the work presented in the literature can be used to control the frame rate of the video being captured as well as when the networking should be turned on and off to save power. 5.4 Video Streaming Technologies There have been a large number of efforts focused on video streaming across both reservation-based and best-effort networks, including our own. As previously mentioned, the work proposed and developed here is different in that traditional streaming technologies focus on the continuity requirements for playback while streaming from video sensors does not have this restriction. For video streaming across wireless networks, there have been a number of efforts focused on maximizing the quality of the video data in the event of network loss. These schemes are either retransmission-based approaches (e.g., Rhee [1998]) or forward error correction based (e.g., Tan and Zakhor [1999]). These techniques can be directly applied to the Panoptes sensor. 6.

CONCLUSION

In this article, we have described our initial design and implementation of the Panoptes video sensor networking platform. There are a number of significant contributions that this article describes. First, we have developed a low-power, high-quality video capturing platform that can serve as the basis of video-based sensor networks as well as other application areas such as virtual reality or robotics. Second, we have designed a prioritizing buffer management algorithm that can effectively deal with intermittent network connectivity or disconnected operation to save power. Third, we have designed a bit-mapping algorithm for the efficient querying and retrieval of video data. Our experiments show that we are able to capture fairly high-quality video running on low amounts of power, approximately the same amount of power required to run a standard night light. In addition, we have showed how the buffering and adaptation algorithms manage to deal with being disconnected from ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Panoptes: Scalable Low-Power Video Sensor Networking Technologies

•

167

the network. In addition, for low-power video sensor, we have discovered that the actual performance of the system involves both the CPU speed and other critical components including the I/O architecture and the memory subsystem. While we entirely expected the Stargate-embedded device to outperform the Bitsy board, we found that its memory system made it slower. The Stargate does, however, consume less power than the Bitsy boards. Although we have made significant strides in creating a viable video sensor network platform, we are far from done. We are currently in the process of assembling a sensor with a wind-powered generator for deployment along the coast of Oregon. Our objective is to use a directed 802.11 network to have a remote video sensor capture video data for the oceanographers at Oregon State. We have an operational goal of having the sensor stay alive for a year without power or wireline services. We are also working on creating an open source platform that can be used by researchers to include the fruits of their research. The goal is to have the sensors in use for research areas such as robotics, and computer vision. ACKNOWLEDGMENT

We would like to thank Dr. Edward Epp of Intel for getting the USB drivers working on the Stargate device. REFERENCES BULUSU, N., HEIDEMANN, J., ESTRIN, D., AND TRAN, T. 2003. Self-configuring localization systems: Design and experimental evaluation. ACM Trans. Embed. Comput. Syst. (May). CORNER, M., NOBLE, B., AND WASSERMAN, K. 2001. Fugue: Time scales of adaptation in mobile video. In Proceedings of the ACM/SPIE Multimedia Computing and Networking Conference (Jan.). EKICI, E., RAJAIE, R., HANDLEY, M., AND ESTRIN, D. 1999. RAP: An end-to-end rate-based congestion control mechanism for real time streaming in the internet. In Proceedings of INFOCOM 1999. ESTRIN, D., CULLER, D., PISTER, K., AND SUKHATME, G. 1992. Connecting the physical world with pervasive networks. IEEE Perv. Comput. (Jan.), 59–69. FENG, W., LIU, M., KRISHNASWAMI, B., AND PRABHUDEV, A. 1999. A priority-based technique for the best effort delivery of stored video. In Proceedings of the ACM/SPIE Multimedia Computing and Networking Conference. ACM, New York, Jan. FENG, W.-C, CODE, B., KAISER, E., SHEA, M., FENG, W.-C., AND BAVOIL, L. 2003. Panoptes: Scalable low-power video sensor networking technologies. In Proceedings of the ACM Multimedia 2003. ACM, New York, Nov. FLINN, J. AND SATYANARAYANAN, M. 1999. Energy-aware adaptation for mobile applications. In Proceedings of the Symposium on Operating Systems Principles. pp. 48–63. HILL, J., SZEWCZYK, R., WOO, A., HOLLAR, S., CULLER, D. E., AND PISTER, K. 2000. System architecture directions for networked sensors. In Architectural Support for Programming Languages and Operating Systems. pp. 83–104. KRASIC, B., WALPOLE, J., AND FENG, W. 2003. Quality-adaptive media streaming by priority drop. In Proceedings of the 13th International Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV 2003), June. KRAVETS, R. AND KRISHNAN, P. 2000. Application driven power management for mobile communication. Wireless Netw., 6, 4, 263–277. RHEE, I. 1998. Error control techniques for interactive low-bit-rate video transmission over the internet. In Proceedings of SIGCOMM 1998, ACM, New York. STOCKDON, H. AND HOLMAN, R. 2000. Estimation of wave phase speed and nearshore bathymetry from video imagery. J. Geophys. Res. 105, 9 (Sept.). TAN, W. AND ZAKHOR, A. 1999. Real-time internet video using error resilient scalable compression and TCP-friendly transport protocol. IEEE Trans. Multimed. 1, 2 (June). Received January 2005; accepted January 2005

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Semantics and Feature Discovery via Confidence-Based Ensemble KINGSHY GOH, BEITAO LI and EDWARD Y. CHANG University of California, Santa Barbara Providing accurate and scalable solutions to map low-level perceptual features to high-level semantics is essential for multimedia information organization and retrieval. In this paper, we propose a confidence-based dynamic ensemble (CDE) to overcome the shortcomings of the traditional static classifiers. In contrast to the traditional models, CDE can make dynamic adjustments to accommodate new semantics, to assist the discovery of useful low-level features, and to improve class-prediction accuracy. We depict two key components of CDE: a multi-level function that asserts class-prediction confidence, and the dynamic ensemble method based upon the confidence function. Through theoretical analysis and empirical study, we demonstrate that CDE is effective in annotating large-scale, real-world image datasets. Categories and Subject Descriptors: I.2.6 [Artificial Intelligence]: Learning—Concept learning; I.5.1 [Pattern Recognition]: Models—Statistical General Terms: Algorithms, Performance Additional Key Words and Phrases: Classification confidence, image annotation, semantics discovery

1.

INTRODUCTION

Being able to map low-level perceptual features to high-level semantics with high accuracy is critical for supporting effective multimedia information organization and retrieval. Several recent studies have proposed using either a generative statistical model (such as the Markov model [Li and Wang 2003]) or a discriminative approach (such as SVMs [Goh et al. 2001] and BPMs [Chang et al. 2003]) to annotate images. These studies make three assumptions. First, they assume that the set of semantic categories (or keywords) is known a priori and is fixed. Second, they assume that the low-level features are a fixed set of attributes. The third assumption is that once a classifier is trained with a set of training instances, the classifier remains unchanged (i.e., the training data is unchanged). When retraining is possible, it is not clear how the training data should be selected for improving class-prediction or annotation accuracy. Let C denotes the set of semantics, P the set of low-level features, and L the set of training instances that a classifier uses. The fixed C, P , and L assumptions make a classifier static, and thereby likely to suffer from the following three shortcomings: (1) Inability to Recognize an Instance of New Semantics. When the label of an unlabeled instance u does not belong to C, u is forced to be classified into one of the categories in C. Authors’ addresses: Electrical and Computer Engineering, University of California, Santa Barbara, Santa Barbara, CA 93106; email: [email protected]; beitao [email protected]; [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]. c 2005 ACM 1551-6857/05/0500-0168 $5.00 ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005, Pages 168–189.

Semantics and Feature Discovery via Confidence-Based Ensemble

•

169

(2) Inability to Provide a Way to Realize a Potential Misprediction Automatically. The trained classifier predicts a label in C to annotate u. However, there is no way to assess whether the prediction is accurate. (3) Inability to Comprehend the Causes of a Misprediction. Suppose a misclassified instance is discovered eventually by a user. A traditional static classifier cannot provide adequate clues to explain why the misprediction has taken place. In this work, we propose a confidence-based dynamic ensemble (CDE) to alleviate the above shortcomings. In contrast to the traditional static ensemble schemes, CDE aims to make intelligent adjustments on C, P , and L. CDE uses multilevel indicators to assert the class prediction confidence at the base classifier (SVMs) level, the ensemble level, and the bag level. When the confidence in a prediction is low, CDE employs two methods for improvement. —Disambiguating Conflicts. When ambiguity exists between two or more classes, CDE narrows down the candidate classes to a subset of C, say C ′ , and dynamically generates an ensemble for C ′ to improve prediction accuracy. —Discovering New Semantics and/or New Features. If the resulting prediction confidence is still low, CDE diagnoses the causes and provides remedies. There are three potential causes (and hence remedies) of a misprediction: (1) new semantics (remedy: adding f (u) to C), (2) under-representative low-level features (remedy: adding new features to P ), and (3) under-representative training data (remedy: adding u to L). The core of CDE is its employment of multilevel confidence factors (CFs). At the base classifier level, we compute binary CF to indicate the confidence of a binary decision. When |C| > 2 or the number of classes is greater than two, we compute inter-classifier CF to assert a |C|-nary class prediction. Finally, we combine CF with bagging [Breiman 1996] and use CF to select training instances judiciously and to aggregate votes intelligently among bags. Our empirical study shows that the confidence-based approach of CDE not only significantly improves class prediction accuracy, but also provides an intuitive avenue for improving C, P , and L for knowledge discovery. The rest of the article is organized as follows: Section 2 discusses related works. Section 3 presents the working mechanism of our dynamic ensemble. We depict the hierarchy of confidence factors, and show how CDE uses the confidence factors to improve classification accuracy and perform knowledge discovery. In Section 4, we present our empirical results. We offer our conclusions in Section 5.

2.

RELATED WORK

The methods for extracting semantic information from multimedia objects (primarily images) can be divided into two main categories: (1) Text-Based Methods. The text surrounding multimedia objects is analyzed and the system extracts those that appear to be relevant. Shen et al. [2000] explore the context of web pages as potential annotations for the images in the same pages. Srihari et al. [2000] propose extracting named entities from the surrounding text to index images. Benitez and Chang [2002] present a method to extract semantic concepts by disambiguating words’ senses with the help of the lexical database WordNet. In addition, the relationships between keywords can be extracted using relations established in WordNet. The major constraint of text-based methods is that it requires the presence of highquality textual information in the surrounding of the multimedia objects. In many situations, this requirement may not be satisfied. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

170

•

K. Goh et al.

(2) Content-Based Methods. More methods focus on extracting semantic information directly from the content of multimedia objects. An approach proposed by Chang et al. [1998] uses the Semantic Visual Templates (SVTs), a collection of regional objects within a video shot, to express the semantic concept of a user’s query. The templates can be further refined by a two-way interaction between the user and the system. Wang et al. [2001] proposed SIMPLIcity, a system that captures semantics using the “robust” Integrated Region Matching metric. The semantics are used to classify images into two broad categories which are then used to support semantics-sensitive image retrievals. IBM Research has developed VideoAnn [Naphade et al. 2002], a semi-automatic video annotation tool. The tool provides an easy-to-use interface for the user to annotate different regions of a video shot. More recently, Fan et al. [2004] used a two-level scheme to annotate images. At the first level, salient objects are extracted from the image and classified using SVMs. At the next level, a finite mixture model is used to mapped the annotated objects to some high-level semantic labels. Most of the annotation approaches rely heavily on local features, which in turn rely on high quality segmentations or regions with semantic meaning. However, segmentation can hardly be done reliably, especially on compressed images. Our prior work [Goh et al. 2001; Chang et al. 2003] uses both local and global perceptual features to annotate images. Finally, Zhang et al. suggest the use of “semantic feature vector” to model images and incorporate the semantic classification into the relevance feedback for image retrieval [He et al. 2002; Wenyin et al. 2001; Wu et al. 2002]. All these approaches assume that C, P , and L are fixed. The ability to improve C, P , and L is where our proposed CDE approach differs from these static ones. The core of CDE is its ability to assess classification confidence. The concept of classification confidence is not new in the field of pattern recognition. Chow [1970] proposed using the “rejection” option to refuse classifying a low-confidence instance. To measure confidence, various methods [Bouchaffra et al. 1999; Platt 1999] have been proposed to map the output of a binary classifier into a posterior probability value, which represents the confidence in a “yes” prediction. Most of the studies have focused on the confidence value of binary classifiers. In this study, we identify important parameters for the confidence estimation of an M -category classifier, and also that of an ensemble of M -category classifiers. To make the classification process more adaptive to data, Ho et al. [1994] selected the most relevant classifier(s) from a pool of classifiers for each query instance. The DAG-SVM scheme proposed by Platt et al. [2000] constructs M (M − 1)/2 one-versus-one binary classifiers, and then uses the Decision Directed Acyclic Graph (DDAG) to classify a query instance. Different instances may follow different decision paths in the acyclic graph to reach their most likely class. Another approach for improving multi-category classification is to perform a multistage classification [Poddar and Rao 1993; Rodriguez et al. 1998]. This approach uses inexpensive classifiers to obtain an initial, coarse prediction (for a query instance), it then selects more relevant and localized classifiers to refine the prediction. Our proposed CDE is a two-stage classification method. In addition to its capability to adjust C, P , and L, CDE is more adaptive than the above methods for constructing relevant and localized classifiers to refine a low-confidence prediction. CDE dynamically composes an ensemble of most relevant classifiers for an instance that would have been rejected by Chow’s scheme [Chow 1970]. Usually, the dynamic composition of classifiers for query instances is discouraged due to concerns about the computational overhead. In this study, through theoretical analysis and empirical studies, we will demonstrate that a dynamic composition can achieve significantly higher annotation accuracy, and also be affordable. 3.

CONFIDENCE-BASED DYNAMIC ENSEMBLE SCHEME

We employ a hierarchy of classifiers with confidence factors to produce class prediction (or semantic annotation) for images. This hierarchy of classifiers is designed for achieving three goals: ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Semantics and Feature Discovery via Confidence-Based Ensemble

•

171

(1) Providing quantitative values for assessing the class-prediction confidence at different levels. (2) Disambiguating a confusing class-prediction by constructing a dynamic ensemble for improving the prediction accuracy. (3) Assisting semi-automatic knowledge discovery for enhancing the quality of low-level features and the descriptiveness of high-level semantics. Our hierarchy of classifiers is organized into three levels: (1) Binary-Class Level. We use Support Vector Machines (SVMs) as our base-classifier in a binary classification setting. Each base-classifier is responsible for performing the class prediction of one semantic label. We map the SVM output of the base-classifier to a posterior probability for characterizing the likelihood that a query instance belongs to the semantic category that the classifier controls. (2) Multi-Class Level. The posterior probabilities from multiple base-classifiers are aggregated to provide a single class-prediction. A confidence factor is estimated for this aggregated prediction. (3) Bag Level. To reduce classification variance, we employ the bagging scheme [Breiman 1996], which combines multiple bags of multi-class classifiers to make an overall prediction. An overall confidence factor is also estimated at this level. For a low overall confidence prediction, a new hierarchy of classifiers is dynamically constructed to improve prediction accuracy. (We discuss the dynamic ensemble scheme in Section 3.2.) If the overall confidence of the new ensemble is still low, we flag the instance as a potential candidate for new-information discovery. (We discuss new semantics discovery in Section 3.3.) In the following sections, we first depict how we estimate prediction confidence at each level. Then, we present the dynamic ensemble scheme for annotation enhancement. We discuss semantics discovery at the end of this section. 3.1 Multilevel Predictions and Confidence We employ Support Vector Machines (SVMs) as our base-classifier. SVMs are a core machine learning technique with strong theoretical foundations and excellent empirical successes. SVMs have been applied to tasks such as handwritten digit recognition [Vapnik 1998], image retrieval [Tong and Chang 2001], and text classification. We present SVMs here to set up the context for discussion in the subsequent sections. For details of SVMs, please consult [Vapnik 1982, 1998]. We shall consider SVMs in the binary classification setting. SVMs learn a decision boundary between two classes by mapping the training examples onto a higher dimensional space and then determining the optimal separating hyperplane within that space. Given a test example x, SVMs output a score f (x) that provides the distance of x from the separating hyperplane. While the sign of the SVM output determines the class prediction, the magnitude of the SVM output can indicate the confidence level of that prediction. However, the SVM output is an uncalibrated value, and it might not translate directly to a probability value useful for estimating confidence. Hastie and Tibshirani [1998] suggest a possible way of mapping an SVM output to a probability value by using Gaussians to model the class-conditional probability p( f | y = ±1), where y is a semantic label. Bayes’ rule can then be used to compute the posterior probability as p( f (x)| y = 1)P ( y = 1) , i=−1,1 p( f (x)| y = i)P ( y = i)

P ( y = 1|x) =

(1)

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

172

•

K. Goh et al.

where P ( y = i) is the prior probability of y = i calculated from the training dataset. They further infer that the posterior probability function is a sigmoid with an analytic form: P ( y = 1|x) =

1 . 1 + exp(a f (x)2 + bf (x) + c)

(2)

However, this function is nonmonotonic, which contradicts with the observation that P ( y = 1|x) is a strong monotonic function of the SVM output. The reason for this contradiction could be due to the assumption of Gaussian class-conditional densities, an assumption that may not always be valid. For this reason, Platt [1999] suggests using a parametric model to fit the posterior P ( y = 1|x) directly without having to estimate the conditional density p( f | y) for each y value. The Bayes’ rule from Eq. (1) on two exponentials suggests using a parametric form of a sigmoid: P ( y = 1|x) =

1 . 1 + exp(A × f (x) + B)

(3)

This model assumes that the SVM outputs are proportional to the log odds of a positive example. The parameters A and B of Eq. (3) are fitted using maximum likelihood estimation from a training set. More precisely, A and B are obtained by minimizing the negative log likelihood of the sigmoid training data using a model-trust minimization algorithm. 3.1.1 Multi-Class Level Integration. To label a query instance (e.g., an image) with one out of M possible categories, we construct M one-per-class (OPC) SVM classifiers. We use the OPC scheme because our prior work [Goh et al. 2001] has validated that OPC is both effective and efficient. (Please refer to Goh et al. [2001] for details.) Let x denote a query instance. Let P (c|x) denote the probability estimate that x belongs to class m. The class of x is predicted as ω = argmax1≤c≤M P (c|x).

(4)

To estimate the confidence of this prediction, we first introduce two parameters we find useful: Definition 1. Top Posterior Probability t p = P (ω|x). Definition 2. Multi-class Margin tm = t p −

max

1≤c≤M ,c=ω

P (c|x).

Based on the definition of posterior probability, the confidence in a prediction is proportional to t p . However, t p alone may not be sufficient for an accurate estimation of the confidence. To illustrate this, we present a scatterplot of misclassified and correctly classified images in Figure 1(a). In Figure 1(a), it is easy to see that quite a few instances with high top posterior probability are misclassified. We observe that there is a better separation of correctly classified and misclassified instances if we use the multiclass margin tm [Goh et al. 2001; Schapire and Singer 1998] as a supplemental criterion. The larger the tm , the less likely an instance is misclassified. It is very unlikely that an instance with both high t p and large tm can be misclassified. Figure 1(b) displays the relationship between the class-prediction accuracy with respect to the multiclass margin tm . We found the following general form of functions useful in modeling the relationship between the class-prediction accuracy and tm : g (tm ) = A +

B . 1 + exp(−C × tm )

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

(5)

Semantics and Feature Discovery via Confidence-Based Ensemble

•

173

Fig. 1. Using Margin as Part of CF. (a) Scatterplot of misclassified and correctly classified images, and (b) classification accuracy vs. multi-class margin with sigmoid fit.

Parameters A, B, and C can be determined through empirical fitting as shown in Figure 1(b). With the margin tm , we formulate the confidence level of an annotation at multi-class level through the following function: CFm = t p × g (tm ). (6) CFm includes the consideration of multiple factors in determining the confidence of a prediction. At the same time, it retains a linear relationship with the expected prediction accuracy. A higher CFm implies that the OPC classifier has a higher confidence in its prediction. 3.1.2 Bag Level Integration. To reduce class-prediction variance, we use the majority voting result of multiple bags [Breiman 1996] as the overall class prediction. Each bag is a multiclass classifier, which deals with a subset of the training data. Suppose we use B bags to determine the class of an instance. With the help of confidence factors, not only is the prediction of the bth bag ωb (b = 1, . . . , B) known, but also the confidence level of that prediction (given by CFm (ωb)). For the bags with higher confidence, their votes should be given greater consideration during the final tally. Thus, we weigh each bag’s vote by the confidence factor CFm (ωb). The final prediction is formulated as = argmax1≤c≤M CFm (ωb). (7) ωb =c

To evaluate the confidence level of the overall prediction, we identify two useful parameters: Definition 3. Top Voting Score Vp =

CFm (ωb).

ωb =

Definition 4. Voting Margin Vm = V p −

max

1≤c≤M ,c=

CFm (ωb),

ωb =c

under the situation of unanimous voting, Vm = V p . ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

174

•

K. Goh et al.

Finally, we define the confidence of the overall prediction as V p × g (Vm ) CFb = . (8) B The denominator in the equation above is to normalize the confidence factor to be within the range [0, 1]. A prediction with a high CF is more likely to be accurate. The formal description of the multilevel annotation algorithm is presented in Figure 2. When the class prediction of an instance is assigned a high overall level of confidence, we output the predicted class. For those with a low CF, we enhance their accuracy with the dynamic ensemble algorithm, which we describe in Section 3.2. 3.2 Dynamic Ensemble Scheme For the instances with low annotation confidence, our system makes an extra effort to diagnose and enhance their annotations. The system dynamically builds an ensemble of OPC classifiers for each instance with low-confidence annotation. Our aim is to reduce the number of classes in the dynamic ensemble, without losing any classes in which the instance semantically belongs. The principle of the dynamic ensemble is best explained by the theory of Structure Risk Minimization [Vapnik 1998]. Let C be the set of all classes, and be the set of classes considered in the dynamic ensemble. The difference between the two sets = C − is the set of excluded classes. The elimination of from the classification can result in a gain or pose a risk. When an instance x belongs to a class in the set , we risk misclassifying x since its true class has been excluded from the classification process. Conversely, when x belongs to a class in the set , an improvement in the expected classification accuracy is likely when we exclude from the dynamic ensemble. The gain in the classification accuracy results from two sources: (1) x will not be misclassified into the classes in , and (2) the decision boundaries will be more accurate since we are only considering a subset of more relevant classes. Formally, let P (|x) be the probability that the instance x belongs to a class in , and E − (x, C) or + E (x, C) be the expected classification accuracy when all classes in C are considered during classification. The “−” sign represents the situation when x belongs to a class in , while the “+” sign is for the situation when x does not belong to any classes in . When only the classes in are considered, the expected classification accuracy is denoted by E + (x, ). We can then express the overall expected gain, denoted as ξ , in classification accuracy as ξ = (E + (x, ) − E + (x, C))(1 − P (|x)) − E − (x, C)P (|x).

(9)

For each instance x, the goal of the dynamic ensemble is to maximize ξ . From Eq. (9), we observe that this goal can be accomplished by selecting an appropriate subset of classes to keep P (|x) low. The posterior probabilities derived in Section 3.1 provide a critical support for the class selection task. To keep P (|x) low, the classes in should be selected such that the probability of x’s true class belonging to will be low. It is more logical to exclude those semantic classes which are unlikely to include x. In other words, classes with a lower posterior probability are considered less relevant to the annotation of x. Formally, we select the set of candidate classes: = {c|P (c|x) ≥ θm }, where c = 1 · · · M . θm is the thresholding parameter for the posterior probabilities. The higher the θm , the higher the value of the term E + (x, ) − E + (x, C), which leads to a higher ξ . However, at the same time, P (|x) will also be higher, which leads to a lower ξ . Through the selection of an optimal θm , we can maximize the expected accuracy gain ξ . In Section 4, we will examine the relationship between θm and annotation accuracy. Once the candidate classes of are identified, the system dynamically composes an ensemble of SVM classifiers to enhance the annotation. The dynamic ensemble includes || binary SVM classifiers, each of which compares one candidate class against other candidate classes. By excluding the influence of ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Semantics and Feature Discovery via Confidence-Based Ensemble

•

175

Fig. 2. Algorithm for multilevel predictions with confidence factors. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

176

•

K. Goh et al.

Fig. 3. Algorithm for DynamicEnsemble.

less relevant classes in , the dynamic ensemble can significantly enhance the annotation accuracy. To illustrate this with a simple example, suppose we are very confident that an instance should be labeled as either architecture or landscape. It would be counter-productive to include images from irrelevant classes such as flowers or fireworks when training the classifiers. Instead, by focusing on the relevant classes, the dynamic ensemble can reduce the noise from less relevant classes, thus producing a more accurate annotation. The formal algorithm of the dynamic ensemble scheme is presented in Figure 3. In our empirical studies, we confirmed that our dynamic ensemble is capable of improving annotation accuracy (see Section 4). By dynamically composing an ensemble of SVM classifiers for low-confidence instances, we will incur additional computational overhead. However, the overhead is not too serious a concern for the following reasons: (1) For most annotation applications, the annotation time is less of a concern than the annotation quality. Frequently, the annotation process is carried out offline. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Semantics and Feature Discovery via Confidence-Based Ensemble

•

177

(2) A dynamic ensemble is applied only to low-confidence annotations, and the percentage of such annotations is usually small. Furthermore, we can control that percentage by tuning the threshold θb . (3) The dynamic ensemble considers only the relevant classes. Thus the training dataset for the dyas large as the entire training dataset. The ratio of || is usually namic ensemble is just about || M M small (especially for datasets with a large number of classes). Besides, we can control this ratio through the parameter θm . Based on the study of Collobert and Bengio [2001], the training time for a binary SVMTorch classifier is about O(N 1.8 ), where N stands for the size of training data. In addition, the dynamic ensemble will need to train only || binary SVM classifiers rather than M . To sum up, the total overhead is in the order O(Nl ( || )2.8 ) of the training time for the entire M dataset, where Nl is the number of low-confidence annotations. For most annotation applications, M usually is large; as a result, the overhead costs of a dynamic ensemble usually come within an acceptable range. Our empirical studies show that when we pick an appropriate θm and θb, the average dynamic ensemble training time for a low-confidence annotation is affordable. For a small dataset with 1, 920 images from 15 classes, the average training time is 0.3 seconds, and for a larger dataset with 25, 000 images from 116 classes, the training time is 1.1 seconds. (These studies assure us that the cost is in the order of seconds, not minutes.) If the confidence of an annotation remains low after the dynamic ensemble scheme has been applied, there are several possible reasons for that low confidence level. The instance could be semantically ambiguous and thus better characterized by multiple classes, or the existing features could be insufficient for describing the semantics in that instance. Another possible reason might be the presence of completely new semantics in the instance. A full exploration of the above scenarios is beyond the scope of this article. However, we will address some of the issues in the following section, especially those associated with the discovery of new semantics. 3.3 Knowledge Discovery The annotation process assigns an image with one or multiple semantic labels. As new images are added to the dataset, however, some of them cannot be characterized by the existing semantic labels. When this situation occurs, the system must alert the researchers so that proper human actions can be taken to maintain the annotation quality (e.g., creating new semantic labels or researching new lowlevel features). In the remainder of this section, we discuss how we utilize our multilevel confidence factors to facilitate the detection of new or ambiguous semantics. We also preview some experimental results and use them as illustrative examples. 3.3.1 New Semantics (L and C) Discovery. We make use of the binary classifier’s posterior probability to discover completely new semantics. There are two possible course of actions we can take when a query instance contains new semantics: (1) Enhancing C (Class Labels) for New Semantics Outside Existing Categories. Suppose we already have the following classes: architecture, flower, and vehicle. If an image of a panda appears, it presents a completely new semantics outside of the existing classes. (2) Enhancing L (Training Data) for Under-represented Semantics within Existing Categories. Suppose we have an animals annotation system that was trained only with images of these land-based animals: tigers, elephants, bears, and monkeys. If we are given an image of camels to annotate, our system will likely make a wrong prediction even though the broad concept of animals is present in our system. A more extreme example would be a query image of whales swimming in the ocean. In ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

178

•

K. Goh et al.

Fig. 4. Insufficient training data example. The query instance is a picture of a lighthouse (frame (a)). The 5-nearest support vectors (frames (b) to (f)) are mostly from the wave category.

these scenarios, the system contains the high-level semantic concept, but the representation of the concept in the training data is inadequate. To conduct knowledge discovery, a well-defined ontology is necessary for two reasons. First, the ontology can help to determine the best course of action to follow when an image is singled out as one with new semantics. For example, the bears category in our image-dataset contains mostly images of brown and polar bears. We often get mispredictions when the query image contains black bears. Although we can say that the training data for bear are not representative enough, it can also be argued that black bear should be a new semantic class. An ontology can dictate how specific keywords should be for describing a particular semantic. For the example above, the ontology will decide whether the bear category encompasses only polar and brown bears, or it includes black bears. Thus, the ontology will make it clear if we need to add the mispredicted image to the training dataset L, or create a new semantic label in set C. The second reason for needing an ontology is to help determine which labels are most applicable to a query instance. In many cases, the query instance is likely to contain multiple semantics. However, not all of those semantics have corresponding labels in the semantic set, and sometimes only a few labels are allowed to be associated with the image. With an ontology at hand, we can quickly decide what labels to assign to the query instance. Figure 4 presents an example where an ontology is useful for determining the best remedy for a misprediction. The query instance (in frame(a)) shows an image of a lighthouse; its true label is landscape and the predicted label is wave. The other five images (frame (b) to (f)) show the five nearest neighbors of the query instance; four belong to the wave category and one belongs to landscape. In this example, the vast expanse of the sky and the sea in the query instance causes it to bear more resemblance to the wave images, hence the misprediction. The only lighthouse image in the training dataset is the second nearest neighbor. As we have discussed above, there are two possible remedies: add the query instance to the landscape training dataset and retrain the classifiers so as to avoid future mispredictions for images with lighthouses, or create a new semantic category lighthouse. With an ontology as a guide, it would be clear which remedy is preferable. If the keyword lighthouse is present in the ontology, then it ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Semantics and Feature Discovery via Confidence-Based Ensemble

•

179

Fig. 5. Example of ambiguous class predictions. Frame (a) shows the query image; the original class label is bear the mispredicted label is wave. Frame (b) to (d) show the nearest-neighbor from the three classes with the highest CF.

would be preferable to create a new label in the semantic set. If the keyword is absent, we need to add the query image to the training data of the label (landscape) that best describes the image. We employ a simple approach to discover completely new semantics. By examining the binary classifier posterior probability, we determine if the query instance belongs to the positive class of the binary setting. If all the probabilities are very low, we consider this instance to be an example containing new semantics. More formally, the set of instances containing new semantics can be represented as = {x|max1≤c≤M P (c|x) ≤ θn }. The higher the θn value, the more likely it is for instances to be regarded as new knowledge. In Section 4.3, we report on the tradeoff between recall and precision of the new semantics discovery with varying θn . 3.3.2 Ambiguity Discovery (P ). There are two main causes of ambiguous predictions: (1) the query instance contains multiple semantics that makes multiple labels applicable, and/or (2) the existing feature set is unable to distinguish between two or more classes. By including the multiclass and voting margins in our confidence factors, CFb will be low when there is a close competition between classifiers of different classes. If an instance x satisfies the following three conditions, the system deems it to be ambiguous: —There are no new semantics, that is: max1≤c≤M P (c|x) > θn . —The confidence of the final prediction, CFb, is low. —After the dynamic ensemble technique is applied, the ambiguity is still unresolved. When the final class-prediction confidence level is still low after applying CDE, we examine the candidates chosen to form the dynamic ensemble and their multi-class confidence factors (CFm ). If all the confidence factors are close to each other, we use the query instance as a possible candidate for feature discovery. Figure 5 shows an example where a mispredicted query instance may be classified into multiple classes. By having better features, we may avoid this misprediction. Frame (a) shows the query instance, ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

180

•

K. Goh et al.

Fig. 6. Inadequate low-level features example. Query instance is an image of an elephant that is misclassifed into the tiger category. Images (b)–(d) show the three nearest training images from the elephant category, and images (e)–(g) show the three nearest training images from the tiger category.

a picture with snowcapped mountains, a vast meadow and a tiny bear. The correct class label is bear and the prediction given by CDE is wave. Frames (b) to (d) show images from the three classes that have the highest CFm . The features used in our empirical study are mainly based on color and texture information. If we have features that can describe shapes in the foreground and background, we could potentially avoid mispredicting the query as a wave image. Since it would not be wrong to classify the query as either a bear or a landscape image, we require additional features to distinguish between these two competing classes. CDE is able to identify such ambiguous instances so that in-depth studies can be conducted to discover new, better features. Figure 6 shows an example where the existing features are not sufficient to distinctly separate two semantic classes (the second cause of misprediction). The query instance shown in frame (a) is that of an elephant against a mostly brownish background. Using CDE, the prediction for this query is the tiger class. Frames (b) to (d) show the three nearest training images from the elephant class. These images contain elephants of different sizes with a variety of backgrounds, including the sky and greenish lands. The three nearest training images from the tiger class (frame (e) to (g)) mostly show a single tiger against a predominantly brownish background. From this example, it is clear that our existing global color and texture features are insufficient. If we have local features, or some feature-weighting schemes that can assign less importance to background information, we can potentially avoid this sort ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Semantics and Feature Discovery via Confidence-Based Ensemble

•

181

of misprediction. The task of feature discovery remains one of the most challenging problems. Again, with CDE, we can provide an effective channel for identifying useful images for feature discovery. 4.

EMPIRICAL STUDY

Our empirical studies can be divided into three parts: (1) Multilevel Prediction Scheme Evaluation. We first studied the effectiveness of using the multiclass confidence factor (CFm ) for bagging in terms of annotation accuracy. In addition, we studied the effectiveness of the confidence factor at the bag level (CFb). (2) Dynamic Ensemble Scheme Evaluation. We examined whether our dynamic ensemble (DE) scheme could improve the annotation of instances with low CFb, and thereby improve the overall annotation accuracy. (3) Knowledge Discovery. We investigated some ways in which CDE could identify instances that contain new semantics and also diagnose mispredictions due to under-representative training data. For our empirical studies, we used the following two datasets: 2K-image dataset. This dataset consists of 1920 images collected from the Corel Image CDs. The dataset contains representative images from fifteen Corel categories—architecture, bears, clouds, elephants, fabrics, fireworks, flowers, food, landscape, people, objectionable images, textures, tigers, tools, and waves. Each category is made up of 90 to 180 images. These Corel images have been widely used by the computer vision and image-processing research communities. We used this dataset to evaluate the effectiveness of our multilevel prediction scheme and our dynamic ensemble scheme, as well as to conduct knowledge discovery experiments. 25K-image dataset. These images are compiled from both the Corel CDs and the Internet, after which we manually classified them into 116 categories. This dataset is more diversified and thus more challenging for learning.1 We characterized each image in our datasets by two main features: color and texture. The types of color feature include color histograms, color means, color variances, color spreadness, and color-blob elongation. Texture features were extracted from three orientations (vertical, horizontal and diagonal) in three resolutions (coarse, medium and fine). A total of 144 features [Tong and Chang 2001] were extracted: 108 from colors and 36 from textures. For each of the datasets, we set aside 20% of the data for the test set and use the remaining 80% for the training set. We further subdivided the training set into two: the SVM-training set (80% of data) for training the SVM binary classifiers, and the validation set (20% of data) for learning the confidence factor functions described in Section 3.1. We used cross-validation2 to obtain the optimal number of bags for each dataset. For both datasets, using five bags of classifiers produced the lowest annotation error rate. In Table I, we summarize the annotation error rates for both datasets. The results show that we can reduce the error rate at each step of our proposed CDE scheme. First, bagging reduces the error rate by 4.4% for the 2K-image dataset, and 3.1% for the 25K-image dataset. The dynamic ensemble scheme further reduces the error rates by 2.7% and 4.0%, respectively. 1 To

accurately classify a dataset with a large number of categories, the base classifier itself must be able to deal with the imbalanced dataset situation [Wu and Chang 2003]. We do not discuss imbalanced dataset classification in this article. 2 Since cross-validation techniques have been used extensively in numerous domains, including classification, machine learning, and so on, we do not go into its details in this article. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

182

•

K. Goh et al. Table I. Summary of Annotation Error Rates 1 Bag 5 Bags After DE

2K-image Dataset 19.1% 14.7% 12.0%

25K-image Dataset 70.8% 67.7% 63.7%

Fig. 7. Examples of annotation results (the annotation label and CF are shown).

Figure 7 shows twelve frames of qualitative examples of our annotation results. The labels show the categories and confidence factors (CFs) for each frame. Frame (a) to (f) each show an example with a high prediction confidence where the label is an accurate description of the content. Frame (g) to (i) show examples with low annotation CFs. For these images, we observe that the labels can partially describe what our human eyes perceive. The low CFs provide an indication of the need for better perceptual features, or point out the inadequacy of our training pool. For instance, Frame (h) could easily be misclassified as clouds. Frame (j) to (l) display examples where misclassification might have ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Semantics and Feature Discovery via Confidence-Based Ensemble

•

183

Fig. 8. Effectiveness of CFm .

been caused by under-represented training data. Frame (j) is an architecture image wrongly predicted to be landscape, (k) is a picture of a black bear mistakenly assumed to belong to the elephant class with low confidence, and lastly, (l) is a landscape image wrongly labeled as waves. In the rest of this section, we present our experimental results in greater detail. 4.1 Evaluation of Multilevel Prediction Scheme Our classification scheme makes use of confidence factors in two areas: (1) During bagging, the multi-class confidence factor CFm (Eq. (6)) is used, instead of the top posterior probability, to aggregate the prediction of each bag and produce the final prediction of a query instance. (2) Once bagging has been completed, a bag-level confidence factor CFb (Eq. (8)), is assigned to the final prediction. CFb is used to identify query instances that will undergo further annotation refinement with the dynamic ensemble scheme. We present the results of using the 2K-image dataset to illustrate the effectiveness of our confidence factors. We applied 50% sampling on the SVM-training set to form the bags for training the classifiers. The bags of classifiers were then applied to both the test set and the validation set. The parameters for the confidence functions were learned using the validation set. The resulting functions were used to compute class-prediction confidence factors for testing data. To demonstrate the effectiveness of CFm , Figure 8 compares the annotation results using CFm with those using top posterior probability. For both the validation and the test set, the method using CFm outperforms that using top posterior probability for various numbers of bags. The figure also shows that the CFm function trained by the validation set generalizes well to the test set. Using CFm lowers the error rate by 1.6% for the validation dataset, and 3.2% for the test dataset. Next, we evaluated the usefulness of the bag level confidence factor CFb. Ideally, when a prediction is correct, we expect to assign a high confidence level to it. In Figure 9, we plot a curve showing the prediction accuracy at each CFb value. The figure shows that when CFb is high, the prediction accuracy is also high, and at lower CFb values, the accuracy also tends to be low. There is a clear correlation between CFb and the classification accuracy of the testing data. This indicates that the CFb we formulated can be generalized to track the prediction accuracy of testing data. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

184

•

K. Goh et al.

Fig. 9. Effectiveness of bag-level CFb .

Fig. 10. Error rates for low-confident predictions after dynamic ensemble.

4.2 Evaluation of the Dynamic Ensemble Scheme This experiment examined the impact on annotation quality after applying the dynamic ensemble (DE) scheme. More specifically, we tried to improve the annotation accuracy by disambiguating the original low-confidence annotation. In Figure 10, we plot the error rates of the low-confidence predictions against the threshold θb for bag-level CFb. When a prediction’s CFb is below θb, we deem the prediction to be a low-confidence one. The figure shows plots for both the 2K- and 25K-image datasets. The posterior probability threshold θm was set empirically to 0.05. In Figures 10(a) and (b), we observe that the error rates are higher at lower θbs. The main function of a confidence factor is to assign a low CFb to potentially wrong annotations. We have shown earlier that CFb provides a good model of prediction accuracy. When θb is set lower, most of the low-confidence predictions are likely to have wrong annotations. Hence, the error rate for low-confidence predictions tends to be high at low θb. In addition, the DE scheme achieves a greater error rate reduction at lower thresholds: for the 2K-image dataset, a reduction in error of 12% at θb = 0.1, and for the 25K-image dataset, a 6.9% reduction at θb = 0.05. This trend of error reduction shows that DE is able to disambiguate conflicts for the low-confidence predictions, while leaving the high-confidence predictions intact. As shown in ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Semantics and Feature Discovery via Confidence-Based Ensemble

•

185

Fig. 11. Relationship between posterior probability threshold θm and overall annotation accuracy for the 2K-image dataset.

Table I at the beginning of this section, the use of DE can further reduce the error rate of bagging by 1.8% for the 2K-image dataset, and 4.0% for the 25K-image dataset. Next, we report the influence of the posterior probability threshold θm on the overall prediction accuracy. As noted in Section 3.2, the expected classification accuracy changes with the value of θm . The empirical relationship between them for the 2K-image dataset is shown in Figure 11. When the value of θm increases from 0.03 to 0.06, the overall prediction error rate continues to decrease. The lowest error rate attained is 12.0%. Further increments of θm will only result in higher annotation error rates. This phenomenon complies well with our theoretical analysis in Section 3.2. 4.3 Knowledge Discovery After applying CDE to disambiguate conflicts, we may still end up with a low confidence level for the query instance’s prediction. As discussed in Section 3.3, the low confidence level can be attributed to the presence of new semantics, the lack of representative training data, or insufficient low-level features. In this section, we illustrate the capability of CDE in identifying these three scenarios. 4.3.1 New Semantics Discovery. The posterior probabilities produced by the base-level binary classifiers provide a good indication of the types of semantics present in the query instance. If all the posterior probabilities are low, it is highly likely that the query instance contains new semantics. Our objective is to automatically detect those new semantics when given a new query instance. For our new semantics discovery experiment, we first constructed a new dataset that contained the following new semantics: frog, windmill, butterfly, cactus, and snake. The total number of images in this new dataset was 349. We added this new dataset to the existing test set to form a larger test set. We then made use of five bags of classifiers (with 50% sampling) from Section 4.1 to classify the enlarged test set. By using different θn thresholds as the criteria for judging posterior probabilities to be considered low, we are able to plot the precision/recall(PR) curves in Figure 12. Recall refers to the percentage of images with new semantics that we recover, and precision refers to the percentage of images from the new dataset in the entire pool of images that are considered new knowledge by our system. When θn is low, the precision is high at the expense of recall. Conversely, high θn results in low precision but high recall. The presence of query instances from the original test set with under-represented training data makes the task of new semantics discovery even more difficult. However, our PR results show that CDE remains effective in identifying instances from the new semantic categories. At the recall ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

186

•

K. Goh et al.

Fig. 12. Performance of CDE in discovering new semantics.

Fig. 13. Insufficient training data example. The query instance is a white flower (Frame (a)), but the 5-nearest support vectors (frames (b) to (f)) have various other colors.

rate of about 8%, the precision is 100%. This means that the 25 images with the lowest CFb are all from categories of new semantics, presenting further evidence that our confidence factors are useful. Although the images from new semantic categories make up less than 50% of the new test set, we are able to achieve a precision of 85% when recall is 50%, indicating that a high percentage of images extracted by our system does contain new semantics. Once images with potentially new semantics are isolated, a user needs to inspect these images and manually annotate them if necessary. Once the labels for these images are assigned, they can be added to the existing set of semantics, where they can be used to retrain new classifiers. 4.3.2 Discovery of Insufficient Training Data. Insufficient training data is another cause for a low confidence prediction. To remedy this situation, we want to add the under-represented instance to the training pool and retrain the relevant classifiers for future predictions. In Figure 13, we show an example in which the misclassified query instance, a white flower (frame (a)), should belong to an existing class. Due to space limitations, we show only the five support vectors ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Semantics and Feature Discovery via Confidence-Based Ensemble

•

187

Fig. 14. Inadequate low-level features example II. The query is an image of an architecture design (frame (a)) that has been assign the bear label. Frame (b) to (d) shows the 3-NNs from the architecture category, and frame (e) to (g) shows the 3-NNs from the bear category.

nearest to the query image (in frames (b) to (f)). It is evident that the lack of white flowers in the training data has caused the misclassification of the query instance. Hence, by adding the query instance into the training data followed by retraining of the flower classifier, we will be able to classify flowers more accurately in the future. 4.3.3 Discovery of Insufficient Features. In Figure 14, we show a case where insufficient low-level features causes a misprediction. The query instance is shown in frame (a), its true label is architecture but it has been assigned the bear label. Frame (b) to (d) shows the 3-NNs from the architecture category. Frame (e) to (f) shows the 3-NNs from the bear category where dark-colored bears resembles the darkened doorway of the query instance. We observe that the three architecture images have colors that are visually similar to the query instance, but the brightness and semantic content of the images are not. If we have features that are able to characterize the shape of the objects in the images, or features that contain spatial information of the color blobs in the image, we can potentially avoid this misprediction. Similar to the previous example, CDE will identify this mispredicted query instance as a suitable candidate for feature discovery. 5.

CONCLUSIONS

In this article, we have proposed a confidence-based dynamic ensemble (CDE) scheme to overcome the shortcomings of traditional static classifiers. In contrast to traditional models, CDE makes dynamic adjustments to accommodate new semantics, to assist the discovery of useful low-level features, and to improve class-prediction accuracy. The key components of CDE include a multilevel prediction scheme that uses confidence factors to assert the class-prediction confidence, and a dynamic ensemble scheme that uses the confidence factors to form new classifiers adaptively for low-confidence predictions. Our empirical results have shown that our confidence factors are able to improve the bag-level prediction accuracy and effectively identify potential mispredictions. We have also illustrated the ability of our ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

188

•

K. Goh et al.

dynamic ensemble scheme to enhance the annotations of low-confidence predictions. And lastly, we have demonstrated that CDE is able to identify images that might contain new semantics, and also to pick out images with semantics that may be under-represented in the existing training data. For future work, we plan to explore the use of CDE for feature discovery more extensively. We also plan to study other methods for characterizing class-prediction confidence, the core of CDE. (Please see Li et al. [2003] for preliminary work and Goh and Chang [2004] for additional results.) REFERENCES BENITEZ, A. B. AND CHANG, S.-F. 2002. Semantic knowledge construction from annotated image collection. In Proceedings of the IEEE International Conference on Multimedia. IEEE Computer Society Press, Los Alamitos, Calif. BOUCHAFFRA, D., GOVINDARAJU, V., AND SRIHARI, S. N. 1999. A methodology for mapping scores to probabilities. IEEE Trans. Patt. Anal. Mach. Intel. 21, 9, 923–927. BREIMAN, L. 1996. Bagging predicators. Mach. Learn. 24, 2, 123–140. CHANG, E., GOH, K., SYCHAY, G., AND WU, G. 2003. Content-based soft annotation for multimodal image retrieval using bayes point machines. IEEE Trans. Circ. Syst. Video Tech. (Special Issue on Conceptual and Dynamical Aspects of Multimedia Content Description) 13, 1, 26–38. CHANG, S.-F., CHEN, W., AND SUNDARAM, H. 1998. Semantic visual templates: Linking visual features to semantics. In Proceedings of the IEEE International Conference on Image Processing, IEEE Computer Society Press, Los Alamitos, Calif. 531–535. CHOW, C. K. 1970. On optimum recognition error and reject tradeoff. IEEE Trans. Inf. Theory 16, 1, 41–46. COLLOBERT, R. AND BENGIO, S. 2001. SVMTorch: Support vector machines for large-scale regression problems. J. Mach. Learn. Res. 1, 143–160. FAN, J., GAO, Y., AND LUO, H. 2004. Multi-level annotation of natural scenes using dominant image components and semantic concepts. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, 540 –547. GOH, K. AND CHANG, E. 2004. One, two class SVMs for multi-class image annotation. UCSB Technical Report. GOH, K., CHANG, E., AND CHENG, K. T. 2001. SVM binary classifier ensembles for image classification. In Proceedings of the ACM CIKM. ACM, New York, 395–402. HASTIE, T. AND TIBSHIRANI, R. 1998. Classification by pairwise coupling. Adv. Neural Inf. Proc. Syst. 10, 507–513. HE, X., MA, W.-Y., KING, O., LI, M., AND ZHANG, H. 2002. Learning and inferring a semantic space from user’s relevance feedback for image retrieval. In Proceedings of the ACM Multimedia. ACM, New York, 343–347. HO, T. K., HULL, J., AND SRIHARI, S. 1994. Decision combination in multiple classifier systems. IEEE Trans. Patt. Anal. Mach. Intel. 16, 1, 66–75. LI, B., GOH, K., AND CHANG, E. 2003. Confidence-based dynamic ensemble for image annotation and semantics discovery. In Proceedings of the ACM International Conference on Multimedia. ACM, New York, 195–206. LI, J. AND WANG, J. Z. 2003. Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. Patt. Anal. Mach. Intell. 25, 9, 1075–1088. NAPHADE, M. R., LIN, C.-Y., SMITH, J., TSENG, B., AND BASU, S. 2002. Learning to annotate video databases. SPIE Electronic Imaging 2002—Storage and Retrieval for Media Databases 4676, 264–275. PLATT, J. 1999. Probabilistic outputs for SVMs and comparisons to regularized likelihood methods. Adv. Large Margin Class. 61–74. PLATT, J., CRISTIANINI, N., AND SHAWE-TAYLOR, J. 2000. Large margin DAGs for multiclass classification. Adv. Neural Inf. Proc. Syst. 12, 547–553. PODDAR, P. AND RAO, P. 1993. Hierarchical ensemble of neural networks. In Proceedings of the International Conference on Neural Networks 1, 287–292. RODRIGUEZ, C., MUGUERZA, J., NAVARRO, M., ZARATE, A., MARTIN, J., AND PEREZ, J. 1998. A two-stage classifier for broken and blurred digits in forms. In Proceedings of the International Conference on Pattern Recognition 2, 1101–1105. SCHAPIRE, R. F. AND SINGER, Y. 1998. Improved boosting algorithms using confidence-rated predictions. In Proceedings of the 11th Annual Conference on Computational Learning Theory. 80–91. SHEN, H. T., OOI, B. C., AND TAN, K. L. 2000. Giving meanings to www images. In Proceedings of ACM Multimedia, ACM, New York, 39–48. SRIHARI, R., ZHANG, Z., AND RAO, A. 2000. Intelligent indexing and semantic retrieval of multimodal documents. Inf. Retriev. 2, 245–275. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Semantics and Feature Discovery via Confidence-Based Ensemble

•

189

TONG, S. AND CHANG, E. 2001. Support vector machine active learning for image retrieval. In Proceedings of the ACM International Conference on Multimedia, ACM, New York, 107–118. VAPNIK, V. 1982. Estimation of Dependences Based on Empirical Data. Springer-Verlag, New York. VAPNIK, V. 1998. Statistical Learning Theory. Wiley, New York. WANG, J., LI, J., AND WIEDERHOLD, G. 2001. Simplicity: Semantics-sensitive integrated matching for picture libraries. IEEE Trans. Patt. Anal. Mach. Intel. 23, 9, 947–963. WENYIN, L., DUMAIS, S., SUN, Y., ZHANG, H., CZERWINSKI, M., AND FIELD, B. 2001. Semi-automatic image annotation. In Proceedings of Interact 2001: Conference on Human-Computer Interaction, 326–333. WU, G. AND CHANG, E. 2003. Adaptive feature-space conformal transformation for learning imbalanced data. In Proceedings of the International Conference on Machine Learning, 816–823. WU, H., LI, M., ZHANG, H., AND MA, W.-Y. 2002. Improving image retrieval with semantic classification using relevance feedback. Vis. Datab. 327–339. Received January 2005; accepted January 2005

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Understanding Performance in Coliseum, An Immersive Videoconferencing System H. HARLYN BAKER, NINA BHATTI, DONALD TANGUAY, IRWIN SOBEL, DAN GELB, MICHAEL E. GOSS, W. BRUCE CULBERTSON, and THOMAS MALZBENDER Hewlett-Packard Laboratories Coliseum is a multiuser immersive remote teleconferencing system designed to provide collaborative workers the experience of face-to-face meetings from their desktops. Five cameras are attached to each PC display and directed at the participant. From these video streams, view synthesis methods produce arbitrary-perspective renderings of the participant and transmit them to others at interactive rates, currently about 15 frames per second. Combining these renderings in a shared synthetic environment gives the appearance of having all participants interacting in a common space. In this way, Coliseum enables users to share a virtual world, with acquired-image renderings of their appearance replacing the synthetic representations provided by more conventional avatar-populated virtual worlds. The system supports virtual mobility—participants may move around the shared space—and reciprocal gaze, and has been demonstrated in collaborative sessions of up to ten Coliseum workstations, and sessions spanning two continents. Coliseum is a complex software system which pushes commodity computing resources to the limit. We set out to measure the different aspects of resource, network, CPU, memory, and disk usage to uncover the bottlenecks and guide enhancement and control of system performance. Latency is a key component of Quality of Experience for video conferencing. We present how each aspect of the system—cameras, image processing, networking, and display—contributes to total latency. Performance measurement is as complex as the system to which it is applied. We describe several techniques to estimate performance through direct light-weight instrumentation as well as use of realistic end-to-end measures that mimic actual user experience. We describe the various techniques and how they can be used to improve system performance for Coliseum and other network applications. This article summarizes the Coliseum technology and reports on issues related to its performance—its measurement, enhancement, and control. Categories and Subject Descriptors: H.4.3 [Information Systems Applications]: Communications Applications—Computer conferencing, teleconferencing, and videoconferencing General Terms: Algorithms, Design, Experimentation, Measurement, Performance Additional Key Words and Phrases: Telepresence, videoconferencing, view synthesis, 3D virtual environments, performance measurement, streaming media, network applications

1.

INTRODUCTION

For decades, videoconferencing has been sought as a replacement for travel. Bandwidth limitations and the accompanying issue of quality of the enabled experience have been central to its delayed arrival. Resolution and latency lead the way in objectionable factors but, were these resolved, close behind would come the issues that separate mediated from direct communication: the sense of co-presence, access to Authors’ addresses: Hewlett-Packard Laboratories, 1501 Page Mill Road, Palo Alto, CA 94304-1120; email: {harlyn.baker, nina.bhatti,donald.tanguay,irwin.sobel,dan.gelb,mike.goss,bruce.culbertson,tom.malzbender}@hp.com. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]. c 2005 ACM 1551-6857/05/0500-0190 $5.00 ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005, Pages 190–210.

Understanding Performance in Coliseum

•

191

Fig. 1. The Coliseum immersive videoconferencing system.

shared artifacts, the feeling of communication that comes from the passing of subtle through glaring signals that characterize face-to-face meetings. In the Coliseum project, we are working toward establishing a facility to meet these communication needs through a thorough analysis of the computational, performance, and interaction characteristics demanded for universally acceptable remote collaboration and conferencing. Our goal has been to demonstrate, on a single desktop personal computer, a cost-effective shared environment that meets the collaboration needs of its users. The solution must provide for multiple participants—from two to tens—and support them with the required elements of person-to-person interaction. These elements include: —Acceptable video and audio quality, including resolution, latency, jitter, and synchronization; —Perceptual cueing such as motion parallax and consistent reciprocal gaze; —Communicating with words, gestures and expressions over ideas, documents and objects; —Joining and departing as easy as entering a room. Traditional telephony and videoconferencing provide some of these elements, including ease of use and audio quality, yet fail on most others. Our Coliseum effort aims to advance the state of videoconferencing by applying recent advances in image-based modeling and computer vision to bring these other elements of face-to-face realism to remote collaboration. Scene reconstruction, the task of building 3D descriptions using the information contained in multiple views of a scene, is an established challenge in computer vision [Longuet-Higgins 1981]. It has seen remarkable progress over the last few years due to faster computers and improved algorithms (such as Seitz and Dyer [1997], Narayanan et al. [1998], and Pollefeys [1999]). The Coliseum system is built upon the Image-Based Visual Hulls (IBVH) scene rendering technology of MIT [Matusik et al. 2000]. Our Coliseum efforts have shown that the IBVH method can operate at video rates from multiple camera streams hosted by a single personal computer [Baker et al. 2002]. Each Coliseum participant works on a standard PC with LCD monitor and a rig housing five video cameras spaced at roughly 30 degree increments (shown in Figure 1). During a teleconferencing session, Coliseum builds 3D representations of each participant at video rates. The appropriate views of each participant are rendered for all others and placed in their virtual environments, one view of which is ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

192

•

H. H. Baker et al.

Fig. 2. Two Coliseum users in a shared virtual environment, as seen by a third.

shown in Figure 2. The impression of a shared space results, with participants free to move about and express themselves in natural ways, such as through gesture and gaze. Handling five video streams and preparing 3D reprojection views for each of numerous coparticipating workstations at video rates has been a formidable task on current computers. Tight control must be exercised on computation, process organization, and inter-machine communication. At project inception, we determined that we needed an effective speedup of about one hundred times over the MIT IBVH processing on a single PC to reach utility. Our purpose in this article is to detail some of the major issues in attaining this performance. 2.

RESEARCH CONTEXT

The pursuit of videoconferencing has been long and accomplished [Wilcox 2000]. While available commercially for some time, such systems have, in large part, been met with less than total enthusiasm. Systems rarely support more than two participating sites, and specially equipped rooms are often required. Frame rates and image quality lag expectations, and the resulting experience is of blurry television watching rather than personal interchange. Our intention in Coliseum has been to push the envelope in all dimensions of this technology—display frame rate and resolution, response latency, communication sensitivity, supported modalities, etc.—to establish a platform from which, in partnership with human factors and remote collaboration experts, we may better understand and deliver on the requirements of this domain. Two efforts similar to ours in their aim for participant realism are Virtue [Schreer et al. 2001] and the National Tele-Immersion Initiative [Lanier 2001]. Both use stereo reconstruction methods for user modeling, and embed their participants in a synthetic environment. As in traditional videoconferencing, these systems are designed to handle two or three participating sites. Neither supports participant mobility. Prince et al. [2002] use Image-Based Visual Hulls for reconstruction and transmission of a dynamic scene to a remote location, although not applying it to multiway communication. Chen [2001] and Gharai et al. [2002] present videoconferencing systems supporting large numbers of users situated individually and reorganized into classroom lecture settings. While both demonstrate some elements we seek—the first examining perceptual issues such as gaze and voice localization and the second including image segmentation to place participants against a virtual environmental backdrop—neither reaches for perceptual realism and nuanced communication in the participant depictions they present. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Understanding Performance in Coliseum

•

193

Fig. 3. The simplified Coliseum processing pipeline: image acquisition from synchronized cameras, 2D image analysis, reconstruction, and display of the rendering for a particular viewpoint.

3.

THE COLISEUM SYSTEM

Coliseum is designed for desktop use serving an individual conference participant. Five VGA-resolution cameras on a single IEEE 1394 FireWire bus provide video, and a microphone and speaker (or ear bud) provide audio. Coliseum participants are connected over either an Ethernet Local Area Network or an Internet Wide Area Network. Being a streaming media application, Coliseum has media flowing through a staged dataflow structure as it is processed. This computational pipeline is expressed within the Nizza architectural and programming framework developed at Hewlett-Packard Laboratories [Tanguay et al. 2004]. Figure 3 depicts the simplified processing pipeline for Coliseum, showing the four stages of image acquisition, 2D image analysis, reconstruction and rendering, and display. First, the cameras each simultaneously acquire an image. Second, 2D image analysis (IA) identifies the foreground of the scene and produces silhouette contours (Section 3.1). Third, IBVH constructs a shape representation from the contours and renders a new viewpoint using the acquired video and current visibility constraints (Section 3.2). Finally, the image is rendered and sent for display at the remote site. Coliseum’s viewer renders conference participants within a VRML virtual environment and provides a graphical user interface to the virtual world for use during a session. This allows participants to look around and move through the shared space, with others able to observe those movements. The Coliseum viewer has features intended to enhance the immersive experience. Consistent display of participants is achieved through their relative placement in the virtual world. An experimental facility for head tracking allows alignment of gaze with the placement of those addressed. In this way, as in the real world, a user can make eye contact with at most one other participant at a time. Head tracking permits the use of motion parallax (Section 3.3), which can further reinforce the immersive experience by making an individual’s view responsive to his movements. Critical to metric analysis of video imagery is acquiring information about the optical and geometric characteristics of the imaging devices. Section 3.4 describes our methods for attaining this through camera calibration. This method is meant to be fast, easy to use, and robust. Sections 3.5 and 3.6 describe the session management and system development aspects of Coliseum. 3.1 Image Processing The image processing task in Coliseum is to distinguish the pixels of the participant from those of the background and present these to a rendering process that projects them back into the image—deciding which pixels constitute the user and should be displayed. Foreground pixels are distinguished from background pixels through a procedure that begins with establishing a background model, acquired with no one in the scene. Color means and variances computed at each pixel permit a decision on whether a pixel has changed sufficiently to be considered part of the foreground. The foreground is ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

194

•

H. H. Baker et al.

Fig. 4. Left: background image; Center foreground contours; Right: foreground after shadow suppression.

represented as a set of regions, delineated by their bounding elements and characterized by properties such as area, perimeter, and variation from the background they cover. Ideally, these foreground computations would be occurring at 30 frames per second on all five cameras of our Coliseum system. Sustaining an acceptable frame rate on VGA imagery with this amount of data calls for careful algorithmic and process structuring. In aiming for this, a few principles have guided our low-level image processing: —Focus on efficiency (i.e., touch a pixel as few times as necessary—once, if possible—and avoid data copying), using performance measuring tools to aim effort; —Use lazy evaluation [Henderson and Morris 1976] to eliminate unnecessary computation; —Provide handles for trading quality for speed, so host capability can determine display/interaction characteristics. Following these guidelines, we have made several design choices to attain high performance: (1) Acquire the Raw Bayer Mosaic. Avoiding explicit color transmission from the cameras enables us to run five full VGA streams simultaneously at high frame rate on a single IEEE 1394 bus. Imagers generally acquire color information with even scan lines of red and green pixels followed by odd scan lines of green and blue pixels (the Bayer mosaic) which are converted to color pixels, typically in YUV422 format. This conversion doubles the bandwidth and halves the number of cameras or the frame rate on a IEEE 1394 bus. (2) Employ a Tailored Foreground Contour Extractor. In one pass over the image, our method determines the major foreground objects, parameterizes them by shape and extent, ranks them by integrated variation from the background and, accommodating to luminance changes—both shadows and gradual light level fluxuations—delivers candidate silhouettes for hull construction. With adjustable sampling of the image, it finds the subject rapidly while retaining access to the high-quality texture of the underlying imagery. Detecting image foreground contours at reduced resolution by increasing the sampling step allows greater image throughput without the loss of image information that accompanies use of a reduced-resolution data source—throughput increases with the square of the sampling. Contour localization doesn’t suffer as much as it might with decimated sampling since our method relocalizes using the full image resolution in the vicinity of each detected foreground contour element. Figure 4 demonstrates the illumination adaptation, and Figure 5 shows sampling variations. (3) Reduce Foreground Contour Complexity through Piecewise Linear Approximation. The cost of constructing the visual hull increases with the square of the number of contour elements, so fewer is better. Figure 6 shows this processing. (4) Correct Lens Distortion on Foreground Contours Rather than on the Acquired Camera Imagery. This means we transform tens of vertices rather than 1.5 million pixels on each imaging cycle. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Understanding Performance in Coliseum

•

195

Fig. 5. Various image contour samplings: 1: 4: 8 = 100%, 6%, 1.5% of the image.

Fig. 6. 525 segment contour, linear approximations (max pixel error, segments) = (4,66): (8,22), (16,14).

(5) Resample Color Texture for Viewpoint-Specific Rendering only as Needed (on Demand). With color not explicit (as (1), above), and lens correction postponed (as (4), above), image data for display must be resampled. The on-demand means that only those pixels contributing to other participants’ view images will be resampled. (6) Parameterize Expensive Operations to Trade Quality for Speed. For example, rendering a typical 300 by 300 IBVH resultant image for each participant would require 90000 complex ray-space intersections at each time step across all cameras. For efficiency, we parameterize this computation through variable sampling and interpolation of the interior and boundary intersecting hull rays. This and other dialable optimizations can be used to balance processing load and visual quality to meet performance requirements. 3.2 Reconstruction We use IBVH to render each participant from viewpoints appropriate for each other participant. IBVH back projects the contour silhouettes into three space and computes the intersection of the resulting frusta. The intersection, the visual hull, approximates the geometry of the user. Rendering this geometry with view-dependent texture mapping creates convincing new views. While we could send 3D models of users across the network and render them with the environment model in the Coliseum viewer, less bandwidth is required if we render all the needed viewpoints of a user locally and then send only 2D video and alpha maps. We use MPEG4 to compress the video. Since the majority of displayed pixels comes from the environment model and is produced locally, the video bandwidth requirements are low (about 1.2 Mb/sec). Figure 7 shows the results of foreground contouring, displayed with the visual hull they produce, in the space of the five Coliseum cameras. While the IBVH algorithm is fast when compared with other reconstruction methods, it has shortcomings. The quality of scene geometry represented depends on the number of acquiring cameras, and surface concavities are not modeled. This geometric inaccuracy can cause artifacts when new views are synthesized. To address this issue, we employed the extension to IBVH of Slabaugh et al. [2002] called Image-Based Photo Hulls (IBPH), which refines the visual hull geometry by matching ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

196

•

H. H. Baker et al.

Fig. 7. View of user in Coliseum space: Five cameras surround the rendered user. Each camera shows its coordinate system (in RGB), video frame, and foreground contour.

colors across the images. While resulting in a tighter fit to the true scene geometry and therefore better face renderings, the computational cost was significant, and we did not run the system in real-time sessions with this enhancement. Details may be found in our earlier paper [Baker et al. 2003]. 3.3 Motion Parallax A successful tele-immersion system will make its users feel part of a shared virtual environment. Since our world is three dimensional and presents differing percepts as we move, head movement before the display should induce a corresponding change in view. To achieve this, we developed the capability to track user head position and update the display as appropriate. Unfortunately, the cost of this computation prevented us from employing it in real-time sessions. Baker et al. [2003] provides details of this head-tracking capability. 3.4 Camera Calibration Our scene reconstruction requires knowledge of the imaging characteristics and pose of each camera. These parameters include: —Lens Distortion, to remove image artifacts produced by each camera’s lens (our use of wide-angle lenses exacerbates this). —Intrinsics, that describe how an image is formed at each camera (focal length, aspect ratio, and center of projection). —Extrinsics, relating the pose (3D position and orientation) of each camera to some global frame of reference. —Color Transforms, to enable color-consistent combination of data from multiple cameras in producing a single display image. All of these parameters must be computed before system use and, in a deployable system such as ours, any of them may need to be recomputed when conditions change. Figure 8 shows the target we use for parameter estimation—a 10-inch cube with four colored squares on each face (totaling 24 colors plus black and white). A differential operator detects contour edges ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Understanding Performance in Coliseum

•

197

Fig. 8. Calibration target.

in luminance versions of the images, and then a classifier verifies that a detected face contains four squares. The large size and color of the squares make them easier to detect and match, while the multiple faces provides both enough color for good colorimetric modeling, and opportunity for all of the cameras to be acquiring geometric calibration data at the same time. The face components supply the elements for determining the calibration parameters. Lens distortion correction is computed by determining the radial polynomial that straightens the target faces’ black boundaries (Devernay and Faugeras 1995). Intrinsic parameters are derived from the homographies that rectify the colored squares from their projected shapes (Zhang [2000]). Camera extrinsics are estimated in a two-stage process that starts with initial adjacent-pair pose estimates using a nonlinear variant of a stereo solver from Longuet-Higgins [1981] applied to matched square vertices. These poses are chained together and iteratively solved in pairs to minimize error. A bundle adjustment minimizes the total calibration error. The correspondences are implied when observed faces are matched to the target faces, with this matching made more robust by the simultaneous visibility of several faces to a single camera. The color of each square is known—they resemble those of a Macbeth Color Chart—so the colors observed can be used to determine each camera’s color transform. 3.5 Session Management Session management is performed through the Hub subsystem, built using the Microsoft DirectPlay API. A Hub host process for each session runs on a central server and processes connect and disconnect events, notifying session members when other users join or leave. A new user may connect to any existing session, or initiate a new session by starting a new host process. Communications among users during a session are peer to peer. When a new user connects to a session, the local portion of the Hub subsystem determines compatible media types between itself and other users, and notifies the local and remote media transmission and reception modules. These media modules communicate directly using datagram protocols. A multistream UDP protocol allows coordination of different mediatype network transmissions. Figure 9 illustrates the dynamic structure of a Coliseum application with session management. 3.6 Software Framework Streaming media applications are difficult to develop: —Digital media processings are complex, requiring orchestration among multiple developers. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

198

•

H. H. Baker et al.

Fig. 9. Coliseum scalable processing pipeline: On a single participant’s Coliseum station, the shaded sub-pipelines are added and subtracted from the application as remote participants enter and leave the conferencing session.

—Simultaneous processing of multiple streams uses multithreading and synchronization. —Real-time user experience requires optimal performance on limited computational resources, requiring flow control and buffer management. In implementing Coliseum, we have created Nizza [Tanguay et al. 2004], a flexible, multiplatform, software framework that simplifies the development of such streaming media applications. This framework allows an application’s processing to be decomposed into a task dependency network and automates exploitation of both task and data parallelism, partitioning operations across as many symmetric multiprocessors as are available. A dataflow architecture is designed around the travel of data. By observing the data lifecycle throughout the application, one may define a pipeline of distinct processing stages that can be clearly expressed as a directed graph. Our framework addresses all three of the difficulties to developing streaming media applications: —The application is decomposed into well-defined, connected task modules. —A scheduler analyzes the task decomposition and then schedules execution. —The task scheduler achieves real-time performance via automated flow control. This design simplifies development at all stages, from prototyping to maintenance. A dataflow API hides details of multithreading and synchronization and improves modularity, extensibility, and reusability. We have implemented the framework in C++ on Windows, Windows Mobile, and Linux platforms. Using this framework, we have developed a library of reusable components for the Windows platform (e.g., audio recording and playback, video playback, network connectivity). The streaming media aspects of Coliseum were built using Nizza and the reusable components. The framework has three main abstractions: Media (data unit), Task (computation unit), and Graph (application unit): ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Understanding Performance in Coliseum

•

199

—Media objects represent time-stamped samples of a digital signal, such as audio or video. A memory manager reuses memory whenever possible. The new Media abstraction inherits an automatic serialization mechanism for writing into pipes, such as a file or network buffer. —Task objects represent an operation on Media. The abstraction is a black box of processing with input and output pins, each specified with supported types of Media. The Task completely encapsulates the processing details so that the user knows only the functional mapping of input pins to output pins. —Graph objects are implicitly defined by the connectivity of multiple Tasks. Several commands can be issued to a graph including those to start and stop the flow of Media among its Tasks. Each Graph has its own scheduler, which orders the parallel execution of Tasks. This infrastructure provides three distinct benefits: —It supports an incremental development strategy for building complex applications—a Graph with a multistage pipeline structure can undergo testing of functional subsets hooked to real or synthesized data sources and sinks. —The framework allows for dynamic graph construction. When stopped, an application can add and remove Tasks, then start again with a new graph while keeping the unchanged portions intact. We use this technique in the large, dynamic graph of Coliseum, adding and removing portions of the graph as participants enter and depart sessions. —The graph structure supports instrumentation. Individual nodes in a Nizza graph can be instructed to append timing and related information to data packets, facilitating data-progression assessment. Keeping a functioning Graph intact, a new Task can connect to any of its output pins to monitor activity. This ability to listen and report is useful for gathering statistics in a performance monitor or to effect feedback control in modifying system parameters on the fly. A designer of a real-time rich media application can choose from several componentized, dataflowstyle architectures for media processing. Microsoft DirectShow, for example, may be a good approach for simple applications that use only prepackaged modules (e.g., compression, video capture). It also has a graphical interface for constructing and configuring a dataflow application. However, constructing new modules is difficult, performance metrics are not automatic, it is complex to learn and use, is dependent on other layers beyond our control (such as COM), and is not supported on the Linux platform. In addition, its use of processes rather than threads makes it poor for debugging multi-stream video applications, and its lack of a media scheduler means it discards untimely work rather than suppressing it, wasting capacity in resource-critical applications. The Network-Integrated Multimedia Middleware (NMM) project [Lohse et al. 2003] is an open-source C++ framework designed for distributed computing. NMM makes the network transparent from the application graph, but does not have performance features like Nizza, and is not available for Windows platforms. The Java Media Framework [Gordon and Talley 2003] is multiplatform and has integrated networking via Remote Transport Protocol, but its performance on heavy media (e.g., video) is not competitive with Nizza’s. Signal processing software environments such as Ptolemy [Buck et al. 2001], while operating efficiently on 1-D signals such as audio, are not appropriate for our “heavy” media, and often don’t support cyclic application graphs, which we have found useful for incorporating user feedback into a processing pipeline. VisiQuest (formerly known as Khoros and now available from AccuSoft1 ) is a commercial visual programming environment for image processing and visualization. While an impressive visual environment, it lacks performance enhancements and metrics. 1 http://www.accusoft.com

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

200

4.

•

H. H. Baker et al.

PERFORMANCE

Coliseum is a multi-way, immersive remote collaboration system that runs on modest through advanced commodity hardware. We have run sessions with up to ten users (all the Coliseum systems we have available), and between North America and Europe. Success depends on our ability to provide videoconference functions with sufficient responsiveness, audio and video quality, and perceptual realism to take and hold an audience. We evaluate Coliseum’s performance in terms of its computational and networking characteristics. Since we aim to support large numbers of participants in simultaneous collaboration, we are reviewing the implications of these measures on the system’s scalability. All the measurements presented are collected on the following equipment and conditions. The application runs on dual Xeon-based PCs with speeds of 2.0, 2.4, 2.8 or 3.06 GHz, each with 1, 2, or 4 GB of memory, and running Windows 2000 or XP. Machines shared a single 1000SX gigabit Ethernet connected by Cisco Catalyst 4003 10/100/1000 switches and were, typically, two to three hops apart for local area tests. The local network was in use at the time for other HP Lab activities. We also conducted wide area network measurements on Internet.2 Imagery was acquired through Point Grey Dragonfly VGA IEEE 1394 (FireWire) cameras operating at 15 Hz. The PC used for data collection had dual 2.4 GHz Xeon CPUs and 1 GB of memory. 4.1 What to Measure and How to Measure When approaching a large scale system like Coliseum it is important to frame the questions of what will be measured, why, and how. There are a number of techniques that can be used and the selection of their measures should be based on the intended use. Our first approach was the common one for measuring performance—using profiling tools such as 3 prof, gprof, VtuneTM , etc. which collect run-time, fine grained performance data. These tools collect resource usage data on a function and module basis through sampling, resource counter monitoring, or call graph instrumentation. The sampling method periodically probes the current instruction addresses and provides an accounting of resource use by the function sample bin. Some processors support continuous counter monitoring which tracks hardware and software resources over a specified time interval. However, neither tracks the call graph. The relationships among functions are not preserved so, while a routine’s total CPU computation is identified, the call sequence is not. Call graph instrumentation is available but this is accomplished at a cost—instrumentation of the code forcing each function to execute an accounting preamble. This changes the execution profile and, for an application such as Coliseum that is already running near system capacity, it is not practical as full frame rates cannot be maintained. We experimented with these traditional measurement methods but found them to be unsatisfactory. The profiling tools do not show how much time was spent in synchronization wait states and in the run state for a particular frame set. They only give this information for a call graph as a whole. For example, we identified that the silhouette functions consumed the most CPU time, but this did not provide a picture of the data flow latency. We determined that we needed a top-down flow-based view of application performance. This called for two further types of measures (see Figure 10): —Application measures that can be used to tune performance and understand each component’s contribution to overall latency, frame rate and resource consumption. —End-to-end measures that can confirm the application instrumentation and capture the user experience that is beyond the application control points. 2 http://www.internet2.edu/ 3 http://www.intel.com/software/products/vtune/

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Understanding Performance in Coliseum

•

201

The application measures are accomplished through instrumentation of our system to collect timing data at flow control points. Nizza’s modular architecture allowed these points to be easily accessed (subsequently, provision was made for this information to be provided automatically at computational nodes in the data chain). Each collection of “frames” is marked with a unique timestamp. As these frames move through the system, timestamps are appended when the frame passes a control point. For example, once contours are extracted for each 5-camera set, their data are available on the five input “data-pins” of the next processing stage. Timing information can be appended here. Similarly, “camera data available” is a synchronization point whose times are tracked and propagated with the data. Clearly, these are the timing values we want, since their integration tells us everything about the system’s behavior as its data moves through. The observations include those of latency, and processing and synchronization wait times. This application instrumentation, while powerful, does not give a complete picture of the userexperienced end-to-end latency. What about the components outside of the application? The latency in the camera hardware and its drivers, and the latency from image composition until it appears on the display are characteristics not accessible from inside the application. What are the effects of these components on performance? 4.2 Use of Performance Data The previous section described three ways that we evaluated performance: profiling, application instrumentation, and end-to-end measurements; in this section, we describe how these measurements can be used to guide system understanding and improvement. One essential component of performance analysis is repeatability. We repeated experiments to confirm that measurements were consistent. If measurements change from run to run, then the system metrics cannot provide conclusive evidence. Once the system is stable, performance data can be used as a system diagnostic to: —Evaluate Different Sections of the Pipeline. Components can be replaced in our modular architecture and we can evaluate them before and after to assess the “cost” of each component. Cost can be assessed by network traffic, CPU load, maximum frame rate, latency, memory usage, etc. For example, we removed MPEG (encode and decode), and compared pre- and post-numbers to let us calculate the cost of the MPEG modules in the pipeline. —Evaluate a “Fix” or “Performance Improvement” Added to the System. Improving visual computing algorithms can be complex, with both subjective and objective components. An absolute performance timing measure assures that this part of the processing has been improved. —Identify Large Resource Users as Targets for Improvement. Originally, we believed that the network was a bottleneck in the system and were prepared to expend effort to reduce data transmission. After measurement, we realized that this would not have resulted in significant latency reduction. —As drivers, hardware, or other non-application components change, we use the measures to quantify the effects. Through profiling, we determined that Coliseum was compute bound. To improve computational efficiency, we systematically evaluated compilers and compile options under identical test conditions and tracked the frame rate metric. This brought a 30% increase in frame rate.

4.3 End-to-End Measurements Latency has a major impact on the usability of a communication system. There are numerous contributors to overall system latency, and we have measured various stages to assemble a picture of the delays ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

202

•

H. H. Baker et al.

Fig. 10. Application and end-to-end performance measurement.

between actions at one site and observations at the other. Coliseum’s latency is composed of the time for: —Camera Latency. An event to be imaged and delivered by the camera driver. —Processing Latency. Receiving the image data from the camera system to process, create the visual hull, render a requested view, and encode this in MPEG4. —Network Latency. Processing by UDP protocol stack, and network transmission. —Display Latency. Dataset reassembly, MPEG decoding, image composition, and return from providing the data to the display drivers. Measuring camera system latency requires either fancy synchronization gear or an externally observing capture device. Choosing the latter, we used a field-interlaced digital video camera to image both an event and the after-processing display of that event simultaneously. If an event is instantaneous, it is visible at the origination and the display. If it is not instantaneous, then we count the number of frames before it is visible in the output display. We tried several events but needed one that would be fairly atomic—coming “on” in one frame. An incandescent light proved to be inappropriate since it took several frames to reach full illumination. Our event was the illumination from a laser pointer, directed at a Coliseum camera. The laser and the camera’s display were simultaneously visible to the observing video camera (see Figure 11). Manual frame-by-frame analysis of the acquired video provided the numbers we sought (Figure 12). We captured several such events, and our tables in Figures 13 and 14 indicate average values. We measured end-to-end latency in four situations: —Simple camera driver demonstration program (TimeSliceDemo). [camera and driver latencies] —Stand-alone version of Coliseum with no network or VRML activity (Carver). [camera, drivers, image processing, and simple display latency] —Coliseum test of two users with live networking, with and without MPEG encoding (Coliseum). [complete system test and measure of MPEG impact] —Coliseum test where the subject is both the sender and receiver of the view (Glview Loopback—single person loop-back conference). [end-to-end latency minus negligible network transmission] Both the Coliseum and Carver measurements reflect round-trip frame counts, so the one-way latency is half the observed figure. The third test was done to see the effect of MPEG processing on latency. Figure 13 gives the average video frame count (at 33 ms per frame) for each test. The observing video camera captured 30 frames per second, permitting us to calculate latencies and standard deviations. Some of the time intervals we measured were about a dozen frames, while others were low single digits. Since images are acquired with units of 33ms delay, estimation precision is better for the former than ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Understanding Performance in Coliseum

•

203

Fig. 11. Capture of laser light onset and propagation to display.

Fig. 12. Latency measurement: 1) no light, 2) light on, 3) propagates to display, 4) saturates.

Fig. 13. Absolute user-perceived latency tests.

the latter, but our interest has been first in ballpark numbers. Refinement could be obtained, where needed, by measuring on the fields rather than the frames of the interlaced video, and by performing linear interpolation on the observed illuminant brightness, but this we did not do. TimeSliceDemo gives us an estimate on the latency that lies beyond our control—it is the time it takes the camera to acquire the frame and store it in the computer. Of course, this includes time for the camera to integrate the frame (on average, one half of a frame, or 16 ms, for the event), to charge ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

204

•

H. H. Baker et al.

Fig. 14. Instrument measure of latency (ms).

transfer, digitize, and ship the frame to the PC (one frame), to buffer and DMA the data to memory, and the time for the PC to display the frame after it has arrived (observed as perhaps one frame cycle of the observing camera). The latter period should be discounted. Figure 14 indicates measures of the instrumented latency of this same system version and, comparing the Coliseum with MPEG user-perceived tests to the instrument latency figures, we find differences of 38 and 42 milliseconds. This difference represents the latency that should be added to instrumented error to derive an estimate of end-user experienced latency. We observe that the absolute user-experienced latency in Coliseum ranges from 244 to 298 milliseconds. Enabling MPEG encoding and decoding increases latency by 33 milliseconds. MPEG encoding reduces the amount of data each participant sends, but does this at the cost of additional processing. This indicates a tradeoff we must consider in our control considerations in system balancing. There is a 77-millisecond difference between Coliseum and our standalone Carver application. This is attributable to the VRML viewer and network activity. We will see that network activity load is minimal and that the addition is due to the VRML control, which currently (and inappropriately) uses a busy-wait loop for its user interface. 4.4 Application Instrumentation We instrumented the code both for the Carver application (non-networked) and for Coliseum itself (see Figure 15). The instrumentation collects continuous timing data. These data are sent with each frame set to the corresponding host. Timing data were collected using a lightweight system call (Windows XP’s QueryPerformanceCounter) that sampled the processor clock. Each resulting data set contained time stamps indicating when the camera frame set was first available (t0 ). After the image analysis is complete, another timestamp is taken (t1 ). The receiving host records the time it received the frame (t2′ ), and the timestamp after decoding, compositing, and displaying the resultant image (t3′ ). t ′ is the time the system waits to piggyback timing information onto the outgoing frameset of the receiving host, indicated by “Wait for Camera Data.” This piggybacking avoids the introduction of additional network traffic and brings only nominal overhead. The bottom of the figure shows the return part of the journey. Processor clocks are not synchronized, so we cannot directly compare timestamp values across machines. Machine clock rates are stable, however, and elapsed time values can be used across host boundaries. We make extensive use of this fact to derive the timing contributions of each component of the timeline. The round time (RT)—camera to display then next camera data back to originating display—is calculated as the time from the camera data to when a corresponding frame was received from the other host minus the time waiting to piggyback the data: RT = (t3 − t0 ) − t The total network time (NT)—protocol stack processing time and transmission—can be determined from the roundtrip less the time in each host for processing, MPEG, composition, and display: NT = RT − (t1 − t0 ) − (t3 − t2 ) − (t3′ − t2′ ) − (t1′ − t0′ ) One-way network time is half NT since NT was a round trip measure. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Understanding Performance in Coliseum

•

205

Fig. 15. Performance measurement: data routing and measurement taps.

Fig. 16. Analysis of latency (ms).

Figure 16 presents data for the Coliseum and Carver tests. We tested each application with four different subjects—a user, a prerecorded dataset of a user, a phantom 5-gallon water bottle, and a prerecorded dataset of the bottle (see Figure 17). The bottle subject is placed at approximately head height but is stationary and several times larger than a user’s head. The prerecorded data allow us to exercise the systems without the camera frame rate limitation, although memory and disk accesses can similarly affect performance. For each subject, we give the one-way latency, time to generate the image, network delay, and display time. The table also shows the achieved average frame rate and CPU ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

206

•

H. H. Baker et al.

Fig. 17. Bottle used as phantom for performance measurements.

utilization. Note that these data were compiled from a later release of the system and therefore should not be directly compared to the data for the user-perceived latency results. 4.5 Networking Requirements Though video usually requires considerable network bandwidth, Coliseum’s bandwidth needs are quite modest. This is because the virtual environment usually occupies the overwhelming majority of the area on a Coliseum screen and, being maintained locally, is not part of the video stream. MPEG4 further reduces the bandwidth requirement. At 15 frames/sec., we measured a typical Coliseum video stream to be 616 Kb/sec. Although Coliseum can use TCP or UDP as a transport, all tests were conducted with UDP. Using UDP, there could be hundreds of Coliseum video streams before overloading a gigabit network. We measured latency in our local area network where the two participant hosts were two networkswitched hops apart. The average latency was 3 ms, so the network contributes about 2% to overall latency. To characterize the wide area performance of the system, we measured latency on Internet 2 from HP Labs Palo Alto to a site at the University of Colorado in Boulder. Our tests showed an average network latency of 25 ms. 4.6 Other Measures of Performance While much of the above discussion centers around latency, we include other measures of performance as well. We focused a lot of work on latency because of the challenges in its measurement. The Carver and Coliseum applications provided continuous monitoring of frame rates and network traffic. In the profiling and application instrumentation tests, we measure CPU and memory usage. We did not vary image quality or size as a result of system load so we did not measure these metrics. Our system is designed to provide the maximum frame rate possible. In the tests of large numbers of users, the frame rate degrades as CPU load rises in accommodating the increase in simultaneous renderings. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Understanding Performance in Coliseum

•

207

Fig. 18. Change in fps and cpu utilization for increasing Coliseum session sizes.

Fig. 19. Aggregate bandwidth for different size Coliseum sessions.

4.7 Scalability Since a major goal of Coliseum is to support video conferencing among large groups of people, scalability is an important system characteristic. We measured the system’s scalability by conducting sessions of increasing population, from 2 to 10 participants (10 being the number of Coliseum systems on site). Figure 18 shows that, as session size increased, system performance (frames/sec) dipped due to the increased workload of creating images and MPEG streams for the expanding number of view renderings required. While frame rate degraded, the total aggregate bandwidth sent by one user remained fairly constant, which means that the system adapts to more users in a work-conserving manner. Figure 19 shows that bandwidth climbed from 616 Kb/sec to 1178 Kb/sec as the CPU utilization saturated and then leveled off after the 6-user session. All in all, the bandwidth varied 16% over the course of these session sizes. At least this much variation is expected over any collection of runs, as the bandwidth is sensitive to user movement and image size. 4.8 Performance Summary In two-way sessions, we have achieved a rate of 15 frames per second, the maximum the FireWire bus can support (five cameras at higher rates on FireWire is only possible with image size reduction). Our throughput to date indicates that we have achieved about a thirty-five-times speedup from algorithmic and architectural innovations and a three-times speedup through processor evolution, meeting our beginning requirement of a hundred-fold speedup. From tests on larger numbers of users, we find that the computational complexity of the system dominates performance. There are a number of parameters that can be used to reduce computation at the expense of visual quality, and adjustment of these would ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

208

•

H. H. Baker et al.

allow support of more users while maintaining interactive frame rates. The current system reduces frame rate but maintains image quality. As the numbers of users grows, performance stabilizes, with bandwidth served remaining relatively constant. Our extensive measurements of Coliseum provides a clear breakdown of latency —Camera latency: 20% —Processing Latency: 50% —Network Latency: 2% to 10% (local or wide area) —Display Latency: 25% These measures can direct strategies for controlling delay and improving system performance for large numbers of users. Since Coliseum is a highly compute-intensive application, we have the potential to control the end node behavior and therefore overall system performance. With facility for graph-level performance monitoring (Section 3.6) and control parameters for adjusting the quality—and therefore speed—of image processing and display computations (Section 3.1), we have the tools we need for balancing throughput with user needs. While statically configured for the evaluations, we report here (with an image sampling step of 2 pixels, a four-pixel maximum deviation for linear approximations, and a hull ray sampling step of 4 pixels), these parameters may be adjusted over time and across cameras to meet bandwidth and throughput demands. 5.

CONCLUSIONS

Coliseum creates an immersive experience by building dynamic, 3D user models, embedding them in a shared, virtual space through which users are free to move, and generating unique views of the space for each user. The views convey reciprocal gaze and can be made responsive to head movements. Interactive performance is achieved through streamlined image processing and a software framework that is tuned for streaming media applications. This represents the first implementation of an immersive collaboration system supporting an arbitrary number of users and aimed at three-dimensional realism. While the possibility of such systems has often been discussed, actual implementations have been incomplete, operating only one-way, using cartoon avatars, or requiring substantial special purpose hardware. Employing commodity PCs and simple video cameras, we have run fully symmetric Coliseum sessions with as many as ten users. Instrumenting the system with timing recorders enabled precise post hoc as well as on-the-fly performance measurement. This tooling permits review of system performance, as described, for computational assessment and restructuring and, as proposed here, dynamic system adjustment to attain required levels of service. 6.

FUTURE WORK

Coliseum continues to evolve to meet a new set of goals. As it stands, it presents an effective mechanism for the proverbial “talking heads” videoconferencing—with the twist that the participants are 3D renderings of themselves and the environment is synthetic. Realistically, there are numerous developments that remain before this could be considered a viable alternative to travel for collaborative remote conferencing. Obvious improvements include increasing the frame rate, reducing latency, raising the quality at which people are displayed, and reconfiguring computation to enable more advanced features (such as head tracking). We are addressing these advancements in several ways: —Frame Rate. Development of a multicamera VGA capture system that streams synchronized video from two dozen or more cameras at 30Hz [Baker and Tanguay 2004]. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

Understanding Performance in Coliseum

•

209

—Latency. A more effective camera interface (for the above) that simplifies frame-set organization and reduces camera latency to its minimum. —Display Quality. Higher resolution capture through multicamera integration [Baker and Tanguay 2004]. In addition, we may raise the camera imaging resolution from VGA to XGA in the near future. —Computation. Migration of compute-intensive image-display operations to PC graphics processors where possible. This will free up resources to let us do more of the operations that will increase the quality of the user’s experience (e.g., obtaining better geometry, tracking heads, or correcting gaze direction). Objects, documents, and elements of the personal environment (such as white boards) also play an important role in collaborative interactions. A recent extensive survey of videoconferencing [Hirsh et al. 2004] indicates the importance of providing a high quality experience, one that will lift acceptance above that of audio conferencing for remote-participation meetings. The survey emphasizes the needs: ease of use, system reliability, high video quality, and provision of a general environmental context including the ability to share work-related objects. Although these have all been goals of our Coliseum work, it is the latter point, most particularly, that has been influencing our current direction. While always aiming to integrate imaged artifacts into the shared virtual space of our heads-and-shoulders depiction (i.e., documents, prototype boards, etc.), we are now moving toward providing full body and workspace coverage as well. At the same time, we choose to support inclusion of multiple participants at a site. These changes have major implications for computational performance and the imaging and display sides of our conferencing: We need larger images—higher resolution in capture and display—and will move to include projection to accommodate the increased scale of presentation needed. This direction will continue to put heavy demand on computational facilities, and increases the demand for thorough understanding of performance issues and careful allocation of overworked resources. Advancing the technology base of our collaborative videoconferencing effort must proceed with a foundation including both innovative design of algorithms and devices, and transparent mechanisms for instrumenting, monitoring, and adapting the system, while maintaining constant attention on the needs and preferences of the user. We have found that augmenting a networked interactive application like Coliseum with monitoring instrumentation is critical to understanding its behavior and dynamic structure. ACKNOWLEDGMENTS

Mike Harville, John MacCormack, the late David Marimont, Greg Slabaugh, Kei Yuasa, and Mat Hans made important contributions to this project. Sandra Hirsh and Abigail Sellen directed our user studies enquiries. Wojciech Matusik, Chris Buehler, and Leonard McMillan provided guidance on the original IBVH system from MIT. REFERENCES BAKER, H. H., TANGUAY, D., SOBEL, I., GELB, D., GOSS, M. E., CULBERTSON, W. B., AND MALZBENDER, T. 2002. The Coliseum Immersive Teleconferencing System. In Proceedings of the International Workshop on Immersive Telepresence (Juan Les Pins, France, Dec.). ACM, New York. BAKER, H. H., BHATTI, N., TANGUAY, D., SOBEL, I., GELB, D., GOSS, M. E., MACCORMICK, J., YUASA, K., CULBERTSON, W. B., AND MALZBENDER, T. 2003. Computation and performance issues in Coliseum, an immersive teleconferencing system. In Proceedings of the 11th ACM International Conference on Multimedia (Berkeley, Calif., Nov.). ACM, New York. 470–479. BAKER, H. H. AND TANGUAY, D. 2004. Graphics-accelerated panoramic mosaicking from a video camera array. In Proceedings of Vision, Modeling, and Visualization Workshop (Stanford, Calif., Nov.). IOS Press. 133–140. BUCK, J., HA, S., LEE, E. A., AND MESSERSCHMITT, D. G. 2001. Ptolemy: A framework for simulating and prototyping heterogeneous systems. In The Morgan-Kaufmann Systems on Silicon Series, Readings in Hardware/Software Co-Design, Morgan-Kaufmann, San Francisco, Calif., 527–543. ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.

210

•

H. H. Baker et al.

CHEN, M. 2001. Design of a virtual auditorium. In Proceedings of the ACM International Conference on Multimedia (Ottawa, Ont., Canada, Sept.). ACM, New York, 19–28. DEVERNAY, F. AND FAUGERAS, O. 1995. Automatic calibration and removal of distortion from scenes of structured environments. SPIE, 2567 (San Diego, Calif., July). 62–72. GHARAI, L., PERKINS, C., RILEY, R., AND MANKIN, A. 2002. Large scale video conferencing: A digital amphitheater. In Proceedings of the 8th International Conference on Distributed Multimedia Systems (San Francisco, Calif., Sept.). GORDON, R. AND TALLEY, S. 1999. Essential JMF: Java Media Framework. Prentice-Hall, Englewood Cliffs, N.J. HENDERSON, P. AND MORRIS, J. H. JR. 1976. A lazy evaluator. In Proceedings of the 3rd ACM SIGACT-SIGPLAN Symposium on Principles on Programming Languages. ACM, New York, 95–103. HIRSH, S., SELLEN, A., AND BROKOPP, N. 2004. Why HP people do and don’t use videoconferencing systems. Hewlett-Packard Laboratories, External Tech. Rep. HPL-2004-140R1. LANIER, J. 2001. Virtually There. Scientific American, Apr., 66–75. LOHSE, M., REPPLINGER, M., AND SLUSALLEK, P. 2003. An open middleware architecture for network-integrated multimedia. Lecture Notes in Computer Science, vol. 2515/2002, Springer-Verlag, Heidelberg, Germany, 327–338. LONGUET-HIGGINS, H. C. 1981. A computer algorithm for reconstructing a scene from two projections. Nature 293, 133–135. MATUSIK, W., BUEHLER, C., RASKAR, R., GORTLER, S., AND MCMILLAN, L. 2000. Image-based visual hulls. In Proceedings of the SIGGRAPH 2000. ACM, New York, 369–374. NARAYANAN, R., RANDER, P., AND KANADE, T. 1998. Constructing virtual worlds using dense stereo. In Proceedings of the International Conference on Computer Vision. (Bombay, India). 3–10. PESCE, M. 2002. Programming Microsoft DirectShow for Digital Video, Television, and DVD. Microsoft Press. POLLEFEYS, M. 1999. Self-calibration and metric 3D reconstruction from uncalibrated image sequences. Ph.D. dissertation. ESAT-PSI, K.U. Leuven. PRINCE, S., CHEOK, A., FARBIZ, F., WILLIAMSON, T., JOHNSON, N., BILLINGHURST M., AND KATO, H. 2002. Real-time 3D interaction for augmented and virtual reality. In Proceedings of SIGGRAPH 2002. ACM, New York, 238. SCHREER, O., BRANDENBURG, N., ASKAR, S., AND TRUCCO, E. 2001. A virtual 3D video-conferencing system providing semiimmersive telepresence: A real-time solution in hardware and software. In Proceedings of the International Conference on eWork and eBusiness. (Venice., Italy). 184–190. SEITZ, S. AND DYER, C. 1997. Photorealistic scene reconstruction by voxel coloring. In Proceedings of Computer Vision and Pattern Recognition Conference. (Puerto Rico). 1067–1073. SLABAUGH, G., SCHAFER, R., AND HANS, M. 2002. Image-based photo hulls. In Proceedings of the 1st International Symposium on 3D Processing, Visualization, and Transmission. (Padua, Italy). 704–708. TANGUAY, D., GELB, D., AND BAKER, H. H. 2004. Nizza: A framework for developing real-time streaming multimedia applications. Hewlett-Packard Laboratories, Technical Report, HPL-2004-132. WILCOX, J. 2000. Videoconferencing, The Whole Picture, Telecom Books, N.Y., ISBN 1-57820-054-7. ZHANG, Z. 2000. A flexible new technique for camera calibration. IEEE Trans. Patt. Anal. Mach. Intell. 22, 11, 1330–1334. Received January 2005; accepted January 2005

ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 1, No. 2, May 2005.