ACM Transactions on Database Systems (March) [Volume 30, Number 1]

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Exchanging Intensional XML Data TOVA MILO INRIA and Tel-Aviv University SERGE ABITEBOUL INRIA BERND AMANN Cedric-CNAM and INRIA-Futurs and OMAR BENJELLOUN and FRED DANG NGOC INRIA

XML is becoming the universal format for data exchange between applications. Recently, the emergence of Web services as standard means of publishing and accessing data on the Web introduced a new class of XML documents, which we call intensional documents. These are XML documents where some of the data is given explicitly while other parts are defined only intensionally by means of embedded calls to Web services. When such documents are exchanged between applications, one has the choice of whether or not to materialize the intensional data (i.e., to invoke the embedded calls) before the document is sent. This choice may be influenced by various parameters, such as performance and security considerations. This article addresses the problem of guiding this materialization process. We argue that—like for regular XML data—schemas (a` la DTD and XML Schema) can be used to control the exchange of intensional data and, in particular, to determine which data should be materialized before sending a document, and which should not. We formalize the problem and provide algorithms to solve it. We also present an implementation that complies with real-life standards for XML data, schemas, and Web services, and is used in the Active XML system. We illustrate the usefulness of this approach through a real-life application for peer-to-peer news exchange. Categories and Subject Descriptors: H.2.5 [Database Management]: Heterogeneous Databases General Terms: Algorithms, Languages, Verification Additional Key Words and Phrases: Data exchange, intensional information, typing, Web services, XML

This work was partially supported by EU IST project DBGlobe (IST 2001-32645). This work was done while T. Milo, O. Benjelloun, and F. D. Ngoc were at INRIA-Futurs. Authors’ current addresses: T. Milo, School of Computer Science, Tel Aviv University, Ramat Aviv, Tel Aviv 69978, Israel; email: [email protected]; S. Abiteboul and B. Amann, INRIA-Futurs, Parc Club Orsay-University, 4 Rue Jean Monod, 91893 Orsay Cedex, France; email: {serge,abiteboul, bernd.amann}@inria.fr; O. Benjelloun, Gates Hall 4A, Room 433, Stanford University, Stanford, CA 94305-9040; email: [email protected]; F. D. Ngoc, France Telecom R&D and LRI, 38–40, rue du G´en´eral Leclerc, 92794 Issy-Les Moulineaux, France; email: Frederic.dangngoc@ rd.francetelecom.com. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.  C 2005 ACM 0362-5915/05/0300-0001 $5.00 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 1–40.

2



T. Milo et al.

1. INTRODUCTION XML, a self-describing semistructured data model, is becoming the standard format for data exchange between applications. Recently, the use of XML documents where some parts of the data are given explicitly, while others consist of programs that generate data, started gaining popularity. We refer to such documents as intensional documents, since some of their data are defined by programs. We term materialization the process of evaluating some of the programs included in an intensional XML document and replacing them by their results. The goal of this article is to study the new issues raised by the exchange of such intensional XML documents between applications, and, in particular, how to decide which parts of the data should be materialized before the document is sent and which should not. This work was developed in the context of the Active XML system [Abiteboul et al. 2002, 2003b] (also see the Active XML homepage of Web site http://www-rocq.inria.fr/verso/Gemo/Projects/axml). The latter is centered around the notion of Active XML documents, which are XML documents where parts of the content is explicit XML data whereas other parts are generated by calls to Web services. In the present article, we are only concerned with certain aspects of Active XML that are also relevant to many other systems. Therefore, we use the more general term of intensional documents to denote documents with such features. To understand the problem, let us first highlight an essential difference between the exchange of regular XML data and that of intensional XML data. In frameworks such as those of Sun1 or PHP,2 intensional data is provided by programming constructs embedded inside documents. Upon request, all the code is evaluated and replaced by its result to obtain a regular, fully materialized HTML or XML document, which is then sent. In other terms, only extensional data is exchanged. This simple scenario has recently changed due to the emergence of standards for Web services such as SOAP, WSDL,3 and UDDI.4 Web services are becoming the standard means to access, describe and advertise valuable, dynamic, up-to-date sources of information over the Web. Recent frameworks such as Active XML, but also Macromedia MX5 and Apache Jelly6 started allowing for the definition of intensional data, by embedding calls to Web services inside documents. This new generation of intensional documents have a property that we view here as crucial: since Web services can essentially be called from everywhere on the Web, one does not need to materialize all the intensional data before sending a document. Instead, a more flexible data exchange paradigm is possible, where the sender sends an intensional document, and gives the receiver the freedom 1 See

Sun’s Java server pages (JSP) online at http://java.sun.com/products/jsp. the PHP hypertext preprocessor at http://www.php.net. 3 See the W3C Web services activity at http://www.w3.org/2002/ws. 4 UDDI stands for Universal Description, Discovery, and Integration of Business for the Web. Go online to http://www.uddi.org. 5 Macromedia Coldfusion MX. Go online to http://www.macromedia.com/. 6 Jelly: Executable xml. Go online to http://jakarta.apache.org/commons/sandbox/jelly. 2 See

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Exchanging Intensional XML Data



3

to materialize the data if and when needed. In general, one can use a hybrid approach, where some data is materialized by the sender before the document is sent, and some by the receiver. As a simple example, consider an intensional document for the Web page of a local newspaper. It may contain some extensional XML data, such as its name, address, and some general information about the newspaper, and some intensional fragments, for example, one for the current temperature in the city, obtained from a weather forecast Web service, and a list of current art exhibits, obtained, say, from the TimeOut local guide. In the traditional setting, upon request, all calls would be activated, and the resulting fully materialized document would be sent to the client. We allow for more flexible scenarios, where the newspaper reader could also receive a (smaller) intensional document, or one where some of the data is materialized (e.g., the art exhibits) and some is left intensional (e.g., the temperature). A benefit that can be seen immediately is that the user is now able to get the weather forecast whenever she pleases, just by activating the corresponding service call, without having to reload the whole newspaper document. Before getting to the description of the technical solution we propose, let us first see some of the considerations that may guide the choice of whether or not to materialize some intensional data: — Performance. The decision of whether to execute calls before or after the data transfer may be influenced by the current system load or the cost of communication. For instance, if the sender’s system is overloaded or communication is expensive, the sender may prefer to send smaller files and delegate as much materialization of the data as possible to the receiver. Otherwise, it may decide to materialize as much data as possible before transmission, in order to reduce the processing on the receiver’s side. — Capabilities. Although Web services may in principle be called remotely from everywhere on the Internet, it may be the case that the particular receiver of the intensional document cannot perform them, for example, a newspaper reader’s browser may not be able to handle the intensional parts of a document. And even if it does, the user may not have access to a particular service, for example, because of the lack of access rights. In such cases, it is compulsory to materialize the corresponding information before sending the document. — Security. Even if the receiver is capable of invoking service calls, she may prefer not to do so for security reasons. Indeed, service calls may have side effects. Receiving intensional data from an untrusted party and invoking the calls embedded in it may thus lead to severe security violations. To overcome this problem, the receiver may decide to refuse documents with calls to services that do not belong to some specific list. It is then the responsibility of a helpful sender to materialize all the data generated by such service calls before sending the document. — Functionalities. Last but not least, the choice may be guided by the application. In some cases, for example, for a UDDI-like service registry, the origin of the information is what is truly requested by the receiver, and hence service ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

4



T. Milo et al.

Fig. 1. Data exchange scenario for intensional documents.

calls should not be materialized. In other cases, one may prefer to hide the true origin of the information, for example, for confidentiality reasons, or because it is an asset of the sender, so the data must be materialized. Finally, calling services might also involve some fees that should be payed by one or the other party. Observe that the data returned by a service may itself contain some intensional parts. As a simple example, TimeOut may return a list of 10 exhibits, along with a service call to get more. Therefore, the decision of materializing some information or not is inherently a recursive process. For instance, for clients who cannot handle intensional documents, the newspaper server needs to recursively materialize all the document before sending it. How can one guide the materialization of data? For purely extensional data, schemas (like DTD and XML Schema) are used to specify the desired format of the exchanged data. Similarly, we use schemas to control the exchange of intensional data and, in particular, the invocation of service calls. The novelty here is that schemas also entail information about which parts of the data are allowed to be intensional and which service calls may appear in the documents, and where. Before sending information, the sender must check if the data, in its current structure, matches the schema expected by the receiver. If not, the sender must perform the required calls for transforming the data into the desired structure, if this is possible. A typical such scenario is depicted in Figure 1. The sender and the receiver, based on their personal policies, have agreed on a specific data exchange schema. Now, consider some particular data t to be sent (represented by the grey triangle in the figure). In fact, this document represents a set of equivalent, increasingly materialized, pieces of information—the documents that may be obtained from t by materializing some of the service calls (q, g , and f ). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Exchanging Intensional XML Data



5

Among them, the sender must find at least one document conforming to the exchange schema (e.g., the dashed one) and send it. This schema-based approach is particularly relevant in the context of Web services, since their input parameters and their results must match particular XML Schemas, which are specified in their WSDL descriptions. The techniques presented in this article can be used to achieve that. The contributions of the article are as follows: (1) We provide a simple but flexible XML-based syntax to embed service calls in XML documents, and introduce an extension of XML Schema for describing the required structure of the exchanged data. This consists in adding new type constructors for service call nodes. In particular, our typing distinguishes between accepting a concrete type, for example, a temperature element, and accepting a service call returning some data of this type, for example, () → temperature. (2) Given a document t and a data exchange schema, the sender needs to decide which data has to be materialized. We present algorithms that, based on schema and data analysis, find an effective sequence of call invocations, if such a sequence exists (or detect a failure if it does not). The algorithms provide different levels of guarantee of success for this rewriting process, ranging from “sure” success to a “possible” one. (3) At a higher level, in order to check compatibility between applications, the sender may wish to verify that all the documents generated by its application may be sent to the target receiver, which involves comparing two schemas. We show that this problem can be easily reduced to the previous one. (4) We illustrate the flexibility of the proposed paradigm through a real-life application: peer-to-peer news syndication. We will show that Web services can be customized by using and enforcing several exchange schemas. As explained above, our algorithms find an effective sequence of call invocations, if one exists, and detect failure otherwise. In a more general context, an error may arise because of type discrepancies between the caller and the receiver. One may then want to modify the data and convert it to the right structure, using data translation techniques such as those provided by Cluet et al. [1998] and Doan et al. [2001]. As a simple example, one may need to convert a temperature from Celsius degrees to Fahrenheit. In our context, this would amount to plugging (possibly automatically) intermediary external services to perform the needed data conversions. Existing data conversion algorithms can be adapted to determine when conversion is needed. Our typing algorithms can be used to check that the conversions lead to matching types. Data conversion techniques are complementary and could be added to our framework. But the focus here is on partially materializing the given data to match the specified schema. The core technique of this work is based on automata theory. For presentation reasons, we first detail a simplified version of the main algorithm. We then describe a more dynamic, optimized one, that is based on the same core idea and is used in our implementation. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

6



T. Milo et al.

Although the problems studied in this article are related to standard typing problems in programming languages [Mitchell 1990], they differ here due to the regular expressions present in XML schemas. Indeed, the general problem that will be formalized here was recently shown to be undecidable by Muscholl et al. [2004]. We will introduce a restriction that is practically founded, and leads to a tractable solution. All the ideas presented here have been implemented and tested in the context of the Active XML system [Abiteboul et al. 2002] (see also the Active XML homepage of Web site http://www-rocq.inria.fr/verso/Gemo/Projects/axml). This system provides persistent storage for intensional documents with embedded calls to Web services, along with active features to automatically trigger these services and thus enrich/update the intensional documents. Furthermore, it allows developers to declaratively specify Web services that support intensional documents as input and output parameters. We used the algorithms described here to implement a module that controls the types of documents being sent to (and returned by) these Web services. This module is in charge of materializing the appropriate data fragments to meet the interface requirements. In the following, we assume that the reader is familiar with XML and its typing languages (DTD or XML Schema). Although some basic knowledge about SOAP and WSDL might be helpful to understand the details of the implementation, it is not necessary. The article is organized as follows: Section 2 describes a simple data model and schema specification language and formalizes the general problem. Additional features for a richer data model that facilitate the design of real life applications are also introduced informally. Section 3 focuses on difficulties that arise in this context, and presents the key restriction that we consider. It also introduces the notions of “safe” and “possible” rewritings, which are studied in Section 4 and 5, respectively. The problem of checking compatibility between intensional schemas is considered in Section 6. The implementation is described in Section 7. Then, we present in Section 8 an application of the algorithms to Web services customization, in the context of peer-to-peer news syndication. The last section studies related works and concludes the article. 2. THE MODEL AND THE PROBLEM To simplify the presentation, we start by formalizing the problem using a simple data model and a DTD-like schema specification. More precisely, we define the notion of rewriting, which corresponds to the process of invoking some service calls in an intensional document, in order to make it conform to a given schema. Once this is clear, we explain how things can be extended to provide the features ignored by the first simple model, and in particular we show how richer schemas are taken into account. 2.1 The Simple Model We first define documents, then move to schemas, before formalizing the key notion of rewritings, and stating the results obtained in this setting, which will be detailed in the following sections. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Exchanging Intensional XML Data



7

Fig. 2. An intensional document before/after a call.

2.1.1 Simple Intensional XML Documents. We model intensional XML documents as ordered labeled trees consisting of two types of nodes: data nodes and function nodes. The latter correspond to service calls. We assume the existence of some disjoint domains: N of nodes, L of labels, F of function names,7 and D of data values. In the sequel we use v, u, w to denote nodes, a, b, c to denote labels, and f , g , q to denote function names. Definition 2.1. An intensional document d is an expression (T, λ), where T = (N , E, ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

30



T. Milo et al.

exhibits

Function nodes have three attributes that provide the necessary information to call a service using the SOAP protocol: the URL of the server, the method name, and the associated namespace. These attributes uniquely identify the called function, and are isomorphic to the function name in the abstract model. In order to define schemas for intensional documents, we use XML Schemaint , which is an extension of XML Schema. To describe intensional data, XML Schemaint introduces functions and function patterns. These are declared and used like element definitions in the standard XML Schema language. In particular, it is possible to declare functions and function patterns globally, and reference them inside complex type definitions (e.g., sequence, choice, all). We give next the XML representation of function patterns that are described by a combination of five optional attributes and two optional subelements: params and return:

Contents: (params?, return?)

The id attribute identifies the function pattern, which can then be referenced by another function pattern using the ref attribute. Attributes methodName, endpointURL, and namespaceURI designate the SOAP Web service that implements the Boolean predicate used to check whether a particular function matches the function pattern. It takes as input parameter the SOAP identifiers of the function to validate. As a convention, when these parameters are omitted, the predicate returns true for all functions. The Contents detail the function signature, that is, the expected types for the input parameters and the result of the function. These types are also defined using XML Schemaint , and may contain intensional parts. To illustrate this syntax, consider the function pattern Forecast, which captures any function with one input parameter of element type city, returning an element of type temp. It is simply described by





ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Exchanging Intensional XML Data



31

Functions are declared in a similar way to function patterns, by using elements of type function. The main difference is that the three attributes methodName, endpointURL, and namespaceURI directly identify the function that can be used. As mentioned already, function and function pattern declarations may be used at any place where regular element and type declarations are allowed. For example, a newspaper element with structure title.date.(Forecast | temp). (TimeOut | exhibit ∗ ) may be defined in XML Schemaint as









Note that just as for documents, we use a different namespace (embodied here by the use of the prefix xsi) to differentiate the intensional part of the schema from the rest of the declarations. Similarly to XML Schema, we require definitions to be unambiguous (see footnote 10)—namely, when parsing a document, for each element and each function node, the subelements can be sequentially assigned a corresponding type/function pattern in a deterministic way by looking only at the element/function name. One of the major features of the WSDL language is to describe the input and output types of Web services functions using XML Schema. We extend WSDL in the obvious way, by simply allowing these types to describe intensional data, using XML Schemaint . Finally, XML Schemaint allows WSDL or WSDLint descriptions to be referenced in the definition of a function or function pattern, instead of defining the signature explicitly (using the WSDLSignature attribute). 7.2 The ActiveXML System ActiveXML is a peer-to-peer system that is centered around intensional XML documents. Each peer contains a repository of intensional documents, and provides some active features to enrich them by automatically triggering the function calls they contain. It also provides some Web services, defined declaratively as queries/updates on top of the repository documents. All the exchanges ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

32



T. Milo et al.

between the ActiveXML peers, and with other Web service providers/consumers use the SOAP protocol. The important point here is that both the services that an ActiveXML peer invokes and those that it provides potentially accept intensional input parameters and return intensional results. Calls to “regular” Web services should comply with the input and output types defined in their WSDL description. Similarly, when calling an ActiveXML peer, the parameters of the call should comply with its WSDL. The role of the Schema Enforcement module is (i) to verify whether the call parameters conform to the WSDLint description of the service, (ii) if not, to try to rewrite them into the required structure and (iii) if this fails, to report an error. Similarly, before an ActiveXML service returns its answer, the module performs the same three steps on the returned data. 7.3 The Schema Enforcement Module To implement this module, we needed a parser of XML Schemaint . We had the choice between extending an existing XML Schema parser based on DOM level 3 or developing an implementation from scratch [Ngoc 2002]. Whereas the first solution seems preferable, we followed the second one because, at the time we started the implementation, the available (free) software we tried (Apache Xerces16 and Oracle Schema Processor17 ) appeared to have limited extensibility. Our parser relies on a standard event-based SAX parser.16 It does not cover all the features of XML Schema, but implements the important ones such as complex types, element/type references, and schema import. It does not check the validity of all simple types, nor does it deal with inheritance or keys. However, these features could be added rather easily to our code. The schema enforcement algorithm we implemented in the module follows the main lines of the algorithm in Section 4, and in particular the three same stages: (1) checking function parameters recursively, starting from the most inner ones and going out, (2) traversing, in each iteration, the tree top down, and (3) rewriting the children of every node encountered in this traversal. Steps (1) and (2) are done as described in Section 4. For step (2), recall from above that XML Schemaint are deterministic. This is precisely what enables the top-down traversal since the possible type of elements/functions can be determined locally. For step (3), our implementation uses an efficient variant of the algorithm of Section 4. While the latter starts by constructing all the required automata and only then analyzes the resulting graph, our implementation builds the automaton A× in a lazy manner, starting from the initial state, and constructing only the needed parts. The construction is pruned whenever a node can be marked directly, without looking at the remaining, unexplored, 16 The 17 The

Xerces Java parser. Go online to http://xml.apache.org/xerces-j/. Oracle XML developer’s kit for Java. Go online to http://otn.oracle.com/tech/xml/.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Exchanging Intensional XML Data



33

Fig. 15. The pruned automaton.

branches. The two main ideas that guide this process are the following: — Sink nodes. Some accepting states in A are “sink” nodes: once you get there, you cannot get out (e.g., p6 in Figures 5 and 7). For the Cartesian product automaton A× , this means that all paths starting from such nodes are marked. When such a node is reached in the construction of A× , we can immediately mark it and prune all its outgoing branches. For example, in Figure 15, the top left shaded area illustrates which parts of the Cartesian product automaton of Figure 6 can be pruned. Nodes [q3 , p6 ] and [q7 , p6 ] contain the sink node p6 . They can be immediately be declared as marked, and the rest of the construction (the left shaded area) need not be constructed. — Marked nodes. Once a node is known to be marked, there is no point in exploring its outgoing branches any further. To continue with the above example, once the node [q7 , p6 ] gets marked, so does [q7 , p3 ] that points to it. Hence, there is no need to explore the other outgoing branches of [q7 , p3 ] (the shaded area on the right). While this dynamic variant of the algorithm has the same worst-case complexity as the algorithm of Figure 3, it saves a lot of unnecessary computation in practice. Details are available in Ngoc [2002]. 8. PEER-TO-PEER NEWS SYNDICATION In this section, we will illustrate the exchange of intensional documents, and the usefulness of our schema-based rewriting techniques through a real-life application: peer-to-peer news syndication. This application was recently demonstrated in Abiteboul et al. [2003a]. The setting is the one shown on Figure 16. We consider a number of news sources (newspaper Web sites, or individual “Weblogs”) that regularly publish news stories. They share this information with others in a standard XML format, called RSS.18 Clients can periodically query/retrieve news from the sources they are interested in, or subscribe to news feeds. News aggregators are special peers that know of several news sources and let other clients ask queries to and/or discover the news sources they know. 18 RSS

1.0 specification. Go online to http://purl.org/rss/1.0. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

34



T. Milo et al.

Fig. 16. Peer-to-peer news exchange.

All interactions between news sources, aggregators, and clients are done through calls to Web services they provide. Intensional documents can be exchanged both when passing parameters to these Web services, and in the answers they return. These exchanges are controlled by XML schemas, and documents are rewritten to match these schemas, using the safe/possible rewriting algorithms detailed in the previous sections. This mechanism is used to provide several versions of a service, without changing its implementation, merely by using different schemas for its input parameters and results. For instance, the same querying service is easily customized to be used by distinct kinds of participants, for example, various client types or aggregators, with different requirements on the type of its input/ output. More specifically, for each kind of peer we consider (namely, news sources and aggregators), we propose a set of basic Web services, with intensional output and input parameters, and show how they can be customized for different clients via schema-based rewriting. We first consider the customization of intensional outputs, then the one of intensional inputs. 8.1 Customizing Intensional Outputs News sources provide news stories, using a basic Web service named getStory, which retrieves a story based on its identifier, and has the following signature:



ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Exchanging Intensional XML Data



35



Note that the output of this service is fully extensional. News sources also allow users to search for news items by keywords,19 using the following service:







This service returns an RSS list of news items, of type ItemList2, where the items are given extensionally, except for the story, which can be intensional. The definition of the corresponding function pattern, intensionalStory is omitted.











A fully extensional variant of this service, aimed for instance at PDAs that download news for offline reading, is easily provided by employing the Schema Enforcement module to rewrite the previous output to one that complies to a fully extensional ItemList3 type, similar to the one above, except for the story that has to be extensional. 19 More complex query languages, such as the one proposed by Edutella could also be used (go online

to http://edutella.jxta.org). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

36



T. Milo et al.

A more complex scenario allows readers to specify a desired output type at call time, as a parameter of the service call. If there exists a rewriting of the output that matches this schema, it will be applied before sending the result, otherwise an error message will be returned. Aggregators act as “superpeers” in the network. They know a number of news sources they can use to answer user queries. They also know other aggregators, which can relay the queries to additional news sources and other aggregators, transitively. Like news sources, they provide a getNewsAbout Web service, but allow for a more intensional output, of type ItemList, where news items can be either extensional or intensional. In the latter case they must match the intensionalNews function pattern, whose definition is omitted.





When queried by simple news readers, the answer is rewritten, depending if the reader is a RSS customer or a PDA, into a Itemlist2 or Itemlist3 version, respectively. On the other hand, when queried by other aggregators that prefer compact intensional answers which can be easily forwarded to other aggregators, no rewriting is performed, with the answer remaining as intensional as possible, preferably complying to the type below, which requires the information to be intensional.



Note also that aggregators may have different capabilities. For instance, some of them may not be able to recursively invoke the service calls they get in intensional answers. This is captured by having them supply, as an input parameter, a precise type for the answer of getNewsAbout, that matches their capabilities (e.g., return me only service calls that return extensional data). 8.2 Intensional Input So far, we considered the intensional output of services. To illustrate the power of intensional input parameters, we define a continuous version of the getNewsAbout service provided by news sources and aggregators. Clients call this service only once, to subscribe to a news feed. Then, they periodically get new information that matches their query (a dual service exists, to unsubscribe). Here, the input parameter is allowed to be given intensionally, ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Exchanging Intensional XML Data



37

so that the service provider can probe it, adjusting the answer to the parameter’s current value. For instance, consider a mobile user whose physical location changes, and wants to get news about the town she is visiting. The zip code of this town can be provided by a Web service running on her device, namely a GPS service. A call to this service will be passed as an intensional query parameter, and will be called by the news source in order to periodically send her the relevant local information. This continuous news service is actually implemented using a wrapper around a noncontinuous getNewsAbout service, calling the latter periodically with the keyword parameter it received in the subscription. Since getNewsAbout doesn’t accept an intensional input parameter, the schema enforcement module rewrites the intensional parameter given in the subscription every time it has to be called. 8.3 Demonstration Setting To demonstrate this application [Abiteboul et al. 2003a], news sources were built as simple wrappers around RSS files provided by news websites such as Yahoo!News, BBC Word, the New York Times, and CNN. The news from these sources could also be queried through two aggregators providing the GetNewsAbout service, but customized with different output schemas. The customization of intensional input parameters was demonstrated using a continuous service, as explained above, by providing a call to a getFavoriteKeyword service as a parameter for the subscription. 9. CONCLUSION AND RELATED WORK As mentioned in the Introduction, XML documents with embedded calls to Web services are already present in several existing products. The idea of including function calls in data is certainly not a new one. Functions embedded in data were already present in relational systems [Molina et al. 2002] as stored procedures. Also, method calls form a key component of object-oriented databases [Cattell 1996]. In the Web context, scripting languages such as PHP (see footnote 2) or JSP (see footnote 1) have made popular the integration of processing inside HTML or XML documents. Combined with standard database interfaces such as JDBC and ODBC, functions are used to integrate results of queries (e.g., SQL queries) into documents. A representative example for this is Oracle XSQL (see footnote 17). Embedding Web service calls in XML documents is also done in popular products such as Microsoft Office (Smart Tags) and Macromedia MX. While the static structure of such documents can be described by some DTD or XML Schema, our extension of XML Schema with function types is a first step toward a more precise description of XML documents embedding computation. Further work in that direction is clearly needed to better understand this powerful paradigm. There are a number of other proposals for typing XML documents, for example, Makoto [2001], Hosoya and Pierce [2000], and Cluet et al. [1998]. We selected XML Schema (see footnote 10) for several reasons. First, it is the standard recommended by the W3C for describing the structure of XML documents. Furthermore, it is the typing language used in WSDL ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

38



T. Milo et al.

to define the signatures of Web services (see footnote 3). By extending XML Schema, we naturally introduce function types/patterns in WSDL service signatures. Finally, one aspect of XML Schema simplifies the problem we study, namely, the unambiguity of XML Schema grammars. In many applications, it is necessary to screen queries and/or results according to specific user groups [Candan et al. 1996]. More specifically for us, embedded Web service calls in documents that are exchanged may be a serious cause of security violation. Indeed, this was one of the original motivations for the work presented here. Controlling these calls by enforcing schemas for exchanged documents appeared to us as useful for building secure applications, and can be combined with other security and access models that were proposed for XML and Web services, for example, in Damiani et al. [2001] and WS-Security.20 However, further work is needed to investigate this aspect. The work presented here is part of the ActiveXML [Abiteboul et al. 2002, 2003b] (see also the Active XML homepage of the Web site: http://www.rocq. inria.fr/verso/Gemo/Projects/axml) project based on XML and Web services. We presented in this article what forms the core of the module that, in a peer, supports and controls the dialogue (via Web services) with the rest of the world. This particular module may be extended in several ways. First, one may introduce “automatic converters” capable of restructuring the data that is received to the format that was expected, and similarly for the data that is sent. Also, this module may be extended to act as a “negotiator” who could speak to other peers to agree with them on the intensional XML Schemas that should be used to exchange data. Finally, the module may be extended to include search capabilities, for example, UDDI style search (see footnote 4) to try to find services on the Web that provide some particular information. In the global ActiveXML project, research is going on to extend the framework in various directions. In particular, we are working on distribution and replication of XML data and Web services [Abiteboul et al. 2003a]. Note that when some data may be found in different places and a service may be performed at different sites, the choice of which data to use and where to perform the service becomes an optimization issue. This is related to work on distributed database systems [Ozsu and Valduriez 1999] and to distributed computing at large. The novel aspect is the ability to exchange intensional information. This is in spirit of Jim and Suciu [2001], which considers also the exchange of intensional information in a distributed query processing setting. Intensional XML documents nicely fit in the context of data integration, since an intensional part of an XML document may be seen as a view on some data source. Calls to Web services in XML data may be used to wrap Web sources [Garcia-Molina et al. 1997] or to propagate changes for warehouse maintenance [Zhuge et al. 1995]. Note that the control of whether to materialize data or not (studied here) provides some flexible form of integration that is a hybrid of the warehouse model (all is materialized) and the mediator model (nothing is). 20 The

WS-Security specification. Go online to http://www.ibm.com/webservices/library/ ws-secure/.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Exchanging Intensional XML Data



39

On the other hand, this is orthogonal to the issue of selecting the views to materialize in a warehouse, studied in, for example, Gupta [1997] and Yang et al. [1997]. To conclude, we mention some fundamental aspects of the problem we studied. Although the k-depth/left-to-right restriction is not limiting in practice and the algorithm we implemented is fast enough, it would be interesting to understand the complexity and decidability barriers of (variants of) the problem. As we mentioned already, many results were found by Muscholl et al. [2004]. Namely, they proved the undecidability of the general safe rewriting problem for a context-free target language, and provided tight complexity bounds for several restricted cases. We already mentioned the connection to type theory and the novelty of our work in that setting, coming from the regular expressions in XML Schemas. Typing issues in XML Schema have recently motivated a number of interesting works such as Milo et al. [2000], which are based on tree automata. REFERENCES ABITEBOUL, S., AMANN, B., BAUMGARTEN, J., BENJELLOUN, O., NGOC, F. D., AND MILO, T. 2003a. Schemadriven customization of Web services. In Proceedings of VLDB. ABITEBOUL, S., BENJELLOUN, O., MANOLESCU, I., MILO, T., AND WEBER, R. 2002. Active XML: Peer-topeer data and Web services integration (demo). In Proceedings of VLDB. ABITEBOUL, S., BONIFATI, A., COBENA, G., MANOLESCU, I., AND MILO, T. 2003b. Dynamic XML documents with distribution and replication. In Proceedings of ACM SIGMOD. CANDAN, K. S., JAJODIA, S., AND SUBRAHMANIAN, V. S. 1996. Secure mediated databases. In Proceedings of ICDE. 28–37. CATTELL, R., Ed. 1996. The Object Database Standard: ODMG-93. Morgan Kaufman, San Francisco, CA. CLUET, S., DELOBEL, C., SIME´ ON, J., AND SMAGA, K. 1998. Your mediators need data conversion! In Proceedings of ACM SIGMOD. 177–188. DAMIANI, E., DI VIMERCATI, S. D. C., PARABOSCHI, S., AND SAMARATI, P. 2001. Securing XML documents. In Proceedings of EDBT. DOAN, A., DOMINGOS, P., AND HALEVY, A. Y. 2001. Reconciling schemas of disparate data sources: a machine-learning approach. In Proceedings of ACM SIGMOD. ACM Press, New York, NY, 509–520. GARCIA-MOLINA, H., PAPAKONSTANTINOU, Y., QUASS, D., RAJARAMAN, A., SAGIV, Y., ULLMAN, J., AND WIDOM, J. 1997. The TSIMMIS approach to mediation: Data models and languages. J. Intel. Inform. Syst. 8, 117–132. GUPTA, H. 1997. Selection of views to materialize in a data warehouse. In Proceedings of ICDT. 98–112. HOPCROFT, J. E. AND ULLMAN, J. D. 1979. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, Reading, MA. HOSOYA, H. AND PIERCE, B. C. 2000. XDuce: A typed XML processing language. In Proceedings of WebDB (Dallas, TX). JIM, T. AND SUCIU, D. 2001. Dynamically distributed query evaluation. In Proceedings of ACM PODS. 413–424. MAKOTO, M. 2001. RELAX (Regular Language description for XML). ISO/IEC Tech. Rep. ISO/IEC, Geneva, Switzerland. MILO, T., SUCIU, D., AND VIANU, V. 2000. Typechecking for XML transformers. In Proceedings of ACM PODS. 11–22. MITCHELL, J. C. 1990. Type systems for programming languages. In Handbook of Theoretical Computer Science: Volume B: Formal Models and Semantics, J. van Leeuwen, Ed. Elsevier, Amsterdam, The Netherlands, 365–458. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

40



T. Milo et al.

MOLINA, H., ULLMAN, J., AND WIDOM, J. 2002. Database Systems: The Complete Book. Prentice Hall, Englewood Cliffs, NJ. MUSCHOLL, A., SCHWENTICK, T., AND SEGOUFIN, L. 2004. Active context-free games. In Proceedings of the 21st Symposium on Theoretical Aspects of Computer Science (STACS ’04; Le Comm, Montpelier, France, Mar. 25–27). NGOC, F. D. 2002. Validation de documents XML contenant des appels de services. M.S. thesis. CNAM. DEA SIR (in French) University of Paris VI, Paris, France. OZSU, T. AND VALDURIEZ, P. 1999. Principles of Distributed Database Systems (2nd ed.). PrenticeHall, Englewood Cliffs, NJ. SEGOUFIN, L. 2003. Personal communication. YANG, J., KARLAPALEM, K., AND LI, Q. 1997. Algorithms for materialized view design in data warehousing environment. In VLDB ’97: Proceedings of the 23rd International Conference on Very Large Data Bases. Morgan Kaufman Publishers, San Francisco, CA, 136–145. ZHUGE, Y., GARC´ıA-MOLINA, H., HAMMER, J., AND WIDOM, J. 1995. View maintenance in a warehousing environment. In Proceedings of ACM SIGMOD. 316–327. Received October 2003; accepted March 2004

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems DIMITRIS PAPADIAS Hong Kong University of Science and Technology YUFEI TAO City University of Hong Kong GREG FU JP Morgan Chase and BERNHARD SEEGER Philipps University

The skyline of a d -dimensional dataset contains the points that are not dominated by any other point on all dimensions. Skyline computation has recently received considerable attention in the database community, especially for progressive methods that can quickly return the initial results without reading the entire database. All the existing algorithms, however, have some serious shortcomings which limit their applicability in practice. In this article we develop branch-andbound skyline (BBS), an algorithm based on nearest-neighbor search, which is I/O optimal, that is, it performs a single access only to those nodes that may contain skyline points. BBS is simple to implement and supports all types of progressive processing (e.g., user preferences, arbitrary dimensionality, etc). Furthermore, we propose several interesting variations of skyline computation, and show how BBS can be applied for their efficient processing. Categories and Subject Descriptors: H.2 [Database Management]; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Experimentation Additional Key Words and Phrases: Skyline query, branch-and-bound algorithms, multidimensional access methods

This research was supported by the grants HKUST 6180/03E and CityU 1163/04E from Hong Kong RGC and Se 553/3-1 from DFG. Authors’ addresses: D. Papadias, Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong; email: [email protected]; Y. Tao, Department of Computer Science, City University of Hong Kong, Tat Chee Avenue, Hong Kong; email: [email protected]; G. Fu, JP Morgan Chase, 277 Park Avenue, New York, NY 10172-0002; email: [email protected]; B. Seeger, Department of Mathematics and Computer Science, Philipps University, Hans-Meerwein-Strasse, Marburg, Germany 35032; email: [email protected]. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.  C 2005 ACM 0362-5915/05/0300-0041 $5.00 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 41–82.

42



D. Papadias et al.

Fig. 1. Example dataset and skyline.

1. INTRODUCTION The skyline operator is important for several applications involving multicriteria decision making. Given a set of objects p1 , p2 , . . . , pN , the operator returns all objects pi such that pi is not dominated by another object p j . Using the common example in the literature, assume in Figure 1 that we have a set of hotels and for each hotel we store its distance from the beach (x axis) and its price ( y axis). The most interesting hotels are a, i, and k, for which there is no point that is better in both dimensions. Borzsonyi et al. [2001] proposed an SQL syntax for the skyline operator, according to which the above query would be expressed as: [Select *, From Hotels, Skyline of Price min, Distance min], where min indicates that the price and the distance attributes should be minimized. The syntax can also capture different conditions (such as max), joins, group-by, and so on. For simplicity, we assume that skylines are computed with respect to min conditions on all dimensions; however, all methods discussed can be applied with any combination of conditions. Using the min condition, a point pi dominates1 another point p j if and only if the coordinate of pi on any axis is not larger than the corresponding coordinate of p j . Informally, this implies that pi is preferable to p j according to any preference (scoring) function which is monotone on all attributes. For instance, hotel a in Figure 1 is better than hotels b and e since it is closer to the beach and cheaper (independently of the relative importance of the distance and price attributes). Furthermore, for every point p in the skyline there exists a monotone function f such that p minimizes f [Borzsonyi et al. 2001]. Skylines are related to several other well-known problems, including convex hulls, top-K queries, and nearest-neighbor search. In particular, the convex hull contains the subset of skyline points that may be optimal only for linear preference functions (as opposed to any monotone function). B¨ohm and Kriegel [2001] proposed an algorithm for convex hulls, which applies branch-andbound search on datasets indexed by R-trees. In addition, several main-memory 1 According

to this definition, two or more points with the same coordinates can be part of the

skyline. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



43

algorithms have been proposed for the case that the whole dataset fits in memory [Preparata and Shamos 1985]. Top-K (or ranked) queries retrieve the best K objects that minimize a specific preference function. As an example, given the preference function f (x, y) = x + y, the top-3 query, for the dataset in Figure 1, retrieves < i, 5 >, < h, 7 >, < m, 8 > (in this order), where the number with each point indicates its score. The difference from skyline queries is that the output changes according to the input function and the retrieved points are not guaranteed to be part of the skyline (h and m are dominated by i). Database techniques for top-K queries include Prefer [Hristidis et al. 2001] and Onion [Chang et al. 2000], which are based on prematerialization and convex hulls, respectively. Several methods have been proposed for combining the results of multiple top-K queries [Fagin et al. 2001; Natsev et al. 2001]. Nearest-neighbor queries specify a query point q and output the objects closest to q, in increasing order of their distance. Existing database algorithms assume that the objects are indexed by an R-tree (or some other data-partitioning method) and apply branch-and-bound search. In particular, the depth-first algorithm of Roussopoulos et al. [1995] starts from the root of the R-tree and recursively visits the entry closest to the query point. Entries, which are farther than the nearest neighbor already found, are pruned. The best-first algorithm of Henrich [1994] and Hjaltason and Samet [1999] inserts the entries of the visited nodes in a heap, and follows the one closest to the query point. The relation between skyline queries and nearest-neighbor search has been exploited by previous skyline algorithms and will be discussed in Section 2. Skylines, and other directly related problems such as multiobjective optimization [Steuer 1986], maximum vectors [Kung et al. 1975; Matousek 1991], and the contour problem [McLain 1974], have been extensively studied and numerous algorithms have been proposed for main-memory processing. To the best of our knowledge, however, the first work addressing skylines in the context of databases was Borzsonyi et al. [2001], which develops algorithms based on block nested loops, divide-and-conquer, and index scanning. An improved version of block nested loops is presented in Chomicki et al. [2003]. Tan et al. [2001] proposed progressive (or on-line) algorithms that can output skyline points without having to scan the entire data input. Kossmann et al. [2002] presented an algorithm, called NN due to its reliance on nearest-neighbor search, which applies the divide-and-conquer framework on datasets indexed by R-trees. The experimental evaluation of Kossmann et al. [2002] showed that NN outperforms previous algorithms in terms of overall performance and general applicability independently of the dataset characteristics, while it supports on-line processing efficiently. Despite its advantages, NN has also some serious shortcomings such as need for duplicate elimination, multiple node visits, and large space requirements. Motivated by this fact, we propose a progressive algorithm called branch and bound skyline (BBS), which, like NN, is based on nearest-neighbor search on multidimensional access methods, but (unlike NN) is optimal in terms of node accesses. We experimentally and analytically show that BBS outperforms NN (usually by orders of magnitude) for all problem instances, while ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

44



D. Papadias et al.

Fig. 2. Divide-and-conquer.

incurring less space overhead. In addition to its efficiency, the proposed algorithm is simple and easily extendible to several practical variations of skyline queries. The rest of the article is organized as follows: Section 2 reviews previous secondary-memory algorithms for skyline computation, discussing their advantages and limitations. Section 3 introduces BBS, proves its optimality, and analyzes its performance and space consumption. Section 4 proposes alternative skyline queries and illustrates their processing using BBS. Section 5 introduces the concept of approximate skylines, and Section 6 experimentally evaluates BBS, comparing it against NN under a variety of settings. Finally, Section 7 concludes the article and describes directions for future work. 2. RELATED WORK This section surveys existing secondary-memory algorithms for computing skylines, namely: (1) divide-and-conquer, (2) block nested loop, (3) sort first skyline, (4) bitmap, (5) index, and (6) nearest neighbor. Specifically, (1) and (2) were proposed in Borzsonyi et al. [2001], (3) in Chomicki et al. [2003], (4) and (5) in Tan et al. [2001], and (6) in Kossmann et al. [2002]. We do not consider the sorted list scan, and the B-tree algorithms of Borzsonyi et al. [2001] due to their limited applicability (only for two dimensions) and poor performance, respectively. 2.1 Divide-and-Conquer The divide-and-conquer (D&C) approach divides the dataset into several partitions so that each partition fits in memory. Then, the partial skyline of the points in every partition is computed using a main-memory algorithm (e.g., Matousek [1991]), and the final skyline is obtained by merging the partial ones. Figure 2 shows an example using the dataset of Figure 1. The data space is divided into four partitions s1 , s2 , s3 , s4 , with partial skylines {a, c, g }, {d }, {i}, {m, k}, respectively. In order to obtain the final skyline, we need to remove those points that are dominated by some point in other partitions. Obviously all points in the skyline of s3 must appear in the final skyline, while those in s2 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



45

are discarded immediately because they are dominated by any point in s3 (in fact s2 needs to be considered only if s3 is empty). Each skyline point in s1 is compared only with points in s3 , because no point in s2 or s4 can dominate those in s1 . In this example, points c, g are removed because they are dominated by i. Similarly, the skyline of s4 is also compared with points in s3 , which results in the removal of m. Finally, the algorithm terminates with the remaining points {a, i, k}. D&C is efficient only for small datasets (e.g., if the entire dataset fits in memory then the algorithm requires only one application of a main-memory skyline algorithm). For large datasets, the partitioning process requires reading and writing the entire dataset at least once, thus incurring significant I/O cost. Further, this approach is not suitable for on-line processing because it cannot report any skyline until the partitioning phase completes. 2.2 Block Nested Loop and Sort First Skyline A straightforward approach to compute the skyline is to compare each point p with every other point, and report p as part of the skyline if it is not dominated. Block nested loop (BNL) builds on this concept by scanning the data file and keeping a list of candidate skyline points in main memory. At the beginning, the list contains the first data point, while for each subsequent point p, there are three cases: (i) if p is dominated by any point in the list, it is discarded as it is not part of the skyline; (ii) if p dominates any point in the list, it is inserted, and all points in the list dominated by p are dropped; and (iii) if p is neither dominated by, nor dominates, any point in the list, it is simply inserted without dropping any point. The list is self-organizing because every point found dominating other points is moved to the top. This reduces the number of comparisons as points that dominate multiple other points are likely to be checked first. A problem of BNL is that the list may become larger than the main memory. When this happens, all points falling in the third case (cases (i) and (ii) do not increase the list size) are added to a temporary file. This fact necessitates multiple passes of BNL. In particular, after the algorithm finishes scanning the data file, only points that were inserted in the list before the creation of the temporary file are guaranteed to be in the skyline and are output. The remaining points must be compared against the ones in the temporary file. Thus, BNL has to be executed again, this time using the temporary (instead of the data) file as input. The advantage of BNL is its wide applicability, since it can be used for any dimensionality without indexing or sorting the data file. Its main problems are the reliance on main memory (a small memory may lead to numerous iterations) and its inadequacy for progressive processing (it has to read the entire data file before it returns the first skyline point). The sort first skyline (SFS) variation of BNL alleviates these problems by first sorting the entire dataset according to a (monotone) preference function. Candidate points are inserted into the list in ascending order of their scores, because points with lower scores are likely to dominate a large number of points, thus rendering the pruning more effective. SFS exhibits progressive behavior because the presorting ensures that a point p dominating another p′ must be visited before p′ ; hence we can immediately ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

46



D. Papadias et al. Table I. The Bitmap Approach id a b c d e f g h i k l m n

Coordinate (1, 9) (2, 10) (4, 8) (6, 7) (9, 10) (7, 5) (5, 6) (4, 3) (3, 2) (9, 1) (10, 4) (6, 2) (8, 3)

Bitmap Representation (1111111111, 1100000000) (1111111110, 1000000000) (1111111000, 1110000000) (1111100000, 1111000000) (1100000000, 1000000000) (1111000000, 1111110000) (1111110000, 1111100000) (1111111000, 1111111100) (1111111100, 1111111110) (1100000000, 1111111111) (1000000000, 1111111000) (1111100000, 11111111110) (1110000000, 1111111100)

output the points inserted to the list as skyline points. Nevertheless, SFS has to scan the entire data file to return a complete skyline, because even a skyline point may have a very large score and thus appear at the end of the sorted list (e.g., in Figure 1, point a has the third largest score for the preference function 0 · distance + 1 · price). Another problem of SFS (and BNL) is that the order in which the skyline points are reported is fixed (and decided by the sort order), while as discussed in Section 2.6, a progressive skyline algorithm should be able to report points according to user-specified scoring functions. 2.3 Bitmap This technique encodes in bitmaps all the information needed to decide whether a point is in the skyline. Toward this, a data point p = ( p1 , p2 , . . . , pd ), where d is the number of dimensions, is mapped to an m-bit vector, where m is the total number of distinct values over all dimensions. Let ki be the total number of distinct values on the ith dimension (i.e., m = i=1∼d ki ). In Figure 1, for example, there are k1 = k2 = 10 distinct values on the x, y dimensions and m = 20. Assume that pi is the ji th smallest number on the ith axis; then it is represented by ki bits, where the leftmost (ki − ji + 1) bits are 1, and the remaining ones 0. Table I shows the bitmaps for points in Figure 1. Since point a has the smallest value (1) on the x axis, all bits of a1 are 1. Similarly, since a2 (= 9) is the ninth smallest on the y axis, the first 10 − 9 + 1 = 2 bits of its representation are 1, while the remaining ones are 0. Consider that we want to decide whether a point, for example, c with bitmap representation (1111111000, 1110000000), belongs to the skyline. The rightmost bits equal to 1, are the fourth and the eighth, on dimensions x and y, respectively. The algorithm creates two bit-strings, c X = 1110000110000 and cY = 0011011111111, by juxtaposing the corresponding bits (i.e., the fourth and eighth) of every point. In Table I, these bit-strings (shown in bold) contain 13 bits (one from each object, starting from a and ending with n). The 1s in the result of c X & cY = 0010000110000 indicate the points that dominate c, that is, c, h, and i. Obviously, if there is more than a single 1, the considered point ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



47

Table II. The Index Approach List 1 a (1, 9) minC = 1 b (2, 10) minC = 2 c (4, 8) minC = 4 g (5, 6) minC = 5 d (6, 7) minC = 6 e (9, 10) minC = 9

List 2 k (9, 1) i (3, 2), m (6, 2) h (4, 3), n (8, 3) l (10, 4) f (7, 5)

minC = 1 minC = 2 minC = 3 minC = 4 minC = 5

is not in the skyline.2 The same operations are repeated for every point in the dataset to obtain the entire skyline. The efficiency of bitmap relies on the speed of bit-wise operations. The approach can quickly return the first few skyline points according to their insertion order (e.g., alphabetical order in Table I), but, as with BNL and SFS, it cannot adapt to different user preferences. Furthermore, the computation of the entire skyline is expensive because, for each point inspected, it must retrieve the bitmaps of all points in order to obtain the juxtapositions. Also the space consumption may be prohibitive, if the number of distinct values is large. Finally, the technique is not suitable for dynamic datasets where insertions may alter the rankings of attribute values. 2.4 Index The index approach organizes a set of d -dimensional points into d lists such that a point p = ( p1 , p2 , . . . , pd ) is assigned to the ith list (1 ≤ i ≤ d ), if and only if its coordinate pi on the ith axis is the minimum among all dimensions, or formally, pi ≤ p j for all j = i. Table II shows the lists for the dataset of Figure 1. Points in each list are sorted in ascending order of their minimum coordinate (minC, for short) and indexed by a B-tree. A batch in the ith list consists of points that have the same ith coordinate (i.e., minC). In Table II, every point of list 1 constitutes an individual batch because all x coordinates are different. Points in list 2 are divided into five batches {k}, {i, m}, {h, n}, {l }, and { f }. Initially, the algorithm loads the first batch of each list, and handles the one with the minimum minC. In Table II, the first batches {a}, {k} have identical minC = 1, in which case the algorithm handles the batch from list 1. Processing a batch involves (i) computing the skyline inside the batch, and (ii) among the computed points, it adds the ones not dominated by any of the already-found skyline points into the skyline list. Continuing the example, since batch {a} contains a single point and no skyline point is found so far, a is added to the skyline list. The next batch {b} in list 1 has minC = 2; thus, the algorithm handles batch {k} from list 2. Since k is not dominated by a, it is inserted in the skyline. Similarly, the next batch handled is {b} from list 1, where b is dominated by point a (already in the skyline). The algorithm proceeds with batch {i, m}, computes the skyline inside the batch that contains a single point i (i.e., i dominates m), and adds i to the skyline. At this step, the algorithm does 2 The

result of “&” will contain several 1s if multiple skyline points coincide. This case can be handled with an additional “or” operation [Tan et al. 2001]. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

48



D. Papadias et al.

Fig. 3. Example of NN.

not need to proceed further, because both coordinates of i are smaller than or equal to the minC (i.e., 4, 3) of the next batches (i.e., {c}, {h, n}) of lists 1 and 2. This means that all the remaining points (in both lists) are dominated by i, and the algorithm terminates with {a, i, k}. Although this technique can quickly return skyline points at the top of the lists, the order in which the skyline points are returned is fixed, not supporting user-defined preferences. Furthermore, as indicated in Kossmann et al. [2002], the lists computed for d dimensions cannot be used to retrieve the skyline on any subset of the dimensions because the list that an element belongs to may change according the subset of selected dimensions. In general, for supporting queries on arbitrary dimensions, an exponential number of lists must be precomputed. 2.5 Nearest Neighbor NN uses the results of nearest-neighbor search to partition the data universe recursively. As an example, consider the application of the algorithm to the dataset of Figure 1, which is indexed by an R-tree [Guttman 1984; Sellis et al. 1987; Beckmann et al. 1990]. NN performs a nearest-neighbor query (using an existing algorithm such as one of the proposed by Roussopoulos et al. [1995], or Hjaltason and Samet [1999] on the R-tree, to find the point with the minimum distance (mindist) from the beginning of the axes (point o). Without loss of generality,3 we assume that distances are computed according to the L1 norm, that is, the mindist of a point p from the beginning of the axes equals the sum of the coordinates of p. It can be shown that the first nearest neighbor (point i with mindist 5) is part of the skyline. On the other hand, all the points in the dominance region of i (shaded area in Figure 3(a)) can be pruned from further consideration. The remaining space is split in two partitions based on the coordinates (ix , i y ) of point i: (i) [0, ix ) [0, ∞) and (ii) [0, ∞) [0, i y ). In Figure 3(a), the first partition contains subdivisions 1 and 3, while the second one contains subdivisions 1 and 2. The partitions resulting after the discovery of a skyline point are inserted in a to-do list. While the to-do list is not empty, NN removes one of the partitions 3 NN

(and BBS) can be applied with any monotone function; the skyline points are the same, but the order in which they are discovered may be different. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



49

Fig. 4. NN partitioning for three-dimensions.

from the list and recursively repeats the same process. For instance, point a is the nearest neighbor in partition [0, ix ) [0, ∞), which causes the insertion of partitions [0, ax ) [0, ∞) (subdivisions 5 and 7 in Figure 3(b)) and [0, ix ) [0, a y ) (subdivisions 5 and 6 in Figure 3(b)) in the to-do list. If a partition is empty, it is not subdivided further. In general, if d is the dimensionality of the data-space, a new skyline point causes d recursive applications of NN. In particular, each coordinate of the discovered point splits the corresponding axis, introducing a new search region towards the origin of the axis. Figure 4(a) shows a three-dimensional (3D) example, where point n with coordinates (nx , n y , nz ) is the first nearest neighbor (i.e., skyline point). The NN algorithm will be recursively called for the partitions (i) [0, nx ) [0, ∞) [0, ∞) (Figure 4(b)), (ii) [0, ∞) [0, n y ) [0, ∞) (Figure 4(c)) and (iii) [0, ∞) [0, ∞) [0, nz ) (Figure 4(d)). Among the eight space subdivisions shown in Figure 4, the eighth one will not be searched by any query since it is dominated by point n. Each of the remaining subdivisions, however, will be searched by two queries, for example, a skyline point in subdivision 2 will be discovered by both the second and third queries. In general, for d > 2, the overlapping of the partitions necessitates duplicate elimination. Kossmann et al. [2002] proposed the following elimination methods: —Laisser-faire: A main memory hash table stores the skyline points found so far. When a point p is discovered, it is probed and, if it already exists in the hash table, p is discarded; otherwise, p is inserted into the hash table. The technique is straightforward and incurs minimum CPU overhead, but results in very high I/O cost since large parts of the space will be accessed by multiple queries. —Propagate: When a point p is found, all the partitions in the to-do list that contain p are removed and repartitioned according to p. The new partitions are inserted into the to-do list. Although propagate does not discover the same ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

50



D. Papadias et al.

skyline point twice, it incurs high CPU cost because the to-do list is scanned every time a skyline point is discovered. —Merge: The main idea is to merge partitions in to-do, thus reducing the number of queries that have to be performed. Partitions that are contained in other ones can be eliminated in the process. Like propagate, merge also incurs high CPU cost since it is expensive to find good candidates for merging. —Fine-grained partitioning: The original NN algorithm generates d partitions after a skyline point is found. An alternative approach is to generate 2d nonoverlapping subdivisions. In Figure 4, for instance, the discovery of point n will lead to six new queries (i.e., 23 – 2 since subdivisions 1 and 8 cannot contain any skyline points). Although fine-grained partitioning avoids duplicates, it generates the more complex problem of false hits, that is, it is possible that points in one subdivision (e.g., subdivision 4) are dominated by points in another (e.g., subdivision 2) and should be eliminated. According to the experimental evaluation of Kossmann et al. [2002], the performance of laisser-faire and merge was unacceptable, while fine-grained partitioning was not implemented due to the false hits problem. Propagate was significantly more efficient, but the best results were achieved by a hybrid method combining propagate and laisser-faire. 2.6 Discussion About the Existing Algorithms We summarize this section with a comparison of the existing methods, based on the experiments of Tan et al. [2001], Kossmann et al. [2002], and Chomicki et al. [2003]. Tan et al. [2001] examined BNL, D&C, bitmap, and index, and suggested that index is the fastest algorithm for producing the entire skyline under all settings. D&C and bitmap are not favored by correlated datasets (where the skyline is small) as the overhead of partition-merging and bitmaploading, respectively, does not pay-off. BNL performs well for small skylines, but its cost increases fast with the skyline size (e.g., for anticorrelated datasets, high dimensionality, etc.) due to the large number of iterations that must be performed. Tan et al. [2001] also showed that index has the best performance in returning skyline points progressively, followed by bitmap. The experiments of Chomicki et al. [2003] demonstrated that SFS is in most cases faster than BNL without, however, comparing it with other algorithms. According to the evaluation of Kossmann et al. [2002], NN returns the entire skyline more quickly than index (hence also more quickly than BNL, D&C, and bitmap) for up to four dimensions, and their difference increases (sometimes to orders of magnitudes) with the skyline size. Although index can produce the first few skyline points in shorter time, these points are not representative of the whole skyline (as they are good on only one axis while having large coordinates on the others). Kossmann et al. [2002] also suggested a set of criteria (adopted from Hellerstein et al. [1999]) for evaluating the behavior and applicability of progressive skyline algorithms: (i) Progressiveness: the first results should be reported to the user almost instantly and the output size should gradually increase. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



51

(ii) Absence of false misses: given enough time, the algorithm should generate the entire skyline. (iii) Absence of false hits: the algorithm should not discover temporary skyline points that will be later replaced. (iv) Fairness: the algorithm should not favor points that are particularly good in one dimension. (v) Incorporation of preferences: the users should be able to determine the order according to which skyline points are reported. (vi) Universality: the algorithm should be applicable to any dataset distribution and dimensionality, using some standard index structure. All the methods satisfy criterion (ii), as they deal with exact (as opposed to approximate) skyline computation. Criteria (i) and (iii) are violated by D&C and BNL since they require at least a scan of the data file before reporting skyline points and they both insert points (in partial skylines or the self-organizing list) that are later removed. Furthermore, SFS and bitmap need to read the entire file before termination, while index and NN can terminate as soon as all skyline points are discovered. Criteria (iv) and (vi) are violated by index because it outputs the points according to their minimum coordinates in some dimension and cannot handle skylines in some subset of the original dimensionality. All algorithms, except NN, defy criterion (v); NN can incorporate preferences by simply changing the distance definition according to the input scoring function. Finally, note that progressive behavior requires some form of preprocessing, that is, index creation (index, NN), sorting (SFS), or bitmap creation (bitmap). This preprocessing is a one-time effort since it can be used by all subsequent queries provided that the corresponding structure is updateable in the presence of record insertions and deletions. The maintenance of the sorted list in SFS can be performed by building a B+-tree on top of the list. The insertion of a record in index simply adds the record in the list that corresponds to its minimum coordinate; similarly, deletion removes the record from the list. NN can also be updated incrementally as it is based on a fully dynamic structure (i.e., the R-tree). On the other hand, bitmap is aimed at static datasets because a record insertion/deletion may alter the bitmap representation of numerous (in the worst case, of all) records. 3. BRANCH-AND-BOUND SKYLINE ALGORITHM Despite its general applicability and performance advantages compared to existing skyline algorithms, NN has some serious shortcomings, which are described in Section 3.1. Then Section 3.2 proposes the BBS algorithm and proves its correctness. Section 3.3 analyzes the performance of BBS and illustrates its I/O optimality. Finally, Section 3.4 discusses the incremental maintenance of skylines in the presence of database updates. 3.1 Motivation A recursive call of the NN algorithm terminates when the corresponding nearest-neighbor query does not retrieve any point within the corresponding ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

52



D. Papadias et al.

Fig. 5. Recursion tree.

space. Lets call such a query empty, to distinguish it from nonempty queries that return results, each spawning d new recursive applications of the algorithm (where d is the dimensionality of the data space). Figure 5 shows a query processing tree, where empty queries are illustrated as transparent cycles. For the second level of recursion, for instance, the second query does not return any results, in which case the recursion will not proceed further. Some of the nonempty queries may be redundant, meaning that they return skyline points already found by previous queries. Let s be the number of skyline points in the result, e the number of empty queries, ne the number of nonempty ones, and r the number of redundant queries. Since every nonempty query either retrieves a skyline point, or is redundant, we have ne = s + r. Furthermore, the number of empty queries in Figure 5 equals the number of leaf nodes in the recursion tree, that is, e = ne · (d − 1) + 1. By combining the two equations, we get e = (s + r) · (d − 1) + 1. Each query must traverse a whole path from the root to the leaf level of the R-tree before it terminates; therefore, its I/O cost is at least h node accesses, where h is the height of the tree. Summarizing the above observations, the total number of accesses for NN is: NANN ≥ (e + s + r) · h = (s + r) · h · d + h > s · h · d . The value s · h · d is a rather optimistic lower bound since, for d > 2, the number r of redundant queries may be very high (depending on the duplicate elimination method used), and queries normally incur more than h node accesses. Another problem of NN concerns the to-do list size, which can exceed that of the dataset for as low as three dimensions, even without considering redundant queries. Assume, for instance, a 3D uniform dataset (cardinality N ) and a skyline query with the preference function f (x, y, z) = x. The first skyline point n (nx , n y , nz ) has the smallest x coordinate among all data points, and adds partitions Px = [0, nx ) [0, ∞) [0, ∞), P y = [0, ∞) [0, n y ) [0, ∞), Pz = [0, ∞) [0, ∞) [0, nz ) in the to-do list. Note that the NN query in Px is empty because there is no other point whose x coordinate is below nx . On the other hand, the expected volume of P y (Pz ) is 1/2 (assuming unit axis length on all dimensions), because the nearest neighbor is decided solely on x coordinates, and hence n y (nz ) distributes uniformly in [0, 1]. Following the same reasoning, a NN in P y finds the second skyline point that introduces three new partitions such that one partition leads to an empty query, while the volumes of the other two are 1/4. P is handled similarly, after which the to-do list contains four partitions z with volumes 1/4, and 2 empty partitions. In general, after the ith level of recursion, the to-do list contains 2i partitions with volume 1/2i , and 2i−1 empty ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



53

Fig. 6. R-tree example.

partitions. The algorithm terminates when 1/2i < 1/N (i.e., i > log N ) so that all partitions in the to-do list are empty. Assuming that the empty queries are performed at the end, the size of the to-do list can be obtained by summing the number e of empty queries at each recursion level i: log N

2i−1 = N − 1.

i=1

The implication of the above equation is that, even in 3D, NN may behave like a main-memory algorithm (since the to-do list, which resides in memory, is the same order of size as the input dataset). Using the same reasoning, for arbitrary dimensionality d > 2, e = ((d −1)log N ), that is, the to-do list may become orders of magnitude larger than the dataset, which seriously limits the applicability of NN. In fact, as shown in Section 6, the algorithm does not terminate in the majority of experiments involving four and five dimensions. 3.2 Description of BBS Like NN, BBS is also based on nearest-neighbor search. Although both algorithms can be used with any data-partitioning method, in this article we use R-trees due to their simplicity and popularity. The same concepts can be applied with other multidimensional access methods for high-dimensional spaces, where the performance of R-trees is known to deteriorate. Furthermore, as claimed in Kossmann et al. [2002], most applications involve up to five dimensions, for which R-trees are still efficient. For the following discussion, we use the set of 2D data points of Figure 1, organized in the R-tree of Figure 6 with node capacity = 3. An intermediate entry ei corresponds to the minimum bounding rectangle (MBR) of a node Ni at the lower level, while a leaf entry corresponds to a data point. Distances are computed according to L1 norm, that is, the mindist of a point equals the sum of its coordinates and the mindist of a MBR (i.e., intermediate entry) equals the mindist of its lower-left corner point. BBS, similar to the previous algorithms for nearest neighbors [Roussopoulos et al. 1995; Hjaltason and Samet 1999] and convex hulls [B¨ohm and Kriegel 2001], adopts the branch-and-bound paradigm. Specifically, it starts from the root node of the R-tree and inserts all its entries (e6 , e7 ) in a heap sorted according to their mindist. Then, the entry with the minimum mindist (e7 ) is “expanded”. This expansion removes the entry (e7 ) from the heap and inserts ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

54



D. Papadias et al. Table III. Heap Contents Action Access root Expand e7 Expand e3 Expand e6 Expand e1 Expand e4

Heap Contents



< g, 11>< b, 12>< c, 12>< l, 14>

S Ø Ø {i} {i} {i, a} {i, a, k}

Fig. 7. BBS algorithm.

its children (e3 , e4 , e5 ). The next expanded entry is again the one with the minimum mindist (e3 ), in which the first nearest neighbor (i) is found. This point (i) belongs to the skyline, and is inserted to the list S of skyline points. Notice that up to this step BBS behaves like the best-first nearest-neighbor algorithm of Hjaltason and Samet [1999]. The next entry to be expanded is e6 . Although the nearest-neighbor algorithm would now terminate since the mindist (6) of e6 is greater than the distance (5) of the nearest neighbor (i) already found, BBS will proceed because node N6 may contain skyline points (e.g., a). Among the children of e6 , however, only the ones that are not dominated by some point in S are inserted into the heap. In this case, e2 is pruned because it is dominated by point i. The next entry considered (h) is also pruned as it also is dominated by point i. The algorithm proceeds in the same manner until the heap becomes empty. Table III shows the ids and the mindist of the entries inserted in the heap (skyline points are bold). The pseudocode for BBS is shown in Figure 7. Notice that an entry is checked for dominance twice: before it is inserted in the heap and before it is expanded. The second check is necessary because an entry (e.g., e5 ) in the heap may become dominated by some skyline point discovered after its insertion (therefore, the entry does not need to be visited). Next we prove the correctness for BBS. LEMMA 1. BBS visits (leaf and intermediate) entries of an R-tree in ascending order of their distance to the origin of the axis. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



55

Fig. 8. Entries of the main-memory R-tree.

PROOF. The proof is straightforward since the algorithm always visits entries according to their mindist order preserved by the heap. LEMMA 2. Any data point added to S during the execution of the algorithm is guaranteed to be a final skyline point. PROOF. Assume, on the contrary, that point p j was added into S, but it is not a final skyline point. Then p j must be dominated by a (final) skyline point, say, pi , whose coordinate on any axis is not larger than the corresponding coordinate of p j , and at least one coordinate is smaller (since pi and p j are different points). This in turn means that mindist( pi ) < mindist( p j ). By Lemma 1, pi must be visited before p j . In other words, at the time p j is processed, pi must have already appeared in the skyline list, and hence p j should be pruned, which contradicts the fact that p j was added in the list. LEMMA 3. Every data point will be examined, unless one of its ancestor nodes has been pruned. PROOF. The proof is obvious since all entries that are not pruned by an existing skyline point are inserted into the heap and examined. Lemmas 2 and 3 guarantee that, if BBS is allowed to execute until its termination, it will correctly return all skyline points, without reporting any false hits. An important issue regards the dominance checking, which can be expensive if the skyline contains numerous points. In order to speed up this process we insert the skyline points found in a main-memory R-tree. Continuing the example of Figure 6, for instance, only points i, a, k will be inserted (in this order) to the main-memory R-tree. Checking for dominance can now be performed in a way similar to traditional window queries. An entry (i.e., node MBR or data point) is dominated by a skyline point p, if its lower left point falls inside the dominance region of p, that is, the rectangle defined by p and the edge of the universe. Figure 8 shows the dominance regions for points i, a, k and two entries; e is dominated by i and k, while e′ is not dominated by any point (therefore is should be expanded). Note that, in general, most dominance regions will cover a large part of the data space, in which case there will be significant overlap between the intermediate nodes of the main-memory ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

56



D. Papadias et al.

R-tree. Unlike traditional window queries that must retrieve all results, this is not a problem here because we only need to retrieve a single dominance region in order to determine that the entry is dominated (by at least one skyline point). To conclude this section, we informally evaluate BBS with respect to the criteria of Hellerstein et al. [1999] and Kossmann et al. [2002], presented in Section 2.6. BBS satisfies property (i) as it returns skyline points instantly in ascending order of their distance to the origin, without having to visit a large part of the R-tree. Lemma 3 ensures property (ii), since every data point is examined unless some of its ancestors is dominated (in which case the point is dominated too). Lemma 2 guarantees property (iii). Property (iv) is also fulfilled because BBS outputs points according to their mindist, which takes into account all dimensions. Regarding user preferences (v), as we discuss in Section 4.1, the user can specify the order of skyline points to be returned by appropriate preference functions. Furthermore, BBS also satisfies property (vi) since it does not require any specialized indexing structure, but (like NN) it can be applied with R-trees or any other data-partitioning method. Furthermore, the same index can be used for any subset of the d dimensions that may be relevant to different users. 3.3 Analysis of BBS In this section, we first prove that BBS is I/O optimal, meaning that (i) it visits only the nodes that may contain skyline points, and (ii) it does not access the same node twice. Then we provide a theoretical comparison with NN in terms of the number of node accesses and memory consumption (i.e., the heap versus the to-do list sizes). Central to the analysis of BBS is the concept of the skyline search region (SSR), that is, the part of the data space that is not dominated by any skyline point. Consider for instance the running example (with skyline points i, a, k). The SSR is the shaded area in Figure 8 defined by the skyline and the two axes. We start with the following observation. LEMMA 4. Any skyline algorithm based on R-trees must access all the nodes whose MBRs intersect the SSR. For instance, although entry e′ in Figure 8 does not contain any skyline points, this cannot be determined unless the child node of e′ is visited. LEMMA 5. If an entry e does not intersect the SSR, then there is a skyline point p whose distance from the origin of the axes is smaller than the mindist of e. PROOF. Since e does not intersect the SSR, it must be dominated by at least one skyline point p, meaning that p dominates the lower-left corner of e. This implies that the distance of p to the origin is smaller than the mindist of e. THEOREM 6.

The number of node accesses performed by BBS is optimal.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



57

PROOF. First we prove that BBS only accesses nodes that may contain skyline points. Assume, to the contrary, that the algorithm also visits an entry (let it be e in Figure 8) that does not intersect the SSR. Clearly, e should not be accessed because it cannot contain skyline points. Consider a skyline point that dominates e (e.g., k). Then, by Lemma 5, the distance of k to the origin is smaller than the mindist of e. According to Lemma 1, BBS visits the entries of the R-tree in ascending order of their mindist to the origin. Hence, k must be processed before e, meaning that e will be pruned by k, which contradicts the fact that e is visited. In order to complete the proof, we need to show that an entry is not visited multiple times. This is straightforward because entries are inserted into the heap (and expanded) at most once, according to their mindist. Assuming that each leaf node visited contains exactly one skyline point, the number NABBS of node accesses performed by BBS is at most s · h (where s is the number of skyline points, and h the height of the R-tree). This bound corresponds to a rather pessimistic case, where BBS has to access a complete path for each skyline point. Many skyline points, however, may be found in the same leaf nodes, or in the same branch of a nonleaf node (e.g., the root of the tree!), so that these nodes only need to be accessed once (our experiments show that in most cases the number of node accesses at each level of the tree is much smaller than s). Therefore, BBS is at least d (= s·h·d /s·h) times faster than NN (as explained in Section 3.1, the cost NANN of NN is at least s · h · d ). In practice, for d > 2, the speedup is much larger than d (several orders of magnitude) as NANN = s · h · d does not take into account the number r of redundant queries. Regarding the memory overhead, the number of entries nheap in the heap of BBS is at most ( f − 1) · NABBS . This is a pessimistic upper bound, because it assumes that a node expansion removes from the heap the expanded entry and inserts all its f children (in practice, most children will be dominated by some discovered skyline point and pruned). Since for independent dimensions the expected number of skyline points is s = ((ln N )d −1 /(d − 1)!) (Buchta [1989]), nheap ≤ ( f − 1) · NABBS ≈ ( f − 1) · h · s ≈ ( f − 1) · h · (ln N )d −1 /(d − 1)!. For d ≥ 3 and typical values of N and f (e.g., N = 105 and f ≈ 100), the heap size is much smaller than the corresponding to-do list size, which as discussed in Section 3.1 can be in the order of (d − 1)log N . Furthermore, a heap entry stores d + 2 numbers (i.e., entry id, mindist, and the coordinates of the lowerleft corner), as opposed to 2d numbers for to-do list entries (i.e., d -dimensional ranges). In summary, the main-memory requirement of BBS is at the same order as the size of the skyline, since both the heap and the main-memory R-tree sizes are at this order. This is a reasonable assumption because (i) skylines are normally small and (ii) previous algorithms, such as index, are based on the same principle. Nevertheless, the size of the heap can be further reduced. Consider that in Figure 9 intermediate node e is visited first and its children (e.g., e1 ) are inserted into the heap. When e′ is visited afterward (e and e′ have the same mindist), e1′ can be immediately pruned, because there must exist at least a (not yet discovered) point in the bottom edge of e1 that dominates e1′ . A ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

58



D. Papadias et al.

Fig. 9. Reducing the size of the heap.

similar situation happens if node e′ is accessed first. In this case e1′ is inserted into the heap, but it is removed (before its expansion) when e1 is added. BBS can easily incorporate this mechanism by checking the contents of the heap before the insertion of an entry e: (i) all entries dominated by e are removed; (ii) if e is dominated by some entry, it is not inserted. We chose not to implement this optimization because it induces some CPU overhead without affecting the number of node accesses, which is optimal (in the above example e1′ would be pruned during its expansion since by that time e1 will have been visited). 3.4 Incremental Maintenance of the Skyline The skyline may change due to subsequent updates (i.e., insertions and deletions) to the database, and hence should be incrementally maintained to avoid recomputation. Given a new point p (e.g., a hotel added to the database), our incremental maintenance algorithm first performs a dominance check on the main-memory R-tree. If p is dominated (by an existing skyline point), it is simply discarded (i.e., it does not affect the skyline); otherwise, BBS performs a window query (on the main-memory R-tree), using the dominance region of p, to retrieve the skyline points that will become obsolete (i.e., those dominated by p). This query may not retrieve anything (e.g., Figure 10(a)), in which case the number of skyline points increases by one. Figure 10(b) shows another case, where the dominance region of p covers two points i, k, which are removed (from the main-memory R-tree). The final skyline consists of only points a, p. Handling deletions is more complex. First, if the point removed is not in the skyline (which can be easily checked by the main-memory R-tree using the point’s coordinates), no further processing is necessary. Otherwise, part of the skyline must be reconstructed. To illustrate this, assume that point i in Figure 11(a) is deleted. For incremental maintenance, we need to compute the skyline with respect only to the points in the constrained (shaded) area, which is the region exclusively dominated by i (i.e., not including areas dominated by other skyline points). This is because points (e.g., e, l ) outside the shaded area cannot appear in the new skyline, as they are dominated by at least one other point (i.e., a or k). As shown in Figure 11(b), the skyline within the exclusive dominance region of i contains two points h and m, which substitute i in the final ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



59

Fig. 10. Incremental skyline maintenance for insertion.

Fig. 11. Incremental skyline maintenance for deletion.

skyline (of the whole dataset). In Section 4.1, we discuss skyline computation in a constrained region of the data space. Except for the above case of deletion, incremental skyline maintenance involves only main-memory operations. Given that the skyline points constitute only a small fraction of the database, the probability of deleting a skyline point is expected to be very low. In extreme cases (e.g., bulk updates, large number of skyline points) where insertions/deletions frequently affect the skyline, we may adopt the following “lazy” strategy to minimize the number of disk accesses: after deleting a skyline point p, we do not compute the constrained skyline immediately, but add p to a buffer. For each subsequent insertion, if p is dominated by a new point p′ , we remove it from the buffer because all the points potentially replacing p would become obsolete anyway as they are dominated by p′ (the insertion of p′ may also render other skyline points obsolete). When there are no more updates or a user issues a skyline query, we perform a single constrained skyline search, setting the constraint region to the union of the exclusive dominance regions of the remaining points in the buffer, which is emptied afterward. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

60



D. Papadias et al.

Fig. 12. Constrained query example.

4. VARIATIONS OF SKYLINE QUERIES In this section we propose novel variations of skyline search, and illustrate how BBS can be applied for their processing. In particular, Section 4.1 discusses constrained skylines, Section 4.2 ranked skylines, Section 4.3 group-by skylines, Section 4.4 dynamic skylines, Section 4.5 enumerating and K -dominating queries, and Section 4.6 skybands. 4.1 Constrained Skyline Given a set of constraints, a constrained skyline query returns the most interesting points in the data space defined by the constraints. Typically, each constraint is expressed as a range along a dimension and the conjunction of all constraints forms a hyperrectangle (referred to as the constraint region) in the d -dimensional attribute space. Consider the hotel example, where a user is interested only in hotels whose prices ( y axis) are in the range [4, 7]. The skyline in this case contains points g , f , and l (Figure 12), as they are the most interesting hotels in the specified price range. Note that d (which also satisfies the constraints) is not included as it is dominated by g . The constrained query can be expressed using the syntax of Borzsonyi et al. [2001] and the where clause: Select *, From Hotels, Where Price∈[4, 7], Skyline of Price min, Distance min. In addition, constrained queries are useful for incremental maintenance of the skyline in the presence of deletions (as discussed in Section 3.4). BBS can easily process such queries. The only difference with respect to the original algorithm is that entries not intersecting the constraint region are pruned (i.e., not inserted in the heap). Table IV shows the contents of the heap during the processing of the query in Figure 12. The same concept can also be applied when the constraint region is not a (hyper-) rectangle, but an arbitrary area in the data space. The NN algorithm can also support constrained skylines with a similar modification. In particular, the first nearest neighbor (e.g., g ) is retrieved in the constraint region using constrained nearest-neighbor search [Ferhatosmanoglu et al. 2001]. Then, each space subdivision is the intersection of the original subdivision (area to be searched by NN for the unconstrained query) and the constraint region. The index method can benefit from the constraints, by ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



61

Table IV. Heap Contents for Constrained Query Action Access root Expand e7 Expand e3 Expand e6 Expand e4 Expand e2

Heap Contents





S Ø Ø Ø Ø {g} {g, f, l}

starting with the batches at the beginning of the constraint ranges (instead of the top of the lists). Bitmap can avoid loading the juxtapositions (see Section 2.3) for points that do not satisfy the query constraints, and D&C may discard, during the partitioning step, points that do not belong to the constraint region. For BNL and SFS, the only difference with respect to regular skyline retrieval is that only points in the constraint region are inserted in the self-organizing list. 4.2 Ranked Skyline Given a set of points in the d -dimensional space [0, 1]d , a ranked (top-K ) skyline query (i) specifies a parameter K , and a preference function f which is monotone on each attribute, (ii) and returns the K skyline points p that have the minimum score according to the input function. Consider the running example, where K = 2 and the preference function is f (x, y) = x + 3 y 2 . The output skyline points should be < k, 12 >, < i, 15 > in this order (the number with each point indicates its score). Such ranked skyline queries can be expressed using the syntax of Borzsonyi et al. [2001] combined with the order by and stop after clauses: Select *, From Hotels, Skyline of Price min, Distance min, order by Price + 3·sqr(Distance), stop after 2. BBS can easily handle such queries by modifying the mindist definition to reflect the preference function (i.e., the mindist of a point with coordinates x and y equals x + 3 y 2 ). The mindist of an intermediate entry equals the score of its lower-left point. Furthermore, the algorithm terminates after exactly K points have been reported. Due to the monotonicity of f , it is easy to prove that the output points are indeed skyline points. The only change with respect to the original algorithm is the order of entries visited, which does not affect the correctness or optimality of BBS because in any case an entry will be considered after all entries that dominate it. None of the other algorithms can answer this query efficiently. Specifically, BNL, D&C, bitmap, and index (as well as SFS if the scoring function is different from the sorting one) require first retrieving the entire skyline, sorting the skyline points by their scores, and then outputting the best K ones. On the other hand, although NN can be used with all monotone functions, its application to ranked skyline may incur almost the same cost as that of a complete skyline. This is because, due to its divide-and-conquer nature, it is difficult to establish the termination criterion. If, for instance, K = 2, NN must perform d queries after the first nearest neighbor (skyline point) is found, compare their results, and return the one with the minimum score. The situation is more complicated when K is large where the output of numerous queries must be compared. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

62



D. Papadias et al.

4.3 Group-By Skyline Assume that for each hotel, in addition to the price and distance, we also store its class (i.e., 1-star, 2-star, . . . , 5-star). Instead of a single skyline covering all three attributes, a user may wish to find the individual skyline in each class. Conceptually, this is equivalent to grouping the hotels by their classes, and then computing the skyline for each group; that is, the number of skylines equals the cardinality of the group-by attribute domain. Using the syntax of Borzsonyi et al. [2001], the query can be expressed as Select *, From Hotels, Skyline of Price min, Distance min, Class diff (i.e., the group-by attribute is specified by the keyword diff). One straightforward way to support group-by skylines is to create a separate R-tree for the hotels in the same class, and then invoke BBS in each tree. Separating one attribute (i.e., class) from the others, however, would compromise the performance of queries involving all the attributes.4 In the following, we present a variation of BBS which operates on a single R-tree that indexes all the attributes. For the above example, the algorithm (i) stores the skyline points already found for each class in a separate main-memory 2D R-tree and (ii) maintains a single heap containing all the visited entries. The difference is that the sorting key is computed based only on price and distance (i.e., excluding the group-by attribute). Whenever a data point is retrieved, we perform the dominance check at the corresponding main-memory R-tree (i.e., for its class), and insert it into the tree only if it is not dominated by any existing point. On the other hand the dominance check for each intermediate entry e (performed before its insertion into the heap, and during its expansion) is more complicated, because e is likely to contain hotels of several classes (we can identify the potential classes included in e by its projection on the corresponding axis). First, its MBR (i.e., a 3D box) is projected onto the price-distance plane and the lower-left corner c is obtained. We need to visit e, only if c is not dominated in some main-memory R-tree corresponding to a class covered by e. Consider, for instance, that the projection of e on the class dimension is [2, 4] (i.e., e may contain only hotels with 2, 3, and 4 stars). If the lower-left point of e (on the price-distance plane) is dominated in all three classes, e cannot contribute any skyline point. When the number of distinct values of the group-by attribute is large, the skylines may not fit in memory. In this case, we can perform the algorithm in several passes, each pass covering a number of continuous values. The processing cost will be higher as some nodes (e.g., the root) may be visited several times. It is not clear how to extend NN, D&C, index, or bitmap for group-by skylines beyond the na¨ıve approach, that is, invoke the algorithms for every value of the group-by attribute (e.g., each time focusing on points belonging to a specific group), which, however, would lead to high processing cost. BNL and SFS can be applied in this case by maintaining separate temporary skylines for each class value (similar to the main memory R-trees of BBS). 4A

3D skyline in this case should maximize the value of the class (e.g., given two hotels with the same price and distance, the one with more stars is preferable). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



63

4.4 Dynamic Skyline Assume a database containing points in a d -dimensional space with axes d 1 , d 2 , . . . , d d . A dynamic skyline query specifies m dimension functions f 1 , f 2 , . . . , f m such that each function f i (1 ≤ i ≤ m) takes as parameters the coordinates of the data points along a subset of the d axes. The goal is to return the skyline in the new data space with dimensions defined by f 1 , f 2 , . . . , f m . Consider, for instance, a database that stores the following information for each hotel: (i) its x and (ii) y coordinates, and (iii) its price (i.e., the database contains three dimensions). Then, a user specifies his/her current location (ux , u y ), and requests the most interesting hotels, where preference must take into consideration the hotels’ proximity to the user (in terms of Euclidean distance) and the price. Each point p with coordinates ( px , p y , pz ) in the original 3D space is transformed to a point p′ in the 2D space with coordinates ( f 1 ( px , p y ), f 2 ( pz )), where the dimension functions f 1 and f 2 are defined as

f 1 ( px , p y ) =

 ( px − ux )2 + ( p y − u y )2 ,

and

f 2 ( pz ) = pz .

The terms original and dynamic space refer to the original d -dimensional data space and the space with computed dimensions (from f 1 , f 2 , . . . , f m ), respectively. Correspondingly, we refer to the coordinates of a point in the original space as original coordinates, while to those of the point in the dynamic space as dynamic coordinates. BBS is applicable to dynamic skylines by expanding entries in the heap according to their mindist in the dynamic space (which is computed on-the-fly when the entry is considered for the first time). In particular, the mindist of a leaf entry (data point) e with original coordinates (ex , e y , ez ), equals 

(ex − ux )2 + (e y − u y )2 + ez . The mindist of an intermediate entry e whose MBR has ranges [ex0 , ex1 ] [e y0 , e y1 ] [ez0 , ez1 ] is computed as mindist([ex0 , ex1 ] [e y0 , e y1 ], (ux , u y )) + ez0 , where the first term equals the mindist between point (ux , u y ) to the 2D rectangle [ex0 , ex1 ] [e y0 , e y1 ]. Furthermore, notice that the concept of dynamic skylines can be employed in conjunction with ranked and constraint queries (i.e., find the top five hotels within 1 km, given that the price is twice as important as the distance). BBS can process such queries by appropriate modification of the mindist definition (the z coordinate is multiplied by 2) and by constraining the search region ( f 1 (x, y) ≤ 1 km). Regarding the applicability of the previous methods, BNL still applies because it evaluates every point, whose dynamic coordinates can be computed on-the-fly. The optimizations, of SFS, however, are now useless since the order of points in the dynamic space may be different from that in the original space. D&C and NN can also be modified for dynamic queries with the transformations described above, suffering, however, from the same problems as the original algorithms. Bitmap and index are not applicable because these methods rely on pre-computation, which provides little help when the dimensions are defined dynamically. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

64



D. Papadias et al.

4.5 Enumerating and K -Dominating Queries Enumerating queries return, for each skyline point p, the number of points dominated by p. This information provides some measure of “goodness” for the skyline points. In the running example, for instance, hotel i may be more interesting than the other skyline points since it dominates nine hotels as opposed to two for hotels a and k. Let’s call num( p) the number of points dominated by point p. A straightforward approach to process such queries involves two steps: (i) first compute the skyline and (ii) for each skyline point p apply a query window in the data R-tree and count the number of points num( p) falling inside the dominance region of p. Notice that since all (except for the skyline) points are dominated, all the nodes of the R-tree will be accessed by some query. Furthermore, due to the large size of the dominance regions, numerous R-tree nodes will be accessed by several window queries. In order to avoid multiple node visits, we apply the inverse procedure, that is, we scan the data file and for each point we perform a query in the main-memory R-tree to find the dominance regions that contain it. The corresponding counters num( p) of the skyline points are then increased accordingly. An interesting variation of the problem is the K -dominating query, which retrieves the K points that dominate the largest number of other points. Strictly speaking, this is not a skyline query, since the result does not necessarily contain skyline points. If K = 3, for instance, the output should include hotels i, h, and m, with num(i) = 9, num(h) = 7, and num(m) = 5. In order to obtain the result, we first perform an enumerating query that returns the skyline points and the number of points that they dominate. This information for the first K = 3 points is inserted into a list sorted according to num( p), that is, list = < i, 9 >, < a, 2 >, < k, 2 >. The first element of the list (point i) is the first result of the 3-dominating query. Any other point potentially in the result should be in the (exclusive) dominance region of i, but not in the dominance region of a, or k(i.e., in the shaded area of Figure 13(a)); otherwise, it would dominate fewer points than a, or k. In order to retrieve the candidate points, we perform a local skyline query S ′ in this region (i.e., a constrained query), after removing i from S and reporting it to the user. S ′ contains points h and m. The new skyline S1 = (S − {i}) ∪ S ′ is shown in Figure 13(b). Since h and m do not dominate each other, they may each dominate at most seven points (i.e., num(i) − 2), meaning that they are candidates for the 3-dominating query. In order to find the actual number of points dominated, we perform a window query in the data R-tree using the dominance regions of h and m as query windows. After this step, < h, 7 > and < m, 5 > replace the previous candidates < a, 2 >, < k, 2 > in the list. Point h is the second result of the 3-dominating query and is output to the user. Then, the process is repeated for the points that belong to the dominance region of h, but not in the dominance regions of other points in S1 (i.e., shaded area in Figure 13(c)). The new skyline S2 = (S1 − {h}) ∪ {c, g } is shown in Figure 13(d). Points c and g may dominate at most five points each (i.e., num(h) − 2), meaning that they cannot outnumber m. Hence, the query terminates with < i, 9 >< h, 7 >< m, 5 > as the final result. In general, the algorithm can be thought of as skyline “peeling,” since it computes local skylines at the points that have the largest dominance. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



65

Fig. 13. Example of 3-dominating query.

Figure 14 shows the pseudocode for K -dominating queries. It is worth pointing out that the exclusive dominance region of a skyline point for d > 2 is not necessarily a hyperrectangle (e.g., in 3D space it may correspond to an “L-shaped” polyhedron derived by removing a cube from another cube). In this case, the constraint region can be represented as a union of hyperrectangles (constrained BBS is still applicable). Furthermore, since we only care about the number of points in the dominance regions (as opposed to their ids), the performance of window queries can be improved by using aggregate R-trees [Papadias et al. 2001] (or any other multidimensional aggregate index). All existing algorithms can be employed for enumerating queries, since the only difference with respect to regular skylines is the second step (i.e., counting the number of points dominated by each skyline point). Actually, the bitmap approach can avoid scanning the actual dataset, because information about num( p) for each point p can be obtained directly by appropriate juxtapositions of the bitmaps. K -dominating queries require an effective mechanism for skyline “peeling,” that is, discovery of skyline points in the exclusive dominance region of the last point removed from the skyline. Since this requires the application of a constrained query, all algorithms are applicable (as discussed in Section 4.1). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

66



D. Papadias et al.

Fig. 14. K -dominating BBS algorithm.

Fig. 15. Example of 2-skyband query.

4.6 Skyband Query Similar to K nearest-neighbor queries (that return the K NNs of a point), a K -skyband query reports the set of points which are dominated by at most K points. Conceptually, K represents the thickness of the skyline; the case K = 0 corresponds to a conventional skyline. Figure 15 illustrates the result of a 2skyband query containing hotels {a, b, c, g, h, i, k, m}, each dominated by at most two other hotels. A na¨ıve approach to check if a point p with coordinates ( p1 , p2 , . . . , pd ) is in the skyband would be to perform a window query in the R-tree and count the number of points inside the range [0, p1 ) [0, p2 ) . . . [0, pd ). If this number is smaller than or equal to K , then p belongs to the skyband. Obviously, the approach is very inefficient, since the number of window queries equals the cardinality of the dataset. On the other hand, BBS provides an efficient way for processing skyband queries. The only difference with respect to conventional skylines is that an entry is pruned only if it is dominated by more than K discovered skyline points. Table V shows the contents of the heap during the processing of the query in Figure 15. Note that the skyband points are reported ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



67

Table V. Heap Contents of 2-Skyband Query Action Access root Expand e7 Expand e3 Expand e6 Expand e5 Expand e1 Expand e4

Heap Contents





S Ø Ø {i} {i, h} {i, h, m} {i, h, m, a} {i, h, m, a, k, g, b, c}

Table VI. Applicability Comparison Constrained Ranked Group-by Dynamic K-dominating K-skyband

D&C Yes No No Yes Yes No

BNL Yes No Yes Yes Yes Yes

SFS Yes No Yes Yes Yes Yes

Bitmap Yes No No No Yes No

Index Yes No No No Yes No

NN Yes No No Yes Yes No

BBS Yes Yes Yes Yes Yes Yes

in ascending order of their scores, therefore maintaining the progressiveness of the results. BNL and SFS can support K -skyband queries with similar modifications (i.e., insert a point in the list if it is dominated by no more than K other points). None of the other algorithms is applicable, at least in an obvious way. 4.7 Summary Finally, we close this section with Table VI, which summarizes the applicability of the existing algorithms for each skyline variation. A “no” means that the technique is inapplicable, inefficient (e.g., it must perform a postprocessing step on the basic algorithm), or its extension is nontrivial. Even if an algorithm (e.g., BNL) is applicable for a query type (group-by skylines), it does not necessarily imply that it is progressive (the criteria of Section 2.6 also apply to the new skyline queries). Clearly, BBS has the widest applicability since it can process all query types effectively. 5. APPROXIMATE SKYLINES In this section we introduce approximate skylines, which can be used to provide immediate feedback to the users (i) without any node accesses (using a histogram on the dataset), or (ii) progressively, after the root visit of BBS. The problem for computing approximate skylines is that, even for uniform data, we cannot probabilistically estimate the shape of the skyline based only on the dataset cardinality N . In fact, it is difficult to predict the actual number of skyline points (as opposed to their order of magnitude [Buchta 1989]). To illustrate this, Figures 16(a) and 16(b) show two datasets that differ in the position of a single point, but have different skyline cardinalities (1 and 4, respectively). Thus, instead of obtaining the actual shape, we target a hypothetical point p such that its x and y coordinates are the minimum among all the expected coordinates in the dataset. We then define the approximate skyline using the two ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

68



D. Papadias et al.

Fig. 16. Skylines of uniform data.

line segments enclosing the dominance region of p. As shown in Figure 16(c), this approximation can be thought of as a “low-resolution” skyline. Next we compute the expected coordinates of p. First, for uniform distribution, it is reasonable to assume that p falls on the diagonal of the data space (because the data characteristics above and below the diagonal are similar). Assuming, for simplicity, that the data space has unit length on each axis, we denote the coordinates of p as (λ, λ) with 0 ≤ λ ≤ 1. To derive the expected value for λ, we need the probability P{λ ≤ ξ } that λ is no larger than a specific value ξ . To calculate this, note that λ > ξ implies that all the points fall in the dominance region of (ξ , ξ ) (i.e., a square with length 1 − ξ ). For uniform data, a point has probability (1 − ξ )2 to fall in this region, and thus P{λ > ξ } (i.e., the probability that all points are in this region) equals [(1 − ξ )2 ] N . So, P {λ ≤ ξ } = 1 − (1 − ξ )2N , and the expected value of λ is given by E(λ) =

1

dP(λ ≤ ξ ) dξ = 2N ξ· dξ

1

ξ · (1 − ξ )2N −1 dξ .

(5.1)

0

0

Solving this integral, we have E(λ) = 1/(2N + 1).

(5.2)

Following similar derivations for d -dimensional spaces, we obtain E(λ) = 1/(d · N + 1). If the dimensions of the data space have different lengths, then ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



69

Fig. 17. Obtaining the approximate skyline for nonuniform data.

the expected coordinate of the hypothetical skyline point on dimension i equals ALi /(d · N +1), where ALi is the length of the axis. Based on the above analysis, we can obtain the approximate skyline for arbitrary data distribution using a multidimensional histogram [Muralikrishna and DeWitt 1988; Acharya et al. 1999], which typically partitions the data space into a set of buckets and stores for each bucket the number (called density) of points in it. Figure 17(a) shows the extents of 6 buckets (b1 , . . . , b6 ) and their densities, for the dataset of Figure 1. Treating each bucket as a uniform data space, we compute the hypothetical skyline point based on its density. Then the approximate skyline of the original dataset is the skyline of all the hypothetical points, as shown in Figure 17(b). Since the number of hypothetical points is small (at most the number of buckets), the approximate skyline can be computed using existing main-memory algorithms (e.g., Kung et al. [1975]; Matousek [1991]). Due to the fact that histograms are widely used for selectivity estimation and query optimization, the extraction of approximate skylines does not incur additional requirements and does not involve I/O cost. Approximate skylines using histograms can provide some information about the actual skyline in environments (e.g., data streams, on-line processing systems) where only limited statistics of the data distribution (instead of individual data) can be maintained; thus, obtaining the exact skyline is impossible. When the actual data are available, the concept of approximate skyline, combined with BBS, enables the “drill-down” exploration of the actual one. Consider, for instance, that we want to estimate the skyline (in the absence of histograms) by performing a single node access. In this case, BBS retrieves the data R-tree root and computes by Equation (5.2), for every entry MBR, a hypothetical skyline point (i) assuming that the distribution in each MBR is almost uniform (a reasonable assumption for R-trees [Theodoridis et al. 2000]), and (ii) using the average node capacity and the tree level to estimate the number of points in the MBR. The skyline of the hypothetical points constitutes a rough estimation of the actual skyline. Figure 18(a) shows the approximate skyline after visiting the root entry as well as the real skyline (dashed line). The approximation error corresponds to the difference of the SSRs of the two skylines, that is, the area that is dominated by exactly one skyline (shaded region in Figure 18(a)). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

70



D. Papadias et al.

Fig. 18. Approximate skylines as a function of node accesses.

The approximate version of BBS maintains, in addition to the actual skyline S, a set HS consisting of points in the approximate skyline. HS is used just for reporting the current skyline approximation and not to guide the search (the order of node visits remains the same as the original algorithm). For each intermediate entry found, if its hypothetical point p is not dominated by any point in HS, it is added into the approximate skyline and all the points dominated by p are removed from HS. Leaf entries correspond to actual data points and are also inserted in HS (provided that they are not dominated). When an entry is deheaped, we remove the corresponding (hypothetical or actual) point from HS. If a data point is added to S, it is also inserted in HS. The approximate skyline is progressively refined as more nodes are visited, for example, when the second node N7 is deheaped, the hypothetical point of N7 is replaced with those of its children and the new HS is computed as shown in Figure 18(b). Similarly, the expansion of N3 will lead to the approximate skyline of Figure 18(c). At the termination of approximate BBS, the estimated skyline coincides with the actual one. To show this, assume, on the contrary, that at the termination of the algorithm there still exists a hypothetical/actual point p in HS, which does not belong to S. It follows that p is not dominated by the actual skyline. In this case, the corresponding (intermediate or leaf) entry producing p should be processed, contradicting the fact that the algorithm terminates. Note that for computing the hypothetical point of each MBR we use Equation (5.2) because it (i) is simple and efficient (in terms of computation cost), ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



71

Fig. 19. Alternative approximations after visiting root and N7 .

(ii) provides a uniform treatment of approximate skylines (i.e., the same as in the case of histograms), and (iii) has high accuracy (as shown in Section 6.8). Nevertheless, we may derive an alternative approximation based on the fact that each MBR boundary contains a data point. Assuming a uniform distribution on the MBR projections and that no point is minimum on two different dimensions, this approximation leads to d hypothetical points per MBR such that the expected position of each point is 1/((d − 1) · N + 1). Figure 19(a) shows the approximate skyline in this case after the first two node visits (root and N7 ). Alternatively, BBS can output an envelope enclosing the actual skyline, where the lower bound refers to the skyline obtained from the lower-left vertices of the MBRs and the upper bound refers to the skyline obtained from the upper-right vertices. Figure 19(b) illustrates the corresponding envelope (shaded region) after the first two node visits. The volume of the envelope is an upper bound for the actual approximation error, which shrinks as more nodes are accessed. The concepts of skyline approximation or envelope permit the immediate visualization of information about the skyline, enhancing the progressive behavior of BBS. In addition, approximate BBS can be easily modified for processing the query variations of Section 4 since the only difference is the maintenance of the hypothetical points in HS for the entries encountered by the original algorithm. The computation of hypothetical points depends on the skyline variation, for example, for constrained skylines the points are computed by taking into account only the node area inside the constraint region. On the other hand, the application of these concepts to NN is not possible (at least in an obvious way), because of the duplicate elimination problem and the multiple accesses to the same node(s). 6. EXPERIMENTAL EVALUATION In this section we verify the effectiveness of BBS by comparing it against NN which, according to the evaluation of Kossmann et al. [2002], is the most efficient existing algorithm and exhibits progressive behavior. Our implementation of NN combined laisser-faire and propagate because, as discussed in Section 2.5, it gives the best results. Specifically, only the first 20% of the to-do list was searched for duplicates using propagate and the rest of the duplicates were ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

72



D. Papadias et al.

Fig. 20. Node accesses vs. dimensionality d (N = 1M).

handled with laisser-faire. Following the common methodology in the literature, we employed independent (uniform) and anticorrelated5 datasets (generated in the same way as described in Borzsonyi et al. [2001]) with dimensionality d in the range [2, 5] and cardinality N in the range [100K, 10M]. The length of each axis was 10,000. Datasets were indexed by R*-trees [Beckmann et al. 1990] with a page size of 4 kB, resulting in node capacities between 204 (d = 2) and 94 (d = 5). For all experiments we measured the cost in terms of node accesses since the diagrams for CPU-time are very similar (see Papadias et al. [2003]). Sections 6.1 and 6.2 study the effects of dimensionality and cardinality for conventional skyline queries, whereas Section 6.3 compares the progressive behavior of the algorithms. Sections 6.4, 6.5, 6.6, and 6.7 evaluate constrained, group-by skyline, K -dominating skyline, and K -skyband queries, respectively. Finally, Section 6.8 focuses on approximate skylines. Ranked queries are not included because NN is inapplicable, while the performance of BBS is the same as in the experiments for progressive behavior. Similarly, the cost of dynamic skylines is the same as that of conventional skylines in selected dimension projections and omitted from the evaluation. 6.1 The Effect of Dimensionality In order to study the effect of dimensionality, we used the datasets with cardinality N = 1M and varied d between 2 and 5. Figure 20 shows the number of node accesses as a function of dimensionality, for independent and anticorrelated datasets. NN could not terminate successfully for d > 4 in case of independent, and for d > 3 in case of anticorrelated, datasets due to the prohibitive size of the to-do list (to be discussed shortly). BBS clearly outperformed NN and the difference increased fast with dimensionality. The degradation of NN was caused mainly by the growth of the number of partitions (i.e., each skyline point spawned d partitions), as well as the number of duplicates. The degradation of BBS was due to the growth of the skyline and the poor performance of R-trees 5 For

anticorrelated distribution, the dimensions are linearly correlated such that, if pi is smaller than p j on one axis, then pi is likely to be larger on at least one other dimension (e.g., hotels near the beach are typically more expensive). An anticorrelated dataset has fractal dimensionality close to 1 (i.e., points lie near the antidiagonal of the space). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



73

Fig. 21. Heap and to-do list sizes versus dimensionality d (N = 1M).

in high dimensions. Note that these factors also influenced NN, but their effect was small compared to the inherent deficiencies of the algorithm. Figure 21 shows the maximum sizes (in kbytes) of the heap, the to-do list, and the dataset, as a function of dimensionality. For d = 2, the to-do list was smaller than the heap, and both were negligible compared to the size of the dataset. For d = 3, however, the to-do list surpassed the heap (for independent data) and the dataset (for anticorrelated data). Clearly, the maximum size of the to-do list exceeded the main-memory of most existing systems for d ≥ 4 (anticorrelated data), which explains the missing numbers about NN in the diagrams for high dimensions. Notice that Kossmann et al. [2002] reported the cost of NN for returning up to the first 500 skyline points using anticorrelated data in five dimensions. NN can return a number of skyline points (but not the complete skyline), because the to-do list does not reach its maximum size until a sufficient number of skyline points have been found (and a large number of partitions have been added). This issue is discussed further in Section 6.3, where we study the sizes of the heap and to-do lists as a function of the points returned. 6.2 The Effect of Cardinality Figure 22 shows the number of node accesses versus the cardinality for 3D datasets. Although the effect of cardinality was not as important as that of dimensionality, in all cases BBS was several orders of magnitude faster than NN. For anticorrelated data, NN did not terminate successfully for N ≥ 5M, again due to the prohibitive size of the to-do list. Some irregularities in the diagrams (a small dataset may be more expensive than a larger one) are due to the positions of the skyline points and the order in which they were discovered. If, for instance, the first nearest neighbor is very close to the origin of the axes, both BBS and NN will prune a large part of their respective search spaces. 6.3 Progressive Behavior Next we compare the speed of the algorithms in returning skyline points incrementally. Figure 23 shows the node accesses of BBS and NN as a function of the points returned for datasets with N = 1M and d = 3 (the number of points in the final skyline was 119 and 977, for independent and anticorrelated datasets, ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

74



D. Papadias et al.

Fig. 22. Node accesses versus cardinality N (d = 3).

Fig. 23. Node accesses versus number of points reported (N = 1M, d = 3).

respectively). Both algorithms return the first point with the same cost (since they both apply nearest neighbor search to locate it). Then, BBS starts to gradually outperform NN and the difference increases with the number of points returned. To evaluate the quality of the results, Figure 24 shows the distribution of the first 50 skyline points (out of 977) returned by each algorithm for the anticorrelated dataset with N = 1M and d = 3. The initial skyline points of BBS are evenly distributed in the whole skyline, since they were discovered in the order of their mindist (which was independent of the algorithm). On the other hand, NN produced points concentrated in the middle of the data universe because the partitioned regions, created by new skyline points, were inserted at the end of the to-do list, and thus nearby points were subsequently discovered. Figure 25 compares the sizes of the heap and to-do lists as a function of the points returned. The heap reaches its maximum size at the beginning of BBS, whereas the to-do list reaches it toward the end of NN. This happens because before BBS discovered the first skyline point, it inserted all the entries of the visited nodes in the heap (since no entry can be pruned by existing skyline points). The more skyline points were discovered, the more heap entries were pruned, until the heap eventually became empty. On the other hand, the to-do list size is dominated by empty queries, which occurred toward the late phases of NN when the space subdivisions became too small to contain any points. Thus, NN could still be used to return a number of skyline points (but not the complete skyline) even for relatively high dimensionality. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



75

Fig. 24. Distribution of the first 50 skyline points (anticorrelated, N = 1M, d = 3).

Fig. 25. Sizes of the heap and to-do list versus number of points reported (N = 1M, d = 3).

6.4 Constrained Skyline Having confirmed the efficiency of BBS for conventional skyline retrieval, we present a comparison between BBS and NN on constrained skylines. Figure 26 shows the node accesses of BBS and NN as a function of the constraint region volume (N = 1M, d = 3), which is measured as a percentage of the volume of the data universe. The locations of constraint regions were uniformly generated and the results were computed by taking the average of 50 queries. Again BBS was several orders of magnitude faster than NN. The counterintuitive observation here is that constraint regions covering more than 8% of the data space are usually more expensive than regular skylines. Figure 27(a) verifies the observation by illustrating the node accesses of BBS on independent data, when the volume of the constraint region ranges between 98% and 100% (i.e., regular skyline). Even a range very close to 100% is much more expensive than a conventional skyline. Similar results hold for NN (see Figure 27(b)) and anticorrelated data. To explain this, consider Figure 28(a), which shows a skyline S in a constraint region. The nodes that must be visited intersect the constrained skyline search region (shaded area) defined by S and the constraint region. In this example, all four nodes e1 , e2 , e3 , e4 may contain skyline points and should be accessed. On the other hand, if S were a conventional skyline, as in Figure 28(b), nodes ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

76



D. Papadias et al.

Fig. 26. Node accesses versus volume of constraint region (N = 1M, d = 3).

Fig. 27. Node accesses versus volume of constraint region 98–100% (independent, N = 1M, d = 3).

e2 , e3 , and e4 could not exist because they should contain at least a point that dominates S. In general, the only data points of the conventional SSR (shaded area in Figure 28(b)) lie on the skyline, implying that, for any node MBR, at most one of its vertices can be inside the SSR. For constrained skylines there is no such restriction and the number of nodes intersecting the constrained SSR can be arbitrarily large. It is important to note that the constrained queries issued when a skyline point is removed during incremental maintenance (see Section 3.4) are always cheaper than computing the entire skyline from scratch. Consider, for instance, that the partial skyline of Figure 28(a) is computed for the exclusive dominance area of a deleted skyline point p on the lower-left corner of the constraint region. In this case nodes such as e2 , e3 , e4 cannot exist because otherwise they would have to contain skyline points, contradicting the fact that the constraint region corresponds to the exclusive dominance area of p. 6.5 Group-By Skyline Next we consider group-by skyline retrieval, including only BBS because, as discussed in Section 4, NN is inapplicable in this case. Toward this, we generate datasets (with cardinality 1M) in a 3D space that involves two numerical dimensions and one categorical axis. In particular, the number cnum of categories is a parameter ranging from 2 to 64 (cnum is also the number of 2D skylines returned by a group-by skyline query). Every data point has equal probability ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



77

Fig. 28. Nodes potentially intersecting the SSR.

Fig. 29. BBS node accesses versus cardinality of categorical axis cnum (N = 1M, d = 3).

to fall in each category, and, for all the points in the same category, their distribution (on the two numerical axes) is either independent or anticorrelated. Figure 29 demonstrates the number of node accesses as a function of cnum . The cost of BBS increases with cnum because the total number of skyline points (in all 2D skylines) and the probability that a node may contain qualifying points in some category (and therefore it should be expanded) is proportional to the size of the categorical domain. 6.6 K -Dominating Skyline This section measures the performance of NN and BBS on K -dominating queries. Recall that each K -dominating query involves an enumerating query (i.e., a file scan), which retrieves the number of points dominated by each skyline point. The K skyline points with the largest counts are found and the top-1 is immediately reported. Whenever an object is reported, a constrained skyline is executed to find potential candidates in its exclusive dominance region (see Figure 13). For each such candidate, the number of dominated points is retrieved using a window query on the data R-tree. After this process, the object with the largest count is reported (i.e., the second best object), another constrained query is performed, and so on. Therefore, the total number of constrained queries is K − 1, and each such query may trigger multiple window queries. Figure 30 demonstrates the cost of BBS and NN as a function of K . The overhead of the enumerating and (multiple) window queries dominates the total cost, and consequently BBS and NN have a very similar performance. Interestingly, the overhead of the anticorrelated data is lower (than the independent distribution) because each skyline point dominates fewer points ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

78



D. Papadias et al.

Fig. 30. NN and BBS node accesses versus number of objects to be reported for K -dominating queries (N = 1M, d = 2).

Fig. 31. BBS node accesses versus “thickness” of the skyline for K -skyband queries (N = 1M, d = 3).

(therefore, the number of window queries is smaller). The high cost of K -dominating queries (compared to other skyline variations) is due to the complexity of the problem itself (and not the proposed algorithm). In particular, a K -dominating query is similar to a semijoin and could be processed accordingly. For instance a nested-loops algorithm would (i) count, for each data point, the number of dominated points by scanning the entire database, (ii) sort all the points in descending order of the counts, and (iii) report the K points with the highest counts. Since in our case the database occupies more than 6K nodes, this algorithm would need to access 36E+6 nodes (for any K ), which is significantly higher than the costs in Figure 30 (especially for low K ). 6.7 K -Skyband Next, we evaluate the performance of BBS on K -skyband queries (NN is inapplicable). Figure 31 shows the node accesses as a function of K ranging from 0 (conventional skyline) to 9. As expected, the performance degrades as K increases because a node can be pruned only if it is dominated by more than K discovered skyline points, which becomes more difficult for higher K . Furthermore, the number of skyband points is significantly larger for anticorrelated data, for example, for K = 9, the number is 788 (6778) in the independent (anticorrelated) case, which explains the higher costs in Figure 31(b). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



79

Fig. 32. Approximation error versus number of minskew buckets (N = 1M, d = 3).

6.8 Approximate Skylines This section evaluates the quality of the approximate skyline using a hypothetical point per bucket or visited node (as shown in the examples of Figures 17 and 18, respectively). Given an estimated and an actual skyline, the approximation error corresponds to their SSR difference (see Section 5). In order to measure this error, we used a numerical approach: (i) we first generated a large number α of points (α = 104 ) uniformly distributed in the data space, and (ii) counted the number β of points that are dominated by exactly one skyline. The error equals β/α, which approximates the volume of the SSR difference divided by the volume of the entire data space. We did not use a relative error (e.g., volume of the SSR difference divided by the volume of the actual SSR) because such a definition is sensitive to the position of the actual skyline (i.e., a skyline near the origin of the axes would lead to higher error even if the SSR difference remains constant). In the first experiment, we built a minskew [Acharya et al. 1999] histogram on the 3D datasets by varying the number of buckets from 100 to 1000, resulting in main-memory consumption in the range of 3K bytes (100) to 30K bytes (1000 buckets). Figure 32 illustrates the error as a function of the bucket number. For independent distribution, the error is very small (less than 0.01%) even with the smallest number of buckets because the rough “shape” of the skyline for a uniform dataset can be accurately predicted using Equation (5.2). On the other hand, anticorrelated data were skewed and required a large number of buckets for achieving high accuracy. Figure 33 evaluates the quality of the approximation as a function of node accesses (without using a histogram). As discussed in Section 5, the first rough estimate of the skyline is produced when BBS visits the root entry and then the approximation is refined as more nodes are accessed. For independent data, extremely accurate approximation (with error 0.01%) can be obtained immediately after retrieving the root, a phenomenon similar to that in Figure 32(a). For anti-correlated data, the error is initially large (around 15% after the root visit), but decreases considerably with only a few additional node accesses. Particularly, the error is less than 3% after visiting 30 nodes, and close to zero with around 100 accesses (i.e., the estimated skyline is almost identical to the actual ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

80



D. Papadias et al.

Fig. 33. BBS approximation error versus number of node accesses (N = 1M, d = 3).

one with about 25% of the node accesses required for the discovery of the actual skyline). 7. CONCLUSION The importance of skyline computation in database systems increases with the number of emerging applications requiring efficient processing of preference queries and the amount of available data. Consider, for instance, a bank information system monitoring the attribute values of stock records and answering queries from multiple users. Assuming that the user scoring functions are monotonic, the top-1 result of all queries is always a part of the skyline. Similarly, the top-K result is always a part of the K -skyband. Thus, the system could maintain only the skyline (or K -skyband) and avoid searching a potentially very large number of records. However, all existing database algorithms for skyline computation have several deficiencies, which severely limit their applicability. BNL and D&C are not progressive. Bitmap is applicable only for datasets with small attribute domains and cannot efficiently handle updates. Index cannot be used for skyline queries on a subset of the dimensions. SFS, like all above algorithms, does not support user-defined preferences. Although NN was presented as a solution to these problems, it introduces new ones, namely, poor performance and prohibitive space requirements for more than three dimensions. This article proposes BBS, a novel algorithm that overcomes all these shortcomings since (i) it is efficient for both progressive and complete skyline computation, independently of the data characteristics (dimensionality, distribution), (ii) it can easily handle user preferences and process numerous alternative skyline queries (e.g., ranked, constrained, approximate skylines), (iii) it does not require any precomputation (besides building the R-tree), (iv) it can be used for any subset of the dimensions, and (v) it has limited main-memory requirements. Although in this implementation of BBS we used R-trees in order to perform a direct comparison with NN, the same concepts are applicable to any datapartitioning access method. In the future, we plan to investigate alternatives (e.g., X-trees [Berchtold et al. 1996], and A-trees [Sakurai et al. 2000]) for highdimensional spaces, where R-trees are inefficient). Another possible solution for ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Progressive Skyline Computation in Database Systems



81

high dimensionality would include (i) converting the data points to subspaces with lower dimensionalities, (ii) computing the skyline in each subspace, and (iii) merging the partial skylines. Finally, a topic worth studying concerns skyline retrieval in other application domains. For instance, Balke et al. [2004] studied skyline computation for Web information systems considering that the records are partitioned in several lists, each residing at a distributed server. The tuples in every list are sorted in ascending order of a scoring function, which is monotonic on all attributes. Their processing method uses the main concept of the threshold algorithm [Fagin et al. 2001] to compute the entire skyline by reading the minimum number of records in each list. Another interesting direction concerns skylines in temporal databases [Salzberg and Tsotras 1999] that retain historical information. In this case, a query could ask for the most interesting objects at a past timestamp or interval. REFERENCES ACHARYA, S., POOSALA, V., AND RAMASWAMY, S. 1999. Selectivity estimation in spatial databases. In Proceedings of the ACM Conference on the Management of Data (SIGMOD; Philadelphia, PA, June 1–3). 13–24. BALKE, W., GUNZER, U., AND ZHENG, J. 2004. Efficient distributed skylining for Web information systems. In Proceedings of the International Conference on Extending Database Technology (EDBT; Heraklio, Greece, Mar. 14–18). 256–273. BECKMANN, N., KRIEGEL, H., SCHNEIDER, R., AND SEEGER, B. 1990. The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the ACM Conference on the Management of Data (SIGMOD; Atlantic City, NJ, May 23–25). 322–331. BERCHTOLD, S., KEIM, D., AND KRIEGEL, H. 1996. The X-tree: An index structure for highdimensional data. In Proceedings of the Very Large Data Bases Conference (VLDB; Mumbai, India, Sep. 3–6). 28–39. ¨ , C. AND KRIEGEL, H. 2001. Determining the convex hull in large multidimensional BOHM databases. In Proceedings of the International Conference on Data Warehousing and Knowledge Discovery (DaWaK; Munich, Germany, Sep. 5–7). 294–306. BORZSONYI, S., KOSSMANN, D., AND STOCKER, K. 2001. The skyline operator. In Proceedings of the IEEE International Conference on Data Engineering (ICDE; Heidelberg, Germany, Apr. 2–6). 421–430. BUCHTA, C. 1989. On the average number of maxima in a set of vectors. Inform. Process. Lett., 33, 2, 63–65. CHANG, Y., BERGMAN, L., CASTELLI, V., LI, C., LO, M., AND SMITH, J. 2000. The Onion technique: Indexing for linear optimization queries. In Proceedings of the ACM Conference on the Management of data (SIGMOD; Dallas, TX, May 16–18). 391–402. CHOMICKI, J., GODFREY, P., GRYZ, J., AND LIANG, D. 2003. Skyline with pre-sorting. In Proceedings of the IEEE International Conference on Data Engineering (ICDE; Bangalore, India, Mar. 5–8). 717–719. FAGIN, R., LOTEM, A., AND NAOR, M. 2001. Optimal aggregation algorithms for middleware. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS; Santa Barbara, CA, May 21–23). 102–113. FERHATOSMANOGLU, H., STANOI, I., AGRAWAL, D., AND ABBADI, A. 2001. Constrained nearest neighbor queries. In Proceedings of the International Symposium on Spatial and Temporal Databases (SSTD; Redondo Beach, CA, July 12–15). 257–278. GUTTMAN, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM Conference on the Management of Data (SIGMOD; Boston, MA, June 18–21). 47– 57. HELLERSTEIN, J., ANVUR, R., CHOU, A., HIDBER, C., OLSTON, C., RAMAN, V., ROTH, T., AND HAAS, P. 1999. Interactive data analysis: The control project. IEEE Comput. 32, 8, 51– 59. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

82



D. Papadias et al.

HENRICH, A. 1994. A distance scan algorithm for spatial access structures. In Proceedings of the ACM Workshop on Geographic Information Systems (ACM GIS; Gaithersburg, MD, Dec.). 136–143. HJALTASON, G. AND SAMET, H. 1999. Distance browsing in spatial databases. ACM Trans. Database Syst. 24, 2, 265–318. HRISTIDIS, V., KOUDAS, N., AND PAPAKONSTANTINOU, Y. 2001. PREFER: A system for the efficient execution of multi-parametric ranked queries. In Proceedings of the ACM Conference on the Management of Data (SIGMOD; May 21–24). 259–270. KOSSMANN, D., RAMSAK, F., AND ROST, S. 2002. Shooting stars in the sky: An online algorithm for skyline queries. In Proceedings of the Very Large Data Bases Conference (VLDB; Hong Kong, China, Aug. 20–23). 275–286. KUNG, H., LUCCIO, F., AND PREPARATA, F. 1975. On finding the maxima of a set of vectors. J. Assoc. Comput. Mach., 22, 4, 469–476. MATOUSEK, J. 1991. Computing dominances in En . Inform. Process. Lett. 38, 5, 277–278. MCLAIN, D. 1974. Drawing contours from arbitrary data points. Comput. J. 17, 4, 318–324. MURALIKRISHNA, M. AND DEWITT, D. 1988. Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In Proceedings of the ACM Conference on the Management of Data (SIGMOD; Chicago, IL, June 1–3). 28–36. NATSEV, A., CHANG, Y., SMITH, J., LI., C., AND VITTER. J. 2001. Supporting incremental join queries on ranked inputs. In Proceedings of the Very Large Data Bases Conference (VLDB; Rome, Italy, Sep. 11–14). 281–290. PAPADIAS, D., TAO, Y., FU, G., AND SEEGER, B. 2003. An optimal and progressive algorithm for skyline queries. In Proceedings of the ACM Conference on the Management of Data (SIGMOD; San Diego, CA, June 9–12). 443–454. PAPADIAS, D., KALNIS, P., ZHANG, J., AND TAO, Y. 2001. Efficient OLAP operations in spatial data warehouses. In Proceedings of International Symposium on Spatial and Temporal Databases (SSTD; Redondo Beach, CA, July 12–15). 443–459. PREPARATA, F. AND SHAMOS, M. 1985. Computational Geometry—An Introduction. Springer, Berlin, Germany. ROUSSOPOULOS, N., KELLY, S., AND VINCENT, F. 1995. Nearest neighbor queries. In Proceedings of the ACM Conference on the Management of Data (SIGMOD; San Jose, CA, May 22–25). 71–79. SAKURAI, Y., YOSHIKAWA, M., UEMURA, S., AND KOJIMA, H. 2000. The A-tree: An index structure for high-dimensional spaces using relative approximation. In Proceedings of the Very Large Data Bases Conference (VLDB; Cairo, Egypt, Sep. 10–14). 516–526. SALZBERG, B. AND TSOTRAS, V. 1999. A comparison of access methods for temporal data. ACM Comput. Surv. 31, 2, 158–221. SELLIS, T., ROUSSOPOULOS, N., AND FALOUTSOS, C. 1987. The R+-tree: A dynamic index for multidimensional objects. In Proceedings of the Very Large Data Bases Conference (VLDB; Brighton, England, Sep. 1–4). 507–518. STEUER, R. 1986. Multiple Criteria Optimization. Wiley, New York, NY. TAN, K., ENG, P., AND OOI, B. 2001. Efficient progressive skyline computation. In Proceedings of the Very Large Data Bases Conference (VLDB; Rome, Italy, Sep. 11–14). 301–310. THEODORIDIS, Y., STEFANAKIS, E., AND SELLIS, T. 2000. Efficient cost models for spatial queries using R-trees. IEEE Trans. Knowl. Data Eng. 12, 1, 19–32. Received October 2003; revised April 2004; accepted June 2004

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Advanced SQL Modeling in RDBMS ANDREW WITKOWSKI, SRIKANTH BELLAMKONDA, TOLGA BOZKAYA, NATHAN FOLKERT, ABHINAV GUPTA, JOHN HAYDU, LEI SHENG, and SANKAR SUBRAMANIAN Oracle Corporation

Commercial relational database systems lack support for complex business modeling. ANSI SQL cannot treat relations as multidimensional arrays and define multiple, interrelated formulas over them, operations which are needed for business modeling. Relational OLAP (ROLAP) applications have to perform such tasks using joins, SQL Window Functions, complex CASE expressions, and the GROUP BY operator simulating the pivot operation. The designated place in SQL for calculations is the SELECT clause, which is extremely limiting and forces the user to generate queries with nested views, subqueries and complex joins. Furthermore, SQL query optimizers are preoccupied with determining efficient join orders and choosing optimal access methods and largely disregard optimization of multiple, interrelated formulas. Research into execution methods has thus far concentrated on efficient computation of data cubes and cube compression rather than on access structures for random, interrow calculations. This has created a gap that has been filled by spreadsheets and specialized MOLAP engines, which are good at specification of formulas for modeling but lack the formalism of the relational model, are difficult to coordinate across large user groups, exhibit scalability problems, and require replication of data between the tool and RDBMS. This article presents an SQL extension called SQL Spreadsheet, to provide array calculations over relations for complex modeling. We present optimizations, access structures, and execution models for processing them efficiently. Special attention is paid to compile time optimization for expensive operations like aggregation. Furthermore, ANSI SQL does not provide a good separation between data and computation and hence cannot support parameterization for SQL Spreadsheets models. We propose two parameterization methods for SQL. One parameterizes ANSI SQL view using subqueries and scalars, which allows passing data to SQL Spreadsheet. Another method presents parameterization of the SQL Spreadsheet formulas. This supports building stand-alone SQL Spreadsheet libraries. These models are then subject to the SQL Spreadsheet optimizations during model invocation time. Categories and Subject Descriptors: H.2.3. [Database Management]: Languages—Data manipulation languages (DML); query languages; H.2.4. [Database Management]: Systems—Query processing General Terms: Design, Languages Additional Key Words and Phrases: Excel, analytic computations, OLAP, spreadsheet

1. INTRODUCTION One of the most successful analytical tools for business data is the spreadsheet. A user can enter business data, define formulas over it using two-dimensional Authors’ addresses: Oracle Corporation, 500 Oracle Parkway, Redwood Shores, CA 94065; email: {andrew.witkowski,srikanth.bellamkonda,tolga.bozkaya,nathan.folkert,abhinav.gupta,john. haydu,lei.sheng,sankar.subramanian}@oracle.com. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.  C 2005 ACM 0362-5915/05/0300-0083 $5.00 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 83–121.

84



A. Witkowski et al.

array abstractions, construct simultaneous equations with recursive models, pivot data and compute aggregates for selected cells, apply a rich set of business functions, etc. Spreadsheets also provide flexible user interfaces like graphs and reports. Unfortunately, analytical usefulness of the RDBMS has not measured up to that of spreadsheets [Blattner 1999; Simon 2000] or specialized MOLAP tools like Microsoft Analytical Services [Peterson and Pinkelman 2000; Thomsen et al. 1999], Oracle Analytic Workspaces [OLAP Application Developer’s Guide 2004], and others [Balmin et al. 2000; Howson 2002]. It is cumbersome and in most cases inefficient to perform array calculations in SQL—a fundamental problem resulting from lack of language constructs to treat relations as arrays and lack of efficient random access methods for their access. To simulate array computations on a relation SQL users must resort to using multiple self-joins to align different rows, must use ANSI SQL Window functions to reach from one row into another, or must use ANSI SQL GROUP BY operator to pivot a table and simulate interrow with intercolumn computations. None of the operations is natural or efficient for array computations with multiple formulas found in spreadsheets. Spreadsheets, for example Microsoft Excel [Simon 2000], provide an excellent user interface but have their own problems. They offer two-dimensional “row-column” addressing. Hence, it is hard to build a model where formulas reference data via symbolic references. In addition, they do not scale well when the data set is large. For example, a single sheet in a spreadsheet typically supports up to 64K rows with about 200 columns, and handling terabytes of sales data is practically impossible even when using multiple sheets. Furthermore, spreadsheets do not support the parallel processing necessary to process terabytes of data in small windows of time. In collaborative analysis with multiple spreadsheets, it is nearly impossible to get a complete picture of the business by querying multiple, inconsistent spreadsheets each using its own layout and placement of data. There is no standard metadata or a unified abstraction interrelating them akin to RDBMS dictionary tables and RDBMS relations. This article proposes spreadsheet-like computations in RDBMS through extensions to SQL, leaving the user interface aspects to be handled by OLAP tools. Here is a glimpse of our proposal: —Relations can be viewed as n-dimensional arrays, and formulas can be defined over the cells of these arrays. Cell addressing is symbolic, using dimensional columns. — The formulas can automatically be ordered based on the dependencies between cells. — Recursive references and convergence conditions are supported, providing for a recursive execution model. — Densification (filling gaps in sparse data) can be easily performed. — Formulas are encapsulated in a new SQL query clause. Their result is a relation and can be further used in joins, subqueries, etc. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Advanced SQL Modeling in RDBMS



85

— The new clause supports logical partitioning of the data providing a natural mechanism of parallel execution. —Formulas support INSERT and UPDATE semantics as well as correlation between their left and right sides. This allows us to simulate the effect of multiple joins and UNIONs using a single access structure. Furthermore, our article addresses lack of parameterization models in ANSI SQL. The issue is critical for model building as this ANSI SQL shortcoming prevents us from constructing parameterized libraries of SQL Spreadsheet. We propose two new parameterization methods for SQL. One parameterizes ANSI SQL views with subqueries and scalars allowing passing data to inner query blocks and hence to SQL Spreadsheet. The second model is a parameterization of the SQL Spreadsheet formulas. We can declare a named set of formulas, called SQL Spreadsheet Procedure, operating on an N-dimensional array that can be invoked from an SQL Spreadsheet. The array is passed by reference to the SQL Spreadsheet Procedure. We support merging of formulas from SQL Spreadsheet Procedure to the main body of SQL Spreadsheet. This allows for global formula optimizations, like removal of unused formulas, etc. SQL Spreadsheet Procedures are useful for building standalone SQL Spreadsheet libraries. This article is organized as follows. Section 2 provides SQL language extensions for spreadsheets. Section 3 provides motivating examples. Section 4 presents an overview of the evaluation of spreadsheets in SQL. Section 5 describes the analysis of the spreadsheet clause and query optimizations with spreadsheets. Section 6 discusses our execution models. Section 7 describes our parameterization models. Section 8 reports results from performance experiments on spreadsheet queries, and Section 9 contains our conclusions. The electronic appendix explains parallel execution of SQL Spreadsheets and presents our experimental results; it also discusses our future research in this area. 2. SQL EXTENSIONS FOR SPREADSHEETS 2.1 Notation In the following examples, we will use a fact table f (t, r, p, s, c) representing a data-warehouse of consumer-electronic products with three dimensions: time (t), region (r), and product ( p), and two measures: sales (s) and cost (c). 2.2 Spreadsheet Clause OLAP applications divide relational attributes into dimensions and measures. To model that, we introduce a new SQL query clause, called the spreadsheet clause, which identifies, within the query result, PARTITION, DIMENSION, and MEASURES columns. The PARTITION (PBY) columns divide the relation into disjoint subsets. The DIMENSION (DBY) columns identify a unique row within each partition, and this row is called a cell. The MEASURES (MEA) columns identify expressions computed by the spreadsheet and are referenced by DBY columns. Following this, there is a sequence of formulas, each describing ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

86



A. Witkowski et al.

a computation on the cells. Thus the structure of the spreadsheet clause is

SPREADSHEET PBY (cols) DBY (cols) MEA (cols)

(, ,.., ) It is evaluated after joins, aggregations, window functions, and final projection, but before the ORDER BY clause. Cells are referenced using an array notation in which a measure is followed by square brackets holding dimension values. Thus s[‘vcr’, 2002] is a reference to the cell containing sales of the ‘vcr’ product in 2002. If the dimensions are uniquely qualified, the cell reference is called a single cell reference, for example, s[p=‘dvd’, t=2002]. If the dimensions are qualified by general predicates, the cell reference refers to a set of cells and is called a range reference, for example, s[p=‘dvd’, t, x SAMPLE PERIOD 1s

In this query, the maximum of 8 s worth of light readings will be computed, but only light readings from sensors whose magnetometers read greater than x will be considered. Interestingly, it turns out that, unless the mag > x predicate is very selective, it will be cheaper to evaluate this query by checking to see if ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

144



S. R. Madden et al.

each new light reading is greater than the previous reading and then applying the selection predicate over mag, rather than first sampling mag. This sort of reordering, which we call exemplary aggregate pushdown can be applied to any exemplary aggregate (e.g., MIN, MAX). Similar ideas have been explored in the deductive database community by Sudarshan and Ramakrishnan [1991]. The same technique can be used with nonwindowed aggregates when performing in-network aggregation. Suppose we are applying an exemplary aggregate at an intermediate node in the routing tree; if there is an expensive acquisition required to evaluate a predicate (as in the query above), then it may make sense to see if the local value affects the value of the aggregate before acquiring the attribute used in the predicate. To add support for exemplary aggregate pushdown, we need a way to evaluate the selectivity of exemplary aggregates. In the absence of statistics that reflect how a predicate changes over time, we simply assume that the attributes involved in an exemplary aggregate (such as light in the query above) are sampled from the same distribution. Thus, for MIN and MAX aggregates, the likelihood that the second of two samples is less than (or greater than) the first is 0.5. For n samples, the likelihood that the nth is the value reported by the aggregate is thus 1/.5n−1 . By the same reasoning, for bottom (or top)-k aggregates, assuming k < n, the nth sample will be reported with probability 1/.5n−k−1 . Given this selectivity estimate for an exemplary aggregate, S(a), over attribute a with acquisition cost C(a), we can compute the benefit of exemplary aggregate pushdown. We assume the query contains some set of conjunctive predicates with aggregate selectivity P over several expensive acquisitional attributes with aggregate acquisition cost K . We assume the values of S(a), C(a), K , and P are available in the catalog. Then, the cost of evaluating the query without exemplary aggregate pushdown is K + P ∗ C(a)

(1)

C(a) + S(a) ∗ K .

(2)

and with pushdown it becomes

When (2) is less than (1), there will be an expected benefit to exemplary aggregate pushdown, and it should be applied. 4.3 Technique 2: Event Query Batching to Conserve Power As a second example of the benefit of power-aware optimization, we consider the optimization of the query ON EVENT e(nodeid) SELECT a1 FROM sensors AS s WHERE s.nodeid = e.nodeid SAMPLE PERIOD d FOR k

This query will cause an instance of the internal query (SELECT ...) to be started every time the event e occurs. The internal query samples results every d seconds for a duration of k seconds, at which point it stops running. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Acquisitional Query Processing In Sensor Networks



145

Fig. 7. The cost of processing event-based queries as asynchronous events versus joins.

Note that, according to this specification of how an ON EVENT query is processed, it is possible for multiple instances of the internal query to be running at the same time. If enough such queries are running simultaneously, the benefit of event-based queries (e.g., not having to poll for results) will be outweighed by the fact that each instance of the query consumes significant energy sampling and delivering (independent) results. To alleviate the burden of running multiple copies of the same identical query, we employ a multiquery optimization technique based on rewriting. To do this, we convert external events (of type e) into a stream of events, and rewrite the entire set of independent internal queries as a sliding window join between events and sensors, with a window size of k seconds on the event stream, and no window on the sensor stream. For example: SELECT s.a1 FROM sensors AS s, events AS e WHERE s.nodeid = e.nodeid AND e.type = e AND s.time - e.time e.time SAMPLE PERIOD d

We execute this query by treating it as a join between a materialization point of size k on events and the sensors stream. When an event tuple arrives, it is added to the buffer of events. When a sensor tuple s arrives, events older than k seconds are dropped from the buffer and s is joined with the remaining events. The advantage of this approach is that only one query runs at a time no matter how frequently the events of type e are triggered. This offers a large potential savings in sampling and transmission cost. At first it might seem as though requiring the sensors to be sampled every d seconds irrespective of the contents of the event buffer would be prohibitively expensive. However, the check to see if the the event buffer is empty can be pushed before the sampling of the sensors, and can be done relatively quickly. Figure 7 shows the power tradeoff for event-based queries that have and have not been rewritten. Rewritten queries are labeled as stream join and nonrewritten queries as async events. We measure the cost in mW of the two approaches ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

146

S. R. Madden et al.



Table IV. Parameters Used in Asynchronous Events Versus Stream-Join Study Parameter tsample nevents durevent mWproc mssample mWsample mJsample mWidle tidle mJidle mscheck mJcheck mWevents mWstream Join

Description Length of sample period Number of events per second Time for which events are active (FOR clause) Processor power consumption Time to acquire a sample, including processing and ADC time Power used while sampling, including processor Energy per sample Milliwatts used while idling Time spent idling per sample period (in seconds) Energy spent idling Time to check for enqueued event Energy to check if an event has been enqueued Total power used in asynchronous event mode Total power used in stream-join mode

Value 1/8 s 0−5 (X axis) 1, 3, or 5 s 12 mW 0.35 ms 13 mW Derived Derived Derived Derived 0.02 ms (80 instrs) Derived Derived Derived

using a numerical model of power costs for idling, sampling and processing (including the cost to check if the event queue is nonempty in the event-join case), but excluding transmission costs to avoid complications of modeling differences in cardinalities between the two approaches. The expectation was that the asynchronous approach would generally transmit many more results. We varied the sample rate and duration of the inner query, and the frequency of events. We chose the specific parameters in this plot to demonstrate query optimization tradeoffs; for much faster or slower event rates, one approach tends to always be preferable. In this case, the stream-join rewrite is beneficial as when events occur frequently; this might be the case if, for example, an event is triggered whenever a signal goes above or below a threshold with a signal that is sampled tens or hundreds of times per second; vibration monitoring applications tend to have this kind of behavior. Table IV summarizes the parameters used in this experiment; “derived” values are computed by the model below. Power consumption numbers and sensor timings are drawn from Table III and the Atmel 128 data sheet (see Atmel Corporation be cited in footnotes to Table III). The cost in milliwatts of the asynchronous events approach, mWevents , is modeled via the following equations: tidle = tsample − nevents × durevent × mssample /1000, mJidle = mWidle × tidle , mJsample = mWsample × mssample /1000, mWevents = (nevents × durevent × mJsample + mJidle )/tsample . The cost in milliwatts of the stream-join approach, mWstreamJoin , is then tidle = tsample − (mscheck + mssample )/1000, mJidle = mWidle × tidle , mJcheck = mWproc × mscheck /1000, mJsample = mWsample × mssamples /1000, mWstreamJoin = (mJcheck + mJsample + mJidle )/tsample . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Acquisitional Query Processing In Sensor Networks



147

For very low event rates (fewer than one per second), the asynchronous events approach is sometimes preferable due to the extra overhead of empty-checks on the event queue in the stream-join case. However, for faster event rates, the power cost of this approach increases rapidly as independent samples are acquired for each event every few seconds. Increasing the duration of the inner query increases the cost of the asynchronous approach as more queries will be running simultaneously. The maximum absolute difference (of about 0.8 mW) is roughly comparable to one-quarter the power cost of the CPU or radio. Finally, we note that there is a subtle semantic change introduced by this rewriting. The initial formulation of the query caused samples in each of the internal queries to be produced relative to the time that the event fired: for example, if event e1 fired at time t, samples would appear at time t + d , t + 2d , . . . . If a later event e2 fired at time t + i, it would produce a different set of samples at time t + i + d , t + i + 2d , . . . . Thus, unless i were equal to d (i.e., the events were in phase), samples for the two queries would be offset from each other by up to d seconds. In the rewritten version of the query, there is only one stream of sensor tuples which is shared by all events. In many cases, users may not care that tuples are out of phase with events. In some situations, however, phase may be very important. In such situations, one way the system could improve the phase accuracy of samples while still rewriting multiple event queries into a single join is via oversampling, or acquiring some number of (additional) samples every d seconds. The increased phase accuracy of oversampling comes at an increased cost of acquiring additional samples (which may still be less than running multiple queries simultaneously). For now, we simply allow the user to specify that a query must be phase-aligned by specifying ON ALIGNED EVENT in the event clause. Thus, we have shown that there are several interesting optimization issues in ACQP systems; first, the system must properly order sampling, selection, and aggregation to be truly low power. Second, for frequent event-based queries, rewriting them as a join between an event stream and the sensors stream can significantly reduce the rate at which a sensor must acquire samples. 5. POWER-SENSITIVE DISSEMINATION AND ROUTING After the query has been optimized, it is disseminated into the network; dissemination begins with a broadcast of the query from the root of the network. As each node hears the query, it must decide if the query applies locally and/or needs to be broadcast to its children in the routing tree. We say a query q applies to a node n if there is a nonzero probability that n will produce results for q. Deciding where a particular query should run is an important ACQP-related decision. Although such decisions occur in other distributed query processing environments, the costs of incorrectly initiating queries in ACQP environments like TinyDB can be unusually high, as we will show. If a query does not apply at a particular node, and the node does not have any children for which the query applies, then the entire subtree rooted at that node can be excluded from the query, saving the costs of disseminating, ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

148



S. R. Madden et al.

executing, and forwarding results for the query across several nodes, significantly extending the node’s lifetime. Given the potential benefits of limiting the scope of queries, the challenge is to determine when a node or its children need not participate in a particular query. One situation arises with constant-valued attributes (e.g., nodeid or location in a fixed-location network) with a selection predicate that indicates the node need not participate. We expect that such queries will be very common, especially in interactive workloads where users are exploring different parts of the network to see how it is behaving. Similarly, if a node knows that none of its children currently satisfy the value of some selection predicate, perhaps, because they have constant (and known) attribute values outside the predicate’s range, it need not forward the query down the routing tree. To maintain information about child attribute values (both constant and changing), we propose a data structure called a semantic routing tree (SRT). We describe the properties of SRTs in the next section, and briefly outline how they are created and maintained. 5.1 Semantic Routing Trees An SRT is a routing tree (similar to the tree discussed in Section 2.3 above) designed to allow each node to efficiently determine if any of the nodes below it will need to participate in a given query over some constant attribute A. Traditionally, in sensor networks, routing tree construction is done by having nodes pick a parent with the most reliable connection to the root (highest link quality). With SRTs, we argue that the choice of parent should include some consideration of semantic properties as well. In general, SRTs are most applicable when there are several parents of comparable link quality. A link-quality-based parent selection algorithm, such as the one described in Woo and Culler [2001], should be used in conjunction with the SRT to prefilter parents made available to the SRT. Conceptually, an SRT is an index over A that can be used to locate nodes that have data relevant to the query. Unlike traditional indices, however, the SRT is an overlay on the network. Each node stores a single unidimensional interval representing the range of A values beneath each of its children. When a query q with a predicate over A arrives at a node n, n checks to see if any child’s value of A overlaps the query range of A in q. If so, it prepares to receive results and forwards the query. If no child overlaps, the query is not forwarded. Also, if the query also applies locally (whether or not it also applies to any children) n begins executing the query itself. If the query does not apply at n or at any of its children, it is simply forgotten. Building an SRT is a two-phase process: first the SRT build request is flooded (retransmitted by every mote until all motes have heard the request) down the network. This request includes the name of the attribute A over which the tree should be built. As a request floods down the network, a node n may have several possible choices of parent, since, in general, many nodes in radio range may be closer to the root. If n has children, it forwards the request on to them and waits until they reply. If n has no children, it chooses a node p ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Acquisitional Query Processing In Sensor Networks



149

Fig. 8. A semantic routing tree in use for a query. Gray arrows indicate flow of the query down the tree; gray nodes must produce or forward results in the query.

from available parents to be its parent, and then reports the value of A to p in a parent selection message. If n does have children, it records the child’s value of A along with its id. When it has heard from all of its children, it chooses a parent and sends a selection message indicating the range of values of A which it and its descendents cover. The parent records this interval with the id of the child node and proceeds to choose its own parent in the same manner, until the root has heard from all of its children. Because children can fail or move away, nodes also have a timeout which is the maximum time they will wait to hear from a child; after this period is elapsed, the child is removed from the child list. If the child reports after this timeout, it is incorporated into the SRT as if it were a new node (see Section 5.2 below). Figure 8 shows an SRT over the X coordinate of each node on an Cartesian grid. The query arrives at the root, is forwarded down the tree, and then only the gray nodes are required to participate in the query (note that node 3 must forward results for node 4, despite the fact that its own location precludes it from participation). SRTs are analogous to indices in traditional database systems; to create one in TinyDB, the CREATE SRT command can be used—its syntax is essentially similar to the CREATE INDEX command in SQL: CREATE SRT loc ON sensors (xloc,yloc) ROOT 0,

where the ROOT annotation indicates the nodeid where the SRT should be rooted from—by default, the value will be 0, but users may wish to create SRTs rooted at other nodes to facilitate event-based queries that frequently radiate from a particular node. 5.2 Maintaining SRTs Even though SRTs are limited to constant attributes, some SRT maintenance must occur. In particular, new nodes can appear, link qualities can change, and existing nodes can fail. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

150



S. R. Madden et al.

Both node appearances and changes in link quality can require a node to switch parents. To do this, the node sends a parent selection message to its new parent, n. If this message changes the range of n’s interval, it notifies its parent; in this way, updates can propagate to the root of the tree. To handle the disappearance of a child node, parents associate an active query id and last epoch with every child in the SRT (recall that an epoch is the period of time between successive samples). When a parent p forwards a query q to a child c, it sets c’s active query id to the id of q and sets its last epoch entry to 0. Every time p forwards or aggregates a result for q from c, it updates c’s last epoch with the epoch on which the result was received. If p does not hear c for some number of epochs t, it assumes c has moved away, and removes its SRT entry. Then, p sends a request asking its remaining children to retransmit their ranges. It uses this information to construct a new interval. If this new interval differs in size from the previous interval, p sends a parent selection message up the routing tree to reflect this change. We study the costs of SRT maintenance in Section 5.4 below. Finally, we note that, by using these maintenance rules, it is possible to support SRTs over nonconstant attributes, although if those attributes change quickly, the cost of propagating interval-range changes could be prohibitive. 5.3 Evaluation of Benefit of SRTs The benefit that an SRT provides is dependent on the quality of the clustering of children beneath parents. If the descendents of some node n are clustered around the value of the index attribute at n, then a query that applies to n will likely also apply to its descendents. This can be expected for location attributes, for example, since network topology is correlated with geography. We simulate the benefits of an SRT because large networks of the type where we expect these data structures to be useful are just beginning to come online, so only a small-number of fixed real-world topologies are available. We include in our simulation experiments using a connectivity data file collected from one such real-world deployment. We evaluate the benefit of SRTs in terms of number of active nodes; inactive nodes incur no cost for a given query, expending energy only to keep their processors in an idle state and to listen to their radios for the arrival of new queries. We study three policies for SRT parent selection. In the first, random approach, each node picks a random parent from the nodes with which it can communicate reliably. In the second, closest-parent approach, each parent reports the value of its index attribute with the SRT-build request, and children pick the parent whose attribute value is closest to their own. In the clustered approach, nodes select a parent as in the closest-parent approach, except, if a node hears a sibling node send a parent selection message, it snoops on the message to determine its siblings parent and value. It then picks its own parent (which could be the same as one of its siblings) to minimize the spread of attribute values underneath all of its available parents. We studied these policies in a simple simulation environment—nodes were arranged on an n × n grid and were asked to choose a constant attribute value ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Acquisitional Query Processing In Sensor Networks



151

from some distribution (which we varied between experiments). We used a perfect (lossless) connectivity model where each node could talk to its immediate neighbors in the grid (so routing trees were n nodes deep), and each node had eight neighbors (with three choices of parent, on average). We compared the total number of nodes involved in range queries of different sizes for the three SRT parent selection policies to the best-case approach and the no SRT approach. The best-case approach would only result if exactly those nodes that overlapped the range predicate were activated, which is not possible in our topologies but provides a convenient lower bound. In the no SRT approach, all nodes participate in each query. We experimented with several of sensor value distributions. In the random distribution, each constant attribute value was randomly and uniformly selected from the interval [0, 1000]. In the geographic distribution, (onedimensional) sensor values were computed based on a function of a node’s x and y position in the grid, such that a node’s value tended to be highly correlated to the values of its neighbors. Finally, for the real distribution, we used a network topology based on data collected from a network of 54 motes deployed throughout the Intel-Research, Berkeley lab. The SRT was built over the node’s location in the lab, and the network connectivity was derived by identifying pairs of motes with a high probability of being able to successfully communicate with each other.14 Figure 9 shows the number of nodes that participate in queries over variablysized query intervals (where the interval size is shown on the x axis) of the attribute space in a 20 × 20 grid. The interval for queries was randomly selected from the uniform distribution. Each point in the graph was obtained by averaging over five trials for each of the three parent selection policies in each of the sensor value distributions (for a total of 30 experiments).For each interval size s, 100 queries were randomly constructed, and the average number of nodes involved in each query was measured. For all three distributions, the clustered approach was superior to other SRT algorithms, beating the random approach by about 25% and the closest parent approach by about 10% on average. With the geographic and real distributions, the performance of the clustered approach is close to optimal: for most ranges, all of the nodes in the range tend to be colocated, so few intermediate nodes are required to relay information for queries in which they themselves are not participating. The fact that the results from real topology closely matches the geographic distribution, where sensors’ values and topology are perfectly correlated, is encouraging and suggests that SRTs will work well in practice. Figure 10 shows several visualizations of the topologies which are generated by the clustered (Figure 10(a)) and random (Figure 10(b)) SRT generation approaches for an 8×8 network. Each node represents a sensor, labeled with its ID and the distribution of the SRT subtree rooted underneath it. Edges represent the routing tree. The gray nodes represent the nodes that would participate in 14 The

probability threshold in this case was 25%, which is the same as the probability the TinyOS/TinyDB routing layer use to determine if a neighboring node is of sufficiently high quality to be considered as a candidate parent. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

152



S. R. Madden et al.

Fig. 9. Number of nodes participating in range queries of different sizes for different parent selection policies in a semantic routing tree (20 × 20 grid, 400 nodes, each point average of 500 queries of the appropriate size). The three graphs represent three different sensor-value distributions; see the text for a description of each of these distribution types.

the query 400 < A < 500. On this small grid, the two approaches perform similarly, but the variation in structure which results is quite evident—the random approach tends to be of more uniform depth, whereas the clustered approach leads to longer sequences of nodes with nearby values. Note that the labels in this figure are not intended to be readable—the important point is the overall pattern of nodes that are explored by the two approaches. 5.4 Maintenance Costs of SRTs As the previous results show, the benefit of using an SRT can be substantial. There are, however, maintenance and construction costs associated with SRTs, as discussed above. Construction costs are comparable to those in conventional sensor networks (which also have a routing tree), but slightly higher due to ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Acquisitional Query Processing In Sensor Networks



153

Fig. 10. Visualizations of the (a) clustered and (b) random topologies, with a query region overlaid on top of them. Node 0, the root in Figures 10(a) and 10(b), is at the center of the graph.

the fact that parent selection messages are explicitly sent, whereas parents do not always require confirmation from their children in other sensor network environments. We conducted an experiment to measure the cost of selecting a new parent, which requires a node to notify its old parent of its decision to move and send its attribute value to its new parent. Both the new and old parent must then update their attribute interval information and propagate any changes up the tree to the root of the network. In this experiment, we varied the probability with which any node switches parents on any given epoch from 0.001 to 0.2. We did not constrain the extent of the query in this case—all nodes were assumed to participate. Nodes were allowed to move from their current parent to an arbitrary new parent, and multiple nodes could move on a given epoch. The experimental parameters were the same as above. We measured the average number of maintenance messages generated by movement across the whole network. The results are shown in Figure 11. Each point represents the average of five trials, and each trial consists of 100 epochs. The three lines represent the three policies; the amount of movement varies along the x axis, and the number of maintenance messages per epoch is shown on the y axis. Without maintenance, each active node (within the query range) sends one message per epoch, instead of every node being required to transmit. Figure 11 suggests that for low movement rates, the maintenance costs of the SRT approach are small enough that it remains attractive—if 1% of the nodes move on a given epoch, the cost is about 30 messages, which is substantially less than ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

154



S. R. Madden et al.

Fig. 11. Maintenance costs (in measured network messages) for different SRT parent selection policies with varying probabilities of node movement. Probabilities and costs are per epoch. Each point is the average of five runs, and where each run is 100 epochs long.

the number of messages saved by using an SRT for most query ranges. If 10% of the nodes move, the maintenance cost grows to about 300, making the benefit of SRTs less clear. To measure the amount of movement expected in practice, we measured movement rates in traces collected from two real-world monitoring deployments; in both cases, the nodes were stationary but employed a routing algorithm that attempted to select the best parent over time. In the 3-month, 200-node Great Duck Island Deployment nodes switched parents between successive result reports with a 0.9% (σ = 0.9%) chance, on average. In the 54 node Intel-Berkeley lab dataset, nodes switched with a 4.3% (σ = 3.0%) chance. Thus, the amount of parent switching varies markedly from deployment to deployment. One reason for the variation is that the two deployments use different routing algorithms. In the case of the Intel-Berkeley deployment, the algorithm was apparently not optimized to minimize the likelihood of switching. Figure 11 also shows that the different schemes for building SRTs result in different maintenance costs. This is because the average depth of nodes in the topologies varies from one approach to the other (7.67 in Random, 10.47 in Closest, and 9.2 in Clustered) and because the spread of values underneath a particular subtree varies depending on the approach used to build the tree. A deeper tree generally results in more messages being sent up the tree as path lengths increase. The closest parent scheme results in deep topologies because no preference is given towards parents with a wide spread of values, unlike the clustered approach which tends to favor selecting a parent that is a member of a pre-existing, wide interval. The random approach is shallower still because nodes simply select the first parent that broadcasts, resulting in minimally deep trees. Finally, we note that the cost of joining the network is strictly dominated by the cost of moving parents, as there is no old parent to notify. Similarly, a node ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Acquisitional Query Processing In Sensor Networks



155

disappearing is dominated by this movement cost, as there is no new parent to notify. 5.5 SRT Observations SRTs provide an efficient mechanism for disseminating queries and collecting query results for queries over constant attributes. For attributes that are highly correlated amongst neighbors in the routing tree (e.g., location), SRTs can reduce the number of nodes that must disseminate queries and forward the continuous stream of results from children by nearly an order of magnitude. SRTs have the substantial advantage over a centralized index structure in that they do not require complete topology and sensor value information to be collected at the root of the network, which will be quite expensive to collect and will be difficult to keep consistent as connectivity and sensor values change. SRT maintenance costs appear to be reasonable for at least some real-world deployments. Interestingly, unlike traditional routing trees in sensor networks, there is a substantial cost (in terms of network messages) for switching parents in an SRT. This suggests that one metric by which routing layer designers might evaluate their implementations is rate of parent-switching. For real-world deployments, we expect that SRTs will offer substantial benefits. Although there are no benchmarks or definitive workloads for sensor network databases, we anticipate that many queries will be over narrow geographic areas—looking, for example, at single rooms or floors in a building, or nests, trees, or regions, in outdoor environments as on Great Duck Island; other researchers have noted the same need for constrained querying [Yao and Gehrke 2002; Mainwaring et al. 2002]. In a deployment like the Intel-Berkeley lab, if queries are over individual rooms or regions of the lab, Figure 9 shows that substantial performance gains can be had. For example, 2 of the 54 motes are in the main conference room; 7 of the 54 are in the seminar area; both of these queries can be evaluated using less that 30% of the network. We note two promising future extensions to SRTs. First, rather than storing just a single interval at every subtree, a variable number of intervals could be kept. This would allow nodes to more accurately summarize the range of values beneath them, and increase the benefit of the approach. Second, when selecting a parent, even in the clustered approach, nodes do not currently have access to complete information about the subtree underneath a potential parent, particularly as nodes move in the network or come and go. It would be interesting to explore a continuous SRT construction process, where parents periodically broadcast out updated intervals, giving current and potential children an option to move to a better subtree and improve the quality of the SRT. 6. PROCESSING QUERIES Once queries have been disseminated and optimized, the query processor begins executing them. Query execution is straightforward, so we describe it only briefly. The remainder of the section is devoted to the ACQP-related issues of prioritizing results and adapting sampling and delivery rates. We present simple schemes for prioritizing data in selection queries, briefly discuss prioritizing ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

156



S. R. Madden et al.

data in aggregation queries, and then turn to adaptation. We discuss two situations in which adaptation is necessary: when the radio is highly contened and when power consumption is more rapid than expected. 6.1 Query Execution Query execution consists of a simple sequence of operations at each node during every epoch: first, nodes sleep for most of an epoch; then they wake, sample sensors, apply operators to data generated locally and received from neighbors, and then deliver results to their parent. We (briefly) describe ACQP-relevant issues in each of these phases. Nodes sleep for as much of each epoch as possible to minimize power consumption. They wake up only to sample sensors and relay and deliver results. Because nodes are time synchronized, parents can ensure that they awake to receive results when a child tries to propagate a message.15 The amount of time, tawake that a sensor node must be awake to successfully accomplish the latter three steps above is largely dependent on the number of other nodes transmitting in the same radio cell, since only a small number of messages per second can be transmitted over the single shared radio channel. We discuss the communication scheduling approach in more detail in the next section. TinyDB uses a simple algorithm to scale tawake based on the neighborhood size, which is measured by snooping on traffic from neighboring nodes. Note, however, that there are situations in which a node will be forced to drop or combine results as a result of the either tawake or the sample interval being too short to perform all needed computation and communication. We discuss policies for choosing how to aggregate data and which results to drop in Section 6.3. Once a node is awake, it begins sampling and filtering results according to the plan provided by the optimizer. Samples are taken at the appropriate (current) sample rate for the query, based on lifetime computations and information about radio contention and power consumption (see Section 6.4 for more information on how TinyDB adapts sampling in response to variations during execution). Filters are applied and results are routed to join and aggregation operators further up the query plan. Finally, we note that in event-based queries, the ON EVENT clause must be handled specially. When an event fires on a node, that node disseminates the query, specifying itself as the query root. This node collects query results, and delivers them to the basestation or a local materialization point. 6.1.1 Communication Scheduling and Aggregate Queries. When processing aggregate queries, some care must be taken to coordinate the times when parents and children are awake, so that parent nodes have access to their children’s readings before aggregating. The basic idea is to subdivide the epoch into a number of intervals, and assign nodes to intervals based on their position in the routing tree. Because this mechanism makes relatively efficient use of the 15 Of

course, there is some imprecision in time synchronization between devices. In general, we can tolerate a fair amount of imprecision by introducing a buffer period, such that parents wake up several milliseconds before and stay awake several milliseconds longer than their children. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Acquisitional Query Processing In Sensor Networks



157

Fig. 12. Partial state records flowing up the tree during an epoch using interval-based communication.

radio channel and has good power consumption characteristics, TinyDB uses this scheduling approach for all queries (not just aggregates). In this slotted approach, each epoch is divided into a number of fixed-length time intervals. These intervals are numbered in reverse order such that interval 1 is the last interval in the epoch. Then, each node is assigned to the interval equal to its level, or number of hops from the root, in the routing tree. In the interval preceding their own, nodes listen to their radios, collecting results from any child nodes (which are one level below them in the tree, and thus communicating in this interval). During a node’s interval, if it is aggregating, it computes the partial state record consisting of the combination of any child values it heard with its own local readings. After this computation, it transmits either its partial state record or raw sensor readings up the network. In this way, information travels up the tree in a staggered fashion, eventually reaching the root of the network during interval 1. Figure 12 illustrates this in-network aggregation scheme for a simple COUNT query that reports the number of nodes in the network. In the figure, time advances from left to right, and different nodes in the communication topology are shown along the y axis. Nodes transmit during the interval corresponding to their depth in the tree, so H, I, and J transmit first, during interval 4, because they are at level 4. Transmissions are indicated by arrows from sender to receiver, and the numbers in circles on the arrows represent COUNTs contained within each partial state record. Readings from these three nodes are combined, via the COUNT merging function, at nodes G and F, both of which transmit new partial state records during interval 3. Readings flow up the tree in this manner until they reach node A, which then computes the final count of 10. Notice that motes are idle for a significant portion of each epoch so they can enter a low power sleeping state. A detailed analysis of the accuracy and benefit of this approach in TinyDB can be found in Madden [2003]. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

158



S. R. Madden et al.

6.2 Multiple Queries We note that, although TinyDB supports multiple queries running simultaneously, we have not focused on multiquery optimization. This means that, for example, SRTs are shared between queries, but sample acquisition is not: if two queries need a reading within a few milliseconds of each other, this will cause both to acquire that reading. Similarly, there is no effort to optimize communication scheduling between queries: transmissions of one query are scheduled independently from any other query. We hope to explore these issues as a part of our long-term sensor network research agenda. 6.3 Prioritizing Data Delivery Once results have been sampled and all local operators have been applied, they are enqueued onto a radio queue for delivery to the node’s parent. This queue contains both tuples from the local node as well as tuples that are being forwarded on behalf of other nodes in the network. When network contention and data rates are low, this queue can be drained faster than results arrive. However, because the number of messages produced during a single epoch can vary dramatically, depending on the number of queries running, the cardinality of joins, and the number of groups and aggregates, there are situations when the queue will overflow. In these situations, the system must decide if it should discard the overflow tuple, discard some other tuple already in the queue, or combine two tuples via some aggregation policy. The ability to make runtime decisions about the value of an individual data item is central to ACQP systems, because the cost of acquiring and delivering data is high, and because of these situations where the rate of data items arriving at a node will exceed the maximum delivery rate. A simple conceptual approach for making such runtime decisions is as follows: whenever the system is ready to deliver a tuple, send the result that will most improve the “quality” of the answer that the user sees. Clearly, the proper metric for quality will depend on the application: for a raw signal, root-mean-square (RMS) error is a typical metric. For aggregation queries, minimizing the confidence intervals of the values of group records could be the goal [Raman et al. 2002]. In other applications, users may be concerned with preserving frequencies, receiving statistical summaries (average, variance, or histograms), or maintaining more tenuous qualities such as signal “shape.” Our goal is not to fully explore the spectrum of techniques available in this space. Instead, we have implemented several policies in TinyDB to illustrate that substantial quality improvements are possible given a particular workload and quality metric. Generalizing concepts of quality and implementing and exploring more sophisticated prioritization schemes remains an area of future work. There is a large body of related work on approximation and compression schemes for streams in the database literature (e.g., Garofalakis and Gibbons [2001]; Chakrabarti et al. [2001]), although these approaches typically focus on the problem of building histograms or summary structures over the streams rather than trying to preserve the (in order) signal as best as possible, which ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Acquisitional Query Processing In Sensor Networks



159

is the goal we tackle first. Algorithms from signal processing, such as Fourier analysis and wavelets, are likely applicable, although the extreme memory and processor limitations of our devices and the online nature of our problem (e.g., choosing which tuple in an overflowing queue to evict) make them tricky to apply. We have begun to explore the use of wavelets in this context; see Hellerstein et al. [2003] for more information on our initial efforts. 6.3.1 Policies for Selection Queries. We begin with a comparison of three simple prioritization schemes, naive, winavg, and delta, for simple selection queries, turning our attention to aggregate queries in the next section. In the naive scheme no tuple is considered more valuable than any other, so the queue is drained in a FIFO manner and tuples are dropped if they do not fit in the queue. The winavg scheme works similarly, except that instead of dropping results when the queue fills, the two results at the head of the queue are averaged to make room for new results. Since the head of the queue is now an average of multiple records, we associate a count with it. In the delta scheme, a tuple is assigned an initial score relative to its difference from the most recent (in time) value successfully transmitted from this node, and at each point in time, the tuple with the highest score is delivered. The tuple with the lowest score is evicted when the queue overflows. Out of order delivery (in time) is allowed. This scheme relies on the intuition that the largest changes are probably interesting. It works as follows: when a tuple t with timestamp T is initially enqueued and scored, we mark it with the timestamp R of this most recently delivered tuple r. Since tuples can be delivered out of order, it is possible that a tuple with a timestamp between R and T could be delivered next (indicating that r was delivered out of order), in which case the score we computed for t as well as its R timestamp are now incorrect. Thus, in general, we must rescore some enqueued tuples after every delivery. The delta scheme is similar to the value-deviation metric used in Garofalakis and Gibbons [2001] for minimizing deviation between a source and a cache although value-deviation does not include the possibility of out of order delivery. We compared these three approaches on a single mote running TinyDB. To measure their effect in a controlled setting, we set the sample rate to be a fixed number K faster than the maximum delivery rate (such that 1 of every K tuples was delivered, on average) and compared their performance against several predefined sets of sensor readings (stored in the EEPROM of the device). In this case, delta had a buffer of 5 tuples; we performed reordering of out of order tuples at the basestation. To illustrate the effect of winavg and delta, Figure 13 shows how delta and winavg approximate a high-periodicity trace of sensor readings generated by a shaking accelerometer. Notice that delta is considerably closer in shape to the original signal in this case, as it is tends to emphasize extremes, whereas average tends to dampen them. We also measured RMS error for this signal as well as two others: a square wave-like signal from a light sensor being covered and uncovered, and a slow sinusoidal signal generated by moving a magnet around a magnetometer. The error for each of these signals and techniques is shown in Table V. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

160



S. R. Madden et al.

Fig. 13. An acceleration signal (top) approximated by a delta (middle) and an average (bottom), K = 4. Table V. RMS Error for Different Prioritization Schemes and Signals (1000 Samples, Sample Interval = 64 ms) Winavg Delta Naive

Accel.

Light (Step)

Magnetometer (Sinusoid)

64 63 77

129 81 143

54 48 63

Although delta appears to match the shape of the acceleration signal better, its RMS value is about the same as average’s (due to the few peaks that delta incorrectly merges together). Delta outperforms either other approach for the fast changing step-functions in the light signal because it does not smooth edges as much as average. We now turn our attention to result prioritization for aggregate queries. 6.3.2 Policies for Aggregate Queries. The previous section focused on prioritizing result collection in simple selection queries. In this section, we look instead at aggregate queries, illustrating a class of snooping based techniques first described in the TAG system [Madden et al. 2002a] that we have implemented for TinyDB. We consider aggregate queries of the form SELECT f agg (a1 ) FROM sensors GROUP BY a2 SAMPLE PERIOD x

Recall that this query computes the value of f agg applied to the value of a1 produced by each device every x seconds. Interestingly, for queries with few or no groups, there is a simple technique that can be used to prioritize results for several types of aggregates. This technique, called snooping, allows nodes to locally suppress local aggregate values by listening to the answers that neighboring nodes report and exploiting the semantics of aggregate functions, and is also used in [Madden et al. 2002a]. Note that this snooping can be done for free due to the broadcast nature of the radio channel. Consider, for example, a MAX query over some attribute a—if a node n ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Acquisitional Query Processing In Sensor Networks



161

Fig. 14. Snooping reduces the data nodes must send in aggregate queries. Here node 2’s value can be suppressed if it is less than the maximum value snooped from nodes 3, 4, and 5.

hears a value of a greater than its own locally computed partial MAX, it knows that its local record is low priority, and assigns it a low score or suppresses it altogether. Conversely, if n hears many neighboring partial MAXs over a that are less than its own partial aggregate value, it knows that its local record is more likely to be a maximum, and assigns it a higher score. Figure 14 shows a simple example of snooping for a MAX query—node 2 is can score its own MAX value very low when it hears a MAX from node 3 that is larger than its own. This basic technique applies to all monotonic, exemplary aggregates: MIN, MAX, TOP-N, etc., since it is possible to deterministically decide whether a particular local result could appear in the final answer output at the top of the network. For dense network topologies where there is ample opportunity for snooping, this technique produces a dramatic reduction in communication, since at every intermediate point in the routing tree, only a small number of node’s values will actually need to be transmitted. It is also possible to glean some information from snooping in other aggregates as well—for example, in an AVERAGE query, nodes may rank their own results lower if they hear many siblings with similar sensor readings. For this approach to work, parents must cache a count of recently heard children and assume children who do not send a value for an average have the same value as the average of their siblings’ values, since otherwise outliers will be weighted disproportionately. This technique of assuming that missing values are the same as the average of other reported values can be used for many summary statistics: variance, sum, and so on. Exploring more sophisticated prioritization schemes for aggregate queries is an important area of future work. In the previous sections, we demonstrated how prioritization of results can be used improve the overall quality of that data that are transmitted to the root when some results must be dropped or aggregated. Choosing the proper policies to apply in general, and understanding how various existing approximation and prioritization schemes map into ACQP is an important future direction. 6.4 Adapting Rates and Power Consumption We saw in the previous sections how TinyDB can exploit query semantics to transmit the most relevant results when limited bandwidth or power is ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

162



S. R. Madden et al.

Fig. 15. Per-mote sample rate versus aggregate delivery rate.

available. In this section, we discuss selecting and adjusting sampling and transmission rates to limit the frequency of network-related losses and fill rates of queues. This adaptation is the other half of the runtime techniques in ACQP: because the system can adjust rates, significant reductions can be made in the frequency with which data prioritization decisions must be made. These techniques are simply not available in non-acquisitional query processing systems. When initially optimizing a query, TinyDB’s optimizer chooses a transmission and sample rate based on current network load conditions, and requested sample rates and lifetimes. However, static decisions made at the start of query processing may not be valid after many days running the same continuous query. Just as adaptive query processing techniques like eddies [Avnur and Hellerstein 2000], Tukwila [Ives et al. 1999], and Query Scrambling [Urhan et al. 1998] dynamically reorder operators as the execution environment changes, TinyDB must react to changing conditions—however, unlike in previous adaptive query processing systems, failure to adapt in TinyDB can cripple the system, reducing data flow to a trickle or causing the system to severely miss power budget goals. We study the need for adaptivity in two contexts: network contention and power consumption. We first examine network contention. Rather than simply assuming that a specific transmission rate will result in a relatively uncontested network channel, TinyDB monitors channel contention and adaptively reduces the number of packets transmitted as contention rises. This backoff is very important: as the four motes line of Figure 15 shows, if several nodes try to transmit at high rates, the total number of packets delivered is substantially less than if each of those nodes tries to transmit at a lower rate. Compare this line with the performance of a single node (where there is no contention)—a single node does not exhibit the same falling off because there is no contention (although the percentage of successfully delivered packets does fall off). Finally, the four motes adaptive line does not have the same precipitous performance because it is able to monitor the network channel and adapt to contention. Note that the performance of the adaptive approach is slightly less than the nonadaptive approach at four and eight samples per second as backoff begins ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Acquisitional Query Processing In Sensor Networks



163

Fig. 16. Comparison of delivered values (bottom) versus actual readings for from two motes (left and right) sampling at 16 packets per second and sending simultaneously. Four motes were communicating simultaneously when this data was collected.

to throttle communication in this regime. However, when we compared the percentage of successful transmission attempts at eight packets per second, the adaptive scheme achieves twice the success rate of the nonadaptive scheme, suggesting the adaptation is still effective in reducing wasted communication effort, despite the lower utilization. The problem with reducing the transmission rate is that it will rapidly cause the network queue to fill, forcing TinyDB to discard tuples using the semantic techniques for victim selection presented in Section 6.3 above. We note, however, that had TinyDB not chosen to slow its transmission rate, fewer total packets would have been delivered. Furthermore, by choosing which packets to drop using semantic information derived from the queries (rather than losing some random sample of them), TinyDB is able to substantially improve the quality of results delivered to the end user. To illustrate this in practice, we ran a selection query over four motes running TinyDB, asking them each to sample data at 16 samples per second, and compared the quality of the delivered results using an adaptive-backoff version of our delta approach to results over the same dataset without adaptation or result prioritization. We show here traces from two of the nodes on the left and right of Figure 16. The top plots show the performance of the adaptive delta, the middle plots show the nonadaptive case, and the bottom plots show the the original signals (which were stored in EEPROM to allow repeatable trials). Notice that the delta scheme does substantially better in both cases. 6.4.1 Measuring Power Consumption. We now turn to the problem of adapting tuple delivery rates to meet specific lifetime requirements in response to incorrect sample rates computed at query optimization time (see Section 3.6). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

164



S. R. Madden et al.

We first note that, using the computations shown in Section 3.6, it is possible to compute a predicted battery voltage for a time t seconds into processing a query. The system can then compare its current voltage to this predicted voltage. By assuming that voltage decays linearly we can reestimate the power consumption characteristics of the device (e.g., the costs of sampling, transmitting, and receiving) and then rerun our lifetime calculation. By reestimating these parameters, the system can ensure that this new lifetime calculation tracks the actual lifetime more closely. Although this calculation and reoptimization are straightforward, they serve an important role by allowing TinyDB motes to satisfy occasional ad hoc queries and relay results for other nodes without compromising lifetime goals of longrunning monitoring queries. Finally, we note that incorrect measurements of power consumption may also be due to incorrect estimates of the cost of various phases of query processing, or may be as a result of incorrect selectivity estimation. We cover both by tuning sample rate. As future work, we intend to explore adaptation of optimizer estimates and ordering decisions (in the spirit of other adaptive work [Hellerstein et al. 2000]) and the effect of frequency of reestimation on lifetime. 7. SUMMARY OF ACQP TECHNIQUES This completes our discussion of the novel issues and techniques that arise when taking an acquisitional perspective on query processing. In summary, we first discussed important aspects of an acquisitional query language, introducing event and lifetime clauses for controlling when and how often sampling occurs. We then discussed query optimization with the associated issues of modeling sampling costs and ordering of sampling operators. We showed how event-based queries can be rewritten as joins between streams of events and sensor samples. Once queries have been optimized, we demonstrated the use of semantic routing trees as a mechanism for efficiently disseminating queries and collecting results. Finally, we showed the importance of prioritizing data according to quality and discussed the need for techniques to adapt the transmission and sampling rates of an ACQP system. Table VI lists the key new techniques we introduced, summarizing what queries they apply to and when they are most useful. 8. RELATED WORK There has been some recent publication in the database and systems communities on query processing in sensor networks [Intanagonwiwat et al. 2000; Madden et al. 2002a; Bonnet et al. 2001; Madden and Franklin 2002; Yao and Gehrke 2002]. These articles noted the importance of power sensitivity. Their predominant focus to date has been on in-network processing—that is, the pushing of operations, particularly selections and aggregations, into the network to reduce communication. We too endorse in-network processing, but believe that, for a sensor network system to be truly power sensitive, acquisitional issues of when, where, and in what order to sample and which samples to process must be considered. To our knowledge, no prior work addresses these issues. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Acquisitional Query Processing In Sensor Networks



165

Table VI. Summary of Acquisitional Query Processing Techniques in TinyDB Technique (Section)

Summary

Event-based queries (3.5) Lifetime queries (3.6) Interleaving acquisition/predicates (4.2) Exemplary aggregate pushdown (4.2.1) Event batching (4.3) SRT (5.1)

Communication scheduling (6.1.1) Data prioritization (6.3) Snooping (6.3.2) Rate adaptation (6.4)

Avoid polling overhead Satisfy user-specified longevity constraints Avoid unnecessary sampling costs in selection queries Avoid unnecessary sampling costs in aggregate queries Avoid execution costs when a number of event queries fire Avoid query dissemination costs or the inclusion of unneeded nodes in queries with predicates over constant attributes Disable node’s processors and radios during times of inactivity Choose most important samples to deliver according to a user-specified prioritization function Avoid unnecessary transmissions during aggregate queries Intentionally drop tuples to avoid saturating the radio channel, allowing most important tuples to be delivered

There is a small body of work related to query processing in mobile environments [Imielinski and Badrinath 1992; Alonso and Korth 1993]. This work has been concerned with laptop-like devices that are carried with the user, can be readily recharged every few hours, and, with the exception of a wireless network interface, basically have the capabilities of a wired, powered PC. Lifetime-based queries, notions of sampling the associated costs, and runtime issues regarding rates and contention were not considered. Many of the proposed techniques, as well as more recent work on moving object databases (such as Wolfson et al. [1999]), focus on the highly mobile nature of devices, a situation we are not (yet) dealing with, but which could certainly arise in sensor networks. Power-sensitive query optimization was proposed in Alonso and Ganguly [1993], although, as with the previous work, the focus was on optimizing costs in traditional mobile devices (e.g., laptops and palmtops), so concerns about the cost and ordering of sampling did not appear. Furthermore, laptop-style devices typically do not offer the same degree of rapid power-cycling that is available on embedded platforms like motes. Even if they did, their interactive, useroriented nature makes it undesirable to turn off displays, network interfaces, etc., because they are doing more than simply collecting and processing data, so there are many fewer power optimizations that can be applied. Building an SRT is analogous to building an index in a conventional database system. Due to the resource limitations of sensor networks, the actual indexing implementations are quite different. See Kossman [2000] for a survey of relevant research on distributed indexing in conventional database systems. There is also some similarity to indexing in peer-to-peer systems [Crespo and Garcia-Molina 2002]. However, peer-to-peer systems differ in that they are inexact and not subject to the same paucity of communications or storage ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

166



S. R. Madden et al.

infrastructure as sensor networks, so algorithms tend to be storage and communication heavy. Similar indexing issues also appear in highly mobile environments (like Wolfson et al. [1999] or Imielinski and Badrinath [1992]), but this work relies on a centralized location servers for tracking recent positions of objects. The observation that interleaving the fetching of attributes and application of operators also arises in the context of compressed databases [Chen et al. 2001], as decompression effectively imposes a penalty for fetching an individual attribute, so it is beneficial to apply selections and joins on already decompressed or easy to decompress attributes. The ON EVENT and OUTPUT ACTION clauses in our query language are similar to constructs present in event-condition-action/active databases [Chakravarthy et al. 1994]. There is a long tradition of such work in the database community, and our techniques are much simpler in comparison, as we we have not focused on any of the difficult issues associated with the semantics of event composition or with building a complete language for expressing and efficiently evaluating the triggering of composite events. Work on systems for efficiently determining when an event has fired, such as Hanson [1996], could be useful in TinyDB. More recent work on continuous query systems [Liu et al. 1999; Chen et al. 2000] has described languages that provide for query processing in response to events or at regular intervals over time. This earlier work, as well as our own work on continuous query processing [Madden et al. 2002b], inspired the periodic and event-driven features of TinyDB. Approximate and best effort caches [Olston and Widom 2002], as well as systems for online-aggregation [Raman et al. 2002] and stream query processing [Motwani et al. 2003; Carney et al. 2002], include some notion of data quality. Most of this other work has been focused on quality with respect to summaries, aggregates, or staleness of individual objects, whereas we focus on quality as a measure of fidelity to the underlying continuous signal. Aurora [Carney et al. 2002] mentioned a need for this kind of metric, but proposed no specific approaches. Work on approximate query processing [Garofalakis and Gibbons 2001] has included a scheme similar to our delta approach, as well as a substantially more thorough evaluation of its merits, but did not consider out of order delivery. 9. CONCLUSIONS AND FUTURE WORK Acquisitional query processing provides a framework for addressing issues of when, where, and how often data is sampled and which data is delivered in distributed, embedded sensing environments. Although other research has identified the opportunities for query processing in sensor networks, this work is the first to discuss these fundamental issues in an acquisitional framework. We identified several opportunities for future research. We are currently actively pursuing two of these: first, we are exploring how query optimizer statistics change in acquisitional environments and studying the role of online reoptimization in sample rate and operator orderings in response to bursts of data or unexpected power consumption. Second, we are pursuing more sophisticated ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Acquisitional Query Processing In Sensor Networks



167

prioritization schemes, like wavelet analysis, that can capture salient properties of signals other than large changes (as our delta mechanism does) as well as mechanisms to allow users to express their prioritization preferences. We believe that ACQP notions are of critical importance for preserving the longevity and usefulness of any deployment of battery powered sensing devices, such as those that are now appearing in biological preserves, roads, businesses, and homes. Without appropriate query languages, optimization models, and query dissemination and data delivery schemes that are cognizant of semantics and the costs and capabilities of the underlying hardware the success of such deployments will be limited. APPENDIX A. POWER CONSUMPTION STUDY This appendix details an analytical study of power consumption on a mote running a typical data collection query. In this study, we assume that each mote runs a very simple query that transmits one sample of (light, humidity) readings every minute. We assume each mote also listens to its radio for 2 s per 1-min period to receive results from neighboring devices and obtain access to the radio channel. We assume the following hardware characteristics: a supply voltage of 3 V, an Atmega128 processor (see footnote to Table III data on the processor) that can be set into power-down mode and runs off the internal oscillator at 4 MHz, the use of the Taos Photosynthetically Active Light Sensor [TAOS, Inc. 2002] and Sensirion Humidity Sensor [Sensirion 2002], and a ChipCon CC1000 Radio (see text footnote 6 for data on this radio) transmitting at 433 MHz with 0-dBm output power and −110-dBm receive sensitivity. We further assume the radio can make use of its low-power sampling16 mode to reduce reception power when no other radios are communicating, and that, on average, each node has 10 neighbors, or other motes, within radio range, period, with one of those neighbors being a child in the routing tree. Radio packets are 50 bytes each, with a 20-byte preamble for synchronization. This hardware configuration represents real-world settings of motes similar to values used in deployments of TinyDB in various environmental monitoring applications. The percentage of total energy used by various components is shown in Table VII. These results show that the processor and radio together consume the majority of energy for this particular data collection task. Obviously, these numbers change as the number of messages transmitted per period increases; doubling the number of messages sent increases the total power utilization by about 19% as a result of the radio spending less time sampling the channel and more time actively receiving. Similarly, if a node must send five packets per sample period instead of one, its total power utilization rises by about 10%. 16 This

mode works by sampling the radio at a low frequency—say, once every k bit-times, where k is on the order of 100—and extending the synchronization header, or preamble, on radio packets to be at least k + ǫ bits, such that a radio using this low-power listening approach will still detect every packet. Once a packet is detected, the receiver begins packet reception at the normal rate. The cost of this technique is that it increases transmission costs significantly. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

168



S. R. Madden et al.

Table VII. Expected Power Consumption for Major Hardware Components, a Query Reporting Light and Accelerometer Readings Once Every Minute Hardware Sensing, humidity Sensing, light Communication, sending (70 bytes @ 38.4 bps × 2 packets) Communication, receive packets (70 bytes @ 38.4 bps × 10 packets) Communication, sampling channel Processor, active Processor, idle

Current (mA) 0.50 0.35 10.40

Active Time (s) 0.34 1.30 0.03

% Total Energy 1.43 3.67 2.43

9.30

0.15

11.00

0.07 5.00 0.001

0.86 2.00 58.00

0.31 80.68 0.47

Average current draw per second: 0.21 mA

This table does not tell the entire story, however, because the processor must be active during sensing and communication, even though it has very little computation to perform.17 For example, in Table VII, 1.3 s are spent waiting for the light sensor to start and produce a sample,18 and another 0.029 s are spent transmitting. Furthermore, the media access control (MAC) layer on the radio introduces a delay proportional to the number of devices transmitting. To measure this delay, we examined the average delay between 1700 packet arrivals on a network of 10 time-synchronized motes attempting to send at the same time. The minimum interpacket arrival time was about 0.06 s; subtracting the expected transmit time of a packet (0.007 s) suggests that, with 10 nodes, the average MAC delay will be at least (0.06 − 0.007) × 5) = 0.265 s. Thus, of the 2 s each mote is awake, about 1.6 s of that time is spent waiting for the sensors or radio. The total 2-s waking period is selected to allow for variation in MAC delays on individual sensors. Application computation is almost negligible for basic data collection scenarios: we measured application processing time by running a simple TinyDB query that collects three data fields from the RAM of the processor (incurring no sensing delay) and transmits them over an uncontested radio channel (incurring little MAC delay). We inserted into the query result a measure of the elapsed time from the start of processing until the moment the result begins to be transmitted. The average delay was less than 1/32 (0.03125) s, which is the minimum resolution we could measure. Thus, of the 81% of energy spent on the processor, no more than 1% of its cycles are spent in application processing. For the example given here at least 65% of this 81% is spent waiting for sensors, and another 8% waiting for the radio to send or receive. The remaining 26% of processing time is time to allow for multihop forwarding of messages and as slop in the event that MAC delays exceed the measured minimums given above. Summing the processor time 17 The

requirement that the processor be active during these times is an artifact of the mote hardware. Bluetooth radios, for example, can negotiate channel access independently of the processor. These radios, however, have significantly higher power consumption than the mote radio; see Leopold et al. [2003] for a discussion of Bluetooth as a radio for sensor networks. 18 On motes, it is possible to start and sample several sensors simultaneously, so the delay for the light and humidity sensors are not additive. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Acquisitional Query Processing In Sensor Networks



169

spent waiting to send or sending with the percent energy used by the radio itself, we get (0.26 + 0.08) × 0.80 + 0.02 + 0.11 + 0.003 = 0.41 This indicates that about 41% of power consumption in this simple data collection task is due to communication. Similarly, in this example, the percentage of energy devoted to sensing can be computed by summing the energy spent waiting for samples with the energy costs of sampling: 0.65 ∗ 0.81 + 0.01 + 0.04 = 0.58. Thus, about 58% of the energy in this case is spent sensing. Obviously, the total percentage of time spent in sensing could be less if sensors that powered up more rapidly were used. When we discussed query optimization in TinyDB in Section 4, we saw a range of sensors with varying costs that would alter the percentages shown here. B. QUERY LANGUAGE This appendix provides a complete specification of the syntax of the TinyDB query language as well as pointers to the parts of the text where these constructs are defined. We will use {} to denote a set, [] to denote optional clauses, and to denote an expression, and italicized text to denote user-specified tokens such as aggregate names, commands, and arithmetic operators. The separator “|” indicates that one or the other of the surrounding tokens may appear, but not both. Ellipses (“. . . ”) indicate a repeating set of tokens, such as fields in the SELECT clause or tables in the FROM clause. B.1 Query Syntax The syntax of queries in the TinyDB query language is as follows: [ON [ALIGNED] EVENT event-type[{paramlist}] [ boolop event-type{paramlist} ... ]] SELECT [NO INTERLEAVE] | agg() | temporal agg(), ... FROM [sensors | storage-point], ... [WHERE {}] [GROUP BY {}] [HAVING {}] [OUTPUT ACTION [ command | SIGNAL event({paramlist}) | (SELECT ... ) ] | [INTO STORAGE POINT bufname]] [SAMPLE PERIOD seconds [[FOR n rounds] | [STOP ON event-type [WHERE ]]] [COMBINE { agg()}] [INTERPOLATE LINEAR]] | [ONCE] | [LIFETIME seconds [MIN SAMPLE RATE seconds]] ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

170



S. R. Madden et al. Table VIII. References to Sections in the Main Text Where Query Language Constructs are Introduced Language Construct ON EVENT SELECT-FROM-WHERE GROUP BY, HAVING OUTPUT ACTION SIGNAL INTO STORAGE POINT SAMPLE PERIOD FOR STOP ON COMBINE ONCE LIFETIME

Section Section 3.5 Section 3 Section 3.3.1 Section 3.7 Section 3.5 Section 3.2 Section 3 Section 3.2 Section 3.5 Section 3.2 Section 3.7 Section 3.6

Each of these constructs are described in more detail in the sections shown in Table VIII. B.2 Storage Point Creation and Deletion Syntax The syntax for storage point creation is CREATE [CIRCULAR] STORAGE POINT name SIZE [ ntuples | nseconds] [( fieldname type [, ... , fieldname type])] | [AS SELECT ... ] [SAMPLE PERIOD nseconds] and for deletion is DROP STORAGE POINT name Both of these constructs are described in Section 3.2. REFERENCES ALONSO, R. AND GANGULY, S. 1993. Query optimization in mobile environments. In Proceedings of the Workshop on Foundations of Models and Languages for Data and Objects. 1–17. ALONSO, R. AND KORTH, H. F. 1993. Database system issues in nomadic computing. In Proceedings of the ACM SIGMOD (Washington, DC). AVNUR, R. AND HELLERSTEIN, J. M. 2000. Eddies: Continuously adaptive query processing. In Proceedings of ACM SIGMOD (Dallas, TX). 261–272. BANCILHON, F., BRIGGS, T., KHOSHAFIAN, S., AND VALDURIEZ, P. 1987. FAD, a powerful and simple database language. In Proceedings of VLDB. BONNET, P., GEHRKE, J., AND SESHADRI, P. 2001. Towards sensor database systems. In Proceedings of the Conference on Mobile Data Management. BROOKE, T. AND BURRELL, J. 2003. From ethnography to design in a vineyard. In Proceedings of the Design User Experiences (DUX) Conference. Case study. CARNEY, D., CENTIEMEL, U., CHERNIAK, M., CONVEY, C., LEE, S., SEIDMAN, G., STONEBRAKER, M., TATBUL, N., AND ZDONIK, S. 2002. Monitoring streams—a new class of data management applications. In Proceedings of VLDB. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Acquisitional Query Processing In Sensor Networks



171

CERPA, A., ELSON, J., D. ESTRIN, GIROD, L., HAMILTON, M., AND ZHAO, J. 2001. Habitat monitoring: Application driver for wireless communications technology. In Proceedings of ACM SIGCOMM Workshop on Data Communications in Latin America and the Caribbean. CHAKRABARTI, K., GAROFALAKIS, M., RASTOGI, R., AND SHIM, K. 2001. Approximate query processing using wavelets. VLDB J. 10, 2-3 (Sep.), 199–223. CHAKRAVARTHY, S., KRISHNAPRASAD, V., ANWAR, E., AND KIM, S. K. 1994. Composite events for active databases: Semantics, contexts and detection. In Proceedings of VLDB. CHANDRASEKARAN, S., COOPER, O., DESHPANDE, A., FRANKLIN, M. J., HELLERSTEIN, J. M., HONG, W., KRISHNAMURTHY, S., MADDEN, S. R., RAMAN, V., REISS, F., AND SHAH, M. A. 2003. TelegraphCQ: Continuous dataflow processing for an uncertain world. In Proceedings of the First Annual Conference on Innovative Database Research (CIDR). CHEN, J., DEWITT, D., TIAN, F., AND WANG, Y. 2000. NiagaraCQ: A scalable continuous query system for internet databases. In Proceedings of ACM SIGMOD. CHEN, Z., GEHRKE, J., AND KORN, F. 2001. Query optimization in compressed database systems. In Proceedings of ACM SIGMOD. CRESPO, A. AND GARCIA-MOLINA, H. 2002. Routing indices for peer-to-peer systems. In Proceedings of ICDCS. DELIN, K. A. AND JACKSON, S. P. 2000. Sensor web for in situ exploration of gaseous biosignatures. In Proceedings of the IEEE Aerospace Conference. DEWITT, D. J., GHANDEHARIZADEH, S., SCHNEIDER, D. A., BRICKER, A., HSIAO, H. I., AND RASMUSSEN, R. 1990. The gamma database machine project. IEEE Trans. Knowl. Data Eng. 2, 1, 44– 62. GANERIWAL, S., KUMAR, R., ADLAKHA, S., AND SRIVASTAVA, M. 2003. Timing-sync protocol for sensor networks. In Proceedings of ACM SenSys. GAROFALAKIS, M. AND GIBBONS, P. 2001. Approximate query processing: Taming the terabytes! (tutorial). In Proceedings of VLDB. GAY, D., LEVIS, P., VON BEHREN, R., WELSH, M., BREWER, E., AND CULLER, D. 2003. The nesC language: A holistic approach to network embedded systems. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI). GEHRKE, J., KORN, F., AND SRIVASTAVA, D. 2001. On computing correlated aggregates over continual data streams. In Proceedings of ACM SIGMOD Conference on Management of Data (Santa Barbara, CA). HANSON, E. N. 1996. The design and implementation of the ariel active database rule system. IEEE Trans. Knowl. Data Eng. 8, 1 (Feb.), 157–172. HELLERSTEIN, J., HONG, W., MADDEN, S., AND STANEK, K. 2003. Beyond average: Towards sophisticated sensing with queries. In Proceedings of the First Workshop on Information Processing in Sensor Networks (IPSN). HELLERSTEIN, J. M. 1998. Optimization techniques for queries with expensive methods. ACM Trans. Database Syst. 23, 2, 113–157. HELLERSTEIN, J. M., FRANKLIN, M. J., CHANDRASEKARAN, S., DESHPANDE, A., HILDRUM, K., MADDEN, S., RAMAN, V., AND SHAH, M. 2000. Adaptive query processing: Technology in evolution. IEEE Data Eng. Bull. 23, 2, 7–18. HILL, J., SZEWCZYK, R., WOO, A., HOLLAR, S., AND PISTER, D. C. K. 2000. System architecture directions for networked sensors. In Proceedings of ASPLOS. IBARAKI, T. AND KAMEDA, T. 1984. On the optimal nesting order for computing n-relational joins. ACM Trans. Database Syst. 9, 3, 482–502. IMIELINSKI, T. AND BADRINATH, B. 1992. Querying in highly mobile distributed environments. In Proceedings of VLDB (Vancouver, B.C., Canada). INTANAGONWIWAT, C., GOVINDAN, R., AND ESTRIN, D. 2000. Directed diffusion: A scalable and robust communication paradigm for sensor networks. In Proceedings of MobiCOM (Boston, MA). INTERSEMA. 2002. MS5534A barometer module. Tech. rep. (Oct.). Go online to http://www. intersema.com/pro/module/file/da5534.pdf. IVES, Z. G., FLORESCU, D., FRIEDMAN, M., LEVY, A., AND WELD, D. S. 1999. An adaptive query execution system for data integration. In Proceedings of ACM SIGMOD. KOSSMAN, D. 2000. The state of the art in distributed query processing. ACM Comput. Surv. 32, 4 (Dec.), 422–46. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

172



S. R. Madden et al.

KRISHNAMURTHY, R., BORAL, H., AND ZANIOLO, C. 1986. Optimization of nonrecursive queries. In Proceedings of VLDB. 128–137. LEOPOLD, M., DYDENSBORG, M., AND BONNET, P. 2003. Bluetooth and sensor networks: A reality check. In Proceedings of ACM Conference on Sensor Networks (SenSys). LIN, C., FEDERSPIEL, C., AND AUSLANDER, D. 2002. Multi-sensor single actuator control of HVAC systems. In Proceedings of the International Conference for Enhanced Building Operations (Austin, TX, Oct. 14–18). LIU, L., PU, C., AND TANG, W. 1999. Continual queries for internet-scale event-driven information delivery. IEEE Trans. Knowl. Data Eng. (special Issue on Web technology) 11, 4 (July), 610–628. MADDEN, S. 2003. The design and evaluation of a query processing architecture for sensor networks. Ph.D. dissertation. University of California, Berkeley, Berkeley, CA. MADDEN, S. AND FRANKLIN, M. J. 2002. Fjording the stream: An architechture for queries over streaming sensor data. In Proceedings of ICDE. MADDEN, S., FRANKLIN, M. J., HELLERSTEIN, J. M., AND HONG, W. 2002a. TAG: A Tiny AGgregation service for ad-hoc sensor networks. In Proceedings of OSDI. MADDEN, S., HONG, W., FRANKLIN, M., AND HELLERSTEIN, J. M. 2003. TinyDB Web page. Go online to http://telegraph.cs.berkeley.edu/tinydb. MADDEN, S., SHAH, M. A., HELLERSTEIN, J. M., AND RAMAN, V. 2002b. Continously adaptive continuous queries over data streams. In Proceedings of ACM SIGMOD (Madison, WI). MAINWARING, A., POLASTRE, J., SZEWCZYK, R., AND CULLER, D. 2002. Wireless sensor networks for habitat monitoring. In Proceedings of ACM Workshop on Sensor Networks and Applications. MELEXIS, INC. 2002. MLX90601 infrared thermopile module. Tech. rep. (Aug.). Go online to http: //www.melexis.com/prodfiles/mlx90601.pdf. MONMA, C. L. AND SIDNEY, J. 1979. Sequencing with series parallel precedence constraints. Math. Operat. Rese. 4, 215–224. MOTWANI, R., WIDOM, J., ARASU, A., BABCOCK, B., S.BABU, DATA, M., OLSTON, C., ROSENSTEIN, J., AND VARMA, R. 2003. Query processing, approximation and resource management in a data stream management system. In Proceedings of the First Annual Conference on Innovative Database Research (CIDR). OLSTON, C. AND WIDOM, J. 2002. In best effort cache sychronization with source cooperation. In Proceedings of SIGMOD. PIRAHESH, H., HELLERSTEIN, J. M., AND HASAN, W. 1992. Extensible/rule based query rewrite optimization in starburst. In Proceedings of ACM SIGMOD. 39–48. POTTIE, G. AND KAISER, W. 2000. Wireless integrated network sensors. Commun. ACM 43, 5 (May), 51–58. PRIYANTHA, N. B., CHAKRABORTY, A., AND BALAKRISHNAN, H. 2000. The cricket location-support system. In Proceedings of MOBICOM. RAMAN, V., RAMAN, B., AND HELLERSTEIN, J. M. 2002. Online dynamic reordering. VLDB J. 9, 3. SENSIRION. 2002. SHT11/15 relative humidity sensor. Tech. rep. (June). Go online to http://www. sensirion.com/en/pdf/Datasheet_SHT1x_SHT7x_0206.pdf. SHATDAL, A. AND NAUGHTON, J. 1995. Adaptive parallel aggregation algorithms. In Proceedings of ACM SIGMOD. STONEBRAKER, M. AND KEMNITZ, G. 1991. The POSTGRES next-generation database management system. Commun. ACM 34, 10, 78–92. SUDARSHAN, S. AND RAMAKRISHNAN, R. 1991. Aggregation and relevance in deductive databases. In Proceedings of VLDB. 501–511. TAOS, INC. 2002. TSL2550 ambient light sensor. Tech. rep. (Sep.). Go online to http://www. taosinc.com/images/product/document/tsl2550.pdf. UC BERKELEY. 2001. Smart buildings admit their faults. Web page. Lab notes: Research from the College of Engineering, UC Berkeley. Go online to http://coe.berkeley.edu/labnotes/1101. smartbuildings.html. URHAN, T., FRANKLIN, M. J., AND AMSALEG, L. 1998. Cost-based query scrambling for initial delays. In Proceedings of ACM SIGMOD. WOLFSON, O., SISTLA, A. P., XU, B., ZHOU, J., AND CHAMBERLAIN, S. 1999. DOMINO: Databases fOr MovINg Objects tracking. In Proceedings of ACM SIGMOD (Philadelphia, PA).

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Acquisitional Query Processing In Sensor Networks



173

WOO, A. AND CULLER, D. 2001. A transmission control scheme for media access in sensor networks. In Proceedings of ACM Mobicom. YAO, Y. AND GEHRKE, J. 2002. The cougar approach to in-network query processing in sensor networks. In SIGMOD Rec. 13, 3 (Sept.), 9–18. Received October 2003; revised June 2004; accepted September 2004

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Data Exchange: Getting to the Core RONALD FAGIN, PHOKION G. KOLAITIS, and LUCIAN POPA IBM Almaden Research Center

Data exchange is the problem of taking data structured under a source schema and creating an instance of a target schema that reflects the source data as accurately as possible. Given a source instance, there may be many solutions to the data exchange problem, that is, many target instances that satisfy the constraints of the data exchange problem. In an earlier article, we identified a special class of solutions that we call universal. A universal solution has homomorphisms into every possible solution, and hence is a “most general possible” solution. Nonetheless, given a source instance, there may be many universal solutions. This naturally raises the question of whether there is a “best” universal solution, and hence a best solution for data exchange. We answer this question by considering the well-known notion of the core of a structure, a notion that was first studied in graph theory, and has also played a role in conjunctive-query processing. The core of a structure is the smallest substructure that is also a homomorphic image of the structure. All universal solutions have the same core (up to isomorphism); we show that this core is also a universal solution, and hence the smallest universal solution. The uniqueness of the core of a universal solution together with its minimality make the core an ideal solution for data exchange. We investigate the computational complexity of producing the core. Well-known results by Chandra and Merlin imply that, unless P = NP, there is no polynomial-time algorithm that, given a structure as input, returns the core of that structure as output. In contrast, in the context of data exchange, we identify natural and fairly broad conditions under which there are polynomialtime algorithms for computing the core of a universal solution. We also analyze the computational complexity of the following decision problem that underlies the computation of cores: given two graphs G and H, is H the core of G? Earlier results imply that this problem is both NP-hard and coNP-hard. Here, we pinpoint its exact complexity by establishing that it is a DP-complete problem. Finally, we show that the core is the best among all universal solutions for answering existential queries, and we propose an alternative semantics for answering queries in data exchange settings. Categories and Subject Descriptors: H.2.5 [Heterogeneous Databases]: Data Translation; H.2.4 [Systems]: Relational Databases; H.2.4 [Systems]: Query Processing General Terms: Algorithms, Theory Additional Key Words and Phrases: Certain answers, conjunctive queries, core, universal solutions, dependencies, chase, data exchange, data integration, computational complexity, query answering

P. G. Kolaitis is on leave from the University of California, Santa Cruz, Santa Cruz, CA; he is partially supported by NSF Grant IIS-9907419. A preliminary version of this article appeared on pages 90–101 of Proceedings of the ACM Symposium on Principles of Database Systems (San Diego, CA). Authors’ addresses: Foundation of Computer Science, IBM Almaden Research Center, Department K53/B2, 650 Harry Road, San Jose, CA 95120; email: {fagin,kolaitis,lucian}@almaden.ibm.com. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.  C 2005 ACM 0362-5915/05/0300-0174 $5.00 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 174–210.

Data Exchange: Getting to the Core



175

1. INTRODUCTION AND SUMMARY OF RESULTS 1.1 The Data Exchange Problem Data exchange is the problem of materializing an instance that adheres to a target schema, given an instance of a source schema and a specification of the relationship between the source schema and the target schema. This problem arises in many tasks requiring data to be transferred between independent applications that do not necessarily adhere to the same data format (or schema). The importance of data exchange was recognized a long time ago; in fact, an early data exchange system was EXPRESS [Shu et al. 1977] from the 1970s, whose main functionality was to convert data between hierarchical schemas. The need for data exchange has steadily increased over the years and, actually, has become more pronounced in recent years, with the proliferation of Web data in various formats and with the emergence of e-business applications that need to communicate data yet remain autonomous. The data exchange problem is related to the data integration problem in the sense that both problems are concerned with management of data stored in heterogeneous formats. The two problems, however, are different for the following reasons. In data exchange, the main focus is on actually materializing a target instance that reflects the source data as accurately as possible; this can be a serious challenge, due to the inherent underspecification of the relationship between the source and the target. In contrast, a target instance need not be materialized in data integration; the main focus there is on answering queries posed over the target schema using views that express the relationship between the target and source schemas. In a previous paper [Fagin et al. 2003], we formalized the data exchange problem and embarked on an in-depth investigation of the foundational and algorithmic issues that surround it. Our work has been motivated by practical considerations arising in the development of Clio [Miller et al. 2000; Popa et al. 2002] at the IBM Almaden Research Center. Clio is a prototype system for schema mapping and data exchange between autonomous applications. A data exchange setting is a quadruple (S, T, st , t ), where S is the source schema, T is the target schema, st is a set of source-to-target dependencies that express the relationship between S and T, and t is a set of dependencies that express constraints on T. Such a setting gives rise to the following data exchange problem: given an instance I over the source schema S, find an instance J over the target schema T such that I together with J satisfy the source-to-target dependencies st , and J satisfies the target dependencies t . Such an instance J is called a solution for I in the data exchange setting. In general, many different solutions for an instance I may exist. Thus, the question is: which solution should one choose to materialize, so that it reflects the source data as accurately as possible? Moreover, can such a solution be efficiently computed? In Fagin et al. [2003], we investigated these issues for data exchange settings in which S and T are relational schemas, st is a set of tuple-generating dependencies (tgds) between S and T, and t is a set of tgds and equality-generating dependencies (egds) on T. We isolated a class of solutions, called universal solutions, possessing good properties that justify selecting them as the semantics ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

176



R. Fagin et al.

of the data exchange problem. Specifically, universal solutions have homomorphisms into every possible solution; in particular, they have homomorphisms into each other, and thus are homomorphically equivalent. Universal solutions are the most general among all solutions and, in a precise sense, they represent the entire space of solutions. Moreover, as we shall explain shortly, universal solutions can be used to compute the “certain answers” of queries q that are unions of conjunctive queries over the target schema. The set certain(q, I ) of certain answers of a query q over the target schema, with respect to a source instance I , consists of all tuples that are in the intersection of all q(J )’s, as J varies over all solutions for I (here, q(J ) denotes the result of evaluating q on J ). The notion of the certain answers originated in the context of incomplete databases (see van der Meyden [1998] for a survey). Moreover, the certain answers have been used for query answering in data integration [Lenzerini 2002]. In the same data integration context, Abiteboul and Duschka [1998] studied the complexity of computing the certain answers. We showed [Fagin et al. 2003] that the certain answers of unions of conjunctive queries can be obtained by simply evaluating these queries on some arbitrarily chosen universal solution. We also showed that, under fairly general, yet practical, conditions, a universal solution exists whenever a solution exists. Furthermore, we showed that when these conditions are satisfied, there is a polynomial-time algorithm for computing a canonical universal solution; this algorithm is based on the classical chase procedure [Beeri and Vardi 1984; Maier et al. 1979]. 1.2 Data Exchange with Cores Even though they are homomorphically equivalent to each other, universal solutions need not be unique. In other words, in a data exchange setting, there may be many universal solutions for a given source instance I . Thus, it is natural to ask: what makes a universal solution “better” than another universal solution? Is there a “best” universal solution and, of course, what does “best” really mean? If there is a “best” universal solution, can it be efficiently computed? The present article addresses these questions and offers answers that are based on using minimality as a key criterion for what constitutes the “best” universal solution. Although universal solutions come in different sizes, they all share a unique (up to isomorphism) common “part,” which is nothing else but the core of each of them, when they are viewed as relational structures. By definition, the core of a structure is the smallest substructure that is also a homomorphic image of the structure. The concept of the core originated in graph theory, where a number of results about its properties have been established (see, for instance, Hell and Neˇsetˇril [1992]). Moreover, in the early days of database theory, Chandra and Merlin [1977] realized that the core of a structure is useful in conjunctive-query processing. Indeed, since evaluating joins is the most expensive among the basic relational algebra operations, one of the most fundamental problems in query processing is the join-minimization problem: given a conjunctive query q, find an equivalent conjunctive query involving the smallest possible number of joins. In turn, this problem amounts to computing ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Data Exchange: Getting to the Core



177

the core of the relational instance Dq that is obtained from q by putting a fact into Dq for each conjunct of q (see Abiteboul et al. [1995]; Chandra and Merlin [1977]; Kanellakis [1990]). Consider a data exchange setting (S, T, st , t ) in which st is a set of sourceto-target tgds and t is a set of target tgds and target egds. Since all universal solutions for a source instance I are homomorphically equivalent, it is easy to see that their cores are isomorphic. Moreover, we show in this article that the core of a universal solution for I is itself a solution for I . It follows that the core of the universal solutions for I is the smallest universal solution for I , and thus an ideal candidate for the “best” universal solution, at least in terms of the space required to materialize it. After this, we address the issue of how hard it is to compute the core of a universal solution. Chandra and Merlin [1977] showed that join minimization is an NP-hard problem by pointing out that a graph G is 3-colorable if and only if the 3-element clique K3 is the core of the disjoint sum G ⊕ K3 of G with K3 . From this, it follows that, unless P = NP, there is no polynomial-time algorithm that, given a structure as input, outputs its core. At first sight, this result casts doubts on the tractability of computing the core of a universal solution. For data exchange, however, we give natural and fairly broad conditions under which there are polynomial-time algorithms for computing the cores of universal solutions. Specifically, we show that there are polynomial-time algorithms for computing the core of universal solutions in data exchange settings in which st is a set of source-to-target tgds and t is a set of target egds. It remains an open problem to determine whether this result can be extended to data exchange settings in which the target constraints t consist of both egds and tgds. We also analyze the computational complexity of the following decision problem, called CORE IDENTIFICATION, which underlies the computation of cores: given two graphs G and H, is H the core of G? As seen above, the results by Chandra and Merlin [1977] imply that this problem is NP-hard. Later on, Hell and Neˇsetˇril [1992] showed that deciding whether a graph G is its own core is a coNP-complete problem; in turn, this implies that CORE IDENTIFICATION is a coNP-hard problem. Here, we pinpoint the exact computational complexity of CORE IDENTIFICATION by showing that it is a DP-complete problem, where DP is the class of decision problems that can be written as the intersection of an NP-problem and a coNP-problem. In the last part of the article, we further justify the selection of the core as the “best” universal solution by establishing its usefulness in answering queries over the target schema T. An existential query q(x) is a formula of the form ∃yφ(x, y), where φ(x, y) is a quantifier-free formula.1 Perhaps the most important examples of existential queries are the conjunctive queries with inequalities =. Another useful example of existential queries is the setdifference query, which asks whether there is a member of the set difference A − B. Let J0 be the core of all universal solutions for a source instance I . As discussed earlier, since J0 is itself a universal solution for I , the certain answers 1 We

shall also give a safety condition on φ. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

178



R. Fagin et al.

of conjunctive queries over T can be obtained by simply evaluating them on J0 . In Fagin et al. [2003], however, it was shown that there are simple conjunctive queries with inequalities = such that evaluating them on a universal solution always produces a proper superset of the set of certain answers for I . Nonetheless, here we show that evaluating existential queries on the core J0 of the universal solutions yields the best approximation (that is, the smallest superset) of the set of the certain answers, among all universal solutions. Analogous to the definition of certain answers, let us define the certain answers on universal solutions of a query q over the target schema, with respect to a source instance I , to be the set of all tuples that are in the intersection of all q(J )’s, as J varies over all universal solutions for I ; we write u-certain(q, I ) to denote the certain answers of q on universal solutions for I . Since we consider universal solutions to be the preferred solutions to the data exchange problem, this suggests the naturalness of this notion of certain answers on universal solutions as an alternative semantics for query answering in data exchange settings. We show that if q is an existential query and J0 is the core of the universal solutions for I , then the set of those tuples in q(J0 ) whose entries are elements from the source instance I is equal to the set u-certain(q, I ) of the certain answers of q on universal solutions. We also show that in the LAV setting (an important scenario in data integration) there is an interesting contrast between the complexity of computing certain answers and of computing certain answers on universal solutions. Specifically, Abiteboul and Duschka [1998] showed that there is a data exchange setting with t = ∅ and a conjunctive query with inequalities = such that computing the certain answers of this query is a coNP-complete problem. In contrast to this, we establish here that in an even more general data exchange setting (S, T, st , t ) in which st is an arbitrary set of tgds and t is an arbitrary set of egds, for every existential query q (and in particular, for every conjunctive query q with inequalities =), there is a polynomial-time algorithm for computing the set u-certain(q, I ) of the certain answers of q on universal solutions. 2. PRELIMINARIES This section contains the main definitions related to data exchange and a minimum amount of background material. The presentation follows closely our earlier paper [Fagin et al. 2003]. 2.1 The Data Exchange Problem A schema is a finite sequence R = R1 , . . . , Rk  of relation symbols, each of a fixed arity. An instance I (over the schema R) is a sequence R1I , . . . , RkI  that associates each relation symbol Ri with a relation RiI of the same arity as Ri . We shall often abuse the notation and use Ri to denote both the relation symbol and the relation RiI that interprets it. We may refer to RiI as the Ri relation of I . Given a tuple t occurring in a relation R, we denote by R(t) the association between t and R, and call it a fact. An instance I can be identified with the set of all facts arising from the relations RiI of I . If R is a schema, then a dependency over R is a sentence in some logical formalism over R. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Data Exchange: Getting to the Core



179

Let S = S1 , . . . , Sn  and T = T1 , . . . , Tm  be two schemas with no relation symbols in common. We refer to S as the source schema and to the Si ’s as the source relation symbols. We refer to T as the target schema and to the T j ’s as the target relation symbols. We denote by S, T the schema S1 , . . . , Sn , T1 , . . . , Tm . Instances over S will be called source instances, while instances over T will be called target instances. If I is a source instance and J is a target instance, then we write I, J  for the instance K over the schema S, T such that SiK = SiI and T jK = T jJ , when 1 ≤ i ≤ n and 1 ≤ j ≤ m. A source-to-target dependency is, in general, a dependency over S, T of the form ∀x(φS (x) → χT (x)), where φS (x) is a formula, with free variables x, of some logical formalism over S, and χT (x) is a formula, with free variables x, of some logical formalism over T (these two logical formalisms may be different). We use the notation x for a vector of variables x1 , . . . , xk . We assume that all the variables in x appear free in φS (x). A target dependency is, in general, a dependency over the target schema T (the formalism used to express a target dependency may be different from those used for the source-to-target dependencies). The source schema may also have dependencies that we assume are satisfied by every source instance. While the source dependencies may play an important role in deriving source-to-target dependencies [Popa et al. 2002], they do not play any direct role in data exchange, because we take the source instance to be given. Definition 2.1. A data exchange setting (S, T, st , t ) consists of a source schema S, a target schema T, a set st of source-to-target dependencies, and a set t of target dependencies. The data exchange problem associated with this setting is the following: given a finite source instance I , find a finite target instance J such that I, J  satisfies st and J satisfies t . Such a J is called a solution for I or, simply, a solution if the source instance I is understood from the context. For most practical purposes, and for most of the results of this article (all results except for Proposition 2.7), each source-to-target dependency in st is a tuple generating dependency (tgd) [Beeri and Vardi 1984] of the form ∀x(φS (x) → ∃yψT (x, y)), where φS (x) is a conjunction of atomic formulas over S and ψT (x, y) is a conjunction of atomic formulas over T. We assume that all the variables in x appear in φS (x). Moreover, each target dependency in t is either a tgd, of the form ∀x(φT (x) → ∃yψT (x, y)), or an equality-generating dependency (egd) [Beeri and Vardi 1984], of the form ∀x(φT (x) → (x1 = x2 )). In these dependencies, φT (x) and ψT (x, y) are conjunctions of atomic formulas over T, where all the variables in x appear in φT (x), and x1 , x2 are among the variables in x. The tgds and egds together comprise Fagin’s (embedded) implicational dependencies [Fagin 1982]. As in Fagin et al. [2003], we will drop ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

180



R. Fagin et al.

the universal quantifiers in front of a dependency, and implicitly assume such quantification. However, we will write down all the existential quantifiers. Source-to-target tgds are a natural and powerful language for expressing the relationship between a source schema and a target schema. Such dependencies are automatically derived and used as representation of a schema mapping in the Clio system [Popa et al. 2002]. Furthermore, data exchange settings with tgds as source-to-target dependencies include as special cases both local-asview (LAV) and global-as-view (GAV) data integration systems in which the views are sound and defined by conjunctive queries (see Lenzerini’s tutorial [Lenzerini 2002] for a detailed discussion of LAV and GAV data integration systems and sound views). A LAV data integration system with sound views defined by conjunctive queries is a special case of a data exchange setting (S, T, st , t ), in which S is the source schema (consisting of the views, in LAV terminology), T is the target schema (or global schema, in LAV terminology), the set t of target dependencies is empty, and each source-to-target tgd in st is of the form S(x) → ∃y ψT (x, y), where S is a single relation symbol of the source schema S (a view, in LAV terminology) and ψT is a conjunction of atomic formulas over the target schema T. A GAV setting is similar, but the tgds in st are of the form φS (x) → T (x), where T is a single relation symbol over the target schema T (a view, in GAV terminology), and φS is a conjunction of atomic formulas over the source schema S. Since, in general, a source-to-target tgd relates a conjunctive query over the source schema to a conjunctive query over the target schema, a data exchange setting is strictly more expressive than LAV or GAV, and in fact it can be thought of as a GLAV (global-and-local-as-view) system [Friedman et al. 1999; Lenzerini 2002]. These similarities between data integration and data exchange notwithstanding, the main difference between the two is that in data exchange we have to actually materialize a finite target instance that best reflects the given source instance. In data integration no such exchange of data is required; the target can remain virtual. In general there may be multiple solutions for a given data exchange problem. The following example illustrates this issue and raises the question of which solution to choose to materialize. Example 2.2. Consider a data exchange problem in which the source schema consists of two binary relation symbols as follows: EmpCity, associating employees with cities they work in, and LivesIn, associating employees with cities they live in. Assume that the target schema consists of three binary relation symbols as follows: Home, associating employees with their home cities, EmpDept, associating employees with departments, and DeptCity, associating departments with their cities. We assume that t = ∅. The source-to-target tgds and the source instance are as follows, where (d 1 ), (d 2 ), (d 3 ), and (d 4 ) are labels for convenient reference later: st : (d 1 ) EmpCity(e, c) → ∃HHome(e, H), (d 2 ) EmpCity(e, c) → ∃D(EmpDept(e, D) ∧ DeptCity(D, c)), (d 3 ) LivesIn(e, h) → Home(e, h), ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Data Exchange: Getting to the Core



181

(d 4 ) LivesIn(e, h) → ∃D∃C(EmpDept(e, D) ∧ DeptCity(D, C)), I = {EmpCity(Alice, SJ), EmpCity(Bob, SD) LivesIn(Alice, SF), LivesIn(Bob, LA)}. We shall use this example as a running example throughout this article. Since the tgds in st do not completely specify the target instance, there are multiple solutions that are consistent with the specification. One solution is J0 = {Home(Alice, SF), Home(Bob, SD) EmpDept(Alice, D1 ), EmpDept(Bob, D2 ) DeptCity(D1 , SJ), DeptCity(D2 , SD)}, where D1 and D2 represent “unknown” values, that is, values that do not occur in the source instance. Such values are called labeled nulls and are to be distinguished from the values occurring in the source instance, which are called constants. Instances with constants and labeled nulls are not specific to data exchange. They have long been considered, in various forms, in the context of incomplete or indefinite databases (see van der Meyden [1998]) as well as in the context of data integration (see Halevy [2001]; Lenzerini [2002]). Intuitively, in the above instance, D1 and D2 are used to “give values” for the existentially quantified variable D of (d 2 ), in order to satisfy (d 2 ) for the two source tuples EmpCity(Alice, SJ) and EmpCity(Bob, SD). In contrast, two constants (SF and SD) are used to “give values” for the existentially quantified variable H of (d 1 ), in order to satisfy (d 1 ) for the same two source tuples. The following instances are solutions as well: J = {Home(Alice, SF), Home(Bob, SD) Home(Alice, H1 ), Home(Bob, H2 ) EmpDept(Alice, D1 ), EmpDept(Bob, D2 ) DeptCity(D1 , SJ), DeptCity(D2 , SD)}, J0′ = {Home(Alice, SF), Home(Bob, SD) EmpDept(Alice, D), EmpDept(Bob, D) DeptCity(D, SJ), DeptCity(D, SD)}. The instance J differs from J0 by having two extra Home tuples where the home cities of Alice and Bob are two nulls, H1 and H2 , respectively. The second instance J0′ differs from J0 by using the same null (namely D) to denote the “unknown” department of both Alice and Bob. Next, we review the notion of universal solutions, proposed in Fagin et al. [2003] as the most general solutions. 2.2 Universal Solutions We denote by Const the set (possibly infinite) of all values that occur in source instances, and as before we call them constants. We also assume an infinite set Var of values, called labeled nulls, such that Var ∩ Const = ∅. We reserve ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

182



R. Fagin et al.

the symbols I, I ′ , I1 , I2 , . . . for instances over the source schema S and with values in Const. We reserve the symbols J, J ′ , J1 , J2 , . . . for instances over the target schema T and with values in Const ∪ Var. Moreover, we require that solutions of a data exchange problem have their values drawn from Const ∪ Var. If R = R1 , . . . , Rk  is a schema and K is an instance over R with values in Const∪Var, then Const(K ) denotes the set of all constants occurring in relations in K , and Var(K ) denotes the set of labeled nulls occurring in relations in K . Definition 2.3. Const ∪ Var.

Let K 1 and K 2 be two instances over R with values in

1. A homomorphism h: K 1 → K 2 is a mapping from Const(K 1 ) ∪ Var(K 1 ) to Const(K 2 ) ∪ Var(K 2 ) such that (1) h(c) = c, for every c ∈ Const(K 1 ); (2) for every fact Ri (t) of K 1 , we have that Ri (h(t)) is a fact of K 2 (where, if t = (a1 , . . . , as ), then h(t) = (h(a1 ), . . ., h(as ))). 2. K 1 is homomorphically equivalent to K 2 if there are homomorphisms h: K 1 → K 2 and h′ : K 2 → K 1 . Definition 2.4 (Universal Solution). Consider a data exchange setting (S, T, st , t ). If I is a source instance, then a universal solution for I is a solution J for I such that for every solution J ′ for I , there exists a homomorphism h : J → J ′. Example 2.5. The instance J0′ in Example 2.2 is not universal. In particular, there is no homomorphism from J0′ to J0 . Hence, the solution J0′ contains “extra” information that was not required by the specification; in particular, J0′ “assumes” that the departments of Alice and Bob are the same. In contrast, it can easily be shown that J0 and J have homomorphisms to every solution (and to each other). Thus, J0 and J are universal solutions. Universal solutions possess good properties that justify selecting them (as opposed to arbitrary solutions) for the semantics of the data exchange problem. A universal solution is more general than an arbitrary solution because, by definition, it can be homomorphically mapped into that solution. Universal solutions have, also by their definition, homomorphisms to each other and, thus, are homomorphically equivalent. 2.2.1 Computing Universal Solutions. In Fagin et al. [2003], we addressed the question of how to check the existence of a universal solution and how to compute one, if one exists. In particular, we identified fairly general, yet practical, conditions that guarantee that universal solutions exist whenever solutions exist. Moreover, we showed that there is a polynomial-time algorithm for computing a canonical universal solution, if a solution exists; this algorithm is based on the classical chase procedure. The following result summarizes these findings. THEOREM 2.6 [FAGIN ET AL. 2003]. Assume a data exchange setting where st is a set of tgds, and t is the union of a weakly acyclic set of tgds with a set of egds. (1) The existence of a solution can be checked in polynomial time. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Data Exchange: Getting to the Core



183

(2) A universal solution exists if and only if a solution exists. (3) If a solution exists, then a universal solution can be produced in polynomial time using the chase. The notion of a weakly acyclic set of tgds first arose in a conversation between the third author and A. Deutsch in 2001. It was then independently used in Deutsch and Tannen [2003] and in Fagin et al. [2003] (in the former article, under the term constraints with stratified-witness). This class guarantees the termination of the chase and is quite broad, as it includes both sets of full tgds [Beeri and Vardi 1984] and sets of acyclic inclusion dependencies [Cosmadakis and Kanellakis 1986]. We note that, when the set t of target constraints is empty, a universal solution always exists and a canonical one is constructible in polynomial time by chasing I, ∅ with st . In the Example 2.2, the instance J is such a canonical universal solution. If the set t of target constraints contains egds, then it is possible that no universal solution exists (and hence no solution exists, either, by the above theorem). This occurs (see Fagin et al. [2003]) when the chase fails by attempting to identify two constants while trying to apply some egd of t . If the chase does not fail, then the result of chasing I, ∅ with st ∪ t is a canonical universal solution. 2.2.2 Certain Answers. In a data exchange setting, there may be many different solutions for a given source instance. Hence, given a source instance, the question arises as to what the result of answering queries over the target schema is. Following earlier work on information integration, in Fagin et al. [2003] we adopted the notion of the certain answers as the semantics of query answering in data exchange settings. As stated in Section 1, the set certain(q, I ) of the certain answers of q with respect to a source instance I is the set of tuples that appear in q(J ) for every solution J ; in symbols,  certain(q, I ) = {q(J ) : J is a solution for I }. Before stating the connection between the certain answers and universal solutions, let us recall the definitions of conjunctive queries (with inequalities) and unions of conjunctive queries (with inequalities). A conjunctive query q(x) over a schema R is a formula of the form ∃yφ(x, y) where φ(x, y) is a conjunction of atomic formulas over R. If, in addition to atomic formulas, the conjunction φ(x, y) is allowed to contain inequalities of the form z i = z j , where z i , z j are variables among x and y, we call q(x) a conjunctive query with inequalities. We also impose a safety condition, that every variable in x and y must appear in an atomic formula, not just in an inequality. A union of conjunctive queries (with inequalities) is a disjunction q(x) = q1 (x) ∨ · · · ∨ qn (x) where q1 (x), . . . , qn (x) are conjunctive queries (with inequalities). If J is an arbitrary solution, let us denote by q(J )↓ the set of all “null-free” tuples in q(J ), that is the set of all tuples in q(J ) that are formed entirely of constants. The next proposition from Fagin et al. [2003] asserts that null-free evaluation of conjunctive queries on an arbitrarily chosen universal solution gives precisely the set of certain answers. Moreover, universal solutions are the only solutions that have this property. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

184



R. Fagin et al.

PROPOSITION 2.7 [FAGIN ET AL. 2003]. Consider a data exchange setting with S as the source schema, T as the target schema, and such that the dependencies in the sets st and t are arbitrary. (1) Let q be a union of conjunctive queries over the target schema T. If I is a source instance and J is a universal solution, then certain(q, I ) = q(J )↓ . (2) Let I be a source instance and J be a solution such that, for every conjunctive query q over T, we have that certain(q, I ) = q(J )↓ . Then J is a universal solution. 3. DATA EXCHANGE WITH CORES 3.1 Multiple Universal Solutions Even if we restrict attention to universal solutions instead of arbitrary solutions, there may still exist multiple, nonisomorphic universal solutions for a given instance of a data exchange problem. Moreover, although these universal solutions are homomorphically equivalent to each other, they may have different sizes (where the size is the number of tuples). The following example illustrates this state of affairs. Example 3.1. We again revisit our running example from Example 2.2. As we noted earlier, of the three target instances given there, two of them (namely, J0 and J ) are universal solutions for I . These are nonisomorphic universal solutions (since they have different sizes). We now give an infinite family of nonisomorphic universal solutions, that we shall make use of later. For every m ≥ 0, let Jm be the target instance Jm = {Home(Alice, SF), Home(Bob, SD), EmpDept(Alice, X 0 ), EmpDept(Bob, Y 0 ), DeptCity(X 0 , SJ), DeptCity(Y 0 , SD), ... EmpDept(Alice, X m ), EmpDept(Bob, Y m ), DeptCity(X m , SJ), DeptCity(Y m , SD)}, where X 0 , Y 0 , . . . , X m , Y m are distinct labeled nulls. (In the case of m = 0, the resulting instance J0 is the same, modulo renaming of nulls, as the earlier J0 from Example 2.2. We take the liberty of using the same name, since the choice of nulls really does not matter.) It is easy to verify that each target instance Jm , for m ≥ 0, is a universal solution for I ; thus, there are infinitely many nonisomorphic universal solutions for I . It is also easy to see that every universal solution must contain at least four tuples EmpDept(Alice, X ), EmpDept(Bob, Y ), DeptCity(X , SJ), and DeptCity(Y, SD), for some labeled nulls X and Y , as well as the tuples Home(Alice, SF) and Home(Bob, SD). Consequently, the instance J0 has the smallest size among all universal solutions for I and actually is the unique (up to isomorphism) universal solution of smallest size. Thus, J0 is a rather special universal solution and, from a size point of view, a preferred candidate to materialize in data exchange. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Data Exchange: Getting to the Core



185

Motivated by the preceding example, in the sequel we introduce and study the concept of the core of a universal solution. We show that the core of a universal solution is the unique (up to isomorphism) smallest universal solution. We then address the problem of computing the core and also investigate the use of cores in answering queries over the target schemas. The results that we will establish make a compelling case that cores are the preferred solutions to materialize in data exchange. 3.2 Cores and Universal Solutions In addition to the notion of an instance over a schema (which we defined earlier), we find it convenient to define the closely related notion of a structure over a schema. The difference is that a structure is defined with a universe, whereas the universe of an instance is implicitly taken to be the “active domain,” that is, the set of elements that appear in tuples of the instance. Furthermore, unlike target instances in data exchange settings, structures do not necessarily have distinguished elements (“constants”) that have to be mapped onto themselves by homomorphisms. More formally, a structure A (over the schema R = R1 , . . . , Rk ) is a sequence A, R1A , . . . , RkA , where A is a nonempty set, called the universe, and each RiA is a relation on A of the same arity as the relation symbol Ri . As with instances, we shall often abuse the notation and use Ri to denote both the relation symbol and the relation RiA that interprets it. We may refer to RiA as the Ri relation of A. If A is finite, then we say that the structure is finite. A structure B = (B, R1B , . . . , RkB ) is a substructure of A if B ⊆ A and RiB ⊆ RiA , for 1 ≤ i ≤ k. We say that B is a proper substructure of A if it is a substructure of A and at least one of the containments RiB ⊆ RiA , for 1 ≤ i ≤ k, is a proper one. A structure B = (B, R1B , . . . , RkB ) is an induced substructure of A if B ⊆ A and, for every 1 ≤ i ≤ k, we have that RiB = {(x1 , . . . , xn ) | RiA (x1 , . . . , xn ) and x1 , . . . , xn are in B}. Definition 3.2. A substructure C of structure A is called a core of A if there is a homomorphism from A to C, but there is no homomorphism from A to a proper substructure of C. A structure C is called a core if it is a core of itself, that is, if there is no homomorphism from C to a proper substructure of C. Note that C is a core of A if and only if C is a core, C is a substructure of A, and there is a homomorphism from A to C. The concept of the core of a graph has been studied extensively in graph theory (see Hell and Neˇsetˇril [1992]). The next proposition summarizes some basic facts about cores; a proof can be found in Hell and Neˇsetˇril [1992]. PROPOSITION 3.3.

The following statements hold:

— Every finite structure has a core; moreover, all cores of the same finite structure are isomorphic. — Every finite structure is homomorphically equivalent to its core. Consequently, two finite structures are homomorphically equivalent if and only if their cores are isomorphic. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

186



R. Fagin et al.

—If C is the core of a finite structure A, then there is a homomorphism h: A → C such that h(v) = v for every member v of the universe of C. — If C is the core of a finite structure A, then C is an induced substructure of A. In view of Proposition 3.3, if A is a finite structure, there is a unique (up to isomorphism) core of A, which we denote by core(A). We can similarly define the notions of a subinstance of an instance and of a core of an instance. We identify the instance with the corresponding structure, where the universe of the structure is taken to be the active domain of the instance, and where we distinguish the constants. That is, we require that if h is a homomorphism and c is a constant, then h(c) = c (as already defined in Section 2.2). The results about cores of structures will then carry over to cores of instances. Universal solutions for I are unique up to homomorphic equivalence, but as we saw in Example 3.1, they need not be unique up to isomorphism. Proposition 3.3, however, implies that their cores are isomorphic; in other words, all universal solutions for I have the same core up to isomorphism. Moreover, if J is a universal solution for I and core(J ) is a solution for I , then core(J ) is also a universal solution for I , since J and core(J ) are homomorphically equivalent. In general, if the dependencies st and t are arbitrary, then the core of a solution to an instance of the data exchange problem need not be a solution. The next result shows, however, that this cannot happen if st is a set of tgds and t is a set of tgds and egds. PROPOSITION 3.4. Let (S, T, st , t ) be a data exchange setting in which st is a set of tgds and t is a set of tgds and egds. If I is a source instance and J is a solution for I , then core(J ) is a solution for I . Consequently, if J is a universal solution for I , then also core(J ) is a universal solution for I . PROOF. Let φS (x) → ∃yψT (x, y) be a tgd in st and a = (a1 , . . . , an ) a tuple of constants such that I |= φS (a). Since J is a solution for I , there is a tuple b = (b1 , . . . , bs ) of elements of J such that I, J  |= ψT (a, b). Let h be a homomorphism from J to core(J ). Then h(ai ) = ai , since each ai is a constant, for 1 ≤ i ≤ n. Consequently, I, core(J ) |= ψT (a, h(b)), where h(b) = (h(b1 ), . . . , h(bs )). Thus, I, core(J ) satisfies the tgd. Next, let φT (x) → ∃yψT (x, y) be a tgd in t and a = (a1 , . . . , an ) a tuple of elements in core(J ) such that core(J ) |= φT (a). Since core(J ) is a subinstance of J , it follows that J |= φT (a), and since J is a solution, it follows that there is a tuple b = (b1 , . . . , bs ) of elements of J such that J |= ψT (a, b). According to the last part of Proposition 3.3, there is a homomorphism h from J to core(J ) such that h(v) = v, for every v in core(J ). In particular, h(ai ) = ai , for 1 ≤ i ≤ n. It follows that core(J ) |= ψT (a, h(b)), where h(b) = (h(b1 ), . . . , h(bs )). Thus, core(J ) satisfies the tgd. Finally, let φT (x) → (x1 = x2 ) be an egd in t . If a = (a1 , . . . , as ) is a tuple of elements in core(J ) such that core(J ) |= φT (a), then J |= φT (a), because core(J ) is a subinstance of J . Since J is a solution, it follows that a1 = a2 . Thus, core(J ) satisfies every egd in t . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Data Exchange: Getting to the Core



187

COROLLARY 3.5. Let (S, T, st , t ) be a data exchange setting in which st is a set of tgds and t is a set of tgds and egds. If I is a source instance for which a universal solution exists, then there is a unique (up to isomorphism) universal solution J0 for I having the following properties: — J0 is a core and is isomorphic to the core of every universal solution J for I . — If J is a universal solution for I , there is a one-to-one homomorphism h from J0 to J . Hence, |J0 | ≤ |J |, where |J0 | and |J | are the sizes of J0 and J . We refer to J0 as the core of the universal solutions for I . As an illustration of the concepts discussed in this subsection, recall the data exchange problem of Example 3.1. Then J0 is indeed the core of the universal solutions for I . The core of the universal solutions is the preferred universal solution to materialize in data exchange, since it is the unique most compact universal solution. In turn, this raises the question of how to compute cores of universal solutions. As mentioned earlier, universal solutions can be canonically computed by using the chase. However, the result of such a chase, while a universal solution, need not be the core. In general, an algorithm other than the chase is needed for computing cores of universal solutions. In the next two sections, we study what it takes to compute cores. We begin by analyzing the complexity of computing cores of arbitrary instances and then focus on the computation of cores of universal solutions in data exchange. 4. COMPLEXITY OF CORE IDENTIFICATION Chandra and Merlin [1977] were the first to realize that computing the core of a relational structure is an important problem in conjunctive query processing and optimization. Unfortunately, in its full generality this problem is intractable. Note that computing the core is a function problem, not a decision problem. One way to gauge the difficulty of a function problem is to analyze the computational complexity of its underlying decision problem. Definition 4.1. CORE IDENTIFICATION is the following decision problem: given two structures A and B over some schema R such that B is a substructure of A, is core(A) = B? It is easy to see that CORE IDENTIFICATION is an NP-hard problem. Indeed, consider the following polynomial-time reduction from 3-COLORABILITY: a graph G is 3-colorable if and only if core(G ⊕ K3 ) = K3 , where K3 is the complete graph with 3 nodes and ⊕ is the disjoint sum operation on graphs. This reduction was already given by Chandra and Merlin [1977]. Later on, Hell and Neˇsetˇril [1992] studied the complexity of recognizing whether a graph is a core. In precise terms, CORE RECOGNITION is the following decision problem: given a structure A over some schema R, is A a core? Clearly, this problem is in coNP. Hell and Neˇsetˇril’s [1992] main result is that CORE RECOGNITION is a coNPcomplete problem, even if the inputs are undirected graphs. This is established by exhibiting a rather sophisticated polynomial-time reduction from NON-3COLORABILITY on graphs of girth at least 7; the “gadgets” used in this reduction are pairwise incomparable cores with certain additional properties. It follows ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

188



R. Fagin et al.

that CORE IDENTIFICATION is a coNP-hard problem. Nonetheless, it appears that the exact complexity of CORE IDENTIFICATION has not been pinpointed in the literature until now. In the sequel, we will establish that CORE IDENTIFICATION is a DP-complete problem. We present first some background material about the complexity class DP. The class DP consists of all decision problems that can be written as the intersection of an NP-problem and a coNP-problem; equivalently, DP consists of all decision problems that can be written as the difference of two NP-problems. This class was introduced by Papadimitriou and Yannakakis [1982], who discovered several DP-complete problems. The prototypical DP-complete problem is SAT/UNSAT: given two Boolean formulas φ and ψ, is φ satisfiable and ψ unsatisfiable? Several problems that express some “critical” property turn out to be DP-complete (see Papadimitriou [1994]). For instance, CRITICAL SAT is DP-complete, where an instance of this problem is a CNF-formula φ and the question is to determine whether φ is unsatisfiable, but if any one of its clauses is removed, then the resulting formula is satisfiable. Moreover, Cosmadakis [1983] showed that certain problems related to database query evaluation are DP-complete. Note that DP contains both NP and coNP as subclasses; furthermore, each DP-complete problem is both NP-hard and coNP-hard. The prevailing belief in computational complexity is that the above containments are proper, but proving this remains an outstanding open problem. In any case, establishing that a certain problem is DP-complete is interpreted as signifying that this problem is intractable and, in fact, “more intractable” than an NP-complete problem. Here, we establish that CORE IDENTIFICATION is a DP-complete problem by exhibiting a reduction from 3-COLORABILITY/NON-3-COLORABILITY on graphs of girth at least 7. This reduction is directly inspired by the reduction of NON-3COLORABILITY on graphs of girth at least 7 to CORE RECOGNITION, given in Hell and Neˇsetˇril [1992]. THEOREM 4.2. CORE IDENTIFICATION is DP-complete, even if the inputs are undirected graphs. In proving the above theorem, we make essential use of the following result, which is a special case of Theorem 6 in [Hell and Neˇsetˇril 1992]. Recall that the girth of a graph is the length of the shortest cycle in the graph. THEOREM 4.3 (HELL AND NESˇ ETRˇ IL 1992). For each positive integer N , there is a sequence A1 , . . . A N of connected graphs such that (1) each Ai is 3-colorable, has girth 5, and each edge of Ai is on a 5-cycle; (2) each Ai is a core; moreover, for every i, j with i ≤ n, j ≤ n and i = j , there is no homomorphism from Ai to A j ; (3) each Ai has at most 15(N + 4) nodes; and (4) there is a polynomial-time algorithm that, given N , constructs the sequence A1 , . . . A N . We now have the machinery needed to prove Theorem 4.2. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Data Exchange: Getting to the Core



189

PROOF OF THEOREM 4.2. CORE IDENTIFICATION is in DP, because, given two structures A and B over some schema R such that B is a substructure of A, to determine whether core(A) = B one has to check whether there is a homomorphism from A to B (which is in NP) and whether B is a core (which is in coNP). We will show that CORE IDENTIFICATION is DP-hard, even if the inputs are undirected graphs, via a polynomial-time reduction from 3-COLORABILITY/ NON-3-COLORABILITY. As a stepping stone in this reduction, we will define CORE HOMOMORPHISM, which is the following variant of CORE IDENTIFICATION: given two structures A and B, is there a homomorphism from A to B, and is B a core? There is a simple polynomial-time reduction of CORE HOMOMORPHISM to CORE IDENTIFICATION, where the instance (A, B) is mapped onto (A ⊕ B, B). This is a reduction, since there is a homomorphism from A to B with B as a core if and only if core(A⊕B) = B. Thus, it remains to show that there is a polynomial-time reduction of 3-COLORABILITY/NON-3-COLORABILITY to CORE HOMOMORPHISM. Hell and Neˇsetˇril [1992] showed that 3-COLORABILITY is NP-complete even if the input graphs have girth at least 7 (this follows from Theorem 7 in Hell and Neˇsetˇril [1992] by taking A to be a self-loop and B to be K3 ). Hence, 3COLORABILITY/NON-3-COLORABILITY is DP-complete, even if the input graphs G and H have girth at least 7. So, assume that we are given two graphs G and H each having girth at least 7. Let v1 , . . . , vm be an enumeration of the nodes of G, let w1 , . . . , wn be an enumeration of the nodes of H, and let N = m + n. Let A1 , . . . , A N be a sequence of connected graphs having the properties listed in Theorem 4.3. This sequence can be constructed in time polynomial in N ; moreover, we can assume that these graphs have pairwise disjoint sets of nodes. Let G∗ be the graph obtained by identifying each node vi of G with some arbitrarily chosen node of Ai , for 1 ≤ i ≤ m (and keeping the edges between nodes of G intact). Thus, the nodes of G∗ are the nodes that appear in the Ai ’s, and the edges are the edges in the Ai ’s, along with the edges of G under our identification. Similarly, let H∗ be the graph obtained by identifying each node w j of H with some arbitrarily chosen node of A j , for m + 1 ≤ j ≤ N = m + n (and keeping the edges between nodes of H intact). We now claim that G is 3colorable and H is not 3-colorable if and only if there is a homomorphism from G∗ ⊕ K3 to H∗ ⊕ K3 , and H∗ ⊕ K3 is a core. Hell and Neˇsetˇril [1992] showed that CORE RECOGNITION is coNP-complete by showing that a graph H of girth at least 7 is not 3-colorable if and only if the graph H∗ ⊕ K3 is a core. We will use this property in order to establish the above claim. Assume first that G is 3-colorable and H is not 3-colorable. Since each Ai is a 3-colorable graph, G∗ ⊕ K3 is 3-colorable and so there a homomorphism from G∗ ⊕ K3 to H∗ ⊕ K3 (in fact, to K3 ). Moreover, as shown in Hell and Neˇsetˇril [1992], H∗ ⊕ K3 is a core, since H is not 3-colorable. For the other direction, assume that there is a homomorphism from G∗ ⊕ K3 to H∗ ⊕ K3 , and H∗ ⊕ K3 is a core. Using again the results in Hell and Neˇsetˇril [1992], we infer that H is not 3-colorable. It remains to prove that G is 3-colorable. Let h be a homomorphism from G∗ ⊕ K3 to H∗ ⊕ K3 . We claim that h actually maps G∗ to K3 ; hence, G is 3-colorable. Let us consider the image of each graph Ai , with 1 ≤ i ≤ m, under the homomorphism h. Observe that Ai cannot be mapped to some A j , when ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

190



R. Fagin et al.

m + 1 ≤ j ≤ N = m + n, since, for every i and j such that 1 ≤ i ≤ m and m + 1 ≤ j ≤ N = m + n, there is no homomorphism from Ai to A j . Observe also that the image of a cycle C under a homomorphism is a cycle C ′ of length less than or equal the length of C. Since H has girth at least 7 and since each edge of Ai is on a 5-cycle, the image of Ai under h cannot be contained in H. For the same reason, the image of Ai under h cannot contain nodes from H and some A j , for m + 1 ≤ j ≤ N = m + n; moreover, it cannot contain nodes from two different A j ’s, for m + 1 ≤ j ≤ N = m + n (here, we also use the fact that each A j has girth 5). Consequently, the homomorphism h must map each Ai , 1 ≤ i ≤ m, to K3 . Hence, h maps G∗ to K3 , and so G is 3-colorable. It should be noted that problems equivalent to CORE RECOGNITION and CORE IDENTIFICATION have been investigated in logic programming and artificial intel¨ ligence. Specifically, Gottlob and Fermuller [1993] studied the problem of removing redundant literals from a clause, and analyzed the computational complexity of two related decision problems: the problem of determining whether a given clause is condensed and the problem of determining whether, given ¨ two clauses, one is a condensation of the other. Gottlob and Fermuller showed that the first problem is coNP-complete and the second is DP-complete. As it turns out, determining whether a given clause is condensed is equivalent to CORE RECOGNITION, while determining whether a clause is a condensation of another clause is equivalent to CORE IDENTIFICATION. Thus, the complexity of CORE RECOGNITION and CORE IDENTIFICATION for relational structures (but not for undi¨ rected graphs) can also be derived from the results in Gottlob and Fermuller ¨ [1993]. As a matter of fact, the reductions in Gottlob and Fermuller [1993] give easier proofs for the coNP-hardness and DP-hardness of CORE RECOGNITION and CORE IDENTIFICATION, respectively, for undirected graphs with constants, that is, undirected graphs in which certain nodes are distinguished so that every homomorphism maps each such constant to itself (alternatively, graphs with constants can be viewed as relational structures with a binary relation for the edges and unary relations each of which consists of one of the constants). For instance, the coNP-hardness of CORE IDENTIFICATION for graphs with constants can be established via the following reduction from the CLIQUE problem. Given an undirected graph G and a positive integer k, consider the disjoint sum G ⊕ Kk , where Kk is the complete graph with k elements. If every node in G is viewed as a constant, then G ⊕ Kk is a core if and only if G does not contain a clique with k elements. We now consider the implications of the intractability of CORE RECOGNITION for the problem of computing the core of a structure. As stated earlier, Chandra and Merlin [1977] observed that a graph G is 3-colorable if and only if core(G⊕K3 ) = K3 . It follows that, unless P = NP, there is no polynomial-time algorithm for computing the core of a given structure. Indeed, if such an algorithm existed, then we could determine in polynomial time whether a graph is 3-colorable by first running the algorithm to compute the core of G ⊕ K3 and then checking if the answer is equal to K3 . Note, however, that in data exchange we are interested in computing the core of a universal solution, rather than the core of an arbitrary instance. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Data Exchange: Getting to the Core



191

Consequently, we cannot assume a priori that the above intractability carries over to the data exchange setting, since polynomial-time algorithms for computing the core of universal solutions may exist. We address this next. 5. COMPUTING THE CORE IN DATA EXCHANGE In contrast with the case of computing the core of an arbitrary instance, computing the core of a universal solution in data exchange does have polynomial-time algorithms, in certain natural data exchange settings. Specifically, in this section we show that the core of a universal solution can be computed in polynomial time in data exchange settings in which st is an arbitrary set of tgds and t is a set of egds. We give two rather different polynomial-time algorithms for the task of computing the core in data exchange settings in which st is an arbitrary set of tgds and t is a set of egds: a greedy algorithm and an algorithm we call the blocks algorithm. Section 5.1 is devoted to the greedy algorithm. In Section 5.2 we present the blocks algorithm for data exchange settings with no target constraints (i.e., t = ∅). We then show in Section 5.3 that essentially the same blocks algorithm works if we remove the emptiness condition on t and allow it to contain egds. Although the blocks algorithm is more complicated than the greedy algorithm (and its proof of correctness much more involved), it has certain advantages for data exchange that we will describe later on. In what follows, we assume that (S, T, st , t ) is a data exchange setting such that st is a set of tgds and t is a set of egds. Given a source instance I , we let J be the target instance obtained by chasing I, ∅ with st . We call J a canonical preuniversal instance for I . Note that J is a canonical universal solution for I with respect to the data exchange setting (S, T, st , ∅) (that is, no target constraints). 5.1 Greedy Algorithm Intuitively, given a source instance I , the greedy algorithm first determines whether solutions for I exist, and then, if solutions exist, computes the core of the universal solutions for I by successively removing tuples from a canonical universal solution for I , as long as I and the instance resulting in each step satisfy the tgds in st . Recall that a fact is an expression of the form R(t) indicating that the tuple t belongs to the relation R; moreover, every instance can be identified with the set of all facts arising from the relations of that instance. Algorithm 5.1 (Greedy Algorithm). Input: source instance I . Output: the core of the universal solutions for I , if solutions exist; “failure,” otherwise. (1) Chase I with st to produce a canonical pre-universal instance J . (2) Chase J with t ; if the chase fails, then stop and return “failure”; otherwise, let J ′ be the canonical universal solution for I produced by the chase. (3) Initialize J ∗ to be J ′ . (4) While there is a fact R(t) in J ∗ such that I, J ∗ − {R(t)} satisfies st , set J ∗ to be J ∗ − {R(t)}. (5) Return J ∗ ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

192



R. Fagin et al.

THEOREM 5.2. Assume that (S, T, st , t ) is a data exchange setting such that st is a set of tgds and t is a set of egds. Then Algorithm 5.1 is a correct, polynomial-time algorithm for computing the core of universal solutions. PROOF. As shown in Fagin et al. [2003] (see also Theorem 2.6), the chase is a correct, polynomial-time algorithm for determining whether, given a source instance I , a solution exists and, if so, producing the canonical universal solution J ′ . Assume that for a given source instance I , a canonical universal solution J ′ for I has been produced in Step (2) of the greedy algorithm. We claim that each target instance J ∗ produced during the iterations of the while loop in Step (4) is a universal solution for I . To begin with, I, J ∗  satisfies the tgds in st by construction. Furthermore, J ∗ satisfies the egds in t , because J ∗ is a subinstance of J ′ , and J ′ satisfies the egds in t . Consequently, J ∗ is a solution for I ; moreover, it is a universal solution, since it is a subinstance of the canonical universal solution J for I and thus it can be mapped homomorphically into every solution for I . Let C be the target instance returned by the algorithm. Then C is a universal solution for I and hence it contains an isomorphic copy J0 of the core of the universal solutions as a subinstance. We claim that C = J0 . Indeed, if there is a fact R(t) in C − J0 , then C − {R(t)} satisfies the tgds in st , since J0 satisfies the tgds in st and is a subinstance of J0 − {R(t)}; thus, the algorithm could not have returned C as output. In order to analyze the running time of the algorithm, we consider the following parameters: m is the size of the source instance I (number of tuples in I ); a is the maximum number of universally quantified variables over all tgds in st ; b is the maximum number of existentially quantified variables over all tgds in st ; finally, a′ is the maximum number of universally quantified variables over all egds in t . Since the data exchange setting is fixed, the quantities a, b, and a′ are constants. Given a source instance I of size m, the size of the canonical preuniversal instance J is O(ma ) and the time needed to produce it is O(ma+ab ). Indeed, the canonical preuniversal instance is constructed by considering each tgd (∀x)(ϕS (x) → (∃y)ψT (x, y)) in st , instantiating the universally quantified variables x with elements from I in every possible way, and, for each such instantiation, checking whether the existentially quantified variables y can be instantiated by existing elements so that the formula ψT (x, y) is satisfied, and, if not, adding null values and facts to satisfy it. Since st is fixed, at most a constant number of facts are added at each step, which accounts for the O(ma ) bound in the size of the canonical preuniversal instance. There are O(ma ) possible instantiations of the universally quantified variables, and for each such instantiation O((ma )b) steps are needed to check whether the existentially quantified variables can be instantiated by existing elements, hence the total time required to construct the canonical preuniversal instance is O(ma+ab ). The size of the canonical universal solution J ′ is also O(ma ) (since it is at ′ most the size of J ) and the time needed to produce J ′ from J is O(maa +2a ). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Data Exchange: Getting to the Core



193

Indeed, chasing with the egds in t requires at most O((ma )2 ) = O(m2a ) chase steps, since in the worst case every two values will be set equal to each other. ′ Moreover, each chase step takes time O((ma )a ), since at each step we need to instantiate the universally quantified variables in the egds in every possible way. The while loop in Step (4) requires at most O(ma ) iterations each of which takes O(ma+ab ) steps to verify that st is satisfied by I, J ∗ − {R(t)}. Thus, Step (4) takes time O(m2a+ab ). It follows that the running time of the greedy ′ algorithm is O(m2a+ab + m2a+aa ). Several remarks are in order now. First, it should be noted that the correctness of the greedy algorithm depends crucially on the assumption that t consists of egds only. The crucial property that holds for egds, but fails for tgds, is that if an instance satisfies an egd, then every subinstance of it also satisfies that egd. Thus, if the greedy algorithm is applied to data exchange settings in which t contains at least one tgd, then the output of the algorithm may fail to be a solution for the input instance. One can consider a variant of the greedy algorithm in which the test in the while loop is that I, J ∗ − {R(t)} satisfies both st and t . This modified greedy algorithm outputs a universal solution for I , but it is not too hard to construct examples in which the output is not the core of the universal solutions for I . Note that Step (4) of the greedy algorithm can also be construed as a polynomial-time algorithm for producing the core of the universal solutions, given a source instance I and some arbitrary universal solution J ′ for I . The first two steps of the greedy algorithm produce a universal solution for I in time polynomial in the size of the source instance I or determine that no solution for I exists, so that the entire greedy algorithm runs in time polynomial in the size of I . Although the greedy algorithm is conceptually simple and its proof of correctness transparent, it requires that the source instance I be available throughout the execution of the algorithm. There are situations, however, in which the original source I becomes unavailable, after a canonical universal solution J ′ for I has been produced. In particular, the Clio system [Popa et al. 2002] uses a specialized engine to produce a canonical universal solution, when there are no target constraints, or a canonical preuniversal instance, when there are target constraints. Any further processing, such as chasing with target egds or producing the core, will have to be done by another engine or application that may not have access to the original source instance. This state of affairs raises the question of whether the core of the universal solutions can be produced in polynomial time using only a canonical universal solution or only a canonical pre-universal instance. In what follows, we describe such an algorithm, called the blocks algorithm, which has the feature that it can start from either a canonical universal solution or a canonical pre-universal instance, and has no further need for the source instance. We present the blocks algorithms in two stages: first, for the case in which there are no target constraints (t = ∅), and then for the case in which t is a set of egds. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

194



R. Fagin et al.

5.2 Blocks Algorithm: No Target Constrains We first define some notions that are needed in order to state the algorithm as well as to prove its correctness and polynomial-time bound. For the next two definitions, we assume K to be an arbitrary instance whose elements consists of constants from Const and nulls from Var. We say that two elements of K are adjacent if there exists some tuple in some relation of K in which both elements occur. Definition 5.3. The Gaifman graph of the nulls of K is an undirected graph in which (1) the nodes are all the nulls of K , and (2) there exists an edge between two nulls whenever the nulls are adjacent in K . A block of nulls is the set of nulls in a connected component of the Gaifman graph of the nulls. If y is a null of K , then we may refer to the block of nulls that contains y as the block of y. Note that, by the definition of blocks, the set Var(K ) of all nulls of K is partitioned into disjoint blocks. Let K and K ′ be two instances with elements in Const ∪ Var. Recall that K ′ is a subinstance of K if every tuple of a relation of K ′ is a tuple of the corresponding relation of K . Definition 5.4. Let h be a homomorphism of K . Denote the result of applying h to K by h(K ). If h(K ) is a subinstance of K , then we call h an endomorphism of K . An endomorphism h of K is useful if h(K ) = K (i.e., h(K ) is a proper subinstance of K ). The following lemma is a simple characterization of useful endomorphisms that we will make use of in proving the main results of this subsection and of Section 5.3. LEMMA 5.5. Let K be an instance, and let h be an endomorphism of K . Then h is useful if and only if h is not one-to-one. PROOF. Assume that h is not one-to-one. Then there is some x that is in the domain of h but not in the range of h (here we use the fact that the instance is finite.) So no tuple containing x is in h(K ). Therefore, h(K ) = K , and so h is useful. Now assume that h is one-to-one. So h is simply a renaming of the members of K , and so an isomorphism of K . Thus, h(K ) has the same number of tuples as K . Since h(K ) is a subinstance of K , it follows that h(K ) = K (here again we use the fact that the instance K is finite). So h is not useful. For the rest of this subsection, we assume that we are given a data exchange setting (S, T, st , ∅) and a source instance I . Moreover, we assume that J is a canonical universal solution for this data exchange problem. That is, J is such that I, J  is the result of chasing I, ∅ with st . Our goal is to compute core(J ), that is, a subinstance C of J such that (1) C = h(J ) for some endomorphism h of J , and (2) there is no proper subinstance of C with the same property (condition (2) is equivalent to there being no endomorphism of C onto a proper subinstance of C). The central idea of the algorithm, as we shall see, is to show that the above mentioned endomorphism h of J can be found as the composition ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Data Exchange: Getting to the Core



195

of a polynomial-length sequence of “local” (or “small”) endomorphisms, each of which can be found in polynomial time. We next define what “local” means. Definition 5.6. Let K and K ′ be two instances such that the nulls of K ′ form a subset of the nulls of K , that is, Var(K ′ ) ⊆ Var(K ). Let h be some endomorphism of K ′ , and let B be a block of nulls of K . We say that h is K -local for B if h(x) = x whenever x ∈ B. (Since all the nulls of K ′ are among the nulls of K , it makes sense to consider whether or not a null x of K ′ belongs to the block B of K .) We say that h is K -local if it is K -local for B, for some block B of K . The next lemma is crucial for the existence of the polynomial-time algorithm for computing the core of a universal solution. LEMMA 5.7. Assume a data exchange setting where st is a set of tgds and t = ∅. Let J ′ be a subinstance of the canonical universal solution J . If there exists a useful endomorphism of J ′ , then there exists a useful J -local endomorphism of J ′ . PROOF. Let h be a useful endomorphism of J ′ . By Lemma 5.5, we know that h is not one-to-one. So there is a null y that appears in J ′ but does not appear in h(J ′ ). Let B be the block of y (in J ). Define h′ on J ′ by letting h′ (x) = h(x) if x ∈ B, and h′ (x) = x otherwise. We show that h′ is an endomorphism of J ′ . Let (u1 , . . . , us ) be a tuple of the R relation of J ′ ; we must show that (h′ (u1 ), . . . , h′ (us )) is a tuple of the R relation of J ′ . Since J ′ is a subinstance of J , the tuple (u1 , . . . , us ) is also a tuple of the R relation of J . Hence, by definition of a block of J , all the nulls among u1 , . . . , us are in the same block B′ . There are two cases, depending on whether or not B′ = B. Assume first that B′ = B. Then, by definition of h′ , for every ui among u1 , . . . , us , we have that h′ (ui ) = h(ui ) if ui is a null, and h′ (ui ) = ui = h(ui ) if ui is a constant. Hence (h′ (u1 ), . . . , h′ (us )) = (h(u1 ), . . . , h(us )). Since h is an endomorphism of J ′ , we know that (h(u1 ), . . ., h(us )) is a tuple of the R relation of J ′ . Thus, (h′ (u1 ), . . . , h′ (us )) is a tuple of the R relation of J ′ . Now assume that B′ = B. So for every ui among u1 , . . . , us , we have that h′ (ui ) = ui . Hence (h′ (u1 ), . . . , h′ (us )) = (u1 , . . . , us ). Therefore, once again, (h′ (u1 ), . . . , h′ (us )) is a tuple of the R relation of J ′ , as desired. Hence, h′ is an endomorphism of J ′ . We now present the blocks algorithm for computing the core of the universal solutions, when t = ∅. Algorithm 5.8 (Blocks Algorithm: No Target Constraints). Input: source instance I . Output: the core of the universal solutions for I . (1) Compute J , the canonical universal solution, from I, ∅ by chasing with st . (2) Compute the blocks of J , and initialize J ′ to be J . (3) Check whether there exists a useful J -local endomorphism h of J ′ . If not, then stop with result J ′ . (4) Update J ′ to be h(J ′ ), and return to Step (3). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

196



R. Fagin et al.

THEOREM 5.9. Assume that (S, T, st , t ) is a data exchange setting such that st is a set of tgds and t = ∅. Then Algorithm 5.8 is a correct, polynomialtime algorithm for computing the core of the universal solutions. PROOF. We first show that Algorithm 5.8 is correct, that is, that the final instance C at the conclusion of the algorithm is the core of the given universal solution. Every time we apply Step (4) of the algorithm, we are replacing the instance by a homomorphic image. Therefore, the final instance C is the result of applying a composition of homomorphisms to the input instance, and hence is a homomorphic image of the canonical universal solution J . Also, since each of the homomorphisms found in Step (3) is an endomorphism, we have that C is a subinstance of J . Assume now that C is not the core; we shall derive a contradiction. Since C is not the core, there is an endomorphism h such that when h is applied to C, the resulting instance is a proper subinstance of C. Hence, h is a useful endomorphism of C. Therefore, by Lemma 5.7, there must exist a useful J -local endomorphism of C. But then Algorithm 5.8 should not have stopped in Step 3 with C. This is the desired contradiction. Hence, C is the core of J . We now show that Algorithm 5.8 runs in polynomial time. To do so, we need to consider certain parameters. As in the analysis of greedy algorithm, the first parameter, denoted by b, is the maximum number of existentially quantified variables over all tgds in st . Since we are taking st to be fixed, the quantity b is a constant. It follows easily from the construction of the canonical universal solution J (by chasing with st ) that b is an upper bound on the size of a block in J . The second parameter, denoted by n, is the size of the canonical universal solution J (number of tuples in J ); as seen in the analysis of the greedy algorithm, n is O(ma ), where a is the maximum number of the universally quantified variables over all tgds in st and m is the size of I . Let J ′ be the instance in some execution of Step (3). For each block B, to check if there is a useful endomorphism of J ′ that is J -local for B, we can exhaustively check each of the possible functions h on the domain of J ′ such that h(x) = x whenever x ∈ B: there are at most nb such functions. To check that such a function is actually a useful endomorphism requires time O(n). Since there are at most n blocks, the time to determine if there is a block with a useful J -local endomorphism is O(nb+2 ). The updating time in Step (4) is O(n). By Lemma 5.5, after Step (4) is executed, there is at least one less null in J ′ than there was before. Since there are initially at most n nulls in the instance, it follows that the number of loops that Algorithm 5.8 performs is at most n. Therefore, the running time of the algorithm (except for Step (1) and Step (2), which are executed only once) is at most n (the number of loops) times O(nb+2 ), that is, O(nb+3 ). Since Step (1) and Step (2) take polynomial time as well, it follows that the entire algorithm executes in polynomial time. The crucial observation behind the polynomial-time bound is that the total number of endomorphisms that the algorithm explores in Step (3) is at most nb for each block of J . This is in strong contrast with the case of minimizing arbitrary instances with constants and nulls for which we may need to explore ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Data Exchange: Getting to the Core



197

a much larger number of endomorphisms (up to nn , in general) in one minimization step. 5.3 Blocks Algorithm: Target Egds In this subsection, we extend Theorem 5.9 by showing that there is a polynomial-time algorithm for finding the core even when t is a set of egds. Thus, we assume next that we are given a data exchange setting (S, T, st , t ) where t is a set of egds. We are also given a source instance I . As with the greedy algorithm, let J be a canonical preuniversal instance, that is, J is the result of chasing I with st . Let J ′ be the canonical universal solution obtained by chasing J with t . Our goal is to compute core(J ′ ), that is, a subinstance C of J ′ such that C = h(J ′ ) for some endomorphism h of J ′ , and such that there is no proper subinstance of C with the same property. As in the case when t = ∅, the central idea of the algorithm is to show that the above mentioned endomorphism h of J ′ can be found as the composition of a polynomial-length sequence of “small” endomorphisms, each findable in polynomial time. As in the case when t = ∅, “small” will mean J -local. We make this precise in the next lemma. This lemma, crucial for the existence of the polynomial-time algorithm for computing core(J ′ ), is a nontrivial generalization of Lemma 5.7. LEMMA 5.10. Assume a data exchange setting where st is a set of tgds and t is a set of egds. Let J be the canonical preuniversal instance, and let J ′′ be an endomorphic image of the canonical universal solution J ′ . If there exists a useful endomorphism of J ′′ , then there exists a useful J -local endomorphism of J ′′ . The proof of Lemma 5.10 requires additional definitions as well as two additional lemmas. We start with the required definitions. Let J be the canonical preuniversal instance, and let J ′ be the canonical universal solution produced from J by chasing with the set t of egds. We define a directed graph, whose nodes are the members of J , both nulls and constants. If during the chase process, a null u gets replaced by v (either a null or a constant), then there is an edge from u to v in the graph. Let ≤ be the reflexive, transitive closure of this graph. It is easy to see that ≤ is a reflexive partial order. For each node u, define [u] to be the maximal (under ≤) node v such that u ≤ v. Intuitively, u eventually gets replaced by [u] as a result of the chase. It is clear that every member of J ′ is of the form [u]. It is also clear that if u is a constant, then u = [u]. Let us write u ∼ v if [u] = [v]. Intuitively, u ∼ v means that u and v eventually collapse to the same element as a result of the chase. Definition 5.11. Let K be an instance whose elements are constants and nulls. Let y be some element of K . We say that y is rigid if h( y) = y for every homomorphism h of K . (In particular, all constants occurring in K are rigid.) A key step in the proof of Lemma 5.10 is the following surprising result, which says that if two nulls in different blocks of J both collapse onto the same element z of J ′ as a result of the chase, then z is rigid, that is, h(z) = z for every endomorphism h of J ′ . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

198



R. Fagin et al.

LEMMA 5.12 (RIGIDITY LEMMA). Assume a data exchange setting where st is a set of tgds and t is a set of egds. Let J be the canonical preuniversal instance, and let J ′ be the result of chasing J with the set t of egds. Let x and y be nulls of J such that x ∼ y, and such that [x] is a nonrigid null of J ′ . Then x and y are in the same block of J . PROOF. Assume that x and y are nulls in different blocks of J with x ∼ y. We must show that [x] is rigid in J ′ . Let φ be the diagram of the instance J , that is, the conjunction of all expressions S(u1 , . . . , us ) where (u1 , . . . , us ) is a tuple of the S relation of J . (We are treating members of J , both constants and nulls, as variables.) Let τ be the egd φ → (x = y). Since x ∼ y, it follows that t |= τ . This is because the chase sets variables equal only when it is logically forced to (the result appears in papers that characterize the implication problem for dependencies; see, for instance, Beeri and Vardi [1984]; Maier et al. [1979]). Since J ′ satisfies t , it follows that J ′ satisfies τ . We wish to show that [x] is rigid in J ′ . Let h be a homomorphism of J ′ ; we must show that h([x]) = [x]. Let B be the block of x in J . Let V be the assignment to the variables of τ obtained by letting V (u) = h([u]) if u ∈ B, and V (u) = [u] otherwise. We now show that V is a valid assignment for φ in J ′ , that is, that for each conjunct S(u1 , . . . , us ) of φ, necessarily (V (u1 ), . . . , V (us )) is a tuple of the S relation of J ′ . Let S(u1 , . . . , us ) be a conjunct of φ. By the construction of the chase, we know that ([u1 ], . . . , [us ]) is a tuple of the S relation of J ′ , since (u1 , . . . , us ) is a tuple of the S relation of J . There are two cases, depending on whether or not some ui (with 1 ≤ i ≤ s) is in B. If no ui is in B, then V (ui ) = [ui ] for each i, and so (V (u1 ), . . . , V (us )) is a tuple of the S relation of J ′ , as desired. If some ui is in B, then every ui is either a null in B or a constant (this is because (u1 , . . . , us ) is a tuple of the S relation of J ). If ui is a null in B, then V (ui ) = h([ui ]). If ui is a constant, then ui = [ui ], and so V (ui ) = [ui ] = ui = h(ui ) = h([ui ]), where the third equality holds since h is a homomorphism and ui is a constant. Thus, in both cases, we have V (ui ) = h([ui ]). Since ([u1 ], . . . , [us ]) is a tuple of the S relation of J ′ and h is a homomorphism of J ′ , we know that (h[u1 ], . . . , h[us ]) is a tuple of the S relation of J ′ . So again, (V (u1 ), . . . , V (us )) is a tuple of the S relation of J ′ , as desired. Hence, V is a valid assignment for φ in J ′ . Therefore, since J ′ satisfies τ , it follows that in J ′ , we have V (x) = V ( y). Now V (x) = h([x]), since x ∈ B. Further, V ( y) = [ y], since y ∈ B (because y is in a different block than x). So h([x]) = [ y]. Since x ∼ y, that is, [x] = [ y], we have h([x]) = [ y] = [x], which shows that h([x]) = [x], as desired. The contrapositive of Lemma 5.12 says that if x and y are nulls in different blocks of J that are set equal (perhaps transitively) during the chase, then [x] is rigid in J ′ . LEMMA 5.13. Let h be an endomorphism of J ′ . Then every rigid element of J is a rigid element of h(J ′ ). ′

PROOF. Let u be a rigid element of J ′ . Then h(u) is an element of h(J ′ ), and so u is an element of h(J ′ ), since h(u) = u by rigidity. Let hˆ be a homomorphism ˆ ˆ ˆ of h(J ′ ); we must show that h(u) = u. But h(u) = hh(u), since h(u) = u. Now ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Data Exchange: Getting to the Core



199

ˆ is also a homomorphism of J ′ , since the composition of homomorphisms is hh ˆ ˆ a homomorphism. By rigidity of u in J ′ , it follows that hh(u) = u. So h(u) = ˆ hh(u) = u, as desired. We are now ready to give the proof of Lemma 5.10, after which we will present the blocks algorithm for the case of target egds. PROOF OF LEMMA 5.10. Let h be an endomorphism of J ′ such that J ′′ = h(J ′ ), and let h′ be a useful endomorphism of h(J ′ ). By Lemma 5.5, there is a null y that appears in h(J ′ ) but does not appear in h′ h(J ′ ). Let B be the block in J that contains y. Define h′′ on h(J ′ ) by letting h′′ (x) = h′ (x) if x ∈ B, and h′′ (x) = x otherwise. We shall show that h′′ is a useful J -local endomorphism of h(J ′ ). We now show that h′′ is an endomorphism of h(J ′ ). Let (u1 , . . . , us ) be a tuple of the R relation of h(J ′ ); we must show that (h′′ (u1 ), . . ., h′′ (us )) is a tuple of the R relation of h(J ′ ). We first show that every nonrigid null among u1 , . . . , us is in the same block of J . Let u p and uq be nonrigid nulls among u1 , . . . , us ; we show that u p and uq are in the same block of J . Since (u1 , . . . , us ) is a tuple of the R relation of h(J ′ ), and h(J ′ ) is a subinstance of J ′ , we know that (u1 , . . . , us ) is a tuple of the R relation of J ′ . By construction of J ′ from J using the chase, we know that there is ui′ where ui ∼ ui′ for 1 ≤ i ≤ s, such that (u′1 , . . . , u′s ) is a tuple of the R relation of J . Since u p and uq are nonrigid nulls of h(J ′ ), it follows from Lemma 5.13 that u p and uq are nonrigid nulls of J ′ . Now u′p is not a constant, since u′p ∼ u p and u p is a nonrigid null. Similarly, uq′ is not a constant. So u′p and uq′ are in the same block B′ of J . Now [u p ] = u p , since u p is in J ′ . Since u′p ∼ u p and [u p ] = u p is nonrigid, it follows from Lemma 5.12 that u′p and u p are in the same block of J , and so u p ∈ B′ . Similarly, uq ∈ B′ . So u p and uq are in the same block B′ of J , as desired. There are now two cases, depending on whether or not B′ = B. Assume first that B′ = B. For those ui ’s that are nonrigid, we showed that ui ∈ B′ = B, and so h′′ (ui ) = h′ (ui ). For those u j ’s that are rigid (including nulls and constants), we have h′′ (u j ) = u j = h′ (u j ). So for every ui among u1 , . . . , us , we have h′′ (u j ) = h′ (u j ). Since h′ is a homomorphism of h(J ′ ), and since (u1 , . . . , us ) is a tuple of the R relation of h(J ′ ), we know that (h′ (u1 ), . . . , h′ (us )) is a tuple of the R relation of h(J ′ ). Hence (h′′ (u1 ), . . . , h′′ (us )) is a tuple of the R relation of h(J ′ ), as desired. Now assume that B′ = B. For those ui ’s that are nonrigid, we showed that ui ∈ B′ , and so ui ∈ B. Hence, for those ui ’s that are nonrigid, we have h′′ (u j ) = u j . But also h′′ (ui ) = ui for the rigid ui ’s. Thus, (h′′ (u1 ), . . . , h′′ (us )) = (u1 , . . . , us ). Hence, once again, (h′′ (u1 ), . . . , h′′ (us )) is a tuple of the R relation of h(J ′ ), as desired. So h′′ is an endomorphism of h(J ′ ). By definition, h′′ is J -local. We now show that h′′ is useful. Since y appears in h(J ′ ), Lemma 5.5 tells us that we need only show that the range of h′′ does not contain y. If x ∈ B, then h′′ (x) = h′ (x) = y, since the range of h′ does not include y. If x ∈ B, then h′′ (x) = x = y, since y ∈ B. So the range of h′′ does not contain y, and hence h′′ is useful. Therefore, h′′ is a useful J -local endomorphism of h(J ′ ). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

200



R. Fagin et al.

We now present the blocks algorithm for computing the core when t is a set of egds. (As mentioned earlier, when the target constraints include egds, it may be possible that there are no solutions and hence no universal solutions. This case is detected by our algorithm, and “failure” is returned.) Algorithm 5.14 (Blocks Algorithm: Target egds). Input: source instance I . Output: the core of the universal solutions for I , if solutions exist, and “failure”, otherwise. (1) Compute J , the canonical preuniversal instance, from I, ∅ by chasing with st . (2) Compute the blocks of J , and then chase J with t to produce the canonical universal solution J ′ . If the chase fails, then stop with “failure.” Otherwise, initialize J ′′ to be J ′ . (3) Check whether there exists a useful J -local endomorphism h of J ′′ . If not, then stop with result J ′′ . (4) Update J ′′ to be h(J ′′ ), and return to Step (3).

THEOREM 5.15. Assume that (S, T, st , t ) is a data exchange setting such that st is a set of tgds and t is a set of egds. Then Algorithm 5.14 is a correct, polynomial-time algorithm for computing the core of the universal solutions. PROOF. The proof is essentially the same as that of Theorem 5.9, except that we make use of Lemma 5.10 instead of Lemma 5.7. For the correctness of the algorithm, we use the fact that each h(J ′′ ) is both a homomorphic image and a subinstance of the canonical universal solution J ′ ; hence it satisfies both the tgds in st and the egds in t . For the running time of the algorithm, we also use the fact that chasing with egds (used in Step (2)) is a polynomial-time procedure. We note that it is essential for the polynomial-time upper bound that the endomorphisms explored by Algorithm 5.14 are J -local and not merely J ′ -local. While, as argued earlier in the case t = ∅, the blocks of J are bounded in size by the constant b (the maximal number of existentially quantified variables over all tgds in st ), the same is not true, in general, for the blocks of J ′ . The chase with egds, used to obtain J ′ , may generate blocks of unbounded size. Intuitively, if an egd equates the nulls x and y that are in different blocks of J , then this creates a new, larger, block out of the union of the blocks of x and y. 5.4 Can We Obtain the Core Via the Chase? A universal solution can be obtained via the chase [Fagin et al. 2003]. What about the core? In this section, we show by example that the core may not be obtainable via the chase. We begin with a preliminary example. Example 5.16. We again consider our running example from Example 2.2. If we chase the source instance I of Example 2.2 by first chasing with the dependencies (d 2 ) and (d 3 ), and then by the dependencies (d 1 ) and (d 4 ), neither of which add any tuples, then the result is the core J0 , as given in Example 2.2. If, however, we chase first with the dependency (d 1 ), then with the dependencies ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Data Exchange: Getting to the Core



201

(d 2 ) and (d 3 ), and finally with the dependency (d 4 ), which does not add any tuples, then the result is the target instance J , as given in Example 2.2, rather than the core J0 . In Example 5.16 , the result of the chase may or may not be the core, depending on the order of the chase steps. We now give an example where there is no chase (that is, no order of doing the chase steps) that produces the core. Example 5.17. Assume that the source schema consists of one 4-ary relation symbol R and the target schema consists of one 5-ary relation symbol S. There are two source-to-target tgds d 1 and d 2 , where d 1 is R(a, b, c, d ) → ∃x1 ∃x2 ∃x3 ∃x4 ∃x5 (S(x5 , b, x1 , x2 , a) ∧S(x5 , c, x3 , x4 , a) ∧S(d , c, x3 , x4 , b)) and where d 2 is R(a, b, c, d ) → ∃x1 ∃x2 ∃x3 ∃x4 ∃x5 (S(d , a, a, x1 , b) ∧S(x5 , a, a, x1 , a) ∧S(x5 , c, x2 , x3 , x4 )). The source instance I is {R(1, 1, 2, 3)}. The result of chasing I with d 1 only is {S(N5 , 1, N1 , N2 , 1), S(N5 , 2, N3 , N4 , 1), S(3, 2, N3 , N4 , 1)},

(1)

where N1 , N2 , N3 , N4 , N5 are nulls. The result of chasing I with d 2 only is {S(3, 1, 1, N1′ , 1), S(N5′ , 1, 1, N1′ , 1), S(N5′ , 2, N2′ , N3′ , N4′ )},

(2)

where N1′ , N2′ , N3′ , N4′ , N5′ are nulls. Let J be the universal solution that is the union of (1) and (2). We now show that the core of J is given by the following instance J0 , which consists of the third tuple of (1) and the first tuple of (2): {S(3, 2, N3 , N4 , 1), S(3, 1, 1, N1′ , 1)}. First, it is straightforward to verify that J0 is the image of the universal solution J under the following endomorphism h: h(N1 ) = 1; h(N2 ) = N1′ ; h(N3 ) = N3 ; h(N4 ) = N4 ; h(N5 ) = 3; h(N1′ ) = N1′ ; h(N2′ ) = N3 ; h(N3′ ) = N4 ; h(N4′ ) = 1; and h(N5′ ) = 3. Second, it is easy to see that there is no endomorphism of J0 into a proper substructure of J0 . From these two facts, it follows immediately that J0 is the core. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

202



R. Fagin et al.

Since the result of chasing first with d 1 has three tuples, and since the core has only two tuples, it follows that the result of chasing first with d 1 and then d 2 does not give the core. Similarly, the result of chasing first with d 2 and then d 1 does not give the core. Thus, no chase gives the core, which was to be shown. This example has several other features built into it. First, it is not possible to remove a conjunct from the right-hand side of d 1 and still maintain a dependency equivalent to d 1 . A similar comment applies to d 2 . Therefore, the fact that no chase gives the core is not caused by the right-hand side of a source-to-target tgd having a redundant conjunct. Second, the Gaifman graph of the nulls as determined by (1) is connected. Intuitively, this tells us that the tgd d 1 cannot be “decomposed” into multiple tgds with the same left-hand side. A similar comment applies to d 2 . Therefore, the fact that no chase gives the core is not caused by the tgds being “decomposable.” Third, not only does the set (1) of tuples not appear in the core, but even the core of (1), which consists of the first and third tuples of (1), does not appear in the core. A similar comment applies to (2), whose core consists of the first and third tuples of (2). So even if we were to modify the chase by inserting, at each chase step, only the core of the set of tuples generated by applying a given tgd, we still would not obtain the core as the result of a chase. 6. QUERY ANSWERING WITH CORES Up to this point, we have shown that there are two reasons for using cores in data exchange: first, they are the smallest universal solutions, and second, they are polynomial-time computable in many natural data exchange settings. In this section, we provide further justification for using cores in data exchange by establishing that they have clear advantages over other universal solutions in answering target queries. Assume that (S, T, st , t ) is a data exchange setting, I is a source instance, and J0 is the core of the universal solutions for I . If q is a union of conjunctive queries over the target schema T, then, by Proposition 2.7, for every universal solution J for I , we have that certain(q, I ) = q(J )↓ . In particular, certain(q, I ) = q(J0 )↓ , since J0 is a universal solution. Suppose now that q is a conjunctive query with inequalities = over the target schema. In general, if J is a universal solution, then q(J )↓ may properly contain certain(q, I ). We illustrate this point with the following example. Example 6.1. Let us revisit our running example from Example 2.2. We saw earlier in Example 3.1 that, for every m ≥ 0, the target instance Jm = {Home(Alice, SF), Home(Bob, SD), EmpDept(Alice, X 0 ), EmpDept(Bob, Y 0 ), DeptCity(X 0 , SJ), DeptCity(Y 0 , SD), ... EmpDept(Alice, X m ), EmpDept(Bob, Y m ), DeptCity(X m , SJ), DeptCity(Y m , SD)} ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Data Exchange: Getting to the Core



203

is a universal solution for I ; moreover, J0 is the core of the universal solutions for I . Consider now the following conjunctive query q with one inequality: ∃D1 ∃D2 (EmpDept(e, D1 ) ∧ EmpDept(e, D2 ) ∧ (D1 = D2 )). Clearly, q(J0 ) = ∅, while if m ≥ 1, then q(Jm ) = {Alice, Bob}. This implies that certain(q, I ) = ∅, and thus evaluating the above query q on the universal solution Jm , for arbitrary m ≥ 1, produces a strict superset of the set of the certain answers. In contrast, evaluating q on the core J0 coincides with the set of the certain answers, since q(J0 ) = ∅ = certain(q, I ). This example can also be used to illustrate another difference between conjunctive queries and conjunctive queries with inequalities. Specifically, if J and J ′ are universal solutions for I , and q ∗ is a conjunctive query over the target schema, then q ∗ (J )↓ = q ∗ (J ′ )↓ . In contrast, this does not hold for the above conjunctive query q with one inequality. Indeed, q(J0 ) = ∅ while q(Jm ) = {Alice, Bob}, for every m ≥ 1. In the preceding example, the certain answers of a particular conjunctive query with inequalities could be obtained by evaluating the query on the core of the universal solutions. As shown in the next example, however, this does not hold true for arbitrary conjunctive queries with inequalities. Example 6.2. Referring to our running example, consider again the universal solutions Jm , for m ≥ 0, from Example 6.1. In particular, recall the instance J0 , which is the core of the universal solutions for I , and which has two distinct labeled nulls X 0 and Y 0 , denoting unknown departments. Besides their role as placeholders for department values, the role of such nulls is also to “link” employees to the cities they work in, as specified by the tgd (d 2 ) in st . For data exchange, it is important that such nulls be different from constants and different from each other. Universal solutions such as J0 naturally satisfy this requirement. In contrast, the target instance J0′ = {Home(Alice, SF), Home(Bob, SD), EmpDept(Alice, X 0 ), EmpDept(Bob, X 0 ), DeptCity(X 0 , SJ), DeptCity(X 0 , SD)} is a solution2 for I , but not a universal solution for I , because it uses the same null for both source tuples (Alice, SJ) and, (Bob, SD) and, hence, there is no homomorphism from J0′ to J0 . In this solution, the association between Alice and SJ as well as the association between Bob and SD have been lost. Let q be the following conjunctive query with one inequality: ∃D∃D ′ (EmpDept(e, D) ∧ DeptCity(D ′ , c) ∧ (D = D ′ )). It is easy to see that q(J0 ) = {(Alice, SD), (Bob, SJ)}. In contrast, q(J0′ ) = ∅, since in J0′ both Alice and Bob are linked with both SJ and SD. Consequently, certain(q, I ) = ∅, and thus certain(q, I ) is properly contained in q(J0 )↓ . 2 This

is the same instance, modulo renaming of nulls, as the earlier instance J0′ of Example 2.2. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

204



R. Fagin et al.

Let J be a universal solution for I . Since J0 is (up to a renaming of the nulls) the core of J , it follows that q(J0 ) ⊆ q(J )↓ . (We are using the fact that q(J0 ) = q(J0 )↓ here.) Since also we have the strict inclusion certain(q, I ) ⊂ q(J0 ), we have that certain(q, I ) ⊂ q(J )↓ , for every universal solution J . This also means that there is no universal solution J for I such that certain(q, I ) = q(J )↓ . Finally, consider the target instance: J ′ = {Home(Alice, SF), Home(Bob, SD), EmpDept(Alice, X 0 ), EmpDept(Bob, Y 0 ), DeptCity(X 0 , SJ), DeptCity(Y 0 , SD), DeptCity(X ′ , SJ)}. It is easy to verify that J ′ is a universal solution and that q(J ′ ) = {(Alice, SJ), (Alice, SD), (Bob, SJ) }. Thus, the following strict inclusions hold: certain(q, I ) ⊂ q(J0 )↓ ⊂ q(J ′ )↓ . This shows that a strict inclusion hierarchy can exist among the set of the certain answers, the result of the null-free query evaluation on the core and the result of the null-free query evaluation on some other universal solution. We will argue in the next section that instead of computing certain(q, I ) a better answer to the query may be given by taking q(J0 )↓ itself! 6.1 Certain Answers on Universal Solutions Although the certain answers of conjunctive queries with inequalities cannot always be obtained by evaluating these queries on the core of the universal solutions, it turns out that this evaluation produces a “best approximation” to the certain answers among all evaluations on universal solutions. Moreover, as we shall show, this property characterizes the core, and also extends to existential queries. We now define existential queries, including a safety condition. An existential query q(x) is a formula of the form ∃yφ(x, y), where φ(x, y) is a quantifier-free formula in disjunctive normal form. Let φ be ∨i ∧ j γij , where each γij is an atomic formula, the negation of an atomic formula, an equality, or the negation of an equality. As a safety condition, we assume that for each conjunction ∧ j γij and each variable z (in x or y) that appears in this conjunction, one of the conjuncts γij is an atomic formula that contains z. The safety condition guarantees that φ is domain independent [Fagin 1982] (so that its truth does not depend on any underlying domain, but only on the “active domain” of elements that appear in tuples in the instance). We now introduce the following concept, which we shall argue is fundamental. Definition 6.3. Let (S, T, st , t ) be a data exchange setting and let I be a source instance. For every query q over the target schema T, the set of the certain answers of q on universal solutions with respect to the source instance I , ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Data Exchange: Getting to the Core



205

denoted by u-certain(q, I ), is the set of all tuples that appear in q(J ) for every universal solution J for I ; in symbols,  u-certain(q, I ) = {q(J ) : J is a universal solution for I }. Clearly, certain(q, I ) ⊆ u-certain(q, I ). Moreover, if q is a union of conjunctive queries, then Proposition 2.7 implies that certain(q, I ) = u-certain(q, I ). In contrast, if q is a conjunctive query with inequalities, it is possible that certain(q, I ) is properly contained in u-certain(q, I ). Concretely, this holds true for the query q and the source instance I in Example 6.2, since certain(q, I ) = ∅, while u-certain(q, I ) = {(Alice, SD), (Bob, SJ)}. In such cases, there is no universal solution J for I such that certain(q, I ) = q(J )↓ . Nonetheless, the next result asserts that if J0 is the core of the universal solutions for I , then u-certain(q, I ) = q(J0 )↓ . Therefore, q(J0 )↓ is the best approximation (that is, the least superset) of the certain answers for I among all choices of q(J )↓ where J is a universal solution for I . Before we prove the next result, we need to recall some definitions from Fagin et al. [2003]. Let q be a Boolean (that is, 0-ary) query over the target schema T and I a source instance. If we let true denote the set with one 0-ary tuple and false denote the empty set, then each of the statements q(J ) = true and q(J ) = false has its usual meaning for Boolean queries q. It follows from the definitions that certain(q, I ) = true means that for every solution J of this instance of the data exchange problem, we have that q(J ) = true; moreover, certain(q, I ) = false means that there is a solution J such that q(J ) = false. PROPOSITION 6.4. Let (S, T, st , t ) be a data exchange setting in which st is a set of tgds and t is a set of tgds and egds. Let I be a source instance such that a universal solution for I exists, and let J0 be the core of the universal solutions for I . (1) If q is an existential query over the target schema T, then u-certain(q, I ) = q(J0 )↓ . ∗

(2) If J is a universal solution for I such that for every existential query q over the target schema T we have that u-certain(q, I ) = q(J ∗ )↓ , then J ∗ is isomorphic to the core J0 of the universal solutions for I . In fact, it is enough for the above property to hold for every conjunctive query q with inequalities =. PROOF. Let J be a universal solution, and let J0 be the core of J . By Proposition 3.3, we know that J0 is an induced substructure of J . Let q be an existential query over the target schema T. Since q is an existential query and J0 is an induced substructure of J , it is straightforward to verify that q(J0 ) ⊆ q(J ) (this is a well-known preservation property of existential first-order formulas). Since J0 is the core of every universal solution for I up to a renaming of the  nulls, it follows that q(J0 )↓ ⊆ {q(J ) : J universal for I }. We now show the reverse inclusion. Define J0′ by renaming  each null of J0 in such a way that J0 and J0′ have no nulls in common. Then {q(J ) : J universal for I } ⊆ q(J0 ) ∩ q(J0′ ). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

206



R. Fagin et al.

But it is easy to see that q(J0 )∩q(J0′ ) = q(J0 )↓ . This proves the reverse inclusion and so  u-certain(q, I ) = {q(J ) : J universal for I } = q(J0 )↓ . For the second part, assume that J ∗ is a universal solution for I such that for every conjunctive query q with inequalities = over the target schema,  q(J ∗ )↓ = {q(J ) : J is a universal solution for I }. (3) Let q ∗ be the canonical conjunctive query with inequalities associated with J ∗ , that is, q ∗ is a Boolean conjunctive query with inequalities that asserts that there exist at least n∗ distinct elements, where n∗ is the number of elements of J ∗ , and describes which tuples from J ∗ occur in which relations in the target schema T. It is clear that q ∗ (J ∗ ) = true. Since q ∗ is a Boolean query, we have q(J ∗ )↓ = q(J ∗ ). So from (3), where q ∗ plays the role of q, we have  q ∗ (J ∗ ) = {q ∗ (J ) : J is a universal solution for I }. (4) Since q ∗ (J ∗ ) = true, it follows from (4) that q ∗ (J0 ) = true. In turn, q ∗ (J0 ) = true implies that there is a one-to-one homomorphism h∗ from J ∗ to J0 . At the same time, there is a one-to-one homomorphism from J0 to J ∗ , by Corollary 3.5. Consequently, J ∗ is isomorphic to J0 . Let us take a closer look at the concept of the certain answers of a query q on universal solutions. In Fagin et al. [2003], we made a case that the universal solutions are the preferred solutions to the data exchange problem, since in a precise sense they are the most general possible solutions and, thus, they represent the space of all solutions. This suggests that, in the context of data exchange, the notion of the certain answers on universal solutions may be more fundamental and more meaningful than that of the certain answers. In other words, we propose here that u-certain(q, I ) should be used as the semantics of query answering in data exchange settings, instead of certain(q, I ), because we believe that this notion should be viewed as the “right” semantics for query answering in data exchange. As pointed out earlier, certain(q, I ) and u-certain(q, I ) coincide when q is a union of conjunctive queries, but they may very well be different when q is a conjunctive query with inequalities. The preceding Example 6.2 illustrates this difference between the two semantics, since certain(q, I ) = ∅ and u-certain(q, I ) = {(Alice, SD), (Bob, SJ)}, where q is the query ∃D∃D ′ (EmpDept(e, D) ∧ DeptCity(D ′ , c) ∧ (D = D ′ )). We argue that a user should not expect the empty set ∅ as the answer to the query q, after the data exchange between the source of the target (unless, of course, further constraints are added to specify that the nulls must be equal). Thus, u-certain(q, I ) = {(Alice, SD), (Bob, SJ)} is a more intuitive answer to q than certain(q, I ) = ∅. Furthermore, this answer can be computed as q(J0 )↓ . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Data Exchange: Getting to the Core



207

We now show that for conjunctive queries with inequalities, it may be easier to compute the certain answers on universal solutions than to compute the certain answers. Abiteboul and Duschka [1998] proved the following result. THEOREM 6.5 [ABITEBOUL AND DUSCHKA 1998]. There is a LAV setting and a Boolean conjunctive query q with inequalities = such that computing the set certain(q, I ) of the certain answers of q is a coNP-complete problem. By contrast, we prove the following result, which covers not only LAV settings but even broader settings. THEOREM 6.5. Let (S, T, st , t ) be a data exchange setting in which st is a set of tgds and t is a set of egds. For every existential query q over the target schema T, there is a polynomial-time algorithm for computing, given a source instance I , the set u-certain(q, I ) of the certain answers of q on the universal solutions for I . PROOF. Let q be an existential query, and let J0 be the core of the universal solutions. We see from Proposition 6.4 that u-certain(q, I ) = q(J0 )↓ . By Theorem 5.2 or Theorem 5.15, there is a polynomial-time algorithm for computing J0 , and hence for computing q(J0 )↓ . Theorems 6.5 and 6.5 show a computational advantage for certain answers on universal solutions over simply certain answers. Note that the core is used in the proof of Theorem 6.5 but does not appear in the statement of the theorem and does not enter into the definitions of the concepts used in the theorem. It is not at all clear how one would prove this theorem directly, without making use of our results about the core. We close this section by pointing out that Proposition 6.4 is very dependent on the assumption that q is an existential query. A universal query is taken to be the negation of an existential query. It is a query of the form ∀xφ(x), where φ(x) is a quantifier-free formula, with a safety condition that is inherited from existential queries. Note that each egd and full tgd is a universal query (and in particular, satisfies the safety condition). For example, the egd ∀x(A1 ∧ A2 → (x1 = x2 )) satisfies the safety condition, since its negation is ∃x(A1 ∧ A2 ∧ (x1 = x2 )), which satisfies the safety condition for existential queries since every variable in x appears in one of the atomic formulas A1 or A2 . We now give a data exchange setting and a universal query q such that u-certain(q, I ) cannot be obtained by evaluating q on the core of the universal solutions for I . Example 6.6. Referring to our running example, consider again the universal solutions Jm , for m ≥ 0, from Example 6.1. Among those universal solutions, the instance J0 is the core of the universal solutions for I . Let q be the following Boolean universal query (a functional dependency): ∀e∀d 1 ∀d 2 (EmpDept(e, d 1 ) ∧ EmpDept(e, d 2 ) → (d 1 = d 2 )). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

208



R. Fagin et al.

It is easy to see that q(J0 ) = true and q(Jm ) = false, for all m ≥ 1. Consequently, certain(q, I ) = false = u-certain(q, I ) = q(J0 ). 7. CONCLUDING REMARKS In a previous article [Fagin et al. 2003], we argued that universal solutions are the best solutions in a data exchange setting, in that they are the “most general possible” solutions. Unfortunately, there may be many universal solutions. In this article, we identified a particular universal solution, namely, the core of an arbitrary universal solution, and argued that it is the best universal solution (and hence the best of the best). The core is unique up to isomorphism, and is the universal solution of the smallest size, that is, with the fewest tuples. The core gives the best answer, among all universal solutions, for existential queries. By “best answer,” we mean that the core provides the best approximation (among all universal solutions) to the set of the certain answers. In fact, we proposed an alternative semantics where the set of “certain answers” are redefined to be those that occur in every universal solution. Under this alternative semantics, the core gives the exact answer for existential queries. We considered the question of the complexity of computing the core. To this effect, we showed that the complexity of deciding if a graph H is the core of a graph G is DP-complete. Thus, unless P = NP, there is no polynomial-time algorithm for producing the core of a given arbitrary structure. On the other hand, in our case of interest, namely, data exchange, we gave natural conditions where there are polynomial-time algorithms for computing the core of universal solutions. Specifically, we showed that the core of the universal solutions is polynomial-time computable in data exchange settings in which st is a set of source-to-target tgds and t is a set of egds. These results raise a number of questions. First, there are questions about the complexity of constructing the core. Even in the case where we prove that there is a polynomial-time algorithm for computing the core, the exponent may be somewhat large. Is there a more efficient algorithm for computing the core in this case and, if so, what is the most efficient such algorithm? There is also the question of extending the polynomial-time result to broader classes of target dependencies. To this effect, Gottlob [2005] recently showed that computing the core may be NP-hard in the case in which t consists of a single full tgd, provided a NULL “built-in” target predicate is available to tell labeled nulls from constants in target instances; note that, since NULL is a “built-in” predicate, it need not be preserved under homomorphisms. Since our formalization of data exchange does not allow for such a NULL predicate, it remains an open problem to determine the complexity of computing the core in data exchange settings in which the target constraints are egds and tgds. On a slightly different note, and given the similarities between the two problems, it would be interesting to see if our techniques for minimizing universal solutions can be applied to the problem of minimizing the chase-generated universal plans that arise in the comprehensive query optimization method introduced in [Deutsch et al. 1999]. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Data Exchange: Getting to the Core



209

Finally, the work reported here addresses data exchange only between relational schemas. In the future we hope to investigate to what extent the results presented in this article and in Fagin et al. [2003] can be extended to the more general case of XML/nested data exchange. ACKNOWLEDGMENTS

Many thanks to Marcelo Arenas, Georg Gottlob, Ren´ee J. Miller, Arnon Rosenthal, Wang-Chiew Tan, Val Tannen, and Moshe Y. Vardi for helpful suggestions, comments, and pointers to the literature. REFERENCES ABITEBOUL, S. AND DUSCHKA, O. M. 1998. Complexity of answering queries using materialized views. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS). 254– 263. ABITEBOUL, S., HULL, R., AND VIANU, V. 1995. Foundations of Databases. Addison-Wesley, Reading, MA. BEERI, C. AND VARDI, M. Y. 1984. A proof procedure for data dependencies. Journal Assoc. Comput. Mach. 31, 4, 718–741. CHANDRA, A. K. AND MERLIN, P. M. 1977. Optimal implementation of conjunctive queries in relational data bases. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 77–90. COSMADAKIS, S. 1983. The complexity of evaluating relational queries. Inform. Contr. 58, 101–112. COSMADAKIS, S. S. AND KANELLAKIS, P. C. 1986. Functional and inclusion dependencies: A graph theoretic approach. In Advances in Computing Research., vol. 3. JAI Press, Greenwich, CT, 163– 184. DEUTSCH, A., POPA, L., AND TANNEN, V. 1999. Physical data independence, constraints and optimization with universal plans. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 459–470. DEUTSCH, A. AND TANNEN, V. 2003. Reformulation of XML queries and constraints. In Proceedings of the International Conference on Database Theory (ICDT). 225–241. FAGIN, R. 1982. Horn clauses and database dependencies. Journal Assoc. Comput. Mach. 29, 4 (Oct.), 952–985. FAGIN, R., KOLAITIS, P. G., MILLER, R. J., AND POPA, L. 2003. Data exchange: Semantics and query answering. In Proceedings of the International Conference on Database Theory (ICDT). 207– 224. FRIEDMAN, M., LEVY, A. Y., AND MILLSTEIN, T. D. 1999. Navigational plans for data integration. In Proceedings of the National Conference on Artificial Intelligence (AAAI). 67–73. GOTTLOB, G. 2005. Cores for data exchange: Hard cases and practical solutions. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS). ¨ GOTTLOB, G. AND FERMULLER , C. 1993. Removing redundancy from a clause. Art. Intell. 61, 2, 263–289. HALEVY, A. 2001. Answering queries using views: A survey. VLDB J. 10, 4, 270–294. HELL, P. AND NESˇ ETRˇ IL, J. 1992. The core of a graph. Discr. Math. 109, 117–126. KANELLAKIS, P. C. 1990. Elements of relational database theory. In Handbook of Theoretical Computer Science, Volume B: Formal Models and Sematics. Elsevier, Amsterdam, The Netherlands, and MIT Press, Cambridge, MA, 1073–1156. LENZERINI, M. 2002. Data integration: A theoretical perspective. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS). 233–246. MAIER, D., MENDELZON, A. O., AND SAGIV, Y. 1979. Testing implications of data dependencies. ACM Trans. Database Syst. 4, 4 (Dec.), 455–469. MILLER, R. J., HAAS, L. M., AND HERNA´ NDEZ, M. 2000. Schema mapping as query discovery. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 77–88.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

210



R. Fagin et al.

PAPADIMITRIOU, C. AND YANNAKAKIS, M. 1982. The complexity of facets and some facets of complexity. In Proceedings of the ACM Symposium on Theory of Computing (STOC). 229–234. PAPADIMITRIOU, C. H. 1994. Computational Complexity. Addison-Wesley, Reading, MA. POPA, L., VELEGRAKIS, Y., MILLER, R. J., HERNANDEZ, M. A., AND FAGIN, R. 2002. Translating Web data. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 598–609. SHU, N. C., HOUSEL, B. C., TAYLOR, R. W., GHOSH, S. P., AND LUM, V. Y. 1977. EXPRESS: A data EXtraction, Processing, amd REStructuring System. ACM Trans. Database Syst. 2, 2, 134– 174. VAN DER MEYDEN, R. 1998. Logical approaches to incomplete information: A survey. In Logics for Databases and Information Systems. Kluwer, Dordrecht, The Netherlands, 307–356. Received October 2003; revised May 2004; accepted July 2004

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Concise Descriptions of Subsets of Structured Sets KEN Q. PU and ALBERTO O. MENDELZON University of Toronto

We study the problem of economical representation of subsets of structured sets, which are sets equipped with a set cover or a family of preorders. Given a structured set U , and a language L whose expressions define subsets of U , the problem of minimum description length in L (LMDL) is: “given a subset V of U , find a shortest string in L that defines V .” Depending on the structure and the language, the MDL-problem is in general intractable. We study the complexity of the MDL-problem for various structures and show that certain specializations are tractable. The families of focus are hierarchy, linear order, and their multidimensional extensions; these are found in the context of statistical and OLAP databases. In the case of general OLAP databases, data organization is a mixture of multidimensionality, hierarchy, and ordering, which can also be viewed naturally as a cover-structured ordered set. Efficient algorithms are provided for the MDLproblem for hierarchical and linearly ordered structures, and we prove that the multidimensional extensions are NP-complete. Finally, we illustrate the application of the theory to summarization of large result sets and (multi) query optimization for ROLAP queries. Categories and Subject Descriptors: H.2.1 [Database Management]: Logical Design—Data models; normal forms; H.2.3 [Database Management]: Languages General Terms: Algorithms, Theory Additional Key Words and Phrases: Minimal description length, OLAP, query optimization, summarization

1. INTRODUCTION Consider an OLAP or multidimensional database setting [Kimball 1996], where a user has requested to view a certain set of cells of the datacube, say in the form of a 100 × 20 matrix. Typically, the user interacts with a front-end query tool that ships SQL queries to a back-end database management system (DBMS). After perusing the output, the user clicks on some of the rows of the matrix, say 20 of them, and requests further details on these rows. Suppose each row represents data on a certain city. A typical query tool will translate the user request to a long SQL query with a WHERE clause of the form city = city1 OR city = city2 ... OR city = city20. However, if the set of cities happens to include every city in Ontario except Toronto, an equivalent but much This work was supported by the Natural Sciences and Engineering Research Council of Canada. Authors’ address: Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ont., Canada M5S 3H5; email: {kenpu,mendel}@cs.toronto.edu. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.  C 2005 ACM 0362-5915/05/0300-0211 $5.00 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 211–248.

212



K. Q. Pu and A. O. Mendelzon

shorter formulation would be province = ‘Ontario’ AND city ‘Toronto’. Minimizing the length of the query that goes to the back end is advantageous for two reasons. First, many systems1 have difficulty dealing with long queries, or even hard limits on query length. Second, the shorter query can often be processed much faster than the longer one (even though an extra join may be required, e.g., if there is no Province attribute stored in the cube). With this problem as motivation, we study the concise representations of subsets of a structured set. By “structured” we simply mean that we are given a (finite) set, called the universe, and a (finite) set of symbols, called the alphabet, each of which represents some subset of the universe. We are also given a language L of expressions on the alphabet, and a semantics that maps expressions to subsets of the universe. Given a subset V of the universe, we want to find a shortest expression in the given language that describes V . We call this the L-MDL (minimum description length) problem. In the example above, the universe is the set of city names, the alphabet includes at least the city name Toronto plus a set of province names, and the semantics provides a mapping from province names to sets of cities. This is the simplest case, where the symbols in the alphabet induce a partition of the universe. The most general language we consider, called L, is the language of arbitrary Boolean set expressions on symbols from the alphabet. In Section 2.1 we show that the L-MDL problem is solvable in polynomial time when the alphabet forms a partition of the universe. In particular, when the partition is granular, that is, every element of the universe is represented as one of the symbols in the alphabet, we obtain a normal form for minimum-length expressions, leading to a polynomial time algorithm. Of course, in addition to cities grouped into provinces, we could have provinces grouped into regions, regions into countries, etc. That is, the subsets of the universe may form a hierarchy. We consider this case in Section 2.2 and show that the normal forms of the previous section can be generalized, leading again to a polynomial time L-MDL problem. In the full OLAP context, elements of the universe can be grouped according to multiple independent criteria. If we think of a row in our initial example as a tuple , and the universe is the set of such tuples, then these tuples can be grouped by city into provinces, or by product into brands, or by date into years, etc. In Section 2.3 we consider the multidimensional case. In particular, we focus on the common situation in which each of the groupings is a hierarchy. We consider three increasingly powerful sublanguages of L, including L itself, and show that the MDL-problem is NP-complete for each of them. In many cases, the universe is naturally ordered, such as the TIME dimension. In Section 3, we define order-structures to capture such ordering. A language L(≤) is defined to express subsets of the ordered universe. The 1 Many

commercial relational OLAP engines naively translate user selection into simple SELECT SQL queries. It has been known that large enough user selections are executed as several SQL queries. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Concise Descriptions of Subsets of Structured Sets



213

Fig. 1. A structured set.

MDL-problem is in general NP-complete, but in the case of one linear ordering, it can be solved in polynomial time. Section 4 focuses on two areas of application of the theory: summarization of query answers and optimization of SELECT queries in OLAP. We consider the scenario of querying a relational OLAP database using simple SELECT queries, and show that it is advantageous to rewrite the queries into the corresponding compact expressions. In Section 5.1, we describe some related MDL-problem and they are related to various languages presented in this article. We also present some existing OLAP query optimization techniques and how they are related to our approach. Finally we summarize our findings and outline the possibilities of future research in Section 6. 2. COVER STRUCTURES, LANGUAGES, AND THE MDL PROBLEM In this section we introduce our model of structured sets and descriptive languages for subsets of them, and state the minimum description length problem. Definition 1 (Cover Structured Set). A structured set is a pair of finite sets (U, ) together with an interpretation  function [·] :  → Pwr(U ) : σ → [σ ] which is injective, and is such that σ ∈ [σ ] = U . The set U is referred to as the universe, and  the alphabet. Intuitively the cover2 structure of the set U is modeled by the grouping of its elements; each group is labeled by a symbol in the alphabet . The interpretation of a symbol σ is the elements in U belonging to the group labeled by σ . Example 1. Consider a cover structured set depicted in Figure 1. The universe U = {1, 2, 3, 4, 5}. The alphabet  = {A, B, C}. The interpretation function is [A] = {1, 2}, [B] = {2, 3, 5}, and [C] = {4, 5}. Elements of the alphabet can be combined in expressions that describe other subsets of the universe. The most general language we will consider for these expressions is the propositional language that consists of all expressions composed of symbols from the alphabet and operators that stand for the usual set operations of union, intersection and difference. Definition 2 (Propositional Language). Given a structured set (U, ), its propositional language L(U, ) is defined as ǫ ∈ L(U, ), σ ∈ L(U, ) for all σ ∈ , and if α, β ∈ L(U, ), then (α + β), (α − β) and (α · β) are all in L(U, ). 2 The term cover refers to the fact that the universe U

is covered by the interpretation of the alphabet . Later, in Section 3, we introduce the order-structure in which the universe is ordered. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

214



K. Q. Pu and A. O. Mendelzon

Definition 3 (Semantics and Length). The evaluation of L(U, ) is a function [·]∗ : L(U, ) → Pwr(U ), defined as — [ǫ]∗ = ∅, — [σ ]∗ = [σ ] for any σ ∈ , and — [α + β]∗ = [α]∗ ∪ [β]∗ , [α − β]∗ = [α]∗ − [β]∗ , and [α · β]∗ = [α]∗ ∩ [β]∗ . The string length of L(U, ) is a function · : L(U, ) → N, given by — ǫ = 0, — σ = 1 for any σ ∈ , and — α + β = α − β = α · β = α + β . Remark. We abuse the definitions in a number of harmless ways. For instance, we may refer to U as a structured set, implying that it is equipped with an alphabet  and an interpretation function [·]. The language L(U, ) is sometimes written simply as L when the structured set (U, ) is understood from the context. The evaluation function [·]∗ supersedes the single-symbol interpretation function [·], so the latter is omitted from discussions and the simpler form [·] is used in place of [·]∗ . Two expressions s and t in L are equivalent if they evaluate to the same set: that is, [s] = [t]. (Note that this means equivalence with respect to a particular structured set (U, ) and thus does not coincide with propositional equivalence.) In case they are equivalent, we say that s is reducible to t if s ≥ t . The expression s is strictly reducible to t if they are equivalent and s > t . An expression is compact if it is not strictly reducible to any other expression in the language. Given a sublanguage K ⊆ L, an expression is K-compact if it belongs to K and is not strictly reducible to any other expression in K. A language K ⊆ L(U, ) is granular if it can express every subset, or equivalently, every singleton, that is, (∀a ∈ U )(∃s ∈ K) [s] = {a}. We say that a structure  is granular if the propositional language L(U, ) is granular. If L(U, ) is not granular, then certain subsets (specifically singletons) of U cannot be expressed by any expression. The solution is then to augment the alphabet  to include sufficiently more symbols until it becomes granular. Definition 4 (K-Descriptive Length). Given a structured set (U, ), consider a sublanguage K ⊆ L(U, ), and a subset V ⊆ U . The language K(V ) is all expressions s ∈ K such that [s] = V , and the K-descriptive length of V , written V K , is defined as  min{ α : α ∈ K(V )} if K(V ) = ∅, and

V K = ∞ otherwise. In case K = L(U, ), we write V K simply as V . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Concise Descriptions of Subsets of Structured Sets



215

Fig. 2. A partition.

The K-descriptive length of a subset V is just the minimal length needed to express it in the language K. Example 2. Continuing with the example of the structure shown in Figure 1, the language L(U, ) includes expressions like s1 = (A − B) − C, s2 = A − B, and s3 = (B − A) − C, with [s1 ] = [A − B] − [C] = ([A] − [B]) − [C] = {1} = [s2 ] and [s1 ] = [B− A]−[C] = ([B]−[A])−[C] = {3}. The first two strings s1 and s2 are equivalent, but s2 is shorter in length; therefore s1 is strictly reducible to s2 . It’s not difficult to check that s2 is L(U, )-compact, so {1} = 2. Our first algorithmic problem is: what is the complexity of determining the minimum length of a subset in the language K. We pose it as a decision problem. Definition 5 (The K-MDL Decision Problem). — INSTANCE: A structured set (U, ), a subset V ⊆ U , and a positive integer k > 0. — QUESTION: V K ≤ k? PROPOSITION 1.

The L-MDL decision problem is NP-complete.

The proof of Proposition 1 requires the simple observation that for any structured set (U, ), there is a naturally induced set cover, written U/, on U given by U/ = {[σ ] : σ ∈ }. The general minimum set-cover problem [Garey and Johnson 1979] easily reduces to the general L-MDL problem. The next few sections will focus on some specific structures that are relevant to realistic databases. 2.1 Partition is in P In this section we focus our attention on the simple case where the symbols in  form a partition of U . Definition 6 (Partition). A structured set (U, ) is a partition if the induced set cover U/ partitions U . Example 3. Consider these streets: Grand, Canal, Broadway in the city NewYork, VanNess, Market, Mission in SanFrancisco, and Victoria, DeRivoli in Paris. The street names form the universe, which is partitioned by the alphabet consisting of the three city names, as shown in Figure 2. PROPOSITION 2. The L-MDL decision problem for a partition (U, ) can be solved in O(|U | · log |U |). The L-MDL decision problem for partitions is particularly easy because, given a subset V , V L is simply the number of cells that cover V exactly. Given the partition and V , computing the number of cells that cover V exactly ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

216



K. Q. Pu and A. O. Mendelzon

can be done in O(|U | log |V |), and can in fact be further optimized to O(|V |) if special data structures are used. Of course, in general not all subsets of street names can be expressed only by using city names—that is, the propositional language L(U, ) for a partition is not, in general, granular. We therefore extend the alphabet  to be granular; this requires having additional symbols in , one for each element of U . Definition 7 (Granular Partition). A structured set (U, ) is a granular ˙ where (U, 0 ) is a partition. The interpretation funcpartition if  = 0 ∪U tion [·] :  → Pwr(U ) is extended such that [u] = {u} for any u ∈ U . The L-MDL decision problem for granular partitions is also solvable in polynomial time. We first define a sublanguage Npar ⊆ L consisting of expressions which we refer to as normal, and show that all expressions in L are reducible to ones in Npar , and use this to constructively show that the Npar -MDL decision problem is solvable in polynomial time. − → Let A = {a1 , a2 , . . . , an } ⊆  be a set of symbols. We write A = a1 + a2 + · · · + an . The ordering of the symbols ai does not change the semantic evaluation nor − → its length, so A can be any of the strings that are equivalent to a1 + a2 + · · · + an − → up to the permutations of {ai }. Furthermore, we write [A] to mean [ A]. For a set of expressions {si }, i si is the expression formed by concatenating si by the + operator. ˙ ) be a Definition 8 (Normal Form for Granular Partitions). Let (U, 0 ∪U granular partition, and its propositional language be L. An expression s ∈ L is → − → − → − in normal form if it is of the form (  + A+ ) − A− where  ⊆ 0 and A+ and A− are elements in U interpreted as symbols in . The normal expression s is trim if A+ = [s] − [] and A− = [] − [s]. Let Npar (U, ) be all the normal expressions in L(U, ) that are trim. Intuitively, a normal form expression consists of taking the union of some set of symbols  from the alphabet, adding to it some elements from the universe, and subtracting some others. The expression is trim if we only add and subtract exactly those symbols that we need to express a particular subset. Note that all normal and trim expressions s ∈ Npar are uniquely determined by their semantics [s] and the high-level symbols  used. Therefore can →we− → write − → − π (V / ) to mean the normal and trim expression of the form  + A+ − A− where A+ = V − [] and A− = [] − V . With the interest of compact expressions, we only need to be concerned with normal expressions that are trim for the following reasons. → − → − → − PROPOSITION 3. A normal expression s =  + A+ − A− is L-compact only if A+ ∩ [] = A− − [] = ∅. + PROOF. If A− ∩ []− is nonempty, say a ∈ A+ ∩ [], then define A′+ = A+ − {a}, → → − → ′+ ′ and s =  + A − A− . It is clear that [s′ ] = [s] but s′ < s , so s cannot be L-compact. Similarly if A− − [] = ∅, we can reduce s strictly as well.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Concise Descriptions of Subsets of Structured Sets

PROPOSITION 4. only if it is trim.



217

A normal expression for a granular partition is L-compact

→ − → − → − PROOF. Let s =  + A+ − A− be a normal expression. Say that it is not trim. Then either A+ = [s] − [] or A− = [] − [s]. We show that in either case, the expression s can be strictly reduced. Say A+ = [s] − []. There are two possibilities: — A+ − ([s] − []) = ∅: Since A+ ∩ [] = ∅ by Proposition we have −−−−→ 3,−− − −− −−→ that − → −−− + − + + ′ A − [s] = ∅. Let a ∈ A − [s]. Define s =  + (A − {a}) − (A − {a}). It’s easy to see that [s′ ] = [s] but s′ < s . — ([s] − []) − A+ = ∅: Recall that [s] = ([] ∪ A+ ) − A− ; we have [] ∪ A+ ⊇ [s], ([s] − []) − A+ = ∅ always, making this case impossible. The second case of A− = [] − [s] implies that s is reducible by similar arguments. LEMMA 1 (NORMALIZATION).

Every expression in L is reducible to one in Npar .

PROOF. The proof is by induction on the construction of the expression s in L. The base case of s = ǫ and s = σ are trivially reducible to Npar . The expression − → − → − → s = ǫ is reducible to ∅ + ∅ − − ∅ ,→ which also has a length zero.− The expression → − → of−→ −→ → − s = σ is reducible to {σ } + ∅ − ∅ if σ ∈ 0 , and to ∅ + {σ } − ∅ if σ ∈ U . The inductive step has three cases: (i) Suppose that s = s1 + s2 where si ∈ Npar . We show that s is reducible to Npar . Write si = i + Ai+ − Ai− . Define  = 1 ∪ 2 . Then, by Definition 8, we have the following, A+ = [s] − [] = ([s1 ] ∪ [s2 ]) − [] = ([s1 ] − []) ∪ ([s2 ] − []) ⊆ ([s1 ] − [1 ]) ∪ ([s2 ] − [2 ]) A−

+ = A+ 1 ∪ A2 , and = [] − [s] = ([1 ] − [s]) ∪ ([2 ] − [s])

⊆ ([1 ] − [s1 ]) ∪ ([2 ] − [s2 ]) − = A− 1 ∪ A2 .

→ − → − → − So the normal expression π ([s]/ ) =  + A+ − A− is equivalent to s, and has its length → − → − → −

π ([s]/ ) =  + A+ − A− = || + |A+ | + |A− | − + − ≤ |1 | + |A+ 1 | + |A1 | + |2 | + |A2 | + |A2 | = s1 + s2 = s . (ii) Suppose that s = s1 · s2 . Let si be as in (i), and define  = 1 ∩ 2 . By standard set manipulations similar to those in (i), we once again get + A+ ⊆ A+ 1 ∪ A2

and

− A− ⊆ A− 1 ∪ A2 .

Hence s is reducible to π ([s]/ ). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

218



K. Q. Pu and A. O. Mendelzon

(iii) Finally consider the case that s = s1 − s2 with si in normal form as before. Let  = 1 − 2 . Then one can show that − A+ ⊆ A+ 1 ∪ A2

and

+ A− ⊆ A− 1 ∪ A2 .

Again s is reducible to π ([s]/ ). This concludes the proof. Lemma 1 immediately implies the following. THEOREM 1.

For all V ⊆ U , we have V Npar = V L .

By Theorem 1, one only needs to focus on the Npar -MDL problem for granular partitions. The necessary and sufficient condition for Npar -compactness can be easily stated in terms of the symbols used. Suppose V ⊆ U ; let us denote  + (V ) = {σ ∈  : |[σ ] ∩ V | > |[σ ] − V | + 1}, and very similarly  # (V ) = {σ ∈  : |[σ ] ∩ V | ≥ |[σ ] − V | + 1}. Intuitively, the interpretation of a symbol in  + (V ) includes more elements in V than elements not in V —by a difference of at least two. Similarly for a symbol in  # (V ), the difference is at least one. We say that symbols in  # (V ) are efficient with respect to V and ones in +  (V ) are strictly efficient. Symbols that are not in  # (V ) are inefficient with respect to V . Example 4. Consider the partition in Figure 2. Let V1 = {Victoria, DeRivoli}, and V2 = {Grand, Canal}.  # (V1 ) =  + (V1 ) = {Paris},  # (V2 ) = {NewYork}, and  + (V2 ) = ∅. → − → − → − LEMMA 2. Let s = (  + A+ ) − A− be an expression in Npar representing V . It is Npar -compact if and only if  + (V ) ⊆  ⊆  # (V ). PROOF (ONLY IF). We show that s is Npar -compact implies that  + (V ) ⊆  ⊆  (V ) by contradiction. #

(i) Suppose  + (V ) ⊆ , then there exists an symbol σ ∈  + (V ) but σ ∈ . ˙ }, and s′ = π (V / ′ ). We have that Define ′ = ∪{σ ˙ ]) = (V − []) − [σ ] = (V − [])−(V ˙ A′+ = V − [′ ] = V − ([]∪[σ ∩ [σ ]) ′−

A

˙ ∩ [σ ]), and = A+ −(V ′ ˙ ]) − V = ([] − V )∪([σ ˙ ] − V) = [ ] − V = ([]∪[σ −˙ = A ∪([σ ] − V ).

So

s′ = |′ | + |A′+ | + |A′− | = s + (|[σ ] − V | + 1 − |V − [σ ]|) < s . This contradicts with the assumption that s is Npar -compact. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Concise Descriptions of Subsets of Structured Sets

219



˙ (ii) Say that  ⊆  # (V ). Let ω ∈  but ω ∈  # (V ). Define ′ = −{ω} and s′ = π (V / ′ ). We then have, ˙ A′+ = V − ([]−[ω]) A′−

= A+ ∪ (V ∩ [ω]), and ˙ ˙ = [′ ] − V = ([]−[ω]) − V = ([] − V )−([ω] − V) −˙ = A −([ω] − V ).

It follows then,

s′ = |′ | + |A′+ | + |A′− | = s + (|V ∩ [ω]| − |[ω] − V | − 1) < s . Again a contradiction. (IF). It remains to be shown that  + (V ) ⊆  ⊆  # (V ) implies that s is Npar compact. Let 0 =  + (V ) and s0 = π(V / 0 ) We are going to prove the following fact: (∀ ⊆ 0 )  + (V ) ⊆  ⊆  # (V ) =⇒ s0 = π (V / ) . +

(∗) #

Therefore by Equation (*), all expressions in Npar with  (V ) ⊆  ⊆  (V ) have the same length, and since one must be Npar -compact by the necessary condition and the guaranteed existence of a Npar -compact expression, all must be Npar -compact. → − → − → − Now we prove (*). Consider any s =  + A+ − A− with  + (V ) ⊆  ⊆  # (V ). ˙ and Define Ŵ =  −  + (V ). Then  = 0 ∪Ŵ, +˙ ˙ A+ 0 = V − [] = (V − [0 ])−(V ∩ [Ŵ]) = A −([Ŵ] ∩ V ), and −˙ A− 0 = [] − V = A ∪([Ŵ] − V ).

It then follows that  

s0 = s + |V ∩ [Ŵ]| − |[Ŵ] − V | − |Ŵ|.

(∗∗)

˙ γ ∈Ŵ [γ ], we conclude Furthermore, since Ŵ ⊆  # (V ), and [Ŵ] = ∪   V ∩ [Ŵ]| − |[Ŵ] − V | = (|V ∩ [γ ]| − |[γ ] − V |) = 1 = |Ŵ|. γ ∈Ŵ

Ŵ

Substitute into Equation (**), we have the desired result: s0 = s . Intuitively Lemma 2 tells us that an expression is Npar -compact if and only if it uses all strictly efficient symbols, and never uses any inefficient ones. COROLLARY 1. Let (U, ) be a granular partition. Given any V ⊆ U , π (V / # (V )) is L-compact. Computing π (V / # (V )) is certainly in polynomial time. THEOREM 2. The L-MDL problem for granular partitions can be solved in polynomial time. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

220



K. Q. Pu and A. O. Mendelzon

Fig. 3. The STORE dimension as a tree.

Example 5. Consider V1 and V2 as defined in the previous example. By Lemma 2, both of the following expressions of V1 ∪ V2 = { Victoria, DeRivoli, Grand, Canal} are compact: s1 = (NewYork + Paris) − Broadway and s2 = Paris + (Grand + Canal). Note that π (V / # (V )) is s1 . 2.2 Hierarchy is in P Partition has the nice property that its MDL problem is simple. However it does not adequately express many realistic structures. We shall generalize the notion of (granular) partitions to (granular) multilevel hierarchies. Definition 9 (Hierarchy).

A structured set (U, ) is a hierarchy if

˙ 2 ∪ ˙ 3 . . . ∪ ˙ N ,  = 1 ∪ such that for any i ≤ N , (U, i ) is a partition; furthermore, for any i, j ≤ N , we have i < j =⇒ U/i refines U/ j . The integer N is referred as the number of levels or the height of the hierarchy, and (U, i ) the ith level. Example 6. We extend the partition in Figure 2 to form a hierarchy with three levels (N = 3) shown in Figure 3. The first level has 1 being the street names, the second has 2 being the city names, and finally the third level has 3 having only one symbol STORE. ˙ 2 · · · ∪ ˙  N ). First note that it is granular if Consider a hierarchy (U, 1 ∪ ˙ 2 ) is a granular partition. and only if in the first level  1 = U , that is, (U, 1 ∪ N For i < N , we define i = k=i+1 k . The alphabet i contains all symbols in levels higher than the ith level of the hierarchy. We may view i as a universe, and consider (i , i ) as a new hierarchy, with the interpretation function given by [·]i : i → Pwr(i ) : λ → {σ ∈ i : [σ ] ⊆ [λ]}. Let Li denote the propositional language L(i , i ). Much of the discussion regarding partitions naturally applies to hierarchies with some appropriate generalization. Definition 10 (Normal Forms). An expression s ∈ Li is in normal form for − → − → + − the hierarchy if it is of the form s = sˆ + Ai − Ai , where sˆ ∈ Li+1 is the leading subexpression of s, and Ai+ , Ai− ⊆ i . It is trim if sˆ is Li+1 -compact and Ai+ = [s]i − [ˆs]i and Ai− = [ˆs]i − [s]i . We denote (Nhie )i = Nhie (, i ) to be the set of all normal and trim expressions of the hierarchy (i , i ), and let Nhie ≡ (Nhie )1 . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Concise Descriptions of Subsets of Structured Sets



221

Fig. 4. The filled circles are the selected elements.

Here are some familiar results. PROPOSITION 5.

A normal expression in Li is Li -compact only if it is trim.

The proof of Proposition 5 mirrors that of Proposition 4 exactly. LEMMA 3 (NORMALIZATION). (Nhie )i .

Every expression in Li can be reduced to one in

PROOF. We prove by induction on the construction of expressions in (Nhie )i . The base cases of s = ǫ and s = σ are trivial. − → − → − Suppose that s = s1 + s2 ∈ Li , where sk = sˆk + A+ k − Ak for k = 1, 2. Then let tˆ be an Li+1 -compact expression that sˆ1 + sˆ2− reduces → − → to. Consider the normal expression s′ = tˆ + A+ − A− where A+ = [s]i − [tˆ ]i and + A− = [tˆ ]i − [s]i . Repeating the proof of Lemma 1, we have that A+ ⊆ A+ 1 ∪ A2 − − − ′ and A ⊆ A1 ∪ A2 . Therefore s reduces to s . The cases for s = s1 · s2 and s = s1 − s2 are handled similarly. THEOREM 3.

Let (U, ) be a hierarchy, then for any V ⊆ U , V L = V Nhie .

Theorem 3 follows immediately from Lemma 3. As in the case for partitions, one only needs to focus on the expressions in Nhie since Nhie -compactness implies L-compactness. LEMMA 4 (NECESSARY CONDITION). Let s ∈ (Nhie )i , and V = [s]i . It is (Nhie )i + + # # compact only if i+1 (V ) ⊆ [ˆs]i+1 ⊆ i+1 (V ), where i+1 (V ) and i+1 (V ) are, respectively, the strictly efficient and efficient alphabets in i+1 with respect to V . The (only if) half of the proof of Lemma 2 applies with minimal modifications. Note that Lemma 4 mirrors Lemma 2. It states that the expression s is compact only when sˆ expresses all the efficient symbols in i+1 with respect to V , and never any inefficient ones. It is also worth noting that this condition is not sufficient, unlike the case in Lemma 2, as demonstrated in the following example. Example 7.

Consider the hierarchical structure shown in Figure 4.

Let V = {1, 2, 4, 5}. The expression s = 1 + 2 + 4 + 5 expresses V is normal. Note that 1+ (V ) is empty, so s is also trim, but it is not compact as it can be reduced to s′ = D − (3 + 6). For any i ≤ N , define a partial order  over (Nhie )i , such that for any two expressions s, t ∈ (Nhie )i , s  t ⇐⇒ [s]i = [t]i

and [ˆs]i+1 ⊇ [tˆ ]i+1 .

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

222



K. Q. Pu and A. O. Mendelzon

PROPOSITION 6. Let s, t be two equivalent expressions in Nhie which satisfy the necessary condition of Lemma 4. Then s  t =⇒ s ≤ t . In other words,

· : (Nhie , ) → (N, ≤) is order preserving. − → − → −→ −→ PROOF. Write s = sˆ + A+ − A− and t = tˆ + B+ − B− , and let V = [s]i = [t]i . # ˙ tˆ ]i+1 By assumption, [ˆs]i+1 and [tˆ ]i+1 are subsets of i+1 (V ). Define Ŵ = [ˆs]i+1 −[ # which is also a subset of i+1 (V ). Recall that A+ = V − [ˆs]i and B+ = V − [tˆ ]i . s  t =⇒ [ˆs]i ⊇ [tˆ ]i =⇒ A+ ⊆ B+ . Furthermore, ˙ (V ∩ [Ŵ]i ) B+ = V − [tˆ ]i = V − [ˆs − Ŵ]i = (V − [ˆs]i ) ∪ + ˙ = A ∪ (V ∩ [Ŵ]i ). ˙ ([Ŵ]i − V ). Therefore, Similarly, we can show that A− = B− ∪ |B+ − A+ | = |B+ | − |A+ | = |V ∩ [Ŵ]i |, |A− − B− | = |A− | − |B− | = |[Ŵ]i − V |. So, s − t = ( ˆs − tˆ ) + (|A+ | − |B+ |) + (|A− | − |B− |) = ( ˆs − tˆ ) − |Ŵ|. − → Observe that sˆ is equivalent to tˆ + Ŵ , so ˆs ≤ tˆ + |Ŵ|. Therefore s ≤ t . Therefore by minimizing with respect to , we are effectively minimizing the length. It is immediate from the definition of  that minimization over  # in (Nhie )i yields maximization of [ˆs]i+1 which is bounded by i+1 ([s]). This leads to the following recursive description of a minimal expression of a set V . COROLLARY 2.

Let minexpi : Pwr(i ) → Li be defined as − → —minexp N (V ) = V , # — for 0 ≤ i < N , minexpi (V ) = πi (V /minexpi+1 (i+1 (V ))) where πi (V /t) denotes −−−−−−→ −−−−−−→ the expression t + (V − [t]i ) − ([t]i − V ). Then for any subset V ⊆ U , minexp0 (V ) is an Nhie -compact expression for V . Here is a bottom-up decomposition procedure to compute a minimal expression in Nhie for a given subset V ⊆ U . Definition 11 (Decomposition Operators). for each i ≤ N :

Define the following mappings

# — i : Pwr(i ) → Pwr(i+1 ) : V → i+1 (V ).

— i+ : Pwr(i ) → Pwr(i ) : V → V − [ i (V )]i , and  [ i (V )]i − V . —i− : Pwr(i ) → Pwr(i ) : V → With these operators and Corollary 2, we can construct a Nhie -compact expression for a given set V with respect to a hierarchy in an iterative fashion. THEOREM 4.

Suppose V ⊆ U . Let

— V1 = V , — Vi+1 = i (Vi ), Wi+ = i+ (Vi ) and Wi− = i− (Vi ), for 1 < i ≤ N . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Concise Descriptions of Subsets of Structured Sets



223

Fig. 5. The decomposition algorithm for hierarchy.

Define the expressions −→ —sN = VN , −−+ → −−− →  — si−1 = si + Wi−1 − Wi−1 for 1 ≤ i < N .

Then s1 is a Nhie -compact expression expressing V . Corollary 2 follows from simple induction on the number of levels of the hierarchy and showing that at each level the constructed expression satisfies the sufficient condition stated in Proposition 6. Clearly the complexity of construction of s1 is in polynomial time, in fact can be done in O(|| · |V | · log |V |). The algorithm is illustrated in Figure 5. Example 8. Consider the hierarchy in Figure 3. Let V1 = {Victoria, DeRivoli, Grand, Broadway, Market}. The algorithm produces: V2 = {Paris, NewYork}, and W1+ = {Market}, W1− = {Canal}, V3 = {STORE}, W2+ = ∅. W2− = {SanFrancisco} The expressions produced by the algorithm are

—s3 = STORE, — s2 = STORE − SanFrancisco, — s1 = (STORE − SanFrancisco) + Market − Canal. Since s1 is guaranteed compact, V1 = s1 = 4. Note that s1 is not the only compact expression; (NewYork − Canal) + Market + Paris, for instance, is another expression with length 4. 2.3 Multidimensional Partition and Hierarchy An important family of structures is the multidimensional structures. The simplest is the multidimensional partition. Definition 12 (Multidimensional Partition). A cover structure (U, ) is a ˙ N where for ev˙ 2 · · · ∪ multidimensional partition if the alphabet  = 1 ∪ ery i, (U, i ) is a partition as defined in Definition 6. The integer N is the dimensionality of the structure. The hierarchy (U, i ) is the ith dimension. Note the subtle difference between a multidimensional partition and a hierarchy. A hierarchy has the additional constraint that U/i are ordered by granularity, and is in fact a special case of the multidimensional partition, but ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

224



K. Q. Pu and A. O. Mendelzon

as one might expect, and we shall show, the relaxed definition of multidimensional partition leads to a NP-hard MDL-problem. A simple extension of the multidimensional partition is the multidimensional hierarchy. Definition 13 (Multidimensional Hierarchy). A cover structure (U, ) is a ˙ 2 · · · ∪ ˙ N where, for multidimensional hierarchy if the alphabet  = 1 ∪ every i, (U, i ) is a hierarchy as defined in Definition 9. The integer N is the dimensionality of the structure. In this section, we will consider three languages which express subsets of the universe, with successively more grammatic freedom. It will be shown that the MDL decision problem is NP-complete for all three languages. In fact, we will show this on a specific kind of structures  that we call product structures. Intuitively, multidimensional partitions and multidimensional hierarchies make sense when the elements of the universe can be thought of as N -dimensional points, and each of the partitions or hierarchies operates along one dimension. Most of our discussion will focus on the two-dimensional (2D) case (N = 2), which is enough to yield the NP-completeness results. We next define product structures for the 2D case. Definition 14 (2D Product Structure). We say that (U, ) is a 2D product structure if universe U is the cartesian product of two disjoint sets X and Y : ˙ Y . The U = X × Y , and the alphabet  is the union of X and Y :  = X ∪ interpretation function is defined as, for any z ∈ ,  {z} × Y if z ∈ X , [z] = X × {z} if z ∈ Y . Note that the 2D product structure is granular, since the language L(X × Y, ) can express every singleton {(x, y)} ∈ Pwr(U ) by the expression (x · y). The 2D product structure admits two natural expression languages, both requiring the notion of product expressions. Definition 15 (Product Expressions). pression if it is of the form − → − → s = ( A · B ) where A ⊆ X and B ⊆ Y .

An expression s ∈ L is a product ex-

We build up two languages using product expressions. Definition 16 (Disjunctive Product Language). language L P + is defined as

The

disjunctive

product

— ǫ ∈ LP +, — any product expression s belongs to L P + , —if s, t ∈ L P + , then (s + t) ∈ L P + .  It is immediate that any expression s ∈ L P + can be written in the form i∈I si where, for any i, si is a product expression. A generalization of the disjunctive product language is to allow other operators to connect the product expressions. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Concise Descriptions of Subsets of Structured Sets

Definition 17 (Propositional Product Language). uct language L P is defined as



225

The propositional prod-

— ǫ ∈ LP , —any product expression s belongs to L P , —if s, t ∈ L P , then (s + t), (s − t), (s · t) ∈ L P . Obviously L P +  L P  L. Example 9.

Consider a 2D product structure with CITY = { New York, San Francisco, Paris},

and PRODUCT = { Clothing, Beverage, Automobile}. The universe U = CITY × PRODUCT consists of the nine pairs of city name and product family:  U= (NewYork, Clothing), (NewYork, Beverage), (NewYork, Automobile), (SanFrancisco, Clothing), (SanFrancisco, Beverage), (SanFrancisco, Automobile),



(Paris, Clothing), (Paris, Beverage), (Paris, Automobile) .

The alphabet  consists of six symbols ˙ PRODUCT  = CITY ∪ = {NewYork, SanFrancisco, Paris, Clothing, Beverage, Automobile}. The interpretation of a symbol is the pairs in U in which the symbol occurs.   ( NewYork, Beverage)   For instance, [ Beverage] = ( SanFrancisco, Beverage) .   ( Paris, Beverage) Consider the following expressions in L(U, ):

—s1 = ((NewYork + Paris) · Clothing) + (NewYork · Beverage), — s2 = ((NewYork + Paris) · (Clothing + Beverage)) − (NewYork · Clothing), —s3 = NewYork − Beverage. The expressions s1 ∈ L P + , s2 ∈ L P − L P + , and s3 ∈ L − L P . They are evaluated to  [s1 ] = {(NewYork, Clothing), (Paris, Clothing), (NewYork,Beverage)}, and [s2 ] = (NewYork, Beverage), (Paris, Clothing), (Paris, Beverage) . The last expression s3 is a bit tricky—it contains all tuples of NewYork that are not Beverage, so [s3 ] = {(NewYork, Clothing), (NewYork, Automobile)}. We will see that the MDL decision problem for each of these languages is NP-complete. 2.4 The L P -MDL Decision Problem is NP-Complete In this section, we prove that the MDL problems for L P + and L P are NPcomplete. It’s obvious that they are all in NP. The proof of NP-hardness is by a reduction from the minimal three-set cover problem. Recall that an instance of minimal three-set cover problem consists of a set cover C = {C1 , C2 , . . . , Cn } ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

226



K. Q. Pu and A. O. Mendelzon

Fig. 6. A set cover with three cells.

Fig. 7. The transformed instance of the MDL problem of 2D product structure.

where (∀C ∈ C)|C| = 3 and anintegerk > 0. The question is if there exists a subcover D ⊆ C such that D = C and |D| ≤ k. This is known to be NP-complete [Garey and Johnson 1979]. From this point on, we fix the instance ofthe minimum cover problem (C, k). Write C = {C1 , C2 , . . . , Cn }. Define X = C, and for each i ≤ n, let Y i be a set such that |Y i | = m > 3. The family {Y i }n is made disjoint. Let Y = ˙ i≤n Y i ) ∪ ˙ { y ∗ }, where y ∗ does not belong to any Y i . The structure is the 2D (∪ product structure of X × Y . The subset to be represented is given by V = ∪i≤n (Ci × Y i ) ∪ ( X × { y ∗ }). It is not difficult to see that this is a polynomial time reduction. Example 10. Consider a set X = {A, B, C, D, E}, and a cover C = {C1 , C2 , C3 } where C1 = {A, B, C}, C2 = {C, D, E} and C3 = {A, C, D}, as shown in Figure 6. It is transformed by first constructing Y 1 , Y 2 , and Y 3 , all disjoint and each ˙ Y2 ∪ ˙ Y3 ∪ ˙ { y ∗ }. The structure is the 2D with four elements. Then let Y = Y 1 ∪ ˙ (C2 × Y 2 ) ∪ ˙ (C3 × product structure of X and Y . The subset V = (C1 × Y 1 ) ∪ ˙ (X × { y ∗ }). It is shown as the shaded boxes in Figure 7. Y 3) ∪ It turns out that for this very specific subset V , one can characterize the form of the compact expressions that express V in L P . LEMMA 5. Let V be a subset resulted from the reduction from a set cover problem (depicted in Figure 7). Then all L P -compact expressions of V are in the form of − − → → − → → − s= ( Ci · Y i ) + (C j · Y j∗ ), i∈I

j ∈J

˙ { y ∗ }, and I ∩ J = ∅, and I ∪ ˙ J = {1, 2, . . . , n}. where Y j∗ = Y j ∪

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Concise Descriptions of Subsets of Structured Sets



227

Note that, by Lemma 5, the L P -compact expressions of V do not make use of the negation “−” and conjunction “·” operators between product expressions; hence they belong to L P + . Example.

For subset V in Figure 7, the expression → → − → − − → − − → − → s = ( C1 · Y 1∗ ) + ( C2 · Y 2∗ ) + ( C3 · Y 3 )

is both L P and L P + -compact. Therefore V L P = V L P + = (3 + 5) + (3 + 4) + (3 + 5) = 23. The proof of Lemma 5 is by ruling out all other possible forms. Before delving into the details of the proof of Lemma 5, let’s use it to prove the NP-hardness of the L P + -MDL and L P -MDL problem. THEOREM 5. L P + -MDL and L P -MDL’s are NP-complete for multidimensional partitions. PROOF. This follows from Lemma 5. As we mentioned, V L P + = V L P . Let s be a L P -compact expression of V . Since − − → → − → → − s= ( Ci · Y i ) + (C j · Y j∗ ), i∈I

j ∈J

its length is s = i≤n (|Ci | + |Y i |) + |J | = (3 + m)n + |J |. Since [s] = V , it is  → − → − necessarily the case that X × { y ∗ } ⊆ [ j ∈J (C j · Y j∗ )], or that {C j } j ∈J covers X . Minimizing s with s in the given form is equivalent to minimization of |J |, or finding a minimal cover of X , which is of course the objective of the minimum set cover problem. 

The proof of Lemma 5 makes use of the following results. Definition 18 (Expression Rewriting). Let σ be a symbol, and t an expression. The rewriting, written · : σ → t is a function L → L : s → s : σ → t, defined inductively as — ǫ : σ → t = ǫ, — for any symbol σ ′ ∈ , σ ′ : σ → t =



t σ′

if σ ′ = σ , else,

— for any two strings s, s′ ∈ L, s  s′ : σ → t = s : σ → ts′ : σ → t, where  can be +, −, or ·. Basically s : σ → t replaces all occurrences of σ in s by the expression t. Definition 19 (Extended Expression Rewriting). Given a set of symbols 0 ⊆ , and t an expression that does not make use of symbols in 0 , then s : 0 → t is the expression of replacing every occurrence of symbols in 0 by the expression t. PROPOSITION 7 (SYMBOL REMOVAL).

For any expression s ∈ L P , we have that

[s : z → ǫ] = [s] − [z], for any symbol z ∈ X ∪ Y . In other words, s : z → ǫ ≡ s − z. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

228



K. Q. Pu and A. O. Mendelzon

PROOF. We prove by induction on the number of product expressions in s. − → − → Suppose s = A · B where A ⊆ X and B ⊆ Y . Without loss of generality, say z ∈ A; then [s : z → ǫ] = (A − {z}) × B = A × B − [z] = [s] − [z]. The induction goes as follows: [t + t ′ : z → ǫ] = [t : z → ǫ + t ′ : z → ǫ] = ([t] − [z]) ∪ ([t ′ ] − [z]) = ([t] ∪ [t ′ ]) − [z] = [t + t ′ ] − [z]. Similar arguments apply to the cases t − t ′ and t · t ′ . We need to emphasize that Proposition 7 does not apply to expressions in L in general. For instance, if s = x and z = y, we have that x : y → ǫ = x ≡ x − y. PROPOSITION 8 (SYMBOL ADDITION). not occur in s. Then,

Let s ∈ L P and x, x ′ ∈ X where x ′ does

˙ ({x ′ } × [s](x)), [s : x → x + x ′ ] = [s] ∪ where [s](x) = { y ∈ Y : (x, y) ∈ [s]}. Similarly, ˙ ([s]( y) × { y ′ }). [s : y → y + y ′ ] = [s] ∪ PROOF. As a notational convenience, let’s fix x, x ′ ∈ X and write ↑ s = s : x → x + x ′  and d (s) = {x ′ } × [s](x). Let  be +, − or ·, and x ′ not occur in s or s′ ; then by simple arguments, [s  s′ ](x) = [s](x)  [s′ ](x). It follows, then, that d (s  s′ ) = x ′ × [s  s′ ](x) = (x ′ × [s](x))  (x ′ × [s](x)) = d (s)  d (s′ ). So d (·) distributes over +, − and ·. We now prove Proposition 8 by induction on the number of product expres− → − → sions in s. For s = ǫ or s = A · B , it is obvious. Suppose that s = t + t ′ ; then ˙ d (t)) ∪ ([t ′ ] ∪ ˙ d (t ′ )) [↑ s] = [↑ t] ∪ [↑ t ′ ] = ([t] ∪ = ([t + t ′ ]) ∪ (d (t + t ′ )). This is not sufficient yet since we need to show that the union of [t + t ′ ] and d (t + t ′ ) is a disjoint one. It’s not too difficult since we recall that d (t + t ′ ) = x ′ × [t + t ′ ](x), but x ′ does not occur in t nor in t ′ ; therefore it is not in s. And since t, t ′ ∈ L P , [t + t ′ ] ∩ [x ′ ] = ∅. The cases for s = t − t ′ and s = t · t ′ are handled similarly. We only wish to remark that, for these two cases, it is important to have the disjointness from d (t) to both [t] and [t ′ ]. Again Proposition 8 does not generalize to L. As a counterexample, say s = x + y. Then ↑ s = x + x ′ + y, so [s](x) = Y . Indeed [↑ s] = [s] ∪ d (s), but it’s not ˙ d (s), since d (s) ∩ [s] = {(x ′ , y)}. a disjoint union: [↑ s] = [s] ∪ We now prove Lemma 5 using the results in Proposition 7 and Proposition 8. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Concise Descriptions of Subsets of Structured Sets



229

PROOF. Let’s first define #z(s) to be the number of occurrences of the symbol z in the expression s. Consider the reduction from an instance of the minimum three-set cover problem. We have the instance of C = {Ci }i ∈ I where  i ∈ I . The  |Ci | = 3 for all ˙ y∗ } reduction produces the universe X ×Y where X = i Ci , and Y = i∈I Y i ∪{ ˙ i (Ci × where Y i are disjoint and |Y i | > 3. The subset to be represented is V = (∪ ˙ (X × { y ∗ }). Let s be a L P -compact expression. Y i )) ∪ Claim I: (∀i ≤ n)(∃ y ∈ Y i ) # y(s) = 1. By contradiction, suppose that (∃i)(∀ y ∈ Y i ) # y(s) > 1; then let s′ = s : Y i → − → − → ǫ + ( Ci · Y i ). By Proposition 7, [s′ ] = ([s] − [Y i ]) ∪ (Ci × Y i ) = ([s] − [Y i ]) ∪ ([s] ∩ [Y i ]) = [s]. So s′ is equivalent to s, but it is shorter in length: 

s′ = s − # y(s) + |Ci | + |Y i | y∈Y i

≤ s −





y∈Y i



2 + |Ci | + |Y i |

= s − 2|Y i | + |Ci | + |Y i | = s + (|Ci | − |Y i |) < s . Therefore s strictly reduces to s′ , which is contradictory to the compactness of s. Claim II: (∀i)(∀ y ∈ Y i ) # y(s) = 1. For contradiction, let’s assume (∃i)(∃ y ∈ Y i ) # y(s) ≥ 2. By Claim I, for i, there exists at least one z ∈ Y i such that #z(s) = 1. Define s1 = s : y → ǫ and s′ = s1 : z → z + y. We show that s reduces strictly to s′ : First note that ˙ ([s1 ](z) × y). However, [s1 ](z) = ([s] − [ y])(z) = [s1 ] = [s] − [ y], and [s′ ] = [s1 ] ∪ [s](z) − [ y](z) = [s](z) since [ y](z) = ∅. So [s′ ] = ([s] − [ y]) ∪ ([s]( y) × y) = ([s] − [ y]) ∪ ([s] ∩ [ y]) = [s]. In terms of its length,

s′ = s1 + 1 = s − # y(s) + 1 < s .  Again a contradiction. Since each y ∈ i Y i must occur exactly once in s, it must be then of the form as claimed. This proof works for both L P or L P + compactness. 2.5 The General L-MDL Problem is NP-Complete As mentioned, the symbol removal and additions rules do not hold in general for expressions in L and, as a result, it is not guaranteed that the minimal expression for V is in the prescribed form in Lemma 5. Here is an example. Example. Consider once again the subset V in Figure 7, and an expression in L but not in L P : − → − → − → s = (A + B) · Y 1 + (D + E) · Y 2 + (A + D) · Y 3 + C + y ∗ . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

230



K. Q. Pu and A. O. Mendelzon

Note that [s] = V , but certainly s is not of the form given in Lemma 5. Its length is s = (2 + 4) + (2 + 4) + (2 + 4) + 1 + 1 = 20. Therefore in this case we have that V L < V L P . The richness of L prevents us from using Lemma 5 to arrive at the NPhardness of the L-MDL decision problem. We have to modify the reduction from the minimum three-set cover problem, and deal with the expressions in greater detail. Definition 20 (Domain Dependency). Let X 0 = X and Y 0 = Y as defined in the reduction from a minimum cover problem. Define a sequence of sets ˙ {αk } X 0 , X 1 , X 2 , . . . , and Y 0 , Y 1 , Y 2 , . . . , such that for all k ≥ 0, X k+1 = X k ∪ ˙ and Y k+1 = Y k ∪ {βk }, where αk and βk are two symbols that do not belong to X k and Y k , respectively. We therefore have a family of 2D product structures {X k × Y k } with the propositional languages L0  L1  L2 · · ·. Let s ∈ Lk , for k ′ ≥ k, and write [s]k ′ to be the evaluation of the expression s in the language Lk ′ . For any k ≥ 0, we say that s ∈ Lk is domain independent if ∀k ′ > k · [s]k ′ = [s]k . If s ∈ Lk is not domain independent, then it’s domain dependent. The notion of domain dependency naturally bipartitions the languages. Let LkI = {s ∈ Lk : s is domain independent.}, and LkD = {s ∈ Lk : s is domain dependent.}. Given an expression s, whether it is domain dependent or not depends on the set of unbounded symbols, defined below. Definition 21 (Bounded Expressions). Let s be an expression in a propositional language. The set of unbounded symbols of s, U(s) is a set of symbols that appear in s, defined as — U(ǫ) = ∅, — U(σ ) = {σ }, and — U(t + t ′ ) = U(t) ∪ U(t ′ ), U(t − t ′ ) = U(t) − U(t ′ ), U(t · t ′ ) = U(t) ∩ U(t ′ ). In case U(s) = ∅, we say that s is a bounded expression, or that it is bounded; otherwise s is unbounded. An expression s ∈ Lk can be demoted to an expression in Lk−1 by erasing the symbols αk−1 and βk−1 so the resulting expression is one in Lk−1 . Let’s write ↓ kk−1 s = s : αk−1 → ǫ : βk−1 → ǫ. Therefore ↓ kk−1 : Lk → Lk−1 . The following is a useful fact. PROPOSITION 9. X k−1 × Y k−1 .

For s ∈ Lk , [↓

k k−1 s]k−1

= [s]k ∩ Uk−1 where Uk−1 =

The proof of Proposition 9 is by straightforward induction on s in Lk . While s ∈ Lk can be demoted to Lk−1 , it can also be promoted to Lk+1 without any syntactic modification. Of course, when treated as an expression in Lk+1 , it has a different evaluation. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Concise Descriptions of Subsets of Structured Sets



231

˙ (U X (s) × {βk }) ∪ ˙ ({αk } × UY (s)), PROPOSITION 10. For s ∈ Lk , [s]k+1 = [s]k ∪ where U X (s) = U(s) ∩ X k and UY (s) = U(s) ∩ Y k . ˙ ({αk } × UY (s)). So the result reads PROOF. We write δk (s) = (U X (s) × {βk }) ∪ k+1 ˙ [s]k+1 = [s]k ∪ δk (s). Note [s]k = [↓ k s]k = [s]k+1 ∩ Uk , which is disjoint from δk (s); hence the union is disjoint. The union inside δk (s) is disjoint for the obvious reason that αk ∈ X k and βk ∈ Y k . By straightforward set manipulations, we can show that δk (t t ′ ) = δk (t)δ(t ′ ) for any t, t ′ ∈ Lk and  be +, − or ·. The rest of the proof is by induction on the construction of s mirroring exactly that in Proposition 8 with δk (s) in the place of d (s). COROLLARY 3. An expression is domain independent if and only if it is bounded, that is, ∀k.s ∈ LkI ⇐⇒ U(s) = ∅. PROOF.

[s]k+1 = [s]k ⇐⇒ U X (s) = UY (s) = ∅ ⇐⇒ U(s) = ∅.

Another result that follows from Proposition 9 and Corollary 3 is the following. COROLLARY 4. If s is domain-independent in Lk , then for all (x, y) ∈ [s]k , both x and y must appear in s. PROOF. Let s ∈ LkI . We show the contrapositive statement: if x or y does not appear in s, then (x, y) ∈ [s]k . Let’s say x does not appear in s (the case for y is by symmetry). Since s is domain independent, and U(↓ kk−1 s) ⊆ U(s) = ∅, ↓ kk−1 s is also domain independent. We can make the arbitrarily removed symbols αk−1 and βk−1 to be x and some z which does not appear in s either, respectively. This means that ↓ kk−1 s = s, and (x, y) ∈ Uk ⊇ [↓ kk−1 s]k−1 = [s]k by Proposition 9. The importance of domain dependency of expressions is demonstrated by the following results. LEMMA 6. Let V ⊆ X 0 × Y 0 , V LkI and V LkD be the lengths of its shortest domain-independent and domain-dependent expressions in Lk , respectively. We have I ∀k ≥ 0. V LkI ≥ V Lk+1 , and D . ∀k ≥ 0. V LkD  V Lk+1 I PROOF. It’s easy to see why V LkI ≥ V Lk+1 : let s be a LkI -compact expresI sion of V . Since s ∈ Lk+1 and [s]k+1 = [s]k = V , it also is the case that s ∈ Lk+1 ; I hence V Lk+1 ≤ s = V LkI . D D , let s be a L To show V LkD  V Lk+1 k+1 -compact expression for V . By

Proposition 9, [↓ kk+1 s]k = [s]k+1 ∩ Uk = V . It’s not difficult to see that αk+1 and k+1 βk+1 do not appear in s, so U(↓ k+1 k s) = U(s) = ∅. That is, ↓ k s expresses V and is domain dependent. k+1 Next we show that ↓ k+1 k s < s by contradiction: if ↓ k s = s , then k+1 k+1 ↓ k s = s since ↓ k s is formed by removing symbols from s. Therefore we k+1 have that [↓ k+1 k s]k+1 = [s]k+1 = [↓ k s]k . But by Proposition 10, this means ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

232



K. Q. Pu and A. O. Mendelzon

that U(↓ k+1 k s) = ∅, which is a contradiction. Therefore, D .

V LkD ≤ ↓ k+1 k s < s = V Lk+1

This concludes the proof. COROLLARY 5.

For any V ⊆ X 0 × Y 0 , ∀k > (2|V |). V LkI  V LkD .

In other words, ∀k > (2|V |). V Lk = V LkI . Therefore by enlarging the dimensions X and Y by adding 2|V | new symbols to each, we are guaranteed that all L-compact expressions are domain independent, and hence are bounded. The reason to force the compact expressions to be domain independent is so that we can reuse symbol removal and addition rules of Proposition 7 and Proposition 8. From this point on, it is understood that the domain has been enlarged to Uk for k > 2|V | and the subscript k is dropped. For instance, we write L for Lk . PROPOSITION 11.

Let s ∈ L I . Then,

(1) [s : z → ǫ] = [s] − [z]. ˙ (z ′ × [s](z)). (2) If z ′ does not occur in s, then [s : z → z + z ′ ] = [s] ∪ PROOF. For (1), suppose z is the symbol to be replaced with ǫ. One can show that for all subexpressions s′ of s, for all x ∈ X and y ∈ Y , if x = z and y = z then (x, y) ∈ [s′ ] ⇐⇒ (x, y) ∈ [s′ : z → ǫ] by induction on the subexpressions of s. Therefore we immediately have [s] − [z] ⊆ [s : z → ǫ]. For the other containment, observe that U(s : z → ǫ) ⊆ U(s) = ∅, so s : z → ǫ is also domain indenpendent by Corollary 3. It follows, then, that every point (x, y) ∈ [s : z → ǫ] cannot be in [z] by Corollary 4, so x = z and y = z; therefore (x, y) ∈ [s] − [z]. For (2), z is to be replaced with z +z ′ where z ′ does not occur in s. Without loss of generality, say z ∈ X . By induction, we can show that for all subexpressions ′ ′ s′ of s, for all y ∈ Y , we have (z, y) ∈ [s′ ] ⇐⇒ (z ′ , y)  ∈ [s : z → z + z ]. It ′ ′ then follows that [s : z → z + z ] = [s] ∪ [s](z) × z . The disjointness of the union comes the fact that, since s is domain independent, [s] ∩ [z ′ ] = ∅ since z ′ does not appear in s. This allows us to repeat the arguments as in Lemma 5 to obtain the following. LEMMA 7.

There exists a L I -compact expression for V of the form (∗).  −  − → → − → → − s= ( Ci · Y i ) + (C j · Y j∗ ). i∈I

(∗)

j ∈J

SKETCH OF PROOF. Let s be a L I -compact expression for V . Following the arguments presented in the proof of Lemma 5 using symbol addition and removal rules in Proposition 11, we obtain: (∀i)(∀ y ∈ Y i ) # y(s) = 1. It is still possible that, s is not in the form (∗), for L I is flexible enough that − → there is no guarantee that, for all i, all y ∈ Y i occurs consecutively to form Y i . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Concise Descriptions of Subsets of Structured Sets



233

But we can always rewrite s to that form. For each i, pick a y i , and let Y i′ = Y i − { y i }. First rewrite s to s′ = s : Y i′ → ǫ. So s′ results from s by replacing of all occurrences of y ∈ Y i for y = y i′ by the empty string ǫ. Then − → construct s′′ = s′ : y i → Y i . One can easily show that [s′′ ] = [s] and s′′ = s , − → −−−−−−→ − → −−−−−−→ and in s′′ , all occurrences of y occur in Y i or Y i ∪ { y ∗ }. Each Y i or Y i ∪ { y ∗ } is − → necessarily individually bounded by Ci . Therefore s′′ is of the form (∗). Finally we arrive at the more or less expected result: THEOREM 6. The L-MDL decision problem is NP-complete for multidimensional partitions. 3. THE ORDER-STRUCTURE AND LANGUAGES So far, all aforementioned structures are cover structures, namely, structures characterized by a set cover on the universe. Another important family of structures is the order-structure where structures are characterized by a family of partial orders on the universe. Definition 22 (Order-Structured Set and Its Language). An order structured set is a set equipped with partial order relations (U, ≤1 , ≤2 , . . . , ≤ N ). The language L(U, ≤1 , . . . , ≤ N ) is given by —ǫ is an expression in L(U, ≤1 , . . . , ≤ N ), — for any a ∈ U , a is an expression in L(U, ≤1 , . . . , ≤ N ), — for any a, b ∈ U and 1 ≤ i ≤ N , (a →i b) is an expression in L(U, ≤1 , . . . , ≤ N ), — (s + t), (s − t) and (s · t) are all expressions in L(U, ≤1 , . . . , ≤ N ) given that s, t ∈ L(U, ≤1 , . . . , ≤ N ), and — nothing else is in L(U, ≤1 , . . . , ≤ N ). When no ambiguity arises, we write L(U, ≤1 , . . . , ≤ N ) as L. Similar to the proposition language for cover structured sets, we define the expression evaluation and length for the language L(U, ≤1 , . . . , ≤ N ). Definition 23 (Semantics and Length). ≤1 , . . . , ≤ N ) → Pwr(U ) is defined as

The evaluation function [·] : L(U,

— [ǫ] = ∅, —[a] = {a} for any a ∈ U , — [a →i b] = {c ∈ U : a ≤i c and c ≤i b}, — [s + t] = [s] ∪ [t], [s − t] = [s] − [t] and [s · t] = [s] ∩ [t]. The length · : L(U, ≤) = N is given by ǫ = 0, a = 1, a →i b = 2, and

s + t = s − t = s · t = s + t . Example 11. Consider a universe of names for cities: Toronto (TO), San Francisco (SF), New York City (NYC), and Los Angeles (LA); U = {TO, SF, NYC, LA}. We consider three orders. First, they are ordered from east ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

234



K. Q. Pu and A. O. Mendelzon

Fig. 8. A set cover.

to west: NYC ≤1 TO ≤1 LA ≤1 SF. Independently, they are also ordered from south to north: LA ≤2 SF ≤2 NYC ≤2 TO. Finally, we know that San Francisco (SF) is much smaller in population than Toronto (TO) and Los Angeles (LA), which are comparable, and in turn New York city (NYC) has the largest population by far. Therefore, by population, we order them partially as SF ≤3 TO, SF ≤3 LA,

and TO ≤ NYC, LA ≤3 NYC,

but TO and LA are incomparable with respect to ≤3 . The following are expressions in L(U, ≤1 , ≤2 , ≤3 ): — s1 = LA →2 TO; the cities north of LA and south of TO inclusively, and [s1 ] = U . —s2 = (SF →3 NYC) − (SF + NYC); the cities larger than SF but smaller than NYC, so [s2 ] = {TO, LA}. — s3 = (NYC →1 LA) · (LA →2 NYC) − (NYC + LA); the cities strictly between NYC and LA in both latitude and longtitude, and [s3 ] = ∅. The notion of compactness and the MDL-problem naturally extend to expressions of order structures. Unfortunately, the general L(U, ≤)-MDL is intractable even with one order relation. PROPOSITION 12. Even with one partial order ≤, the L(U, ≤)-MDL decision problem is NP-complete. SKETCH OF PROOF. We reduce from the minimum set cover problem. Let C = {Ci }i∈I where, without loss of generality, we assume that each Ci has at least five elements that are not covered by other {C j : j = i}. This can always be ensured by duplicating each element in the set into  five distinct copies. ˙ {⊤i , ⊥i }). For each The universe of our ordered structure set is U = i∈I (Ci ∪ cover Ci , we introduce two new symbols ⊤i and ⊥i . The ordering ≤ is defined as (∀i ∈ I )(∀c ∈ Ci ) c < ⊤i and ⊥i < c. Nothing else is comparable. Consider the instance of a set-cover problem shown in Figure 8. We first duplicate each element into five copies, and obtain another instance shown in Figure 9. Finally the order-structure is shown in Figure 10. The subset to be expressed ≤)-compact expression  is ∪i∈I Ci , and its−−L(U, −−−→ is always of the form s = j ∈J ((⊤ j → ⊥ j ) − {⊤ j , ⊥ j }). It will not mention

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Concise Descriptions of Subsets of Structured Sets



235

Fig. 9. Each element is duplicated.

Fig. 10. The transformed order-structure.

individual elements in any of the Ci , since by symmetry of the problem, if one is mentioned, then its copies will be mentioned, and that would use five symbols, −−−−→ which is longer than (⊤i → ⊥i ) − {⊤i ,  ⊥i }. Its length is then 4|J | where |J | is the number of covers needed to cover i∈I Ci . Minimizing |J | is equivalent to minimizing s . 3.1 Linear Ordering is in P We say that an order-structure (U, ≤) is linear if there is only one ordering and it is linear, that is, if every two elements u, u′ ∈ U are comparable. Therefore, (U, ≤) forms a chain, and in this case, not surprisingly, the MDL-problem is solvable in polynomial time. The formal argument for this statement is analogous to that for partitions. In this section, we fix the structure (U, ≤) to be linear. Definition 24 (Closure and Segments). Let A ⊆ U . Its closure A is defined as A = {u ∈ U : (∃a, b ∈ A)a ≤ u and u ≤ b}. A segment is a subset A of U such that A = A. The length of a segment is simply |A|. Segments are particularly easy to express: if A is a segment with length greater than 2, then A L(U,≤) = 2 always since it can be expressed by the expression (min A → max A) using only two symbols. A segment of V is simply a segment A such that A ⊆ V . We denote the set of maximal segments in V by SEG(V ). Note that maximal segments are pairwise disjoint. The set SEG(V ) also has a natural−− compact expression:  → (min A → max A), which from now on we call SEG(V ). A∈SEG(V )

Example. Consider a universe U with 10 elements linearly ordered by ≤. We simply call them 1 to 10, and ≤ is the ordering of the natural numbers. Let V be {2, 4, 5, 7, 8} shown in Figure 11. −−→ The segments of V are {2}, {4, 5} and {7, 8}, and SEG(V ) = (2 → 2) + (4 → 5) + (7 → 8).

PROPOSITION 13. For any two subsets A and B, we have, for  being any of ∪, ∩, or −, |SEG(A B)| ≤ |SEG(A)| + |SEG(B)|. Therefore, −−→ −−→ −−→

SEG(A B) ≤ SEG(A) + SEG(B) . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

236



K. Q. Pu and A. O. Mendelzon

Fig. 11. A subset V of the universe: filled elements belong to V .

One might at first be tempted to just express a set V by its segment de−−→ composition SEG(V ). But we can in general do better than that. For instance, consider the previous example with V shown in Figure 11. The expression −−→ SEG(V ) has a length of 6, but V can be expressed by only four symbols by s = (4 → 8) + 2 − 6 or (2 → 8) − (3 + 6). For the remainder of this section, we fix the subset V and assume that V does not contain the extrema max U, min U of U . This restriction on V relieves us from considering some trivial cases, and can be lifted without loss of generality. Definition 25 (Normal Form for Linear Order—Structures). The normal form is the sublanguage Nlin of L(U, ≤) consisting of expressions of the form − → − → s = t + A+ − A− ,  ′ + where the subexpression t = i (ai → ai ) is a union of segments and A = − [s] − [t] and A = [t] − [s]. LEMMA 8. For the linear order-structure (U, ≤), every expression of L(U, ≤) can be reduced to an expression in Nlin . OUTLINE OF PROOF. The proof is very similar to Lemma 1. It is by induction. The base cases of s = ǫ and s = u for u ∈ U are trivial. − → − → − → − → − + − Let s1 = t1 + A+ 1 − A1 and s2 = t2 + A2 − A2 be two expressions already in Nlin . We need to show that s1 + s2 , s1 − s2 and s1 · s2 are all reducible to Nlin . s = s1 + s2 : Let t = SEG([t1 ] ∪ [t2 ]), A+ = [s] − [t], and A− = [t] − [s]. Then − + + − we have that t ≤ t1 + t2 , A+ ≤ A+ 1 + A2 , and A ≤ A1 + A2

(as was the case in the proof of Lemma 1). The other two cases are handled similarly. COROLLARY 6.

V L = V Nlin .

Therefore the L(U, ≤) MDL-problem reduces to the Nlin MDL-problem when the ordering is linear. We only need to show that the latter is tractable. Definition 26 (Neighbors, Holes, Isolated, Interior, and Exterior Points). Consider an element u in the universe U . We define  max{u′ ∈ U : u′ < u} if u = min U , u−1 = undefined if u = min U .,  ′ ′ min{u ∈ U : u > u} if u = max U , u+1 = undefined if u = max U ., to be the immediate predecessor and the immediate successor, respectively. We say that u ∈ U is a hole in V if u ∈ V − V but {u − 1, u + 1} ⊆ V . The set of all holes of V is denoted by Hol(V ). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

Concise Descriptions of Subsets of Structured Sets



237

An element u ∈ U is an isolated point in V if u ∈ V but u − 1, u + 1 ∈ V . The set of all isolated points is denoted by Pnt(V ). An interior point u of V is when u ∈ V and at least one of u − 1 or u + 1 is also in V . All the interior points of V is Int(V ). Conversely, an exterior point of V is an element u ∈ U such that u ∈ V and {u − 1, u + 1} − V = ∅. Ext(V ) are all the exterior points of V . Example. Consider the subset V in the universe in Figure 11. Observe that Hol(V ) = {3, 6}, Pnt(V ) = {2}, Int(V ) = {4, 5, 7, 8} and Ext(V ) = {1, 9, 10}. Note that the universe is partitioned into holes, isolated, interior, and exterior points of V . These concepts allow us to define extended segments of V which are very useful in constructing a compact expression of V . Definition 27 (Extended Segments). A subset A is an extended segment of V if A ⊆ V ∪ Hol(V ), A = A, and A ∩ Int(V ) = ∅. So an extended segment is a segment that can only contain elements of V and holes in V , and must contain at least one interior point of V . Observe that the maximally extended segments of V are pairwise disjoint. The set of the maximally extended segments is denoted by XSEG(V ). The expression  −−−−→ A∈XSEG(V ) (min A → max A) is denoted by XSEG(V ).

Example. Again, consider V Figure 11. The extended segments in V are {2, 3, 4, 5}, {4, 5, 6, 7, 8}, {5, 6, 7} · · ·. In general, there could be many maximally extended segments, but in this case there is only one : {2, 3, 4, 5, 6, 7, 8}. −−−−→ Therefore XSEG(V ) = (2 → 8). − → − → −−−−→ − THEOREM 7. An expression s∗ = t∗ + A+ ∗ − A∗ , where t∗ = XSEG(V ), − A+ ∗ = V − [t] and A∗ = [t] − V is compact for V in Nlin . SKETCH OF PROOF. We show that any expression s ∈ Nlin for V can be reduced to s∗ . The proof is by describing explicitly a set of rewrite procedures that take any expressions of V in the normal form and reduce it to s∗ . Without loss of generality, we assume that all segments in t are of length of at least two. (1) First we make sure that all the segments a → a′ in t are such that a, a′ ∈ V , and all segments are disjoint: this can be done without increasing the length of the expression. (2) Remove exterior points from [t]: if there is an exterior point u in [t], then it appears in some a → a′ in t. Since u ∈ Ext(V ), at least one of its neighbors u′ ∈ {u − 1, u + 1} must also be exterior to V and appear in a → a′ . They must then appear in A− . Rewrite a → a′ to at most two segments a → b and b′ → a′ such that u and its neighbor u′ are no longer included in t. This increases the length of t by at most 2. We then remove u, u′ from A− . The overall expression length is not increased. (3) Add all interior points to [t]: if there is an interior point u that is not in [t], then it must appear in A+ . Since u ∈ Int(V ), there is a neighbor u′ ∈ {u − 1, u + 1} ∩ V . If u′ ∈ [t], then it is in A+ as well. In this case, create a new segment u → u′ (or u′ → u if u′ = u − 1) in t, and delete u, u′ from A+ . If u′ ∈ A+ , then it must appear in a segment a → a′ in t. Extend the segment to include u, and delete u from A+ . ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

238



K. Q. Pu and A. O. Mendelzon

(4) Remove all segments in [t] not containing interior points: if there is a segment a → a′ in t and [a → a′ ] ∩ Int(V ) = ∅, then it must contain only isolated points and holes of V , but not exterior points (by step 2). Furthermore, since the end points a, a′ ∈ V (by step 1), there is one more isolated point than holes in [a → a′ ]. The holes appear in A− . Delete a → a′ from t, holes from A− , and add isolated points to A+ . The overall expression length is then reduced by 1. At this point, observe that all segments in t contain some interior points and none  of exterior points, and hence are extended segments of V . Therefore, [t] ⊆ XSEG(V ).   (5) Add XSEG(V ) − [t] to [t]: consider u ∈ XSEG(V ) − [t]. Let u ∈ A ∈ XSEG(V ). The segment A must contain an interior point v which must appear in some segment [a → b] in t. It is always possible to extend a → b (and possibly merge with neighboring segments in t) to cover u. The extension will include some holes and isolated points which need to be added to A− and removed from A+ respectively. This can always be done without increasing the length of the expression.  By the end of the rewriting, XSEG(V ), and clearly the −−−−we → have [t] = minimal expression for [t] is XSEG(V ). COROLLARY 7. The L(U, ≤) MDL-problem can be solved in linear time for linear order-structure. 3.2 Multilinear Ordering Is “Hard” It is not terribly realistic to consider only a single ordering of the universe. There are often many: we may order people by age, or by their names, or some other attributes. In this section, we introduce multiorder structures and the corresponding language. In this case the MDL-problem is hard even when we only have two linear orders. Definition 28 (2-Linear Order-Structure). Consider the universe U = X × Y where both X and Y are linearly ordered by ≤1 and ≤2 . We define two orderings ≤ X and ≤Y over the universe U as the lexicographical ordering along X and Y respectively. Formally, (x, y) ≤ X (x ′ , y ′ ) ⇐⇒ (x ≤1 x ′ ) ∧ ((x (n/k)/(n+ n/k) = 1/(k + 1) and so x will be output. Hence, we can extract the set S, and so the space stored must be (m) since, by an information theoretic argument, the space to store an arbitrary subset S is m bits.

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically



253

Table I. Summary of Previous Results on Insert-Only Methods (LV (Las Vegas) and MC (Monte Carlo) are types of randomized algorithms. See Motwani and Raghavan [1995] for details.) Algorithm Lossy Counting [Manku and Motwani 2002] Misra-Gries [Misra and Gries 1982] Frequent [Demaine et al. 2002] Count Sketch [Charikar et al. 2002]

Type

Time per item

Space

Deterministic

O(log(n/k)) amortized

(k log(n/k))

Deterministic

O(log k) amortized

O(k)

Randomized (LV)

O(1) expected

O(k)

Approximate, randomized (MC)

O(log(1/δ))

(k/ǫ 2 log n)

This also applies to randomized algorithms. Any algorithm which guarantees to output all hot items with probability at least 1 − δ, for some constant δ, must also use (m) space. This follows by observing that the above reduction corresponds to the Index problem in communication complexity [Kushilevitz and Nisan 1997], which has one-round communication complexity (m). If the data structure stored was o(m) in size, then it could be sent as a message, and this would contradict the communication complexity lower bound. This argument suggests that, if we are to use less than (m) space, then we must sometimes output items which are not hot, since we will endeavor to include every hot item in the output. In our guarantees, we will instead guarantee that (with arbitrary probability) all hot items are output and no items which are far from being hot will be output. That is, no item which has 1 frequency less than k+1 − ǫ will be output, for some user-specified parameter ǫ. 2.1 Prior Work Finding which items are hot is a problem that has a history stretching back over two decades. We divide the prior results into groups: those which find frequent items by keeping counts of particular items; those which use a filter to test each item; and those which accommodate deletions in a heuristic fashion. Each of these approaches is explained in detail below. The most relevant works mentioned are summarized in Table I. 2.1.1 Insert-Only Algorithms with Item Counts. The earliest work on finding frequent items considered the problem of finding an item which occurred more than half of the time [Boyer and Moore 1982; Fischer and Salzberg 1982]. This procedure can be viewed as a two-pass algorithm: after one pass over the data, a candidate is found, which is guaranteed to be the majority element if any such element exists. A second pass verifies the frequency of the item. Only a constant amount of space is used. A natural generalization of this method to find items which occur more than n/k times in two passes was given by Misra and Gries [1982]. The total time to process n items is O(n log k), with space O(k) ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

254



G. Cormode and S. Muthukrishnan

(recall that we assume throughout that any item label or counter can be stored in constant space). In the Misra and Gries implementation, the time to process any item is bounded by O(k log k) but this time is only incurred O(n/k) times, giving the amortized time bound. The first pass generates a set of at most k candidates for the hot items, and the second pass computes the frequency of each candidate exactly, so the infrequent items can be pruned out. It is possible to drop the second pass, in which case at most k items will be output, among which all hot items are guaranteed to be included. Recent interest in processing data streams, which can be viewed as onepass algorithms with limited storage, has reopened interest in this problem (see surveys such as those by Muthukrishnan [2003] and Garofalakis et al. [2002]). Several authors [Demaine et al. 2002; Karp et al. 2003] have rediscovered the algorithm of Misra and Gries [1982], and using more sophisticated data structures have been able to process each item in expected O(1) time while still keeping only O(k) space. As before, the output guarantees to include all hot items, but some others will be included in the output, about which no guarantee of frequency is made. A similar idea was used by Manku and Motwani [2002] with the stronger guarantee of finding all items which occur more than n/k times and not reporting any that occur fewer than n( k1 − ǫ) times. The space required is bounded by O( 1ǫ log ǫn)—note that ǫ ≤ k1 and so the space is effectively (k log(n/k)). If we set ǫ = kc for some small c then it requires time at worst O(k log(n/k)) per item, but this occurs only every 1/k items, and so the total time is O(n log(n/k)). Another recent contribution was that of Babcock and Olston [2003]. This is not immediately comparable to our work, since their focus was on maintaining the top-k items in a distributed environment, and the goal was to minimize communication. Counts of all items were maintained exactly at each location, so the memory space was (m). All of these mentioned algorithms are deterministic in their operation: the output is solely a function of the input stream and the parameter k. All the methods discussed thus far have certain features in common: in particular, they all hold some number of counters, each of which counts the number of times a single item is seen in the sequence. These counters are incremented whenever their corresponding item is observed, and are decremented or reallocated under certain circumstances. As a consequence, it is not possible to directly adapt these algorithms to the dynamic case where items are deleted as well as inserted. We would like the data structure to have the same contents following the deletion of an item, as if that item had never been inserted. But it is possible to insert an item so that it takes up a counter, and then later delete it: it is not possible to decide which item would otherwise have taken up this counter. So the state of the algorithm will be different from that reached without the insertions and deletions of the item. 2.1.2 Insert-Only Algorithms with Filters. An alternative approach to finding frequent items is based on constructing a data structure which can be used as a filter. This has been suggested several times, with different ways ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically



255

to construct such filters being suggested. The general procedure is as follows: as each item arrives, the filter is updated to reflect this arrival and then the filter is used to test whether this item is above the threshold. If it is, then it is retained (for example, in a heap data structure). At output time, all retained items can be rechecked with the filter, and those which pass the filter are output. An important point to note is that, in the presence of deletions, this filter approach cannot work directly, since it relies on testing each item as it arrives. In some cases, the filter can be updated to reflect item deletions. However, it is important to realize that this does not allow the current hot items to be found from this: after some deletions, items seen in the past may become hot items. But the filter method can only pick up items which are hot when they reach the filter; it cannot retrieve items from the past which have since become frequent. The earliest filter method appears to be due to Fang et al. [1998], where it was used in the context of iceberg queries. The authors advocated a second pass over the data to count exactly those items which passed the filter. An article which has stimulated interest in finding frequent items in the networking community was by Estan and Varghese [2002], who proposed a variety of filters to detect network addresses which are responsible for a large fraction of the bandwidth. In both these articles, the analysis assumed very strong hash functions which exhibit “perfect” randomness. An important recent result was that of Charikar et al. [2002], who gave a filter-based method using only limited (pairwise) independent hash functions. These were used to give an algorithm to find k items whose frequency was at least (1−ǫ) times the frequency of the kth most frequent item, with probability 1−δ. If we wish to only find items with count greater than n/(k + 1) then the space used is O( ǫk2 log(n/δ)). A heap of frequent items is kept, and if the current items exceed the threshold, then the least frequent item in the heap is ejected, and the current item inserted. We shall return to this work later in Section 4.1, when we adapt and use the filter as the basis of a more advanced algorithm to find hot items. We will describe the algorithm in full detail, and give an analysis of how it can be used as part of a solution to the hot items problem. 2.1.3 Insert and Delete Algorithms. Previous work that studied hot items in the presence of both of inserts and deletes is sparse [Gibbons and Matias 1998, 1999]. These articles have proposed methods to maintain a sample of items and count of the number of times each item occurs in the data set, and focused on the harder problem of monitoring the k most frequent items. These methods work provably for the insert-only case, but provide no guarantees for the fully dynamic case with deletions. However, the authors studied how effective these samples are for the deletion case through experiments. Gibbons et al. [1997] presented methods to maintain various histograms in the presence of inserts and deletes using a “backing sample,” but these methods too need access to large portion of the data periodically in the presence of deletes. A recent theoretical work presented provable algorithms for maintaining histograms with guaranteed accuracy and small space [Gilbert et al. 2002a]. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

256



G. Cormode and S. Muthukrishnan

The methods in this article can yield algorithms for maintaining hot items, but the methods are rather sophisticated and use powerful range summable random variables, resulting in k log O(1) n space and time algorithms where the O(1) term is quite large. We draw some inspiration from the methods in this article—we will use ideas similar to the “sketching” developed in Gilbert et al. [2002a], but our overall methods are much simpler and more efficient. Finally, recent work in maintaining quantiles [Gilbert et al. 2002b] is similar to ours since it keeps the sum of items in random subsets. However, our result is, of necessity, more involved, involving a random group generation phase based on group testing, which was not needed in [Gilbert et al. 2002b]. Also, once such groups are generated, we maintain sums of deterministic sets (in contrast to the random sets as in Gilbert et al. [2002b]), given again by error correcting codes. Finally, our algorithm is more efficient than the (k 2 log2 m) space and time algorithms given in Gilbert et al. [2002b]. 2.2 Our Approach We propose some new approaches to this problem, based on ideas from group testing and error-correcting codes. Our algorithms depend on ideas drawn from group testing [Du and Hwang 1993]. The idea of group testing is to arrange a number of tests, each of which groups together a number of the m items in order to find up to k items which test “positive.” Each test reports either “positive” or “negative” to indicate whether there is a positive item among the group, or whether none of them is positive. The familiar puzzle of how to use a pan balance to find one “positive” coin among n good coins, of equal weight, where the positive coin is heavier than the good coins, is an example of group testing. The goal is to minimize the number of tests, where each test in the group testing is applied to a subset of the items (a group). Our goal of finding up to k hot items can be neatly mapped onto an instance of group testing: the hot items are the positive items we want to find. Group testing methods can be categorized as adaptive or nonadaptive. In adaptive group testing, the members of the next set of groups to test can be specified after learning the outcome of the previous tests. Each set of tests is called a round, and adaptive group testing methods are evaluated in terms of the number of rounds, as well as the number of tests, required. By contrast, nonadaptive group testing has only one round, and so all groups must be chosen without any information about which groups tested positive. We shall give two main solutions for finding frequent items, one based on nonadaptive and the other on adaptive group testing. For each, we must describe how the groups are formed from the items, and how the tests are performed. An additional challenge is that our tests here are not perfect, but have some chance of failure (reporting the wrong result). We will prove that, in spite of this, our algorithms can guarantee finding all hot items with high probability. The algorithms we propose differ in the nature of the guarantees that they give, and result in different time and space guarantees. In our experimental studies, we were able to explore these differences in more detail, and to describe the different situations which each of these algorithms is best suited to. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically



257

3. NONADAPTIVE GROUP TESTING Our general procedure is as follows: we divide all items up into several (overlapping) groups. For each transaction on an item x, we determine which groups it is included in (denoting these G(x)). Each group is associated with a counter, and for an insertion we increment the counter for all G(x); for a deletion, we correspondingly decrement these counters. The test will be whether the count for a subset exceeds a certain threshold: this is evidence that there may a hot item within the set. Identifying the hot items is a matter of putting together the information from the different tests to find an overall answer. There are a number of challenges involved in following this approach: (1) bounding the number of groups required; (2) finding a concise representation of the groups; and (3) giving an efficient way to go from the results of tests to the set of hot items. We shall be able to address all of these issues. To give greater insight into this problem, we first give a simple solution to the k = 1 case, which is to find an item that occurs more than half of the time. Later, we will consider the more general problem of finding k > 1 hot items, which will use the procedure given below as a subroutine. 3.1 Finding the Majority Item If an item occurs more than half the time, then it is said to be the majority item. While finding the majority item is mostly straightforward in the insertionsonly case (it is solved in constant space and constant time per insertion by the algorithms of Boyer and Moore [1982] and Fischer and Salzberg [1982]), in the dynamic case, it looks less trivial. We might have identified an item which is very frequent, only for this item to be the subject of a large number of deletions, meaning that some other item is now in the majority. We give an algorithm to solve this problem by keeping ⌈log2 m⌉ + 1 counters.  The first counter, c0 , merely keeps track of n(t) = x nx (t), which is how many items are “live”: in other words, we increment this counter on every insert, and decrement it on every deletion. The remaining counters are denoted c1 · · · c j . We make use of the function bit(x, j ), which reports the value of the j th bit of the binary representation of the integer x; and g t(x, y), which returns 1 if x > y and 0 otherwise. Our procedures are as follows: Insertion of item x: increment each counter c j such that bit(x, j ) = 1 in time O(log m). Deletion of x: decrement each counter c j such that j ) = 1 in time O(log m). bit(x, log m Search: if there is a majority, then it is given by j =12 2 j g t(c j , n/2), computed in time O(log m). The arrangement of the counters is shown graphically in Figure 1. The two procedures of this method—one to process updates, another to identify the majority element—are given in Figure 2 (where trans denotes whether the transaction is an insertion or a deletion). THEOREM 3.1. The algorithm in Figure 2 finds a majority item if there is one with time O(log m) per update and search operation. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

258



G. Cormode and S. Muthukrishnan

Fig. 1. Each test includes half of the range [1 · · · m], corresponding to the binary representation of values.

Fig. 2. Algorithm to find the majority element in a sequence of update.

PROOF. We make two observations: first, that the state of the data structure is equivalent to that following a sequence of c0 insertions only, and second, that in the insertions only case, this algorithm identifies a majority element. For the first point, it suffices to observe that the effect of each deletion of an element x is to precisely cancel out the effect of a prior insertion of that element. Following a sequence of I insertions and D deletions, the state is precisely that obtained if there had been I − D = n insertions only. The second part relies on the fact that if there is an item whose count is greater than n/2 (that is, it is in the majority), then for any way of dividing the elements into two sets, the set containing the majority element will have weight greater than n/2, and the other will have weight less than n/2. The tests are arranged so that each test determines the value of a particular bit of the index of the majority element. For example, the first test determines whether its index is even or odd by dividing on the basis of the least significant bit. The log m tests with binary outcomes are necessary and sufficient to determine the index of the majority element. Note that this algorithm is completely deterministic, and guarantees always to find the majority item if there is one. If there is no such item, then still some item will be returned, and it will not be possible to distinguish the difference based on the information stored. The simple structure of the tests is standard in group testing, and also resembles the structure of the Hamming single errorcorrecting code. 3.2 Finding k Hot Items When we perform a test based on comparing the count of items in two buckets, we extract from this a single bit of information: whether there is a hot item present in the set or not. This leads immediately to a lower bound on the number ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically



259

of tests necessary: to locate k items among m locations requires log2 (m k) ≥ k log(m/k) bits. We make the following observation: suppose we selected a group of items to monitor which happened to contain exactly one hot item. Then we could apply the algorithm of Section 3.1 to this group (splitting it into a further log m subsets) and, by keeping log m counters, identify which item was the hot one. We would simply have to “weigh” each bucket, and, providing that the total weight of other items in the group were not too much, the hot item would always be in the heavier of the two buckets. We could choose each group as a completely random subset of the items, and apply the algorithm for finding a single majority item described at the start of this section. But for a completely random selection of items then in order to store the description of the groups, we would have to list every member of every group explicitly. This would consume a very large amount of space, at least would be linear in m. So instead, we shall look for a concise way to describe each group, so that given an item we can quickly determine which groups it is a member of. We shall make use of hash functions, which will map items onto the integers 1 · · · W , for some W that we shall specify later. Each group will consist of all items which are mapped to the same value by a particular hash function. If the hash functions have a concise representation, then this describes the groups in a concise fashion. It is important to understand exactly how strong the hash functions need to be to guarantee good results. 3.2.1 Hash Functions. We will make use of universal hash functions derived from those given by Carter and Wegman [1979]. We define a family of hash functions f a,b as follows: fix a prime P > m > W , and draw a and b uniformly at random in the range [0 · · · P − 1]. Then set f a,b(x) = ((ax + b mod P )

mod W ).

Using members of this family of functions will define our groups. Each hash function is defined by a and b, which are integers less than P . P itself is chosen to be O(m), and so the space required to represent each hash function is O(log m) bits. Fact 3.2 (Proposition 7 of Carter and Wegman [1979]). of a and b, for x = y, Pr[ f a,b(x) = f a,b( y)] ≤ 1/W .

Over all choices

We can now describe the data structures that we will keep in order to allow us to find up to k hot items. 3.2.2 Nonadaptive Group Testing Data Structure. The group testing data structure is initialized with two parameters W and T , and has three components: — a three-dimensional array of counters c, of size T × W × (log(m) + 1); — T universal hash functions h, defined by a[1 · · · T ] and b[1 · · · T ] so hi = f a[i],b[i] ; — the count n of the current number of items. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

260



G. Cormode and S. Muthukrishnan

Fig. 3. Procedures for finding hot items using nonadaptive group testing.

The data structure is initialized by setting all the counters, c[1][0][0] to c[T ][W − 1][log m], to zero, and by choosing values for each entry of a and b uniformly at random in the range [0 · · · P −1]. The space used by the data structure is O(T W log m). We shall specify values for W and T later. We will write hi to indicate the ith hash function, so hi (x) = a[i] ∗ x + b[i] mod P mod W . Let G i, j = {x|hi (x) = j } be the (i, j )th group. We will use c[i][ j ][0] to keep the count of the current number of items within the G i, j . For each such group, we shall also keep counts for log m subgroups, defined as G i, j,l = {x|x ∈ G i, j ∧ bit(x, l ) = 1}. These correspond to the groups we kept for finding a majority item. We will use c[i][ j ][l ] to keep count of the current number of items within subgroup G i, j,l . This leads to the following update procedure. 3.2.3 Update Procedure. Our procedure in processing an input item x is to determine which groups it belongs to, and to update the log m counters for each of these groups based on the bit representation of x in exactly the same way as the algorithm for finding a majority element. If the transaction is an insertion, then we add one to the appropriate counters, and subtract one for a deletion. The current count of items is also maintained. This procedure is shown in pseudocode as PROCESSITEM (x, trans, T , W ) in Figure 3. The time to perform an update is the time taken to compute the T hash functions, and to modify O(T log m) counters. At any point, we can search the data structure to find hot items. Various checks are made to avoid including in the output any items which are not hot. In group testing terms, the test that we will use is whether the count for a group or subgroup exceeds the threshold needed for an item to be hot, which is n/(k + 1). Note that any group which contains a hot item will pass this test, but that it is possible that a group which does not contain a hot item can also pass this test. We will later analyze the probability of such an event, and show that it can be made quite small. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically



261

3.2.4 Search Procedure. For each group, we will use the information about the group and its subgroups to test whether there is a hot item in the group, and if so, to extract the identity of the hot item. We process each group G i, j in turn. First, we test whether there can be a hot item in the group. If c[i][ j ][0] ≤ n/(k + 1) then there cannot be a hot item in the group, and so the group is rejected. Then we look at the count of every subgroup, compared to the count of the whole group, and consider the four possible cases: n ? c[i][ j ][l ] > k+1 No

n ? c[i][ j ][0] − c[i][ j ][l ] > k+1 No

No

Yes

Yes

No

Yes

Yes

Conclusion Cannot be a hot item in the group, so reject group If a hot item x is in group, then bit(l , x) = 0 If a hot item x is in group, then bit(l , x) = 1 Not possible to identify the hot item, so reject group

If the group is not rejected, then the identity of the candidate hot item, x, can be recovered from the tests. Some verification of the hot items can then be carried out. — The candidate item must belong to the group it was found in, so check hi (x) = j. — If the candidate item is hot, then every group it belongs in should be above the threshold, so check that c[i][hi (x)][0] > n/(k + 1) for all i. The time to find all hot items is O(T 2 W log m). There can be at most T W candidates returned, and checking them all takes worst-case time O(T ) each. The full algorithms are illustrated in Figure 3. We now show that for appropriate choices of T and W we can first ensure that all hot items are found, and second ensure that no items are output which are far from being hot. LEMMA 3.3. Choosing W ≥ 2k and T = log2 ( kδ ) for a user chosen parameter δ ensures that the probability of all hot items being output is at least 1 − δ. PROOF. Consider each hot item x, in turn, remembering that there are at most k of these. Using Fact 3.2 about the hash functions, then the probability for any other item falling into the same group as x under the ith hash function 1 is given by 1/W ≤ 2k . Using linearity of expectation, then the expectation of the total frequency of other items which land in the same group as item x is      fy 1 − fx 1 ≤ ≤ . E f y · Pr[hi ( y) = hi (x)] ≤ fy = 2k 2k 2(k + 1) y=x y=x y=x,h ( y)=h (x) i

i

(1)

Our test cannot fail if the total weight of other items which fall in the same bucket is less than 1/(k + 1). This is because each time we compare the counts of items in the group we conclude that the hot item is in the half with greater count. If the total frequency of other items is less than 1/(k + 1), then the hot ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

262



G. Cormode and S. Muthukrishnan

item will always be in the heavier half, and so, using a similar argument to that for the majority case, we will be able to read off the index of the hot item using the results of log m groups. The probability of failing due to the weight of other items in the same bucket being more than 1/(k + 1) is bounded by the Markov inequality as 21 , since this is at least twice the expectation. So the probability that we fail on every one of the T independent tests is less log(k/δ) than 21 = δ/k. Using the union bound, then, over all hot items, the probability of any of them failing is less than δ, and so each hot item is output with probability at least 1 − δ. 1 , if we set W ≥ 2ǫ LEMMA 3.4. For any user specified fraction ǫ ≤ k+1 and T = log2 (k/δ), then the probability of outputting any item y with f y < 1 − ǫ is at most δ/k. k+1

PROOF. This lemma follows because of the checks we perform on every item before outputting it. Given a candidate item, we check that every group it is a member of is above the threshold. Suppose the frequency of the item y is less 1 − ǫ). Then the frequency of items which fall in the same group under than ( k+1 hash function i must be at least ǫ, to push the count for the group over the threshold for the test to return positive. By the same argument as in the above lemma, the probability of this event is at most 21 . So the probability that this occurs in all groups is bounded by

1 log k/δ 2

= δ/k.

Putting these two lemmas together allows us to state our main result on nonadaptive group testing: THEOREM 3.5. With probability at least 1 − δ, then we can find all hot items 1 1 whose frequency is more than k+1 , and, given ǫ ≤ k+1 , with probability at least 1 − ǫ using space 1 − δ/k each item which is output has frequency at least k+1 1 O( ǫ log(m) log(k/δ)) words. Each update takes time O(log(m) log(k/δ)). Queries take time no more than O( 1ǫ log2 (k/δ) log m). PROOF. This follows by setting W = 2ǫ and T = log(k/δ), and applying the above two lemmas. To process an item, we compute T hash functions, and update T log m counters, giving the time cost. To extract the hot items involves a scan over the data structure in linear time, plus a check on each hot item found that takes time at most O(T ), giving total time O(T 2 W log m). Next, we describe additional properties of our method which imply its stability and resilience. COROLLARY 3.6. The data structure created with T = log(k/δ) can be used to find hot items with parameter k ′ for any k ′ < k with the same probability of success 1 − δ. PROOF. Observe in Lemma 3.3 that, to find k ′ hot items, we required W ≥ 2k . If we use a data structure created with W ≥ 2k, then W ≥ 2k > 2k ′ , and so the data structure can be used for any value of k less than the value it was created for. Similarly, we have more tests than we need, which can only ′

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically



263

help the accuracy of the group testing. All other aspects of the data structure are identical. So, if we run the procedure with a higher threshold, then with probability at least 1 − δ, we will find the hot items. This property means that we can fix k to be as large as we want, and are then able to find hot items with any frequency greater than 1/(k + 1) determined at query time. COROLLARY 3.7. of the input data.

The output of the algorithm is the same for any reordering

PROOF. During any insertion or deletion, the algorithm takes the same action and does not inspect the contents of the memory. It just adds or subtracts values from the counters, as a function solely of the item value. Since addition and subtraction commute, the corollary follows. 3.2.5 Estimation of Count of Hot Items. Once the hot items have been identified, we may wish to additionally estimate the count, nx , of each of these items. One approach would be to keep a second data structure enabling the estimation of the counts to be made. Such data structures are typically compact, fast to update, and give accurate answers for items whose count is large, that is, hot items [Gilbert et al. 2002b; Charikar et al. 2002; Cormode and Muthukrishnan 2004a]. However, note that the data structure that we keep embeds a structure that allows us to compute an estimate of the weight of each item [Cormode and Muthukrishnan 2004a]. COROLLARY 3.8. Computing mini c[i][hi (x)][0] gives a good estimate for nx with probability at least 1 − (δ/k). PROOF. This follows from the  proofs of Lemma 3.3 and Lemma 3.4. Each estimate c[i][hi (x)][0] = nx + y=x,hi (x)=hi ( y) n y . But by Lemma 3.3, this additional noise is bounded by ǫn with constant probability at least 12 , as shown in Equation (1). Taking the minimum over all estimates amplifies this probability to 1 − (δ/k). 3.3 Time-Space Tradeoff In certain situations when transactions are occurring at very high rates, it is vital to make the update procedure as fast as possible. One of the drawbacks of the current procedure is that it depends on the product of T and log m, which can be slow for items with large identifiers. For reducing the time dependency on T , note that the data structure is intrinsically parallelizable: each of the T hash functions can be applied in parallel, and the relevant counts modified separately. In the experimental section we will show that good results are observed even for very small values of T ; therefore, the main bottleneck is the dependence on log m. The dependency on log m arises because we need to recover the identifier of each hot item, and we do this 1 bit at a time. Our observation here is that we can find the identifier in different units, for example, 1 byte at a time, at the expense of extra space usage. Formally, define dig(x, i, b) to be the ith digit ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

264



G. Cormode and S. Muthukrishnan

in the integer x when x is written in base b ≥ 2. Within each group, we keep (b − 1) × logb m subgroups: the i, j th subgroup counts how many items have dig(x, i, b) = j for i = 1 · · · logb m and j = 1 · · · b − 1. We do not need to keep a subgroup for j = 0 since this count can be computed from the other counts for that group. Note that b = 2 corresponds to the binary case discussed already, and b = m corresponds to the simple strategy of keeping a count for every item. THEOREM 3.9. Using the above procedure, with probability at least 1−δ, then 1 we can find all hot items whose frequency is more than k+1 , and with probability 1 at least 1 − (δ/k), each item which is output has frequency at least k+1 − ǫ using b space O( ǫ logb(m) log(k/δ)) words. Each update takes time O(logb(m) log(k/δ)) and queries take O( bǫ logb(m) log2 (k/δ)) time. PROOF. Each subgroup now allows us to read off one digit in the base-b representation of the identifier of any hot item x. Lemma 3.3 applies to this situation just as before, as does Lemma 3.4. This leads us to set W and T as before. We have to update one counter for each digit in the base b representation of each item for each transaction, which corresponds to logb m counters per test, giving an update time of O(T logb(m)). The space required is for the counters to record the subgroups of T W groups, and there are (b − 1) logb(m) subgroups of every group, giving the space bounds. For efficient implementations, it will generally be preferable to choose b to be a power of 2, since this allows efficient computation of indices using bitlevel operations (shifts and masks). The space cost can be relatively high for speedups: choosing b = 28 means that each update operation is eight times faster than for b = 2, but requires 32 times more space. A more modest value of b may strike the right balance: choosing b = 4 doubles the update speed, while the space required increases by 50%. We investigate the effects of this tradeoff further in our experimental study. 4. ADAPTIVE GROUP TESTING The more flexible model of adaptive group testing allows conceptually simpler choices of groups, although the data structures required to support the tests become more involved. The idea is a very natural “divide-and-conquer” style approach, and as such may seem straightforward. We give the full details here to emphasize the relation between viewing this as an adaptive group testing procedure and the above nonadaptive group testing approach. Also, this method does not seem to have been published before, so we give the full description for completeness. Consider again the problem of finding a majority item, assuming that one exists. Then an adaptive group testing strategy is as follows: test whether the count of all items in the range {1 · · · m/2} is above n/2, and also whether the count of all items in the range {m/2 + 1 · · · m} is over the threshold. Recurse on whichever half contains more than half the items, and the majority item is found in ⌈log2 m⌉ rounds. The question is: how to support this adaptive strategy as transactions are seen? As counts increase and decrease, we do not know in advance which queries ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically



265

Fig. 4. Adaptive group testing algorithms.

will be posed, and so the solution seems to be to keep counts for every test that could be posed—but there are (m) such tests, which is too much to store. The solution comes by observing that we do not need to know counts exactly, but rather it suffices to use approximate counts, and these can be supported using a data structure that is much smaller, with size dependent on the quality of approximation. We shall make use of the fact that the range of items can be mapped onto the integers 1 · · · m. We will initially describe an adaptive group testing method in terms of an oracle that is assumed to give exact answers, and then show how this oracle can be realized approximately. Definition 4.1. A dyadic range sum oracle returns the (approximate) sum of the counts of items in the range l = (i2 j + 1) · · · r = (i + 1)2 j for 0 ≤ j ≤ log m and 0 ≤ i ≤ m/2 j . Using such an oracle, which reflects the effect of items arriving and departing, it is possible to find all the hot items, with the following binary search divideand-conquer procedure. For simplicity of presentation, we assume that m, the range of items, is a power of 2. Beginning with the full range, recursively split in two. If the total count of any range is less than n/(k+1), then do not split further. Else, continue splitting until a hot item is found. It follows that O(k log(m/k)) calls are made to the oracle. The procedure is presented as ADAPTIVEGROUPTEST on the right in Figure 4. In order to implement dyadic range sum oracles, define an approximate count oracle to return the (approximate) count of the item x. A dyadic range sum oracle can be implemented using j = 0 · · · log m approximate count oracles: for each item in the stream x, insert ⌊ 2xj ⌋ into the j th approximate count oracle, for all j . Recent work has given several methods of implementing the approximate count oracle, which can be updated to reflect the arrival or departure of any item. We now list three examples of these and give their space and update time bounds: — The “tug of war sketch” technique of Alon et al. [1999] uses space and time O( ǫ12 log 1δ ) to approximate any count up to ǫn with a probability of at least 1 − δ. — The method of random subset sums described in Gilbert et al. [2002b] uses space and time O( ǫ12 log 1δ ). — The method of Charikar et al. [2002]. builds a structure which can be used to approximate the count of any item correct upto ǫn in space O( ǫ12 log 1δ ) and time per update O(log 1δ ). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

266

G. Cormode and S. Muthukrishnan



The fastest of these methods is that of Charikar et al. [2002], and so we shall adopt this as the basis of our adaptive group testing solution. In the next section we describe and analyze the data structure and algorithms for our purpose of finding hot items. 4.1 CCFC Count Sketch We shall briefly describe and analyze the CCFC count sketch.1 This is a different and shorter analysis compared to that given in Charikar et al. [2002], since here the goal is to estimate each count to within an error in terms of the total count of all items rather than in the count of the kth most frequent item, as was the case in the original article. 4.1.1 Data Structure. The data structure used consists of a table of counters t, with width W and height T , initialized to zero. We also keep T pairs of universal hash functions: h1 · · · hT , which map items onto 1 · · · W , and g 1 · · · g T , which map items onto {−1, +1}. 4.1.2 Update Routine. When an insert transaction of item x occurs, we update t[i][hi (x)] ← t[i][hi (x)]+ g i [x] for all i = 1 · · · T . For a delete transaction, we update t[i][hi (x)] ← t[i][hi (x)] − g i [x] for all i = 1 · · · T . 4.1.3 Estimation. To estimate the count of x, compute mediani (t[i][hi (x)] · g i (x)). 4.1.4 Analysis. Use the random variable X i to denote t[i][hi (x)]· g i (x). The expectation of each estimate is  Pr[hi ( y) = hi (x)] · (Pr[ g i (x) = g i ( y)] − Pr[ g i (x) = g i ( y)]) = nx E(X i ) = nx + y=x

since Pr[ g i (x) = g i ( y)] = 12 . The variance of each estimate is   Var(X i ) = E X i2 − E(X i )2 2

2

(2)

n2x

= E( g i (x) (t[i][hi (x)]) ) − (3)  n y nz Pr[hi ( y) = hi (z)](Pr[ g i (x) = g i ( y)] − Pr[ g i (x) = g i ( y)]) (4) = 2 y=x,z

+ n2x +

=

 n2y y=x

W

 y=x



g i2 ( y)n2y Pr[hi ( y) = hi (x)] − n2x

(5)

n2 . W

(6) √

Using the Chebyshev inequality, it follows that Pr[|X i − x| > √2n ] < 12 . W Taking the median of T estimates amplifies this probability to 2T/4 , by a standard Chernoff bounds argument [Motwani and Raghavan 1995]. 1 CCFC

denotes the initials of the authors of Charikar et al. [2002].

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically



267

4.1.5 Space and Time. The space used is for the W T counters and the 2T hash functions. The time taken for each update is the time to compute the 2T hash functions, and update T counters. THEOREM 4.2. By setting W = ǫ22 and T = 4 log 1δ then we can estimate the count of any item up to error ±ǫn with probability at least 1 − δ. 4.2 Adaptive Group Testing Using CCFC Count Sketch We can now implement an adaptive group testing solution to finding hot items. The basic idea is to apply the adaptive binary search procedure using the above count sketch to implement the dyadic range sum oracle. The full procedure is shown in Figure 4. m allows us to find every item THEOREM 4.3. Setting W = ǫ22 and T = log k log δ 1 with frequency greater than k+1 + ǫ, and report no item with frequency less than m 1 −ǫ, with a probability of at least 1−δ. The space used is O( ǫ12 log(m) log k log ) k+1 δ k log m words, and the time to perform each update is O(log(m) log δ ). The query time m is O(k log m log k log ) with a proabability of at least 1 − δ. δ δ ), so that for the PROOF. We set the probability of failure to be low ( k log m O(k log m) queries that we pose to the oracle, there is probability at most δ of any of them failing, by the union bound. Hence, we can assume that with a probability of at least 1 − δ, all approximations are within the ±ǫn error bound. Then, when we search for hot items, any range containing a hot item will have its approximate count reduced by at most ǫn. This will allow us to find the hot 1 item, and output it if its frequency is at least k+1 + ǫ. Any item which is output must pass the final test, based on the count of just that item, which will not 1 − ǫ. happen if its frequency is less than k+1 Space is needed for log(m) sketches, each of which has size O(T W ) words. For these settings of T and W , we obtain the space bounds listed in the theorem. The time per update is that needed to compute 2T log(m) hash values, and then to update up to this many counters, which gives the stated update time.

4.2.1 Hot Item Count Estimation. Note that we can immediately extract the estimated counts for each hot item using the data structure, since the count of item x is given by using the lowest-level approximate count. Hence, the count m ). nx is estimated with error at most ǫn in time O(log(m) log k log δ 4.3 Time-Space Tradeoffs As with the nonadaptive group testing method, the time cost for updates depends on T and log m. Again, in practice we found that small values of T could be used, and that computation of the hash functions could be parallelized for extra speedup. Here, the dependency on log m is again the limiting factor. A similar trick to the nonadaptive case is possible, to change the update time dependency to logb m for arbitrary b: instead of basing the oracle on dyadic ranges, base it on b-adic ranges. Then only logb m sketches need to be updated for each transaction. However, under this modification, the same guarantees ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

268



G. Cormode and S. Muthukrishnan

do not hold. In order to extract the hot items, many more queries are needed: instead of making at most two queries per hot item per level, we make at most b queries per hot item per level, and so we need to reduce the probability of making a mistake to reflect this. One solution would be to modify T to give a guarantee—but this can lose the point of the exercise, which is to reduce the cost of each update. So instead we treat this as a heuristic to try out in practice, and to see how well it performs. A more concrete improvement to space and time bounds comes from observing that it is wasteful to keep sketches for high levels in the hierarchy, since there are very few items to monitor. It is therefore an improvement to keep exact counts for items at high levels in the hierarchy. 5. COMPARISON BETWEEN METHODS AND EXTENSIONS We have described two methods to find hot items after observing a sequence of insertion and deletion transactions, and proved that they can give guarantees about the quality of their output. These are the first methods to be able to give such guarantees in the presence of deletions, and we now go on to compare these two different approaches. We will also briefly discuss how they can be adapted when the input may come in other formats. Under the theoretical analysis, it is clear that the adaptive and nonadaptive methods have some features in common. Both make use of universal hash functions to map items to counters where counts are maintained. However, the theoretical bounds on the adaptive search procedure look somewhat weaker than those on the nonadaptive methods. To give a guarantee of not outputting items which are more than ǫ from being hot items, the adaptive group testing depends on 1/ǫ 2 in space, whereas nonadaptive testing uses 1/ǫ. The update times look quite similar, depending on the product of the number of tests, T , and the bit depth of the universe, logb(m). It will be important to see how these methods perform in practice, since these are only worst-case guarantees. In order to compare these methods in concrete terms, we shall use the same values of T and W for adaptive and nonadaptive group testing in our tests, so that both methods are allocated approximately the same amount of space. Another difference is that adaptive group testing requires many more hash function evaluations to process each transaction compared to nonadaptive group testing. This is because adaptive group testing computes a different hash for each of log m prefixes of the item, whereas nonadaptive group testing computes one hash function to map the item to a group, and then allocates it to subgroups based on its binary representation. Although the universal hash functions can be implemented quite efficiently [Thorup 2000], this extra processing time can become apparent for high transaction rates. 5.1 Other Update Models In this work we assume that we modify counts by one each time to model insertions or deletions. But there is no reason to insist on this: the above proofs work for arbitrary count distributions; hence it is possible to allow the counts to be modified by arbitrary increments or decrements, in the same update time ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically



269

bounds. The counts can even include fractional values if so desired. This holds for both the adaptive and nonadaptive methods. Another feature is that it is straightforward to combine the data structures for the merge of two distributions: providing both data structures were created using the same parameters and hash functions, then summing the counters coordinatewise gives the same set of counts as if the whole distribution had been processed by a single data structure. This should be contrasted to other approaches [Babcock and Olston 2003], which also compute the overall hot items from multiple sources, but keep a large amount of space at each location: instead the focus is on minimizing the amount of communication. Immediate comparison of the approaches is not possible, but for periodic updates (say, every minute) it would be interesting to compare the communication used by the two methods. 6. EXPERIMENTS 6.1 Evaluation To evaluate our approach, we implemented our group testing algorithms in C. We also implemented two algorithms which operate on nondynamic data, the algorithm Lossy Counting [Manku and Motwani 2002] and Frequent [Demaine et al. 2002]. Neither algorithm is able to cope with the case of the deletion of an item, and there is no obvious modification to accommodate deletions and still guarantee the quality of the output. We instead performed a “best effort” modification: since both algorithms keep counters for certain items, which are incremented when that item is inserted, we modified the algorithms to decrement the counter whenever the corresponding item was deleted. When an item without a counter was deleted, then we took no action.2 This modification ensures that when the algorithms encounter an inserts-only dataset, then their action is the same as the original algorithms. Code for our implementations is available on the Web, from http://www.cs.rutgers. edu/˜muthu/massdal-code-index.html. 6.1.1 Evaluation Criteria. We ran tests on both synthetic and real data, and measured time and space usage of all four methods. Evaluation was carried out on a 2.4-GHz desktop PC with 512-MB RAM. In order to evaluate the quality of the results, we used two standard measures: the recall and the precision. Definition 6.1. The recall of an experiment to find hot items is the proportion of the hot items that are found by the method. The precision is the proportion of items identified by the algorithm which are hot items. It will be interesting to see how these properties interact. For example, if an algorithm outputs every item in the range 1 · · · m then it clearly has perfect recall (every hot item is indeed included in the output), but its precision is very poor. At the other extreme, an algorithm which is able to identify only the 2 Many

variations of this theme are possible. Our experimental results here that compare our algorithms to modifications of Lossy Counting [Manku and Motwani 2002] and Frequent [Demaine et al. 2002] should be considered proof-of-concept only. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

270



G. Cormode and S. Muthukrishnan

Fig. 5. Experiments on a sequence of 107 insertion-only transactions. Left: testing recall (proportion of the hot items reported). Right: testing precision (proportion of the output items which were hot).

most frequent item will have perfect precision, but may have low recall if there are many hot items. For example, the Frequent algorithm gives guarantees on the recall of its output, but does not strongly bound the precision, whereas, for Lossy Counting, the parameter ǫ affects the precision indirectly (depending on the properties of the sequence). Meanwhile, our group testing methods give probabilistic guarantees of perfect recall and good precision. 1 6.1.2 Setting of Parameters. In all our experiments, we set ǫ = k+1 and 2 hence set W = k+1 , since this keeps the memory usage quite small. In practice, we found that this setting of ǫ gave quite good results for our group testing methods, and that smaller values of ǫ did not significantly improve the results. In all the experiments, we ran both group testing methods with the same values of W and T , which ensured that on most base experiments they used the same amount of space. In our experiments, we looked at the effect of varying the value of the parameters T and b. We gave the parameter ǫ to each algorithm and saw how much space it used to give a guarantee based on this ǫ. In general, the deterministic methods used less space than the group testing methods. However, when we made additional space available to the deterministic methods equivalent to that used by the group testing approaches, we did not see any significant improvement in their precision and we saw a similar pattern of dependency on the Zipf parameter.

6.2 Insertions-Only Data Although our methods have been designed for the challenges of transaction sequences that contain a mix of insertions and deletions, we first evaluated a sequence of transactions which contained only insertions. These were generated by a Zipf distribution, whose parameter was varied from 0 (uniform) to 3 (highly skewed). We set k = 1000, so we were looking for all items with frequency 0.1% and higher. Throughout, we worked with a universe of size m = 232 . Our first observation on the performance of group testing-based methods is that they gave good results with very small values of T . The plots in Figure 5 show the precision and recall of the methods with T = 2, meaning that each item was placed in two groups in nonadaptive group testing, and two estimates were computed for each count in adaptive group testing. Nonadaptive group ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically



271

Fig. 6. Experiments on synthetic data consisting of 107 transactions.

testing is denoted as algorithm “NAGT,” and adaptive group testing as algorithm “Adapt.” Note that, on this data set, the algorithms Lossy Counting and Frequent both achieved perfect recall, that is, they returned every hot item. This is not surprising: the deterministic guarantees ensure that they will find all hot items when the data consists of inserts only. Group testing approaches did pretty well here: nonadaptive got almost perfect recall, and adaptive missed only a few for near uniform distributions. On distributions with a small Zipf parameter, many items had counts which were close to the threshold for being a hot item, meaning that adaptive group testing can easily miss an item which is just over the threshold, or include an item which is just below. This is also visible in the precision results: while nonadaptive group testing included no items which were not hot, adaptive group testing did include some. However, the deterministic methods also did quite badly on precision, frequently including many items which were not hot in its output while, for this value of ǫ, Lossy Counting did much better than Frequent, but consistently worse than group testing. As we increased T , both nonadaptive and adaptive group testing got perfect precision and recall on all distributions. For the experiment illustrated, the group testing methods both used about 100 KB of space each, while the deterministic methods used a smaller amount of space (around half as much). 6.3 Synthetic Data with Insertions and Deletions We created synthetic datasets designed to test the behavior when confronted with a sequence including deletes. The datasets were created in three equal parts: first, a sequence of insertions distributed uniformly over a small range; next, a sequence of inserts drawn from a Zipf distribution with varying parameters; last, a sequence of deletes distributed uniformly over the same range as the starting sequence. The net effect of this sequence should be that the first and last groups of transactions would (mostly) cancel out, leaving the “true” signal from the Zipf distribution. The dataset was designed to test whether the algorithms could find this signal from the added noise. We generated a dataset of 10,000,000 items, so it was possible to compute the exact answers in order to compare, and searched for the k = 1000 hot items while varying the Zipf parameter of the signal. The results are shown in Figure 6, with the recall plotted on the left and the precision on the right. Each data point comes from one trial, rather than averaging over multiple repetitions. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

272



G. Cormode and S. Muthukrishnan

The purpose of this experiment was to demonstrate a scenario where insertonly algorithms would not be able to cope when the dataset included many deletes (in this case, one in three of the transactions was a deletion). Lossy Counting performed worst on both recall and precision, while Frequent managed to get good recall only when the signal was very skewed, meaning the hot items had very high frequencies compared to all other items. Even when the recall of the other algorithms was reasonably good (finding around threequarters of the hot items), their precision was very poor: for every hot item that was reported, around 10 infrequent items were also included in the output, and we could not distinguish between these two types. Meanwhile, both group testing approaches succeeded in finding almost all hot items, and outputting few infrequent items. There is a price to pay for the extra power of the group testing algorithm: it takes longer to process each item under our implementation, and requires more memory. However, these memory requirements are all very small compared to the size of the dataset: both group testing methods used 187 kB—Lossy Counting allocated 40 kB on average, and Frequent used 136 kB.3 In a later section, we look at the time and space costs of the group testing methods in more detail. 6.4 Real Data with Insertions and Deletions We obtained data from one of AT&Ts networks for part of a day, totaling around 100 MB. This consisted of a sequence of new telephone connections being initiated, and subsequently closed. The duration of the connections varied considerably, meaning that at any one time there were huge numbers of connections in place. In total, there were 3.5 million transactions. We ran the algorithms on this dynamic sequence in order to test their ability to operate on naturally occurring sequences. After every 100,000 transactions we posed the query to find all (source, destination) pairs with a current frequency greater than 1%. We were grouping connections by their regional codes, giving many millions of possible pairs, m, although we discovered that geographically neighboring areas generated the most communication. This meant that there were significant numbers of pairings achieving the target frequency. Again, we computed recall and precision for the three algorithms, with the results shown in Figure 7: we set T = 2 again and ran nonadaptive group testing (NAGT) and adaptive group testing (Adapt). The nonadaptive group testing approach is shown to be justified here on real data. In terms of both recall and precision, it is nearly perfect. On one occasion, it overlooked a hot item, and a few times it included items which were not hot. Under certain circumstances this may be acceptable if the items included are “nearly hot,” that is, are just under the threshold for being considered hot. However, we did not pursue this line. In the same amount of space, adaptive group testing did almost as well, although its recall and precision were both 3 These

reflected the space allocated for the insert-only algorithms based on upper bounds on the space needed. This was done to avoid complicated and costly memory allocation while processing transactions. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically



273

Fig. 7. Performance results on real data.

Fig. 8. Choosing the frequency level at query time: the data structure was built for queries at the 0.5% level, but was then tested with queries ranging from 10% to 0.01%.

less good overall than nonadaptive. Both methods reached perfect precision and recall as T was increased: nonadaptive group testing achieved perfect scores for T = 3, and adaptive for T = 7. Lossy Counting performed generally poorly on this dynamic dataset, its quality of results swinging wildly between readings but on average finding only half the hot items. The recall of the Frequent algorithm looked reasonably good, especially as time progressed, but its precision, which began poorly, appeared to degrade further. One possible explanation is that the algorithm was collecting all items which were ever hot, and outputting these whether they were hot or not. Certainly, it output between two to three times as many items as were currently hot, meaning that its output necessarily contained many infrequent items. Next, we ran tests which demonstrated the flexibility of our approach. As noted in Section 3.2, if we create a set of counters for nonadaptive group testing for a particular frequency level f = 1/(k + 1), then we can use these counters to answer a query for a higher frequency level without any need for recomputation. To test this, we computed the data structure for the first million items of the real data set based on a frequency level of 0.5%. We then asked for all hot items for a variety of frequencies between 10% and 0.5%. The results are shown in Figure 8. As predicted, the recall level was the same (100% throughout), and precision was high, with a few nonhot items included at various points. We then examined how much below the designed capability we could push the group testing algorithm, and ran queries asking for hot items with progressively lower frequencies. For nonadaptive group testing with T = 1, the quality of the recall began deteriorating after the query frequency descended below 0.5%, but ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

274



G. Cormode and S. Muthukrishnan

Fig. 9. Timing results on real data.

for T = 3 the results maintained an impressive level of recall down to around the 0.05% level, after which the quality deteriorated (around this point, the threshold for being considered a hot item was down to having a count in single figures, due to deletions removing previously inserted items). Throughout, the precision of both sets of results were very high, close to perfect even when used far below the intended range of operation. 6.5 Timing Results On the real data, we timed how long it took to process transactions, as we varied certain parameters of the methods. We also plotted the time taken by the insert-only methods for comparison. Timing results are shown in Figure 9. On the left are timing results for working through the whole data set. As we would expect, the time scaled roughly linearly with the number of transactions processed. Nonadaptive group testing was a few times slower than for the insertion-only methods, which were very fast. With T = 2, nonadaptive group testing processed over a million transactions per second. Adaptive group testing was somewhat slower. Although asymptotically the two methods have the same update cost, here we see the effect of the difference in the methods: since adaptive group testing computes many more hash functions than nonadaptive (see Section 5), the cost of this computation is clear. It is therefore desirable to look at how to reduce the number of hash function computations done by adaptive group testing. Applying the ideas discussed in Sections 3.3 and 4.3, we tried varying the parameter b from 2. The results for this are shown on the right in Figure 9. Here, we plot the time to process two million transactions for different values of b against T , the number of repetitions of the process. It can be seen that increasing b does indeed bring down the cost of adaptive and nonadaptive group testing. For T = 1, nonadaptive group testing becomes competitive with the insertion methods in terms of time to process each transaction. We also measured the output time for each method. The adaptive group testing approach took an average 5 ms per query, while the nonadaptive group testing took 2 ms. The deterministic approaches took less than 1 ms per query. 6.6 Time-Space Tradeoffs To see in more detail the effect of varying b, we plotted the time to process two million transactions for eight different values of b (2, 4, 8, 16, 32, 64, 128, and ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically



275

Fig. 10. Time and space costs of varying b.

Fig. 11. Precision and recall on real data as b and T vary.

256) and three values of T (1, 2, 3) at k = 100. The results are shown in Figure 10. Although increasing b does improve the update time for every method, the effect becomes much less pronounced for larger values of b, suggesting that the most benefit is to be had for small values of b. The benefit seems strongest for adaptive group testing, which has the most to gain. Nonadaptive group testing still computes T functions per item, so eventually the benefit of larger b is insignificant compared to this fixed cost. For nonadaptive group testing, the space must increase as b increases. We plotted this on the right in Figure 10. It can be seen that the space increases quite significantly for large values of b, as predicted. For b = 2 and T = 1, the space used is about 12 kB, while for b = 256, the space has increased to 460 kB. For T = 2 and T = 3, the space used is twice and three times this, respectively. It is important to see the effect of this tradeoff on accuracy as well. For nonadaptive group testing, the precision and recall remained the same (100% for both) as b and T were varied. For adaptive group testing, we kept the space fixed and looked at how the accuracy varied for different values of T . The results are given in Figure 11. It can be seen that there is little variation in the recall with b, but it increases slightly with T , as we would expect. For precision, the difference is more pronounced. For small values of T , increasing b to speed up processing has an immediate effect on the precision: more items which are not hot are included in the output as b increases. For larger values of T , this effect is reduced: increasing b does not affect precision by as much. Note that the transaction processing time is proportional to T/ log(b), so it seems that good tradeoffs are achieved for T = 1 and b = 4 and for T = 3 and b = 8 or 16. Looking at Figure 10, we see that these points achieve similar update times, of approximately one million items per second in our experiments. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

276



G. Cormode and S. Muthukrishnan

7. CONCLUSIONS We have proposed two new methods for identifying hot items which occur more than some frequency threshold. These are the first methods which can cope with dynamic datasets, that is, the removal as well as the addition of items. They perform to a high degree of accuracy in practice, as guaranteed by our analysis of the algorithm, and are quite simple to implement. In our experimental analysis, it seemed that an approach based on nonadaptive group testing was slightly preferable to one based on adaptive group testing, in terms of recall, precision, and time. Recently, we have taken these ideas of using group testing techniques to identify items of interest in small space, and applied them to other problems. For example, consider finding items which have the biggest frequency difference between two datasets. Using a similar arrangement of groups but a different test allows us to find such items while processing transactions at very high rates and keeping only small summaries for each dataset [Cormode and Muthukrishnan 2004b]. This is of interest in a number of scenarios, such as trend analysis, financial datasets, and anomaly detection [Yi et al. 2000]. One point of interest is that, for that scenario, it is straightforward to generalize the nonadaptive group testing approach, but the adaptive group testing approach cannot be applied so easily. Our approach of group testing may have application to other problems, notably in designing summary data structures for the maintenance of other statistics of interest and in data stream applications. An interesting open problem is to find combinatorial designs which can achieve the same properties as our randomly chosen groups, in order to give a fully deterministic construction for maintaining hot items. The main challenge here is to find good “decoding” methods: given the result of testing various groups, how to determine what the hot items are. We need such methods that work quickly in small space. A significant problem that we have not approached here is that of continuously monitoring the hot items—that is, to maintain a list of all items that are hot, and keep this updated as transactions are observed. A simple solution is to keep the same data structure, and to run the query procedure when needed, say once every second, or whenever n has changed by more than k. (After an item is inserted, it is easy to check whether it is now a hot item. Following deletions, other items can become hot, but the threshold of n/(k + 1) only changes when n has decreased by k + 1.) In our experiments, the cost of running queries is a matter of milliseconds and so is quite a cheap operation to perform. In some situations this is sufficient, but a more general solution is needed for the full version of this problem. ACKNOWLEDGMENTS

We thank the anonymous referees for many helpful suggestions. REFERENCES AHO, A. V., HOPCROFT, J. E., Wesley, Reading, MA.

AND

ULLMAN, J. D. 1987.

Data structures and algorithms. Addison-

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically



277

ALON, N., GIBBONS, P., MATIAS, Y., AND SZEGEDY, M. 1999. Tracking join and self-join sizes in limited storage. In Proceedings of the Eighteenth ACM Symposium on Principles of Database Systems. 10–20. ALON, N., MATIAS, Y., AND SZEGEDY, M. 1996. The space complexity of approximating the frequency moments. In Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing. 20–29. Journal version in J. Comput. Syst. Sci., 58, 137–147, 1999. BABCOCK, B. AND OLSTON, C. 2003. Distributed top-k monitoring. In Proceedings of ACM SIGMOD International Conference on Management of Data. BARBARA, D., WU, N., AND JAJODIA, S. 2001. Detecting novel network intrusions using Bayes estimators. In Proceedings of the First SIAM International Conference on Data Mining. BOYER, B. AND MOORE, J. 1982. A fast majority vote algorithm. Tech. Rep. 35. Institute for Computer Science, University of Texas, at Austin, Austin, TX. CARTER, J. L. AND WEGMAN, M. N. 1979. Universal classes of hash functions. J. Comput. Syst. Sci. 18, 2, 143–154. CHARIKAR, M., CHEN, K., AND FARACH-COLTON, M. 2002. Finding frequent items in data streams. In Procedings of the International Colloquium on Automata, Languages and Programming (ICALP). 693–703. CORMODE, G. AND MUTHUKRISHNAN, S. 2003. What’s hot and what’s not: Tracking most frequent items dynamically. In Proceedings of ACM Conference on Principles of Database Systems. 296– 306. CORMODE, G. AND MUTHUKRISHNAN, S. 2004a. An improved data stream summary: The count-min sketch and its applications. J. Algorithms. In press. CORMODE, G. AND MUTHUKRISHNAN, S. 2004b. What’s new: Finding significant differences in network data streams. In Proceedings of IEEE Infocom. ´ DEMAINE, E., LOPEZ -ORTIZ, A., AND MUNRO, J. I. 2002. Frequency estimation of Internet packet streams with limited space. In Proceedings of the 10th Annual European Symposium on Algorithms. Lecture Notes in Computer Science, vol. 2461. Springer, Berlin, Germany, 348–360. DU, D.-Z. AND HWANG, F. 1993. Combinatorial Group Testing and Its Applications. Series on Applied Mathematics, vol. 3. World Scientific, Singapore. ESTAN, C. AND VARGHESE, G. 2002. New directions in traffic measurement and accounting. In Proceedings of ACM SIGCOMM. Journal version in Comput. Commun. Rev. 32, 4, 323–338. FANG, M., SHIVAKUMAR, N., GARCIA-MOLINA, H., MOTWANI, R., AND ULLMAN, J. D. 1998. Computing iceberg queries efficiently. In Proceedings of the International Conference on Very Large Data Bases. 299–310. FISCHER, M. AND SALZBERG, S. 1982. Finding a majority among n votes: Solution to problem 81-5. J. Algorith. 3, 4, 376–379. GAROFALAKIS, M., GEHRKE, J., AND RASTOGI, R. 2002. Querying and mining data streams: You only get one look. In Proceedings of the ACM SIGMOD International Conference on Management of Data. GIBBONS, P. AND MATIAS, Y. 1998. New sampling-based summary statistics for improving approximate query answers. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Journal version in ACM SIGMOD Rec. 27, 331–342. GIBBONS, P. AND MATIAS, Y. 1999. Synopsis structures for massive data sets. DIMACS Series in Discrete Mathematics and Theoretical Computer Science A. GIBBONS, P. B., MATIAS, Y., AND POOSALA, V. 1997. Fast incremental maintenance of approximate histograms. In Proceedings of the International Conference on Very Large Data Bases. 466– 475. GILBERT, A., GUHA, S., INDYK, P., KOTIDIS, Y., MUTHUKRISHNAN, S., AND STRAUSS, M. 2002a. Fast, small-space algorithms for approximate histogram maintenance. In Proceedings of the 34th ACM Symposium on the Theory of Computing. 389–398. GILBERT, A., KOTIDIS, Y., MUTHUKRISHNAN, S., AND STRAUSS, M. 2001. QuickSAND: Quick summary and analysis of network data. DIMACS Tech. Rep. 2001–43, Available online at http://dimacs. crutgers.edu/Techniclts. GILBERT, A. C., KOTIDIS, Y., MUTHUKRISHNAN, S., AND STRAUSS, M. 2002b. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the International Conference on Very Large Data Bases. 454–465. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

278



G. Cormode and S. Muthukrishnan

IOANNIDIS, Y. E. AND CHRISTODOULAKIS, S. 1993. Optimal histograms for limiting worst-case error propagation in the size of the join radius. ACM Trans. Database Syst. 18, 4, 709–748. IOANNIDIS, Y. E. AND POOSALA, V. 1995. Balancing histogram optimality and practicality for query result size estimation. In Proceedings of the ACM SIGMOD International Conference on the Management of Data. 233–244. KARP, R., PAPADIMITRIOU, C., AND SHENKER, S. 2003. A simple algorithm for finding frequent elements in sets and bags. ACM Trans. Database Syst. 28, 51–55. KUSHILEVITZ, E. AND NISAN, N. 1997. Communication Complexity. Cambridge University Press, Cambridge, U.K. MANKU, G. AND MOTWANI, R. 2002. Approximate frequency counts over data streams. In Proceedings of the International Conference on Very Large Data Bases. 346–357. MISRA, J. AND GRIES, D. 1982. Finding repeated elements. Sci. Comput. Programm. 2, 143–152. MOTWANI, R. AND RAGHAVAN, P. 1995. Randomized Algorithms. Cambridge University Press, Cambridge, U.K. MUTHUKRISHNAN, S. 2003. Data streams: Algorithms and applications. In Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms. Available online at http:// athos.rutgers.edu/∼muthu/stream-1-1.ps. THORUP, M. 2000. Even strongly universal hashing is pretty fast. In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms. 496–497. YI, B.-K., SIDIROPOULOS, N., JOHNSON, T., JAGADISH, H., FALOUTSOS, C., AND BILIRIS, A. 2000. Online data mining for co-evolving time sequences. In Proceedings of the 16th International Conference on Data Engineering (ICDE’ 00). 13–22. Received October 2003; revised June 2004; accepted September 2004

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

XML Stream Processing Using Tree-Edit Distance Embeddings MINOS GAROFALAKIS Bell Labs, Lucent Technologies and AMIT KUMAR Indian Institute of Technology

We propose the first known solution to the problem of correlating, in small space, continuous streams of XML data through approximate (structure and content) matching, as defined by a general tree-edit distance metric. The key element of our solution is a novel algorithm for obliviously embedding tree-edit distance metrics into an L1 vector space while guaranteeing a (worst-case) upper bound of O(log2 n log∗ n) on the distance distortion between any data trees with at most n nodes. We demonstrate how our embedding algorithm can be applied in conjunction with known random sketching techniques to (1) build a compact synopsis of a massive, streaming XML data tree that can be used as a concise surrogate for the full tree in approximate tree-edit distance computations; and (2) approximate the result of tree-edit-distance similarity joins over continuous XML document streams. Experimental results from an empirical study with both synthetic and real-life XML data trees validate our approach, demonstrating that the average-case behavior of our embedding techniques is much better than what would be predicted from our theoretical worstcase distortion bounds. To the best of our knowledge, these are the first algorithmic results on lowdistortion embeddings for tree-edit distance metrics, and on correlating (e.g., through similarity joins) XML data in the streaming model. Categories and Subject Descriptors: H.2.4 [Database Management]: Systems—Query processing; G.2.1 [Discrete Mathematics]: Combinatorics—Combinatorial algorithms General Terms: Algorithms, Performance, Theory Additional Key Words and Phrases: XML, data streams, data synopses, approximate query processing, tree-edit distance, metric-space embeddings

1. INTRODUCTION The Extensible Markup Language (XML) is rapidly emerging as the new standard for data representation and exchange on the Internet. The simple, A preliminary version of this article appeared in Proceedings of the 22nd Anuual ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (San Diego, CA, June) [Garofalakis and Kumar 2003]. Authors’ addresses: M. Garofalakis, Bell Labs, Lucent Technologies, 600 Mountain Ave., Murray Hill, NJ 07974; email: [email protected]; A. Kumar, Department of Computer Science and Engineering, Indian Institute of Technology, Hauz Khas, New Delhi-110016, India; email: [email protected]. Permission to make digital/hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.  C 2005 ACM 0362-5915/05/0300-0279 $5.00 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005, Pages 279–332.

280



M. Garofalakis and A. Kumar

self-describing nature of the XML standard promises to enable a broad suite of next-generation Internet applications, ranging from intelligent Web searching and querying to electronic commerce. In many respects, XML documents are instances of semistructured data: the underlying data model comprises an ordered, labeled tree of element nodes, where each element can be either an atomic data item or a composite data collection consisting of references (represented as edges) to child elements in the XML tree. Further, labels (or tags) stored with XML data elements describe the actual semantics of the data, rather than simply specifying how elements are to be displayed (as in HTML). Thus, XML data is tree-structured and self-describing. The flexibility of the XML data model makes it a very natural and powerful tool for representing data from a wide variety of Internet data sources. Of course, given the typical autonomy of such sources, identical or similar data instances can be represented using different XML-document tree structures. For example, different online news sources may use distinct document type descriptor (DTD) schemas to export their news stories, leading to different node labels and tree structures. Even when the same DTD is used, the resulting XML trees may not have the same structure, due to the presence of optional elements and attributes [Guha et al. 2002]. Given the presence of such structural differences and inconsistencies, it is obvious that correlating XML data across different sources needs to rely on approximate XML-document matching, where the approximation is quantified through an appropriate general distance metric between XML data trees. Such a metric for comparing ordered labeled trees has been developed by the combinatorial pattern matching community in the form of tree-edit distance [Apostolico and Galil 1997; Zhang and Shasha 1989]. In a nutshell, the treeedit distance metric is the natural generalization of edit distance from the string domain; thus, the tree-edit distance between two tree structures represents the minimum number of basic edit operations (node inserts, deletes, and relabels) needed to transform one tree to the other. Tree-edit distance is a natural metric for correlating and discovering approximate matches in XML document collections (e.g., through an appropriately defined similarity-join operation).1 The problem becomes particularly challenging in the context of streaming XML data sources, that is, when such correlation queries must be evaluated over continuous XML data streams that arrive and need to be processed on a continuous basis, without the benefit of several passes over a static, persistent data image. Algorithms for correlating such XML data streams would need to work under very stringent constraints, typically providing (approximate) results to user queries while (a) looking at the relevant XML data only once and in a fixed order (determined by the stream-arrival pattern) and (b) using a small amount of memory (typically, logarithmic or polylogarithmic in the size of the stream) [Alon et al. 1996, 1999; 1 Specific

semantics associated with XML node labels and tree-edit operations can be captured using a generalized, weighted tree-edit distance metric that associates different weights/costs with different operations. Extending the algorithms and results in this article to weighted tree-edit distance is an interesting open problem. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

XML Stream Processing Using Tree-Edit Distance Embeddings



281

Fig. 1. Example DTD fragments (a) and (b) and XML Document Trees (c) and (d) for autonomous bibliographic Web sources.

Dobra et al. 2002; Gilbert et al. 2001]. Of course, such streaming-XML techniques are more generally applicable in the context of huge, terabyte XML databases, where performing multiple passes over the data to compute an exact result can be prohibitively expensive. In such scenarios, having single-pass, space-efficient XML query-processing algorithms that produce good-quality approximate answers offers a very viable and attractive alternative [Babcock et al. 2002; Garofalakis et al. 2002]. Example 1.1. Consider the problem of integrating XML data from two autonomous, bibliographic Web sources WS1 and WS2 . One of the key issues in such data-integration scenarios is that of detecting (approximate) duplicates across the two sources [Dasu and Johnson 2003]. For autonomously managed XML sources, such duplicate-detection tasks are complicated by the fact that the sources could be using different DTD structures to describe their entries. As a simple example, Figures 1(a) and 1(b) depict the two different DTD fragments employed by WS1 and WS2 (respectively) to describe XML trees for academic publications; clearly, WS1 uses a slightly different set of tags (i.e., article instead of paper) as well as a “deeper” DTD structure (by adding the type and authors structuring elements). Figures 1(c) and 1(d) depict two example XML document trees T1 and T2 from WS1 and WS2 , respectively; even though the two trees have structural differences, it is obvious that T1 and T2 represent the same publication. In fact, it is easy to see that T1 and T2 are within a tree-edit distance of 3 (i.e., one relabel and two delete operations on T1 ). Approximate duplicate detection across WS1 and WS2 can be naturally expressed as a tree-edit distance similarity join operation that returns the pairs of trees (T1 , T2 ) ∈ WS1 × WS2 that are within a tree-edit distance of τ , where the user/application-defined similarity threshold τ is set to a value ≥ 3 to perhaps account for other possible differences in the joining tree structures (e.g., missing or misspelled coauthor ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

282



M. Garofalakis and A. Kumar

names). A single-pass, space-efficient technique for approximating such similarity joins as the document trees from the two XML data sources are streaming in would provide an invaluable data-integration tool; for instance, estimates of the similarity-join result size (i.e., the number of approximate duplicate entries) can provide useful indicators of the degree of overlap (i.e., “content similarity”) or coverage (i.e., “completeness”) of autonomous XML data sources [Dasu and Johnson 2003; Florescu et al. 1997]. 1.1 Prior Work Techniques for data reduction and approximate query processing for both relational and XML databases have received considerable attention from the database research community in recent years [Acharya et al. 1999; Chakrabarti et al. 2000; Garofalakis and Gibbons 2001; Ioannidis and Poosala 1999; Polyzotis and Garofalakis 2002; Polyzotis et al. 2004; Vitter and Wang 1999]. The vast majority of such proposals, however, rely on the assumption of a static data set which enables either several passes over the data to construct effective data synopses (such as histograms [Ioannidis and Poosala 1999] or Haar wavelets [Chakrabarti et al. 2000; Vitter and Wang 1999]); clearly, this assumption renders such solutions inapplicable in a data-stream setting. Massive, continuous data streams arise naturally in a variety of different application domains, including network monitoring, retail-chain and ATM transaction processing, Web-server record logging, and so on. As a result, we are witnessing a recent surge of interest in data-stream computation, which has led to several (theoretical and practical) studies proposing novel one-pass algorithms with limited memory requirements for different problems; examples include quantile and order-statistics computation [Greenwald and Khanna 2001; Gilbert et al. 2002b]; distinct-element counting [Bar-Yossef et al. 2002; Cormode et al. 2002a]; frequent itemset counting [Charikar et al. 2002; Manku and Motwani 2002]; estimating frequency moments, join sizes, and difference norms [Alon et al. 1996, 1999; Dobra et al. 2002; Feigenbaum et al. 1999; Indyk 2000]; and, computing one- or multidimensional histograms or Haar wavelet decompositions [Gilbert et al. 2002a; Gilbert et al. 2001; Thaper et al. 2002]. All these articles rely on an approximate query-processing model, typically based on an appropriate underlying stream-synopsis data structure. (A different approach, explored by the Stanford STREAM project [Arasu et al. 2002], is to characterize subclasses of queries that can be computed exactly with bounded memory.) The synopses of choice for a number of the above-cited data-streaming articles are based on the key idea of pseudorandom sketches which, essentially, can be thought of as simple, randomized linear projections of the underlying data item(s) (assumed to be points in some numeric vector space). Recent work on XML-based publish/subscribe systems has dealt with XML document streams, but only in the context of simple, predicate-based filtering of individual documents [Altinel and Franklin 2000; Chan et al. 2002; Diao et al. 2003; Gupta and Suciu 2003; Lakshmanan and Parthasarathy 2002]; more recent work has also considered possible transformations of the XML documents in order to produce customized output [Diao and Franklin 2003]. Clearly, the ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

XML Stream Processing Using Tree-Edit Distance Embeddings



283

problem of efficiently correlating XML documents across one or more input streams gives rise to a drastically different set of issues. Guha et al. [2002] discussed several different algorithms for performing tree-edit distance joins over XML databases. Their work introduced easier-to-compute bounds on the tree-edit distance metric and other heuristics that can significantly reduce the computational cost incurred due to all-pairs tree-edit distance computations. However, Guha et al. focused solely on exact join computation and their algorithms require multiple passes over the data; this obviously renders them inapplicable in a data-stream setting. 1.2 Our Contributions All earlier work on correlating continuous data streams (through, e.g., join or norm computations) in small space has relied on the assumption of flat, relational data items over some appropriate numeric vector space; this is certainly the case with the sketch-based synopsis mechanism (discussed above), which has been the algorithmic tool of choice for most of these earlier research efforts. Unfortunately, this limitation renders earlier streaming results useless for directly dealing with streams of structured objects defined over a complex metric space, such as XML-document streams with a tree-edit distance metric. In this article, we propose the first known solution to the problem of approximating (in small space) the result of correlation queries based on tree-edit distance (such as the tree-edit distance similarity joins described in Example 1.1) over continuous XML data streams. The centerpiece of our solution is a novel algorithm for effectively (i.e., “obliviously” [Indyk 2001]) embedding streaming XML and the tree-edit distance metric into a numeric vector space equipped with the standard L1 distance norm, while guaranteeing a worst-case upper bound of O(log2 n log∗ n) on the distance distortion between any data trees with at most n nodes.2 Our embedding is completely deterministic and relies on parsing an XML tree into a hierarchy of special subtrees. Our parsing makes use of a deterministic coin-tossing process recently introduced by Cormode and Muthukrishnan [2002] for embedding a variant of the string-edit distance (that, in addition to standard string edits, includes an atomic “substring move” operation) into L1 ; however, since we are dealing with general trees rather than flat strings, our embedding algorithm and its analysis are significantly more complex, and result in different bounds on the distance distortion.3 We also demonstrate how our vector-space embedding construction can be combined with earlier sketching techniques [Alon et al. 1999; Dobra et al. 2002; Indyk 2000] to obtain novel algorithms for (1) constructing a small sketch synopsis of a massive, streaming XML data tree that can be used as a concise 2 All

log’s in this article denote base-2 logarithms; log∗ n denotes the number of log applications required to reduce n to a quantity that is ≤ 1, and is a very slowly increasing function of n. 3 Note that other known techniques for approximating string-edit distance based on the decomposition of strings into q-grams [Ukkonen 1992; Gravano et al. 2001] only give one-sided error guarantees, essentially offering no guaranteed upper bound on the distance distortion. For instance, it is not difficult to construct examples of very distinct strings with nearly identical q-gram sets (i.e., arbitrarily large distortion). Furthermore, to the best of our knowledge, the results in Ukkonen [1992] have not been extended to the case of trees and tree-edit distance. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

284



M. Garofalakis and A. Kumar

surrogate for the full tree in tree-edit distance computations, and (2) estimating the result size of a tree-edit-distance similarity join over two streams of XML documents. Finally, we present results from an empirical study of our embedding algorithm with both synthetic and real-life XML data trees. Our experimental results offer some preliminary validation of our approach, demonstrating that the average-case behavior of our techniques over realistic data sets is much better than what our theoretical worst-case distortion bounds would predict, and revealing several interesting characteristics of our algorithms in practice. To the best of our knowledge, ours are the first algorithmic results on oblivious tree-edit distance embeddings, and on effectively correlating continuous, massive streams of XML data. We believe that our embedding algorithm also has other important applications. For instance, exact tree-edit distance computation is typically a computationally-expensive problem that can require up to O(n4 ) time (for the conventional tree-edit distance metric [Apostolico and Galil 1997; Zhang and Shasha 1989]), and is, in fact, N P-hard for the variant of tree-edit distance considered in this article (even for the simpler case of flat strings [Shapira and Storer 2002]). In contrast, our embedding scheme can be used to provide an approximate tree-edit distance (to within a guaranteed O(log2 n log∗ n) factor) in near-linear, that is, O(n log∗ n), time. 1.3 Organization The remainder of this article is organized as follows. Section 2 presents background material on XML, tree-edit distance and data-streaming techniques. In Section 3, we present an overview of our approach for correlating XML data streams based on tree-edit distance embeddings. Section 4 presents our embedding algorithm in detail and proves its small-time and low distance-distortion guarantees. We then discuss two important applications of our algorithm for XML stream processing, namely (1) building a sketch synopsis of a massive, streaming XML data tree (Section 5), and (2) approximating similarity joins over streams of XML documents (Section 6). We present the results of our empirical study with synthetic and real-life XML data in Section 7. Finally, Section 8 outlines our conclusions. The Appendix provides ancillary lemmas (and their proofs) for the upper bound result. 2. PRELIMINARIES 2.1 XML Data Model and Tree-Edit Distance An XML document is essentially an ordered, labeled tree T , where each node in T represents an XML element and is characterized by a label taken from a fixed alphabet of string literals σ . Node labels capture the semantics of XML elements, and edges in T capture element nesting in the XML data. Without loss of generality, we assume that the alphabet σ captures all node labels, literals, and atomic values that can appear in an XML tree (e.g., based on the underlying DTD(s)); we also focus on the ordered, labeled tree structure of the XML data and ignore the raw-character data content inside nodes with string ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

XML Stream Processing Using Tree-Edit Distance Embeddings



285

Fig. 2. Example XML tree and tree-edit operation.

labels (PCDATA, CDATA, etc.). We use |T | and |σ | to denote the number of nodes in T and the number of symbols in σ , respectively. Given two XML document trees T1 and T2 , the tree-edit distance between T1 and T2 (denoted by d (T1 , T2 )) is defined as the minimum number of tree-edit operations to transform one tree into another. The standard set of tree-edit operations [Apostolico and Galil 1997; Zhang and Shasha 1989] includes (1) relabeling (i.e., changing the label) of a tree node v; (2) deleting a tree node v (and moving all of v’s children under its parent); and (3) inserting a new node v under a node w and moving a contiguous subsequence of w’s children (and their descendants) under the new node v. (Note that the node-insertion operation is essentially the complement of node deletion.) An example XML tree and tree-edit operation are depicted in Figure 2. In this article, we consider a variant of the tree-edit distance metric, termed tree-edit distance with subtree moves, that, in addition to the above three standard edit operations, allows a subtree to be moved under a new node in the tree in one step. We believe that subtree moves make sense as a primitive edit operation in the context of XML data—identical substructures can appear in different locations (for example, due to a slight variation of the DTD), and rearranging such substructures should probably be considered as basic an operation as node insertion or deletion. In the remainder of this article, the term tree-edit distance assumes the four primitive edit operations described above, namely, node relabelings, deletions, insertions, and subtree moves.4 2.2 Data Streams and Basic Pseudorandom Sketching In a data-streaming environment, data-processing algorithms are allowed to see the incoming data records (e.g., relational tuples or XML documents) only once as they are streaming in from (possibly) different data sources [Alon et al. 1996, 1999; Dobra et al. 2002]. Backtracking over the stream and explicit access to past data records are impossible. The data-processing algorithm is also allowed a small amount of memory, typically logarithmic or polylogarithmic in the data-stream size, in order to maintain concise synopsis data structures for the input stream(s). In addition to their small-space requirement, these synopses should also be easily computable in a single pass over the data and with small per-record processing time. At any point in time, the algorithm can combine the maintained collection of synopses to produce an approximate result. 4 The

problem of designing efficient (i.e., “oblivious”), guaranteed-distortion embedding schemes for the standard tree-edit distance metric remains open; of course, this is also true for the much simpler standard string-edit distance metric (i.e., without “substring moves”) [Cormode and Muthukrishnan 2002]. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

286



M. Garofalakis and A. Kumar

We focus on one particular type of stream synopses, namely, pseudorandom sketches; sketches have provided effective solutions for several streaming problems, including join and multijoin processing [Alon et al. 1996, 1999; Dobra et al. 2002], norm computation [Feigenbaum et al. 1999; Indyk 2000], distinctelement counting [Cormode et al. 2002a], and histogram or Haar-wavelet construction [Gilbert et al. 2001; Thaper et al. 2002]. We describe the basics of pseudorandom sketching schemes using a simple binary-join cardinality estimation query [Alon et al. 1999]. More specifically, assume that we want to estimate Q = COUNT(R1 ✶ A R2 ), that is, the cardinality of the binary equijoin of two streaming relations R1 and R2 over a (numeric) attribute (or, set of attributes) A, whose values we assume (without loss of generality) to range over {1, . . . , N }. (Note that, by the definition of the equijoin operator, the two join attributes have identical value domains.) Letting f k (i) (k = 1, 2; i = 1, . . . , N )Ndenote the frequency of the ith value in Rk , is is easy to see that Q = i=1 f 1 (i) f 2 (i). Clearly, estimating this join size exactly requires at least (N ) space, making an exact solution impractical for a data-stream setting. In their seminal work, Alon et al. [Alon et al. 1996, 1999] proposed a randomized technique that can offer strong probabilistic guarantees on the quality of the resulting join-size estimate while using space that can be significantly smaller than N . Briefly, the key idea is to (1) build an atomic sketch X k (essentially, a randomized linear projection) of the distribution vector for each input stream Rk (k = 1, 2) (such a sketch can be easily computed over the streaming values of Rk in only O(log N ) space) and (2) use the atomic sketches X 1 and X 2 to define a random variable X Q such that (a) X Q is an unbiased (i.e., correct on expectation) randomized estimator for the target join size, so that E[X Q ] = Q, and (b) X Q ’s variance (Var[X Q ]) can be appropriately upper-bounded to allow for probabilistic guarantees on the quality of the Q estimate. More formally, this random variable X Q is constructed on-line from the two data streams as follows: — Select a family of four-wise independent binary random variates {ξi : i = 1, . . . , N }, where each ξi ∈ {−1, +1} and P [ξi = +1] = P [ξi = −1] = 1/2 (i.e., E[ξi ] = 0). Informally, the four-wise independence condition means that, for any 4-tuple of ξi variates and for any 4-tuple of {−1, +1} values, the probability that the values of the variates coincide with those in the {−1, +1} 4-tuple is exactly 1/16 (the product of the equality probabilities for each individual ξi ). The crucial point here is that, by employing known tools (e.g., orthogonal arrays) for the explicit construction of small sample spaces supporting fourwise independence, such families can be efficiently constructed on-line using only O(log N ) space [Alon et al. 1996]. —Define X Q = X 1 · X 2 , where the atomic sketch X k is defined simply as X k = N i=1 f k (i)ξi , for k = 1, 2. Again, note that each X k is a simple randomized linear projection (inner product) of the frequency vector of Rk .A with the vector of ξi ’s that can be efficiently generated from the streaming values of A as follows: start a counter with X k = 0 and simply add ξi to X k whenever the ith value of A is observed in the Rk stream. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

XML Stream Processing Using Tree-Edit Distance Embeddings



287

The quality of the estimation guarantees can be improved using a standard boosting technique that maintains several independent identically distributed (iid) instantiations of the above process, and uses averaging and medianselection operators over the X Q estimates to boost accuracy and probabilistic confidence [Alon et al. 1996]. (Independent instances can be constructed by simply selecting independent random seeds for generating the families of fourwise independent ξi ’s for each instance.) We use the term (atomic) AMS sketch to describe a randomized linear projection computed in the above-described manner over a data  Nstream.2 Letting SJk (k = 1, 2) denote the self-join size of Rk .A (i.e., SJk = i=1 f k (i) ), the following theorem [Alon et al. 1999] shows how sketching can be applied for estimating binary-join sizes in limited space. (By standard Chernoff bounds [Motwani and Raghavan 1995], using medianselection over O(log(1/δ)) of the averages computed in Theorem 2.1 allows the confidence in the estimate to be boosted to 1 − δ, for any pre-specified δ < 1.) THROEM 2.1 [ALON ET AL. 1999]. Let the atomic AMS sketches X 1 and X 2 be as defined above. Then, E[X Q ] = E[X 1 X 2 ] = Q and Var(X Q ) ≤ 2 · SJ1 · SJ2 . 2 ) iid instantiations of the basic Thus, averaging the X Q estimates over O( SJQ12·SJ ǫ2 scheme, guarantees an estimate that lies within a relative error of at most ǫ from Q with constant probability > 1/2. It should be noted that the space-usage bounds stated in Theorem 2.1 capture the worst-case behavior of AMS-sketching-based estimation—empirical results with synthetic and real-life data sets have demonstrated that the average-case behavior of the AMS scheme is much better [Alon et al. 1999]. More recent work has led to improved AMS-sketching-based estimators with provably better space-usage guarantees (that actually match the lower bounds shown by Alon et al. [1999]) [Ganguly et al. 2004], and has demonstrated that AMS-sketching techniques can be extended to effectively handle one or more complex multijoin aggregate SQL queries over a collection of relational streams [Dobra et al. 2002, 2004]. Indyk [2000] discussed a different type of pseudorandom sketches which N are, once again, defined as randomized linear projections X k = i=1 f k (i)ξi of a streaming input frequency vector for the values in Rk , but using random variates {ξi } drawn from a p-stable distribution (which can again be generated in small space, i.e., O(log N ) space) in the X k computation. The class of pstable distributions has been studied for some time (see, e.g., Nolan [2004]; [Uchaikin and Zolotarev 1999])—they are known to exist for any p ∈ (0, 2], and include well-known distribution functions, for example, the Cauchy distribution (for p = 1) and the Gaussian distribution (for p = 2). As the following theorem demonstrates, such p-stable sketches can provide accurate probabilistic estimates for the L p -difference norm of streaming frequency vectors in small space, for any p ∈ (0, 2]. THEOREM 2.2 [INDYK 2000].  N Let p ∈ (0, 2], and define the p-stable sketch for the Rk stream as X k = i=1 f k (i)ξi , where the {ξi } variates are drawn from a p-stable distribution (k = 1, 2). Assume that we have built l = O( log(1/δ) ) iid ǫ2 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

288



M. Garofalakis and A. Kumar j

j

pairs of p-stable sketches {X 1 , X 2 } ( j = 1, . . . , l ), and define   X = median  X 11 − X 21 |, . . . , |X 1l − X 2l  . Then, X  lies within a relative error of at most ǫ of the L p -difference norm || f 1 − f 2 || p = [ i | f 1 (i) − f 2 (i)| p ]1/ p with probability ≥ 1 − δ. More recently, Cormode et al. [2002a] have also shown that, with small values of p (i.e., p → 0), p-stable sketches can provide very effective estimates for the Hamming (i.e., L0 ) norm (or, the number of distinct values) over continuous streams of updates. 3. OUR APPROACH: AN OVERVIEW The key element of our methodology for correlating continuous XML data streams is a novel algorithm that embeds ordered, labeled trees and the treeedit distance metric as points in a (numeric) multidimensional vector space equipped with the standard L1 vector distance, while guaranteeing a small distortion of the distance metric. In other words, our techniques rely on mapping each XML tree T to a numeric vector V (T ) such that the tree-edit distances between the original trees are well-approximated by the L1 vector distances of the tree images under the mapping;  that is, for any two XML trees S and T , the L1 distance V (S) − V (T )1 = j |V (S)[ j ] − V (T )[ j ]| gives a good approximation of the tree-edit distance d (S, T ). Besides guaranteeing a small bound on the distance distortion, to be applicable in a data-stream setting, such an embedding algorithm needs to satisfy two additional requirements: (1) the embedding should require small space and time per data tree in the stream; and, (2) the embedding should be oblivious, that is, the vector image V (T ) of a tree T cannot depend on other trees in the input stream(s) (since we cannot explicitly store or backtrack to past stream items). Our embedding algorithm satisfies all these requirements. There is an extensive literature on low-distortion embeddings of metric spaces into normed vector spaces; for an excellent survey of the results in this area, please see the recent article by Indyk [2001]. A key result in this area is Bourgain’s lemma proving that an arbitrary finite metric space is embeddable in an L2 vector space with logarithmic distortion; unfortunately, Bourgain’s technique is neither small space nor oblivious (i.e., it requires knowledge of the complete metric space), so there is no obvious way to apply it in a data-stream setting [Indyk 2001]. To the best of our knowledge, our algorithm gives the first oblivious, small space/time vector-space embedding for a complex tree-edit distance metric. Given our algorithm for approximately embedding streaming XML trees and tree-edit distance in an L1 vector space, known streaming techniques (like the sketching methods discussed in Section 2.2) now become relevant. In this article, we focus on two important applications of our results in the context of streaming XML, and propose novel algorithms for (1) building a small sketch synopsis of a massive, streaming XML data tree, and (2) approximating the size of a similarity join over XML streams. Once again, these are the first results on ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

XML Stream Processing Using Tree-Edit Distance Embeddings



289

correlating (in small space) massive XML data streams based on the tree-edit distance metric. 3.1 Technical Roadmap The development of the technical material in this article is organized as follows. Section 4 describes our embedding algorithm for the tree-edit distance metric (termed TREEEMBED) in detail. In a nutshell, TREEEMBED constructs a hierarchical parsing of an input XML tree by iteratively contracting edges to produce successively smaller trees; our parsing makes repeated use of a recently proposed label-grouping procedure [Cormode and Muthukrishnan 2002] for contracting chains and leaf siblings in the tree. The bulk of Section 4 is devoted to proving the small-time and low distance-distortion guarantees of our TREEEMBED algorithm (Theorem 4.2). Then, in Section 5, we demonstrate how our embedding algorithm can be combined with the 1-stable sketching technique of Indyk [2000] to build a small sketch synopsis of a massive, streaming XML tree that can be used as a concise surrogate for the tree in approximate tree-edit distance computations. Most importantly, we show that the properties of our embedding allow us to parse the tree and build this sketch in small space and in one pass, as nodes of the tree are streaming by without ever backtracking on the data (Theorem 5.1). Finally, Section 6 shows how to combine our embedding algorithm with both 1-stable and AMS sketching in order to estimate (in limited space) the result size of an approximate treeedit-distance similarity join over two continuous streams of XML documents (Theorem 6.1). 4. OUR TREE-EDIT DISTANCE EMBEDDING ALGORITHM 4.1 Definitions and Overview In this section, we describe our embedding algorithm for the tree-edit distance metric (termed TREEEMBED) in detail, and prove its small-time and low distancedistortion guarantees. We start by introducing some necessary definitions and notational conventions. Consider an ordered, labeled tree T over alphabet σ , and let n = |T |. Also, let v be a node in T , and let s denote a contiguous subsequence of children of node v in T . If the nodes in s are all leaves, then we refer to s as a contiguous leaf-child subsequence of v. (A leaf child of v that is not adjacent to any other leaf child of v is called a lone leaf child of v.) We use T [v, s] to denote the subtree of T obtained as the union of all subtrees rooted at nodes in s and node v itself, retaining all node labels. We also use the notation T ′ [v, s] to denote exactly the same subtree as T [v, s], except that we do not associate any label with the root node v of the subtree. We define a valid subtree of T as any subtree of the form T [v, s], T ′ [v, s], or a path of degree-2 nodes (i.e., a chain) possibly ending in leaf node in T . At a high level, our TREEEMBED algorithm produces a hierarchical parsing of T into a multiset T (T ) of special valid subtrees by stepping through a number of edge-contraction phases producing successively smaller trees. A key component ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

290



M. Garofalakis and A. Kumar

of our solution (discussed later in this section) is the recently proposed deterministic coin tossing procedure of Cormode and Muthukrishnan [2002] for grouping symbols in a string—TREEEMBED employs that procedure repeatedly during each contraction phase to merge tree nodes in a chain as well as sibling leaf nodes. The vector image V (T ) of T is essentially the “characteristic vector” for the multiset T (T ) (over the space of all possible valid subtrees). Our analysis shows that the number of edge-contraction phases in T ’s parsing is O(log n), and that, even though the dimensionality of V (T ) is, in general, exponential in n, our construction guarantees that V (T ) is also very sparse: the total number of nonzero components in V (T ) is only O(n). Furthermore, we demonstrate that our TREEEMBED algorithm runs in near-linear, that is, O(n log∗ n) time. Finally, we prove the upper and lower bounds on the distance distortion guaranteed by our embedding scheme. 4.2 The Cormode-Muthukrishnan Grouping Procedure Clearly, the technical crux lies in the details of our hierarchical parsing process for T that produces the valid-subtree multiset T (T ). A basic element of our solution is the string-processing subroutine presented by Cormode and Muthukrishnan [2002] that uses deterministic coin tossing to find landmarks in an input string S, which are then used to split S into groups of two or three consecutive symbols. A landmark is essentially a symbol y (say, at location j ) of the input string S with the following key property: if S is transformed into S ′ by an edit operation (say, a symbol insertion) at location l far away from j (i.e., |l − j | >> 1), then the Cormode-Muthukrishnan stringprocessing algorithm ensures that y is still designated as a landmark in S ′ . Due to space constraints, we do not give the details of their elegant landmarkbased grouping technique (termed CM-Group in the remainder of this article) in our discussion—they can be found in Cormode and Muthukrishnan [2002]. Here, we only summarize a couple of the key properties of CM-Group that are required for the analysis of our embedding scheme in the following theorem. THEOREM 4.1 [CORMODE AND MUTHUKRISHNAN 2002]. Given a string of length k, the CM-Group procedure runs in time O(k log∗ k). Furthermore, the closest landmark to any symbol x in the string is determined by at most log∗ k + 5 consecutive symbols to the left of x, and at most five consecutive symbols to the right of x. Intuitively, Theorem 4.1 states that, for any given symbol x in a string of length k, the group of (two or three) consecutive symbols chosen (by CM-Group) to include x depends only on the symbols lying in a radius of at most log∗ k + 5 to the left and right of x. Thus, a string-edit operation occurring outside this local neighborhood of symbol x is guaranteed not to affect the group formed containing x. As we will see, this property of the CM-Group procedure is crucial in proving the distance-distortion bounds for our TREEEMBED algorithm. Similarly, the O(k log∗ k) complexity of CM-Group plays an important role in determining the running time of TREEEMBED. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

XML Stream Processing Using Tree-Edit Distance Embeddings



291

4.3 The TREEEMBED Algorithm As mentioned earlier, our TREEEMBED algorithm constructs a hierarchical parsing of T in several phases. In phase i, the algorithm builds an ordered, labeled tree T i that is obtained from the tree of the previous phase T i−1 by contracting certain edges. (The initial tree T 0 is exactly the original input tree T .) Thus, each node v ∈ T i corresponds to a connected subtree of T —in fact, by construction, our TREEEMBED algorithm guarantees that this subtree will be a valid subtree of T . Let v(T ) denote the valid subtree of T corresponding to node v ∈ T i . Determining the node label for v uses a hash function h() that maps the set of all valid subtrees of T to new labels in a one-to-one fashion with high probability; thus, the label of v ∈ T i is defined as the hash-function value h(v(T )). As we demonstrate in Section 7.1, such a valid-subtree-naming function can be computed in small space/time using an adaptation of the Karp-Rabin string fingerprinting algorithm [Karp and Rabin 1987]. Note that the existence of such an efficient naming function is crucial in guaranteeing the small space/time properties for our embedding algorithm since maintaining the exact valid subtrees v(T ) is infeasible; for example, near the end of our parsing, such subtrees are of size O(|T |).5 The pseudocode description of our TREEEMBED embedding algorithm is depicted in Figure 3. As described above, our algorithm builds a hierarchical parsing structure (i.e., a hierarchy of contracted trees T i ) over the input tree T , until the tree is contracted to a single node (|T i | = 1). The multiset T (T ) of valid subtrees produced by our parsing for T contains all valid subtrees corresponding to all nodes of the final hierarchical parsing structure tagged with a phase label to distinguish between subtrees in different phases; that is, T (T ) comprises all < v(T i ), i > for all nodes v ∈ T i over all phases i (Step 18). Finally, we define the L1 vector image V (T ) of T to be the “characteristic vector” of the multi-set T (T ); in other words, V (T )[< t, i >] := number of times the < t, i > subtree-phase combination appears in T (T ). (We use the notation Vi (T ) to denote the restriction of V (T ) to only subtrees occurring at phase i.) A small example execution of the hierarchical tree parsing in our embedding algorithm is depicted pictorially in Figure 4. The L1 distance between the vector images of two trees  S and T is defined in the standard manner, that is, V (T ) − V (S)1 = x∈T (T )∪T (S) |V (T )[x] − V (S)[x]|. In the remainder of this section, we prove our main theorem on the near-linear time complexity of our L1 embedding algorithm and the logarithmic distortion bounds that our embedding guarantees for the tree-edit distance metric. 5 An

implicit assumption made in our running-time analysis of TREEEMBED (which is also present in the complexity analysis of CM-Group in Cormode and Muthukrishnan [2002]—see Theorem 4.1) is that the fingerprints produced by the naming function h() fit in a single memory word and, thus, can be manipulated in constant (i.e., O(1)) time. If that is not the case, then an additional multiplicative factor of O(log |T |) must be included in the running-time complexity to account for the length of such fingerprints (see Section 7.1). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

292



M. Garofalakis and A. Kumar

Fig. 3. Our tree-embedding algorithm.

Fig. 4. Example of hierarchical tree parsing.

THEOREM 4.1. The TREEEMBED algorithm constructs the vector image V (T ) of an input tree T in time O(|T | log∗ |T |); further, the vector V (T ) contains at most O(|T |) nonzero components. Finally, given two trees S and T with n = max{|S|, |T |}, we have d (S, T ) ≤ 5 · V (T ) − V (S)1 = O(log2 n log∗ n) · d (S, T ). It is important to note here that, for certain special cases (i.e., when T is a simple chain or a “star”), our TREEEMBED algorithm essentially degrades to ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

XML Stream Processing Using Tree-Edit Distance Embeddings



293

Fig. 5. Example of parsing steps for the special case of a full binary tree.

the string-edit distance embedding algorithm of Cormode and Muthukrishnan [2002]. This, of course, implies that, for such special cases, their even tighter O(log n log∗ n) bounds on the worst-case distance distortion are applicable. As another special-case example, Figure 5 depicts the initial steps in the parsing of a full binary tree T ; note that, after two contraction phases, our parsing essentially reduces a full binary tree of depth h to one of depth h − 1 (thus decreasing the size of the tree by a factor of about 1/2). As a first step in the proof of Theorem 4.1, we demonstrate the following lemma which bounds the number of parsing phases. The key here is to show that the number of tree nodes goes down by a constant factor during each contraction phase of our embedding algorithm (Steps 3–16). LEMMA 4.3. The number of phases for our TREEEMBED algorithm on an input tree T is O(log |T |). PROOF. define

We partition the node set of T into several subsets as follows. First,

A(T ) = {v ∈ T : v is a nonroot node with degree 2 (i.e., with only one child) or

v is a leaf child of a non-root node of degree 2}, and B(T ) = {v ∈ T : v is a node of degree ≥ 3 (i.e., with at least two children) or v is the root node of T }.

Clearly, A(T ) ∪ B(T ) contains all internal (i.e., nonleaf) nodes of T ; in particular, A(T ) contains all nodes appearing in (degree-2) chains in T (including potential leaf nodes at the end of such chains). Thus, the set of remaining nodes of T , say L(T ), comprises only leaf nodes of T which have at least one sibling or are children of the root. Let v be a leaf child of some node u, and let sv denote the maximal contiguous set of leaf children of u which contains v. We further partition the leftover set of leaf nodes L(T ) as follows: L1 (T ) = {v ∈ L(T ) : |sv | ≥ 2}, L2 (T ) = {v ∈ L(T ) : |sv | = 1 and v is the leftmost such child of its parent}, and L3 (T ) = L(T ) − L1 (T ) − L2 (T ) = {v ∈ L(T ) : |sv | = 1 and v is not the leftmost such child of its parent}.

For notational convenience, we also use A(T ) to denote the set cardinality |A(T )|, and similarly for other sets. We first prove the following ancillary claim. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

294



M. Garofalakis and A. Kumar

CLAIM 4.4. A(T )/2 − 1.

For any rooted tree T with at least two nodes, L3 (T ) ≤ L2 (T ) +

PROOF. We prove this claim by induction on the number of nodes in T . Suppose T has only two nodes. Then, clearly, L3 (T ) = 0, L2 (T ) = 1, and A(T ) = 0. Thus, the claim is true for the base case. Suppose the claim is true for all rooted trees with less than n nodes. Let T have n nodes and let r be the root of T . First, consider the case when r has only one child node (say, s), and let T ′ be the subtree rooted at s. By induction, L3 (T ′ ) ≤ L2 (T ′ ) + A(T ′ )/2 − 1. Clearly, L3 (T ) = L3 (T ′ ). Is L2 (T ) equal to L2 (T ′ )? It is not hard to see that the only case when a node u can occur in L2 (T ′ ) but not in L2 (T ) is when s has only one child, u, which also happens to be a leaf. In this case, obviously, u ∈ L2 (T ′ ) (since it is the sole leaf child of the root), whereas in T u is the end-leaf of a chain node, so it is counted in A(T ) and, thus, u ∈ / L2 (T ). On the other hand, it is easy to see that both s and r are in A(T ) − A(T ′ ) in this case, so that L2 (T ) + A(T )/2 = L2 (T ′ ) + A(T ′ )/2. Thus, the claim is true in this case as well. Now, consider the case when the root node r of T has at least two children. We construct several smaller subtrees, each of which is rooted at r (but contains only a subset of r’s descendants). Let u1 , . . . , uk be the leaf children of r such that sui = {ui } (i.e., have no leaf siblings); thus, by definition, u1 ∈ L2 (T ), whereas ui ∈ L3 (T ) for all i = 2, . . . , k. We define the subtrees T1 , . . . , Tk+1 as follows. For each i = 1, . . . , k + 1, Ti is the set of all descendants of r (including r itself) that lie to the right of leaf ui−1 and to the left of leaf ui (as special cases, T1 is the subtree to the left of u1 and Tk+1 is the subtree to the right of uk ). Note that T1 and Tk+1 my not contain any nodes (other than the root node r), but, by the definition of ui ’s, all other Ti subtrees are guaranteed to contain at least one node other than r. Now, by induction, we have that L3 (Ti ) ≤ L2 (Ti ) + A(Ti )/2 − 1 for all subtrees Ti , except perhaps for T1 and Tk+1 (if they only comprise a sole root node, in which case, of course, the L2 , L3 , and A subsets above are all empty). Adding all these inequalities, we have    A(Ti )/2 − (k − 1), (1) L2 (Ti ) + L3 (Ti ) ≤ i

i

i

where we only have k − 1 on the right-hand side since T1 and Tk+1 may not contribute a −1 to this summation.  Now, it is easy to see that, if u ∈ A(Ti ), then u ∈ A(T ) as well; thus, A(T ) = i A(Ti ). Suppose u ∈ L2 (Ti ), and let w denote the parent of u. Note that w cannot be the root node r of T , Ti . Indeed, suppose that w = r; then, since u ∈ {u1 , . . . , uk }, su contains a leaf node other than u which is also not in Ti (since u ∈ L2 (Ti ))). But then, it must be the case that u is adjacent to one of the leaves u1 , . . . , uk , which is impossible; thus, w = r which, of course, implies that u ∈ L2 (T ) as well. Conversely, suppose that u ∈ L2 (T ); then, either u = u1 or the parent ofu is in one of the subtrees Ti . In the latter case, u ∈ D2 (Hi ). Thus, L2 (H) = i L2 (Ti ) + 1. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

XML Stream Processing Using Tree-Edit Distance Embeddings



295

Finally, we can argue in a similar manner that, for each i = 1, . . . , k + 1, L3 (Ti ) ⊂ L3 (T ). Furthermore,  if u ∈ L3 (T ), then either u ∈ {u2 , . . . , uk } or u ∈ L3 (Ti ). Thus, L3 (T ) = i L3 (Ti ) + k − 1. Putting everything together, we have  L3 (T ) = L3 (Ti ) + k − 1 i



 i

L2 (Ti ) +



A(Ti )/2

(by Inequality (1))

i

= L2 (T ) + A(T )/2 − 1. This completes the inductive proof argument. With Claim 4.4 in place, we now proceed to show that the number of nodes in the tree goes down by a constant factor after each contraction phase of our parsing. Recall that T i is the tree at the beginning of the (i + 1)th phase, and let L′ (T i+1 ) ⊆ L(T i+1 ) denote the subset of leaf nodes in L(T i+1 ) that are created by contracting a chain in T i . We claim that B(T i+1 ) ≤ B(T i ) and

B(T i+1 ) + A(T i+1 ) + L′ (T i+1 ) ≤ B(T i ) +

A(T i ) . (2) 2

Indeed, it is easy to see that all nodes with degree at least three (i.e., ≥ two children) in T i+1 must have had degree at least three in T i as well; this obviously proves the first inequality. Furthermore, note that any node in B(T i+1 ) corresponds to a unique node in B(T i ). Now, consider a node u in A(T i+1 ) ∪ L′ (T i+1 ). There are two possible cases depending on how node u is formed. In the first case, u is formed by collapsing some degree-2 (i.e., chain) nodes (and, possibly, a chain-terminating leaf) in A(T i )—then, by virtue of the CM-Group procedure, u corresponds to at least two distinct nodes of A(T i ). In the second case, there is a node w ∈ B(T i ) and a leaf child of w that is collapsed into w to get u—then, u corresponds to a unique node of B(T i ). The second inequality follows easily from the above discussion. During the (i + 1)th contraction phase, the number of leaves in L1 (T i ) is clearly reduced by at least one-half (again, due to the properties of CM-Group). Furthermore, note that all leaves in L2 (T i ) are merged into their parent nodes and, thus, disappear. Now, the leaves in L3 (T i ) do not change; so, we need to bound the size of this leaf-node set. By Claim 4.4, we have that L3 (T i ) ≤ L2 (T i ) + A(T i )/2—adding 2 · L3 (T i ) on both sides and multiplying across with 1/3, this inequality gives L3 (T i ) ≤

A(T i ) L2 (T i ) 2 + L3 (T i ) + . 3 3 6

Thus, the number of leaf nodes in L(T i+1 ) − L′ (T i+1 ) can be upper-bounded as follows: L(T i+1 ) − L′ (T i+1 ) ≤

L1 (T i ) L2 (T i ) 2 A(T i ) 2 A(T i ) + + L3 (T i ) + ≤ L(T i ) + . 2 3 3 6 3 6 ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

296



M. Garofalakis and A. Kumar

Combined with Inequality (2), this implies that the total number of nodes in T i+1 is 2 A(T i ) A(T i ) + B(T i ) + L(T i ) + 2 3 6 2 i i i ≤ B(T ) + (A(T ) + L(T )). 3

A(T i+1 ) + B(T i+1 ) + L(T i+1 ) ≤

Now, observe that B(T i ) ≤ A(T i )+ L(T i ) (the number of nodes of degree more than two is at most the number of leaves in any tree)—the above inequality then gives 5 2 1 B(T i ) + (A(T i ) + L(T i )) + B(T i ) 6 3 6 5 i i i ≤ (A(T ) + B(T ) + L(T )). 6

A(T i+1 ) + B(T i+1 ) + L(T i+1 ) ≤

Thus, when going from tree T i to T i+1 , the number of nodes goes down by a constant factor ≤ 65 . This obviously implies that the number of parsing phases for our TREEEMBED algorithm is O(log |T |), and completes the proof. The proof of Lemma 4.3 immediately implies that the total number of nodes in the entire hierarchical parsing structure for T is only O(|T |). Thus, the vector image V (T ) built by our algorithm is a very sparse vector. To see this, note that the number of all possible ordered, labeled trees of size at most n that can be built using the label alphabet σ is O((4|σ |)n ) (see, e.g., Knuth [1973]); thus, by Lemma 4.3, the dimensionality needed for our vector image V () to capture input trees of size n is O((4|σ |)n log n). However, for a given tree T , only O(|T |) of these dimensions can contain nonzero counts. Lemma 4.3, in conjunction with the fact that the CM-Group procedure runs in time O(k log∗ k) for a string of size k (Theorem 4.1), also implies that our TREEEMBED algorithm runs in O(|T | log∗ |T |) time on input T . The following two subsections establish the distance-distortion bounds stated in Theorem 4.1. An immediate implication of the above results is that we can use our embedding algorithm to compute the approximate (to within a guaranteed O(log2 n log∗ n) factor) tree-edit distance between T and S in O(n log∗ n) (i.e., near-linear) time. The time complexity of exact tree-edit distance computation is significantly higher: conventional tree-edit distance (without subtree moves) is solvable in O(|T S|d T d S ) time (where, d T (d S ) is the depth of T (respectively, S)) [Apostolico and Galil 1997; Zhang and Shasha 1989], whereas in the presence of subtree moves the problem becomes N P-hard even for the simple case of flat strings [Shapira and Storer 2002]. 4.4 Upper-Bound Proof Suppose we are given a tree T with n nodes and let  denote the quantity log∗ n + 5. As a first step in our proof, we demonstrate that showing the upperbound result in Theorem 4.2 can actually be reduced to a simpler problem, namely, that of bounding the L1 distance between the vector image of T and the vector image of a 2-tree forest created when removing a valid subtree from ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

XML Stream Processing Using Tree-Edit Distance Embeddings



297

T . More formally, consider a (valid) subtree of T of the form T ′ [v, s] for some contiguous subset of children s of v (recall that the root of T ′ [v, s] has no label). Let us delete T ′ [v, s] from T , and let T2 denote the resulting subtree; furthermore, let T1 denote the deleted subtree T ′ [v, s]. Thus, we have broken T into a 2-tree forest comprising T1 = T ′ [v, s] and T2 = T − T1 (see the leftmost portion of Figure 8 for an example). We now compare the following two vectors. The first vector V (T ) is obtained by applying our TREEEMBED parsing procedure to T . For the second vector, we apply TREEEMBED to each of the trees T1 and T2 individually, and then add the corresponding vectors V (T1 ) and V (T2 ) component-wise—call this vector V (T1 + T2 ) = V (T1 ) + V (T2 ). (Throughout this section, we use (T1 + T2 ) to denote the 2-tree forest composed of T1 and T2 .) Our goal is to prove the following theorem. THEOREM 4.5. The L1 distance between vectors V (T ) and V (T1 + T2 ) is at most O(log2 n log∗ n). Let us first see how this result directly implies the upper bound stated in Theorem 4.2. PROOF OF THE UPPER BOUND IN THEOREM 4.2. It is sufficient to consider the case when the tree-edit distance between S and T is 1 and show that, in this case, the L1 distance between V (S) and V (T ) is ≤ O(log2 n log∗ n). First, assume that T is obtained from S by deleting a leaf node v. Let the parent of v be w. Define s = {v}, and delete S ′ [w, s] from S. This splits S into T and S ′ [w, s]— call this S1 . Theorem 4.5 then implies that V (S) − V (T + S1 )1 = V (S) − (V (T ) + V (S1 ))1 ≤ O(log2 n log∗ n). But, it is easy to see that the vector V (S1 ) only has three nonzero components, all equal to 1; this is since S1 is basically a 2-node tree that is reduced to a single node after one contraction phase of TREEEMBED. Thus, V (S1 )1 = (V (T ) + V (S1 )) − V (T )1 ≤ 3. Then, a simple application of the triangle inequality for the L1 norm gives V (S) − V (T )1 ≤ O(log2 n log∗ n). Note that, since insertion of a leaf node is the inverse of a leafnode deletion, the same holds for this case as well. Now, let v be a node in S and s be a contiguous set of children of v. Suppose T is obtained from S by moving the subtree S ′ [v, s], that is, deleting this subtree and making it a child of another node x in S.6 Let S1 denote S ′ [v, s], and let S2 denote the tree obtained by deleting S1 from S. Theorem 4.5 implies that V (S) − V (S1 + S2 )1 ≤ O(log2 n log∗ n). Note, however, that we can also picture (S1 + S2 ) as the forest obtained by deleting S1 from T . Thus, V (T ) − V (S1 + S2 )1 is also ≤ O(log2 n log∗ n). Once again, the triangle inequality for L1 easily implies the result. Finally, suppose we delete a nonleaf node v from S. Let the parent of v be w. All children of v now become children of w. We can think of this process as follows. Let s be the children of v. First, we move S ′ [v, s] and make it a child of w. At this point, v is a leaf node, so we are just deleting a leaf node now. Thus, 6 This

is a slightly “generalized” subtree move, since it allows for a contiguous (sub)sequence of sibling subtrees to be moved in one step. However, it is easy to see that it can be simulated with only three simpler edit operations, namely, a node insertion, a single-subtree move, and a node deletion. Thus, our results trivially carry over to the case of “single-subtree move” edit operations. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

298



M. Garofalakis and A. Kumar

the result for this case follows easily from the arguments above for deleting a leaf node and moving a subtree. As a consequence, it is sufficient to prove Theorem 4.5. Our proof proceeds along the following lines. We define an influence region for each tree T i in our hierarchical parsing (i = 0, . . . , O(log n))—the intuition here is that the influence region for T i captures the complete set of nodes in T i whose parsing could have been affected by the change (i.e., the splitting of T into (T1 + T2 )). Initially (i.e., tree T 0 ), this region is just the node v at which we deleted the T1 subtree. But, obviously, this region grows as we proceed to subsequent phases in our parsing. We then argue that, if we ignore this influence region in T i and the corresponding region in the parsing of the (T1 +T2 ) forest, then the resulting sets of valid subtrees look very similar (in any phase i). Thus, if we can bound the rate at which this influence region grows during our hierarchical parsing, we can also bound the L1 distance between the two resulting characteristic vectors. The key intuition behind bounding the size of the influence region is as follows: when we effect a change at some node v of T , nodes far away from v in the tree remain unaffected, in the sense that the subtree in which such nodes are grouped during the next phase of our hierarchical parsing remains unchanged. As we will see, this fact hinges on the properties of the CM-Group procedure used for grouping nodes during each phase of TREEEMBED (Theorem 4.1). The discussion of our proof in the remainder of this section is structured as follows. First, we formally define influence regions, giving the set of rules for “growing” such regions of nodes across consecutive phases of our parsing. Second, we demonstrate that, for any parsing phase i, if we ignore the influence regions in the current (i.e., phase-(i + 1)) trees produced by TREEEMBED on input T and (T1 + T2 ), then we can find a one-to-one, onto mapping between the nodes in the remaining portions of the current T and (T1 + T2 ) that pairs up identical valid subtrees. Third, we bound the size of the influence region during each phase of our parsing. Finally, we show that the upper bound on the L1 distance of V (T ) and V (T1 + T2 ) follows as a direct consequence of the above facts. We now proceed with the proof of Theorem 4.5. Define (T1 + T2 )i as the 2-tree forest corresponding to (T1 + T2 ) at the beginning of the (i + 1)th parsing phase. We say that a node x ∈ T i+1 contains a node x ′ ∈ T i if the set of nodes in T i which are merged to form x contains x ′ . As earlier, any node w in T i corresponds to a valid subtree w(T ) of T ; furthermore, it is easy to see that if w and w′ are two distinct nodes of T i , then the w(T ) and w′ (T ) subtrees are disjoint. (The same obviously holds for the parsing of each of T1 , T2 .) For each tree T i , we mark certain nodes; intuitively, this node-marking defines the influence region of T i mentioned above. Let M i be the set of marked nodes (i.e., influence region) in T i (see Figure 6(a) for an example). The generic structure of the influence region M i satisfies the following: (1) M i is a connected subtree of T i that always contains the node v (at which the T1 subtree was removed), that is, the node in T i which contains v (denoted by vi ) is always in M i ; (2) there is a center node ci ∈ M i , and M i may contain some ancestor nodes of ci —but all such ancestors (except perhaps for ci itself) must be of degree 2 only, and should form a connected path; and (3) M i may also contain some ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

XML Stream Processing Using Tree-Edit Distance Embeddings



299

Fig. 6. (a) The subtree induced by the bold edges corresponds to the nodes in M i . (b) Node z becomes the center of N i .

descendants of the center node ci . Finally, certain (unmarked) nodes in T i − M i are identified as corner nodes—intuitively, these are nodes whose parsing will be affected when they are shrunk down to a leaf node. Once again, the key idea is that the influence region M i captures the set of those nodes in T i whose parsing in TREEEMBED may have been affected by the change we made at node v. Now, in the next phase, the changes in M i can potentially affect some more nodes. Thus, we now try to determine which nodes M i can affect; that is, assuming the change at v has influenced all nodes in M i , which are the nodes in T i whose parsing (during phase (i + 1)) can change as a result of this. To capture this newly affected set of nodes, we define an extended influence region N i in T i —this intuitively corresponds to the (worstcase) subset of nodes in T i whose parsing can potentially be affected by the changes in M i . First, add all nodes in M i to N i . We define the center node z of the extended influence region N i as follows. We say that a descendant node u of vi (which contains v) in T i is a removed descendant of vi if and only if its corresponding subtree u(T ) in the base tree T is entirely contained within the removed subtree T [v, s]. (Note that, initially, v0 = v is trivially a removed descendant of v0 .) Now, let w be the highest node in M i —clearly, w is an ancestor of the current center node ci as well as the vi node in T i . If all the descendants of w are either in M i or are removed descendants of vi , then define the center z to be the parent of node w, and add z to N i (see Figure 6(b)); otherwise, define the center z of N i to be same as ci . The idea here is that the grouping of w’s parent in the next phase can change only if the entire subtree under w has been affected by the removal of the T ′ [v, s] subtree. Otherwise, if there exist nodes under w in T i whose parsing remains unchanged and that have not been deleted by the subtree removal, then the mere existence of these nodes in T i means that it is impossible for TREEEMBED to group w’s parent in a different manner during the next phase of the (T1 + T2 ) parsing in any case. Once the center node z of N i has been fixed, we also add nodes to N i according to the following set of rules (see Figures 7(a) and (b) for examples). ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

300



M. Garofalakis and A. Kumar

Fig. 7. (a) The nodes in dotted circles get added to N i due to Rules (i), (ii), and (iii). (b) The nodes in the dotted circle get added to N i due to Rule (iv)—note that all descendants of the center z which are not descendants of u are in M i . (c) Node u moves up to z, turning nodes a and b into corner nodes.

(i) Suppose u is a leaf child of the (new) center z or the vi node in T i ; furthermore, assume there is some sibling u′ of u such that the following conditions are satisfied: u′ ∈ M i or u′ is a corner leaf node, the set of nodes s(u, u′ ) between u and u′ are leaves, and |s(u, u′ )| ≤ . Then, add u to N i . (In particular, note that any leaf child of z which is a corner node gets added to N i .) (ii) Let u be the leftmost lone leaf child of the center z which is not already in M i (if such a child exists); then, add u to N i . Similarly, for the vi node in T i , let u be a leaf child of vi such that one of the following conditions is satisfied: (a) u is the leftmost lone leaf child of vi when considering only the removed descendants of vi ; or (b) u is the leftmost lone leaf child of vi when ignoring all removed descendants of vi . Then, add u to N i. (iii) Let w be the highest node in M i ∪ {z} (so it is an ancestor of the center node z). Let u be an ancestor of w. Suppose it is the case that all nodes between u and w, except perhaps w, have degree 2, and the length of the path joining u and w is at most ; then, add u to N i . (iv) Suppose there is a child u of the center z or the vi node in T i such that one of the following conditions is satisfied: (a) u is not a removed descendant of vi and all descendants of all siblings of u (other than u itself) are either already in M i or are removed descendants of vi ; or (b) u is a removed descendant of vi (and, hence, a child of vi ) and all removed descendants of vi which are not descendants of u are in M i . Then, let u′ be the lowest descendant of u which is in M i . If u′′ is any descendant of u′ such that the path joining them contains degree-2 nodes only (including the end-points), and has length at most , then add u′′ to N i . Let us briefly describe why we need these four rules. We basically want to make sure that we include all those nodes in N i whose parsing can potentially ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

XML Stream Processing Using Tree-Edit Distance Embeddings



301

be affected if we delete or modify the nodes in M i (given, of course, the removal of the T ′ [v, s] subtree). The first three rules, in conjunction with the properties of our TREEEMBED parsing, are easily seen to capture this fact. The last rule is a little more subtle. Suppose u is a child of z (so that we are in clause (a) of Rule (iv)); furthermore, assume that all descendants of z except perhaps those of u are either already in M i or have been deleted with the removal of T ′ [v, s]. Remember that all nodes in M i have been modified due to the change effected at v, so they may not be present at all in the corresponding picture for (T1 + T2 ) (i.e., the (T1 +T2 )i forest). But, if we just ignore M i and the removed descendants of vi , then z becomes a node of degree 2 only, which would obviously affect how u and its degree-2 descendants are parsed in (T1 + T2 )i (compared to their parsing in T i ). Rule (iv) is designed to capture exactly such scenarios; in particular, note that clauses (a) and (b) in the rule are meant to capture the potential creation of such degree-2 chains in the remainder subtree T2i and the deleted subtree T1i , respectively. We now consider the rule for marking corner nodes in T i . Once again, the intuition is that certain (unaffected) nodes in T i − M i (actually, in T i − N i ) are marked as corner nodes so that we can “remember” that their parsing will be affected when they are shrunk down to a leaf. Suppose the center node z has at least two children, and a leftmost lone leaf child u—note that, by Rule (ii), u ∈ N i . If any of the two immediate siblings of u are not in N i , then we mark them as corner nodes (see Figure 7(c)). The key observation here is that, when parsing T i , u is going to be merged into z and disappear; however, we need to somehow “remember” that a (potentially) affected node u was there, since its existence could affect the parsing of its sibling nodes when they are shrunk down to leaves. Marking u’s immediate siblings in T i as corner nodes essentially achieves this effect. Having described the (worst-case) extended influence region N i in T i , let us now define M i+1 , that is, the influence region at the next level of our parsing of T . M i+1 is precisely the set of those nodes in T i+1 which contain a node of N i . The center of M i+1 is the node which contains the center node z of N i ; furthermore, any node in T i+1 which contains a corner node is again marked as a corner node. Initially, define M 0 = {v} (and, obviously, v0 = c0 = v). Furthermore, if v has a child node immediately on the left (right) of the removed child subsequence s, then that node as well as the leftmost (respectively, rightmost) node in s are marked as corner nodes. The reason, of course, is that these ≤ 4 nodes may be parsed in a different manner when they are shrunk down to leaves during the parsing of T1 and T2 . Based on the above set of rules, it is easy to see that M i and N i are always connected subtrees of T i . It is also important to note that the extended influence region N i is defined in such a manner that the parsing of all nodes in T i − N i cannot be affected by the changes in M i . This fact should become clear as we proceed with the details of the proofs in the remainder of this section. Example 4.6. Figure 8 depicts the first three phases of a simple example parsing for T and (T1 + T2 ), in the case of a 4-level full binary tree T that ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

302



M. Garofalakis and A. Kumar

Fig. 8. Example of TREEEMBED parsing phases for T and (T1 + T2 ) in the case of a full binary tree, highlighting the influence regions M i in T i and the corresponding P i regions in (T1 + T2 )i (“o” denotes an unlabeled node).

is split by removing the right subtree of the root (i.e., T1 = T ′ [x3 , {x6 , x7 }], T2 = T − T1 ). We use subscripted x’s and y’s to label the nodes in T i and (T1 + T2 )i to emphasize the fact that these tree nodes are parsed independently by TREEEMBED; furthermore, we employ the subscripts to capture the original subtrees of T and (T1 + T2 ) represented by nodes in later phases of our parsing. Of course, it should be clear that x and y nodes with identical subscripts refer to identical (valid) subtrees of the original tree T ; for instance, both x4,8,9 ∈ T 2 and y 4,8,9 ∈ T22 represent the same subtree T [x4 , {x8 , x9 }] = {x4 , x8 , x9 } of T . As depicted in Figure 8, the initial influence region of T is simply M 0 = {x3 } (with v0 = c0 = x3 ). Since, clearly, all descendants of x3 are removed descendants of v0 , the center z for the extended influence region N 0 moves up to the parent node x1 of x3 (and none of our other rules are applicable); thus, N 0 = {x1 , x3 } and, obviously, M 1 = {x1 , x3 }. This is crucial since (as shown in Figure 8), due to the removal of T1 , nodes y 1 and y 3 are processed in a very different manner in the remainder subtree T20 (i.e., y 3 is merged up into y 1 as its leftmost lone leaf child). Now, for T 1 , none of our rules for extending the influence region apply and, consequently, N 1 = M 2 = {x1 , x3 }. The key thing to note here is that, for each parsing phase i, ignoring the nodes in the influence region M i (and the “corresponding” nodes in (T1 + T2 )i ), the remaining nodes of T i and (T1 + T2 )i have been parsed in an identical manner by TREEEMBED (and correspond to an identical subset of valid subtrees in T ); in other words, their corresponding characteristic vectors in our embedding are exactly the same. We now proceed to formalize these observations. Given the influence region M i of T i , we define a corresponding node set, P , in the (T1 + T2 )i forest. In what follows, we prove that the nodes in T i − M i and (T1 + T2 )i − P i can be matched in some manner, so that each pair of matched nodes correspond to identical valid subtrees in T and (T1 + T2 ), i

ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

XML Stream Processing Using Tree-Edit Distance Embeddings

Fig. 9.



303

f maps from T i − M i to (T1 + T2 )i − P i .

respectively. The node set P i in (T1 + T2 )i is defined as follows (see Figure 8 for examples). P i always contains the root node of T1i . Furthermore, a node u ∈ (T1 + T2 )i is in P i , if and only if there exists a node u′ ∈ M i such that the intersection u(T1 + T2 ) ∩ u′ (T ) is nonempty (as expected, u(T1 + T2 ) denotes the valid subtree corresponding to u in (T1 + T2 )). We demonstrate that our solution always maintains the following invariant. INVARIANT 4.7. Given any node x ∈ T i − M i , there exists a node y = f (x) in (T1 + T2 )i − P i such that x(T ) and y(T1 + T2 ) are identical valid subtrees on the exact same subset of nodes in the original tree T . Conversely, given a node y ∈ (T1 + T2 )i − P i , there exists a node x ∈ T i − M i such that x(T ) = y(T1 + T2 ).

Thus, there always exists a one-to-one, onto mapping f from T i − M i to (T1 + T2 )i − P i (Figure 9). In other words, if we ignore M i and P i from T i and (T1 + T2 )i (respectively), then the two remaining forests of valid subtrees in this phase are identical. Example 4.8. Continuing with our binary-tree parsing example in Figure 8, it is easy to see that, in this case, the mapping f : T i − M i −→ (T1 +T2 )i − P i simply maps every x node in T i − M i to the y node in (T1 +T2 )i − P i with the same subscript that, obviously, corresponds to exactly the same valid subtree of T ; for instance, y 10,11 = f (x10,11 ) and both nodes correspond to the same valid subtree T ′ [x5 , {x10 , x11 }]. Thus, the collections of valid subtrees for T i − M i and (T1 + T2 )i − P i are identical (i.e., the L1 distance of their corresponding characteristic vectors is zero); this implies that, for example, the contribution of T 1 and (T1 + T2 )1 to the difference of the embedding vectors V (T ) and V (T1 + T2 ) is upper-bounded by |M 1 | = 2.

Clearly, Invariant 4.7 is true in the beginning (i.e., M 0 = {v}, P 0 = {v, root(T1 )}). Suppose our invariant remains true for T i and (T1 + T2 )i . We now need to prove it for T i+1 and (T1 + T2 )i+1 . As previously, let N i ⊇ M i be the extended influence region in T i . Fix a node w in T i − N i , and let w′ be the corresponding node in (T1 + T2 )i − P i (i.e., w′ = f (w)). Suppose w is contained in node q ∈ T i+1 and w′ is contained in node q ′ ∈ (T1 + T2 )i+1 . LEMMA 4.9. Given a node w in T i − N i , let q, q ′ be as defined above. If q(T ) and q ′ (T1 +T2 ) are identical subtrees for any node w ∈ T i − N i , then Invariant 4.7 holds for T i+1 and (T1 + T2 )i+1 as well. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

304



M. Garofalakis and A. Kumar

PROOF. We have to demonstrate the following facts. If x is a node in T i+1 − M i+1 , then there exists a node y ∈ (T1 +T2 )i+1 −P i+1 such that y(T1 +T2 ) = x(T ). Conversely, given a node y ∈ (T1 + T2 )i+1 − P i+1 , there is a node x ∈ T i+1 − M i+1 such that x(T ) = y(T1 + T2 ). Suppose the condition in the lemma holds. Let x be a node in T i+1 − M i+1 . Let x ′ be a node in T i such that x contains x ′ . Clearly, x ′ ∈ / N i , otherwise x i+1 ′ ′ would be in M . Let y = f (x ), and let y be the node in (T1 + T2 )i+1 which contains y ′ . By the hypothesis of the lemma, x(T ) and y(T1 + T2 ) are identical subtrees. It remains to check that y ∈ / P i+1 . Since y(T1 + T2 ) = x(T ), y(T1 + T2 ) i+1 is disjoint from z(T ) for any z ∈ T , z = x. By the definition of the P i node sets, since x ∈ / M i+1 , we have that y ∈ (T1 + T2 )i+1 − P i+1 . Let us prove the converse now. Suppose y ∈ (T1 + T2 )i+1 − P i+1 . Let y ′ be a node in (T1 + T2 )i such that y contains y ′ . If y ′ ∈ P i , then (by definition) there exists a node x ′ ∈ M i such that x ′ (T ) ∩ y ′ (T1 + T2 ) = ∅. Let x be the node in T i+1 which contains x ′ . Since x ′ ∈ N i , x ∈ M i+1 . Now, x(T ) ∩ y(T1 + T2 ) ⊇ x ′ (T )∩ y ′ (T1 + T2 ) = ∅. But then y should be in P i+1 , a contradiction. Therefore, y′ ∈ / P i . By the invariant for T i , there is a node x ′ ∈ T i −M i such that y ′ = f (x ′ ). Let x be the node in T i+1 containing x ′ . Again, if x ′ ∈ N i , then x ∈ M i+1 . But then x(T ) ∩ y(T1 + T2 ) ⊇ x ′ (T ) ∩ y ′ (T1 + T2 ), which is nonempty because x ′ (T ) = y ′ (T1 + T2 ). This would imply that y ∈ P i+1 . So, x ′ ∈ / N i . But then, by the hypothesis of the lemma, x(T ) = y(T1 + T2 ). Further, x cannot be in M i+1 , otherwise y will be in P i+1 . Thus, the lemma is true. It is, therefore, sufficient to prove that, for any pair of nodes w ∈ T i − N i , w = f (w) ∈ (T1 + T2 )i − P i , the corresponding encompassing nodes q ∈ T i+1 and q ′ ∈ (T1 + T2 )i+1 map to identical valid subtrees, that is, q(T ) = q ′ (T1 + T2 ). This is what we seek to do next. Our proof uses a detailed, case-by-case analysis of how node w gets parsed in T i . For each case, we demonstrate that w′ will also get parsed in exactly the same manner in the forest (T1 + T2 )i . In the interest of space and continuity, we defer the details of this proof to the Appendix. Thus, we have established the fact that, if we look at the vectors V (T ) and V (T1 + T2 ), the nodes corresponding to phase i of V (T ) which are not present in V (T1 + T2 ) are guaranteed to be a subset of M i . Our next step is to bound the size of M i . ′

LEMMA 4.10. The influence region M i for tree T i consists of at most O(i log∗ n) nodes. PROOF. Note that, during each parsing phase, Rule (iii) adds at most  nodes of degree at most 2 to the extended influence region N i . It is not difficult to see that Rule (iv) also adds at most 4 nodes of degree at most 2 to N i during each phase; indeed, note that, for instance, there is at most one child node u of z which is not in M i and satisfies one of the clauses of Rule (iv). So, adding over the first i stages of our algorithm the number of such nodes in M i can be at most O(i log∗ n). Thus, we only need to bound the number of nodes that get added to the influence region due to Rules (i) and (ii). We now want to count the number of leaf children of the center node ci which are in M i . Let ki be the number of children of ci which become leaves for the ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

XML Stream Processing Using Tree-Edit Distance Embeddings



305

first time in T i and are marked as corner nodes. Let Ci be the nodes in M i ′ which were added as the leaf children of the center node of T i , for some i ′ < i. i−1 i Then, we claim that C can be partitioned into at most 1 + j =1 k j contiguous sets such that each set has at most 4 elements. We prove this by induction on i. So, suppose it is true for T i . Consider such a contiguous set of leaves in Ci , call it C1i , where |C1i | ≤ 4. We may add up to  consecutive leaf children of ci on either side of C1i to the extended influence region N i . Thus, this set may grow to a size of 6 contiguous leaves. But when we parse this set (using CM-Group), we reduce its size by at least half. Thus, this set will now is at most contain at most 3 leaves (which i 4). Therefore, each of the 1 + i−1 k contiguous sets in C correspond to a j =1 j i+1 contiguous set in T of size at most 4. Now, we may add other leaf children of ci to N i . This can happen only if a corner node becomes a leaf. In this case, at most  consecutive leaves on either side of this node are added to N i (by Rule (i)); thus, we may add ki more such sets of consecutive leaves to N i . This completes our inductive argument. But note that, in any phase, at most two new corner nodes (i.e., the immediate siblings of the center node’s leftmost lone leaf child) can be added. (And, of course, we also start out with at most four nodes  marked as corners inside and next to the removed child subsequence s.) So, ij =1 k j ≤ 2i + 2. This shows that the number of nodes in Ci is O(i log∗ n). The contribution toward M i of the leaf children of the vi node can also be upper bounded by O(i log∗ n) using a very similar argument. This completes the proof. We now need to bound the nodes in (T1 + T2 )i which are not in T i . But this can be done in exactly analogous manner if we switch the roles of T and T1 + T2 in the proofs above. Thus, we can define a subset Q i of (T1 + T2 )i and a one-to-one, onto mapping g from (T1 + T2 )i − Q i to a subset of T i such that g (w)(T ) = w(T1 + T2 ) for every w ∈ (T1 + T2 )i − Q i . Furthermore, we can show in a similar manner that |Q i | ≤ O(i log∗ n). We are now ready to complete the proof of Theorem 4.5. PROOF OF THEOREM 4.5. Fix a phase i. Consider those subtrees t such that Vi (T )[< t, i >] ≥ Vi (T1 +T2 )[< t, i >]. In other words, t appears more frequently in the parsed tree T i than in (T1 + T2 )i . Let the set of such subtrees be denoted by S. We first observe that |M i | ≥

 t∈S

Vi (T )[< t, i >] − Vi (T1 + T2 )[< t, i >].

Indeed, consider a tree t ∈ S. Let V1 be the set of vertices u in T i such that u(T ) = t. Similarly, define the set V2 in (T1 + T2 )i . So, |V1 | − |V2 | = Vi (T )[< t, i >] − Vi (T1 + T2 )[< t, i >]. Now, the function f must map a vertex in V1 − M i to a vertex in V2 . Since f is one-to-one, V1 − M i can have at most |V2 | nodes. In other words, M i must contain |V1 |− |V2 | nodes from V1 . Adding this up for all such subtrees in S gives us the inequality above. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

306



M. Garofalakis and A. Kumar

We can write a similar inequality for Q i . Adding these up, we get  |M i | + |Q i | ≥ |Vi (T )[< t, i >] − Vi (T1 + T2 )[< t, i >]|, t

where the summation is over all subtrees t. Adding over all parsing phases i, we have V (T ) − V (T1 + T2 )1 ≤

O(log n) i=1

O(i log∗ n) = O(log2 n log∗ n).

This completes our proof argument. 4.5 Lower-Bound Proof Our proof follows along the lower-bound proof of Cormode and Muthukrishnan [2002], in that it does not make use of any special properties of our hierarchical tree parsing; instead, we only assume that the parsing structure built on top of the data tree is of bounded degree k (in our case, of course, k = 3). The idea is then to show how, given two data trees S and T , we can use the “credit” from the L1 difference of their vector embeddings V (T ) − V (S)1 to transform S into T . As in Cormode and Muthukrishnan [2002], our proof is constructive and shows how the overall parsing structure for S (including S itself at the leaves) can be transformed into that for T ; the transformation is performed level-by-level in a bottom-up fashion (starting from the leaves of the parsing structure). (The distance-distortion lower bound for our embedding is an immediate consequence of Lemma 4.11 with k = 3.7 ) LEMMA 4.11. Assuming a hierarchical parsing structure with degree at most k (k ≥ 2), the overall parsing structure for tree S can be transformed into exactly that of tree T with at most (2k − 1)V (T ) − V (S)1 tree-edit operations (node inserts, deletes, relabels, and subtree moves). PROOF. As in Cormode and Muthukrishnan [2002], we first perform a topdown pass over the parsing structure of S, marking all nodes x whose subgraph appears in the both parse-tree structures, making sure that the number of marked x nodes at level (i.e., phase) i of the parse tree does not exceed Vi (T )[x] (we use x instead of v(x) to also denote the valid subtree corresponding to x in order to simplify the notation). Descendants of marked nodes are also marked. Marked nodes are “protected” during the parse-tree transformation process described below, in the sense that we do not allow an edit operation to split a marked node. We proceed bottom-up over the parsing structure for S in O(log n) rounds (where n = max{|S|, |T |}), ensuring that after the end of round i we have created an Si such that Vi (T ) − Vi (Si )1 = 0. The base case (i.e., level 0) deals with 7 It is probably worth noting at this point that the subtree-move operation is needed only to establish

the distortion lower-bound result in this section; that is, the upper bound shown in Section 4.1 holds for the standard tree-edit distance metric as well. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

XML Stream Processing Using Tree-Edit Distance Embeddings



307

Fig. 10. Forming a level-i node x.

simple node labels and creates S0 in a fairly straightforward way: for each label a, if V0 (S)[a] > V0 (T )[a] then we delete (V0 (S)[a]− V0 (T )[a]) unmarked copies of a; otherwise, if V0 (S)[a] < V0 (T )[a], then we add (V0 (T )[a] − V0 (S)[a]) leaf nodes labeled a at some location of S. In each case, we perform |V0 (S)[a] − V0 (T )[a]| edit operations which is exactly the contribution of label a to V0 (T )− V0 (S)1 . It is easy to see that, at the end of the above process, we have V0 (T ) − V0 (S0 )1 = 0. Inductively, assume that, when we start the transformation at level i, we have enough nodes at level i − 1; that is, Vi−1 (T ) − Vi−1 (Si−1 )1 = 0. We show how to create Si using at most (2k−1)Vi (T )−Vi (Si )1 subtree-move operations. Consider a node x at level i (again, to simplify the notation, we also use x to denote the corresponding valid subtree). If Vi (S)[x] > Vi (T )[x], then we have exactly Vi (T )[x] marked x nodes at level i of S’s parse tree that we will not alter; the remaining copies will be split to form other level-i nodes as described next. If Vi (S)[x] < Vi (T )[x], then we need to build an extra (Vi (T )[x] − Vi (S)[x]) copies of the x node at level i. We demonstrate how each such copy can be built by using ≤ (2k − 1) subtree move operations in order to bring together ≤ k level-(i − 1) nodes to form x (note that the existence of these level-(i − 1) nodes is guaranteed by the fact that Vi−1 (T ) − Vi−1 (Si−1 )1 = 0). Since (Vi (T )[x] − Vi (S)[x]) is exactly the contribution of x to Vi (T ) − Vi (Si )1 , the overall transformation for level i requires at most (2k − 1)Vi (T ) − Vi (Si )1 edit operations. To see how we form the x node at level i note that, based on our embedding algorithm, there are three distinct cases for the formation of x from level-(i − 1) nodes, as depicted in Figures 10(a)–10(c). In case (a), x is formed by “folding” the (no-siblings) leftmost leaf child v2 of a node v1 into its parent; we can create the scenario depicted in Figure 10(a) easily with two subtree moves: one to remove any potential subtree rooted at the level-(i − 1) node v2 (we can place it under v2 ’s original parent at the level-(i − 1) tree), and one to move the (leaf) v2 under the v1 node. Similarly, for the scenarios depicted in cases (b) and (c), we basically need at most k subtree moves to turn the nodes involved into leaves, and at most k − 1 additional moves to move these leaves into the right formation around one of these ≤ k nodes. Thus, we can create each copy of x with ≤ (2k − 1) subtree move operations. At the end of this process, we have Vi (T ) − Vi (Si )1 = 0. Note that we do not care where in the level-i tree we create the x node; the exact placement will be taken care of at higher levels of the parsing structure. This completes the proof. ACM Transactions on Database Systems, Vol. 30, No. 1, March 2005.

308



M. Garofalakis and A. Kumar

5. SKETCHING A MASSIVE, STREAMING XML DATA TREE In this section, we describe how our tree-edit distance embedding algorithm can be used to obtain a small, pseudorandom sketch synopsis of a massive XML data tree in the streaming model. This sketch synopsis requires only small (logarithmic) space, and it can be used as a much smaller surrogate for the entire data tree in approximate tree-edit distance computations with guaranteed error bounds on the quality of the approximation based on the distortion bounds guaranteed from our embedding. Most importantly, as we show in this section, the properties of our embedding algorithm are the key that allows us to build this sketch synopsis in small space as nodes of the tree are streaming by without ever backtracking on the data. More specifically, consider the problem of embedding a data tree T of size n into a vector space, but this time assume that T is truly massive (i.e., n far exceeds the amount of available storage). Instead, we assume that we see the nodes of T as a continuous data stream in some apriori determined order. In the theorem below, we assume that the nodes of T arrive in the order of a preorder (i.e., depth-first and left-to-right) traversal of T . (Note, for example, that this is exactly the ordering of XML elements produced by the event-based SAX parsing interface (sax.sourceforge.net/).) The theorem demonstrates that the vector V (T ) constructed for T by our L1 embedding algorithm can then be constructed in space O(d log2 n log∗ n), where d denotes the depth of T . The sketch of T is essentially a sketch of the V (T ) vector (denoted by sketch(V (T ))) that can be used for L1 distance calculations in the embedding vector space. Such an L1 sketch of V (T ) can be obtained (in small space) using the 1-stable sketching algorithms of Indyk [2000] (see Theorem 2.2). THEOREM 5.1. A sketch sketch(V (T )) to allow approximate tree-edit distance computations can be computed over the stream of nodes in the preorder traversal of an n-node XML data tree T using O(d log2 n log∗ n) space and O(log d log2 n(log∗ n)2 ) time per node , where d denotes the depth of T . Then, assuming sketch vectors of size O(log 1δ ) and for an appropriate combining function f (), f (sketch(V (S)), sketch(V (T ))) gives an estimate of the tree-edit distance d (S, T ) to within a relative error of O(log2 n log∗ n) with probability of at least 1 − δ. The proof of Theorem 5.1 hinges on the fact that, based on our proof in Section 4.4, given a node v on a root-to-leaf path of T and for each of the O(log n) levels of the parsing structure above v, we only need to retain a local neighborhood (i.e., influence region) of nodes of size at most O(log n log∗ n) to determine the effect of adding an incoming subtree under T . The O(d ) multiplicative factor is needed since, as the tree is streaming in preorder, we do not really know where a new node will attach itself to T ; thus, we have to maintain O(d ) such influence regions. Given that most real-life XML data trees are reasonably “bushy,” we expect that, typically, d