204 109 860KB
English Pages 55 Year 2004
Internet Research Volume 14, Number 5, 2004
ISSN 1066-2243
Papers from the Fourth International Network Conference (INC 2004) ± 6-9 July 2004, Plymouth, United Kingdom Guest Editor: Dr Steven Furnell
Contents 334 Access this journal online 335 Abstracts & keywords 337 Guest editorial 339 Efficient resource discovery in grids and P2P networks Nick Antonopoulos and James Salter 347 Mechanisms for controlling access in the global grid environment George Angelis, Stefanos Gritzalis and Costas Lambrinoudakis 353 Non-business use of the WWW in three Western Australian organisations Craig Valli
366 Web services: measuring practitioner attitude P. Joshi, H. Singh and A.D. Phippen 372 Using the Web Graph to influence application behaviour Michael P. Evans and Andrew Walker 379 Multi-dimensional-personalisation for location and interest-based recommendation Steffen W. Schilke, Udo Bleimann, Steven M. Furnell and Andrew D. Phippen 386 Awards for Excellence 387 Note from the publisher
360 PIDS: a privacy intrusion detection system Hein S. Venter, Martin S. Olivier and Jan H.P. Eloff
Access this journal electronically The current and past volumes of this journal are available at:
www.emeraldinsight.com/1066-2243.htm You can also search over 100 additional Emerald journals in Emerald Fulltext:
www.emeraldinsight.com/ft See page following contents for full details of what your access includes.
individuals, institutions and resources by providing consistent, easy and inexpensive access to high-end computational capabilities. Studies Grid security and specifically users’ access control. It has been proved that the viability of these heterogeneous environments is highly dependent on their security design. Solutions trying to address all aspects of security were proposed by most existing Grid projects and collaborations; however the results were not always satisfactory. Reviews some of the most widelyaccepted security solutions, and collects the most efficient. Emphasizes access control procedures and the solutions addressing authentication and authorization issues. Identifies the most successful security schemes implemented and illustrates their effectiveness. Collects these mechanisms to form the backbone of a security mechanism, addressing authentication and authorization Grid-specific problems. The proposed schemes can constitute the backbone of an effective Grid security architecture.
Abstracts & keywords
Efficient resource discovery in grids and P2P networks Nick Antonopoulos and James Salter Keywords Computer networks, Resources, Computer software Presents a new model for resource discovery in grids and peer-to-peer networks designed to utilise efficiently small numbers of messages for query processing and building of the network. Outlines and evaluates the model through a theoretical comparison with other resource discovery systems and a mathematical analysis of the number of messages utilised in contrast with Chord, a distributed hash table. Shows that through careful setting of parameter values the model is able to provide responses to queries and node addition in fewer messages than Chord. The model is shown to have significant benefits over other peer-to-peer networks reviewed. Uses a case study to show the applicability of the model as a methodology for building resource discovery systems in peer-to-peer networks using different underlying structures. Shows a promising new method of creating a resource discovery system by building a timeline structure on demand, which will be of interest to both researchers and system implementers in the fields of grid computing, peer-topeer networks and distributed resource discovery in general.
Mechanisms for controlling access in the global grid environment George Angelis, Stefanos Gritzalis and Costas Lambrinoudakis Keywords Computer networks, Control systems, Message authentication, Data security The Grid is widely seen as the next generation Internet. Aims to share dynamic collections of
Internet Research Volume 14 · Number 5 · 2004 · Abstracts & keywords q Emerald Group Publishing Limited · ISSN 1066-2243
Non-business use of the WWW in three Western Australian organisations Craig Valli Keywords Worldwide web, User studies, Cost accounting, Australia This paper is an outline of findings from a research project investigating the non-business use of the World Wide Web in organisations. The study uncovered high non-business usage in the selected organisations. Pornography and other traditionally identified risks were found to be largely non-issues. MP3 and other streaming media and potential copyright infringement were found to be problematic. All organisations had end-users displaying behaviours indicating significant, deliberate misuse that often used a variety of covert techniques to hide their actions.
PIDS: a privacy intrusion detection system Hein S. Venter, Martin S. Olivier and Jan H.P. Eloff Keywords Computer networks, Privacy, Safety devices It is well-known that the primary threat against misuse of private data about individuals is present within the organisation; proposes a system that uses intrusion detection system (IDS) technologies to help safeguard such private information. Current IDSs attempt to detect intrusions on a low level whereas the proposed privacy IDS (PIDS) attempts to detect intrusions on a higher level. Contains information about information privacy and privacy-enhancing technologies, the role that a current IDS could play in a privacy system, and a framework for a privacy IDS. The system works by identifying anomalous behaviour and reacts by throttling access to the data
335
Abstracts & keywords
Internet Research Volume 14 · Number 5 · 2004 · 335-336
and/or issuing reports. It is assumed that the private information is stored in a central networked repository. Uses the proposed PIDS on the border between this repository and the rest of the organisation to identify attempts to misuse such information. A practical prototype of the system needs to be implemented in order to determine and test the practical feasibility of the system. Provides a source of information and guidelines on how to implement a privacy IDS based on existing IDSs.
Web services: measuring practitioner attitude P. Joshi, H. Singh and A.D. Phippen Keywords Internet, Worldwide web, Servicing, Function evaluation Distributed computing architecture has been around for a while, but not all of its benefits could be leveraged due to issues such as inter-operability, industry standards and cost efficiency that could provide agility and transparency to the business process integration. Web services offer a cross platform solution that provides a wrapper around any business object and exposes it over the Internet as service. Web services typically work outside of private networks, offering developers a non-proprietary route to their solutions. The growth of this technology is imminent; however, there are various factors that could impact its adoption rate. This paper provides an in-depth analysis of various factors that could affect adoption rate of this new technology by the industry. Various advantages, pitfalls and future implications of this technology are considered with reference to a practitioner survey conducted to establish the main concerns effecting adoption rate of Web services.
Using the Web Graph to influence application behaviour Michael P. Evans and Andrew Walker Keywords Worldwide web, Computer software, Graphical programming The Web’s link structure (termed the Web Graph) is a richly connected set of Web pages. Current
applications use this graph for indexing and information retrieval purposes. In contrast the relationship between Web Graph and application is reversed by letting the structure of the Web Graph influence the behaviour of an application. Presents a novel Web crawling agent, AlienBot, the output of which is orthogonally coupled to the enemy generation strategy of a computer game. The Web Graph guides AlienBot, causing it to generate a stochastic process. Shows the effectiveness of such unorthodox coupling to both the playability of the game and the heuristics of the Web crawler. In addition, presents the results of the sample of Web pages collected by the crawling process. In particular, shows: how AlienBot was able to identify the power law inherent in the link structure of the Web; that 61.74 per cent of Web pages use some form of scripting technology; that the size of the Web can be estimated at just over 5.2 billion pages; and that less than 7 per cent of Web pages fully comply with some variant of (X)HTML.
Multi-dimensional-personalisation for location and interest-based recommendation Steffen W. Schilke, Udo Bleimann, Steven M. Furnell and Andrew D. Phippen Keywords Internet, Information services, Mobile communication systems During the dot com era the word “personalisation” was a hot buzzword. With the fall of the dot com companies the topic has lost momentum. As the killer application for UMTS has yet to be identified, the concept of multi-dimensional-personalisation (MDP) could be a candidate. Using this approach, a recommendation of online content, as well as offline events, can be offered to the user based on their known interests and current location. Instead of having to request this information, the new service concept would proactively provide the information and services – with the consequence that the right information or service could therefore be offered at the right place, at the right time. Following an overview of the literature, the paper proposes a new approach for MDP, and shows how it extends the existing implementations.
336
The 4th International Network Conference (INC 2004)
Guest editorial
About the Guest Editor Steven Furnell is with the Network Research Group, School of Computing, Communications and Electronics, University of Plymouth, Plymouth, UK.
Internet Research Volume 14 · Number 5 · 2004 · pp. 337-338 q Emerald Group Publishing Limited · ISSN 1066-2243
The papers in this issue of Internet Research are based on a selection of submissions from INC 2004, the 4th International Network Conference, which was held in Plymouth, UK, from 6-9 July 2004. INC events have been running since 1998, and provide a forum for sharing the latest research in computer networks and related technologies. As regular readers of Internet Research may recall, papers from previous events have provided the basis for previous themed issues of the journal. In common with previous events, INC 2004 drew a truly international audience, with authors from 24 countries. These included a diverse mixture of academic and industrial participants, ensuring that a number of different perspectives were represented in both the presentations and the subsequent discussions. The main themes addressed by the 2004 conference were Web Technologies and Applications, Network Technologies, Security and Privacy, Mobility, and Applications and Impacts. The full conference proceedings include a total of 70 papers, with coverage ranging from discussion of applications and services, down to details of specific underlying technologies[1]. The papers selected for this issue have been chosen to be representative of the broad range of topics covered by the conference, whilst at the same time addressing areas of relevance to the journal readership. To begin the discussion, Antonopoulos and Salter focus on the grid and other distributed computing environments, proposing a new model for resource discovery. Their paper, which was the recipient of the INC 2004 Best Paper Prize (sponsored by Emerald and Internet Research), Thanks are due to Dr Paul Dowland, Jayne Garcia, Denise Horne, and the various members of the Network Research Group without whose support and involvement the conference would not have taken place. Additionally, the guest editor is very pleased to acknowledge the support of the various conference co-sponsors: the British Computer Society, the Institution of Electrical Engineers, Orange, Symantec, and, of course, Emerald (the publishers of Internet Research). Acknowledgements are also due to the numerous members of the International Programme Committee who assisted with the original review process. Finally, particular thanks must be given to David Schwartz, editor of Internet Research and a valued member of the INC programme committee, for his continued support of the conference and for again offering the opportunity for us to compile this special issue.
337
Guest editorial
Internet Research Volume 14 · Number 5 · 2004 · 337-338
presents an outline of the proposed approach, and presents a comparative evaluation against existing alternatives. The grid-related discussion is maintained by Angelis et al., who consider the environment from the perspective of its security requirements. Specific attention is given to access control issues, with the authors proposing core mechanisms to address authentication and authorization issues in the grid context. The security theme is continued by Valli, who considers the problem that can be posed by legitimate users who have not had sufficient control placed over their activities. An investigation conducted within three Western Australian organisations revealed a high incidence of the available technology being misused for nonbusiness purposes, and of end-users employing covert techniques to hide their actions. Such findings demonstrate the need to devote more attention towards combating the insider threat, and a relevant contribution is therefore provided by Venter et al., who propose a means of safeguarding private information against end-user misuse, using a development of intrusion detection system (IDS) technologies. Their discussion introduces the issues of privacy and privacyenhancing technologies, before considering the opportunity for applying IDS approaches and proposing the framework for a privacy IDS. Moving on from the security issues, the remaining papers consider other challenges within distributed network environments. Joshi et al. examine the domain of Web services, which are regarded as a key technology for distributed software development and business integration. However, several barriers may impede adoption, and their study measures practitioner attitudes in order to reveal the nature of the challenges to be overcome. Insights into the adoption of Webrelated standards is also one of the themes to emerge from the work of Evans and Walker, who
present a novel Web crawling approach in the guise of a computer game, with the Web’s link structure being used to influence the game play. The authors examine the results obtained from a sample of Web pages collected by this crawling process, and discuss the extent to which different Web technologies are represented within them. The distribution of such a vast range of content across the Web is obviously a great advantage, but a key issue for many end-users will be to gain access to the material that is of interest to them. To this end, Schilke et al. introduce readers to the concept of multi-dimensional-personalisation, and suggest how recommendations of online content, as well as offline events, can be offered to the user, based on their known interests and current geographic location. The article considers the potential impact of the approach for the delivery of future mobile services. The papers in this issue have been updated from the versions in the conference proceedings, giving the authors the opportunity to provide greater detail, as well as to reflect any feedback received during the conference, and any further developments in their research since the submission of the original paper. Following the success of the 2004 conference, the next INC event has been scheduled for July 2005. However, in a departure from the previous events, INC 2005 will be held on the Greek island of Samos. Readers interested in attending, or submitting a paper, are asked to look at www.inc2005.org for further details. Steven Furnell
Note
338
1 Full copies of the conference proceedings can be obtained from the Network Research Group at the University of Plymouth, UK ([email protected]).
Introduction
Efficient resource discovery in grids and P2P networks Nick Antonopoulos and James Salter
The authors Nick Antonopoulos is University Lecturer and James Salter is a PhD Student, both in the Department of Computing, University of Surrey, Guildford, UK.
Grid computing and peer-to-peer (P2P) networks are emerging technologies enabling widespread sharing of distributed resources. However, efficient discovery of resources is an ongoing problem in both fields. Solutions must be scalable, fault-tolerant and able to deliver high levels of performance. In this paper, we present an initial model for resource discovery in which we aim to reduce the number of messages required to resolve queries while providing a scalable environment without a centralised structure that could lead to a single point of failure. By utilising a simple learning mechanism on top of a distributed inverted index structure we are able to provide guarantees that matching resources will be found wherever they exist in the network, within a small number of messages.
Keywords Computer networks, Resources, Computer software
Related work
Abstract Presents a new model for resource discovery in grids and peerto-peer networks designed to utilise efficiently small numbers of messages for query processing and building of the network. Outlines and evaluates the model through a theoretical comparison with other resource discovery systems and a mathematical analysis of the number of messages utilised in contrast with Chord, a distributed hash table. Shows that through careful setting of parameter values the model is able to provide responses to queries and node addition in fewer messages than Chord. The model is shown to have significant benefits over other peer-to-peer networks reviewed. Uses a case study to show the applicability of the model as a methodology for building resource discovery systems in peer-to-peer networks using different underlying structures. Shows a promising new method of creating a resource discovery system by building a timeline structure on demand, which will be of interest to both researchers and system implementers in the fields of grid computing, peer-to-peer networks and distributed resource discovery in general.
Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/1066-2243.htm
Internet Research Volume 14 · Number 5 · 2004 · pp. 339-346 q Emerald Group Publishing Limited · ISSN 1066-2243 DOI 10.1108/10662240410566926
Resource discovery in grid environments shares many elements in common with resource discovery in P2P networks. Although there are some differences between the two approaches, several researchers believe there will be an eventual convergence (Foster and Iamnitchi, 2003), so it is valid to compare our work with other models from both approaches. Centralised resource discovery schemes such as Napster (Saroiu et al., 2002) and Globus’ original MDS implementation (Fitzgerald et al., 1997) are based around a cluster of central servers hosting a directory of resources. These introduce issues such as scalability, since the entire index must be stored on a single cluster. Centralised schemes can be vulnerable to attack, such as denial-of-service attacks, because there is a single point of failure. Distributed solutions such as Gnutella (Ripeanu, 2001) discover resources by broadcasting queries to all connected nodes. When a node receives a query, its list of local resources is checked for matches, and any results are back propagated to the requester. Regardless of whether a match has been found a time-to-live (TTL) counter is decremented and, if greater than zero, the query is forwarded to neighbouring nodes. Freenet (Clarke et al., 2000) is a distributed information storage system designed to ensure anonymity and make it infeasible to determine the origin or final destination of a resource passing through the network by utilising hash functions and public key encryption. Queries are forwarded across the network in a serial fashion, based on similarity of the query to keys stored in routing tables. Successful matches cause the resource to be
339
Efficient resource discovery in grids and P2P networks
Internet Research
Nick Antonopoulos and James Salter
Volume 14 · Number 5 · 2004 · 339-346
cached in the local file store of each node along the path between the original requestor and the node answering the query. Routing tables are also updated to point to the node that provided the resource. The primary motivation behind Freenet is anonymity across a decentralised data store, rather than performance or long-term storage capability. Routing indices (RI) (Crespo and GarciaMolina, 2002) provide a framework for nodes to forward queries to neighbouring nodes that are most likely to provide answers or be able to forward the query to a node with matching resources. Each node maintains a routing index pointing to each neighbouring node, together with a count of the total number of resources available and the number of resources available on each topic (group of keywords) by following the path from each neighbour. Distributed hash tables (DHTs) (Kelaskar et al., 2002) such as Chord (Stoica et al., 2003) have become the dominant methodology for resource discovery in structured P2P networks, typically providing query routing in O(log N) messages and updates following a node join in O(log2 N) messages. Each node in the network hosts part of the index, and queries are hashed to create a key that is mapped to the node with the matching identifier. In Chord, nodes are organised in a ring. Each node maintains a small finger table that is used to forward queries around the ring until the correct node is located.
Motivations Our approach to resource discovery describes a distributed architecture that efficiently processes queries and updates across the network using a small number of messages. Several resource discovery mechanisms, such as Gnutella, which use broadcasting techniques for querying employ TTL counters to limit the number of messages generated. TTL counters are decremented each time a query is sent to a new node. When the TTL counter reaches zero, the query expires and is not forwarded any further through the network. Since queries in such systems do not reach every node, resource discovery cannot be guaranteed. If a resource exists but is outside the query “horizon”, it will not be discovered. One of the criteria of our system was that it should guarantee to provide an answer to a query if one exists, without the need for a query to be sent to every node in the network.
System model Centralised architectures arguably provide the best performance in terms of number of messages required to give an answer to a query, but suffer weaknesses such as scalability and single point-offailure as described above. The benefits of guaranteed answers to queries and elimination of query broadcasting provided by the inverted index nature of centralised architectures are utilised by our system. We gain improvements over this by distributing the inverted index over many network nodes, replicating information and utilising preference lists. Our resource discovery mechanism comprises a series of nodes connected together in an arbitrary manner, each running a software agent. Figure 1 shows the model of a typical node. Each node represents a group of one or more machines connected to the network that may have resources, such as software, databases or CPU cycles available for use. These resources are registered against a set of one or more keywords describing them in a local resource table on their local node. A node may become a supernode responsible for one or more keywords. A supernode maintains a node keyword table listing each node providing resources matching the keyword(s) for which it is responsible. Every node hosts a local node lookup table that contains a list of which supernode is responsible for which keyword. If each node had to store information for each supernode and keyword, a message broadcast would be needed every time a new supernode or keyword was added. Instead, we introduce a pointer in each supernode pointing to the newest supernode created. Only the local node lookup table of the newest supernode must be updated when new keywords are added. Keyword information not available in a node’s local node lookup table can be found by contacting a Figure 1 Node model showing data structures and processes of a typical node
340
Efficient resource discovery in grids and P2P networks
Internet Research
Nick Antonopoulos and James Salter
Volume 14 · Number 5 · 2004 · 339-346
supernode and following its pointer to the newest supernode, which will hold the latest list of keywords. Certain supernodes along this timeline are designated as checkpoints (Figure 2). This prevents the need to update the pointer to the newest supernode on every node in the timeline whenever a new one is formed. Instead, only the pointer on the latest checkpoint is updated. If a supernode further back needs to find the newest supernode it will contact the checkpoint preceding itself, which will in turn point to the next checkpoint in the timeline. By following the sequence of checkpoint pointers, the newest supernode will be found within a few hops. To overcome the continuous increase in size of the local node lookup table on the newest supernode, the timeline is split into segments. When the size of the table grows over a set limit, the current segment is closed and a new segment created. The local node lookup table is cleared at the start of each segment, meaning it will only list information on supernodes within the current segment. A mesh of segments is created to enable queries to be multicast to other segments if they are not resolvable in the current segment. The timeline can support name-value pair style keywords in addition to standard single-word keywords. These are useful for defining the semantics of a keyword. For example, a query for “cpu_speed¼1,600” is split into name (cpu_speed) and value (1,600) components, giving more meaning than a single-word keyword “1,600”. The name component is treated as a
standard keyword. However, rather than listing all resource providers with resources matching that keyword, the supernode responsible for the keyword acts as the root of a second timeline. This value timeline is constructed of a series of keyword value supernodes, which act in a similar way to supernodes described above, but index the value component of the name-value pair for the single keyword. Searching for resources When a query is made using our resource discovery mechanism, firstly the local node’s software agent checks the local resource table for matching resources. This allows matches to be provided within the node’s local environment, without the need to send any messages across the network. If no matching resources are found, or those that match are unavailable to the user, the search will proceed to find resources on remote nodes. The local node’s local node lookup table (Table I) is searched to see if keywords in the query are listed. If not, the most recent supernode listed in the table is contacted. If searching this supernode’s local node lookup table does not list the required Table I Sample local node lookup table Keyword Temperature Rs232 Turing ...
Figure 2 Timeline of supernodes connected in order of joining
341
Supermode
Preference list
45 757 235 ...
3, 7, 45, 863, 23 757 235, 343, 1, 55, 34, 76 ...
Efficient resource discovery in grids and P2P networks
Internet Research
Nick Antonopoulos and James Salter
Volume 14 · Number 5 · 2004 · 339-346
supernode, the software agent running the query will be sent through the timeline. Each supernode has a pointer to the previous checkpoint in the timeline. In closed segments, the checkpoint hosts a pointer directly to the newest checkpoint (the closing supernode) in the segment. In the open segment, checkpoints forward queries to the next checkpoint in the timeline. The query is sent across checkpoint supernodes (or directly to the newest segment checkpoint), checking each one’s local node lookup table as it goes, until either the keyword is found or the newest supernode is reached. If the keyword is not found, the query is multicast to the closing node of each other segment and their local node lookup tables searched. When the address of the supernode responsible for the keyword is found, the supernode is contacted. If the query is for a single-word keyword, the supernode’s node keyword table gives a list of all nodes providing resources matching the keyword. If the keyword is a namevalue pair, the value timeline relating to the name is traversed to find the keyword value supernode responsible for the value component of the pair, which will have a node keyword table listing nodes providing matching resources. The list of nodes is attached to the query, which is then forwarded to each node’s agent in turn until an available matching resource is found through searching of each node’s local resource table. In addition to the address of the machine with the matching resource, the list of nodes from the supernode is sent back to the original querying node. The node can then use this list to build its preference list. Preference lists are maintained in local node lookup tables at each node, independently of supernode/keyword lists stored at other nodes. The lists are sorted in order of the node most likely to be able to match the query, based on successful matches in the past. They act as a learning mechanism to enhance the performance of future similar queries. When a preference list for a keyword exists on a node’s local node lookup table, it can be used to begin the search directly without the need to contact the keyword’s supernode, which helps to spread the query load across multiple servers for popular queries. We attach the preference list to the query, so if the first node to be sent the query cannot provide matching resources the query will be forwarded directly to the next node. Using this scheme, the node most likely to answer the query is always contacted first, then other nodes are contacted in serial until the query is answered. Another approach is to weight each preference list entry and randomly select a node, to enable alternative nodes
to be tested periodically and avoid potentially overloading the most popular resource providers. If the supernode is not listed in the preference list, it is attached to the end of the preference list prior to sending it out from the local node. When the query reaches the supernode for the keyword, any other nodes listed on the supernode but not currently in the preference list can be added to the end of the list. This allows for a complete search, without the need to broadcast the query to nodes that do not provide any matching resources. Periodically full searches are conducted for keywords which have associated preference lists, to enable the preference list to be refreshed and new resource providers discovered. Preference lists are important as they form the first stage in amalgamating simple keyword matching with more complex matching of users to resources in the context of availability and access policies (Antonopoulos and Shafarenko, 2001). For example, the supernode for a keyword may offer resources matching a user’s query, but its security policy may prevent that user from exploiting the resources because of the relationship with the owner of the node on which they are connected. Therefore, it would be more beneficial for a user to contact directly a different node with matching resources and a compatible security policy, rather than first contacting the supernode. By introducing the concept of preference lists into our system we are able to minimise the number of messages required to provide an answer to a query, while still providing guarantees that if a matching resource exists we will be able to find it. Searching for a keyword’s supernode as described above is the worst-case scenario in terms of number of messages required to process a query. This activity, which uses only a small number of additional messages compared to a standard query, is a one-time occurrence. The keyword’s supernode will be listed in the node’s local node lookup table for subsequent queries and preference lists will be deployed to minimise the message count further. Adding new nodes and resources When a new node wishes to join the network, it contacts the software agent running on an existing node in order to inherit the node’s local node lookup table. A blank local resource table is also created on the node to list any local resources that will be introduced. If the node does not offer any resources, this is all it needs to do to begin participation within the network. If a resource is to be introduced into the system, it is first listed against each keyword for which it is associated in the local resource table on its local node. If keywords not previously associated with the local node are being introduced, the keyword
342
Efficient resource discovery in grids and P2P networks
Internet Research
Nick Antonopoulos and James Salter
Volume 14 · Number 5 · 2004 · 339-346
information must also be published on the relevant supernodes to allow the new resource to be found by any node in the network. The address of a supernode is chosen from the local node’s local node lookup table, which will in turn provide a pointer to the newest supernode (possibly via hops across checkpoints along the timeline). The local node lookup tables of each segment are consulted to see whether keywords associated with the new resource already exist. If they do, the address of the new node can simply be registered with the relevant supernodes responsible for each keyword as providing resources matching that keyword. A broadcast message is not necessary, as information about the new resources can be found by contacting the relevant supernode. If one or more keywords associated with the resource did not previously exist within the network, the node hosting the resources will be designated as the supernode for those keywords. The local node lookup table for this new supernode will be inherited from the previous newest supernode on the currently open segment and then updated with entries showing the new supernode as the supernode for the new keywords. The pointer on the current checkpoint will be updated to reflect the presence of the new supernode. Again, a broadcast of information relating to the new node is unnecessary. If the size of the local node lookup table has exceeded the maximum limit (configured prior to the timeline being built), the current segment will be closed and the next supernode added will begin a new one. The supernode will not inherit the current local node lookup table, but will begin a blank one. On closure of a segment, a message is sent to each checkpoint in the segment to update their pointers to point to the final node of the segment, allowing the complete segment local node lookup table to be quickly accessed.
is individually queried to check whether the remaining keywords match any of their resources.
Multiple keyword queries Queries may be submitted comprising multiple keywords (“distribution ¼ binomial AND factor ¼ 34 AND fractal” for example). To resolve these queries efficiently using a minimal number of messages, for keywords with preference list entries the preference lists can be intersected to find common resources. Where preference lists for each pair do not exist, a query is executed for a single keyword that is selected as the one most likely to be resolved in the least messages (if one of the keywords had been previously queried for then this pair should be selected as the supernode for the keyword would already be listed in the local node lookup table). If the set of results returned is large, a query is run for another keyword and the set of results intersected. If the set of results is small, each resource provider
Comparison with other systems By describing the main architecture and algorithms of our system we have shown that we have eliminated message broadcasting while still providing guarantees that an answer to a query will be found if one exists in the network, without the need for a centralised architecture. The original Gnutella employed both message broadcasting and TTL counters, meaning it was not able to provide guarantees that resources requested could be found in the network, even though it used many more messages than our system. To search every node in the system would require approximately one message to be generated for every node, resulting in scalability concerns in large networks (Ritter, 2001). Since Gnutella provides no facility for improving query performance based on past experience, every time a popular query is submitted a similar number of messages will be generated. Freenet’s architecture employs a steepest-ascent hill-climbing search with backtracking, meaning queries are forwarded to nodes in a serial fashion, eliminating message broadcast. However, unlike our system, it cannot guarantee to find matching resources as it utilises TTL counters (called hopsto-live limits). Query performance improves over time, as successful queries cause matching resources to be cached and routing tables updated. However, Freenet consumes a lot of bandwidth during this process as resources are copied between locations. RI presents an improvement over Freenet by always returning a result if one exists in the network. Rather than using a TTL counter, RI uses a stop condition to specify the number of matching resources that must be found before a query will be terminated. Summarisation techniques are employed to provide keyword groupings, which may mean that resources related to rare keywords may not be found because RI uses a frequency threshold that discards topics with very few documents. Indices are not modified depending on the success or failure of a query, so there is no scope for improvement in query performance through learning. RI relies on a network architecture where every node can be reached (possibly via a path consisting of several intermediate nodes) by every other node. If this level of connectivity is not maintained, resources may not be found because they are unreachable from a point in the network. There is also the potential for a high number of update messages to be generated, because when resources are
343
Efficient resource discovery in grids and P2P networks
Internet Research
Nick Antonopoulos and James Salter
Volume 14 · Number 5 · 2004 · 339-346
introduced or removed, routing indices on every node in a path pointing to the node hosting the resource must be updated.
Comparison with Chord Chord and similar single-layer DHTs do not directly provide any mechanism for lookup of name-value pair keywords where the name and value are split into separate components. Therefore, to provide a fair comparison with Chord we only discuss a timeline indexing standard single-word keywords. Chord provides a mechanism for resource discovery with log(n) messages required for query processing and log2(n) messages for updating routing information following a node joining the network in the worstcase. The number of messages to process a query in our system is dependant on the number of checkpoints in the timeline and the number of segments. For a query originating in a closed segment, the worst-case number of messages required is s þ 2, where s is the number of segments. The constant 2 messages are required to locate a checkpoint and then hop to the newest (closing) checkpoint in the segment. For queries originating in the open segment, the worst-case number of messages required is c þ s þ 2 where c is the number of checkpoints in each segment, since the query must be sent across each checkpoint in the open segment, then multicast to all other segments. In both cases we assume the query is not resolvable in the originating segment. To show improvement over Chord in the worst-case we must have: c þ s þ 2 , log2 ðnÞ
ð1Þ
where n is the total number of supernodes in the timeline. We define latency as a measure of the time taken to resolve a query, which is influenced by the number of steps (hops) involved in the lookup process. Chord’s routing algorithm requires routing to take place sequentially. However, in our approach since each timeline segment is connected within a mesh, a query can be multicast to each segment simultaneously. Reduced latency over Chord is possible providing the number of hops required to be completed sequentially (hopping across the timeline) is less than the total number of hops required to resolve the query in an equivalent Chord ring. This can be achieved for queries originating in closed segments in the worst case where log2 ðnÞ . 3 and log2 ðnÞ . c þ 3 (by substituting 1 for s in the above formulae to
measure the number of sequential steps) for queries originating in the open segment. Each time a supernode is added, a message is sent to the most recent checkpoint to update its pointer to the newest supernode. In total mc of these will be sent in a segment, where m is the number of supernodes between segments. On creation of a checkpoint supernode, update messages are sent to the final checkpoint in each closed segment (s 2 1 messages). Finally, when a segment is closed, each checkpoint is updated to point to the final checkpoint in the segment, using c messages. In total, the number of update messages to create a segment is cðm þ ðs 1ÞÞ þ c ¼ cðm þ sÞ. It follows that the average messages to create a supernode is c(m+s)/ nodes in segment¼c(m+s)/cm ¼ m+s/m. To show improvement over Chord for adding nodes in the worst case, we must have: mþs , log22 ðnÞ: m
ð2Þ
Utilising inequalities (1) and (2) together with (there cannot be less than one checkpoint per segment) and (the total supernodes in the network), we can derive parameter values for which the timeline can show lower message costs for queries and updates than Chord. As an example, a timeline consisting of n ¼ 5,000 supernodes split across s ¼ three segments with c ¼ seven checkpoints and m ¼ 239 supernodes between each checkpoint would satisfy the above conditions, therefore yielding lower numbers of messages than Chord (,1 to insert a node and 12 worst-case query routing compared to ,151 worst-case for Chord to insert a node and 13 to route a query in the worst case). Many such combinations of parameters are possible, but the expected size of the timeline (n) must be known before the network is built so they can be set correctly. In many scenarios it is impossible to know this, so the timeline must be adapted dynamically as it is built to ensure the message costs remain low. Developing a set of intelligent algorithms for determining when checkpoints should be created and the timeline segmented is a topic for future work. The worst-case scenario of hopping across each checkpoint in a segment to process a query only occurs for nodes that believe a supernode prior to the first checkpoint in the segment is the newest supernode. When queries are sent across the timeline, the querying node learns the address of a newer supernode. Timeline traversals begin from that supernode in future queries, so nodes which submit frequent queries will experience the smallest hop counts across the timeline. Although a query from a node that has not submitted a query recently will potentially have a greater number of
344
Efficient resource discovery in grids and P2P networks
Internet Research
Nick Antonopoulos and James Salter
Volume 14 · Number 5 · 2004 · 339-346
hops, by its nature the node will not frequently send out queries into the timeline and therefore not add substantial numbers of messages to the overall network traffic. Additionally, once the node has queried, it will learn the address of the newest supernode and thus its queries will be routed more quickly in future.
relevant keyword ring is then entered and the relevant node hosting the addresses of resource providers relating to the specified value is located. Searches in both rings are conducted using the standard Chord algorithms. By introducing similar information to that contained in local node lookup tables, queries can be directed straight into keyword rings, shortcutting the super ring. Preference lists can also be used in similar ways to those described above. Benefits in terms of number of messages required to resolve a query and update information can be realised by building the overlay on demand, thereby creating fewer, smaller-sized rings than in other multi-tiered DHT topologies which assume secondary-level rings are either already built or are created based on node locality (Garce´s-Erice et al., 2003, Mislove and Druschel, 2004). In these, representative nodes are added to the primary ring, effectively building the network upwards. The system itself has no control over how many subrings exist or how many nodes each contains. Following our methodology, the network is built downwards from the primary ring, with decisions on the size and number of rings being made by the system. When a resource is registered under a new keyword in our topology, rather than immediately creating a keyword ring, the resource information is held on the supernode responsible for the keyword. A keyword ring is only built once either the number of resources registered against the keyword or the number of queries for the keyword have exceeded preset trigger levels. When one of these has been exceeded, nodes with available capacity are selected to become members of the ring from those listed in a separate ring containing all available nodes. The supernode responsible for the keyword will determine the number of nodes needed for the keyword ring and the capacity (primarily in terms of query processing load) each one will need. Due to rings being built on demand and the size of the rings being controlled, fewer update/ maintenance and query lookup messages are required compared with either a single Chord ring or other multi-tiered Chord architectures.
Case study We have shown above how our model can be implemented using a timeline structure. This model can be generalised to become a methodology for building efficient resource discovery mechanisms. The construction of our model follows a different approach from previous methods of inserting nodes into the overlay as they join the network. The topology is instead built on-demand, with nodes being added when a new keyword is registered in the system. In this case study we use a multi-tiered Chord architecture as an example of how the methodology could be deployed using other topologies, further details of which are provided separately (Salter and Antonopoulos, 2004). Our timeline could be substituted for a topology similar in structure to other multi-tier DHTs. At the core of this overlay (Figure 3) is a central super ring that organises a Chord ring of indexed keywords, splitting the index over the supernodes in the ring. Supernodes point to one or more keyword rings. Each keyword ring is a Chord ring of indexed values associated with a single keyword. Each node in the keyword ring holds a list of IP addresses of machines hosting resources matching the values and keyword for which the node has responsibility. Searching for resources involves a two-step process. First, the super ring is traversed to find the supernode responsible for the keyword. The Figure 3 The multi-ring topology
Future work Failure of certain nodes could currently break the timeline, so extra routing information must be added to give each supernode increased knowledge of its environment. This will give the side-benefit of traversal across the timeline in fewer messages. We envisage this will follow a similar scheme to Chord’s successor lists.
345
Efficient resource discovery in grids and P2P networks
Internet Research
Nick Antonopoulos and James Salter
Volume 14 · Number 5 · 2004 · 339-346
Pointers to other segments will be distributed throughout the timeline, reducing the reliance on the closing node of each segment. Since the message cost of inserting a supernode into the timeline is small, an increase in update traffic as a result of maintaining extra routing information should not have a substantial impact on the system. We will create intelligent algorithms that will determine when checkpoints should be inserted and new segments created, rather than relying on static parameters defining, for example, how many supernodes should exist between checkpoints. This will enhance the adaptivity of the model to different network sizes and conditions, improving scalability, message costs and fault tolerance. In addition to simple resource discovery, the model has potential uses as a distributed access control architecture. Segments can be viewed as autonomous administrative domains. Within each segment are a number of progressively higher security levels, with checkpoint supernodes controlling access to nodes further along the segment timeline. Methods for combining discovery and access control will be explored.
parameters involved in timeline construction, and explore the model’s suitability as a distributed access control architecture.
Conclusions In our work so far we have demonstrated a method of providing a scalable resource discovery mechanism without the need for a single centralised index. We have shown that through selection of key parameters, nodes can be inserted and queries resolved in less hops than in DHT architectures such as Chord. Through a case study, we have explored how our model can be generalised into a methodology for developing efficient multi-tiered peer-to-peer networks. Keeping indexes of individual resources on the nodes hosting the resources provides a more fault tolerant and scalable solution than centralised approaches such as Napster. By imposing a certain amount of structure, in the form of supernodes connected together on a timeline, we are able to remove the need for flooding the network with queries and the associated TTL counters, a key requirement in systems such as Gnutella. We can guarantee to find all available resources (if necessary) as nodes always point towards the answer and there is a single point-of-contact for each keyword. We now continue our work to improve fault tolerance of the model by adding routing information and distributing pointers to multiple nodes, defining algorithms to dynamically adapt
References Antonopoulos, N. and Shafarenko, A. (2001), “An active organisation system for customised, secure agent discovery”, The Journal of Supercomputing, Vol. 20 No. 1, pp. 5-35. Clarke, I., Sandberg, O., Wiley, B. and Hong, T.W. (2000), “Freenet: a distributed anonymous information storage and retrieval system”, Lecture Notes in Computer Science, Vol. 2009, pp. 46-66. Crespo, A. and Garcia-Molina, H. (2002), “Routing indices for peer-to-peer systems”, Proceedings of the International Conference on Distributed Computing Solutions (ICDCS’02), Vienna, 2-5 July. Fitzgerald, S., Foster, I., Kesselman, C., von Laszewski, G., Smith, W. and Tuecke, S. (1997), “A directory service for configuring high-performance distributed computations”, Proceedings of the 6th IEEE Symposium on High Performance Distributed Computing, Portland, OR, pp. 365-76. Foster, I. and Iamnitchi, A. (2003), “On death, taxes, and the convergence of peer-to-peer and grid computing”, Proceedings of the 2nd International Workshop on Peer-to-Peer Systems (IPTPS’03), Berkeley, CA. Garce´s-Erice, L., Biersack, E.W., Ross, K.W., Felber, P.A. and Urvoy-Keller, G. (2003), “Hierarchical peer-to-peer systems”, Parallel Processing Letters, Vol. 13 No. 4, December, pp. 643-57. Kelaskar, M., Matossian, V., Mehra, P., Paul, D. and Parashar, M. (2002), “A study of discovery mechanisms for peer-to-peer applications”, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid Workshop on Global and Peer-to-Peer on Large Scale Distributed Systems, Berlin, May, pp. 444-5. Mislove, A. and Druschel, P. (2004), “Providing administrative control and autonomy in structured peer-to-peer overlays”, paper presented at the 3rd International Workshop on Peer-to-Peer Systems (IPTPS ’04), San Diego, CA, 26-27 February. Ripeanu, M. (2001), Peer-to-Peer Architecture Case Study: Gnutella Network, University of Chicago Technical Report TR-2001-26, University of Chicago, Chicago, IL. Ritter, J. (2001), “Why Gnutella can’t scale”, available at: www.darkridge.com/,jpr5/doc/gnutella.html Salter, J. and Antonopoulos, N. (2004), An Efficient Fault Tolerant Approach to Resource Discovery in P2P Networks, Computing Sciences Report CS-04-02, University of Surrey, Guildford. Saroiu, S., Gummadi, P.K. and Gribble, S.D. (2002), “A measurement study of peer-to-peer file sharing systems”, Proceedings of Multimedia Computing and Networking 2002 (MMCN’02), San Jose, CA, 18-25 January. Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., Kaashoek, M.F., Dabek, F. and Balakrishnan, H. (2003), “Chord: a scalable peer-to-peer lookup protocol for Internet applications”, IEEE/ACM Transactions on Networking, Vol. 11 No. 1, February, pp. 17-32.
346
Introduction
Mechanisms for controlling access in the global grid environment George Angelis Stefanos Gritzalis and Costas Lambrinoudakis The authors George Angelis is a PhD Candidate, Department of Information and Communication Systems Engineering, University of the Aegean, Samos, Greece, and a member of the Information and Communication Systems Security Laboratory. Stefanos Gritzalis is Associate Professor, Department of Information and Communication Systems Engineering, University of the Aegean, Samos, Greece, and an Associate Director of the Information and Communication Systems Security Laboratory. Costas Lambrinoudakis is Assistant Professor, Department of Information and Communication Systems, University of the Aegean, Samos, Greece and a Senior Researcher of the Information and Communication Systems Security Laboratory.
Keywords Computer networks, Control systems, Message authentication, Data security
Abstract The Grid is widely seen as the next generation Internet. Aims to share dynamic collections of individuals, institutions and resources by providing consistent, easy and inexpensive access to high-end computational capabilities. Studies Grid security and specifically users’ access control. It has been proved that the viability of these heterogeneous environments is highly dependent on their security design. Solutions trying to address all aspects of security were proposed by most existing Grid projects and collaborations; however the results were not always satisfactory. Reviews some of the most widely-accepted security solutions, and collects the most efficient. Emphasizes access control procedures and the solutions addressing authentication and authorization issues. Identifies the most successful security schemes implemented and illustrates their effectiveness. Collects these mechanisms to form the backbone of a security mechanism, addressing authentication and authorization Gridspecific problems. The proposed schemes can constitute the backbone of an effective Grid security architecture.
Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/1066-2243.htm
Internet Research Volume 14 · Number 5 · 2004 · pp. 347-352 q Emerald Group Publishing Limited · ISSN 1066-2243 DOI 10.1108/10662240410566935
Recent rapid advances in high-speed networking technologies, along with the rise of the Internet and the emergence of e-business have made it possible to construct computations that integrate resources located at multiple geographically distributed locations. They have led to a growing awareness that an enterprise’s information technology (IT) infrastructure also encompasses external networks, resources and services, therefore generating new requirements for distributed applications development and deployment. The Grid is one of several mechanisms to exploit the highly connected sea of networked computers, sensors and data repositories. The idea has been built around the following fact: in a distributed environment, there are a large number of applications starved of computation resources, whereas an overwhelming majority of computers are often idle. This can be bridged by allowing computation-intensive applications to be executed on otherwise idle resources, no matter where the latter are located. Based on this, the Grid could be described as an infrastructure that tightly integrates computations and storage devices, software, databases, specialized instruments, displays and people from widespread locations, and under different management authorities. It is the move from the existing Internet, offering to everyone easy, inexpensive and consistent access to enormous portions of shared information, to the next generation where processing power and access to specialized instruments will also be provided to everyone in a secure and effective manner. In a Grid computing environment, the security problem can be divided into two main concerns:the ability to assert and confirm identity of a participant in an operation or request; and determining whether a client should have access to an object or resource and to what extent this access should be allowed.The former is known as authentication and it is the association of each entity with a unique identifier, and the latter is known as authorization and is actually based on a match between the identity of the requester and some notion of who should have access to a resource. These two main aspects of Grid security constitute what is known as access control. Although there are other sections of security that also need to be considered in a Grid environment, in this document we focus on access control, since we believe that it provides the base security mechanism around which all others are developed. In the next section of this paper we describe the advantages of a well-defined access control model by analyzing authentication and authorization
347
Mechanisms for controlling access in the global grid environment
Internet Research
George Angelis, Stefanos Gritzalis and Costas Lambrinoudakis
Volume 14 · Number 5 · 2004 · 347-352
characteristics. We also explain why these two factors are considered to be the backbone of security in Grid computing environments. After reviewing the Grid security models that currently are implemented in major Grid projects and have been proposed to address authentication and authorization issues, we choose among these, and we prove that the selected authentication and authorization solutions are capable of effectively answering almost all security queries that have been expressed until now on Grid security. Surrounded with other contributions and enhancements they may compose a complete and effective Grid security architecture.
environments will be implemented for production purposes. For the time being, it is beyond the scope of this paper.
Access control in the Grid Authentication and authorization are the baseline of the security considerations for any system or environment, obliging the participants to obey to a set of rules – a policy – that define the security subjects (e.g. users) the security objects (e.g. resources) and relationships among them. In most of the existing Grid computing environments, security policies could be described by a list of valid entities (authentication), granted a set of access rights (authorization) over each available resource. However, the specification and enforcement of a security architecture should not omit a research and risk analysis in some other non-trivial sections. Confidentiality, which ensures non-disclosure of data exchanged over the network, to unauthorized entities. Integrity, ensuring that content is received unaltered, or changes may be detected. Administration, which is translated to performing permissions granting to and revoking from participating entities. Logging, which ensures all access to resources either successful or unsuccessful, including highly privileged ones is logged and reviewed by trusted third-party entities. Finally accountability, which introduces a method of tracking resource usage. The reason that the above are not treated as equally significant with authentication and authorization by Grid security administrators is that almost all their functionality and the protection they provide is already covered by the two major factors. For example, most authentication mechanisms in place are based on implementations, which in turn are based on encryption keypairs. As a result, confidentiality and integrity are also preserved. In some other cases, the issue in question does not offer a field for exhaustive analysis and further research (e.g. in the case of logging). Accountability will play an extremely significant role in the near future when Grid research will come into conclusions and Grid
Authentication Authentication is the process of verifying the identity of a participant in an operation or request and may be required for any entity in the Grid, including users, resources, and services. Kerberos tickets and PKI are two common examples that one can often meet in Grid authentication implementations. Because of the particularity of the Grid computing environment, several extensions to the already-known conventional schema need to also be included under the scope of Grid authentication. These are the following: . Delegation of identity. In other words, the granting of authority to assume the identity of another principle for completion of some task. For example, a user must be able to grant to a program the ability to run on his/her behalf, so that the program is able to access the resources on which the user is authorized. Furthermore, this program should be able to delegate further identity to another program or spawned processes. The concept of proxies is introduced here. . Single sign-on. This addresses the problem of the user signing-on to all diverse individually controlled domains while his/her process is traversing the Grid to locate resources. Users should be able to authenticate only once when accessing the Grid and then access resources that they are authorized to, throughout the whole environment without any further user intervention. . Identity mapping. In order to achieve singlesign-on, user Grid identities should be mapped to local credentials. The user must have a local ID at the sites that need to be accessed, and the site administrator must agree with the Grid administrator on the mapping to be used. . Mutual authentication. Not only does a service need to authenticate a user’s identity to assure that resources are not accessed by unauthorized users, but also the user must authenticate a server to assure that data and resources received can be trusted and that submitted data are sent where intended. . Certification. The heterogeneity of the Grid computing environment imposes the existence of independent certificate authorities (CAs) employed to provide the binding between identity credentials and principals. Certification duration and revocation are also issues that need to be resolved in the dynamic Grid user population.
348
Mechanisms for controlling access in the global grid environment
Internet Research
George Angelis, Stefanos Gritzalis and Costas Lambrinoudakis
Volume 14 · Number 5 · 2004 · 347-352
Authorization Authorization is the process of determining whether a specific client may use or access a resource. Traditionally, both local access lists and security administrators have been used to describe the set of principals authorized to access a particular object. In the Grid computing environment local lists must be kept separately in each local participating domain. Local administrators or resource owners must be able to control which subjects can access the resource, and under what conditions. Control should be also posed over delegated identities, in order to minimize exposure from compromised or misused delegated credentials, since along with identity, authorization rights are also delegated. In an access control decision there are usually global and local rules and policies to take into account. In the Grid computing environment, it is not feasible to administer authorization information for all individual users at every site. Users normally have direct administrative deals with their own local site and with the collaborations they participate in, but not generally with other sites. A Grid authorization is established by enforcing agreements between local sites rules and the Grid security policy, since resource access is controlled by both parties. Authorization is usually the step that follows successful authentication. The security architecture for authorization allows the following steps: (1) The service to gather additional information associated with the user or the actual session (group membership, role, period of validity, etc.). (2) The service to gather additional information associated with the protected resource (e.g. file permissions). (3) The checking of any local policy applicable to the request (e.g. a temporarily disabled user). (4) The making of an authorization decision based on the identity of the user and the additional information.
Globus, which is the best known and probably the most widely used end-to-end Grid infrastructure available today (Foster and Kesselman, 1998; Tuecke, 2001). GSI mainly focuses on authentication, which also constitutes the base for other services like authorization and encryption. Authentication in GSI is based on proxies. A special process called a user proxy is created by the user on his/her local Globus host and is given permission to act on behalf of the user for authentication purposes. A temporary certificate is also created for the user proxy. To enable authentication on the resource side, a resource proxy responsible for scheduling access to a resource, maps global to local credentials/identities (e.g. Kerberos tickets). In order to implement the above, authentication is based on both TLS and SSL protocols which provide for public key-based authentication, message integrity and confidentiality, and the X.509 Certificates PKI mechanism. GSI authentication is further analyzed in this paper.
Proposed implementations An urgent need for research in the fields of Grid security turned out to be inevitable since almost the Grid idea’s infancy. Below we present some of the solutions proposed by the most widespread Grid projects. Globus Security Infrastructure Globus Security Infrastructure (GSI) is the service that configures the basic security mechanisms in
Datagrid security Datagrid is a European Community supported Grid project effort in which many European institutions participate either for research and academic or for business purposes (Cancio et al., 2001). The implemented security policy has an extensive rationale to address authorization issues. The proposed authorization model suggests a rolebased community in which the virtual organization (VO), defined as an “abstract entity grouping Users, Institutions and Resources in a single administrative domain”, decides the roles of its users, the rights associated with each role and the degree of delegation allowed. It also classifies information managed, and defines roles depending on this classification (e.g. only a number of resources are allowed to store and handle confidential data). the Datagrid authorization model is described in more detail below. Legion Legion is an architectural model for the metacomputing environment, implemented in an object-safe language; therefore all entities within the model are represented by independent, active objects. Security in Legion is based on a public key infrastructure for authentication, and access control lists (ACLs) for authorization (Ferrari et al., 1998). However, it can be retargeted to other authentication mechanisms such as Kerberos. Identity, and consequently authentication in Legion are based on LOIDs. The LOID of each object contains its credential which is a X.509 certificate with an RSA public key. The concept of PKI is accomplished with the existence of a CA.
349
Mechanisms for controlling access in the global grid environment
Internet Research
George Angelis, Stefanos Gritzalis and Costas Lambrinoudakis
Volume 14 · Number 5 · 2004 · 347-352
Authorization in Legion follows the local access list model. The ACL associated with any object encodes the permissions for that object. When any method of a Legion object is invoked, the protocol stack associated with the object ensures that the security layer is invoked to check the defined permissions, before the request is forwarded to the method itself. Integrity in Legion is provided at the level of Legion messages. Public keys are used either for encryption of the messages or for hashing (message digest computation), depending on the needs and the criticality of the communication. The whole message or part of it (e.g. only the credentials) can be encrypted or hashed.
rather adopts the GSI authentication solution. Security mechanisms in Legion are hard-coded into the security architecture, thus making incorporation of new standards difficult. Legion certificates do not have a time-out, therefore the period of time during which the certificate is vulnerable to attack is not limited. Also multiple sign-on which consists one of the major problems in Grid environments seems not to be addressed by Legion’s security. Finally, within each domain participating in the Legion Grid, it is the Legion and not the domain administrators that dictate protection measures. CRISIS, although proposing a more complete security architecture than Legion, still has some deficiencies. Local administrators are not allowed to choose the security mechanism used and single-sign-on has not been efficiently addressed. From this review, it becomes obvious that the most complete and robust security solutions proposed are GSI Authentication and Datagrid Authorization.
.
CRISIS CRISIS is the security subsystem of WebOS – an effort to simplify the development of wide-area applications while providing efficient global resource utilization (Belani et al., 1998). It emphasizes a number of design principles for highly secure distributed system, such as redundancy to eliminate single points of attack, lightweight control of permissions, strict process control to make each access decision, caching credentials for performance, as well as short-term identity certificates. Public keys signed by a CA compose the basic authentication mechanism of CRISIS. The signed certificates are attached to all following messages originated by the specific user. In terms of authorization, CRISIS uses the security manager approach. In other words, as all programs execute in the context of a security domain, each domain runs a security manager responsible for granting or refusing access to all local resources. This is an approach that respects the Grid participant institutions’ local security policies.
Selection of an authentication and authorization mechanism As was mentioned in the previous section, GSI attempts to address other security issues like confidentiality, integrity, authorization, etc. based on its authentication mechanism. The current Globus toolkit implements authorization by the mapping of a global name (e.g. DN in the X.509 certificate) to a local UNIX user account and then by the use of standard UNIX access control mechanisms. Consequently, if evaluated in terms of authorization and security logging, GSI can be rated as cumbersome to manage, which makes it hard to preserve autonomy of local security policies. On the other hand, Datagrid does not propose its own authentication mechanism, but
GSI authentication The GSI developed for the initial Globus toolkit focuses on just one problem, authentication. GSI provides inter-domain security protocols that bridge the gap between the different local authentication solutions, by mapping authenticated Globus credentials into locally recognized ones at a particular site. In GSI, credentials are represented by certificates, which specify the name of the entity and other additional information like the public key that can be used to identify the entity. Certificates in GSI follow the standard X.509 format. The concept of the CAs creating, maintaining, distributing and revoking entities’ certificates, also integrates authentication in GSI. However, the accuracy of the entity’s identity depends on the trust placed on the CA that issued that certificate in the first place. Thus the authentication algorithm must validate first the identity of the CA as part of the authentication protocol. This is implemented by the authentication algorithm defined by the TLS protocol. GSI addresses efficiently the single-sign-on problem that concerns all Grid security managers, using the notion of proxies. An entity may delegate a subset of its rights to another entity (another program, a spawned child-process, etc) by creating a temporary identity. Possession of a temporary identity allows the proxy to impersonate the original entity. A proxy is identified by an impersonation certificate, which is a X.509 certificate signed by the original entity or a previous proxy of the original entity. In this way it is possible to create a chain of signatures and delegations terminating with the CA that issued
350
Mechanisms for controlling access in the global grid environment
Internet Research
George Angelis, Stefanos Gritzalis and Costas Lambrinoudakis
Volume 14 · Number 5 · 2004 · 347-352
the initial certificate. By checking the certificate chain, processes started on separate sites by the same user can authenticate to one another and to required local resources, without the user needed to send his/her initial credentials to the site. A proxy may also have associated with its certified credentials a specification of what operations it is allowed to perform, in which case we are talking about a restricted proxy. Another feature of delegated credentials that contributes in security enhancement is the short-term certificates. These are normal X.509 certificates with a short expiration time, typically 12-24 hours, which minimize the risk of a stolen certificate and eliminate the need to maintain complex and huge certificate revocation lists. GSI’s ability to address the above issues is enhanced by coding all security algorithms in terms of the Generic Security Service (GSS) standard. GSS defines a standard procedure and API for obtaining credentials (passwords or certificates), for mutual authentication (requestor and resource), and for message-oriented signature, encryption and decryption. The important about GSS is that it is independent of any particular security mechanism and can be layered on top of different security methods. For example, the GSS standard defines how GSS functionality should be implemented on top of Kerberos and public key infrastructure. A negotiation mechanism is also supported, which allows support for different security mechanisms simultaneously in the Globus environment.
From the authorization point of view, a Grid is established by enforcing agreements between RPs and VOs where resource access is controlled by both parties. In order to separate the different roles that the two parties have, authorization information is classified into three categories: (1) Information regarding the relationship of a user with the VO he/she belongs to: groups, roles, capabilities. (2) Local general security policy for the RP (general rules, algorithms, protocols that this RP has selected to protect its resources). (3) Access control lists imposed by the local RP security manager to restrict the access of users. Restrictions can be imposed to the specific VO user belongs to, to the user him/ herself, to the specific group of the VO the user belongs to, etc.
Datagrid authorization As was previously mentioned, in an access control decision there are usually global and local rules and policies to take into account. Datagrid authorization takes both pieces of information into account in order to make the decision. Global rules and policies are managed by the Virtual Organization Membership Service (VOMS), while each local authorization system manages the local rules and makes the decision to grant or reject access requested to any of its administered resources. The latter is mostly based on access control lists. In such a computing environment it is not feasible to maintain authorization information for all individual users at every site. Therefore, in order to define Global authorization it is convenient to define the concept of the VO. VO is an abstract entity grouping users, institutions and resources in a single administrative domain. A resource provider (RP) is an institution offering resources (e.g. computing elements, storage elements, and network) to other parties (VOs) according to specific service level agreements.
In order to fulfil the above, the user must be provided with a certificate which both proves his/ her identity and his/her membership to a specific VO. This job is accomplished by VOMS, which is considered in the Grid computing environment as a trusted service. VOMS signs a set of information (user’s identity, groups he/she belongs to) with its private key, so that its validity can be checked at the requested services using the public key of this trusted service that has already been distributed. This new credential is then included in the user’s proxy certificate, which is signed by the user’s normal certificate. A user can become member of as many VOs as he/she needs in order to gain the necessary access and accomplish his/her tasks. From the RP’s side, authorization is based on LCAS and LCMAPS. LCAS stands for Local Center Authorization Service and allows authorization decisions based on user credentials information and service request characteristics. LCMAPS stands for Local Credentials Mapping Service and it is a library providing a policy-driven framework for acquiring and enforcing local credentials, based on the complete security context. LCMAPS also proposes a policy language whose main advantage is simplicity for the system administrators, as opposed to the other similar policy languages.
Conclusions The GSI authentication scheme implemented initially in Globus and later to many other Grids, as well as the proposed authorization solution by the European Datagrid project are both considered as complete, well-defined, up-to-date and consistent mechanisms. Of course, there are other things that a well-defined security
351
Mechanisms for controlling access in the global grid environment
Internet Research
George Angelis, Stefanos Gritzalis and Costas Lambrinoudakis
Volume 14 · Number 5 · 2004 · 347-352
architecture should not omit, in order to further eliminate the security risk. Network security issues should be addressed in a security architecture of a distributed Grid environment with so many heterogeneous participants. Proper placing and services/ports configuration of firewalls and intrusion detection systems should be considered. Encryption and digital signatures always have a good position in interconnected worlds to preserve information confidentiality, integrity, but also trust. Effective and cooperative security administration should be a key-factor in a security architecture. Finally, effective security logging could result in intrusion detection, identification of malicious or accidental security violation attempts, and could help in investigating security administrators’ suspicious activities. There should be a specific process in reviewing security logs, and reviewers from different domains participating in the Grid could exchange their results whenever needed, to achieve better overall protection. All the above systemic or procedural measures should surround authentication and authorization to form the security architecture for the Grid in question. As the Grid matures, security problems are more clearly defined, emerging technologies are standardized and solutions are proposed. Current Grid environments are making efforts to incorporate security within their functionality and implement architectures which are continuously updated to address the arising security problems. The drawback is that there is not enough experience in the peculiar Grid environment which would help to perfect the existing security mechanisms, and experience from conventional stand-alone and centrally managed environments could leave a lot of gaps. Perhaps the most obvious direction for immediate future work is in further testing and deployment of the existing systems. This paper can be used as an indicative help for the Grid participant institutions which would like to improve their security architecture or build a new
one from scratch. A wide discussion field regarding security issues in the Grid environment is presented here, based on the knowledge that has been acquired from security architecture configurations of existing Grids as these were described earlier in this paper. Since this is an ongoing effort, we intend to continue optimizing the security architecture for the Grid computing environments by combining the received feedback from the real-world experience, and the contribution of emerging technologies.
References Belani, E., Vahdat, A., Anderson, T. and Dahlin, M. (1998), “The CRISIS wide area security architecture”, Proceedings of the USENIX Security Symposium. Cancio, G., Fisher, S., Folkes, T., Giacomini, F., Hoschek, W., Kelsey, D. and Tierney, B. (2001), The DataGrid Architecture, Version 2, CERN, Geneva. Ferrari, A., Knabe, F., Humphrey, M., Chapin, S. and Grimshaw, A. (1998), A Flexible Security System for Metacomputing Environments, Technical Report CS-98-36, Department of Computer Science, University of Virginia, Charlottesville, VA. Foster, I. and Kesselman, C. (1998), “The Globus Project: a status report”, paper presented at the IPPS/SPDP ’98 Heterogeneous Computing Workshop. Tuecke, S. (2001), “Grid security infrastructure (GSI) roadmap”, Internet draft.
Further reading Foster, I., Kesselman, C., Tsudik, G. and Tuecke, S. (1998), “A security architecture for computational grids”, Proceedings of the 5th ACM Conference on Computer and Communications, pp. 83-192. Humphrey, M. and Thompson, M. (2001), “Security implications of typical computing usage scenarios”, Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing.
352
Introduction
Non-business use of the WWW in three Western Australian organisations Craig Valli
The author Craig Valli is Senior Lecturer – Computer and Network Security, School of Computer and Information Science, Edith Cowan University, Mount Lawley, Australia.
Keywords Worldwide web, User studies, Cost accounting, Australia
Abstract This paper is an outline of findings from a research project investigating the non-business use of the World Wide Web in organisations. The study uncovered high non-business usage in the selected organisations. Pornography and other traditionally identified risks were found to be largely non-issues. MP3 and other streaming media and potential copyright infringement were found to be problematic. All organisations had end-users displaying behaviours indicating significant, deliberate misuse that often used a variety of covert techniques to hide their actions.
Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister
This paper is based on findings of a research project that examined the non-business usage of the World Wide Web (WWW) within three selected organisations in Western Australia. The cases where examined as multiple interpretive case studies using Klein and Myers (1999) article as a guiding instrument for the conduct of the research. Each organisation was investigated using the same conceptual framework, which was: (1) Interview and survey of selected organisation’s key stakeholders. (2) Document review of the selected organisation’s policies. (3) Analysis of log files (cyclical): . macro analysis of organisational use of the WWW; . generic category analysis of organisation use of the WWW; and . individual misuser analysis. (4) Reporting details of analysis back to organisation. (5) Post-analysis interviews. The three research questions that were examined as a result of the study are: (1) To what extent is the WWW service used by employees for non-business-related activities in selected organisations? (2) What is the perceived versus measured reality of non-business use in an organisation? (3) Does the enforcement of countermeasures such as policy affect the behaviours of users within an organisation? This paper will outline method of analysis including the case organisation profiles, the major findings and outcomes of the cases and the research.
The current issue and full text archive of this journal is available at www.emeraldinsight.com/1066-2243.htm
Method of analysis
Internet Research Volume 14 · Number 5 · 2004 · pp. 353-359 q Emerald Group Publishing Limited · ISSN 1066-2243 DOI 10.1108/10662240410566944
The WWW log files for each case were analysed by different log file analysis tools these were Cyfin (Wavecrest, 2002), Squid Analysis Report Generator (SARG) (Orso, 2001), Webalizer (Barrett, 2002), pwebstats (Gleeson, 2002) and Analog (Turner, 2003). At least three tools were used on any case to verify statistics generated and also to provide different lenses for the research situation. Cyfin (Wavecrest, 2002) is a commercially available tool that allows for extensive analysis of a wide range of log files and the adaptation of reports and outputs for analysis. Cyfin allows for a high level graphical reporting of the traffic usage patterns. It also, categorises traffic into 55 preset
353
Non-business use of the WWW in three Western Australian organisations
Internet Research
Craig Valli
Volume 14 · Number 5 · 2004 · 353-359
categoriesand ten customisable categories of URLs. The author used all of the preset categories and created custom categories as necessitated. SARG (freeware) and pwebstats (freeware for educational institutions) produce general statistics with a high level of granularity making it possible to determine, down to the user and file level, any activity that is generated. In particular, each produces a detailed 24-hour histogram of users traffic usage, which allowed the author to analyse readily hours of Web usage by users. Webalizer allowed for a reasonable degree of granularity, but it was not as comprehensive as the SARG outputs. It did, however, confirm general statistics generated by the other log file processors on a user and overall basis. All interviews conducted were open-ended and were recorded for later analysis by the author. Document review of existing organisational documents relating to the use of network and WWW resources was conducted before analysis of the WWW access logs. Organisation 1 was a large university department with 846 users who consumed 121 Gbytes of traffic in four calendar months. Organisation 2 was a large State Government agency with 1995 users who accessed 142 Gbytes of material in six months. Organisation 3 was a medium-sized state government agency with 309 users who downloaded 33 Gbytes of material in eight calendar months.
quantifies the level of non-business usage for each case. The level of unacceptable usage was determined from the final optimised output from the log file analysis tools. This metric measured the level of usage and was based on each Cyfin category allocation of acceptable, neutral or unacceptable as a result of applying that case organisation policy. The allocation of a category status, as acceptable, unacceptable or neutral, had been confirmed with the organisational stakeholders before any log file analysis began. This allocation allowed for the calculation of the unacceptable volume usage by each case organisation based on the total downloaded volume for each case. The base rate of AU$200 per Gbyte was calculated, based on pricing from Internet services providers (ISPs) at the time of the access by each organisation. During the period examined the average weekly wage for a person working in Australia was $853.40 per week or $44,376 per annum (ABS, 2002). Based on this average wage Case 1 in total and Case 2 could justify a 78 per cent salary to reduce the non-business usage on direct costs alone. This justifiable salary in Case 1 and Case 2 based on the evidence is contrary to findings of Mirchandani and Motwani (2002, p. 55):
Measurement of non-business-related activities The concept of non-business usage is not capable of global definition but one that is developed through social interactions that determine the management and use of the WWW system within each organisational context. The definition, delineation of non-business usage is often specified within the organisational usage policy for the WWW. This usage policy for each organisation was used as the basis for the conduct of the investigation of the organisations WWW access logs. When applying prescribed policy all of the cases had significant levels of non-business usage. Table I
Some practitioners we interviewed estimated the average cost of Internet abuse to their company (in terms of lost productivity) did not exceed the annual salary of one employee.
The costs in the table above do not attempt to quantify intangibles such as lost productivity. A costing based on wages alone would be a complex measure and would need to take into account information that organisations were unwilling to divulge. It is reasonable, however, to assume that the costs from non-business usage and productivity losses would be significant. A simple example can be given to highlight this problem based on evidence from Case 2, which had the organisational wide area network (WAN) links becoming congested with non-business-related material in the afternoon. This congestion of the WAN links for transportation of WWW traffic was affecting the ability of other workers to complete business related tasks in a timely fashion, thereby
Table I Levels of non-business use Case 1 2 3
No. of users
Level of unacceptable usage (%)
Volume (Gb)
Cost of volume ($)
Cost per month ($)
Cost per year ($)
Cost per user ($)
846 1,995 309
75 56 21
90.4 80.1 6.4
18,080 16,020 1,284
4,254.12 2,912.73 160.50
51,049 34,953 1,926
21.37 8.03 4.16
354
Non-business use of the WWW in three Western Australian organisations
Internet Research
Craig Valli
Volume 14 · Number 5 · 2004 · 353-359
causing the productivity losses cited by management. If a simple atomic measure for legitimate business activity such as a database transaction takes three seconds longer to complete as a result of the congestion caused by non-business usage, then it appears comparatively small. However, extrapolating this scenario to 120 employees, who perform an average of 20 database transactions per hour, as cited by management, which equals 1 minute lost per employee, which is 120 minutes per hour. Taken at an hourly rate of pay of $18 this equates to $36 loss per hour, at 4 hours a day this costs $3,168 per month (based on 22 working days a month) in lost productivity. This simple example highlights the magnitude of the problem, when for Case 2 this 3 second per transaction loss is outstripping the cost of downloaded unacceptable bandwidth by 9 per cent. Not all non-business usage was a result of user action via direct request and download of content. In all of the case organisations examined, the banners/advertisements, category consumed significant bandwidth. Much of this material serves no purpose other than to advertise products and, they typically are graphical in nature and are non-cacheable objects. In some pages, banners and advertisements can account for up to 80-90 per cent of downloaded page size (Valli, 2001). Misusers in all case organisations investigated displayed behaviours or Web activity patterns that would indicate they were deliberately avoiding detection. In all case organisations some frequent misusers had conducted forward intelligence and environment scanning and had gained target Web sites links in the form of URLs or IP addresses via other channels; possibly via home computers or targeted e-mails. Evidence for this was indicated by users who, on accessing the WWW for the first time on a give day, enter selected links into their browsers. During the studied period there was no attempt made to search for this content via search engines to find the website address or content. These links were unique; they were not recorded in any previous activity by the user on the WWW and, were highly focused (often down to specific files). In addition, the URLs were often complex and long in structure making short-term memorisation by users a very difficult if not impossible task. Many of the pre-researched sites accessed by these users were no longer accessible by the author when examining and verifying the data contained in the log files. Some of the accessed sites had notices from Web site providers saying that the site had been removed because it contained inappropriate content. This situation would indicate that these sites that are rotated through in
order to avoid detection or prosecution (McCandless, 1997). In Case 1 users were trying to avoid detection by accessing the material early in the morning for example, between 1 a.m. to 4 a.m., where the probability of detection through natural surveillance such as, the use of active audit trail monitoring or a “walk-in”, would be low. This behaviour was as a result of the users having intimate knowledge of the workings of the organisation, and socially engineering their activities to reduce the probability of detection. In Case 3 users were trying to mask nonbusiness use by hiding their activity amongst large volumes of legitimate traffic. Unlike the previous cases of detection avoidance, the users actively researched the sites they were visiting but did so in a stilted and controlled manner. The threads of evidence were there to be found but they were buried under considerable amounts of legitimate traffic generated by the misuser. This avoidance is simple and exploits a weakness in the modus operandi of a Web log file analysis tools. The role of a Web log file analyser is to reduce the thousands, sometimes millions, of log file entries into a neat, reduced executive report that a system administrator analyses to interpret trends and patterns in traffic consumption. Typically, most log file analysers will list the most frequently accessed pages from the log file or they will employ some pre-determined limit like the top 100 URLs, set typically by the administrator. Similarly, log file analysers will often list the heaviest consumers of Web traffic. So, this form of avoidance is effective as the pages accessed by the misuser will have a low frequency count on the abusive activity and, subsequently will not be output into the log file analyser’s report. The content categorisation ability of Cyfin using policy guidelines as the arbiter for unacceptable or acceptable is a potentially more effective means of measuring non-business usage.
What is the perceived versus measured reality of non-business usage? Each case organisation’s key stakeholders held views and opinions that were not consistent with activity that was discovered in the log files and documented as policy. In all cases no organisation demonstrated any awareness of levels of usage of the WWW. When asked as to what levels of nonbusiness usage they believed was occurring, staff did not know or they speculated at best. This situation further confirms suspicion about the accuracy of management surveys that are conducted on non-business usage or misuse of the
355
Non-business use of the WWW in three Western Australian organisations
Internet Research
Craig Valli
Volume 14 · Number 5 · 2004 · 353-359
WWW. The three cases by no means provide a conclusive finding but they do concur with the reservations of Holtz (2001) about the validity of such claims. All cases cited the cost of monitoring software and its deployment as an impediment to the use of ongoing systemic monitoring. This observation concurs with research conducted by Mirchandani and Motwani (2002, p. 26) where “Most companies already have the means to track the Internet usage of their employees but choose not to do so because of the effort involved.” It is apparent in all cases that the initial cost of purchasing monitoring software could see a return on investment realised within two to 12 months based on direct cost of the Internet bandwidth alone arising from non-business usage. This situation indicated a large gap in management perception of what was occurring and the reality of the WWW log file data. The reasons for this perception gap could be seen as the management being unaware of the true monetary loss that non-business usage represented: but the reasons may go deeper into the organisational psyche than this. The implementation of countermeasures may simply be perceived as “yet another job” for which the management becomes responsible for. The reluctance to deploy these measures may be a fear of uncovering a greater evil to cope with, or it may be that management is uncomfortable about monitoring employees’ activities. A major suspicion the author had was that much of the literature regarding the accessing of pornographic content in the workplace suffered from hyperbole and exaggeration. In each case organisation there was a belief by management that pornography was the organisations biggest risk exposure and most likely source of nonbusiness usage. The three case studies showed minimal or nil use of pornography within the organisations. The first case study had 3.1 per cent of downloaded material as pornography with 20 users downloading 72.6 per cent of this content. The second case study had only 0.15 per cent of total downloaded content was pornography and similarly, the third case study had only 0.13 per cent. In the latter two cases this usage could easily be explained by unsolicited pornographic e-mails. If we were to believe some of the existing body of literature such as articles by Greengard (2000) or Hickins (1999), it would have been reasonable to find significantly higher download volumes and levels of participation in accessing and viewing of pornography. The first and second case study demonstrated a high download of MP3 and streaming movie or audio type files such as AVI, MOV and WMV. In
most cases, the audio-based content downloaded by the misusers, was in breach of copyright and organisational policy. All the case organisations operated firewalls and countermeasures that made it difficult for users to access streaming media via the standard file sharing tools facilitating such purposes like Napster, Gnutella and Kazaa. The only real alternative for users was to acquire this content through the Web browser via the organisational http proxy server from either http or ftp sites providing this type of content. This situation has significant implications for the use of the WWW in organisations. The main perceived threat of pornography was largely not realised in two out of the three cases. The illegal accessing of streaming content such as MP3, AVI and WMA and other file types that contain copyright material has been shown to be a greater risk exposure in all cases. The risk of prosecution is now becoming a reality, as a result of the activism of the Recording Industry Association of America (RIAA) and similar bodies in pursuing transgressors of copyright law (Cox, 2003; Naraine, 2003). In some recent cases that the RIAA is prosecuting it is seeking damages of US$125,000 per breach of copyright. The RIAA has even gone to the point of advocating for the right to attack systems that hold copyright material wherever they are located on the Internet (McCullagh, 2001). The cases have found nexus points where the perception of management versus the reality of usage of the WWW did not intersect and they were credible and substantial. The first nexus point is that of actual usage, humans are poor estimators and these examined cases have been shown to be no exception. Organisational management was surprised by the level of non-business usage uncovered by the study of their WWW log files. The second nexus was the reality of costs associated with non-business usage. On the simple measure of raw bandwidth cost alone, Case 1 and Case 2, could justify a salary for reduction of cost in Internet bandwidth, yet all organisations were reluctant to measure usage or expend effort in doing so. The third nexus was that due to a lack of diligence or business intelligence that would be generated from proper analysis of WWW traffic many of the risk exposures for the organisations were badly aligned with the reality found in the WWW log files.
Are countermeasures effective at reducing non-business activity? Each case used policy with varying levels of awareness and transparency to achieve
356
Non-business use of the WWW in three Western Australian organisations
Internet Research
Craig Valli
Volume 14 · Number 5 · 2004 · 353-359
communication of the organisation’s wishes with regard to employee access and use of the WWW. Case 1 had the most pornography even though it had a policy that clearly stated penalties for accessing such materials. Furthermore, Case 2 applied policy and a content filtered approach for their WWW traffic that resulted in low access of pornography. However, Case 3 used no content filtering and achieved similar levels of end-user compliance regarding the downloaded pornographic content as Case 2. The main difference between Case 3 and Case 1 was the former actually had cases where policy and sanctions had been fully enforced as per the published policy with transgressors having punitive sanctions applied against them. This would indicate that there is at least a possible causal link between policy effectiveness and enforcement. In Case 1, there were instances of users having been caught grossly transgressing the acceptable usage policy with no subsequent exercise of punitive threats eventuating. This inaction would have sent a signal to existing users that transgression of the policy resulted in no penalty being applied. The IS manager of that case organisation in interview also cited a lack of enforcement as a reason for not seeing monitoring as an important or necessary event. No case organisation was actively monitoring WWW usage with sufficient windows of opportunity for detection and remediation of enduser activity via organisational policy. Case 3 was the only organisation that regularly analysed log files on a monthly basis, with minimal analysis on the part of IT service manager. All cases lacked basic business intelligence resulting from the simple and timely analysis of WWW proxy server log files. There was no formal reporting of usage or subsequent follow up based on any anomalies discovered when analysis was performed. This was a serious deficiency in all cases. All cases had instances of users who left the particular organisation within the time frames examined by the author. In the cases investigated, a marked, large increase in baseline usage by the identified users was noted during the notice period. This issue is significant and potentially shows that policy will only work when the threat of a credible and enforceable sanction is realisable (Harrington, 1996). For the departing employee, some of the most powerful methods of sanction had become greatly reduced in potency. The threat of dismissal and any apparent financial penalty for such action is a paramount threat used by many organisations to maintain control. The power of this threat is largely removed as the departing employees had another job/position to take up and any potential loss of income is at most four weeks
pay. An additional sanction would be the threat of losing a favourable reference. This threat is of limited potency as the user has in all probability secured employment elsewhere and consequences from misuse are lessened due to the employee’s impending departure. The problem is a balance between maintaining control and human relation issues such as trust, perceived justice and job satisfaction. If the organisation adopts countermeasures that are perceived by staff to be draconian in approach, then employee distrust of the organisation could occur. An extreme, although not entirely implausible scenario, could see the consequence of tendering a resignation by an employee result in automatic loss or severe restriction of Internetbased access. This action could be perceived as a lack of management trust by the employee and, could result in end-user resentment and users seeking to rectify the injustice by taking more liberties with the system. These behaviours are already apparent in other studies dealing with perceived injustice in the workplace (Greenberg, 1990; Lim, 2002). Another possibility would be to increase the level of surveillance of the departing employee’s Internet activity to monitor it for anomalous behaviour this too is not without its problems. A study by Chalykoff and Kochan (1989) found that computer-based monitoring had a direct and significant influence on the overall job satisfaction of employees. In an article, George (1996, p. 463) outlines several cases and cites studies where employee health had been severely affected by such overt monitoring. Many WWW filtering products are often advertised as a panacea for organisational control of incoming content. It has been found in previous studies that the product rhetoric often does not meet reality when it comes to management of content (George, 1996; Hunter, 2000; Nunberg, 2001). The tool used to conduct preliminary investigation into all of the case organisations was Cyfin, it could be reasonably argued that this is attempting to do a static examination of content that an active content filtering tool such as Cyber Patrol would perform. In Case 1, the organisation did not have content filtering installed but the initial analysis had a very high percentage of sites in the unknown/unclassified category. The percentage of unclassified sites was greatly reduced when it was realised that many of these sites were foreign language sites that were being accessed. The classification tool used in the study was developed in the USA, as are many available content filtering tools. It is reasonable to assume that many of the customers of these products would have English as a first or second language. Therefore, the developers would spend much of
357
Non-business use of the WWW in three Western Australian organisations
Internet Research
Craig Valli
Volume 14 · Number 5 · 2004 · 353-359
the time developing the detection of English-based content, as opposed to other languages based on Mandarin Chinese or Hindi content for example. To complicate matters further the foreign language sites often used colloquial terms for the content they were supplying to users. Unless a person has lived in a particular culture, they often do not pick up many of the subtle colloquial nuances. This lack of enculturation is possibly as a result of developers having learned or acquired their language skills by conversing with other second language speakers or from textbooks. This weakness in linguistic diversity could easily translate to content filtering systems development, if companies employ second language speakers to aid in judging content. Many of sites accessed by the users in Case Organisation 1 were oriented towards Indian nationals and expatriates, and under Indian law were plainly illegal. The same sites under Australian Law and American would hardly rate a mention as pornographic sites due to cultural and religious bias. The Indian sites often only displayed partial nudity and no graphic depictions of sex acts. The sites would have mostly received lower than restricted ratings in Australia. If companies are selling these content filtering products, how do they cope with these extremes of cultural or religious bias? The simple answer is they cannot cope, and herein lays a dangerous trap for organisations utilising filtering, if they have employees from diverse cultural backgrounds. What may seem inoffensive and acceptable to a content filtering system, may offend and discriminate against an employee due to culturally dichotomous views of acceptability making for some potentially interesting industrial arbitration or civil court cases. Similar cases have been already heard in the US legal system with Microsoft and Chevron being defendants and involving US$2.2 million in payouts to the plaintiffs in the cases heard (Greengard, 2000). Filtering proved ineffective against determined non-business users in all Case 2. The users employed pre-researched or known URL’s to access information as a method of avoiding detection by the content filtering system. By using URLs gleaned from research conducted externally to the organisation or received in e-mail from communities of practice no extended audit trail is provided on the organisations log files as many of these are unique and single instance. By obtaining URLs in this manner users can keep ahead of a manufacturer of content filters ability to research, classify and distribute a block for an inappropriate site. Under most statistical analysis a single instance URL will not appear in the report unless the download itself was comparatively large in size, Case 1 and Case 3 also showed users that were
taking the same approach even though their connections were not filtered. In all cases examined where there was significant misuse of the system a small number of users were responsible for downloading large percentages of the total download. In some extreme cases, one or two users were consuming 80-90 per cent of bandwidth for a particular unacceptable category. Mitigation of this type of usage could easily be accomplished with some human resource management or individual counselling about their usage profiles.
Summary A significant level of non-business usage was encountered in the organisations examined. Case 1 had 74.6 per cent, and Case 2 had 56.4 per cent, of bandwidth consumed by non-business usage if policy was applied. Case 3 while not having high non-business bandwidth consumption, still had 20.6 per cent of non-business-related use occurring. Even Case 3 with the lowest of nonbusiness usage volume, is seeing one in five kilobytes being wasted by non-business users this is a significant level of misuse. Case 1 and 2 could be argued to have high to very high levels of nonbusiness usage by the simple metric of bandwidth consumed. All cases highlighted the issues of non-business use and its true cost goes beyond the technical metrics of bandwidth measurement, reaching into a range of organisational areas and issues. For example, Case 2 had issues with non-business usage of the WWW impacting on the organisation’s ability to deliver information in a timely manner to stakeholders due to congested intra-organisational network links. Case 1 had IT staff reluctant to perform any measurement of misuse due to past inactions on the part of management when offenders were identified. All cases had staff becoming deliberately deceptive and covert in their actions to access non-businessrelated material. Policy was used in all cases with varying degrees of success as a countermeasure to non-business activity. Case 1 had problems with effective enforcement of policy, which meant that it was largely ignored by stakeholder groups. Case 2 had policy but it was not overt and it relied on a content filtering approach as a substitute for effective management. Case 3 appeared to be an exemplar in the use of policy to reduce non-business activity. The cases demonstrated that policy or the lack of policy enforcement has some effect on end-user behaviour. Cases 2 and 3 highlighted a policy issue with staff members who have resigned and have a
358
Non-business use of the WWW in three Western Australian organisations
Internet Research
Craig Valli
Volume 14 · Number 5 · 2004 · 353-359
period to work in the existing organisation before commencing their new employment. Evidence in these cases would suggest that as a countermeasure policy is greatly reduced in efficacy once a person has resigned. All cases had users attempting to mask behaviour as a result of the countermeasures in place. Many of the identified high-level nonbusiness users of the WWW in the organisations had adapted and developed mechanisms that would defeat countermeasures such as log file analysis or content filtering. These cases apart, it would appear that even the existence of policy had a positive effect on user behaviours. There was a lack of alignment between the perceived non-business usage occurring and management perceptions of the level of nonbusiness usage. Each of the examined cases had significant non-business usage of the WWW by end-users. In the initial interview sessions, the key stakeholders within the organisations at best appeared to speculate about the level of nonbusiness usage occurring. Case 3 was the exception where the IT services manager reviewed a monthly report on WWW usage but when asked about bandwidth issues he said he was unsure of the exact amount of bandwidth consumed and had little or no idea of any misuse. In the post-analysis presentations, the reaction from management was typically surprise and amazement at the depth and severity of the level of actual non-business usage discovered. This disparity again indicates a large gap in the cases between actual activity and perceived activity. It could be said that the key stakeholders were unaware of the WWW modus operandi within their respective organisations. Management perceiving no real problem with WWW usage or being unwilling to address the issue, did not encourage nor develop appropriate feedback and reporting mechanisms to deal with non-business use of the WWW. Even the simplistic use of log file analysis tools to provide basic day-today monitoring was not occurring within the examined organisations, a dangerous practice, indeed.
Barrett, B.L. (2002), “Webalizer”, available at: http://directory. fsf.org/webauth/misc/Webalizer.html Chalykoff, J. and Kochan, T. (1989), “Computer-aided monitoring: its influence on employee job satisfaction and turnover”, Personnel Psychology, Vol. 42 No. 4, pp. 807-34. Cox, B. (2003), “ RIAA trains anti-piracy guns on universities”, available at: www.internetnews.com/bus-news/ article.php/1577101 (accessed 6 June 2003). George, J.F. (1996), “Computer-based monitoring: common perceptions and empirical results”, MIS Quarterly, Vol. 20 No. 4, pp. 459-81. Gleeson, M. (2002), “pwebstats (version 1.38)”, available at: http://martin.gleeson.com/pwebstats/ Greenberg, J. (1990), “Employee theft as a reaction to underpayment inequity: the hidden costs of pay cuts”, Journal of Applied Psychology, Vol. 75 No. 5, pp. 561-8. Greengard, S. (2000), “The high cost of cyberslacking”, Workforce, Vol. 79, December, pp. 22-4. Harrington, S.J. (1996), “The effect of codes of ethics and personal denial of responsibility on computer abuse judgments and intentions”, MIS Quarterly, Vol. 20 No. 3, pp. 257-78. Hickins, M. (1999), “Fighting surf abuse”, Management Review, Vol. 88 No. 6, p. 8. Holtz, S. (2001), “Employees online: the productivity issue”, Communication World, Vol. 18 No. 2, pp. 17-23. Hunter, C.D. (2000), “Social impacts: Internet filter effectiveness – testing over and under-inclusive blocking decisions of four popular Web filters”, Social Science Computer Review, Vol. 18 No. 2, pp. 214-22. Klein, H.K. and Myers, M.D. (1999), “A set of principles for conducting and evaluating interpretive field studies in information systems”, MIS Quarterly, Vol. 23 No. 1, pp. 67-94. Lim, V.K.G. (2002), “The IT way of loafing on the job: cyberloafing, neutralizing and organizational justice”, Journal of Organizational Behavior, Vol. 23 No. 5, p. 675. McCandless, D. (1997), “Warez wars”, available at: www.wired.com (accessed 10 May 2003). McCullagh, D. (2001), “RIAA wants to hack your PC”, available at: www.wired.com/ (accessed 6 June 2003). Mirchandani, D. and Motwani, J. (2002), ““Reducing Internet abuse in the workplace”, SAM Advanced Management Journal, Vol. 68 No. 1, pp. 22-26, 55. Naraine, R. (2003), “RIAA targets file-sharing in the workplace”, available at: www.atnewyork.com/news/article.php/ 2112521 (accessed 6 June). Nunberg, G. (2001), “The Internet filter farce: why blocking software doesn’t – and can’t – work as promised”, American Prospect, Vol. 12 No. 1, pp. 28-33. Orso, P. (2001), “Squid Analysis Report Generator (Version 1.21)”, available at http://sarg.sourceforge.net/ Turner, S. (2003), Analog (Version 5.24), ClickTracks. Valli, C. (2001), “A byte of prevention is worth a terabyte of remedy”, paper presented at the We-B Conference 2001, Scarborough. Wavecrest (2002), Cyfin Reporter (Version 4.01), WaveCrest Computing, Melbourne, FL.
References ABS (2002), “Measuring Australia’s economy – section 6. Prices and income – average weekly earnings”, available at: www.abs.gov.au/Ausstats/ (accessed 30 June 2003).
359
Introduction
PIDS: a privacy intrusion detection system Hein S. Venter Martin S. Olivier and Jan H.P. Eloff
The authors Hein S. Venter is Senior Lecturer, and Martin S. Olivier and Jan H.P. Eloff are both Professors, all in the Department of Computer Science, University of Pretoria, Pretoria, South Africa.
Keywords Computer networks, Privacy, Safety devices
Abstract It is well-known that the primary threat against misuse of private data about individuals is present within the organisation; proposes a system that uses intrusion detection system (IDS) technologies to help safeguard such private information. Current IDSs attempt to detect intrusions on a low level whereas the proposed privacy IDS (PIDS) attempts to detect intrusions on a higher level. Contains information about information privacy and privacy-enhancing technologies, the role that a current IDS could play in a privacy system, and a framework for a privacy IDS. The system works by identifying anomalous behaviour and reacts by throttling access to the data and/or issuing reports. It is assumed that the private information is stored in a central networked repository. Uses the proposed PIDS on the border between this repository and the rest of the organisation to identify attempts to misuse such information. A practical prototype of the system needs to be implemented in order to determine and test the practical feasibility of the system. Provides a source of information and guidelines on how to implement a privacy IDS based on existing IDSs.
Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/1066-2243.htm
Internet Research Volume 14 · Number 5 · 2004 · pp. 360-365 q Emerald Group Publishing Limited · ISSN 1066-2243 DOI 10.1108/10662240410566953
Personal privacy is often defined in terms of control: one has privacy to the extent that one exerts control over one’s personal information. While such a definition of privacy is far from perfect, it highlights a fundamental issue of privacy. This issue is that, if control over use of private information is not actively enforced and monitored, a loss of privacy will occur. Over many years privacy-enhancing technologies (PETs) have been developed that help individuals to retain such control. Examples include Crowds (Reiter and Rubin, 1999), LPWA (Gabber et al., 1999), P3P (Reagle and Cranor, 1999) and PrivGuard (Lategan and Olivier, 2002). Even encryption, when used in a privacy context, is about ensuring that only the intended recipient has access to the information – another example of controlling, who gets access to that shared information. In more recent years, a substantial amount of work has been done to protect personal information after it has been collected by an organisation. This includes work done on the legal, policy and technical areas. Examples of laws that restrict use of personal information and require protection of such information include the (wellknown) US Privacy Act of 1974 and the EU Data Protection Directive (Pfleeger and Pfleeger, 2003). It is also obvious that privacy policies on Web sites and in other contexts have become common. A few years ago such policies existed in a limited number of places. This paper is specifically interested in work done in the technical area. Examples of technologies proposed or developed in this area include Hippocratic databases (Agrawal et al., 2002), E-P3P (Karjoth et al., 2003; Ashley et al., 2003) and work on decision making in this context (Olivier, 2003a). Clearly, after one’s information has been collected by an organisation, personal control becomes much harder to enforce. Some of the proposed solutions in this context do store individual preferences that are taken into account before such information is accessed. In a more general sense they limit access to personal information in the organisational context. This is indeed necessary given the fact that some of the major known breaches of privacy in the past occurred when individuals who had access to personal data, misused their privileges to obtain access to that information (GAO, 1993, 1997). This material is based on work supported by the National Research Foundation (NRF) in South Africa under Grant number 2054024, as well as by Telkom and IST through THRIP. Any opinion, findings and conclusions or recommendations expressed in this material are those of the authors and therefore the NRF, Telkom and IST do not accept any liability thereto.
360
PIDS: a privacy intrusion detection system
Internet Research
Hein S. Venter, Martin S. Olivier and Jan H.P. Eloff
Volume 14 · Number 5 · 2004 · 360-365
Access rules might, however, not be sufficient. In the case of more traditional access control, it is common knowledge that intrusion detection systems (IDS) augment such systems in a natural manner. These systems are not only able to thwart attacks that might otherwise have breached the normal access control system, but also lead to an improved understanding of the strategies used by attackers, as well as improved knowledge about the frequency and severity of actual attacks launched against an organisation. This raises the question: Is it possible to use an IDS that is specifically tailored to detect attacks against a collection of personal information stored by an organisation, and then react to such attacks? This paper addresses this question by proposing a privacy IDS (PIDS) that does exactly that. The application of IDS techniques to enhance privacy offers interesting challenges and opportunities. A challenge is that it is extremely difficult to distinguish between legitimate access to private information and access by someone who, under slightly different circumstances, should have been allowed access but is actually “just browsing” when access is made. It is therefore necessary to cater for a very high number of false positives and false negatives. We contend that it is possible to react in a manner that makes the impact of a false positive or negative tolerable, but still improves the privacy of stored data. An opportunity that arises is that, given the specific domain of application, it becomes possible to take more direct steps to deal with (possible) attacks that are in progress. This paper is structured as follows. The next section contains further information about information privacy and privacy-enhancing technologies. After that, the role that a current IDS could play in a privacy system is examined. The section thereafter proposes a framework for a PIDS, and the last section concludes the paper.
The three principles that apply directly to the problem at hand are the use limitation, security safeguards and the accountability principles. The first of these principles is central to the problem at hand: Only when a valid reason exists and the intended use is compatible with the reasons for collecting the data in the first place, should the action to use the information be allowed to proceed. The only exceptions are when the law allows such use as intended, or where the subject has given permission for such use. IDS technology promises to detect the use of information outside these boundaries. The second principle singled out above, requires that security safeguards be in place to protect private data. Given the knowledge that a specific domain implies about the data to be protected, an IDS can use such knowledge to be better able to identify possible intrusions. It also has a greater range of available actions to take in the case of possible intrusions that it detects. The work in the following section will explore this possibility in more depth. The final principle was that of accountability. In an organisation this will typically imply that the organisation is accountable for misuse of private information by any of its employees. This clearly places an obligation on the organisation to identify possible misuses of such information by employees. While sophisticated logging of access requests can solve some aspects of this problem, logging does not normally provide all the functionality that an IDS does. Moreover, it has been reported that logging leaves loopholes, for example when information is retrieved via a system other than that providing the logging information (GAO, 1997). In addition, current approaches to logging are implemented on a too high level and do not give details on the specific access to information contents, but rather to the containers, such as databases, in which the information is held. Logging is also done in a static manner, whereas IDSs are able to detect intrusions dynamically and in real time. Since the PIDS to be proposed in this paper is a form of privacy-enhancing technology, it will be useful to consider other forms of privacyenhancing technologies that the IDS can interact with or complement. The Layered Privacy Architecture (LaPA) (Olivier, 2003b) provides a useful framework for the discussion of privacyenhancing technologies. LaPA classifies PETs in four layers, namely, the personal control layer (PCL), the organisational safeguards layer (OSL), the private (confidential) communication layer (CCL) and the identity management layer (IML). The PCL includes technologies such as P3P (Reagle and Cranor, 1999) that allows individuals to express their personal preferences about how data should be processed. The CCL uses
Privacy Perhaps the most widely accepted principles for parties processing personal information are the OECD guidelines (www.oecd.org). The guidelines consist of the following principles: collection limitation, data quality, purpose specification, use limitation, security safeguards, openness, individual participation and accountability. A lack of space precludes a detailed discussion of all these principles here. We will therefore only consider those that are directly related to the specific area this paper considers; the names of the remaining guidelines have been listed for the sake of completeness and to give the context for those that are discussed.
361
PIDS: a privacy intrusion detection system
Internet Research
Hein S. Venter, Martin S. Olivier and Jan H.P. Eloff
Volume 14 · Number 5 · 2004 · 360-365
encryption, such as PGP (Garfinkel, 1995), to ensure that communicated information remains confidential. The IML allows individuals, where applicable, to remain anonymous or to use a pseudonym. Examples of technologies in this category include LPWA (Gabber et al., 1999) and Onion-routing (Goldschlag et al., 1999). For the purposes of this paper, the OSL is the most important layer. Examples of technologies that can be used on this layer have been given above. As a technology used by the organisation to protect information collected by the organisation, the proposed PIDS will itself form part of this layer. All existing state-of-the-art IDS implementations attempt to identify intrusions in network and host domains originating from the outside world. Although privacy intrusion can originate from the outside world, the main threat of compromising privacy of a system originates from the inside of organisations. In other words, existing approaches to IDS mainly view this technology as a perimeter defence, similar to firewalls. PIDS operating on the OSL layer in contrast aims to propose a technology that can be employed to detect privacy-compromising behaviour, linked to networked attached storage from internal and external sources. We assume that private data is best stored in a central repository within the organisation connected to the rest of the organisation via a network. This enables one carefully to monitor access requests to the repository. Often such central repositories are implemented as network-attached storage (NAS) units.
occur when a user attempts to assume a role that the user is not authorised to assume, or when a user attempts to act in a work context that is not expected. While role-based access control (RBAC) and workflow security mechanisms should ensure that this does not happen, IDS technology can identify attempts to bypass these mechanisms and react if these systems are indeed compromised. Leakage of private information in the privacy context can be associated with identity theft. Trojan Horses have, for example, been used to determine the identifying attributes (such as user identifiers and passwords) of individuals and so gained access to their personal information. Once this information has been lost, the normal mechanisms used to protect access become ineffective. IDS technology could help to identify attempts to acquire such personal information.
IDS functionality applied to privacy Sundaram (1996) classifies intrusions into six main categories, namely, attempted break-ins, masquerade attacks, penetration of the security control system, leakage, denial of service and malicious use. Often atypical behaviour is used to detect a specific type of intrusion. In the normal security context, attempted break-ins, for example, are detected when the behaviour of a subject differs from the typical behaviour profile, or if a subject violates specified security constraints. These aspects of an IDS can clearly apply to an IDS in the privacy context with only minor modifications: atypical access of private information may indicate misuse of such information. Similarly, constraints that apply specifically to the use of private data can be specified and violations of such constraints could indicate misuse of private data. Masquerading, a common vulnerability identified by currently available IDS technology, is also a concern in the privacy context. It is likely to
Model for PIDS From the previous sections, it is clear that there is potential in combining IDS functionality with privacy to form a PIDS. In order to demonstrate how this is feasible, a model for a PIDS is introduced in this section. Before this is shown, it is, however, necessary to consider the architecture of an IDS. It should be noted that there are two main types of IDS: anomaly detection and misuse detection (Bace, 2000). The IDS architecture for these two types of IDS is essentially the same, except that where an anomaly-based IDS architecture has an anomaly detector and an anomaly profile, a misuse-based IDS has a pattern matcher and policy rules. A misuse-based IDS, thus differs from an anomaly-based IDS in that, instead of looking for anomalies, it attempts to match a specific pattern from the audit data using the policy rules. These patterns are known in advance and hence specified by the policy rules. For the purpose of the PIDS the architecture of the anomaly-based IDS is adopted. The main components for an anomalydetection IDS include a data source, a profile engine, an anomaly detector, a profile database and a report generator. The way in which an anomaly-detection IDS performs intrusion detection is as follows. Each piece of source data is carefully grouped by the profile engine to form sets of related user or system behaviour. Such a set of behaviour is referred to as a profile. A profile database contains profiles of normal user or system behaviour. The profile database can be set up manually by a human expert to define profiles. Another option is to use a computer in compiling profiles by using statistical techniques, which can be updated automatically. The anomaly detector then compares each profile compiled from the
362
PIDS: a privacy intrusion detection system
Internet Research
Hein S. Venter, Martin S. Olivier and Jan H.P. Eloff
Volume 14 · Number 5 · 2004 · 360-365
source data by the profile engine to the normal user and system behaviour profiles from the profile database. When the anomaly detector finds a profile that appears to be abnormal or unusual compared to a specific user or system profile in the profile database, the behaviour is labelled as intrusive and a report or alarm is generated. A model for a PIDS is shown in Figure 1. This model is based on a model defined by Denning (1986) and the anomaly-detection IDS architecture as discussed above. The scope of the PIDS in Figure 1 is depicted by the dark black line. The data request first travels trough a number of perimeter security and privacy technologies before it arrives at the PIDS for further evaluation. This is only possible if it passed successfully through all the perimeter security and privacy technologies. The components of the PIDS map directly to the original anomaly-based IDS architecture except for a new component called the “privacy enforcer”. This component ensures that the request is either routed successfully to retrieve data from the database or not, depending on whether a privacy intrusion was detected. The privacy enforcer then sends a reply whether the request was granted or rejected. Since it is clear how the PIDS components map to the original IDS components, the components of the PIDS can now be discussed in more detail. In order to understand these components, consider the following case scenario. Suppose a representative of a government’s revenue service is querying the revenue database. As he has legitimate access to the database, he can retrieve only information he has been granted access to. He might though be able to use this information for
unethical purposes. Suppose he retrieves the information of persons over the age of 65 years who are millionaires. A safe assumption is that the revenue service has a privacy policy that does not allow an employee of the revenue service to disclose personal information of any taxpayer to third parties. Suppose, though, that the representative retrieved the personal and contact details of the selected group of people and supplied the list to his wife who happens to be a property marketing agent for an exclusive retirement village. The request that the representative made may have been allowed based on his access rights and role but, an invasion of people’s privacy has occurred! It is this type of privacy intrusion that a PIDS will attempt to detect. The case scenario described above will be examined by the PIDS as follows. First, consider the data request “Get contact details of people with age larger than 65 and financial income of more than ZAR1 million” as input to the PIDS. This request will have to pass through the normal perimeter security and privacy technologies. These technologies can range from low-level perimeter security such as a firewall and simple access control to higher-level security such as a rule-based access control system, a secure workflow system, or any other PET. In the PIDS component the data requests of the user are carefully examined by the anomaly profile engine and compared to the PIDS anomaly profile database by the anomaly detector in a bid to find a profile that appears to be abnormal or unusual. In this particular case, an interest in people who are over 65 might be normal, if they, for example, qualify for a specific tax rebate. Similarly, an interest in the contact
Figure 1 Model for a privacy IDS
363
PIDS: a privacy intrusion detection system
Internet Research
Hein S. Venter, Martin S. Olivier and Jan H.P. Eloff
Volume 14 · Number 5 · 2004 · 360-365
details of tax payers and, in some cases, even an interest in those who are relatively wealthy are likely to be normal queries. However, the combined interest is what should be noted as anomalous. If one was to classify all queries that have not previously and regularly been executed by those associated with the current profile as possible intrusions, one would end up with an extremely high rate of false positives. This problem can be solved in two ways though. First, the normal range of parameters for features of a profile is often easy to determine. Normal working hours are, for example, well known and activities outside these hours could indicate suspicious behaviour. Such a determined range on its own does not solve the problem. If an employee decides to work a few minutes late it should not trigger alarms. If employees are not allowed to work outside normal working hours, this would indeed be simple to address using normal access control measures. It is therefore necessary for this to be combined with the second part of the solution – throttling. We use the term throttling to refer to the dynamic adaptation of the parameters used for the profile and/or the level of service that the system provides to the user. If the number of “unusual” queries, perhaps involving certain sensitive fields such as contact details, forms one of the monitored features, the threshold for reporting or halting such activity could be lowered with each such query entered during a given period. Alternatively or simultaneously, the speed at which records are returned could be lowered with each record retrieved. This would not prevent our example of the tax inspector from obtaining any contact details at all, but could limit the number of such records he can get and could also lead to his activities being reported to management if he persists. Examples of facets that could be monitored and throttles are shown in Table I. The association refers to a specific PIDS anomaly profile “feature”, meaning that “association” is one of a few thresholds that are checked when detecting privacy anomalies. Examples of other such features in a privacy anomaly profile are shown in Table I. To illustrate the concept, fictitious threshold values have been allocated to each feature in the table. In addition, the scope of what is monitored as well as what action to take when a specific value is found to be outside of the valid threshold range is shown in Table I. All such values will differ extensively from one organisational environment to another. It is important to realise that each request made to the database is linked to a specific person or system referred to as a subject and so a PIDS anomaly profile will have to exist for each subject, but would be derived from the subject’s role. The specific features identified for this example include the time of day, duration, number of records
accessed, number of records edited, association of records and frequency of usage. There are ten entries shown in Table I for subject Bob; more entries could, however, exist for other subjects. The “time of day” feature for Bob specifies that Bob is normally supposed to access the database between 08:00 and 17:00; Alice, however, might be a night worker and would normally access the database between 18:00 and 02:00. In order to prevent false positives from occurring, Bob may request special permission to work late when he anticipates the need to do so due to. Likewise, the features are set up with specific valid threshold values for each feature for each different subject. It is also possible to take some appropriate action in order to throttle the normal behaviour of the subject accordingly when a specific feature is breached.
Comparison with other work The idea of using an IDS approached to protect privacy is not new. In a Hippocratic Database (Agrawal et al., 2002) a query intrusion detector (QID) is proposed, but few details are given. PIDS differs from the QID in three significant respects: PIDS considers queries while QID considers the results of queries (before data is released). Second, PIDS uses an intrusion detection model based on the expected activities of a user. This is derived from the role of the user, as well as individual traits. In contrast QID builds a profile from past queries. Third, QID apparently only flags suspect queries, whiles PIDS attempts to limit damage by using throttling.
Conclusion This paper described the concept of a PIDS. It was shown how it could be used to augment normal security and privacy-enhancing technologies to better safeguard private data. It should be emphasised that no such system could ever be perfect. It only takes one person to disclose one piece of information that that individual was perfectly authorised to see to violate privacy. Clearly this would not be noticed by any automated means. We believe that any system that helps to eliminate some violations of privacy that might otherwise have gone unnoticed is a worthwhile effort. Future research on this system will focus on the construction of a prototype that will allow experimentation, particularly to determine the effectiveness of our application of throttling.
364
PIDS: a privacy intrusion detection system
Internet Research
Hein S. Venter, Martin S. Olivier and Jan H.P. Eloff
Volume 14 · Number 5 · 2004 · 360-365
Table I Features in a PIDS anomaly profile database Entry Subject Feature Valid threshold range 1
Bob
Time of day
2
Bob
3
Storage component
Action(s) taken on threshold violation
Database
1) Report 2) Close database connection 1) Report 2) Close database connection 3) Throttle threshold range 1) Report 2) Throttle threshold range 1) Report 2) Reduce threshold range with 2 1) Report 2) Reduce threshold range with 20 1) Report 2) Close database connection 1) Report 2) Close database connection for the remainder of the day 1) Report 2) Close database connection 1) Report 2) Close database connection for the remainder of the day 1) Report 2) Disallow access to specific record for remainder of the day 1) Report 2) Throttle threshold range ...
Duration
08:00-17:00 (override: 08:00-20:00) 0-10 minutes
Database
Bob
Duration
0-3 minutes
Records
4
Bob
No. of records accessed
1-10 records
Records
5
Bob
No. of records accessed
11-100 records
Records
6
Bob
No. of records accessed
.100 records
Records
7
Bob
No. of records edited
0
Specific record
8
Bob
Association of records
0-2 records associated
Records
9
Bob
Usage frequency
0-10 times per day
Database
10
Bob
Usage frequency
0-3 times per day
Specific record
11
Alice
Time of day
18:00-02:00
Database
...
...
...
...
...
References Agrawal, R., Kiernan, J., Srikant, S. and Xu, Y. (2002), “Hippocratic databases”, paper presented at the International Conference on Very Large Databases (VLDB), Hong Kong. Ashley, P., Hada, S., Karjoth, G. and Schunter, M. (2003), “E-P3P privacy policies and privacy authorization”, Proceedings of the ACM Workshop on Privacy in the Electronic Society, ACM Press, New York, NY, pp. 103-9. Bace, R.G. (2000), Intrusion Detection, Macmillan Technical Publishing, New York, NY. Denning, D.E. (1986), “An intrusion detection model”, Proceedings of the 1986 IEEE Symposium on Security and Privacy, Oakland, CA. Gabber, E., Gibbons, P.B., Kristol, D.M., Matias, Y. and Mayer, A. (1999), “Consistent, yet anonymous, Web access with LPWA”, Communications of the ACM, Vol. 42 No. 2, pp. 42-7. GAO (1993), National Crime Information Center: Legislation Needed to Deter Misuse of Criminal Justice Information, Document GAO/T-GGD-93-41, United Stated General Accounting Office, Washington, DC. GAO (1997), IRS Systems Security: Tax Processing Operations and Data Still at Risk Due to Serious Weaknesses, Document GAO/AIMD-97-49, United States General Accounting Office, Washington, DC. Garfinkel, S. (1995), PGP: Pretty Good Privacy, O’Reilly, Sebastopol, CA. Goldschlag, D.M., Reed, M.G. and Syverson, P.F. (1999), “Onion routing”, Communications of the ACM, Vol. 42 No. 2, pp. 39-41.
Karjoth, G., Schunter, M. and Waidner, M. (2003), “Platform for enterprise privacy practices: privacy-enabled management of customer data”, in Dingledine, R. and Syverson, P. (Eds), Privacy Enhancing Technologies: 2nd International Workshop, PET 2002, San Francisco, CA, revised papers, Springer, New York, NY. Lategan, F.A. and Olivier, M.S. (2002), “PrivGuard: a model to protect private information based on its usage”, South African Computer Journal, Vol. 29, pp. 58-68. Olivier, M.S. (2003a), “Using organisational safeguards to make justifiable decisions when processing personal data”, in Eloff, J.H.P., Kotze´, P., Engelbrecht, A.P. and Eloff, M.M. (Eds), IT Research in Developing Countries (SAICSIT 2003), Sandton, South Africa, pp. 275-84. Olivier, M.S. (2003b), “A layered architecture for privacyenhancing technologies”, in Eloff, J.H.P., Venter, H.S., Labuschagne, l. and Eloff, M.M. (Eds), Proceedings of the 3rd Annual Information Security South Africa Conference (ISSA2003), Sandton, South Africa, pp. 113-26. Pfleeger, C.P. and Pfleeger, S.L. (2003), Security in Computing, 3rd ed., Prentice-Hall, Englewood Cliffs, NJ. Reagle, J. and Cranor, L.F. (1999), “The platform for privacy preferences”, Communications of the ACM, Vol. 42 No. 2, pp. 48-55. Reiter, M.K. and Rubin, A.D. (1999), “Anonymous Web transactions with crowds”, Communications of the ACM, Vol. 42 No. 2, pp. 32-48. Sundaram, A. (1996), “An introduction to intrusion detection”, ACM Crossroads, available at: www.acm.org/crossroads/ xrds2-4/intrus.html
365
Web services: measuring practitioner attitude P. Joshi H. Singh and A.D. Phippen
The authors P. Joshi is a Research Student and H. Singh was a Senior Lecturer, both at the School of Computing Science, Middlesex University, Tunbridge Wells, UK. A.D. Phippen is Senior Lecturer, Network Research Group, University of Plymouth, Plymouth, UK.
Keywords Internet, Worldwide web, Servicing, Function evaluation
Abstract Distributed computing architecture has been around for a while, but not all of its benefits could be leveraged due to issues such as inter-operability, industry standards and cost efficiency that could provide agility and transparency to the business process integration. Web services offer a cross platform solution that provides a wrapper around any business object and exposes it over the Internet as service. Web services typically work outside of private networks, offering developers a non-proprietary route to their solutions. The growth of this technology is imminent; however, there are various factors that could impact its adoption rate. This paper provides an in-depth analysis of various factors that could affect adoption rate of this new technology by the industry. Various advantages, pitfalls and future implications of this technology are considered with reference to a practitioner survey conducted to establish the main concerns effecting adoption rate of Web services.
Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/1066-2243.htm
Introduction Web services have the potential to transform traditional distributed computing by exposing software functions or applications as services over the Internet. They use the basic Internet infrastructure to query services, publish services and to carry transactions across various services. This infrastructure allows services from distinct vendors to communicate or interact with each other. A Web service could be a real time stock quote service, weather advisory, hotel and airline booking service or a combination of multiple such services to form an entire business process. This communication can take place through an Internet browser or any other independent application can use SOAP over hyper text transfer protocol (or HTTP) to invoke a Web service. A service requester can query the registry for any particular Web service, get its description and invoke the Web service through Internet. The growth of electronic business depends on business-to-business (B2B), application-toapplication (A2A) and business-to-consumer (B2C) interaction over the Web. This requires a technology that supports inter-operability, cross platform transaction and integration of software components written in any language with legacy application. No previous distributed computing architecture (for example, Sun Java Remote Method Invocation (RMI), OMG Common Object Request Broker Architecture (CORBA), Microsoft Distributed Component Object Model (DCOM)) can deliver such benefits (Lim and Wen, 2003). All of these technologies are heavily dependent on vendor platform and tight coupling of client and server. This paper reviews the Web service architecture and considers the benefits and major concerns reported in literature that are faced by enterprises in implementing this technology. Following on from this review, we evaluate these claims against a practitioner-focussed survey, which considers the literature against a sample of people who use the technologies.
A brief review of Web services There are numerous detailed technical reviews of Web services (for example Lim and Wen (2003), Gottschalk et al. (2002) and Walsh (2002)). While
Internet Research Volume 14 · Number 5 · 2004 · pp. 366-371 q Emerald Group Publishing Limited · ISSN 1066-2243 DOI 10.1108/10662240410566962
This paper is dedicated to the memory of Dr Harjit Singh, who was the motivation behind all the work that went in to this paper. His absence is felt as a mentor, guide and as the most wonderful person the other authors ever came across.
366
Web services: measuring practitioner attitude
Internet Research
P. Joshi, H. Singh and A.D. Phippen
Volume 14 · Number 5 · 2004 · 366-371
we do not feel there is any benefit in providing another in this paper, it is worth reviewing the technologies briefly for the purposes of discussion. Web services are based on XML standards and facilitates writing these software services using any programming language and over any platform. A Web services programming model is based on service definition through Web service definition language (WSDL) documents. These WSDL templates define the logical structure of a Web service including the input-output parameters and reference to the location of service. Universal description, discovery, and integration (UDDI) registries publish these templates. Clients across the globe can access UDDI to browse for suitable service by reading various WSDL templates. Simple object access protocol (SOAP) is a communication protocol for XML-based Web services. SOAP messaging runs over HTTP, which makes it globally acceptable as most operating systems support HTTP. SOAP enables applications to communicate directly without the need for custom binaries, runtime libraries, or other platform-specific information that has plagued cross-platform data transfer in the past. By using the Web service architecture applications can share data or invoke methods and properties of other remote applications with out any knowledge of other application’s architecture. This process is potentially expected to be much more transparent and easy to integrate in an heterogeneous environment. Despite growth in popularity of Web services their true commercial exploitation depends on further development of standards in areas such as security, reliable messaging, transaction support, and workflow (Gottschalk et al., 2002).
.
.
.
.
Factors impacting Web services adoption Web services will impact the way business objects communicate across intranet, across applications, and even across enterprises. They would provide an interface for bridging distinct technical infrastructure gaps by promoting inter-operability and flexibility. There could potentially be no more boundaries for information exchange. The joint industrial push towards standardisation of Web service components has widely influenced early adoption of this technology. If we examine the literature in these areas, we can identify a number of purported driving forces for rapid adoption of Web services: . Industry support. All major technology vendors including Microsoft, Oracle, IBM and Sun have adopted Web service architecture. Each one of these is working to integrate support for
.
Web services development. Sun is building Web service support into its Java2 Enterprise Edition (Sun, 2004) and Microsoft has built it into its NET framework (Wolter, 2001). Loose coupling. It allows Web services to connect any application or data source to any target application. The Web services world has informally adopted loose coupling because tightly coupled applications that maintains constant contact with one another do not operate well on unpredictable networks or when two applications are on opposite sides of a firewall (Shirky, 2002). Interoperability. Web services allow developers to use various programming languages, such as: Java, C++, VBScript, JavaScript, or Perl for development. Once developed they can be invoked from any platform, regardless of programming language used for its development. With the use of standards-based communications methods, Web services are virtually platform-independent (Shirky, 2002). Integration. Web services allow developers to easily integrate business processes across various different enterprises. They provide a much-needed interface to the software industry to integrate legacy business applications. Enterprises with distinct technical infrastructures can integrate via Web services implementation (Shirky, 2002). Simplicity. Implementation of Web services is much simpler to carry out than other distributed systems approaches. Web services help to hide back-end implementation details through the use of standardised or well-known interface definitions. The programming model for Web Services separates interfaces from implementation, which makes it easier to understand and simple to deploy (Scribner and Stiver, 2002). Another added advantage is that many of SOAP implementation tools are freely available. Business expansion. Companies would be able to access unexplored market place via UDDI public registries. Web services help companies outsource business processes. Using Web services a company can turn over the management of a process or function such as manufacturing, logistics, or human resources, to an outside provider (Hagel, 2002).
Web services obstacles Every technology comes with some pitfalls – it is impossible to design a loosely coupled distributed
367
Web services: measuring practitioner attitude
Internet Research
P. Joshi, H. Singh and A.D. Phippen
Volume 14 · Number 5 · 2004 · 366-371
architecture, which would inter-operate with various distinct systems, without any problematic areas. Using Internet as a base technology carries along issues of security, availability, reliability and so forth. The concept of inter-operability can make any distributed system quite complex, as the entire industry needs to agree upon standards and norms. Major obstacles in the early adoption of Web services can be identified as: . Security. Executing applications or calling objects over Internet could pose large problems for any IT project manager. Implementing security for transactions carried over Internet could be challenging. Security is a fundamental requirement of Web services. Web services could be exposing business methodologies, business applications or secure business details; this data needs to be protected from unauthorised users. Major security concerns for Web service implementation are confidentiality of data, authentication or authorisation of participating parties, integrity, nonrepudiation and availability of the service (Turban, 2002). . Accountability. Or how can the service be paid for and who is responsible for the service? This has been a long debated issue with regards to payment infrastructure and access to Web services. Either a company can have a onetime charge for accessing a Web service or a periodical subscription (Ratnasingam, 2002). . Performance. Performance is one of the major parameters to measure QoS of Web service. They can encounter performance bottlenecks due to the limitations of the underlying messaging and transport protocols (Mani and Nagarajan, 2002).
The survey took samples from developer groups focussing on Web service development issues. By drawing from these groups we were able to ensure developers with an interest in Web service development and a good level of knowledge in the area. However, we would acknowledge that we cannot consider this sample to be truly representative, as we cannot consider the sampling method to be completely random. However, we feel it is an indicative sample that can be used for effective measurement. Questionnaire formulation was based on the claims made by the literature discussed above. There were three main areas of focus in the questionnaire: (1) Building a profile of the respondent to ascertain suitability in answering the survey and determining the depth of developer experience with Web service technologies. (2) The technology choices used in developing Web service solutions. (3) Evaluating opinion based on questions drawn from literature related to adoption issues.
Measuring Web service adoption and attitudes While the above identifies a number of issues hampering the adoption of Web services drawn from literature in the area, it has been identified in previous studies (for example, Phippen, 2001; Kunda and Brooks, 2000) that sometimes the reality of technology adoption differs from those reported. In particular this can be true in areas where literature is industrially driven with very little rigour or empiricism in the evidencing of analysis or results. In order to investigate further the adoption issues in the Web service domain it was decided that a survey sampling the people that use Web service technologies would greatly strengthen investigative work to date.
Respondent profile Of the 100 targeted, we achieved 46 respondents, giving a response rate of 46 per cent. The respondents represented were drawn from a number of different organisational roles: chief technology officer (CTO) (45 per cent), project managers (17 per cent), IT analysts (19 per cent) and others (19 per cent) – which included business consultants, technical consultants and software developers. The significant response for CTOs demonstrates the strategic level of importance afforded to web services. This is not a technology used by the IT department with little impact on the organisation as a whole. The primary function of companies represented by the respondents was research and development (41 per cent), e-commerce/e-business (25 per cent), ERP/CRM (6 per cent) and other (29 per cent) including banking, payroll, food production, and government. Respondents were fairly evenly distributed across the main countries considered to be actively involved in Web services such as the USA, the UK, France, Germany, Canada and Australia. The level of knowledge regarding Web services was also encouraging when considering the validity of the sample set. The majority of respondents (77 per cent) have been familiar with Web services for over two years whereas 20 per cent of them have been familiar for at least a year. Only 3 per cent of respondents considered themselves as beginners level. Overall the respondents provided over 100
368
Web services: measuring practitioner attitude
Internet Research
P. Joshi, H. Singh and A.D. Phippen
Volume 14 · Number 5 · 2004 · 366-371
years of experience using Web service technologies. We also determined the type of experience the respondents had with Web services – were they consumers or providers. Initial results showed over 75 per cent of respondents were already consuming Web services and over 75 per cent were also developing Web services. An interesting result came from comparing those consuming Web services against those developing their own, detailed in Table I. Almost 100 per cent of those respondents who consumed Web services were also developing their own.
Technology choice While the argument for Web services is that they are essentially technology neutral the fact remains that practitioners have to make a choice regard which technology to adopt to develop their own Web service strategy. Two dominant approaches to building Web services applications are either to go with Microsoft’s.NET architecture or the rest of the industry’s Java-based approaches. Coupled with this choice is the option of development tools. While Microsoft, unsurprisingly packages their Web service technology in their Visual Studio.Net the options are more diverse for J2EE. A number of companies such as IBM, BEA, Cape Clear and Oracle have built some or all of their Web Services offerings on top of J2EE. The responses from our survey do not give too many surprises when examined independently (see Table II). There is a fairly even split between Sun and Microsoft technologies and a few “other” responses (most of those use Perl’s Web service technologies). Again, with the choice of development technology there is a reasonably straightforward split. However, the “open source” response is more interesting when compared against the choice of platform in cross-tabular analysis (see Table III). There is a significant number of developers using open source technologies to develop.Net solutions. This certainly highlights the standards focused approach to Web services. Literature states that it should make no difference which application server platform is used to develop these services (Clabby, 2003), and our survey indicates that this may certainly be the case.
Table II Platform and development technology choices Per cent respondents
Platform Sun J2EE Microsoft.Net Other
49 41.2 9.8
Development technology Open source Sun Microsystems Microsoft BEA IBM HP CapeClear Other
29.0 17.4 23.2 11.6 7.2 2.9 2.9 5.8
Evaluation of adoption issues When evaluating adoption issues we decomposed the benefits and barriers drawn from the literature into three broad areas: (1) Standards – generally viewed as one of the main reasons to adoption Web services and provide numerous benefits to interoperability and integration. (2) Complexity – or lack thereof. The literature suggests that Web service adoption is possible due to the simplicity of development and integration. This certainly differs from findings relating to previous distributed software technologies (Phippen, 2001; Kunda and Brooks, 2000). However, this simplicity also hampers performance and perhaps reliability. (3) Security – possibly the biggest barrier to Web service adoption and distributed, loosely coupled systems in general. Issues such as accountability and reliability also draw on this security domain. The questionnaire focussed on these three core issues in presenting a number of statements and requesting closed, bipolar responses (strongly disagree, disagree, no opinion, agree, strongly agree). These responses were coded from 1 to 5 in order to ascertain mean response, standard deviation, and skewness of data. The statements presented were: . Web services have the potential to transform traditional distributed computing.
Table I Planning to use Web services vs planning to develop Web services Already started Planning to use Web services Considering Do not know Total
Planning to develop Web services Testing various Thinking about it
32 1 1 34
2 1 1 4
369
0 1 1 2
No idea
Total
0 0 6 6
34 3 9 46
Web services: measuring practitioner attitude
Internet Research
P. Joshi, H. Singh and A.D. Phippen
Volume 14 · Number 5 · 2004 · 366-371
Table III Platform choices vs development choices
Per cent of cases
Open source (%)
Sun (%)
Platform choice Sun J2EE Microsoft.Net Other
40.5 29.7 10.8
29.7 10.8 10.8
.
.
.
.
.
.
Development technologies Microsoft BEA IBM (%) (%) (%) 18.9 40.5 10.8
Web services are easy to code and deploy compared to other distributed solutions such as CORBA and DCOM. Web services offer interoperability between applications by supports code development using any software language. Support from major technology vendors such as Microsoft, Sun, IBM and HP is a basic reason for Web service adoption. Web services allow developers to easily integrate business processes across various enterprises. There are enough security standards available for Web services. Adoption rates for Web services would improve with further development of security standards.
The core responses are detailed in Tables IV and V and discussed below. Results show a positive response for complexity issues (statements a-c) with the Web service domain. In particular issues such as the potential of Web services to realise distributed computing goals, and the use of Web services to achieve interoperability is very positively represented. One point of interest is the level of “no opinion” on statement b. Given the responses for other issues surrounding development complexity in Web services, we put forward the assumption that respondents perhaps did not have experience using older distributed software technologies such as CORBA and DCOM. However, it seems from the results the we can confirm that the development of distributed software solutions from Web service approaches does offer significant potential without the complex knowledge requirements traditionally associated with this type of development (Brown and Wallnau, 1996). The issue of standards is less clear from the immediate results. Statement d in particular does
21.6 10.8 5.4
13.5 8.1 5.4
HP (%)
CapeClear (%)
Other (%)
5.4 2.7 2.7
2.7 2.7 0
8.1 5.4 2.7
Table V Response outcomes for adoption issues (percentages)
Strongly disagree Disagree No opinion Agree Strongly agree
a
b
c
d
e
f
g
0 8.7 13 52.2 26.1
0 10.9 21.7 30.4 37
4.3 4.3 10.9 54.3 26.1
4.3 21.7 13 45.7 15.2
0 15.2 17.4 45.7 21.7
13 37 15.2 26.1 8.7
2.2 8.7 23.9 41.3 23.9
not show such a strong positive response (although its mean value and skewness both veer towards an overall positive response). As this is put forward in literature as one of the most positive aspects to web services, this is an interesting outcome. Further cross table analysis also fails to identify any obvious trends in developer feeling on this issue – open source developers are split fairly equally on whether it is a positive or negative issue, and those who view business process integration as an important issue also do not show any trends. The one area where it was very apparent that participants were not confident regarding Web services was the issue of security standards. The majority of respondents believed that adoption rate of Web services would drastically improve with further development of security standards. This was as expected since security has already been identified as a major threat to growth of Web services. In order to develop the issues surrounding security further, we posed a final question: what was the most important issue for Web service security. Table VI details the outcomes. Table VI Most important security issue affecting Web services Response (%) Authentication Confidentiality Integrity Non-repudiation Other
43.5 28.3 8.7 2.2 17.4
Table IV Core statistics for adoption issues
Mean Std deviation Skewness
a
b
c
d
e
f
g
3.96 0.868 2 0.766
3.93 1.020 20.522
3.93 0.975 21.368
3.46 1.130 2 0.516
3.74 0.976 20.491
2.80 1.222 0.239
3.76 0.993 20.630
370
Web services: measuring practitioner attitude
Internet Research
P. Joshi, H. Singh and A.D. Phippen
Volume 14 · Number 5 · 2004 · 366-371
Discussion and conclusion Web services represent a technology migration from vendor specific software development to Internet based cross platform and vendor neutral environment. The use of SOAP over HTTP for communication makes it easily available over Internet. Most software companies are already investing their resources into developing or implementing Web services environment. For others it is a matter of time before which they would have to make a decision to jump onto the bandwagon. Web services are worth paying attention even for organisations that do not plan an immediate launch. Literature has stated that Web services offer low transaction cost, improved return on investment, extensibility of legacy software and move business process over to Internet. If the potential express in literature could be realised in reality the flexibility and power afforded by such approaches could offer major advantages to a company. The online survey established Web services as solution provider for distributed software development and interoperability and, to a slightly lesser extent, business integrity. However, one major issue arising from our study is the nonavailability of security standards for Web services. Although security concerns still remain high on agenda of companies looking to adopt Web services. Authentication standards need to be developed further to support user as well as device authentication model. This would not only provider higher level of authentication but also support better accountability management solution in distributed computing architecture. This would lead to cost efficiency and resolve issues of non-repudiation. However, considering the results from our survey the adoption rate of Web services is positive and in the coming years further developments would be made in this field. It is arguably the first distributed software technology whose complexity will not hamper its development. Looking at the progress made by Web services since its start few
years back and considering the benefits offered by them to the business world it would seem that Web services will have to positive impact in the future.
References Brown, A.W. and Wallnau, K.C. (1996), “A framework for evaluating software technology”, IEEE Software, Vol. 13 No. 5, September, pp. 39-49. Clabby, J. (2003), Web Services Explained: Solution and Applications for the Real World, Prentice-Hall, Englewood Cliffs, NJ. Gottschalk, K., Graham, H., Kreger, J. and Snell, J. (2002), “Introduction to Web services architecture”, IBM System Journal, Vol. 41 No. 2. Hagel, J. (2002), “Edging into Web services”, McKinsey Quarterly, Vol. 4, special edition: technology issue, p. 29. Kunda, D. and Brooks, L. (2000), “Assessing obstacles to component based development: a case study approach”, Information and Software Technology, Vol. 42, pp. 715-25. Lim, B. and Wen, H. (2003), “Web services: an analysis of the technology, its benefits, and implementation difficulties”, Information System Journal, Vol. 20 No. 2, p. 49. Mani, A. and Nagarajan, A. (2002), ),“Understanding quality of service for Web services: Improving the performance of your Web services”, available at: www-106.ibm.com/ developerworks/library/ws-quality.html Phippen, A. (2001), “Practitioner perception of component technologies”, Proceedings of Euromedia, Society for Computer Simulation, Munich. Ratnasingam, P. (2002), “The importance of technology trust in Web services security”, Information Management & Computer Security, Vol. 10 No. 5, pp. 255-60. Scribner, K. and Stiver, M. (2002), Applied SOAP: Implementing.NET XML Web Services, SAMS Publication, Indianapolis, IN. Shirky, C. (2002), Planning for Web Services: Obstacles and Opportunities, O’Reilly, Sebastopol, CA. Sun (2004), “Java 2 Platform, Enterprise Edition (J2EE)”, available at: http://java.sun.com/j2ee/ Turban, E. (2002), Electronic Commerce: A Managerial Perspective, Prentice-Hall, Englewood Cliffs, NJ. Walsh, A. (2002), UDDI, SOAP, and WSDL: The Web Services Specification Reference Book, Pearson Education Asia. Wolter, R. (2001), “XML Web services basics”, available at: http://msdn.microsoft.com/library/?url=/library/en-us/ dnwebsrv/html/webservbasics.asp?frame=true
371
Introduction
Using the Web Graph to influence application behaviour Michael P. Evans and Andrew Walker
The authors Michael P. Evans is Lecturer, Applied Informatics and Semiotics Laboratory, University of Reading, Reading, UK. Andrew Walker is a Research Student, School of Mathematics, Kingston University, Kingston-upon-Thames, UK.
Keywords Worldwide Web, Computer software, Graphical programming
Abstract The Web’s link structure (termed the Web Graph) is a richly connected set of Web pages. Current applications use this graph for indexing and information retrieval purposes. In contrast the relationship between Web Graph and application is reversed by letting the structure of the Web Graph influence the behaviour of an application. Presents a novel Web crawling agent, AlienBot, the output of which is orthogonally coupled to the enemy generation strategy of a computer game. The Web Graph guides AlienBot, causing it to generate a stochastic process. Shows the effectiveness of such unorthodox coupling to both the playability of the game and the heuristics of the Web crawler. In addition, presents the results of the sample of Web pages collected by the crawling process. In particular, shows: how AlienBot was able to identify the power law inherent in the link structure of the Web; that 61.74 per cent of Web pages use some form of scripting technology; that the size of the Web can be estimated at just over 5.2 billion pages; and that less than 7 per cent of Web pages fully comply with some variant of (X)HTML.
Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/1066-2243.htm
Internet Research Volume 14 · Number 5 · 2004 · pp. 372-378 q Emerald Group Publishing Limited · ISSN 1066-2243 DOI 10.1108/10662240410566971
Since its inception, the World Wide Web has evolved into an information system comprising billions of individual Web pages linked together via hyperlinks. Because the architecture of the Web places no restrictions on the method of publication, authorship or linking, its underlying link structure has emerged without any guiding direction. However, that is not to say the structure is random. For example, many studies have revealed a power-law distribution in the number of connections per node (Barabasi and Albert, 1999; Broder et al., 2000; Kleinberg et al., 1999). Albert et al. (1999) showed that the Web is a small world network (i.e. the “diameter” of the Web in terms of any contiguously linked Web pages is small relative to the size of the Web as a whole). More significantly, Broder et al. (2000) revealed that the Web resembled a “bow-tie” structure, with a large core of strongly connected Web pages; an IN component, whose pages link into the core, but are not necessarily linked to by the core; an OUT component, whose pages are linked to from the core, but which do not generally link to the core; and a smaller set of pages that are largely disconnected from the core. Thus, the Web has a structure that is by no means random. It is the Web’s link structure that Web crawlers follow when retrieving information. A Web crawler is an application that parses a page, indexing or retrieving information from it, and identifying the pages links to other pages, which the crawler will then follow to start the process once more on the new page. In this way, these applications literally crawl the Web by following its link structure, collating, indexing and retrieving information as they go. Web crawlers themselves are well-understood, with well-defined heuristics guiding them across the Web’s link structure. However, by looking at the crawling process in a different way, the behaviour of a crawler can be seen to be governed by the Web’s link structure. Effectively, the crawler cannot go outside the bounds of the links it finds, and so its pattern of movements is constrained by the structure of the links. In this way, the structure guides the movement of the crawler (at least in the case of naı¨ve crawling heuristics). This idea of the link structure guiding the behaviour of an application does not have to be limited to a Web crawler. There are other applications that can benefit from the unique characteristics of the link structure. One such application is the computer game. Specifically, computer games rely on heuristics when controlling the enemy (or enemies) a player must fight. Getting the heuristics right is critical to the playability of the game, as they determine the challenge it poses. If the heuristics are too simple, the game will be easily conquered; too difficult,
372
Using the Web Graph to influence application behaviour
Internet Research
Michael P. Evans and Andrew Walker
Volume 14 · Number 5 · 2004 · 372-378
and the player will swiftly become fed up at his/her lack of progress, and will stop playing. As such, we orthogonally coupled the link structure of the Web to the heuristics of a computer game such that the link structure guided the generation of enemies in the computer game, leading to a playable game that changes over time as the link structure itself continues to evolve. We present the results of such coupling in this paper, which proceeds as follows. First, we discuss the principle behind coupling a computer game to the link of structure of the Web; we then move on to present the architecture of AlienBot: a Web crawler coupled to a computer game. The following section presents the results of our design, validating the heuristics used by the AlienBot Web crawl, and also revealing some interesting statistics gained from a crawl of 160,000 Web pages. Finally, the paper concludes by discussing some of the issues we faced, and some suggestions for further work.
G is therefore a stochastic process that iteratively visits the vertices of the graph G (Bar-Yossef et al., 2000). However, as various experiments have discovered, the Web Graph G has unusual properties that introduce considerable bias into a traditional random walk. A random walk across G should generate a finite-state discrete time Markov chain, in which each variable vn (where vn 2V ) in the chain is independent of all other variables in the chain. Thus, the probability of reaching the next Web page should be independent of the previous Web page, given the current state (Baldi et al., 2003). However, G is neither undirected nor regular, and a straightforward walk will have a heavy bias towards pages with high in-degree (i.e. links pointing to them) (Bar-Yossef et al., 2000). This leads to a dependence between pages, in which a page on the walk affects the probability that another page is visited. In particular, some pages that are close to one another may be repeatedly visited in quick succession due to the nature of the links between them and any intermediate pages (Henzinger et al., 2000).
Orthogonally coupling a computer game and the Web Graph Computer game heuristics A simple example of a computer game’s heuristics can be seen in a “shoot-em-up” game, in which the player controls a lone spacecraft (usually at the bottom of the screen), and attempts to defeat a number of enemy spacecraft that descend on him/ her. Early games, such as Namco’s 1979 Galaxians, relied on heuristics in which the enemies would initially start at the top of the screen, but would intermittently swoop down towards the player in batches of two or three. Modern shoot-em-ups, such as Treasure’s 2003 hit Ikaruga, are more sophisticated and offer superb visuals, but still rely on a fixed set of heuristics, in which the enemies attack in a standard pattern that can be discerned and predicted after only a few plays. As such, we recognised that such games could be made more enjoyable by introducing a stochastic process into the enemy’s heuristics, thus generating a measure of surprise, and making the game subtly different each time. Randomly walking the Web Graph We found that such a process could be obtained by letting the Web’s link structure guide the behaviour of the enemy generation routines. The Web’s link structure can be represented as a graph, G, called the Web Graph. This is the (directed or undirected) graph G ¼ ðV ; EÞ, where V ¼ v1 ; v2 ; . . .vn } is the set of vertices representing Web pages, and E the collection of edges representing hyperlinks (“links”) that connect the Web pages. Thus, Grepresents the structure of the Web in terms of its pages and links. A random walk across
Coupling the computer game to the Web Graph As described, a random walk across the Web is a stochastic process that can contain discernible patterns. Although unwelcome for sampling the Web, such a process is ideal for our computer game. In addition, the Web Graph is constantly evolving, with nodes being created and deleted all the time (Baldi et al., 2003). As such, the dynamics of the structure are such that novelty is virtually guaranteed. This is the reason we chose to couple the two applications. We achieved this through the use of a Web crawler, which performs the required random walk by parsing a Web page for hyperlinks, and following one at random. We coupled the two applications by mapping each hyperlink parsed by the crawler to the generation of one enemy in our computer game. In this way, the exact number of enemies that appear cannot be predicted in advance, but patterns may be discerned, as the sampling bias inherent within the Web Graph is reflected in the pattern of enemies generated from it. Furthermore, we couple the two applications tighter still by making each enemy represent its associated hyperlink, and sending this link back to the crawler when the enemy is shot. In this way, the choice of enemy shot by the player is used to determine the next Web page crawled by the crawler, as the enemy represents the hyperlink. As each enemy is indistinguishable from its neighbour, the player has no reason to shoot one over another, and thus implicitly adds true randomness into the URL selection process. The player therefore blindly selects the hyperlink on
373
Using the Web Graph to influence application behaviour
Internet Research
Michael P. Evans and Andrew Walker
Volume 14 · Number 5 · 2004 · 372-378
behalf of the crawler, and thus drives the random walk blindly across the Web while he/she plays the game. The Web crawler and the computer game, traditionally seen as orthogonal applications, are therefore tightly coupled to one another, a process we term “orthogonal coupling”.
When an alien is shot by the user its physical representation on screen is removed and the URL it represents is added to a queue of URLs within the client. The client works though this list one at a time, on a first in first out basis, sending each URL to the server (Figure 2). Upon receiving a new URL, the server retrieves the referenced page, and parses it for new links. Meanwhile, the client listens for a response from the server, which is sent once the server-side script has finished processing the page. The response sent back from the server consists of a list of one or more URLs retrieved from the hyperlinks on the page being searched. The game can then create new aliens to represent the URLs it has received.
Alienbot – a novel design for a Web crawler A general overview of Alienbot Our crawler, called Alienbot, comprises two major components: a client-side game (Figure 1) produced using Macromedia Flash, and a serverside script using PHP. The client-side program handles the interaction with the user, whereas the server-side program is responsible for the bulk of the crawling work (see subsection “Processing a new Web page”). Alienbot is based on the game Galaxians. It runs in the user’s Web browser (see Figure 1), and interacts with the server-side script through HTTP requests. In the game, URLs (i.e. hyperlinks) are associated with aliens in a one to one mapping. Figure 1 AlienBot
Processing a new Web page When the client sends a URL to the server indicating that an alien has been shot, the server performs the following operations (see Figure 3): (1) Step 1: . Download the page referenced by the URL supplied by the client. . Search the page for links and other information about the page. . URL resolver – resolve local addresses used in links to absolute URLs. That is, converting links to the form http://www. hostname.com/path/file.html . Remove links deemed to be of no use (e.g. undesired file type, such as exe files). . Database checks – check the database to see if the page has already been crawled. (2) Step 2: . Record in the database all URLs found on the page. . Select a random sample (where the randomness is generated by a random number generator) of the resolved URLs to send back to the client. The exact number, and the method used to determine when to send the sample, is determined according Figure 2 Overview of the AlienBot process
374
Using the Web Graph to influence application behaviour
Internet Research
Michael P. Evans and Andrew Walker
Volume 14 · Number 5 · 2004 · 372-378
Figure 3 AlienBot’s random walk
to the enemy throttle heuristics used (see the following subsection). . If there are no links of use (or N ¼ 0 in the previous step), a random jump is performed. (3) Step 3. Each URL sent to the client is used to represent an enemy. When the player shoots an enemy, its associated link is returned to the server, where its page is crawled, and the process begins again. Thus, Alienbot selects its URLs using a combination of programmatic and user interaction. Enemy throttling Each time the player shoots one enemy, the associated URL is sent to the server, the Web page referenced by the URL downloaded, and all URLs on that page are retrieved ready for submitting back to the client to generate new enemies. Accordingly, each time one enemy is shot, many more will be generated, thereby completely overwhelming the player. To prevent this, some form of enemy throttling heuristic is used, where only some small subsets of URLs are returned. In our prototype of AlienBot, a random subset of the links found are returned, with the number returned calculated by R¼(n mod 5)+1, where R is the number of links sent back to the game and N the number of resolved links on the page that remain after database checks have been made. However, other throttling heuristics include: . URL queue. The server returns all URLs parsed from a page back to the client, which stores them in a queue. Once the queue is full, the server can be told to wait, or existing
.
URLs in the queue can be deleted and replaced by newer ones. URL delay. URLs are only returned to the client every m minutes, or every h hits, etc., where m and h can be tuned according to the playability characteristics required.
Results Analyzing the Web crawler’s performance In order to test the Web crawler, we ran it between 28 April 2003 and 29 July 2003, with some 31 players playing the game intermittently. In all, some 160,000 URLs were collected in total. After the testing process was complete, we analysed the Web pages referenced by these URLs, and used the statistics obtained to compare the results generated from Alienbot with those of other Web crawlers. Figures 4 and 5 show the distribution of out links (i.e. those on a Web page that reference another Web page) across the different Web pages crawled by AlienBot, and give a good indication of the underlying structure of the Web Graph. Both results clearly show the power law that exists in the Web Graph, and compare well with similar results by Broder et al. (2000), Henzinger et al. (2000), Baraba´si and Bonabeau (2003), and Adamic and Huberman (2001). In particular, the line of best fit for Figure 5 reveals an exponent of 3.28, which compares well with Kumar et al.’s (2000) value of 2.72, obtained with a crawl of some 200 million Web pages. These results therefore validate the
375
Using the Web Graph to influence application behaviour
Internet Research
Michael P. Evans and Andrew Walker
Volume 14 · Number 5 · 2004 · 372-378
Figure 4 Distribution of OutLinks
added to the surprise factor in terms of the number of enemies generated. Power law distributions are characterised by scale-free properties, in which a high percentage of one variable (e.g. number of links) gives rise to a low percentage of another (e.g. Web pages), with the ratio of the percentages remaining (roughly) constant at all scales. Thus, a high number of links will be found on a small number of Web pages, which produces the surprise (in terms of enemy generation) at unknown intervals. This was the stochastic process we aimed to capture by using the results from the crawl to drive the game’s enemy generation strategy. The validation of the crawler shows that we accurately captured the stochastic process, while the (albeit subjective) playability of the game revealed the benefit of the whole approach. An additional advantage of using the Web Graph to drive the heuristics of the enemy generation process will only become evident over time. Specifically, the Web Graph is not static; rather, it is dynamically evolving as new Web pages are added and others deleted. The patterns that have been discovered in this structure will not necessarily persist over time. We have already discovered evidence for this in the link structure of Web logs or Blogs (online journals), which are much more richly connected than traditional Web pages. As the structure of the Web Graph itself begins to change, so too will the behaviour of the application it drives. From AlienBot’s perspective, it means the enemy generation heuristics will change over time. Thus the same game you play today may play very differently in five years’ time.
Figure 5 Log-log plot revealing power law
effectiveness of our Web crawling heuristics in accurately traversing the Web Graph. Analyzing the game’s performance The design of the AlienBot architecture introduced no detectable latency to the gameplay from the round-trips to the server, and the unpredictability of the number of aliens to be faced next certainly added to the game’s playability. In particular, revealing the URL associated with an alien that the user had just shot added a new and interesting dimension to the game, as it gave a real sense of the crawling process that the user was inadvertently driving (see Figure 6). From the game’s perspective, the power law of the distribution of links crawled by AlientBot Figure 6 Randomly walking the world
Analysis of the URL sample set In addition to validating the crawling heuristics, we also used the data from our URL sample to provide some results from the Web Graph, and to take some Web metrics. Comparison of the Alienbot data set with the Google index Results from random walks have been used in other studies to compare the coverage of various search engines by calculating the percentage of pages crawled by a random walk that have not been indexed by the search engine (Henzinger et al., 1999). As such, we compared the results of AlienBot with that of Google, and discovered that 36.85 per cent of the pages we located are not indexed by Google, suggesting that Google’s coverage represents 63.15 per cent of the publicly available Web (see Table I). This compares with an estimate of 18.6 per cent that Lawrence and Giles (1999) found in 1999. Given that Google’s coverage extends to 3,307,998,701 pages (as of November 2003), we estimate the size of the Web to be 5,238,319,400 pages, or 6.5 times its estimated size in 1998 (Lawrence and Giles, 1999).
376
Using the Web Graph to influence application behaviour
Internet Research
Michael P. Evans and Andrew Walker
Volume 14 · Number 5 · 2004 · 372-378
Table I AlienBot Pages indexed in Google Number of pages
Percentage of pages
15,296 8,927
63.15 36.85
In Google Not in Google
Miscellaneous page statistics Of the 160,000 URLs we collected, we downloaded 25,000 at random to sample the technologies that are currently prevalent on the Web. As can be seen in Table II, 61.74 per cent of all Web pages use a scripting language. Of these pages, 99.11 per cent use JavaScript, and only 0.89 per cent use Microsoft’s VBScript. Furthermore, we found only 1.91 per cent of pages use Macromedia’s Flash. Page validation statistics Web pages are marked up using some variant of hypertext markup language (HTML), of which there are several versions up to the most current, version 4.01. More recently, the markup language has been reformulated into eXtensible HTML (XHTML), an XML-formatted version of HTML 4.01. XHTML brings with it many features that can be leveraged by programs such as Alienbot that attempt to gain information from Web pages. For example, its syntax is much stricter than HTML, leaving less room for ambiguity; it can be easily validated with a wide variety of existing XML validation tools; and it shifts presentational elements of the page into separate stylesheet files, and so cleans out much of the clutter from the content of the page. However, XHTML is both more difficult to write manually, and also more recent than the widely adopted HTML 4.01, two factors that could significantly retard its adoption. Any application that attempts to parse a Web page must therefore be able to cope with the variety of markup language variants that now exist across the Web. Clearly this makes writing such an application more complex than if there was just one standard markup language in existence. However, the situation is exacerbated by the fact that many Web pages are not even valid according to the version of (X)HTML they claim to be compliant with. This was a fact we identified early on in the development of AlienBot. In order to see the extent of the problem, we used the pages we had downloaded to determine how many were valid, and which variants of (X)HTML were most Table II Web page statistics
JavaScript VBScript Total pages with scripting Total pages using Flash
Number of pages
Percentage of pages
15,214 222 15,436 477
60.86 0.89 61.74 1.91
compliant with the standard they claimed to adhere to. To do this, we used the W3C’s HTML Validator tool (W3C, 2004) to validate those pages found by Alienbot that contained a doctype (a declaration used in an (X)HTML page to specify the version of (X)HTML being used). Using a random sample of 34,158 pages from the total set found by Alienbot, we found 15,313 (44.83 per cent of the total sample) with a doctype. From these pages it was possible to get some response from the HTML Validator for 14,125 of them. However, as can be seen from Table III, only 973 pages, or 6.89 per cent of the pages with a doctype, were found to be valid according to their specified (X)HTML version. Table IV breaks down the valid Web pages according to the version of (X)HTML they comply with. As can be seen, pages using both XHTML 1.0 Transitional and HTML 4.01 Transitional were the most likely to be valid, with 40.08 per cent and 38.95 per cent respectively being valid. XHTML 1.0 Strict and HTML 4.01 Strict scored a poor 11 per cent and 1.64 per cent respectively, which is particularly disappointing, as the term “Strict” requires absolute conformance to the specification. Clearly this is far from the case. From these statistics, it is clear that Web page parsing applications, be they browsers or Web crawlers such as AlienBot, have a difficult challenge ahead of them of trying to interpret correctly the largely poorly-formed Web pages they must work with. With (X)HTML still evolving, and further versions of the markup language being continuously developed, the situation can only get more complex. It remains to be seen how such loose compliance with standards will affect the Table III Page validation statistics
Valid Invalid Errors prevent validation
Number of pages
Percentage of pages
973 5,966 7,186
6.89 42.24 50.87
Table IV Doctypes used on valid pages
XHTML 1.0 transitional HTML 4.01 transitional XHTML 1.0 strict HTML 4.0 transitional HTML 4.01 strict XHTML 1.1 HTML 3.2 XHTML 1.1 strict HTML 4.0 frameset HTML 2.0
377
Number of pages
Percentage of valid pages
390 379 107 48 16 15 14 2 1 1
40.08 38.95 11.0 4.93 1.64 1.54 1.449 0.21 0.10 0.10
Using the Web Graph to influence application behaviour
Internet Research
Michael P. Evans and Andrew Walker
Volume 14 · Number 5 · 2004 · 372-378
semantic Web, with its new markup languages (specifically OWL, the Web Ontology Language) designed for machine inference rather than simple presentation of content.
size, strength, colour, even the terrain of the playing landscape, can all be generated stochastically from the Web Graph. Such an environment would allow the use to explore the Web in a completely new way, with the environment changing and evolving as it reflects the underlying link structure of the Web Graph, and changing over longer periods of time, too, as the Web Graph itself evolves. By reversing the relationship between the Web Graph and application, we have demonstrated how the Web Graph can become a potent source of evolving data from which many novel game ideas can be developed.
Issues and further work Ideally, Alienbot would have gathered a larger sample of documents to prevent any bias towards pages in the neighbourhood of the seed set of URLs. Furthermore, Alienbot’s simple content filter, based on a list of unwanted file extensions, is not ideal, and has allowed non-HTML pages to make it into the data set, thereby adding some unwanted noise to the results. A future version could make use of HTTP’s content-type header in order to filter out pages that are not created in some HTML based mark-up. However, the results from the crawl show that these limitations did not introduce a significant bias into the crawler when compared with results from other studies, particularly considering the relatively small number of Web pages of our study.
Conclusion We have presented a novel application of the Web Graph, using it to influence the behaviour of an application, rather than simply traversing it. The Web Graph is traversed by a Web crawling agent we developed called AlienBot, the output of which is orthogonally coupled to the enemy generation strategy of a computer game. The Web Graph guides AlienBot, causing it to generate a stochastic process, which is used in the enemy generation strategy of the game to add novelty and long-term dynamic game play. We have validated the crawling process by revealing the power law inherent in the Web Graph that has been revealed by other much larger crawls, and shown the effectiveness of such unorthodox coupling to the playability of the game and the heuristics of the Web crawler. In addition, we have presented the results obtained from crawling the Web using AlienBot. Our results show that 61.74 per cent of Web pages use some form of scripting technology; that the size of the Web can be estimated at just over 5.2 billion pages; and that less than 7 per cent of Web pages fully comply with some variant of (X)HTML. From the perspective of the game, we envisage other game genres employing AlienBot’s principles just as effectively, as the results from the stochastic process can be fed into any property of an enemy, not just its heuristics. For example, the enemy’s
References Adamic, L.A. and Huberman, B.A. (2001), “The Web’s hidden order”, Communications of the ACM Archive, Vol. 44 No. 9, September, pp. 55-60. Albert, R., Jeoung, H. and Barabasi, A.-L. (1999), “The diameter of the World Wide Web”, Nature, Vol. 401, p. 130. Baldi, P., Frasconi, F. and Smyth, P. (2003), Modeling the Internet and the Web, John Wiley & Sons, London. Barabasi, A.-L. and Albert, R. (1999), “Emergence of scaling in random networks”, Science, Vol. 286, pp. 509-12. Baraba´si, A.-L. and Bonabeau, E. (2003), “Scale-free networks”, Scientific American, Vol. 288 No. 5, May, pp. 50-9. Bar-Yossef, Z., Berg, A., Chien, S., Fakcharoenphol, F. and Weitz, D. (2000), “Approximating aggregate queries about Web pages via random walks”, Proceedings of the 26th International Conference on Very Large Databases, September, pp. 535-44. Broder, A.Z., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A. and Wiener, J. (2000), “Graph structure in the Web. Ninth International World Wide Web Conference”, Computer Networks, Vol. 33 No. 1-6, pp. 309-20. Henzinger, M.R., Heydon, A., Mitzenmacher, M. and Najork, M. (1999), “Measuring index quality using random walks on the Web”, Proceedings of the 8th International World Wide Web Conference, Toronto, May, pp. 213-25. Henzinger, M.R., Heydon, A., Mitzenmacher, M. and Najork, M. (2000), “On near-uniform URL sampling”, Proceedings of the 9th International World Wide Web Conference, Amsterdam, May. Kleinberg, J., Kumar, R., Raghava, P., Rajagopalan, S. and Tomkins, A. (1999), “The Web as a graph: measurements, models, and methods”, Proceedings of the International Conference on Combinatorics and Computing, Lecture Notes in Computer Science, Vol. 1627, Springer, Berlin. Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A.S. and Upfal, E. (2000), “The Web as a graph”, Proceedings of the 19th ACM SIGACT-SIGMOD-AIGART Symposium Principles of Database Systems (PODS). Lawrence, S. and Giles, C.L. (1999), “Accessibility of information on the Web”, Nature, Vol. 400, 8 July. W3C (2004), “The W3C markup validation service”, available at: http://validator.w3.org/
378
1. Introduction
Multi-dimensionalpersonalisation for location and interestbased recommendation Steffen W. Schilke Udo Bleimann Steven M. Furnell and Andrew D. Phippen The authors Steffen W. Schilke is a PhD Student, Member and Researcher, Network Research Group, University of Plymouth, Plymouth, UK and the Institute of Applied Informatics Darmstadt (AIDA), University of Applied Sciences Darmstadt, Darmstadt, Germany. Udo Bleimann is a Managing Director, AIDA, Darmstadt, Germany. Steven M. Furnell is Head and Andrew D. Phippen is a Senior Lecturer, both at the Network Research Group, University of Plymouth, Plymouth, UK.
Keywords Internet, Information services, Mobile communication systems
Abstract During the dot com era the word “personalisation” was a hot buzzword. With the fall of the dot com companies the topic has lost momentum. As the killer application for UMTS has yet to be identified, the concept of multi-dimensional-personalisation (MDP) could be a candidate. Using this approach, a recommendation of online content, as well as offline events, can be offered to the user based on their known interests and current location. Instead of having to request this information, the new service concept would proactively provide the information and services – with the consequence that the right information or service could therefore be offered at the right place, at the right time. Following an overview of the literature, the paper proposes a new approach for MDP, and shows how it extends the existing implementations.
Electronic access The Emerald Research Register for this journal is available at www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at www.emeraldinsight.com/1066-2243.htm
Internet Research Volume 14 · Number 5 · 2004 · pp. 379-385 q Emerald Group Publishing Limited · ISSN 1066-2243 DOI 10.1108/10662240410566980
This paper introduces a next generation personalisation approach which continues the approaches of most previous personalisation projects. Besides the personalisation efforts, the location of the user and a temporal component will be taken into account. These can be considered to be the main dimensions used or applied for this new personalisation approach – hence the name “multi-dimensional-personalisation” (MDP). Other issues have to be taken into account and will be mentioned later. “Personalisation” was a hot term during the dot com era. In 1999, at a Gartner Group symposium, it was predicted that “. . . by 2003, nearly 85 percent of global 1,000 Web sites will use some form of personalization (0.7 probability)” (Abrams et al., 1999). It seems that this prediction did not come true. Through the meltdown of the dot com companies, a lot of the hype faded. Nowadays personalisation is seen in a broader context known as an “adaptive interface”. The first two levels (i.e. conceptual and semantic levels) represent the personalisation (the other two levels are called syntactical and lexical (adapted from Foley and van Dam, 1990)). This paper initially shows how the terms have been defined and used. Their relationship and samples of the application of these approaches are presented. It then presents how the new approach will be extended to allow the combination of dimensions of user interest, the location and the time for recommending and presenting the right information or service proactively. The approach permits the offering of personalised views on information and services in the online as well as the offline world. Additionally, it allows a user to have one specific site personalised and extends this approach to all participants (online as well as offline). One of the key issues is to organise information in a way that a system can use to support the user and pass a selection/ recommendation to the user, depending on the active user model/personality.
2. Dimensions for personalisation The proposed system applies three main dimensions, which are interest (where personalisation is usually based on), location and time. It is necessary to show the relationship to previous approaches to be able to show the extended reach of this idea fully. The first dimension is interest, which is the main factor used for personalisation nowadays. The interests of the user are taken into account to
379
MDP for location and interest-based recommendation
Internet Research
Steffen W. Schilke et al.
Volume 14 · Number 5 · 2004 · 379-385
present the information or services the user should be interested in. It is based on various pieces of information about the user, for example his/her past user behaviour, different filtering techniques and recommendations on this data. The second dimension is the location of the user, which allows the extension of the reach of this approach beyond a desktop-bound application to the mobile world. Events and services from both the online and offline world can be taken into account. This makes it necessary to take the enduser device and its capabilities into consideration. For example, the device might have a low bandwidth connection or other technical limitations. The third main dimension is time. This dimension can be applied in different ways. The system could synchronise with the schedule of the user to recommend events that fit the (future) schedule of the user. Another application could be to apply the knowledge of a timeline in a curriculum of a degree programme to be able to recommend information or events which suit the progress of the student. From the perspective of a marketing approach this dimension could be used by evaluating the behaviour of the user (e.g. the daily way to and from work) to offer, for example, a happy hour special or a concert in a barclub which fits the usual time the user passes a certain location (see Figure 1). Additionally the user might want to have different “personalities” (known in standard mobile phones nowadays as profiles, i.e. for the local personalisation of the mobile phone itself) which are easily available to him/her (e.g. to switch between a private or business profile). Within such a MDP profile the user can distinguish between his/her “private” personality and interests and the
“business person” with the matching interests. Naturally a user can define more than these profiles. These profiles would keep the information of the interests of this profile to allow targeted delivery of recommendation to the active profile. Obviously there would be some kind of intersection of active and inactive profiles, e.g. the time and location component would apply to all profiles of a user at the same time. A user might want to choose that even if he/she “is” a business person certain recommendations be based on the “private” profile and should be brought to his/her attention. This could be based on a scoring/ prioritisation system or time frames (e.g. during lunch time the recommendations for the “private” personality will get promoted). Besides the context, the user should be able to define a kind of “Mood” or “Situation” (user model/personality/profile) which acts as a threshold or prioritisation, e.g. to prevent being disturbed during an important business meeting. In addition these profiles can contain information about the device used, the bandwidth available to the device and the capabilities of the device to distinguish between the types of media provided. As the different technologies come closer together there might has to be an automation which adapts to the environment where the user is in. For example, if the user is in the countryside his/her device can function only with a GPRS-based connection resulting in limitations in bandwidth. By moving towards a bigger city the device can log on into a UMTS cell to have a higher bandwidth available. In another case the user might be located inside a building that allows him/her to utilize, for example, a blue tooth or WLAN connection for his/her device that provides an higher bandwidth. Another issue which has to be taken into account is how the person is moving at the time, i.e. a person walking around as a pedestrian has a greater resistance to walking more than a mile to a recommended event then a person riding a bicycle, whereas if the person is using a car the temptation to go to a recommendation which is five miles away would probably work. These cell sizes are a factor which is important for the recommender as well as the user of such a system (see Figure 2). In the next sections the definitions of the terms will be presented and extended to fit the new MDP approach.
Figure 1 The extended form of personalisation based on interest, location and time
2.1 Personalisation as a concept Personalisation should not be mixed up with customisation. Customisation usually deals with the appearance of a Web site (e.g. colours, fonts or the appearance of the site, i.e. which information goes where – i.e. how information will be displayed), for example, “In customization, user
380
MDP for location and interest-based recommendation
Internet Research
Steffen W. Schilke et al.
Volume 14 · Number 5 · 2004 · 379-385
Figure 2 Cell-based MDP recommendation – cell size depends on way of moving
For the Internet community or industry “Personalisation is increasingly considered to be an important ingredient of Web applications. In most cases personalization techniques are used for tailoring information services to personal user needs”. Personalisation “. . . is done automatically based on the user’s actions, the user’s profile, and (possibly) the profiles of others with ‘similar’ profiles” (Mobasher et al., 2001). Unfortunately most personalisation systems are mainly driven by a one-dimensional approach. This is expressed by a statement from Abowd and Mynatt (2000) which states that: Most context-aware systems still do not incorporate knowledge about time, history (recent or long past), other people than the user, as well as many other pieces of information often available in our environment.
controls and customizes the site or the product based on his/her preferences” (Mobasher et al., 2001). A summary by Allen et al. (2001) states: “On the Web, the difference between customisation and personalisation usually comes down to who is in control of the content”. For the customisation the user is in control of the appearance and not in control of the content. The user usually can only control which place and where to place a certain content by customising (e.g. my.yahoo.com). Nowadays there are groups referring to personalisation and customisation as “interface adaptation” instead. The term adaptive is used quite a lot today, but still the differentiation between personalisation and customisation is a valid approach: . . . the Web is ultimately a personal medium in which every user’s experience is different than any other’s (Schwartz, 1997).
The term can be defined as:. . .the task of making Web-based information systems adaptive to the needs and interests of individual users, or groups of users. Typically, a personalized Web site recognizes its users, collects information about their preferences and adapts its services, in order to match the users’ needs (Pierrakos et al., 2001).In 1998 the idea of an adaptive Web sites was defined as sites which “. . . automatically improve their organization and presentation . . .” (Perkowitz and Etzioni, 1998). Pierrakos et al. (2003) stated that: One way to expand the personalization of the Web is to automate the adaptation of Web-based services to their users.
Kahabka et al. (1997) defined:
In this statement one issue is left out as the system could use information about the future of the user by actively using information from a diary or schedule. Common sense leads to the thought that a system that is offering the most relevant information for the user in a given situation (e.g. determined by location, time, interests of the user, etc.) will be more successful than another system offering only a standard view on the information. In addition it seems necessary to expand the reach even further from the online to the offline world. This is where the new concept called MDP can provide significant benefits to the user. It is an approach to support the user in coping with massive information overflow. The online world as well as the offline world provides a vast array of opportunities, information and services or events the might be relevant to the user. The main problem nowadays is to get the right information at the right time at the right place and in the right format.
2.2 The dimension location Mobility has become a buzzword, much like personalisation. It has to be taken into account that “mobile distributed environments applications often need to dynamically obtain information that is relevant to their current location” (Jose´ and Davies, 1999). The “. . . location awareness is a key factor for mobile commerce’s success, because it can contribute to a system’s ease of use in many ways” (Zipf, 2002). Bob Egan, vice president Mobile & Wireless, from the Gartner Group has said that:
The aim of personalisation is to select data whose content are most relevant to the user from a greater volume of information and to present them in a suitable way for the user.
381
The Internet will not be successfully translated to the mobile world without location awareness which is a significant enabler in order to translate the Internet into a viable mobile economy.
MDP for location and interest-based recommendation
Internet Research
Steffen W. Schilke et al.
Volume 14 · Number 5 · 2004 · 379-385
Unfortunately “. . . the current design of the Internet does not provide any conceptual models for addressing this issue” (Jose´ and Davies, 1999). As stated above the location can be an important parameter which is necessary to offer the user the appropriate information or services. Even when used from a desktop system it can be necessary to provide location information to use the offer of a Web site successfully. For example, a price comparison Web site allows the user to search a DVD from a catalogue and request a price comparison which includes the shipping information. In order to do so the system needs information about the location of the user to identify the shipping costs from each shop.
subject, but with only the specific items they want to know about” (Manber et al., 2000). It allows a user to customise his/her page depending on his/ her interests. Some of the personalisation happens “. . . inside the modules. For example, users can choose which TV channels they want to include in their TV guide in addition to which local cable system they use” (Manber et al., 2000). By choosing the local cable system a second dimension, location will be included by considering the televison interests and the location of the user. A similar approach was taken in the PTV project:
2.3 Time as dimension This temporal dimension has not been recognised much in earlier approaches. One well-known scenario is a television guide Web site that offers information on television shows ahead of time by combining the interest of the user with a given time frame. Other examples are an event guide or a database with “calls for papers”. The time dimension is an important component for this approach as it allows the tracking of typical user behaviour together with the location dimension. Extending the reach into the future is possible by using the schedule and appointments listed to offer information about events which will happen around another event in the “neighbourhood” of this location. This is especially true if you consider offline events. It seems obvious that the dimension time does not make sense applied alone. It is usually used directly or indirectly combined with one or both of the other dimensions. 2.4 Examples of the application for the different dimensions My Yahoo! is a mixture of customisation and personalisation whereas Amazon has a personalisation system for book recommendations which depends on the user’s profile and the purchasing patterns of users with a similar purchasing and interest history. Personalisation of book recommendations has been performed in the past by bookshop staff, who remember the preferences of the customer and proactively offer books which suit the taste of that customer. My Yahoo! “. . . is a customized personal copy of Yahoo! Users can select from hundreds of modules, such as news, stock prices, weather, and sports scores, and place them on one or more Web pages. The actual content for each module is then updated automatically, so users can see what they want to see in the order they want to see it. This provides users with the latest information on every
The basic idea behind PTV [Personalised Television] is the ‘online personalised TV guide’. That is, PTV is a television guide, listing programme viewing details just like any other guide, but with one important difference, the listed programmes are carefully selected to match the viewing preferences of individual subscribers. In short, every subscriber sees a different guide, a guide that has been carefully constructed just for them, taking account of their programme preferences, their preferred viewing times, and their available channels. Crucially, PTV can inform users about programmes that they may be interested in watching (Smyth et al., 1998).
3. MDP concept Most context-aware systems still do not incorporate knowledge about time, history (recent or long past), other people than the user, as well as many other pieces of information often available in our environment (Abowd and Mynatt, 2000).
MDP is an approach to support the user in coping with massive information overflow. There are the main dimensions time, interest and location and the minor issues like bandwidth (e.g. GPRS, HCSD, UMTS, LAN, WLAN), format/medium (from plain text format to rich media formats depending on the client, available bandwidth or hardware), priority (how important is an information) and cost (costs associated with information or an event). Besides these dimensions and issues there are security and trust concerns which have to be considered. As mentioned before, the main dimensions for such a new personalisation approach are: . The time dimension. This is comparable to a calendar or schedule. The user has a certain repeating behaviour (always in a similar time frame, e.g. the way to/from work, lunchtime, etc.) or plans some trips ahead. The MDP would build on this information and would allow a permission-based recommendation
382
.
.
MDP for location and interest-based recommendation
Internet Research
Steffen W. Schilke et al.
Volume 14 · Number 5 · 2004 · 379-385
taking into account the interest of the user and the location of the user. This would permit recommendations for future events as well as events which fit the regular schedule of the user. The location dimension. This takes the “moving” pattern of the user into account. Regardless of whether the user is using a desktop PC, a notebook, a mobile device like a PDA or smart phone he/she will be “somewhere”. Either at home, at work or on the road there will always be interesting things or information related to this user. Combined with the other dimensions it is possible to offer recommendations “just in time” at the right place. Even planning ahead in time would be possible. Reoccurring moving patterns of a user can be tracked and used for recommendation based on the user’s location. The interest dimension: (or “personalisation” in prior approaches). This addresses what the user is interested in. This can range from business or commercial interests which are related to the job or studies to private interests like hobbies. These interests can be grouped in profiles to allow switching and prioritising between them.
Minor issues as mentioned above can be taken care of during the implementation of the system, for example, issues like bandwidth of the communication, technical capabilities of the device used to participate, etc. It seems that in existing systems, and in previous work or literature, such an approach has not been taken before. There are usually the two main approaches – interest and location – used in such existing personalisation systems. The interest-based personalisation usually uses filtering techniques like content filtering, collaborative filtering, rulebased filtering, content mining, monitoring of the surf behaviour or selects topics of interest through the user for the personalisation or recommendation to the user. These methods have to be extended to be applied in the MDP context and have to be taken into account for the proof-of-concept or implementation phase. The user shall be provided “. . . with the information they want or need, without expecting from them to ask for it explicitly” (Mulvenna et al., 2000). Besides this the content and services should be “. . . actively tailored to individuals based on rich knowledge about their preferences and behaviour” (Hagen et al., 1999). Nielsen (2002) writes on his Web site that the ”. . . bottom line is that for enabling Smart Personalization techniques the application needs to recognize individuals, not computers”. By taking this into account and considering that
personalisation usually happens only on one Web site or within a portal (e.g. in an intranet) the requirements should be clear. This new approach proposes services which will provide the user with the possibility to use his/her profile across all participating Web sites. As Schafer et al. (2001) writes “One classification of delivery methods is pull, push, and passive”. In order provide good services for the user it shall be an interactive solution providing the result of the MDP in form of a proactive push to the user. This requirement was described in Chavez et al. (1998) as “an optimal assistant provides the required information autonomously and independently, without requiring the user to ask for it explicitly”. The user shall get “. . . the right information at the right time and place – with minimal interaction” (Chavez et al., 1998). The location-based approach is nowadays generally used for mobile devices like mobile or smart phones. In such scenarios the information is generally used to navigate the user to a service or information provider. This is connected to a certain need or demand of the user (e.g. a pizza restaurant, a hotel or such things). This is mainly an “on demand” scenario, i.e. the user requests/ pulls the information and has to select “what” he/she wants. The location-based personalisation provides the “where” information for the “what”. A literature search has not found a system proposed which really combines these two dimensions in a personalisation engine. At present, there is no approach known to the authors that combines the third dimension, time, with the two other main dimensions. Another issue is that there are no real approaches known that try to bridge personalisation between the online and the offline world.
4. Research outlook In order to deploy the MDP approach the building blocks have to be established. To do so components have to be combined to offer this service. The classification of the interest is an important issue for the personalisation. A metahierarchy approach for the controlled vocabulary that describes the interest has to be investigated. The interest can be gathered from as well as be collected from the usage behaviour of the user (e.g. by evaluating the Web usage, favourites, Web searches, output from recommendation systems, etc.). The geographical and temporal behaviour is another important issue for the MDP. This is a specific issue as data protection laws can cause problems in certain countries. One solution is to
383
MDP for location and interest-based recommendation
Internet Research
Steffen W. Schilke et al.
Volume 14 · Number 5 · 2004 · 379-385
store the profile anonymously and pass these data without a real reference to the participating and requesting server. This type on middleman can act as a “Chinese wall” in between the user and the service provider – i.e. the organisation that wants to offer a service or recommendation only deals with an anonymous profile (see Figure 3). Nowadays similar trust relationships already exist, e.g. the trust that a mobile phone user has in the mobile phone service provider which has access to several data protection and security relevant data, such as location/moving behaviour (by evaluating the cells with which a mobile phone is registered), the numbers called, the services paid or used with via the mobile phone (e.g. toll numbers, WAP, iMode pay services or the use of premium SMS), the address of the user (even if this is limited with prepaid mobile phone cards) to name a few. A Microsoft (2001) white paper stated this problem:
information about a cell where the device is located). One approach/experiment which will be undertaken is to monitor volunteers with a GPS data logger device to gather the necessary data to analyse so that, to a great extent, there is a regular moving pattern which could be used for recommendations, even if the event is in the future (e.g. along the daily way from or to work). Whereas the location is important for the geographical recommendation the temporal dimension has to be considered as well. In combination with the location, e.g. the daily way from and to work or a part-time lecture at a university can be identified. Future events would have to rely on a combination of time and location in a schedule (e.g. two weeks’ holiday in Las Vegas from the 13 January or a business trip from 11 January to London). One issue for this is an extended calendar event description in digital form which not only gives information about the date and time, but also gives information about the location (e.g. the iCalendar extension SKiCAL which allows the description of an event including location and time). Additionally it should cover a form of classification of the topic of the event in order to map it to matching interests. In order to reduce the payload on the used device, regarding bandwidth and storage, a multitier architecture has to be defined which allows the system to work even with low bandwidth devices like a standard smart phone. This scenario would also allow transparent switching between different devices without having to synchronise the profiles of the user as they would be keep and maintained on a server instead of the device used. Besides that the device could be implemented as a slim client to access the MDP system. It would also allow the filtering of recommendations for the user on the server side, which would take a lot of bandwidth and processing power on a mobile device. In addition it would allow the keeping of privacy of the user by providing recommending servers with only an anonymous profile, i.e. they would not have access to the user or its data/profile directly. This new approach of a personalisation engine which extents its reach from the online world towards the offline world seems to be very promising as it connects the separated worlds. As the users interests would be considered in conjunction with their location, even when they are on the road (i.e. a roaming user profile). The system could provide a new convenience for users world-wide. By considering the regular/daily schedule and prescheduled events in the calendar the MDP system is able to look ahead to recommend interesting events. The service would not only be applicable to the user. In addition it gives opportunities for service providers,
People are not in control of the technology that surrounds them. We have important data and personal information scattered in hundreds of places across the technology landscape, locked away in applications, product registration databases, cookies, and Web site user tracking databases.
By providing an open interface and data security a cross platform and cross Web site personalisation could be achieved. In order to use the geographical information about the user the devices used have to be capable to gather data about its location. Possible technology scenarios are GPS/Galileo, location information via the mobile phone service provider (e.g. GSM cellular location, CellID, observed time difference of arrival (OTDOA), i.e. location Figure 3 MDP multi-tier architecture to foster a trusted relation
384
MDP for location and interest-based recommendation
Internet Research
Steffen W. Schilke et al.
Volume 14 · Number 5 · 2004 · 379-385
marketing and learning providers. Especially in the life-long learning field a school could suggest to a busy student presentations and workshops along its daily and future schedule even when he/she is away from his/her “home” location. In this scenario the time dimension applies twice – once in the schedule of the user and once in the progress along the timeline for the curriculum/study programme.
Manber, U., Patel, A. and Robison, J. (2000), “Experience with personalization on Yahoo!”, Communications of the ACM, Vol. 43 No. 8, August. Microsoft (2001), “Building user-centric experiences – an introduction to Microsoft HailStorm”, Microsoft White Paper, March, available at: http://msdn.microsoft.com/ theshow/Episode014/ Mobasher, B., Berendt, B. and Spiliopoulou, M. (2001), “PKDD 2001 tutorial: ‘KDD for personalization’”, paper presented at the 5th European Conference on Principles and Practice of Knowledge Discovery in Databases, Freiburg, 6 September. Mulvenna, M.D., Anand, S.S. and Buchner, A.G. (2000), “Personalization on the Net using Web mining”, Communications of the ACM, Vol. 43 No. 8, August, pp. 123-5. Nielsen, J. (2002), “Supporting multiple-location users”, Jakob Nielsen’s Alertbox, 26 May, available at: www.useit.com/ alertbox/20020526.html Perkowitz, M. and Etzioni, O. (1998), “Adaptive sites: automatically synthesizing Web pages“, Proceedings of the 15th National Conference on Artificial Intelligence, Madison, WI. Pierrakos, D., Paliouras, G., Papatheodorou, C. and Spyropoulos, C.D. (2001), “KOINOTITES: a Web usage mining tool for personalization”, Proceedings of the Panhellenic Conference on Human Computer Interaction. Pierrakos, D., Paliouras, G., Papatheodorou, C. and Spyropoulos, C.D. (2003), “Web usage mining as a tool for personalization: a survey“, User Modeling and Useradapted Interaction, Vol. 13, Kluwer Academic Publishers, New York, NY, pp. 311-72. Schafer, J.B., Konstan, J.A. and Riedl, J. (2001), Data Mining and Knowledge Discovery, Vol. 5 No. 1-2, pp. 115-53. Schwartz, E.I. (1997), Webonomics, Broadway Books, New York, NY. Smyth, B., Cotter, P. and O’Hare, G.M.P. (1998), “Let’s get personal: personalised TV listings on the Web”, paper presented at the 9th Irish Conference on Artificial Intelligence and Cognitive Science (AICS-98), Dublin, August. Zipf, A. (2002), “Adaptive context-aware mobility support for tourists”, IEEE Intelligent Systems, Section on “Intelligent Systems for Tourism”, November/December.
References Abowd, G.D. and Mynatt, E.D. (2000), “Charting past, present, and future research in ubiquitous computing”, ACM Transactions on Computer-Human Interaction, Vol. 7 No. 1, pp. 29-58. Abrams, C., Bernstein, M., de Sisto, R., Drobik, A. and Herschel, G. (1999), “E-business: the business tsunami”, Proceedings of Gartner Group Symposium/ITxpo, Cannes. Allen, C., Kania, D. and Yaeckel, B. (2001), One-to-One Web Marketing: Build a Relationship Marketing Strategy One Customer at a Time, 2nd ed., John Wiley & Sons, New York, NY. Chavez, E., Ide, R. and Kirste, T. (1998), “SAMoA: an experimental platform for situation-aware mobile assistance”, Proceedings of Workshop on Interactive Applications of Mobile Computing. Foley, J. and van Dam, A. (1990), Computer Graphics, Principles and Practice, 2nd ed., Addison-Wesley, Reading, MA. Hagen, P.R., Manning, H. and Souza, R. (1999), The Forrester Report. Smart Personalization, Forrester Research, Cambridge, MA, July, p. 8. Jose´, R. and Davies, N. (1999), “Scalable and flexible location-based services for ubiquitous information access”, Proceedings of 1st International Symposium on Handheld and Ubiquitous Computing (HUC’99). Kahabka, T., Korkea-aho, M. and Specht, G. (1997), “GRAS: an adaptive personalization scheme for hypermedia databases”, Proceedings of the 2nd Conference on Hypertext-Information Retrieval – Multimedia (HIM ’97), Universita¨tsverlag Konstanz (UVK), Konstanz, pp. 279-92.
385
Literati Club
Awards for Excellence June Lu and Chung-Sheng Yu
University of Houston-Victoria, Victoria, Texas, USA
Chang Liu
Northern Illinois University, De Kalb, Illinois, USA
James E. Yao
Montclair State College, Upper Montclair, New Jersey, USA
are the recipients of the journal's Outstanding Paper Award for Excellence for their paper
``Technology acceptance model for wireless Internet'' which appeared in Internet Research: Electronic Networking Applications and Policy, Vol. 13 No. 3, 2003 June Lu is an Assistant Professor in the Department of Management and Marketing, School of Business Administration at University of Houston-Victoria, Victoria, Texas, USA. She teaches MBA level MIS and e-commerce classes. Recently she has been an active researcher studying wireless mobile technology acceptance and mobile commerce in different cultural settings, besides research on implementation of online learning systems. Her works have been published in Information & Management, Journal of Internet Research, International Journal of Mobile Communications, Journal of Delta Pi Epsilon, Journal of Computer Information Systems and other journals. June Lu received her doctoral degree from the University of Georgia. Chung-Sheng Yu is Assistant Professor in the School of Business at the University of Houston-Victoria. He received his Doctor of Business Administration from Mississippi State University. His research has mainly been in cross-cultural management, quality management, and mobile commerce. His articles have appeared in Current Topics in Management, Quality Management Journal, Management World, Scientific Management Review, Information & Management, International Journal of Mobile Communications, Journal of Internet Research and other journals. Chang Liu is Associate Professor of Management Information Systems at Northern Illinois University. He received his Doctor of Business Administration from Mississippi State University. His research interests are electronic commerce, Internet computing, and telecommunications. His research works have been published in Information & Management, Journal of Global Information Technology Management, Journal of Internet Research, International Journal of Electronic Commerce and Business Media, Journal of Computer Information Systems, Mid-American Journal of Business, and Journal of Informatics Education Research. He teaches database and electronic commerce courses. James E. Yao is an Associate Professor of MIS in the Department of Information and Decision Sciences, School of Business, Montclair State University, New Jersey, USA. He received his PhD from Mississippi State University. He teaches database, management information systems, e-commerce, and other MIS courses. His research interests are in the areas of IT innovation adoption, diffusion, and infusion in organizations, e-business, e-commerce, and m-commerce. His research has been published in Journal of Internet Research, Journal of Computer Information Systems and other journals.
386
teachers. An online, moderated information exchange will also allow interaction with other researchers internationally.
Note from the publisher
Helping academic institutions and business schools perform better
Management journals and more: Emerald Management Xtra launches in 2005 For 2005, Emerald Group Publishing Limited will be launching the world’s largest collection of peerreviewed business and management journals, Emerald Management Xtra. As well as more than 100 research journals, the Emerald Management Xtra collection also features independent reviews of papers published in 300 of the world’s major English language business journals, plus a newly developed range of online support services for researchers, teaching faculty, academic authors and others. Emerald Management Xtra will be sold on subscription, primarily to libraries serving business schools and management departments, with online access to research papers and faculty services networked throughout an institution. Designed in consultation with an international working group of business academics, Emerald Management Xtra is intended to consolidate the library’s position at the heart of a university’s research and information needs. In addition to nearly 40,000 research papers published by Emerald journals over the past ten years, Emerald Management Xtra features help for aspiring authors of scholarly papers, advice on research funding, reading lists and lecture plans for academic teaching faculty, and an online management magazine for business students and
Internet Research Volume 14 · Number 5 · 2004 · pp. 387-388 q Emerald Group Publishing Limited · ISSN 1066-2243
Input and advice from heads of research and deans and directors of business schools have helped focus Emerald Management Xtra on academic productivity improvement and time-saving, which is seen as particularly useful given the increasing pressures on academic time today. Building on the success of the Emerald FullText collection, available in more than 1,000 business schools and universities worldwide, Emerald Management Xtra puts the latest thinking, research and ideas at users’ fingertips. “Already, 95 of the FT Top 100 business schools in the world subscribe to Emerald journals or the Emerald Fulltext collection,” said John Peters, editorial director of Emerald. “We will find over the coming years that Emerald Management Xtra will be an essential part of any credible business school or management faculty. You can’t get all Emerald management and library information journals in full and in their current volumes from anywhere else; you can’t get the support services we offer to faculty and the academic community anywhere else. We can even help you plan your conference travel through the world’s largest independent register of academic conferences in business and management, Conference Central. We developed Emerald Management Xtra by asking academics, librarians and directors around the world what their biggest challenges were, we then put solutions in place to address them.”
Integrated search and publishing innovation An important part of the Emerald Management Xtra collection is an integrated search facility of Emerald-published papers with Emerald Management Reviews, which features nearly 300,000 independently written summary abstracts from a further 300 of the world’s top business and management journals. This allows quick and easy summary of the relevant management literature. From 2005, both Emerald journals and Emerald Management Reviews will feature Structured Abstracts. A first in management publishing, structured abstracts provide a consistent structured summary of a paper’s aims, methodology and content. Used successfully in
387
Note from the publisher
Internet Research Volume 14 · Number 5 · 2004 · 387-388
many medical and scientific journals, structured abstracts allow a quicker assessment of an article’s content and relevance, thus saving researchers time and effort.
Value Emerald is already recognised as a world leader in offering value for money from its online collections. Research undertaken by a large US library purchasing consortium showed that the Emerald portfolio was “best value” of any publisher collection, based on the ratio of usage and downloads to subscription price. Unlike most other publishers, the Emerald Management Xtra portfolio is entirely subjectspecific – focusing just on business and management (whereas other publisher collections contain a diverse range of subjects in sciences, social sciences, arts and so on). This means that Emerald Management Xtra gives superb value to business schools and management departments, with no subject redundancy.
Emerald Group Publishing Limited Established for more than 30 years, Emerald Group Publishing has its headquarters in the north of England, with sales offices in the USA,
Malaysia, Australia and Japan. Emerald was founded as Management Consultants Bradford, later MCB University Press, as a spin-off from the University of Bradford business school, by faculty and alumni, and has maintained a tradition of having academics, former and current, as key decision-makers and advisers. Emerald journals are almost exclusively international in scope, with around 75 per cent of all authors coming from outside the UK. More than 100 of the Emerald Management Xtra journals collection are traditionally peerreviewed. The Emerald Management Xtra journals collection is completed by a small number of secondary digest titles and practitioner-focused magazine titles. Emerald Management Xtra includes long-established titles such as the European Journal of Marketing, Journal of Documentation, International Journal of Operations & Production Management, Management Decision, Personnel Review, Accounting, Auditing & Accountability Journal, and the Journal of Consumer Marketing. Emerald Management Xtra is available now. Contact Emerald for more information. E-mail: [email protected]; Tel: +44 1274 777700 Mail: Emerald Group Publishing Limited, Toller Lane, Bradford BD8 9BY, UK. Rebecca Marsh Head of Editorial
388