474 48 11MB
English Pages 443 [452] Year 2009
Knowledge & Information
Knowledge & Information Studies in Information Science Edited by Wolfgang G. Stock (Düsseldorf, Germany) and Ronald E. Day (Bloomington, Indiana, U.S.A.) Sonja Gust von Loh (Düsseldorf, Germany) – Associate Editor Richard J. Hartley (Manchester, U.K.) Robert M. Hayes (Los Angeles, California, U.S.A.) Peter Ingwersen (Copenhagen, Denmark) Michel J. Menou (Les Rosiers sur Loire, France, and London, U.K.) Stefano Mizzaro (Udine, Italy) Christian Schlögl (Graz, Austria) Sirje Virkus (Tallinn, Estonia) Knowledge and Information (K&I) is a peer-reviewed information science book series. The scope of information science comprehends representing, providing, searching and finding of relevant knowledge including all activities of information professionals (e.g., indexing and abstracting) and users (e.g., their information behavior). An important research area is information retrieval, the science of search engines and their users. Topics of knowledge representation include metadata as well as methods and tools of knowledge organization systems (folksonomies, nomenclatures, classification systems, thesauri, and ontologies). Informetrics is empirical information science and consists, among others, of the domain-specific metrics (e.g., scientometrics, webometrics), user and usage research, and evaluation of information systems. The sharing and the distribution of internal and external information in organizations are research topics of knowledge management. The information market can be defined as the exchange of digital information on networks, especially the World Wide Web. Further important research areas of information science are information ethics, information law, and information sociology.
De Gruyter · Saur
Isabella Peters
Folksonomies Indexing and Retrieval in Web 2.0
Translated from German by
Paul Becker
De Gruyter · Saur
D 61
U Printed on acid-free paper which falls within the guidelines of the ANSI to ensure permanence and durability.
ISBN 978-3-598-25179-5 ISSN 1868-842X Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available in the Internet at http://dnb.d-nb.de. © Copyright 2009 by Walter de Gruyter GmbH & Co. KG, 10785 Berlin, www.degruyter.com All rights reserved, including those of translation into foreign languages. No Parts of this book may be reproduced in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. Printed in Germany Printing: Strauss GmbH, Mörlenbach
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Current State of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Open Questions in Folksonomy Research. . . . . . . . . . . . . . . . . . . . . . . . . Notes on the Book’s Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 5 6 8 10
Chapter 1 Collaborative Information Services . . . . . . . . . . . . . . . . . . . Web 2.0 vs Social Software vs Collaborative Information Services . . . . . . Social Bookmarking Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-Commerce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Commercial Information Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Music-Sharing Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Libraries 2.0 – Museums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Photosharing Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Videosharing Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Social Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Blogs and Blog Search Engines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Games with a Purpose (GWAP) – Tagging Games . . . . . . . . . . . . . . . . . . Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13 13 23 37 45 49 55 69 80 88 96 100 104 107
Chapter 2
Basic Terms in Knowledge Representation and Information Retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction to Knowledge Representation. . . . . . . . . . . . . . . . . . . . . . . . Paradigmatic and Syntagmatic Relations . . . . . . . . . . . . . . . . . . . . . . . . . Ontologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thesauri . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classification Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nomenclatures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Text-Word Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Citation Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Knowledge Representation in the World Wide Web . . . . . . . . . . . . . . . . . Introduction to Information Retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . Relevance Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Retrieval Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Text Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vector Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Probabilistic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Link Topology – Kleinberg Algorithm and PageRank . . . . . . . . . . . . . . . . Information Linguistics – NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
119 119 124 124 125 126 128 128 128 129 130 131 133 134 135 136 138 139
vi
Contents
Similarity Coefficients and Cluster Analysis. . . . . . . . . . . . . . . . . . . . . . . Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
142 144 146
Chapter 3 Knowledge Representation in Web 2.0: Folksonomies . . . . . Definition of the Term ‘Folksonomy’ . . . . . . . . . . . . . . . . . . . . . . . . . . . Tags – Users – Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cognitive Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Broad vs Narrow Folksonomies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Collective Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tag Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Users’ Tagging Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tag Categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tag Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advantages and Disadvantages of Folksonomies in Knowledge Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problem-Solving and Structuring Endeavors in Folksonomies . . . . . . . . . . Tag Gardening in Knowledge Representation . . . . . . . . . . . . . . . . . . . . . . Traditional Methods of Knowledge Representation vs Folksonomies . . . . . Outlook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
153 153 157 161 164 166 170 184 196 204
Chapter 4 Information Retrieval with Folksonomies. . . . . . . . . . . . . . . The Relation between Knowledge Representation and Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Searching vs Browsing vs Retrieving. . . . . . . . . . . . . . . . . . . . . . . . . . . . Information Filters – Information Filtering – Collaborative Filtering . . . . . Folksonomy-Based Recommender Systems in Information Retrieval . . . . . Retrieval Effectiveness of Folksonomies . . . . . . . . . . . . . . . . . . . . . . . . . Visualizations of Folksonomies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Disadvantages of Folksonomies in Information Retrieval . . . . . . . . . . . . . Query Tags as Indexing Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Relevance Ranking in Folksonomies . . . . . . . . . . . . . . . . . . . . . . . . . . . . Power Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tag Gardening in Information Retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . Outlook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
283 285 287 293 299 310 314 332 336 339 363 372 388 393
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
411 418
Index of Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Subject Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
419 431
212 228 235 247 255 262
Introduction
We used to rely on philosophers to put the world in order. Now we’ve got information architects. But they’re not doing the work – we are. --Bruce Sterling, Wired Magazine, 2005.
Five years ago, most librarians, archivists, documentalists and information architects would likely not have dared dream that one of the main areas of their daily work would come to gain the recognition it now enjoys in theory and practice, as well as among internet users, and neither could they have predicted that they themselves – the experts – would not be the ones to blaze its trail, but mainly laymen: I am talking about the indexing of digital information resources via ‘tags,’ or user-generated descriptors. A study conducted by PEW Research Center (Rainie, 2007) revealed that 28% of internet users have already indexed online content with tags, and 7% stated that they do so several times over the course of a typical day online. But why do folksonomies, the collections of user-generated tags, attract such attention? „One cannot help but wonder whether such enthusiasm for metadata would be the same if people were asked to use only prescribed and standardized vocabularies.” (Spiteri, 2005, 85) Folksonomies are part of a new generation of tools for the retrieval, deployment, representation and production of information, commonly termed ‘Web 2.0.’ In Web 2.0 it is no longer just journalists, authors, web designers or companies who generate content – every user can do so via numerous online services and through various media, such as photos, videos or text. eMarketer (2007) estimates that by 2011, more than 200 million users worldwide will be contributing to the internet’s content. But even today, the growth rates of certain online services are impressive: the social bookmarking service del.icio.us has about 90,000 registered users (Al-Khalifa, Davis, & Gilbert, 2007) and spans around 115m posts linking to anywhere between 30 and 50m URLs – and the data pool grows by 120,000 URLs every day. The photo-sharing service Flickr compiles around 2 billion pictures (Oates, 2007), the German social networking service studiVZ has over 5m registered users and Technorati was able to index more than 70m blogs at the beginning of 2007 (Sifry, 2007). The heavy growth of user-generated content increases the demand for suitable methods and facilities for the storage and retrieval of said content. In order to meet those demands, companies and computer scientists have developed collaborative information services like social bookmarking, photosharing and videosharing, which enable users to store and publish their own information resources as well as to index these with their own customised tags. Thus the indirect co-operation of users creates a folksonomy for each collaborative information service comprised of each individ-
2
Introduction
ual user’s tags. Using this folksonomy, all users may then access the resources of the information service in question. The production of user-generated content, the development of collaborative information services and the usage of folksonomies, as well as the popularity of these three aspects, are mutually dependent. For one thing, this means that the more collaborative information services there are, the more user-generated content, the more tags in folksonomies there will be (see Figure I). At the same time, the growing number of user-generated information resources necessitates ever more collaborative information services in order to store them and folksonomies to index and make them retrievable. Also, a collaborative information service’s success, and by the same token its usefulness, increases proportionally to the number of users producing and indexing their own information resources, since this creates more varied access paths as well as greater quantities of different resources.
Figure I:
The Production of User-generated Content, the Development of Collaborative Information Resources and the Growing Usage of Folksonomies are Mutually Dependent.
The task of folksonomies is to create access paths to information resources using tags. Their perspective on classification systems is an altered one – proponents see this as the crucial difference to knowledge organization systems and documentation languages from library science and professional information services. To clarify, let’s look at some examples from everyday life: suppose you’ve just moved to a new flat and want to organise your bookshelves systematically. First you must choose the classification system that’s right for you – you can arrange the books alphabetically by title or by author, stack them according to size or spine colour, or separate them thematically, by genre. You could create different sections for novels with male and female protagonists, or create an order based on the date of purchase. The same goes for recipes, which can labelled as either hors d’œuvres, main courses or desserts, or classified according to ingredients, such as meat and fish. You could think of many analogous examples for photos, clothes, e-mails or CDs, and would find that they each share a common feature: once a classification
Introduction 3
system is settled on, the book, recipe or photo is allowed exactly one place on the shelf or in the folder – whatever the taxonomy – where the resource is to be found. The problem which arises here is that a systematic order of any physical or digital resources always requires a decision on the part of the “guardian of order” as to which of the resource’s properties is to be used as a classification criterion. And only one criterion may be selected, since you cannot arrange books alphabetically on a shelf by author and title. The guardian must also take care to shape the classification system in such a way as to be able to quickly find the required resources as needed. Here we encounter a problem touching upon the system’s usability: an order that appears logical and practical to one user might cause utter confusion in somebody else. A solution for both problems might be the multiple storage of one and the same resource, but that would be neither efficient nor practicable – apart from costing money and space, since most books would have to be bought and stacked at least twice. Folksonomies take a different approach in the classification and structuring of digital information resources. Instead of choosing a classification criterion and filling it with resources, it is now the resources that are allocated the criteria. Folksonomies turn the classification system from a criteria-centric into a resource-centric approach. This means that multiple storage no longer refers to the resources but to the multiple allocations of the ‘folders,’ ‘drawers’ or ‘shelves’ that are the tags of the folksonomy. Pinned to the information resources are as many tags as are necessary to adequately describe and retrieve them. Thus tags enable the most diverse criteria to be allocated to the resources and in this way guarantee a much broader access to them, which, due to the collaborative construction of the folksonomy, is also independent of the guardian. In the digital world, however, this approach always requires an indexing and retrieval system to render the folksonomy-based classification system manageable. The user may have created numerous access paths to the information resources, but a system will be needed to aggregate the tags and so provide links to the desired resources. So in order to structure and classify resources, folksonomies sidestep onto a metalevel, which represents the resource via (a whole lot of) tags. This approach is neither new nor especially innovative, however. In the physical world, this meta-level is mainly developed and implemented by libraries, with their methods of knowledge representation. Subject catalogues, classification systems and thesauri are all somewhat antiquated testaments to this endeavour. They are to be distinguished from folksonomies in that they use a controlled vocabulary and a term-based approach, to be applied to the resources by trained professionals. This means that the indexing and retrieval of the resources is only possible using this prescribed vocabulary which, however, represents and finds the resource independently of its representation in language. The user of such a controlled vocabulary must commit the allowed terms to heart before gaining access to the resources. Folksonomies, however, allow users to implement their own terminology for indexing and representing content, as well as for retrieving resources, which at first sight enormously facilitates the information retrieval process. Since the allocation of tags to resources is not bound by a unified set of rules, however, a suitable retrieval system is needed in order to efficiently look for and find the resources. Here, too, librarians were the pioneers with their card indexes, that were later replaced by electronic search and data storage facilities, not least on the internet.
4
Introduction
Both collaborative information services and folksonomies mainly serve the user to personally store and arrange digital information resources, which are either selfproduced or found online, e.g. videos or photos. The coverage and indexing of all resources available on the internet is not the user’s professed goal. But little by little it transpires that folksonomies have reached a new feasibility level in indexing web resources, which takes its place next to web catalogues as well as the full text storage and analysis of websites’ link-topological properties as practiced by search engines. The low entry barriers for folksonomies make this form of mass indexing possible, since they do not require any specialised knowledge on the part of the user. Added together, the numerous activities of single users then lead to a database of web resources indexed by human hands. Thus the burden of mass indexing is no longer shouldered by single institutions, but carried by the many component parts of the internet community, where each user contributes his share: Collaborative tagging is most useful when there is nobody in the ‘librarian’ role or there is simply too much content for a single authority to classify; both of these traits are true of the web, where collaborative tagging has grown popular (Golder & Huberman, 2005, 198). The hierarchical Yahoo directory was developed for the purpose of browsing, but categorization by a limited number of professionals cannot practically deal with the huge number of web pages. [...] Folksonomy seems to be able to deal with the large amount of content (Ohkura, Kiyota, & Nakagawa, 2006). As a result, it leads to an emergent categorization of web resources in terms of tags, and creates a different kind of web directory (Choy & Lui, 2006).
It transpires that folksonomies are a method of knowledge representation in the exact same way that libraries’ traditional controlled vocabularies are – minus a centralised administration for the vocabulary. Nevertheless, folksonomies aim for an optimised representation and retrievability of information resources. But unlike the established, term-based documentation languages and knowledge organization systems, folksonomies have a few weak spots. Since they forego a restricted terminology, they are confronted by all of language’s problems and idiosynchrasies (such as synonyms and homonyms), which come to bear on the process of information retrieval especially. A search for information resources on the subject of ‘disco’ might lead to fewer relevant hits, since synonymous terms such as ‘night club’ are omitted from the search results. Similarly, whoever searches for ‘club’ will be cluttered with useless information if the user is interested in partying and not, in fact, golfing equipment or sports teams. In certain cases, the user’s search will even result in no hits at all, which has immense repercussions: „It’s impossible to create knowledge from information that cannot be found or retrieved“ (Feldman & Sherman, 2001, 4). The goal of knowledge representation – and thereby, of folksonomies – is to provide access to information resources; non-textual resources such as photos and videos are particularly dependent on them. If there is no access path, the resources are deemed unimportant: „The digital consumer is highly pragmatic – their attitude is that if the information is not found immediately, in one place, it is not worth looking for” (Nicholas & Rowlands, 2008). In today’s information gathering, which mainly relies on internet search engines, folksonomies are a weak tool for retrieval since they hardly represent a departure from search engines’ full text storage. „Users of tagging systems can quickly label (tag) large numbers of objects, but these labels are much
Introduction 5
less informative – tags tell us little more than the free-form string that they present,” is Heymann and Garcia-Molina’s (2006) fitting summary. Thus folksonomies lag far behind the potential of methods for knowledge representation and information retrieval that stem from library science (Peters, 2006). Why then are folksonomies so popular and successful? Could they offer advantages and access paths to information resources that controlled vocabularies cannot, and that are not visible at first sight? Or could folksonomies still learn from the traditional methods of knowledge representation and the established approaches to information retrieval, and vice versa?
Current State of Research Outside of information science, (folksonomy) research is mainly carried out in IT. Here folksonomy-based database and retrieval systems are constructed and evaluated in particular (e.g. BibSonomy), and characteristic frequency distributions for tags, users and resources are calculated. Social and media science mainly deal with the network characteristics of folksonomies, user behaviour in folksonomy-based systems and the effect of folksonomies on media consumption, while linguistics investigates the semantic and pragmatic properties of tags as well as their possibilities in the computational processing of language. Library science researches folksonomies with a particular focus on their usefulness for the indexing and retrieval of mostly physical information resources, as well as their effects on the (self-)image of libraries. Information science’s research is located on the intersection of all these disciplines. Its basis is IT, which is complemented by the other disciplines in order to be able to meet the particular demands of information resources adequately. In folksonomy research, information science’s task is to adopt a holistic point of view and to apply the results of the above disciplines on the processing, communication and classification of digital information resources, and to develop conceptions for improved methods of knowledge representation and information retrieval. All these disciplines are at the beginning of their research endeavours, since most folksonomy-based web services have been developed only from 2003 onwards (e.g. del.icio.us) as part of Web 2.0 – the very term ‘folksonomy’ was coined as late as 2004 (Smith, 2004; Vander Wal, 2004; Mathes, 2004). Characteristically, interest in folksonomies first manifested itself on the internet, and at first did not appear in scientific publications. Only in 2006, Golder & Huberman, Guy & Tonkin and Marlow et al., amongst others, published studies on the subject that are still highly relevant today and can be deemed essential literature. From that point on, the number of publications on folksonomies has grown exponentially, but they are still mainly to be found in the blogosphere1 or conference literature. Today we can assume that around 1,000 scientific publications on the subject have been published. A research for ‘folksonomy’ or ‘folksonomies’ in scientifically oriented databases yields the following picture: • Web of Science shows 38 results after a title and keyword search, • Scopus has 198 hits after a title, abstract and keyword search, and 454 results after a search in all text fields, 1
‘Blogosphere’ describes the totality of all blogs and blog entries available on the internet.
6
Introduction
•
the ACM Portal boasts 435 results after a search in all text fields, and 105 results after a title and abstract search, and • InfoData finds 29 results after a search in all text fields. This book summarises the research findings of more than 700 publications. Monographic publications are rare, however. Exceptions are the popular science books „Tagging. People-powered Metadata“ by Gene Smith (2008), „Everything Is Miscellaneous: The Power of the New Digital Disorder” by David Weinberger (2007) and soon „Understanding Folksonomy. Catalyzing Users to Enrich Information” by Thomas Vander Wal (2009), as well as publications originating in diploma or doctoral theses such as „Social Tagging. Schlagwortvergabe durch User als Hilfsmittel zur Suche im Web“ by Sascha A. Carlin (2007) and „Tagging, Rating, Posting. Studying Forms of User Contribution for Web-based Information Management and Information Retrieval” by Markus Heckner (2009). Smith’s book is directed at readers who wish to implement a tagging system within (non-)commercial and in-house company websites, shows advantages and disadvantages of folksonomies and system features and gives advice on the realisation and implementation of such systems. Moreover, he presents three case studies for different sorts of tagging systems: Social Bookmarking, Media Sharing and Personal Information Management. Citing many examples, Weinberger (2007) gives an overview on the different category and classification systems that exist in the real world and discusses their philosophical as well as their (im-)practical backgrounds. Above all, the shortcomings of such systems with regard to digital information are discussed. Hence, Weinberger grapples intensively with the new possibilities for creating metadata in Web 2.0, tags and folksonomies, and their implications for companies and users. Carlin (2007) uses his diploma thesis mainly to provide a literature survey on the subject and does not offer any original ideas or solutions. Heckner (2009) investigates the tagging behaviour of users in a tagging system and beyond platform borders, and additionally creates a model for the categorisation of tags via their functional and linguistic properties. The results flow towards conceptual thoughts on the architecture and implementation of a folksonomy-based online help system. Even though Heckner (2009) provides the first monographic publication on the subject of folksonomies from the perspective of information science, there exists as of yet no scientific work investigating folksonomies as both a method of knowledge representation and as a tool for information retrieval. This book aims to close this gap.
Open Questions in Folksonomy Research Although the number of writings on the subject of folksonomies increases steadily, some subareas, particularly from an information scientific point of view, remain uncharted. Unavailable as of yet is a taxonomy of services in Web 2.0 that differentiate between collaborative information services (which use folksonomies and aim to manage information resources) and social software (which also includes resource management, but additionally includes the construction of a knowledge base and the communication with other internet users). Such a taxonomy is sorely needed if the definitional parameters for the observation and usage of folksonomies in Web 2.0 are to be provided and standardised. Moreover, neither a critical analysis of the ap-
Introduction 7
plicability of folksonomies within specific collaborative information services, nor an evaluation with regard to the methods of knowledge representation and information retrieval currently in use has so far been carried out. Summarising statements concerning the applicability, as well as the strengths and weaknesses, of folksonomies in knowledge representation have been published on several occasions, and will be cited here, yet comprehensive solution statements on how to avoid or neutralise the disadvantages are few and far between. A possibility might be the use of semiautomatic processes for cleansing and unifying folksonomies during and after the indexing of information resources (e.g. via natural language processing), as well as the projection of documentation languages and knowledge organization systems on folksonomies. Here the question arises as to how preexisting knowledge organization systems on can profit with regard to their terminology and to the term relations applied by folksonomies, respectively how this knowledge can reenter the folksonomies. Collaborative information services and folksonomies are basking in success at the moment; the number of users keeps increasing. Is it conceivable that folksonomies or collaborative information services might one day collapse under the sheer weight of users, tags or resources? No research has yet been expended on this problem, perhaps due to the topic’s youth. Answers might be found in network economics and adjacent areas. In the scientific debate it is almost always assumed that the allocation of tags to resource follows an informetric distribution or power law. Is this correct, or might other forms of distribution apply to the tags? The area of information retrieval via folksonomies has only entered the scientific debate in the past three years. In what ways resource gathering via folksonomies may be implemented, which retrieval strategies play a role in this process and how a search via folksonomies is different from information retrieval using traditional search tools are therefore questions that have been examined but not yet summarised. A generally accepted algorithm for relevance ranking in folksonomy-based retrieval systems is missing altogether. There is also no overview of folksonomies’ disadvantages in information retrieval with its subareas retrieval effectiveness, search interface design and search result visualisation, as well as in relevance ranking, and no solution statement on how to avoid or neutralise the localised disadvantages. Possible options here would be the use of query tags as indexing tags, a relevance ranking that takes into account resource-inherent properties as well as tags and user activities, the observation of empirical tag distribution on a resource level to increase folksonomies’ retrieval effectiveness, or semiautomatic user support during searches that use tools for information retrieval and knowledge representation. This book will aim to provide conceptual answers to these open questions in folksonomy research, and in so doing comprehensively describe the close relation between the endeavours of knowledge representation and information retrieval. Particular attention will be paid to the reciprocity of these two information scientific research areas and to the effects on folksonomy-based indexing methods and retrieval systems.
8
Introduction
Notes on the Book’s Structure Chapter one will provide an introduction to the context of folksonomies. First, a definition of terms will be carried out in order to differentiate ‘Web 2.0,’ ‘social software’ and ‘collaborative information services,’ which are predominantly used synonymously. Folksonomies will be defined as one of the essential components of collaborative information services. Then, several known collaborative information services will be introduced and their functionalities discussed. Here particular attention will be paid to those of the systems’ properties that make indexing and research via folksonomies possible. Since collaborative information services are heavily represented online, I will take this opportunity to discuss the totality of information resources, from photos and links up to and including videos. The chapter will close with a summary of all results, which will register the functionalities of information services and their folksonomies as well as their cardinality with regard to the retrieval of information resources. Chapter two will then provide a short commentary on the basic terms of knowledge representation and information retrieval. I will focus especially on terms which are invaluable for the understanding of the following chapters and of the discussion of folksonomies in knowledge representation and information retrieval. Where possible, folksonomies will already be classified according to their respective contexts as methods of knowledge representation and tools for information retrieval. Chapters three and four give the book its name, since they deal exclusively with the areas of folksonomies in knowledge representation and folksonomies in information retrieval. Both follow a three-part structure: first the subject area will be introduced and any relevant research that will familiarise the reader with the mode of operation and particularities of folksonomies in the particular subject area cited. The first part then closes with a critical evaluation of folksonomies, discussing its advantages and disadvantages as a tool in knowledge representation and information retrieval. The second part of the chapters is tasked to solve folksonomies’ localised problems and disadvantages. These solution statements are of a conceptual nature and have mainly not been implemented in practice or only vaguely investigated, respectively. Where possible, however, available research results will be cited to support the argumentation. Both chapters then close on a forecast that introduces further relevant areas of research and identifies research questions left unanswered. The third chapter is wholly dedicated to folksonomies and their task as a method for knowledge representation. First the term ‘folksonomy’ will be more closely defined and the difficulty concerning the competing terms for this idea discussed. Then follow a description of the three intrinsic parts of a folksonomy (‘users,’ ‘tags’ and ‘resources’), and an observation on the cognitive effort that indexing via folksonomies demands of users. The differentiation of folksonomies into three variants (‘broad,’ ‘narrow’ and ‘extended narrow folksonomy’), as well as illustrations on collective intelligence in groups of people are necessary to explain the formation of different tag distributions on a resource level. While tag distributions arise mainly through the statistical accumulation of user activity after indexing, the section ‘Tagging Behaviour’ will address actual user behaviour during indexing. The properties of the tags themselves and a possible allocation to different sorts of tags will be examined after. While the three component parts of a folksonomy and their interconnections can be exploited for different recommender systems, here light will be shed on the proposal of tags during indexing. On the basis of the preceding observations,
Introduction 9
advantages and disadvantages of folksonomies will be deduced and summarised. This critical evaluation forms the basis of the chapter’s latter sections, in which folksonomies are compared with traditional methods of knowledge representation, and where semiautomatic strategies meant to increase folksonomies’ effectiveness in indexing are presented. Then methods that will minimise disadvantages and optimise advantages are developed and presented. The goal is an improved form of knowledge representation that will feed on folksonomies’ strengths without however forgetting about the virtues of established indexing methods (see ‘tag gardening’). The chapter ends on a forecast on the further possibilities of knowledge representation via folksonomies and research questions left unanswered. The fourth chapter will discuss folksonomies as tools for information retrieval. First, the relation between knowledge representation and information retrieval will be examined, with a particular focus on folksonomies as access vocabulary during research. After that, the different retrieval strategies (‘searching,’ ‘browsing’ and ‘retrieving’) that can be implemented via folksonomies are illustrated. In the same context, the differentiation between active and passive information retrieval will be discussed. Active information retrieval distinguishes itself through its use of tags as information filters and thus as search terms during retrieval Information filtering and collaborative filtering stand for passive information retrieval, which uses tags to formulate search requests or as the basis of a recommender system. As opposed to the active retrieval strategies discussed in the preceding section, the user is ‘delivered’ possible relevant resources by the retrieval or tagging system. The recommender systems (collaborative filtering) making use of folksonomies and their threepart structure of users, tags and resources will be discussed in particular detail. An explanation will be given as to the role folksonomies play in resource gathering, and what sorts of information might further be retrieved in this way. A whole section will deal with folksonomies’ retrieval effectiveness, since it can elucidate whether folksonomies as a method of knowledge representation can also serve as a tool for information gathering. What follows is a discussion of several of folksonomies’ visualisation techniques, especially tag clouds, as well as their usefulness as facilities for efficient information retrieval. After this comes a summary of folksonomies’ disadvantages in information retrieval as localised in the preceding sections. This will form the basis of the four following subject areas that deal with neutralising these disadvantages. The first suggestion is to supplement the tags added to the resources during indexing by more tags drawn from search requests. Relevance ranking makes the decision whether a resource is relevant for a search or not easier for the user. Folksonomies offer query-, user- and resource-specific properties that may be exploited for relevance ranking and which are introduced in this section as ranking algorithms. Users’ indexing activities may be statistically analysed and equally exploited for information retrieval. Popular tags, that is tags with a high indexing frequency, are in this context considered ‘power tags’ and localised in a fictitious retrieval system, where they can be used as tools for restricting the number of search results. The conception and procedure of this approach will be explained in this section. Chapter three will have already discussed tag gardening for knowledge representation, so at this point the concept will be projected onto information retrieval. Research questions still unanswered, as well as possible further uses for folksonomies in information retrieval, will be considered in the forecast. In the conclusion, the points made in the preceding chapters will be examined, and the book’ contribution to folksonomy research considered.
10
Introduction
Bibliography Al-Khalifa, H. S., Davis, H. C., & Gilbert, L. (2007). Creating Structure from Disorder: Using Folksonomies to Create Semantic Metadata. In Proceedings of the 3rd International Conference on Web Information Systems and Technologies, Barcelona, Spain . Carlin, S. A. (2007). Social Tagging. Schlagwortvergabe durch User als Hilfsmittel zur Suche im Web - Ansatz, Modelle, Realisierungen. Boizenburg: Hülsbusch. Choy, S., & Lui, A. K. (2006). Web Information Retrieval in Collaborative Tagging Systems. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, Hong Kong (pp. 352–355). eMarketer (2007). UGC Users Outnumber Creators: Are Enough People Recording Their Cats? from http://www.emarketer.com/Article.aspx?id=1005081. Feldman, S., & Sherman, C. (2001). The High Cost of Not Finding Information, from http://web.archive.org/web/20070320063457/http://www.viapoint.com/doc/ IDC+on+The+High+Cost+Of+Not+Finding+Information.pdf. Golder, S., & Huberman, B. (2006). Usage Patterns of Collaborative Tagging Systems. Journal of Information Science, 32(2), 198–208. Guy, M., & Tonkin, E. (2006). Folksonomies: Tidying Up Tags? D-Lib Magazine, 12(1). Heckner, M. (2009). Tagging, Rating, Posting. Studying Forms of User Contribution for Web-based Information Management and Information Retrieval. Boizenburg: Hülsbusch. Heymann, P., & Garcia-Molina, H. (2006). Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems: InfoLab Technical Report, from http://dbpubs.stanford.edu/pub/2006-10. Heymann, P., Koutrika, G., & Garcia-Molina, H. (2008). Can Social Bookmarking Improve Web Search? In Proceedings of the International Conference on Web Search and Web Data Mining, Palo Alto, California, USA (pp. 195–206). Mathes, A. (2004). Folksonomies - Cooperative Classification and Communication Through Shared Metadata, from www.adammathes.com/academic/computermediated-communication/ folksonomies.html. Marlow, C., Naaman, M., d. boyd, & Davis, H. (2006). HT06, Tagging Paper, Taxonomy, Flickr, Academic Article, To Read. In Proceedings of the 17th Conference on Hypertext and Hypermedia, Odense, Denmark (pp. 31–40). Nicholas, D., & Rowlands, I. (2008). In Praise of Google. Library & Information Update, December, 44–45. Oates, G. (2007). Heiliger Bimbam!, from http://blog.flickr.net/de/2007/11/13/heiliger-bimbam/. Ohkura, T., Kiyota, Y., & Nakagawa, H. (2006). Browsing System for Weblog Articles based on Automated Folksonomy. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Peters, I. (2006). Against Folksonomies: Indexing Blogs and Podcasts for Corporate Knowledge Management. In Preparing for Information 2.0. Proceedings of Online Information Conference, London, Great Britain (pp. 93–97). London: Learned Information Europe Ltd. Rainie, L. (2007). 28% of Online Americans Have Used the Internet to Tag Content. Forget Dewey and is Decimals, Internet Users are Revolutionizing the Way We
Introduction 11
Classify Information – and make Sense of It, from http://www.pewinternet.org/ ~/media//Files/Reports/2007/ PIP_Tagging.pdf.pdf. Sifry, D. (2007). The State of the Live Web, April 2007, from http://technorati.com/ weblog/2007/04/328.html. Smith, G. (2004). Folksonomy: Social Classification, from http://atomiq.org/archives/2004/08/folksonomy_social_classification.html. Smith, G. (2008). Tagging. People-powered Metadata for the Social Web. Berkeley: New Riders. Spiteri, L. (2005). Controlled Vocabulary and Folksonomies, from http://www.termsciences.fr/IMG/pdf/Folksonomies.pdf. Sterling, B. (2005). Order Out of Chaos. Wired Magazine, 13(4), from http://www.wired.com/wired/archive/13.04/view.html?pg=4. Vander Wal, T. (2004). Feed On This, from http://www.vanderwal.net/random/entrysel.php?blog=1562. Vander Wal, T. (2009). Understanding Folksonomy: Catalyzing Users to Enrich Information. Beijing [u.a.]: O'Reilly Media [to appear]. Weinberger, D. (2007). Everything is Miscellaneous. The Power of the New Digital Disorder. New York, NY: Times Books [u.a.]. All online sources cited here and over the course of this book have last been accessed on March 25th, 2009.
Chapter 1
Collaborative Information Services
This chapter will provide a systematic description of collaborative information services and their particular tagging and search functionalities. Before describing the web services, however, it is absolutely necessary to differentiate the various concepts within this subject area from one another.
Web 2.0 vs Social Software vs Collaborative Information Services The context that houses collaborative information services is termed ‘Web 2.0.’ Web 2.0 spans all activities and technical requirements that allow users of the World Wide Web to self-publish content, in the form of profiles, bookmarks, photos, videos, posts etc. and to make it accessible to other users, as well as to communicate with them. Furthermore, ‘Web 2.0’ describes a state of affairs in which the internet distinguishes itself through the continuous development and combination (‘mashup’) of known online services and incorporates users in this development process via feedback loops (Lange, 2006). This state is also called ‘perpetual beta’ (Cox, Clough, & Marlow, 2008). The term ‘Web 2.0’ was created as the name for a conference, held by O’Reilly Media Inc.2, in which the latest developments online were to be discussed (Notess, 2006a). Since then, the suitability of this term has been the subject of ample and contentious debate (Dvorak, 2006; Shaw, 2005; Notess, 2006a; Millard & Ross, 2006). Nevertheless, Web 2.0 has established itself as a concise buzzword in daily language (Sixtus, 2006; Braun & Weber, 2006; Röttgers, 2007; Schachner & Tochtermann, 2008, 23ff; Gissing & Tochtermann, 2007). O’Reilly (2005) himself presented the ideas behind the concept of Web 2.0 in a meme map (see Figure 1.1), in which both innovations in users’ online behaviour as well as the continuing development of the technical basis were illustrated. The most frequently cited idea is the ‘architecture of participation,’ which puts the main emphasis on user-generated content as well as online communication. Ankolekar, Krötzsch and Vrandecic (2007) concentrate on software development: “It is notable that the term Web 2.0 was actually not introduced to refer to a vision, but to characterise the current state of the art in web engineering” (Ankolekar, Krötzsch, & Vrandecic, 2007).
2
http://www.oreilly.com.
14
Collaborative Information Services
Figure 1.1: Meme Map of the Concept ‘Web 2.0’ after O’Reilly. Source: O’Reilly (2005, Fig. 1).
To declare the existence of a second version of the World Wide Web presupposes the existence of a previous version (Tredinnick, 2006; Notess, 2006a). Cormode and Krishnamurthy (2008) extensively discuss the differences between ‘Web 1.0’ and ‘Web 2.0’ (see also O’Reilly, 2005). They localize these differences on the three pivots of: 1. technology, 2. structure, 3. social aspects. Online services and websites in ‘Web 1.0’ are rather static on a technological and structural level, may not be adjusted by users’ sites and mainly serve as sources of information, while Web 2.0 services especially focalize communication and exchange of resources between users. They demand little technological know-how on the part of the user, since Web 2.0 services separate content, navigation elements and technology from each other (Tredinnick, 2006), adapt to the user’s needs with regard to the design, structure, processing and creation of their content, in other words: because they put the user at the center of the online service: The key to Web 2.0 is harnessing the ways in which users use information to add value to information (either through direct or indirect user-participation) in creating the information sources they use. In other words, Web 2.0 reflects collective use over time, rather than reflecting an organization’s preferred view of itself. Web 2.0 is built out of real use and need, not idealized use and need (Tredinnick, 2006, 232).
Collaborative Information Services
15
That is why Danowski und Heller (2006) also speak of a change in the World Wide Web from an object-centric to a person-centric network (Danowski & Heller, 2006; Schmitz et al., 2007). This book will proceed from a definition of the term ‘Web 2.0’ that is represented in Figure 1.2. Here the term ‘Web 2.0’ is composed of the three areas of technology, respectively AJAX and RSS, licences and social software. From a technological perspective facilitates the creation and publication of content, which is why Coates (2003) calls Web 2.0 “the Mass Amateurisation of (Nearly) Everything“: “Updating a website on a daily basis is no longer an activity that only a trained professional (or a passionate hobbyist) can accomplish. It’s now open to pretty much everyone, costfree and practically effortlessly” (Coates, 2003). For one thing, the technology of Web 2.0 tools recedes so far into the background behind an intuitive user interface that even the most technically unversed user may create and publish information resources in this way.
Figure 1.2: Classification of Services, Technologies and Licences in Web 2.0.
Then, in the sense of ‘perpetual beta,’ the technology is made accessible to users for processing in so-called ‘mash-ups’ or for corrections (van Veen, 2006; Zang, Rosson, & Nasser, 2008; Lackie & Terrio, 2007). Both processes are implemented via the so-called ‘Application Programming Interfaces’ (short ‘API’). Weiss (2005) explains the concept of APIs with the example of the photosharing service Flickr:
16
Collaborative Information Services The API, or Application Programming Interface, is a set of technical documentation which tells a developer how to interact with a software engine. [...] Flickr published its API so that Web developers anywhere could write their own applications to leverage the Flickr photo and tag database (Weiss, 2005, 22).
Zang, Rosson and Nasser (2008) write about interviews conducted with programmers of mash-ups, in which they found out that 77.8% of all mash-ups consist of map materials, 40.7% of photos and 14.8% of news items. The most commonly used APIs are Google Maps’3 (96.4%), Flickr’s (39.3%) und Amazon’s (14.3%).
Figure 1.3: The AJAX Concept. Source: Garrett (2005, Fig. 2).
Another important technological component of Web 2.0 is its capacity for asynchronous data transmission using AJAX (short for ‘Asynchronous JavaScript and XML’; Garrett, 2005; Crane, Pascarello, & James, 2006; Clark, 2006). AJAX does not signify a new programming language in this context, but describes instead a novel connection of familiar programming techniques, as Garrett (2005) emphasizes: Ajax isn’t a technology. It’s really several technologies, each flourishing in its own right, coming together in powerful new ways. Ajax incorporates: standards-based presentation using XHTML and CSS; dynamic display and interaction using the Document Object Model; data interchange and manipulation using XML and XSLT; asynchronous data retrieval using XMLHttpRequest; and JavaScript binding everything together (Garrett, 2005). 3
http://maps.google.com.
Collaborative Information Services
17
This connection affects users’ interaction with online services in particular and conveys the impression of a desktop application that processes the user’s requests almost without delay (see also Figure 1.3): The Ajax engine allows the user’s interaction with the application to happen asynchronously – independent of communication with the server. So the user is never staring at a blank browser and an hourglass icon, waiting around for the server to do something (Garrett, 2005).
AJAX transforms the desktop into a ‘Webtop’ (Lange, 2006) that grants users the decentralized storage of resources as well as access to these resources from every computer with internet connection (Notess, 2006b). Storage does not require any hard drive capacity, and searching and installing software updates is no longer necessary since changes to the program are implemented online. But the greatest advantage of these online applications is surely that multiple users can access and edit the same resource. The different versions are often saved during the process itself, so that a return to earlier incarnations of the resource may always be loaded. So-called RSS4 or Atom Feeds (Wusteman, 2004; Hammersley, 2005) allow for a new form of information distribution and platform-independent exchange of data: RSS is a way of syndicating web content through the use of content feeds, which consists of XML marked-up files. RSS feeds usually combine either the lead paragraph, or a summary of an article published on the web or on a blog, and a hyperlink back to its resource (Tredinnick, 2006, 230).
They abridge the information to be transmitted to its minimum and provide for it to be sent directly from its creator to the users, where the latter may decide for themselves which feeds they want to receive at all and what information in particular. Wusteman (2004) explains the difference between RSS and Atom: All versions of RSS and Atom are written in XML. The main distinction between them is whether or not they are based on the W3C [World Wide Web Consortium, Anm. d. A.] standard RDF (Resource Description Framework). RSS 1.0 is based on RDF; all the others are not (Wusteman, 2004, 408).
Feeds are received via a feed reader (e.g. Bloglines5 or NewsGator6), which bundles all incoming feeds, not unlike an e-mail program, and displays them for the user. Selecting one such message will then lead directly to the original website, where the full text may be accessed. RSS or Atom feeds may also be integrated into other websites, so that with each visit of the site the new messages will be bundled and displayed anew. Another decisive factor for the (further) development of mash-ups and software are legal aspects in Web 2.0. Licences such as ‘Copy Left’ (Stallman, 2004) or ‘Open Source’ (Grassmuck, 2004) as well as ‘Creative Commons’ (Lessig, 2003; Dobusch & Forsterleitner, 2007) provide for the collaborative editing and subsequent use of intellectual property which would otherwise be protected by copyright law (Federal Ministry of Justice, 2008). In this context, open-source licences apply to software with a disclosed source code, creative commons may however be used 4
RSS is short for ‚Really Simple Syndication.’ http://www.bloglines.com. 6 http://www.newsgator.com. 5
18
Collaborative Information Services
on any resources. The modular structure of creative commons licences in particular allows authors to accurately determine the further usage of their work, e.g. to allow it to be edited but prohibit any commercial activity. The social aspect is much more pronounced in the world of Web 2.0 as well, since the exchange of information between users and the accumulation of extensive contacts is heavily valued. It must, however, be stressed that for the first time since the creation of the internet, the user ego emphatically takes center stage: blogging software allows users to create and publish diary-like commentaries, they can post updates of their current status around the clock in social networks (e.g. “Kate is… on her way to the gym!”), online merchants count on users’ readiness to actively improve and complete product information, and homemade videos, published on video platforms, attain cult status and propel their makers to great fame. The consumer press sees in this the change from average media consumers to selfdetermined ‘prosumers,’ to cite Toffler’s (1980) fusion of ‘producer’ und ‘consumer’: Viewed most optimistically, the new sharing driving today’s Internet evolution could lead the way to a truly democratic network, where producers and consumers are one and the same (Weiss, 2005, 23). The new ‘I-media’ have visibly changed the face of the World Wide Web in a few short years. Everybody is a potential media producer now: as columnist, diarist, critic, expert, like-minded individual, product tester (Wegner, 2005, 94). The internet has become a chaotic and colorful marketplace that’s open to everyone, where you can sit in the stands or play on the stage according to your whims […] Their concept is wholly different from that of earlier internet pioneers. They do not regard their audience as passive ‘users,’ but as creative, communicative authors and creators who constantly want to swap ideas. They produce a previously rare and expensive product for free: content (Hornig, 2006, 62f.).
This is the reason the term ‘social software’ (Bächle, 2006; Stegbauer & Jäckel, 2008; Tepper, 2003; Alby, 2007) forms a certain competition to ‘Web 2.0.’ Bächle (2006) defines ‘social software’ as follows: “Social Software is the name given to software systems which support human communication and collaboration” (Bächle, 2006, 121). Lange (2006) differentiates the terms ‘Web 2.0’ and ‘social software’ thus: Social Software describes programs and applications that allow users to create social networks. A dating site is social software, and weblogs and Wikipedia are. […] They are all programmed for one purpose: they encourage communication, interaction and collaboration between users. That is why they should be easy to get to grips with; users should be able to act as intuitively as possible (Lange, 2006, 16).
This book supports these definitions of ‘Social Software,’ yet does not see the term as synonymous to Web 2.0; rather, it defines social software as a component part of Web 2.0. Alby (2007) suggests a dichotomy of the term: Social Software, in which communication is prized above all (and which generally does not keep records). Social Software, in which there is still commu-
Collaborative Information Services
19
nication, but which also focuses on content, user-made or user-enriched in some way – the idea of the community is crucial (Alby, 2007, 90f.)
This recommendation will not be investigated further at this point, since it seems insufficiently differentiated to paint a picture of the diversity of applications in the area of social software. Instead, three of social software’s main functions will be defined in order to adopt another course in the conceptualization of social software: • communication and socializing, • the building of a knowledge base and • resource management The first of these aims for the creation of social networks and the communicative exchange with other users of the same software. An important aspect of this is that such communication is not private, as in a one-on-one conversation, but either public (e.g. in message boards) or directed simultaneously to multiple interlocutors (as in microblogging7 or instant messaging). Social networks provide a public platform where users can make new contacts or deepen pre-existing ones (Alby, 2007; Schmidt, 2007). The second function is a vital part of social software and Web 2.0, since it is a direct reflection of ‘architecture of participation.’ Users create a knowledge base by publishing so-called ‘user-generated content’ and making it accessible to other users. This can be text resources (e.g. blogs, wikis or rating services), but also audio or video resources (e.g. podcasts). Here users make their knowledge available to other users, provide help and advice, give explicit recommendations (as in rating services such as Ciao8) or speak their mind (Reinmann, 2008; Efimova, 2004; Efimova & de Moor, 2005; Doctorow, 2002). The aggregation and condensation of content from other media (as in news portals such as digg9), and of users’ own knowledge, as well as the collaboration in the recording of encyclopaedic knowledge (as in wikis such as Wikipedia10), can be located within this area. The third function concerns resource management, where users may manage, structure and edit their own resources. Here it is important to emphasize that the three functions do not have clearly outlined boundaries, and that several functions, or at least their attributes, may be found in a specific application. Thus blogs also abet communication with other users and personal resource management due to their commentary function, which may be enriched by links or photos – nevertheless, their main function remains the archiving of text posts and thus the creation of a knowledge base. The allocations applied in Figure 1.2 only take into account the respective main function of each single application and should not be seen as too rigid. Social software with an emphasis on resource management may again be divided into two sections: personal resource management and collaborative information services. Personal resource management occurs when the structuring and management of a user’s own personal resources takes place in private (e.g. Google’s webmail program Gmail, which allows for the tagging of e-mails but can only be maintained privately). Collaborative Information Services mainly serve the management of personal resources, but also allow for the collaborative creation of a public database, accessible to each (where necessary, registered) user. Furthermore, collaborative in7
Using twitter, for example: http://www.twitter.com. http://www.ciao.de. 9 http://www.digg.com. 10 http://de.wikipedia.org. 8
20
Collaborative Information Services
formation services’ resources are indexed by the users, that is represented on the platform via tags11. That is why collaborative information services may also be termed ‘tagging systems’ (Smith, 2008b, 6). Smith (2008b) includes personal resource management in his usage of the term, which is not the case for the definition used here. Users collaborate to create an information service which provides access to different resources and is organized by themselves. It is these two aspects which make up the difference to social software applications, the main functions of which are in the service of the creation of a knowledge base. Wikis are created collaboratively, but they are not indexed by users via tags; blogs are represented via tags, but they are not created collaboratively. Common to all collaborative information services is the fact that users can profit from one another’s activities (e.g. the publishing of photos), without having to contribute anything towards the information service themselves. However, the information service’s usefulness for each individual user increases if he or she will participate in its collaborative construction (McFedries, 2006): Each contributor gains more from the system than she puts into it (Ankolekar, Krötzsch, & Vrandecic, 2007). The commitment of many individuals guarantees the services’ success (Elbert et al., 2005).
Millen, Feinberg and Kerr (2006) mention another property shared by all collaborative information services and which mainly comes to bear on information gathering using such platforms – ‘pivot browsing’: “We call this ability to reorient the view by clicking on tags or user names, ‘pivot browsing’; it provides a lightweight mechanism to navigate the aggregated bookmark collection“ (Millen, Feinberg, & Kerr, 2006, 112). The term ‘pivot’ stems from spreadsheet processing, where it describes the function that collects homogenous data sets and displays them in new charts. Here the user may freely choose the data sets as well as the output fields. In collaborative information services, this functionality is achieved via links – all elements in collaborative information services (user – tag – resource) can generally be clicked on by the user. Selecting an element, e.g. the user, leads to a hit list, which can be changed by clicking on another element, e.g. a tag. In this way the user browses through the collaborative information service’s database and discovers resources relevant to him or her, more or less by accident. Sinha (2006) uses a metaphor: The metaphor that comes to mind for pivot browsing is walking in the forests or some other open space, stopping to smell the pine, taking a break. You get the lay of the land as you walk around. The point is not just the destination, the point is the journey itself (anyone who has wasted time looking around on one of the tagging system will know what I mean) (Sinha, 2006).
Generally, two sorts of collaborative information services may be distinguished: 1. social bookmarking services and 2. sharing services. This dichotomy will be adhered to throughout this book. Smith (2008b) differentiates between five different tagging systems: 1. Managing Personal Information (e.g. Email), 11 Tags are descriptors which are freely chosen by the users. For the detailed description, see chapter three.
Collaborative Information Services
21
2. Social Bookmarking, 3. Collecting and Sharing Objects, 4. Improving the E-Commerce Experience (e.g. Amazon) 5. Other Uses (e.g. ESP Game12) (Smith, 2008b, 7ff.). Since the definition of collaborative information services at hand precludes the purely personal use of tags as a means for resource management, item 1 is untenable for the further elucidations on collaborative information services over the course of this book. The aspects ‘Social Bookmarking’ and ‘Improving the E-Commerce Experience’ are summarized under ‘social bookmarking services.’ The aspect ‘Other Uses’ will also be illustrated, but not allocated to collaborative information services, since the constituent property of the creation of a knowledge base accessible to all users is not in evidence here.
Figure 1.4: The Structure of a Collaborative Information Service, Using the Example of a Social Bookmarking System. Source: Heymann, Koutrika, & Garcia-Molina (2007, 37, Fig. 1).
Social bookmarking services allow users to store (web) resources in their personal user profile, put tags on them and thus render the resources retrievable. The resources were either found online by the users (e.g. links on del.icio.us) or provided by the social bookmarking service. Example for the latter variant are the electronic marketplace Amazon, where users may mark the products with tags, or library catalogs, where users may tag books. This procedure allows social bookmarking services to accumulate the tags thus attached to the resources, and they can now display all tags for a book as well as their frequency of allocation for the user’s benefit. 12
http.://www.gwap.com.
22
Collaborative Information Services
Heymann, Koutrika and Garcia-Molina (2007) exemplarily describe the structure of a social bookmarking service (see Figure 1.4). The creation of an account is the first step a user must take in order to join the information service and to be able to manage resources (see Figure 1.4 ‘user registration,’ upper right-hand corner). Then the user may transfer the desired resource into the system, ‘submit bookmarks’ and perhaps index it by describing the resource via tags or with further information such as title, content, location (‘Annotations’) etc. Initially, these are the only actions a user may or must take, respectively, as ‘content creator.’ Further user activities mainly concern the retrieval of resources (see Figure 1.4, upper left-hand corner), since users mostly gain access to the resources via the interfaces ‘Tag cloud,’ ‘Recent bookmarks,’ ‘Popular bookmarks’ or ‘Tag bookmarks.’ This means that the user – located in the information platform’s information space (see Figure 1.4 bottom center) – can reach the following resources via the interfaces: Tag cloud: a list of the most commonly occurring tags in the system; Recent bookmarks: a list of the most recent URLs posted to the system by any user; Popular bookmarks: a list of the most popular URLs ordered as a function of number of postings and the amount of recent activity for each URL; Tag bookmarks: for a given tag t, a list of the most recently posted URLs annotated with tag t; User bookmarks [not shown in Figure 1.4, A/N]: for a given user u, a list of the most recent URLs that user posted (Heymann, Koutrika, & Garcia-Molina, 2007, 38).
Sharing services also enable users to create their own profile, but also allow them to upload their own various resources (e.g. photos and videos) to the servers. These resources are then made accessible to the other users, who may view and edit them. The user can provide his or her own resources with tags and thus render them searchable. Other developments in the area of social software that will not go unmentioned are varieties of collaborative information services that make heavy use of tags and folksonomies and incorporate another main aspect of social software besides resource management. Tagging-based social networks (such as for example 43things13) combine the features of collaborative information services and social software and may be termed ‘goal-sharing services.’ These platforms are meant to bring like-minded people together and to encourage their communication. This is implemented via tags which the users publish on the platform itself. However, they do not upload any resources but merely describe their goals in life (hence the term) via the tags in their profiles. Social software has not gone without problems or without facing criticism. With regard to social networks, there are serious misgivings about data and user privacy (Gross, Acquisti, & Heinz, 2005; Krishnamurthy & Wills, 2008; Gissing & Tochtermann, 2007 a.o.), wikis and especially Wikipedia must answer questions about suitable criteria of quality (Stvilia et al., 2005; Hammwöhner, 2007 a.o.) and collaborative information services all too often slip into the legal mire due to alleged violations of intellectual property (Leistner & Spindler, 2005; Lohrmann, 2008 a.o.). Even the collaborative creation and publication of great quantities of data in the 13
http://www.43things.com.
Collaborative Information Services
23
World Wide Web can lead to problems that mainly come to bear on the user’s information gathering: “But simply adding more information isn’t virtually free. It comes with a cost: noise“ (Kalbach, 2008, 36). To find the relevant information in the mass of available information becomes ever more difficult. Schütt (2006) summarizes this problem using the example of blogs: But it’s hardly begun and already there’s talk of a ‘blog overflow’ because every Tom, Dick and Harry has his own blog, and so the quality of the content seems to be drowning once again in the ocean of information. This cries for artistic limitations or specialized blog search engines that help find the diamond (Schütt, 2006, 33).
This is why tools have been outlined in Figure 1.4 that provide access to the information resources in social software, and which will be explained over the course of this chapter. Technorati14 would be one such search engine that searches blogs and thus makes heavy use of tags. The flood of information may also be met by improved access paths to the resources. In information science, this access is generally warranted through the coverage and indexing of the resources; for collaborative information services, this is done via tags and folksonomies, respectively. Figure 1.2 mentions tagging games, which help during the indexing of mainly non-textual resources. Principally, however, the indexing via tags is implemented by the users of collaborative information services themselves, so that once again the amalgamation of producing and consuming users in Web 2.0 becomes apparent: “[Folksonomy, A/N] represents another example of the fuzziness separating consumers and creators on the Web today” (Godwin-Jones, 2006, 10). After all, users use tags (their own as well as others’) to gain access to the information resources: It may be more accurate, therefore, to say that folksonomies are created in an environment where, although people may not actively collaborate in their creation and assignation of tags, they may certainly access and use tags assigned by other (Spiteri, 2007).
Below, I will introduce various collaborative information services, explain their functionalities and discuss their applicability in the World Wide Web. Filtered from the mass of all information services available online will be the most popular and thus basically the prototypical representatives, where provisions will be made for all distinct resource types, such as photos, videos, music, links etc. In the presentation of collaborative information services, particular attention will be paid to the question of to what degree they use folksonomies for indexing and research purposes. The goal of this analysis is to provide an overview of the applicabilities of folksonomies in collaborative information services.
Social Bookmarking Services Social bookmarking services have upgraded a function of conventional desktop applications for Web 2.0 by enabling the user to save the web browser’s list of favorites online, independently of the desktop, and thus make resources already found more easily retrievable (Gordon-Murnane, 2006, 28): “One of the greatest chal14
http://www.technorati.com.
24
Collaborative Information Services
lenges facing people who use large information spaces is to remember and retrieve items that they have previously found and thought to be interesting“ (Millen, Feinberg, & Kerr, 2006, 111). The user does not even have to use his own browser anymore in order to access his bookmarks or favorites, but can instead use any browser to connect to his social bookmarking service of choice and save, manage and retrieve his links (Noruzi, 2006): “Social Bookmarking systems are a class of collaborative applications that allow users to save, access, share and describe shortcuts to web resources“ (Braly & Froh, 2006). Thus bookmark collections are booming once more after having been nearly replaced by search engines and their fast search algorithms (Hammond et al., 2005). The use of corporate social bookmarking services in-house by companies is also wide-spread (Pan & Millen, 2008; Hayman, 2007; Hayman & Lothian, 2007; John & Seligmann, 2006; Damianos, Griffith, & Cuomo, 2006; Millen, Feinberg, & Kerr, 2006; Farrell & Lau, 2006). Spiteri (2006b) tells an anecdote about the problems caused by desktop-dependent lists of favorites and their limited organization possibilities: Our bookmarks or favorite lists are mushrooming out of control. Many of us have folders within folders within folders. We find ourselves bookmarking the same site a dozen times because we can’t remember where we filed it. Alternatively, we simply ‘Google it’ to save time. […] The problem is exacerbated when people use different computers (e.g., one at work, one at home, a laptop, etc.); they do not keep the same information across the different computers they use (Spiteri, 2006b, 78).
Feinberg (2006) defines the following main functions for social bookmarking services: a) “keeping frequently used resources handy,“ b) “grouping resources needed for a particular project,” c) “identifying potentially interesting but non-critical resources” and d) “locating occasionally used resources that are difficult to recall” (Feinberg, 2006).
Figure 1.5: Bookmark Manager of the Browser ‘Firefox’ with its Folder Structure.
Apart from their decentralized bookmark storage facilities, social bookmarking services distinguish themselves mainly through another aspect: collaboration. They enable users to not only manage their own links privately, but also to make their collections of favorites publicly accessible to other users of the same service (Heller, 2007b). Thus the entire online community can share in and profit from a single
Collaborative Information Services
25
user’s activity – only then does the actual meaning of the term ‘social’ manifest itself: Just as long as those hyperlinks (or let’s call them plain old links) are managed, tagged, commented upon, and published onto the Web, they represent a user’s own personal library placed on public record, which - when aggregated with other personal libraries - allows for rich, social networking opportunities (Hammond et al., 2005).
Regulski (2007) reports on a magazine that uses social bookmarking services specifically in order to raise the profile of their own articles and thus increase the number of their citations and of access paths leading to these articles. The activities that users may perform on their lists of links mainly encompass the organization and categorization of URLs. In contrast to traditional browser-oriented classification systems, the URL no longer has to be saved in a folder (see Figure 1.5), but can be described and at the same time rendered retrievable via metadata, in this case via tags. To achieve this, the user attaches as many tags as needed to the URL. The community’s other users may then use the tags or the link address to search for resources. If they do so using tags, their search results will be all URLs that were indexed with the requested tags, as well as any available information as to what other tags the displayed resources were allocated, and which users have stored the URL in their profile. A search via link address will also lead to these users and furthermore, to all tags the resource was indexed with by the community. Thus the social bookmarking systems changes from a personal URL management program to an information service that uses the main features of social networks and thus enables the community to search more effectively. It now serves not only as a storage space (as described above by Feinberg (2006)), but also as a provider of interesting or relevant web resources – after all, these searches lead the user to URLs that he might never have found browsing through the net, and which have been indexed intellectually by the community using tags. This is where social bookmarking services have the upper hand over search engines: “Social bookmarking is already used by many as an alternative to search engines, or at least as an addition – an addition which has the advantage that the information has been pre-filtered by like-minded people” (Reinmann, 2008, 54). At this point Skusa and Maaß (2008) and Graefe, Maaß and Heß (2007) argue that social bookmarking services’ search results can never be as up to date as algorithm-based search engines, and that services such as news websites are seldom included in social bookmarking services and thus rendered researchable besides. A detailed description of the functionalities of social bookmarking services will be provided by the example of two popular services: del.icio.us15 and BibSonomy16. Del.icio.us represents social bookmarking services that are directed at the broad mass of users and explicitly do not specialize in any subject area (as do, for example, Blinklist17, Furl18, Ma.gnolia19 oder Simpy20) – even though six of the ten most popular bookmarks in del.icio.us are about web programming (Wetzker, 15
http://del.icio.us. http://www.bibsonomy.org. 17 http://www.blinklist.com. 18 http://www.furl.net. 19 http://www.ma.gnolia.com. 20 http://www.simpy.com. 16
26
Collaborative Information Services
Zimmermann, & Bauckhage, 2008) and thus directed at a technology-savvy usership, while BibSonomy mainly caters to the scientific community and manages scientific links as well as references on scientific publications (as do CiteULike21 and Connotea22, a.o.; Lund et al., 2005). All platforms cited are free of charge. Del.icio.us was created in 2003 by Joshua Schachter as a social bookmarking service (Alby, 2007), and as of September, 2007, could boast more than 5m registered users and in excess of 150m saved bookmarks (Arrington, 2007; Heymann, Koutrika, & Molina, 2008). Del.icio.us was taken over by Yahoo! (Schachter, 2005).
Figure 1.6: User Interface for Manually Adding Bookmarks in del.icio.us.
After creating an account on del.icio.us’ homepage, users can directly begin collecting and managing their bookmarks (Orchard, 2006). To do so, they either enter the URLs manually (see Figure 1.6) or use a browser add-on which enables them to add bookmarks with the click of a button. Apart from the URL, they can also add a title for and notes on the bookmark and attach tags to it. The maximum number of tags attachable is 1,000 characters, the maximum tag length is 128 characters, and two tags are separated from one another by a blank. This way of separating multiple tags is described by Smith (2008b, 124ff.) as ‘character-delimited,’ since the user can pick any system-prescribed symbol (e.g. blank, comma or semicolon) as a separating symbol and thus index several tags for a single resource at once. Choosing ‘Do Not Share’ keeps the bookmark private, so that other members of the community cannot access them. Figure 1.6 displays the user interface for saving the first bookmark, which is why none of the user’s previously added tags are being recommended for indexing at the bottom (blue tab ‘Tags’). In this case, the URL http://www.phil-fak.uni-duesseldorf.de/infowiss has not even been saved in del.icio.us’ database, so that none of its other users’ tags are available for recommendation.
21 22
http://www.citeulike.org. http://www.connotea.org.
Collaborative Information Services
27
The user interface is constructed differently when the user and the about-to-be-saved bookmark both already have several tags (see Figure 1.7). In this case, the user is shown his or her tags for indexing, so that he or she can compile a personal and fairly consistent indexing vocabulary. The most popular tags for this bookmark are equally displayed. If the user enters a tag that has been used for this bookmark before (in this case ‘search’), a selection of tags as well as information as to their frequency of usage for this bookmark will be displayed via a ‘Type Ahead’ functionality.
Figure 1.7: Adding an Already Indexed Bookmark in del.icio.us.
Figure 1.8 shows a user account. Here the user is shown information regarding his saved URLs (date of entry, name, number of users who have also saved the URL, tags attached to the URL) and his used tags in a sort of list. Clicking on one of the tags in the list or adding a tag in ‘Username Æ Type a Tag’ displays all bookmarks that have been indexed with this tag. Entering in the search field has the advantage that tags can be linked with an AND. This interface also enables the user to apply different editing functions to the bookmarks and tags. Thus information regarding the URL can be changed or the URL deleted altogether, the way the tags are displayed can be adjusted via ‘tag options’ (see Figure 1.8).
28
Collaborative Information Services
Figure 1.8: Interface of a User Account in del.icio.us.
Additionally, the user can personally change the tags: they are renameable and deletable, and they can also be compiled into so-called ‘tag bundles’ in order to create a hierarchical tag order. This functionality’s greatest use is in helping users manage their own tags.
Figure 1.9: Options for the Display of Tags via ‘Tag Options’ in del.icio.us.
Helping users index their own bookmarks is not the tags’ sole purpose, however – they are equally employed during searches for information or new URLs. They can be used to order ‘subscriptions’ (see Figure 1.10), in which the user is automatically sent all URLs indexed with the searched-for tag (in this case ‘folksonomy’) directly into their account. The subscription may also be limited to one particular user. Thus the user is kept up to date on the latest postings or developments within his area of interest, or of their favored community member, without having to actively search
Collaborative Information Services
29
for the information. The automatic reception of new URLs can also be implemented by subscribing to RSS feeds, which then react to either tags or users.
Figure 1.10: Subscriptions for the Tag ‘Folksonomy’ in del.icio.us.
Also, the user can use tags to search for a URL, either by entering a tag into the search mask on the del.icio.us homepage or by browsing and clicking through tag clouds (see Figure 1.11). After a search, both variants yield all tags within del.icio.us that were indexed with the searched-for tag. The tag search field again has the advantage that the search terms can be linked with an AND. However, the search tags must be entered into the different fields one by one (see Figure 1.12). This form of user guidance enables users to not only restrict search results with further tags, and thus limit the number of results, but also to remove already entered tags with a click of the mouse and so re-adjust the search results. In Figure 1.12, an initial search for the tags ‘web,’ ‘design,’ ‘graphic,’ ‘journals’ and ‘css’ was executed, which yielded one result. If the user removes the tag ‘css’ from the AND link, the number of results will increase to three. The exclusion of search terms is not an option; an OR link can be activated via tag bundles. The use of upper- and lower-case letters for search terms is ignored, only compounds must be spelled correctly. Search requests via search field as well as by clicking on tags in the tag cloud are supplemented by del.icio.us with ‘related tags,’ that is tags similar to the search term, which can serve to refine the request (see Figure 1.13). The search result is re-adjusted by clicking on a related tag. These similar tags are calculated through co-occurrence analyses of tags in del.icio.us.
30
Collaborative Information Services
Figure 1.11: Search Field and Tag Cloud for Information Retrieval in del.icio.us.
Tag-based search is not del.icio.us’ default setting, but it can be activated via the tab ‘Explore Tags’ on the homepage or, after a search, via the tab ‘Tags.’ Searching for and browsing through URLs and usernames is also an option. Likewise, the search can be manually limited to one’s own bookmarks, one’s own network (provided the user is linked to others via the ‘friendship’ function) or one’s own tags. The list of search results is sorted after the latest save or indexing date – the latest entry will be displayed at the top of the page.
Collaborative Information Services
31
Figure 1.12: Linking Tags in a Search on del.icio.us.
Figure 1.13: Related Tags for the Refinement of Search Requests in del.icio.us.
BibSonomy (Hotho et al., 2006a; Hotho et al., 2006b; Jäschke et al., 2007; Schmitz et al., 2006; Regulski, 2007) has committed itself to the storage of scientifically oriented URLs and scientific references, and in 2007 reported more than 5,000 registered users (Jäschke et al., 2007). This social bookmarking service thus offers many further functionalities which cannot be found in del.icio.us and which are mainly directed at scientists, while considering their ways of working. BibSonomy serves as a decentralized URL storage facility on the one hand, and on the other hand as a storage facility for references and publications in the BibTeX format (Patashnik, 1988). Furthermore, BibSonomy facilitates the automatic compilation of bibliographies from saved references.
32
Collaborative Information Services
Figure 1.14: Manually Entering a Publication into the Designated Fields Equipped with Tagging Functionality in BibSonomy.
After creating an account, users can directly begin managing their bookmarks and publications. The information (such as author name, title etc.) can be either entered directly into the designated fields or imported from a BibTeX program or added ‘post publication’ via a browser button. Here the user can choose manually what sort
Collaborative Information Services
33
of publication the resource to be saved is to be classified as. In a second step, an enhanced user interface appears (see Figure 1.14) which enables the user to add even more information to the resource, such as personal remarks, abstracts or tags. The fields are heavily influenced by the parameters set by the two literature management programs BibTex and EndNote, in order to facilitate the import and export of the saved publications. To enter a URL (see Figure 1.15), the user is only required to fill out the fields ‘URL,’ ‘title’ and ‘tags.’ While entering the tags, the user receives tag suggestions in two different ways: a) ‘suggested’ tags are displayed during text input via a ‘Type Ahead’ functionality and calculated from the user’s available tags, b) ‘recommendation’ tags are calculated from the user’s available tags or, as in Figure 1.15, extracted from the title of a resource. The separate tags are then separated via a blank. Using a drop-down menu, the user can also decide whether the resource (bookmark or publication) should be made accessible to the entire BibSonomy community, just to his friends or only to him- or herself: These three accessibility options for an entry facilitate the platform’s usage as a private, portable information basis, as an instrument for collaborative research, or as a forum for the exchange of recommendations and information regarding research-relevant literature (Regulski, 2007, 180).
The user can access their account via a permanent URL. As displayed in Figure 1.15, the user is here provided an overview of his saved URLs and publications as well as of the tags used for indexing these resources. The way the tags are represented can be adjusted and displayed as a tag cloud (as in the picture) or as a list, for example, sorted after frequency of occurrence or alphabetically or with a minimum of indexing frequency. The user can also use the field ‘Filter’ to browse through their tags.
Figure 1.15: User Account with Information on Saved URLs and Publications and Used Tags in BibSonomy.
For the bookmarks, the user can edit or delete details on all URLs either at the same time or one by one. Furthermore, he can export the bookmarks via RSS or XML feeds. For the publications, additional functions are at the user’s disposal. Thus publications can be exported using different formats (RSS, BibTex, RDF, a.o.) or deposited in a ‘basket’ using the ‘pick’ button. This last function is available to the user
34
Collaborative Information Services
not only in the user account, but on the entire BibSonomy platform, so that he can compile a bibliography while browsing and export it using various formats. BibSonomy, like del.icio.us, also offers a tag editing function (see Figure 1.16). This enables the user to either delete their tags or replace them with others, and to link them via unspecific relations and thus create a tag hierarchy (Hotho et al., 2006b) – an example would be a link between the tags ‘knowledge representation’ and ‘folksonomy.’ While doing so, it is important to enter the superordinate tag into the right field and the subordinate tag into the left field.
Figure 1.16: Editing Functions for the User’s Tags and Relations in BibSonomy.
The greatest effect of these relations is on searches. If the user clicks on the superordinate tag in the ‘relations’ menu, he is not only shown all publications and bookmarks indexed with this ‘supertag’ but also all resources indexed with its ‘subtag’ (see Figure 1.17). However, this functionality is limited to the user’s own personal resources and tags and cannot be applied to the platform at large.
Figure 1.17: Subtags and Supertags in BibSonomy.
Research within BibSonomy is possible in various different ways: either by clicking on the tags in the tag cloud (see Figure 1.18) or by entering terms into the search field. The search field’s advantage is that the user can directly limit his search on a
Collaborative Information Services
35
tag, a user, a group, an author, a concept, a BibTex Key23 or his own account. Additionally, he can link multiple search terms with an AND or an OR (the same goes for tags, users etc.).
Figure 1.18: Search Functions in BibSonomy.
A tag cloud search and a traditional search via search field (set to ‘tags’) both yield a list of resources indexed with the tag in question (see Figure 1.19).
Figure 1.19: List of Search Results for the Tag ‘Folksonomy’ in BibSonomy.
Furthermore, on the right-hand side the user is offered the possibility of limiting the search to his own account or of changing the tag to a concept 24 and then repeating the search, either on the whole platform or in his account. Also displayed on the right-hand side are ‘related tags’ and ‘similar tags’ which can be activated as an alternative to the search term by clicking on them and do not serve to refine the search result via an AND link: Related tags are those tags which were assigned together to a post. If e.g. a user has tagged a post with java and programming, then those two tags are re23 The BibTex-Key is a resource-specific key that serves the precise definition of the data set in a literature management program. 24 “In order to distinguish between simple tag queries and those involving subtags, we call the latter one a query for java as a concept [occurs when ‘Java’ was designated as the hyponym of another term ‘X,’ A/N]“ (Jäschke et al., 2007).
36
Collaborative Information Services lated. Similar tags on the other side are computed by a more complex similarity measure coming from the research on information retrieval, namely cosine similarity in the vector of the popular tags. Similar tags are in many cases synonym tags25.
The tags’ font size reflects the degree to which they resemble the search term. The user can structure the list of results according to the date the resources were recorded or according to their FolkRank value. FolkRank is an adaptation of the idea of PageRank and aims to find the most relevant resources for each tag. To achieve this, it examines the link structure between users, tags and resources. BibSonomy has also implemented several shortcuts for search purposes, so that the required information can be entered directly into the browser address field and searched. Attach the following to www.bibsonomy.org: a) b) c) d)
tag/SEARCHTERM user/USERNAME group/NAME relations/USERNAME
in order to retrieve all resources indexed with the search term. in order to retrieve all of the user’s resources. in order to retrieve all of the goup’s resources. in order to retrieve all of the user’s relations.
These search options can also be accessed via the drop-down menu ‘myBibSonomy.’ Here the user can also gain access to the PDF documents he uploaded to the BibSonomy server, to any duplicates and to the advanced search functions, called ‘mySearch.’ Advanced search offers the user more elaborate search functionalities for the publications in his account (see figure 1.20). Here he can link several tags or authors with ANDs or Ors for search purposes, and use the operators to link tags and authors with each other. Additionally, he can enter search terms that do not correspond with the tags into the free text search field. The search is run over the publications’ title fields. Advanced search is not available for URLs.
Figure 1.20: Advanced Search ‘mySearch’ for a User’s Publications in BibSonomy.
25
See http://www.bibsonomy.org/faq.
Collaborative Information Services
37
E-Commerce Tagging functionalities are growing ever more popular in e-commerce (Tschetschonig et al., 2008; Hayman, 2007; Hayman & Lothian, 2007), where they are used to siphon users’ knowledge into the presentation, classification26 and marketing of products.
Figure 1.21: Product with Tags, Tag Search Option and the Option for Adding Tags in Amazon.
The online merchant Amazon27 has been making use of the (implicit) shopping and browsing behavior of its customers and of their readiness to share their impressions of products with other users ever since its foundation in 1994 (Spalding, 2007a). Every user’s transactions and viewed products are saved in Amazon’s system, evaluated and used to interest him in other products (Linden, Jacobi, & Benson, 1998; Linden, Smith, & York, 2003; Gaul et al., 2002). Typical examples for this procedure are the recommendations underneath the product description: “Customers Who Bought This Item Also Bought Items B, C, D etc.” or “Frequently Bought Together: Items A, B and C.” While these are calculate purely from the products’ sales statistics, registered users receive product recommendations based on their own shopping and browsing behavior. Furthermore, registered users can write and publish customer reviews to the purchased products on Amazon, as well as rate the items with stars (‘x out of five stars’). These reviews can also be commented on by other users, thus facilitating not only the exchange of communication between Amazon and its customers but also between customers. Especially active users with a 26
See also the online marketplace for handmade products www.etsy.com, the product category system of which is based on user-generated tags. 27 http://www.amazon.com.
38
Collaborative Information Services
large number of reviews under their belt are rated ‘Top Reviewers’ by Amazon, creating an incentive for users to keep publishing intelligent comments. Since 2007, Amazon has enabled users to exchange their knowledge via a product wiki, Amapedia28 (Gilbertson, 2007). The success of this collaborative trading platform based on customers’ and manufacturers’ product information has led to the introduction of tags to describe and categorize products. Since late 2005 Amazon customers have been able to attach tags to products (Arrington, 2005), where the actual allocation of tags is the prerogative of registered users – anonymous users can only use tags to search for products (see Figure 1.21). After the registration, the user may annotate29 each product with a tag, a comment or a review (see Figure 1.21). Underneath the formal product description, the user finds tags that have already been used by others users. On display are the six most commonly attached tags. Should more tags be available, the user can call these up and arrange them by their popularity, alphabetically or by their indexing date (see Figure 1.22). Additionally, one can see which other users have tagged the product (e.g. Mariam "meme05" was the first to add a tag ‘modern romance’ and that the last added tag was ‘smoochbook’).
Figure 1.22: Sorting Options for Tags and Access Path to Other Users in Amazon.
To use already available tags for his own indexing purposes, the user only has to click on them (exemplified by the checked tags); to add a new tag (e.g. ‘smoochbook’), he can use the input field (see Figure 1.23). Apart from the tags, the user is provided information as to how often each tag has been used for this particular resource. Each user may add as many tags to the resource as he wants. The input field for tags is interesting because it is designed for entering single tags only, which means that tags separated by a blank will be combined into one single tag if entered into the field one after the other (e.g. ‘grandma_xmas smoochbook’) and then clicking on the ‘add’ button. This may result in vast tag chains if the user is unfamiliar with this system. On the other hand, Amazon tries to avoid the forming of compounds – the linking of tags via symbols such as dashes or underscores etc. is not corrected, like the tag ‘grandma_xmas’ in Figure 1.23. Any capitalization, however, is immediately changed to lower-case letters by Amazon.
28
http://amapedia.amazon.com orr http://www.amazon.com/gp/help/customer/display.html?ie=UTF8&nodeId=43568011. 29 First, however, he must decide whether he wants to use a pseudonym or his ‚real’ name, generated from his credit card / bank account information.
Collaborative Information Services
39
Figure 1.23: Adopting Tags and Adding Personal Tags in Amazon.
Figure 1.24: Type-Ahead Functionality and Frequency of Tag Usage Displayed on Amazon.
Figure 1.25: Quick Tagging Function by Double-Typing ‘t’ in Amazon.
While manually entering tags, the user is supported by Amazon via a Type-Ahead functionality, which additionally shows how frequently the tag in question has been used (see Figure 1.24). A function unique in the manner of its implementation is quick tagging. Typing ‘t’ twice in a row opens a browser window (see Figure 1.25) allowing the user to add or edit tags, as well as showing him tags already used for this resource. With regard to adding tags, the user is here alerted to the fact that multiple tags in the input field must be separated by a comma. Additionally, the indexing process is
40
Collaborative Information Services
again enhanced by the Type-Ahead functionality. Editing tags in this window and in general comprises the deletion and editing of tags. Changes to one’s own tags, and thus for all products tagged with these, are applied automatically. Since used tags are visible to all users by default (provided the user doing the tagging has placed an order with Amazon at least once), Amazon recommends guidelines for the use of tags demanding that users tag without resorting to any of the following: • Profane or obscene language, inciting or spiteful tags, • Tags that might: harass, abuse or threaten other members of the community, • Tags that may reveal any personal information about children under age 13, • Tags that promote illegal or immoral conduct, • We recommend that you do not use tags which might reveal your phone number or e-mail address.30.
Figure 1.26: Tag Cloud of the Most Popular Tags on Amazon. Source: http://www.amazon.de/gp/tagging/cloud.
A product search can be implemented by either clicking on a tag cloud (see Figure 1.26) or via the search field on the product pages. Clicking on a tag in the tag cloud or underneath the product description leads the user to all products indexed with this tag (e.g. ‘historical fiction’) (see Figure 1.27). He can also see how often a product has been indexed with this tag and which users have done so. Usernames link to user profiles which contain information about the individual, any reviews they have posted, their lists of favorites and tags used as well as their frequency. Using the plus and minus symbols above the product depictions, the user can either add the tag (here ‘historical fiction’) himself (using the plus symbol) or recommend to the community not to use this tag to index this product anymore (using the minus symbol). This functionality cedes quality control to the users.
30
See at: http://www.amazon.com/gp/help/customer/display.html?ie=UTF8&nodeId= 16238571.
Collaborative Information Services
41
Figure 1.27: List of Results after Clicking on a Tag Cloud or after a Tag Search on Amazon.
The user can also add further products to a tag and is thus requested to index. The tag cloud that comes up informs the user which co-occurrences exist for the tag ‘romance novels’ and what his browsing options are. If he clicks on one of the tags here, he will be shown all products indexed with that tag – there will be no AND link between the two search tags. The user can also be directed to different product categories; clicking on the tag ‘romance’ will also lead to CDs and DVDs. Linking tags with an AND can be implemented in a tag search via search field, but only after the list of search results for the first tag has been displayed (see Figure 1.28). Here the user can limit the search results to only the most popular tags, the ones most frequently used, or search for all tags (displayed in alphabetical order). Both options are implemented and adjusted via co-occurrence analyses. A link with the Boolean operator AND, a phrase search or the entering of operators directly into the search field are impossible. The user can still discover related products in this list of search results; the tags do not limit the results but instead lead to other products. What’s more, the user can check the correct allocation of tags and products. The search results can be arranged according to ‘recently tagged’ items, ‘popularity’ and ‘recently added.’
42
Collaborative Information Services
Figure 1.28: List of Search Results after Tag Search via Search Field on Amazon.
Tag search and tag clouds are both not accessible from Amazon’s homepage. The user can limit his search to certain categories in the search field, or browse within categories, but an advanced search and the search for tags only are not possible. Tag
Collaborative Information Services
43
search is only offered on an actual product’s page, and the tag cloud is reached via a link31. One’s own profile and tags as well as one’s tagged products are accessible after registration via the button ‘My Profile’ (see Figure 1.29). Here the user can manage his personal information accessible to other users during his tagging and commenting activities (depending on whether this option is activated – a private profile is also possible). Furthermore, the user is told how many tags he has already used and how often, what products he has allocated which tags, and what other tags have been used by other users for the same product. Additionally he can manage his tags, that is search for his own products and tags and rename or delete his tags, make them accessible to other users or keep them private.
Figure 1.29: A User Profile with Access to Tags on Amazon.
The German internet store BonPrix makes heavy use of tags for its catalog32 (Tschetschonig et al., 2008). Registered users can index the various items with tags and are thus recommended further items. Non-registered users cannot use tags but they can search for them. In this case, as Figure 1.30 shows, a conspicuous number of tags are adjectives (e.g. ‘weiblich’ = ‘feminine’). Spalding (2007a) analyzes tagging patterns in e-commerce and does not rate the success of this approach very highly. He estimates that Amazon currently has about 1,3m tags, compared to the online book community LibraryThing, which has more than 13m. Spalding (2007a) concludes that tagging is only successful in areas where users index their own resources: “You can’t get your customers to organize your products, unless you give them a very good incentive. We all make our beds, but nobody volunteers to fluff pillows at the local Sheraton” (Spalding, 2007a). Besides, it is not possible on Amazon to compile tags on various different editions of one book, to index the same content in one work and to cumulate the tags. This also 31 32
http://www.amazon.com/gp/tagging/cloud. http://www.bonprix.de.
44
Collaborative Information Services
means that some resources are seldom or never indexed. Furthermore, he establishes that users with less than fifty books on LibraryThing tag very little or not at all; tags are only used for indexing purposes from 200 books upwards. He doubts that Amazon users have to keep track of that many resources – not to mention the fact that Amazon offers sufficient alternative products for managing one’s purchases (e.g. the wish list function) and hardly advertises the tagging function as well as the creation of a social network around its products.
Figure 1.30: Tags in E-Commerce. Source: http://www.bonprix.de.
Collaborative Information Services
45
Commercial Information Services Even some commercial information service providers and hosts have embraced the indexing of their resources by the users by now. The German business information host GBI Genios33 with its database WISO34 and Elsevier’s Engineering Village35 both offer a tagging function (Stock, 2007). At this point only the specific functions involving tagging will be discussed. A full introduction to the functional scope of these hosts cannot be provided in this context. WISO bundles in excess of 6,4m full texts from more than 340 German-language magazines as well as more than 4,4m references from different languages under a search interface, for the benefit of companies, universities and private individuals. The database’s core subject areas are economics, including among others industrial management, political and credit economy, as well as social science, with disciplines such as politics, social pedagogy and international relations. Moreover, WISO offers access to, among others, 90 daily and weekly newspapers, company profiles, annual accounts and entries in the Register of Companies in its ‘Press’ area. There are also plans for eBooks of selected textbooks from business and social science to be added to the database in 200936.
Figure 1.31: WISO’s Search Interface.
The tagging of resources is only possible after registering and acquiring access data. Since users cannot upload any personal resources to the platform, they first have to perform a search before they can index a resource with tags. To do so, they can use the input fields in Advanced Search or use Simple Search, which also displays a tag cloud reflecting the popularity of each tag (see Figure 1.31). Clicking on a tag leads the user to all resources that have been indexed with it. However, it is not possible to limit a search to tags only. After selecting a resource, users can add a comment, a rating (positive, neutral, negative) or original tags using the ‘Rate’ button (see Figure 1.32 and 1.33). Furthermore, users must select a ‘reading room’ to 33
http://www.genios.de. http://www.wiso-net.de 35 http://www.engineeringvillage2.org. 36 See http://www.zbw.eu/ueber_uns/aktuelles/veranstaltungen/frt08/patzer.ppt. 34
46
Collaborative Information Services
which the resource is then allocated37. These reading rooms comprise various subject areas (e.g. e-business or tourism) and can be accessed in the menu bar to the left. There they will also find their indexed resources.
Figure 1.32: Display of a Resource with Tags and Rating Functionality in WISO.
Figure 1.33: Tagging and Rating Interface in WISO.
37
See http://www.wiso-net.de/downloads/tipps-und-tricks-wiso-2008-neu.pdf.
Collaborative Information Services
47
Users cannot rate or tag the same resource more than once, and neither is there any way of deleting or editing the tags. After rating, the comments, ratings and tags are allocated to a resource and only become visible after they have been selected from a hit list; the comments may be viewed and designated ‘unacceptable,’ tags can be used to browse through further resources indexed with the same tag. However, users are not told how often the tag in question has been allocated to the resource; only new tags are attached to resources. Furthermore, users are not offered the option to view and edit their tags and indexed resources in WISO. Even in the reading rooms, which contain all rated resources, it is not possible to search; only browsing is permitted. Engineering Village is mainly directed at researchers and professionals in engineering science. The various databases (Compendex and Inspec, a.o.) account for bibliographical references from more than 175 engineering disciplines, patents, monographs and news items. Engineering Village has been allowing its users to index the resources in its database with tags since early 2007.
Figure 1.34: Tag Cloud and Tag Search on Engineering Village.
48
Collaborative Information Services
Figure 1.35: Indexing a Resource via the Tag Widget on Engineering Village.
After registering and acquiring access data, the user can begin indexing Engineering Village’s resources. Here, too, he must first perform a search to find resources for indexing. These are found using either Simple Search or Advanced Search. In that case, however, the search cannot be limited to tags. A specialized tag search as well as tag clouds can be accessed via the tab ‘Tags + Groups’ (see Figure 1.34). The user can adjust the display of the tag cloud and arrange the tags either alphabetically, by their frequency of usage or their indexing date38. Tag search can be limited to public and private tags as well as group or institutional tags. Linking multiple tags via the Boolean AND is also an option: Documents can then be retrieved by searching for specific tags or sets of tags“ (Graber, 2007). Both research options lead the user to a hit list displaying resources indexed with that tag. Using the ‘Tag Widget,’ the user can add tags to the resource as required (see Figure 1.35, upper right-hand corner). Multiple tags are separated by commas. Here the user can directly decide whether all users of Engineering Village should be allowed access to this tag, or whether it should be kept private, or restricted to members of the user’s own group or institution. During tagging, the user is shown the tags he has already used. Here he can also edit or delete his tags. Once added, tags are attached to the resource and can immediately be found among the controlled terms (added by professional indexers) in the ‘uncontrolled terms’ area. Since the tags serve as links, the user can click on them to browse through related resources. However, he is not told at this point how often a tag has been allocated to the resource.
38
See http://www.ei.org/documents/Tagging.ppt.
Collaborative Information Services
49
Music-Sharing Services Particularly dependent on tags are the kinds of collaborative information services that offer a platform to collect and exchange non-textual resources (Turnbull, Barrington, & Lanckriet, 2008). Apart from photos, these are mainly videos, music files and podcasts, where the term ‘podcast’ is an amalgamation of ‘iPod,’ the name of Apple’s popular Mp3-player, and ‘broadcasting.’ They are defined as follows: Podcasting is the publication of aural content online. Its crucial innovation is the option of automating the reception of the content by subscribing to it. […] Podcasting is not a completely new technology, but a development of existing trends and standards. Podcasting has emulated the success of weblogs by upgrading the latter’s format with an audiovisual component (Bacigalupo & Ziehe, 2005, 81ff.).
Generally, podcasts are offered in an Mp3 or Mpeg-4 format easily playable on mobile listening devices, such as iPods. Examples for collaborative information services dedicated to collecting and publishing podcasts are podcast.de39 or podcast.com40. Producers of podcasts can authorize the publication of their files on both online portals and let them be entered into their database. The services then subscribe to the audio files using RSS or Atom feeds and are thus always up to date on the podcasts’ current status. Podcasts do not have to be audio files, but can also contain a visual element. These video podcasts are also called ‘vodcasts’ (Quain, 2005; Meng, 2005). As opposed to videos on YouTube, podcasts and vodcasts are freely downloadable and can be saved on a user’s computer indefinitely (Weiss, 2005). Podcasts are indexed and subscribed to via tags. The music-sharing service Last.fm41 also offers its users a tagging functionality. However, users cannot upload any personal resources to the platform or download others, unless they have been authorized for download by the record labels or artists. This gives Last.fm the character of an internet radio; however, this music platform must not go unmentioned in this context, since it highly respects the principles of collaborative information services and produces considerable added value for each individual user through the collaboration of numerous users. Last.fm was created in 2002 by Felix Miller, Martin Stiksel and Richard Jones, and as of early 2008 listed 21m users from over 200 countries visiting its sites each month (Kiss, 2008; Kaiser, 2008). The platform is London-based but uses the Micronesian country domain ‘fm’ in order to establish its link to frequency-modulated radio programming for the purpose of disturbance-free transmission. Most music albums and titles can be streamed and listened to three times free of charge on Last.fm. The artists are remunerated for each playback of their songs (Jones, 2008). Last.fm distinguishes itself by offering its users free online access to several internet radio stations, music videos (from YouTube, a.o.) and songs. Furthermore, it uses a collaborative rating system to get to know each user’s taste in music and places titles on his playlist that he doesn’t know yet, but might like (Wegner, 2005). These recommendations are based on the user’s listening habits (which songs does the user play on the platform? on his PC? what internet radios does he listen to?) and are created using the ‘Audioscrobbler.’ This recommendation system has one decisive ad39
http://www.podcast.de. http://www.podcast.com. 41 http://www.last.fm. 40
50
Collaborative Information Services
vantage over the process used by online merchants such as Amazon, for example: the suggestions are based not on the user’s shopping behavior, since any transaction might be a gift for another person, but on his actual listening habits (Alby, 2007). By registering and thus identifying himself, the user also counteracts the corruption of the suggestions, since other users might also listen to music on Last.fm, using the same computer. The user receives his recommendations on a profile page, called ‘The Dashboard,’ where he is not only alerted to music, but also to events, news stories and similar users and shown direct recommendations from other users.
Figure 1.36: User Profile with Favorite Artists and Recommendations on Last.fm.
Access to the radio stations and music titles is also possible without registration, but users must create a profile in order to take advantage of the system’s music recommendations. Registered users can also post on Last.fm’s message boards, add tags to resources, receive and send private messages, and use the Last.fm music player. Last.fm enables its registered users to tag artists, albums and songs and thus to create a musical folksonomy. Ater the registration, the user can state some of his
Collaborative Information Services
51
favorite artists in order to provide the recommendation system with a few pointers to his taste in music. Adding personal data or a photo, as well as changing the privacy settings is also an option. Figure 1.36 displays the favorite artists as well as recommendations for artists, events and videos. Searching for tags leads the user to an internet radio that plays mostly songs tagged accordingly. Linking search terms with Boolean operators is not possible in this case. The radio station provides the user with ample information about the music (see Figure 1.37).
Figure 1.37: The Radio Station on Last.fm Displays Additional Information to the Music Titles.
Thus the user is provided current information on the artist whose song is playing, gleaned from his profile page. As soon as an artist has added their first title to Last.fm or been played by a user, Last.fm automatically creates an artist profile, which collects and makes available information on the number of listeners, call-ups, on popular songs, linked groups, on events and blogs, similar artists and frequently used tags. Official music videos or videos imported from YouTube are also on display. Using a wiki, users can add biographical details on the artist themselves, which
52
Collaborative Information Services
will then be displayed on the radio station. Homonyms are still a problem, though. Last.fm has no way of separating artists of the same name yet, and collects them on an artist page42. Furthermore, the radio station offers users similar artists (to the one being played at that moment) and also displays tags attached to the resource. These similar artists are calculated via all users’ listening habits. If many users listen to artist X and also to artists Y and Z, Y and Z will be displayed as artists similar to X.
Figure 1.38: Tagging Functionality on Last.fm.
Clicking on a tag leads the user to another radio station, playing music associated with that tag. Selecting an artist leads to his profile page. For every song, the user is told how many times it has been played before and how many other users are listening to it. Additionally, the user can ban the song from his playlist if he doesn’t like it, thus eliminating it from the recommendation system, directly submit it to ‘friends’ in the network, or mark it as a favorite song in order to give it more weight in the recommendation system, buy the song on Apple’s iTunes43 (Wegner, 2005) and tag it. Using a tag widget (see Figure 1.38) the user can allocate as many tags as required, where multiple tags are separated by commas. Tags must be no longer than 256 digits and may contain only letters, numbers, dashes, blanks and coma. Last.fm also recommends tags for indexing, which are calculated through co-occurrence analyses. If the user has already indexed several songs or artists, his own tags will also be suggested. Any editing of the tags, such as renaming them, is not an option; the user can merely delete them. This is possible either for each individual tag or for all tags at once. In the latter case, Last.fm offers a ‘Reset’ option. After being saved, the tags are attached to the songs and appear in a tag cloud as part of the background information on the music and artists. Users are also shown the tags they have used on their profile page (see Figure 1.39), as well as in their music collection (see Figure 1.40). Here they can also choose between displaying the tags in a tag cloud or a tag list with information on their frequency of usage. Clicking on a tag leads users to all music they have indexed with that tag. Furthermore, they are told what other music has been indexed in the same way, how often the tag has been used and how many other users also chose this tag to classify the same song. 42 43
See http://www.lastfm.de/help/faq. See http://www.apple.com/de/itunes.
Collaborative Information Services
53
Figure 1.39: Personal Tags in a User Profile on Last.fm.
Figure 1.40: Personal Tags in a User Profile on Last.fm: Display as Tag Cloud or List.
On Last.fm, users can browse through or perform active searches for music, events, tags, users, groups or record labels. During active search, only one Boolean AND link is allowed, which is implemented automatically. Browsing options are: Top Artists, Hot Artists, Top Songs, Hot Songs and Top Tags (see tag cloud in Figure 1.40). Tags and musical genres are also directly accessible by typing them into the browser address field: http://www.last.fm/tag/country for example shows the user all
54
Collaborative Information Services
songs, albums, radio stations, videos, events and artists that have been tagged ‘country.’ While manually entering a URL, it is also possible to enhance the search with a Boolean operator: the URL http://www.lastfm.de/tag/country+rock+folk takes into account resources with the tags ‘country,’ ‘rock’ and ‘folk.’ The display of similar users is based either on the user’s listening habits or on the songs and artists. The user is shown his ‘musical neighbors’ on his profile, where those users most similar are displayed in a larger font (Alby, 2007). All listeners are displayed on the artists’ pages, and clicking on the usernames will lead directly to their profiles. These provide information on their musical tastes, favorite songs, artists and tags used. Additionally, Last.fm calculates the degree of musical compatibility with this user and displays it on a scale. Thus the user can estimate whether it might be worth it to contact this other user on his ‘wall’ or via personal message, and to add him as a ‘friend’ in his personal network. These Last.fm friendships again influence the musical recommendation system.
Figure 1.41: Tag Cloud on Last.fm. Source: http://www.lastfm.de/charts/toptags.
So far only very few scientific publications have analyzed and investigated musicsharing services, even though the results would play a notable role in music information retrieval (Downie, 2003) and the determination of genres and states of mind. Abbey Thompson (2008) set herself the task of finding out how users describe their own music collection in collaborative information services and what effect this might have on the arrangement of songs and artists. To this end, she investigated the 100 most used tags on Last.fm and the tags allocated to five number-one hit singles on the Billboard Music Charts44. The most popular tag in Last.fm’s database is ‘rock45.’ The analysis shows that the 100 most-used tags serve mainly to describe the genre, while individual tags attached to songs generally express the user’s attitude towards the resource. This leads Thompson (2008) to the conclusion: 44
See http://www.billboard.com/bbcom/charts.jsp. On November 13th, 2008, 198,003 users used the tag ‚rock’ a grand total of 2,118,857 times. See http://www.last.fm/tag/rock.
45
Collaborative Information Services
55
Most importantly, the findings also show that tagging data is more reliable in representing musical genre and subject than previously speculated, indicating that with proper analysis and coding, social data could be harvested to provide genre-level metadata for popular music titles (Thompson, 2008, 28).
Geleijnse, Schedl and Knees (2007) also investigated the tags in Last.fm’s database, but base their analysis on a quantity of tags gleaned from a random sample of 1,995 artists. The tags are thus not attached to a particular song, as above, but to an artist. The authors find that 56% of tags had only been attached to an artist once. To ascertain whether tags serve to indicate similar artists, another sample was taken to compare different artists with each other on the basis of their tags. To this end, the relative tag similarity was calculated by dividing the number of tags shared by the number of tags belonging to the artist to be compared. The result of these calculations shows that “the number of tags shared by similar artists is indeed much larger than that shared by randomly chosen artists” (Geleijnse, Schedl, & Knees, 2007, 528). Levy and Sandler (2007) confirm this. It is to be concluded that users’ tags can be deemed credible and are also able to collect similar artists and songs. This makes tags a useful tool for retrieving music resources and browsing through music databases.
Libraries 2.0 – Museums Libraries – and, more broadly speaking, also archives and museums – are the elder statesmen among institutions devoted to the collection of resources and, with regard to folksonomies, to the creation of metadata. For centuries, they were the classic information services supporting users in their search for relevant information. For a long time, though, they did not take the opportunity to incorporate the users in their activities, as collaborative information services do today. “The library gives, the user takes” (Figge & Kropf, 2007, 142) was their motto. Particularly control over the creation of metadata in the form of controlled vocabularies rested firmly in the libraries’ hands (Arch, 2007; Peterson, 2008; Heller, 2007b): “The idea that users would be allowed to modify the catalog is about as far from the mentality of the cataloging rules as you can possibly get“ (Coyle, 2007, 290). With the great success of Web 2.0 and the growing number of user-generated resources, libraries more and more ask themselves the question: “Do libraries matter?“ (Chad & Miller, 2005). Chad und Miller (2005) respond unequivocally: Libraries provide unique value. We believe that a list of links in a search engine, while useful, does not have the same value as the knowledge that a library can provide. […] Yet, the staggering success of sites such as Amazon and Google has shown that, to meet the expectations of the modern world, libraries do have to change quite dramatically (Chad & Miller, 2005, 5).
Thus libraries have been trying to integrate elements of Web 2.0 (Miller, 2005; Miller, 2006), particularly folksonomies (Spiteri, 2006a; Spiteri, 2006b; Heller, 2007a; Heller, 2006; Arch, 2007; Cosentino, 2008; West, 2007), into their services – ultimately, to stay up to date, fulfil users’ demands and to keep and expand their user base: The catalog is still the centerpiece of libraries’ services; to adjust it to the demands created by Web 2.0 would enhance and complement the core product
56
Collaborative Information Services and further contribute to the use of libraries as well as improving their image (Plieninger, 2008, 222f.)*. This tagged catalog is an open catalog, a customized, user-centered catalog. It is library science at its best (Maness, 2006).
These endeavors are often collectively termed ‘Library 2.0’ (Casey & Savastinuk, 2007; Maness, 2006; Chad & Miller, 2005) even though this term is controversial (Kaden, 2008; Crawford, 2006; Casey, 2005). From this discussion, Danowski and Heller (2006) derive the following principles that constitute Library 2.0 (see also Maness, 2006): • OPAC + Browser + Web 2.0 Properties + Openness to Third-Party Applications = OPAC 2.0. • To let library users take part in the design and implementation of the services. • Library users should be enabled to use the services provided and to tailor them to their specific needs. • Openness: Library 2.0 is not a closed concept. • Permanent improvement instead of upgrade cycles (‘perpetual beta’). • To copy and integrate third-party ideas into the library services. • To constantly check and improve services; to be ready to replace them with newer and better services at any time (Danowski & Heller, 2006, 1261f.). Libraries 2.0 often follow two approaches. On the one hand, library staff are actively involved in various collaborative information services, such as social networks (Breeding, 2007) or blogs (Bar-Ilan, 2004), and act as experts in information retrieval and the selection of valuable sources (Furner, 2007). The other approach is the integration of users into the creation of metadata. Here again, libraries are seen to follow two strategies (Heuwing, 2008; Figge & Kropf, 2007; Plieninger, 2008; Peterson, 2008): a) the integration of tagging functionalities into their own platform and OPAC (short for ‘Online Public Access Catalogue’), b) the use of external internet platforms, such as LibraryThing46 or BibSonomy (Heller, 2006). Generally, the libraries’ users are asked to index all information resources in the catalog with tags. Thus resources are kept up to date and indexed according to the users’ wishes. Spiteri (2007) recommends the creation of a set of regulations for this purpose, to avoid any ambiguities with regard to the tags (such as how to form compounds, singular vs plural). The distinction between both of the above possibilities is merely the location where the collaboration takes place, but it can bring about advantages and disadvantages for the libraries (Figge & Kropf, 2007), as displayed in Figure 1.42. Heller (2006) mentions another advantage of combining the OPAC with outside tagging systems: avoiding an ‘island effect.’ Open tagging platforms naturally attract a larger number of users, who then tag and thus provide access to a larger number of resources, while the OPAC is ‘only’ used by registered library users and maintained by library staff. Combining the OPAC with outside tagging systems provides greater added value for the users, since it achieves greater reach and attracts lots of bibliophiles.
*
All quotations marked with an asterisk are translated from German. http://www.librarything.com.
46
Collaborative Information Services
Advantages
Tagging on own platform User data can be collected Creation of metadata profits
Disadvantages
Users are bound to the library catalog Library is ‘trendy’ Effort of creation and implementation must be borne by library Librarians feel ‘excluded’
57
Tagging on outside platforms Effort of creation and implementation is left to others Library profits from third-party know-how Potentially greater target group through third party’s users Library is ‘trendy’ User behavior can be neither observed nor exploited Creation of metadata does not profit from users
Figure 1.42: Advantages and Disadvantages of Tagging Functionalities on Own and Outside Platforms.
The first method – the integration of tagging functionality into the library’s own platform – is practiced, among others, by the university libraries of Pennsylvania47, Cologne48 and Hildesheim49. Ann Arbour District Library50 (short: ‘AADL’) also lets users comment on, rate and tag resources in its catalog, which is why it calls its OPAC ‘SOPAC,’ short for ‘Social OPAC’ (AADL goes Social, 2007). The university library of Pennsylvania’s PennTags (Sweda, 2006) enable users to deposit and manage all of the library’s resources via a social bookmarking tool: PennTags is a social bookmarking tool for locating, organizing, and sharing your favorite online resources. Members of the Penn Community can collect and maintain URLs, links to journal articles, and records in Franklin, our online catalog and VCat, our online video catalog. Once these resources are compiled, you can organize them by assigning tags (free-text keywords) and/or by grouping them into projects, according to your specific preferences. PennTags can also be used collaboratively, because it acts as a repository of the varied interests and academic pursuits of the Penn community, and can help you find topics and users related to your own favorite online resources51.
The user has direct access to the resources via the tags (see Figure 1.43). Browsing through ‘projects,’ collections of posts under their own URL (such as annotated bibliographies) and users (‘owners’) is also possible (see also Figure 1.44). The tags and comments, as well as the indexing users, are tied directly to the resource, so that all users may profit from others’ metadata, and an exchange take place (see Figure 1.44). The suggestion of similar tags for a resource is also based on publicly indexed tags. Linking the search terms with an AND is possible via these similar tags, limiting the search results. Users cannot restrict the search to tags in the search field it47
http://tags.library.upenn.edu. http://kug.ub.uni-koeln.de. 49 http://www.uni-hildesheim.de/mybib. 50 http://www.aadl.org. 51 http://tags.library.upenn.edu/help. 48
58
Collaborative Information Services
self; a search will consider all fields by default. Linking search arguments with Boolean operators is not an option. However, users can call up an alphabetical list of all previously used tags, including their frequency of usage, and begin their search there.
Figure 1.43: Tag Cloud of PennTags, University Library of Pennsylvania.
Figure 1.44: List of Search Results for the Tag ‘medieval_studies’ in PennTags.
The university library of Hildesheim’s MyBib system (Heuwing, 2008) basically utilizes the same functionalities as PennTags. However, here the user is directly shown how often each tag has been used before – the relative popularity indicated by font sizes in tag clouds is replaced by factual data (see Figure 1.45).
Collaborative Information Services
59
Figure 1.45: List of Tags in the University Library of Hildesheim’s MyBib System.
Since PennTags and MyBib both link the indexing users to the resources, they are able to exploit a typical feature of folksonomies (Spiteri, 2006a; Spiteri, 2006b): the tripartite graph (see chapter three). Sweda (2006) reports: Users can tag items in our OPAC and the keywords will be viewable along with authorized subject headings assigned by professional catalogers. Tags and users are hyperlinked so that users can investigate other resources sharing those terms/ names [...] (Sweda, 2006).
The university library of Cologne also offers users a tagging functionality in the library’s full catalog. Here users can attach tags to the resources, without however becoming visible themselves. While entering tags, users are suggested tags that have previously been used for the resource (see Figure 1.46). Additionally, they can determine whether the tags should be visible to the public or kept private. The whole catalog is searchable via the tags. Furthermore, users are shown co-occurring tags from BibSonomy, which they can also use to browse the university’s resources. To this end all BibSonomy resources indexed with this tag are displayed and checked for their availability in the university’s library. Users can also search for authors on Wikipedia, or subscribe to the author, the citation and the tag (see Figure 1.46) via RSS feed, in order to be automatically kept up to date on new publications of the same author or in relevant subject areas (Danowski & Heller, 2006). The resource can also be exported to BibSonomy, placed on the watch list, printed etc. Users are notified if the resource on display is one of the top 20 resources in the catalog. Furthermore, the resource page shows whether the resource is represented on Wikipedia. A recommendation system for other resources is also implemented, alerting users to potentially interesting resources. These recommendations are based on the user’s borrowing habits and online behavior52. It is not possible to limit a search to tags; only to display the most frequently used tags per catalog section. 52
http://kug.ub.uni-koeln.de Æ Help.
60
Collaborative Information Services
Figure 1.46: Resource Display and Tagging Functionality of the University Library of Cologne.
The second method, utilizing tagging functionalities on outside platforms, is tested for example by the university library of Heidelberg53 and the public library of Nordenham54 (tag clouds and interesting links are extracted from del.icio.us and displayed on the blog). A third-party provider would be LibraryThing55, which will be further examined below. The university library of Heidelberg enables its users to deposit and index hit lists in several social bookmarking services, such as BibSonomy or del.icio.us. The tags themselves are not visible on the library’s pages. Indepently of the methods libraries choose to incorporate elements from Web 2.0 into their program, it should have become sufficiently clear that librarians are not about to become obsolete just because users take over one part of the indexing process by using tags. There is still an enormous demand for professional know-how, which is incorporated into the catalogs by librarians, be it via the inclusion of neologisms made visible by tags into the controlled vocabulary or through a professionally completed field schema. Hänger (2008) also establishes that for the further development of metadata, e.g. via automatic information extraction, the fundamentals must be of sufficiently high quality before folksonomies can be used for this purpose. Hänger (2008) identifies the actual usefulness of tags in the prompt and timely indexing of previously unindexed library resources while allowing for a contemporary, user-friendly language. He also mentions a rating system for tags which is used 53
http://katalog.ub.uni-heidelberg.de. http://www.stadtbuecherei-nordenham.de/wordpress. 55 http://www.librarything.com/forlibraries. 54
Collaborative Information Services
61
for both user-generated tags as well as professionally indexed keywords. Users can rate the tags on a scale from one to five and thus express whether they think the tags accurately describe the book’s content or subject area. With such a points system in evidence, it is easily conceivable that tags and subject areas could be divided into main and auxiliary aspects. Since tags put users into direct contact with the resources, aspects might be exploited that could aid users primarily in information gathering. The university library of Karlsruhe uses the tripartite relation of tag, resource and user for a recommendation system (Mönnich & Spiering, 2007): on the basis of previously borrowed books and their tags, users are recommended further books of potential interest via BibTip56. The limits of folksonomies are defined as the arbitrariness of tags and the resulting lack of specificity in some searches (Hänger, 2008). However, the common consensus is: tags are no replacement for controlled vocabularies in the context of libraries, but do serve their purpose as a useful supplement (Hänger, 2008). Maness (2006) emphasizes that libraries are in a transitional period that forces them to adjust but does not make them redundant. Rather, Library 2.0 should be seen as an opportunity and realize its strengths, such as the communication of information literacy: The potential for this dramatic change is very real and immediate, a fact that places an incredible amount of importance on information literacy. In a world where no information is inherently authoritative and valid, the critical thinking skills of information literacy are paramount to all other forms of learning. […] The library’s services will change, focusing more on the facilitation of information transfer and information literacy rather than providing controlled access to it. […] While Library 2.0 is a change, it is of a nature close to the tradition and mission of libraries. It enables the access to information across society, the sharing of that information, and the utilization of it for the progress of the society. […] Library 2.0 is not about searching, but finding; not about access, but sharing (Maness, 2006).
LibraryThing57 is an internet platform for the public managing and saving of mostly private bibliographies and library resources (Voß, 2007). This collaborative information service was created by Tim Spalding in 2005. As of now, LibraryThing counts 553,427 registered users, 33,410,876 cataloged books and 43,604,686 indexed tags58. Registered users can use their free account to catalog up to 200 books, keep them, as a whole, private or open them up to public access and so communicate with other users who read the same books or share the same taste. While cataloguing books, or uploading books to one’s profile, respectively, one of LibraryThing’s functionalities turns out to be particularly useful: users can choose which sources they want to draw all available book information from in order to upload it to their profile (see Figure 1.47).
56
http://www.bibtip.org/. http://www.librarything.com. 58 All information was taken from http://www.librarything.de/zeitgeist on November 21st, 2008. 57
62
Collaborative Information Services
Figure 1.47: Adding Books to Personal Profile on LibraryThing.
Amazon.com, the Library of Congress and about 690 other libraries, but also LibraryThing itself may be selected as data sources. The bibliographical metadata are automatically absorbed by the user’s profile after saving. Since LibraryThing checks more than one source while searching for a title, a specific syntax must be adhered to during the research. Single search terms must be separated by commas, since LibraryThing comprehends search requests of the genre search term 1 blank search term 2 as phrases, which must occur in exactly that way in a bibliographical field. As seen in Figure 1.48, LibraryThing asks users to check their syntax if no search results have been obtained.
Figure 1.48: Searching for Books in Various Sources on LibraryThing Requires a Specific Syntax.
After the user has found the book and added it to his library, he can do several things to better manage the book information (see Figure 1.49). First, he sees how many other users of LibraryThing have deposited the book in their library; clicking on that number will lead the user to all these other members of the community. Then he is told how many and which reviews are dedicated to the book in question, what its average rating is, and how many discussions have been started on its subject. Further-
Collaborative Information Services
63
more, he has access to both the ratings and the discussions. He is also provided with further links to online mail-order services, filesharing sites or libraries, where he can acquire the book. Selecting from several different available covers is also a possibility. As for editing the book details, the user has many options. He can • edit the title and author’s name, • add tags to make the book retrievable in his personal library (maximum of 30 digits per tag, no commas), • mark whether the book is currently being read or still on the ‘to-read’ list, • rate the book (0-4 stars), • write a review, • add further individuals involved with the book and determine their exact role (e.g. translator, publisher etc.), • add or change bibliographical data such as publishing house, location etc., • edit the year of publication, • add the ISBN, • edit the Library of Congress’ classification, • add a Dewey Decimal notation, • pinpoint the book’s subject areas, • state the language it is written in, and its original language (for translations), • add public or private comments, • note how many copies of the book he owns, • state the book’s Bookcrossing59 ID-number (where applicable), • state the date of purchase, • state the respective dates he started and finished reading the book. The user is also shown the date he added the book to his library and which source he used. All book information, but particularly his own tags, is displayed in his personal library (see Figure 1.50). The tags are also attached directly to the books, which means that all users’ tags for this book are collected and displayed in the book description. Book information can also be edited in the personal library, partly even in the displayed fields: for example, ratings can be selected via double-click and then changed. The user can also search within single fields or in the entire database. Worth mentioning at this point is the option of multiple-editing books. Using this functionality, the user can add or remove tags to all or several books at the same time, he can delete books or change the language. The search for duplicates (using the ISBN) is also supported.
59
http://bookcrossing.com is an online book club.
64
Collaborative Information Services
Figure 1.49: Editing Functions for Book Information on LibraryThing.
Collaborative Information Services
65
Figure 1.50: Display of Personal Library with Editing Options on LibraryThing.
LibraryThing offers potent functions for the retrieval of books and book information. Searches are limitable to several fields, user-guided on the one hand (see Figure 1.51) and via the search syntax on the other. The user-guided search enables the user to search within his personal library; he can also, however, search through LibraryThing’s database for works60 (via title, author or ISBN), authors, tags, board messages, groups, members and locations. Additionally, he can search for members in other social and filesharing networks. Search terms can be linked via several operators, where the standard is AND. An OR-link is not an option, an NOT-link of two search terms is signified by a minus sign in front of the term that is to be excluded. Phrase searches are activated by quotation marks.
Figure 1.51: Search Options and Tips on LibraryThing.
At this stage of research, the separation of search terms with commas is not necessary. Usable truncation symbols are asterisks (replace anywhere between 0 and an 60
A ‘work’ is defined as follows by LibraryThing: “The purpose of works is social. Books that a library catalog considers distinct can nevertheless be a single LibraryThing “work.” A work brings together all different copies of a book, regardless of edition, title variation, or language. This works system will provide improved shared cataloging, recommendations and more. For example, if you wanted to discuss M. I. Findley's The Ancient Economy, you wouldn't really care whether someone else had the US or the British edition, the first edition or the second” (http://www.librarything.com/concepts). This definition is not at odds with libraries’ term definitions, which deem a creative act on the part of an author or an artist a ‘work,’ regardless of any ‘expressions’ (e.g. translations), ‘manifestations’ (e.g. different printings) or ‘items’ (concrete single copies) (IFLA, 1998).
66
Collaborative Information Services
infinite number of symbols) and question marks (replace exactly one symbol) for right truncation. Linking the single search fields is not possible in user-guided searches, but the search syntax allows several different fields to be searched at the same time. The following restrictions61 are at the user’s disposal: • • • • • • • • • • • • • • •
all: title, author, tags, ISBN, date, lcc, publication, dewey, source, subjects, comments, review, other authors, private comment most: title, author, tags, ISBN, date, lcc, publication, dewey, source, comments, other authors, private comment titleandauthor: title, author, other authors title: title author: author, other authors tag: or tags: tags isbn: ISBN date: date lccn: LCCN dewey: dewey source: book data source (e.g. Amazon) subject: subjects comment: or comments: comments review: or reviews: review titleandauthorandsubjects: title, author, subjects, other authors.
LibraryThing reveals a particularity after a search for tags has been performed. The search results include all books that have been indexed with the tag in question, and the user is also told how often and by how many users the tag has been used for indexing purposes on the platform. At the same time, he is recommended further tags to search for. The user can change his search to related tags (the search is then repeated for the new search term), or restrict it to related subject areas. Furthermore, he can use tag combinations, or ‘tagmashes’ (Spalding, 2007b; Smith, 2008a), to continue his search. A tagmash is a user-created combination of tags that links several tags with the Boolean operators AND and NOT and searches through the database for resources that meet these demands (Smith, 2008a, 16). Instead of entering these combinations time and again for every single search, the user can treat the tagmashes as tags, and use them to index books.
Figure 1.52: Summary of Synonyms in a Tag on LibraryThing.
In the related-tags area, the user can also compile tags – meaning, should the user notice that several tags are synonymous with each other he can save them as one 61
See http://www.librarything.de/wiki/index.php/Search.
Collaborative Information Services
67
single tag62. The most popular tag is then noted as the preferred tag. Figure 1.52 shows the effects of these endeavours. There are more options, however, than simply to search directly for books and tags. LibraryThing offers its users a ‘zeitgeist’ functionality for browsing (see Figure 1.53). Here several tag clouds are displayed that serve users as browsing stepladders. Thus LibraryThing can be explored via • the 50 largest libraries of LibraryThing users, • the 25 most-discussed books, • authors that use LibraryThing, • the top books, • the top 75 authors, • the 50 most active reviewers, • the 50 highest-rated books, • the top 75 tags, • the top 50 taggers, • the 50 highest-rated authors, • the top 50 longest tags (tags with 20 or more digits), • the 50 lowest-rated authors, • the top 25 languages, • the most recently registered users and • the 50 ‘completest’ (“Average number of different books held by people who have any books by the author”) authors. The advantage here is that the tag clouds are arranged by the popularity of the single elements, and that the user is told the actual numbers that make up the rankings.
Figure 1.53: A Part of LibraryThing’s ‘Zeitgeist’ Functionality.
Apart from searches and tag clouds, users can also call up book information on their profile (see Figure 1.54), since LibraryThing displays users that have saved the same books in their profile. If a user has saved a sufficient number of books (10-20) in his 62
See http://www.librarything.de/wiki/index.php/Tag_combining.
68
Collaborative Information Services
profile, LibraryThing will also recommend books to him. These recommendations are based on co-occurrence analyses of books saved together in various user libraries. By clicking on the ‘clouds’ button the user is shown several tag clouds which also serve to facilitate his search for interesting titles. On the one hand, he can consult his own tags and saved authors, and on the other hand he has access to a ‘tag mirror.’ This mirror displays those tags used by other users to index their personal libraries. To encourage the exchange between users, LibraryThing has created another interesting function, ‘memes.’ Here the user is told which books he has saved twice and which books only he and one other member of the community have saved to their profile. But not only bibliophiles use LibraryThing as a platform for communicating and information gathering. Since early 2007 (Voß, 2007), libraries have had access to LibraryThing’s various services via “LibraryThing for Libraries63“, and been able to implement these in their OPAC (Westcott, Chappell, & Lebel, 2009). Thus tags and reviews for single books created on LibraryThing can be displayed in the OPAC, and the OPAC’s users can themselves write reviews for or rate the books. Library users can also profit from book recommendations on the basis of user habits on LibraryThing.
Figure 1.54: Display of a User Profile on LibraryThing.
Museums create not only textual resources, but are particularly dependent on the indexing of non-textual objects via metadata. Trant (2006b) describes the problems of indexing museum objects as follows: The content of art museum collections is visual, but we work with the ideas represented in them in a primarily textual mode. This produces the major paradox in the documentation and retrieval of art museum collections. What is searched is not the work of art, or even a reproduction (however faithful) of the work of art, but a textual representation of those characteristics of the work of art that were seen as salient by its custodian and/ or descriptor. […] 63
See http://www.librarything.com/forlibraries.
Collaborative Information Services
69
what we are able to search in digitized art museum collections is a limited representation of their content, transformed into another media. What is retrieved is not the original or even a facsimile […]: what is retrieved is a surrogate (Trant, 2006b).
A whole new area of application thus opens up for folksonomies. According to Kellog Smith (2006), Trant and Wyman (2006), Bearman and Trant (2005) and Trant (2006a; 2006b; 2009), many prominent museums already allow their users to index works of art and exhibits with tags64 in the context of the Steve Project65, such as the Metropolitan Museum of Art, the Guggenheim Museum and the Cleveland Museum of Art (Chun et al., 2006). The reasons for the implementation of folksonomies into the museum catalog are mainly derived from folksonomies’ general advantages: The motivation for having online viewers - presumably largely non-art experts - describe art images are a) to generate keywords for image and object records in museum information retrieval system in a cost-effective way and b) to engage online visitors with the artworks and with each other by inviting visitors to express themselves and share their descriptions of artwork. […] The assumption is that pictorial and emotional subject description, particularly phrased in non-specialist terminology, is especially useful and appealing for generalist system user to access artwork images and information (Kellog Smith, 2006).
Trant (2006b) goes further in analyzing user behavior during tagging and argues that “[s]ocial tagging offers a direct way for museums to learn what museum-goers see in works of art, what they judge as significant and where they find or make meaning.“ Kellog Smith (2006) reports that visitors have responded favorably to the tagging functionality and that they use it actively. However, since the tags’ quality is very uneven, she recommends that museums educate their users to become better taggers by providing or recommending their own specific tags. This would enhance users’ understanding of the works of art and communicate additional knowledge (Kellog Smith, 2006). Trant (2006b) summarizes the advantages of tagging systems in museums as follows: “Social tagging offers new way for museums to engage user communities and, through the resulting folksonomy, to assist them in their use of collections“.
Photosharing Services Photosharing services offer a platform to store mainly photographs, but possibly also short videos (as on Flickr66), on the World Wide Web in a decentralized way, manage them and make them accessible to other users in manifold ways. The rapid rise in sales of digital cameras and the plummeting prices for storage media make it easier and easier for users to produce great quantities of photographs; but lots of data in return require elaborate structuring methods and facilities. Photosharing services assume this task, and what’s more, they attach the community aspect to it. The photos 64
For an example, see the section below called ‘Games with a Purpose.’ The Steve Project is a collaboration of several museums that aims to increase access to museum catalogs through folksonomies (see http://www.steve.museum). 66 http://www.flickr.com. 65
70
Collaborative Information Services
are supposed to facilitate the exchange and communication between the platform’s users, and above all provide added value for the individual user. The resource ‘photo’ is particularly enticing in this respect: photographs provide glimpses of other lives, show other points of view, “say more than a thousand words” and are personal, even if no person should be visible on them. At the same time though, these aspects contain the potential for conflict with regard to user privacy (KramerDuffield & Hank, 2008; Palen & Dourish, 2003). “Sharing photographs is a locus of engagement, a moment through which common ground is built and shared meanings established. Because they are polysemic, images draw out differences of interpretations” (Cox, Clough, & Marlow, 2008, 4). These properties also lead users to upload photos to the platform and to search for photos. Users can then contact each other using various commentary functions and discuss the photographs. There are also diverse systems of notification that keep users up to date on other users’ commenting and publishing activities. Thus photosharing services combine two types of information system: Pull and Push services (see chapter four). Thus Cox, Clough and Marlow (2008) establish that photosharing services are situated “somewhere between an information system [...] and a mass medium” (Cox, Clough, & Marlow, 2008, 6). I will introduce Flickr as a prototypical example of a photosharing service and its functionalities, since it is the most popular service of its kind and offers multiple tagging functionalities. Flickr: “Share your photos. Watch the world.67“. Flickr was founded in 2004 by Stewart Butterfield and Caterina Fake and had at that time still been conceived as an online game called ‘Game Neverending,’ where users could snap pictures on their cameras or phones and then upload them to the platform (Graham, 2006; Hornig, 2006; Wegner, 2005). The photo platform was developing so rapidly that the gaming functionality was soon abolished so that Flickr could concentrate exclusively on the exchange and publishing of photographs. In November, 2007, Oates (2007) reported the two billionth picture uploaded to Flickr; van Zwol (2007) mentions 8.5m users and a record upload rate of 2m photos in one day. In 2005, Flickr was bought by Yahoo!, which means that new users can only register on Flickr via a Yahoo! user account (Campbell, 2007). Using Flickr is free with a simple account – Flickr can turn a profit, however, by selling storage space in ‘Pro-Accounts,’ via Google AdSense, or by printing and processing photos into calendars and postcards (Hornig, 2006). Flickr’s success stands out by the fact that users can collectively manage the pictures and index them with any tag they choose (Wegner, 2005; Campbell, 2007, Lerman, Plangprasopchok, & Wong, 2008; Marlow et al, 2006a; Marlow et al., 2006b; Cox, Clough, & Marlow, 2008; Sturtz, 2004; Winget, 2006). Van House (2007) pinpoints the medium photo as Flickr’s factor of success, compared to blogs, for example: “Flickr is mostly about images, not images as adjuncts to text“ (van House, 2007, 2719). Thus Flickr is on its way to becoming a “collective photo agency” (Wegner, 2005, 94), which is the reason Butterfield describes Flickr as “the eyes of the world” (Hornig, 2006, 65). After registering on Yahoo!, the user can begin archiving and publishing his photos. Before that, he can personalize his profile with a photo, generate his own personal URL and add further personal information to his profile. He can select several photos and upload them to Flickr at the same time. This is done either via Flickr’s 67
See http://www.flickr.com.
Collaborative Information Services
71
homepage or through camera phones or any other mobile terminal devices with internet access. The user can also choose different privacy settings for his photos. These settings concern the photographs’ visibility, the posting of comments and the adding of tags and notes. The user can choose from the following options: • private (only the user himself can view, tag and comment on the photos), • friends (only the user’s confirmed contacts, marked as ‘friends,’ can view, tag and comment on the photos), • family (only the user’s confirmed contacts, marked as ‘family,’ can view, tag and comment on the photos), • public (all of Flickr’s users can view, tag and comment on the photos) (see Figure 1.X). The standard setting is ‘public.’ After an upload, Flickr enables the user to name his photographs: he can either name them individually as well as add tags and descriptions, or use a ‘mass editing’ function to tag all photos at the same time. The pictures can also be allocated to one or more photo albums (see Figure 1.55).
Figure 1.55: Editing Options after Photo Upload on Flickr.
72
Collaborative Information Services
Figure 1.56: Photo Editor ‘Picnik’ on Flickr.
Clicking on a picture opens the photo editor ‘Picnik,’ where the photos can be edited (add note, add comment, send to a group, add to an album, add to a blog, display different picture sizes, order prints, rotate the photo, delete) (see Figure 1.56). Here the user can also delete or add tags and well as view and edit content-describing tags and so-called geotags (“add to your map”), as well as the camera’s automatically adopted EXIF files68 (here: “Casio EX-Z10. “ and “October 31, 2007 “). The photo’s privacy settings and access statistics can also be checked. The function ‘add notes’ allows the user to add detailed information to his photos by attaching these notes directly to the photo itself (see “Congratulations Red Sox!”) This additional information is visible to all users. Figure 1.57 shows a user’s photostream, which is displayed after uploading and describing the photos. Here the user is provided some information to his photos as well as several editing options. For example, he is told how often his photostream has been watched and how many photos it contains. Additionally, he is shown his uploaded photos, including description, title and date of upload, as well as the number of comments left by other users, the photos’ privacy settings, and under what licence the photos have been published. The following traditional creative commons licences are at the user’s disposal: • None (all rights reserved), 68
Digital cameras save JPEG files in the EXIF format (“Exchangeable Image File“). Camera settings and information on the conditions under which the picture was taken are saved in the image file by the camera. This includes information on shutter speed, time and date, focal length, exposure compensation, flash measurement and whether a flash was used (see also http://www.digicamhelp.com/what-is-exif).
Collaborative Information Services
• • • • • •
73
Attribution License, Attribution-NoDerivs License, Attribution-NonCommercial-NoDerivs License, Attribution-NonCommercial License, Attribution-NonCommercial-ShareAlike License, Attribution-ShareAlike License69.
Figure 1.57: A User’s Photostream on Flickr Displays Various Information and Editing Options.
Furthermore, the user can edit his photos, which includes deleting them, changing the privacy or creative commons settings, adding descriptions to the photos, searching through his photostream and the display and publishing of his photos as a slide show. This means that the user can make his photostream accessible to other users by either entering the username or an e-mail address or by copying the stream’s URL and sending it to the individual in question. Using the buttons directly below ‘Your Photostream,’ the user can access his photo albums, his tags (displayable as a tag cloud or as a tag list with usage information), his archive (with information on upload data and photos’ creation dates), his favorites (photos by other users marked as favorites), his most popular pictures (displayable according to interestingness (see Figure 1.58), number of comments, number of favorites and number of visits) and his profile (with access to his photostreams, contacts and recommendations).
69
See http://www.flickr.com/creativecommons/.
74
Collaborative Information Services
Figure 1.58: Flickr Displays the Most Popular Photos – Here According to Interestingness.
Figure 1.59: Flickr Displays Tags as Tag Cloud and as Tag List.
Particularly interesting for the purpose of this book is the tag editing function (see Figure 1.59). As shown in Figure 1.59, Flickr automatically normalizes the spelling of each tag for representation in the system – the user himself is not shown these changes while tagging his own photos, for example. The capitalization of initials is reversed, special characters such as the German letter ‘ß’ are simplified to their nearest equivalent (in this case ‘s’) and compounds containing quotation marks, dashes or underscores are now written together. If a tag has deliberately been constructed out of two tags, these components must be bookended by quotation marks (e.g. “süßer Hund” = “cute dog”) in order to be viewable and searchable by the community. Users can edit and allocate tags to photos using this editing function. At the
Collaborative Information Services
75
moment, a maximum of 75 tags may be added to each photo by its owner or, according to their privacy setting, by friends/family/the community. Photo searches on Flickr are dependent on the user’s status: registered users are presented a different search interface than anonymous guests. The former can search for uploads by every member of the community, as well as for those of their contacts and their friends, they can search through their own photostreams and groups, they can search other Flickr users and look for photos based on location (see Figure 1.60); the latter cannot search for uploads by contacts and friends, or in their own photostream (see Figure 1.61). Instead, anonymous users can choose between • a simple search, which they can limit to photos, persons and groups, as well as to full text (that is photo titles, descriptions, comments, notes and tags) and tags, and which enables them to link search terms with the Boolean operators AND (preset in the search field), OR (type OR into the Flickr search field), NOT (preclude a search term by putting a minus sign in front of it) and phrase (phrase components are linked by quotation marks), • an advanced search, which they can limit to content types (photos/videos, screencasts/screenshots, illustration/art/animation/CGI), media types (photos & videos, only photos, only videos), creation or upload date, creative commons licenses (content with creative commons license, content with permission to exploit commercially, content with permission to edit, change and adjust) as well as full text and tags, and which enables them to link search terms with AND, OR, NOT and phrase, • and a camera search, which they can limit to the type of camera used.
Figure 1.60: Search Options for Registered Users on Flickr.
Figure 1.61: Search Options for Anonymous Users on Flickr.
A search for tags reveals the intrasystem standardization of the tags. As shown in Figure 1.62, a search for the tags “i-know,“ “i_know,“ “iknow“ and “i know“ leads to the same number of results, since these tags are all represented as “iknow” in Flickr’s system. Searching for “i” and “know,” however, leads to another result, since this search request is processed via an AND link of the two terms.
76
Collaborative Information Services
Figure 1.62: Processing of Search Requests on Flickr.
The user can arrange the search results according to relevance, recency and interestingness70 after a full-text search, and only according to recency and interestingness 70
“Interestingness” is defined by Flickr as follows: “There are lots of elements that make something ‘interesting’ (or not) on Flickr. Where the clickthroughs are coming from; who comments on it and when; who marks it as a favorite; its tags and many more things which are constantly changing. Interestingness changes over time, as more and more fantastic content and stories are added to Flickr.” (http://www.flickr.com/explore/interesting). Dennis (2006) deems the interestingness function the most valuable access path to the photos: “The
Collaborative Information Services
77
after a tag search; likewise, he can choose between a thumbnail display and a display of the photos with detailed information. Clicking on a photo reveals the complete title and description. Furthermore, the user is told which album the photograph belongs to, who created it, which tags, geotags and EXIF files have been attached to it and how often it was viewed, saved as a favorite and commented on. The comments are on display as well. Usernames, album descriptions and tags are links, leading the user to either the photographer’s profile (or the photostream with his most recently uploaded photos, the user profile with contacts and groups and the photos marked as favorites by that user, respectively), the complete album, as well as the photos also indexed with that tag, with one click. The user can also subscribe to albums, photostreams, favorites, group pools, group discussions, recent activities, the Flickr blog and tags via RSS or Atom feeds. This means that he no longer has to actively search for this information, but automatically receives these updates via his feed reader after creating a search profile or subscribing to a feed, respectively – without having to visit Flickr.com. Users can also browse via the ‘discover’ option (see Figure 1.63).
Figure 1.63: Browsing Through Flickr’s Photo Database via the ‘Discover’ Option.
Here the focus is not on a target-specific search, but on the random finding, or discovering, of photos. The user can choose between several different display options: • items of interest from the last seven days or the last few months (from August, 2004), • popular tags (popular tags of the past 24 hours, popular tags of the last week, most popular tags of all time), • calendar (display of the most interesting photos of one month as a calendar), • recently uploaded content, • video on Flickr, • discover analog photography (digitalized analog photos), presentation of these interesting photos can be considered an easily navigable, attractive, visual gateway into the massive Flickr photographic community” (Dennis, 2006). The ranking based on interestingness is described and analyzed in detail in the chapter on information retrieval.
78
Collaborative Information Services
•
world map (full-text and photo search for content created in one particular location, displayed on a map), • places (locations with many photos and geotags), • a year ago today (display of the most interesting photos from one year ago), • albums, • groups (photo-based collections of topics with discussion pages: “They are primarily buckets for collecting related types of photos rather than social groups“ (Cox, Clough, & Marlow, 2008, 5)). The display option ‘world map’ is of particular interest for browsing purposes. Here the photos’ geotags (Amitay et al, 2004; Naaman, 2006; Torniai, Battle, & Cayzer, 2007; Jaffe et al., 2006) are used to arrange and display them on a map. The user can explore this map via either the tags (see Figure 1.64 at the head of the picture) or the photo clusters (bottom of Figure 1.64): clicking on a tag or a cluster opens a photo menu containing all pictures that possess the same tags or geotags. It is also possible to search the map by entering towns, regions or sights (see Figure 1.65).
Figure 1.64: Exploring the World Map via Tags (Above) and Photo Clusters (Below) on Flickr.
Collaborative Information Services
79
Figure 1.65: Searching the World Map via Entering Towns/Regions/Sights on Flickr.
By now, Flickr has become a popular subject of research activity. Thus Campbell (2007) suggests Flickr, amongst other examples, as a didactic tool for language classes (see also Godwin-Jones, 2006). He places particular emphasis on the use of tags: “Instead of having students break out their dictionaries, the teacher can do a tag search to show a photo of the vocabulary word in question” (Campbell, 2007). Van Zwol (2007) analyzes Flickr’s logfile data and concludes that users’ clicking habits follow a power law: 7% of photographs make up for 50% of page views. Furthermore, van Zwol (2007) presents hard numbers on the photos’ visibility: “we see that 65% of the most popular images get discovered within 3 hours after being uploaded and that within 48 hours these images get almost 45% of the views they will generate during the 50 day window” (van Zwol, 2007, 189; see also Lerman & Jones, 2006). According to the author, the reason for this is the strong link between Flickr users via contacts and photo pools. Lerman and Jones (2006) confirm this thesis: “for the images [...] the views and favorites they receive correlate most strongly with the number of reverse contacts the photographer has“ (Lerman & Jones, 2006). Popular photos are also more heavily requested from all over the world, whereas less popular photos’ access is generally restricted to a smaller geographical area. Van House (2007) surveys several users on their Flickr habits. She initially finds out that photos are mainly used for four purposes: 1) Memory, narrative, and identity: Photos record memories and describe the different phases of a life – they tell a person’s story. 2) Relationships: Photos create and strengthen interpersonal relationships. 3) Self-representation: Photos reflect a person the way they want themselves to be seen. 4) Self-expression: Photos allow the photographers to realize their potential and express themselves (van House, 2007, 2718). Flickr mainly serves its members to communicate with other users and acquaintances rather than as a photo archive. This is the reason photographs are frequently tagged not for the creator’s benefit but in order to steer the platform’s other users
80
Collaborative Information Services
towards these photos. In this way the users of one’s own community, friends and acquaintances, can be kept up to date on the events in one’s life: Many respondents log on daily or several times a day to view their contacts’ newest images. The Flickr image stream is often seen as a substitute for more direct forms of interaction like email. Several have given up on blogging because it is too much work, and upload images to show friends what they are up to (van House, 2007, 2720).
Lerman und Jones (2006) also view the social networking aspect on Flickr as the formula for the platform’s success. A user accesses other users’ photos via his contacts – it is no longer active searches via search fields, but the discovery of interesting photographs by browsing that determine user behavior: “every username, every group name, every descriptive tag is a hyperlink that can be used to navigate the site“ (Lerman & Jones, 2006). Here the authors describe tags as being of little use for photo searches: users are generally alerted to new pictures via their contacts or Flickr’s ‘discover’ function. Cox, Clough and Marlow (2008) analyze user behavior with regard to social networking and interview 100 Flickr users (50 users are registered members, 50 aren’t) by telephone. They find out that the average user is 27 years old and reveals little personal information on his profile. However, it is still the platform’s networking character that most attracts users: “Most interviewees also logged on frequently, indeed some were continuously logging on during the day because they wanted to look at new activity” (Cox, Clough, & Marlow, 2008, 10). This is also evident for photographs. Users aren’t concerned with arranging, editing or archiving their pictures, but with the effect these pictures have on other users: “Interviewees managed the bulk of their photos on their own computer. Only one interviewee used it as a store for photos. The collection on Flickr was usually a selection of the best or most appropriate to be shared” (Cox, Clough, & Marlow, 2008, 10). Plangprasopchok & Lerman (2008) mention a function on Flickr that allows users to place their photos into a context. For one thing, they can collect photos in albums (‘photo sets’), and then allocate these to superordinate albums (‘collections’). This creates a photo hierarchy and simplifies the search for and allocation of photographs. The ‘collection’ function is limited to users of Flickr Pro, however.
Videosharing Services Like music and photos, videos have limited textual metadata, like titles or descriptions. That is why they need tags – not only to provide a wider access, but to provide access at all, as Loasby (2006) emphasizes: “There is no text to search so metadata is a necessity, not a nice-to-have“ (Loasby, 2006, 26). Collaborative videosharing services rely on users’ readiness to index videos with tags. Videosharing services distinguish themselves from photosharing services only by the media they work with, respectively, and the upload and display options that go with them. That is why I will start directly with the most popular video portal, YouTube71, even though the World Wide Web offers plenty other such services, e.g. Vimeo72 or Clipfish73. 71
http://www.youtube.com. http://www.vimeo.com. 73 http://www.clipfish.de. 72
Collaborative Information Services
81
YouTube was created in 2005 by three friends called Chad Hurley, Steve Chen and Jawed Karim, who were looking for an alternative to the sending of videos as email attachments, and thus came up with a platform for viewing video files (Hornig, 2006). In late 2006, the company was bought by Google. YouTube’s greatest advantage is that it supports uploads of videos in nearly all existing formats (.WMV, .AVI, .MOV, .MPEG, .DivX, .FLV and .OGG, as well as .3GP for videos shot with a camera phone) and then uniformly plays them in the Macromedia Flash Player format .FLV (Webb, 2007; Cheng, Dale, & Liu, 2007). Downloading videos from YouTube is not an option; the clips are played back in real time in the browser. The maximum upload volume per video file is set at 1GB and ten minutes’ length74. Yen (2008) reports that YouTube.com was visited by 79m users in 2008, who viewed 3bn videos. In 2006, YouTube saved 6,1m videos and roughly 500,000 user accounts on its servers (Gomes, 2006). Hurley (2007) summarizes YouTube’s success: “YouTube is more than a library of clips“ (Hurley, 2007).
Figure 1.66: The Profile Page (‘Channel’) of the User ‘isabellapeters’ on YouTube.
After registering, the user can upload single videos to the platform or save videos from other members in his own videolog, mark videos as favorites and communicate with other users. Furthermore, he can recruit other users as subscribers of his own videos or his own channel, respectively. The ‘channel’ is a user’s profile page (see Figure 1.66), which he can adjust to his liking (Paolillo, 2008): he can upload a profile picture as well as change the color display and the type of channel (YouTuber, director, musician, comedian, reporter, guru). Personal information (such as home town and favorite movies) and tags describing the channel and allowing it to be accessed by other users can also be entered here. The user can also use this page to 74
This time limit for videos was only introduced in March, 2006 – previously uploaded videos can also run longer. Up to that point, a ‘Director Account’ was available, enabling users to upload unlimited video lengths and volumes. Users with this form of access are still able to upload longer videos to the platform today (http://help.youtube.com/support/youtube/bin/ answer.py?answer=71673&ctx=sibling).
82
Collaborative Information Services
send messages or add comments. Another option is the user’s recommendation. The channel contains information on the user’s date of registration, last login and number of viewed videos, subscribers and channel views. Before uploading, the user must fill out some fields. He must specify the video’s title and add a description, both of which can be changed later (see Figure 1.67). The video must also be allocated to one of the 15 available categories: None, Cars & Vehicles, Comedy, Education, Entertainment, Film & Animation, Gaming, Howto & Style, Music, News & Politics, People & Blogs, Pets & Animals, Science & Technology, Sport, Travel & Events. The video must also be tagged, where the maximum length is 120 digits including blanks, which separate two tags from each other. The ‘owner’ of the video is the only user who can attach tags to it. Changing the remaining settings is optional. ‘Broadcast Options’ regulates access to the video (public or private), ‘Date and Map Options’ allows the user to state the video’s creation date and location and ‘Share Options’ allows or excludes communication channels (comments, comment polls, video replies, ratings, embedding and broadcasting).
Figure 1.67: Description Options for Video Uploads on YouTube.
After the upload, the user is provided all entered information as well as the options to play the video, edit the video information, add notes to the video, replace the video’s audio track (‘AudioSwap’), consult the video’s viewing statistics (‘Insight’), embed it in a blog or send it to a cellphone (Webb, 2007). He can also re-arrange his videos and display them according to title, length, upload date, number of views and ratings. This makes it easier for users to access their videos if they do not perform a search. There are several search options on YouTube. For one thing, users can restrict their searches to videos, channels and groups. Advanced Search further offers the possibility to limit searches to video length (short, medium, long), language, cate-
Collaborative Information Services
83
gory, location and upload date (anytime, this month, this week, today), as well as the option to filter videos that may not be suitable for minors (SafeSearch) (see Figure 1.68). Search terms may also be linked via the Boolean operators AND (AND in the search field), OR (OR in the search field), NOT (minus sign in the search field and phrase (search terms in quotation marks in the search field).
Figure 1.68: Advanced Search on YouTube.
Figure 1.69: Browsing Options on YouTube: Videos, Channels, Categories.
Search results are displayed according to relevance, upload date (today, this week, this month, anytime), number of views, ratings or video type (all, partner videos, annotations, closed captions). The Type-Ahead functionality for entering search terms can be manually deactivated. Restricting the search to tags only is not an option – YouTube searches all mandatory fields by default. Browsing options are accessible via the tabs ‘Videos,’ ‘Channels’ and ‘Categories’ (see Figure 1.69). The video display can be altered by either selecting spotlight videos, rising videos, most discussed, recent videos, most responded, top favorited, top rated or as seen on. Channels are searchable by most viewed and most subscribed.
84
Collaborative Information Services
Furthermore, any user can subscribe to another user’s channel, favorites or tag via RSS feed. That user’s videos/favorites or the videos indexed with that tag, respectively, are then displayed on the user’s homepage (Webb, 2007). If after a search the user selects a video, YouTube automatically directs him to a page where the video is played back using an Adobe Flash Player Plugin (see Figure 1.70).
Figure 1.70: Video Display after a Search on YouTube.
Detailed information includes the video’s title, description, allocated category, owner, upload date and tags. Clicking on a tag, the category or the username leads
Collaborative Information Services
85
the user to more videos indexed with the same tag or allocated to the same category, or to the video owner’s channel, respectively. The user is also shown the video’s permanent URL as well as the data he would need in order to embed the clip in another website, and what other videos this user hosts on YouTube. Promoted videos are provided by YouTube Partners, such as ZDF75 or Financial Times Germany (Bertram, 2008; Hurley, 2007). The user is shown a maximum of 20 similar videos (Cheng, Dale, & Liu, 2007). Information on the number of views as well as the number and type of ratings is on display beneath the video plugin. The user can also consult the video’s awards and the websites linked to it under ‘Statistics.’ There is also the possibility of forwarding the clip to various social networks (e.g. MySpace76 or Facebook77), social bookmarking services (e.g. del.icio.us) or collaborative news portals (e.g. Reddit78 or Digg79). Registered users can post comments, video replies, add the video to their playlist or list of favorites or flag it for containing inappropriate content. YouTube, too, has increasingly become the subject of scientific endeavors. These are directed mainly at the platform’s social networking aspect (Paolillo, 2008; Cheng, Dale, & Liu, 2007), its applicability for education purposes (Micolich, 2008), the video and metadata properties (Cheng, Dale, & Liu, 2007; Geisler & Burns, 2007; Meeyoung et al., 2007) as well as the segmentation of videos for more effective improve information gathering (Sack & Waitelonis, 2008). Paolillo (2008) uses the clips’ textual metadata (title, description and tags) to analyze the network of YouTube users. He finds that user networks are strongly concentrated around certain video clusters, meaning that users are only ‘friends,’ or linked via comments, favorite lists etc., if they share the same interests with regard to the videos. Tags have an important function here: they serve to alert other users, or other like-minded users, to one’s own video and thus to generate contacts: “these tags in some cases identify cohesive subgroups of authors exchanging similar content” (Paolillo, 2008). The author arrives at the following conclusion: “YouTube is thus a social networking site, with the added feature of hosting video content” (Paolillo, 2008). He further observes that most users only watch the videos without producing and uploading any of their own. Active, content-generating users use YouTube to publish their own content, and also to add outside content to their own videos and thus create a sort of filter for the huge mass of clips. Geisler and Burns (2007) analyze the tags of more than 500,000 YouTube users and find that each video is indexed with six tags on average. These tags provide real added value – especially for users searching – since 66% of them do not appear in the other metadata (title, description or author). Due to the range of tags and additional metadata, access to the videos is duly enhanced and guaranteed. This observation coincides with Paolillo’s (2008), who also describes this function of tags as access vocabulary.
75
‘Zweites Deutsches Fernsehen/Second German Television’ – a Mainz-based public-service television channel. See http://www.zdf.de. 76 http://www.myspace.com. 77 http://www.facebook.com. 78 http://www.reddit.com. 79 http://www.digg.com.
86
Collaborative Information Services
Figure 1.71: Distribution of Category Allocations on YouTube. Source: Cheng, Dale, & Liu (2007, Table 2 and Fig. 1).
Cheng, Dale and Liu (2007) compare YouTube to other video platforms (e.g. ClipShack80 or VSocial81) and Peer-to-Peer networks (e.g. BitTorrent82). YouTube’s distinction vis-à-vis Peer-to-Peer networks is that videos are not just placed next to each other at random, but are put into a context that users can exploit for information retrieval by being linked in groups or displayed next to similar videos. The possibility to rate or comment videos is another innovation. Cheng, Dale and Liu (2007) find that 97.8% are shorter than 600 seconds, and that 99,1% of videos are shorter than 700 seconds, which means that the greater part of clips approaches or exceed by a minimum the maximum video length of ten minutes83. 98.8% of clips’ volume is less than 30MB, the average data volume is 8.4MB. “Considering there are over 42.5 million YouTube videos, the total disk space required to store all the videos is more than 357 terabytes!“ (Cheng, Dale, & Liu, 2007). The authors also examine the distribution of the videos’ category allocations. The result is displayed in Figure 1.71. Almost 23% of videos are classified as ‘music,’ almost 18% belong to the category ‘entertainment’ and around 12% of all clips are to be found under ‘comedy.’ The table lists two other categories: ‘unavailable’ and ‘removed.’ The category ‘unavailable’ (0.9%) compiles all videos that have been flagged as private or inappropriate and thus removed from public access; ‘removed’ contains those videos that have been deleted by either the creator himself or a YouTube administrator, but are still linked to by other videos. Meeyoung et al. (2007) use YouTube as a prototypical example for usergenerated video content and compare it with commercial providers for professionally generated film and video content (e.g. Lovefilm84 or the Internet Movie Database85) – their conclusion is that user-generated videos are created and published much faster than professionally shot film: “YouTube enjoys 65,000 daily new up80
http://www.clipshack.com. http://www.vsocial.com. 82 http://www.bittorrent.com. 83 This duration limit has only been introduced in March, 2006, which means that previously uploaded videos come to bear on this result. 84 http://www.lovefilm.com. 85 http://www.imdb.com. 81
Collaborative Information Services
87
loads – which means that it only takes 15 days in YouTube to produce the same number of videos as all IMDb movies“ (Meeyoung et al., 2007). If this result is none too surprising, the next comparison quite accurately mirrors the activities of amateur and professional filmmakers: 90% of professionals publish less than 10 films, and 90% of amateurs upload less than 20 videos. The authors also analyze characteristics specific to YouTube: according to their findings, 54% of videos have been rated, and 47% of videos from the category ‘Science & Technology’ (termed ‘Howto & DIY’ in Figure 1.71) are accessed via external links (mostly through MySpace, Blogspot86, Orkut87, Qooqle88 and Friendster89). The most popular videos are distributed according to a power law, since roughly 10% of videos make up for almost 80% of page views. With regard to the chronology of page views it must be noted that 90% of clips are viewed at least once within a day of being uploaded, while 40% are viewed more than ten times. Meeyoung et al. (2007) conclude that “if a video did not get enough requests during its first days, then it is unlikely that they will get many requests in the future“ (Meeyoung et al., 2007). As opposed to photographs, which are static, video clips are dynamic and furthermore possess a certain temporal dimension. However, so far it is only possible to tag videos as a whole, even though they may consist of different sections and even handle various subject areas. Sack and Waitelonis (2008)90 use tags to provide for content searches of video sequences by initially letting users segment the clips and then combining these sections with tags and time-referenced metadata. Thus videos can be precisely and accurately indexed and researched. Micolich (2008) reports on the application of YouTube videos for learning and teaching purposes in physics lessons at school, particularly emphasizing their use for representing physical phenomena (e.g. cloud formations) chronologically. He also discusses the novelty of communicating knowledge via video clips. Noteworthy at this point is the “Large Hadron Rap91,“ in which employee of the European Organization for Nuclear Research (short ‘CERN’), explains particle physics via a rap performance (netzeitung.de, 2008). Czardybon et al. (2008) use clips with a duration of at most five minutes as a video glossary for university education. These glossaries provide a short summary of the most important terms for the study of information science and serve students for the purposes of revision and exam preparation.
86
http://www.blogspot.com. http://www.orkut.com. 88 http://www.qooqle.jp. 89 http://www.friendster.com. 90 A Beta Version of the system can be tested at http:// www.yovisto.com. 91 http://de.youtube.com/watch?v=f6aU-wFSqt0. 87
88
Collaborative Information Services
Social Networks Social networks, such as Facebook92, StudiVZ93, XING94, LinkedIn95 or MySpace96, primarily serve the networking and exchange between contacts, friends, acquaintances or business partners. They build on the ‘six degrees of separation’ or ‘small worlds’ principle (Watts & Strogatz, 1998; Milgram, 1967), which holds that everybody on the planet is linked with everybody else through a chain of at most six people sind (Udell, 2004; Schmidt, 2007). Social networks can take shape for a variety of reasons. Gruppe (2007) lists four motives for the creation of ‘communities’ or networks: Communities distinguish themselves through a common interest in a topic. [...] There are various dimensions to these commonalities: 1. geographical: defined by a physical location, such as a town or a region; 2. demographical: defined by age, sex, race or nationality; 3. subject-specific: common interests, such as fan club, association or company; 4. action-specific: shopping, speculating, gaming, making music (Gruppe, 2007).
Social networks have an enormous appeal at the moment (Thompson, C., 2008; Huberman, Romero, & Wu, 2009): Many young people regard the use of Social Media and any digital content as essential; e-mail is already losing popularity with teens. [...] by now studiVZ has over 5m users [...], 50 per cent of whom log in daily and send one million messages per day (Hamm, 2008).
These networking services generally do not offer users the possibility of using tags to ‘categorize’ contacts. XING, the networking service for professionals, is an exception to this rule; however, the tags are not visible to the public. Tagalag97 enables its users to publicly tag profiles, but so far only exists as a beta version (Hayman, 2007; Hayman & Lothian, 2007). Tags are used to identify personnel in corporate environment, however. Muller, Ehrlich and Farrell (2006; see also Farrell et al., 2007 and Farrell & Lau, 2006) report on the management of personal information via tags. The goal is this: “If we provided a system in which people could tag people, could we use these tags to create a folksonomy of employees?“ (Farrell et al., 2007, 91). This ‘people-tagging’ allows employees to complete the register of persons and the personal profiles with their own ideas on and ‘table of contents’ of that person: “Tagging people is also a way to capture limited social-network data. One can see a little bit of who someone knows and who knows them, and more usefully, who the user and another individual have in common“ (Farrell & Lau, 2006). As opposed to other tagging systems, this system facilitates the use of reciprocal linking, which means that a resource is not only linked to a user via a tag, but that the resource – the employee, in this case – can reciprocate the tag. Thus personal profiles not only display ingoing links, that is
92
http://www.facebook.com. http://www.studivz.net. 94 http://www.xing.com. 95 http://www.linkedin.com. 96 http://www.myspace.com. 97 http://www.tagalag.com. 93
Collaborative Information Services
89
tags describing the employee, but also those tags the employee uses to describe colleagues (Farrell & Lau, 2006). An analysis of all used tags (Muller, Ehrlich, & Farrell, 2006) revealed that these could be divided into 14 different categories, such as ‘hobby,’ ‘skill,’ ‘sport,’ ‘divison-group’ or ‘knowledge-domain.’ Another result of this analysis is that 79% of users have indexed their own profile with tags and have even spent more time doing so than tagging colleagues. Muller, Ehrlich and Farrell (2006) regard this as positive behavior and conclude: While this kind of impression management might appear to be a matter of vanity or self-promotion, we understand it as a kind of organizationally responsible behavior for people in a knowledge-intensive company. Knowledge work often involves finding opportunities to contribute to collaborative work, and one way of finding those opportunities is to “advertise” one’s skills to other members of the organization. Self-tagging would support the creation and refinement of such a public persona (Muller, Ehrlich, & Farrell, 2006).
A further study concerning the same system (Farrell et al., 2007) revealed that tagging is particularly useful for the timely completion of employees’ profiles while at the same time relieving users of the task of maintaining their own profiles. Since tags are attached to a profile by one’s colleagues, the individual user is spared the annoying task of constantly updating his profile – and so, “only 40% of profiles have been updated in the past nine months and 22% have been updated in the past three months“ (Farrell et al., 2007, 91). Employees are shown three tag clouds on their profile: the first one displays the tags their colleagues have indexed them with, the second one displays the tags that they have indexed their colleagues with, and the third tag cloud displays the tags that the user has used for the in-house social bookmarking service ‘dogear.’ This results in a fairly comprehensive profile of the employee via his interests, contacts and responsibilities. An additional functionality of the tag clouds is provided by the tags themselves: moving the mouse over any tag triggers a display of how many other users have used that tag, as well as of their names. The second tag cloud has a special task – it guards against spam and inappropriate tags by mirroring the tags the employee has used for colleagues’ profiles. Should a user feel the need to insult another employee with a tag, that tag would automatically be displayed on their own profile, thus exposing them. These methods seem to be effective, since Farrell et al. (2007) report that “86% indicated that they were comfortable with the tags they had received and only 14% expressed a desire to have tags removed“ (Farrell et al., 2007, 97). Additionally, the reasons for removing tags are mostly that these are no longer relevant, since, for example, the employee is no longer working on the project the tag refers to. The general aim of tagging employees is to localize relevant colleagues or experts in particular areas by ranking employees based on the quantity of tags. Farrell et al. (2007) realized, however, that this assumption is only partly viable. It would seem that high rankings are earned by those employees who are very active in discussing a particular topic on blogs or message boards: “Perhaps tag-based ranking brings out the ‚hubs’ rather than the subject matter experts, and perhaps is more useful for finding the right person“ (Farrell et al., 2007, 100). Farrell et al. (2007) then sum up their findings on people-tagging: “Our results suggest that the people-tagging feature is useful for contact management, supports collective maintenance of the employee directory,
90
Collaborative Information Services
and appears to have built-in safeguards against undesired tags“ (Farrell et al., 2007, 92). The social network 43Things has a slightly different approach and is very publicly devoted to the use of tags in order to create a network of people at all. This platform’s goal is to get users to communicate with each other via their ‘To Do’ lists, which they publish online. These lists are typically made up of at most 43 things that they hope to accomplish in life. These goals are represented as tags (see Figure 1.72). Thus meeting people is no longer a matter of becoming ‘friends’ with them by clicking on a button, as with most other social networks, but by evaluating the tags that bring together people with similar goals in life. 43Things is mainly interested in getting users to communicate and exchange ideas, and thus provide them with inspiration for their own lifestyles: “Other people often have great ideas. You can get inspiration from others“98.
Figure 1.72: Zeitgeist Tag Cloud of the Most Popular Tags on 43Things.
43Things was created in 2005 by several individuals, some of whom had previously worked for Amazon, where they were in charge of programming that platform’s ratings system. A large part of the platform is financed by Amazon99 to this day (Wegner, 2005). After registering, the user is initially given an overview (visualized as a tag cloud) of what the other members want to do in life. He is also told that at the time of this writing, 1,638,798 persons from 15,032 towns are registered and engaged in discussing 1,374,738 goals (see Figure 1.73). In order to personalize his profile, the user is given the possibility of stating his geographical location100. Thus 43Things is able to bring together persons and goals that are also linked geographically. 98
See http://www.43things.com/about/view/faq#why-someone-else. See also http://www.43things.com/about/view/faq#more-about-us. 100 This is achieved by its sister website, 43places.com, which collects comments, photos and ratings on towns and countries. 99
Collaborative Information Services
91
The user can add his goals in life to his To-Do list via the input field. He is told (see Figure 1.73) how many other users pursue the same goal – one other user is seen to have used the same exact phrase, ‘eat more vegetables and fruit,’ 37 other users have the same goal but use the phrase ‘eat more fruits & vegetables,’ and 5 users have decided that ‘eat more vegetables and fruit’ and ‘eat more fruits & vegetables’ mean the same thing. If the user now wants to adjust his goals to the community’s parlance, he can add the desired phrase to his To-Do list. He can also post a comment on his goal, complement it with a photograph or post it on a blog. Furthermore, the user can delete the entire entry, or add a reminder for resubmission (see Figure 1.74). He can also state whether the goal has been accomplished, and post a report on how it happened.
Figure 1.73: Input Fields for Goals in Life and Display of Personal To-Do List on 43Things.
The transmission of semantic information with regard to the goals’ formulation can also be carried out by the user. He can enter alternative phrasings for one and the same goal under ‘Report a very similar goal.’ At the bottom of the page there is another tag cloud that presents other goals and is meant to serve the user as inspiration for his own To-Do list. In order to better describe the goals, the user can also add tags to the entries. Multiple tags must be separated by commas. Researching, or rather browsing through other users’ To-Do lists and profiles, is possible via either the tag cloud ‘Zeitgeist’ or a search. If the user enters a search term, he is shown a list of results, as seen in Figure 1.75. Here he is shown which goals contain the search term, how many users follow these goals, as well as which usernames and tags contain the search term. A click on one of the search results will lead the user to either the goals (see Figure 1.75), the personal profile or the entries and goals indexed with that tag, respectively. Selecting a tag will show the user related tags and facilitate a pure tag search.
92
Collaborative Information Services
Figure 1.74: Editing and Commenting Functions for Personal Goals on 43Things.
Collaborative Information Services
Figure 1.75: List of Search Results for the Term ‘Travel’ on 43Things.
93
94
Collaborative Information Services
Search terms can be linked with the Boolean operators AND and OR, and a phrase search can be initiated with the use of quotation marks, in the general search area, but not for the tag search. Tag searches as well as related tags are immediately accessible by clicking on a tag and via tag clouds. The selection of a goal leads the user to the display seen in Figure 1.76. This tells him, how many other users follow the same goal, how many have already reached it, how long this has taken them on average, and how many of them believe it to have been worthwhile. Furthermore, he can consult posts by the people who have reached that goal and have chosen to share their experience. Clicking on a username will lead the user to the other member’s profile and To-Do list. He is also told which tags the goal has been indexed with, which personal tags he has used for this goal (if applicable), and which goals are closely related to the selected one (e.g. ‘get an article published this year’). There is also a message board where he can communicate with other users who follow the same goal or get advice on his current status in achieving his goal. Lastly, he can find out which users subscribe to his own phrasing of the goal and which if any believe that it can be formulated differently. Wordie101 creates a similar social network via tags and words, or phrases. The user can create word lists, tag them and make them accessible to other users: Wordie lets you make lists of words and phrases. Words you love, words you hate, words on a given topic, whatever. Lists are visible to everyone but can be added to by just you, a group of friends, or anyone, as you wish102.
The user can discuss the words with other people, add quotes or comments, or find out which other users have a special relationship to any given word. Research is possible via tags or words.
101 102
http://wordie.org. http://wordie.org/words/index/faq.
Collaborative Information Services
Figure 1.76: Display of a Goal on 43Things.
95
96
Collaborative Information Services
Blogs and Blog Search Engines Technorati is a (blog) search engine and thus not a prototypical example of a collaborative information service. However, since these services make heavy use of folksonomies and tags in their search algorithm, and are meant to cover usergenerated content exclusively, they must not go unmentioned. Examples for other search engines that also utilize tags for their searches are Blogpulse103 and Blogoscoop104. Technorati was founded in late 2002 by David Sifry (Sifry, 2002) and indexes, by its own account, 112.8m blogs and 250m ‘pieces of social media,’ which includes photos, videos ratings etc.105 Weblogs (short: blogs) are easily manageable websites which are distinguished by the reverse-chronological arrangement of their content or entries (‘posts’) (Pikas, 2005; Chopin, 2008; Efimova, 2004; Efimova & de Moor, 2005; Röll, 2003; Röll, 2004; Eck, 2007; Nichani & Rajamanickam, 2001; Schmidt, 2006; Picot & Fischer, 2006; Schnake, 2007). The most recent update is displayed at the top of the page (see Figure 1.77).
Figure 1.77: Typical Blog with Tag Cloud. Source: http://jkontherun.com/.
This form of optical display is also the basis of the term ‘blog:’ it is made up of the words ‘web’ and ‘log book’106 and was first used to describe this genre of website by Jorn Barger (Blood, 2002). Blogs are heavily linked amongst each other: as is common practice for scientific publications, bloggers may refer to the source text by providing direct links. The blog thus cited is automatically notified of its referral via 103
http://www.blogpulse.com. http://www.blogoscoop.net. 105 See http://technoratimedia.com/about. 106 A log book is the captain of a ship’s chief information recording facility and serves to note daily events. It can be likened to a diary, except the most recent entries are at the top of the page. 104
Collaborative Information Services
97
‘Trackback.’ All referring blogs are also cited in the source blog. This creates a tight web of links and referrals, which can be explored by the blog operators as well as their readership (Kumar et al., 2004; Cayzer, 2004): “We’ve observed that bloggers in a community often link to and cross-reference one another’s postings, so we can infer community structure by analyzing the linkage patterns among blog entries“ (Kumar et al., 2004, 38). This way of linking is facilitated by a new sort of link: the Perma-Link. These are permanently assigned to a blog post and they remain, independently of whether the post is to be found on the blog’s homepage or in its archive. The categorization tools blogs use are tags, which can be allocated to individual blog posts and are generally displayed as a tag cloud on the blog’s homepage. The blog as a whole, single posts as well as single tagged resources of usergenerated content (such as photos from Flickr or videos from YouTube) can be subscribed to via RSS or Atom feed using their respective tags. Technorati exploits this feature for its searches. Technorati searches user-generated content for: • resources concerning the subject X, • blog posts mentioning X (keyword search) and • blogs, blog posts and any other resources tagged with X. The former two types of search are executed via a full-text sweep of the resources, the latter via tags only. Technorati differentiates between two types of tags for tag search. For one, there are ‘Blog Tags,’ which serve to describe the subject area of a blog as a whole and to file it in the ‘Blog Directory’ (see Figure 1.78). There the user can search for relevant blogs according to subject area. On the other hand, Technorati’s database can be searched via ‘Post Tags’ in the tag search or by clicking through the Technorati tag cloud. Post Tags do not refer to the blog as a whole, but represent single blog posts, as well as photos and videos. In order to effectively search for them, Technorati must index Post Tags in the same way as single posts’ full-text. For this purpose, Technorati places the following demands on tags: • the tags must be retrievable in the body of the post, since Technorati’s Crawlers only consider the posts’ title, footer and body, • the tags must be contained in the post’s RSS or Atom feed: “This will probably mean displaying the full posts in your feed“107, • the tag code must have the following format if the tag is to be indexed: ipod. To this end, Technorati provides a ‘Tag Generator108.’ Technorati’s advice for the creation of tags: “The [tagname] can be anything, but it should be descriptive. Please only use tags that are relevant to the post. You do not need to include the brackets, just the descriptive keyword for your post“109. At this point the problem arises that tags, once indexed by Technorati, cannot be changed. The indexing process can be accelerated by activating a ‘Ping’ update every time a new post is published. This can happen automatically via the blog
107
See http://support.technorati.com/faq/topic/48. See http://www.technorati.com/tools/tag-generator/index.html. 109 See http://support.technorati.com/support/siteguide/tags. 108
98
Collaborative Information Services
software (e.g. Wordpress110 or Moveable Type111), or by manually notifying Technorati at http://technorati.com/ping.
Figure 1.78: Blog Directory on Technorati with Sign-In Option. Source: http://www.technorati.com/blogs/directory.
110 111
http://de.wordpress.com. http://www.movabletype.org.
Collaborative Information Services
99
Figure 1.79: Display of Search Results and Advanced Search Options on Technorati.
Tag search is accessible either via Advanced Search or can be selected at the top of a hit list after a simple search. There the user can limit his search results to either posts, blogs, photos or videos (see Figure 1.79. In the ‘Posts’ area he can search through the entire blog posts or the tags. Furthermore, he can restrict the blogs’ language and take into account the blogs’ authority values (‘any,’ ‘a little,’ ‘some,’ ‘a lot of’) for the display. Technorati also recommends similar tags that the user can use to change the search request. These recommendations are based on CoOccurrence analyses. The area ‘Blogs’ displays blogs tagged with the search term. Here too, the user is referred to similar search terms. The area ‘Photos’ shows photographs from Flickr tagged with the search term – the same goes for the area ‘Videos’ from YouTube. Figure 1.80 displays the search results for a photo tagged with ‘süßer Hund’ (‘cute dog’) – accurately found by Technorati is my own photo example from Figure 1.57.
Figure 1.80: Technorati Retrieves Flickr Photos.
Search terms can be linked via the Boolean operators AND, OR and NOT as well as phrase construction, either in the Advanced Search menu or by directly entering AND, OR, NOT or quotation marks in the search field. The list of search results can be further restricted to all blogs, blogs on a specific topic or blogs with a specific
100
Collaborative Information Services
URL in Advanced Search. Another option is the search for blogs that link to a certain website (‘find posts that link to’), or that have been indexed with a certain tag (‘find posts tagged’). Clicking on a tag in the Technorati tag cloud112 yields the same result. Of particular usefulness is the option to restrict the search to a particular blog, since most blogs do not have a search function – in this way, the user can search all posts on a blog (Bates, 2008). The display of search results in a post search is preset to reverse-chronological order, which means that the most recent post is at the top of the list, whether the user searched for keywords or tags. The hit list in a blog search is automatically arranged according to the blogs’ ‘Authority.’ This Authority value is defined by Technorati as the number of individual blogs that have linked to the blog in question over the last six months113. To counteract link spamming, Technorati only cites the blog as a whole as the referring blog – independently of whether the blog XY is cited in 1,000 posts or only in one. Clicking on the ‘Authority’ button displays all posts and blogs that link to the blog in question.
Games with a Purpose (GWAP) – Tagging Games ‘Games with a Purpose’ refers to online games that encourage users to participate while serving a specific purpose. In other words, the game is not only a way of killing time, but is meant to accomplish a task. Von Ahn (2006) regards these games as particularly appropriate when they help computers ‘understand’ people. After all, and “[d]espite colossal advances over the past 50 years, computers still don’t possess the basic conceptual intelligence or perceptual capabilities that most humans take for granted“ (von Ahn, 2006, 96). (Online) games also reach a great number of participants, which distributes the burden of finding a solution onto many shoulders, thus achieving better results in less time. The gaming aspect also provides an impetus to participate and slyly hides the game’s main purpose, which is the solving of problems: Any game designed to address these and other problems must ensure that game play results in a correct solution and, at the same time, is enjoyable. People will play such games to be entertained, not to solve a problem—no matter how laudable the objective (von Ahn, 2006, 98).
The website http://www.gwap.com compiles several such games. For this context, I will introduce a few GWAPs that make heavy use of tagging resources as a main functionality.
112 113
See http://www.technorati.com/tag. See http://support.technorati.com/faq/topic/71.
Collaborative Information Services
101
Figure 1.81: Game Interface of the ESP Game. Source: http://www.gwap.com.
The ESP Game114 was developed at Carnegie Mellon University (von Ahn & Dabbish, 2004; von Ahn & Dabbish, 2005; von Ahn & Dabbish, 2008; von Ahn, 2005; von Ahn, 2006) and tries to playfully introduce users to tagging: “the people who play the game label images for us“ (von Ahn & Dabbish, 2004). Two players index several photos at the same time, where they must each guess which tags the other player uses – certain words are ‘taboo’ and cannot be used. Points are awarded if both players have used the same tag for a photo (see Figure 1.81); a maximum of 15 matches can be rated (von Ahn, 2006). So the ESP Game makes use of the wisdom of masses in one small respect – it suffices that two users agree in their indexing terminology: “agreement by a pair of independent players implies that the label is probably meaningful“ (von Ahn & Dabbish, 2004). This method is supposed to help the game solve the problem of indexing all pictures and photos available on the World Wide Web with tags. This is meant to widen the access paths to the photos, also for the purpose of a disabilityfriendly accessibility, and thus create better search functionalities and more precise search results: Having proper labels associated to each image on the Internet would allow for very accurate image search, would improve the accessibility of the Web (by providing word descriptions of all images to visually impaired individuals), and would help users block inappropriate (e.g., pornographic) images from their computers115.
According to von Ahn and Dabbish (2008), more than 50m photos were indexed in this way by 200,000 players up until July, 2008. Google has acquired the licence for the ESP Game from von Ahn (O’Reilly, 2006) and now hosts its own game, the Google Image Labeler116, to use tagging in order to improve the search results on Google Image Search (see Figure 1.82). The Image Labeler even awards a different number of points for more or less specific 114
The game can be joined at http://www.gwap.com. http://www.espgame.org. Unfortunately, the website is no longer available at this URL. 116 http://images.google.com/imagelabeler. 115
102
Collaborative Information Services
tags: if the players agree on a highly specific tag, the users gain more points than for a fairly regular tag. In order to play this game, users have to set up a Google account. A similar principle as the ESP Game’s is used by TagaTune117 (Law, von Ahn, & Dannenberg, 2007). Instead of trying to find matching tags for a photograph, here two players must index a sound file in three minutes using free tags (see Figure 1.83). Each matching pair is rewarded with points.
Figure 1.82: Game Interface of Google Image Labeler.
Figure 1.83: Game Interface of TagaTune.
The project Steve Museum118 has slightly changed the ESP Game in order to index pictures and photos of specific museum objects. Here the task is not to achieve a match with a fellow player, but to describe the resource with as many tags as possible. For the example displayed in Figure 1.84, 582 different tags have already been 117 118
http://www.gwap.com/gwap/gamesPreview/tagatune. See http://www.steve.museum.
Collaborative Information Services
103
added to the resource. The members of this project hope to achieve as complete a representation of the resource as possible and thus widen access paths within museum catalogs.
Figure 1.84: Game Interface of the Steve Tagger. Source: http://steve.thinkdesign.com/ steve.php.
The Dogear Game (Dugan et al., 2007) is another offshoot of the ESP Game. Dogear is an in-house social bookmarking system at IBM. Its goal: The goal of the Dogear Game is to generate human-sourced bookmark recommendations for a user of a social bookmark system, written by someone who knows that user – including that user’s interests, skills, etc – rather than using a more automated approach, such as machine learning (Dugan et al., 2007, 387).
The Dogear Game may not try to generate further tags for particular resources, but it does make use of the social bookmarking system’s folksonomy by displaying a bookmark, together with its annotated tags, which the players must then allocate to the right user. Correct allocations are rewarded with points. False allocations go unrewarded, but in such a case the player can forward the bookmark to the user he had assumed to be the correct answer. This encourages the exchange of knowledge between colleagues. The Dogear Game’s charm is not only in the way players can use their knowledge about their colleagues, but also in the playful getting to know other employees and their areas of interest. Dugan et al. (2007) point out four advantages: Value to the individual player (e.g., Dogear Game players learn about their colleagues), Value to at least one colleague of the player (e.g., the colleagues of the Dogear Game player receive well-informed, human-sourced recommendations), Value to the enterprise (e.g., the enterprise in which people play the Dogear Game develops a better set of tagged resources (through recommendations) and additional metadata about employee relatedness), Value to the player’s community of practice (Dugan et al., 2007, 390).
104
Collaborative Information Services
Summary Collaborative information services are as manifold as the resources they are based on. Equally diverse are their ways of using folksonomies, which range from a weak (as in the case of WISO) to a pronounced usage (e.g. LibraryThing) for both knowledge representation and information retrieval. An overview of the tagging and retrieval functionalities, as well as a classification of the collaborative information services introduced here in terms of the type of folksonomy they use is displayed in Figure 1.85. The columns list the services’ properties (e.g. what tag editing functions they provide or which users are allowed to tag) and the rows list the single information services. In summary, it can be said that there are collaborative information services which allow their users to provide the resources that are to be tagged and collaborative information services that provide the content themselves. Particularly information service providers such as GBI Genios or libraries provide their own resources in order to complement professional indexing with user-generated tags and thus enhance access to the information resources. Almost all information services introduced in this chapter (excepting Last.fm and WISO) allow their users to delete and rename tags once they are indexed, which goes to emphasize the great demand for this functionality. Collaborative information services vary in terms of the extent to which they allow the tagging of information resources, however. It transpires that there are three variants, which also determine the type of folksonomy119: • type 1, termed ‘broad,’ allows each user to add tags to every resource, where the multiple allocation of individual tags is taken into consideration by that resource’s folksonomy, • type 2, termed ‘extended narrow,’ allows the resource’s owner and specified other users (e.g. friends) to add tags, where each tag may only be indexed once per resource and • type 3, termed ‘narrow,’ only allows the resource’s owner to add tags. This observation is of enormous relevance to the following chapters.
119
A detailed description of the different types of folksonomy will follow in chapter three.
Provider
Provider
WISO
Engineering
Author
Technorati
---
rename + delete
rename + delete
---
rename + delete
delete
rename + delete
rename + delete
rename + delete
rename + delete
rename + delete
Tag editing functions
---
X
X
X
X
X
---
X
X
X
---
---
---
---
---
---
---
X
---
---
---
friends
X
Author +
All users
---
---
---
---
---
---
X
---
---
---
---
Only author
Who is allowed to tag the resource?
broad
broad
broad
broad
broad
narrow
“Phrase”
AND, OR, - (NOT),
full text
- (NOT), “Phrase” in
AND, OR,
AND
---
AND
tag cloud
tag cloud
tag cloud
tag cloud
tag cloud
tag cloud
community
AND
Videos, channels,
AND, OR,
cloud
Explore function, tag
tag cloud
tag cloud
tag cloud
Browsing
- (NOT), “Phrase”
AND, OR, - (NOT), “Phrase”
narrow
- (NOT), “Phrase”
AND,
AND, OR
AND, OR
Search operators
extended
broad
broad
broad
type
Folksonomy
Figure 1.85: Overview of the Tagging and Search Functionalities of the Collaborative Information Services.
Author + other users
43things
Village
Provider
Provider
Amazon
Author
YouTube
Last.fm
Author
LibraryThing
Flickr
Author + other users
Author + other users
BibSonomy
be tagged?
Who provides the resources to
Author + other users
Service
Collaborative Information Services
del.icio.us
13
Tag
---
---
---
---
---
---
---
---
Tagmash
X
X
Bundles
106
Collaborative Information Services
The same range is represented in the search functions collaborative information services offer their users. Almost all services allow for the linking of search terms with the Boolean AND (excepting WISO), but only six out of eleven information services allow OR links. Excluding search terms and combining several terms into a phrase are among the more elaborate retrieval options and are only offered by five services – interestingly, by the same ones. All collaborative services support the retrieval strategy of browsing by providing tag clouds, where Flickr offers an additional ‘Explore’ function. YouTube does without tag clouds but allows a limited search for videos, channels and in the community. The representation of tags in hierarchical relations via so-called tag bundles is only offered during the indexing of information resources by the social bookmarking services del.icio.us and BibSonomy. LibraryThing, however, offers so-called tagmashes, which allow users to summarize synonyms. It becomes apparent that collaborative information services make use of folksonomies for knowledge representation and information retrieval in different ways and to a different extent and thus construct different tagging systems. The conception and structuring of collaborative information services is the task of information architecture, which finds itself caught between two poles: the information services become on the one hand the target for their users varying needs, and are on the other hand influenced by the system objectives. Smith (2008b, 22) and McMullin (2003) call this area of tension a ‘value-centered design model,’ where the system objectives are referred to as ‘return on investment’ and the user demands as ‘return on experience.’ Having developed a taxonomy of Web 2.0 services and represented the application areas as well as the practical use of folksonomies using the example of popular collaborative information services, we can now turn to and discuss the use of folksonomies in the field of information science. What will come to the fore again and again is that the balance between ‘return on experience’ and ‘return on investment’ is a hugely influential factor for the success or failure of the folksonomies.
Collaborative Information Services
107
Bibliography AADL.org Goes Social (2007), from www.blyberg.net/2007/01/21/aadlorg-goessocial/. Alby, T. (2007). Web 2.0: Konzepte, Anwendungen, Technologien (2. Aufl.). München: Hanser. Amitay, E., Har’El, N., Sivan, R., & Soffer, A. (2004). Web-a-where: Geotagging Web Content. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK (pp. 273–280). Ankolekar, A., Krötzsch, M., & Vrandecic, D. (2007). The Two Cultures: Mashing up Web 2.0 and the Semantic Web: Position Paper. In Proceedings of the 16th International WWW Conference, Banff, Alberta, Canada (pp. 825–834). Arch, X. (2007). Creating the Academic Library Folksonomy: Put Social Tagging to Work at Your Institution. College & Research Libraries News, 68(2), from http://www.ala.org/ ala/mgrps/divs/acrl/publications/crlnews/2007/feb/libraryfolksonomy.cfm. Arrington, M. (2005). Amazon Tags, from http://www.techcrunch.com/2005/11/14/ amazon-tags. Arrington, M. (2007a). Breaking: Yahoo To Shut Down Yahoo Photos In Favor Of Flickr, from http://www.techcrunch.com/2007/05/03/breaking-yahoo-toannounce-closure-of-yahoo-photos-tomorrow. Arrington, M. (2007b). Exclusive: Screen Shots And Feature Overview of Delicious 2.0 Preview, from http://www.techcrunch.com/2007/09/06/exclusive-screenshots-and-feature-overview-of-delicious-20-preview. Bächle, M. (2006). Social Software. Informatik Spektrum, 29(2), 121–124. Bacigalupo, F., & Ziehe, M. (2005). Podcasting: Die Erweiterung der Medienlandschaft um Audiobeiträge von jedermann. Information Management & Consulting, 20(3), 81–87. Bar-Ilan, J. (2004). Blogarians - A New Breed of Librarians. In Proceedings of the 67th ASIS&T Annual Meeting, Providence, RI, USA (pp. 119–128). Bates, M. E. (2008). March 2008 InfoTip: Mining Technorati, from http://www.batesinfo.com/ march-2008-infotip.html. Bearman, D., & Trant, J. (2005). Social Terminology Enhancement Through Vernacular Engagement: Exploring Collaborative Annotation to Encourage Interaction with Museum Collections. D-Lib Magazine, 11(9), from www.dlib.org/dlib/september05/bearman/ 09bearman.html. Bertram (2008). YouTube-Deutschland: Partner und Reichweite, from http://www.gugelproductions.de/blog/2008/youtube-deutschland-partner-undreichweite.html. Blood, R. (2002). The Weblog Handbook: Practical Advice on Creating and Maintaining Your Blog. Cambridge, MA, USA: Perseus Publishing. Braly, M., & Froh, G. (2006). Social Bookmarking in the Enterprise. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA, from http://enterprisetagging.org/2006asistsigcr-poster.pdf. Braun, H., & Weber, V. (2006). Mehr als ein Hype - Web 2.0 im Praxiseinsatz. c't Magazin für Computertechnik, 14, 92, from http://www.heise.de/ct/06/14/092.
108
Collaborative Information Services
Breeding, M. (2007). Librarians Face Online Social Networks. Computers in Libraries, 27(8), 30–32. [Federal Ministry of Justice] Bundesministerium für Justiz (2008). Gesetz über Urheberrecht und verwandte Schutzrechte, from http://bundesrecht.juris.de/urhg/index.html. Campbell, A. (2007). Motivating Language Learners with Flickr. TESL-EJ, 11(2), from http://tesl-ej.org/ej42/m2.html. Casey, M. (2005). Librarians Without Borders, from http://www.librarycrunch.com/ 2005/09/librarians_without_borders.html. Casey, M. E., & Savastinuk, L. C. (2007). Library 2.0. A Guide to Participatory Library Service. Medford, NJ: Information Today. Cayzer, S. (2004). Semantic Blogging and Decentralized Knowledge Management. Communications of the ACM, 47(12), 47–52. Chad, K., & Miller, P. (2005). Do Libraries Matter? The Rise of Library 2.0, from www.talis.com/applications/downloads/white_papers/DoLibrariesMatter.pdf. Cheng, X., Dale, C., & Liu, J. (2007). Understanding the Characteristics of Internet Short Video Sharing: YouTube as a Case Study, from http://arxiv.org/PS_cache/ arxiv/pdf/0707/ 0707.3670v1.pdf. Chopin, K. (2008). Finding Communities: Alternative Viewpoints through Weblogs and Tagging. Journal of Documentation, 64(4), 552–575. Chun, S., Cherry, R., Hiwiller, D., Trant, J., & Wyman, B. (2006). Steve.museum: An Ongoing Experiment in Social Tagging, Folksonomy, and Museums. In Proceedings of Museums and the Web 2006, Albuquerque, NM, USA. Clark, J. A. (2006). AJAX. This Isn't the Web I'm Used To. Online - Leading Magazine for Information Professionals, Nov/Dec, 31–34. Coates, T. (2003). (Weblogs and) The Mass Amateurisation of (Nearly) Everything… from http://www.plasticbag.org/archives/2003/09/weblogs_and_the_ mass_amateurisation_of_nearly_everything/. Cormode, G., & Krishnamurthy, B. (2008). Key Differences between Web 1.0 and Web 2.0. First Monday, 13(6), from http://www.uic.edu/htbin/cgiwrap/bin/ojs/ index.php/fm/article/view/ 2125/1972. Cosentino, S. L. (2008). Folksonomies: Path to a Better Way? Public Libraries, 47(2), 42–47. Cox, A., Clough, P., & Marlow, J. (2008). Flickr: A First Look at User Behaviour in the Context of Photography as Serious Leisure. Information Research, 13(1), from http://informationr.net/ir/13-1/paper336.html. Coyle, K. (2007). The Library Catalog in a 2.0 World. The Journal of Academic Librarianship, 33(2), 289–291. Crane, D., Pascarello, E., & James, D. (2006). Ajax in Action. Das Entwicklerbuch für das Web 2.0. München: Addison-Wesley. Crawford, W. (2006). Library 2.0 and "Library 2.0". Cites & Insights: Crawford at Large, 6(2), from http://citesandinsights.info/civ6i2.pdf. Czardybon, A., Grün, P., Gust von Loh S., & Peters I. (2008). Gemeinschaftliches Selbstmarketing und Wissensmanagement in einem akademischen Rahmen. In Proceedings der 30. Online-Tagung der DGI, Frankfurt a.M., Germany (pp. 27– 42). Damianos, L., Griffith, J., & Cuomo, D. (2006). Onomi: Social Bookmarking on a Corporate Intranet. In Proceedings of the 15th International Conference on
Collaborative Information Services
109
World Wide Web, Edinburgh, Scotland, from http://www.semanticmetadata.net/ hosted/taggingws-www2006-files/28.pdf. Danowski, P., & Heller, L. (2006). Bibliothek 2.0: Die Zukunft der Bibliothek? Bibliotheksdienst, 40, 1259–1271. Dennis, B. M. (2006). Foragr: Collaboratively Tagged Photographs and Social Information Visualization. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland, from http://www.semanticmetadata.net/hosted/taggingws-www2006-files/3.pdf. Dobusch, L., & Forsterleitner, C. (Eds.) (2007). Freie Netze - freies Wissen: Ein Beitrag zum Kulturhauptstadtjahr Linz 2009. Wien: Echo Media Verl. Doctorow, C. (2002). My Blog, My Outboard Brain, from http://www.oreillynet.com/ pub/a/javascript/2002/01/01/cory.html. Downie, J. S. (2003). Music Information Retrieval. In B. Cronin (Ed.): Annual Review of Information Science and Technology (pp. 295-340). Medford, NJ: Information Today. Dugan, C., Muller, M. J., Millen, D. R., Geyer, W., Brownholtz, B., & Moore, M. (2007). The Dogear Game: A Social Bookmark Recommender System. In Proceedings of the International ACM Conference on Supporting Group Work, Sanibel Island, FL, USA (pp. 387–390). Dvorak, J. C. (2006). Web 2.0 Baloney, from http://www.pcmag.com/article2/ 0,2817,1931858,00.asp. Eck, K. (2007). Corporate Blogging. In T. Schwarz (Ed.), Leitfaden Online Marketing: Das kompakte Wissen der Branche (pp. 638–647). Waghäusel: marketingBörse. Efimova, L. (2004). Discovering the Iceberg of Knowledge Work: A Weblog Case. In Proceedings of the 5th European Conference on Organizational Knowledge, Learning and Capabilities, Innsbruck, Austria. Efimova, L., & de Moor, A. (2005). Beyond Personal Webpublishing: An Exploratory Study of Conversational Blogging Practices. In Proceedings of the 38th Hawaii International Conference on System Sciences 2005 (p. 107). Elbert, N. et al. (2005). Wie Blog, Flickr und Co. das Internet verändern. Handelsblatt online, from http://www.handelsblatt.com/technologie/it-internet/sozialerevolution-im-netz;923120. Farrell, S., & Lau, T. (2006). Fringe Contacts: People-Tagging for the Enterprise. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Farrell, S., Lau, T., Wilcox, E., & Muller, M. J. (2007). Socially Augmenting Employee Profiles with People-tagging. In Proceedings of the 20th Annual ACM Symposium on User Interface Software and Technology, Newport, Rhode Isaland, USA (pp. 91–100). Feinberg, M. (2006). An Examination of Authority in Social Classification Systems. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Figge, F., & Kropf, K. (2007). Chancen und Risiken der Bibliothek 2.0: Vom Bestandsnutzer zum Bestandsmitgestalter. Bibliotheksdienst, 41(2), 139–149. Furner, J. (2007). User Tagging of Library Resources: Toward a Framework for System Evaluation. In Proceedings of World Library and Information Congress, Durban, South Africa.
110
Collaborative Information Services
Garrett, J. J. (2005). Ajax: A New Approach to Web Applications, from http://www.adaptivepath.com/ideas/essays/archives/000385.php. Gaul, W., Geyer-Schulz, A., Hahsler, M., & Schmidt-Thieme, L. (2002). eMarketing mittels Recommendersystemen. Marketing - Zeitschrift für Forschung und Praxis, 24, 47–55. Geisler, G., & Burns, S. (2007). Tagging Video: Conventions and Strategies of the YouTube Community. In Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, Vancouver, BC, Canada (p. 480). Geleijnse, G., Schedl, M., & Knees, P. (2007). The Quest for Ground Truth in Musical Artist Tagging in the Social Web Era. In Proceedings of the 8th Conference on Music Information Retrieval, Vienna, Austria (pp. 525–530). Gilbertson, S. (2007). Amazon Wikis On The Moon, from http://blog.wired.com/ monkeybites/ 2007/01/amazon_wikis_on.html. Gissing, B., & Tochtermann, K. (2007). Corporate Web 2.0: Web 2.0 und Unternehmen - Wie passt das zusammen? Aachen: Shaker. Godwin-Jones, R. (2006). Tag Clouds in the Blogosphere: Electronic Literacy and Social Networking. Language Learning & Technology, 10(2), 8–15. Gomes, L. (2006). Will All of Us Get Our 15 Minutes On a YouTube Video? from http://online.wsj.com/public/article/SB115689298168048904.html. Gordon-Murnane, L. (2006). Social Bookmarking, Folksonomies, and Web 2.0 Tools. Searcher - The Magazine for Database Professionals, 14(6), 26–38. Graber, R. (2007). Engineering Village™ Introduces Record Tagging, from http://www.ei.org/news/news_2007_02_22.html. Graefe, G., Maaß, C., & Heß, A. (2007). Alternative Searching Services: Seven Theses on the Importance of “Social Bookmarking”, from http://ftp.informatik.rwthaachen.de/ Publications/CEUR-WS/Vol-301/Paper_1_Maas.pdf. Graham, J. (2006). Flickr of Idea on a Gaming Project Led to Photo Website, from http://www.usatoday.com/tech/products/2006-02-27-flickr_x.htm. Grassmuck, V. (2004). Freie Software. Zwischen Privat- und Gemeineigentum (Bd. 458). Bonn: Bundeszentrale für Politische Bildung. Gross, R., Acquisti, A., & Heinz, H. J. (2005). Information Revelation and Privacy in Online Social Networks. In Proceedings of the 2005 ACM Workshop on Privacy in the Electronic Society, Alexandria, VA, USA (pp. 71–80). Gruppe, M. (2007). Communities - alle Macht dem Kunden? from http://www.prportal.de/artikel/20070701-0858c880. Hamm, F. (2008). Web 2.0 Kongress: Das Web 2.0 wird erwachsen, from http://www.contentmanager.de/magazin/artikel_1824-print_web_2_0_kongress. html. Hammersley, B. (2005). Developing Feeds with RSS and Atom. Beijing: O'Reilly Media. Hammond, T., Hannay, T., Lund, B., & Scott, J. (2005). Social Bookmarking Tools (I). D-Lib Magazine, 11(4), from http://www.dlib.org/dlib/april05/hammond/ 04hammond.html. Hammwöhner, R. (2007). Qualitätsaspekte der Wikipedia. In C. Stegbauer, J. Schmidt & K. Schönberger (Eds.), Wikis: Diskurse, Theorien und Anwendungen. Sonderausgabe von kommunikation@gesellschaft, Jg. 8, from http://www.soz.uni-frankfurt.de/KG/B3_2007_ Hammwoehner.pdf. Hänger, C. (2008). Good tags or bad tags? Tagging im Kontext der bibliothekarischen Sacherschließung. In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags
Collaborative Information Services
111
- Bad Tags: Social Tagging in der Wissensorganisation (pp. 63–71). Münster, New York, München, Berlin: Waxmann. Hayman, S. (2007). Folksonomies and Tagging: New Developments in Social Bookmarking, from http://www.educationau.edu.au/jahia/webdav/site/myjahiasite/shared/papers/arkhayman.pdf. Hayman, S., & Lothian, N. (2007). Taxonomy Directed Folksonomies: Integrating User Tagging and Controlled Vocabularies for Australian Education Networks. In Proceedings of the 73rd IFLA General Conference and Council, Durban, South Africa. Heller, L. (2006). Tagging durch Benutzer im OPAC: Einige Probleme und Ideen, from http://log.netbib.de/archives/2006/04/21/tagging-durch-benutzer-im-opaceinige-probleme-und-ideen. Heller, L. (2007a). Social Tagging in OPACs und Repositories: Weitere Ideen - und Details, from http://log.netbib.de/archives/2007/02/02/social-tagging-im-opacweitere-ideen-und-details. Heller, L. (2007b). Bibliographie und Sacherschließung in der Hand vernetzter Informationsbenutzer. Bibliothek. Forschung und Praxis, 31(2), 162–172. Heuwing, B. (2008). Tagging für das persönliche und kollaborative Informationsmanagement: Implementierung eines Social-Software Systems zur Annotation und Informationssuche in Bibliotheken, Magisterarbeit der Universität Hildesheim, from http://mybibproject.files.wordpress.com/2008/12/magisterarbeit_heuwing_tagging.pdf. Heymann, P., Koutrika, G., & Garcia-Molina, H. (2007). Fighting Spam on Social Websites: A Survey of Approaches and Future Challenges. IEEE Internet Computing, 11(6), 36–45. Hornig, F. (2006). Du bist das Netz! Spiegel, 29, 60–74. Hotho, A., Jäschke, R., Schmitz, C., & Stumme, G. (2006a). Bibsonomy: A Social Bookmark and Publication Sharing System. In Proceedings of the Conceptual Structure Tool Interoperability Workshop at the 14th International Conference on Conceptual Structures, Aalborg, Denmark (pp. 87–102). Hotho, A., Jäschke, R., Schmitz, C., & Stumme, G. (2006b). Das Entstehen von Semantik in BibSonomy. In Social Software in der Wertschöpfung, Baden-Baden, Germany. Hubermann, B., Romero, D. M., & Wu, F. (2009). Social Networks That Matter: Twitter Under the Microscope. First Monday, 14(1), from http://www.uic.edu/ htbin/cgiwrap/bin/ojs/ index.php/fm/article/view/2317/2063. Hurley, C. (2007). YouToo. Forbes Magazine, from http://www.forbes.com/ free_forbes/2007/0507/068.html. IFLA (1998). Functional Requirements for Bibliographic Records. München: Saur. Jaffe, A., Naaman, M., Tassa, T., & Davis, M. (2006). Generating Summaries and Visualization for Large Collections of Geo-Referenced Photographs. In Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, Santa Barbara, USA (pp. 89–98). Jäschke, R., Grahl, M., Hotho, A., Krause, B., Schmitz, C., & Stumme, G. (2007). Organizing Publications and Bookmarks in BibSonomy. In Proceedings of the 16th International WWW Conference, Banff, Alberta, Canada, from http://www2007.org/workshops/paper_25.pdf.
112
Collaborative Information Services
John, A., & Seligmann, D. (2006). Collaborative Tagging and Expertise in the Enterprise. In Proceedings of the 15th International Conference on World Wide Web, Edinburgh, Scotland, from http://www.semanticmetadata.net/hosted/taggingws-www2006-files/26.pdf. Jones, R. (2008). Free the Music, from http://blog.last.fm/2008/01/23/free-themusic. Kaden, B. (2008). Zu eng geführt: Debatte zur "Library 2.0". BuB - Forum für Bibliothek und Information, 60(2), 224–225. Kaiser, T. (2008). Schlechte Frisuren, miese Laune, gute Geschäfte. Welt Online, from http://www.welt.de/wams_print/article2661378/Schlechte-Frisuren-mieseLaune-gute-Geschaefte.html. Kalbach, J. (2008). Navigating the Long Tail. Bulletin of the ASIST, 34(2), 36–38. Kellog Smith, M. (2006). Viewer Tagging in Art Museums: Comparisons to Concepts and Vocabularies of Art Museum Visitors. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Kiss, J. (2008). Last.fm Widgets Boost User Numbers. Guardian, from http://www.guardian.co.uk/media/2008/feb/28/web20.digitalmedia. Kramer-Duffield, J., & Hank, C. (2008). Babies in Bathtubs: Public Views of Private Behaviors in the Flickr Domain. In Proceedings of the 71st ASIS&T Annual Meeting., Columbus, Ohio, USA. Krishnamurthy, B., & Wills, C. E. (2008). Characterizing Privacy in Online Social Networks. In Proceedings of the 1st Workshop on Online Social Networks, Seattle, WA, USA (pp. 37–42). Kumar, R., Novak, J., Raghavan, P., & Tomkins, A. (2004). Structure and Evolution of Blogspace. Communications of the ACM, 47(12), 35–39. Lackie, R. J., & Terrio, R. D. (2007). MASHUPS and Other New or Improved Collaborative Social Software Tools. MultiMedia & Internet@Schools, 14(4), 12– 16. Lange, C. (2006). Web 2.0 zum Mitmachen: Die beliebtesten Anwendungen, from ftp://ftp.oreilly.de/pub/katalog/web20_broschuere.pdf. Law, E. L. M., von Ahn L., & Dannenberg, R. (2007). Tagatune: A Game for Music and Sound Annotation. In Proceedings of the 8th Conference on Music Information Retrieval, Vienna, Austria, from http://www.cs.cmu.edu/~elaw/papers/ISMIR2007.pdf. Leistner, M., & Spindler, G. (2005). Die Verantwortlichkeit für Urheberrechtsverletzungen im Internet – Neue Entwicklungen in Deutschland und in den USA. Gewerblicher Rechtsschutz und Urheberrecht - Internationaler Teil, 773–796. Lerman, K., & Jones, L. A. (2006). Social Browsing on Flickr, from http://arxiv.org/ abs/cs/ 0612047. Lerman, K., Plangprasopchok, A., & Wong, C. (2007). Personalizing Image Search Results on Flickr, from http://arxiv.org/abs/0704.1676. Lessig, L. (2003). The Creative Commons. Florida Law Review, 55, 763–773. Levy, M., & Sandler, M. (2007). A Semantic Space for Music Derived from Social Tags. In Proceedings of the 8th Conference on Music Information Retrieval, Vienna, Austria (pp. 411–416). Linden, G. D., Jacobi, J. A., & Benson, E. A. (1998). Collaborative Recommendations Using Item-to-item Similarity Mappings, Patent-No. US6.266.649. Linden, G. D., Smith, B., & York, J. (2003). Amazon.com Recommendations. Itemto-item Collaborative Filtering. IEEE Internet Computing, 7(1), 76–80.
Collaborative Information Services
113
Loasby, K. (2006). Changing Approaches to Metadata at bbc.co.uk: From Chaos to Control and Then Letting Go Again. Bulletin of the ASIST, 33(1), 25–26. Lohrmann, J. (2008). Urheberrechtsverletzungen im Web 2.0: Die Suche nach Vergütungsalternativen in Zeiten von Youtube. München: GRIN Verlag GmbH. Lund, B., Hammond, T., Flack, M., & Hannay, T. (2005). Social Bookmarking Tools (II). A Case Study - Connotea. D-Lib Magazine, 11(4), from http://www.dlib.org/dlib/april05/lund/ 04lund.html. Maness, J. M. (2006). Library 2.0 Theory: Web 2.0 and Its Implications for Libraries. Webology, 3(2), from www.webology.ir/2006/v3n2/a25.html. Marlow, C., Naaman, M., d. boyd, & Davis, M. (2006a). HT06, Tagging Paper, Taxonomy, Flickr, Academic Article, To Read. In Proceedings of the 17th Conference on Hypertext and Hypermedia, Odense, Denmark (pp. 31–40). Marlow, C., Naaman, M., d. boyd, & Davis, M. (2006b). Position Paper, Tagging, Taxonomy, Flickr, Article, ToRead. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. McFedries, P. (2006). Folk Wisdom. IEEE Spectrum, 43(Feb), 80. McMullin, J. (2003). Searching for the Center of Design, from http://www.boxesandarrows.com/view/searching_for_the_center_of_design. Meeyoung, C., Kwak, H., Rodriguez, P., Ahn, Y., & Moon, S. (2007). I Tube, You Tube, Everybody Tubes: Analyzing the World’s Largest User Generated Content Video System. In Proceedings of the ACM Internet Measurement Conference, San Diego, CA, USA, from http://www.imconf.net/imc-2007/papers/imc131.pdf. Meng, P. (2005). Podcasting & Vodcasting. A White Paper. Definitions, Discussions & Implications, from http://www.tfaoi.org/cm/3cm/3cm310.pdf. Micolich, A. P. (2008). The Latent Potential of YouTube – Will It Become the 21st Century Lecturer's Film Archive? from http://arxiv.org/abs/0808.3441v1. Milgram, S. (1967). The Small World Problem. Psychology Today, 60–67. Millard, D. E., & Ross, M. (2006). Web 2.0: Hypertext by Any Other Name? In Proceedings of the 17th Conference on Hypertext and Hypermedia, Odense, Denmark (pp. 27–30). Millen, D. R., Feinberg, J., & Kerr, B. (2006). Dogear: Social Bookmarking in the Enterprise. In Proceedings of the Conference on Human Factors in Computing Systems, Montréal, Canada (pp. 111–120). Miller, P. (2005). Web 2.0: Building the New Library. Ariadne, 45, from http://www.ariadne.ac.uk/issue45/miller. Miller, P. (2006). Coming Together around Library 2.0: A Focus for Discussion and a Call to Arms. D-Lib Magazine, 12(4), from http://dlib.org/dlib/april06/miller/04miller.html. Mönnich, M., & Spiering, M. (2007). Bibtip – Recommendersystem für den Bibliothekskatalog. EUCOR-Bibliotheksinformationen - Informations des bibliothèques, 30, 4–8. Muller, M. J., Ehrlich, K., & Farrell, S. (2007). Social Tagging and Self-Tagging for Impression Management: IBM Research Technical Report TR 06-02, from http://domino.watson.ibm.com/cambridge/research.nsf/c9ef590d6d00291a85257 141004a5c19/27658d7dcf7e8cce852572330070244d/$FILE/TR2006-2.pdf. Naaman, M. (2006). Eyes on the World. Computers in Libraries, 39(10), 108–111, from http://infolab.stanford.edu/~mor/research/naamanComp06.pdf. netzeitung.de (2008). Wenn Physiker nicht dozieren, sondern rappen, from http://www.netzeitung.de/wissenschaft/1147457.html.
114
Collaborative Information Services
Nichani, M., & Rajamanickam, V. (2001). Grassroots KM Through Blogging, from http://www.elearningpost.com/articles/archives/grassroots_km_through_bloggin g. Noruzi, A. (2006). Folksonomies: (Un)Controlled Vocabulary? Knowledge Organization, 33(4), 199–203. Notess, G. R. (2006a). The Terrible Twos: Web 2.0, Library 2.0, and More. Online Leading Magazine for Information Professionals, 30(3), 40–42. Notess, G. R. (2006b). Web-based Software and the New Desktops on the Web. Online - Leading Magazine for Information Professionals, 30(4), 39–41. Oates, G. (2007). Heiliger Bimbam! from http://blog.flickr.net/de/2007/11/13/heiliger-bimbam. Orchard, L. M. (2006). Hacking del.icio.us. Indianapolis, IN: Wiley. O'Reilly, T. (2005). What Is Web 2.0: Design Patterns and Business Models for the Next Generation of Software, from http://www.oreillynet.com/pub/a/oreilly/tim/ news/2005/09/30/what-is-web-20.html. O'Reilly, T. (2006). Google Image Labeler, the ESP Game, and Human-Computer Symbiosis, from http://radar.oreilly.com/archives/2006/09/more-on-googleimage-labeler.html. Palen, L., & Dourish, P. (2003). Unpacking "Privacy" for a Networked World. In Proceedings of the Conference on Human Factors in Computing Systems, Ft. Lauderdale, Florida, USA. Pan, Y. X., & Millen D. R. (2008). Information Sharing and Patterns of Social Interaction in an Enterprise Social Bookmarking Service. In Proceedings of the 41st Hawaii International Conference on System Sciences. Paolillo, J. (2008). Structure and Network in the YouTube Core. In Proceedings of the 41st Hawaii International Conference on System Sciences. Patashnik, O. (1988). BibTeXing: Included in the BibTeX distribution. Peterson, E. (2008). Parallel Systems: The Coexistence of Subject Cataloging and Folksonomy. Library Philosophy & Practice, 10(1), http://digitalcommons.unl.edu/cgi/viewcontent.cgi?article=1182&context=libphilprac. Picot, A., & Fischer, T. (2006). Weblogs professionell. Grundlagen, Konzepte und Praxis im unternehmerischen Umfeld. Heidelberg: dpunkt.Verl. Pikas, C. K. (2005). Blog Searching for Competitive Intelligence, Brand Image, and Reputation Management. Online - Leading Magazine for Information Professionals, 29(4), 16–21. Plangprasopchok, A., & Lerman, K. (2008). Constructing Folksonomies from Userspecifed Relations on Flickr, from http://arxiv.org/PS_cache/arxiv/pdf/805/ 0805.3747v1.pdf. Plieninger, J. (2008). Bibliothek 2.0 und digitale Spaltung. BuB - Forum für Bibliothek und Information, 3, 220–223. Quain, J. R. (2005). Now, Audio Blogs for Those Who Aspire to Be D.J.’s. New York Times, from http://www.nytimes.com/2005/05/12/technology/circuits/ 12basics.html?ex=1183003200&en=3594cea139d051a5&ei=5070. Regulski, K. (2007). Aufwand und Nutzen beim Einsatz von Social-BookmarkingServices als Nachweisinstrument für wissenschaftliche Forschungsartikel am Beispiel von BibSonomy. Bibliothek. Forschung und Praxis, 31(2), 177–184. Reinmann, G. (2008). Lehren als Wissensarbeit? Persönliches Wissensmanagement mit Weblogs. Information - Wissenschaft & Praxis, 59(1), 49–57.
Collaborative Information Services
115
Röll, M. (2003). Business Weblogs - A Pragmatic Approach to Introducing Weblogs in Medium and Large Enterprises. In T. N. Burg & R. Blood (Eds.), Proceedings of BlogTalks: European Conference on Weblogs (pp. 32–50). Wien: Cultural Research - Zentrum für Wiss. Forschung und Dienstleistung. Röll, M. (2004). Distributed KM - Improving Knowledge Workers' Productivity and Organisational Knowledge Sharing with Weblog-based Personal Publishing. In T. N. Burg (Ed.), Proceedings of BlogTalk 2.0: European Conference on Weblogs. Krems: Permalink - Zentrum für personenzentriertes Wissensmanagement. Röttgers, J. (2007). Am Ende der Flegeljahre: Das Web 2.0 wird erwachsen. c't Magazin für Computertechnik, 25, 148, from http://www.heise.de/ct/07/25/148. Sack, H., & Waitelonis, J. (2008). Zeitbezogene kollaborative Annotation zur Verbesserung der inhaltsbasierten Videosuche. In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 107–118). Münster, New York, München, Berlin: Waxmann. Schachner, W., & Tochtermann, K. (2008). Web 2.0 und Unternehmen - das passt zusammen! (Bd. 2). Aachen: Shaker. Schachter, J. (2005). y.ah.oo! from http://blog.delicious.com/blog/2005/12/yahoo.html. Schmidt, J. (2006). Weblogs. Eine kommunikationssoziologische Studie. Konstanz: UVK Verl.-Ges. Schmidt, J. (2007). Social Software: Facilitating Information-, Identity- and Relationship Management. In T. N. Burg (Ed.), BlogTalks Reloaded: Social Software - Research & Cases (pp. 31–49). Norderstedt: Books on Demand. Schmitz, C., Hotho, A., Jäschke, R., & Stumme, G. (2006). Kollaboratives Wissensmanagement. In T. Pellegrini & A. Blumauer (Eds.), Semantic Web: Wege zur vernetzten Wissensgesellschaft (pp. 273–289). Berlin, Heidelberg: Springer. Schmitz, C., Grahl, M., Hotho, A., Stumme, G., Cattuto, C., & Baldassarri, A., et al. (2007). Network Properties of Folksonomies. AI Communications, 20(4), 245– 262. Schnake, A. (2007). Blogging Success Study. Corporate Blogs: Ein neuer Begriff von Dialog. Direkt Marketing, 4, 36–38. Schütt, P. (2006). Social Computing im Web 2.0. Wissensmanagement, (3), 30–33. Shaw, R. (2005). Web 2.0? It Doesn’t Exist, from http://blogs.zdnet.com/iptelephony/?p=805. Sifry, D. (2002). Technorati, from http://www.sifry.com/alerts/archives/ 000095.html. Sinha, R. (2006). Findability with Tags: Facets, Clusters, and Pivot Browsing, from http://rashmisinha.com/2006/07/27/findability-with-tags-facets-clusters-andpivot-browsing. Sixtus, M. (2006). Ein Streifzug durch das Web 2.0. c't - Magazin für Computertechnik, 5, 144, from http://www.heise.de/ct/06/05/144. Skusa, A., & Maaß, C. (2008). Suchmaschinen: Status Quo und Entwicklungstendenzen. In D. Lewandowski & C. Maaß (Eds.), Web-2.0-Dienste als Ergänzung zu algorithmischen Suchmaschinen, Web-2.0-Dienste als Ergänzung zu algorithmischen Suchmaschinen (pp. 1–12). Berlin: Logos. Smith, G. (2008a). Tagging: Emerging Trends. Bulletin of the ASIST, 34(6), 14–17. Smith, G. (2008b). Tagging. People-powered Metadata for the Social Web. Berkeley: New Riders.
116
Collaborative Information Services
Spalding, T. (2007a). When Tags Work and When They Don't: Amazon and LibraryThing, from http://www.librarything.com/thingology/2007/02/when-tagsworks-and-when-they-dont.php. Spalding, T. (2007b). Tagmash: Book Tagging Grows Up, from http://www.librarything.com/thingology/2007/07/tagmash-book-tagging-growsup.php. Spiteri, L. (2006a). The Use of Collaborative Tagging in Public Library Catalogues. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Spiteri, L. (2006b). The Use of Folksonomies in Public Library Catalogues. The Serials Librarian, 51(2), 75–89. Spiteri, L. (2007). Structure and Form of Folksonomy Tags: The Road to the Public Library Catalogue. Webology, 4(2), from http://www.webology.ir/2007/v4n2/ a41.html. Stallman, R. M. (2004). Free Software, Free Society. Selected Essays of Richard M. Stallman (2nd ed.). Boston, MA: Free Software Foundation. Stegbauer, C., & Jäckel, M. (2008). Social Software - Herausforderungen für die mediensoziologische Forschung. In C. Stegbauer & M. Jäckel (Eds.), Social Software. Formen der Kooperation in computerbasierten Netzwerken (pp. 7–10). Wiesbaden: VS Verlag für Sozialwissenschaften / GWV Fachverlage GmbH Wiesbaden. Stock, W. G. (2007). Folksonomies and Science Communication: A Mash-Up of Professional Science Databases and Web 2.0 Services. Information Services & Use, (27), 97–103. Sturtz, D. (2004). Communal Categorization: The Folksonomy, from http://www.davidsturtz.com/drexel/622/sturtz-folksonomy.pdf. Stvilia, B., Twidale, M. B., Gasser, L., & Smith, L. (2005). Information Quality Discussions in Wikipedia. Technical Report ISRN UIUCLIS--2005/2+CSCW, from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.3912&rep=rep1&t ype=pdf. Sweda, J. (2006). Using Social Bookmarks in an Academic Setting: Penn Tags. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Tepper, M. (2003). The Rise of Social Software. netWorker, 7(3), 19–23. Thompson, A. E. (2008). Playing Tag: An Analysis of Vocabulary Patterns and Relationships Within a Popular Music Folksonomy, Master’s Thesis from University of North Carolina at Chapel Hill, from http://hdl.handle.net/1901/535. Thompson, C. (2008). Brave New World of Digital Intimacy. The New York Times, from http://www.nytimes.com/2008/09/07/magazine/07awareness-t.html. Toffler, A. (1980). The Third Wave. New York, NY: Morrow. Torniai, C., Battle, S., & Cayzer, S. (2007). Sharing, Discovering and Browsing Geotagged Pictures on the Web, from http://www.hpl.hp.com/techreports/2007/ HPL-2007-73.pdf. Trant, J. (2006a). Exploring the Potential for Social Tagging and Folksonomy in Art Museums: Proof of Concept. New Review of Hypermedia and Multimedia, 12(1), 83–105. Trant, J. (2006b). Social Classification and Folksonomy in Art Museums: Early Data from the Steve.Museum Tagger Prototype. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA.
Collaborative Information Services
117
Trant, J. (2009). Tagging, Folksonomy and Art Museums: Early Experiments and Ongoing Research. Journal of Digital Information, 10(1), from http://journals.tdl.org/jodi/article/view/270/277. Trant, J., & Wyman, B. (2006). Investigating Social Tagging and Folksonomy in Art Museums with Steve.museum. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Tredinnick, L. (2006). Web 2.0 and Business: A Pointer to the Intranets of the Future? Business Information Review, 23(4), 228–234. Tschetschonig, K., Ladengruber, R., Hampel, T., & Schulte, J. (2008). Kollaborative Tagging Systeme im Electronic Commerce. In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 119–130). Münster, New York, München, Berlin: Waxmann. Turnbull, D., Barrington, L., & Lanckriet, G. (2008). Five Approaches to Collecting Tags for Music. In Proceedings of the 9th International Conference on Music Information Retrieval, Philadelphia, USA (pp. 225–230). Udell, J. (2004). Collaborative Knowledge Gardening: With Flickr and del.icio.us, Social Networking Goes Beyond Sharing Contacts and Connections, from http://www.infoworld.com/ article/04/08/20/34OPstrategic_1.html. van House, N. A. (2007). Flickr and Public Image-sharing: Distant Closeness and Photo Exhibition. In Proceedings of the Conference on Human Factors in Computing Systems, San Jose, CA, USA (pp. 2717–2722). van Veen, T. (2006). Serving Services in Web 2.0. Ariadne, 47, from http://www.ariadne.ac.uk/issue47/vanveen. van Zwol, R. (2007). Flickr: Who is Looking? In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, Silicon Valley, California (pp. 184–190). Voß, J. (2007). LibraryThing – Web 2.0 für Literaturfreunde und Bibliotheken, from http://eprints.rclis.org/12663. von Ahn, L. (2005). Method for Labeling Images through a Computer Game. Patent No. US20050014118. von Ahn, L. (2006). Games with a Purpose. Computer, 96–98. von Ahn, L., & Dabbish, L. (2004). Labeling Images with a Computer Game. In Proceedings of the Conference on Human Factors in Computing Systems, Vienna, Austria (pp. 319–326). von Ahn, L., & Dabbish, L. (2005). ESP: Labeling Images with a Computer Game. In Proceedings of the AAAI Spring Symposium, Stanford, CA, USA (pp. 91–98). von Ahn, L., & Dabbish, L. (2008). Designing Games with a Purpose. Communications of the ACM, 51(8), 58–67. Watts, D. J., & Strogatz, S. H. (1998). Collective Dynamics of 'Small-world' Networks. Nature, 393, 440–442. Webb, P. L. (2007). YouTube and Libraries: It Could Be a Beautiful Relationship. C&RL News, 68(6), from http://www.ala.org/ala/mgrps/divs/acrl/publications/ crlnews/2007/jun/youtube. cfm. Wegner, J. (2005). Von Millionen für Millionen. Focus, 10, 92–100. Weiss, A. (2005). The Power of Collective Intelligence. netWorker, 9(3), 16–23. West, J. (2007). Subject Headings 2.0: Folksonomies and Tags. Library Media Connection, 25(7), 58–59. Westcott, J., Chappell, A., & Lebel, C. (2009). LibraryThing for Libraries at Claremont. Library Hi Tech, 27(1).
118
Collaborative Information Services
Wetzker, R., Zimmermann, C., & Bauckhage, C. (2008). Analyzing Social Bookmarking Systems: A del.icio.us Cookbook. In Proceedings of the 18 European Conference on Artificial Intelligence, Patras, Greece. Winget, M. (2006). User-Defined Classification on the Online Photo Sharing Site Flickr…or, How I Learned to Stop Worrying and Love the Million Typing Monkeys. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Wusteman, J. (2004). RSS: The Latest Feed. Library Hi Tech, 22(4), 404–413. Yen, Y. (2008). YouTube Looks for the Money Clip, from http://techland.blogs.fortune.cnn.com/2008/03/25/youtube-looks-for-the-money-clip. Zang, N., Rosson, M. B., & Nasser, V. (2008). Mashups: Who? What? Why? In Proceedings of the Conference on Human Factors in Computing Systems, Florence, Italy (pp. 3171–3176).
Chapter 2
Basic Terms in Knowledge Representation and Information Retrieval
This chapter will provide an overview of the most basic information-scientific terms in the areas of knowledge representation and information retrieval. Particular attention will be paid to terms vital for the comments on folksonomies in knowledge representation and information retrieval that will follow in later chapters. For a complete illustration, please consult the textbooks and standard works on knowledge representation (Stock & Stock, 2008; Bertram, 2005; Kuhlen, Seeger, & Strauch, 2004; Lancaster, 2003; Taylor, 1999; Aitchison, Gilchrist, & Bawden, 2000) and information retrieval (Manning, Raghavan, & Schütze, 2008; Stock, 2007b; Kuhlen, Seeger, & Strauch, 2004; Ferber, 2003; Frakes & Baeza-Yates, 1992; Sparck-Jones & Willett, 1997), as well as the further bibliographical notes within the definitions.
Introduction to Knowledge Representation Knowledge representation in information science serves to display and arrange information resources (books, pictures, websites etc.) via classification systems, which previously had been used mainly as shelving orders in libraries, content-describing elements in scientific documentation or as means for modelling knowledge in artificial intelligence research. The resource is not always displayed immediately, but represented in the database through ‘placeholders’ or ‘metadata.’ The metadata serve as a ‘navigation layer’ (Kalbach, 2008, 36), since they allow the user to search the full text or the non-textual resources more quickly and more easily, and also support the user in rating the resource’s relevance: Metadata can enhance the process of resource discovery by disclosing sufficient information about a resource to enable users or intelligent agents to discriminate between what is relevant and what is irrelevant to a specific information need (Macgregor & McCulloch, 2006, 291).
This proxy function is often assumed by terms that provide for the resource’s rendering, and that are not limited to labels or words. The terms provide the classification system’s users access to the information resources, which means that knowledge representation also serves to join the different user vocabularies of indexers and researchers. Only a match between the searching and indexing vocabularies results in a display of the resource as a search result. The retrieval of resources is knowledge representation’s primary goal. The classification system, or metadata, absolutely need a structure in order to effectively display the information resources’
120
Basic Terms in Knowledge Representation and Information Retrieval
content: “For metadata to be useful, it needs structure and context, even technically generated metadata” (Kalbach, 2008, 37). The methods of knowledge representation are thus normalized via different sets of regulations, such as DIN 1463/1 (1987) for the creation of thesauri. Added to that, term classifications are perpetually in need of a consensus with regard to their displayed knowledge and the terms, in order to be able to be structured at all. For folksonomies, the degree of the consensus is kept small: “Tagging is more popular than Semantic Web taxonomies precisely because it uses a minimal amount of convention“ (Halpin & Shepard, 2006). The indexing of information resources is always a two-step process, consisting of: 1) content analysis, meaning the registration of the resource’s content, and 2) the allocation of the terms according to the first step and the translation of the resource into the classification system’s terminology. Since the second step of indexing allocates terms from a classification system to the information resource, it is often referred to as categorization. Corresponding resources are summarized in the same categories and thus differentiated from resources belonging to other categories: “The cataloger is naming the work and distinguishing it from other works, yet is also grouping the work with similar entities“ (Peterson, 2006). Both steps of knowledge representation always require an interpretation on the part of the indexer. The entire process of indexing content can be represented by a tripartite network consisting of indexer / information resource / method of knowledge representation (see Figure 2.1): Indexing is an act where an indexer in a particular context, goes through a process of analyzing a document for its significant characteristics, using some tools to represent those characteristics in an information system for a user (Tennis, 2006).
On the indexer’s side, three different types of agent can be distinguished: authors, professional indexers and users (Mathes, 2004; Kipp, 2006a; Kipp, 2006b; Stock, 2007a). These agents walk different indexing paths and probably also highlight different characteristics of the respective resources. Authors appear in knowledge representation when they either index their own information resource themselves or when the resource is indexed via text-oriented methods of knowledge representation, such as the text-word method or citation indexing. Automatic indexing via terms extracted from the information resource can also be counted among these methods, since here, too, direct use is made of the author’s own specific terms. As opposed to these methods, folksonomies not only consider the author’s but also the users’ language – that is to say, the language of all agents with access to the resource (Sinclair & Cardew-Hall, 2008). Furthermore, folksonomies distinguish themselves from traditional controlled vocabularies by not being developed in advance but growing organically, and developing with increased usage (Weinberger, 2006; Quintarelli, 2005; Shirky, 2005a). If the information resource is indexed using knowledge organization systems such as nomenclatures, thesauri or classification systems, interpreters or translators are needed to use it. For one thing, these are the experts that create the knowledge organization system or the controlled vocabulary, but they are also the ones using the vocabulary for indexing purposes. The difficulty here is that the developers of controlled vocabularies must first analyze the “literature, needs, actors, tasks, domains, activities, etc.“ (Mai, 2006, 17) in order to then build the controlled vocabulary around them. Mathes (2004) sees this as the overriding problem
Basic Terms in Knowledge Representation and Information Retrieval
121
of the traditional indexing methods and thus endorses the use of folksonomies to counteract it – “but both approaches share a basic problem: the intended and unintended users of the information are disconnected from the process“ (Mathes, 2004).
Figure 2.1: Classification of the Methods of Knowledge Representation. Source: Peters & Stock (2007, Fig. 8).
Figure 2.2: Advantages and Disadvantages of Differently Created Metadata. Source: Kalbach (2008, 37, Table1).
The problem of consistent indexing with different indexers (Inter-Indexer Consistency), or with the same indexer at different times (Intra-Indexer Consistency) is often mentioned in this context (Markey, 1984; Stock & Stock, 2008, 358ff.). Common to all agents, however, is that they interpret the resource’s content from within their own environment and translate it into a method of knowledge representation – the same goes for information retrieval, where the researcher must first formulate his information request and then transfer it into a controlled vocabulary. The advantages
122
Basic Terms in Knowledge Representation and Information Retrieval
and disadvantages created by the different agents are summarized by Kalbach (2008) in Figure 2.2. Weller (2007) classifies the methods of knowledge representation with regard to their expressiveness and coverage of the knowledge domain (see Figure 2.3): a knowledge organization system’s expressiveness is measured on the number and specificity of its used relations, its coverage of the knowledge domain on the size of the represented subject area (Macgregor & McCulloch, 2006). Weak methods of knowledge representation, such as folksonomies, can be applied to a large knowledge domain but are relatively inexpressive since they make no use of semantic relations. Any extension of knowledge organization systems from the right side of the spectrum to the left increases the semantics and thus the expressiveness of the vocabulary; however, its applicability to large or separate knowledge domains is severely restricted, since the semantic relations would have to represent every single connection between the different terms, which to accomplish would take enormous time and effort.
Figure 2.3: Classification of the Methods of Knowledge Representation with regard to Expressiveness and Coverage of the Knowledge Domain. Source: Adapted from Peters & Weller (2008, 101, Fig. 1).
Voß (2007) also classifies the indexing methods according to their degree of terminological control and growing semantic structure: Term lists like authority files, glossaries, gazetteers, and dictionaries emphasize lists of terms often with definitions. Classifications and categories like subject headings and classification schemes (also known as taxonomies) emphasize the creation of subject sets. Relationship groups like thesauri, semantic networks, and ontologies emphasize the connections between concepts (Voß, 2007).
A problem with controlled vocabularies (Aitchison, Gilchrist, & Bawden, 2000, 6) is that they often consider very few perspectives in knowledge representation and the indexing of content, since they only specify a limited amount of terms for indexing and searching purposes, thus complicating the process of information gathering: “traditional knowledge organization systems look like stone age technique: effective but just too uncomfortable“ (Voß, 2007). Furthermore, changes to user needs, in-
Basic Terms in Knowledge Representation and Information Retrieval
123
formation resources or to the terminology can often be registered only very slowly (Giunchiglia, Marchese, & Zaihrayeu, 2005; Medeiros, 2008). That is why a more flexible method of knowledge representation is often requested: For static information – entries in an encyclopedia, for instance, or books in a library – a taxonomy makes it easier to put everything in its place. Dynamic information – RSS feeds, blogs, or any of the other ways to stream content virtually – needs a more flexible categorization that can quickly adapt to change (Dye, 2006, 42).
The classification of information resources into categories and the corresponding fallacy of the one-time-only allocation (you can only put a book on one shelf) is criticized (Shirky, 2005a). Crawford (2006) clarifies: There is nothing in classification and taxonomy systems that inherently requires that they be ‘folders’, that items have one and only one name. Ever seen a cataloging record with more than one subject heading? (Crawford, 2006).
Quintarelli (2005) summarizes the criticism faced by controlled vocabularies: Items do not always fit exactly inside one and only one category. Hierarchies are rigid, conservative, and centralized. In a word, inflexible. Hierarchical classifications are influenced by the cataloguer's view of the world and, as a consequence, are affected by subjectivity and cultural bias. Rigid hierarchical classification schemes cannot easily keep up with an increasing and evolving corpus of items. Hierarchical classifications are costly, complex systems requiring expert cataloguers to guess the users' way of thinking and vocabulary (mind reading). Hierarchies require predictions on the future to be stable over time (fortune telling). Hierarchies tend to establish only one consistent authoritative structured vision. This implies a loss of precision, erases difference of expression, and does not take into account the variety of user needs and views. Hierarchies need expert or trained users to be applied consistently.
The task of knowledge representation is to provide clarity about the information content of a resource and thus reducing the users’ insecurity – in terms of rating the resource (relevant or not?) and in terms of formulating their search request. These two aspects are becoming more important than ever, regarding the exponential growth of information and resources (Voß, 2007; Macgregor & McCulloch, 2006): Information creation has continued apace and methods for storing and transmitting material electronically, especially via the Internet, have only increased the average user’s thirst for access information. This rapid expansion of both information and access to it is rapidly outpacing attempts to enhance organisation and retrieval and creates a fresh need for new methods of information organisation. […] Information organisation is intended to reduce the difficult inherent in searching large document spaces for information (Kipp, 2007, 63f.).
124
Basic Terms in Knowledge Representation and Information Retrieval
Paradigmatic and Syntagmatic Relations The goal of information-scientific knowledge representation is to represent content in order to improve its retrievability. In order to guarantee this, it uses not only the full text, or the words, terms and labels of the resources to be indexed, but additionally selects terms to represent the content. One term here is the summary of objects or concepts under one category, which means that different terms can verbalize the same meaning. In knowledge organization systems (short: KOS; Zeng, 2000; Tudhope & Nielsen, 2006), terms are connected with each other via relations and thus integrated into a semantic network: Knowledge Organization Systems […] attempt to model the underlying semantic structure of a domain for the purposes of retrieval. […] The presentation of concepts in hierarchies and other semantic structures helps both indexer and searcher choose the most appropriate concept for their purposes (Tudhope & Nielsen, 2006, 4).
Generally, there are two sorts of relations: a) paradigmatic relations are hardwired links, i.e. the terms are connected via specific relations of the KOS, and b) syntagmatic relations are ‘ad hoc’ relations, which means that the connection of the terms is brought about by their ‘co-occurrence’ in a resource. Paradigmatic relations are generally hierarchical relations, equivalence relations and association relations, which can be further split up: • Hierarchical Relation: o Hyponymy or abstraction relation: two terms are linked by an abstraction relation, if term B comprises all characteristics of term A plus at least one additional characteristic, e.g. bird – parakeet. o Meronymy or part-of relation: two terms are linked by a part-of relation if term B is a part of term A, e.g. hand – finger. o Instance: is an individual term, e.g. Audi Q7. • Equivalence Relation: o Synonymy: two terms are synonymous if they share a common meaning. Quasi-synonyms are summarized for better practicability, even when their meaning is not exactly the same. o Genidentity: two terms are gene-identical if they refer to an entity which has possessed different extensions or intensions at various points in time (e.g. Russia and Soviet Union: different intension and extension). • Association Relation: o “See Also”: terms can be connected with the “see also” link if they are related in some way but not neither hierarchically nor synonymously. o various other relations: since association relations can display all manner of connection between terms, they are very diverse.
Ontologies Ontologies are the most detailed method of knowledge representation and are meant to serve the ‘semantic web,’ mainly to facilitate the interaction between man and
Basic Terms in Knowledge Representation and Information Retrieval
125
computer, as well as between computer and computer. The term ‘ontology’ is from philosophy, where it describes ‘the nature of being.’ In knowledge representation, ontologies mainly serve the representation of terms and their relations to each other, with the objective of providing an exhaustive display of a limited knowledge domain. The classic definition of an ontology is as follows: “An ontology is an explicit specification of a conceptualization” (Gruber, 1993). This definition has been heavily modified and enhanced: “Ontologies are specifications of the conceptualizations at a semantic level“ (Gruber, 2005). Ontologies make use of IT’s tools (e.g. formalized languages) in order to enable computers to access and edit the represented domain knowledge. The formalized relations between terms and the defined rules for their applicability should allow computers to draw automatic inferences about certain subjects and thus approximate the human way of thinking and working. Ontologies are thus also an important part of artificial intelligence research. Since ontologies make heavy use of relations, with the general association relation in particular being split up into a multitude of more specific relations (e.g. has_author, is_located_in, etc.), their expressiveness is enormous but can only come to bear on a limited knowledge domain (e.g. on human genetic information, such as Gene Ontology120) if the degree of detail is high: Ontologies, too, aim for the structured representation of a knowledge domain in the form of terms (mostly concepts, but also class) and their relations to one another. Other than with thesauri, these relations are freely definable: the association relation is pretty much dissolved, making explicit the single forms of association … This provides for a greater leeway in representing and managing knowledge (Weller, 2006, 227).
The term ‘ontology’ is often used inconsistently, or generalizingly for all structured knowledge organization systems. Stock and Stock (2008) recommend calling knowledge organization systems ontologies only if they have the following features: • the use of a standardized ontology language (e.g. OWL), • the option of automatic inference using terminological logic, • the occurrence of and differentiation by common terms and instances (e.g. ‘director’ and ‘Steven Spielberg’), • the use of specific relations (next to hierarchical relations).
Thesauri Thesauri are natural-language knowledge organization systems, which link their terms with one another via hierarchical, equivalence and association relations (Aitchison, Gilchrist, & Bawden, 2000). Furthermore, the thesaurus collects all versions of any given term in a descriptor set – “concepts instead of terms“ (Voß, 2006) – and defines a preferred term (descriptor) that must be used to index and retrieve the resource. All other terms in the descriptor set are labelled non-descriptors, only refer to the preferred term and serve as an access vocabulary to the resource (see Figure 2.4). The terminological control clearly puts the terms in relation with one another by completely recording synonyms, marking homonyms and polysemes, and specifying a preferred term for each term. Term control then links the terms via 120
http://www.geneontology.org.
126
Basic Terms in Knowledge Representation and Information Retrieval
paradigmatic relations. During the indexing of content, any number of descriptors can be allocated to the resource (within the framework of specified rules); i.e. each issue brought forward should be expressed via descriptors. The focus of thesauri in knowledge representation is the differentiation of resources: they use descriptors to isolate one resource from the others. A thesaurus’ look and structure are heavily normalized: in Germany by the DIN standard 1463/1 (1987), in the USA by ANSI/NISO Z39.19 (2003), and internationally by the ISO standard 2788 (1986).
Figure 2.4: A Thesaurus. Source: Redmond-Neal & Hlava (2005, 5).
The natural-language representation of terms is the advantage and the disadvantage of thesauri. The descriptors may make access to the resources easier for the user, but must be summarized into one descriptor for multilingual resources. Since natural language is in a perpetual state of flux, the thesaurus, too, must be adjusted to innovations in language use or delete terms no longer in use. Additionally, each descriptor set and each relation must be controlled and adjusted as needed. Since this is mainly an intellectual task, the procedure is demanding and expensive. In information retrieval, thesauri distinguish themselves through easy navigation using the alphabetical arrangement of the descriptors and their relations, thus providing for the finding of suitable search terms. However, the user must always search using the prescribed vocabulary, and can only incorporate hyponyms or hyperonyms into his search automatically when the retrieval system offers such functionalities.
Classification Systems Unlike thesaurus and ontology, the classification system does not work with naturallanguage terms but with notations, which are often formed from a combination of
Basic Terms in Knowledge Representation and Information Retrieval
127
numbers and letters. Furthermore, the classification system only displays hierarchical relations between the terms (see Figure 2.5). The classification system, too, is subject to strict standards in terms of look and structure: in Germany, the DIN standard 32705 (1987).
Figure 2.5: Example for a Classification System: International Patent Classification. Source: European Patent Classification (http://v3.espacenet.com/eclasrch).
Classification systems were initially very popular at libraries, where they were used as shelving systems for books. Thus classification systems in knowledge representation focus particularly on the centralization of identical resources: “Traditionally, categorization has been viewed as defining some specified bounds that can serve as abstract containers to group items with certain shared properties“ (Lin et al., 2006). Some well-known examples are the Dewey Decimal System (contains ten main classes which are meant to store all the world’s knowledge), the International Patent Classification (currently contains around 69,000 classes by which patents are classified) and the Open Directory Project for the classification of websites (with 16 main classes at the moment). Classification systems can be expanded via additional symbols, which attach further semantic information to the class, thus refining it. The classification system’s advantages are mainly its language-independent, notation-based design and the user-friendly navigation options through the different hierarchy levels. Furthermore, sources of semantic error, such as homonyms, are avoided by the classes. In information retrieval, systems with structure-displaying hierarchical relations can incorporate hyponyms into the search process through the use of truncations. Disadvantages would be the very notations, which can be hard to comprehend and memorize for the user. A good knowledge of the classification system is thus vital for knowledge representation as well as information retrieval. The prescribed structure with n classes (e.g. ten classes for decimal classification) makes it hard for classification systems to expand breadth-wise. Refining the classes in depth is, however, easily implementable.
128
Basic Terms in Knowledge Representation and Information Retrieval
Nomenclatures Nomenclatures are controlled vocabularies that are used for the indexing of content as well as information retrieval. They consist of natural-language terms, or descriptors, and equivalence as well as ‘see also’ relations between the descriptors. Nomenclatures make no use of hierarchical relations. The descriptors cannot be chosen freely for indexing and retrieval but are based in the nomenclature, e.g. the Library of Congress Subject Headings. Homonyms are dissociated from each other via explanatory notations (e.g. Bridge Bridge ), synonyms are summarized with a preferred term being specified. Also considered in nomenclatures is the relation of the Genidentity. Nomenclatures or subject headings are known from libraries, where they provide natural-language access to the resources, and from documentation in the natural sciences, such as the CAS Registry in chemistry.
Text-Word Method The text-word method aims to develop a KOS-independent yet natural-language method of knowledge representation. This is its fundamental difference to controlled vocabularies such as thesauri or classification systems, which prescribe a certain number of terms for knowledge representation as well as information retrieval. The text-word method only considers the ‘text-words’ that actually occur in the resources, and uses these for indexing. Paradigmatic relations between these words are not taken into consideration; however, their syntagmatic connection is formalized by a form of syntactical indexing (Henrichs, 1970a; 1970b). This also facilitates a weighted information retrieval via the text-word chains. The text-word method is applicable in areas that do not use a firm terminology, or that use a terminology inconsistently, or that do not have a knowledge organization system yet. The textword method leaves little room for interpretation, because the authors’ language is directly adopted, which also makes it a suitable empirical source of terms for controlled vocabularies. However, since this indexing method does not use terms for indexing, the user must be familiar with all possible linguistic variants in order to perform effective searches.
Citation Indexing Citation indexing can be applied wherever citation occurs, which is mainly in scientific literature, but also in legal and patent literature (Schmitz, 2007). The main assumption behind this indexing method is that references as well as citations both provide information on the resource’s content (Garfield, 1979). A resource A refers to another resource B if B is to be found in A’s bibliography; therefore, B has been cited by A. One can reconstruct the resource’s genesis, and thus demonstrate the connection of the resources’ content, via the references; citations show the resource’s effect on other resources, which in turn provides information on the thematic connection of both resources. These relations are exploited for knowledge representation and information retrieval. Starting with one resource, the user can navigate ‘backwards’ on a timeline, thus call up the older resources, or search ‘for-
Basic Terms in Knowledge Representation and Information Retrieval
129
wards,’ in order to see the resource’s echo in other publications. These links via citations and references also help the user to research an overview on a subject or topic. Since citation indexing is not dependent on words, it facilitates a languageindependent retrieval.
Figure 2.6: Bibliographic Coupling and Co-Citation. Source: Stock & Stock (2008, 334, Abb. 18.7).
Two special forms of linking references and citations are ‘bibliographic coupling’ (Kessler, 1963) and ‘co-citations’ (Small, 1973) (see Figure 2.6). Two resources D1 and D2 are bibliographically coupled if both refer to the same resource A, B or C (bottom half of Figure 2.6); two resources D1 and D2 are linked via co-citation if they are cited together by the resources X, Y or Z (top half of Figure 2.6). Knowledge of this particular connectedness between resources has enormous repercussions, particularly on the theories of information retrieval (see link topology).
Knowledge Representation in the World Wide Web The genesis of the World Wide Web led to a marked increase in available resources and information. This called for a suitable indexing in order to keep track of the mass of data or to at least provide access to the resources. Apart from the search engines’ simple full-text indexing, it was initially attempted to transfer categorization methods from libraries to the internet (Weinberger, 2006; Shirky, 2005a; Shirky, 2005b). Some web catalogs (e.g. the Open Directory Project or Yahoo! Web Catalog) were developed that tried to classify websites into simple categories while counting on input from the user community (Skusa & Maaß, 2008). However, since these allocations were mainly processed manually, problems quickly arose concerning employee capacity and the appropriate indexing and allocation of the digital resources. The internet grew too fast to find, consider, analyze and allocate every website. Thus web catalogs quickly became obsolete and were disregarded by users for both searching and indexing purposes. Standardized metadata meant to be indexed by the websites’ proprietors were supposed to fill the gap, the so-called Dublin Core
130
Basic Terms in Knowledge Representation and Information Retrieval
Metadata. Analogously to the author tags in traditional knowledge representation, these metadata would be available from the moment the website went online, thus helping search engines and researchers by facilitating a field-based search and the indexing of non-textual resources: “The central aim of the developers of the DC (Dublin Core) was to substantially improve resource discovery capabilities by enabling field-based (e.g., author, title) searches, permitting indexing of non-textual objects“ (Bar-Ilan et al., 2006). These metadata would also minimize the weaknesses of full-text indexing, particularly in order to facilitate a more effective information retrieval: “The basic philosophy of the DC and of the librarian world on general is that field-based resource discovery is much more effective than free text search“ (Bar-Ilan et al., 2006). But the Dublin Core Metadata’s susceptibility to spamming, due to the absence of a centralized quality control service on the internet, prevented its success. Ever since, full-text indexing has dominated as a method of knowledge representation on the World Wide Web, or more specifically, only serves search engines to apply their search and ranking algorithms (see link topology), in order to be able to present the most relevant resource at the top of the list of any user’s search results. Resources are no longer analyzed; a resource’s relevance is (largely) determined by external factors and not its semantics. Only the advent of Web 2.0 has brought to the fore another form of knowledge representation: the folksonomy. With the breathtakingly fast technological developments (e.g. digital cameras, camera phones) and ever faster broadband connections allowing users to produce and publish content for the internet much more easily, an easily manageable knowledge organization system for personal resources is more than ever in demand. And again, users become indexers by attaching metadata to their resources and thus allocate a meaning as well as provide access to them: “We are on the cusp of an exciting new stage of Web growth in which the users provide both meaning and a means of finding through tagging“ (Kroski, 2005).
Introduction to Information Retrieval The goal of information retrieval is to find all relevant resources to a search request or query from the sheer mass of resources in a database. The ballast of unsuitable resources should be avoided, since it leads to an unnecessary bloating of the search results. After all, the user only wants those resources that satisfy his thirst for knowledge. The effectiveness of retrieval systems is measured with the indicators ‘Recall’ and ‘Precision’ (Salton & MacGill, 1987, 174ff.). In order to calculate these values, the following factors are needed: a = relevant results obtained, b = non-relevant resources in the search results (ballast), c = relevant resources in the database not on display (loss). The Precision of a retrieval system is calculated as follows:
precision
=
a (a + b)
while the Recall of a retrieval system is calculated in this way:
Basic Terms in Knowledge Representation and Information Retrieval
recall
=
131
a (a + c)
Ideal retrieval systems show values of 1 for both Recall and Precision, which is, however, the exception rather than the rule in practice. While the Precision can be calculated exactly, it is a bit more difficult for the Recall. The factor c is hard to determine or often unknown, particularly in very large databases. That is why the Recall value should be considered in its relation to the Precision value, since both are reciprocally connected. If a search result’s Precision increases, its ballast will decrease and vice versa. This means that users must approach the ideal relation of Precision and Recall, “researchers’ holy grail” (Evans, 1994), step by step.
Relevance Distributions Figure 2.7 sketches three well-known relevance distributions, which play a prominent role in determining relevant search results in information retrieval and while compiling a relevance ranking (Stock, 2007b, Ch. 6). Relevance distributions are developed in information retrieval from the resource hits for a particular search query; the ranking of the search results is determined by their retrieval status value (further elaborations will follow in the chapter Information Retrieval). In knowledge representation via folksonomies, the ranking is determined by the allocation rate of a tag, either per resource or per occurrence in the entire database. As seen in the chart, the onset of the distribution curves must not be neglected while considering and determining distributions, because it contains important information (Stock, 2006; Stock, 2007b, 76ff.). I will now use the example of a tagging system to explain relevance distributions: • Binary relevance distribution mirrors the allocation of tags on a resource level, if only the resource’s author may index tags. A tag was either added, in which case it receives a value of 1, or it was not added, and its value is 0. Here no relevance ranking can be implemented, since each tag is weighted equally. • Informetric distribution, or Power Law, shows that only the first n ranks (here: three) are occupied by the most relevant tags, while numerous further tags sit in the ‘Long Tail’ (Anderson, 2006) and might be less relevant for a search request in combination with a resource. This means that the user can only concentrate on the first few hits to answer his query. In knowledge representation, these tags could be used as candidates for a controlled vocabulary. Power Laws (see Figure 2.7) are found in many areas (Newman, 2005), for example in text statistics, and describe the proportional connection between number and size of the objects to be measured: “The power law is among one of the most frequent scaling laws that describe the scale invariance found in many natural phenomena” (Huang, 2006). Shirky (2003) paraphrases this typical Power-Law characteristic as follows: “The basic shape is simple – in any system sorted by rank, the value for the Nth position will be 1/N“ (Shirky, 2003, 78). There are various approaches towards formalizing Power-Law distributions: the Pareto distribution, Zipf’s Law or Lotka’s Law, amongst others (Egghe & Rousseau, 1990; Egghe, 2005). Common to all of these principles is that the Power-Law curve approaches a straight line during the calculation with logarithmic axes (Albrecht, 2006): “When written in this [logarith-
132
Basic Terms in Knowledge Representation and Information Retrieval
mic] form, a fundamental property of power laws becomes apparent – when plotted in log-log space, power laws are straight lines“ (Halpin, Robu, & Shepherd, 2007, 215).
Figure 2.7: Informetric, Inverse-logistic and Binary Relevance Distributions. Source: Stock (2006, 1127, Fig. 1).
The Pareto distribution describes the phenomenon of a small amount of high values in a value set contributing to their aggregate value to a higher degree than a large amount of small values in that set would. In other words, and in terms of its relevance to folksonomies: around 20% of a folksonomy’s tags make up for 80% of the allocation frequency. Zipf’s Law investigates the quantity of words in a text with regard to their ranking: “Zipf's version of the law states that the size y of the x'th largest object in a set is inversely proportional to its rank” (Voß, 2006). Lotka’s Law is based on the number of scientific publications. A distribution based on Lotka’s Law follows the equation
f (x) =
C x a
where C is a constant, x the ranking of the tag in question and a a constant value (usually between 1 and 2). • The inverse logistic distribution (see Figure 2.7) shows a Long Tail, but also a ‘Long Trunk’ at the onset of the curve. In information retrieval, this means that in order to satisfy a search request, many more resources must be analyzed than would be the case for a Power Law. For knowledge representation, this means that there are now more candidates for a controlled vocabulary that must be evaluated. The inverse logistic distribution shows many probably relevant tags in the Long Trunk and a Long Tail to boot. It follows the formula
f ( x) = −e C '( x −1)
b
Basic Terms in Knowledge Representation and Information Retrieval
133
where e is the Euler’s number and x the tag’s ranking. C’ is a constant and the exponent b is always around 3. In many cases, the Long Trunk is shorter than the Long Tail (Stock, 2006).
Retrieval Models Information retrieval is meant to help the user find those resources that will quench his thirst for knowledge, or in other words, which are relevant to his search request. In order to determine the relevance of a resource for the search request, several retrieval models have been developed, which will be discussed in the following. Most retrieval models are based on the assumption that a resource becomes more relevant the greater the degree of compliance of terms in the search request with terms in the resources, or the more often the search term occurs in the documents. That is why these models can also be applied to the determining of similar resources, the ranking of resources (in decreasing order of similarity), cluster-forming procedures and thus quasi-knowledge organization systems as well as recommendation systems for resources, users or tags. I will only briefly mention the Boolean retrieval model (Stock, 2007, 141-167), since Boolean operators are essential for information retrieval but the strict Boolean model has not established itself in the practice of web information retrieval in particular (Lewandowski, 2005b). The large numbers of search results on the internet require a relevance ranking, which the Boolean model cannot provide due to its True/False dichotomy. The Boolean operators (‘AND,’ ‘OR’ and ‘NOT,’ in some retrieval systems also ‘XOR’) facilitate the linking of several search arguments while formulating the search request. The AND aims to find the intersection of the terms, i.e. the results will include those resources that contain all search terms, and the OR aims for the set union, displaying search results containing either of the search terms. The OR excludes the search argument placed after the operator from the search, and XOR, also termed ‘excluding OR,’ prevents the display of the intersection of two terms. Strict Boolean retrieval systems follow the ‘exact match’ procedure (Lewandowski, 2005b, 80), which means that resources are only displayed as search results if they contain the exact search terms. A search request with AND link and multiple search arguments must find all search terms in the resources; synonyms or spelling mistakes in the query are considered as new and are not rectified. This influences the Recall. However, it is not the greatest problem of the Boolean retrieval model. The Boolean methodology does not allow for a relevance ranking, since it is only meant to answer the question: does the search term occur in the resource exactly? An affirmative answer results in the display of the resource as a search result – several positive results mean that all results are equally relevant. This assumption is not sustainable, as the example of text statistics demonstrates below. To establish a relevance ranking in the Boolean model, enhanced Boolean retrieval models were developed – the geometrically oriented approach of the P-Norm (Salton, Fox, & Wu, 1983) and the (mixed) minimum-maximum model (Fox et al., 1992; Lee et al., 1993) based on Fuzzy Logic. Neither of these models caught on outside of the scientific community, however. The following retrieval models have established themselves in practice.
134
Basic Terms in Knowledge Representation and Information Retrieval
Text Statistics Text statistics refer to a resource’s inherent factors, that is to say their terms (words as they occur in the text, information-linguistically processed words or N-grams). That is why text statistics can only be applied to textual resources, or resources with textual parts. The basic assumption here is that the more a term occurs in any given resource, the more important it becomes for describing the resource’s content – however, it must not occur too often, because this would lead to losing its meaning for describing the resource’s content, its discriminatory power (Luhn’s Model; Luhn, 1958). A term’s importance is determined by two different values: the resource-specific term frequency and the inverse document frequency. The resourcespecific absolute term frequency (short: TF) is determined by a term’s frequency of occurrence in a resource; in folksonomies, by tags popularity. In traditional information retrieval, which generally focuses on textual resources and full-text, simple term frequency has a big flaw: longer texts are at an advantage, since the probability of a term occurring multiple times is simply greater for them. To remedy this, the relative term frequency is determined, which means that a term’s frequency of occurrence is divided by the total number of terms in the resource (Salton, 1968). The formula for calculating the relative TF is composed of freq(t,d) = frequency of occurrence of term t in the resource d and L = number of all terms in d:
rel.TF (t , d ) =
freg (t , d ) L
A variant form of relative TF is the Within-Document Frequency (WDF), which uses logarithmic values to achieve a more compressed value set compared to the former method (Harman, 1992; Harman, 1986):
WDF (t , d ) =
ld ( freq(t , d ) + 1) ldL
In order to incorporate the discriminatory power of terms and tags into the calculation of the retrieval status value as well, the value from relative TF is multiplied with the inverse document frequency (IDF). The IDF value refers to the term’s frequency of occurrence in the entire database and states that a term becomes less relevant, or discriminatory, the more it occurs in different resources (Spärck Jones, 2004 [1972], 498f.). It is calculated by dividing the total number of resources in the database by the number of resources that contain at least one occurrence of the tag:
⎛N⎞ IDF (t ) = ld ⎜ ⎟ + 1 ⎝n⎠ where N = total number of resources in the database, n = resources containing the term t at least once. Multiplying the relative term frequency with the inverse document frequency results in neither longer texts nor too frequent tags receiving preferred treatment in the calculation.
Basic Terms in Knowledge Representation and Information Retrieval
135
Vector Space Model The vector space model (Salton & MacGill, 1987) localizes either the search request (in information retrieval) or the initial resource (while determining resource similarity) as well as the resources to be searched, or synchronized, as vectors in an ndimensional space. This means that the terms generate the dimensions of the vector space in which the resources are localized as vectors. For the terms’ weighting values, the vector space model draws on the calculation method of text statistics. The similarity between search request or model resource and the resources in the database is calculated from the angle of their vectors: the smaller the angle, the more similar the resources are to one another. If an angle has a value of 0˚, this means that request vector / model resource vector and search result resource vector are a perfect match and thus one and the same – an angle value of 90˚ represents the maximum distance between the two vectors and request and resource could not be more dissimilar. The angle is calculated via the cosine: l
Cos ( Doci , Query j ) =
∑ (Term k =1
l
∑ (Term k =1
ik
l
) * ∑ (QTerm jk ) 2 2
ik
* QTerm jk )
k =1
where Termik = weighting value of term k in resource i, ATermjk = weighting value of search term k in request vector j and l = total number of resources. The comparison of search requests and resources in the database often leads to unsatisfactory results, since the search request contains few terms and thus only generates few dimensions, while the resources contain a vastly greater number of terms. Hence the processing of queries containing only one search argument is completely impossible, since there would be only one dimension and thus no possible angle between search request vector and resource vector. One solution would be the so-called relevance feedback, which enables the user to let the system add further terms to his query (Salton, 1968). The user selects the most relevant resources from the first number of hits, which are then evaluated by the retrieval system. Further relevant terms are added to the query, terms that match the search term are given a higher emphasis in the search and non-matches or terms that do not occur in the resource are given a lower emphasis or deleted. Thus the search request vector is ‘enhanced’ and able to find more relevant resources through multiple matches. This problem is less prevalent in the comparison of model resources and resources in a database, since the model resource contains enough terms for the comparison. The modification of the search request vector is expressed via the Rocchio Algorithm (Rocchio, 1971):
r r r r β γ qm = αq + ∑ dj − ∑ dj Dr ∀dr j ∈Dr Dn ∀dr j ∈Dn
where qm = modified search request vector, Dr = centroid of whichever document vectors marked relevant by the user, Dn = centroid of whichever document vectors marked non-relevant by the user, and α, β and γ = freely adjustable constants with
136
Basic Terms in Knowledge Representation and Information Retrieval
the ability to weight the terms of the sum (Baeza-Yates & Ribeiro-Neta, 1999, 119). If α, β and γ are defined as 1, the weighting is neglected; if the positive relevance judgments are supposed to receive a higher weighting, then γ < β – if the negative relevance judgments are supposed to receive a higher weighting, then β < γ. The importance of a query can be regulated via α. A disadvantage of the vector space model is that the terms or dimensions must be viewed as independent, which is not always the case. In other words, the model requires a one hundred per cent morphological match between the terms in order to regard them as similar. Semantic similarities are neglected. Thus synonyms such as ‘flat’ and ‘apartment’ generate a 90° angle in the vector space and are dissimilar, which flies in the face of semantics as well as the user’s way of thinking. Here the incorporation of knowledge organization systems and the summarization of synonymous terms in one dimension (Billhardt, Borrajo, & Maojo, 2002; Kulyukin & Settle, 2001; Liddy, Paik, & Yu, 1993) provides help. An advantage of the vector space model is that the calculated angle values can always be used to arrange the resource similarities in a sequence and thus create a ranking of the search results.
Probabilistic Model The probabilistic model calculates the probability value of whether a resource is relevant to a search request (Crestani et al., 1998). Historically speaking, the probabilistic model has established relevance ranking in information retrieval, meaning the output of a list of search results in descending order of relevance (Maron & Kuhns, 1960). The probabilistic model sets out from the principle of conditional probability, i.e. the probability that a resource is relevant depends on the search request. Probability values are always between 0 and 1, where 1 is the certain event and 0 the impossible event. Conditional probability is calculated via Bayes’ Theorem:
P( D | Q) =
[P(Q | D) * P( D)] P (Q )
where D and Q both consist of the terms t1 to tn. • P(D|Q) is the probability of the relevance of a resource D in the search request Q; • P(Q|D) is the conditional probability of the relevance of a search request Q in a resource D. For the calculation of P(Q|D), the system needs the user to provide relevance judgments in order to be able to add relevant terms to the search request or to weight the search terms anew; • P(D) is the probability of the resource, calculated via the terms contained within it (using text-statistical procedures); • P(Q) is the probability of the search request and is a given, which is why it is defined as having a value of 1. The probabilistic model must take at least two steps in order to generate a hit list: it must first generate an initial number of hits, which is then rated by either the user or the system in a relevance feedback loop and used to modify the search request (by adding search terms and weighting values). For the calculation of the relative distri-
Basic Terms in Knowledge Representation and Information Retrieval
137
bution values for modifying the search request, the Robertson-Sparck Jones formula is used (Robertson & Sparck Jones, 1976; Harman, 1992):
⎛ ⎜ ⎝ R w = log n ⎛ ⎜ ⎝ N − n
r ⎞ ⎟ − r ⎠ − r ⎞ ⎟ − R + r ⎠
where r = resource contains search term and was marked relevant by the user, R = number of resources relevant for the query, N = number of rated resources and n = number of resources containing the search term t. The distribution value w is calculated for all terms t of the initial search and, relative to the retrieval system, for the terms of the rated resources and then transferred to the retrieval system as a search request for the terms t. A weighting value is added to the hit resources for every term via a variant of TF*IDF, which is then multiplied with the distribution value w. The retrieval status value of a resource is composed of the sum of these multiplications and can be displayed in a ranking of the search results. Finally, the user receives the definite list of search results. Thus retrieval systems based on the probabilistic model learn together with the user and absolutely require a relevance judgment for the necessary calculations. However, it is also possible to generate the relevance judgment automatically, via a so-called pseudo-relevance feedback. This is accomplished with the help of the Croft-Harper formula (Croft & Harper, 1979):
⎡ ⎤ r w = log ⎢ ⎥ ⎣ (R − r ) ⎦ where r = number of top-ranking resources containing the term t, R = number of rated resources and R-r = number of resources not containing the term t. The CroftHarper formula is based on the following assumption: A major assumption made in these models is that relevance information is available. That is, some or all of the relevant and non-relevant documents have been identified. […] The documents at the top of the ranking produced by the initial search have a high probability of being relevant. The assumption underlying the intermediate search is that these documents are relevant, whether in actual fact they are or not (Croft & Harper, 1979, 285ff.).
This would mean that the top-ranked hit resources in the initial number of search results are automatically considered relevant and their term material is adopted for further calculations in order to generate the final search results. Since no negative relevance judgments are available here, only the positive part of the Robertson-Sparck Jones formula is used for calculations; in other words, only pseudo-relevant resources are taken into consideration. Since the scientific debate on folksonomies in information retrieval largely neglects the probabilistic model, the above will suffice as a brief introduction to the topic. Nevertheless, the probabilistic model, as well as the vector space model, can be used as a theoretical basis for cluster calculations, similarity comparisons, recommendation systems etc., and it remains to be seen when the scientific community
138
Basic Terms in Knowledge Representation and Information Retrieval
will take up the thoughts behind the probabilistic model and apply them to folksonomies.
Link Topology – Kleinberg Algorithm and PageRank Link-topological retrieval models do not refer to the terms of the resource, as opposed to the text-statistical models, but use the resources’ links (hyperlinks for internet resources and references for textual resources) to determine their relevance. The basis of link topology are Garfield’s thoughts on citation analysis (Garfield, 1979). Scientific resources in particular are linked via references and citations: a resource A refers to another resource B, while B is cited by A. This idea is transferred to hyperlinks in the World Wide Web. To determine the relevance of resources, two algorithms are used primarily. The Kleinberg Algorithm (Kleinberg, 1999) operates on the following principle: initially, hubs and authorities are identified in the mass of websites. Sites are considered hubs if they have many outgoing links (outlinks), and authorities if they have many incoming links (inlinks). ‘Good’ hubs refer to ‘good’ authorities and vice versa, where the measurement ‘good’ is quantified via the number of links. The important thing is that authority value and hub value strengthen each other, since the relation between these two values is expressed with the help of an iterative calculation. In this way numerical weighting values are determined for each page and refreshed again and again, until they approach a boundary value and are finally registered as the website’s retrieval status values. As opposed to the PageRank Algorithm, the Kleinberg Algorithm bases its calculations on the search request121, which means that new hub and authority values must be calculated, and new search results generated, for each query. This results in long calculation periods and thus to long waits for the user. The PageRank Algorithm (Brin & Page, 1998; Page et al., 1998) also calculates a website’s retrieval status value iteratively, but does not consider the search request and the website’s hub weight. The algorithm’s main assumption can be paraphrased as follows: the PageRank of a website is the probability that any given user will stumble upon the site while surfing the web. The method concentrates on number and ‘quality’ of a website’s inlinks. The more links lead to a site, the greater the probability that a random user will find it. The quality of a link is also determined by the PageRank – if the links of an authority (website with lots of inlinks) lead to another website, these links weigh more than incoming links from a less popular site. This also means that a website’s PageRank always proportionally spreads to another
PR
( A ) = (1 −
d
)+
⎡ PR (T 1 ) PR (T n ) ⎤ d *⎢ + ... + ⎥ C (T n ) ⎦ ⎣ C (T 1 )
site’s PageRank via the link (Stock, 2007b, 382-386). The PageRank’s formula is this: 121
The query terms are initially aligned with the resource terms and form the first set of search results if they match. Only after receiving this set can the actual calculation of the Retrieval Status Value, via the Kleinberg algorithm, be performed (Stock, 2007b, 377).
Basic Terms in Knowledge Representation and Information Retrieval
139
where PR(A) = PageRank of a website A, d = damping factor, represents the probability that a random surfer follows the website’s links, PR(T1) = PageRank of the website T1, PR(Tn) = PageRank of the website Tn, where all T1, …, Tn link to the website A, C(T1) = number of outgoing links of the website T1 and C(Tn) = number of outgoing links of the website Tn. The PageRank, too, supports the analogy to citation analysis, which expresses an author’s judgment on a website via a reference or a link to that site: “„[Links] are indicative of the quality of the pages to which they point – when creating a page, an author presumably chooses to link to pages deemed to be of good quality” (Richardson, Prakash, & Brill, 2006, 708). It is important to stress at this point that the PageRank does not rate the quality of the website for the purpose of a content analysis, but merely performs a statistical rating of the website’s position in the resource space. Hence the ranking has no bearing whatsoever on the resource’s content. Link-topological rating procedures are not without their problems and criticism (Lewandowski, 2005a, 144f.). Thus the quality of a website is determined only via its incoming and outgoing links, to use the analogy to citation analysis. However, links can be placed for various reasons, e.g. for navigation, as ads for other websites, as spam or as negative examples for something, which means that the assumption that “links lead to good content” is untenable. The position of a link is also left unconsidered for the ranking and the quality judgment, even though it seems pretty clear that prominently placed links are more important than others. Lewandowski (2005a) summarizes the problems of link-topological rating: “In conclusion, linkbased algorithms helped to improve the quality of search engine rankings to a high degree, but there is also an immanent kind of bias that is not considered enough yet“ (Lewandowski, 2005a, 145).
Information Linguistics – NLP Natural Language Processing (short: NLP) is a subarea of information linguistics that serves the disambiguation of natural language, in particular written language, and aims towards an automatic processing of language in computerized environments. It mainly draws upon the findings of linguistics and computer linguistics. A typical NLP processing algorithm concerning folksonomies is sketched in Figure 2.8. There are several stages: initially, comparisons with language patterns, word distributions or n-grams reveal the terms’ / tags’ language. Then the tags are parsed into words or n-grams. The favored approach for folksonomies is word-based, since this offers better processing options (see the three bottom levels in Figure 2.8). An alternative (not followed up on here) is the implementation of n-gram procedures, meaning the reduction of tags to artificially produced words with n digits (Stock, 2007b, Ch. 13). The next step is error detection and recovery as well as the discovery of ‘named entities,’ via the alignment of word and name lists, a.o. (Stock, 2007b, 255ff.). Even simple synonyms, such as acronyms and their complete spelling, are processed in this step. Orthographical as well as morphological variants, such as upper- and lower-case letters, British vs American English or singular and plural forms (Kipp & Campbell, 2006) are added together in the summary of word forms and unified via stemming (Stock, 2007b, 232ff.; van Damme, Hepp, & Siorpaes, 2007) or lemmatization (Stock, 2007b, 228ff.). Stemming reduces terms to their word stem by cutting off suffixes, while lemmatization reduces terms to their basic form and
140
Basic Terms in Knowledge Representation and Information Retrieval
has to draw upon knowledge from linguistics (Galvez, de Moya-Anegón, & Solana, 2005). Lemmatization is achieved either through alignment with dictionaries (which contain lexemes, inflection forms and derivations) such as IDX (Zimmermann, 1998) or rule-governed procedures that apply various deletion or replacement rules to the term (Kuhlen, 1977). Stemming concentrates on cutting off suffixes, where two procedures are mainly used: a) the Longest-Match stemmer and b) the iterative Stemmer. One Longest-Match stemmer, the so-called Lovins Stemmer (Lovins, 1968), will cut off the longest possible suffix of a term in one step; an iterative Stemmer, such as the Porter Stemmer (Porter, 1980), separates any consecutive suffixes separately. The unification or ‘conflation’ of word forms makes it possible to join orthographical variants of the same term and thus increase the Recall in information retrieval. The processing of natural-language search requests is also allowed. Search term and indexing term no longer have to show a complete match in their letter sequence; matching word components are recognized and incorporated into the alignment of search request and resources in the database. Up to this point, the NLP Algorithm views the terms as grammatical elements of the resources and merely unifies their format. But these terms are still semantically ambiguous – only the last three steps of the process address the semantic disambiguation of the terms and relate the terms to each other. In other words: their relations are rendered visible in order to heighten the precision of knowledge representation and to erase semantic ambiguities. To achieve this, the process must draw upon traditional methods of knowledge representation. During the detection of homonyms and synonyms, knowledge organization systems such as WordNet (Miller, 1998) must be used and comparisons made. Knowledge organization systems must be used to link terms via relations: in the end, multilingual retrieval requires the existence of machine-readable dictionaries.
Basic Terms in Knowledge Representation and Information Retrieval
141
Figure 2.8: NLP Processing Algorithm for Folksonomies. Source: Peters & Stock (2007, Fig. 9).
142
Basic Terms in Knowledge Representation and Information Retrieval
Similarity Coefficients and Cluster Analysis Similarity coefficients and cluster analysis can be used in various areas: in knowledge representation, both procedures allow the creation of semantic clusters, which can then be used to build a quasi-KOS; alternatively, they can be used to determine recommendations for similar indexing terms or resources. In information retrieval, both procedures are mainly applied in the relevance ranking of search results or serve as the basis of recommendation systems. Co-occurrences, i.e. of terms, users or resources, are always the basis for calculating similarity values. The assumption here is that co-occurrence is an indication of a similarity relation between the factors under consideration. The basis for similarity calculations of the factors terms, users and resources are terms which are contained in either the tag sets of other indexed resources, their full-text or in user profiles. Since terms always contain ambiguities and variants, it makes sense to unify the term base before the calculation. This can be achieved via information-linguistic processing procedures, the removal of noise words and the summarization of phrases (Stock & Stock, 2008, 372-377). To calculate the similarity coefficients, the following values must be determined in the cleansed term base: 1) the number of resources containing term A, 2) the number of resources containing term B, and 3) the number of resources containing both terms. Absolute quantities (meaning the actual number of occurrences) or the weighting values calculated via TF*IDF (Rasmussen, 1992, 422) can be used for this determination. The following similarity coefficients are to be calculated with absolute quantities, where g = the number of common terms, a = the number of terms in resource 1 and b = the number of terms in resource 2: • The Jaccard-Sneath Coefficient (Jaccard, 1901; Sneath, 1957):
S Di , D j = •
g a+b− g
The Dice Coefficient (Dice, 1945):
S Di , D j = •
2g a+b
The Cosine (Salton, Wong, & Yang, 1975):
S Di , D j =
g a *b
Basic Terms in Knowledge Representation and Information Retrieval
143
For the calculation with weighted values, the following formulas are to be applied (Rasmussen, 1992, 442): • Jaccard-Sneath:
S Di , D j = •
∑
L
weightik2 k =1
∑ +∑
L
k =1 L
( weightij * weight jk )
weight 2jk − ∑k =1 ( weightik * weight jk ) k =1 L
Dice:
2∑k =1 ( weight ik * weight jk ) L
S Di , D j = •
∑k =1 weightik2 + ∑k =1 weight 2jk L
L
Cosine:
S Di , D j =
∑ ∑
L k =1
( weight ik * weight jk )
weight ik2 * ∑k =1 weight 2jk k =1 L
L
These values can also be displayed in a similarity matrix (Stock & Stock, 2008, 374). This makes particular sense if one of the more elaborate methods of cluster formation is to be used in order to better determine similar resources.
Figure 2.9: Cluster Formation via the Single-Link Procedure. Source: Stock & Stock (2008, 375, Fig. 20.5).
144
Basic Terms in Knowledge Representation and Information Retrieval
Figure 2.10: Cluster Formation via the Complete-Link Procedure. Source: Stock & Stock (2008, 376, Fig. 20.6).
Based on the values calculated by the similarity coefficients, variously pronounced clusters or quantities of recommendable resources can be compiled (Stock & Stock, 2008, 374-377): 1. K-Nearest-Neighbors Procedure: starting with one resource, the observed resources are arranged by similarity value, in descending order, and the first k resources are compiled in a cluster or recommended as similar resources. The clusters can be regulated by the quantity k. 2. Single-Link Procedure: starting with the most similar pair of resources, all resources are added that share a similarity value higher than a specific threshold value with one of the resources (see Figure 2.9). Here, too, cluster size can be adjusted via the threshold value; generally, though, Single-Link Procedures result in rather spacious clusters. 3. Complete-Link Procedure: Here not only resources similar to both initial resources are added to the cluster; the added resources must also exceed a threshold value with regard to each other (see Figure 2.10). The Complete-Link Procedure generates rather dense and small clusters. 4. Group-Average-Link Procedure: here the starting point is the Single-Link Procedure. After it has been applied, the arithmetic mean of all similarity values in this cluster is calculated. “This mean now serves as a threshold value, so that all documents which are linked with documents in this class by a value below the mean are removed“ (Stock & Stock, 2008, 376).
Network Model As mentioned before in citation indexing and link topology, resources of a different nature are often in relation to one another: websites are interconnected via links, scientific publications use references to take up the ideas of other scientific publications, even human beings are in relation to one another via friendship or family links (Stock, 2007b, Ch. 26). In information retrieval, these organizational and social structures are meant to be used to arrive at relevance judgments on the resources and thus create a relevance ranking for them. We already met a tripartite network consisting of the elements indexer / information resource / method of knowledge representation; in information retrieval, this network is transferred to the agents authors / information resources / topics. The network model aims to reveal the structures of the relations between the different authors, information resources or topics, and to use this structural information to arrive at relevance judgments. Here the agents are termed ‘nodes,’ their relations ‘edges.’ The connection between nodes and edges is
Basic Terms in Knowledge Representation and Information Retrieval
145
called ‘graph.’ Graphs can be divided into a) directed graphs and b) undirected graphs. Directed graphs represent an information flow, e.g. via the reference to another resource; undirected graphs do not define the relation between the agents in any great detail, e.g. for the syntagmatic relations of the topics. A graph’s density reflects the strength of the connection between the agents and is calculated as follows: a) for directed graphs:
Density ( graph, directed ) =
L n * (n − 1)
b) for undirected graphs:
Density ( graph, undirected ) =
L n * (n − 1) 2
where L = the number of all edges in the graph and n = the number of nodes in the graph. If all nodes in a graph are connected via edges, this results in a value of 1; if there are no edges, the value is 0. Based on these calculations, the position of an agent in the network can be determined and then exploited for information retrieval or relevance ranking (Yaltaghian & Chignell, 2002), e.g. an agent’s popularity or their exposed position in the network. The starting point is always the agent. An agent’s popularity is expressed via the degree of their centrality (Wasserman & Faust, 1994). The three measurements used here are 1) ‘degree,’ 2) ‘closeness’ and 3) ‘betweenness:’ 1) degree: the number of all edges to a node divided by the number of all nodes in the network; for directed graphs, a distinction must be made between ‘indegree’ (edges that point towards a node) and ‘out-degree’ (edges that lead away from a node); 2) closeness: the distance between the agents is calculated through the quotient from the number of nodes in the network minus 1 and the sum of the path lengths (the shortest path lengths between the initial node and all other nodes in the network); 3) betweenness: some nodes are not directly connected via edges – their connection is provided by an ‘intermediary node.’ This node then boasts a betweenness value that is calculated by dividing the number of connections of one agent to all other agents via the intermediary node by the number of connections of one agent to all other agents. One concept that is often mentioned in the context of the network model is ‘small worlds’ (Watts & Strogatz, 1998), which describes the phenomenon of short path lengths in networks. These are distinguished by a high graph density and small graph diameters – agents that are far away from each other can be reached via ‘shortcuts,’ and are connected by them. If a network is represented by graphs, small worlds call attention to themselves by appearing as more or less self-contained subgraphs. For information retrieval, networks’ small-world characteristic can be used for relevance ranking or the visualization of agent clusters, a.o. The determination of similar resources for recommendation systems can also be performed on the basis of the network model.
146
Basic Terms in Knowledge Representation and Information Retrieval
Bibliography Aitchison, J., Gilchrist, A., & Bawden, D. (2000). Thesaurus Construction and Use: A Practical Manual. London: Aslib. Albrecht, C. (2006). Folksonomy. Diplomarbeit, TU Wien, Wien. Anderson, C. (2006). The Long Tail: Why the Future of Business Is Selling Less of More. New York, NY: Hyperion. ANSI/NISO Z39.19 (2003). Guidelines for the Construction, Format and Management of Monolingual Thesauri. Baeza-Yates, R., & Ribeiro-Neta, B. (1999). Modern Information Retrieval. New York: Addison-Wesley. Bar-Ilan, J., Shoham, S., Idan, A., Miller, Y., & Shachak, A. (2006). Structured vs. Unstructured Tagging - A Case Study. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Bertram, J. (2005). Einführung in die inhaltliche Erschließung. Würzburg: Ergon. Billhardt, H., Borrajo, D., & Maojo, V. (2002). A Context Vector Model for Information Retrieval. Journal of the American Society for Information Science and Technology, 53, 236–249. Brin, S., & Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia. Crawford, W. (2006). Folksonomy and Dichotomy. Cites & Insights: Crawford at Large, 6(4), from http://citesandinsights.info/v6i4a.htm. Crestani, F., Lalmas, M., van Rijsbergen, C. J., & Campbell, I. (1998). "Is this document relevant? … probably": A Survey of Probabilistic Models in Information Retrieval. ACM Computing Surveys, 30(4), 528–552. Croft, W. B., & Harper, D. J. (1979). Using Probabilistic Models of Document Retrieval without Relevance Information. Journal of Documentation, 35, 285–295. Dice, L. R. (1945). Measures of the Amount of Ecologic Association between Species. Ecology, 26, 297–302. DIN 32705 (1987). Klassifikationssysteme: Erstellung und Weiterentwicklung von Klassifikationssystemen. DIN 1463/1 (1987). Erstellung und Weiterentwicklung von Thesauri: Einsprachige Thesauri. Dye, J. (2006). Folksonomy: A Game of High-tech (and High-stakes) Tag. EContent, April, 38–43. Egghe, L., & Rousseau, R. (1990). Introduction to Informetrics. Amsterdam: Elsevier. Egghe, L. (2005). Power Laws in the Information Production Process: Lotkaian Informetrics. Amsterdam: Elsevier Academic Press. Evans, R. (1994). Beyond Boolean: Relevance Ranking, Natural Language and the New Search Paradigm. In Proceedings of the 15th National Online Meeting, New York, USA (pp. 121–128). Ferber, R. (2003). Information Retrieval. Suchmodelle und Data-Mining-Verfahren für Textsammlungen und das Web. Heidelberg: dpunkt.verlag.
Basic Terms in Knowledge Representation and Information Retrieval
147
Fox, E. A., Betrabet, S., Koushik, M., & Lee, W. (1992). Extended Boolean Models. In W. B. Frakes & R. Baeza-Yates (Eds.), Information Retrieval. Data Structure & Algorithms (pp. 393–418). Upper Saddle River, NJ: Prentice Hall PTR. Frakes, W. B., & Baeza-Yates, R. (Eds.) (1992). Information Retrieval. Data Structure & Algorithms. Upper Saddle River, NJ: Prentice Hall PTR. Galvez, C., de Moya-Anegón, F., & Solana, V. H. (2004). Term Conflation Methods in Information Retrieval. Journal of Documentation, 61, 520–547. Garfield, E. (1979). Citation Indexing. New York, NY: Wiley. Giunchiglia, F., Marches, M., & Zaihrayeu, I. (2005). Towards a Theory of Formal Classification. In Proceedings of the AAAI-05 Workshop on Contexts and Ontologies: Theory, Practice and Applications, Pittsburgh, Pennsylvania, USA. Gruber, T. (1993). A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, 5(2), 199–220. Gruber, T. (2005). Ontology of Folksonomy: A Mash-Up of Apples and Oranges, from http://www.tomgruber.org/writing/mtsr05-ontology-of-folksonomy.htm. Halpin, H., & Shepard, H. (2006). Evolving Ontologies from Folksonomies: Tagging as a Complex System, from http://www.ibiblio.org/hhalpin/homepage/notes/taggingcss.html. Halpin, H., Robu, V., & Shepherd, H. (2007). The Complex Dynamics of Collaborative Tagging. In Proceedings of the 16th International WWW Conference, Banff, Alberta, Canada (pp. 211–220). Harman, D. (1986). An Experimental Study of Factors Important in Document Ranking. In Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy (pp. 186–193). Harman, D. (1992). Ranking Algorithms. In W. B. Frakes & R. Baeza-Yates (Eds.), Information Retrieval. Data Structure & Algorithms (pp. 363–392). Upper Saddle River, NJ: Prentice Hall PTR. Henrichs, N. (1970a). Philosophie-Datenbank. Bericht über das Philosophy Information Center and der Universität Düsseldorf. Conceptus, 4, 133–144. Henrichs, N. (1970b). Philosophische Dokumentation. Literatur-Dokumenttaion ohne strukturierten Thesaurus. Nachrichten für Dokumentation, 21, 20–25. Huang, H. (2006). Tag Distribution Analysis Using the Power Law to Evaluate Social Tagging Systems: A Case Study in the Flickr Database. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. ISO 2788 (1986). Guidelines for the Establishment and Development of Monolingual Thesauri. Jaccard, P. (1901). Etude Comparative de la Distribution Florale dans une Portion des Alpes et du Jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37, 547–579. Kalbach, J. (2008). Navigating the Long Tail. Bulletin of the ASIST, 34(2), 36–38. Kessler, M. M. (1963). Biblographic Coupling between Scientific Papers. American Documentation, 14, 10–25. Kipp, M. E. I. (2006a). Complementary or Discrete Contexts in Online Indexing: A Comparison of User, Creator, and Intermediary Keywords. Canadian Journal of Information and Library Science, 30(3), from http://dlist.sir.arizona.edu/1533.
148
Basic Terms in Knowledge Representation and Information Retrieval
Kipp, M. E. I. (2006b). Exploring the Context of User, Creator and Intermediary Tagging. In Proceedings of the 7th Information Architecture Summit, Vancouver, Canada. Kipp, M. E. I. (2007). Tagging for Health Information Organisation and Retrieval. In Proceedings of the North American Symposium on Knowledge Organization, Toronto, Canada (pp. 63–74). Kipp, M. E. I., & Campbell, D. (2006). Patterns and Inconsistencies in Collaborative Tagging Systems: An Examination of Tagging Practices. In Proceedings of the 17th Annual Meeting of the American Society for Information Science and Technology, Austin, Texas, USA. Kleinberg, J. (1999). Hubs, Authorities, and Communities. ACM Computing Surveys, 31(4es), No. 5, from http://portal.acm.org/citation.cfm?id=345966.345982. Kroski, E. (2005). The Hive Mind: Folksonomies and User-Based Tagging, from http://infotangle.blogsome.com/2005/12/07/the-hive-mind-folksonomies-anduser-based-tagging. Kuhlen, R. (1977). Experimentelle Morphologie in der Informationswissenschaft. München: Verlag Dokumentation. Kuhlen, R., Seeger, T., & Strauch, D. (Eds.) (2004). Grundlagen der praktischen Information und Dokumentation. München: Saur. Kulyukin, V. A., & Settle, A. (2001). Ranked Retrieval with Semantic Networks and Vector Spaces. Journal of the American Society for Information Science and Technology, 52, 1224–1233. Lancaster, F. (2003). Indexing and Abstracting in Theory and Practice (3rd ed.). University of Illinois: Champaign. Lee, J. H., Kim, W. Y., Kim, M. H., & Lee, Y. J. (1993). On the Evaluation of Boolean Operators in the Extended Boolean Retrieval Framework. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, PA, USA (pp. 291–297). Lewandowski, D. (2005a). Web Searching, Search Engines and Information Retrieval. Information Services & Use, 25(3-4), 137–147. Lewandowski, D. (2005b). Web Information Retrieval. Technologien zur Informationssuche im Internet. Frankfurt am Main: DGI. Liddy, E. D., Paik, W., & Yu, E. S. (1993). Natural Language Processing System for Semantic Vector Representation which Accounts for Lexical Ambiguity. US 5.873.058. Lin, X., Beaudoin, J., Bul, Y., & Desal, K. (2006). Exploring Characteristics of Social Classification. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Lovins, J. B. (1968). Development of a Stemming Algorithm. Mechanical Translation and Computational Linguistics, 11(1-2), 22–31. Luhn, H. P. (1958). The Automatic Creation of Literature Abstracts. IBM Journal, 2(2), 159–165. Macgregor, G., & McCulloch, E. (2006). Collaborative Tagging as a Knowledge Organisation and Resource Discovery Tool. Library Review, 55(5), 291–300. Mai, J. (2006). Contextual Analysis for the Design of Controlled Vocabularies. Bulletin of the ASIST, 33(1), 17–19. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge: Cambridge Univ. Press.
Basic Terms in Knowledge Representation and Information Retrieval
149
Markey, K. (1984). Inter-indexer Consistency Tests: A Literature Review and Report of a Test of Consistency in Indexing Visual Materials. Library & Information Science Research, 6, 155–177. Maron, M. E., & Kuhns, J. L. (1960). On Relevance, Probabilistic Indexing and Information Retrieval. Journal of the ACM, 7, 216–244. Mathes, A. (2004). Folksonomies - Cooperative Classification and Communication Through Shared Metadata, from www.adammathes.com/academic/computermediated-communication/ folksonomies.html. Medeiros, N. (2008). Screw Cap or Cork? Keeping Tags Fresh (and Related Matters). OCLC Systems & Services, 24(2), from http://eprints.rclis.org/14042/1/ELIS_OTDCF_v24no2.pdf. Miller, G. (1998). Nouns in WordNet. In C. Fellbaum (Ed.), WordNet. An Electronic Lexical Database (pp. 23–46). Cambridge, Mass., London: MIT Press. Newman, M. E. J. (2005). Power Laws, Pareto Distributions and Zipf's Law. Contemporary Physics, 46(5), 323–351. Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The PageRank Citation Ranking: Bringing Order to the Web, from http://ilpubs.stanford.edu:8090/422. Peters, I., & Stock, W. G. (2007). Folksonomies and Information Retrieval. In Joining Research and Practice: Social Computing and Information Science. Proceedings of the 70th ASIS&T Annual Meeting, Milwaukee, Wisconsin, USA (pp. 1510–1542). Peters, I., & Weller, K. (2008). Paradigmatic and Syntagmatic Relations in Knowledge Organization Systems. Information - Wissenschaft & Praxis, 59(2), 100– 107. Peterson, E. (2006). Beneath the Metadata. D-Lib Magazine, 12(11), from http://dlib.org/dlib/november06/peterson/11peterson.html. Porter, M. F. (1980). An Algorithm for Suffix Stripping. Program, 14(3), 130–137. Quintarelli, E. (2005). Folksonomies: Power to the People. In Proceedings of the ISKO Italy-UniMIB Meeting, Milan, Italy. Rasmussen, E. M. (1992). Clustering Algorithms. In W. B. Frakes & R. Baeza-Yates (Eds.), Information Retrieval. Data Structure & Algorithms (pp. 419–442). Upper Saddle River, NJ: Prentice Hall PTR. Redmond-Neal, A., & Hlava, M. M. K. (2005). ASIS&T Thesaurus of Information Science, Technology, and Librarianship (3rd ed.). Medford, NJ: Information Today. Richardson, M., Prakash, A., & Brill, E. (2006). Beyond PageRank: Machine Learning for Static Ranking. In Proceedings of the 15th International Conference on World Wide Web, Edinburgh, Scotland (pp. 707–715). Robertson, S. E., & Sparck Jones, K. (1976). Relevance Weighting of Search Terms. Journal of the American Society for Information Science and Technology, 27, 129–146. Rocchio, J. J. (1971). Relevance Feedback in Information Retrieval. In G. Salton (Ed.), The SMART Retrieval System - Experiments in Automatic Document Processing (pp. 313–323). Englewood Ciffs, N. J.: Prentice Hall PTR. Salton, G. (1968). Automatic Information Organization and Retrieval. New York: McGraw-Hill. Salton, G., & MacGill, M. J. (1987). Information Retrieval: Grundlegendes für Informationswissenschaftler. Hamburg: McGraw-Hill.
150
Basic Terms in Knowledge Representation and Information Retrieval
Salton, G., Fox, E. A., & Wu, H. (1983). Extended Boolean Information Retrieval. Communications of the ACM, 26, 1022–1036. Salton, G., Wong, A., & Yang, C. S. (1975). A Vector Space Model for Automatic Indexing. Communications of the ACM, 18, 613–620. Schmitz, J. (2007). Möglichkeiten der Patentinformetrie: Ein Überblick. In Proceedings der 29. Online-Tagung der DGI, Frankfurt a. M., Germany (pp. 31–48). Shirky, C. (2003). Power Laws, Weblogs and Inequality: Diversity plus Freedom of Choice Creates Inequality. In J. Engeström, M. Ahtisaari & A. Nieminen (Eds.), Exposure: From Friction to Freedom (pp. 77–81). USA: Aula. Shirky, C. (2005a). Ontology is Overrated: Categories, Links, and Tags, from http://www.shirky.com/writings/ontology_overrated.html. Shirky, C. (2005b). Semi-Structured Meta-Data Has a Posse: A Response to Gene Smith, from http://tagsonomy.com/index.php/semi-structured-meta-data-has-aposse-a-response-to-gene-smith. Sinclair, J., & Cardew-Hall, M. (2008). The Folksonomy Tag Cloud: When Is It Useful? Journal of Information Science, 34(1), 15–29. Skusa, A., & Maaß, C. (2008). Suchmaschinen: Status Quo und Entwicklungstendenzen. In D. Lewandowski & C. Maaß (Eds.), Web-2.0-Dienste als Ergänzung zu algorithmischen Suchmaschinen (pp. 1–12). Berlin: Logos. Small, H. (1973). Co-citation in the Scientific Literature: A New Measure of the Relationship between Two Documents. Journal of the American Society for Information Science and Technology, 24, 265–269. Sneath, P. H. A. (1957). Some Thoughts in Bacterial Classification. Journal of General Microbiology, 17, 184–200. Spärck Jones, K. (2004 [1972]). A Statistical Interpretation of Term Specificity and its Application in Retrieval. Journal of Documentation, 28(1), 11–21. Sparck-Jones, K., & Willett, P. (Eds.) (1997). Readings in Information Retrieval. San Francisco: Morgan Kaufmann. Stock, W. G. (2006). On Relevance Distributions. Journal of the American Society for Information Science and Technology, 57(8), 1126–1129. Stock, W. G. (2007a). Folksonomies and Science Communication. A Mash-up of Professional Science Databases and Web 2.0 Services. Information Services & Use, 27, 97–103. Stock, W. G. (2007b). Information Retrieval. Informationen suchen und finden. München, Wien: Oldenbourg. Stock, W. G., & Stock, M. (2008). Wissensrepräsentation. Informationen auswerten und bereitstellen. München: Oldenbourg. Taylor, A. G. (1999). The Organization of Information. Englewood: Libraries Unlimited. Tennis, J. (2006). Social Tagging and the Next Steps for Indexing. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Tudhope, D., & Nielsen, M. L. (2006). Introduction to Knowledge Organization Systems and Services. New Review of Hypermedia and Multimedia, 12(1), 3–9. van Damme, C., Hepp, M., & Siorpaes, K. (2007). FolksOntology: An Integrated Approach for Turning Folksonomies into Ontologies. In Proceedings of the European Semantic Web Conference, Innsbruck, Austria (pp. 71–85). Voß, J. (2006). Collaborative Thesaurus Tagging the Wikipedia Way, from http://arxiv.org/abs/cs.IR/0604036.
Basic Terms in Knowledge Representation and Information Retrieval
151
Voß, J. (2007). Tagging, Folksonomy & Co - Renaissance of Manual Indexing? In Proceedings of the 10th International Symposium for Information Science, Cologne, Germany. Wasserman, S., & Faust, K. (1994). Social Network Analysis: Methods and Applications. Cambridge: Cambridge University Press. Watts, D. J., & Strogatz, S. H. (1998). Collective Dynamics of 'Small-world' Networks. Nature, 393, 440–442. Weinberger, D. (2006). Taxonomies and Tags from Trees to Piles of Leaves, from http://hyperorg.com/blogger/misc/taxonomies_and_tags.html. Weller, K. (2006). Kooperativer Ontologieaufbau. In Proceedings der 28. Online Tagung der DGI, Frankfurt a. M., Germany (pp. 108–115). Weller, K. (2007). Folksonomies and Ontologies. Two New Players in Indexing and Knowledge Representation. In Applying Web 2.0. Innovation, Impact and Implementation. Proceedings of Online Information Conference, London, GB (pp. 108–115). Yaltaghian, B., & Chignell, M. (2002). Re-ranking Search Results using Network Analysis: A Case Study with Google. In Proceedings of the Conference of the Centre for Advanced Studies on Collaborative Research, Toronto, Canada. Zeng, M. L. (2000). Taxonomy of Knowledge Organization Sources/ Systems, from http://nkos.slis.kent.edu/KOS_taxonomy.htm. Zimmermann, H. H. (1998). Automatische Indexierung und elektronische Thesauri, from http://fiz1.fh-potsdam.de/volltext/saarland/03035.html.
Chapter 3
Knowledge Representation in Web 2.0: Folksonomies
As opposed to the other collaborative information services (see chapter 1), folksonomies do not provide any immediate services or offer any platform for the exchange or storage of information resources – rather, they represent a certain functionality of collaborative information services. Folksonomies have been developed out of the need for a better structure of the growing mass of information within these information services, the need to make them searchable and retrievable. By now, they have established themselves as a method of knowledge representation on the internet, and particularly in Web 2.0.
Definition of the Term ‘Folksonomy’ Folksonomies consist of freely selectable keywords, or tags, which can be liberally attached to any information resource – hence the term ‘tag’, which might be defined as either an identifying label, or the mark hung around a dog’s throat signifying ownership. These are considered “the electronic equivalent of Post-It notes” (Müller-Prove, 2008, 15)*. The user of the respective information – producer as well as recipient – becomes an indexer (Weinberger, 2005; Berendt & Hanser, 2007), as he decides which and how many tags he wishes to distribute (Hayman, 2007; Hayman & Lothian, 2007). He is not bound by any guidelines. This process of free indexing is called tagging: “Collaborative tagging describes the process by which many users add metadata in the form of keywords to shared content“ (Golder & Huberman, 2006, 198); the totality of all tags on any given information platform forms the folksonomy (Trant, 2009). Apart from ‘folksonomy’ there is a multitude of different terms for the same concept, which all emphasize different aspects: ‘democratic indexing’ (Kellog Smith, 2006), ‘social classification system’ (Feinberg, 2006), ‘collaborative classification system’ (Schmidt, 2007a), ‘ethnoclassification’ (Star, 1996), ‘collaborative tagging’ (Golder & Huberman, 2006), ‘grassroots taxonomy’ (Weinberger, 2005), ‘user-generated metadata’ (Dye, 2006), ‘communal categorization’ (Sturtz, 2004), ‘folk wisdom’ (McFedries, 2006) and ‘mob indexing’ (Morville, 2005) characterize the social component of tagging and the perspective of the indexer, whereas ‘tagosphere’ (Gruber, 2005), ‘tagsonomy’ (Hayes, Avesani, & Veeramachaneni, 2007), ‘metadata for the masses’ (Merholz, 2004), ‘people-powered metadata’ (Smith, 2008b), ‘lightweight knowledge representation’ (Schmitz et al., 2007), ‘tag soup’ (Hammond et al., 2005)
154
Knowledge Representation in Web 2.0: Folksonomies
and ‘a revolutionary form of verbal indexing’ (Kipp & Campbell, 2006) all highlight the tags in themselves. The term ‘folksonomy’ gained currency when Gene Smith (2004) quoted Thomas Vander Wal in his information-architecture-themed blog: Last week I asked the AIfIA [Asilomar Institute for Information Architecture, Anm. d. A.] member’s list what they thought about the social classification happening at Furl, Flickr and Del.icio.us. In each of these systems people classify their pictures/bookmarks/web pages with tags […], and then the most popular tags float on the top […]. Thomas Vander Wal, in his reply, coined the great name for these informal social categories: a folksonomy. […] Still, the idea of socially constructed classification schemes (with no input from an information architect) is interesting. Maybe one of these services will manage to build a social thesaurus (Smith, 2004).
Vander Wal (2007) himself defines folksonomies as follows: Folksonomy is the result of personal freetagging of information and objects (anything with a URL) for one’s own retrieval. The tagging is done in a social environment (shared and open to others). The act of tagging is done by the person consuming the information. The value in this external tagging is derived from people using their own vocabulary and adding explicit meaning, which may come from inferred understanding of the information/object. The people are not so much categorizing as providing a means to connect items and to provide their meaning in their own understanding (Vander Wal, 2007).
“Folksonomy” is a combination of the words ‘folk’ and ‘taxonomy’ and simply means “a taxonomy created by the people.” The adequacy of the component ‘taxonomy’ is the subject of fierce debate (Mathes, 2004); the term, it is argued, has misleading connotations and gives undue emphasis to the hierarchical character of any classification of terms. Smith (2004), above, uses this last word, ‘classification,’ as an alternative to ‘folksonomies.’ Hammond et al. (2005) also use the term ‘social classification’: While suitably folksy, and enjoying a certain cachet at the present time, it leans too much on the notion of taxonomy, which is not obviously present in social (or distributed) classification systems although may yet be derivable from them (Hammond et al., 2005). We would generally incline to the term “social classification“, or even ‘distributed classification,’ as this, to our minds, most closely describes the nature of the activity […] (Hammond et al., 2005).
Like “taxonomy,” this actually points in the wrong direction. Folksonomies are not classifications, since they use neither notations nor relations: One important fact about tagging is that it is ‘post-hoc’ categorization, not pre-optimized classification, and so is more likely to optimally characterize the data. Tagging represents a fundamental shift in the way the world is viewed from the expert-designed ontology. While ontologies classify data, tagging categorizes data (Halpin & Shepard, 2006).
Rather, folksonomies should be understood as a simple collection, or list, of all users’ keywords (Reamy, 2006; Mathes, 2004). But since folksonomies are mostly restricted to one platform and accessible to everyone, we can speak, if not of a controlled vocabulary, at the very least of a collective vocabulary: “The set of all
Knowledge Representation in Web 2.0: Folksonomies
155
tags utilized by the user community represents the shared vocabulary“(Maass, Kowatsch, & Münster, 2007, 47). To lend further credence to this perspective, Spyns et al. coined the term ‘folksabularies’ to replace ‘folksonomies’: In fact, the term ‘folksonomies’ is semantically speaking too broad as there is no taxonomy involved, but only a vocabulary (a bag of words). ‘Folksabularies’ (in analogy with vocabularies) would thus be a more accurate term (Spyns et al., 2006, 744).
The connection of the tags with each other is a syntagmatic process by definition, that is to say it is a direct result of the tags’ co-occurrence. Paradigmatic relations, like hierarchy or equivalence relations, as used in controlled vocabularies, do not typically occur in folksonomies: Typically, users are not allowed to specifiy relations between tags. Instead, tags serve as a set of atomic symbols that are tied to a document. […] At first glance, this would seem like a very weak language for describing documents; users have no way to indicate that two tags are meant to be the same, or that a ‚contains’ relation exists between tags […] or that a set of tags form a complete enumeration of possible values (Brooks & Montanez, 2006a, 9f.).
The approaches of tag gardening (see also further below) are meant to combine methods of traditional knowledge representation with folksonomies, in order to enrich folksonomies with semantic content. Spyns et al. (2006) call the result of these endeavors ‘folksologies:’ It basically means that ontologies resp. taxonomies emerge on the fly thanks to individuals who autonomously create and use any tag they like to annotate any web page they deem worthwhile. […] There is a lot of confusion about and inconsistent use of both terms. Folksonomies relate to folksologies in the same way as taxonomies relate to ontologies, i.e. single vs. multiple types of relationships between concepts. Their commonality, however, is the focus on the social process leading to informal semantics (Spyns et al., 2006, 739).
Since the term ‘folksonomy’ has won the day in spite of the extensive debate on the correct name for the totality of tags, I will refer to folksonomies throughout this book. ‘Tagging’ or ‘social tagging’ are also used as synonyms in many publications, even though technically these terms refer to the act of indexing via tags rather than the mass of tags. Vander Wal (2005a) points out that the term ‘folksonomy’ always takes into consideration the social component of tagging; it incorporates the visibility of all tags to all users and the possibility of adding tags to all resources into its meaning. Also according to Vander Wal (2005a), the term ‘tagging’ rather refers to personal information and resource management that does not make the tags public. Tonkin et al. (2008) likewise advocate a differentiation of the concepts: “If members of a community do not tag differently for the community than they do for themselves, is it truly social tagging?“ (Tonkin et al., 2008). The term ‘tagging system’ refers to the collaborative information services that use tagging as an indexing method. The task and the usefulness of folksonomies is the structuring as well as the representation of the content of information resources (bookmarks, photos, videso etc.): “Tagging systems are built to enable sharing and managing citations, photos, and web pages“ (Tennis, 2006). Thus folksonomies, like other methods of knowledge representation, create content-descriptive metadata (Al-Khalifa & Davis,
156
Knowledge Representation in Web 2.0: Folksonomies
2006). Accordingly tags assume a ‘proxy function’ for the resource, represent its content in the collaborative information service and provide an access path. Tags have already made a name for themselves as ‘metatags’ for the description of websites: “Meanwhile, there has also been a current of users attempting to annotate their own pages with metadata. This began with the tags which allowed for keywords on a web page to aid search engines” (Heymann, Koutrika, & GarciaMolina, 2008, 197). Berendt and Hanser (2007) argue that tags are more than just metadata, they are “just more content.” This they deduce from their analysis of tags and blog entries, which shows that tags and blog entries complement each other rather than creating an overlap: “Author tags contain semantically different information than the body of a blog post. […] Author tags and the body of a blog post contain complementary information“ (Berendt & Hanser, 2007). Since the task of metadata is strictly nothing but the summary of the resource’s content as well as disambiguation, according to Berendt and Hanser (2007), tags cannot be viewed as metadata since they contain other semantic information and rather assume the function of providing complementary statements on the resource’s content: With regard to the question of how tags add content, we might say that they do not (or not always) add information that is missing in the text, but select or highlight aspects of meaning that for some readers may be less relevant or not even present in the body. This is related to the disambiguating function of metadata, but it is not the same (Berendt & Hanser, 2007).
So according to Berendt and Hanser (2007), tags should be viewed as complementary to the resources, not as proxies. This point is still being fiercely debated, as the following quote demonstrates: Despite all current hype about tags […] for the authors of this paper, tags are just one kind of metadata and are not a replacement for formal classification systems such as Dublin Core, MODS, etc. Rather, they are supplemental means to organize information and order search results (Hammond et al., 2005).
Hammond et al. (2005) localize folksonomies as metadata among the methods of knowledge representation, while stressing that tags can only ever complement and never replace these elaborate methods of indexing content. Tags are too weak and error-prone for anything more. However, it must be kept in mind that tags often represent the only available access path, or research option, particularly for nontextual information resources (Fichter, 2006). Folksonomies mainly serve the arrangement of resources. These structuring endeavors on the part of the users are often of a purely personal nature at the outset (Mathes, 2005b; Bates, 2006) and are arranged individually: Tagging systems are […] to be understood as software applications, in which the user is guided by rules while classifying objects according to his or her own classification criteria, on the basis of certain routines and expectations, and research with a view to specific information needs – initially, they represent individual information management (Schmidt, 2007b)*.
That is also the reason this personal tagging is mostly used to retrieve already found objects within the information services (Feinberg, 2006). The totality of a user’s used tags is also called a ‘personomy’ (Hotho et al., 2006b; Quintarelli, 2005).
Knowledge Representation in Web 2.0: Folksonomies
157
Folksonomies’ social, or collective (Vander Wal, 2008) character only asserts itself during a folksonomy-based search, when the user works his way to his desired search result using other users’ tags. Folksonomies’ strength and attractiveness lies in their openess to the community’s influence, allowing multiple opinions, interpretations, definitions, but also spelling options. After all, “social tagging systems rely on shared and emergent social structures and behaviors, as well as related conceptual and linguistic structures of the user community” (Marlow et al., 2006a; 2006b). In order to accommodate this social versatility on a language level, Vander Wal (2008) suggests that this sort of tagging be called ‘collective tagging.’ The indexing user may choose his tags deliberately (Held & Cress, 2008), but does not originally aim to reach a consensus with the other users of the information platform concerning the different tags (Smith, 2008a). He just wants to retrieve his information resources: “For taggers, it’s is not about the right or the wrong way to categorize something and it’s not about accuracy or authority, it’s about remembering” (Kroski, 2005). Hence Panke and Gaiser (2008b) talk about the “collective memory of the individual keyword allocators” (Panke & Gaiser, 2008b, 16)*. Collective tagging is used in the social bookmarking system del.icio.us, for example. Wikipedia also uses collective indexing to categorize their articles. According to Vander Wal (2008), however, this is ‘collaborative tagging’ and thus, strictly speaking, not a folksonomy. Here the users’ goal is to reach a consensus concerning the definition of a category (Vander Wal, 2008). Hammond et al. (2005) also stress this difference: “Wikipedia uses a shared taxonomy – albeit one generated by a community – not free-form tags.“ As opposed to folksonomies, which are formed from the amalgamation of many and more personomies, or in other words, which are created directly and from the public network of the users, categorization on Wikipedia is governed by the desire for an agreement between the users and based on jointly declared rules: “There is value in the collaborative tagging, but it is not as rich an understanding as the collective approach“ (Vander Wal, 2008, 8).
Tags – Users – Resources Folksonomies can also be defined as a tripartite hypergraph (Albrecht, 2006; Lambiotte & Ausloos, 2006; Wu, Zubair, & Maly, 2006), since they confront you with three different aspects at the same time (Marlow et al., 2006a; 2006b): • with the resources to be described, which must be clearly referenceable (via a URL, for example; Müller-Prove, 2008; Steels, 2006), • with the tags that are selected for the description, • with the users that perform the indexing. In some discussions on the topic a fourth component is introduced. Thus Maass, Kowatsch and Münster (2007), as well as John and Seligmann (2006), mention the factor time, that is to say the time of indexing. This fourth information is particularly useful for the observation of changes within the day, user or resource structures (Jäschke et al., 2006; Dellschaft & Staab, 2008). Van Damme, Hepp and Siorpaes (2007), however, localize the fourth factor within the tagging system, in the Web 2.0 platform itself. Over the course of this book, though, I will follow the general consensus and refer to the tripartite hypergraph. Here the users as well as the resources are
158
Knowledge Representation in Web 2.0: Folksonomies
connected amongst each other in a social network, where the tags serve as paths (Lambiotte & Ausloos, 2006). Here the tags can also be described as hyperlinks with many destinations (Müller-Prove, 2008) or as ‘open hyperlinks’ (Steels, 2006), which allow for the dynamic linkage of resources. This means that the allocation of tags also allows for resources unknown to the user to be linked, thus providing access to them (Panke & Gaiser, 2008a). This connection between resources, users and tags can be exploited in many ways: “These connections define an implicit relationship between resources through the users that tag them; similarly, users are connected by the resources they tag” (Marlow et al., 2006a; 2006b). To wit, the resources are linked thematically if they have been indexed with the same tag. The resources 1 and 2 as well as 3 and 4 in Figure 3.1 are each thematically linked (resources 1 and 2 via tag 2; resources 3 and 4 via tag 4). Additionally, resources are connected by common users. Thus the documents 1 and 2, 3 and 4, but also 2 and 4 are linked through their users. Co-resources are created analogously to co-citations (Small, 1973) if the resources have been indexed via the same tag or by the same user (resources 3 and 4 by user 3). Users are thematically linked if they use the same tag (in the example, users 1 and 2 via tags 1 and 2); they are bibliographically coupled via common resources (Kessler, 1963) if they describe their respective content (users 1, 2 and 3 via resource 2). Van Damme, Hepp and Siorpaes (2007) form three bilateral networks to describe these relations: “Out of a tripartite model of tags, objects, and actors, three bipartite graphs were generated based on the cooccurrence of its elements: the AC (actor-tag) graph, AI (actor-object) graph, and the CI (tag-object) graph“ (van Damme, Hepp, & Siorpaes, 2007, 59).
bibliographic coupling
co-resources
Figure 3.1: The Tripartite Graph of Resource, User and Tag. Source: Peters & Stock (2008, 86, Fig. 15).
Furthermore, the resources can also be connected amongst each other via hyperlinks, users on the other hand via group affiliation, for example. Tags are related to each other through co-occurrences. The analysis of resource linkage is the task of link
Knowledge Representation in Web 2.0: Folksonomies
159
topology, social network analysis explores group affiliations and the syntagmatic relations among tags are discussed in the debates on the semantic web or the creation of ontologies, amongst others (Marlow et al., 2006a; 2006b).
Figure 3.2: Tripartite Tagging System: Model ‘User – Tag – Information.’ Source: Tonkin et al. (2008, Fig. 2).
Figure 3.3: Tripartite Tagging System: Model ‘User – Tag – User.’ Source: Tonkin et al. (2008, Fig. 3).
Tonkin et al. (2008) ascertain three sorts of tripartite tagging systems. They use the term ‘Information’ for the naming of resources, which will be maintained at this point. Figure 3.2 shows the model ‘User – Tag – Information’ and places the emphasis on the relation between user and information. Hence the tag emphasizes how the user perceives the information. The second model, ‘User – Tag – User’ (see Figure 3.3), places the emphasis on the relations between users, demonstrating their connectedness via jointly used tags, that is to say the thematic coupling of users. This model is also useful for representing the problem of spam tags. Tags could be subject to false allocations, deliberate or not, thus leading users down the wrong track while browsing.
160
Knowledge Representation in Web 2.0: Folksonomies
In Figure 3.4, the relation between different information, or different ‘banks of data’ (Tonkin et al., 2008) becomes particularly important. Here tags serve as metadata that can serve to explicate relations and come to bear on the creation of ontologies especially.
Figure 3.4: Tripartite Tagging System: Model ‘Information – Tag – Information.’ Source: Tonkin et al. (2008, Fig. 4).
The social networking website 43Things uses the relation between tags and users (see model 2, Fig. 3.3) in a special way. The members of 43Things use tags to compile a list of 43 things they want to accomplish in life. They make notes on how they plan to achieve these things, why they want to do them, etc. Clicking on a tag leads to the exact number of users, in other words, to the community, that also has this tag on their list: “For example, x on 43things wants to climb Grouse Mountain near Vancouver. He has this in common with y, z, and # of other users of the system“ (Tennis, 2006). This facilitates communication with like-minded people and shows how the thematic coupling of users can be exploited. “Conceptually, it is equal parts Friendster, folksonomy, and Oprah, aimed at giving the Internet-savvy a place to connect and collaborate over tagged ambitions” (Dye, 2006, 41). The trilateral network is always a part of collaborative information services with a folksonomy function; the user allocates a tag to a resource and thus makes it retrievable. Since the connection of tag – user – resource is publicly accessible, it is easy to generate knowledge on users and their tagging behavior, for example. It is this characteristic of folksonomies in particular that accounts for their attractiveness as a method of knowledge representation and as a tool in a retrieval system. The users can see which information resources have been indexed with what tags, or which users have used the same tags. “These relations made public between user, object and keyword thus lead to the users performing collective information management” (Schmidt, 2007b)*. This knowledge leads to a folksonomy-based information system, which is very well suited for browsing resources, tags and users. The relevance of single tags, and thus very often the relevance of the information resource as well, is a product of the frequency of its usage. Since a tag can only be allocated to a resource once by every user, the resource’s or the tag’s importance is generated from the collection of all annotated tags. Users determine the relevance of
Knowledge Representation in Web 2.0: Folksonomies
161
a resource through their collective tagging and thus implicitly rate or recommend it: “Today, as the adopters of tagging systems enthusiastically label their bookmarks and photos, they are implicitly voting with their tags” (Gruber, 2005). This procedure can be compared to the PageRank’s functionality (see chapter 2) – a resource’s most frequently used tags reflect their content and serve as a recommendation for it: “This is critical to future usage and to the future retrieval in particular, in the same way that manually establishing links between Web pages confer some of these pages authority over others“ (Maarek et al., 2006).
Cognitive Skills One of folksonomies’ advantages is often identified as the fact that they are easy to use. The allocation of tags is meant to be easier, to require less cognitive effort, than categorization or indexing with controlled vocabularies (Shirky, 2005a). To substantiate this, there are numerous studies that explore cognitive efforts during tagging (Civan et al., 2008). Butler (2006) for instance performs a ‘verbal protocol analysis’ for tagging objects for personal retrieval and for general retrieval. The goal here is to provide information on the “cognitive operations at work when tagging” (Butler, 2006). Sinha (2005) also analyzes the cognitive efforts at work during tagging, comparing them to the allocation of resources into a knowledge organization system. Professional indexers perform the latter task, and are furthermore guided by binding sets of regulations (indexing rules) (Stock & Stock, 2008, 342-364). Sinha determines that tagging and conceptualization do not differ quite as strongly as perviously suspected during their cognitive implementation. In Figure 3.5 the tagging process is displayed. In a first step, the user assesses the similarity between the object to be tagged and the terms stored in his memory (‘candidate concepts’), which mentally activates a quantity of similar or linked terms. The user then selects one or more of these and annotates it/them as tag(s). “Writing down some of these concepts is easy enough. With tagging there is no filtering involved at this stage, you can note as many of those associations as you want” (Sinha, 2005). Conceptualization generally follows the same procedure as tagging, but involves one further step, as sketched in Figure 3.5. After a quantity of concepts for the object has been activated, it must be decided which concept will be selected (‘categorization’). This is achieved by assessing the similarity between the object and the possible concepts (‘candidate categories’) and can be explained via the Shepard-Luce formula. The Shepard-Luce formula describes a cognition-psychological approach to selecting concepts and terms (Luce, 1959; Shepard, 1964; Logan, 2004; Andrich, 2002). Attention: “Choice probability increases with strength of evidence that an object belongs to a category” (Andrich, 2002). This process seems difficult and elaborate, but categorization is an innate human skill and thus easily accomplished. Tagging and conceptualization would then seem to require a roughly equal amount of cognitive effort. Sinha (2005) sees the great difference between the two approaches in the nature of the objects to be categorized and in ‘post activation analysis paralysis:’ Cognitively, we are equipped to handle making category decisions. So, why do we find this so difficult, especially in the digital realm, to put our email into folders, categorize our bookmarks, sort our documents. Here are some
162
Knowledge Representation in Web 2.0: Folksonomies factors that lead to what I call the ‘post activation analysis paralysis’ (Sinha, 2005).
Figure 3.5: Cognitive Tagging Process. Source: Sinha (2005).
There is no cultural consensus or cultural knowledge on how digital resources had best be filed in the system of concepts. The system’s borders are often clearly demarcated for real objects like ‘apple’ or ‘potato,’ and the allocation of these has grown culturally for centuries. For digital objects, however, it is not only the correct allocation into concepts that is important, but also their future retrievability: We need to consider not just the most likely category, but also where we are most likely to look for the item at the time of finding. These two questions might lead to conflicting answers, and complicate the categorization process (Sinha, 2005).
Furthermore, the allocation into the system of concepts must be achieved while following the fundamental schema and taking care that the allocations are balanced. This includes answering such questions as “Are there too many objects under this concept?” and “Must I create a new concept for this object?” According to Sinha (2005), all these factors play a big role in the allocation of objects to concepts and manifest themselves in the fear – the ‘post activation analysis paralysis’ – of making a wrong decision, performing a false allocation and never finding the object again. The future retrieval effort for the object is thus already sized up during the conceptualization and taken into consideration for the allocation, a mental process which actually intensifies the cognitive effort for the conceptualization process. “Why is tagging simpler. In my opinion, tagging
Knowledge Representation in Web 2.0: Folksonomies
163
eliminates the decision – (choosing the right category), and takes away the analysis paralysis stage for most people“ (Sinha, 2005).
Figure 3.6: Tag Cloud of the User ‘Marmalade Today’ for 1,478 Bookmarks. Source: http://delicious.com/MarmaladeToday (11/9/2008).
So tagging is not as easy as generally assumed. Tag allocation may be a simple procedure, but a useful application of tags in a resource management system costs a lot of effort. When working with tags, it becomes clear that there is a proportional relation between the number of resources to be indexed and the cognitive effort made: the more resources must be indexed, and the more resources already are indexed, the more tags must be used and hence, the more cognitive skills are necessary (MüllerProve, 2008). After all, the user must differentiate between the different tags and allocate them correctly – even consistently over a certain period of time. As sketched impressively in Figure 3.6, the clarity of tags and folksonomies is sometimes not what it should be. In this example from the social bookmarking
164
Knowledge Representation in Web 2.0: Folksonomies
system del.icio.us, 1,149 single tags were used to structure 1,478 bookmarks. So if the user wants to register a new bookmark, he needs to have all 1,149 previously used tags in his head and then decide whether these are sufficient or whether a new tag must be created. Strictly speaking, the user must also know the tags used by the other members of the platform, if he wants to be visible with his resource in the folksonomy of the information resource as well as in the folksonomy of the collaborative information service. It is obvious that enormous cognitive efforts are required of the user here, which goes some way towards defusing the argument that folksonomies are an easy method of knowledge representation.
Broad vs Narrow Folksonomies Generally, we can differentiate between two sorts of folksonomies with regard to ‘tag scope’ (Sen et al., 2006, 183): 1) folksonomies that allow for the multiple allocation of a tag to the same resource and 2) folksonomies that are only generated from the author’s tags and may allow the adding of new tags by other users. In discussing variant 1, Marlow et al. (2006a; 2006b) talk of a ‘bag model’ of folksonomies, which gathers and collects all tags and registers their frequency of allocation, and for variant 2, of a ‘set model,’ which only registers each tag once per resource. Smith (2008b, 55ff.) defines variant 1 as ‘collaborative tagging,’ since not the resource itself is tagged but a ‘pointer’ – a proxy of the resource in the database – which is created for each indexer. Smith (2008b) then labels variant 2 ‘simple tagging,’ since here it is the resource that is tagged; as a consequence, each tag can only be indexed once. The other prominent terms (‘Broad vs. Narrow Folksonomy’) for their differentiation were introduced by Vander Wal (2005a; see also Dye, 2006), however. In a Broad Folksonomy (Figure 3.7), many different users (A through F in the figure) can tag a resource, and then find the resource in the search mode using the tags. Thus the resource’s content is described with the same, similar or completely different labels (1 through 5) from many perspectives. This leads to the indexing user not only being shown his own tags during the indexing process, but also recommendations from the entire database: “Users […] can see what other tags have been created for certain content and use this to get a broad idea of associated terms and tags“ (Dye, 2006, 40). In terms of users’ write permissions, Marlow et al. (2006a; 2006b) suggest the term ‘free-for-all tagging.’ Since a Broad Folksonomy also registers the tags’ indexing frequency per resource, statistical derivations can be generated here (see also the section Tag Distributions). This possibility for tag accumulation is seen as the true strength of Broad Folksonomies and particularly comes to bear on the observations on Collective Intelligence: The power of folksonomy is connected to the act of aggregating, not simply to the creation of tags. Without a social distributed environment that suggests aggregation, tags are just flat keywords, only meaningful for the user that has chosen them. The power is people here. The term-significance releationship emerges by means of an implicit contract between the users (Quintarelli, 2005).
The most prominent service in Web 2.0 that uses a Broad Folksonomy for multiple tagging is the social bookmarking service del.icio.us. There are also similar services
Knowledge Representation in Web 2.0: Folksonomies
165
for the scientific community, such as CiteULike, Connotea (Hammond et al., 2005; Lund et al., 2005) and BibSonomy (Hotho et al., 2006a).
Figure 3.7: Broad and Narrow Folksonomies. Source: Vander Wal (2005a).
In Narrow Folksonomies, e.g. Flickr, tags are only indexed and registered once per resource. New tags can be added to the resources, however. In Narrow Folksonomies, there is no possibility of counting tag frequency on a resource level and to observe distributions. The author, or ‘content creator,’ often indexes the first couple of tags – according to Marlow et al. (2006a; 2006b), this is called ‘selftagging.’ Vander Wal (2005a), however, also provides for the adding of tags by other users in his model, e.g. by friends of the author as on Flickr. This procedure reminds one of indexing via controlled vocabularies, such as nomenclatures, thesauri or classification systems (Stock & Stock, 2008, ch. 11 to 13), by professional indexers – not to mention that only uncontrolled terms are ever used in folksonomies. In some collaborative information services, only the content creator is allowed to add tags to his folksonomy, e.g. in YouTube. Strictly speaking, cases like these are no longer folksonomies, since the social element of collaborative indexing is missing. After all, neither can new tags be added by other users, nor tag distribution frequencies be observed on a resource level; only the accumulation of all tags for all resources on the platform, and thus an overview of the most used tags, is possible. Furthermore, the choice of tags is limited to the content creator’s individual point of view. This circumscribes the advantage of folksonomies, the taking into consideration of many different interpretations: “narrow folksonomies and resulting tags lack the social cohesion of broad folksonomies“ (Dye, 2006, 41). In scientific literature, the term ‘Narrow Folksonomy’ is often used for the result of indexing by the content creator (e.g. Dye, 2006), as well as for indexing by the content creator and other users, albeit with a single registration of all added tags. The latter application follows the intentions of Vander Wal (2005a), who tried to connect ‘self-
166
Knowledge Representation in Web 2.0: Folksonomies
tagging’ with ‘free-for-all tagging’ and thus maintain the social and public character of folksonomies. The adding of tags broadens access to the resources, thus neutralizing the author’s limited perspective. After all: “The weak point of author categorization is the lack of objective viewpoints, which are essential for good categorization” (Ohkura, Kiyota, & Nakagawa, 2006). However, no tag distribution frequencies on a resource level can be calculated here, which makes the exploiting of statistical evaluations much harder. To forego any confusion about the terms, this book will differentiate between • Broad Folksonomies: the author and other users can add tags to resources more than once (e.g. del.icio.us), • Extended Narrow Folksonomies: the author and other users can add new tags to resources, which are registered only once (e.g. Flickr), and • Narrow Folksonomies: only the author can add tags to resources, but other users can search using these tags (e.g. YouTube). Within these collaborative information services, the extended form of the Narrow Folksonomy has established itself right beside the Broad Folksonomy, e.g. for Flickr (Beaudoin, 2007); however, there is often an option to mark tags as ‘private’ and thus withdraw them from public access.
Collective Intelligence Since the authors in collaborative information services tag their resources collectively (at least once), or correct and continue documents reciprocally, we can speak of ‘Collective Intelligence’ in this context: With content derived primarily by community contribution, popular and influential services like Flickr and Wikipedia represent the emergence of “collective intelligence” as the new driving force behind the evolution of the Internet (Weiss, 2005, 17).
Collective Intelligence is a product of the collaboration of authors and users and becomes particularly salient in folksonomies, most of all in Broad Folksonomies. The term ‘collective intelligence’ or ‘the wisdom of crowds’ was popularized by James Surowiecki’s book ‘The Wisdom of Crowds’ (2005) and describes the phenomenon of a group of very different people, who share neither comparable knowledge nor common goals, are more intelligent than an individual. Kroski (2005), referring to the behavior of bee colonies, characterizes this phenomenon as the ‘Hive Mind;’ in Artificial Intelligence research, it is also called ‘Swarm Intelligence’ (Beni & Wang, 1989), Galton (1907) calls it ‘Vox Populi,’ the voice of the people, and in a business context, the term ‘crowdsourcing’ has established itself (Howe, 2006). It was Galton who conducted the first experiment on the subject of the wisdom of crowds: a group of people was asked to estimate the weight of a bull. He compared the submitted guesses and found out that the average value of all estimates came closer to the bull’s actual weight than the best guess of any of the individual test subjects. Gissing and Tochtermann (2007) formulate the main assumption – the group is smarter than the individual – while taking into consideration business-organizational aspects in the following manner: To simplify, Collective Intelligence can be seen as the finding of solutions while incorporating decentralizedly scattered information, or knowledge, by
Knowledge Representation in Web 2.0: Folksonomies
167
different people in different organizations (Gissing & Tochtermann, 2007, 27)*.
McFedries (2006) summarizes the intention of Collective Intelligence in other words: “One person can be pretty smart, but 10,000 or 100,000 people are almost always going to be smarter“ (McFedries, 2006, 80). For knowledge representation, this means that the collective creation of folksonomies and the indexing with their help ought to be more rich semantically and thus more complete and better than indexing with a controlled vocabulary, since several opinions and perspectives are taken into consideration, and not just that of a centralized instance: The essence of the hive mind idea is that the combined knowledge of a group of people will be more accurate than the knowledge of any individual, even an expert individual. While the editor of a controlled vocabulary may miss a term that a particular user might associate with a concept, a wide user base constantly adding and applying terms will be more likely to include it. In addition, this broad user base will add new terms to the system quickly, bypassing the lag associated with formal vocabulary development (Feinberg, 2006).
Sterling (2005) describes this assumption more metaphorically: A folksonomy is nearly useless for searching out specific, accurate information, but that’s beside the point. It offers dirt-cheap, machine-assisted herd behaviour; common wisdom squared; a stampede toward the water holes of semantics (Sterling, 2005).
The concept of Collective Intelligence is represented in different ways in the scientific debate on folksonomies. The underlying assumption by Galton (1907) and Surowiecki (2005) views the average from many different valuations and opinions as Collective Intelligence. A modified version defines the consensus within a broad group of people as Collective Intelligence, i.e. the group settles on a term or formulation after a thorough debate. This understanding forms the basis of Wikipedia, for example: “In their belief that, with sufficient critical mass, truth would arise from consensus, Sanger and Wales [the founder of Wikipedia, A/N] have attracted many believers“ (Weiss, 2005, 21). This characteristic of Collective Intelligence may also be termed ‘Collaborative Intelligence,’ after Vander Wal (2008). In folksonomies, the concept of Collective Intelligence is registered differently. Here the focus is not primarily on a consensus arrived at through discussion, but rather on the individual influence of users on tags and thus on the folksonomy as a whole. Tags are formulated according to the users’ desire, and added to a particular resource as often as required – only in Broad Folksonomies, however. This allows for statistical evaluations, which register frequently indexed tags on the one hand, but also the rarer tags, so-called niche tags. Narrow Folksonomies, like controlled vocabularies, cannot display any specific frequency distributions, since here all tags are of equal value (all occur exactly once). Here tag distributions can only be determined for the entire database. The most allocated tags per resource or per database (possibly starting from a threshold value) can be understood as an ‘implied consensus’ or a ‘statistical consensus,’ since the mass of users pretty has much settled on these tags as the best descriptors for that resource: “These user-generated classifications emerge through bottom-up consensus by users
168
Knowledge Representation in Web 2.0: Folksonomies
assigning free form keywords to online resources for personal or social benefit“ (Quintarelli, Resmini, & Rosati, 2007, 10). The result is a user consensus with regard to the content of the information resource, expressed by the frequency of the allocated tags (Maass, Kowatsch, & Münster, 2007): “Agreement between users on a proper view of the world is critical for creating a useful shared context. In the tagging case, this translates into agreement on which tags are appropriate for a given subject“ (Heymann & Garcia-Molina, 2006). This way of thinking also forms the basis of the striking formula “from chaos comes order“ (Weiss, 2005, 22), which is frequently cited when discussing folksonomies. From the chaos created by many individual indexers and their tags arises an order based on the opinion of the majority of individuals and thus reflects the consensus within the community implicitly, or statistically: In traditional controlled vocabulary-based indexing, all terms assigned to a document carry more or less equal weight. In social tagging, this is certainly not the case. Certain tags will be much popular than others [...]. This behavior implies a degree of consensus among users regarding what they think a document is about, albeit without any coordinating effort by any user in particular or by any third party (Lin et al., 2006).
Jörgensen (2007) thus calls the indexing of information resources via folksonomies ‘consensus-based indexing.’ Independently of whether the consensus is consciously sought within the community or implicitly generated from the statistical count of tag distributions, the consensus is a sort of quality control. The fact that there is no centralized instance controlling the tags or checking their correct allocation means that the consensus reached by the community of users must be seen as quality control: Given that mechanisms exist for generating or suggesting tags without direct user intervention, and that no formal method of quality control is used in most tagging systems, the mechanism most widely used in ‘broad’ tagging systems, such as del.icio.us, is that of community consensus (Tonkin, 2006).
At this point of the quality control a social mechanism kicks in, built around the assumption that “if others use these tags for indexing, they must be the right ones,” validating the actions of each individual: “The principle of ‚social proof’ suggests that actions are viewed as correct to the extent that one sees others doing them“ (Golder & Huberman, 2006, 206). The functioning of Collective Intelligence relies heavily on the reaching of a critical mass of users (Gordon-Murnane, 2006, 29), since only a sufficient number of users can guarantee the positive effects of the ‘wisdom of crowds:’ “the greater the number of persons, or users, interacting, the more probable it becomes that a result will be achieved that distinguishes itself through self-organization” (Gissing & Tochtermann, 2007, 32)*. The community’s composition also plays a big role for the workings of Collective Intelligence. In order to profit from as large a crosssection as possible, the members should come from different social backgrounds, with different standards of knowledge and motivations. Furthermore, they should feel independent within the group and not be constrained by any hierarchical dynamics (Surowiecki, 2005). Kipp and Campbell (2006) elaborate on the relevance of this idea to folksonomies:
Knowledge Representation in Web 2.0: Folksonomies
169
Intellectually, the assumption echoes the widespread belief that the World Wide Web is a complex, adaptive system, of the sort described in complexity theory. In a complex system, there is no pacemaker, of the sort we see in libraries, with their centralized indexing and cataloguing systems. Instead, each unit, following a set of very simple rules, contributes to a spontaneously self-organizing pattern, in which the whole becomes greater than the sum of its parts (Kipp & Campbell, 2006).
Maass, Kowatsch and Münster (2007) also emphasize the meaning of the selforganizing folksonomy system with regard to the creation of a collective vocabulary: The vocabulary – consisting of tags and generated within all indexing tasks by all users – is a part of this system, which organizes its structure by itself, without a centralized control mechanism. The users of a collaborative indexing system generate this vocabulary in a decentralized approach, not even aware of it. On its own this system evolves over time into a more stable state (Maass, Kowatsch, & Münster, 2007, 48).
However, Collective Intelligence does not seem limitless; that is to say, it should not be processed blindly, for if a sufficient number of members of a community concur that one and one is three, that does not make it so. In crowd psychology, such phenomena are called ‘madness of the crowds’ (Le Bon, 1895; Mohelsky, 1985; Menschel, 2002); only think of the panicking masses when stock markets collapse or raging soccer fans. Here it seems as though the growth of the mass of people diminishes the collective intelligence of its members. So it must be emphasized once more: Collective Intelligence of the kind that is meant for use in Broad Folksonomies is exclusively generated on a statistical basis, from the aggregation of independent single values, or single personomies, and not from swayed users or users imitating others. Surowiecki (2005) emphasizes the following for the functioning of the ‘wisdom of crowds:’ Groups generally need rules to maintain order and coherence, and when they’re missing or malfunctioning, the result is trouble. Groups benefit from members talking to and learning from each other, but too much communication, paradoxically, can actually make the group as a whole less intelligent. While big groups are often good for solving certain kinds of problems, big groups can also be unmanageable and inefficient. […] Diversity and independence are important because the best collective decisions are the product of diasagreement and contest, not consensus or compromise. An intelligent group, especially when confronted with cognition problems, does not ask its members to modify their positions in order to let the group reach a decision everyone can be happy with. Instead, it figures out how to use mechnisms […] to aggregate and produce collective judgments that represent not what any one person in the group thinks but rather, in some sense, what they all think. Paradoxically, the best way for a group to be smart is for each person in it to think and act as independently as possible (Surowiecki, 2005, XIXf.).
Furthermore, he defines four fundamental principles that are necessary for the working of a Collective Intelligence and prevent the occurrence of a ‘madness of the crowds:’
170
Knowledge Representation in Web 2.0: Folksonomies Diversity of opinion (each person should have some private information, even if it’s just an eccentric interpretation of the known facts), independence (people’s opinions are not determined by the opinions of those around them), decentralization (people are able to specialize and draw on local knowledge), and aggregation (some mechanism exists for turning private judgments into a collective decision). If a group satisfies those conditions, its judgment is likely to be accurate. Why? At heart, the answer rest on a mathematical truism. If you ask a large enough group of diverse, independent people to make a prediction or estimate a probability, and then average those estimates, the errors of each of them makes in coming up with an answer will cancel themselves out. Each person’s guess, you might say, has two components: information and error. Subtract the error, and you’re left with the information (Surowiecki, 2005, 10).
For folksonomies in knowledge representation, this means that Collective Intelligence can develop best if users tag the same resources multiple times independently of each other, and thus implicitly compile a collective vocabulary that reflects their collective opinion on the resources. However, we must also treat the opposite assumption with great caution: “Merely looking at frequencies of tag use is a too simplistic measure. […] E.g., one cannot simply state that a definition is incorrect only because it is hardly used” (Spyns et al., 2006, 745).
Tag Distributions In folksonomies (narrow or broad), it is generally possible to determine the frequency of tag distributions based on the collaborative information service’s entire database (tag distribution on a database level). In Broad Folksonomies, it is also possible to determine the frequency of indexed tags per information resource (tag distribution on a resource level): Because each tag for a given resource is repeated a number of times by different users, for any given tagged resource, there is a distribution of tags and their associated frequencies. The collection of all tags and their frequencies ordered by rank frequency for a given resource is the tag distribution of that resource (Halpin et al., 2007, 212).
Kipp (2006c) is able to demonstrate that the number of users is in positive correlation with the number of tags on a resource level. The distributions provide information on what tags are used most frequently by single users or groups of users or the totality of users and thus reflect the implied consensus concerning a concept or a term: “semantics may be inferred from the sum of the tag corpus“ (Tonkin, 2006). Tag distributions can also be useful for the ranking of tags or information resources. Furthermore, the calculation of tag distributions allows for various probability calculations, for informetric analyses and the ascertainment of any regularities in users’ tagging behavior. These observations can then play a role in the creation of tagging tools.
Knowledge Representation in Web 2.0: Folksonomies
171
Figure 3.8: Tag Distribution for the Website www.visitlondon.com. Source: http://del.icio.us.
Vander Wal (2005b), Shirky (2003; 2005a), Munk & Mork (2007a; 2007b), Heymann & Garcia-Molina (2006), Lux, Granitzer, & Kern (2007) and others determine that in folksonomies, the distribution of indexed tags on a resource level resembles a Power Law curve (see Figure 3.9). “Tags are usually represented by a power-law distribution of possible terms, of which a few are extremely popular, but the majority are used only infrequently“ (Tonkin, 2006). Should this assumption be correct, very few tags with high values would be placed at the left side of the tag distribution curve, while the right side would consist of a large number of almost similar frequent tags, thus forming the ‘Long Tail.’ The term ‘Long Tail’ was discussed by Chris Anderson (2006) from a business perspective, with a particular view to the returns from niche products. The validity of this assumption is examined in the following, on a concrete example but on a resource level, using Lotka’s Law (Egghe, 2005). The figure makes it plain that certain tags isolate themselves from others based on their frequency. The tags ‘London’ and ‘Travel’ on the left side of the curve share almost 80% of the tag frequency, while the right side, with tags such as ‘Culture,’ ‘Information’ and ‘Holiday’ displays a greater variety of content description, but is indexed a lot less. This quantity of low-frequency, almost similar frequent tags, is what forms the Long Tail. After calling up the website, it becomes clear that the tags at the beginning of the curve (‘London’ and ‘Travel’) describe the site’s content adequately and generally, and that the Long-Tail tags describe it very specifically. The curve for the website www.visitlondon.com, and thus the tag distribution, follows the Power Law with an exponent of around a = 1. An ideal Power Law distribution is displayed in Figure 3.9.
172
Knowledge Representation in Web 2.0: Folksonomies
Power Law
Inverse-logistic Distribution
Figure 3.9: Possible Relevance Distributions. Source: Modified from Stock (2006, 1127, Fig. 1).
A particularity of Power Law distributions is in their scale invariance, i.e. the character of the curve or of the distribution stays the same, even after changing the dimensions: One important feature of power laws produced by complex systems is that they can often be ‘scale-free’, such that regardless of how larger the system grows, the shape of the distribution remains the same, and thus ‘stable’ (Halpin, Robu, & Shepherd, 2007, 212).
In an analysis of the co-occurrence of tags (‘co-tags,’ in analogy to citation analysis’ ‘co-citations’) Lux, Granitzer and Kern (2007) are able to demonstrate that on a database level, around 80% of tags co-occur with others in such a way that these other terms will follow a Power Law. The same conclusion is drawn by Maass, Kowatsch and Münster (2007, 51): “There are many tags T ji with low frequency rates of co-occurring tags and few with very high frequency rates“. This uniformity of tag allocation can be a result of the similarity between the users’ social situation and their user personality: similar users index similar tags for the same resources (Maier & Thalmann, 2008, 81). In their analysis of the tagging behavior on del.icio.us, Golder and Huberman emphasize: “Accordingly, some documents may occupy roughly the same status in many of those users’ lives; since they make use of web documents in the same way, users may categorize them the same way, as well“ (Golder & Huberman, 2006, 206). There are many different approaches towards explaining the generation of the Power Law, which are, however, always calculated using probabilities and viewed as analogous to the development of biological taxa, the growth of cities or the number of citations, for example (Newman, 2005; Adamic, 2002; Mitzenmacher, 2004). Cattuto (2006) as well as Cattuto, Loreto and Pietronero (2007) also present a very interesting theoretical ansatz. As in Lux, Granitzer, & Kern (2007), here too the focus is on tag occurrence on a database level (or experimentally severed parts thereof) and not, as in the example above (Figure 3.8), on the resource level, the allocation of certain tags to concrete resources. Cattuto, Loreto and Pietronero (2007) apply the approach of ‘Semiotics Dynamics’ on the creation of new, or respectively the use of older tags in a folksonomy. “Folksonomies […] do exhibit dynamical aspects also observed in human language, such as the emergence of naming conventions, competition between terms, takeovers by neologisms, and more” (Cattuto, 2006, 33). The explanatory model is based on the Yule process as
Knowledge Representation in Web 2.0: Folksonomies
173
well as on the Yule-Simon process and broaches, in this modified form, the probabilities for the occurrence of words in tags. The underlying Yule process describes the generation of different biological taxa and is the most widespread model for the explanation of Power Law development. In scientific debate, this approach is rated very highly: However, the important point is that the Yule process is a plausible and general mechanism that can explain a number of the power-law distributions observed in nature and can produce a wide range of exponents to match the observations by suitable adjustments in the parameters (Newman, 2005).
According to the Yule-Simon process, at every point in the text, a certain word has the probability p of being a new word (i.e., of not having occurred in the text so far), or it has the probability 1 – p, of being a copy of a pre-existing word. The value of 1 – p depends on how often the given word has already occurred in the text, where a simple relation is created: the more a word has already occurred, the greater the probability of its occurring again. The Yule-Simon process is thus a variant of the well-known ‘success breeds success’ (Egghe & Rousseau, 1995) or of the Matthew Effect, ‘everyone that hath shall be given’ (Merton, 1968). In order to apply this model to folksonomies, one only has to speak of ‘tags in folksonomies’ instead of ‘words in texts.’ The more a tag occurs in an information service, the greater the probability of its being used for indexing again. In their analysis of users’ tagging behavior, Sen et al. (2006) are able to confirm this assumption even for the resource level: “as users apply more tags, the chance that an applied tag will be new for them drops. In total, 51% of all tag applications are tags that the tagger has previously applied” (Sen et al., 2006, 185).
Figure 3.10: The Yule-Simon Process Describes the Generation of a Power Law. Source: Cattuto, Loreto, & Pietronero (2007, 1464, Fig. 6).
Cattuto, Loreto, & Pietronero (2007) refine the Yule-Simon process by adding a time component, so that newer tags have a greater indexing probability than old ones: “It seems more realistic to assume that users tend to apply recently added tags more frequently than old ones, according to a skewed memory kernel” (Cattuto, 2006, 35). Figure 3.10 provides an overview of the explanatory model. The probability p expresses that a completely new tag is used for indexing, while 1 – p, accordingly, states that an older tag is used. The latter probability depends on the
174
Knowledge Representation in Web 2.0: Folksonomies
time at which the (older) tag is used. The time statements are stored in the memory Q and govern a weighting along the lines of ‘the newer the more probable.’
Figure 3.11: The Shuffling Theory Describes the Generation of a Power Law. Source: Halpin, Robu, & Shepherd (2007, 214, Fig. 2).
Halpin, Robu und Shepherd (2007) suggest a similar procedure for explaining the generation of a Power Law on a resource level. However, they rely on a theoretical model of ‘preferential attachment’ (Barabási & Albert, 1999; Capocci et al., 2006), the ‘shuffling theory,’ to determine the probability with which a certain card is drawn from a hat or a deck. They also assume that it is more probable that a preexisting tag will be re-indexed than that a new one will be created: The notion of a feedback cycle is encapsulated in the simple idea that a tag that has already been used is likely to be repeated. This behaviour is a clear example of preferential attachment, known popularly as the ‘rich get richer’ model (Halpin, Robu, & Shepherd, 2007, 213).
Cattuto, Loreto, & Pietronero (2007) and Halpin, Robu, & Shepherd (2007), or the Yule-Simon process and shuffling theory basically follow the same procedure. Only the relation for determining the probability of a new tag being indexed is swapped: Halpin, Robu, & Shephard (2007) use 1-p, while Cattuto, Loreto, & Pietronero (2007) use p for their calculations. This difference has no effect on the generation of a Power Law, however. To clarify the probability calculations, Halpin et al. (2007) provide the following simple example (see also Figure 3.11): At time step 1 in our example, the user has no choice but to add a new tag ‘piano’, to the page. At the next stage, the user does not reinforce a new tag but chooses a new tag, ‘music’, and so P(piano)=1/2 and P(music)=1/2. At t=3, the user reinforces a previous ‘piano’ tag and so P(piano) increases to 2/3, while P(music) decreases to 1/3. At t=4, a new tag, ‘digital’, is chosen and so P(piano) goes up while P(music) decreases to 1/4 and P(digital) is 1/4. Taken to its conclusion, this process produces a power law distribution (Halpin, Robu, & Shepherd, 2007).
Both models, however, require a sufficient quantity of already indexed tags to found their explanations, or calculations, on, as Newman (2005) emphasizes and elaborates on with an example: “if newspapers appear with no citations and garner citations in
Knowledge Representation in Web 2.0: Folksonomies
175
proportion to the number they currently have, which is zero, then no paper will ever get any citations!“ Dellschaft and Staab (2008) criticize the above two models for failing to incorporate users’ background knowledge during tagging. Furthermore, the models put too strong an emphasis on the idea that users imitate other users during tagging. Another point they make is this: But the most obvious flaw of all previous models is their inability to explain the characteristic sub-linear tag growth. Instead of the continuous, but decaying growth of the set of distinct tags they all lead to a linear growth or, even worse, assume a fixed vocabulary size (Dellschaft & Staab, 2008, 75).
This is why Dellschaft and Staab (2008) use a model for explaining the generation of tag distributions on a resource and database level that incorporates users’ knowledge and active vocabulary. To simulate the user vocabulary (or background knowledge, short: BK), they use a text corpus consisting of the cooccurrences of different tags on del.icio.us. The probability BK of a user indexing a tag t from his active vocabulary equals the probability of this tag t occurring in the corpus. The factor I corresponds to the probability of an already used tag being reindexed. Here the authors assume that popular tags recommended by the system (the 7 most frequent tags on del.icio.us) are reused most often. In their analysis of the model, the authors are able to demonstrate that in 60 to 90 per cent of cases, users imitate pre-existing tags and add the others from their vocabulary. Thus the model is also capable of explaining the characteristic form of distribution as well as the sublinear growth of tag distributions. Shirky (2003) locates the reason for the Power Law’s generation in the users’ many alternatives and explains this approach with the example of blog readerships. The more choice there is in the blogosphere, the sooner many readers will subscribe to the most popular ones and follow the many unknown blogs (see also Fu et al., 2006). This leads to a Power Law distribution with a Long Tail: “Diversity plus freedom of choice creates inequality, and the greater the diversity, the more extreme the inequality. […] The very act of choosing, spread widely enough and freely enough, creates a power law distribution” (Shirky, 2003, 77). For the blogosphere, Shirky (2003) observes that the gap between the top-ranked blogs and the less successful ones becomes greater as more blogs are created and made accessible. This also means that it becomes harder for new bloggers to gain any readers at all. When applied to tags, this then means that it will at some point become more difficult for users to index new and unused tags and that they will tend to re-use preexisting ones. Shirky (2003) thinks that the reason for this is also to be founding the characteristics of a social system, where users are not uninfluenced by the opinions of other users: But people’s choices do affect one another. […] Think of this positive feedback as a preference premium. The system assumes that later users come into an environment shaped by earlier users; the thousand-and-first user will not be selecting blogs at random, but will be affected, even if unconsciously, but the preference premiums built up in the system previously (Shirky, 2003, 79).
The underlying assumption of both the Yule-Simon process and the theory of ‘preferential attachment’ is based on the proportional connection of already indexed
176
Knowledge Representation in Web 2.0: Folksonomies
tags and the probability of their being indexed again. They are, however, not wholly suited to explain the generation of a Power Law, as they place their emphasis elsewhere. Both theories (as well as all other explanatory theories, see Newman, 2005) merely describe the development of the Long Tail typical to the Power Law distribution. The beginning of the Power Law curve is often excluded in the determination of a Power Law (see for example Newman, 2005, Fig. 4) and then neglected in the description. Evidence of a Long Tail is enough to brand a distribution as ‘Power Law’ (even for Cattuto, Loreto, & Pietronero, 2007). Scientific discussion thus often seems to start from the tacit assumption that the Power Law distribution is all there is. Evidence to the contrary is provided by an excursion into information retrieval and Figure 2.7 in the preceding chapter. Lux, Granitzer and Kern (2007) also confirm this for the database level: they were only able to label 80% of co-tag distributions as Power Law. 20% seem to follow other rules: “not all tags follow a power law distribution [..]“ (Lux, Granitzer, & Kern, 2007). The inverse-logistic distribution and its particularities shall thus be illustrated in the following using an example. Figure 3.12 shows an inverse-logistic distribution on a resource level.
Figure 3.12: Tag Distribution for the website www.asis.org. Source: http://del.icio.us.
As can clearly be seen, this particular distribution does not follow a Power Law. It seems as though in this case two Long Tails are developing: the known entity to the right and a ‘Long Trunk’ on the left-hand side. Some authors refer to the short segment at the start as the ‘short head’ (Turnbull, Barrington, & Lanckriet, 2008), in analogy to the Long Trunk. The tags do not differentiate themselves from each other strongly enough in their frequency distributions, causing many tags to occur quasiequally. On www.asis.org, the Long Trunk registers the tags ‘associations,’ ‘library,’ ‘information,’ ‘ia,’ ‘technology’ and ‘professional.’ After this, there is a turning point in the distribution to which the Long Tail attaches itself. The tags in the Long Trunk describe the content of the website – unlike the first two or three tags in a Power Law distribution – with insufficient adequacy. The tags ‘associations’ and ‘technology’ are extremely general, and the other terms, too, make only the vaguest
Knowledge Representation in Web 2.0: Folksonomies
177
allusions to the content. Here Collective Intelligence seems to have trouble painting a ‘true’ picture of the resource, or agreeing on the relevant tags. Kipp and Campbell (2006) also observe this sort of tag distribution on a resource level in the social bookmarking service del.icio.us, but do not know how to classify this distribution (see also Figure 3.13): However, a classic power law shows a much steeper drop off than is apparent in all the tag graphs. In fact, some graphs show a much gentler slope at the beginning of the graph before the rapid drop off […]. Shirky notes that this drop off does not always occur immediately and suggests that this shows that user consensus views such a site as being about more things than a site whose tags experience a sharper drop off in frequency […]. A number of other graphs in the sample exhibit this less rapid drop off suggesting that users may tend to settle on a cluster of related terms which they feel best describes the aboutness of the URL (Kipp & Campbell, 2006).
Figure 3.13: Tag Distribution for the Website www.pocketmod.com. Source: Kipp & Campbell (2006, Fig. 1).
Knowing about the different relevance or frequency distributions is interesting not only for information retrieval, but also for knowledge representation with folksonomies – after all, these distributions, especially the Power Law, are used to explain the workings of Collective Intelligence in collaborative information services. In spite of that, the different forms of distribution are not yet present in scientific debate, only Paolillo (2008) talks of two distinct Power Law distributions: before and after the turning point of the distribution curve. Since both curves have a Long Tail, there is indeed the possibility of confusing the two. Hence particular attention must be paid to the beginning of the distribution. The decisive factor excluding the classification of a distribution as Power Law is the presence of a Long Trunk. If a curve is approximated without the first few ranks (as is most strikingly the case in Capocci and Caldarelli (2008, Fig. 3) or in Cattuto (2006, Fig. 5)), misattributions are eminently possible. For the record, then: typical tag distributions can be determined for the entire database as well as for single information resources. As the examples demonstrate, we cannot assume that tag distributions will conform to Power Law regularities per se. The inverse-logistic distribution must also be considered for both levels. Common to both distribution forms is, however, that tags at the beginning of the curve distinguish themselves from those in the Long Tail in their distribution
178
Knowledge Representation in Web 2.0: Folksonomies
frequencies as well as the meaning allocated them by the users. It is interesting that as soon as a critical mass of resources is provided a sufficient amount of tags, the tag allocation will stay mostly constant for longer periods of time: “the objects will stabilize once enough objects are tagged” (Maarek et al., 2006). Kipp and Campbell (2006) also observe this connection: “Furthermore, early research suggests that when a URL acquires a certain number of taggers, the most common terms tend to remain stable” (Kipp & Campbell, 2006; see also Halpin, Robu, & Shepherd, 2007; Maass, Kowatsch, & Münster, 2007; Maier & Thalmann, 2007). For them, these tags possess the characteristics of a controlled vocabulary, only ‘checked’ by the user community (Kipp & Campbell, 2006). Halpin, Robu, & Shepherd (2007) demonstrate, parallel to their explanation for the generation of a Power Law, that any form of tag distributions will remain mostly stable over time. The relative frequency of indexed tags, but not their absolute number, stays constant, or behaves scale-invariantly, after the form of distribution has been reached in a Power Law as well as in an inverse-logistic distribution. Halpin, Robu and Shepherd (2007) confirm this via the Kullback-Leibler formula: If the Kullback-Leibler divergence between two consecutive time points or between each step and the final one becomes zero (or close to zero), it indicates that the shape of the distribution has stopped changing. This result suggests that the power law may form relatively early in the process for most sites and persist with remarkable consistence throughout (Halpin, Robu, & Shepherd, 2007, 217).
Thus the Long-Trunk tags, or the first n tags of the Power Law, reflect the implicit consensus of the user community with regard to the resource’s aboutness fairly consistently: “In effect, a text or object identifies itself over time“ (Godwin-Jones, 2006, 10). The Long-Tail tags illustrate the opinions of individual users and can provide insights into the semantic diversity of terms. These tags may be more fitting in their description of the resource than the most cited tags. Compared to Broad Folksonomies, Narrow Folksonomies are generally indexed with fewer tags per resource. This may be the reason why a Long Tail will not be particularly long in these cases. Since niche tags are often not even represented in a tag cloud (see chapter four), they remain invisible and have no influence on indexing users. Hence they are very unlikely to see an increase in their distribution frequencies. An inverting function for tag clouds could be of some help here and make the tags visible. Services that offer informetric analyses of tags on the basis of their distributions and their chronological development shall not go unmentioned at this point. Technorati for example shows time series of blog posts regarding topics (for an example, see Stock & Weber, 2006; Müller-Prove, 2008), while Cloudalicious generates time series of tags for particular URLs on del.icio.us (Russell, 2006). A research field is opening up to the analysis of regularities of chronological tag distributions (Dellschaft & Staab, 2008). A research question that has stayed mostly unanswered so far concerns the satiation of a tagging system. Could a folksonomy collapse through too many tags, too many users, too many resources? The attempts at a solution which are presented here for the first time are often contradictory. This is why they shall be discussed with regard to the different elements of folksonomies – users (Ossimitz & Russ,
Knowledge Representation in Web 2.0: Folksonomies
179
2008), resources (Linde, 2008) and tags (de Solla Price, 1969). In the end, though, all attempts can be applied to all elements, only the reference values will require some adjusting. Ossimitz and Russ (2008) discuss the user aspect in their observations on the growth of online social networks. Their first discovery is that in social networks, socalled ‘online herds’ turn into ‘online crowds,’ which are instrumental for the growth or stagnation of user numbers. Online herds are defined thus: “They observe and imitate the behavior of other users and follow online trends without a rational reason” (Ossimitz & Russ, 2008, 150). Growth processes always affect two factors in a system: time and size of a reference value. Real-world systems are subject to particular conditions: For real-world systems absolute growth of some state variable over a finite time interval is always limited. [...] Even over a hypothetical infinite timespan the state of many systems would stay always finite. This applies due to the finiteness restriction of our world, which allows neither infinitely tall persons nor unlimited big populations or infinite big amounts of money. Typically this limitation to growth can be specified by some kind of specific limitation (carrying capacity) (Ossimitz & Russ, 2008, 151).
In the most limited-growth models, a freely selectable constant K is used, which represents the capacity limit. This means that the growth will stop at once and for good once this limit has been reached (e.g. for the booking of plane tickets). Problems can be caused by such cases that are described as ‘overshoot and collapse:’ “It happens whenever a system has a carrying capacity K, which is actually ignored by the growing forces, so that the overgrowth goes considerably beyond K” (Ossimitz & Russ, 2008, 152; for an example case of such a scenario, see Rasmussen, 1941). So if the capacity is exceeded, the system will collapse. For the development of user numbers in social networks, the theories of selforganization (e.g. Heylighen, 2002), and of self-organized criticality (e.g. Bak, Tang & Wiesenfeld, 1987) play a particular role, according to the authors. Analyzing the social networks Friendster and Facebook, Ossimitz and Russ (2008) find out that the growths of both systems comes close to imitating the curve in Figure 3.14, and thus also pass through the four phases of a mass phenomenon: 1) ‘Initiation,’ 2) ‘Propagation,’ 3) ‘Amplification’ and 4) ‘Termination.’ The fourth aspect is particularly important for the satiation of tagging systems, since it states that the upturn cannot last forever, and that the growth will become unstable after breaching the critical factor K. The critical factor K here corresponds to the size of the human population on the one hand, i.e. the system can only accommodate as many users as there are people on the planet, or in other words, as many people as its system capacity can handle, and on the other hand, to the time the system is still useful to the individual. Should the tagging system work slowly or faultily as a result of too many users, its usefulness for the individual has come to an end and the growth phenomenon will turn from euphoria to hysteria - the system collapses or is replaced by another. This is why Ossimitz and Russ (2008) finally assume that the phenomenon of online crowds, to which users of tagging systems doubtlessly belong, can succumb to an ‘overshoot and collapse’ effect and break down.
180
Knowledge Representation in Web 2.0: Folksonomies
Figure 3.14: Growth Phases of Natural Systems. Source: Ossimitz & Russ (2008, 155, Fig. 2).
Classical Economy Output Quantity
Network Economy Output Quantity
Input Quantity Lower Point of Profitability
Upper Point of Profitability
Law of Diminishing Returns
Input Quantity Threshold of Profitability
Law of Increasing Returns
Figure 3.15: Marginal Profits in Classical and Network Economy. Source: Modified from Weiber (2002, 288).
Linde (2008) calls on the observations on economical network effects (Buxmann, 2002; Dietl & Royer, 2000) in order to describe the development of digital resources. A network, whether real or virtual, is made up of a quantity of objects and their interconnections. According to Linde (2008, 42ff.), there is in the real world a network, for example of telephone lines and telephone users, in virtual reality it might be the use of a particular software for the creation of text documents. Networks or network goods distinguish themselves by possessing a basic use on the
Knowledge Representation in Web 2.0: Folksonomies
181
one hand, i.e. the use of the product in itself, and a network use which is deduced from the number of users of the network good (e.g. the network of one single telephone is pretty useless for the individual; however, the use for the individual rises, the more users join the network). Applied to tagging systems, this means that the system - and thus its use - becomes more attractive and better for the users, the more users offer resources for access in the collaborative information service. Since tagging systems deal with digital resources, it follows, according to Linde (2008, 123ff.), that the classical law of diminishing returns, as seen in Figure 3.15 and used by Ossimitz and Russ (2008), above, applied to users, does not apply to them but rather the law of increasing returns (see Figure 3.15, to the right): In information science [...] networks are omnipresent by comparison. Here it can be observed that the development of returns shows an exponential course, i.e. we are dealing with increasing returns. With heightened input of one [...] or all factors [...], a disproportional amount can be gained. Even if, as hinted at in Fig. [3.15], it takes longer to reach the profitability threshold [...], the potential profit is distinctly larger and basically unlimited form the provider's perspective. It makes no difference to the information provider financially if his product is sold several hundred or several thousand times over, particularly if it is an online offer (Linde, 2008, 123ff.)*.
Linde (2008) bases his observations on the costs and returns of information goods. If we are to apply the law of increasing returns to the growth of resources in collaborative information services and the system's usefulness, this would mean that after a critical mass of entered resources has been exceeded, it is potentially possible to add an infinite number of resources to the tagging system (see Figure 3.15, righthand side, ‘Input Size’) - especially since digital resources take up very little disk space and require very little of the computer - and thus increase the system's usefulness (see Figure 3.15, right-hand side, ‘Result Size’) ad infinitum as well. That is why according to Linde (2008), there is no danger of tagging systems collapsing.
Figure 3.16: Common Form of the Logistical Curve. Source: de Solla Price (1969, 21, Fig. 5).
De Solla Price (1969) is deemed the founder of scientometry and gained great publicity for his ‘science of science,’ i.e. the analysis of the publishing habits of
182
Knowledge Representation in Web 2.0: Folksonomies
scientists. He examined the growth of scientific publications in different subject areas and was able to derive several characteristic distributions, which can also be applied to the satiation of tagging systems, or the growth of tags. He too argues, as do Ossimitz and Russ (2008), that natural systems cannot grow forever: In the real world things do not grow and grow until they reach infinity. Rather, exponential growth eventually reaches some limit, at which the process must slacken and stop before reaching absurdity (de Solla Price, 1969, 20).
As opposed to Ossimitz and Russ (2008), the growth curve in de Solla Price (1969) does not decrease after reaching a critical factor, but approaches a threshold value asymptotically (see Figure 3.16). This behavior is typical for logistical distributions: In its typical pattern, growth starts exponentially and maintains this pace to a point almost halfway between floor and ceiling, where it has an inflection. After this, the pace of growth declines so that the curve continues toward the ceiling in a manner symmetrical with the way in which it climbed from the floor to the midpoint. […] we may deduce […] that the existence of a ceiling is plausible since we should otherwise reach absurd conditions at the end of another century. Given the existence of such a limit, we must conclude that our exponential growth is merely the beginning of a logistic curve in other guise (de Solla Price, 1969, 21ff.).
Value of a Network
Potential Technology Jump
„Significance proceeds momentum“
Exponential Growth
Margi- Optionali- nalization zation
Potentially Negative Lock.In
Figure 3.17: Growth Limits in Network Effects and Technology Jump. Source: Modified from Zerdick et al. (2001, 215).
Applied to tagging systems and tags, this would mean that after a period of heavy growth, i.e. after exponentially increasing indexing activity with tags, there will follow a phase in which tags are still used a lot, but the degree of newly introduced tags decreases - in other words, the network's usefulness is still high, but it approaches the satiation factor and generates no further use.
Knowledge Representation in Web 2.0: Folksonomies
183
Neither de Solla Price (1969) nor Linde (2008) deny that in natural systems, phases that are distinguished by their approaching a degree of satiation (see Figure 3.16) as well as a diminishing growth (see Figure 3.14 and 3.15, left-hand side) can be joined by further phases resembling the former. Linde (2008) locates the reason for this renewed increase of the growth period in a technology jump (see Figure 3.17), e.g. the introduction of new computer capacity or new search options in folksonomies. De Solla Price (1969) notices self-repeating structures in the discovery of chemical elements and localizes ‘bunches of ogives’ (de Solla Price, 1969, 30) at points in time where it was rather physical achievements that led to the discovery of elements, or machines that facilitated research and thus led to a reincrease in the number of known elements (see Figure 3.18): From this we are led to suggest a second basic law of the analysis of science: all the apparently exponential laws of growth must ultimately be logistic, and this implies a period of crisis extending on either side of the date of midpoint for about a generation. The outcome of the battle at the point of no return is complete reorganization or violent fluctuation or death of the variable (de Solla Price, 1969, 30).
Figure 3.18: Number of Known Chemical Elements as a Time Function. Source: de Solla Price (1969, 29, Fig. 11).
These three attempts at an explanation must be viewed as a first step towards answering the research question “When is a tagging system satiated?” and provide plausible clues to the chronology of growth processes. In the end, though, they can only be an introduction to the discussion on the appropriate behavior of tagging systems and for empirical investigations that cannot be performed in the context of this book.
184
Knowledge Representation in Web 2.0: Folksonomies
Users’ Tagging Behavior The relationship between users and tags in all its complexity is summarized under the keyword ‘tagging behavior.’ This aspect deals specifically with the user’s behavior during tagging and provides answers for questions such as: Why does the user tag? Or: What facilities can be used to make tagging easier or give it a structure? Thus Kipp and Campbell (2006, see also Kipp, 2007) observe that 65% of the users in a tagging system index their information resources with as little as between one and three tags (see also Farooq et al., 2007). Heckner, Neubauer and Wolff (2008) are able to demonstrate that the average tag allocation frequency varies from one collaborative information service to the other (see also Heckner, Mühlbacher, & Wolff, 2007, and Heckner, 2009). Thus in the scientifically-minded social bookmarking service Connotea resources are indexed with 4.22 tags on average, resources in the social bookmarking service del.icio.us with 2.82 tags, photos on the photosharing site Flickr with 2.79 tags and the videos on its videosharing site with 4.81 tags. But there are also users who make heavy use of tagging options (see Figure 3.6 above). Here we can observe a trend that shows that the more resources a user manages, the more tags are used to index them (Kipp, 2007, 69) and the more users a tagging system has, the more different tags are used for indexing within the system (Farooq et al., 2007). Lee et al. (2009) are able to determine that the familiarity of users with the concept of tagging, the functionality of tagging systems and the use of web catalogs has a great effect on the user’s tagging behavior. The more familiar the user is with the three aspects stated above, the better and more frequent his tagging becomes. As is already the case for tag distributions to information resources, here too, according to Marlow et al. (2006a; 2006b) the relation of tags per user can be represented in a Power Law distribution: “the most users have very few distinct tags while a small group has extremely large sets of tags” (Marlow et al., 2006a; 2006b; Lux, Granitzer, & Kern, 2007). Lux, Granitzer and Kern (2007) report that in their cross-section of the del.icio.us folksonomy, 57% of tags only occur once. Since the used tags frequently follow a Power law distribution with regard to an information resource, we could assume that users often index several high-frequency tags together. This is not the case, according to Kipp and Campbell (2006): “While high frequency tags are used frequently, they are not necessarily used most frequently with other high frequency tags” (Kipp & Campbell, 2006). In Kipp (2007, the analysis of the social bookmarking service CiteULike shows that a large part of the resources is not indexed at all. This means that resources are saved, but not tagged. Kipp (2007) blames the system design. Since resources can be found by title, author or magazine title, “it is possible that these users do not consider the need to think up useful tags to be worth the time required to do so” (Kipp, 2007, 71). Great user demand, on the other hand, is present for the indexing with multiple-word tags, i.e. in the formation and indexing of compounds. Collaborative information services often allow only one-word tags, and tags separated by a blank are indexed as separate tags. This is why in the folksonomies of information services different forms of tag connections can be found, which are meant to correct this flaw in the user intersection, e.g. compounds in CamelCase122, 122
CamelCase means the individual tags’ initals are written in upper-case letters despite the compounding.
Knowledge Representation in Web 2.0: Folksonomies
185
underscore or dash (Guy & Tonkin, 2006). The variation in compounds demonstrates that users regard this sort of tags as important and even vital for the adequate description of resources: That compound tags are seen frequently may be taken to imply that the user sees a legitimate purpose for them; one might speculate that a principal use of them is a descriptive metadata, more generally known as free-text annotation (Tonkin, 2006).
A frequently occurring particularity of tags within personomies is their syncategorematical form, i.e. separated from their context, they are barely useful at all for other users in knowledge representation and information retrieval. Examples for syncategoremata practically devoid of meaning are: acronyms such as P1F1123 on Flickr.com. Pluzhenskaia (2006) here discovers that “1/4 of the tags [...] does not make sense for any users other than their author.” This conclusion is held up by Marlow et al. (2006a; 2006b). They investigate the motivations of the users regarding tagging and define personal information management for the purpose of ‘future retrieval’ as the most important reason. Others include: • contribution and sharing: to increase access to the resources via tagging and allow other users to access the content, • attract attention: to attract attention to one’s own resource, • play and competition: with regard to certain rules or competitions (e.g. the ESP Game) which absolutely require tagging, • self-presentation: the presentation of one’s own personality and interests, • opinion expression: the expression of one’ own opinion and the desire to publish this particular comment (Marlow et al., 2006a; 2006b). Brooks and Montanez (2006a; 2006b) reduce the tagging motivations to three common aspects, but do not rate them: “annotating information for personal use, placing information into broadly defined categories, and annotating particular articles so as to describe their content” (Brooks & Montanez, 2006a, 11). Hammond et al. (2005) present a four-field schema (see Figure 3.19) to clarify the tagging motivations - particularly with a view to tag users and content creators, and with specific examples for their use, too. They also differentiate between tagging for personal reasons, or personal resource management (‘self’) and social tagging, i.e. tagging for other people (‘others’). The field to the bottom left here represents the most ‘selfish’ sort of tagging via the example of Flickr: users as well as authors in this information service mainly use tags for their own information retrieval. The field at the top right-hand side then represents the opposite, altruistic motivation: Wiki authors mainly create content for the benefit of other users, and tag it so that they can find it. The field to the bottom right can be explained via the example of social bookmarking services: authors provide the bookmarks and tag them for their own resource management, in the same way that users tag for personal reasons. The field at the top left uses Technorati to describe the tagging of authors for their own information retrieval as well as the alignment of of tags towards their general retrievability for other users.
123
This acronym refers to the Flickr group ‘Post1Fave1.’
186
Knowledge Representation in Web 2.0: Folksonomies
Figure 3.19: Tagging Motivations and Exemplary Applications. Source: Hammond et al. (2005, Fig. 3).
Ames and Naaman (2007) interview Flickr users on their reasons for tagging photos and also present the result as a four-field schema (see Figure 3.20). The chart to the left displays the ‘sociality’ aspect vertically, i.e. whether the tag is intended for the individual or for the community of friends, family or the public; the ‘function’ aspect, which represents the purpose as well as the use of the tags, is displayed horizontally. This continuum is where Ames and Naaman (2007) localize the following tagging motivations: a) top left - users tag in order to arrange and retrieve information resources for themselves; b) bottom left - users tag so that information resources can be found by all selected communities (e.g. family, friends) and perhaps in order for these other users to enhance their reputation by clicking or commenting on their tags in the collaborative information service. Further motivations within this category include the adjustment of tags to certain community conventions and the knowledge that other users can profit from the tags; c) top right - users tag in order to store information concerning the resource for themselves; d) bottom right - users tag in order to attach context information to the resource and make them public to the community, e.g. geographical information or ratings. The chart at the right-hand side in Figure 3.20 reflects the frequency distribution of the tagging motivations. A total of 13 users were interviewed by Ames and Naaman (2007); primary motivations are in bold letters, secondary ones in italics. Although the number of persons interviewed is not representative, the chart is able to suggest a tendency. Three trends can be discerned: Organization for oneself is a more common motivation than communication for oneself [...] communication with friends and family is a more common motivation than organization for friends and family [...] organization for the
Knowledge Representation in Web 2.0: Folksonomies
187
general public is a much more common motivation than communication (Ames & Naaman, 2007, 978).
The four-field schema cannot, however, provide a clear delineation of tagging motivations. This is why the authors emphasize: Again, we emphasize that specific tags can play several roles in our motivation taxonomy. [...] However, we found that generally, most participants only considered one or two motivations for adding; in many cases, they had not considered the other possible benefits (Ames & Naaman, 2007, 978).
Figure 3.20: Tagging Motivations and Exemplary Applications. Source: Ames and Naaman (2007, 976f., Table 1 & 2).
Figure 3.21: Motivations for the Use of Tagging Services. Source: Modified from Panke & Gaiser (2008a, 28, Fig. 1).
Panke and Gaiser (2008a) take up this study and formulate the tagging motivations a little more generally, in order to apply them to all collaborative information services and not have to adopt the strong focus on photosharing services (see Figure 3.21). Thom-Santelli, Muller and Millen (2008) conduct very personal interviews with 33 users and ask them about their tagging motivations, especially regarding their tagging behavior in corporate tagging systems. Here they are able to determine five sorts of user:
188
Knowledge Representation in Web 2.0: Folksonomies
a)
community seeker: searches for and indexes with tags in order to join a preexisting community; b) community builder: indexes with tags that are known to all members of a community, so that these members are able to easily retrieve the tagged resources; c) evangelist: indexes with known tags and uses these consistently in several tagging systems, so that other users are able to retrieve the resources. Furthermore, he links the resources among each other and takes care to let all users know he is an expert in this area; d) publisher: indexes with tags in order to purposely reach different demographics. Mostly it is his task in the company to carefully spread information and increase its visibility, without focusing on his own reputation; e) small team leader: indexes and searches with tags that are only known to a small group of users and specifically serve the spreading of information within this group. This categorization makes it clear that tagging in a company intranet is very different from personal tagging as a thing to do in one’s free time; personal organization strategies (as seen in Figure 3.19, 3.20 and 3.21) are almost nonexistent. On the other hand, it becomes clear that users of corporate tagging systems are highly aware of the other users and modify their behavior accordingly.
Figure 3.22: Impetus Factors in the Collaborative Information Service. Source: Fokker, Buntine, & Pouwelse (2006, Fig. 3).
Fokker, Buntine and Pouwelse (2006) present a model called ‘Taxonomy of cooperation inducing interface features,’ which describes the impetus factors for users to join a collaborative information service at all. The factors are displayed in Figure 3.22. The first factor (‘social distance’) describes the influence of the social and sometimes geographical distance to the members of the community concerning their readiness to join a collaborative information service or react to recommendations. The second factor (‘user profiling’) describes the measures which explain the usefulness of the information service to the members of the community, e.g. a complete user profile leads to better recommendations, since it facilitates the comparison with other user profiles or resources. Collective Intelligence is used in the third factor (‘power of collectivity’) and states that the system improves, the more users join it. The last factor (‘social visibility’) concerns the visibility and reputation of a community member in the sense of their social presence. The studies and models also show that the knowledge of a social presence (i.e. if users know that they have an audience and their behavior is being watched) can
Knowledge Representation in Web 2.0: Folksonomies
189
influence tagging behavior (Zollers, 2007). The factors ‘social communication’ and ‘social organization’ from the chart explained above (see Figure 3.20) would not exist otherwise. Vuorikari and Ochoa (2009) are able to demonstrate that users formulate their tags in the native language as well as in a metalanguage (e.g. English), if they are aware of the tagging system’s multilingual users. A study by Nov, Naaman and Ye (2008) investigates tagging activity via two indicators of social presence: groups and contacts. Flickr users can join groups with a particular focus of interest or link up with other users directly via the contact function. The study’s hypotheses are: the degree of self, public and family+friends motivation (from Figure 3.20) increases proportionately to the growing number of allocated tags - the more contacts a user has, the more tags he will allocate, and the more groups he joins, the more tags he will allocate. Furthermore, the authors assume that the greater a user’s activity on Flickr, represented by his uploaded photos, the more tags he will allocate. The study arrives at the following conclusion: “We found that the levels of the Self and Public motivations as well as the social presence indicators and the number of photos, were positively correlated with tagging level” (Nov, Naaman, & Ye, 2008, 1099).
Figure 3.23: Influence Factors on Users’ Tagging Behavior. Source: Sen et al. (2006, 182, Fig. 1).
So users tag for different reasons. Here it can also be observed that tags, or their names, do not necessarily stay the same. Tags are relinquished, new tags are added or pre-existing ones, if possible, modified. The reasons for this are explained by Russell (2006; see also Birkenhake, 2008): “three reasons: 1) the content has changed, 2) the vocabulary used to describe the content has changed, or 3) the users have become aware of the social interactions of their tagging and have changed their behaviour.” The power of the community seems to have a great influence on users (Lee, 2006). Sen et al. (2006) investigate the factors ‘personal tendency,’ personal tagging preferences and habits, ‘community influence,’ awareness of the user community according to ‘social proof124,’ and ‘tag selection algorithm,’ the visible 124
For more on this, see the elucidations in chapter four, section ‘Searching vs Browsing vs Retrieving.’
190
Knowledge Representation in Web 2.0: Folksonomies
tags’ presentation form regarding their influence on users’ tagging behavior. The influence factors as well as their effects are summarized in Figure 3.23. The user selects the tags for indexing according to his preferences, ideas and experiences from other tagging systems as well as his knowledge of the world. This choice of tags will lead to user habits which will then cement themselves into a personal tagging vocabulary for the user and hence become his preferred indexing choices. A change in personal habits is incredibly hard to effect, only the community can exert a strong enough influence on the individual to make him change his habits and thus his tagging vocabulary. This influence can assert itself through the imitation of other users and their indexing habits, for example. Furthermore, users’ indexing habits are directed by the system itself: “The method by which a system selects and displays tags serves as a user’s lens into the tagging community, and thus directly impacts the community’s influence on a user's choice of tags” (Sen et al., 2006, 183). These influencing factors act on the design of the tags in particular. Sen et al. (2006) determine that users often relate their tags to particular purposes (selfportrayal, organization, getting to know as well as resource retrieval and relevance determination). Hence: Different tag classes are useful for different tags. Factual tags are useful for learning about and finding movies. Subjective tags are useful for selfexpression. Personal tags are useful for organization. Both factual and subjective tags are moderately useful in decision support (Sen et al., 2006, 189).
Golder and Huberman (2006) assume that it is more probable for a user to re-index a previously used tag than it is for them to generate a new one. Sen et al. (2006) also assume that users imitate each other according to the ‘conformity theory’ and copy the tags’ quality as well. If a resource is indexed with many ‘good’ tags, users will also rather add good tags: “Conformity theory predicts that the tags that users see from other users will influence the tags that they in turn assign” (Sen et al., 2007, 361). Maier and Thalmann (2007) confirm this thesis with their study on tagging students. If a rather content or subject-oriented quantity of tags was provided as the pre-existing tag collection, students mostly chose content or subject tags to index the resources; if the initial quantity consisted of context-defining tags, students largely chose context tags: “These observations uphold the thesis that the choice of the initial quantity can purposely influence the genre of the indexed tags” (Maier & Thalmann, 2007, 83)*. In their experimental and psycho- / sociologically motivated study, Rader and Wash (2008) observe that this assumption is correct, but must be restricted regarding the ‘tag source:’ Of our three explanatory variables, the strongest influence is users' previous tag choices. The coefficients on used.byuser [the user's tags, A/N] consistently indicate a much larger influence than that of used.onsite [tags in the resouce folksonomy, A/N] or the interaction term [tags from del.icio.us’ recommender system, A/N] (Rader & Wash, 2008).
Farooq et al.’s analysis of the scientifically oriented social bookmarking system CiteULike also shows that most tags on the entire platform are not reused. According to these findings, users tend to imitate themselves rather than other users. Dye (2006), however, observes that personomies adapt to the structures, wishes and habits of the community over time, probably in order to make themselves visible within it, or keep themselves that way: “Most folksonomists pattern their tags after
Knowledge Representation in Web 2.0: Folksonomies
191
others’. It makes sense, even for regular folks: for information to be found by the maximum number of searchers, its tags have to make sense to searchers” (Dye, 2006, 42). This development is also supported, or strengthened, by particular tag recommender systems125. The motivations ‘future retrieval’ and ‘awareness of social interaction’ are thus situated in an area of tension between personalization and general accessibility: “The latter suggests that the current stage of social tagging has a potential to evolve in two opposite directions: toward further socialization, or toward deeper individualization and disintegration of online communities” (Pluzhenskaia, 2006). It is a fact, however, that tags are not used without a reason. Tags are used consciously, just as an information resource is saved consciously: “Social tagging systems [...] are by essence intentional, users actively select a page or objects they want to keep or share” (Maarek et al., 2006). Numerous studies deal with the analysis of user behavior during tagging at the moment (e.g. Panke & Gaiser, 2008a; 2008b). Here the added tags are compared to surrogates from controlled vocabularies, a.o. (e.g. Lin et al., 2006; Kipp, 2006b; 2006c; 2007, Al-Khalifa & Davis, 2007c; Spiteri, 2007; Bruce, 2008; Peterson, 2008), structured tag entries are investigated via field prescriptions (Bar-Ilan et al., 2006), user demand in museum catalogs is defined (Trant, 2006a; 2006b; Kellog Smith, 2006) or users’ tagging behavior is compared beyond platform borders (Muller, 2007a). Lin et al. (2006) compare tags, automatically extracted terms from resource titles and descriptors from MeSH (Medical Subject Headings), in order to check the adequacy of the three methods regarding indexing quality. Here resources that have been tagged by users in Connotea, a social bookmarking tool and at the same time been enriched with further bibliographical information, such as MeSH, by databases such as PubMed, serve as a foundation. Furthermore, an automatic procedure (GATE, see Lin et al., 2006), is used to extract terms from the titles of these resources. Finally, the three keyword sets (tags, MeSH terms and title terms) are compared with each other. This comparison shows that only 11% (59 tags from 540 in total) of indexed tags match the MeSH terms. In her analysis of CiteULike resources (indexed with user tags, author keywords and thesaurus terms), Kipp (2006c; see also Kipp, 2007) is also able to show that users may use descriptions similar to thesauri for indexing purposes, but that there are very few 100% matches between tags and descriptors. A similar result is obtained by Peterson (2008). Lin et al. (2006) suspect that this is due to the indexing methods’ different goals. Tagging users seem to have other demands than professional indexers, who want to index and describe the resources absolutely using controlled vocabularies. Users seek out the subject they are interested in and add a tag rather than represent the resource completely: While MeSH terms are mostly used for content description, tagging terms are employed by individual users for content as well as for other reasons. It appears as if users do not attempt to tag all content of a document but instead they highlight specific content or facts most interesting to them (Lin et al., 2006).
125
Tag recommender systems, including their advantages and disadvantages, will be addressed more exclusively in the following.
192
Knowledge Representation in Web 2.0: Folksonomies
It is also interesting that tags are often redundant (described, in Farooq et al. (2007) as ‘tag non-obviousness’) and do not provide any further information (see also Lux, Granitzer, & Kern, 2007; Jeong, 2008; Heckner, Mühlbacher, & Wolff, 2007). 19% (102/540) of tags match the automatically extracted title terms, i.e. the tags are sometimes adopted straight from the resource titles and, strictly speaking, do not provide any informational added value. The resource title is standardly registered and indexed by the system anyway. Heymann, Koutrika and GarciaMolina (2008; see also Heckner, Mühlbacher, & Wolff, 2007 for an analysis of the social bookmarking system Connotea with similar results) also arrive at this conclusion in their analysis of the social bookmarking system del.icio.us regarding the retrieval functionality: “Tags are present in the pagetext of 50% of the pages they annotate and and in the titles of 16% of the pages they annotate. A substantial proportion of tags are obvious in context, and many tagged pages would be discovered by a search engine” (Heymann, Koutrika, & Garcia-Molina (2008). Bischoff et al. (2008) confirm this result, whilst however detecting a newness value of 98.5% for music resources; the reason for this is in the low quantity of metadata which are normally available for music resources. In Lin et al. (2006), the smaller match between tags and MeSH terms can also be interpreted as a greater information content on the part of the controlled vocabulary. Peterson (2008) and Kipp (2007) subscribe to this view: “title keyword searching alone and controlled vocabulary searching alone led to failure find some articles” (Kipp, 2007, 66). Lin et al. (2006) refer to the artificiality of controlled vocabularies and thus summarize the result of their study as follows: [Results] suggest that users more likely select terms in titles for their tagging [...] [and that] the user's tagging language is more close to natural language than to controlled vocabularies. [...] Our data suggest that tags are more similar to automatic indexing than to controlled vocabulary indexing, particularly when the number of users increased (Lin et al., 2006).
Al-Khalifa and Davis (2007a; see also Al-Khalifa & Davis, 2007c) use the Yahoo Term Extractor to extract the most important terms from a corpus of different websites and then compare these to the tags in del.icio.us and an indexer's handpicked descriptors. Their results are that 1) “folksonomy tags cannot be replaced by automatically extracted keywords” (Al-Khalifa & Davis, 2007a) and 2) that “folksonomy tags agree more closely with human generated keywords than the automatically generated ones” (Al-Khalifa & Davis, 2007c). With discoveries like these, they contradict the results of the studies mentioned above. Bar-Ilan et al. (2006) use their study to examine the advantages and disadvantages of a guided tagging via specific fields. The users are meant to add tags to pictures and then allocate these tags to the fields ‘general themes,’ ‘symbols,’ ‘personalities,’ ‘description of event,’ ‘location of event,’ ‘time of event,’ ‘object type,’ ‘object creation date,’ ‘creator,’ ‘relate links,’ ‘additional information about the image that did not match any of the predefined fields’ and ‘recommending additional fields to improve the form,’ where not all fields are necessarily relevant for each picture. For comparison, another group tags pictures at its guise. The result of this study can be defined as the field-based indexing producing more detailed tags, and thus more detailed descriptions of the pictures, than the free tagging. According to Bar-Ilan et al. (2006), this is due to the fact that the fields serve as a tagging aid for indexing, but also provide a necessary context for the tag: “The basic
Knowledge Representation in Web 2.0: Folksonomies
193
role of the elements [fields, A/N] is to provide context” (Bar-Ilan et al., 2006). Furthermore, they observe that “the structured group felt obliged to fill in most of the fields” (Bar-Ilan et al., 2006). So the preset fields animate users to add tags in all categories and thus to think more deeply about an adequate description of the picture than free tagging does. This results in higher-quality image descriptions. A problem can arise form the field labels, however. For example, the fields ‘description of event’ and ‘location of event’ often led to problems of demarcation and unambiguous allocation of the tags. Also, tags can be allocated more than once to fields like ‘symbol’ and ‘general themes.’ This not only leads to confusion, but eradicates the added value provided by structured tagging, and the end result is again similar to full-text indexing, or free tagging. Bar-Ilan et al.’s (2006) final recommendation for tagging systems is to combine free tagging and field-based tagging. We believe that the use of elements has added-value, since overall, the ‘structured’ group provided higher quality descriptions. Probably, the system should suggest a list of elements from which the user can choose, and allow him/her to add new elements if none of the existing ones are appropriate for his/her needs (Bar-Ilan et al., 2006).
A similar conclusion is drawn by Trant (2006a) in her study on tagging behavior regarding museum objects and the use of controlled vocabularies. The controlled vocabulary is often insufficient for user needs, and also often too specific for amateur research in museum catalogs. This is why users embrace free tagging. Furthermore, it becomes evident that many of the user-allocated tags do not exist in the museum-specific controlled vocabularies yet, which provides real added value for the user, but also for the creation of vocabularies: Thus, volunteer taggers do provide many terms that are not in standard art object records in museum collection management systems. 88% of tagger terms were not already in the Metropolitan Museum’s object description system. ¾ of these terms were judged valid and ‘appropriate’ to these artworks by [the] museum’s Subject Cataloguing Committee (Trant, 2006a).
Veres (2006) pursues a different approach. She compares category descriptions from the Open Directory Project and the Yahoo Directory as well as tags from del.icio.us with a linguistics-based classification of categories following Wierzbicka (1984) and finds out that most categories from the directories have a functional character, i.e. they answer questions such as “What is it for? How can you use it? Where does it come from? Why is it there?” (Veres, 2006). The tags, on the other hand, do not concentrate exclusively on the functional character. This is mainly reflected in the word forms: “First, many tags in del.icio.us were not nominals. Sometimes people use adjectives to describe a resource, for example cool, social” (Veres, 2006, 65). Verbs are also used for resource descriptions. So it appears as though not only a great variety of semantic categories exists within folksonomies, but that grammatical variation regarding word form is also used extensively. Muller (2007a) investigates users’ tagging behavior within four different collaborative information services, each used in the intranet of IBM. The study’s goal is summarized as follows: We hope to learn if people expect each tag to have a stable, consistent meaning across multiple social-tagging services, and if their tag usage is
194
Knowledge Representation in Web 2.0: Folksonomies consistent across multiple types of objects stored in the multiple services. [...] Our studies will help us to understand the semantics of social-tagging services, and thus the extent to which these services can provide a set of common descriptors for an enterprise’s collective knowledge (Muller, 2007a, 342).
So do users of the different collaborative information services use their tags consistently, even beyond platform borders? The study concludes that only 13% of tags occur parallel in all four information services. Even more surprising is the fact that not even one and the same user tags consistently: When we focused on personal re-use of tags across pairs of services, we found very low tags-in-common rates (mean of 1.06 tags-in-common per person, or mean of 2.65% of per-user opportunities to have tags in common). When we re-focused our analysis on the tags, and asked whether each tag was associated with the same users across services [...], we again found very low rates of commonality (Muller, 2007a, 346).
Muller (2007a) initially attributes this phenomenon to the different information services, which manage different resources (bookmarks, people, blog entries and activities). Thus resources would naturally be indexed with different tags. But the analysis of the study’s data material yields different results: users tag similar resources with the same tags, even beyond platform borders - but the tags are not directly recognizable as ‘the same,’ since they are often blurred by orthographical variation or compounding (Muller, 2007a, 347). Editing the tags using a normalization algorithm (Porter Stemmer and De-Prefixing) is meant to help, but is only able to achieve an improvement of 21.3% in tag similarity. The implementation of the semantical disambiguation between tags is meant to be explored in future papers. Thus Muller (2007a) is left to conclude: “We are left with the apparent paradox that people work hard to write tags, but seem not to do this consistently for related aspects of their work that are stored in different services” (Muller, 2007a, 347). In telephone interviews of Flickr users, Cox, Clough and Marlow (2008) gather information on the usage of the photosharing service and investigate the relation between amateur photography clubs and group activity on Flickr. With regard to the tagging of photos, they find out that the social component of the tags plays a big role in the photosharing community and is taken very seriously. Tags are mainly used to alert the audience to one’s own resources: Our interviewees were quite concerned with generating interest and feedback on their photos, and a key motivation for tagging their own photographs and being drawn into commenting and group activity was to increase activity on their own photos (Cox, Clough, & Marlow, 2008, 9).
Users’ comments on collaborative tagging are of particular interest. Flickr uses an Extended Narrow Folksonomy, i.e. the content creator may create tags for his picture, while other users can only add new tags. Since it is Collective Intelligence in particular which causes much of the uproar about folksonomies and in scientific discussion is generally presented as an advantage popular with users, the results of Cox, Clough and Marlow’s (2008) survey comes as a surprise: Interviewees differed about tagging the photos of others, as opposed to tagging one’s own. Several saw it as rude, because it was an invasion of one’s
Knowledge Representation in Web 2.0: Folksonomies
195
own space and also because whoever had added the tag was not visible (whereas for most other user behaviour in one's space, e.g., commenting, one can navigate to the profile of the person who did it). Others saw tagging photos of others, e.g., with corrected spellings, as a public good (Cox, Clough, & Marlow, 2008).
So users are generally aware of their actions in collaborative information services and act reticently, or appropriately. That here too there is the possibility for vandalism, e.g. spam, goes without saying. Winget (2006) also discovers, in her analysis of Flickr tags, that users do indeed deal with tag allocation on a very critical level: This research project further suggests that users, especially if they are invested in the ‘success’ of their images within the given system, at the very least have the best intentions in terms of description. Not only do they give their images multiple tags [...], they also provide specific geographical terms when appropriate, as well as multiple spellings, abbreviations and concatenations of those terms (Winget, 2006).
The study by Trant (2006b) mentioned above observes at this point, that the quality of the added tags is related to the user’s visibility. Non-registered users make more than twice as many spelling mistakes as registered members and also use inadequate tags more often, such as obscenities. Further advantages of folksonomies or tagging systems are identified as the easy applicability of indexing via tags. Cox, Clough and Marlow’s (2008) interviewees also provide unexpected perspectives on these points. On the one hand, they often regard the tagging of information resources as annoying; on the other hand, they make no statements at all on personal information management and the arrangement of their own pictures. They are aware that tags allow them to reach a larger audience and may facilitate searches, but they recognize no added value in the development of their own knowledge organization system: As an aspect of organizing the collection, tagging was recognised as important but also often seen to be boring or difficult. [...] However, another seemed to be concerned to generate the maximum number of tags. We suggest that tagging, though of interest theoretically for information studies, may not be very important in Flickr (at least not to users), because one is not generally searching for an image as such (Cox, Clough, & Marlow, 2008, 10).
The question arises of whether users must possess something like a certain tagging competence: once regarding the adequate selection and allocation of tags (e.g. form of expression and correct assignment), and then regarding their understanding of the applications of folksonomies (e.g. as a structuring instrument). The blogger Pind (2005) has the following to report on the subject of his tagging habits: One thing I have noticed is that I suck at tagging my links in delicious; I draw a blank, and frequently just repeat words from the title. [...] This is a problem, because, at least on my part, it is more an artificat of my shortcomings as a tagger, than of the links I am tagging (Pind, 2005).
Pind (2005) is aware of his tagging incompetence - many other users of tagging systems may never have thought thought about whether they use the tags ‘correctly,’ or to their own advantage, in order to create real added value for them. The problemsolving approaches for folksonomies pick up on this theme and are discussed below.
196
Knowledge Representation in Web 2.0: Folksonomies
Tag Categories Research into tag categories is conducted very intensively at the moment. Here it is particularly the linguistic quirks of tags and the way these differ from standard language that are being investigated, but also the occurrence of regularities regarding certain genres or forms of tags. The default databases for this research are del.icio.us and Flickr. In the following, I will present a few selected research results. Dubinko et al. (2006), Winget (2006) and Schmitz (2006) derive tag categories from the Flickr folksonomy, Golder and Huberman (2006) as well as Al-Khalifa and Davis 2007b) do the same via del.icio.us and Mathes (2004) defines the categories from both services' tags. Tags are defined ad hoc as hyperonyms, or categories, by test subjects or the researchers themselves. Dubinko et al. (2006) develop an application which makes it possible to visualize the appearance and disappearance of Flickr tags as well as selected images on a timeline. In the context of this application, they localize three big large tag categories, which most tags can be subsumed within: ‘events,’ recurring, yearly or one-time events and bank holidays, e.g. ‘Valentine’s Day,’ 2) ‘personalities,’ people, e.g. ‘Jeanne Claude’ or ‘pope,’ and 3) ‘social media tagging,’ special tags which serve to identify communities, e.g. ‘Faces_in_holes.’ Winget (2006) investigates the Flickr folksonomy with a view to its cluster functionality and finally arranges tags into five categories: 1) ‘date and time,’ e.g. ‘june11’ or ‘samsbirthday,’ 2) ‘geographical,’ e.g. ‘rome’ or ‘trevifountain,’ 3) ‘narrative,’ e.g. ‘buildings’ or ‘sunset,’ 4) ‘characterizations’ e.g. ‘mygirlfriend’ or ‘happy,’ and 5) ‘individually defined tags’ e.g. ‘mynecktie’ or ‘pickleproject.’ Sigurbjörnsson and van Zwol (2008) are also able to detect five categories in their analysis of Flickr tags (for a total of 52m photots). They compare the tags with WordNet categories and find out that 28% of tags can be allocated to the category ‘locations,’ 16% to ‘artifacts’ or ‘objects,’ 13% to ‘people’ or ‘groups,’ 9% to ‘actions’ or ‘events’ and 7% of tags to the category ‘time.’ Schmitz (2006) wants to use his study to automatically generate semantic information and relations from tags and determines the following ‘key facets’ (Schmitz, 2006): ‘place,’ geographical information such as locations, ‘activity/event,’ such as ‘skiing in the holidays,’ ‘depiction,’ tags which describe the object, e.g. ‘people’ or ‘flowers,’ and ‘emotion/response,’ tags which emphasize the community aspect of the platform, such as ‘funny’ or ‘toread.’ Golder and Huberman (2006) discover seven functions, or categories, of tags in their analysis of del.icio.us tags: the category ‘topics’ includes generic descriptions in varying detail and proper names, which explain what the resource is about, ‘type’ includes information on the resource's format, e.g. ‘blog’ and ‘article,’ ‘ownership’ states the author, ‘adjectives’ reflect the author’s opinion, e.g. ‘funny,’ ‘scary,’ ‘self reference’ explains the relation between the tagger and the resource, e.g. ‘mystuff,’ ‘task organizing’ summarizes those tags that describe future uses for the resource, e.g. ‘toread’ and finally, ‘refining tags’ are syncategoremata, i.e. tags which describe another tag in more detail and only derive their meaning from this relation. Al-Khalifa and Davis (2007b), on the other hand, reduce the number of tag categories to three (as do Sen et al., 2006): 1) personal tags for organizing personal resources, 2) subjective tags reflecting the user's opinion on the resource, and 3) factual tags describing facts regarding a resource. Al-Khalifa and Davis' categorization is extraordinary in that it regards abbreviations or compounds as
Knowledge Representation in Web 2.0: Folksonomies
197
personal tags, “since no one knows what they do or mean, or why they were formed in this shape” (Al-Khalifa & Davis, 2007b). In his research on folksonomies in del.icio.us and Flickr, Mathes (2004) finds out that tags can be summarized in eight large categories: 1) ‘technical tags’ like ‘RSS’ or ‘Python,’ 2) ‘genre tags’ like ‘Humor’ or ‘Comic,’ 3) ‘self-organization,’ e.g. ‘todo,’ 4) ‘place names’ like ‘NewYork,’ 5) ‘years,’ e.g. ‘2007,’ 6) ‘colors’ such as ‘yellow,’ 7) ‘photographic terms’ like ‘cameraphone’ and 8) ‘ego,’ e.g. ‘me’ or ‘mystuff.’ The studies mentioned here demonstrate that a variety of possible tag categories can be defined. However, there is no complete match in their descriptions of tag categories; only two results will be the same, respectively. This allows for the definition of the most frequently recognized categories via co-occurrence: people, things, events, places, self or task organization and ego seem to lead the charge and are used most often by tagging users. This is merely an implicit process, since the user does not select these categories consciously in order to add tags to them. Only in retrospect can category allocation be manually defined. Bischoff et al. (2008) investigate tags and category allocations in various collaborative information services (del.icio.us, Flickr and Last.fm) and notice significant differences. These can be explained mainly via the resource categories (bookmarks, photos, music): Specifically, the most important category for Del.icio.us and Flickr is Topic, while for Last.fm, the Type category is the most prominent one […]. For pictures only, Location plays an important role […] while Last.fm […] exhibits a significantly higher amount of subjective/opinion tags. Time and Self-reference only represent a very small part of the tags studied here. Author/Owner is a little more frequent, though very rarely used in Flickr due to the fact that people mainly tag their own pictures (Bischoff et al., 2008, 206).
Heckner, Neubauer and Wolff (2008) make similar observations: photos are tagged for content; photos are tagged for location; photos are often untagged; photos are tagged with the camera device name; videos as well as photos are often tagged extensively; videos are tagged for persons; scientific articles are tagged for time and task (Heckner, Neubauer, & Wolff, 2008, 9).
Further differences in the tag categories can also be due to the respective purpose of the collaborative information services. Heckner, Mühlbacher and Wolff (2007) analyze the rather scientifically minded social bookmarking service Connotea regarding its tag categories and are able to determine, among other things, that 92% of tags are related either to the resource’s content or to the resource itself (e.g. resource type) and that emotional (‘funny’) or ‘time and task related’ tags are used minimally: Under the assumption that scientists use a special kind of language register when annotating bibliographies, this low number [of ‘time and task related’ tags, A/N] can possibly be ascribed to fundamental differences between scientific and standard language use (Heckner, Mühlbacher, & Wolff, 2007).
198
Knowledge Representation in Web 2.0: Folksonomies
Tag allocations can be compared to the fields in a field schema and thus combine folksonomies with traditional knowledge representation. Fields, or categories, also make it possible during indexing and during retrieval to distinguish, for example, books by Shakespeare from books on Shakespeare, or to define the tag ‘Italy’ either as the user’s location or as the image’s photographic content (van Damme, Hepp, & Siorpaes, 2007, 58). This distinction is not possible in collaborative information services as of yet. If one wants to use this advantage in knowledge representation, a suitably designed system must be in place (Knautz, 2008) and the user must be enticed and educated to adequately fill the categories. The distinction or allocation to categories is one way of separating tags from one another in their setup. Another way is the allocation to two tag groups: one group contains all tags regarding the information resource’s aboutness, the other group all tags which exist independently of that aboutness. The categories mentioned above, like people or places, belong to the first group since they describe the content - what the resource is about. Tags in this group may be redundant, since oftentimes full-text comments, title information or other metadata of the information object is available for indexing or research. Nevertheless, aboutness tags provide a further access path to the resources and are, in a way, independent of them, as Pluzhenskaia (2006) emphasizes: “such tags are context-independent and can be easily shared by all users”. Golder and Huberman (2006) also recognize the user-independent character of these tags: “[They] are not necessarily explicitly personal. [...] the information is extrinsic to the tagger, so one can expect significant overlap among individuals” (Golder & Huberman, 2006, 204). Aboutness tags regard obvious themes of the resource, summarize resources with similar themes and can be defined by nearly all users. Tags that do not directly describe the resource’s aboutness but might still create added value for the user – be it for the personal arrangement of his resources – belong to the second group. Pluzhenskaia (2006) formulates the arbitrary character of these tags slightly more drastically: “user-specific tags that are virtually meaningless to anybody except their creators.” Hassan-Montero and Herrero-Solana (2006) present a weakened version of this point: “In other words, tags are not (always) topic or subject index terms.” These tags include ‘cool,’ ‘todo,’ ‘gtd126,’ etc. Brooks and Montanez (2006a) address another problem: “there’s no shared meaning that can emerge out of a tag like ‘todo’” (Brooks & Montanez, 2006a, 11). Users’ Collective Intelligence is thus inadequately exploitable via these tags. This category also includes tags that are intentionally distorted or obfuscated by their creators so that only they or a community of their choice can understand them: “In fact, some participants, in order to maintain privacy, purposefully obfuscated their tags so that friends or family would understand what the tags meant but the general public would not” (Ames & Naaman, 2007, 978). Heckner, Mühlbacher and Wolff (2007) call such tags understandable only to the user ‘exclusive label tags’ and tags used by a community to index resources ‘shared label tags.’ Tags that rate or express a feeling towards the resource, such as ‘cool,’ ‘fun,’ ‘sucks,’ are described by Kipp (2006a) as ‘affective tags’ or ‘emotional tags’. Tonkin, Tourte and Zollers (2008, see also Zollers, 2007) discuss so-called ‘performance tags,’ like ‘waste of time and money,’ which occur in tagging systems
126
The abbreviation ‘gtd’ stands for ‘getting things done.’
Knowledge Representation in Web 2.0: Folksonomies
199
with a multiple-word tagging option. Performance tags are defined as follows on the example of the tag ‘makes me wish for the sweet release of death:’ Such a tag has an information dimension – it suggests that the resource to which it has been applied is better avoided. The second information dimension refers to the author themselves, by means of the provision of a dramatised reaction to the resource (Tonkin, Tourte, & Zollers, 2008).
Users who index such tags communicate with other users via those tags. At the same time they assume, during the indexing process, that there is an audience worth communicating with for the indexed resource and the tags (Tonkin, Tourte, & Zollers, 2008). In traditional knowledge representation, these emotionally charged keywords or allocations are not allowed, since they do not restrict themselves to the aboutness of the object to be indexed. Users of tagging systems seem to have a need for these annotations, however. Yanbe et al. (2007) analyze the Japanese social bookmarking service ‘Hatena Bookmarks’ and find out that there are multiple variants of ‘sentiment tags’ (see Figure 3.24), but purely descriptive tags are still used more often (to a ratio of 10:1). It is noteworthy that only one negative tag, ‘it’s awful,’ was used more than 100 times: “This means that social bookmarkers usually do not bookmark resources to which they have negative feelings” (Yanbe et al., 2007, 112).
Figure 3.24: Top 54 Emotional Tags of the Japanese Social Bookmarking Service ‘Hatena Bookmarks.’ Source: Yanbe et al. (2007, 112, Fig. 8).
So-called ‘geotags’ (Neal, 2007; Amitay et al., 2004; Ahern et al., 2007; Toyama, Logan, & Roseway, 2003; Jaffe et al., 2006; Naaman, 2006), as they can be indexed on Flickr, for example, consist of locations or GPS data and can be attached to the photo, or are registered during the taking of the photo via specific terminal (Toyama, Logan, & Roseway): Geotagging (geocoding) is the process of adding geographical identification metadata to resources (websites, RSS feed, images or videos). The metadata usually consist of latitude and longitude coordinates, but they may also include altitude, camera heading direction and place names (Torniai, Battle, & Cayzer, 2007, 1).
200
Knowledge Representation in Web 2.0: Folksonomies
The location of the photo shoot is then connected to the picture resource. Geotags can be visualized above all via mash-ups, e.g. in connection with maps127, and thus facilitate new browsing options (Jaffe et al., 2006; Naaman, 2006; Ahern et al., 2007; Kennedy et al., 2007; Torniai, Battle, & Cayzer, 2007): “Geotagging is a mechanism by which users can define exact latitude and longitude of their images, allowing for tie-ins with mapping applications like GoogleMaps” (Winget, 2006). The (automatically provided) geotags can also be used to find photographic resources that have not been otherwise indexed, and they facilitate the recommendation of other tags for indexing via the geographical similarity of resources (Naaman, 2006). The combination of geotags and camera phones, which would support the photo’s upload to an internet platform and also facilitate indexing with (geo)tags during the uploading process, is being experimented with (e.g. Carboni, Sanna, & Zanarini, 2006; Naaman, 2006; Ames & Naaman, 2007). Similarly to geotags, time tags are also common on Flickr. They display the date the photo was taken, for example; often, they even do so automatically during uploading, via Exif data. Apart from these automatically inserted, or systemprovided tags, there are often time statements added manually by the user. These time tags include information on the year or specific holidays. Analyses of time tags find out different results regarding their usage and application. Beaudoin (2007) investigates tagging behavior and tag properties on Flickr and arrives at the following conclusion: The limited use of time by Flickr users is an interesting discovery since it is a frequently used organizational principle associated with the management of images. This situation is possibly due to the fact that images field commonly receive a date stamp when they are created. Therefore, some Flickr users would see this information as redundant (Beaudoin, 2007, 28).
Lin et al. (2006) also analyze the tagging behavior of Flickr users and discover that even though the date the photos are taken is automatically registered by the system, many users still add a time tag, e.g. the year, to the resource. An explanation for the discrepancy between the two analyses can only be speculative at this point. Possibly it is easier and more practical for users to index a time tag than to search for a date in a folder structure. A new form of keywords appears with tags such as ‘todo,’ ‘to_read,’ ‘tobuy,’ ‘deleteme’ or ‘gtd.’ These tags describe a future action of the tagging user regarding the resource, e.g. to read the article later, print the image at some point or buy the book as a present. Thus the tags embody a performative act (Austin, 1972; Peters & Stock, 2007). Kroski (2005) labels these keywords as ‘functional tags,’ Dennis (2006) as ‘signalling tags’ and Kipp (2006a) as ‘time and task related tags,’ since they simultaneously describe an action and a point in time: They express a response from the user rather than a statement of the aboutness of the document; they are intrinsically time-sensitive; they suggest an active engagement with the text, in which the user is linking the perceived subject matter with a specific task or a specific set of interest (Kipp & Campbell, 2006). 127
Mash-ups of maps and photos can be seen on http://www.flickr.com/map, http://www.tagmaps.research.yahoo.com, http://www.jpgearth.com and http://www.de.zoomr.com/world, for example.
Knowledge Representation in Web 2.0: Folksonomies
201
In her study on unusual tags in del.icio.us, CiteULike and Connotea, Kipp (2006a) also observes that these performative tags make up for 16% of the folksonomy within the collaborative information service and thus have great significance for the users. Equally often, information resources are assigned tags such as ‘me’ or ‘mystuff.’ Hereby the user brings himself, or ‘the tagging I,’ into the indexing process for the first time, and creates an explicit connection with the resources (Mathes, 2004): “The self is now contained various categories such as me, mine, and my_stuff “ (Kroski, 2005). Further tags also not concerning aboutness include keywords on the resource’s document type, such as ‘blog,’ ‘article,’ ‘’book’ (Pluzhenskaia, 2006), or community tags, e.g. ‘German Photo Group’ on Flickr, which are used to mark group affiliations. In Flickr, this sort of community linking is called ‘tagography128’ (Dye, 2006, 41). Dennis (2006) holds, concerning these non-aboutness tags: “It is safe to say that tags are an important communication mechanism in addition to a categorization technique” (Dennis, 2006). In the discussion on tag characteristics, it becomes conspicuous that there are tag specifics apart from the creation of categories or the separation into aboutness and non-aboutness tags. Several forms of compounding develop from the rudimentary tag entering options, which often do not allow multiple-word tags, e.g. via underscore or dash. The indexing frequency of tags connected in this way shows that there is a great user demand for multiple-word tags: “Their high occurrence illustrates that flickr users prefer to use composite expressions rather than single terms to describe their images” (Lin et al., 2006). Since there is such a great variety of compounds and no consistent, binding standard, folksonomies are very confusing and indexing and searching is hard. Studies show that compound tags are often added only once and thus contribute little to Collective Intelligence: “A particular instance of a compound tag is likely to be unique; that is, the majority of compound terms are applied only once” (Tonkin, 2006). It is also striking that for tag allocations, no matter in what context, there is heavy use of so-called Basic-Level tags. The Basic-Level theory states that terms can be cognitively structured in a hierarchical system with three different levels of specificity: 1) the superordinate level, 2) the basic level and 3) the subordinate level (Rosch, 1975). The basic level often contains the one term in that group which is the most demonstrative, but none too specific (Iyer, 2007); the superordinate level is too abstract, i.e. it has less distinctive characteristics than the basic level, and the subordinate level is too specific, showing more distinctive characteristics than the basic level. In other words: the basic level is “the level at which objects within a category will share the maximum number of attribute and at which the objects in the category will be maximally different from objects in other categories” (Jörgensen, 2007). Rosch (1975) illustrates this relation with the following example: furniture (superordinate level) – chair (basic level) – kitchen chair (subordinate level).
The term ‘furniture’ is too general, since it only specifies very few distinctive features to adequately separate its hyponyms (e.g. cupboard, table, chair) from each other – the term ‘kitchen chair’ is too specific, since it includes too many distinctive 128
See also: http://www.flickr.com/groups/central/discuss/2730.
202
Knowledge Representation in Web 2.0: Folksonomies
traits and thus excludes too many terms from an allocation. The term ‘chair’ is specific enough to distinguish itself from the other terms ‘table’ or ‘cupboard’ (e.g. through the purpose of sitting), but also general enough to also include the term ‘kitchen chair.’ The basic level is also said to be the simplest and most familiar level in the cognitive allocation of terms (categorization) and in the cognitive specification of objects (Mack et al., 2008; Bowers & Jones, 2007). Furthermore, Basic-Level terms occur much more frequently in natural language than terms from the other levels. Goodrum and Spink (2001) also analyze the searching behavior during the retrieval of visual information and observe that often, users initially search for Basic-Level terms, but then restrict the search results using so-called ‘refiners’ (Goodrum & Spink, 2001, 306). Thus the more specific, subordinate level is called on during searches: Terms such as young, fat, blonde and beautiful may serve to refine a general term such as girl into a more specific visual request. Research examining image queries […] have consistently demonstrated the occurrence of such terms in image queries and image descriptions in non-web based image systems (Goodrum & Spink, 2001, 306).
Rorissa (2008) investigates use behavior regarding image indexing, but not in the context of collaborative information services. He finds out that the basic level is generally used for the description of single objects and the superordinate level for the description of object groups: “When describing images, people made references to objects (mainly named at the basic level) in images much more than other types of attributes” (Rorissa, 2008). The tag aspects mentioned so far refer to their character and the allocation of certain categories, or meanings, in particular. However, their features can be ascribed to four fundamental aspects of tagging systems: the user interface, the specific application, the usership and the user readiness to implement the system (Tonkin et al., 2008): We are left with several preoccupations: the context in which a tag is written, and the context in which it is read; the community interactions which underlie both processes; untangling the confusions that result from the many and varied uses of the terms ‘context’ and ‘community.’ A fourth expression can be added to this list, to form a quartet of interest: caution. Without investigation and analysis, seeing named social entities in a dataset may simply reflect our own preconceptions (Tonkin et al., 2008).
Heckner, Neubauer and Wolff (2008), as well as Heckner, Mühlbacher and Wolff (2007) present a tag categorization model, which is meant to provide a systematized overview of the tag categories (see Figure 3.25). Here they identify 1) the so-called ‘functional tags,’ i.e. tags, which refer to both the aboutness (‘subject related tags’) and the non-aboutness (‘non-subject related, personal tags’) of the resource, 2) the tags’ linguistic aspects (word form, orthography etc.) and 3) matches of tags and text components (title, abstract, full-text etc.) of the resource, i.e. the redundancy in tags. Through these distinctions, the authors achieve a good coverage of the different tag categories as illustrated above and frequently occurring in folksonomies.
Knowledge Representation in Web 2.0: Folksonomies
203
Figure 3.25: Categorization Model for Tags. Source: Modified from Heckner, Mühlbacher, & Wolff (2007, Fig. 1, Fig. 2, Fig. 4, Fig. 6).
The note concerning ‘tag avoidance,’ i.e. abstaining from indexing with tags or using symbols other than tags, is very important in this model. In a comparison of the social bookmarking services Connotea and del.icio.us, the photosharing platform Flickr and the videosharing platform YouTube, Heckner, Neubauer and Wolff (2008) are able to discern several avoidance tags, e.g. ‘???,’ ‘-,’ ‘…’ or ‘::.’ They also observe that tag avoidance is nonexistent on Connotea, occurs to the rate of 49.32% on del.icio.us, to 98.69% on Flickr and to 30.77% on YouTube. The authors assume that the high number of unindexed photos on Flickr is due to the fact that photos can be shared and shown to other users when they are pointed to the right album via URL, retrieval is not critical, since the items in the album can easily be browsed and photos are instantly self-descriptive when viewed by a user (Heckner, Neubauer, & Wolff, 2008, 8).
On the other hand, videos on YouTube are tagged so extensively, because “users who tag videos do not want to organize their personal collection, but rather want the video to be retrieved and viewed by as many people as possible” (Heckner, Neubauer, & Wolff, 2008, 8). The other options for tag characterization have received short shrift so far. Apart from any superordinate categories, tags can also be assigned information, e.g. relevance ratings. In this way, users of Amazon can rate the allocated tags and thus influence their importance. Furthermore, tags can also be weighted for one’s personal profile, i.e. the user can determine which tag is of high relevance for him and how the single tags, or their ‘tag score,’ should influence the product recommendations.
204
Knowledge Representation in Web 2.0: Folksonomies
Tag Recommender Systems Some collaborative information services, e.g. del.icio.us or BibSonomy, experiment with tag recommender systems which during tagging suggest further tags for the user’s consideration (MacLaurin, 2005; Xu et al., 2006; Garg & Weber, 2008; Jäschke et al., 2007; Sigurbjörnsson & van Zwol, 2008; Wang & Davison, 2008; Subramanya & Liu, 2008). Flickr also cooperates with an application (‘ZoneTag’) that supports the user while uploading and tagging photos from his cellphone (Ames & Naaman, 2007; Naaman, 2006). Sigurbjörnsson and van Zwol (2008) observe that in Flickr, 64% of all photos in the database are suitable for tag recommendations, since these have only been indexed with two or three tags, leaving a tag recommender system to provide good service in improving indexing depth. There are different varieties of tag recommendations, each with different points of emphasis: “Recommending tags can serve various purposes, such as: increasing the chances of getting a resource annotated, reminding a user what a resource is about and consolidating the vocabulary across the users” (Jäschke et al., 2007, 507). Marlow et al. (2006a; 2006b) initially distinguish between three sorts of tagging systems regarding tag visibility: 1) systems with ‘blind tagging’ provide no assistance to users during tagging and any tags already added to a resource stay invisible to users while they are still allocating, 2) users of ‘viewable tagging’ systems can consult already added tags during indexing, and 3) ‘suggestive tagging’ systems actively propose tags to the users while they are still tagging. Tag recommendations are called ‘tagging support’ by Marlow et al. (2006). There are several variations of tag recommendations (Morrison, 2007). On the one hand, the tagging system can alert the user to any spelling mistakes and recommend orthographical variants for indexing. This sort of tag recommender system is unproblematic, because it merely helps the user spell tags correctly without influencing him in the actual allocation. On the other hand, a tag recommender system can re-suggest the user’s already indexed tags for use on further resources. This has the advantage that the user can keep track of already indexed tags in his information management, and that he will also index consistently in the future without losing himself in an ocean of tag variants (Muller, 2007a; Sinclair & Cardew-Hall, 2008). This sort of recommendation is used by del.icio.us, for example (see Figure 3.26). Thus the personomy assumes the character of a controlled vocabulary, as also discussed by Neal (2007): “Could our own sets of tags be considered individualized controlled vocabularies?” (Neal, 2007, 9). Another option for tag recommendations is the suggestion of already used tags for the purpose of ‘viewable tagging:’ once from the perspective of the information platform’s folksonomy, and once from the folksonomy of the respective information resource (Sigurbjörnsson & van Zwol, 2008). The most high-frequency tags from the folksonomy or the resource are recommended very often (Jäschke et al., 2007).
Knowledge Representation in Web 2.0: Folksonomies
205
Figure 3.26: Input Mask with Different Tag Recommendations on del.icio.us. Source: Rader & Wash (2008, Fig. 1).
The problem with this sort of recommender system is the threat of the self-fulling prophecy, the positive feedback loop (Guy & Tonkin, 2006) or the Matthew Effect. The Matthew Effect129 (Merton, 1968) has been discussed in the context of citation analysis and describes the phenomenon of the positive feedback loop: known authors and their popular articles are quoted more frequently than other authors or articles and thus get even more famous. The Matthew Effect is also expressed colloquially as ‘success breeds success.’ If the most frequently used tags of the folksonomy or of the information resource are recommended and re-recommended, these tags will very probably be re-indexed and re-re-recommended. This procedure will then spiral on and on and be able to influence, or prevent, the development of new tags: “It is of course possible that the selection of tags can be influenced by the user interface which can provide hints about how others have tagged the resource” (Veres, 2006, 60). This eradicates the advantages of folksonomies, like indexing through Collective Intelligence and the use of the user language. After all, this sort of recommender system artificially generates the implicit user agreement on certain tags for describing the resource, where it is normally created by actual user behavior. Hence we can no longer speak of a reflection of authentic user behavior. Weinberger (2005) talks of a ‘tyranny of the majority,’ since the sole recommendation of high-frequency tags suppresses niche tags, even though they might provide better or more appropriate content descriptions (Muller, 2007a): “Fine-grained metadata value comes from The Long Tail” (Al-Khalifa & Davis,
129
Following Matthew 25.29 (NT) and – here not concerning money – Mt 13.12 ‘The Parable of the Talents:’ “For unto every one that hath shall be given, and he shall have abundance: but from him that hath not shall be taken away even that which he hath.”
206
Knowledge Representation in Web 2.0: Folksonomies
2007c). Apart from that, exaggerated usage of high-frequency tags dilutes their own expressiveness, or diminishes their discrimination strength: In fact, the over-use of popular tags has been shown to decrease their information value (Muller, 2007a, 347). Terms that might be considered too general to provide good distinguishability from other articles in the field (Kipp, 2006b). On the other hand there are the popular tags which do little to contribute to semantically differentiating content, but which many users are likely to come up with. Unfortunately, such tags are likely to be exposed in tag clouds (Paolillo & Penumarthy, 2007).
The degree of discrimination strength has particular repercussions on the information retrieval of information resources (Chi & Mytkowicz, 2006). If many resources are indexed with the same tag, the user will be confronted with an equally large hit list when searching for that tag. Little indexed tags may have a higher degree of discrimination, but yield few search results due to their infrequent usage. This reciprocal relation is regulated by calculating TF*IDF in information retrieval (Xu et al., 2006; see chapter four). Figure 3.27 summarizes the possible degrees of discrimination in folksonomies.
TAGS OF THE RESOURCE
frequent ‘Power Tags’ seldom ‘Long Tail’
TAGS IN THE DATABASE frequent seldom low discrimination high strength discrimination strength very low discrimination low discrimination strength strength
Figure 3.27: Tag Discrimination Strengths.
Sigurbjörnsson and van Zwol (2008) suggest a modified TF*IDF calculation in order to avoid the Matthew Effect for the recommendation of popular tags. They multiply the stability value (stability(u)) of a tag u with the description value (descriptive (c)) of a candidate tag c to be recommended:
stability (u ) :=
ks k s + abs ( k s − log ( u ))
where ks is a parameter defined by system training and |u| the indexing frequency of the tag u. The function abs(x) serves to determine the absolute value of x. This is the goal of the calculation: “Considered that user-defined tags with very low collection frequency are less reliable than tags with higher collection frequency, we want to promote those tags for which the statistics are more stable” (Sigurbjörnsson & van Zwol, 2008, 331).
Knowledge Representation in Web 2.0: Folksonomies
207
In
descriptiv e (c ) :=
kd k d + abs ( k d − log ( c ))
kd is the training-defined parameter and |c| is the indexing frequency of the candidate tag c in the entire database. The goal here: “Tags with very high frequency are likely to be too general for individual photos. We want to promote the descriptiveness by damping the contribution of candidate tags with a very high frequency” (Sigurbjörnsson & van Zwol, 2008, 331). This calculation is used at the same time to recommend the most pertinent tags to the user, since Sigurbjörnsson and van Zwol (2008) hold that neither the high-frequency tags are suitable for recommendations, because they are too general, nor the Long-Tail tags, because these are too specific or contain spelling mistakes etc. Golder and Huberman (2006) view the sort of recommender system based on high-frequency tags as a curtailment of the tagger’s power of judgment and independence, since in accepting these recommendations, he copies the behavior of other users – “„[the user is] imitiating the choices of previous users“ (Golder & Huberman, 2006, 206) – and does not express his own opinion. Nevertheless, this ‘social proof’ effect can encourage and support tagging in general: One reason some users did not tag is because they could not think of any tags. […] To this end, designers may wish to take advantage of our finding that preexisting tags affect future tagging behaviour. […] Our results suggest that users would tend to follow the pre-seeded tag distribution (Sen et al., 2006, 190).
Further tag recommendations during indexing can be generated from the cooccurrence of already indexed tags per information resource (Sigurbjörnsson & van Zwol, 2008). Xu et al. (2006) want to simplify the unloved chore of indexing in this way: We propose collaborative tagging techniques that suggests tags for an object based on what other users use to tag the object. This not only addresses the vocabulary divergence problem, but also relieves users the obnoxious task of having to come up with a good set of tags (Xu et al., 2006).
This method can also be used to suggest synonyms, quasi-synonyms or associated words, for example. At its basis is the “assumption that the co-occurrence of words [...] is a measure of the strength of the relationship between the co-occurring words” (Kipp & Campbell, 2006). Two tags that are frequently co-indexed for a resource probably have something in common and are thus recommended as alternate tags: A first step towards more structure within such systems is to discover knowledge that is already implicitly present by the way different users assign tags to the resources. This knowledge may be used for recommending both a hierarchy on the already existing tags, and additional tags, ultimately leading towards emergent semantics […] by converging use of the same vocabulary (Schmitz et al., 2006a, 261).
In (Extended) Narrow Folksonomies, this can be problematic, though, since it can only rarely be assumed that synonymous relations are connected syntagmatically (a user who has indexed his image with ‘wedding’ will hardly make the effort of also adding the tags ‘marriage ‘ and ‘nuptials’). Still, in (Extended) Narrow
208
Knowledge Representation in Web 2.0: Folksonomies
Folksonomies tags can be added to other users’ resources, so synonymous terms may indeed appear – only not in as great a variety as in Broad Folksonomies. Knowledge organization systems (e.g. WordNet) can provide some relief here, since the recommendations could then be derived from the synonyms stored in them. In Broad and database-specific folksonomies, there is no such restriction, since they unite the variety of tagging users in their system and are thus able to display synonyms and quasi-synonyms; one only has to think of all the different compound forms of one and the same term. Kipp and Campbell (2006) also observe this in a cluster analysis of the del.icio.us folksonomy: Again, as for the spelling variations and emulated multi word groupings, this is intuitively logical as users of one synonymic term are less likely to include additional synonymic terms in their personal tags. It is only in concert that such synonyms are added to the full list of tags for a URL (Kipp & Campbell, 2006).
Wetzker, Zimmermann and Bauckhage (2008) object, however, that the automatic extraction of synonyms can be corrupted by the occurrence of spam in tagging systems, and that the relations between tags should not be adopted without being checked first. Schmitz et al. (2007) provide a short description of the application of cooccurrence analyses: The simplest way to study tag co-occurrence at the global level is to define a network of tags, where two tags t1 and t2 are linked if there exists a post where they have been associated by a user with the same resource (Schmitz et al., 2007).
Such a recommender system can work on two levels: suggestions on a syntagmatic level might be comprised of singular or plural variants, i.e. they might supplement the tag ‘picture’ with the tag ‘pictures’ (MacLaurin, 2005), and of relational notes, which show the user related, sub- or superordinate terms (Pind, 2005), i.e. the user indexes the tag ‘picture’ and the system suggests ‘chart,’ because the user has already used that tag before. Tag co-occurrence here reflects an association relation which has been implicitly generated by the users through the indexing of multiple tags. This is why Weiss (2005), on the subject of Flickr, speaks of a folksonomy generated from Collective Intelligence: “Flickr infers its knowledge from the tags entered by every other user in the system – creating a so-called ‘folksonomy,’ a group intelligence derived by association” (Weiss, 2005, 22). Schmitz et al. (2006a; see also Schmitz et al., 2006b) derive the candidate tags for a recommender system via knowledge discovery techniques, in particular the socalled ‘Association Rules.’ In order to apply these rules, the tripartite relation of folksonomies must be transformed into a bipartite one, i.e. all pairs (tag/user, user/resource, resource/tag) are considered individually (see also Jäschke et al., 2008 and Hotho et al., 2006c). So two possibilities for tag recommendations present themselves: 1) user A tags a resource with X and Y and user B tags the same resource with C and D – user A is then recommended the tags C and D; 2) if a resource has been frequently indexed with the tag pair A and B, it can be assumed that B is a hyponym of A and is then recommended automatically (Schmitz et al., 2006a, 267).
Knowledge Representation in Web 2.0: Folksonomies
209
The last point is somewhat problematic for tag recommendations. Since the tag hierarchy is generated solely from statistical evaluations without first being compared with a knowledge organization system, the hyperonym/hyponym relation is purely speculative. Frequently co-occurring tags might also be two components of a multiple-word term, which cannot be co-indexed due to the system’s restrictions. Sigurbjörnsson and van Zwol (2008) test the effects of different co-occurrence calculations in Flickr and find out that the similarity algorithm according to Jaccard (Jaccard, 1901) principally leads to tag recommendations that are place-bound or synonymous to the original tag (e.g. ‘Seine’ and Tour Eiffel’ for the initial tag ‘Eiffel Tower;’ Sigurbjörnsson & van Zwol, 2008, 330), whereas the simple cooccurrence calculation, relativized to the initial tag’s indexing frequency, suggests a greater variety of similar tags (e.g. ‘Paris’ and ‘France’ for ‘Eiffel Tower’). In the same study, the authors provide clues as to which suggested tags are mainly indexed by the users. Apparently, the most accepted tags are those which conform to the WordNet categories ‘locations,’ ‘artifacts’ and ‘objects.’ Users only rarely adopt tags from the categories ‘people’ and ‘groups’. Furthermore, tag recommendations can also be extracted and suggested from other indexed metadata, such as title or full-text descriptions, or the full-text of the resource itself, either automatically or manually (Calefato, Gendarmi, & Lanubile, 2007; Byde, Wan, & Cayzer, 2007; MacLaurin, 2005). According to Xu et al. (2006), this procedure also enhances the tag quality and frequency of less tagged resources: “This not only solves the cold start problem, but also increases the tag quality of those objects that are less popular“ (Xu et al., 2006). The strength of the co-occurrence between two tags can also be taken into consideration: “A link weight can be introduced by defining the weight of the link between t1 and t2, t2 ≠ t1, as the number of posts where they appear together” (Schmitz et al., 2007). Ohkura, Kiyota and Nakagawa (2006) try to add tags to resources automatically. Here they extract keywords from various blog entries and assign them the most appropriate tags using the Vector Space model. The users can then search for blog entries via these tags. The result of their experiments is this: “our method attaches too many tags. [...] Although the system can determine if a particular concept is contained or not, it cannot determine whether that concept is a peripheral one or one of the main subjects” (Ohkura, Kiyota, & Nakagawa, 2006). Ames and Naaman (2007) investigate the influence of tag recommendations on the tagging behavior of users in the context of the ‘ZoneTag’ application, which facilitates the uploading and indexing of photos from a cellphone. They find out that tag recommendations influence users in so far as to index tags for their photos in the first place, and to pay attention to their quality: “we showed that it is possible to motivate users to annotate content. [...] In some cases, the suggestions can inspire users to tag their photos and give them guidance for how best to annotate“ (Ames & Naaman, 2007, 980). But they also warn against too much euphoria. Tag recommender systems should be very well worked out, so as not to confuse users, and only be offered as an additional tool: However, suggestions should be used with caution. First, users may be confused or alarmed by inexplicable tags. Second, users may just choose these tags even if they are not immediately relevant to the content instead of manually entering more accurate tags (Ames & Naaman, 2007, 979).
210
Knowledge Representation in Web 2.0: Folksonomies
The generating of tags can be based not only on the further metadata or full-text of the resources, but also on other resources (e.g. blog posts,; Subramanya & Liu, 2008). Fokker, Pouwelse and Buntine (2006), as well as Fokker, Buntine and Pouwelse (2006), discuss the extraction of tags from Wikipedia URLs. But also resources that the user has stored on his computer can come into consideration. Chirita et al. (2007) introduce their system ‘P-Tag,’ which operates in this way and also facilitates personalized annotations. The system proceeds as follows: it extracts terms from the websites to be indexed and from the documents stored on the user’s desktop, in order to personalize the tags. The term extraction can happen in different ways (Chirita et al., 2007, 847ff.): a) ‘document oriented extraction,’ similar documents (web resource and desktop documents) are determined by the Vector Space model, relevant terms are extracted from the most similar documents via TF*IDF, b) ‘keyword oriented extraction,’ relevant terms from the website are compared to relevant terms from the desktop resources (both are determined via TF*IDF) via co-occurrence statistics or thesauri, c) ‘hybrid extraction’ combines both the above approaches. The extracted terms are recommended as tags for indexing, where the terms that occur both on the website and the desktop receive a higher weighting. P-Tags advantages are summarized as follows by Chirita et al. (2007): Our technique overcomes the burden of manual tagging and it does not recquire any manual definition of interest profiles. It generates personalized annotation tags for Web pages by building upon the implicit background knowledge residing on each user’s personal desktop (Chirita et al., 2007, 853).
Apart from the suggestion of co-occurring tags or other metadata, there is also the option of exploiting knowledge organization systems (Pammer, Ley, & Lindstaedt, 2008; Heß, Maaß, Dierick, 2008) for the tag recommender system. Here part-of relations (meronymy, Weller & Peters, 2007) or the is-a relation (or abstraction relation), as well as many other specific relations can be used as tagging aids. Thus the image resource ‘chair’ on Flickr can be complemented with the tags ‘chair leg,’ ‘seat’ and ‘arm rest’ using the paradigmatic part-of relation, or with the tags ‘seating furniture’ and ‘furniture’ via the paradigmatic abstraction relation. For the case of the abstraction relation, Muller (2007a) talks of ‘tag inheritance’ and suggests an automatized adoption of tags: “automatic tag inheritance has been added to the Activities system130, such that any child object automatically receives (at the time of creation) any tags that had been associated with its parent object“ (Muller, 2007a, 343). Both procedures are particularly sensible in order to increase the resources’ indexing depth, since the indexed tags are often redundant and extracted from the titles or descriptions of the resources (Heymann, Koutrika, & Garcia-Molina, 2008; Jeong, 2008; Heckner, Mühlbacher, & Wolff, 2007). As opposed to the recommendations of high-frequency tags, syntagmatic and relational suggestions are unproblematic. They merely alert the user to possibly more appropriate tags and to spelling or compounding variants. By adopting such tag recommendations, the user can markedly increase the visibility of his tagged resources. Xu et al. (2006) and Muller (2007a; 2007b) even recommend the auto-
130
The Activities System is a collaboration tool exclusive to IBM.
Knowledge Representation in Web 2.0: Folksonomies
211
complete, or ‘Type-Ahead131’ function for entering the tags. This functionality also supports the user in knowledge representation and provides for consistent indexing: “This not only improves the usability of the system but also enables the convergence of tags“ (Xu et al., 2006). Xu et al. (2006) also suggest the recommendation of a mixture of different tags by the system. They, too, have found out in their analysis that tags can have very different characteristics and may describe either the aboutness of the resource or wholly subjective properties. Hence a mixture of these tags is supposed to widen access paths to the resources and their recommendation should provide broader indexing options. The authors regard a mixture of ‘good’ tags as the most sensible option. Good tags are determined by the following characteristics: 1) High coverage of multiple facets: tags from all categories should be recommended in order to create a great range of terms. 2) High popularity: high-frequency tags appear more pertinent, since they reflect users’ Collective Intelligence. 3) Least-effort: nevertheless, the number of tags per resource should be kept small, in order to allow users to browse more quickly. 4) Uniformity (normalization): spelling or semantic variants of the tags should be put together out of the user’s sight, in order to avoid hit list ballast and increase Recall. 5) Exclusion of certain types of tags: highly personalized tags should be removed from the public folksonomy, since they often make no sense to other users. They can still occur in the individual user’s personomy, though. Xu et al. (2006) restrict their catalog of good-tag criteria to tag recommender systems, but it is evident that these criteria can also be applied to information retrieval (e.g. normalization of search tags and indexing tags for a better comparison of query and resource) and the hit lists’ relevance ranking (i.e. tags with criteria 1 and 2 have a higher retrieval status value). So the tripartite network of tag – resource – user can be exploited extensively for the purpose of tag recommendations. The relations of tag/user and tag/resource play a particular role here. Marlow et al. (2006a; 2006b) and Muller (2007a) summarize the possibilities for tag recommendations once more: The suggested tags may be based on existing tags by the same user, tags assigned to the same resource by other users. Suggested tags can also be generated from other sources of related tags such as automatically gathered contextual metadata, or machine-suggested tag synonyms (Marlow et al., 2006a; 2006b). Social recommendation provides another potential strategy for calculating related tags. This approach begins with one or more user-provided tags, and suggests other tags that were used by either (a) other people who used the same tags; (b) other people whose overall tagging patterns were similar to that of the current user; or (c) other people who are closely related to the user in a social network analysis (Muller, 2007a, 348).
131
‘Auto-Complete’ or ‘Type-Ahead’ is what the functionality within the information service is called that displays already saved tags while the user is entering tags with the same word stem.
212
Knowledge Representation in Web 2.0: Folksonomies
Advantages and Disadvantages of Folksonomies in Knowledge Representation Folksonomies are a new method of indexing the content of information resources. Hence, they are a new tool that can be applied to knowledge representation. In the following, I will summarize and discuss the advantages and disadvantages of this indexing method (Spiteri, 2006a; Spiteri, 2006b; Gordon-Murnane, 2006; Guy & Tonkin, 2006; Macgregor & McCulloch, 2006; Mathes, 2004; Quintarelli, 2005; Merholz, 2005; Hayman, 2007; Hayman & Lothian, 2007; Tennis, 2006; Herget & Hierl, 2007). This illustration also aims to answer the following fundamental question: “Is it really possible that the problem of organizing and classifying can be solved as simply as allowing random users to contribute tags?” (Morrison, 2007, 12). Since the users of collaborative information services index the information resources themselves – for the purpose of ‘user-generated indexing’ (HassanMontero & Herrero-Solana, 2006) – the folksonomy authentically reflects the users’ language, needs and knowledge (Quintarelli, 2005). Thus it chooses a democratic route towards the creation of a knowledge organization system, as opposed to controlled vocabularies: The democratic approach of a folksonomy avoids many of the ethical and political concerns of top-down, centrally-imposed systems. It allows the users of the system to establish their own sense of balance within the system, to use their own vernacular for indexing and retrieval, and prevents exclusion by creating new categories as needed (Sturtz, 2004).
Broad and Extended Narrow Folksonomies not only reflect the language of the author, but also the language of the users: “folksonomies reflect the terms and concepts of the users of the information, rather than the creators. This reflects user needs and how they view the information” (Gordon-Murnane, 2006, 30). This form of content indexing leads to a variety of multiple (and sometimes mutually exclusive) interpretations on, as well as ‘multicultural views’ of one and the same resource (Peterson, 2006; Kohlhase & Reichel, 2008). These ‘shared intersubjectivities’ allow the users “to benefit, not just from their own discoveries, but from those of others” (Campbell, 2006, 10). Users profit from the linguistic and semantic variation of the terms, both during indexing and during research later – after all, it cannot always be assumed that the user of this information, or of this tag, is also its creator, or privy to any expert knowledge (Schmidt, 2007b). The tags’ lack of clarity becomes their advantage, as they leave wiggle room, just like human language. This is exploited, among others, by Fuzzy Logic (Zadeh, 1994; Zadeh, 1996): “fuzzy logic – to mimic the abbility of the human mind to effectively employ modes of reasoning that are approximate rather than exact” (Zadeh, 1994, 77). The tags’ diversity also generates the Long Tail form niche tags: tags that are not necessarily indexed very frequently but are still used by different users for knowledge representation. It is precisely these Long-Tail tags that reflect the democratic and open character of folksonomies. The many different users’ associations, opinions and ratings, as expressed by tags, increase the tagged resource’s information content (Schmidt, 2007b) and thus create added value for the individual user: “After all tags are, presumably, not plucked from thin air but instantiated because they have some cognitive import: they
Knowledge Representation in Web 2.0: Folksonomies
213
capture some meaning in the web site“ (Veres, 2006, 60). There is added value on a linguistic level, too. The mixture of author vocabulary, and thus perhaps even expert vocabulary, and user vocabulary is one of the great advantages of folksonomies, because it is in this way that a broad linguistic range as well as broad areas of knowledge are covered. This makes it possible to surmount the so-called ‘vocabulary problem’ (Furnas et al., 1987) and to bridge the ‘semantic gap’ between different user types (Jörgensen, 2007). Kellog Smith (2006) observes, in her investigation on tagging in museum catalogs, that “there is a ‘semantic gap’ between specialists’ artwork descriptions in standard museum records and what nonspecialists are familiar with that shuts out many information and image seekers” (Kellog Smith, 2006). Semantic subtleties in the indexed terms can also be better taken into consideration by folksonomies than by controlled vocabularies: In a folksonomy the scheme is multi-faceted. [...] The Library of Congress subject heading for movies is ‘Motion Pictures’. By reducing terms such as movies, film, and cinema to on all-encompassing category, the distinctive meanings of each term gets lost in the translation (Kroski, 2005).
Apart from the semantic variation in tags, syntagmatic or orthographical variations can also be better registered in folksonomy terms. Plural and singular forms, different compounds or upper and lower-case spelling are indexed as entered by the user and then added to the folksonomy. There are tags which are sighted for the very first time in folksonomies, so-called neologisms. Mathes (2004) discusses the examples ‘sometaithurts’ and ‘flicktion,’ which quickly gained popularity in the Flickr community and could not have developed the same way in knowledge organization systems: “Although small, there is a quick formation of new terms to describe what is going on, and others adopting that term and the activity it describes” (Mathes, 2004). This unexpected and unpredictable use of language and tags reflects the user’s demand to be able to adequately describe contents or phenomena on the one hand, and on the other hand it exemplifies the dynamic character of language, or of indexing. Mathes (2004) summarizes: “[there is] communication and ad-hoc group formation facilitated through metadata“. A great range of possible terms provides for an equally great basis for the language of users of indexing systems, as well as for the language of their creators. Access paths to the information resources are widened (Kipp, 2007) and access is offered to different user types: “Tagging has the potential of increasing access to artwork images and records dramatically for searchers of all levels of expertise” (Kellog Smith, 2006). The linguistic and semantic variety of folksonomies can also be exploited for traditional knowledge representation. Thus the development and maintenance of preexisting controlled vocabularies can profit from folksonomies (Aurnhammer, Hanappe, & Steels, 2006a; Christiaens, 2006; Gendarmi & Lanubile, 2006; Macgregor & McCulloch, 2006; Mika, 2005; Wu, Zubair, Maly, 2006; Zhang, Wu, & Yu, 2006). The tags, their frequency and their distributions can serve as a source for new controlled terms, for term modifications and term deletions. The relations form the tripartite network of tag – resource – user can even already be used to create knowledge organization systems (van Damme, Hepp, & Siorpaes, 2007, 60; Zhang, Wu, & Yu, 2008). This ‘bottom-up categorization’ (Vander Wal, 2005b) guarantees quick reaction times regarding new subjects and innovations in the knowledge domain and is thus an important instrument for the construction and
214
Knowledge Representation in Web 2.0: Folksonomies
maintenance of nomenclatures, classification systems and thesauri. Folksonomies can also be used to deal with the heterogeneity problem (Krause, 2003; 2006; Geisselmann, 2000). Heterogeneity refers on the one hand to the differences between the resources to be indexed (Krause, 1996) and on the other hand to the heterogeneity of the indexing methods, which make it harder for users to retrieve relevant resources as they place their respective emphases in different places and are not interchangeable: Today’s information world is characterized by strong decentralization, distributed data collections and heterogenous data structures and indexing procedures. […] The aim is to enable an integrated search from subject standpoint in distributed data collections (Geisselmann, 2000).
Krause (2006) accordingly finds out that it is not users who must adjust to the heterogeneous structure of the indexing methods, but the indexing itself that must become more flexible: The user will want to access this information, regardless of the process by which the content is developed or according to which system it is catalogued. Whether considered right or wrong, the paradigm of forcing homogenity by overaching standardization efforts is no longer sufficient (Krause, 2006, 100).
Folksonomies are strictly user-oriented and thus eminently capable of meeting this demand. Halpin and Shepard (2006) caution, however, that “each tagging system is stranded from interaction with greater Web, and the data itself is usually held hostage behind firewalls” (Halpin & Shepard, 2006). But in how far can folksonomies contribute practically to the development of controlled vocabularies, from ontologies to the semantic web (Specia & Motta, 2007; Halpin & Shepard, 2006)? The models developed so far are mainly cooccurrence-based (Schmitz, 2006), use simple cluster algorithms (Grahl, Hotho & Stumme, 2007) or the Vector Space model (dimensions: resources, vectors: tags, tag similarity: cosine) (Heymann & Garcia-Molina, 2006). The similarity values generated can serve as the foundation of a similarity graph, in which the tag position can provide information on the hierarchical localization in a ‘latent hierarchical taxonomy’ (Heymann & Garcia-Molina, 2006, 4). Since the users of collaborative information services become indexers themselves, and attach their thoughts, associations and descriptions to the information resource in their own language, via tags, the tags then directly reflect the users’ wishes regarding the descriptions. Besides, the tags follow, in their conceptuality, the time-specific mindscape, so that the typical translation problems [the structures of the Dewey Decimal classification are hardly compatible with current knowledge, A/N] between professional indexers and recipients […] do not apply (Hänger, 2008, 66)*.
And so folksonomies provide immediate access to the users, but they, too, open their minds to the environment and express, via tags, what they really need, want or wish to discuss. Mathes (2004) and Kroski (2005) talk of ‘desire lines’ in this context: through tags, folksonomies follow their users’ wishes. Furthermore, the tags give folksonomies a ‘small world’ characteristic (Hotho, 2008; Schmitz et al. 2007; Schmitz, 2007; Watts & Strogatz, 1998; Shen & Wu, 2005). This is how Schmitz et al. (2007) define ‘small worlds:’
Knowledge Representation in Web 2.0: Folksonomies
215
Recently, the term ‘small world’ has been defined more precisely as a network having a small characteristic path length comparable to that of a (regular or Erdös) random graph, while at the same time exhibiting a large degree of clustering (which a random graph does not). These networks show some interesting properties: while nodes typically located in densely-knit clusters, there are still long-range connections to other parts of the network, so that information can spread quickly (Schmitz et al., 2007).
The tight network of users, resources and tags, or their clumps of abbreviations within user, resource or tag clusters so typical in ‘small worlds,’ is what makes browsing possible, among other things. A few clicks on different tags lead the user from one subject to the next, or from resource to resource: As folksonomies are triadic structures of (tag, user, resource) assignments, the user interface of such a folksonomy system will typically allow the user to jump from a given tag to (a) any resource associated with that tag, or (b) any user who uses that tag, and vice versa for users and resources (Schmitz et al., 2007).
These short path lengths between the user, tag and resource clusters are particularly advantageous in information retrieval (Mislove et al., 2008); in knowledge representation, the tripartite network can be exploited mainly for the generation of further indexing tags, e.g. via tag recommender systems. The tripartite graph can also be applied when it comes to the user perspective and the discovery of communities (Jäschke et al., 2006; Schmitz et al., 2006b) within collaborative information services, as Schmitz et al. (2007) emphasize: This can be used, for example, to make those communities explicit which already exist intrinsically in a folksonomy, e.g. to provide user recommendations and support new users in browsing and exploring the system (Schmitz et al., 2007).
Users are linked via commonly used tags and commonly tagged resources, respectively, which allows for the calculation of thematic group affiliation (Van Damme, Hepp, & Siorpaes, 2007). This knowledge about members of the community, or of their thematic areas of interest, can be exploited in different ways: “Collaborative tagging systems have the potential of becoming a technological infrastructure for harvesting social knowledge“ (Wu, Zubair, & Maly, 2006, 114). On the basis of commonly used tags or commonly indexed resources, it becomes possible to use folksonomies for the purposes of implicit collaborative recommender systems (Diederich & Iofciu, 2006; Niwa, Doi, & Honiden, 2006; see also chapter four). Thus the user can be suggested further resources potentially of interest to him – via matching or similar tags – or other users to communicate with, via commonly tagged resources. The first variant is known from search engine hit lists and is also termed ‘More like this!,’ i.e. the user is asking for similar resources. The second variant can be called ‘More like me’ in analogy; the system shows the user other users similar to him, or users using similar tags and tagging similar resources. The main assumption: “Users may have similar interests if their annotations share many semantically related tags“ (Wu, Zhang, & Yu, 2006). The advantages of the tripartite network in folksonomies – the short path length between tags, users and resources and its applicability for different recommender systems – leads Panke and Gaiser (2008a; see also Blank et al., 2008) to call folksonomies ‘a better search engine.’
216
Knowledge Representation in Web 2.0: Folksonomies
Collaborative information services and folksonomies are dependent on user participation, on user-generated content. As opposed to traditional print or broadcast media, they mainly forego quality control by gatekeepers, as must be faced by journalists or scientists if they want their works to be published. But the question must be asked: are there such gatekeepers on the internet? Clay Shirky (2005a) states, on this subject: “The web has an editor, it’s everybody.” So there is quality control of the content – but only after its publication, and by the users themselves. The users can make their quality judgment explicitly or implicitly, but it is always done via tags. An implicit rating is submitted via tag quantity: the more users tag a resource, the more relevance the resource seems to have for them. It is questionable, though, whether pure quantity can be an indicator of quality. Does something become ‘proven’ quality, only because a majority thinks it is? 132 Explicit ratings, on the other hand, can be counted and used as quality control without any problems. Tags such as ‘stupid’ or ‘cool’ carry explicit semantic information and thus provide clues as to the resource’s content. Naturally, any count of these explicit tags is not abuse-proof, either; here, too, quantity is no safe indicator of quality. Controlled vocabularies can often admit only very few perspectives in knowledge representation and indexing, since they provide only a limited number of terms to start with. They meet the multitude of users by forcing them to use their own (documentation) language. Folksonomies, on the other hand, thrive on the variety and the quantity of users. Hence in an economical sense, folksonomies would bear the character of typical network goods (Linde, 2008, 42ff.): the more users tag, the better it is for the system; if the number of tagging users breaches ‘critical mass,’ the system will lift off and establish itself as a standard. One of folksonomies’ strengths thus lies in their ‘scalability,’ i.e. folksonomies belong to those “systems which actually become better as they are more heavily used” (Campbell, 2006). This strength leads to a further advantage, which has both a clear practical and economical character. Folksonomies divide the burden of their creation among many shoulders – “[they are] distributing the workload for metadata creation“ (Marlow et al., 2006a; 2006b) – as opposed to controlled vocabularies, which are developed by a select few experts. In comparison, folksonomies are cheap-sustenance systems, since maintenance and indexing is achieved through the voluntary and free collaboration of many users (Shirky, 2005b). The time period between indexing and publishing the resource is kept very small to boot, since, in stark contrast to the indexing processes in libraries, for example, the tags enter the classification system at the same time as the resource itself (Hänger, 2008). “Tagging has dramatically lower costs because there is no complicated, hierarchically organized nomenclature to learn. User simply create and apply tags on the fly” (Wu, Zubair, & Maly, 2006, 111). Furthermore, there is a high degree of user enthusiasm to tag and thus stimulate and strengthen the tagging process within the community (Fichter, 2006). The high acceptance regarding tags can be put down to six factors (the first four from Derntl et al., 2008): 1. Inclusion: tags are easily maintained and have few access restrictions, 2. Customizability: users can design tags according to their own tastes, 3. Usability: collaborative information services simplify tag usage via tag recommendation or Auto-Complete systems, 132
Here again the example of the math problem: if many students provide the same, false solution, it does not gain in quality, but stays false.
Knowledge Representation in Web 2.0: Folksonomies
4. 5. 6.
217
User Involvement: All users can add tags and thus participate in the system design, Membership and Exchange in a Community: “Tagging [...] provides immediate self and social feedback” (Sinha, 2005), “Tagging is fun!”: users simply enjoy tagging (Neal, 2007, 7; also in Fichter, 2006).
For practical reasons, it seems impossible to index all websites, blog entries, photos and videos on the World Wide Web intellectually with controlled vocabularies. Algorithmic search engines like Google or Yahoo! can automatically index textual resources. A purely automatic, content-based indexing would be difficult for resources such as photos or videos, though, because they contain no metadata. Here folksonomies are the only solution for providing access to these resources via usergenerated metadata, or tags. Goodrum and Spink (2001) emphasize this and put great faith in the advantage of content-based indexing, which enhances the visual information by an important component: Most existing information retrieval (IR) systems are text-based, but the problem remains that images frequently have little or no accompanying textual information. Historically, the solution has been to develop text-based ontologies and classification schemes for image description. Text-based indexing has much strength, including the ability to represent both specific and general instantiations of an object at varying levels of complexity. Textual metadata can provide information not available from the image, e.g., names of people shown in the picture, or geographical location of the shot (Goodrum & Spink, 2001, 296).
Another advantage of folksonomies is (even if there is no robust empirical evidence for it yet) that they bring the user closer to the difficulties, but also to the general usefulness of indexing and its methods, sensitizing him: There are positive lessons to be learned from the interactivity and social aspects exemplified by collaborative tagging systems. Even if their utility for high precision information retrieval is minimal, they succed in engaging users with information and online communities, and prove useful within PIM [personal information management, A/N.] contexts (Macgregor & McCulloch, 2006, 298). It hasn’t always been easy to get users to take on such responsibility. But as more people understand what tags are, how they work and why they’re important, the number of participants in folksonomies has grown (Terdiman, 2005).
The central strengths of folksonomies are not only on display in knowledge representation, but also, and particularly, in information retrieval. Since that subject will be addressed in the following chapter, I will only mention it briefly, and for completeness’ sake at this point. Retrieval with folksonomies can occur in two ways: a) the user enters possible tags into the search field, or b) he uses the tags to work his way to the desired resource. That last point, called ‘serendipity,’ is the central feature and strength of research via folksonomies for Adam Mathes (2004): “The long tail paradigm is also about discovery of information, not just about finding it” (Quintarelli, 2005). Searching via tags is much faster and easier for the
218
Knowledge Representation in Web 2.0: Folksonomies
layman than research using elaborate tools for information retrieval, such as the International Patent Classification (IPC). Some users even forego the entering of search arguments altogether and click their way to the desired information using the tag clouds (Sinclair & Cardew-Hall, 2008). Furthermore, folksonomies can complement controlled vocabularies in the practice of research. Hence, folksonomies and ontologies should not be regarded as rivals. Tom Gruber (2006, 994) comments that “it is time to embrace a unified view”. The fundamental idea here is to represent tags in their semantic environment, from which the user can then glean additional search arguments. Since folksonomies never explicitly specify paradigmatic environment terms, these must be taken from other knowledge organization systems. WordNet (Miller, 1998) was already successfully implemented for data from del.icio.us (Laniado, Eynard & Colombetti, 2007a; 2007b) and Flickr (Kolbitsch, 2007). The following will briefly summarize the advantages of folksonomies. They • authentically reflect the users’ language and thus solve the ‘vocabulary problem,’ • allow for different interpretations and thus bridge the ‘semantic gap,’ • broaden access to information resources, • follow users’ ‘desire lines,’ • are an affordable form of indexing content, • divide the burden of indexing among many shoulders, • improve as more and more users participate – i.e., they are probably equipped with network effects, • are the only possibility of indexing mass information on the internet, • are term sources for the development and maintenance of ontologies and controlled vocabularies, • relay quality control for the information resources to the users, • allow specific searching and browsing, • register neologisms, • contribute towards the identification of communities and ‘small worlds,’ • provide a basis for recommender systems concerning tags, users and resources, • sensitize users to the indexing of content. The greatest strength of folksonomies, their linguistic and semantic variety, is also their greatest weakness. The lack of a controlled vocabulary leads to numerous problems (Guy & Tonkin, 2006; Choy & Lui, 2006; Macgregor & McCulloch, 2006; Bates, 2006; Furnas et al., 2006; Noruzi, 2006; Noruzi, 2007). These are particularly salient in later information retrieval, but also during indexing, if a user wants to make his resources available to as large a number of people as possible. “Lack of precision is a function of user behaviour, not the tags themselves” is Shirky’s (2005a) attempt to smooth over the issue. But Furnas et al. (1987) confirm that the language’s variability represents a great weakness for an efficient indexing as well as information retrieval: “We studied spontaneous word choice for objects in five application-related domains, and found the variability to be surprisingly large. In every case two people favored the same term with probability tag>tag’ (Quintarelli, Resmini, & Rosati, 2007, 11). Brooks and Montanez (2006a; 2006b) recommend cluster formation – not automatically, though, but manually: We argue that users should be able to cluster tags […] to specifiy relations (not just similarity) between tags, to use tags to associates documents with objects such as people. This is not necessarily inconsistent with the idea of folksonomy; there is no reason why hierarchical definitions can’t emerge from common usage (Brooks & Montanez, 2006a, 15). In fact, the term “hierarchy” may be overly restrictive; what users really seem to need is a way to express relations between tags (Brooks & Montanez, 2006b).
An example of a popular collaborative information service that has already implemented this approach, and thus Tag Gardening, for its users, is del.icio.us (see above). Here the user can collect his own tags in ‘tag bundles’ and thus create a certain – hierarchical – structure (see Figure 3.34; Hammond et al., 2005). The search engine RawSugar also uses this kind of Tag Gardening: “With RawSugar taggers can specify tag hierarchies in their own accounts (saying that sushi is a subtag of food for example)” (Begelman, Keller, & Smadja, 2006). In the example in Figure 3.35, the tags ‘Düsseldorf’ and ‘Hildesheim’ were subordinated to the tag ‘Information Science.’
142
http://www.facetag.org.
Knowledge Representation in Web 2.0: Folksonomies
245
Figure 3.35: Tag Bundles of the User ‘isa.bella83.’ Source: del.icio.us (http://delicious.com/isa.bella83).
Gruber (2005) already mentions the purpose of tagging tags, i.e. the meaning of metatags or ‘tag-on-tag’ (Gruber, 2005) and attests them the ability to determine hyperonyms or synonyms for ontologies. However, since del.icio.us pays deliberate attention neither to known relations such as the synonym or abstraction relations, nor to a controlled vocabulary, this cannot, strictly speaking, be called Tag Gardening. Only the individual user’s personomy is structured for a better overview but keeps its ‘folksonomical’ and thus arbitrary character: Certain tagging systems provide a method of employing limited hierarchy in classification. Del.icio.us ‘tag bundles’ are one such method. Tag Bundles, ‘tagging of tags’, permit a user to bundle a number of tags underneath an umbrella term [...]. This has applications for the process of disambiguation, as well as for neatness; nonetheless, tagging remains a flat namespace (Tonkin, 2006).
Because the community cannot profit from this structure yet. BibSonomy, the scientifically oriented social bookmarking system, also allows users to manually create hierarchical relations between tags. These relations can then be used to structure the personomy’s tag cloud according to hierarchy levels and to incorporate hyponyms into the tag search (Hotho et al., 2006a). The book platform LibraryThing allows users to mark two terms as synonymous. The more popular tag in the platform is then marked as the preferred tag. It is also possible for each user to annul these equivalence relations, leaving the determination of synonyms entirely to the community. Thus the users of LibraryThing create “a bottom-up community-driven controlled vocabulary” (Smith, 2008a, 16). A weak form of Tag Gardening is presented by Calefato, Gendarmi and Lanubile (2007). They merely suggest recommending synonyms to the user during the tag allocation. These synonyms are extracted from the resource to be indexed, from dictionaries and ontologies as well as from the user’s personomy and the resource folksonomy. Since the authors factor the resource’s full-text, synonyms from dictionaries and ontologies into their tag recommendations, automatic suggestions are possible even for new and to-be-indexed resources. Brooks and Montanez (2006b) introduce a system for the automatic generation of hierarchical clusters based on folksonomies and full-texts (see also Hayes & Avesani, 2007 and Hayes, Avesani, & Veeramachaneni, 2006). First came an analysis of the tagging effectiveness in blog entries. Here the authors determined,
246
Knowledge Representation in Web 2.0: Folksonomies
via similarity calculations, that the tags may be useful for allocating resources to general categories, but not for a reflection of the specific content of the resource. Terms extracted from the full-texts via TF*IDF even showed better similarity values than tags: Simply extracting the top three TFIDF words and using them as tags produces significantly better similarity scores than tagging does […] automated tagging produces more focused, topical clusters, whereas human-assigned tags produce broad categories (Brooks & Montanez, 2006b).
Based on these observations, they create a tag hierarchy in six steps, almost exclusively from the information resources’ full-texts: 1) the top 250 most popular tags are extracted from Technorati, 2) 20 blog entries are selected for each of those tags, 3) from these entries, tags are automatically extracted via TF*IDF, 4) each resulting cluster is compared with each other cluster and the pair similarity is calculated for the TF*IDF values via the cosine, 5) the two most similar clusters (with their blog entries) are summarized as an ‘abstract tag cluster,’ 6) that cluster receives an ‘abstract tag’ from a conjunction of all tags contained within it. The tag hierarchy is then created from the ever more general clusters: “All grouping was done based on the similarity of the articles” (Brooks & Montanez, 2006b). The authors suggest using the system for tag recommendations, especially regarding synonyms and more specific, or more general tags. Van Damme, Coenen and Vandijck (2008) apply (semi-) automatic Tag Gardening to the tagging system of a European enterprise. They concentrate on 1) the discovery of similar tags and 2) the relations between the tags. The first step includes, among other things, tag stemming. They are confronted with various problems by the company-specific user vocabulary: “When stemming algorithms are used, there should be a way to determine the language of the tags and whether it involves corporate-specific language“ (van Damme, Coenen, & Vandijck, 2008, 39). In order to determine the similarity of two tags, they apply the Levenshtein Metric (Levenshtein, 1966). Here the letters of a word that must be deleted, replaced or inserted in order to transform it into another word are counted. To spell-check the tags, the authors use Google, Wikipedia and other online dictionaries. Problems arise during the alignment with the company-specific language use. The second step consists of co-occurrence analyses, the application of cluster techniques and the calculation of the contingent probability of the co-occurrence of tag pairs (for the discovery of relations; Schmitz, 2006), of the Social Network Analysis (for the discovery of group-specific vocabularies), of the alignment with the dictionaries mentioned above (for the discovery of relations) and of visualization techniques (for the clarification of tag relations). As the results of their study, Van Damme, Coenen and Vandijck are able to record that a threshold value of 0.65 after Levenshtein produces the most similar tag pairs and that a value of 0.7 upwards generates the most common relations between the tags when calculating contingent probability. Al-Khalifa and Davis (2007a) compare automatically extracted keywords, handpicked descriptors and tags regarding their semantic value. To do this, they check whether the indexing terms can be ascribed to one of the following thesaurus relations: 1) same (spelling variants, acronyms), 2) synonym, 3) broader term, 4) narrower term, 5) related term (of a descriptor), 6) related (unclear relation), 7) not related (Al-Khalifa & Davis, 2007a). The result of the analysis is that most folksonomy tags can be counted among the first six relations, while the majority of
Knowledge Representation in Web 2.0: Folksonomies
247
automatically extracted terms matches the seventh, i.e. is not relevant for the indexed resource. Tags can thus be exploited for the generation, or representation, of implicit relations. Tags of the seventh relation can be used particularly effectively for research into new paradigmatic relations (Peters & Weller, 2008; Weller & Peters, 2007). Al-Khalifa and Davis (2007a) report which tags were found to match this relation: “Folksonomy tags falling in the ‚not related’ category tend to be either time managment tags e.g. ‚todo’, ‚toread’, ‚toblog’, etc., or expression tags e.g. ‚cool’, self-reference tags and sometimes unknown/uncommon abbreviations” (AlKhalifa & Davis, 2007a). In a further study, Al-Khalifa and Davis (2007d) check the suitability of folksonomies for the automatic generation of semantic metadata via the three aspects of quality, usefulness and representativeness. Here they start off with the following assumption: “Users have their own perspective when tagging a resource; they may add new contextual dimensions, for example to suggest its application or its relationship to neighboring domains” (Al-Khalifa & Davis, 2007d). Folksonomies should, then, be able to support pre-existing ontologies as a source of terms and relations. The generation of the metadata consists of two steps: 1) the extraction and processing of tags with the known methods of NLP, and 2) the semantic annotation, where each tag is aligned with a pre-existing ontology. AlKhalifa and Davis (2007d) arrive at the following conclusions, which firmly corroborate the reasons for processing folksonomies for the purpose of Tag Gardening: a) Folksonomies are eminently suitable as term source for pre-existing ontologies. This is due largely to the “latent (implicit) semantics embedded in the tags“ (AlKhalifa & Davis, 2007d). b) Folksonomies reflect users’ Collective Intelligence without necessarily having to achieve a consensus regarding any one descriptive term. c) The Long-Tail tags contain the highest semantic value, thus complementing other metadata.
Traditional Methods of Knowledge Representation vs Folksonomies The approaches of Tag Gardening show that there is a great need for the structuring of folksonomies. The implementation of Tag Gardening also shows that folksonomies and other methods of knowledge representation should not be regarded as rivals, but rather as complementary tools (Voß, 2008). Gruber (2005) describes folksonomies and traditional methods of knowledge representations as ‘apples and oranges’ to express that from a taxonomical perspective, they have more similarities than differences. After all both methods have the same goal: the indexing and representation of information resources. Social tagging and subject cataloguing are examples of information organization frameworks. It can be argued that these frameworks are a type of indexing, because both of them are systems, methods, and work processes that analyze documents and create representations of significant characteristics for inclusion in information systems (Tennis, 2006).
248
Knowledge Representation in Web 2.0: Folksonomies They all contribute to the same overall goal, i.e. adding meta-data to the web and offering a means to share within a community information on the web (Spyns et al., 2006, 752).
Folksonomies complete the tools and methods of knowledge representation, as sketched in Figure 2.1 and 2.3. Folksonomies may be applied to large knowledge domains, they use very few paradigmatic relations – none at all, without Tag Gardening – but they factor in both the content creator’s language and that of the resource’s users. Hassan-Montero and Herrero-Solana (2006) thus see in folksonomies a solution for the inter-indexer consistency (Markey, 1984), since tag distributions remain stable after a certain period: “It happens when different indexers use different indexing terms to describe the same document. This problem would be reduced when indexing is obtained by aggregation, such as occur in social tagging” (Hassan-Montero & Herrero-Solana, 2006). Furthermore, indexing with folksonomies is the method that most takes into consideration and reflects the agents’ personal associations with the resource: “Tagging seems intensely personal, whereas subject cataloguing is an act of delegation mediated by institutions” (Tennis, 2006). The disadvantage of controlled vocabularies and knowledge organization systems is that they only consider very few perspectives in knowledge representation, and only prescribe a limited number of terms. Furthermore, changes to user demands or to the terminology can often be registered only very slowly. Folksonomies do not have these problems; they can act flexibly and in a timely manner. Tennis (2006) comments: Indexing, as the interpretation and representation of significant characteristics of documents for information systems is an act with many different manifestations. Social tagging is a manifestation of indexing based in the open – yet very personal – web (Tennis, 2006).
Again in contrast to controlled vocabularies, folksonomies do not define any strict borders, allowing for vaguenesses and redundancies in indexing: Items often lie between categories or equally well in multiple categories. The lines one ultimately draws for oneself reflect one’s own experiences, daily practices, needs and concerns. [...] Collective tagging, then, has the potential to exacerbate the problems associated with the fuzziness of linguistic and cognitive boundaries. As all taggers’ contributions collectively produce a larger classification system, that system consists of idiosyncratically personal categories as well as those that are widely agreed upon (Golder & Huberman, 2006, 201).
That is also the reason why the efforts of the DMOZ and Yahoo! categorizations are regarded as failures: “What is different between folksonomies and the directories is that the former also includes information which is best expressed by adjectives, verbs and proper names whereas the latter do not” (Veres, 2006, 67). This indexing liberty with regard to language and semantics leads Halpin et al. (2007) and McFedries (2006) to go so far as to say that folksonomies are the single best method of knowledge representation: Tagging is able retrieve the data and share data more efficiently than classifying (Halpin et al., 2007, 211).
Knowledge Representation in Web 2.0: Folksonomies
249
It tells us that hundreds millions of people can probably classify what they see and interact with on the Web more efficiently, more comprehensively, and more usefully than a small group of Yahoo managers (McFedries, 2006, 80).
Tennis (2006) also emphasizes the differences between folksonomies and controlled vocabularies rather than their similarities, and prophesies the latter’s impending doom: Social tagging is not built in the same context, with the same tools, by the same methods, or even for the same purposes as subject cataloguing. […] Likewise, the phenomenon of social tagging shows us that the modernist concept of indexing is no longer desirable because we see a very personal and constantly evolving set of systems support a framework that works with profiles, personal collections, and novel tagging combinations (Tennis, 2006).
While Veres (2006) sees advantages in the free indexing of content via tags, she issues caveats and thus brings folksonomies and controlled vocabularies a little closer to each other once more: “In fact, taggers are categorizing, but they are also supplementing the categories with descriptions. In other words they are creating an ontology of categories and a list of property relations” (Veres, 2006, 67). She thus wants to use folksonomies to enhance controlled vocabularies and tailor them to user demands: “In addition we showed the importance of non taxonomic categories for classification, which is an emerging field of study in the area of ontology engineering” (Veres, 2006, 68). The true difference between both methods of knowledge representation is localized in the bilateral and fast communication provided by folksonomies: “Instead, the novelty is the immediacy of the feedback from the community of users” (Veres, 2006, 59). Golder and Huberman (2006) take the side of traditional controlled vocabularies and recall a great advantage of the category system: “Unlike a keyword-based search, wherein the seeker cannot be sure that a query has returned all relevant items, a folder hierarchy assures the seeker that all the files it contains are in one stable place” (Golder & Huberman, 2006, 199). Spyns et al. (2006) sum up the discussion of the advantages and disadvantages of folksonomies and traditional methods of knowledge representation in a neutral way: In a nutshell, folksologies [folksonomies] have to differentiate between natural language words and a language-independent artificially created label indicating a concept or sense, and ontologies have to adopt strategies and tools as used for social bookmarking sites to make the meaning of concepts spontaneously emerge and converge (Spyns et al., 2006, 742).
The advantages of folksonomies are often used as arguments in favor of isolating the traditional methods of knowledge representation. Here the fact of a greater added value resulting from a combination of the two is disregarded, though (Golub et al., 2008), as Ankolekar, Krötzsch and Vrandecic (2007) emphasize on the subject of the highly structured Semantic Web: “The Semantic Web can learn from Web 2.0’s focus on community and interactivity, while Web 2.0 can draw from the Semantic Web’s rich technical infrastructure for exchanging information across application boundaries” (Ankolekar, Krötzsch, & Vrandecic, 2007). Golder and Huberman (2006) also mention the synergy effect resulting from a combination of the two: “However, there is also opportunity to learn from one another through sharing and organizing information” (Golder & Huberman, 2006, 201).
250
Knowledge Representation in Web 2.0: Folksonomies
The combination of traditional methods of knowledge representation and folksonomies can be effected in various ways and, in keeping with Tag Gardening, is labeled ‘fertilizing.’ On the one hand, folksonomies can be used as a term source for pre-existing controlled vocabularies (Al-Khalifa & Davis, 2006; Angeletou et al., 2007, 35; Merholz, 2004; Zheng, Wu, & Yu, 2008), in order to counteract the lack of any timely adjustments of language and semantics: “this inclusion of newer terms in the user tags can happen faster than it would in a traditional thesaurus” (Kipp, 2006b). Novel concepts or terms are then entered into the documentation language by experts. The users profit from the better searching and indexing options of the controlled vocabularies, but need not be afraid that the terms will quickly grow obsolete: “encourage authors and users to generate folksonomies, and use those terms as candidates for inlcusion in richer, more current controlled vocabularies that can evolve to best support findability” (Rosenfeld, 2005). The Library of Congress, for example, encourages Flickr users to tag and comment on its photographs in order to broaden access to the pictures (Raymond, 2008): The Library of Congress invites you to explore history visually by looking at interesting photos from our collections. Please add tags and comments, too! More words are needed to help more people find and use these pictures. By way of background, Library of Congress staff often make digital versions of our popular image collections available online as quickly as possible by relying primarily on the identifying information that came with the original photos. That text can be incomplete and is even inaccurate at times. We welcome your contribution of names, descriptions, locations, tags, and also your general reactions143.
This way of combining folksonomies and KOS can also be reflected via a theoretical model that had not previously been applied to indexing with folksonomies: the shell model (Krause, 1996; 2006; 2007). Here Krause (2006) demands: “we need a way to directly integrate and easily re-use the thesaurus and classification based information resources of the libraries and the other traditional information providers” (Krause, 2006, 100). In the shell model, the resources of a database are indexed by professional indexers using KOS, and additionally tagged by users (Peters, 2006a; 2006b). Here it is important that the shell model respect the resource’s relevance during indexing. In other words: the selection of the method of knowledge representation is contingent to the resource’s relevance. The consequence of this is that not all resources in a database are indexed in the same way. The indexing tools differ in expressiveness just as the resources differ in their import for the database. Krause (1996) thus concludes that the indexer can adapt his indexing to the resource. High-value resources with a high relevance should be indexed with the most complex indexing method (e.g. thesaurus) in order to allow users to find the resource (again) as quickly as possible. Less relevant resources can also be represented less carefully (e.g. as full-text or via title indexing) in the database. The decision as to what resources are relevant and what method is to be used is made once by the indexer. In the environment of libraries or professional information providers, this means that the source (e.g. scientific magazine) of the resources (e.g. single articles) is rated on the basis of its relevance – all articles entered subsequently are allocated to the same level of relevance and are indexed with the 143
See http://www.flickr.com/photos/library_of_congress/collections.
Knowledge Representation in Web 2.0: Folksonomies
251
fitting tool of knowledge representation. In tagging systems, the users, for example, could be ranked according to their relevance. The shell model is very flexible in spite of the allocation of resources to different relevance levels, since it can effortlessly adapt to the particularities of the underlying database. After all, both indexing method and resource can be freely selected and allocated. Krause (1996) uses the shell model to exemplify this approach (see Figure 3.36): The innermost shell contains the core of the relevant literature. This is indexed in as deep and high-value a manner as possible. Quality control is in the hands of the co-ordinating information service authority. […] The second shell loosens the relevance prescriptions and, in parallel, the requirements on indexing quality. […] It is determined by the particularities of a subject area how many shells are to be put in place and what features they define (Krause, 1996, 18f.)*.
It is obvious that the shell model can only be used in restricted user groups and resource stocks. The introduction of a centralized information point and the ranking of the sources’ relevance is almost unimaginable for tagging systems on the World Wide Web. The shell system can very well be used in companies or institutions, though, and render good service – particularly since there is often an indexer, knowledge manager of archivist already present, who knows his way around the data set and must professionally take care that information can be found quickly and reliably. An exemplary allocation of in-company resources and indexing methods is executed via Figure 3.36.
Figure 3.36: Shell Model for Indexing Heterogeneous Data Sets. Source: Modified from Krause (1996, 18).
The most relevant resources of the company (e.g patents or reports from R&D) are in the model’s core and thus ths ubject of high-value indexing (e.g. via classification system and thesaurus). This guarantees their retrievability. The second shell contains less relevant, but still important, resources (e.g. business reports and memos from
252
Knowledge Representation in Web 2.0: Folksonomies
the management) and is thus indexed via a thesaurus. The two outer shells contain less relevant resources (e.g. press reports or in-company blogs) and are indexed using less expressive methods (e.g. full-text storage or title indexing). To incorporate the user into the indexing process, one could also let them tag the resources and thus counteract the lack of quality in the KOS and broaden access to the resources, e.g. via semantic query expansion. On the other hand, controlled vocabularies can be used during tagging (e.g. via semantic tag recommender systems), i.e. support the user during indexing (Noruzi, 2007) and thus educate them in term of ‘tag literacy’ (Guy & Tonkin, 2006; Mejias, 2005). Hayman (2007), as well as Hayman and Lothian (2007; see also Güntner, Sint, & Westenthaler, 2008) report on this fertilizing in an Australian portal for ELearning resources. During tagging, the user is confronted with other tags and descriptors of a thesaurus that are suggested by a Type-Ahead functionality. Besides, the user is displayed hyponyms and hyperonyms for a term that he can select. The user can freely choose what term to index – he can keep his own tag or decide that a descriptor might be better-suited. All tags and descriptors are displayed in a tag cloud, where they are treated as equals. However, this tag cloud is different from the tag cloud of other collaborative information services, such as Flickr or del.icio.us: “This collection of tags will be a folksonomy that has been directed by a taxonomy” (Hayman, 2007, 16). Hayman (2007), as well as Hayman and Lothian (2007), also discuss the display of further information regarding tags and descriptors. Again during indexing, they want to offer the users of their portal the Scope Notes144 and other indexing guidelines for descriptors. On the other hand, the authors hope to observe the folksonomy’s characteristics, the results of which they would then share with the users: Information about the tags it will indicate which thesaurus terms are useful to our users it will indicate new terms for existing concepts that should be considered for our thesauri (either as preferred or non-preferred terms) it will indicate new concepts and suggest terms for them Information about the items tagged (resources) it will indicate which items are considered of value by our users simply because of the number of times they have been tagged, and, if ratings are included, which are valued by valued taggers Information about the people doing the tagging it will indicate what tags and items a person has used, and each person will have profile about their learning journey (Hayman, 2007, 16). This approach aims to facilitate the indexing process, or the selection of adequate indexing terms, by providing users with enhanced information on tags and descriptors (e.g. frequency distributions or tag ratings). The third kind of combination is represented by the novel approaches of Tag Gardening. Here the pre-existing folksonomy is structured and disambiguated in retrospect. This is done by the users, who thus create a KOS bottom-up. The goal is 144
“Scope Notes” are additional information on a descriptor in a thesaurus, that refer to the descriptor’s applicabilities or the exclusion of terms from the descriptor set, among other things.
Knowledge Representation in Web 2.0: Folksonomies
253
to incorporate the context of the information resource into the indexing process (Jörgensen, 2007), i.e. to make the relations between tags explicit (van Damme, Coenen, & Vandijck, 2008). The question here is whether as weakly structured a system as a folksonomy can in reality be turned into a knowledge organization system, or whether it can be assumed that there is an implicit structure in folksonomies (Halpin & Shepard, 2006). Steels (2004) compares folksonomy structures with natural-language structures: One of the key questions to understand [is] how a communication system can arise [...] how distributed agents without a central authority and without specification can nevertheless arrive at a sufficiently shared language conventions to make communication possible (Steels, 2004, 131).
If weakly-structured KOS, such as folksonomies, are turned by users into a highly structured KOS, e.g. an ontology (Halpin & Shepard, 2006), this is called “emergent semantics145” (Zhang, Wu, & Yu, 2006; Fengel, Rebstock, & Nüttgens, 2008) or “Ontology Maturing” (Braun et al., 2008, 163). The summarization of several different KOS (e.g. thesauri, ontologies and folksonomies) is also called ‘Ontology Mapping’ or ‘Ontology Merging’ (Fensel, 2004; Stuckenschmidt & van Harmelen, 2005; Staab & Studer, 2004). Not only pre-existing thesauri or classification systems can be used to enrich folksonomies, though; other sources, such as Wikipedia, Google or leo.org146 can also be incorporated. Van Damme, Hepp and Siorpaes (2007, 69) suggest such an approach: “our approach aims at […] fully using a ‚mash-up’ of available lexical, semantic, and social data sources for producing and maintaining domain ontologies.” Numerous other studies also deal with the automatic creation of structured KOS from folksonomies at the moment (e.g. Schmitz et al, 2007; Hotho et al., 2006b; Grahl, Hotho, & Stumme, 2007; Wu, Zhang, & Yu, 2006; Heymann & Garcia-Molina, 2006; Plangprasopchock & Lerman, 2008; Candan, Di Caro, & Sapino, 2008; Li et al., 2007): “But in addition the way we characterize tag sets facilitates their translation into ontologies which suggests a new methodology for automated ontology extraction” (Veres, 2006, 68). This approach did not emerge with folksonomies, however, but was already suggested by Chen (1994), among others, for full-text form brainstorming sessions, in order to solve the ‘vocabulary problem’ (Furnas et al., 1987): The concept space approach is an algorithmic solution for creating a vocabulary-rich dictionary/thesaurus. A concept space is generated by extracting concepts (terms) automatically from the texts produced during collaboration. Similar concepts are then linked through co-occurrence 145
Wu, Zhang and Yu (2006) define ‘emergent semantics’ this way: “Desire lines are the foot-worn paths that sometimes appear in a landscape over time. The emergent semantics is like the desire lines. It emerges from the actual use of the tags and web resources and directly reflects the user’s vocabulary and can be used back immediately to serve the users that created them” (Wu, Zhang & Yu, 2006, 418). Spyns et al. (2006) present a similar definition of ‘emergent semantics:’ “The semantics emerge from the implicit but immediate feedback from the community in the form of usage frequencies of tags as well as the listing of tags per URL and URLs per tag” (Spyns et al., 2006, 741). In my opinion, though, neither definition goes far enough. The semantics of the folksonomies may be implicitly contained in the tags’ co-occurrence, but it only becomes meaningful for the users once they are made explicit by the users themselves or an automatic system. 146 http://www.leo.org is a German online dictionary featuring several different languages.
254
Knowledge Representation in Web 2.0: Folksonomies analysis. This approach guarantees that any terms brought out by group memers will be captured and that terms with similar meanings will be linked (associated) by the system (Chen, 1994, 59).
This idea reflects the demands of that time, and especially the procedure of automatic thesaurus enhancement, which can be adopted for folksonomies with no changes made, and is pursued in this precise manner, pretty clearly. The classificatory allocation of methods of knowledge representation was already executed in chapter two, Figure 2.3. We can observe: “From a categorization perspective, folksonomy and taxonomy can be placed at the two opposites ends of the categorization spectrum” (Al-Khalifa & Davis, 2006). This can be explained by the following reflections: a documentation language’s expressiveness is measured via the number and specificity of its used relations, the coverage of the knowledge domain is measured via the size of the represented subject area. Any extension of the documentation languages from the right to the left-hand side of the continuum increases the semantics, creates ‘emergent semantics.’ By enriching folksonomies with relations and methods for semantic disambiguation, their arbitrariness is supposed to be counteracted, while at the same time their user-oriented language as well as their flexibility are kept. The approaches ‘Folksonomies as Term Sources’ and ‘emergent semantics’ can both be combined, particularly in order to compensate for the differences regarding coverage of the knowledge domain and expressiveness of indexing. Christiaens (2006) explains the possibilities for combination in Figure 3.37. Initially, Christiaens (2006) distinguishes between ‘free’ and ‘restricted’ methods of knowledge representation. The free form of indexing allows all agents (authors, professional indexers, users) to add descriptive keywords or tags. Christiaens counts folksonomies, both Broad and Narrow, as well as keywords, or metatags, as they are used for describing HTML documents, among this form.
Figure 3.37: Continual Feedback Loop between Controlled Vocabularies and Folksonomies. Source: Christiaens (2006, 201, Fig. 1).
In folksonomies, the problem of tag quality is particularly noticeable, but also the strength of their being able to index and thus cover a large amount of resources. The restricted form of indexing takes care that a controlled vocabulary is established and the content of the resource to be indexed is thus suitably described. Taxonomies, faceted classifications and ontologies that must be created at great length, be of a high quality and heavily structured, the so-called ‘heavy-weight mechanisms’ (Christiaens, 2006, 201) are counted among these.
Knowledge Representation in Web 2.0: Folksonomies
255
The continuous feedback loop guarantees the ongoing exchange between both types of knowledge representation. Here the objective is mainly to make explicit, combine and refunction the specific characteristics, tags or relations etc. of both methods, in order to complement them with each other. This loop leads, according to Christiaens, to a new, improved indexing system possessing great expressiveness and also a large coverage: In our view, these mechanisms should not be used in isolation, but rather as complementary solutions, in a continuous process wherein the strong points of one increase the semantic depth of the other. […] In our opinion, this will result in a system that receives the benefit from the free zone (quantity) as well as from the restricted zone (quality) (Christiaens, 2006, 1 & 202).
Thus it becomes clear that the traditional methods of knowledge representation and the knowledge of the creation of controlled vocabularies are not obsolete, but more up to date and in demand than ever before (Rosenfeld, 2001). As Figures 3.6 and 3.24 demonstrate, folksonomies can be used for indexing in a quick and easy manner, but are strictly inadequate for an effective resource management. Loasby (2006) also emphasizes that even in the future, all agents of knowledge representation – professional indexers, authors and users – will have their role to play, but that one should not deny the new demands of Web 2.0: We will have to harness the power of folksonomies while remembering there is stuff our audience will demand we know about our own content. Most of all, we have to ensure our choices of metadata systems are made with the user in mind (Loasby, 2006, 26).
Outlook The outlook on the world of folksonomies is based mainly on their observed inadequacies and on users’ demands. The use of tags beyond collaborative information services is also discussed (Quasthoff, Sack, & Meinel, 2008). Campbell (2006) analyzes tagging and the Semantic Web from a phenomenological perspective, while Harrington (2006) checks, on the example of an online job portal, whether prescribed categories are better and provide a more adequate description of the content than classes taken from the user folksonomy. Tennis (2006) compares tags and keywords through a framework analysis via four factors: “purpose, predication, function, context.” The result of his analysis: Finally, social tagging, in the context of the web, using links to personal collections and profiles, and with its focus on sharing highlights a novel kind of intertextuality in indexing. […] Furthermore, there is a reinvention of authorship and agency in the post-Fordist social tagging discourse. We no longer see a monolithic standard, we see individuals tagging personal collections, using ad hoc tools (Tennis, 2006).
Na and Yang (2006) investigate folksonomies using a ‘cultural cognitive theory,’ the basis of which is the following assumption: “people from different culture have differing cognitive process.” The idea is to observe whether this theory also applies during tagging (see also Capps, 2006).
256
Knowledge Representation in Web 2.0: Folksonomies
Voß (2006) investigates the folksonomy of Wikipedia and finds out that articles and tags, or in this case category descriptions, grow exponentially. Furthermore, he distinguishes the collectively created folksonomies of other collaborative information services from the collaboratively created Wikipedia folksonomy, since the latter does not provide any personal-tag or personal-category options – the tags are only attached to the resources, not the user: “In Wikipedia […] a specific category is only assigned to a specific article once for all users” (Voß, 2006). Voß (2006) also discovers a decisive difference between the folksonomies in Wikipedia and other collaborative information services: An essential difference between known collaborative tagging systems and Wikipedia’s categories is that one can also assign categories to other categories. This way hierarchical relationships with supercategories and subcatgories are defined. From these hierarchies one can derive tree structures like those of known classifications (Voß, 2006).
For him, Wikipedia is a thesaurus with no paradigmatic relations – even though, strictly speaking, this would no longer be a thesaurus. The association relation makes itself felt through links between the resources, i.e. the Wikipedia articles, but according to Voß (2006), this relation cannot be equated with that in a thesaurus: “Associations between categories are possible with normal links between category pages but in contrast to thesaurus relationships these links are not symmetric by default” (Voß, 2006). Nevertheless, semantic information can be derived from cooccurrences in the Wikipedia categories, just as in other folksonomies: “However semantic information can be derived: using co-occurrences of categories to calculate similarities between categories and create a ‘semantic map’ of English Wikipedia” (Voß, 2006).
Figure 3.38: Exchange of Tags beyond Platform Borders. A Combination of Last.fm Tags and Flickr. Source: Last.fm.
Folksonomies largely consist of textual metadata. These tags can be attached to practically any information resource. But why do tags always have to be textual? It could be a good idea to attach photos, videos, persons, mp3s or other things to the resources. Indexing beyond platform borders would then be another option for
Knowledge Representation in Web 2.0: Folksonomies
257
expansion. A first approach of this kind is being made by Last.fm at the moment (see Figure 3.38). Users of the music recommendation platform can add ‘event tags’ to photos of different events, published on Flickr, and then link these images with the corresponding concert reports on Last.fm (see Figure 3.39). This exchange of tags and resources between information services is also addressed by Gruber (2005): Tagging across various and varied applications, both existent and to be created, requires that we make it possible to exchange, compare, and reason about the tag data without any one application owning the ‘tag space’ or Folksonomy (Gruber, 2005).
Such an exchangeability requires a certain degree of folksonomical unity, which is somewhat implausible considering the open form of folksonomies.
Figure 3.39: Exchange of Tags beyond Platform Borders. A Combination of Last.fm Tags and Flickr. Source: Flickr.com.
A restricted fulfillment of this demand for exchangeability could be accomplished on the basis of personomies. Since we can assume that most users use more than one collaborative information service and tag resources there, there is obviously a multitude of different personomies per user – one in every information service. Muller (2007a; see also Thom-Santelli, Muller, & Millen, 2008) confirms that users in different information services tag differently and inconsistently, and De Chiara, Fish and Ruocco (2008), as well as Panke and Gaiser (2008a), recognize a great user demand for improved tag management systems. Vander Wal (2008) also bemoans the lack of a centralized tag collection point: “An unsolved issue hampering the personal benefit is the portability of tagging data between different web services, which would allow to aggregate and manage all my tags from different systems in one interface” (Vander Wal, 2008, 7). Oldenburg, Garbe and Cap (2008) ask themselves the question:
258
Knowledge Representation in Web 2.0: Folksonomies Nowadays users can interact with more than one tagging service in parallel. As these services typically provide non-interoperbale tagging features, users cannot be sure to apply the same tagging behaviour in all services. […] How can a variety of services be efficiently used by users in such way, that they do not have to cope with service specific features, and that they can transparently apply their tagging behavior not regarding the service in background? (Oldenburg, Garbe, & Cap, 2008, 11f.).
They want to solve the problem by recommending tags to the user that do not stem from the currently used collaborative information service but from several different platforms. This would weaken the disadvantages of tag recommender systems within single platforms, such as the restriction to and strengthening of the preexisting tags and the resulting stagnation of the folksonomy’s expressiveness. The recommendation of similar tags from other platforms would in turn increase the folksonomy’s expressiveness. This procedure seems particularly advisable for Narrow Folksonomies. Oldenburg, Garbe and Cap (2008) determine the similarity between tags from two different information services via the Jaccard-Sneath Coefficient. So far, they have applied their method to the social bookmarking services del.icio.us, CiteULike, Connotea and BibSonomy. They found out that the most popular tags of each information service do not match and that intersections of tags are mainly to be found in the Long Tails. Another approach places the focus on the users’ personomies. In order to keep an overview of one’s tags and to be able to tag consistently, for one’s own resources at least, it would make sense to construct an application that consequently suggests the tags a user has added previously to that very user (Garg, & Weber, 2008). That personomy, however, works independently of whatever information service the user is currently logged in to. The personomy thus becomes a centralized collection, archiving and management platform for the tags (Weller & Peters, 2008). ‘TagCare147 is a web-based application that means to implement this idea (Golov, Weller, & Peters, 2008). It imports a user’s personal tags from different platforms (so far from del.icio.us, Flickr and BibSonomy) and enables him to manage these tags centrally. The management of the tags comprises their creation, deletion and editing, as well as the establishing of relations (equivalence, part-of, abstraction and association relation) between tags. Besides, the user is provided a bundled overview on his tags and how often he has used them within the different platforms. In time, the users of TagCare are also meant to be able to further define the association relation (e.g. ‘is_opposite_of’) and to incorporate additional pre-existing KOS into the indexing process. Regarding implementation, transferability and questions of detail, however, there is still a need for research in both cases. The approach and success of the ESP Game (see chapter one) show that users’ enthusiasm for tagging can also be used for large resource stocks. Besides, the advantages of folksonomies are intensified by this method, because the indexing here is fast, cheap and executed from different perspectives (Al-Khalifa & Davis, 2007a). According to Morrison (2007), this may even be the only way to index mass information at all: “When presented with a large and unorganized collection, games like this may just be the fastest and least expensive way to get a large amount of classification done” (Morrison, 2007, 15). Similar games or impetus systems can be used for this purpose by the most radically different organizations or institutions. 147
http://www.tagcare.org.
Knowledge Representation in Web 2.0: Folksonomies
259
Another observation is that folksonomies are viewed and used more and more as a means of information retrieval. The reach and popularity of the collaborative information service in question still plays a role here, but the folksonomy becomes the place of information exchange. In an anecdote, Neal explains how this works (2007): I took this photo of my husband Jason standing by a sculpture located in front of the Fort Worth Museum of Modern Art. I added the tags “Vortex”, “Fort Worth” and “The Modern”. Someone else who viewed the picture tagged it with “Richard Serra”, which is the sculptor’s name. I was pleased that the person added the sculptors’s name, since Jason and I did not know it. This demonstrates the social utility of folksonomies (Neal, 2007, 8).
Due to the fact that Broad and Extended Narrow Folksonomies allow the adding of further tags to the author tags, informational added value is created – and not only for the other users, but potentially also for the owner of the resource, as the preceding example impressively demonstrates. By now, this form of information exchange has even become somewhat ‘professionalized.’ On Flickr, the group Tagr148 has tasked itself to provide precisely this kind of social indexing: “Tagr The group all about tagging photos! Add just photos which should be tagged (and don’t have any or many tags)! If you add a photo, please add as many useful tags as possible to other people’s!” Van Hooland (2006) also reports on information exchange in image databases; here, though, the focus is on the image database of the Dutch National Archive and not via tags, but via comments that can be added to the picture: “A fraction of the comments contain questions or invitations toward the institution or other users to help identifying an image“ (van Hooland, 2006). Further functions that go far beyond ‘normal’ indexing and, strictly speaking, no longer belong to knowledge representation, can also be attached to tags. Thus user roles149 (Quasthoff, Sack, & Meinel, 2008) can be used to link certain privacy settings or database protection guidelines to the tags. Another plausible scenario is this: the tag search for a specific image yields a result and the user want to download the image. Since it is indexed with a specific tag, though, this is only possible after paying a fee. The use of tagging for didactic purposes is also being heavily discussed. Tagging procedures are suited to the forming of communities, to personal and collaborative knowledge management and to the reflection of the subject matter in E-Learning and teaching (Schiefner, 2008; Bateman et al., 2007; Godwin-Jones, 2006; Kohlhase & Reichel, 2008): Shared tagging invites us to analyze texts and sum up their distinctiveness in keywords. […] You can’t “tag“ a Web resouorce without being able to extract salient points the author makes, considering how to summarize in keywords what’s important, and placing that text in the context of others (GodwinJones, 2006, 8).
Schiefner (2008) investigates the use of social tagging in teaching and observes that folksonomies are not very widespread so far and often restrict themselves to the indexing of simple bibliographies. According to Schiefner (2008), tagging is particularly useful in virtual project work or teamwork, for network formation
148 149
The group can be accessed at http://www.flickr.com/groups/tagr. The user roles determine the allocation of licences in digital environments.
260
Knowledge Representation in Web 2.0: Folksonomies
among students or teachers, and for epistemological term definitions. Here, though, the focus is mainly on the making of decisions that accompany the learning process: They [the students, A/N] must decide what resource they are dealing with (what is it?), they must find out which concept is at hand (what is it about?) and choose a symbol, or one or several terms for it (set of tags representing the resource) (Schiefner, 2008, 75)*.
Special tags that serve the identification of seminars or communities – called ‘tagography’ in Flickr (Dye, 2006, 41) and ‘course code’ in Kipp (2006b) – facilitate the exchange between participants and serve the easy allocation of teaching materials for virtual reading lists. A combination of tags and chronological metadata can also provide students with information on personal learning successes or the development of their own knowledge, like a ‘Tag Time-Machine’ (Schiefner, 2008, 79). “Tagging can also be a great asset in the formation of critical judgment concerning information – a faculty that is becoming increasingly important” (Schiefner, 2008, 81)*. Harrer and Lohmann (2008) regard tags as a great teaching tool. Added tags can be used to draw conclusions regarding the transmitted knowledge. In this way questions can be answered such as: were there difficulties understanding what was said? Have I, as a teacher, expressed myself clearly enough? Was the technical context sufficiently elucidated? In this way any misunderstandings and a clear background knowledge can be made visible and finally processed: “An analysis of the allocated tags thus provides an access path to the students’ learning experience, also enabling the restructuring or reorganization of the transmission of contents by the teacher” (Harrer & Lohmann, 2008, 99)*. The changing of the access path to the learning materials is of great advantage to the students, since a hierarchical access is no longer strictly necessary; access can be flexibly regulated via tags. Thus learning materials on a subject, or tag, can be specifically displayed. Harrer and Lohmann (2008) also found out, via a student survey, that students regard the visual display of folksonomies and learning materials as very important. Among the wishes are: “a graphical, network-like overview of relations between materials via tags as a fundamental criterion of the indexing of connections in the learning process” (Harrer & Lohmann, 2008, 104)*. Harrer and Lohmann (2008) recommend that students be clearly shown the added value (particularly as opposed to pen-and-paper annotations), and that too many media changes be avoided for a successful use of folksonomies in E-Learning and teaching. Bateman et al. (2007) also emphasize the great significance of an intrinsic motivation for the success of tagging in E-Learning. Nevertheless, Harrer and Lohmann (2008) also find out that this success initially requires the reaching of a critical mass. In a study, Al-Khalifa and Davis (2006) present the system ‘FolksAnnotation’ that enriches a pre-exsting ontology with folksonomies and thus improves the indexing of teaching and learning resources. Bateman et al. (2007) use their system ‘OATS’ (Open Annotation and Tagging System) to combine the annotation and tagging of learning resources. Text parts of the resource can be marked or highlighted by the users and tagged or commented on. If the same passages are marked frequently, the color of their highlighting changes in order to visually alert viewers to important quotes. Bateman et al. (2007) want to use their tool to facilitate the indexing and annotating of text passages, not only of resources as a whole.
Knowledge Representation in Web 2.0: Folksonomies
261
Independently of what directions folksonomies will develop into and what user demands for leisure and exchangeability are met, one thing is abundantly clear: folksonomies were and are being made by the users – as emphasized by Weinberger: The tagging movement says, in effect, that we’re not going to wait for the experts to deliver a taxonomy from on high. We’re just going to build one ourselves. It’ll be messy and inelegant and inefficient, but it will be good enough. And, most important, it will be ours, reflecting our needs and our ways of thinking (Weinberger, 2005).
262
Knowledge Representation in Web 2.0: Folksonomies
Bibliography Adamic, L. A. (2002). Zipf, Power-laws, and Pareto: A Ranking Tutorial, from http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html. Ahern, S., Naaman, M., Nair, R., & Yang, J. (2007). World Explorer: Visualizing Aggregate Data from Unstructured Text in Geo-referenced Collections. In Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, Vancouver, BC, Canada (pp. 1–10). Albrecht, C. (2006). Folksonomy. Diplomarbeit der TU Wien. Wien. Al-Khalifa, H. S., & Davis, H. C. (2006). FolksAnnotation: A Semantic Metadata Tool for Annotating Learning Resources Using Folksonomies and Domain Ontologies. In Proceedings of the 2nd International IEEE Conference on Innovations in Information Technology, Dubai, UAE. Al-Khalifa, H. S., & Davis, H. C. (2007a). Exploring the Value of Folksonomies for Creating Semantic Metadata. International Journal of Semantic Web and Information Systems, 3(1), 12–38. Al-Khalifa, H. S., & Davis, H. C. (2007b). Towards Better Understanding of Folksonomic Patterns. In Proceedings of the 18th Conference on Hypertext and Hypermedia, Manchester, UK (pp. 163–166). Al-Khalifa, H. S., & Davis, H. C. (2007c). Folksonomies versus Automatic Keyword Extraction: An Empirical Study. In Proceedings of 7th Mensch & Computer Conference, Weimar, Germany. Al-Khalifa, H. S., & Davis, H. C. (2007d). FAsTA: A Folksonomy-Based Automatic Metadata Generator. In Proceedings of the 2nd European Conference on Technology Enhanced Learning, Crete, Greece. Ames, M., & Naaman, M. (2007). Why We Tag: Motivations for Annotation in Mobile and Online Media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, San Jose, California, USA (pp. 971–980). Amitay, E., Har’El, N., Sivan, R., & Soffer, A. (2004). Web-a-where: Geotagging Web Content. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK (pp. 273–280). Anderson, C. (2006). The Long Tail: Why the Future of Business Is Selling Less of More. New York, NY: Hyperion. Andrich, D. (2002). Comparisons vs. Preferences, from http://www.rasch.org/rmt/ rmt161d.htm. Angeletou, S., Sabou, M., Specia, L., & Motta, E. (2007). Bridging the Gap between Folksonomies and the Semantic Web: An Experience Report. In Proceedings of the 4th European Semantic Web Conference, Innsbruck, Austria (pp. 30–43). Ankolekar, A., Krötzsch, M., & Vrandecic, D. (2007). The Two Cultures: Mashing up Web 2.0 and the Semantic Web: Position Paper. In Proceedings of the 16th International WWW Conference, Banff, Alberta, Canada (pp. 825-834). Aurnhammer, M., Hanappe, P., & Steels, L. (2006a). Augmenting Navigation for Collaborative Tagging with Emergent Semantics. Lecture Notes in Computer Science, 4273, 58–71.
Knowledge Representation in Web 2.0: Folksonomies
263
Aurnhammer, M., Hanappe, P., & Steels, L. (2006b). Integrating Collaborative Tagging and Emergent Semantics for Image Retrieval. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Austin, J. L. (1972). Zur Theorie der Sprechakte. Stuttgart: Reclam. Bak, P., Tang, C., & Wiesenfeld, K. (1987). Self-Organized Criticality: An Explanation of 1/f Noise. Physical Review Letters, 59, 381–384. Bao, S. et al. (2007). Optimizing Web Search Using Social Annotations. In Proceedings of the 16th International WWW Conference, Banff, Alberta, Canada (pp. 501–510). Barabási, A., & Albert, R. (1999). Emergence of Scaling in Random Networks. Science, 286(5439), 509–512. Bar-Ilan, J., Shoham, S., Idan, A., Miller, Y., & Shachak, A. (2006). Structured vs. Unstructured Tagging - A Case Study. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Bateman, S., Brooks, C., Brusilovsky, P., & McCalla, G. (2007). Applying Collaborative Tagging to E-Learning. In Proceedings of the 16th International WWW Conference, Banff, Alberta, Canada. Bates, M. E. (2006). Tag - You're It. Online - Leading Magazine for Information Professionals, 30(6), 64. Beal, A. (2006). Wink's Michael Tanne Discusses the Future of Tagging, from http://www.marketingpilgrim.com/2006/01/winks-michael-tanne-discussesfuture.html. Beaudoin, J. (2007). Flickr Image Tagging: Patterns Made Visible. Bulletin of the ASIST, 34(1), 26–29. Begelman, G., Keller, P., & Smadja, F. (2006). Automated Tag Clustering: Improving Search and Exploration in the Tag Space. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Beni, G., & Wang, J. (1989). Swarm Intelligence. In Proceedings of the 7th Annual Meeting of the Robotics Society of Japan (pp. 425–428). Berendt, B., & Hanser, C. (2007). Tags are not Metadata, but "Just More Content" to Some People. In Proceedings of ICWSM, Boulder, Colorado, USA. Birkenhake, B. (2008). Semantic Weblog: Erfahrungen vom Bloggen mit Tags und Ontologien. In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 153–162). Münster, New York, München, Berlin: Waxmann. Bischoff, K., Firan, C. S., Nejdl, W., & Paiu, R. (2008). Can All Tags Be Used for Search? In Proceedings of the 17th ACM Conference on Information and Knowledge Mining, Napa Valley, CA, USA (pp. 193–202). Blank, M., Bopp, T., Hampel, T., & Schulte, J. (2008). Social Tagging = Soziale Suche? In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 85–96). Münster, New York, München, Berlin: Waxmann. Bowers, J. S., & Jones, K. W. (2007). Detecting Objects Is Easier Than Categorizing Them. The Quarterly Journal of Experimental Psychology, 16(4), 552–557. Braun, S., Schmidt, A., & Zacharias, V. (2007). SOBOLEO: Vom Kollaborativen Tagging zur Leichtgewichtigen Ontologie. In Proceedings of 7th Mensch & Computer Conference, Weimar, Germany (pp. 209–218). Braun, S., Schmidt, A., Walter, A., & Zacharias, V. (2008). Von Tags zu semantischen Beziehungen: Kollaborative Ontologiereifung. In B. Gaiser, T.
264
Knowledge Representation in Web 2.0: Folksonomies
Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 163–173). Münster, New York, München, Berlin: Waxmann. Brooks, C., & Montanez, N. (2006a). An Analysis of the Effectiveness of Tagging in Blogs. In N. Nicolov, F. Salvetti, M. Liberman & J. Martin (Eds.), Computation Approaches to Analyzing Weblogs. Papers from the 2006 AAAI Spring Symposium (pp. 9–15). Brooks, C., & Montanez, N. (2006b). Improved Annotation of the Blogosphere via Autotagging and Hierarchical Clustering. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Bruce, R. (2008). Descriptor and Folksonomy Concurrence in Education Related Scholarly Research. Webology, 5(3), from http://www.webology.ir/2008/v5n3/ a59.html. Butler, J. (2006). Cognitive Operations behind Tagging for One's Self and Tagging for Others. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Butterfield, D., Costello, E., Fake, C., Henderson-Begg, C., Mourachow, S., & Schachter, J. (2006). Media Object Metadata Association and Ranking. PatentNo. US 2006/0242178A1. Buxmann, P. (2002). Strategien von Standardsoftware-Anbietern. Eine Analyse auf der Basis von Netzeffekten. zfbf, 54, 442–457. Byde, A., Wan, H., & Cayzer, S. (2007). Personalized Tag Recommendations via Tagging and Contentbased Similarity Metrics. In Proceedings of ICWSM 2007 Boulder, Colorado, USA. Calefato, F., Gendarmi, D., & Lanubile, F. (2007). Towards Social Semantic Suggestive Tagging. In Proceedings of the 4th Italian Semantic Web Workshop, Bari, Italy (pp. 101–109). Campbell, D. (2006). A Phenomenological Framework for the Relationship between the Semantic Web and User-Centered Tagging Systems. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Candan, K. S., Di Caro, L., & Sapino, M. L. (2008). Creating Tag Hierarchies for Effective Navigation in Social Media. In I. Soboroff, E. Agichtein, & R. Kumar (Eds.), Proceedings of the 2008 ACM Workshop on Search in Social Media, Napa Valley, California (pp. 75–82). Capocci, A., & Caldarelli, G. (2008). Folksonomies and Clustering in the Collaborative System CiteULike, from http://arxiv.org/abs/0710.2835. Capocci, A., Servedio, V., Colaiori, F., Buriol, L. S., Donato, D., & Leonardi, S., et al. (2006). Preferential Attachment in the Growth of Social Networks: The Internet Encyclopedia Wikipedia. Physics Review E, 74. Capps, J. (2006). Ranking Patterns: A Flickr Tagging System Pilot Study. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Carboni, D., Sanna, S., & Zanarini, P. (2006). GeoPix: Image Retrieval on the Geo Web: From Camera Click to Mouse Click. In Proceedings of the 8th Conference on Human-computer Interaction with Mobile Devices and Services, Helsinki, Finland (pp. 169–172). Cattuto, C. (2006). Semiotic Dynamics in Online Social Communities. European Physical Journal C, 46(2), 33–37.
Knowledge Representation in Web 2.0: Folksonomies
265
Cattuto, C., Loreto, V., & Pietronero, L. (2007). Semiotic Dynamics and Collaborative Tagging. Proceedings of the National Academy of Sciences of the USA, 104(5), 1461–1464. Chen, H. (1994). Collaborative Systems: Solving the Vocabulary Problem. IEEE Computer, Special Issue on Computer-Supported Cooperative Work, 27(5), 58– 66. Chi, E. H., & Mytkowicz, T. (2006). Understanding Navigability of Social Tagging Systems, from http://www.viktoria.se/altchi/submissions/submission_edchi _0.pdf. Chirita, P. A., Costache, S., Handschuh, S., & Nejdl, W. (2007). P-TAG: Large Scale Automatic Generation of Personalized Annotation Tags for the Web. In Proceedings of the 16th International WWW Conference, Banff, Alberta, Canada (pp. 845–854). Chopin, K. (2008). Finding Communities: Alternative Viewpoints through Weblogs and Tagging. Journal of Documentation, 64(4), 552–575. Choy, S., & Lui, A. K. (2006). Web Information Retrieval in Collaborative Tagging Systems. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, Hong Kong (pp. 352–355). Christiaens, S. (2006). Metadata Mechanism: From Ontology to Folksonomy…and Back. Lecture Notes in Computer Science, 4277, 199–207. Civan, A., Jones, W., Klasnja, P., & Bruce, H. (2008). Better to Organize Personal Information by Folders Or by Tags? The Devil Is in the Details. In Proceedings of the 71st ASIS&T Annual Meeting. People Transforming Information Information Transforming People. Columbus, Ohio, USA. Cox, A., Clough, P., & Marlow, J. (2008). Flickr: A First Look at User Behaviour in the Context of Photography as Serious Leisure. Information Research, 13(1), from http://informationr.net/ir/13-1/paper336.html. de Chiara, R., Fish, A., & Ruocco, S. (2008). Eulr: A Novel Resource Tagging Facility Integrated with Flickr. In Proceedings of the Working Conference on Advanced Visual Interfaces, Napoli, Italy (pp. 326–330). de Solla Price, D. J. (1969). Little Science, Big Science. New York: Columbia University Press. Dellschaft, K., & Staab, S. (2008). An Epistemic Dynamic Model for Tagging Systems. In Proceedings of the 19th ACM Conference on Hypertext and Hypermedia, Pittsburgh, PA, USA (pp. 71–80). Dennis, B. M. (2006). Foragr: Collaboratively Tagged Photographs and Social Information Visualization. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Derntl, M., Hampel, T., Motschnig, R., & Pitner, T. (2008). Social Tagging und Inclusive Universal Access. In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 51–62). Münster, New York, München, Berlin: Waxmann. Diederich, J., & Iofciu, T. (2006). Finding Communities of Practice from User Profiles Based on Folksonomies. In Proceedings of the 1st International Workshop on Building Technology Enhanced Learning Solutions for Communities of Practice, Crete, Greece. Dietl, H., & Royer, S. (2000). Management virtueller Netzwerkeffekte in der Informationsökonomie. zfo, 69(6), 324–331.
266
Knowledge Representation in Web 2.0: Folksonomies
Dubinko, M., Kumar, R., Magnani, J., Novak, J., Raghavan, P., & Tomkins, A. (2006). Visualizing Tags Over Time. In Proceedings of the 15th International Conference on World Wide Web, Edinburgh, Scotland (pp. 193–202). Dye, J. (2006). Folksonomy: A Game of High-tech (and High-stakes) Tag. EContent, April, 38–43. Egghe, L. (2005). Power Laws in the Information Production Process: Lotkaian Informetrics. Amsterdam: Elsevier Academic Press. Egghe, L., & Rousseau, R. (1995). Generalized Success-Breeds-Success Principle Leading to Time-Dependent Informetric Distributions. Journal of the American Society for Information Science and Technology, 46(6), 426–445. Farooq, U., Kannampallil, T. G., Song, Y., Ganoe, C. H., Carroll, J. M., & Giles, C. L. (2007). Evaluating Tagging Behavior in Social Bookmarking Systems: Metrics and Design Heuristics. In Proceedings of the 2007 ACM Conference on Supporting Group Work, Sanibel Island, FL, USA (pp. 351–360). Farrell, S., Lau, T., Wilcox, E., & Muller, M. J. (2007). Socially Augmenting Employee Profiles with People-tagging. In Proceedings of the 20th Annual ACM Symposium on User Interface Software and Technology, Newport, Rhode Island, USA (pp. 91–100). Feinberg, M. (2006). An Examination of Authority in Social Classification Systems. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Fengel, J., Rebstock, M., & Nüttgens, M. (2008). Modell-Tagging zur semantischen Verlinkung heterogener Modelle. In EMISA 2008 - Auswirkungen des Web 2.0 auf Dienste und Prozesse, Bonn/ St. Augustin, Germany. Fensel, D. (2004). Ontologies: A Silver Bullett for Knowledge Management and Electronic Commerce (2nd ed.). Berlin, Heidelberg, New York: Springer. Fichter, D. (2006). Intranet Applications for Tagging and Folksonomies. Online Leading Magazine for Information Professionals, 30(3), 43–45. Fischer, G., Grudin, J., McCall, R., Ostwald, J., Redmiled, D., & Reeves, B., et al. (2001). Seeding, Evolutionary Growth and Reseeding: The Incremental Development of Collaborative Design Environments. In G. M. Olson, T. W. Malone & J. B. Smith (Eds.), Coordination Theory and Collaboration Technology (pp. 447–472). New Jersey, USA: Mahwah, Erlbaum Publishers. Fokker, J., Buntine, W., & Pouwelse, J. (2006). Tagging in Peer-to-Peer Wikipedia. In Proceedings of the 2nd International Workshop on Open Source Information Retrieval, Seattle, USA. Fokker, J., Pouwelse, J., & Buntine, W. (2006). Tag-Based Navigation for Peer-toPeer Wikipedia. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Fu, F., Liu, L., Yang, K., & Wang, L. (2006). The Structure of Self-organized Blogosphere, from http://arxiv.org/abs/math/0607361. Furnas, G., Landauer, T., Gomez, L., & Dumais, S. (1987). The Vocabulary Problem in Human-System Communication. Communications of the ACM, 30(11), 964–971. Furnas, G., Fake, C., von Ahn, L., Schachter, J., Golder, S., & Fox, K., et al. (2006). Why Do Tagging Systems Work? In Proceedings of the Conference on Human Factors in Computing Systems, Montréal, Canada (pp. 36–39). Galton, F. (1907). Vox Populi. Nature, 75, 450-451.
Knowledge Representation in Web 2.0: Folksonomies
267
Gaiser, B., Hampel, T., & Panke, S. (Eds.) (2008). Good Tags - Bad Tags: Social Tagging in der Wissensorganisation. Münster, New York, München, Berlin: Waxmann. Garg, N., & Weber, I. (2008). Personalized Tag Suggestion for Flickr. In Proceedings of the 17th International Conference on World Wide Web, Beijing, China (pp. 1063–1064). Geisselmann, F. (2000). The Indexing of Electronic Publications - Ways out of Heterogeneity. In Proceedings of the 66th IFLA Council and General, Jerusalem, Israel. Gendarmi, D., & Lanubile, F. (2006). Community-Driven Ontology Evolution Based on Folksonomies. Lecture Notes in Computer Science, 4277, 181–188. Gissing, B., & Tochtermann, K. (2007). Corporate Web 2.0: Web 2.0 und Unternehmen - Wie passt das zusammen? Aachen: Shaker. Godwin-Jones, R. (2006). Tag Clouds in the Blogosphere: Electronic Literacy and Social Networking. Language Learning & Technology, 10(2), 8–15. Golder, S., & Hubermann, B. (2006). Usage Patterns of Collaborative Tagging Systems. Journal of Information Science, 32(2), 198–208. Golov, E., Weller, K., & Peters, I. (2008). TagCare: A Personal Portable Tag Repository. In Proceedings of the Poster and Demonstration Session at the 7th International Semantic Web Conference, Karlsruhe, Germany. Golub, K., Jones, C., Matthews, B., Moon, J., Nielsen, M. L., & Tudhope, D. (2008). Enhancing Social Tagging with a Knowledge Organization System. ALISS, 3(4), 13–16. Goodrum, A., & Spink, A. (2001). Image Searching on the Excite Web Search Engine. Information Processing and Management, 37(2), 295–311. Gordon-Murnane, L. (2006). Social Bookmarking, Folksonomies, and Web 2.0 Tools. Searcher - The Magazine for Database Professionals, 14(6), 26–38. Governor, J. (2006). On The Emergence of Professional Tag Gardeners, from http://www.redmonk.com/jgovernor/2006/01/10/on-the-emergence-ofprofessional-tag-gardeners. Grahl, M., Hotho, A., & Stumme, G. (2007). Conceptual Clustering of Social Bookmarking Sites. In Proceedings of I-KNOW, International Conference on Knowledge Management, Graz, Austria (pp. 356–364). Gruber, T. (2005). Ontology of Folksonomy: A Mash-Up of Apples and Oranges, from http://www.tomgruber.org/writing/mtsr05-ontology-of-folksonomy.htm. Gruber, T. (2006). Where the Social Web Meets the Semantic Web. Lecture Notes in Computer Science, 4273, 994. Güntner, G., Sint, R., & Westenthaler, R. (2008). Ein Ansatz zur Unterstützung traditioneller Klassifikation durch Social Tagging. In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 187–199). Münster, New York, München, Berlin: Waxmann. Guy, M., & Tonkin, E. (2006). Folksonomies: Tidying Up Tags? D-Lib Magazine, 12(1), from http://www.dlib.org/dlib/january06/guy/01guy.html. Gyöngyi, P., Garcia-Molina, H., & Pedersen, J. (2004). Combating Web Spam with Trustrank. In Proceedings of the 30th International Conference on Very Large Data Bases, Toronto, Canada (pp. 576–587). Hänger, C. (2008). Good tags or bad tags? Tagging im Kontext der bibliothekarischen Sacherschließung. In B. Gaiser, T. Hampel, & S. Panke
268
Knowledge Representation in Web 2.0: Folksonomies
(Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 63–71). Münster, New York, München, Berlin: Waxmann. Halpin, H., & Shepard, H. (2006). Evolving Ontologies from Folksonomies: Tagging as a Complex System, from http://www.ibiblio.org/hhalpin/homepage/notes/ taggingcss.html. Halpin, H., Robu, V., & Shepherd, H. (2007). The Complex Dynamics of Collaborative Tagging. In Proceedings of the 16th International WWW Conference, Banff, Alberta, Canada (pp. 211–220). Hammond, T., Hannay, T., Lund, B., & Scott, J. (2005). Social Bookmarking Tools (I). D-Lib Magazine, 11(4), from http://www.dlib.org/dlib/april05/hammond/ 04hammond.html. Hampel, T., Tschetschonig, K., & Ladengruber, R. (2008). Kollaborative Tagging Systeme im Electronic Commerce. Vortrag im Rahmen des Good Tags & Bad Tags – Workshop, Social Tagging in der Wissensorganisation. Institut für Wissensmedien Tübingen. Harrer, A., & Lohmann, S. (2008). Potenziale von Tagging als partizipative Methode für Lehrportale und E-Learning Kurse. In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 97–105). Münster, New York, München, Berlin: Waxmann. Harrington, K. (2006). Social Classification and Online Job Banks: Finding the Right Words to Find the Right Job. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Hassan-Montero, Y., & Herrero-Solana, V. (2006). Improving Tag-Clouds as Visual Information Retrieval Interfaces. In Proceedings of the International Conference on Multidisciplinary Information Science and Technologies, Mérida, Spain. Hayes, C., & Avesani, P. (2007). Using Tags and Clustering to Identify TopicRelevant Blogs. In Proceedings of ICWSM 2007 Boulder, Colorado, USA (pp. 67–75). Hayes, C., Avesani, P., & Veeramachaneni, S. (2007). An Analysis of the Use of Tags in a Blog Recommender System. In Proceedings of IJCAI 2007 Hyderabad, India (pp. 2772–2777). Hayman, S. (2007). Folksonomies and Tagging: New Developments in Social Bookmarking, from http://www.educationau.edu.au/jahia/webdav/site/ myjahiasite/shared/papers/arkhayman.pdf. Hayman, S., & Lothian, N. (2007). Taxonomy Directed Folksonomies: Integrating User Tagging and Controlled Vocabularies for Australian Education Networks. In Proceedings of 73rd IFLA General Conference and Council, Durban, South Africa. Heckner, M., Mühlbacher, S., & Wolff, C. (2007). Tagging Tagging. Analysing User Keywords in Scientific Bibliography Management Systems, from http://dlist.sir.arizona.edu/2053. Heckner, M., Neubauer, T., & Wolff, C. (2008). Tree, funny, to_read, google: What are Tags Supposed to Achieve? A Comparative Analysis of User Keywords for Different Digital Resource Types. In I. Soboroff, E. Agichtein, & R. Kumar (Eds.), Proceedings of the 2008 ACM Workshop on Search in Social Media, Napa Valley, California (pp. 3–10). Heckner, M. (2009). Tagging, Rating, Posting. Studying Forms of User Contribution for Web-based Information Management and Information Retrieval. Boizenburg: Hülsbusch.
Knowledge Representation in Web 2.0: Folksonomies
269
Held, C., & Cress, U. (2008). Social Tagging aus Kognitionspsychologischer Sicht. In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 37–49). Münster, New York, München, Berlin: Waxmann. Herget, J., & Hierl, S. (2007). Top-down versus Bottom-up: Wissensorganisation im Wandel. Von der traditionellen Wissenserschließung zu Folksonomies. In Proceedings der 37. Jahrestagung der Gesellschaft für Informatik e.V., Bremen, Germany (pp. 503-508). Heß, A., Maaß, C., & Dierick, F. (2008). From Web 2.0 to the Semantic Web: A Semi-Automated Approach. In Proceedings of the European Semantic Web Conference, Tenerife, Spain. Heylighen, F. (2002). The Science of Self-Organisation and Adaptivity. In The Encyclopedia of Life Support Systems, The Encyclopedia of Life Support Systems. Oxford: EOLSS [u.a.]. Heymann, P., & Garcia-Molina, H. (2006). Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems: InfoLab Technical Report, from http://dbpubs.stanford.edu/pub/2006-10. Heymann, P., Koutrika, G., & Garcia-Molina, H. (2007). Fighting Spam on Social Websites: A Survey of Approaches and Future Challenges. IEEE Internet Computing, 11(6), 36–45. Heymann, P., Koutrika, G., & Garcia-Molina, H. (2008). Can Social Bookmarking Improve Web Search? In Proceedings of the International Conference on Web Search and Web Data Mining, Palo Alto, California, USA (pp. 195–206). Hotho, A. (2008). Analytische Methoden zur Nutzerunterstützung in TaggingSystemen. Vortrag im Rahmen des Good Tags & Bad Tags – Workshop, Social Tagging in der Wissensorganisation. Institut für Wissensmedien Tübingen. Hotho, A., Jäschke, R., Schmitz, C., & Stumme, G. (2006a). Bibsonomy: A Social Bookmark and Publication Sharing System. In Proceedings of the Conceptual Structure Tool Interoperability Workshop at the 14th International Conference on Conceptual Structures, Aalborg, Denmark. Hotho, A., Jäschke, R., Schmitz, C., & Stumme, G. (2006b). Information Retrieval in Folksonomies: Search and Ranking. Lecture Notes in Computer Science, 4011, 411–426. Hotho, A., Jäschke, R., Schmitz, C., & Stumme, G. (2006c). Das Entstehen von Semantik in BibSonomy. In Social Software in der Wertschöpfung, BadenBaden, Germany. Howe, J. (2006). The Rise of Crowdsourcing. Wired Magazine, 14(6), from http://www.wired.com/wired/archive/14.06/crowds_pr.html. Ingwersen, P. (2002). Cognitive Perspectives of Document Representation. In Proceedings of the 4th International Conference on Conceptions on Library and Information Science, Seattle, WA, USA (pp. 285–300). Iyer, H. (2007). Social Tagging: Social Computing, Folksonomies, and Image Tagging: Reports from the Research Front. Talk at the 70th ASIS&T Annual Meeting 2007: Joining Research and Practice. Social Computing and Information Science. Milwaukee, Wisconsin, USA. Jaccard, P. (1901). Etude Comparative de la Distribution Florale dans une Portion des Alpes et du Jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37, 547–579.
270
Knowledge Representation in Web 2.0: Folksonomies
Jäschke, R., Hotho, A., Schmitz, C., & Stumme, G. (2006). Wege zur Entdeckung von Communities in Folksonomies. In Proceedings of the 18th Workshop Grundlagen von Datenbanken, Halle-Wittenberg, Germany (pp. 80–84). Jäschke, R., Marinho, L., Hotho, A., Schmidt-Thieme, L., & Stumme, G. (2007). Tag Recommendations in Folksonomies. Lecture Notes in Artificial Intelligence, 4702, 506–514. Jäschke, R., Hotho, A., Schmitz, C., Ganter, B., & Stumme, G. (2008). Discovering Shared Conceptualizations in Folksonomies. Web Semantics: Science, Services and Agents on the World Wide Web, 6(1), 38–53. Jaffe, A., Naaman, M., Tassa, T., & Davis, M. (2006). Generating Summaries and Visualization for Large Collections of Geo-Referenced Photographs. In Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, Santa Barbara, USA (pp. 89–98). Jeong, W. (2008). Does Tagging Really Work? In Proceedings of the 71st ASIS&T Annual Meeting. People Transforming Information - Information Transforming People. Columbus, Ohio, USA. Jörgensen, C. (2007). Image Access, the Semantic Gap, and Social Tagging as a Paradigm Shift. In Proceedings 18th Workshop of the ASIS&T Special Interest Group in Classification Research, Milwaukee, Wisconsin, USA John, A., & Seligmann, D. (2006). Collaborative Tagging and Expertise in the Enterprise. In Proceedings of the 15th International Conference on World Wide Web, Edinburgh, Scotland. Kalbach, J. (2008). Navigating the Long Tail. Bulletin of the ASIST, 34(2), 36–38. Kellog Smith, M. (2006). Viewer Tagging in Art Museums: Comparisons to Concepts and Vocabularies of Art Museum Visitors. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Kennedy, L., Naaman, M., Ahern, S., Nair, R., & Rattenbury, T. (2007). How Flickr Helps Us Make Sense of the World: Context and Content in CommunityContributed Media Collections. In Proceedings of the 15th International Conference on Multimedia, Augsburg, Germany (pp. 631–640). Kessler, M. M. (1963). Bibliographic Coupling between Scientific Papers. American Documentation, 14, 10–25. Kipp, M. E. I. (2006a). @toread and cool: Tagging for Time, Task, and Emotion. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Kipp, M. E. I. (2006b). Exploring the Context of User, Creator and Intermediary Tagging. In Proceedings of the 7th Information Architecture Summit, Vancouver, Canada. Kipp, M. E. I. (2006c). Complementary or Discrete Contexts in Online Indexing: A Comparison of User, Creator, and Intermediary Keywords. Canadian Journal of Information and Library Science, 30(3), from http://dlist.sir.arizona.edu/1533. Kipp, M. E. I. (2007). Tagging for Health Information Organisation and Retrieval. In Proceedings of the North American Symposium on Knowledge Organization, Toronto, Canada (pp. 63–74). Kipp, M. E. I., & Campbell, D. (2006). Patterns and Inconsistencies in Collaborative Tagging Systems: An Examination of Tagging Practices. In Proceedings of the 17th Annual Meeting of the American Society for Information Science and Technology, Austin, Texas, USA.
Knowledge Representation in Web 2.0: Folksonomies
271
Knautz, K. (2008). Von der Tag-Cloud zum Tag-Cluster: Statistischer Thesaurus auf der Basis syntagmatischer Relationen und seine mögliche Nutzung in Web 2.0Diensten. In Proceedings der 30. Online-Tagung der DGI, Frankfurt a. M., Germany (pp. 269–284). Kohlhase, A., & Reichel, M. (2008). Embodied Conceptualizations: Social Tagging and E-Learning. International Journal of Web-based Learning and Teaching Technologies, 3(1), 58–67. Kolbitsch, J. (2007). WordFlickr: A Solution to the Vocabulary Problem in Social Tagging Systems. In Proceedings of I-MEDIA and I-SEMANTICS, Graz, Austria (pp. 77–84). Kome, S. (2005). Hierarchical Subject Relationships in Folksonomies. Master Thesis, University of North Carolina. Chapel Hill, from http://etd.ils.unc.edu:8080/dspace/handle/1901/238. Koutrika, G., Effendi, F. A., Gyöngyi, P., Heymann, P., & Garcia-Molina, H. (2007). Combating Spam in Tagging Systems. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, Banff, Alberta, Canada (pp. 57–64). Krause, J. (1996). Informationserschließung und -bereitstellung zwischen Deregulation, Kommerzialisierung und weltweiter Vernetzung [Schalenmodell]: IZ-Arbeitsbericht Nr. 6, from http://www.gesis.org/fileadmin/upload/forschung/ publikationen/gesis_reihen /iz_arbeitsberichte/ab6.pdf. Krause, J. (2003). Standardisierung von der Heterogenität her denken - Zum Entwicklungsstand bilateraler Transferkomponenten für digitale Fachbibliotheken: IZ-Arbeitsbericht Nr. 28, from http://www.gesis.org/ fileadmin/upload/forschung/publikationen/gesis_reihen/iz_arbeitsberichte/ab_28. pdf. Krause, J. (2006). Shell Model, Semantic Web and Web Information Retrieval. In I. Harms, H. D. Luckhardt & H. W. Giessen (Eds.), Information und Sprache. Beiträge zu Informationswissenschaft, Computerlinguistik, Bibliothekswesen und verwandten Fächern. Festschrift für Harald H. Zimmermann (pp. 95–106). München: Saur. Krause, J. (2007). The Concepts of Semantic Heterogeneity and Ontology of the Semantic Web as a Background of the German Science Portals vascoda and sowiport. In Proceedings of the International Conference on Semantic Web & Digital Libraries, Bangalore, India (pp. 13–24). Krause, B., Hotho, A., & Stumme, G. (2008). The Anti-Social Tagger - Detecting Spam in Social Bookmarking Systems. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, Beijing, China. Kroski, E. (2005). The Hive Mind: Folksonomies and User-Based Tagging, from http://infotangle.blogsome.com/2005/12/07/the-hive-mind-folksonomies-anduser-based-tagging. Lambiotte, R., & Ausloos, M. (2006). Collaborative Tagging as a Tripartite Network. Lecture Notes in Computer Science, 3993, 1114–1117. Lancaster, F. (2003). Indexing and Abstracting in Theory and Practice (3rd ed.). University of Illinois: Champaign. Laniado, D., Eynard, D., & Colombetti, M. (2007a). A Semantic Tool to Support Navigation in a Folksonomy. In Proceedings of the 18th Conference on Hypertext and Hypermedia, Manchester, UK (pp. 153–154).
272
Knowledge Representation in Web 2.0: Folksonomies
Laniado, D., Eynard, D., & Colombetti, M. (2007b). Using WordNet to Turn a Folksonomy into a Hierarchy of Concepts. In Proceedings of the 4th Italian Semantic Web Workshop, Bari, Italy (pp. 192–201). Layne, S. (2002). Subject Access to Art Images. In M. Baca (Ed.), Introduction to Art Image Access (pp. 1–19). Los Angeles: Getty Research Institute. Le Bon, G. (1895). Psychologie des foules. Paris. Lee, K. J. (2006). What Goes Around Comes Around: An Analysis of del.icio.us as Social Space. In Proceedings of the 20th Anniversary Conference on Computer Supported Cooperative Work, Banff, Alberta, Canada (pp. 191–194). Lee, C. S., Goh, D. H., Razikin, K., & Chua, A. Y. K. (2009). Tagging, Sharing and the Influence of Personal Experience. Journal of Digital Information, 10(1), from http://journals.tdl.org/jodi/article/view/275/275. Levenshtein, V. I. (1966). Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics Doklady, 10(8), 707–710. Li, R., Bao, S., Fei, B., Su, Z., & Yu, Y. (2007). Towards Effective Browsing of Large Scale Social Annotations. In Proceedings of the 16th International WWW Conference, Banff, Alberta, Canada (pp. 943–952). Lin, X., Beaudoin, J., Bul, Y., & Desal, K. (2006). Exploring Characteristics of Social Classification. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Linde, F. (2008). Ökonomie der Information (2nd ed.). Göttingen: Univ.-Verl. Göttingen. Loasby, K. (2006). Changing Approaches to Metadata at bbc.co.uk: From Chaos to Control and Then Letting Go Again. Bulletin of the ASIST, 33(1), 25–26. Logan, G. D. (2004). Cumulative Progress in Formal Theories of Attention. Annual Review of Psychology, 55, 207–234. Luce, R. D. (1959). Individual Choice Behavior. New York: Wiley. Lund, B., Hammond, T., Flack, M., & Hannay, T. (2005). Social Bookmarking Tools (II). A Case Study - Connotea. D-Lib Magazine, 11(4), from http://www.dlib.org/dlib/april05/lund/ 04lund.html Lux, M., Granitzer, M., & Kern, R. (2007). Aspects of Broad Folksonomies. In Proceedings of the 18th International Conference on Database and Expert Systems Applications, Regensburg, Germany (pp. 283–287). Maarek, Y., Marnasse, N., Navon, Y., & Soroka, V. (2006). Tagging the Physical World. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Maass, W., Kowatsch, T., & Münster, T. (2007). Vocabulary Patterns in Free-for-all Collaborative Indexing Systems. In Proceedings of International Workshop on Emergent Semantics and Ontology Evolution, Busan, Korea (pp. 45–57). Macgregor, G., & McCulloch, E. (2006). Collaborative Tagging as a Knowledge Organisation and Resource Discovery Tool. Library Review, 55(5), 291–300. MacLaurin, M. (2007). Selection-Based Item Tagging. Patent-No. US 2007/0028171 A1. Mack, M. L., Gauthier, I., Sadr, J., & Palmeri, T. J. (2008). Object Detection and Basic-level Categorization: Sometimes You Know It Is There before You Know What It Is. Psychonomic Bulletin & Review, 15(1), 28–35. Maier, R., & Schmidt, A. (2007). Characterizing Knowledge Maturing: A Conceptual Model Integrating E-Learning and Knowledge Management. In
Knowledge Representation in Web 2.0: Folksonomies
273
Proceedings of 4th Conference of Professional Knowledge Management, Potsdam, Germany (pp. 325–333). Maier, R., & Thalmann, S. (2007). Kollaboratives Tagging zur inhaltlichen Beschreibung von Lern- und Wissensressourcen. In Proceedings of XML Tage, Berlin, Germany (pp. 75–86). Maier, R., & Thalmann, S. (2008). Institutionalised Collaborative Tagging as an Instrument for Managing the Maturing Learning and Knowledge Resources. International Journal of Technology Enhanced Learning, 1(1), 70–84. Marinho, L., Buza, K., & Schmidt-Thieme, L. (2008). Folksomomy-Based Collabulary Learning. In Proceedings of the 7th International Semantic Web Conference, Karlsruhe, Germany (pp. 261–276). Markey, K. (1984). Inter-indexer Consistency Tests: A Literature Review and Report of a Test of Consistency in Indexing Visual Materials. Library & Information Science Research, 6, 155–177. Markey, K. (1986). Subject Access to Visual Resource Collections. Westport: Greenwood. Markkula, M., & Sormunen, E. (2000). End-user Searching Challenges Indexing Practices in the Digital Newspaper Photo Archive. Information Retrieval, 1, 259–285. Marlow, C., Naaman, M., boyd, d., & Davis, M. (2006a). HT06, Tagging Paper, Taxonomy, Flickr, Academic Article, To Read. In Proceedings of the 17th Conference on Hypertext and Hypermedia, Odense, Denmark (pp. 31–40). Marlow, C., Naaman, M., boyd, d., & Davis, M. (2006b). Position Paper, Tagging, Taxonomy, Flickr, Article, ToRead. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Mathes, A. (2004). Folksonomies - Cooperative Classification and Communication Through Shared Metadata, from www.adammathes.com/academic/computermediated-communication/ folksonomies.html. Mayr, P. (2006). Thesauri, Klassifikationen & Co - die Renaissance der Kontrollierten Vokabulare? In P. Hauke & K. Umlauf (Eds.), Vom Wandel der Wissensorganisation im Informationszeitalter. Festschrift für Walther Umstätter zum 65. Geburtstag (pp. 151–170). Bad Honnef: Bock + Herchen. McFedries, P. (2006). Folk Wisdom. IEEE Spectrum, 43(Feb), 80. Mejias, U. A. (2005). Tag Literacy, from http://blog.ulisesmejias.com/2005/04/26/ tag-literacy. Ménard, E. (2007). Image Indexing: How Can I Find a Nice Pair of Italian Shoes? Bulletin of the ASIST, 34(1), 21–25. Menschel, R. (2002). Markets, Mobs & Mayhem: How to Profit from the Madness of the Crowds. Hoboken, NJ, USA: John Wiley & Sons Inc. Merholz, P. (2004). Metadata for the Masses, from http://adaptivepath.com/ideas/ essays/archives/000361.php. Merholz, P. (2005). Clay Shirky's Viewpoints are Overrated, from http://www.peterme.com/archives/000558.html. Merton, R. (1968). The Matthew Effect in Science. Science, 159(3810), 56–63. Mika, P. (2005). Ontologies are us: A Unified Model of Social Networks and Semantics. Lecture Notes in Computer Science, 3729, 522–536. Miller, G. (1998). Nouns in WordNet. In C. Fellbaum (Ed.), WordNet. An Electronic Lexical Database (pp. 23–46). Cambridge, Mass., London: MIT Press.
274
Knowledge Representation in Web 2.0: Folksonomies
Mislove, A., Koppula, H. S., Gummadi, K. P., Druschel, P., & Bhattacharjee, B. (2008). Growth of the Flickr Social Network. In Proceedings of the 1st Workshop on Online Social Networks, Seattle, WA, USA (pp. 25–30). Mitzenmacher, M. (2004). A Brief History of Generative Models for Power Law and Lognormal Distributions. Internet Mathematics, 1(2), 226–251. Mohelsky, H. (1985). Madness as a Symbol of the Crowd and its Implications for Psychosocial Rehabilitation. Psychosocial Rehabilitation Journal, 8(4), 42–48. Morrison, P. (2007). Why Are They Tagging, and Why Do We Want Them To? Bulletin of the ASIST, 34(1), 12–15. Morville, P. (2005). Ambient Findability: What We Find Changes Who We Become. Beijing: O'Reilly. Müller-Prove, M. (2008). Modell und Anwendungsperspektive des Social Tagging. In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 15–22). Münster, New York, München, Berlin: Waxmann. Muller, M. J. (2007a). Comparing Tagging Vocabularies among four Enterprise Tag-Based Services. In Proceedings of the International ACM Conference on Supporting Group Work, Sanibel Island, FL, USA (pp. 341–350). Muller, M. J. (2007b). Anomalous Tagging Patterns Can Show Communities Among Users: Poster at ECSCW 2007, Limerick, Ireland, from http://www.ecscw07.org/poster-abstracts/11-Michael_MullerAnomalous%20Tagging%20Patterns.pdf. Munk, T., & Mork, K. (2007a). Folksonomy, the Power Law and the Significance of the Least Effort. Knowledge Organization, 34(1), 16–33. Munk, T., & Mork, K. (2007b). Folksonomies, Tagging Communities, and Tagging Strategies-An Empirical Study. Knowledge Organization, 34(3), 115–127. Na, K., & Yang, C. (2006). Exploratory Study of Classification Tags in Terms of Cultural Influences and Implications for Social Classification. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Naaman, M. (2006). Eyes on the World. Computers in Libraries, 39(10), 108–111. Neal, D. (2007). Introduction. Folksonomies and Image Tagging: Seeing the Future? Bulletin of the ASIST, 34(1), 7–11. Newman, M. E. J. (2005). Power Laws, Pareto Distributions and Zipf's Law. Contemporary Physics, 46(5), 323–351. Niwa, S., Doi, T., & Honiden, S. (2006). Web Page Recommender Systems Based on Folksonomy Mining for ITNG, 06 Submissions. In Proceedings of the 3rd Conference on Information Technology: New Generation, Las Vegas, Nevada, USA (pp. 388–393). Noruzi, A. (2006). Folksonomies: (Un)Controlled Vocabulary? Knowledge Organization, 33(4), 199–203. Noruzi, A. (2007). Folksonomies: Why Do We Need Controlled Vocabulary? Webology, 4(2), from http://www.webology.ir/2007/v4n2/editorial12.html. Nov, O., Naaman, M., & Ye, C. (2008). What Drives Content Tagging: The Case of Photos on Flickr. In Proceedings of the 26th Conference on Human Factors in Computing Systems, Florence, Italy (pp. 1097–1100). Ohkura, T., Kiyota, Y., & Nakagawa, H. (2006). Browsing System for Weblog Articles Based on Automated Folksonomy. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland.
Knowledge Representation in Web 2.0: Folksonomies
275
Oldenburg, S., Garbe, M., & Cap, C. (2008). Similarity Cross-Analysis of Tag / CoTag Spaces in Social Classification Systems. In I. Soboroff, E. Agichtein, & R. Kumar (Eds.), Proceedings of the 2008 ACM Workshop on Search in Social Media, Napa Valley, California (pp. 11–18). Ossimitz, G., & Russ, C. (2008). Growth and Sustainability in Online Social Networks. In Proceedings of I-Know and I-Media: The International Conferences on Knowledge Management and New Media Technology, Graz, Austria (pp. 150–160). Ornager, S. (1995). The Newspaper Image Database. Empirical Supported Analysis of Users' Typology and Word Association Clusters. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA (pp. 212–218). Pammer, V., Ley, T., & Lindstaedt, S. (2008). Tagr: Unterstützung in kollaborativen Tagging-Umgebungen durch semantische und assoziative Netzwerke. In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 201–210). Münster, New York, München, Berlin: Waxmann. Panke, S., & Gaiser, B. (2008). "With my Head up in the Clouds" - Social Tagging aus Nutzersicht. In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 23–35). Münster, New York, München, Berlin: Waxmann. Panke, S., & Gaiser, B. (2008b). Nutzerperspektiven auf Social Tagging – Eine Online Befragung, from www.e-teaching.org/didaktik/recherche/ goodtagsbadtags2.pdf. Panofsky, E. (1975). Sinn und Deutung in der Bildenden Kunst. Köln: DuMont. Panofsky, E. (2006). Ikonographie und Ikonologie. Köln: DuMont. Paolillo, J. (2008). Structure and Network in the YouTube Core. In Proceedings of the 41st Hawaii International Conference on System Sciences, Hawaii, USA. Paolillo, J., & Penumarthy, S. (2007). The Social Structure of Tagging Internet Video on del.icio.us. In Proceedings of the 40th Hawaii International Conference on System Science, Hawaii, USA. Peters, I. (2006a). Inhaltserschließung von Blogs und Podcasts im betrieblichen Wissensmanagement. In Proceedings der 28. Online Tagung der DGI, Frankfurt a. M., Germany (pp. 143–151). Peters, I. (2006b). Against Folksonomies: Indexing Blogs and Podcasts for Corporate Knowledge Management. In Preparing for Information 2.0. Proceedings of Online Information Conference, London, GB (pp. 93–97). Peters, I., & Stock, W. G. (2007). Folksonomies and Information Retrieval. In Joining Research and Practice: Social Computing and Information Science. Proceedings of the 70th ASIS&T Annual Meeting, Milwaukee, Wisconsin, USA (pp. 1510–1542). Peters, I., & Stock, W. G. (2008). Folksonomies in Wissensrepräsentation und Information Retrieval. Information - Wissenschaft & Praxis, 59(2), 77–90. Peters, I., & Weller, K. (2008a). Paradigmatic and Syntagmatic Relations in Knowledge Organization Systems. Information - Wissenschaft & Praxis, 59(2), 100–107. Peters, I., & Weller, K. (2008b). Good Tags + Bad Tags. Tagging in der Wissensorganisation: Von "Baby-Tags" zu "Tag Gardening". Password, 5, 18– 19.
276
Knowledge Representation in Web 2.0: Folksonomies
Peters, I., & Weller, K. (2008c). Tag Gardening for Folksonomy Enrichment and Maintenance. Webology, 5(3), Article 58, from http://www.webology.ir/2008/ v5n3/a58.html. Peterson, E. (2006). Beneath the Metadata. D-Lib Magazine, 12(11). Peterson, E. (2009). Patron Preferences for Folksonomy Tags: Research Findings When Both Hierarchial Subject Headings and Folksonomy Tags Are Used. Evidence Based Library and Information Practice, 4(1), from http://ejournals.library.ualberta.ca/index.php/EBLIP/article/view/4580. Pind, L. (2005). Folksonomies: How We Can Improve the Tags, from http://pinds.com/ 2005/01/23/folksonomies-how-we-can-improve-the-tags. Plangprasopchok, A., & Lerman, K. (2008). Constructing Folksonomies from Userspecifed Relations on Flickr, from http://arxiv.org/PS_cache/arxiv/pdf/0805/ 0805.3747v1.pdf. Pluzhenskaia, M. (2006). Folksonomies or Fauxonomies: How Social is Social Bookmarking? In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Quasthoff, M., Sack, H., & Meinel, C. (2008). Nutzerfreundliche Internet-Sicherheit durch tag-basierte Zugriffskontrolle. In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 211– 221). Münster, New York, München, Berlin: Waxmann. Quintarelli, E. (2005). Folksonomies: Power to the People. In Proceedings of the ISKO Italy-UniMIB Meeting, Milan, Italy, 2005. Quintarelli, E., Resmini, A., & Rosati, L. (2007). Face Tag: Integrating Bottom-up and Top-down Classification in a Social Tagging System. Bulletin of the ASIST, 33(5), 10–15. Rader, E., & Wash, R. (2006). Tagging with del.icio.us: Social or Selfish? In Proceedings of the 20th Anniversary Conference on Computer Supported Cooperative Work, Banff, Alberta, Canada. Rader, E., & Wash, R. (2008). Collaborative Tagging and Information Management: Influences on Tag Choices in del.icio.us, from http://bierdoctor.com/papers/delicious-chi-logistic-final.pdf. Rasmussen, D. I. (1941). Biotic Communities of Kaibab Plateau Arizona. Ecological Monographs, 11(3), 230–275. Rasmussen, E. (1997). Indexing Images. Annual Review of Information Science and Technology, 32, 169–196. Raymond, M. (2008). My Friend Flickr: A Match Made in Photo Heaven, from http://www.loc.gov/blog/?p=233. Reamy, T. (2006). Folksonomies and Complexity Theory: Evolved Information Structures. In Preparing for Information 2.0. Proceedings of Online Information Conference, London, GB (pp. 111–113). Rorissa, A. (2008). User-generated Descriptions of Individual Images versus Labels of Groups of Images: A Comparison Using Basic Level Theory. Information Processing and Management, 44(5), 1741–1753. Rosch, E. (1975). Cognitive Reference Points. Cognitive Psychology, 532–547. Rosenfeld, L. (2001). Looking for Metadata in All the Wrong Places: Why a Controlled Vocabulary or Thesaurus is in Your Future, from http://www.webreference.com/authoring/ design/information/cv. Rosenfeld, L. (2005). Folksonomies? How about Metadata Ecologies? from http://louisrosenfeld.com/home/bloug_archive/000330.html.
Knowledge Representation in Web 2.0: Folksonomies
277
Russell, T. (2006). Cloudalicious: Folksonomy Over Time. In Proceedings of the 6th ACM/IEEE-CS joint Conference on Digital Libraries, Chapel Hill, North Carolina, USA (p. 364). Sack, H., & Waitelonis, J. (2008). Zeitbezogene kollaborative Annotation zur Verbesserung der inhaltsbasierten Videosuche. In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 107–118). Münster, New York, München, Berlin: Waxmann. Schiefner, M. (2008). Social Tagging in der universitären Lehre. In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 73–83). Münster, New York, München, Berlin: Waxmann. Schmidt, J. (2007a). Social Software: Facilitating Information-, Identity- and Relationship Management. In T. N. Burg (Ed.), BlogTalks reloaded. Social Software - Research & Cases (pp. 31–49). Norderstedt: Books on Demand. Schmidt, J. (2007b). Tagging und Kollektive Verschlagwortungssysteme in der Organisationskommunikation: Vortrag beim Workshop “Social Software in der Wertschöpfung”, Stuttgart, 18.7.2006, from http://www.schmidtmitdete.de/pdf/ TaggingOrganisationskommunikation2007preprint.pdf. Schmitz, P. (2006). Inducing Ontology from Flickr Tags. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Schmitz, C. (2007). Self-organized Collaborative Knowledge Management. Kassel: Kassel Univ. Press. Schmitz, C., Hotho, A., Jäschke, R., & Stumme, G. (2006a). Mining Association Rules in Folksonomies. In Proceedings of the IFCS Conference, Ljubljana, Slovenia (pp. 261–270). Schmitz, C., Hotho, A., Jäschke, R., & Stumme, G. (2006b). Kollaboratives Wissensmanagement. In T. Pellegrini & A. Blumauer (Eds.), Semantic Web: Wege zur vernetzten Wissensgesellschaft (pp. 273–289). Berlin, Heidelberg: Springer. Schmitz, C., Grahl, M., Hotho, A., Stumme, G., Cattuto, C., & Baldassarri, A., et al. (2007). Network Properties of Folksonomies. AI Communications, 20(4), 245– 262. Schulenburg, F., Raschka, A., & Jungierek, M. (2007). Der „McDonald’s der Informationen“? Ein Blick hinter die Kulissen des kollaborativen Wissensmanagements in der deutschsprachigen Wikipedia. Bibliothek. Forschung und Praxis, 31(2), 225–229. Sen, S., Lam, S., Rashid, A., Cosley, D., Frankowski, D., Osterhouse, J., Harper, M., & Riedl, J. (2006). tagging, communities, vocabulary, evolution. In Proceedings of the 20th Anniversary Conference on Computer Supported Cooperative Work, Banff, Alberta, Canada (pp. 181–190). Shatford, S. (1986). Analyzing the Subject of a Picture. A Theoretical Approach. Cataloguing and Classification Quarterly, 6(3), 39–62. Shen, K., & Wu, L. (2005). Folksonomy as a Complex Network, from http://arxiv.org/PS_cache/ cs/pdf/0509/0509072v1.pdf. Shepard, R. N. (1964). On Subjectively Optimum Selection among Multi-Attribute Alternatives. In M. W. Shelley & G. L. Bryan (Eds.), Human Judgments and Optimality, Human Judgments and Optimality. New York: Wiley.
278
Knowledge Representation in Web 2.0: Folksonomies
Shirky, C. (2003). Power Laws, Weblogs and Inequality: Diversity plus Freedom of Choice Creates Inequality. In J. Engeström, M. Ahtisaari & A. Nieminen (Eds.), Exposure: From Friction to Freedom (pp. 77–81). USA: Aula. Shirky, C. (2004). Folksonomy, from http://many.corante.com/archives/2004/08/25/ folksonomy.php. Shirky, C. (2005a). Ontology is Overrated: Categories, Links, and Tags, from http://www.shirky.com/writings/ontology_overrated.html. Shirky, C. (2005b). Semi-Structured Meta-Data Has a Posse: A Response to Gene Smith, from http://tagsonomy.com/index.php/semi-structured-meta-data-has-aposse-a-response-to-gene-smith. Sigurbjörnsson, B., & van Zwol, R. (2008). Flickr Tag Recommendation Based on Collective Knowledge. In Proceedings of the 17th International Conference on World Wide Web, Beijing, China (pp. 327–336). Sinclair, J., & Cardew-Hall, M. (2008). The Folksonomy Tag Cloud: When Is It Useful? Journal of Information Science, 34(1), 15–29. Sinha, R. (2005). A Cognitive Analysis of Tagging (Or How the Lower Cognitive Cost of Tagging Makes It Popular), from http://rashmisinha.com/2005/09/27/acognitive-analysis-of-tagging. Small, H. (1973). Co-citation in the Scientific Literature: A New Measure of the Relationship between Two Documents. Journal of the American Society for Information Science and Technology, 24, 265–269. Smith, G. (2004). Folksonomy: Social Classification, from http://atomiq.org/ archives/2004/08/ folksonomy_social_classification.html. Smith, M. (2006). Viewer Tagging in Art Museum: Comparisons to Concepts and Vocabularies of Art Museum Visitors. In Proceedings of the 17th ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Smith, G. (2008a). Tagging: Emerging Trends. Bulletin of the ASIST, 34(6), 14–17. Smith, G. (2008b). Tagging. People-powered Metadata for the Social Web. Berkeley: New Riders. Specia, L., & Motta, E. (2007). Integrating Folksonomies with the Semantic Web. Lecture Notes in Computer Science, 4519, 624–639. Spiteri, L. (2006a). The Use of Collaborative Tagging in Public Library Catalogues. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Spiteri, L. (2006b). The Use of Folksonomies in Public Library Catalogues. The Serials Librarian, 51(2), 75–89. Spiteri, L. (2007). Structure and Form of Folksonomy Tags: The Road to the Public Library Catalogue. Webology, 4(2), from http://www.webology.ir/2007/v4n2/ a41.html. Spyns, P., de Moor, A., Vandenbussche, J., & Meersman, R. (2006). From Folksologies to Ontologies: How the Twain Meet. Lecture Notes in Computer Science, 4275, 738–755. Staab, S., & Studer, R. (2004). Handbook on Ontologies. Berlin, Heidelberg, New York: Springer. Star, S. (1996). Slouching toward Infrastructure. (Digital Libraries Conference Workshop). Steels, L. (2004). The Evolution of Communication Systems by Adaptive Agents. In E. Alonso, D. Kudenko & D. Kazakov (Eds.), Adaptive Agents and Multi-Agent Systems. LNAI, 2636 (pp. 125–140). Berlin: Springer.
Knowledge Representation in Web 2.0: Folksonomies
279
Steels, L. (2006). Collaborative Tagging as Distributed Cognition. Pragmatics & Cognition, 14(2), 287–292. Sterling, B. (2005). Order Out of Chaos. Wired Magazine, 13(4), from http://www.wired.com/wired/archive/13.04/view.html?pg=4. Stock, W. G. (2006). On Relevance Distributions. Journal of the American Society for Information Science and Technology, 57(8), 1126–1129. Stock, W. G. (2007a). Information Retrieval. Informationen suchen und finden. München, Wien: Oldenburg. Stock, W. G. (2007b). Folksonomies and Science Communication: A Mash-Up of Professional Science Databases and Web 2.0 Services. Information Services & Use, 27, 97–103. Stock, W. G., & Weber, S. (2006). Facets of Informetrics. Information Wissenschaft & Praxis, 57(8), 385–389. Stock, W. G., & Stock, M. (2008). Wissensrepräsentation. Informationen auswerten und bereitstellen. München: Oldenbourg. Stuckenschmidt H., & van Harmelen, F. (2005). Information Sharing on the Semantic Web. Berlin, Heidelberg, New York: Springer. Sturtz, D. (2004). Communal Categorization: The Folksonomy, from http://www.davidsturtz.com/drexel/622/sturtz-folksonomy.pdf. Subramanya, S. B., & Liu, H. (2008). SocialTagger - Collaborative Tagging for Blogs in the Long Tail. In I. Soboroff, E. Agichtein, & R. Kumar (Eds.), Proceedings of the 2008 ACM Workshop on Search in Social Media, Napa Valley, California (pp. 19–26). Surowiecki, J. (2005). The Wisdom of Crowds: Why the Many are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies, and Nations. New York: Anchor Books. Szekely, B., & Torres, E. (2005). Ranking Bookmarks and Bistros: Intelligent Community and Folksonomy Development, from http://torrez.us/archives/2005/ 07/13/tagrank.pdf. Tennis, J. (2006). Social Tagging and the Next Steps for Indexing. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Terdiman, D. (2005). Folksonomies Tap People Power, from http://www.wired.com/science/discoveries/news/2005/02/66456. Thom-Santelli, J., Muller, M. J., & Millen, D. R. (2008). Social Tagging Roles: Publishers, Evangelists, Leaders. In Proceedings of the 26th Conference on Human Factors in Computing Systems, Florence, Italy (pp. 1041–1044). Tonkin, E. (2006). Searching the Long Tail: Hidden Structure in Social Tagging. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Tonkin, E., Tourte, G. J. L., & Zollers, A. (2008). Performance Tags - Who's Running the Show? In Proceedings of the 19th Workshop of the American Society for Information Science and Technology Special Interest Group in Classification Research, Columbus, Ohio. Tonkin, E., Corrado, E. M., Moulaison, H. L., Kipp, M. E. I., Resmini, A., & Pfeiffer, H. D., et al. (2008). Collaborative and Social Tagging Networks. Ariadne, 54, from http://www.ariadne.ac.uk/issue54/tonkin-et-al.
280
Knowledge Representation in Web 2.0: Folksonomies
Torniai, C., Battle, S., & Cayzer, S. (2007). Sharing, Discovering and Browsing Geotagged Pictures on the Web, from http://www.hpl.hp.com/personal/ Steve_Cayzer/downloads/papers/ geospatial_final.pdf. Toyama, K., Logan, R., & Roseway, A. (2003). Geographic Tags on Digital Images. In Proceedings of the 11th International Conference on Multimedia, Berkeley, CA, USA (pp. 156–166). Trant, J. (2006a). Exploring the Potential for Social Tagging and Folksonomy in Art Museums: Proof of Concept. New Review of Hypermedia and Multimedia, 12(1), 83–105. Trant, J. (2006b). Social Classification and Folksonomy in Art Museums: Early Data from the Steve.Museum Tagger Prototype. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Turnbull, D., Barrington, L., & Lanckriet, G. (2008). Five Approaches to Collecting Tags for Music. In Proceedings of the 9th International Conference on Music Information Retrieval, Philadelphia, USA (pp. 225–230). Udell, J. (2004). Collaborative Knowledge Gardening: With Flickr and del.icio.us, Social Networking Goes Beyond Sharing Contacts and Connections, from http://www.infoworld.com/article/04/08/20/34OPstrategic_1.html. van Damme, C., Hepp, M., & Siorpaes, K. (2007). FolksOntology: An Integrated Approach for Turning Folksonomies into Ontologies. In Proceedings of the European Semantic Web Conference, Innsbruck, Austria (pp. 71–85). van Damme, C., Coenen, T., & Vandijck, E. (2008). Turning a Corporate Folksonomy into a Lightweight Corporate Ontology. In Proceedings of the 11th International Conference on Business Information Systems, Innsbruck, Austria (pp. 36–47). van Hooland, S. (2006). From Spectator to Annotator: Possibilities Offered by UserGenerated Metadata for Digital Cultural Heritage Collections. In Proceedings of the CILIP Cataloguing & Indexing Group Annual Conference, Norwich, Great Britain. Vander Wal, T. (2005a). Explaining and Showing Broad and Narrow Folksonomies, from http://www.vanderwal.net/random/entrysel.php?blog=1635. Vander Wal, T. (2005b). Folksonomy Explanations, from http://www.vanderwal.net/ random/entrysel.php?blog=1622. Vander Wal, T. (2007). Folksonomy: Folksonomy Coinage and Definition, from http://www.vanderwal.net/folksonomy.html. Vander Wal, T. (2008). Welcome to the Matrix! In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 7–9). Münster, New York, München, Berlin: Waxmann. Veres, C. (2006). The Language of Folksonomies: What Tags Reveal about User Classification. Lecture Notes in Computer Science, 3999, 58–69. Vollmar, G. (2007). Knowledge Gardening: Wissensarbeit in intelligenten Organisationen. Bielefeld: Bertelsmann. Voß, J. (2008). Vom Social Tagging zum Semantic Tagging. In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 175–186). Münster, New York, München, Berlin: Waxmann. Voß, J. (2006). Collaborative Thesaurus Tagging the Wikipedia Way, from http://arxiv.org/ftp/cs/ papers/0604/0604036.pdf.
Knowledge Representation in Web 2.0: Folksonomies
281
Vuorikari, R., & Ochoa, X. (2009). Exploratory Analysis of the Main Characteristics of Tags and Tagging of Educational Resources in a Multi-lingual Context. Journal of Digital Information, 10(2), from http://journals.tdl.org/jodi/article/ view/447/284. Wang, J., & Davison, B. D. (2008). Explorations in Tag Suggestion and Query Expansion. In I. Soboroff, E. Agichtein, & R. Kumar (Eds.), Proceedings of the 2008 ACM Workshop on Search in Social Media, Napa Valley, California (pp. 43–50). Wash, R., & Rader, E. (2006). Collaborative Filtering with del.icio.us, from http://bierdoctor.com/papers/delicious_chi2006_wip_updated.pdf. Watts, D. J., & Strogatz, S. H. (1998). Collective Dynamics of 'Small-world' Networks. Nature, 393, 440–442. Weiber, R. (2002). Die empirischen Gesetze der Netzwerkökonomie. Auswirkungen von IT-Innovationen auf den ökonomischen Handlungsrahmen. Die Unternehmung, 56(5), 269–294. Weinberger, D. (2005). Tagging and Why It Matters, from http://www.cyber.law.harvard.edu/home/uploads/507/07-whyTaggingMatters. pdf. Weiss, A. (2005). The Power of Collective Intelligence. netWorker, 9(3), 16–23. Weller, K., & Peters, I. (2007). Reconsidering Relationships for Knowledge Representation. In Proceedings of I-KNOW, International Conference on Knowledge Management, Graz, Austria (pp. 493–496). Weller, K., & Peters, I. (2008). Seeding, Weeding, Fertilizing: Different Knowledge Gardening Activites for Folksonomy Maintenance and Enrichment. In Proceedings of I-Semantics, International Conference on Semantic Systems, Graz, Austria (pp. 110–117). Weller, K., & Stock, W. G. (2008). Transitive Meronymy. Automatic Concept-based Query Expansion Using Weighted Transitive Part-whole Relations. Information Wissenschaft & Praxis, 59(3), 165–170. Wierzbicka, A. (1984). Apples are not a "Kind of Fruit": The Semantics of Human Categorization. American Ethnologist, 313–328. Winget, M. (2006). User-Defined Classification on the Online Photo Sharing Site Flickr…or, How I Learned to Stop Worrying and Love the Million Typing Monkeys. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Wu, X., Zhang, L., & Yu, Y. (2006). Exploring Social Annotations for the Semantic Web. In Proceedings of the 15th International Conference on World Wide Web, Edinburgh, Scotland (pp. 417–426). Wu, H., Zubair, M., & Maly, K. (2006). Harvesting Social Knowledge from Folksonomies. In Proceedings of the 17th Conference on Hypertext and Hypermedia, Odense, Denmark (pp. 111–114). Xu, C., & Chu, H. (2008). Social Tagging in China and the USA: A Comparative Study. In Proceedings of the 71st ASIS&T Annual Meeting. People Transforming Information - Information Transforming People. Columbus, Ohio, USA. Xu, Z., Fu, Y., Mao, J., & Su, D. (2006). Towards the Semantic Web: Collaborative Tag Suggestions, from http://www.semanticmetadata.net/hosted/taggingwswww2006-files/13.pdf.
282
Knowledge Representation in Web 2.0: Folksonomies
Yanbe, Y., Jatowt, A., Nakamura, S., & Tanaka, K. (2007). Can Social Bookmarking Enhance Search in the Web? In Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, Vancouver, BC, Canada (pp. 107–116). Zacharias, V., & Braun, S. (2007). SOBOLEO - Social Bookmarking and Lightweight Ontology Engineering. In Proceedings of the 16th International WWW Conference, Banff, Alberta, Canada. Zadeh, L. A. (1994). Fuzzy Logic, Neural Networks, and Soft Computing. Communications of the ACM, 37(3), 77–84. Zadeh, L. A. (1996). Fuzzy Logic = Computing with Words. IEEE Transactions on Fuzzy Systems, 2, 103–111. Zerdick, A. et al. (2001). Die Internet-Ökonomie. Strategien für die digitale Wirtschaft (3rd ed.). Berlin: European Communication Council. Zhang, L., Wu, X., & Yu, Y. (2006). Emergent Semantics from Folksonomies: A Quantitative Study. Lecture Notes in Computer Science, 4090, 168–186. Zheng, H., Wu, X., & Yu, Y. (2008). Enriching WordNet with Folksonomies. Lecture Notes in Computer Science, 5012, 1075–1080. Zollers, A. (2007). Emerging Motivations for Tagging: Expression, Performance, and Activism. In Proceedings of the 16th International WWW Conference, Banff, Alberta, Canada.
Chapter 4
Information Retrieval with Folksonomies
Collaborative information services generate a large quantity of information, spread out over different resources and integrated in different platforms. To adequately use this mass information, and to extract it from the platforms, users must be provided suitable tools and knowledge, as Brooks and Montanez (2006a) have stated with regard to the blogosphere: “As with any new source of information, as more people begin blogging, tools are needed to help users organize and make sense of all the blogs, blogger and blog entries in the blogosphere” (Brooks & Montanez, 2006a, 9). After all, the best information is useless if the users cannot find it: Just as people have learned with internet search, the information has to be structured properly. The model of information consumption relies on the information being found. Today information is often found through search and information aggregators and these trends seem to be the foundation of information use of tomorrow (Vander Wal, 2004).
Folksonomies in information retrieval thus serve mainly to provide access to the information resources for all users, where the initial motivation for tagging is often of a personal nature: In my mind, at least initially, it really has do with retaining information that you, I, or anyone else has found and organizing it in a way so that people can re-find that information at some later date. Tagging bookmarks is really another way of trying to keep source […] once found on the Internet permanently found (Gordon-Murnane, 2006, 28).
Especially for non-textual information resources, such as images or videos, metadata in the form of tags, titles or descriptions are invaluable, since search engine technology’s state of the art only registers textual research at the moment, rendering these resources irretrievable otherwise: “Not to mention the fact that a lot of content is made up of pictures, blog entries, personal Web sites, and other personal data that can’t be easily crawled by searchbots, which only understand text” (Dye, 2006, 38). In such cases, it is also important that the resources’ retrievability is guaranteed by a broad access vocabulary, which is where a folksonomy can provide excellent services: “Users may need access to images based on such primitive features such as color, texture or shape, or access to images based on abstract concepts and symbolic imagery” (Goodrum & Spink, 2001, 296). Due to their strong user-orientedness and broad access vocabulary, folksonomies can facilitate the retrieval of relevant information in company intranets or in ‘enterprise search’ as well (Schillerwein, 2008). Nevertheless, even information retrieval with folksonomies must focus on the finding of relevant resources and the avoidance of ballast, ideally, in other words, to
284
Information Retrieval with Folksonomies
reach “the holy grail of researchers,” which lies between Recall and Precision, by using computer-guided tools: “We need technology to help us wade through all the information to find the items we really want and need, and to rid us of the things we do not want to be bothered with” (Shardanand & Maes, 1995, 210). Folksonomies are located in the area of tension between traditional methods of knowledge representation and information retrieval that distinguish themselves mainly through intense structuring endeavors and require structure information, and general resource storage that can only provide limited support to the user in his search for relevant information. At the same time it is evident that the search within collaborative information services, or more broadly speaking, in Web 2.0, places – has to place respectively – other priorities, which requires an adjustment of the traditional research options. These priorities can be summarized under the keyword ‘social search’ (Brusilovsky, 2008), which is defined as follows by Evans and Chi (2008): “Social search’ is an umbrella term used to describe search acts that make use of social interactions with others. These interactions may be explicit or implicit, co-located or remote, synchronous or asynchronous” (Evans & Chi, 2008, 83). Since the term is so broad, pretty much all retrieval options that folksonomies support can be summarized by it. Smith et al. (2008) formulate six demands that information retrieval within collaborative information services and via folksonomies must meet: 1) Filter: the social context of the user as well as his search, clicking and buying histories have to be exploited for the compilation of hit lists. 2) Rank: the social context of the user as well as his search, clicking and buying histories should also be exploited for the search results’ ranking. Furthermore, the popularity and believability of single users should be used to balance the ranking. 3) Disambiguate: the social context of the user as well as his search, clicking and buying histories can be considered for the adjustment of the search results and thus provide relevant results. 4) Share: The user’s self-produced and -published content can be exploited for searches. 5) Recommend: the social context of the user as well as his search, clicking and buying histories and his content can be used to recommend products, friends, information etc. without the user having to actively search for it. 6) Match Make: The recommendation and finding of similar users serves not only to start romantic relationships, but also to make business contacts and communicate with team members, experts in certain areas etc. and should not be neglected. The chapter at hand investigates the applicabilities of folksonomies in information retrieval, and particularly their functionalities for relevance ranking, the purpose of their visualizations and their advantages and disadvantages. Here I will check in how far folksonomies meet the above demands of ‘social search’ and whether the following assumption concerning folksonomies is on point: “The result is an uncontrolled and chaotic set of tagging terms that do not support searching as effectively as more controlled vocabularies do” (Guy & Tonkin, 2006). Since the area of ‘information retrieval’ comprises the searching and finding of information in a very general way, and since there are many research paths for finding information, I will address these first and then, building on this, observe the particularities of folksonomies in information retrieval. Another section is dedicated to the relation
Information Retrieval with Folksonomies
285
between information retrieval and knowledge representation, in order to stress the necessity of indexing measures for information retrieval. Since informationscientific research in the area of folksonomies and information retrieval is not very advanced, recourse must often be made to findings from computer science (e.g. Hotho et al., 2006c; Zhou et al., 2008).
The Relation between Knowledge Representation and Information Retrieval The goal and the task of information retrieval is to find relevant information in a suitable period of time and with a suitable work effort. This is not always easy due to the exponentially increasing mass of data and their circulation on the World Wide Web, and it is definitely not to be achieved without tools: “Since more than 10 years the internet offers ways to publish and store (hyper-)text in a distributed way and increases the problem of retrieving the right piece of content at the right time” (Lux, Granitzer & Kern, 2007). A truly effective information retrieval would exploit the groundwork laid down by knowledge representation – after all, in information retrieval we can only reap what we have sewn in knowledge representation: Content indexing means the totality of methods and tools for describing the content of documents. Here documents are enriched with single words and/or entire sentences that represent their content in a compressed way. This facilitates their retrievability, enables fast access and speeds up the relevance decision. Content indexing is thus never the end in itself but acquires its meaning by providing users with access to and orientation over the content of documents. In this way, it is an important sub-process of information retrieval (Bertram, 2005, 18f.)*.
The task of knowledge representation is to represent the information resource’s content via a vocabulary that is known to both users and indexers (Tudhope & Nielsen, 2006). This task is not unproblematic: In an information retrieval system, there are art least two, and possibly many more vocabularies present […]. These could include that the user of the system, the designer of the system, the author of the material, the creators of the classification system; translating between these vocabularies is often difficult and defining issue in information systems (Mathes, 2004).
Merholz (2005) tells of an argument with Clay Shirky, the folksonomy proponent, on this subject. According to Merholz (2005), Shirky holds that a user will find any resource, as long as there is one other person who tags in the same way as that user. Merholz’s (2005) response: “If all I’m doing is trying to find people who tag the way I do, my exposure to the world of information is going to be awfully awfully constrained.” The alignment of the different vocabularies provides access to the resources and makes information retrieval possible in the first place (Chopin, 2008; Chi & Mytkowicz, 2006). The problem of linguistic variation applies to both approaches: in knowledge representation concerning the allocation of adequate concepts and terms, in information retrieval with regard to the finding of adequate search terms, or the knowledge of the used indexing terms. Reduced to a simple formula, the user’s problem is this: “Find what I mean and not what I say!” (Feldman, 2000).
286
Information Retrieval with Folksonomies
In this context, Furnas et al. (1987) study the relation of the user-computer dialog concerning the entering of search terms or computer commands and identify the ‘vocabulary problem:’ “Many functions of most large systems depend on users typing in the right words. New or intermittent users often use the wrong words and fail to get the actions or information they want. This is the vocabulary problem” (Furnas et al., 1987, 964). Mayr (2006) summarizes this problem under the term ‘vagueness:’ “A central problem of information retrieval is the vagueness between a user’s request and the indexing terms that describe the content of the documents stored in the information system.” (Mayr, 2006, 152)*. Since the searching users do not know the right (indexing) terms, they cannot find access to the information resources. For the information retrieval of non-textual information, this difficulty becomes even more complex, as Goodrum and Spink (2001) emphasize: Problems arise when either documents or information needs cannot be expressed in a manner that will provide congruence between the representation and its referent. In the case of multimedia searching, there are problems in representing image information needs with textual queries, and with representing retrieved images as short abstracts (Goodrum & Spink, 2001, 308).
The mass and variety of the information resources does not make information retrieval easier either, since they result in an equally large variety of terms. Neither researcher nor indexer can recognize the precise terms in advance, unless of course a controlled vocabulary is being used. This, however, is often hard to learn for the average user. Furnas et al. (1987) summarize: Simply stated, the data tell us there is no one good access term for most objects. The idea of an “obvious”, “self-evident”, or “natural” term is a myth! Since even the best possible name is not very useful, it follows that there can exist no rules, guidelines or procedures for choosing a good name, in the sense of “accessible to the unfamiliar user” (Furnas et al., 1987, 967).
Furthermore, controlled vocabularies often appear constructed and artificial and do not reflect users’ language use, as Chen (1994) stresses: “The designers’ chosen vocabularies, which are often quite different from the users’ preferred terms, can cause serious communication breakdown and an interaction bottleneck” (Chen, 1994, 58). This is where folksonomies enter the fray. Folksonomies can reflect a broad range of linguistic terms and concepts and thus broaden the access paths to information resources. They are capable of bridging the ‘semantic gap’ between different groups of users and thus improve the power of information retrieval in a fundamental way (Jörgensen, 2007). Furnas et al. (1987) pursue a quantity-oriented approach and encourage designers of information retrieval systems to allow for as many search terms as possible: Clearly the only hope for untutored vocabulary driven access is to provide many, many alternate entry terms. Thus aliases are, indeed, the answer, but only if used on a much larger scale than usually considered. The tendency to underestimate the need for so many aliases results, we conjecture, from the fact that any one person (for example, the designer) usually can think of only a handful of the terms that other people would consider appropriate. […] In terms of the designer’s usual system-oriented standpoint, the proposal here is to allow essentially unlimited numbers of aliases (Furnas et al., 1987, 968).
Information Retrieval with Folksonomies
287
To a certain degree this approach runs counter to information-scientific and, above all, documentary practice which used to be prevalent, and which rather “discourages indexing redundancy and favors conciseness and precision” (Chen, 1994, 59). It is important to emphasize that the ‘vocabulary problem’ is neither new nor did it first make itself felt with the creation of collaborative information services. The problem of the vagueness of adequate terms comes to bear on any attempts to introduce a meta-level meant to represent the referred objects. Mayr (2006) describes this general problem thusly: If you juxtapose the databases and compare what terms have been used to index identical documents, or documents with similar content, you will note that, depending on the indexing vocabulary, different indexing terms, which can vary in their semantic precision, are being used for identical semantic concepts. […] (Mayr, 2006, 153)*.
Through their liberal character, folksonomies are supposed to represent an option for counter-acting this problem of limited description variety, and for broadening access paths to the information resources via their collective indexing.
Searching vs Browsing vs Retrieving Information retrieval is about the searching and, ideally, the finding of information. The way towards determining any specific information is not prescribed (Jiang & Koshman, 2008) – but the starting point of any search is: the user always has a certain need for information, or, in Belkin, Oddy and Brooks’ (1982) terminology, an ‘Anomalous State of Knowledge’ (ASK). In order to satisfy his need, he can pursue different paths, which can roughly be divided into two approaches: 1) the pull approach and 2) the push approach. The pull approach requires the user to articulate his information need, translate it into a search formula and then actively search for the information within the information system via this formula. The push approach is distinguished from the pull approach through the fact that, whereas the user still has to articulate his need and formulate a search formula, in this case the information system will search the incoming data stream for this information on its own and then pass the results on to the user, requiring no further action on his part. Both approaches of information retrieval are implemented within collaborative information services with the help of folksonomies (Begelman, Keller & Smadja, 2006). The pull approach is represented by active searches via tags or by the different visualizations of folksonomies, e.g. tag clouds, and the push approach by the subscription to RSS feeds on the basis of tags or users. An active search via tags, i.e. entering search tags into a search mask, works the same way as online search engines or library catalogs: the search term is compared to the indexing terms, or, depending on the case, with the terms in the full-text and if there is a match, the resource in question is yielded as a search result. Folksonomies are fundamentally different from searches using controlled vocabularies when it comes to the tags’ design: there are no rules, and only the users can determine them. Folksonomies as search tools are located directly between the poles in the continuum of elaborate KOS and ‘loose’ full-texts, which is why Gruber (2005) considers them the perfect research tool: “For the task of finding information, taxonomies are too rigid and purely text-based search is too weak” (Gruber, 2005).
288
Information Retrieval with Folksonomies
Folksonomies make use of the language and views of users, doing without a restricted indexer or KOS vocabulary. Thus they can broaden access paths to the resources in collaborative information services and, in the case of non-textual resources, such as images, make them accessible in the first place (Jörgensen, 2007). The general difficulty in information retrieval is in the aspect of ‘translating the information need into a search formula.’ The search vocabulary must match the indexing vocabulary, otherwise no resource will be found: In information retrieval systems, the keywords that are assigned by indexers are often at odds with those tried by searchers. The seriousness of the problem is indicated by the need for professional intermediaries between users and systems, and by disappointingly low performance (recall) rates (Furnas et al., 1987).
Folksonomies can solve this problem to a certain degree, due to their liberal design. Collaboration in Web 2.0 contributes to the problem-solving, as Hammond et al. (2005) emphasize: This provision of rich, structured metadata means that the user is provided with an accurate third-party identification of a document, which could be used to retrieve that document, but is also free to search on user-supplied terms so that documents of interest (or rather, references to documents) can be made discoverable and aggregated with other similar descriptions either recorded by the user or by other users (Hammond et al., 2005).
Weiss (2005) also regards the collaborative creation of folksonomies as well as the collaborative indexing of information resources as folksonomies’ greatest advantages, particularly for browsing: “Flickr demonstrates how aggregate sharing of fragments can create value, ultimately becoming more useful than the sum of its parts” (Weiss, 2005, 22). Crawford (2006) does not rate retrieval via tags quite as highly, and criticizes the very unpredictability of folksonomies, which comes to bear on the duration of searches in particular, as folksonomies “[are] lowering the cost for those who might wish to identify, but increasing the cost (in time) for those searching” (Crawford, 2006). Cox, Clough and Marlow (2008) argue that collaborative information services such as Flickr do not aim towards the intentional search for resources at all, and that tags are not necessary for research, since other textual information is available in abundance: Purposive searching would imply a specific information need, such as for a photo of a cat on a fence, and Flickr is probably more used for browsing. Furthermore, searches can also be in text, in titles and descriptions, so tagging is not critical to retrieval (Cox, Clough, & Marlow, 2008, 10).
If, on the other hand, tags are regarded as metadata providing an additional description of the resource’s content, thus re-approaching the basic idea of controlled vocabularies, then folksonomies are actually way ahead of online search engines and their full-text search. Wash and Rader (2006) summarize: The average site has only 26% of its tags appearing in the webpage at all. We believe that this evidence that the tags provide useful metadata that is not directly available in the webpage. […] This allows users to search and filter using words that do not appear in the target document, something normal internet search engines are not very good at (Wash & Rader, 2006).
Information Retrieval with Folksonomies
289
The subject ‘searching with folksonomies’ is often approached from a different angle in scientific discussion. The tags cannot be compared to the mightiness of terms extracted from controlled vocabularies, of course, but they are more suited for another retrieval strategy: browsing – “it therefore follows that the low-specific nature of tags makes them more suitable to support browsing than querying” (Hassan-Montero & Herrero-Solana, 2006). Browsing in information retrieval is the term given to the searching for information by following and pursuing hypertext structures, i.e. links on the web, or of references in textual resources. Searching is not accomplished by entering a search term and viewing a hit list, but, starting off from a resource, by clicking on the links provided: “Discovery is using the tag as an anchor in finding interesting resources” (Furnas et al., 2006, 37). In collaborative information services and tagging systems, the terms ‘pivot browsing’ (Millen, Feinberg, & Kerr, 2006; Sinha, 2006) and ‘exploratory search’ (Millen, Whittaker, & Feinberg, 2007) are repeatedly used for information retrieval. The concept of ‘serendipity’ (Mathes, 2004; Auray, 2007; Tonkin et al., 2008) is closely related to the retrieval strategy of browsing. Serendipity refers to ‘a happy accident,’ i.e. the finding of information that the user was not looking for but which are of interest to him anyway. The term was first introduced by Merton (Merton, 1949; Merton & Barber, 2004), referring to a Persian fairytale. In collaborative information services, pivot browsing can be performed via the tie points of tags (leading the user to all resources indexed with these tags), persons/users (lead the user to a person’s profile as well as to their resources and tags) or resources (lead the user to the resource itself and to the indexed tags and the persons who have also saved the resource) (Hotho et al., 2006a; Hotho et al., 2006c; Choy & Lui, 2006). Browsing is mainly supported by the various visualization methods of folksonomies, e.g. tag clouds. Feinberg (2006) compares browsing and clicking on tags within collaborative information services with the scanning or the following of citation chains in scientific documents for the purpose of finding further relevant resources. For information retrieval via browsing, folksonomies’ typical ‘small world’ feature (Watts & Strogatz, 1998) is particularly helpful (Hotho, 2008; Mislove et al., 2008), as Schmitz et al. (2007) observe in the social bookmarking service del.icio.us: In practice this means that on average, every user, tag, or resource within del.icio.us can be reached within 3.5 mouse clicks from any given del.icio.us page. This might help to explain why the concept of serendipitous discovery of contents plays such a large role in the folksonomy community – even if the folksonomy grows to millions of nodes, everything in it is still reachable within few hyperlinks (Schmitz et al., 2007).
Particularly the design of tagging systems, and the representation of folksonomies within, can influence and at best support browsing. Heymann and Garcia-Molina (2006) discuss the representations of folksonomies prevalent today and their options for starting a search: Currently, users can browse the objects in a tagging system using three main views: 1. A list of all objects which are tagged with a given tag (or possibly a combination of two or more tags). 2. A list of the most popular tags in the system.
290
Information Retrieval with Folksonomies 3. A list of tags which have a high degree of overlap with a tag the user is currently investigating (Heymann & Garcia-Molina, 2006).
Pivot browsing through the resources, tags and users of the collaborative information services leads the user to the desired information. This procedure can be relatively time-consuming, as the user keeps getting distracted by new resources or better tags, if he is not promptly directed to his goal. A direct click on tags in a tag cloud must be excluded here, though, since it is not fundamentally different from an active search via tags – the result of the research stays the same: in both cases, the user is provided the resources that have been indexed with the tag in question. Searching and browsing are different concerning the cognitive effort required of the user in order to satisfy his information need. Xu et al. (2006) hold that active searches are less demanding: Tagging bridges some gap between browsing and search. Browsing enumerates all objects and finds the desirable one by exerting the recognition aspect of human brain, whereas search uses association and dives directly to the interested objects, and thus is mentally less obnoxious (Xu et al., 2006).
Li et al. (2007) want to apply aspects of ‘emergent semantics’ (see chapter three) from folksonomies to browsing and thus facilitate a more efficient retrieval. They offer their users a semantic browsing, similar tags determined by the cosine, and a hierarchical browsing via tag quasi-hierarchies created through cluster analysis. As addressed above, the tripartite network of resources, tags and users is the factor that can best be exploited for browsing within collaborative information services. The networks’ human factor in particular seems to make browsing attractive to users. They want to link up with other users, be a part of the community and find out what resources their contacts read, save, upload or rate (Mislove et al., 2008). This aspect comes to bear on the description of another variant of browsing: ‘social browsing’ (Lerman & Jones, 2006) or ‘social navigation’ (Dieberger, 1997; Svensson, 1998; Dieberger et al., 2000). The user no longer finds his information by actively searching, but through the users of his network: Instead of relying solely on abstracted representations of the space, social navigation utilises the fact that most information navigation in the real world is performed through talking to other people. When we need to find information about an illness, we talk to our relatives, friends and medical doctors, when we are lost in a city, we approach people walking by, etc. (Forsberg, Höök, & Svensson, 1998).
Since the members of a community often share the same interests, the user can assume that resources saved by members he knows will also be interesting to him: “Tags can also be a powerful tool for social navigation, helping people to share and discover new information contributed by other community members” (Sen et al., 2006b, 181). In the most extreme scenario, the user is automatically sent news of his community, so that information retrieval shifts from the pull approach to the push approach. Social navigation depends on user collaboration and activity, since the behavior of one user will influence another, thus facilitating the exchange between the two (Blank et al., 2008; Sandusky, 2006; Freyne et al., 2007; Freyne & Smyth, 2004; Brusilovsky, 2008). This is where social navigation distinguishes itself from purely spatial navigation, in which users move through the information on offer via spatial
Information Retrieval with Folksonomies
291
coordinates only (e.g. up, down, next to), and semantic navigation, where the user must orient himself on the underlying semantic order (e.g. similar, more general, more specific) of the information (Dourish & Chalmers, 1994). Information retrieval via social navigation is always motivated by the users’ influence (Dourish & Chalmers, 1994), where the motivation can be expressed directly or indirectly (Millen & Feinberg, 2006). Direct motivation distinguishes itself through an intentional act on the part of the user, i.e. a user directly alerts another user to an interesting resource, whereupon that resource is explored by the latter user. Indirect motivation is defined by Millen and Feinberg (2006) as follows: “indirect social navigation is when navigational advice is inferred from historical traces left by others. Indirect social navigation involves monitoring and analyzing the behavior of a group of people” (Millen & Feinberg, 2006). Forsberg, Höök and Svensson (1998) call these two forms of social navigation ‘intentional’ and ‘unintentional.’ Indirect social navigation can be compared to prototypical browsing, the following of links. Millen and Feinberg (2006; see also Millen, Whittaker, & Feinberg, 2007) observe, for their corporate social bookmarking system ‘dogear,’ that navigating and searching via linked members of the community is users’ preferred method: One kind of evidence for social navigation would be found in the number of times individuals looked at the bookmark collections of other people. In total, 2545 (98.7%) of dogear users used tag or people links at least one time to browse and explore the bookmark collection. [...] The most frequent way to browse bookmarks is by clicking on another person’s name, followed by browsing bookmarks by selecting a specific tag from the system-wide tag cloud. It is considerably less common for a user to select a tag from another user’s tag cloud and there is almost no use to date of the more advanced browsing of tag intersections (Millen & Feinberg, 2006).
The members of a community trust one another and draw the greatest possible benefit from similar users and interests, and thus resources relevant to them. This information retrieval procedure on the World Wide Web reflects the natural, human behavior of social navigation in the real world: “The decision that some information might be interesting as a result of seeing the clustering of like-minded individuals around is clearly exploiting a familiar real-world situation and is based on the use of space” (Dourish & Chalmers, 1994). Forsberg, Höök and Svensson (1998) compile a catalog of rules for the working of social navigation, which comprises the following six points: 1) Integration: the tools for social navigation must be directly integrated into the work flow and be understood by the users; 2) Presence: the system must make it clear that other users have already been there: “different information artefacts (articles, annotations, etc.) can communicate the existence of a person, even if s/he is not currently there;” 3) Trust: there must be notes informing users who the other members are and whether they are to be trusted; 4) Adequacy: the use of social navigation must be adequate, i.e. not all areas are suited to it (e.g. sensitive, private data); 5) Privacy: the users must know what information can be used for social navigation – users may have to be able to delete certain information, or keep it private;
292
Information Retrieval with Folksonomies
6) Personalized Navigation: the navigating options should be adjustable to the user needs. Folksonomies can implement these design aspects very well – the factors ‘Presence’ and ‘Trust’ in particular are perfectly represented by them. Tags attached to resources make it clear that other users have been there before, and viewed or rated these resources. In the real world, the quality of the product can often be guessed from its appearance or environment. Shopworn book covers (Munro, 1998), empty shelves or even the sales location (shop vs market stall) provide implicit clues to what other users think of the product or how high its quality is, e.g. empty shelves can point to a particularly popular and rare item. In the digital world, appearances cannot be reproduced in this way, which makes it harder for users to make quality judgments: “In the digital realm, problem-solvers must approach situations as though they were the first and only people ever to make use of the information” (Wexelblat & Maes, 1999, 270). In order to support the user in rating resources and to make the presence of other users visible, a multitude of methods can be used, e.g. citations, comments, clickthrough rates (Chiang, 2006), page-view counts on websites (Forsberg, Höök, & Svensson, 1998) or the display of other users’ online status (Wexelblat & Maes, 1999). This is also where the consequences of the ‘social proof’ (Golder & Hubermann, 2006) become clear: users orient themselves on other users, follow their recommendations and believe that whatever a number of other users likes will therefore be good. The same goes for the factor ‘Trust.’ Frequently cited users, or users with many tags and comments, may enjoy a higher level of trust than others. Both factors have in common that their ‘interaction history’ (Wexelblat & Maes, 1999; Hill & Hollan, 1994) becomes visible, i.e. both users and resources reveal how often they have been used, how old they are, in what context they are located etc. Wexelblat and Maes (1999) explain the interaction history theory with the following example: Interaction history is the difference between buying and borrowing a book. Conventional information retrieval theory would say they were the same object, given the same words, same pictures, same organization, etc. However, the borrowed book comes with additional information such as notes in the margins, highlights and underlines, and dog-eared pages. Even the physical object reflects its history: a book opens more easily to certain places once it has been used (Wexelblat & Maes, 1999, 270).
Social navigation is not only used in collaborative information services, but also in scientific publications, beaten paths in nature (Wu, Zhang, & Yu, 2006; Mathes, 2004; Kroski, 2005; Sterling, 2005), online shopping (Svensson et al., 2000; Wexelblat & Maes, 1999) or when borrowing books (Munro, 1998), if users use the information left behind by others. Social navigation and browsing in general offer research options and options for starting a search in folksonomies and collaborative information services, in order to enable them to deal with the mass of information resources on a platform: This is a useful way to browse through the entire bookmark collection to see other information sources of interest. We call this ability to reorient the views by clicking in tags or user names, “pivot browsing”; it provides a lightweight mechanism to navigate the aggregated bookmark collection (Millen & Feinberg, 2006).
Information Retrieval with Folksonomies
293
This form of information retrieval is very close to collaborative filtering. Nevertheless here, as with active searches, the problem of the variability of linguistic terms arises. The meaning of the tags must be known if one is to profit from social navigation and find relevant resources. Sen et al. (2006b) point out this difficulty: “Social navigation may be more powerful in communities, that share a common vocabulary. As an extreme example, people who speak different language will find little value in each others’ tags” (Sen et al., 2006b, 181). The problem of linguistic variability arises again, in a weakened form, for another way of using tags for retrieval: the finding of resources already known to a user. Studies on users’ tagging behavior (e.g. Pluzhenskaia, 2006; Marlow et al., 2006a; Marlow et al., 2006b; Brooks & Montanez, 2006a; Brooks & Montanez, 2006b) show that tags are mainly used for personal resource management. Thus tags mainly serve the tagging user who wants to find resources again that have been found before or were published by himself: “Recovery enables a user to recall content that was discovered before” (Xu et al., 2006). Here, too, there is the danger that the user may at one point forget what tags he used for which meaning or which resource (Muller, 2007), but an inter-indexer inconsistency does not enter into it and make retrieval harder. By predominantly dealing with their own resource management and the indexing via tags, users of collaborative information services serve them in two ways, as Dye (2006) explains: “Users are not only creating their content, they’re building their own infrastructure for making it easier to find” (Dye, 2006, 38). And this infrastructure, at first motivated by selfish reasons, helps all users in retrieval: the totality of personomies, the folksonomy of the platform, is what accounts for the strength of folksonomies in browsing, with their multitude of search entries. This form of collaboration in the creation of a retrieval system is regarded by Dye (2006) as the greatest advantage of folksonomies: “Collaboration through collective tagging gives members of these communities a chance to build their own search systems from the ground up, based on their own vocabularies, interests, and ideas” (Dye, 2006, 40).
Information Filters – Information Filtering – Collaborative Filtering The terms ‘information filters,’ ‘information filtering’ and ‘collaborative filtering’ are strongly linked with information retrieval. As retrieval methods, their task is to restrict the mass of information, which they implement through their own different approaches. Information filtering and collaborative filtering can be allocated to the push approach. Since the three terms are sometimes used synonymously in information-scientific discussion, they must first be clarified. In knowledge representation, the terms used for indexing are called information filters if they represent concepts instead of words and thus attach semantic added value to the indexed resource (Stock, 2007a, ch. 5). This requires the resource to be interpreted in a first step and then allocated a meaning via the indexing term; this is the task of indexing. Simple words, and thus tags, can also be information filters. As semantic control is lacking in their case, however, they can only be viewed as very bad information filters. In information retrieval, the database is searched with the help of the search term, which ideally matches the indexing term. Then the result is displayed as a list of filtered resources – all resources containing the same concept,
294
Information Retrieval with Folksonomies
independently of its linguistic realization. Retrieval via information filters distinguishes itself from ‘normal’ searches, e.g. via search engines on the internet, through the latter’s often only allowing the search for keywords, which gives rise to the problems of homonym separation and synonym summarization. Search engines do not perform an analysis of the resources’ content, they merely access the websites reachable via links, copy their entire content and treat it as full-text during searches. The alignment of search requests and resources is only performed on a word level, or sometimes even only a digit level, i.e. a search for the term ‘dog’ finds resources that contain the exact sequence of letters ‘_dog_’ in the indexed full-text, but not the resources that have been indexed with ‘_dogs_.’ Lacking a content analysis of the context of the indexed terms, search engines further find resources that contain the sequence of letters ‘this text does not deal with the dog Lassie’ and yield them as search results – which is a clear error and represents ballast. In information retrieval, one speaks of information filtering when the user does not actively search for information in databases or on the web (pull approach), but is ‘delivered’ said information via SDI (‘Selective Dissemination of Information’) or RSS feed (‘Really Simple Syndication’) (push approach). Here the user’s information need is articulated once, saved in a search profile and the retrieval is then ‘commissioned.’ Thus the user is no longer provided all information as it appears daily, but only what has been filtered by his search profile. The features of information filtering have been summarized by Belkin and Croft (1992) in six points: 1) an information filtering system is often confronted by unstructured or semistructured data; 2) information filtering is often used for textual data; 3) large amounts of data are typically checked by an information filtering system; 4) this data is equally typically located in a continually incoming data stream; 5) the information filtering system is based on a profile that has been designed out of either individual or a group-specific information needs. These needs are typically long-term; 6) information filtering systems can either extract desired information from the data stream or block certain information, as is already done by spam filters, for example. Belkin and Croft (1992) emphasize that information retrieval and information filtering are strongly linked – they even regard the two methods of information gathering as “two sides of the same coin” (Belkin & Croft, 1992, 37). The only difference is located in the users themselves, or in their information needs: Information filtering begins with people (the users of the filtering system) who have relatively stable, long-term, or periodic goals or desires (e.g., accomplishing a work task, or being entertained). Groups, as well as individuals, can be characterized by such goals. These then lead to regular information interests (e.g., keeping up-to-date on a topic) that may change slowly over time as conditions, goals, and knowledge change. Such information interests lead the people to engage in relatively passive forms of information-seeking behaviour, such as having texts brought to their attention (Belkin & Croft, 1992, 31f.).
Information retrieval, on the other hand, distinguishes itself by its active user behavior and an information need on the part of the user that occurs only once.
Information Retrieval with Folksonomies
295
Collaborative filtering refers to the restriction of a quantity of information with the help of a group of users: “Collaborative filtering simply means that people collaborate to help one another perform filtering by recording their reactions to documents they read” (Goldberg et al., 1992, 61). In other words: “By making other users’ actions visible we can take advantage of the work they have done to find their way around and to solve problems” (Svensson et al., 2000, 260). After all, the tripartite connection of indexing terms, resources and users, as set up in knowledge representation, is exploited in such a way that information streams are filtered on the basis of different profiles, which are gleaned from the ‘bibliographic coupling’ (Kessler, 1963) of users or matches between indexing terms. The underlying assumption here is that ‘a match means similarity.’ Collaborative filtering is also often used for recommender systems (Kwiatkowski & Höhfeld, 2007). The terms ‘collaborative filtering’ and ‘recommender system’ are rivals in a way150. Resnick and Varian (1997) prefer the term ‘recommender system’ for the following reasons: We prefer the more general term ‚recommender system’ for two reasons. First, recommenders may not explicitly collaborate with recipients, who may be unknown to each other. Second, recommendations may suggest particularly interesting items, in addition to indicating those that should be filtered out (Resnick & Varian, 1997, 56).
Golovchinsky, Pickens and Back criticize, similarly to Vander Wal (2008), the term component ‘collaborative,’ since the users do not collaborate in recommending resources; it is only their behavior which is analyzed: “In some sense this is not strictly collaboration, burt rather a coordination of people’s activities” (Golovchinsky, Pickens & Back, 2008). Xerox was one of the first companies to report on the use of collaborative filtering for incoming data streams of digital documents and its own system ‘tapestry’ (Goldberg et al., 1992); their successors were ‘GroupLens,’ a personalized filtering system for web news (Resnick et al., 1994) and ‘Ringo,’ a recommender system for music albums and musicians (Shardanand & Maes, 1995). Goldberg et al. (1992) see the great advantage of collaborative filtering systems in their continuous applicability and time-independent usefulness: “When a Tapestry user installs a file that uses annotations, documents matching that filter are returned as soon as the document receives the specified annotations. Thus Tapestry filters can be thought of as running continuously” (Goldberg et al., 1992, 69). Kautz, Selman and Shah (1997) introduce the collaborative filtering system ‘ReferralWeb,’ which aims to visualize and reproduce social networks within the web. The fundamental idea here is of a ‘referral chain,’ i.e. the exploitation of a user’s relationships or contacts in order to obtain the desired information or to find an expert to solve the problem. ReferralWeb implements this idea for incoming data streams, the internet and searches as follows: The current ReferralWeb system uses the co-occurrence of names in close proximity in any documents publicly available on the Web as evidence of a direct relationship. […] The social network also prioritizes the answers, in that the user retrieves hits on people that are closest to him or herself, rather than
150
This book will use both terms synonymously and orient itself on the cited authors and their terminology for elucidations.
296
Information Retrieval with Folksonomies simply a long list of hundreds of names or documents (Kautz, Selman, & Shah, 1997, 64).
So the user is provided the information relevant to him. The filtering system is furthermore capable, via the quantity of analyzed relations, to support the user in information gathering. Kautz, Selman and Shah (1997) thus address a fundamental advantage of collaborative filtering systems: Typically a user is only aware of a portion of the social network to which he or she belongs. By instantiating the larger community, the user can discover connections to people and information that would otherwise lay hidden over the horizon (Kautz, Selman, & Shah, 1997, 65).
Recommender systems can make suggestions based on explicit recommendations, but also from implicit ones (Resnick & Varian, 1997, 56; Szomszor et al., 2007). A direct recommendation would be the function ‘Send a Friend,’ where a user can directly and personally send his recommendation to another user; indirect recommendations are often presented via user behavior, e.g. on the basis of articles bought or searched. Recommender systems have their problems, however: according to Resnick and Varian (1997) the biggest one is the ‘Free Rider Problem,’ since once a user has created a profile, he can profit from the system’s suggestions without actively having to contribute to it. This is also confirmed for del.icio.us: “One potential problem with del.icio.us’s inbox is the free rider problem. A user can reap all of the benefits of the collaborative filter without providing any metadata to the system” (Wash & Rader, 2006). Equally problematic is the so-called ‘vote early and often’ phenomenon, i.e. spamming in direct recommendations: “content owners may generate mountains of positive recommendations for their own materials and negative recommendations for their competitors” (Resnick & Varian, 1997, 57). Another disadvantage of recommender systems is the ‘ramp-up’ problem (Szomszor et al., 2007, 76). New users of the system do not have a detailed and finished profile yet, so they are able to profit from the suggestions only in a limited way, or even not at all. The same goes for new products or resources that have not yet been bought, saved, viewed etc. 151 The term ‘collaborative filtering’ is also used to delimit collaborative information services from the World Wide Web. The internet is not subject to quality control, so every user with ability and access can publish resources online and link up with other information providers. Search engines copy nearly all websites that can be reached via links, but do not index or check the content, leaving a large number of questionable websites and spam accessible and searchable. For collaborative information services, the case is slightly different. Every user can still publish whatever he wants, but he is bound at the very least by the service’s terms of usage and the mandatory creation of a user account. The concentration of the information services on one specific resource format, e.g. images on Flickr or videos on YouTube, restricts users in a way. Especially in information retrieval, the linking of users and resources via tags represents a distinct advantage of collaborative information services vis-à-vis search engines, since searches the former search in an enclosed and ‘controlled’ space: 151
This is only applicable to collaborative information services with reservations, if one assumes that they obtain their resources from the users’ active collaboration, via uploads etc. Thus every resource is ‘rated’ at least once.
Information Retrieval with Folksonomies
297
In short, Google yields search results that represent attention allocated by computers, while DCSs [distributed classification systems, Anm. d. A.] yield search results that represent attention allocated by humans. The former method (computer attention) is cheap, and hence ideal for indexing large amounts of information quickly; the latter method (human attention) is not so cheap, and not so quick, but it can yield more socially valuable information because it means a human being has made the association between a resource and a particular tag. Hence, this method is ideal for qualitative indexing (Mejias, 2005).
The services’ users pretty much control the incoming resources and, in Broad Folksonomies, rate via tags – good or relevant resources would seem to be indexed more frequently and are thus more easily retrievable, or more visible. Unusable resources are often not accepted by the information services at all (Graefe, Maaß, & Heß, 2007; Schiefner, 2008; Millen, Whittaker, & Feinberg, 2007). If information is automatically obtained via a push services, the source will at least have been checked and deemed trustworthy in advance. Search results thus run through a collaborative filtering mechanism of the community: Consequently, the search results are personalized and spam-filtered by the trusted networks (Xu et al., 2006). This facilitates a search in a hit list that has been compiled by human users, and not in results of a statistical evaluation procedure (Skusa & Maaß, 2008, 6)*. Users perceive the search results of social bookmarking systems as more trustworthy than those of algorithm-based search engines (Graefe, Maaß, & Heß, 2007).
In collaborative information services, all filtering methods named above are used for information retrieval. The tags serve as information filters, even if they represent no concept, as Golder and Hubermann (2005) observe: Looking at it another way, tagging is like filtering; out of all the possible documents (or other items) that are tagged, a filter (i.e. a tag) returns only those items tagged with that tag. Depending on the implementation and query, a tagging system can, instead of providing the intersection of tags (thus, filtering), provide the union of tags; that is, all the items tagged with any of the given tags, rather than all of them. From a user perspective, navigating a tag system is similar to conducting keyword-based searches; regardless of the implementation, users are providing salient, descriptive terms in order to retrieve a set of applicable items (Golder & Hubermann, 2005, 199).
Searching via tags works analogously to keyword searches; when search tag and indexing tag match, the resource is yielded as a search result. Here tags provide the same functionalities as keywords, descriptors or notations, such as the restriction of the search results via a combination of several tags: Tagging, when combined with search technology, becomes a powerful tool to discover interesting Web objects. [...] The way tags work is analogous to filters. They are treated as logical constraints to filter the objects. Refinement of results is done through strengthening the constraints whereas generalization is done by weakening them. E.g., tag combination (2006, calendar) strengthens tag (2006) and tag (calendar) (Xu et al., 2006).
298
Information Retrieval with Folksonomies
Active searches in collaborative information services can also exploit the aspect of the ‘trusted network’ – after all, the user only searches in a collaboratively controlled area of the information platform, e.g. in sub-communities or even just within his own personomy, as in del.icio.us: “The tag also becomes a part of the community at large, and members can search for tags within their individual folders as well as the entire community” (Dye, 2006, 40). If one is to assume that users are more interested in information from their own network, the system can even manipulate the active search results in this direction, e.g. via weighting values, Weinberger (2005) suggests: “If the application knows who is in one’s social group, it can weigh the tags that group uses more heavily when executing searches“ (Weinberger, 2005). Information filtering in collaborative information services is realized mainly via RSS feeds. They make it possible to subscribe to certain tags, users or resources, and to receive the respective content automatically on an RSS feed reader (Begelman, Keller, & Smadja, 2006). This means that the user of a collaborative information service creates a search profile of desired tags, users or resources, which is then used to filter the incoming data stream of the entire information service. The user is then delivered only the information deemed relevant or interesting to him, without having to access the information service and actively having to look for updates. HassanMontero and Herrero-Solana (2006) delimit information retrieval from information filtering on the example of social bookmarking services: In IF [Information Filtering, Anm. d. A.] user plays a passive role, expecting that system pushes or sends toward him information of interest according to some previously defined profile. Social bookmarking tools allow a simple IF access model, where user can subscribe to a set of specific tags via RSS/ Atom syndication, and thus be alerted when a new resource will be indexed with this set. On the other hand, IR user seeks actively information, pulling at it, by means of querying or browsing. In tag querying, user enters one or more tags in the search box to obtain an ordered list of resources which were in relation with these tags. When a user is scanning this list, the system also provides a list of related tags (i.e. tags with a high degree of co-occurrence with the original tag), allowing hypertext browsing (Hassan-Montero & Herrero-Solana, 2006).
Wash and Rader (2006) regard information filtering as a great advantage of Web 2.0 and of the social bookmarking service del.icio.us: We believe that del.icio.us is a novel form of collaborative filter for Internet websites. Users can learn about new websites through the use of the inbox, and can filter out the individually relevant sites by selectively subscribing to relevant tags and similar users (Wash & Rader, 2006).
They describe the implementation of information filtering in del.icio.us as follows: The del.icio.us inbox is the primary method for collaborative filtering. This inbox allows users to create subscriptions to other users and tags. Any new bookmarks that match these subscriptions will then be placed on the user’s inbox page for viewing. There are three possible types of subscriptions. A user can subscribe to all of the bookmarks from another user (a ‘user’ subscription). He or she can subscribe to all bookmarks that have a given tag (a ‘tag’ subscription). Or the subscription can be of the form ‘user/tag’, which
Information Retrieval with Folksonomies
299
only matched bookmarks by the specific user where that user applied the given tag (Wash & Rader, 2006).
Information filtering within collaborative information services unites two advantages over searching online via search engines: a) the user no longer has to actively search, since he is delivered the requested information, b) this information is spam-free and probably relevant, since it has been extracted form the ‘trusted network.’ Hammond et al. (2005) summarize: This ability to sort out the wheat from the chaff is an important win over a web-based search engine. Search engines, at this point, tend to index and search a global space – not my local pace. My space comprises the documents I am interested in and the documents of other users that I want to follow (Hammond et al., 2005).
This procedure can be compared to the SDIs from professionally compiled databases, which check and rate the sources of their database. Thus the customer can be sure that he is only delivered trustworthy information. Wash and Rader (2006) also describe the effects on the social structure of the tagging system; after all, each subscription to a resource, tag or user is a quality judgment and means that the subscription is deemed trustworthy: del.icio.us takes advantage of similarity between users interests by allowing users to subscribe to some or all of the tags of another specific user with similar interests. In this context, the act of bookmarking a site by that user serves as an endorsement of the site. Subscribing to a user is stating that those endorsements are worth trusting (Wash & Rader, 2006).
In an open tagging system, this quality judgment about an information resource does not stay hidden form the community, so that here, too, the Matthew Effect can be observed and influence the rating. Since collaborative filtering within tagging systems addresses numerous aspects of recommender systems, this will only be addressed in the following section.
Folksonomy-Based Recommender Systems in Information Retrieval “Folksonomies are essentially a development in information retrieval, an interesting variant on the keyword-search theme,” Szomszor et al. (2007, 74) observe, since folksonomies can serve as recommender systems (Wu, Zubair, & Maly, 2006). In collaborative information services, the user can be presented with recommendations for a successful information retrieval in two ways: 1) through collaborative filtering systems and 2) through tag recommender systems for formulating the search request. Collaborative filtering systems serve as recommender systems for relevant information, based on user behavior, the relations between users of an information platform (Lambiotte & Ausloos, 2006) and the deliberations on the push approach. Here recommender systems are used for two purposes in particular: 1) to help the user discover new resources and 2) to rate the degree of interest in a resource (Szomszor et al., 2007, 75). Generally, one can distinguish between three sorts of recommender systems: a) content-based recommender systems that exploit resourceimmanent elements such as word frequency or degree of linking, b) collaborative
300
Information Retrieval with Folksonomies
recommender systems that will be described further in the following, and b) hybrid systems that combine the approaches of the first two systems and are thus able to dampen the effect of ramp-up problems (Szomszor et al., 2007, 75f.). In collaborative information services, certain resources are preferred for suggestions, citing ‘social proof’ or ‘interaction history:’ “One of the primary benefits of interaction history is to give newcomers the benefit of work done in the past. In fact, the slogan […] is: We all benefit from experience, preferably someone else’s” (Wexelblat & Maes, 1999, 271f.). More precisely put, the collaborative filtering systems exploit the networking function of folksonomies, which connects users with the information resources via tags (Paolillo & Penumarthy, 2007; Fokker, Pouwelse, & Buntine, 2006). Fokker, Pouwelse and Buntine (2006) here observe: “[Recommender systems, A/N] exploit the fact that people with similar tagging behaviour – also known as tag buddies – have related taste.” Users are bibliographically coupled, if they index, save, edit etc. the same resources. Resources are thematically linked if they have been indexed with the same tags (see Figure 3.1). Folksonomy-based filtering or recommender systems work with exactly this information and are thus generally able to suggest two sorts of resources: “Recommendation systems may, in general, suggest either people or items” (John & Seligmann, 2006). Here they combine the features of content-based and collaborative recommender systems: [Folksonomy-based recommendation, A/N] is similar to collaborative filtering, since we use tags to represent agreement between users. It is also similar to content-based recommendation, because we represent image content by the tags […] that have been assigned to it by the user (Lerman, Plangprasopchok, & Wong, 2008).
For folksonomy-based filtering systems one must, however, adjust the factors that need to be observed in order to work successfully, as Jäschke et al. (2007) emphasize: Because of the ternary relational nature of folksonomies, traditional CF [collaborative filtering, A/N] cannot be applied directly, unless we reduce the ternary relation Y to a lower dimensional space. […] The projections preserve the user information, and lead to log-based like recommender systems based on occurrence or non-occurrence of resources or tags, resp., with the users. Notice that now we have two possible setups in which the k-neighborhood N k u of a user u can be formed, by considering either the resources or the tags as objects (Jäschke et al., 2007, 508).
Diederich and Iofciu (2006) summarize what suggestions can be made on the basis of user profiles and tags: 1. Objects based on users: recommendation of other resources based on other users. 2. Users based on objects: recommendation of other users based on other resources. 3. Users based on co-tagging: recommendation of other users based on identically/similarly tagged resources. 4. Tags based on users: tag recommendation based on identical/similar users. 5. Users based on tags: recommendation of other users based on identical/similar tags.
Information Retrieval with Folksonomies
301
The suggested resources are extracted via similarity calculations and cluster-forming procedures (Shardanand & Maes, 1995; see chapter two). Calculations on the basis of identical indexing tags, identical terms in full-texts and resource descriptions or similar link structures particularly lend themselves to the processing of textual resources and images, videos etc.: Similar (semantically related) annotations are usually assigned to similar (semantically-related) web pages by users with common interests. In the social annotation environment, the similarity among annotations in various forms can further be identified by the common web pages they annotated (Bao et al., 2007, 503).
Jäschke et al. (2006) also emphasize the meaning of folksonomies for collaborative filtering or recommender systems: The annotation procedure provides the user with added value for little effort, because on the one hand it facilitates the finding of one’s own resources, and on the other hand it becomes very easy to find similar new resources that could be of interest. As opposed to traditional search engines, this works equally well for text resources as it does for images, videos or other nontextual content (Jäschke et al., 2006)*.
Collaborative recommender systems are often used on online shopping platforms, such as Amazon (on the basis of other resources) and online grocery shopping, as described by Svensson et al. (2000). The community of the shopping platform is provided with suitable recipes for the food they bought and are alerted to other recipes via the analysis of users’ group affiliations and tagged recipes. Sen et al. (2006a) apply the idea of collaborative filtering to automatic sending of information, or ‘alerts,’ via RSS or Atom feeds, in their system ‘FeedMe,’ in order to limit the mass of incoming information for the user’s benefit. The basis of the similarity calculations are on the one hand implicit factors, such as the opening of a link within an alert, and explicit ones, such as a thumbs-up/thumbs-down rating on the part of the users. Much more important that the recommending of similar resources, though, would seem to be the recommending of similar or relevant users in collaborative information services (Wash & Rader, 2007; Diederich & Iofciu, 2006; Farrell et al., 2007): “As such, a collaborative tagging system helps users in not only retrieving information but also socializing with others” (Wu, Zubair, & Maly, 2006, 112). Here it is important that a critical mass be reached in collaborative information services in order to provide real usefulness: “The bigger the set of users, the more likely I am to find someone like me” (Resnick & Varian, 1997, 58). Users are mainly interested in other users, their resources and their opinions. Panke and Gaiser (2008) interviewed roughly 200 users on their tagging behavior and found out that two thirds of the interviewees use the tags “to meet new people” (Panke & Gaiser, 2008, 21)*. Macaulay (1998) reports on information gathering among journalists and finds out that they attach a greater importance to the source’s trustworthiness than on the information itself (Svensson et al., 2000, 262). This behavior is a natural procedure for humans during information gathering and is mainly exploited in social navigation: When searching for information […] people often rely on the advice of other people rather than more abstract tools such as maps, search engines, etc.
302
Information Retrieval with Folksonomies Social interaction is (of course) basic to human behaviour, and therefore well learnt and efficient (Forsberg, Höök, & Svensson, 1998).
Collaborative information services and folksonomies are capable of delivering information in this way. Here is one explanation for the success of Web 2.0: communication and information gathering on the internet is no longer a one-way process, where company websites, for example, would provide updates on certain products statically, but the reciprocal exchange between users emulates real-world word-of-mouth propaganda (Edelman, 2008; Bender, 2008; Ketchum & University of Southern California Annenberg, 2009). The motto of information gathering is now: “More like me!” – find users that are similar to me, so I can find relevant information by watching them (Smith et al., 2008). It can be assumed that users are similar or have similar interests if they use the same tags to index resources (Chopin, 2008), or if they are connected to them via the same relations, or if they index the same resources. Diederich and Iofciu (2006) regard a user’s allocated tags as his interest profile and then compare users on that basis. The authors locate the advantage of this procedure in the combination of different recommender system types: This unique combination of the user profile aspect of collaborative recommender systems with the feature-based schema to describe user profiles (as used in content-based recommender systems) is intended to better capture the interests of the users in the recommendation process […] (Diederich & Iofciu, 2006).
The users’ similarity is calculated via the procedures described above. Here, though, two steps must be considered: the first step in recommending users consists of calculating the similarities between the initial user and all other users of the information platform via a coefficient. After all: “These persons, being relevant for the user, are potential candidates to collaborate with and, thus, to be added to the user’s Community of Practice“ (Diederich & Iofciu, 2006). The calculation must be performed twice, in order to do justice to the users’ thematical linking on the one hand, and to their bibliographic coupling on the other. Van Damme, Hepp and Siorpaes (2007) also arrive at this conclusion and call this procedure the exploitation of two light-weight ontologies: the ‘sub-communities’ (bibliographic coupling) and the ‘object overlaps’ (same tags). Concerning the thematical linking, it must be observed for each similarity coefficient (e.g. the cosine; Diederich & Iofciu, 2006) that: a is the number of tags allocated by user 1, b is the number of tags allocated by user i and g is the number of tags used by both in unison. For bibliographic coupling: a is now the number of resources indexed by user 1, b the number of resources indexed by user i and g the number of resources indexed by both. Both calculated values can then be summarized and thus determine the degree of similarity between two users. The second step (if one follows the more elaborate variant and neglects the k-nearest-neighbors procedure, for example) now includes the formation of clusters or, in this case, of communities of users via the SingleLink, Complete-Link or Group-Average-Link procedures. The most similar users thus determined can then be suggested to user 1. An elaborate system for discovering similar user and communities via identical tags is introduced by Alani et al. (2003) wth ‘Ontocopi.’ They use the relations of an ontology as tags and thus link resources and users:
Information Retrieval with Folksonomies
303
Ontocopi uses ontological relations to discover connections between objects that the ontology only implicitly represents. For example, the tool can discover that two people have similar patterns of interaction, work with similar people, go to the same conferences, and subscribe to the same journals (Alani et al., 2003, 18).
As displayed in Figure 4.1, user A and user B, for example, are linked in multiple ways: on the one hand, they are both ‘members of’ the same resource D (department, club, team etc.), and on the other hand they are both authors of the resource H (via the tag, or relation, ‘HasAuthor’). User B is also linked with user C via resource F.
Figure 4.1: Discovery of Communities or Similar Users via Ontologies. Source: Alani et al. (2003, 21, Fig. 3).
Although the relations for the discovery of similar users and communities used in Ontocopi are more differentiated and thus also facilitate relational recommendations, the procedure is still comparable to folksonomy-based filtering and recommender systems. You only have to replace, in the following quote, the terms ‘Ontocopi’ and ‘more formal relations’ with ‘Folksonomies’ and ‘tags:’ “Ontocopi lets you infer the informal relations that define a community of practice from the presence of more formal relations” (Alani et al., 2003, 18). Another approach for folksonomy-based person recommender systems is presented by John and Seligmann (2006). They use tags as expert finders in companies (see also Schillerwein, 2008): Tagging systems’ capability of creating or clarifying social networks is one reason why they wish to exploit folksonomies for their expert finder. Among the other reasons are: It represents user categorization of shared content that may be presumed to be representative of user interest and expertise – an „I tag, therefore I know“ indication by the user. Users do not have to be authors of or be referenced in the content being tagged. Tags enable the automatic generation of expertise categories. It enables the formation of social networks around tags which facilitates identification of expertise communities.
304
Information Retrieval with Folksonomies It provides a way of keeping pace with the user’s changing interests without the user having to update skill profiles. The feedback loop leading to asymmetric communication is active in an enterprise environment because users know each other and are aware of reputations (John & Seligmann, 2006).
Employees are able to tag any form of in-company resource, e.g. e-mails or documents, in order to build up their own tagging profile. The recommendation of other users or the finding of experts works via the relation between the number of a user’s tags and the number of his resources. Figure 4.2 shows an overview of this model. The font size represents the tagging activity, e.g. the number of resources indexed with this tag, the dots within the tag circles represent the users and their size in turn the users’ tagging activity. The lines between the circles stand for the semantic relations between individual tags. This visualization thus displays the popularity of a certain tag and the most excessive users of a tag. The employee can now find an expert for a particular subject area via a search function and a ranking algorithm, the so-called ‘ExpertRank.’ The more bookmarks an employee has tagged, the more likely it becomes that she will be classified as an expert for this subject area: “her expertise is strictly the number of bookmarks she contributes” (John & Seligmann, 2006). The visualization can then be used to determine either a single expert or a group of experts. Automatic recommendations of similar users are not mentioned by John and Seligmann (2006). A comparison with the Ontocopi system mentioned above makes it clear, however, that the idea of an expert finder can easily be transformed into an ‘expert recommender.’
Figure 4.2: Representation of a Social Network via Tags. Source: John & Seligmann (2006, Fig. 1).
Van Damme, Hepp and Siorpaes (2007) discuss the development of recommender systems based on different information platforms, in order to increase the range of analyzable data: “Consolidating the entire user-created data of similar kinds of objects, which is dispersed on several systems, may generate a more complete overview on the meta data of overlapping objects” (Van Damme, Hepp, & Siorpaes, 2007, 61). They point out, however, that the resources to be compared must in any case share a similar structure (e.g. compare resource-specific tag clouds form
Information Retrieval with Folksonomies
305
different social bookmarking systems, but not tag clouds with full-texts). They also warn: “This means if the tags ‘size’ of the different systems are differing, the frequency of tags has to be adjusted in proportion” (Van Damme, Hepp, & Siorpaes, 2007, 65). The tripartite relation of tags – users – resources can further be exploited in order to generate automatic suggestions for the expansion of search requests or for indexing on the basis of a tag’s, a user’s or a resource’s placement in the network (Diederich & Iofciu, 2006; Kwiatkowski & Höhfeld, 2007, 272f.). The recommendation of search tags in information retrieval is, of course, less problematic than the recommendation of indexing tags in knowledge representation, since the folksonomy is not dependent on the ‘natural’ generation via user indexing in this case. Should the search tags resulting from recommendations be used as indexing tags, however (see further below), the problem of the Matthew Effect can very well falsify the indexing process or ‘natural’ searches. Recommendations of better or more similar search terms are also summarized under the keyword ‘query expansion’ (Efthimiadis, 1996; Efthimiadis, 2000). Here the term is used independently of whether the request is restricted by more specific or several terms, or whether it is enhanced by more general or synonymous terms (for an overview of query modification options, see Anick, 2003). Kome (2005) thus talks of ‘query refinement.’ Greenberg (2001) summarizes the use of query expansion thusly: “retrieval results associated with a particular query are often inadequate, and that they might be improved by reformulating the initial query through the addition and/ or deletion of search terms” (Greenberg, 2001, 402). Mayr (2006) also emphasizes that the goal of query expansion is the finding of the most adequate terms for representing the information need: “traditionally, the vagueness [or ‘vocabulary problem,’ A/N] between the query and document levels is resolved via term enhancement procedures [...] (Mayr, 2006, 153)*. Query expansions can be performed via synonyms, hyperonyms, hyponyms or related terms. The effects of query expansions are described by Greenberg (2001): “QE via semantic relationships generally improves recall, although not always significantly” (Greenberg, 2001, 409). A more precise analysis of the different kinds of query expansion via the different semantic relations concerning the Recall and Precision values yielded the following results: Recall is heightened through expansion with related terms, hyperonyms (broader terms), hyponyms (narrower terms) and synonyms (in descending order), while Precision is heightened through query expansion via synonyms, hyponyms, hyperonyms and related terms (in descending order) (Greenberg, 2001, 409). The use of tag recommendations for query expansion and information retrieval is confirmed by Brooks and Montanez (2006a; 2006b). The candidates for tag recommendation and query expansion can be generated in various ways. The summarization by Stock (2007a, 486) is complemented by two points (1 and 6): 1. co-occurrence in a resource, 2. co-occurrence in a cluster, 3. identical terminology, 4. relations, via references or citations, 5. neighborhood in a social network, 6. alignment with KOS. Simple recommender systems merely recommend the top n tags of the folksonomy’s co-occurring tags. An enhancement of this method is the formation of tag clusters
306
Information Retrieval with Folksonomies
via the known similarity coefficients and cluster procedures in order to generate semantic webs of the indexed tags and to use these for recommending related tags (Begelman, Keller, & Smadja, 2006). Grahl, Hotho and Stumme (2007) form clusters from the del.icio.us folksonomy via the ‘KMeans-Cluster’ algorithm. This algorithm is used iteratively in order to create a quasi-term-hierarchy; i.e., the term hierarchy does not consist of paradigmatic relations but is formed of the syntagmatic relations of the tags (Stock & Stock, 2008, 372). The KMeans algorithm works with the Vector Space model in order to determine the similarity between two tags. This is why the dimensions and the vectors are both represented by the tags. The cluster algorithm’s operating principle is described as follows by Grahl, Hotho and Stumme (2007): The principle of KMeans is as follows: Let k be the number of desired clusters. The algorithm starts by choosing randomly k data points of D as starting centroids and assigning each data point to the closest centroid (with respect to the given similarity measure; in our case the cosine measure). Then it (re-)calculates all cluster centroids and repeats the assignment to the closest centroid until no reassignment is performed. The result is a non-overlapping partitioning of the whole dataset into k clusters (Grahl, Hotho, & Stumme, 2007, 358).
The result of the cluster algorithm is displayed in Figure 4.3. The first step of clustering results in the lowest hierarchy level of the quasi-KOS (see Figure 4.3 ‘Tags’). For the medium level, one tag is taken from the bottom level and then clustered anew. The top hierarchy level consists of tag pairs, generated from the medium level’s clusters. It becomes clear that complex tag recommendations can be generated via this procedure, recommendations based not only on simple cooccurrence, but that can also incorporate different levels of expressiveness into the tag recommendations via the quasi-KOS. Capocci and Caldarelli (2008) work with cluster analysis and use it to check the probability of two hierarchically linked tags being annotated. This is what they find out: Thus, the behavior of clustering coefficient of the tag co-occurrence networks can be used as a test for models representing the tag semantical organization or, equivalently, how users choose tags when annotating a resource. […] users typically use tags hierarchically, labelling a resource by tags related to the same topics but with different generality, adding more specialized tags as the number of collected resources grows (Capocci & Caldarelli, 2008).
Information Retrieval with Folksonomies
307
Figure 4.3: Quasi-Term-Hierarchy of del.icio.us as Generated via Cluster Procedures. Source: Grahl, Hotho, & Stumme (2007, 360, Fig. 1).
Brooks and Montanez (2006a) also investigate the expressiveness of tags in blog posts, but concentrate on the comparison of user-generated tags and tags automatically extracted from the blogs’ full-texts. The cluster analysis of both kinds of tags shows that the terms from the tag clusters are less similar to one another than the terms from tag clusters calculated with TF*IDF. Thus the authors conclude: Simply extracting the top three TFIDF-scored words and using them as tags produces significantly better similarity scores than tagging does [...]. The clusters themselves are typically smaller, indicating that automated tagging produces more focused, topical clusters, whereas human-assigned tags produce broad categories (Brooks & Montanez, 2006a, 14).
The remarks by Capocci and Caldarelli (2008) as well as Brooks and Montanez (2006a; 2006b) are important for the comparison of cluster and tag-based recommender systems and must be considered in the construction of tag recommender systems. If the recommender system can use full-texts or other
308
Information Retrieval with Folksonomies
metadata for its analysis, the comparison of the resource terminology can occur within the search results (Cui et al., 2002). Thus at first similar resources are found and then used to extract tag suggestions via TF*IDF. This procedure can also be used to generate tags for previously un-tagged resources, as Graham, Eoff and Caverlee (2008) find out: “Tag recommendations for an untagged document are generated by finding the top-10 most similar documents, ranking their tags based on TF-IDF measures across the tag corpus and on the user’s tag profile” (Graham, Eoff & Caverlee, 2008, 1166). The system ‘AutoTag’ (Mishne, 2006) also works according to this principle. AutoTag recommends tags for blog posts, based on collaborative filtering methods. Mishne (2006) describes the workings of AutoTag in analogy to shopping platforms that use recommender systems: In AutoTag, the blog posts themselves take the role of users, and the tags assigned to them function as the products that the users expressed interest in. In traditional recommender systems, similar users are assumed to buy similar products; AutoTag makes the same assumption, and identifies useful tags for a post by examining tags assigned to similar posts (Mishne, 2006, 953).
The concrete workings of the system are as follows: at first, similar blog posts are needed to establish a basis for recommending a tag for a blog post. To achieve this, a search request is generated from the original post and sent on to the search engine. The search results are ranked by posts most similar (where the similarity is determined via known retrieval models). From these posts, then, the top terms are determined via simple frequency calculation and recommended as indexing tags. Added to that, the tags already used by the author receive a higher retrieval status value and are thus recommended more readily than the other extracted tags. This procedure is problematic, as the Matthew Effect can negatively influence the indexing process at this point. The recommender system ‘Plurality’ (Graham, Eoff, & Caverlee, 2008) offers its users personalized tag recommendations by incorporating the context into its calculations. This means that the user can choose from various sources (all resources from del.icio.us and all resources tagged by the user) and use different filters (chronological, geographical) for the resources comparison and the extraction of tag candidates. Thus the system only offers relevant recommendations. Graham, Eoff and Caverlee (2008) provide the following example for clarification: For example, the [...] blog entry about ‘the high price of oil’ was written in 2005. Plurality’s tag suggestions in this case are drawn from a recent crawl of del.icio.us, so some of the tag suggestions are temporally relevant to the original blog entry, e.g., ‘iraq’, ‘war’, and ‘bush’. One of the goals of the Plurality project is to tag archival content; hence, a 1970s document referencing the ‘high price of oil’ could be tagged ‘jimmy carter’ and ‘opec’ (Graham, Eoff, & Caverlee, 2008, 1166).
Since the tagging of a resource can also be regarded as the referencing of a user to this resource, the tag-resource connections can be used for tag recommendations. This procedure resembles the recommending of co-occurring tags, but in this case it refers directly to the resource level and not, as in the first point, to the platformspecific folksonomy. The exploitation of neighborhood in a social network regards user similarity via co-tagged resources or co-used tags and can make suggestions for tag variants to the searching user. Kautz, Selman and Shah (1997) define social networks as follows: “A social network is modelled by a graph, where the nodes
Information Retrieval with Folksonomies
309
represent individuals, and an edge between nodes indicates that a direct relationship between the individuals has been discovered” (Kautz, Selman, & Shah, 1997, 64). The edges between the nodes (users) are represented via tags or resources. Alani et al. (2003) use social networks to offer the users synonyms in particular: “For example, you might refer to the same object or concept with different names [...]. When the measure passes some threshold, it proves that the two instances, although represented by different names, are identical” (Alani et al., 2003, 24). Wang and Davison (2008) use a variant of the Pseudo-Relevance feedback in order to enrich the user’s search request with tags. They analyze the tags from an initial hit list of resources and offer these to the user for the purpose of query expansion. The authors suggest using the top 8 tags from this list for query expansion, where the tags’ ranking is calculated via their popularity per website. At the same time, Wang and Davison (2008) are alerted to possible homonyms etc. via the tags: “For instance, among the top 10 results of query tomato, if 5 documents are talking about movie reviews and another 5 are talking about vegetable and food, we can conclude that the query tomato has at least two meanings” (Wang & Davison, 2008, 47). In a small user study, Wang and Davison (2008) are able to demonstrate that 50% of query expansions can be performed with this method and that the users thus regard this approach as relevant. (Semi-)automatic query expansions can also serve to personalize the search results, or to adjust them to the searching user. Carman, Baillie and Crestani (2008) personalize search and search request in two different ways. They use both the user’s allocated tags and the saved bookmarks’ full-texts. On the one hand, the search request can be expanded via one’s own tags or terms from the saved bookmarks’ full-texts in order to specialize the search request. On the other hand, the search requests can be arranged retroactively, in the ranking, by comparing the found resources with one’s own tags or terms from the saved bookmarks. The authors elect to use the first alternative, since here there is the possibility of pushing resources that would not even have been retrieved by the other method up to higher ranks. Evaluating their approach, Carman, Baillie and Crestani (2008) find out that query expansion via terms from saved bookmarks achieves more relevant search results than query expansion via one’s own tags. In order to be able to exploit these tags for personalizing the search results anyway, they suggest this: Good results were also achieved using profiles that combined tag data with the content of bookmarked documents. For these profiles, document content was used to populate the profile while tag similarity was used to weight the contribution of each bookmark (Carman, Baillie, & Crestani, 2008, 34).
The more elaborate variant for tag recommender systems is based on alignment with pre-existing KOS. Since the KOS reflect paradigmatic relations, it is possible to create relational recommendations for query expansion in this way. This procedure is without a doubt much more complex, since it involves constant alignment with and constant updating of the KOS. The user is, however, offered the option of performing semantically correct modifications of the search request, and to incorporate hyperonyms, hyponyms or synonyms via these recommender systems. Al-Khalifa and Davis (2006) suggest, for the use of relations in information retrieval, that the user can refine search requests via drop-down menus during actual searches by selecting the desired relation (e.g. equivalence relation), thus expanding the search with synonyms. Query expansion via hyponyms is also addressed by
310
Information Retrieval with Folksonomies
Christiaens (2006). The connection of folksonomies and traditional methods of knowledge representation moves information retrieval closer to searches with controlled vocabularies, but does not create too much effort for the user. This approach, as in knowledge representation, can be called Tag Gardening and will thus be discussed in depth further below. So far, there are only few empirical studies on the use of recommender systems in folksonomies and on their use for information retrieval. Noteworthy is the investigation by Jäschke et al. (2007), which observes the effects of recommended search tags on retrieval performance (via BibSonomy and Last.fm). The RecallPrecision graph shows steady increases of the retrieval quality when using recommendation methods. Thus the study confirms the thesis formulated at the beginning: “From the user’s perspective, goal-oriented and user-specific recommendations are a bonus for the system, as are an intelligent resource (or tag, or user) ranking or an enhanced display of the folksonomy” (Jäschke et al., 2007)*.
Retrieval Effectiveness of Folksonomies The retrieval effectiveness of folksonomies or tagging systems has also not been researched exhaustively. According to Chopin (2008), this is due to the as yet insufficient indexing of information resources with tags; search engines still have the upper hand because of their range: “A key drawback to tagging which is shown in the tag search is the problem of scale. Simly put, not enough people tag enough […]” (Chopin, 2008, 571). Furner (2007) believes that the reason is the lack of an evaluation system, which has led him to develop conceptual thoughts for such a system. Morrison (2008) compares the retrieval effectiveness of search engines (Google, Microsoft Live, AltaVista), web catalogs (Yahoo, Open Directory Project) and folksonomies (del.icio.us, Furl, Reddit) via the Recall and Precision values as well as matches between hit lists. The test subjects submit their own queries to a meta search engine covering the services mentioned above and then determine the relevance of the search results (20 hits per retrieval tool) with a simple yes/no statement. The evaluation of the results yields the following: with regard to the search results’ precision, hits from del.icio.us were nearly equally relevant to hits form the search engine Live, “which shows that a folksonomy can be as precise as a major search engine“ (Morrison, 2008, 1571); with regard to recall, it transpires that the search engines have a clear upper hand due to their automatic indexing of huge amounts of data, yet folksonomies and web catalogs have similar recall values. If both values are combined, the following picture emerges: search engines have the highest recall and precision values, web catalogs are more precise than folksonomies, but have similar recall values. Morrison (2008) also observes the correlation of retrieval effectiveness and query type. Here he finds out that folksonomies are less suited for questions about facts and searches for specific websites: “The folksonomies seemed to be least suited to searches for a specific item” (Morrison, 2008, 1574). On the other hand, folksonomies are well equipped for searches for current news, new products or neologisms, since they directly reflect user behavior and user language and are thus able to react flexibly. The alignment with search engines also reveals that folksonomies do not yet employ particularly elaborate search mechanisms. Thus folksonomies seem to be very strict in following
Information Retrieval with Folksonomies
311
the AND link of search terms and only yield resources that actually contain all search terms as tags. Search engines often loosen up the AND link if there are too few hits and yield alternative results, with fewer matching search terms. Kipp (2007b; see also Kipp, 2008) conducts a small study in which the subjects are supposed to search for five resources, in a database with professionally indexed resources (PubMed and MeSH) on the one hand, and in a collaborative information service employing a folksonomy (CiteULike) on the other. Here Kipp (2007b) observes the users with a Screen Capture software that records cursor movements and users’ entries, conducts a ‘Think Aloud protocol’ that records users’ verbal musings, and interviews the users on the different search terms’ usefulness after the research. The result is that subjects preferred the system they used first and needed 4 to 5 search terms on average to reach their research goal on CiteULike, and 1-4 on PubMed. Gruzd (2006) investigates whether folksonomy-based information retrieval systems yield more relevant search results than traditional web searches. Here he checks how well the indexing terms, i.e. tags or descriptors generated from full-text clusters, can differentiate between different resources and how specific they are. Gruzd (2006) concludes that there is no difference between tags and descriptors in terms of their discriminatory power, but that folksonomies index more specifically than a simple ‘Bag-of-words’ (Gruzd, 2006) from the full-text. Hassan-Montero and Herrero-Solana (2006) proceed similarly and analyze folksonomies with regard to their tagging effectiveness. They define tagging effectiveness as follows: Tagging effectiveness can be measured by means of two related parameters: term/ tag specifity and indexing/ tagging exhaustivity. These two variables indicate the number of resources described by one tag, and the number of tags assigned to one resource respectively (Hassan-Montero & Herrero-Solana, 2006).
On the basis of other studies (Brooks & Montanez, 2006b; Golder & Hubermann, 2006; Voß, 2006), Hassan-Montero and Herrero-Solana (2006) conclude that users do not index exhaustively enough – 90% of users index with fewer than five tags (Voß, 2006) – and that they tend, furthermore, to index general tags (Brooks & Montanez, 2006b). Thus folksonomies seem to be better suited for browsing than for direct searches. Both retrieval options are, however, dependent on the tags used: “In terms of traditional IR [Information Retrieval, A/N], broader tags entail high recall and low precision, whereas narrower tags entail low recall and high precision. [...] In other words, browsing and querying respectively” (Hassan-Montero & HerreroSolana, 2006). An elaborate study on users’ tagging and retrieval behavior within del.icio.us and web search engines is conducted by Heymann, Koutrika and Garcia-Molina (2008). They find out that collaborative information services have the upper hand on web search engines in terms of internet coverage: “Approximately 25% of URLs posted by users are new, unindexed pages. [...] ‘new’ in the sense that they were not yet indexed by a search engine at the time they were posted to del.icio.us” (Heymann, Koutrika, & Garcia-Molina, 2008, 199). They identify the reasons for this as the fact that websites may be indexed under another URL by the search engines, be spam, have an unindexable format (e.g. Flash) or have not yet been discovered by the search engines (Heymann, Koutrika, & Garcia-Molina, 2008, 199). On the other hand, the authors discover that “roughly 9% of results for search queries are URLs
312
Information Retrieval with Folksonomies
present in del.icio.us. del.icio.us URLs are disproportionately common in search results compared to their coverage” (Heymann, Koutrika, & Garcia-Molina, 2008, 199). According to them, del.icio.us has roughly 30 to 50m URLs in stock at the moment. Another interesting result of their study concerns the tags themselves: Popular query terms and tags overlap significantly [...]. One important question is whether the metadata attached to bookmarks is actually relevant to web searches. That is, if popular query terms often appear as tags, then we would expect the tags to help guide users to relevant queries. [...] while there was reasonable degree of overlap between query terms and tags, there was no positives correlation between popular tags and popular query terms (Heymann, Koutrika, & Garcia-Molina, 2008, 200)
This should mean that users may get results in collaborative information services while searching with popular search terms, but that tagging users do not consciously index for search purposes. This result again confirms the assumption that folksonomies are mainly used for personal resource management. Furner (2007) emphasizes the importance of the indexer-researcher consistency for the effectiveness of retrieval systems in his conception of an evaluation method for tagging systems: There appears to be broad consensus, nonetheless, that indexer–searcher consistency—the degree to which indexers and searchers agree on the subjects and concepts that given resources are considered to be “about,” and on the combinations of terms that are used to express given subjects and concepts— is a fairly robust indicator of retrieval effectiveness. The assumption is that, if indexers are able successfully to predict those terms that will be used by searchers in future queries to which the resources being indexed are relevant, then levels of retrieval effectiveness will be correspondingly high. Historically, observations of the correlation between indexer–searcher consistency and retrieval effectiveness have been used as evidence in support of the provision, both for indexers and searchers, of access to structured and controlled vocabularies of various kinds (Furner, 2007).
Thus Bischoff et al. (2008) also investigate whether tagging vocabulary and search vocabulary match. They use tags from the collaborative information services del.icio.us, Flickr and Last.fm as well as AOL’s query logs. The authors initially check how many search terms from a traditional web search match the information services’ tags and arrive at the following conclusion: Regarding Del.icio.us, 71.22% of queries contain at least on Del.icio.us tag, while 30.61% of queries consist entirely of Del.icio.us tags. […] For Flickr and Last.fm the numbers are 64.54% and 12.66%, and 58.43% and 6%, respectively (Bischoff et al., 2008, 209).
Next they compare the search terms with most commonly occurring tag categories per resource type (see chapter three). Here they observe many intersections. The most commonly indexed tag category on del.icio.us is ‘topic’ and tags of this category are searched most frequently. ‘Topics’ is the most frequently indexed and searched tag on Flickr, too. There are some small differences in the tag categories ‘location’ and ‘author/owner:’ persons are more often searched than indexed, while locations are more often indexed than searched. Bischoff et al. (2008) observe a particularity, however:
Information Retrieval with Folksonomies
313
The biggest deviation between queries and tags occurs for music queries. While our tags in Last.fm are to a large extent genre names, user queries belong to the Usage context category (like “wedding songs“ or “graduation songs“, or songs from movie or video games […]). […] An interesting and surprising observation is that searching by genre is rare: Users intensively use tags from this category, but do not use them to search for music (Bischoff et al., 2008, 209).
The authors assume that the large amount of music resources indexed with ‘genre’ tags leads to too large hit lists, which scares off users. A similar procedure for comparing search engines (MSN and Google) and folksonomies, or social bookmarking systems (del.icio.us), is chosen by Krause, Hotho and Stumme (2008). First they determine how far the search terms extracted from the search engines match the indexing tags of the social bookmarking service, in order to find out the usage similarity of both systems. Here they observe that far more and more different terms are used for searches in search engines than for indexing in del.icio.us. Thus they conclude: “These numbers indicate that Del.icio.us users focus on fewer topics than search engine users […]” (Krause, Hotho & Stumme, 2008, 104). The degree to which search and indexing terms match consists of only a quarter of all of del.icio.us’ tags. The analysis of the nonmatching terms revealed that this is mainly due to taggers’ unusual compounding habits (e.g. ‘artificial_intelligence’). Since it is not a requirement for search engines to enter one-word search tags, there can be no match here. Another interesting result is that search tags often represent parts of URLs that are not used for indexing in this precise way by the users. With regard to search engines’ and social bookmarking services’ degree of coverage, and to the resource ranking, Krause, Hotho and Stumme (2008) observed a relatively large match: both traditional search engines and folksonomies focus on basically the same subset of the Web. […] the top entries of search engine rankings are – in comparison with the medium ranked entries – also those which are judged more relevant by the Del.icio.us users (Krause, Hotho, & Stumme, 2008, 108).
The authors assume that this may have to do with the fact that taggers often use search engines to find interesting resources, and then add these to the social bookmarking system. The investigation of different ranking algorithms (algorithms of AOL, Google and MSN, the Vactor-Space model with TF*IDF and cosine as well as the FolkRank) revealed that “The two major search engines [MSN and Google, A/N] therefore seem to have more in common than folksonomies [Vektor-Space model and FolkRank, A/N] with search engines” (Krause, Hotho & Stumme, 2008, 109). After the mass of del.icio.us data to be analyzed was restricted to rankings with at least 20 URLs in common, it was shown that the FolkRank’s top hits are closer to the search engines’ top hits than to the Vector-Space model’s. Chi and Mytkowicz (2006) conduct a study of the retrieval effectiveness of folksonomies on the basis of the information theory according to Claude Shannon (1948). This theory assumes that each digit posses a certain degree of entropy influencing insecurity concerning a source: the more digits are received from a source, the larger is its information content and the less insecurity prevails concerning its information content (Stock, 2007a, 17ff.). A digit’s entropy is calculated via its probability of occurring, where the information content of a digit
314
Information Retrieval with Folksonomies
becomes larger the more seldom it occurs. The information-theoretical analysis of del.icio.us’ folksonomy revealed that 1) more and more tags are used in the folksonomy in order to describe more and more resources – i.e., the information content of the tags decreases, since the same tags must describe a rising mass of resources and thus produce too much recall; 2) it becomes more difficult for the users to describe a resource unambiguously, in order to guarantee its retrievability, since the growing folksonomy leads to the growing probability of re-indexing an already indexed tag; 3) the number of different users in the system and the total number of resources may rise, but that there are large intersections in the saved resources; i.e., that the information content of the resource stock decreases and it becomes more and more difficult to recommend resources to users; 4) the growing folksonomy and mass of resources makes localizing ‘topic experts’ harder, since too similar tagging behavior of the users leads to the tags’ information content decreasing; 5) more and more tags must be attached to a resource in order to clearly describe it. As their conclusion, Chi and Mytkowicz (2006) summarize: Also, the collective of users on del.icio.us is increasingly having a harder time in tagging documents in del.icio.us. They are less certain what tags should be used to describe documents. A piece of evidence that is consistent with this observation is that users appear to have responded to this evolutionary pressure by increasing the average number of tags per document over time. […] We can see from our analysis that the vocabulary that is emerging in del.icio.us is becoming less efficient. This is somewhat understandable, since the amount of information being introduced into the system is growing at an extremely fast pace according to a power law. As the tags that people use to describe these documents become more saturated, they need to either find new words or they need to use more words to describe the contents. Indeed, overall we are seeing more and more new tags being introduced into the system. However, since only so many words are applicable to describing the content of a document, users cope by increasing the average number of tags they use (Chi & Mytkowicz, 2006).
Visualizations of Folksonomies Folksonomies’ ability to support browsing through information platforms rather than specific searches via search masks and queries is often seen as one of their great advantages. An effective and easily manageable browsing tool needs its own design criteria than a simple search instrument. This is why, particularly in collaborative information services, possibilities of visualizing folksonomies have developed that go far beyond an alphabetical or list-oriented arrangement of the tags (HassanMontero & Herrero-Solana, 2006). In the visualization of folksonomies, it is important that the user get an impression of the information platform in its entirety and find entry points for his browsing activities. Graphical interfaces are one way of providing this, as HassanMontero and Herrero-Solana (2006) emphasize: “visual interfaces provide a global
Information Retrieval with Folksonomies
315
view of tags or resources collection, a contextual view” (Hassan-Montero & Herrero-Solana, 2006). Furthermore, browsing-compatible visualizations should reflect an order (Storey, 2007) in order to allow users to start searching more easily: “The key in building an effective exploration space seems to be able to group and show related items and to explain how the items are related” (Begelman, Keller, & Smadja, 2006). The best-known visualization method for folksonomies are the tag clouds (Begelman, Keller, & Smadja, 2006; Sinclair & Cardew-Hall, 2008; Bateman, Gutwin, & Nacenta, 2008; Viégas & Wattenberg, 2008). The online merchant Amazon calls them ‘concordances’ (Dye, 2006, 41), the British newspaper ‘The Guardian’ terms them the ‘folksonomic zeitgeist’ (McFedries, 2006). The tag clouds consist of either the tags of a selected information resource or of the entire information platform’s tags. The particularity of tag clouds is that they are arranged alphabetically, yet some tags stick out due to differing font size (see Figure 4.4).
Figure 4.4: Tag Cloud of the Social Networking Website ‘43Things.com.’ Source: http://www.43things.com/zeitgeist (08/26/2008).
The size of a tag is determined by its popularity on the resource or platform level (Bateman, Gutwin, & Nacenta, 2008), i.e. the more the tag has been allocated the larger it is displayed: “The equation is pretty simple – as tags grow in popularity, they get bigger; as they fall out of favor, they become smaller, until they eventually disappear from the cloud” (Dye, 2006, 41). Tag clouds thus visualize three dimensions of the folksonomy at the same time: tags, tag relevance and the alphabetical arrangement of the folksonomy (Hearst & Rosner, 2008). This visualization method is capable of providing the user with access to the content of the information resource or platform at first glance, via the tags’ semantics and their gradation according to frequency: They provide a global contextual view of tags assigned to resources in the system (Quintarelli, Resmini, & Rosati, 2007a, 11).
316
Information Retrieval with Folksonomies These tag clouds function as aggregators of activity being carried out by thousands of users, summarizing the action that happens beneath the surface of socially oriented websites (Viégas & Wattenberg, 2008, 50).
Russell (2006) sees tag clouds as an analogy to classical book shelves: “Collectively, the users of the site have given a URL a space on the virtual bookshelf – classifying its content and what it seems to be about” (Russell, 2006). It is a great advantage of tag clouds that this collective determination of the content of an information resource is directly accessible. Furthermore, trends in language use (Hearst & Rosner, 2008) and points of interest (Fichter, 2006) within the information platform or a community can be observed via tag size: “Tag clouds as social information in terms of how the tag cloud shows the topics of interest to a community” (Hearst & Rosner, 2008). Since collaborative information services make more and more content accessible to users, Viégas and Wattenberg (2008) see a great user need for structured forms of representing textual resources in particular. Furthermore, tag clouds are a “marker of social or individual interaction with the contents of an information collection” (Hearst & Rosner, 2008). Viégas and Wattenberg (2008) assume that this is also the reason for the success of tag clouds. They provide a ‘friendly’ atmosphere, make visible the people behind the platform and invite users to browse through the website: “In a sense, these clouds may act as individual and group mirrors which are fun rather than serious and businesslike. Indeed, this all-world visualization is a digram that even a mathophobe can love” (Viégas & Wattenberg, 2008, 52). Since tag clouds are mostly restricted to a certain number of tags, the user can quickly register the most important tags and use them as a browsing entry without having to peruse and scroll through long lists first (Hearst & Rosner, 2008). Tag clouds are thus an extremely fun and comfortable browsing instrument (Hearst & Rosner, 2008). In particular the option of quickly registering the information resource’s context accounts for tag clouds’ attraction and popularity: “With a tag cloud [...] users won’t get a specific result, but what they will get is a very specific context of a subject. It saves a lot of reading. The content is the tag” (Dye, 2006, 42). Apart from these important advantages, tag clouds also have flaws. Thus the alphabetical arrangement of the tags often goes unnoticed by the users (Hearst & Rosner, 2008). The users mainly orient themselves on the popular, large-type tags and disregard others as browsing entries. On the other hand, tag clouds do not provide any visual constants, neglecting the ‘natural visual flow’ (Hearst & Rosner, 2008) that facilitates the registering of a text. Furthermore, the tags’ structure allows no conclusions to be drawn concerning relations or other semantic systematics between them, so that tag clouds ignore a potent retrieval functionality: “Alphabetical arrangements of displayed tags neither facilitate visual scanning nor enable to infer semantic relation between tags” (Hassan-Montero & Herrero-Solana, 2006). In the tag clouds, similar terms or descriptions can be located quite far apart from each other, so that important associations go unmade or even false links are drawn up (Hearst & Rosner, 2008): “‘East’ is close to ‘Easter’ but far from ‘west’” (Viégas & Wattenberg, 2008, 51). Quintarelli, Resmini and Rosati (2007a) summarize the disadvantages of tag clouds resulting from the lack of a systematic arrangement:
Information Retrieval with Folksonomies
317
The problem with this approach is that flat tag clouds are not sufficient to provide a semantic, rich and multidimensional browsing experience over large tagging spaces. There are several reasons for this: Tag Clouds don’t help much to address the language variability issue, so the findability quotient and scalability of the system are very low. Choosing tags by frequency of use inevitably causes a high semantic density with very few well-known and stable topics dominating the scene. Providing only an alphabetical criterion to sort tags heavily limits the ability to quickly navigate, scan and extract and hence build a coherent mental model out of tags. A flat tag cloud cannot visually support semantic relationships between tags. We suggest that these relationships are needed to improve the user experience and general usefulness of the system (Quintarelli, Resmini, & Rosati, 2007a, 11).
Hassan-Montero and Herrero-Solana (2006) want to avoid these disadvantages and improve on them by no longer arranging tags alphabetically but according to semantic aspects. The tags’ semantic closeness is also supposed to be represented through visual proximity: The display method is similar to traditional Tag-Cloud layout, with the difference that tags are grouped with semantically similar tags, and like wise clusters of tags are displayed near semantically similar clusters. Similar tags are horizontally neighbours, whereas similar clusters are vertically neighbours (Hassan-Montero & Herrero-Solana, 2006).
The tags’ semantic similarity is determined via the Jaccard-Sneath coefficient, clusters via similarity calculations via the Vector-Space model. This new tag arrangement is meant to facilitate browsing: Alphabetical-based schemes are useful for know-item searching, i.e. when user knows previously what tag he is looking for, such as when user browses his personal Tag-Cloud. [...] Our proposal for Tag-Cloud layout is based on the assumption that clustering techniques can improve Tag-Clouds’ browsing experience (Hassan-Montero & Herrero-Solana, 2006).
Begelman, Keller and Smadja (2006) also see great advantages in the formation of tag clusters and the resulting improvements of browsing options: “Clustering makes it possible to present a guidepost, to provide the means that allow the user to explore the information space” (Begelman, Keller, & Smadja, 2006). Knautz (2008) also introduces an approach that evaluates syntagmatic relations in folksonomies and makes co-occurrence relations between tags using the established similarity algorithms (e.g. Dice or Jaccard-Sneath; see also chapter two). The goal is to find tag clusters and thus improve the visualization of folksonomies by making visible the hidden semantic relations between tags. The basis of similarity algorithms is always the frequency of co-occurring tags within the folksonomies. The results of the calculations are in the value area of [0;1], where the value 0 states that there is no similarity between the two tags and 1 states that there is a maximum match. It is important to stress at this point that the similarity of two tags is determined on a purely statistical basis. There is no alignment with KOS, so it cannot be assumed that maximum matches are also semantic synonyms. Based on
318
Information Retrieval with Folksonomies
the calculated similarity values for all tag pairs of the database, Knautz (2008) now uses cluster-forming procedures, such as the Single-Link, Complete-Link or GroupAverage procedures.
Figure 4.5: Tag Cluster Generated via the Single-Ling Procedure. Source: Knautz (2008, 276, Fig. 6).
Figure 4.5 displays a cluster example for a database with five resources that has been formed via the Single-Link procedure and identifies all tag pairs as possessing a similarity value of ≥0.8. The tag cluster consists of undirected graphs. The tags’ font size here represents the tags’ frequency of distribution within the database and the strength of the connecting lines between the tags represents their similarity (corresponding to the calculated similarity values) – fatter lines mean greater similarity. Knautz (2008) further suggests the enhancement of the approach with weighting values for the tags and a form of syntactic indexing. Syntactic indexing, i.e. the user sets the tags into relations to one another during indexing, can be performed via field suggestions. Not only the individual tags are then taken into consideration for the similarity calculations, but also their occurrence in the indexing chains thus formed.
Figure 4.6: Tag Cloud and Tag List for the Bookmark http://www.phoenix.lpl.arizona.edu/ 06_19_pr.php. Source: del.icio.us (06/20/2008).
The representation of tags as dependent on their indexing frequency is another critical point of tag clouds. Tags with equal or similar font size are hard to compare or differentiate, so that the user can have difficulty determining the most relevant tags (Hearst & Rosner, 2008). A tag’s popularity can be stated in terms of numbers
Information Retrieval with Folksonomies
319
to provide some relief to this problem, which would, however, turn the tag cloud back into a normal list or chart. Figure 4.6. shows an example from del.icio.us. Furthermore, the most popular tags are not always the most relevant, so that the tag cloud may reflect a false picture of the resource or the platform. Sen et al. (2007) emphasize that quantity must not be the sole design criterion: Due to limited screen space, many systems can only display a few tags from among the many users have applied. With all these challenges, how should del.icio.us select the five tags it displays for the website digg from among the 10.688 unique tags users have applied to is? (Sen et al., 2007, 362).
The combination of relatively long words and large type can also make assessing a tag’s relevance more difficult, as Hearst and Rosner (2008) point out: “indicating a word’s importance by size causes problems. The length of the word is conflated with its size, thus making its importance seem to be a function part of the number of characters it contains” (Hearst & Rosner, 2008). The effects of font size on the users’ power of judgment becomes noticeable in the tag clouds’ subject coverage. Less popular subjects, such as the tags in the Long Tail, are less regarded by the users and thus less often used for indexing and search (Hearst & Rosner, 2008). Due to the Matthew Effect, this procedure facilitates the tags’ uniformity and is obstructive to the generation of semantic variety in the tags and descriptions. Furthermore, this effect makes the precise retrieval of information more difficult: “In terms of discrimination value, the most frequently-used terms are the worst discriminators” (Hassan-Montero & Herrero-Solana, 2006). Since the user can only ever search for one tag at a time in tag clouds, the clouds’ retrieval power is fairly restricted (Hearst & Rosner, 2008). If we consider all disadvantages of tag clouds, we are left to conclude that “spatial tag clouds are a poor layout compared to lists for information recognition and recall tasks” (Hearst & Rosner, 2008). Not for nothing do Viégas and Wattenberg (2008) object: So there’s a puzzle: If tag clouds don’t provide quantifiable benefits and if people are unaware of how items are organized in the visualizations, how and why are tag clouds being used? [..] One might say that tag clouds work in practice, but not in theory (Viégas & Wattenberg, 2008, 52).
Hearst and Rosner (2008) therefore recommend spending time and effort for research and design guidelines to improve tag clouds as a retrieval tool: The limited research on the usefulness of tag clouds for understanding information and for other information processing tasks suggests that they are (unsurprisingly) inferior to a more standard alphabetical listing. This could perhaps be remedied by adjusting white space, font, and other parameters, or by more fundamentally changing the layout (Hearst & Rosner, 2008).
Sinclair and Cardew-Hall (2008), Rivadeneira et al. (2007) and Bateman, Gutwin and Nacenta (2008) follow up on their request. Sinclair and Cardew-Hall (2008) tasked themselves to use their study to investigate the retrieval effectiveness of tag clouds empirically. They initially had their subjects index several scientific articles with tags and then let them answer differently specific search requests in the constructed tagging system. The subjects could use either a search field or a tag cloud as retrieval tools. Clicking on a tag in the cloud yielded a hit list with all articles indexed with that tag. On the hit list’s site
320
Information Retrieval with Folksonomies
then appeared another tag cloud, which summarized these articles’ other tags. Clicking on one of those tags then refined the hit list with an AND link. The results of this study show that the tag cloud is mainly used as a retrieval tool in two cases: 1) far-reaching and unspecific search requests and 2) search requests in which the search tag is displayed in the tag cloud. Thus the assumption that tag clouds are mainly suited for browsing through the information platform is confirmed. Also, the search via tag clouds can better point out related articles or tags and thus reduces the user’s cognitive effort with regard to generating search terms. Concerning this point, it also becomes clear that users are more likely to use a tag cloud for search purposes when the desired search tag is displayed in the tag cloud than when it is not. One user concludes: “clicking is fast[er] than typing” (Sinclair & Cardew-Hall, 2008, 25). The tag clouds are further suited for searching non-native speakers, since they are provided helpful tag suggestions for their search. Using the search field is, on the other hand, better suited for a concrete information need and also requires fewer steps to find the relevant article: “In general, it appears that answering a question using the tag cloud required more queries per question than the search box” (Sinclair & Cardew-Hall, 2008, 24). The authors interviewed the subjects directly as to why they preferred which search tool. The results are displayed in Figure 4.7.
Figure 4.7: Reasons for Preferring the Search Tool ‘Search Field’ or ‘Tag Cloud.’ Source: Sinclair & Cardew-Hall (2008, 26, Table 7).
Just like Hearst and Rosner (2008), Sinclair and Cardew-Hall (2008) identify three important advantages of tag clouds: It is particularly useful for browsing or non-specific information discovery. The tag cloud provides a visual summary of the contents of the database. It appears that scanning the tag cloud requires less cognitive load than formulating specific query terms (Sinclair & Cardew-Hall, 2008, 27).
They identify as important disadvantages their unsuitability for solving specific search requests and the inaccessibility, due to their design concept, of some resources. After all, tag clouds exclude some resources from their search from the outset. Clicking on a tag in the initial tag cloud leads to hits indexed with this tag. Clicking on the next tag cloud refines the results with the next tag, i.e. creates an AND link between the two tags. Resources that have only been tagged with tag 2, though, can no longer be found in this way.
Information Retrieval with Folksonomies
321
Left unmentioned by Sinclair and Cardew-Hall (2008) is the fact that resources, or tags, are also excluded from the search if there is no space for them or if they are not popular enough. Sinclair and Cardew-Hall (2008) conclude: In light of these disadvantages, it is clear that the tag cloud is not sufficient as the sole means of navigating a folksonomy dataset. Some form of search facility or supplementary classification scheme is necessary to expose all the articles (Sinclair & Cardew-Hall, 2008, 28).
Rivadeneira et al. (2007) also test the retrieval effectiveness of tag clouds, but with a particular focus on their different design elements. They first find out that tag clouds are mainly suited for the following (retrieval) tasks: • Search: direct searches by clicking on the tag within the tag cloud. • Browsing: exploring the website by following the links. • Impression Formation or Gisting: getting an impression of the website and subjects covered by it. • Recognition/Matching: recognizing and allocating single tag clusters within the tag cloud. For their analysis, Rivadeneira et al. (2007) distinguish between two groups of design elements: a) elements that change the tag itself (‘text features’) and b) elements that concern the tags’ arrangement in the cloud (‘word placement’). The first group incorporates aspects such as font size, color and weight, to the second group belong tag sorting, clustering and arrangement, e.g. arrangement of the tags in continuous lines or as clusters (see Figure 4.8).
Figure 4.8: Cluster-Forming Tag Cloud. Source: Rivadeneira et al. (2007, 995, Fig. 1).
Rivadeneira et al. (2007) initially checked the influence of the design criteria with the help of test subjects, who were shown a tag cloud for 20 seconds and then had to reproduce the tags in that cloud. The authors found out that tags with a large font size could be better reproduced than smaller-type tags. Also, tags from the upper left quadrant of the tag cloud were more easily memorized than those from the other quadrants. A second test focused on the human interpretation of tag clouds and incorporated the different sorts of tag arrangements. Here it could be observed that simple arrangements of the tags as lists, sorted by tag popularity, achieved the best results. The tag clouds’ ‘content’ was registered even better if the tags’ list-like arrangement was complemented by larger font sizes for popular tags. Rivadeneira et al. (2007) make the following suggestions for the design and use of tag clouds: (a) locate smaller font words here [in the upper-left quadrant, A/N] to compensate for font size, while locating bigger font words in other quadrants; (b) locate tags that you want to emphasize in this quadrant [in the upper-left
322
Information Retrieval with Folksonomies quadrant, A/N]. The results from Experiment 2 imply that a simple list ordered by frequency may provide a more accurate impression about the tagger [or the resource’s content, A/N] than other tagcloud layouts (Rivadeneira et al., 2007, 998).
Bateman, Gutwin and Nacenta (2008) respond to the study by Rivadeneira et al. (2007), but add further design criteria of tag clouds for checking the ‘visual influence.’ The design criteria’s visual influence is defined as follows: We define ‘visual influence’ as the visual characteristics of the tag that draw a user’s attention. We can further refine this definition by saying that the visual influence of tags does not include the semantics of the word itself, since once the tags is read, the meaning of the tag word has a strong influence on whether or not that tag is chosen (Bateman, Gutwin, & Nacenta, 2008, 194).
Thus the authors only use artificially generated tags with no meaning, so as not to influence their test subjects in their choice. The subjects were asked to choose the top 10 most important tags from differently designed tag clouds by clicking on them. The design aspects mainly concerned font size and weighting, color and intensity of the writing, the tags’ pixel quantity (the letter m has more pixels than the letter l), tag width (the space filled out by the tag), any given tag’s number of letters, the tag area (“exchange the font size mapping with that of the area of the tag” (Bateman, Gutwin, & Nacenta, 2008, 196)) and the position of the tag within the cloud. The authors conclude that • the font size has the greatest influence on the users and lets tags appear to be very important, • the font weighting and intensity, with regard to contrast, also have an important bearing on a tag’s perceived relevance, • the number of a tag’s pixels, as well as its width and area, have little influence on users, • the color of the writing has pretty much no effect whatsoever, and • the tag’s positioning influences its clicking frequency – tags at the top and bottom edges of the tag cloud were selected less often than tags in the center. Since the tag clouds display a large amount of inadequacies, more and more visualization approaches are introduced that are meant to facilitate an improved browsing or direct search. ‘Del.icio.us Soup’ (Albrecht, 2006) visualizes tag similarity and works as a tag recommender system. The most popular tags are here marked via balloon size (see Figure 4.9). The tag similarity calculation is based on co-occurrences. The ‘del.icio.us graph’ is also tasked to visualize tag similarities (Albrecht, 2006) by displaying a tag’s ‘related tags’ based on the co-occurrences of tags on del.icio.us. The user can explore the relations between different tags by clicking on the knots (see Figure 4.10).
Information Retrieval with Folksonomies
Figure 4.9: Visualization Using ‘Del.icio.us Soup.’ Source: http://www.zitvogel.com/delicioussoup/screenshots.html (06/20/2008).
Figure 4.10: Visualization of ‘Related Tags’ Using ‘del.icio.us Graph.’ Source: http://www.hublog.hubmed.org/archives/001049.html (06/20/2008).
323
324
Information Retrieval with Folksonomies
Figure 4.11: Visualization of the Personomy of the del.icio.us User ‘kevan’ in Extispicious. Source: http://www.kevan.org/extispicious.cgi?name=kevan (06/20/2008).
The application ‘Extispicious’ visualizes a user’s personomy via his del.icio.us account (Guy & Tonkin, 2006; see Figure 4.11). The service describes itself in these terms: “extisp.icio.us text gives you a random textual scattering of a user’s tags, sized according to the number of times that they’ve used each of them, and leaves you to draw your own insights from the overlapping entrails.”
Figure 4.12: Visualization of the del.icio.us Folksonomy via ‘SpaceNav.’ Source: http://www.ivy.fr/revealicious/demo/spacenav.html (06/20/2008).
The visualization tool ‘SpaceNav’ represents relations between single tags rather empirically. As displayed in Figure 4.12, the tag ‘information’ was applied to 9% of
Information Retrieval with Folksonomies
325
resources on del.icio.us and has co-occurred with 19% of the resource folksonomy’s other tags. ‘Grouper’ was developed for a better organization of the bookmarks and to provide a better overview of the personomy. The display (see Figure 4.13) is separated into three areas of tag popularity: 1) most used tags, 2) frequently used tags, 3) less frequently used tags. Furthermore, Grouper offers the possibility of visualizing co-occurrences. In the Figure we can see that the tag ‘Mac’ (highlighted) is most often co-indexed with the bold tags (‘e.g. ‘Cool’ and ‘Tool’).
Figure 4.13: Visualization of the Folksonomy from del.icio.us via ‘Grouper.’ Source: http://www.ivy.fr/revealicious/demo/grouper.html (06/20/2008).
The visualization tool ‘TagsCloud’ also serves to make the personomy easily overseeable. Co-occurring, or ‘related’ tags, are displayed if a tag is touched by the cursor (see Figure 4.14, top, tags in blue fonts). Another functionality is displayed in Figure 4.14, bottom. ‘Non-related tags’ can be made invisible, which additionally improves the overseeability of this optimized tag cloud.
326
Information Retrieval with Folksonomies
Figure 4.14: Visualization of the Folksonomy form del.icio.us via ‘TagsCloud.’ Source: http://www.ivy.fr/revealicious/demo/tagscloud.html (06/20/2008).
The application ‘Flickr Graph’ (Albrecht, 2006) visualizes the social network of contacts on Flickr (see Figure 4.15).
Figure 4.15: Visualization with ‘Flickr Graph.’ Source: www.marumushi.com/apps/ flickrgraph (06/20/2008).
Information Retrieval with Folksonomies
327
Figure 4.16: Visualization of the Folksonomy for www.youtube.com for the time period of 09/11/2008 to 09/14/2008 via ‘Cloudalicious.’ Source: http://www.cloudalicio.us (09/14/2008).
‘Cloudalicious’ (Russell, 2006; Russell, 2007) exemplifies the chronological development of tags by using a timeline (see Figure 4.16). Tags for a URL from del.icio.us are downloaded and visually represented. The y-axis here represents the relative weighting values (indexing frequency divided by number of authors) of the most popular tags for the URL, while the x-axis represents the time that has passed since the URL’s first indexing: The x-axis on the timeline can be represented as both time and as tagging events. The two are directly correlated, but will be variably related depending on the amount of tag activity being investigated. Popular items and prolific people will have more tagging events within a shorter amount of time than those that are not as active. Showing both time and tagging events on the xaxis allows the searcher more context in which to understand what is happening (Russell, 2007).
The graphs’ colors are allocated to the different tags. Horizontal graphs point to a stabilization of the users’ tagging vocabulary, diagonal graphs to a change in tagging behavior. The application ‘TagLines’ by Yahoo! (Dubinko et al., 2006) uses Flickr’s database and visualizes the most popular tags of the platform folksonomy on a dayto-day basis. Also added are photos indexed with the tags. Via a timeline, the user can enter the desired date, control keys allow him to change or even stop the display speed. Slow-moving tags are tags that have been in use for a long time and have appeared in the database for a over a long period of time. If the user clicks on a tag speeding by, he reaches all Flickr images indexed with this tag. The user can also choose between two display options: 1) the tags and photos move across the screen from right to left (see Figure 4.17, left), 2) the tags and images are vertically aligned and change, while keeping their position (see Figure 4.17, right).
328
Information Retrieval with Folksonomies
Figure 4.17: Visualization of the Flickr Folksonomy via ‘TagLines.’ Source: http://research.yahoo.com/taglines/ (06/20/2008).
Figure 4.18: The Use of Emoticons in Tag Clouds. Source: Ganesan, Sundaresan & Deo (2008, 1181, Fig. 1).
Ganesan, Sundaresan and Deo (2008) want to index emoticons as markers for the tags’ tonality, in order to improve tag clouds. Emoticons in written communications provide information on the communicators’ mood and are formed with special symbols. The authors extract tags from full-text comments on an e-commerce platform and align these with an emoticon catalog, in order to determine the tags’ tonality and to allocate tags and emoticons both. The tag cloud (see Figure 4.18) displays tags and emoticons; the tags’ font size represents their frequency of usage. An emoticon key explaining the symbols’ exact meanings is also displayed to further support users. Ganesan Sundaresan and Deo (2008) sum up their approach thusly: It mines and itentifies tags that are more representative of a user among users in a community or tags that are more representative of a category among
Information Retrieval with Folksonomies
329
categories in an application. […] We believe that this system combines fun and utility in a unique way (Ganesan, Sundaresan, & Deo, 2008, 1182).
Fujimura et al. (2008) use a ‘topigraphical’ approach to improve the tag clouds’ visualization. ‘Topigraphy,’ an amalgamation of ‘topic’ and ‘topography,’ is defined as follows: “Topigraphy uses a topographic image as the backround against which the tag clouds are displayed; tag ‘height’ represents the centrality of the concept of the related tags while the 2-dimensional layout addresses tag similarity” (Fujimura et al., 2008, 1087). To determine tag similarity, the authors use the cosine. The result of these endeavors is displayed in Figure 4.19. The height of a tag, e.g. ‘Cooking,’ expresses its degree of abstraction, spcatial proximity between tags their similarity. After a retrieval test, Fujimura et al. (2008) observe that topographical visualizations are particularly suited for getting an overview of a specific subject area, and thus should be selected by users who do not have a concrete search request in mind.
Figure 4.19: Topigraphical Visualization of Tag Clouds. Source: Fujimura et al. (2008, 1087, Fig. 1).
Kuo et al. (2007) discuss the use of font colors in tag clouds to express how up to date the tags are. Furthermore, they check the effectiveness of tag clouds for visualizing search results. They extract tags from the full-texts’ Abstracts (the underlying database being PubMed) found after a query. They thus want to improve user access to the results. The tags’ font size here represents their frequency of occurrence and the colors mark them as recent or old. The newest tags are colored bright red, the oldest ones dark grey. A tag’s recency is determined via the underyling resource’s date of publication. Evaluating their visualization approach in comparison with traditional hit lists, Kuo et al. (2007) found out that: Tag clouds are not a panacea for the summarization of web search results. Although they do provide some improvements in terms of summarizing descriptive information, tag clouds do not help users in identifying relational concepts, and in fact, slow them down when they need to retrieve sepcific items (Kuo et al., 2007, 1204).
330
Information Retrieval with Folksonomies
Figure 4.20: Flickr Clusters for the Tag ‘Jaguar.’ Source: http://www.flickr.com/photos/ tags/jaguar/clusters/ (06/20/2008).
At the moment, more and more popular collaborative information services offer visualization or sorting functions for tag clouds in order to make searching and finding relevant resources simpler. On Flickr, it is possible to generate clusters for certain tags (see Figure 4.20). The cluster function is particularly relevant for the distinction between homonyms or the establishment of a more precise first impression of the tagged and retrieved resources: “Flickr-developed tools, like [...] ‘clustering’ algorithm, provide a service very much like hierarchical structure, which helps to disambiguate different types of images” (Winget, 2006). Del.icio.us and the search engine Rawsugar allow for the summing up of tags in ‘bundles.’ As seen in Figure 4.21, the user ‘tech_cool’ has pretty much tagged his tags and thus created a tag hierarchy. For instance, he superordinates the tag ‘Fun’ to the tag ‘Games.’ This functionality makes it easier for user and indexer qua searcher to ideally arrange the tags according to their interests and to find relevant resources. Begelman, Keller and Smadja emphasize this as well: “The system uses these hierarchies to provide a strong exploration and search experience” (Begelman, Keller, & Smadja, 2006). Del.icio.us also allows users to adjust the display of their tags according to their preferences and thus keep a better overview of their tags. This is only valid for a user’s personomy, however, and not for a resource’s folksonomy. The ‘Tag Options’ (see Figure 4.22) can change the display in four different ways: 1) all of a certain user’s tags can either be displayed as a tag cloud (see Figure 4.21, right) or as a tag list (see Figure 4.21, left). The individual tag’s popularity is represented via font size in the cloud and numerically in the list; 2) list and cloud both can be arranged alphabetically or by tag popularity; 3) another option is to only display those tags that have been used at least once, twice or five times; 4) the tags’ hierarchical structure via tag bundles can be either displayed or hidden and works for both tag cloud and tag list.
Information Retrieval with Folksonomies
331
Figure 4.21: Tag Bundles of the User ‘tech_cool.’ Source: http://del.icio.us/tech_cool (06/27/2008).
Figure 4.22: Tag Options on del.icio.us. Source: http://del.icio.us.
To broaden access to resources through an improved tag cloud display is also the approach used by IBM’s in-house social bookmarking service ‘dogear.’ The tag cloud display can be changed by a slide control: more or fewer tags can be represented (Millen & Feinberg, 2006; Millen, Feinberg, & Kerr, 2006). Single tags’ popularity is represented by their color; the darker it is, the more frequently the tag has been used (Millen & Feinberg, 2006). If hierarchical structures have been formed in folksonomies via cluster algorithms or Tag Gardening activities (see chapter three), a display option can be freely chosen, as the weekly newspaper ‘Die Zeit’s’ ‘Subject Browser’ exemplifies. It visualizes hierarchical and thematic relations between tags152 (see Figure 4.23). The uppermost hierarchy level is represented by the tag ‘Subjects of Die Zeit.’ The tags “Gesellschaft” (=‘Society’), “Natur” (=‘Nature’) and “Technik” (‘Technology’) are subordinated. Using the arrows (left, right), the user navigates through the current 152
The tags do not consist of user-generated keywords but probably reflect the category hierarchy of ‘Die Zeit’s’ online version. Nevertheless, this visualization can be very effectively applied to (hierarchically structured) folksonomies and shall not go unmentioned.
332
Information Retrieval with Folksonomies
subjects on the same hierarchy level (see Figure 4.23, left), the arrow ‘up’ moves the slide control to the hyperonyms. The underlined tag, in this case ‘Natur,’ is complemented by a display of its subordinate tags. Clicking on the hyponym “Artenschutz” (=‘Species Conservation’) opens up another subject area, visible in Figure 4.23, on the right-hand side. Here, too, the single hierarchy levels of the tags are represented: ‘Nature’ is the hyponym of ‘Species Conservation,’ “Tierschutz” (=‘Animal Protection’) is a hyponym of ‘Species Conservation’ and both “Naturkatastrophe” (=‘Environmental Catastrophe’) and “Umwelt” (=‘Environment’) are sister terms of ‘Species Conservation.’
Figure 4.23: Subject Browser of ‘Die Zeit’ with Hierarchical and Thematic Arrangement of the Tags. Source: http://www.zeit.de/themen/index.
The different visualization approaches for tag clouds make it plain that more elaborate retrieval options can indeed be implemented in a purposeful and accessible manner while still keeping the browsing character that facilitates the exploration of the information platform by mouse, which is so typical for folksonomies.
Disadvantages of Folksonomies in Information Retrieval The disadvantages of information retrieval with folksonomies are mainly the same as in knowledge representation, resulting from the lack of a controlled vocabulary. The weaknesses of folkosnomies are particularly apparent here, since the user may not be presented any resources in response to his query or the hit list may consist of irrelevant resources. The immediate connection between the areas of knowledge representation and information retrieval is probably more evident for the user here than anywhere else. The linguistic variability within the indexing and search vocabulary makes information retrieval that much more difficult; resources are only yielded as results if there is a complete match between search term and indexing tag. It is also difficult to bring the different user groups on a common linguistic, or conceptual level in order to facilitate exchange between them, or the finding of resources. Aligning the different user vocabularies is the task of knowledge representation. Here there are several methods (e.g. KOS) to choose from. Guy and Tonkin (2006) point out, however, that: In tag-based systems, there are at least two stakeholder groups: those who contribute metadata in the form of tags, and the consumers of that metadata. These may overlap; however, there is no reason to assume that a metadata
Information Retrieval with Folksonomies
333
consumer must be familiar with the metadata submision process. While the contributors’ choice of vocabulary may have been ‘trained’ by the various means […], the metadata consumer may not have had the benefit of that process (Guy & Tonkin, 2006).
The large number of tags within popular information services does not make searching and, above all, finding any easier: “The difficulty comes from the fact that several people usually use different tags for the same document. [...] This creates a noisy tagspace and thus makes it harder to find material tagged by other people” (Begelman, Keller, & Smadja, 2006). Just as in knowledge representation, methods for unifying search terms and creating semantic unambiguousness can also be used in information retrieval. The search tags can also be processed informationlinguistically, for example, in order to generate a greater probability of matches between search and indexing vocabulary: “Optimisation of user tag input, to improve their quality for the purposes of later reuse as searchable keywords, would increase the perceived value of the folksonomic tag approach” (Guy & Tonkin, 2006). Query expansion via semantic relations can also support the finding of relevant resources, since they confront the user with alternative search terms. More general search terms, i.e. tags on higher levels of the term ladder, provide for a higher Recall, while more specific search tags on lower term hierarchy levels restrict it, thus raising the search results’ Precision. Since folksonomies use neither of the two options for information retrieval yet, their retrieval functionality resembles the search in full-texts (Weinberger, 2005) and is not yet fully pronounced. The entire search effort is passed on to the user; after all, he is the one who must think about all linguistic variants of a term and manually enter them into the search field in order to achieve the most complete search result possible. The eventual problem of homonyms has not been considered, though, so which leads to possible ballast in the hit list. Expressed by information retrieval’s parameters, this means: Recall = 0 and Precision = 0. Folksonomies are thus badly suited for both recall-oriented research and precision-oriented research. For this reason, Peterson (2006) advertises the superiority of KOS: “A traditional classification scheme will consistently provide better results to information seekers” (Peterson, 2006). In an interview, the Flickr executive Stewart Butterfield had this to say on the problem: Once again I will mention the concept of trade-offs and that although a user may not be able to locate every resource which has been organized in this fashion, the user will find nothing in a system which is too difficult or daunting to use. Flickr CEO Stewart Butterfield points out, “we’ll have a million photos of Tokyo, and if the TOKYO tag only gets you 400k of them, it’s okay. You’re only going to look at 20 of them anyway” (Kroski, 2005).
The popular collaborative information services’ motto thus appears to be ‘quantity, not quality.’ This is understandable to a certain degree, if we consider that they do not charge for their services, as opposed to professional information service providers such as Lexis Nexis153 or Web of Science154, which in turn offer enhanced search and display options. Morrison (2008) observes that the user (particularly the information professional) should accept that folksonomies are not a formidable retrieval tool: “It is important to note that Web sites that employ folksonomies [...] 153 154
http://www.lexisnexis.com. http://www.isiwebofknowledge.com.
334
Information Retrieval with Folksonomies
are not necessarily desigend to have search as the primary goal” (Morrison, 2008, 1563). In consequence, collaborative information services and tagging systems have not been developed for a better information gathering but seem to follow other goals primarily. The networking of users based on different resources and the decentralized storage of personal and already found resources are in the foreground; the finding of interesting information and resources is merely a pleasant side effect. Nevertheless, users have certain information needs they want to satisfy – and this is achieved through active searches in the web or information services. If the user restricts his searches to the folksonomy, thus neglecting all other available metadata, a problem arises that is also prevalent in the World Wide Web but not in professional information services: the user can only search for tags that have already been indexed: “We can’t rely on tags that aren’t there yet” (Dye, 2006, 43). Likewise, search engines on the internet can only find websites that have been linked before. Professional information services or libraries, on the other hand, have a central indexing point, which indexes all incoming resources and registers them in the system only after, making them searchable and retrievable. Here it is impossible for a resource to lie slumbering in the system, undetected, only because it lacks metadata. If a resource in a tagging system has been indexed with tags, this does not mean that the resource is actually relevant for the tag (Koutrika et al., 2007). Since there is no quality control regarding the allocation of resources and tags in folksonomies, inadequate or simply false indexing can occur. This is also exploited by spammers for ‘spagging,’ in which they terrorize the tagging system with unsuitable tagresource links (Szekely & Torres, 2005). Searching users receive inordinate amounts of ballast in response to their queries. Another problem of the lack of quality control for indexing is in the negative usage of tags. Koutrika et al. (2007) wonder: “If users can also use negative tags, e.g., this document is not about ‘cars’, what would be the impact on searches?” (Koutrika et al., 2007, 64). Both of these aspects play an important role in the search results’ relevance ranking. The assumption that it is only the indexing frequency of a tag which reflects its relevance, or the resource’s, is not justified. Likewise, the greatest strength and weakness of folksonomies is its liberal design. Users can be as specific or as general as they like in indexing tags. The users’ different motivations and perspectives can also be reflected in the tags. This can become a problem for information retrieval systems: “however, in the networked world of information retrieval, a display of all views can also lead to a breakdown of the system” (Peterson, 2006). And even problems of display and space can become a problem in the internet, immeasurable though it is. Retrieval systems that aim to represent the wealth of perspectives in its entirety necessarily hit a wall – be it merely the wall of legibility. Another firm limit of folksonomies is their dependence on the collaborative information service. Generally, folksonomies can only be searched for one information platform; no provisions are made for a ‘cross-platform’ or ‘universal search.’ More and more meta search engines appear, however, that make it possible for users to search the folksonomies of different platforms at the same time and get the different resources as search results. The meta search engine ‘MyTag155’ (Braun et al., 2008) allows parallel searches in Flickr, del.icio.us and YouTube. A personalized ranking of the search results via the user personomy is 155
http://mytag.uni-koblenz.de.
Information Retrieval with Folksonomies
335
also possible. ‘TagsAhoy156’ proceeds in a similar manner and allows users to search only their own personomy within the services del.icio.us, Flickr, Gmail, Squirl, LibraryThing and Connotea. So far, there have been only a few studies on user searching behavior within and with folksonomies. To get an impression of folksonomies’ usability with regard to their retrieval functionality, one must consult related studies instead. Van Hooland (2006) analyzes users’ queries within the image database of the Dutch National Archive via the Shatford Classification (who, what, when, specific, generic, abstract). This classification concerns the indexing of images and photos and differentiates between several degrees of specificity. The author found out in his analysis that: The majority of users want to retrieve images related to a specific geographical location. Secondly, searches regarding specific individuals, groups or objects are also very popular. On the other hand we can state a total lack of use of abstract query terms (Van Hooland, 2006).
Van Hooland also refers to other studies: The larger part of queries refers to specific instances and unique items as object names and geographical locations, whereas more general and abstract concepts are not included. Studies focussing on requests within newspaper image archives reaffirm these results (Van Hooland, 2006).
The observations are in accord with the insights into tagging behavior in knowledge representation. There, too, it has been observed that users tend rather to choose Basic-Level terms as tags, thus indexing with a medium degree of specificity rather than allocating general tags. Van Hooland (2006) not only investigates users’ search requests but also regards user-generated content. In the Dutch National Archive the users are unable to tag photos, but they can add comments. These often contain somewhat more general terms, but do not yet belong to the Shatford Classification’s abstract category. Thus Van Hooland (2006) concludes: “Both queries and comments are highly motivated by interests in specific terms, use few generic terms and hardly any or no abstract notions” (Van Hooland, 2006). The results of this study have immense repercussions for the development of retrieval systems based on folksonomies. Rorissa (2008), for example, addresses the meaning of the BasicLevel theory for the design of tagging system in the context of photosharing platforms: Several reasons and scenarios could be cited for using the basic level theory as a framework for desiginig these interfaces and other image indexing tools. For instance, the three levels of abstraction of concepts (subordinate, basic, and superorodinate) provide a hierarchical taxonomy in order to build hierarchical classificatory structures which are often appropriate for building browsing interfaces and systems. […] it could reduce the number of ‘drill-down’ levels (unlike tools such as WordNet) while still providing the user with the basic classificatory structure for browsing and explaratory search (Rorissa, 2008).
This study shows that when it comes to the degree of tag specificity, there is mostly a match between indexing and search vocabulary. Small deviations are merely detectable in the fact that users tend to index on the specific, or Instance level (e.g. 156
http://www.tagsahoy.com.
336
Information Retrieval with Folksonomies
the tag ‘me’) and the Basic level, while conversely searching on all three levels (subordinate, basic level, superordinate). A complete overlap is only noticeable in the lack of indexing tags and query tags on the very abstract level. After the grammatical and semantic cleansing of both vocabularies it thus becomes possible for the user to find the desired resources even via folksonomies. The disadvantages of folksonomies in information retrieval are perfectly summed up by Guy and Tonkin (2006): “Still, possibly the real problem with folksonomies is not their chaotic tags but that they are trying to serve two masters at once; the personal collection, and the collective collection” (Guy & Tonkin, 2006).
Query Tags as Indexing Tags Folksonomies depend on users’ tagging activity. Only when a critical mass of users has been reached can the many advantages of the user-generated indexing systems come to full fruition. Especially the exploitation of tag distributions for recommender systems or relevance ranking, for example, absolutely requires tagging users. In tagging systems, two kinds of problems can hinder the development of tags: 1) Narrow Folksonomies and 2) the so-called ‘cold start problem.’ Narrow Folksonomies cannot develop specific tag distributions for the indexed resource since every tag can only be indexed once per resource. Distributions can be determined on the platform level, just as in Broad Folksonomies. If one wants to use the resource-specific tag distribution for recommender systems, ranking or the creation of a user-controlled vocabulary, however, another possibility of tag generation must be found. Resource collections, as they are found in collaborative information services but also in libraries and on the World Wide Web, often suffer from the ‘cold start problem’ with regard to the indexing of their content. Several days or even weeks can pass between the time the information resource is registered in the database and the time it is indexed, particularly in the case of intellectual indexing. The resource thus ‘slumbers’ in the database, and since it has not been indexed can only be found via full-text search, title specifications or other metadata, which cannot be the ideal state of affairs in information retrieval. As a solution for both problems, query terms can be used as indexing tags, i.e. the retrieval system saves both the terms from the search request and the search results selected by the user and indexes the result with the search term (Krause, Hotho, & Stumme, 2008; Jäschke et al., 2008; Moffat et al., 2007; Baeza-Yates, 2007a; BaezaYates, 2007b). Because: “Users were tagging items implicitly when they searched and then found an item satisfied their search” (Morrison, 2007, 14). Dmitriev et al. (2006) also suggest this procedure for indexing: “The basic idea is to use the queries users submit to the search engine as annotations for pages users click on” (Dmitriev et al., 2006, 811). The possibility of indexing via search terms has already existed for all retrieval systems working with full-texts, thesauri, classification systems and nomenclatures, but was often left unrecognized and not implemented (Stock, 2007b). In the area of folksonomies, this possibility of tag generation is equally given short shrift (Krause, Hotho, & Stumme, 2008; Jäschke et al., 2008). Wiggins (2002) calls the creation of a controlled vocabulary by noting and saving query terms an ‘accidental thesaurus.’ On the websites of Michigan State University,
Information Retrieval with Folksonomies
337
Wiggins (2002) tested the creation of a thesaurus via search terms. Here he chose another procedure than would be recommendable for collaborative information services, though. Wiggins (2002) scoured the search log files of the internal search engine in order to find out what terms the users searched with. From these terms he generated a thesaurus, the MSU Keywords. To improve the search results, he then manually linked each term with the most appropriate website. This procedure is too elaborate for collaborative information services and large library inventories, though, and not strictly necessary. It suffices to attach the tags to the resource they have been found with and thus create tag distributions for further processing. Wiggins’ advice on when search terms stay constant is very important: “For any given Web presence, whether intranet or global, the top 500 unique search phrases entered by users represent at least 40 percent of the total searches performed” (Wiggins, 2002). After a certain time, i.e. after reaching critical mass, search terms form constant distributions and thus behave exactly like indexing tags from folksonomies. Jäschke et al. (2008) evaluated the search terms for the social bookmarking system BibSonomy and found out that these also follow a Power-Law distribution: “For the distribution of resources and tags this is also the case in logsonomies” (Jäschke et al., 2008; see also Krause, Hotho, & Stumme, 2008). They call the folksonomy thus developed from search tags a ‘logsonomy’ and describe the procedure as follows: “Queries or query terms represent tags, session IDs correspond to users, and the URLs clicked by users can be considered as the resources that they tagged with the query terms” (Jäschke et al., 2008). Wiggins (2002) is so enthused about his ‘accidental thesaurus’ that he considers any indexing of the content by information professional superfluous: “For practical, everyday solutions, we do not need the rigor of an academic discipline examining the literature” (Wiggins, 2002). Since with search terms, too, problems occur due to ambiguities or synonyms, they should not be attached to the resource as indexing tags without being checked or processed first. It is thus unfeasible to do completely without the help of professional indexers. Millen and Feinberg (2006) also emphasize the importance of evaluating search terms, since they observed, in their study on social navigation in the in-house social bookmarking service ‘dogear,’ that only 6 out of 10 tags for the top search terms matched the top indexing tags (Millen & Feinberg, 2006, Table 1). Users thus search with other tags than they index with. Only a ‘learning’ retrieval and indexing system is capable of acertaing these discrepancies and achieve better search results. Furnas et al. (1987) pointed this out as well: Basically an adaptive indexing system begins with an index obtained in an ordinary way (for example, by armchair naming). Users try their own words to access desired objects, and, on their first few tries, usually fail. But sometimes they will eventually find something they want. With user concurrence, the initially unsuccessful words are added to the system’s usage table. The next time someone seeks the object, the index is more knowledgeable (Furnas et al., 1987, 970).
The indexing of search terms also offers the advantage of in-house or specific abbreviations of terms can be indexed with the resource and thus reflect the users’ language use just like user-generated tags (Dmitriev et al., 2006). Also, indexed search tags work like a time machine, since old terms for any newly structured sites are also collected – via a search request using an antiquated term while clicking on a
338
Information Retrieval with Folksonomies
new site – and thus keep access to the information upright (Dmitriev et al., 2006). Since search engines are always based on an alignment of search terms and only yield results if there is a match between search and indexing vocabularies, this method can of course only be used for a search with several search terms and a less strict AND link (Stock, 2007a, 328). In the case of a zero-result search, the search engine first replaces the AND link with an OR and/or expands the query via step-bystep adding of synonyms via an OR link in order to be able to offer the users a search list and then evaluate the clicking behavior. At the same time, search tags act like a time machine to the future as they can show trends in user terminology in order to adjust the indexing vocabulary more quickly: “When this dynamic information is monitored one can observe a historical or trend-setting development of the vocabulary based upon the time-stamp of selected indexing tasks” (Maass, Kowatsch, & Münster, 2007, 55). This trend information can be calculated via the tag distributions, and particularly for Power Laws. The n first tags of the search term distributions can here be regarded as candidates for controlled vocabularies, for example. Furthermore, the evaluation of search terms and the calculation of possibly empty hit lists can serve to make out gaps in indexing or in the controlled vocabulary, in order to fill them retroactively. Since the evaluation of search terms leads to similar results and points out characteristics that are equally valid for folksonomies in knowledge representation, this method can also be used for recommender systems for similar tags, resources or users. Jäschke et al. (2007) report: The analysis of the topological structure of logsonomies has shown, that the clicking behaviour of search engine users and the tagging behaviour of social bookmarking users is driven by similar dynamics: in both systems, power law and small world properties exist. Hence, logsonomies can serve as a source of finding topic-oriented, community driven content either by a specific search along the three dimensions or by means of serendipitous browsing (Jäschke et al., 2007).
The evaluation and incorporation of search terms can thus provide valuable services for the indexing of information resources as well as the maintenance of controlled vocabularies. Since search terms have similar characteristics and distributions as the tags of a folksonomy, they can be used for the same tasks, such as recommender systems or ranking. In particular, it is possible to use search terms to generate resource-specific (search) tag distributions in Narrow Folksonomies, since search terms can be registered multiple times per resource. At the same time, the attachment of a multitude of further tags (e.g. extracted from other metadata or other controlled vocabularies) can broaden access to the information resources. Likewise, search terms allow for the solution of the cold start problem with regard to indexing tags, since resources might be found via full-text of title descriptions even though they are not yet represented by ‘tags,’ ‘keywords’ or ‘descriptors.’ The search terms can replace the keywords or descriptors, however, and thus at the same time entice other users to tag and heighten the resource’s retrievability.
Information Retrieval with Folksonomies
339
Relevance Ranking in Folksonomies Relevance ranking in collaborative information services has long been – compared to other research endeavors – neglected in scientific debate (with the exception of Hotho et al., 2006c). The services themselves also do not (yet) place any great value on the ranking of the search results: Connotea has no apparent ranking, del.icio.us ranks reverse-chronologially according to the date of the bookmarks entry into the system and the tags according to popularity (Szekely & Torres, 2005; Sturtz, 2004), Flickr standardly sorts by most recent date of entry and on YouTube, the user can at least arrange the search results according to the criteria relevance, date of entry, rating and view count. Slowly, though, more and more researchers and developers, as well as information services, are becoming aware of the necessity of a ranking mechanism: “As these systems are growing, the currently implemented navigation by browsing tag clouds with subsequent lists of bookmarks that are represented in chronological order may not be the best arrangement for concise information retrieval” (Krause, Hotho, & Stumme, 2008, 101). There is a particular need for finding distinctions between the relevant and less relevant resources in the growing mass of information in the platforms – ideally, by getting the top result for the query to appear at the very top of the hit list: “Ranking should be of highest priority for any community interested in providing information access and archival features” (Szekely & Torres, 2005). Generally, the possibilities of relevance ranking can be distributed into two large areas, referring to searches on the World Wide Web in particular: the dynamic or query-dependent ranking, which calculates the similarity between query and resource anew for each search request, and the static or query-independent ranking, which determines the retrieval status value of an individual resource solely based on its inherent characteristics (Lewandowski, 2005a; Bao et al., 2007, 501). Agichtein, Brill and Dumais (2006, 20) call these areas ‘concept-based features’ and ‘queryindependent page quality features.’ For the query-independent ranking, factors such as • WDF and IDF, and possibly the consideration of the search term’s placement in the resource (e.g. title field), • the search terms’ word distance in the resource, • the order of the search arguments in the resource, • structural information of the resources (e.g. font size or distinctions of the search term), • metatags, • anchor texts, • language, • spatial proximity of user and resource (Lewandowski, 2005b, 143, Table 2; see also Stock, 2007a, chapter 22) are taken into consideration in order to compare the resource (here: the website) with the search request. Several different models can be chosen from for the similarity calculations, e.g. the Vector-Space model (Salton & McGill, 1987) or the probabilistic model (Crestani et al., 1998), in order to be able to implement the ranking of the search results. Query-independent ranking factors include: • the number of the resource’s inlinks and outlinks, • the website’s position within the website hierarchy,
340
Information Retrieval with Folksonomies
• • • • •
clicking popularity, resource’s up-to-dateness, resource size (e.g. document length), resource format, the website’s size (Lewandowski, 2005b, 144, Table 3; see also Stock, 2007a, chapter 22). These aspects influence the resource’s retrieval status value and thus its position in the ranking; they are, however, determined independently of the concrete search requests. That means that the resource keeps its ranking in the hit list for a certain period of time – until it is processed again, that is – even if one or more individual factors should change. The resource’s retrieval status value is thus static, as opposed to the dynamic ranking factors mentioned above, which change in accord with the search requests. The best-known static ranking algorithms are the Kleinberg algorithm (Kleinberg, 1999), used in the retrieval system HITS, and the PageRank algorithm, used in the search engine Google (Brin & Page, 1998; Page et al., 1998). Both algorithms exploit the findings of citation analysis (Garfield, 1979) and evaluate bibliographic coupling as well as the co-citations of websites. Here they regard incoming links as citations and outgoing links as references (Stock, 2007a, chapter 22). Since the resources of collaborative information services often have links, the link-topographical algorithms can be applied to them; they then adopt the same problems for their ranking, however. That is why new ways of ranking are tested in folksonomies, since they open up new vistas for search engines and relevance ranking in their capacity as Web 2.0 components with their own characteristic features. The current approaches often combine folksonomies with the known search engine methods and ranking algorithms, in order to be able to implement the greatest advantages for the determining of relevant resources. Bao et al. (2007) try to implement this combination with two different ranking algorithms within folksonomies. The ‘SocialSimRank’ is a query-dependent ranking factor that makes use of tags as a basis for comparison: “These annotations provide a new metadata for the similarity calculation between a query and a web page” (Bao et al., 2007, 501). The similarity between search terms and tags is thus supposed to be determined via their semantic similarity (Bao et al., 2007, 503). This is done as follows: The social annotations are usually an effective, multi-faceted summary of a web page and provide a novel metadata for similarity ranking. A direct and simple use of the annotations is to calculate the similarity based on the count of shared terms between the query and annotations. [...] Similar to the similarity between query and anchor text, the term-matching based queryannotation similarity may serve as a good complement to the whole querydocument similarity estimation (Bao et al., 2007, 504).
The second ranking factor, ‘SocialPageRank,’ is static in nature and calculates a resource’s retrieval status value not via its degree of linkage but via the number of tags: Static ranking, the amount of annotations assigned to a page indicates its popularity and implies its quality in some sense. […] SocialPageRank (SPR) to measure the popularity of web pages using social annotations” (Bao et al., 2007, 501f.).
Information Retrieval with Folksonomies
341
We propose a novel algorithm, namely SocialPageRank (SPR) to quantitatively evaluate the page quality (popularity) indicated by social annotations. The intuition behind the algorithm is the mutual enhancement relation among popular web pages, up-to-date web users and hot social annotations (Bao et al., 2007, 504).
Bao et al. (2007) take a first step towards exploring ranking algorithms within folkosnomies and collaborative information services; further approaches will be summarized in the following. Richardson, Prakash and Brill (2006) introduce ‘fRank,’ which is defined as a combination of RankNet and different features, where “RankNet is a straightforward modification to the standard neural network back-prop algorithm” (Richardson, Prakash, & Brill, 2006, 709). The feature set consists of the components PageRank, popularity (visits/page impressions of the website), anchor text and inlinks (amount of text in links, number of once-occurring words etc.) and domain (average number of outlinks etc.). The innovation of this ranking algorithm is that the combined features (which are otherwise mostly used for a dynamic ranking) are tried for static ranking in this case, i.e. the combination of RankNet and the feature set provides a query-independent ranking factor. The ranking algorithms ‘TagRank’ and ‘UserRank’ (Szekely & Torres, 2005) modify the PageRank algorithm and serve to locate the best user and the best tag within the folksonomy. The UserRank incorporates user activity into its calculations and rewards the user who first annotates a popular tag for a resource: “Even though u benefits from links from both v and w, u should gain more per link because u made the assertion first. In addition, both s(w,u) and s(w,v) should be penalized to avoid the ‘bandwagon’ effect” (Szekely & Torres, 2005). A problem with this procedure is that users could attach as many tags as they want to equally many resources in order to influence the ranking in their favor. There is also the possibility of using fake accounts to prefer artificially determined tags, again in order to push up the user’s ranking. The calculation of the UserRank has direct repercussions on the determining of the best tag. The TagRank works as follows: “the rank of a tag is simply the sum of the UserRank of the tagger over all instances of the tag. This approach is based on the belief that the most relevant tags are those used by the best users” (Szekely & Torres, 2005). There is a great problem for both ranking algorithms, making them unsuitable for large collections of data: “The difficulty with this approach is that new ranking must be computed for each type of request. As the community grows, these rankings may take hours to compute” (Szekely & Torres, 2005). A similar ranking factor to the UserRank (Szekely & Torres, 2005) is Amir Michail’s ‘CollaborativeRank157.’ Here, too, the time of tagging is regarded and counted in the user’s favor: TopTaggers is a search engine for the social bookmarking service del.icio.us. The key novelty behind TopTaggers is a way to encourage del.icio.us users to bookmark helpful/timely URLs with meaningful tags. This is done by rewarding del.icio.us users who are among the first to bookmark helpful/timely URLs and give them meaningful tags. The reward for doing so 157
The CollaborativeRank is mentioned in Guy & Tonkin (2006), amongst others. Unfortunately, neither documentation nor system are currently online. Decriptive information can be found on the website http://www.wired.com/science/discoveries/news/2005/10/69083 erhältlich.
342
Information Retrieval with Folksonomies comes in several forms: free advertising, influence on search rankings, and recognition as experts158.
The ranking according to best or most relevant user is also aimed for by John and Seligmann (2006) in introducing ‘ExpertRank.’ The ExpertRank also incorporates user activity into its calculations to determine the most important user of the tagging community. It can be assumed that a user becomes more important the more resources he tags. The ‘SBRank’ by Yanbe et al. (2007) also follows the approach of combining the PageRank algorithm with the users’ collaboration in order to be able to provide a better search engine and a better ranking: “We propose here utilizing consumerprovided metadata for extending capabilities of current search engines” (Yanbe et al., 2007, 108). The authors assume that only a combination of these two or other methods can guarantee the finding of relevant information and at the same time question the PageRank’s purpose. Since there are ever more systems automatically generating links (e.g. for blogs, trackbacks, pingbacks or spam), thus lessening the value of hyperlinks, they argue that further factors must be incorporated into the design of the search engine and taken into consideration for the ranking. Yanbe et al. (2007) concentrate especially on social bookmarking systems and implement the user collaboration at first via a simple ‘tags only’ search in their search engine. Subsequent improvements add the following aspects: “In result, our application allows users to search for Web pages by their content, associated metadata, temporal aspects in social bookmarking systems, user sentiment and other features” (Yanbe et al., 2007, 108). The authors prefer social bookmarking services as sources for their ranking algorithm because the resources there are user-checked and intellectual instead of having been chosen via a purely mathematical calculation, and because they directly reflect user interest: “SBRank captures the popularity of resources among content consumers (page readers), while PageRank is in general a result of author-to-author evaluation of Web resources” (Yanbe et al., 2007, 107). They compare the saving of bookmarks and the popularity of single bookmarks with the established citing and referencing mechanisms and the thus growing reputation in scientific discourse. Yanbe et al. (2007) further find out, through the Google Toolbar, that more than 56% of bookmarks have a very low PageRank, meaning that these sites can only be found with great difficulty via conventional searches, and are thus unlikely to rise to the top spots of the ranking. They thus advocate the combination of PageRank and information from social bookmarking systems to bring ‘less known’ or not heavily linked sites to the foreground in ranking and search: “The high dynamics of social bookmarking services makes it superior over traditional link based page ranking metric from the temporal viewpoint, as it allows for a more rapid, and unbiased, popularity increase of pages” (Yanbe et al., 2007, 112). User rating via ‘affective tags’ (Kipp, 2006a; 2006b), such as ‘funny’ or ‘cool’ should also be used for search and ranking. Yanbe et al. (2007) call this factor ‘sentiment.’ One of the few ranking algorithms already implemented in a collaborative information service is the ‘FolkRank’ (Hotho et al., 2006c). It is used in the scientifically oriented social bookmarking service BibSonomy. Hotho et al. (2006c; see also Hotho et al., 2006b and Hotho et al., 2006d) were among the first to recognize the necessity of retrieval and ranking methods in tagging systems: 158
http://collabrank.elwiki.com/index.php/Introduction.
Information Retrieval with Folksonomies
343
“However, the resources that are displayed are usually ordered by date, i.e., the resources entered last show up at the top. A more sophisticated notion of ‘relevance’ – which could be used for ranking – is still missing” (Hotho et al., 2006c, 413). The authors also orient themselves on the PageRank’s procedure (including Random Surfer) and start from the following assumption: “The basic notion is that a resource which is tagged with important tags by important users becomes important itself” (Hotho et al., 2006c, 417). Here the FolkRank can be applied to all parts of the tripartite graph of a folksonomy. Since the PageRank works on the basis of a directed bipartite hypergraph, however, and the folksonomies form an undirected tripartite hypergraphs, the folksonomic structure must first be adjusted. Here the three factors are summarized in pairs of undirected hypergraph: tags and user, users and resources and tags and resources (Hotho et al., 2006c, 417). Thus users and resources and tags can all be processed with the FolkRank and furthermore be linked with weighting values in order to better be able to exemplify certain relations (Hotho, 2008). In BibSonomy, the FolkRank is being used for two applications so far: “The algorithm will be used for two purposes: determining an overall ranking, and specific topic-related rankings” (Hotho et al., 2006c, 412). The subject-specific ranking can also facilitate the determining of points of particular interest within the tagging community, so that the FolkRank can also be used for the discovering of communities and an improved information exchange within an information platform: This leads naturally to the idea of extracting communities of interest from the folksonomy, which are represented by their top tags and the most influential persons and resources. If these communities are made explicit, interested users can find them and participate, and community members can more easily get to know each other and learn of others’ resources (Hotho et al., 2006b).
In conclusion, Hotho et al. (2006b) summarize: Overall, our experiments indicate that topically related items can be retrieved with FolkRank for any given set of highlighted tags, user and/or resources. As detailed above, our ranking is based on tags only without regarding any inherent features of the resources at hand (Hotho et al., 2006b). This allows to apply FolkRank to search for pictures (e.g., in Flickr) and other multimedia content, as well as for all other items that are difficult to search in a content-based fashion (Hotho et al., 2006c, 424)
Also already implemented in a collaborative information service is the ‘Interestingness Ranking’ of the photo platform Flickr (Dennis, 2006). Winget (2006) may claim that “The qualities of ‘interesting’ images are defined by the Flickr system, and the algorithm for determining an image’s ‘interestingness’ is not publicly available” (Winget, 2006), but that is not strictly the case. In a patent application for Flickr by Yahoo!, Butterfield et al. (2006a; 2006b) describe the meaning and purpose of a folksonomy-based Interestingness Ranking for multimedia resources: “to make available a wider variety of user-derived information concerning media objects, and to develop more relevant rankings for media objects based upon that information” (Butterfield et al., 2006b). For a ranking according to the resource’s ‘interestingness’ in Narrow Folksonomies, six general (1-6) and two personalized (7 and 8) criteria for determining the retrieval status value of a resource are defined:
344
Information Retrieval with Folksonomies
1) click rates: how often has a resource been called up, viewed or played back? 2) extent of metadata: how many tags, titles, descriptions, comments, annotations are there for the resource and how often has it been added to the list of favorites? 3) number of tagging users and tag popularity: what are the most frequently indexed tags for the resource and how many users have participated in indexing the resource (via every form of metadata)? This last factor can also be restricted to certain fields, groups or periods of time. 4) tag relevance: is the tag deemed relevant by the users or not (via active user ratings)? 5) chronology: the ranking algorithm assumes that a resource’s interestingness decreases over time. This is why the algorithm reduces a resource’s retrieval status value by 2% per tag each day after it has been entered. 6) relation between the metadata: what tags co-occur the most in indexing? 7) relation between author and user: the following basic assumption is made concerning registered users: “given the potentially higher likelihood of a similarity of interest between such a user and the poster relative to other users, this relationship may be weighted and summed into the interestingness score to increase it” (Butterfield et al., 2006a). 8) relation between location and residence: here too, registered users receive a personalized ranking based on the following assumption: a media object is more interesting to a particular user if the location associated with the object is associated with a residence of the user […] or associated with a location that is itself associated with a number of objects that have been assigned metadata by the user, […] e.g. the user has designated as favorites a large number of images associated with the Washington D.C. area (Butterfield et al., 2006a).
The ranking factors suggested by Yahoo! purposefully exploit the collaborative aspects of the tagging systems, but by far do not exhaust all ranking possibilities – particularly the traditional retrieval models, such as the Vector-Space model or the link topology, and the users’ concrete search and indexing behavior, are not taken into consideration. Another problem for the Interestingness Ranking is the factor ‘time.’ It is not necessary that a resource always lose relevance over time. The relevance of the Mona Lisa, for instance, would be near zero today, which might very well be disputed by art historians. Following Flickr’s Interestingness Ranking, the following can be observed for tagging-based ranking algorithms: “There are generally two ways to utilize the explored social features for dynamic ranking of web pages: (a) treating the social actions as independent evidence for ranking results, and (b) integrating the social features into the ranking algorithm” (Bao et al., 2007, 505).
Information Retrieval with Folksonomies
345
Figure 4.24: Three Sets of Criteria for Relevance Ranking in and with Folksonomies. Source: Modified from Peters & Stock (2007, Fig. 10).
Thus the ranking possibilities for resources within folksonomies can generally be concentrated into three large areas or feature sets: 1) tags, 2) collaboration and 3) users (see Figure 4.24). Each area consists of individual ranking factors, which can each be weighted separately in order to be able to give certain aspects stronger or weaker consideration in the ranking, i.e. to affect the resource’s retrieval status value (RSV) positively (rise) or negatively (drop). The feature set ‘tags’ is querydependent, i.e. there must be a match between search term and indexing term and the resource’s RSV must be calculated anew for each search request. The sets ‘collaboration’ and ‘user’ are query-independent, which means that the resource’s RSV is not calculated with regard to the search request but is kept the same for each resource. The retrieval status of each resource can be made up of either the consideration of the individual ranking factors or of the summarization of the individual feature sets. A schematic display of this procedure is presented in Figure 4.25, following Harman (1992). The individual components (‘a,’ ‘b’ and ‘c’) of a feature set are accumulated, e.g. by multiplication or addition, and these results are then summarized anew. This generates the RSV of each resource, which is used as the basis of the relevance ranking. Which form of accumulation will yield an adequate result in each respective case must be determined through empirical observation.
346
Information Retrieval with Folksonomies
Figure 4.25: Schematic Display of the Calculation of a Resource’s Retrieval Status Value.
The ranking factors introduced in the following are the result of conceptual thoughts on the basis of research literature, known ranking algorithms and personal reflection. The description of the possible ranking factors will be put up for discussion in the context of this book, but no empirical control of this conception can be performed. Note at this point that all preceding comments on the different ranking factors merely reflect the current status of scientific research and can in no way be regarded as established. The first feature set (see Figure 4.24) considers the concrete tags of the folksonomy and processes them for relevance ranking. Since the ranking refers to the resource level, it is important to stress at this point that the entire bundle of criteria ‘tags’ can only be applied in Broad Folksonomies since they allow for resource-specific tag distributions. In Narrow Folksonomies, a detour via the query tags must be taken if one wants to exploit the resource-specific distributions for the ranking and not work with a tag popularity value of 1. A primitive variant of relevance determination and the ranking can be generated via term frequency (1a) (Stock, 2007a, Kap. 19; Sen et al., 2007; Beal, 2006). Hotho et al. (2006c) claim that “as the documents consists of short text snippets only [..], ordinary ranking schemes such as TF/IDF are not feasible [for folksonomies, A/N]” (Hotho et al., 2006c, 416). This is not true. The term frequency (TF) is determined via a term’s frequency of occurrence within a resource; in Broad Folksonomies via a tag’s popularity:
TF (t , r ) = freq (t , r )
where freq = the indexing frequency of a tag t in the resource r.
Information Retrieval with Folksonomies
347
We can say for folksonomies that in case of a match between search tag and indexing tag the resource will be placed at the top spot of the search results in which indexing tag has been most frequently used. Szekely and Torres (2005) point out, however, that the relevance of a resource cannot necessarily be derived from the number of tags. They mention a tagging-based restaurant rating system, in which they observe that a restaurant has been tagged very often, but exclusively with semantically negative tags. The restaurant is quite obviously bad and should not be listed at the top of the ranking, as the most relevant resource, due to its tag popularity. Sen et al. (2007) confirm the doubts concerning the ranking factor of tag popularity. They interviewed users about the relevance of individual tags in the film recommdation system ‘MovieLens:’ Perhaps users apply higher quality tags more often than low quality tags. If so, then the number of times a tag has been applied might be a reasonable proxy for its quality. [...] We might expect the most often applied tags to be the highest rated, but this is not the case – users gave low average ratings for several of the most frequently applied tags (Sen et al., 2007, 365).
Thus absolute tag frequency is not the only ranking factor needed. From text statistics, we know of the use of relative term frequency (1b), which serves the relativization of frequency values for longer texts. For the relative frequency of a tag t, the tag’s frequency of occurrence is divided by the total number of tags L per resource r:
rel.TF (t , r ) =
freq(t , r ) L
A variant of relative term frequency is the Within Document Frequency (WDF), which works with the logarithmization of relative frequency and thus achieves a more compressed value area (Harman, 1992). The WDF value of a tag in a resource can also be calculated via the following formula:
WDF (t , r ) =
ld ( freq(t , r ) + 1) ldL
For the calculation of tag frequencies within resources, all three variants can be used. The calculation of the WDF value has established itself, at least in text statistics (Stock, 2007a, 322). In order to incorporate the discriminatory power of the tags into the determination of a resource’s RSV, the value of a variant of term frequency is multiplied with Inverse Document Frequency (IDF) (1c) (Lux, Granitzer, & Kern, 2007; see chapter two). The IDF value is calculated through the division of the total number of resources in the database N and the number of resources containing the tag at least once (n):
⎛N⎞ IDF (t ) = ld ⎜ ⎟ + 1 ⎝n⎠ In Van Damme, Hepp and Coenen (2008, 121), this variant of IDF is called ‘Inverse Resource Frequency’ or ‘IRF’ in order to express the provision for tags in the very name. The calculation of IDF at later on its multiplication with a variant of term frequency (TD, rel. TF or WDF) serve to make provision for neither too frequently
348
Information Retrieval with Folksonomies
nor too seldomly occurring tags in the RSV. This thought has been formalized in ‘Luhn’s Thesis’ (Luhn, 1958). In the end, one only needs a reference value for satisfying this demand, where the IDF value has established itself in text statistics. Also thinkable would be a referral to the total number of tags in the information platform instead of the IDF. The ‘Inverse Tag Frequency’ (ITF) can be calculated thus:
⎛M ⎞ ITF (t ) = ld ⎜ ⎟ + 1 ⎝m⎠ where M = total nuber of tags in the folksonomy (counting multiple mentions in one resource) and m = number of tags t in the entire folksonomy (multiple mentions in one resource also included). To exemplify this, we will consider the calculation of the ITF value. A collaborative information service works with a folksonomy consisting of 4 individual tags, where each tag has been indexed four times by the users. The results are an M value of 16, m of 4 and in consequence, an ITF value of 3 (2+1). The number of query tags leading to a resource can also be analyzed for the ranking (1d) (Sen et al., 2007; see also Freyne & Smith, 2004). In Narrow Folksonomies, one must take recourse to the search-tag distributions anyway, in order for this idea to be applied to Broad Folksonomies with no problem. To implement the determination of the query-determined ‘Search-Tag Weight’ STW (t,r), the frequency of co-occurrence with the search tag must bet determined for all resources matching the search tag. Since tags are textual elements of the folksonomy, they can be processed with text-statistical means for the ranking, e.g. via simple term frequency TF, relative term frequency rel.TF or Within Document Frequency WDF. The calculation of WDF for search tags WDF(st,r) occurs analoguously to the calculation with indexing tags:
WDF ( st , r ) =
ld ( freq(st , r ) + 1) +1 ld ( L)
where freq(st,r) expresses the frequency of the resource r being successfully found by the search tag st, i.e. clicked on, and L is the total number of search tags for this resource. The addition of 1 counteracts the effect of not yet retrieved resources receiving an RSV of 0. The calculated value corresponds to the search-tag weight STG(t,r) of the resource. It is also possible, however, to make not only relevance judgments concerning a retrieved resource, but also to attach ratings to the tags themselves, thus preparing them for the ranking. Sen et al. (2007) implemented a rating option for the film recommendation system ‘MovieLens’ in order to incorporate users into the tags’ rating: “In order to study explicit tag feedback [...] users should be able to rate tags with a single click, and the ratings interface should require minimal screen space” (Sen et al., 2007, 362). Thus Sen et al. (2007) decided that tags would only be ratable as positive or negative via a thumbs-up or thumbs-down command. Their motivation came from the lack of or unsophisticated rating systems of the popular information platforms: While many commercial applications incorporate both thumbs up and thumbs down ratings, several only employ one or the other. For example, BoardGameGeek originally employed thumbs up and down moderation, but
Information Retrieval with Folksonomies
349
shifted to only thumbs up moderation to ‘make it harder for people to gang up and reduce hurt feelings’. In sites such as YouTube, users provide positive feedback about items by marking items as ‘favorites’. Other sites allow solely negative feedback. Users of Google Video, for example, may mark tags as ‘spam’ but have no means of providing positive feedback (Sen et al., 2007, 363).
In the ranking, user ratings are then used for a personalized display of the search results, i.e. for personalized film recommendations. Analyzing the tag ratings, Sen et al. (2007) found out that the users rate the tags independently of their indexed resource, which means that such a quality judgment is well suited to be a ranking factor since it refers to the actual tag and not to its relation to the resource. The subjective indexing is thus not put into doubt – it is merely the tag’s expressiveness which is rated. The analysis also revealed that users tend to agree as to what tags are bad rather than which ones are good and that users are quicker to leave good ratings if the thumbs-down option also exists: “Overall, we find that the interface containing both up and down ratings widgets led to the greatest levels of contributions” (Sen et al., 2007, 364). Furthermore, the users do recognize the purpose of good tags, since they search for and click on good tags more often: As with the number of users who applied each tag, we see a gentle upward trend. [...] Num-searches performs well at the high end: 87% of the 128 tag survey ratings for the most searched-for tags were rated three, four or five stars (Sen et al., 2007, 366).
In contrast to Sen et al. (2007) the here introduced ‘Relevance Feedback Weight’ RFW (t,r) regards the adequacy of the tag’s indexing (1e). This would then mean that the tag is rated dependently of the resource. Here the user must be notified, in the tagging system, that the tag should be rated for the resource and in terms of its intrinsic value as a tag. The tags rated poorly by the users will not be included in the search result. In other words: if a tag has been indexed for a resource, which has been deemed unsuitable for the resource by the users, that resource will not be yielded as a search result in queries for this tag.
Figure 4.26: Calculation of the Relevance Feedback Weight.
If we assume that there is a binary relevance judgment of the tags via thumbs-up or thumbs-down rating, the tags are either given a Positive Relevance Feedback Weight PRFW(t,r) or a Negative Relevance Feedback Weight NRFW(t,r). For each of the resource’s tags, it must first be determined how often it is rated positively or
350
Information Retrieval with Folksonomies
negatively, in dependence of the search request (particularly if there is more than one search term) (see Figure 4.26). The example in Figure 4.26 shows that ‘Tag 1’ has received 3 positive and 5 negative ratings, ‘Tag 2’ 5 positive and 10 negative and ‘Tag 3’ 12 positive and one negative. To determine the PRFW(t,r) or the NRFW(t,r), the weight of the negative rating must be subtracted from the weight of the positive rating. Possible results are then either a PRFW(t,r) of 0, a PRFW(t,r) of +x or an NRFW(t,r) of –x. For all tags of the entire platform folksonomy with a PRFW(t,r), the values are then normalized159 to the interval [0,1]. Furthermore, 1 is added so that tags with a PRFW(t,r) of 0 can also be accommodated in the calculation and receive a RFW(t,r) of 1. All tags of the entire platform folksonomy with an NRFW(t,r) are regarded as a value and equally normalized to the interval [0,1]; however, they are not added to but subtracted from 1. This subtraction means that the tag rated very low (which does not receive a value of 1 in the normalization) is not considered for the ranking since its RFW(t,r) = 0. In a query with several search terms, the PRFW(t,r) or NRFW(t,r) of the tags are added to the RFW(t,r); in a search request with only one tag, the RFW(t,r) is equal to the PRFW(t,r), i.e. to the NRFW(t,r) of the tag that matches the query. Gew(t,r)
Variants of
Calculation
Variants of
Calculation
Search-Tag
Calculation
Relevance
Term
Factor
Inverse
Factor
Weight
Factor
Feedback
Frequency
Reference
Weight
Value Gew(t,r) =
TF(t,r)
*
IDF(t)
*
STW(t,r)
*
RFW(t,r)
Gew(t,r) =
Rel.TF(t,r)
*
IDF(t)
*
STW(t,r)
*
RFW(t,r)
Gew(t,r) =
WDF(t,r)
*
IDF(t)
*
STW(t,r)
*
RFW(t,r)
Gew(t,r) =
TF(t,r)
*
ITF(t)
*
STW(t,r)
*
RFW(t,r)
Gew(t,r) =
Rel.TF(t,r)
*
ITF(t)
*
STW(t,r)
*
RFW(t,r)
Gew(t,r) =
WDF(t,r)
*
ITF(t)
*
STW(t,r)
*
RFW(t,r)
Figure 4.27: Determining the Term Weight of Tags in Resources.
The multiplication of a variant of term frequency with either Inverse Document Frequency or Inverse Tag Frequency as well as the Search-Tag Weight STW(t,r) and the Relevance Feedback Weight RFW(t,r) leads to the term weight W(t,r) of the tag with regard to a resource (see Figure 4.27). Here it is important to mention that in a Narrow Folksonomy, simple Term Frequency TF(t,r) can only ever have a value of 1, since each tag can only be allocated once per resource. The determining of relative Term Frequency, or of WDF, makes little sense at this point, so that in Narrow Folksonomies, a general value of 1 for all variants of term frequency is supposed (column 2). It is counter-intuitive that the value of a tag only sinks because a large number of other tags has been indexed for the resource and the tagging system does not allow the multiple allocation of tags. 159
The normalizing of a value means that the highest value of a value set receives the value of 1 and the lowest 0. A simple form of normalizing occurs via the rule of three: all values of the value set are divided by the highest value.
Information Retrieval with Folksonomies
351
If we have determined a text-statistical value for each resource, then (in the points 1a – 1e), we can use the Vector-Space model in a next step in order to determine the similarity between the resources and the search request and thus create a ranking of the resources (1f) (Stock, 2007a, chapter 20). The use of the Vector-Space model is only possible if the search request has more than one term, though, since each occurrence of the search tag in a resource will be taken into consideration otherwise. The resource is consequently deemed relevant and given a retrieval status value of 1. In other words: each resource containing the search tag is relevant; if more than one resource contains it, they are equally relevant for the query and no relevance ranking is possible. The reason for this is that the Vector-Space model weights the terms from the search request and also the terms of the resource. The query term can be weighted liberally and the resource terms contain the value of a variant of TF*IDF. If these values are applied to the Cosine formula, we get the same values in both numerator and denominator, which if divided, would lead to a result of Cosine(Ri, Queryj) = 1. In other words: the query vector from a single query term (vector length equals the query term’s weighting value) and the resource vector from several terms, of which one equals the query term, are on the same dimension. This is the case for all resources containing this single query term, though, which in turn means that no relevance ranking is possible anymore. In the case of several search tags, the VectorSpace model can achieve a relevance gradation on the basis of the individual resources’ W(t,r), however. The dimensions here are the different tags in the database, the value of the dimension W(t)I is determined via the Term Weight W(t,r) (1c). The resources are displayed via vectors in an n-dimensional space and the similarity between the resources and search request W(st)j is finally determined via the cosine (1d): n
Cosine( Ri , Query j ) =
∑ (W (t ) *W (st ) i =1
i
n
n
i =1
i =1
j
)
∑ (W (t )i ) 2 * ∑ (W (st ) j ) 2
where i = a resource from the resource collection, j = search tag, n = number of search tags. The result of the calculation consists of numerical values that state the similarity of search tag and indexing tag, or resource, and are either sorted in descending order for the ranking or processed further with the following ranking factors. The approach of using a time factor for the ranking has already been mentioned in the Interestingness Ranking by Flickr. There it was about the date of entry of the resource. Xu et al. (2006), however, want the time factor to refer to the tags (1g): “Thus, a higher weight can be assigned to more recent tags than those introduced long time ago” (Xu et al., 2006). More current tags result in a more positive RSV than older ones. If we assume that the probability of older tags being used more frequently, and thus receive a larger display in the tag cloud, it can make sense to give new tags, or baby tags, preferred treatment. To calculate the ‘Time Weight’ TW(t), it is sensible to subtract the time of a tag’s first indexing within the entire platform from the time of the search request. The closer both times are to each other, the higher the value of the tag. Thus newer tags are preferred for the ranking. An elapsed time of one year, i.e. there are more than or
352
Information Retrieval with Folksonomies
exactly 360 days between search date and first indexing date has no effect on the tag’s Time Weight – the tag is simply too old and receives a value of 1. If there are less than or exactly 30 days between search and indexing date, the tag is deemed new and thus receives a Time Weight TW(t) of 2. Periods between 30 and 360 days long are calculated in dependence of this request. A distance of 180 days would thus result in a TW(t) of 2-(180/330). There are approaches that implement a modified PageRank for the ranking (Hotho et al., 2006c, 417). A quality judgment is thus attached to the tag via the user and submitted to the resource for the ranking. The idea of a ‘super-user’ or ‘superposter’ here takes effect (1h). To Hotho et al. (2006b; 2006c), super-posters are those users who publish large amounts of content and thus seem to be experts in a certain area; for Xu et al. (2006) and also for Gyöngyi, Garcia-Molina and Pedersen (2004) and their TrustRank, super-posters distinguish themselves through innovative and especially high-quality indexing vocabulary: People who introduce original and high quality tags should be assigned higher authority than those who follow, and similarly for people who are heavy users of the system. One way to handle this is to give the user who introduces an original tag some bonus credit each time the tag is reinforced by another user (Xu et al., 2006). We propose a reputation score for each user based on the quality of the tags contributed by the user (Xu et al., 2006).
Koutrika et al. (2007) suggest a combination of super-posters and tag popularity for the ranking: Given a query tag t, coincidence factors [co-occurrence, A/N] can be taken into account for ranking documents returned for a specific query. [...] In words, a document’s importance with respect to a tag is reflected in the number and reliability of users that have associated t with d. d is ranked high if it is tagged with t by many reliable taggers. Documents assigned a tag by few less reliable users will be ranked low (Koutrika et al., 2007, 61).
Lee et al. (2009) are able to demonstrate that users who are familiar with tagging systems and web catalogs index better tags for the resources than beginners: Our overall finding suggests that experts (i.e. high familiarity [with the concept of tagging, Web directories and social tagging systems, A/N]) are likely to perform better than novices (i.e. low familiarity [with the concept of tagging, Web directories and social tagging systems, A/N]) in terms of using more effective tags for content sharing (Lee et al., 2009)
Thus it seems sensible to incorporate a factor with this emphasis into the ranking. The ranking factor ‘Super-Poster Weight’ SPW(t) refers to the tag within the entire folksonomy, just like the Time Weight. That means that the more superposters index a certain tag, the greater its weight becomes for the ranking. The basic assumption for the SPW(t) is thus that super-posters have lots of experience indexing tags and hence (probably) index good tags more often. A super-poster can here be defined as a user who has tagged at least x (e.g. 100) resources. The super-poster weight is calculated as follows:
SPW (t ) =
SP(t ) +1 P (t )
Information Retrieval with Folksonomies
353
where SP(t) = number of super-posters having indexed the tag t at least once and P(t) = number of users having indexed the tag at least once (including superposters). The addition of 1 counteracts a result of SPW(t) = 0, if the tag should have not yet been indexed by a super-poster. Dependent on their frequency distribution, some tags are marked as Power Tags160 (1i). The weighting value of Power Tags, the ‘Power Tag Weight’ PTW(t,r) strengthens the tag’s term frequency value: tags that have already been indexed a lot for a resource are preferred in the ranking. This means that all power tags receive a PTW(t,r) of 2 and all other tags of the resource a PTW(t,r) of 1, which leads the latter to stay anonymous in the ranking. For the resources’ relevance ranking after the first feature set, the single ranking factors must be aggregated now. Here all factors can be considered or only a selection. The aggregation of the factors occurs via multiplication to a partial Retrieval Status Value of the resource (W1(t,r)) via feature set 1. For search requests with only one query tag, it must be noted that: W 1(t , r ) = WDF (t , r ) * IDF (t ) * STW (t , r ) * RFW (t , r ) * TW (t ) * SPW (t ) * PTW (t , r ) while search requests with two and more query tags consider the following ranking algorithms:
W 1(t , r ) = Cosine( Ri , Query j ) * TW (t ) * SPW (t ) * PTW (t , r ) The following example shall serve to clarify: a folksonomy consists of a total of 144 tags and 100 resources. The search request is ‘dog.’ A resource has been indexed with the tags ‘dog,’ ‘cat’ and ‘mouse,’ where ‘dog’ has been indexed 16 times, ‘cat’ 8 times and ‘mouse’ 4 times. Also, the resource has been retrieved 24 times via the tag ‘dog;’ in total, the resource has been found via 48 tags. The tag ‘dog’ has been rated good 5 times and bad 3 times. All in all, there are 30 resources on the platform that contain the tag ‘dog’ at least once. • The WDF(t,r) of ‘dog’ results from ld(16+1)/ld(16+8+4) and is thus 0.85. • The IDF(t) of ‘dog’ results from ld100/ld30+1 and is thus 2.35. • The STW(t,r) of ‘dog’ results from ld(24+1)/ld48+1 and is thus 1.83. • The RFW(t,r) of ‘dog’ results from the subtraction of 5-3 and the normalization of a PRFW(t,r) of 2 to the interval [0,1]. The addition of 1 after the normalization means that the RFW(t,r) of ‘dog’ equals 2. • The tag ‘dog’ is relatively new and has first been indexed only 60 days before the search request. Thus the tag received a TW(t) of 1.83 in the resource (because 2-60/360). • 21 super-posters have indexed the tag ‘dog’ at least once, while it has been indexed by 30 users in total. The SPW(t) of ‘dog’ is thus 1.7 in the resource. • The tag ‘dog’ is a Power Tag of the resource and thus gets a PTW(t,r) of 2. • The calculation via the Vector-Space model cannot occur, since there is only one search argument (‘dog’). The Retrieval Status Value W1(t,r) of the resource is thus 0.85*2.35*1.83*2*1.83*1.7*2, and thus 45.49. The second criteria bundle incorporates the users’ active collaboration, i.e. the individual users’ behavior added to the respective resource, into the ranking, as positive factors (Agichtein, Brill, & Dumais, 2006; Joachims, 2002; Xue et al., 2004). Tonkin et al. (2008) may still ask: “Amazon and Google use personal 160
Power Tags will be discussed in more detail further below.
354
Information Retrieval with Folksonomies
information to generate popularity or relevance indicators, do non-subject tags offer any similar advantages?” (Tonkin et al., 2008); but the following remarks will show that the most different sorts of collaboration, even from the area of search engine analysis and others, can be implemented for the ranking. The elements of this feature set here work query-independently, meaning that they are formed from resource-inherent factors and ascribe a value to the resource. Thus the click rates (2a) can be called on for the ranking with regard to single resources (Culliss, 1997). For Jung, Herlocker and Webster (2007), Xue et al. (2004) and also Joachims (2002), the click rates are an implicit relevance feedback in web searches, since the user is not asked directly for a relevance judgment of the resource, but the resource’s relevance is inferred from user behavior: “Inferring relevance from implicit feedback is based on the assumption that users continuously make tacit judgements of value while searching for information” (Jung, Herlocker, & Webster, 2007, 792). The basic assumption here is that “a user is more likely to click on a link, if it is relevant to q” (Joachims, 2002, 134). Click data are a highly collaboration-oriented ranking criterion of Web 2.0. The more a certain search result is clicked on, the more important or relevant it appears to be for the query and the better it should be ranked. The click numbers are summed up for the resources and thus represent the collaboration of the users, who in turn enter the ranking as values themselves. Jung, Herlocker and Webster (2007) thus conclude, with regard to click data: When relevance feedback is used to benefit all users of the search engine, then it can be considered collaborative filtering. Relevance feedback from one user indicates that a document is considered relevant for their current need. If that user’s information need can be matched to others’ information needs, then the relevance feedback can help improve the others’ search results (Jung, Herlocker, & Webster, 2007, 794).
To exploit click rates, one must first execute a search, get a display of the search results and then rank them on the basis of other ranking factors. Like a learning system (Joachims, 2002), the ranking factor ‘click rates’ is extremely dynamic and the effect of the analysis only becomes apparent with time. The click data can be extracted from the platform search engines’ log files: “any information that can be extracted from logfiles is virtually free and substantially more timely” (Joachims, 2002, 133). The time a user spends with the resource can also be registered, analyzed and incorporated into the ranking: They issue queries, follow some of the links in the results, click on ads, spend time on pages, reformulate their queries, and perform other actions. […] Our rich set of implicit feature, such as time on page and deviations from the average behaviour, provides advantages over using clickthrough alone as an indicator of interest (Agichtein, Brill, & Dumais, 2006, 19 & 25).
Xue et al. (2004) refine the click-rate method via a co-occurrence analysis in order to generate larger amounts of data for further similarity calculations: The basic assumption of the co-visited method is that two web pages are similar if they are co-visited by users with similar queries, and the associated queries of the two web pages can be taken (merged) as the metadata for each other (Xue et al., 2004, 118).
Information Retrieval with Folksonomies
355
Richardson, Prakash and Brill (2006) go one step further and combine the click data evaluation with the time of day. Thus the resource ranking can be adjusted to user needs: Finally, the popularity data can be used in other interesting ways. The general surfing and searching habits of Web users varies by time of day. Activity in the morning, daytime, and evening are often quite different (e.g., reading the news, solving problems, and accessing entertainment, respectively). We can gain insight into these differences by using the popularity data, divided into segments of the day. When a query is issued, we would then use the popularity data matching the time of query in order to do the ranking of Web pages (Richardson, Prakash, & Brill, 2006, 714).
Two points are problematic in the usage of click rates for the ranking: on the one hand, Xue et al. (2004) point out that click rates can only be evaluated for a resource if that resource has already been called up at least once. On the other hand, users’ clicking behavior depends on the display of the search results to a large degree, which is why Joachims (2002) emphasizes: “This means that clickthrough data does not convey absolute relevance judgements, but partial relative relevance judgements for the links the user browsed through” (Joachims, 2002, 135). Furthermore it is as yet unclear what influence a ‘malicious’ user can have on the ranking if he repeatedly clicks on a certain link. If the click rates should still have an influence on the resource’s ranking, it is sensible to divide the click rate of the resource CR(r) over a certain period of time by the sum of the click rates of all resources within the platform CRP:
⎡ CR (r ) ⎤ CRW ( r ) = ⎢ → [0,1]⎥ + 1 ⎣ CRP ⎦ To compress the result value area, a normalization of the CRW(r) value to the interval [0,1] is executed directly after the above calculation. This means that the resource with the highest Click-Rate Weight CRW(r) of the entire platform receives the value 1 and all other resources are adjusted to this highest value by having their values divided by the maximum Click-Rate Weight CRW(r). After the normalization, 1 is added to the new result values. This addition means that resources that have not been clicked yet – thus having a Click-Rate Weight CRW(r) of 0 – do not also receive a Retrieval Status Value of 0, which would lead to them being multiplied with 0 during the later aggregation of the ranking factors. The maximum CRW(r) of a resource is thus 2, and the minimum is 1. The number of different users who index a resource could also be a meaningful ranking factor (2b) (Sen et al., 2007). High-frequency discussions on the basis of certain resources again point to resources of immense importance for the community. A resource that is indexed by many users is probably relevant and can receive a higher RSV. The ‘User Weight’ UW(r) reflects user activity with regard to a resource: how many users index this resource? The division of the number of indexing users for a resource IU(r) by the total number of users of the platform N gives us the User Weight:
⎡ IU (r ) ⎤ UW (r ) = ⎢ → [0,1]⎥ + 1 ⎣ N ⎦
356
Information Retrieval with Folksonomies
Here, too, a normalization of the result value to the interval [0,1] must be performed via the procedure mentioned above in order to make the results comparable. After the normalization follows the addition of 1 to the new result value, which again means that non-indexed resources do not receive an RSV of 0 but of 1, and that the maximum Retrieval Status Value of a resource is 2. The number of commentators of a resource (2c) can determine its placement in the ranking, since here too, user interest is directly reflected. Analogously to the User Weight, we can here calculate a ‘Commentator Weight’ CoW(r) for the resource:
⎡ Co(r ) ⎤ CoW (r ) = ⎢ → [0,1]⎥ + 1 ⎣ CoS ⎦ where Co(r) = the number of commentators for a resource and CoS = the sum of commentators within the platform. After the normalization to the interval [0,1] follows addition of 1 to the result value so that the maximum value for a resource is 2 and the minimum is 1. The resources of the collaborative information services often have links and thus provide opportunities for the calculation of their link-topological placement (2d): either via the Kleinberg algorithm (Kleinberg, 1999) or the PageRank (Brin & Page, 1998). Resources with a lot of inlinks are referred to as authorities and are the center of interest of certain discussions or users and should thus receive a higher ranking. Resources that have a lot of outlinks, i.e. hubs, can be given consideration in the ranking (Marlow et al., 2006a; 2006b). It is important to emphasize at this point that resources, i.e. documents, as well as users can be made authorities/hubs. As mentioned before, ranking factors that are purely qualitative in nature always have to deal with certain problems. The assumption that ‘quantity equals quality’ is not always sustained. Thus one should never exclusively follow a quantity-oriented approach but always combine it with other methods, preferably such that concentrate on quality. In the case of linked resources, a ‘PageRank Weight’ PRW(r) can be determined for the resource. Here the resource’s PageRank is calculated iteratively, via the method described in chapter two. This value is then brought to the interval [0,1] and 1 is added, in order to guarantee the results’ comparability and not to give too great a weight to any one ranking factor. The maximum PageRank Weight is again 2, the minimum is 1. The aggregation of the ranking factors of the second feature set is also performed via a multiplication, the result of which reflects the second part of the Retrieval Status Value of the resource W2(r):
W 2(r ) = CRW (r ) * UW ( R ) * CoW ( r ) * PRW (r )
By normalizing all values of the ranking factors to the interval [0,1], and by adding 1, we arrive at a maximum value of 16 (2*2*2*2) for W2(r) and a minimum W2(r) value of 1. The third packet of ranking criteria refers to the user and his conscious acts. In contrast to click data, which only implicitly reflect use behavior, the following ranking factors are explicitly and actively designed by the users. The factors presented here are query-independent, just as in the previous feature set. Performative verbs not only describe an action, but also perform an action by virtue of being uttered (Austin, 1972). Typical performative verbs are ‘to baptize’ or
Information Retrieval with Folksonomies
357
‘to bless.’ In folksonomies, too, performative verbs are used for indexing resources, e.g. ‘todo’ or ‘toread’ (3a). In Kipp (2006a; 2006b) these verbs are described as ‘emotional’ or ‘affective.’ In any case, they make a direct value judgment concerning the resource. If a user wants to ‘read’ a resource, it seems to be important and relevant and can thus be rated higher in the ranking. Performative tags can be localized via alignment with dictionaries. Problematic is the fact that tags might still persist even long after they have lost their relevance for the user; he might have already read the resource but neglected to delete the tag on the resource level. At this point, a combination with a time factor would make sense. The longer a performative tag persists on the resource level and per user, the less relevant it will be. The total value of the tag for the resource would then be calculated via the sum of all values per user. The ranking factor ‘Performative Tag Weight’ PerTW(t,r) must be calculated via several steps, since many elements contribute to the calculation. The basic assumption when tagging with performative tags is the following: several users index a performative tag at different points in time. The tag’s relevance for the user decreases with time, so that the resource also loses relevance. If a user searches within the collaborative information service, resources that have only lately been indexed with a performative tag are ranked higher. (1) The first step towards determining the PerTW(t,r) is thus determining the performative tags via dictionary alignment. (2) Then follows the calculation of the average value of a tag. This shall be illustrated in Figure 4.28: N1
N2
N3
90 days
180 days
Performative Tag A
average: 210 days
Resource
360 days
Figure 4.28: Time Factor in Performative Tags.
The users 1-3 (‘N1,’ ‘N2’ and ‘N3’) have each indexed the same performative tag A for the resource. They have done so at different times: user 1 90 days, user 2 180 days and user 3 360 days before the specific search request. The average mean is calculated for these values, which results in a median age of tag A of 210 days. (3) The third step then consists of determining the median age of every performative tag of the resource. (4) Since more recent performative tags are meant to provide a higher RSV for the resource, the average means are now aligned analogously to the procedure in determining the Time Weight: if a performative tag has an average mean of greater than or equal to 30 days, it receives a value of 1; if the average mean is less than or equal to 30 days, its value is 2. Average means x that lie between the interval of 30 and 360 days are calculated via 2-x/330. (5) This
358
Information Retrieval with Folksonomies
calculation is performed for each performative tag PT(t,r) of the resource. (6) The values determined in (4) and (5) are added and the result is the Performative-Tag Weight PerTW(t,r): k
PerTW (t , r ) = ∑ PT (t , n) . n =1
Another quality judgment concerning the indexed resource can be made via the tag’s sentiment (3b). ‘Opinion-expressing tags,’ such as ‘cool,’ ‘waste of money’ or ‘stupid’ (Zollers, 2007), rate the resource directly and can influence the ranking. Negative value judgments thus lead, depending on their resource-specific frequency of occurrence, to a decrease of the Retrieval Status Value; tags with positive connotations increase the resource’s value. The correct allocation of the connotation can be executed via a dictionary alignment of positive and negative words or via ‘sentiment analysis’ (Pang & Lee, 2008; Ziegler, 2006) This is where folksonomies’ ranking options distinguish themselves from traditional KOS. In the latter, value judgments are no valid indexing vocabulary, and consequently, cannot be exploited for the ranking. This possibility of subjective indexing only exists in folksonomies, leading Golder and Hubermann (2006) to regard folksonomies as recommender systems: “Delicious functions as a recommendation system, even without explicitly providing recommendations” (Golder & Hubermann, 2005, 207). For the ranking factor ‘Sentiment Weight’ SW(t,r), the positive and negative connotations of the tags are taken into consideration, neutral judgments have no influence on the SW(t,r). The goal of this ranking factor is to allocate a higher value to resources with (many) positive tags than to resources with (many) negative tags. In calculating the SW(t,r), one can proceed analogously to the determining of the Relevance Feedback Weight, but independently of the search request. Since this ranking factor makes no provision for the search terms, the value can be determined for all resources in the information platform. The example in Figure 4.29 will serve to illustrate the calculation:
Figure 4.29: Effects of the Sentiment Weight of a Tag on the Retrieval Status Value of the Resource.
A resource was indexed 5 times with the positive tag ‘cool,’ 8 times with the positive tag ‘great,’ 9 times with the negative tag ‘dumb’ and 8 times with the negative tag ‘lame.’ The addition of the resource’s individual tags results in a Positive Tag-Sentiment Weight PTSW(t,r) of 13 (5+8) and a Negative Tag-
Information Retrieval with Folksonomies
359
Sentiment Weight NTSW(t,r) of 17 (9+8). If we now subtract the NTSW(t,r) from the PTSW(t,r), the result is a Positive Sentiment Weight PSG(t,r) of the resource for values greater than or equal to 0, and a Negative Sentiment Weight NSW(t,r) for values smaller than 0. In the example, the result is a NSW(t,r) of -4, which states that the resource is generally rated negatively by the users. Since the Sentiment Weight SW(t,r) is calculated for all resources of the platform, the PSW(t,r) and NSW(t,r) values of the resources must now be normalized to the interval [0,1], where the resource with the highest PSW(t,r) or NSW(t,r) receives the value 1. After the normalization, 1 is added to the PSW(t,r) values; 1 is on the other hand subtracted from the NSW(t,r) values. This means that the best resource receives a Sentiment Weight SW(t,r) of 2 and the worst resource an SW(t,r) of 0:
if PSW > 0, then : SW (t , r ) = [PTSW (t , r ) − NTSW (t , r )] → [0,1] + 1 if PSW = 0, then : SW (t , r ) = [PTSW (t , r ) − NTSW (t , r )] + 1
if NSW < 0, then : SW (t , r ) = [PTSW (t , r ) − NTSW (t , r )] → 1 − [0,1]
Good resources are thus preferred in the ranking, while the worst resource in the entire database is no longer included. The rating of the resources, tags or users can also occur via rating systems (3c). These systems concentrate on direct dialog with the users. As mentioned above, the three elements of the folksonomy can be rated either positively or negatively via thumbs-up or thumbs-down statements. A more differentiated rating can be executed via stars (see Amazon) or slide controls (Schmidt & Stock, 2009) and is thus able to express nuances. The concrete recommendation of resources via a ‘send a friend’ function and the printing or saving as a bookmark can also be viewed as positive ratings. The ratings in turn influence the resource’s RSV. The user is thus given the opportunity to actively participate in determining the respective Retrieval Status Value. This ranking factor uses the idea of the democratic relevance ranking once more. Here it can work its way into the ranking as either an implicit weighting factor for the RSV or explicitly (“Other users found this document to be very helpful” or “The document received 5 out of 6 relevance stars”). The ranking factor ‘Rating-System Weight’ RSW(r) is determined via the number of rating points submitted per resource, e.g. 0-5 stars. This is done by cumulating the stars each resource receives. The total number of stars is brought to the interval [0,1] in dependence of all resources in the entire platform. 1 is added to the result, which leads to a maximum value of 2 per resource. The resource with the most stars within the entire information platform receives the value 2, all other resources are adjusted accordingly. The individual ranking factors if the feature set ‘users’ are accumulated into the Partial Retrieval Status Value W3(r) of the resource:
W 3( r ) = PerTW (t , r ) * SW (t , r ) * RSW (r )
The entire Retrieval Status Value of the resource RSV(r) results from the aggregation, e.g. multiplication, of the individual feature sets’ values:
RSV (r ) = W 1(r ) *W 2( r ) *W 3( r )
Furthermore, the ranking can be personalized for each feature set (3d) that refers directly to the user and takes into consideration his previous activities or explicitly stated wishes. Since a personalization greatly restricts the search results and displays
360
Information Retrieval with Folksonomies
only a small selection of the possible resources – which is how it should be – we will decide in favor of Recall and go without a personalized filtering of the search result. The personalization of the three feature sets will not be developed conceptually at this point, then. Nevertheless, in the following I will introduce the personalization approaches prevalent in the literature, to paint a complete picture of the range of folksonomies’ ranking factors. Braun et al. (2008) observe that in collaborative information services, “Usually, the overall popularity of a resource is used for ranking search results. A personalized search is currently missing that takes the interest of a user into account” (Braun et al., 2008, 1031). Hence they developed their own meta search facilitating a personalized ranking. ‘MyTag’ searches the resource of Flickr, YouTube and del.icio.us optionally, offering several options for restricting the search and for the ranking. Thus the search can be restricted to one’s own personomy, and the ranking can focus on the personomy: The personomy is automatically built based on the resources the user picks from the result set. It is modelled by a vector p of tag frequencies representing the previous search interests of the user. As it is based on the implicit feedback given by selecting from the search results, no additional user effort is required to gain personalization. [...] The tags of a resource are represented as a vector v of binary values indicating the presence of a tag. The rank r of a resource is then computed by the scalar product of the two vectors: r = v*p (Braun et al., 2008, 1031f.).
Resources indexed with the user’s ‘own’ tags thus rise up in the ranking. Richardson, Prakash and Brill (2006) follow a slightly different path and mainly focus on the peculiarities of search and ranking in a corporate intranet. They regret that the PageRank algorithm cannot necessarily be applied to the intranet resources and thus propose the following: These are domains where the link structure is particularly weak (or nonexistent), but there are other domain-specific features that could be just as powerful. For example, the author of an intranet page and his/her position in the organization (e.g., CEO, manager, or developer) could provide significant clues as to the importance of that page (Richardson, Prakash, & Brill, 2006, 708).
Similarly to the FolkRank (Hotho et al., 2006) or the TrustRank (Gyöngyi, GarciaMolina, & Pedersen, 2004), the focus is on the user’s significance within the community. However, the relevance is not determined via his activity, e.g. the mass of content or indexing of good tags, but is preset from the beginning according to his position in the company (e.g. CEO). Resources of this user would then rise in the ranking. This ranking factor can only be used if the users, or the community, are known entities. Good resources by ‘low’ or ‘non-relevant’ users would by definition be disregarded in the ranking, which makes little sense. A combination of several ranking factors or the possibility for users, and thus their resources, to ‘rise socially,’ should be implemented to prevent this. Under the motto ‘More like me!’ however, personalized rankings can also give preferred treatment to resources by users explicitly marked as ‘friends,’ or by similar users. At the same time, the geographical proximity of user and resource can be accounted for in the ranking – the closer, the higher up in the ranking (Butterfield et al., 2006a; 2006b).
Information Retrieval with Folksonomies
361
The fact that the user can also actively participate in the ranking and get opportunities to leave feedback shall not go unmentioned at this point. This can occur via the traditional Relevance Feedback (e.g. Rocchio, 1971), for example. The traditional explicit Relevance Feedback (Rocchio, 1971; Jung, Herlocker, & Webster, 2007) occurs after the search results have been displayed: the user selects the resources that are relevant for his search request or filters out those that aren’t. Thus the retrieval system receives a direct relevance judgment from the user for further processing. This further processing leads to a second step, in which the selected resources are evaluated and used to generate new search terms. After this, a new query is submitted to the system, which checks the resource stock and generates a new hit list. Possible procedures here include the Vector-Space model (Rocchio algorithm) and the Relevance Feedback procedure after Robertson and Sparck-Jones (Robertson & Sparck-Jones, 1976) or – automatically, by the retrieval system – the Pseudo-Relevance Feedback of the Probabilistic model (Croft & Harper, 1979) (Stock, 2007a, 339-341, 356-360). The advantage of the Relevance Feedback is that the initial query (often consisting of very few tags) is enhanced by further tags from the positively marked resources. The Vector-Space model in particular profits from this query expansion, since it requires many dimensions for evaluation in order to provide an ideal search result. The Relevance Feedback occurs after an initial search and can be performed over several rounds on the basis of each new respective search term and the ranking factors mentioned above. From the user’s perspective, it is important that they be able to turn off such sophisticated ranking algorithms, either wholly or partially (Stock, 2007b). A dropdown menu seems to be the ideal method for incorporating the users into the process of adjusting the ranking factors in order to arrange the search results. The users can then decide for themselves by what criteria they want the results to be arranged and displayed. Perhaps they prefer formal criteria (date of publication, author etc.) or only very specific criteria from relevance ranking’s toolbox. All user needs could be satisfied this way. As has been demonstrated, tagging systems offer a wealth of ranking options that goes beyond the traditional methods of database providers and web searches. Hammond et al. (2005) thus conclude: “When it comes to ranking by relevance (as opposed to merely identifying potentially relevant resources), this kind of data might actually be more useful than a much smaller number of tightly controlled keywords” (Hammond et al., 2005). Morrison (2008) also concludes that a combination of folksonomies and traditional retrieval and ranking methods improves the retrieval of relevant resources. Morrison (2008) investigated the retrieval effectiveness of search engines, web catalogs and folksonomies and found out the following: Documents that were returned by all three types were most likely to be relevant, followed closely by those returned by directories and search engines and those returned by folksonomies and search engines. This is very interesting because it suggests that meta-searching a folksonomy could significantly improve search engine results (Morrison, 2008, 1575f.).
This discovery can also be used for the ranking: if tagged resources pop up in catalogs and search engines, they will receive a higher ranking. Ranking algorithms that make use of folksonomies automatically incorporate a social component into their method, since they always point to the user, the collaboration of the users and the users’ knowledge, manifested in the tags. Thus
362
Information Retrieval with Folksonomies
folksonomies, particularly with regard to their possibilities for relevance ranking, are an important component of social search (Brusilovsky, 2008). Egozi (2008) summarizes the ranking approaches in social search and prototypical proxies in a four-field schema (see Figure 4.30) and thus demonstrates that folksonomies can be located in each of the fields.
Figure 4.30: Ranking Approaches in Social Search. Source: Modified after Egozi (2008, Fig. 1).
Both the quadrant at the top left and the quadrant at the bottom right reflect the search engines’ and information providers’ ranking methods applied so far, which are the result of an aggregation of structural information on the resource to be ranked (see top left, e.g. via links) on the one hand, and on the other hand from users’ personal behavior for the purpose of collaborative filtering (see bottom right, e.g. on Amazon): Personalized methods tailor results to each individual user’s social footprint, whereas Aggregated methods have all of the user’s footprints contribute to a central ranking value. Structure-based approaches take social context from explicit social graph structure, as opposed to behavior-based using implicit social hints, such as like-minded clicks and votes (Egozi, 2008).
The two quadrants that are left aggregate user activity (see bottom left) and use the platforms’ network structure in order to offer a personalized ranking (see top right). The ranking factors introduced in this section have also been localized in the fourfield schema and thus confirm the good applicability of folksonomies in the social search ranking. Ranking algorithms are not immune to attacks and manipulation by spammers. The deliberations on ranking factors in tagging systems presented here can, unfortunately, easily fall victim to spam – especially after they have been discussed here in detail. Particularly susceptible are the ranking factors that are based on the tags’ mere indexing frequency. A first step to avoiding spam, especially with regard
Information Retrieval with Folksonomies
363
to this aspect, is the saving of tagging users’ IP addresses161. The multiple indexing of the same tags from one IP address should thus be, if not prevented outright, at the very least made more difficult. Amazon already uses this form of spam obstruction and does not register a tag indexed more than once in the resource’s folksonomy; without, however, alerting the user that he has already indexed this tag for that resource.
Power Tags Tag distributions play an important role in collaborative information services as they can be applied for multiple tasks: they determine the size and structure of the tags in tag clouds, they are used for tag recommender systems or their chronological changes are visualized in timelines (e.g. Technorati). Here tag distributions can be localized on two different levels: on the resource level (which tags describe a resource and how often does each individual tag occur?) and on the platform level (which tags does the community use or indexing and how often does each individual tag occur?). It must be kept in mind, though, that resource-specific tag distributions can only occur in Broad Folksonomies; in Narrow Folksonomies, a resource-specific tag distribution must be generated via the search tags (see Figure 4.31). Narrow Folksonomy Broad Folksonomy KOS
Indexing Tags no Power Tags Power Tags ascertainable no Power Tags
Search Tags Power Tags ascertainable Power Tags ascertainable Power Tags ascertainable
Figure 4.31: How and where can Power Tags be ascertained? Source: Peters & Stock (2009, Table 1).
Tag distributions can further be used to determine ‘Power Tags.’ This approach will be conceptually developed at this point; no control or evaluation of specific applications has been implemented so far. Power Tags are tags that best describe the resource’s content, or the platform’s focal point of interest, according to Collective Intelligence (Vander Wal, 2005), since they reflect the implicit consensus of the user community (Van Damme, Hepp, & Coenen, 2008, 120; Shirky, 2003, 79f.): “We argue that those tags, which follow a power law […] are high quality tags (i.e. tags describing ressources with accuracy) for most of the users […]” (Lux, Granitzer, & Kern, 2007). Maier and Thalmann (2007) also recognize the distinctiveness of these tags: While tagging, a mass of collectively recognized descriptions, carried by a large part of the community, is generated, and also a mass of specific descriptions carried by relatively few members of the community. The first 161
IP addresses serve to identify computers in a network.
364
Information Retrieval with Folksonomies mass seems significant for organizations, since this represents the collective consensus concerning the description of a resource and thus seems better suited for annotation and broad re-utilization. Thus the tags with the highest consensus can be regarded as a resource’s ‘semantic breadth’ (Maier & Thalmann, 2007, 79)*.
These semantically broad tags can best be used to create a controlled vocabulary or to extract descriptors, as Quintarelli (2005) determines: “Therefore, a broad folksonomy provides a tool to investigate trends in large groups of people describing a corpus of items and can be used to select preferred terms or extract a controlled vocabulary” (Quintarelli, 2005). Lux, Granitzer and Kern (2007) have found out in their study on the social bookmarking system del.icio.us that there is a higher consensus in user groups and in the high-frequency tags of the tag distribution than in the Long-Tail tags. This means that in the Long Tail one generally finds such tags that have only been indexed by small user groups and reflect the consensus within that group only. To determine the Power Tags, one must look to the different tag distributions. So far, two typical forms of tag distributions have been observed: a) the informetric distribution, or Power Law, and b) the inverse-logistic distribution. Both forms of distribution share the trait of displaying a Long Tail with seldom used tags that describe the resource/platform with more semantic detail and value but are only indexed by very few users: “the distribution of tags that are used for semantically annotating a web resource always yields a long tail shape” (Al-Khalifa & Davis, 2007). The scientific discussion on tag distributions erroneously assumes that each distribution that shows a Long Tail is also a Power Law. As Stock (2006) was able to demonstrate, this is not the case. Inverse-logistic distributions also have a Long Tail. The difference between both distributions is apparent at the beginning of the curve. The inverse-logistic distribution (see Figure 4.32) has, as opposed to the Power Law (see Figure 4.33), a ‘Long Trunk’ of similarly frequent tags at the beginning of the curve, which means that the users either cannot agree on a meaning for the resource or platform or that the resource/platform unites several aspects of meaning in itself and must be described with multiple tags (Kipp & Campbell, 2006). The example in Figure 4.32 shows that the resource ‘http://www.litmusapp.com/alkaline’ was indexed about equally with the tags ‘testing,’ ‘browser,’ ‘mac,’ ‘software,’ ‘osx’ and ‘webdesign.’ A glance at the website tells us that it is a test tool for Mac users, which displays the user-created website design in 7 different Microsoft Windows browsers and thus facilitates a verification of the source code. It is also shown that the tags are correct, but that several were necessary to describe the website’s content accurately. The second example in Figure 4.33 shows a Power Law for the resource ‘www.go2web20.net,’ and in this case the first three tags ‘web20,’ ‘directory’ and ‘tools’ seem to suffice for an adequate description. It is interesting that both tag distributions stabilize in time, and as enough indexing users converge, which means that the top tags at the beginning of the curves stay constant, while of course changing their absolute indexing frequency (Maier & Thalmann, 2007; Maier & Thalmann, 2008). Thus these tags almost form a controlled vocabulary for the community. Cattuto (2006) and Tonkin et al. (2008) arrive at the same conclusion in their studies:
Information Retrieval with Folksonomies
365
The fact that tag fractions stabilize quickly allows the emergence of a clearly defined categorization of the resource in terms of tags, with a few top-ranked tags defining a semantic “fingerprint” of the resource they refer to (Cattuto, 2006, 35). This suggests that consensus does exist in the aggregate. The terms that are most common tended to provide a reasonable description of the content of the site and remained constant over time. Data collection in November 2007, almost two years after the initial collection, showed that the seven most frequent terms remained constant over time (Tonkin et al., 2008).
Figure 4.32: An Inverse-Logistic Tag Distribution for the URL http://www.litmusapp.com/ alkaline. Source: del.icio.us.
Figure 4.33: An Informetric Tag Distribution for the URL www.go2web20.net. Source: del.icio.us.
In their study on students’ tagging behavior162, Maier and Thalmann (2007; 2008) not only find out that the relative tag distribution becomes stable in time, but also that the Power Tags differentiate themselves significantly from the Long-Tail tags in the different distributions. To illustrate this obvious rupture between ‘commonly
162
A total of 174 students tagged 10 different resources, over 4 stages (Maier & Thalmann, 2007).
366
Information Retrieval with Folksonomies
accept tags,’ i.e. Power Tags, and ‘specific,’ Long-Tail tags, Figure 4.34 reproduces the summary table from Maier and Thalmann (2007, 80, Table 2).
Figure 4.34: Tag Frequency Distributions. Source: Maier & Thalmann (2007, 80, Table 2).
The authors here propose that a 5% hurdle - and not the tag distribution curve’s turning point – be instituted as a criterion for separation, in order to distinguish Power and Long-Tail tags from each other: “If a keyword were to receive at least 5% of all keyword allocations for a resource, it would be considered a collectively recognized tag” (Maier & Thalmann, 2007, 81)*. The turning point method promises more precise results, or more precise markers for delineating both tag types, though, since it incorporates the entire course of the curve into its calculation and doesn’t just adopt a statistical mean. These characteristics of tag distributions (top tags describe the resource adequately and generally, tag structure is stable and the distinction between Power and Long-Tail tags is clear) underline the exposed position of Power Tags. This is why in this book, they are applied as a retrieval tool and processed further for Tag Gardening. The tags that are positioned to the left of the curve’s turning point are selected as Power Tags in dependence of the underlying tag distribution. In PowerLaw distributions, the number of Power Tags depends on the exponent and varies between 2 and 3 tags, in inverse-logistic distributions no statement concerning the number of Power Tags can be made before the fact. Power Tags have an exclusive character, since they neglect the Long Tail with its more versatile tags and subscribe to the “tyranny of the majority” (Weinberger, 2005), thus adopting the most popular tags as a sufficiently adequate indexing vocabulary for the resource. The basic assumption of Collective Intelligence, “the mass will find a way,” is directly adopted. Diederich and Iofciu (2006) have already applied this idea to user recommender systems, but call Power Tags “the ‘stars’ in the power-law distribution” (Diederich & Iofciu, 2006): “As tags are typically power-law
Information Retrieval with Folksonomies
367
distributed, removing the rarely-used tags can reduce the dimensionality of the user profiles by several orders of magnitude” (Diederich & Iofciu, 2006). Halpin, Robu and Shepherd (2007) have also noticed the distinctiveness of Power Tags and the purpose of the exclusive approach, but have not implemented it yet: This is important in that a stable distribution is an essential aspect of what might be user consensus around the categorization of information driven by tagging behaviors. Furthermore, as shown by our empirical study of the tagging history of these items, this behavior depends on the number of users and to some extent on the temporal duration of the tagging process. [...] One might claim that the users have collectively discovered a collective categorization scheme. [...] By focusing on the tags which are most common in the tagging distribution, one should be able to understand the essence of the collective categorization scheme. One could then safely ignore the ‘long-tail’ of idiosyncratic and low frequency tags that are used by users to tweak their own results for personal benefit, or alternatively, treat the ‘long-tail’ as an object of examination for other reasons. As shown by our visualization graphs, insightful categorization and classification schemes can be gained by focusing on the high frequency ‘short head’ (as opposed to the long tail) of a stabilized tag distribution (Halpin, Robu, & Shepherd, 2007, 220).
Furnas et al. (1987; see also Li et al., 2007, 949) also confirm that the most frequently used terms are indeed representative and suitable as access vocabulary for resources: From the standpoint of first-try success for the untrained, the best possible access term would be the word real users most often apply to an object. This empirically based naming approach has been a popular proposal in human factors circles: terms offered by a representative sample of potential users would be collected, and the most frequent term identified (Furnas et al., 1987).
As retrieval tools, the Power Tags serve mainly to increase the Precision of search results and to restrict Recall: “Thus, for retrieval tags can be seen as additional source of information, extending description and title as well as adding more precise information” (Lux, Granitzer, & Kern, 2007). The Recall aims for the hit list’s completeness, while the Precision registers the freedom from ballast. In retrieval research, an inverse proportional connection between Recall and Precision is established, which is now supposed to be exploited with the help of Power Tags: an increase of the Precision is effected by decreasing Recall (and vice versa). An ideal search option (see Figure 4.35) enables users to restrict their search to Power Tags, thus decreasing Recall – since only resources would then be yielded whose Power Tags match the query tag – and increasing the Precision – due to the inverse proportional connection between the two. A restriction of the Recall gets the user more relevant results; the restriction to Power Tags mainly serves to avoid ballast: The problem is that while the disparate user vocabularies and terms enable some very interesting browsing and finding, the sheer multiplicity of terms and vocabularies may overwhelm the content with noisy metadata that is not useful or relevant to a user (Mathes, 2004).
368
Information Retrieval with Folksonomies We can see that compared with the previous attempt, in which we used the whole tag [set of a resource, A/N] as a search input, using the most popular keywords can achieve better results (Han et al., 2006).
Figure 4.35: The Retrieval Function ‘Power Tags Only.’ Source: Peters & Stock (2009, Fig. 6).
The more important characteristic value in connection with Power Tags is Precision. Recall losses can be absorbed, as the user can bring the Recall back to its ‘normal value’ by switching off the ‘Power Tags only’ option. At the same time, the system’s search effort can be decreased via the restriction to Power Tags, as fewer tags have to be searched and aligned. This is what Cormode and Krishnamurty (2008) point out: “skewed distributions with long tails of object popularity means fewer cache hits, rendering caching to be not worthwhile”. The implementation of Power Tags as retrieval tool can occur as described in Figure 4.36. Figure 4.36 refers to a tagging system with Narrow Folksonomy and thus shows the processing of Power Tags generated from resource-specific distributions of query tags. The processing always begins with the resource-specific tag distribution (either of indexing tags or query tags). Then it must be determined whether enough users have searched for or indexed the resource in order to exploit the Power Tags’ stability. The threshold value for the attribute ‘sufficient’ must first be determined empirically. If there are too few users, the processing is stopped, if there are enough, it proceeds. Before calculating the TF*IDF values, it makes sense to normalize the tags, i.e. to unify them via NLP methods. Thus spelling variants, abbreviations or multilingual synonyms are summarized that could otherwise falsify the tag distributions through their multiple occurrences. They are represented by a normalized tag. The TF value is determined via tag popularity (indexing frequency, or how often was the resource found with this tag, respectively), the IDF value via frequency of occurrence within the entire database, i.e. the platform folksonomy. TF*IDF values are calculated for all tags of the distribution in order to relativize the absolute Term Frequency. Sen et al. (2006) justify this procedure for folksonomies: “A tag that is applied to very large proportion of items may be too general to be useful, while a tag that is applied very few times may be useless due to its obscurity” (Sen et al., 2006b, 190). In the next step, the calculated TF*IDF values are normalized to the interval [0,1] in order to make the tag distributions comparable.
Information Retrieval with Folksonomies
369
Figure 4.36: Exemplary Processing of Power Tags in a Retrieval System. Source: Peters & Stock (2009, Fig. 4).
The highest value a TF*IDF value can reach here is 1. Then the tags are sorted via their newly calculated TF*IDF values. Based on this resource-specific tag distribution, the distribution form is determined next. If the second tag of the distribution has a value of roughly less than half the first tag’s value, and the third tag has a value of less than a third than the first one’s value, it can fairly safely be assumed that this is a Power Law distribution. If this should not be the case, however, and there is a turning point in the curve, then it could be an inverse-logistic distribution. To determine the Power Tags, one must first define a threshold value that marks the point between Power Tags and Long-Tail tags. In a Power Law, this value depends on the exponent; in an inverse-logistic distribution the threshold value is equal to the curve’s turning point. All tags with the same TF*IDF value as, or a TF*IDF value normalized higher than the threshold value, are extracted as Power Tags. If there is neither a Power Law nor an inverse-logistic tag distribution, no Power Tags can be determined and the processing of this resource must be aborted. As a last step, all Power Tags are saved in a second inverted file in order to guarantee a quicker search (see Figure 4.37). A search via the different retrieval options (sketched in Figure 4.37) would achieve the following results: if a user
370
Information Retrieval with Folksonomies
searches via the option ‘all tags,’ the system will work with the normal inverted file, of a user searches with ‘Power Tags only,’ the system will work with the second inverted file and compare the search tags with the tags saved in it. Figure 4.37 shows an example of both determining Power Tags and of determining the search results. For the model resource (ID = 11), a tag distribution (keyword 1 through keyword n) was created from normalized TF*IDF values. Since the second tag value is only half as big as the first one’s, what we have is a Power Law distribution with an exponent of a ≈ 1. Thus the system extracts the first three tags as Power Tags and saves them in the second inverted file. If a user searches with the first (or second, or third) tag and marks the search option ‘Power Tags only,’ he will receive the resource with ID = 11 as a search result. If this user deactivates the ‘Power Tags only’ option, however, and searches with ‘all tags,’ he will be yielded the resources with the ID = 11 and the ID = 21. This example clearly shows that the Recall is restricted by the retrieval option ‘Power Tags only.’ After the search, one should perform a relevance ranking of the resources in order to make it easier for the user to assess the resources. The ranking can occur via the methods described above, but the option of an individual sorting function should not be excluded.
Figure 4.37: Exemplary Processing of Power Tags in a Retrieval System – Inverted File of the Power Tags. Source: Peters & Stock (2009, Fig. 5).
The Power Tags as a retrieval functionality should only be optionally available as search tools, so as not to restrict the search results prematurely, since even though this exclusive approach increases Precision by restricting Recall, valuable tags could be contained in the tag distribution’s Long Tail that should not be excluded from the search per se, since this would restrict access to the resources. Mayr (2006) discusses another approach, which can also be applied to Power Tags, namely the functionalities of ‘Bradford’ Law of Scattering’ (Bradford, 1934) for information retrieval. Bradford’s Law is from bibliometrics and deals with the distribution of articles (or subjects in articles) within professional journals (Stock, 2000, 133ff.). The magazines’ articles are counted and the magazines are ranked.
Information Retrieval with Folksonomies
371
Then follows a grouping of the magazines into three quantities (core zone, middle zone, peripheral zone), which each contain the same amount of articles. Bradford’s Law states that the number of magazines in the groups is distributed in a relation of 1 : n : n2. This means that the most valuable magazines are situated in the core, since here a large number of articles is concentrated into very few magazines, while the less valuable magazines are at the edge, since the same number of articles is distributed over a far larger amount of magazines.
Power Tags Resource Quantity 1
Power Tags I & II Resource Quantity 2
All Tags Total Resource Stock
Figure 4.38: Core areas of the Resource Stock, Represented via Different Power Tags.
Figure 4.39: The Slide Control Regulates the Number of Tags to be Aligned and thus Changes both Recall and Precision of the Search Results. This Facilitates the Manual Switching from Search to Browsing.
Parallel to the basic idea behind Power Tags, Mayr (2006) also assumes that a restricted search quantity increases the search results’ Precision: “For the purpose of research support, we assume that the distribution models of the BLS [Bradford’s Law of Scattering] (core zone with high article frequency) and have positive effects
372
Information Retrieval with Folksonomies
for the user in information retrieval” (Mayr, 2006, 163)*. If the user only searches in core magazines, he will get few but relevant hits; if he broadens his search to include the edge, the Recall increases but finding the most relevant articles gets more difficult. Mayr (2006) suggests restricting the search for the user, initially to the core area, while offering him the edges for browsing. For Power Tags, this would mean that they represent the core area of the search (only the Power Tags are aligned with the search tag), thus increasing the search results’ Precision, i.e. increasing the Recall from the very beginning (see Figure 4.38). The Recall can be increased once more via a step-by-step expansion of the resource quantity to be searched (see Figure 4.39). The expansion can be performed either by incorporating all tags of the resource (‘all tags’) or by grouping together further Power Tags: ‘Power Tags only’ merely searches the core zone, ‘Power Tags only + Power Tags II’ merely searches the core zone and an adjacent area etc. To generate the Power Tags II and n, one must only define further threshold values. The user is offered the possibility of manually changing the Recall via a slide control visualization, so that he can search through the resource collection with a high degree of specificity (via ‘Power Tags only’) or a browsing functionality (via ‘all tags’).
Tag Gardening in Information Retrieval The methods of Tag Gardening serve to semi-automatically or manually disambiguate and structure folksonomies in retrospect and are meant to make indexing and retrieving information resources easier for the user. The collaborative information service LibraryThing already offers its users the option of creating ‘TagMashes’ and using them for research (Spalding, 2007; Smith, 2008). The user here compiles several tags in a search request via the Boolean AND or NOT and saves these in the system. Thus the user can always reuse that request. In Tag Gardening, however, the objective is mainly to achieve a match in user vocabulary (indexers and researchers) and semantic clarity: The tag based exploration of resources in CTS [collaborative tagging systems, A/N] requires a social consensus on the meaning and cognitive boundaries of tags. To locate a particular resource in CTS, the seeker and the contributor must share the same view about the most suitable tags for the resource and the meaning of tags. […] Tag based exploration in CTS suffers from two types of inconsistencies: inconsistency between communities and inconsistency of word usage (Choy & Lui, 2006).
As early as 2006, Hotho et al. predicted that as folksonomies, or tagging systems, grow in size, more elaborate search methods and links with other methods of knowledge representation will become necessary: When folksonomy-based systems grow larger, user support has to go beyond enhanced retrieval facilities. Therefore, the internal structure has to become better organised. An obvious approach for this are semantic web technologies. The key question remains though how to exploit its benefits without bothering untrained users with its rigidity. We believe that this will become a fruitful research area for the Semantic Web community for the next years (Hotho et al., 2006c, 425).
Information Retrieval with Folksonomies
373
Golub et al. (2008) call to mind that tagging systems are first and foremost tools for personal resource management: However, social tagging is less concerned with consistency than with making it easier for end-users to describe information items and to have access to other users’ descriptions. Existing social tagging applications have not been designed with information discovery and retrieval in mind (Golub et al., 2008, 16).
But Chen (1994) and Gordon-Murnane (2006) observe that the use of semantics is indispensable for an effective information retrieval: Without the assistance of a system-generated concept space, searchers of a large scientific database would have to perform a trial-and-error process of generating various search terms themselves, using their mental model of the subject domain (a painstaking and cognitively demanding process). In contrast, a concept space can serve as an on-line search aid and can be invoked by searchers for query refinement and concept exploration (Chen, 1994, 64). If one could combine metadata or tags with the power of search, then you could have very powerful and rewarding search tools that really would enable better searching and location of new and old resources. This is the ultimate goal of the semantic web (Gordon-Murnane, 2006, 30).
The implementation of enhanced methods of knowledge representation, such as for example synonym summarization, homonym disambiguation or tag editing via NLP (Ruge & Goeser, 1998), and of information retrieval, such as similarity measurements of relevance ranking, lead to improved research options that lead the user to the relevant information resources more quickly and easily. These retrieval options can reach far beyond simple tag searches or browsing and thus completely exploit the inherent traits of folksonomies for their own purposes, as Angeletou et al. (2007) emphasize: “Unfortunately, the simplistic tag-based search used by folksonomies is agnostic to the way tags relate to each other although they annotate the same or similar resources” (Angeletou et al., 2007, 30). Tag Gardening can be performed manually or semi-automatically, where the latter variant is to be preferred with a view to the amount of resources, tags and users to be processed. The application of Tag Gardening always concerns two areas: 1) knowledge representation (see chapter two) and 2) information retrieval, where the processing of both areas can build on each other or be performed separately. For the semi-automatic variant, it makes sense first to extract those tags from the folksonomy that can be considered for a later manual processing. Two separate methods can be used to extract these tag candidates: a) Power Tags (Peters & Weller, 2008a) and b) similarity algorithms in combination with cluster-forming procedures. Both approaches are largely uncharted territory in research, having been neither implemented nor evaluated so far, but shall be presented for scientific discussion at this point. The first method starts off with Power Tags. Since these tags are gleaned from the Long Trunk and neglect all Long-Tail tags in the further stages of processing, this approach is termed ‘exclusive.’ The first step of this exclusive approach consists of determining the Power Tags for the resource level, according to the procedure described above. Since Tag Gardening refers to the entire folksonomy of the
374
Information Retrieval with Folksonomies
information platform, Power Tags are accordingly determined for each resource. Since these Power Tags are processed even further, they will be called Power Tags I, to better keep track of them. The next step now serves the observation of the environment of Power Tags I, i.e. co-occurrence analyses are performed for these tags. This creates new tag distributions that reflect the semantic environment of Power Tags I. For these distributions, too, Power Tags are determined; the extracted number of tags will be called Power Tags II. Since the ‘strongest’ tags have now been determined for both the platform and the resource level, Power Tags II are the tag candidates for manual Tag Gardening. They and their connection to Power Tags I are meant to be exploited for ‘emergent semantics,’ since their relation promises the best insights (Halpin & Shepard, 2006; Li et al., 2007). Here synonyms or related terms can be discovered of important paradigmatic relations extracted. Halpin and Shepard (2006), in a thesis paper on this subject, assume that ‘superclass relationships’ (abstraction relation), ‘synonym relationships’ (equivalence relation) and ‘facet relationships’ (bi- or tripartite relations following the pattern of ‘book’ Æ ‘author’ Æ ‘Zadie Smith’) can be found in folksonomies. An example shall illustrate the procedure of determining relations (see Figure 4.40).
Figure 4.40: Inverse-Logistic Tag Distribution for the Determination of Power Tags on the Resource Level. Source: Peters & Weller (2008, Fig. 4).
The tag distribution for the URL www.readwriteweb.com shows an inverse-logistic form where C’ = roughly 0.16 und b = 3. That is why the tags ‘Web2.0,’ ‘filtering,’ ‘socialmedia,’ ‘aggregation,’ ‘socialnetworking,’ ‘Social’ and ‘Technology’ are determined as Power Tags I. The platform folksonomy’s co-occurring tags are now determined for these tags. This results in the tag distribution sketched in Figure 4.41 for the tag ‘Web2.0,’ for instance. Since this tag distribution forms a Power Law where a= 1, only the tags ‘blog’ and ‘web’ are marked as Power Tags II. Together with ‘Web2.0,’ these two tags now form the basis for the manual Tag Gardening activities. Here it would not only be true that the tag ‘Web2.0’ is strongly connected to the tags ‘blog’ and ‘web,’ but we can also determine the form of relation in question. ‘Web’ is a hyperonym of ‘Web2.0’ while ‘blog’ is a part of ‘Web2.0’ and thus a meronym. These specifications could then be used in information retrieval for the enhancements of search requests, for example, or in knowledge representation to create and maintain controlled vocabularies and KOS.
Information Retrieval with Folksonomies
375
Figure 4.41: Power Law for Co-occurring Tags for the Tag ‘Web2.0’ (Platform Level). Source: Peters & Weller (2008, Fig. 5).
The second method of extracting Tag Gardening candidates recalls Knautz’s (2008) recommendation to form tag clusters (for the creation of tag clusters, see also Hayes & Avesani, 2007; Hayes, Avesani, & Veeramachaneni, 2006, Brooks & Montanez, 2006a and 2006b). Since this procedure factors in all tags of the platform and does not restrict itself to Power Tags for the extraction of tag candidates, this approach is termed ‘inclusive.’ Initially, elaborate methods of similarity calculation (e.g. Dice or Cosine) are used to determine the coincidence values of the single tag pairs for all tags in the entire database. On the basis of these values, tag clusters are then formed that are created via the Single-Link, Complete-Link or Group-Link procedures. It may be necessary here to establish a threshold value that excludes some tags from the cluster. This value can be used to regulate the cluster’s size. After this process, the platform folksonomy can again be visualized as a tag cloud, this time incorporating the calculated similarity values. Knautz (2008) displays the tag pairs’ similarity values via their lines of intersection – the stronger the connection, the more similar the tags are to each other. Here we can, as above with Power Tags II, begin the manual Tag Gardening. Tag pairs with great similarity seem to have a strong relation towards each other, which is not necessarily an equivalence relation. So it is worth having a closer look, making the implicit connection explicit and exploiting it for further Tag Gardening or search request extensions etc. Furthermore, it is possible to combine the exclusive and inclusive approaches, in order for example to use the folksonomy to generate term candidates for a thesaurus. The manual construction of thesauri, which are natural-language KOS, always begins with a collection of terms, e.g. from scientific literature (López-Huertas, 1997, 144), and a terminological control, This means that the collected terms are distributed into a rough classification and then alphabetically sorted. Then follow the terminological and conceptual controls, which distribute the terms into equivalence classes and set these into relations to one another. This is a very time-consuming process. Here the combination of the exclusive and inclusive approaches can help by rendering visible both term candidates and relation candidates and thus supporting the construction of the thesaurus. Power Tags I and II can be extracted via the
376
Information Retrieval with Folksonomies
principle explained above and set into relations via cluster-forming procedures. All tag pairs beyond a certain threshold value are viewed as descriptors of the thesaurus and thus form its structure, or the rough classification. Since Power Tags catch the eye in folksonomies due to their popularity, often representing general rather than specific concepts (Candan, Di Caro, & Sapino, 2008), they are well-suited for the construction of the thesaurus structure and form the context for the non-descriptors. The non-descriptors, the descriptors’ quasi-synonyms, are gleaned from the platform folksonomy. Here the inclusive approach comes into play, since all tags of the folksonomy and their similarities to each other are now observed. For each tag pair of the platform folksonomy similarity values are calculated and then, again via cluster-forming procedures, set into relation with one another (Jackson, 1970); it might be necessary to consult Cattuto et al. (2008a; 2008b) and van Damme, Hepp and Siorpaes (2008) (see below for both). Thus an initial, statistically calculated descriptor-non-descriptor construct is created (Rees-Potter, 1989; Schneider & Borlund, 2005), which can be used for further, possibly manual procedures and yield first hints of a KOS (Stock & Stock, 2008, 248ff.; 372ff.). At the same time, this statistical KOS can be used to recommend tags during indexing and research. Jörgensen (2007) distinguishes between three sorts of indexing, which can also refer to Tag Gardening: 1) ‘concept-based indexing’ or ‘concept-based Tag Gardening’ refers to the use of human intelligence during the indexing of concepts (in a hierarchical structure); 2) ‘content-based indexing’ or ‘content-based Tag Gardening’ comprises the use of various computer algorithms for the automatic allocation of concepts; 3) ‘consensus-based indexing’ or ‘consensus-based Tag Gardening’ describes the construction of a KOS based on the (statistical) user ‘consensus,’ as is the case for folksonomies. The extraction and processing of tag candidates via Power Tags unites all three variations in one: the ‘consensus-based’ factor is directly reflected in the Power Tags, since they are taken out of the statistical consensus, the ‘concept-based’ factor is addressed by the manual processing and allocation of tags and relations, or concepts, and the ‘content-based’ factor is covered by the automatical co-occurrence analysis. In general, many observations from knowledge representation and information retrieval, such as cluster calculations, similarity algorithms, automatically generated tag hierarchies (Heymann & Garcia-Molina, 2006) or field-based tagging (Bar-Ilan et al., 2006; Strohmaier, 2008), can be applied in Tag Gardening and are also used for separating homonyms or recommending tags in the query expansion. Passant (2007; see also Birkenhake, 2008) links tags with classes, or instances, from ontologies, in order to make information retrieval easier for the users and to navigate the problems of homonymy and synonymy while recommending similar resources or search terms. The separation of homonyms could be executed via the identification of groups or sub-communities and their respective tagging behavior (Van Damme, Hepp, & Siorpaes, 2007) or by using co-occurring tags from the resource folksonomy to disambiguate (Laniado, Eynard, & Colombetti, 2007b). Both options are recommended for Flickr (Butterfield et al., 2006a; 2006b; Lerman, Plangprasopchok, & Wong, 2008). On Flickr, users can come together in groups and exchange information about their area of interest, or connect to other users via the contact function. In the case of registered users, the system knows what groups the user belongs to or what contacts he has and can thus recommend tags used in his
Information Retrieval with Folksonomies
377
group for the query expansion or adjust the search results to the user’s interests by aligning them with the tags from the group, or from the user’s contacts. Another possibility is of restrict the mass of photos to be searched to contacts’ and contacts’ contacts’ photos, thus reducing the search results’ Recall and enhancing their Precision, due to the similar users, photos and tags (Lerman, Plangprasopchok, & Wong, 2008). If the user belongs to the group ‘autofans,’ for example, and if this group often uses tags such as ‘car,’ ‘BMW’ or ‘motor,’ it becomes very probable that in searching for the tag ‘Jaguar,’ the user is referring to the automobile, not the animal. The search tag’s co-occurring tags can act in the same way. If the system observes that a search tag belongs to different tag clusters, it can provide the user with this information and offer him different hit lists, so that he can decide for himself which meaning of the tag he meant (Butterfield et al., 2006a). Since Tag Gardening always requires an intellectual effort, it is important to make users realize how necessary the increased effort is (e.g. via more relevant results) while at the same time making the user work as leisurely as possible. After all, users are accustomed to simple interfaces during research, as Kipp (2007a) observes: “The popularity of Google suggests that users prefer to be able to search for items in a more natural way using natural language vocabulary and a simple interface” (Kipp, 2007a, 72). Nevertheless, the intellectual processing of the tags is of great significance, not only for information retrieval, but also for knowledge representation. Since tags are always context-dependent, they require an interpretation before they can be applied adequately: “tags have to be interpreted in terms of their different contexts of use” (Paolillo & Penumarthy, 2007). The interpretation must be executed in particular for the areas of synonym summarization and homonym disambiguation (although here the alignment with dictionaries can also provide good services), the connection between pre-existing KOS and folksonomies and the analysis of the tags’ syntagmatic relations. The creation of new KOS on the basis of folksonomies is described as follows by Voß (2006): Collaborative thesaurus creation and tagging is a new method of information retrieval that combines thesauri and collaborative tagging. [...] instead of confrontating collaborative tagging and indexing by experts you should consider the conceptual properties of the different indexing systems (Voß, 2006).
Collective Intelligence should be provided access to the controlled vocabularies, but especially the allocation of new terms and the deletion of obsolete ones should be performed centrally. Generally, though, the combination of different KOS is to be recommended and can yield excellent results in information retrieval: Making it easy and painless to tag pages or other media and merging the collaborative tagging with more formal systems could really add power to search. Think of it this way: Combine the Library of Congress Subject Headings or the Dewey Decimal System with tagging and you could create both a hierarchical structure and flat taxonomy search engines could use to give you a really rich user experience. Now how cool would that be? (Gordon-Murnane, 2006, 38).
The insertion of pre-existing or, in folksonomies, newly developing paradigmatic relations should also be performed by a centralized instance. The fact that the use of paradigmatic relations facilitates improved searches and quicker retrieval of relevant
378
Information Retrieval with Folksonomies
information is only slowly entering the consciousness of the system architects. The solution to the synonym/homonym problem has already been tackled through cluster displays (see for example Flickr Cluster), but the use of far more detailed relations (e.g. producer-product or antonymy) has frequently been neglected – even though Gordon-Murnane pointed out the possibilities of paradigmatic relations in information retrieval with folksonomies as early as 2006: The flat-system folksonomies lack parent-child relationships, categories, and subcategories. The lack of hierarchy can impact directly on searching and search results. Without hierarchy or synonym control, a search of a specific term will only yield results on that term and not provide the full body of related terms that might be relevant to the user’s information needs and goals (Gordon-Murnane, 2006, 30).
Sen et al. (2006) emphasize the significance of semantic relations for the visualization of folksonomies and hope to provide improved retrieval functionalities this way; at the same time, they regret the lack of research in this area: “deriving relationships and structure from the tags that are applied may provide additional guidance in how to display tags in ways that aid search and navigation” (Sen et al., 2006b, 190). Angeletou et al. (2007) have decided to rise to this challenge and analyze the syntagmatic relations in the folksonomy of a photo database, in order to develop improved search options. Here they align the folksonomies with preexisting ontologies and try to locate intersections. The authors know the following relations: ‘subsumption’ (hyponym), ‘disjointness’ (antonym), ‘generic’ (partitive relation), ‘sibling’ and ‘instance-of.’ They concentrate, however, merely on the hyponym and opposite relation, i.e. ‘subsumption’ and ‘disjointness,’ and arrive at the following result: “Finally, a broad range of semantic relations can exist between tags, including subsumption, disjointness, meronymy and many generic relations (e.g., location)” (Angeletou et al., 2007, 41). They find a wealth of possible connections, particularly for the partitive relation: • partitive relations: location Æ ‘forest,’ ‘garden,’ seasons Æ ‘fall,’ ‘autumn,’ usage Æ ‘lunch,’ colors Æ ‘green,’ ‘blue,’ photo jargon Æ ‘canon ixus’ (Angeletou et al., 2007, 37, Table 1), • partitive relations: attributes Æ ‘juicy,’ ‘jummy,’ cultivation Æ ‘tree,’ ‘plant’ (Angeletou et al., 2007, 39, Table 2), • partitive relations: container Æ ‘mug,’ ‘can,’ event/ place Æ ‘breakfast,’ ‘party,’ ingredient Æ ‘lemon,’ ‘cream’ (Angeletou et al., 2007, 39, Table 3), • partitive relations: action Æ ‘eating’ (Angeletou et al., 2007, 41, Table 4). Kipp (2006b) also conducted a study, in which she went through author keywords, tags and descriptions in the social bookmarking system CiteULike and checked them for thesaurus relations. She found out that “relationships in the world of folksonomies include relationships that would never appear in a thesaurus including the identity of the user (or users) who used the tag” (Kipp, 2006b). Here it appears as though it is worth describing the unspecific association relation in more detail, or to splice up the partitive relation (Weller & Stock, 2008) further. If it were to transpire that certain relations are implicitly used by a majority of users during indexing, one might be able to prescribe fields on this basis for indexing and information retrieval, thus achieving more precise results. Query expansion or homonym disambiguation is also possible via these relations:
Information Retrieval with Folksonomies
379
We believe that content retrieval can be further improved by making the relations between tags explicit. […] The knowledge that Lions and Tigers are kinds of Mammals would expand the potential of folksonomies. Users could make generic queries such as ‘Return all mammals’ and obtain all the resources tagged with lion or tiger even if they are not explicitly tagged with mammal (Angeletou et al., 2007, 30f.).
It is important to note that these more precise retrieval functionalities should only be offered to the users as supplements, and not as default settings. A step-by-step change of the search, via different menus, for example, which would wholly conform to the user’s wishes, is the ideal variant in order to better process more complex information needs without asking too much of more inexperienced users. Angeletou et al. (2007) also point out that folksonomies should offer improved research options but let the users handle them: “As searchers rely on their own view about what inter-related tags best describe the resource they are looking for, it follows that content retrieval could be enhanced if folksonomies were aware of the relations between their tags” (Angeletou et al., 2007, 31). For a semi-automatic processing of the tags it is important to know how certain procedures (e.g. similarity algorithms or clustering methods) work and what effects they might have on tag suggestions or automatically created tag hierarchies, for example. Cattuto et al. (2008a; see also Cattuto et al., 2008b) investigate the different similarity measurements (simple co-occurrence via number of connections, cosine via co-occurrences and FolkRank (Hotho et al., 2006b)), in order to describe their effects on tags. To be able to judge the semantic content of the created tag clusters or rankings, the authors align them with the WordNet Synsets. Here they find out the following: • the calculation via cosine tends to determine synonyms and related terms, such as orthographical variants and terms from the same WordNet Synset; • the calculation via FolkRank and simple co-occurrence tends to determine more general terms and should thus be used to create a term hierarchy; • the calculation via FolkRank is capable of re-summarizing multiple-word terms, if they were separated by the system design. To account for the different summarizations of similar terms, Cattuto et al. (2008a) assume: A possible justification for these different behaviors is that the cosine measure is measuring the frequency of co-occurrence with other words in the global contexts, whereas the co-occurrence measure and – to a lesser extent – FolkRank measure the frequency of co-occurrence with other words in the same posts (Cattuto et al., 2008a).
Van Damme, Hepp and Siorpaes (2008) also investigate the different techniques for creating similar tags and present the result of their analysis in Figure 4.42. The authors understand ‘Ontology matching algorithms’ to refer to the ‘Formal Classification Theory’ presented in Giunchiglia, Marchese and Zaihrayeu (2005). The connection of the ‘Subsumption Model’ with co-occurrence analyses is discussed in Schmitz (2006) and adopted here. The same goes for Mika (2005), where the combination of social network analysis and a ‘Set Model’ is presented.
380
Information Retrieval with Folksonomies
Figure 4.42: Similarity Algorithms and their Results. Source: Van Damme, Hepp, & Siorpaes (2008, 67, Table 3).
Some systems already try to incorporate semantic information into information retrieval – these generally do not go beyond an experimental or Beta version, though. The research project ‘tagCare163’ (Golov, Weller, & Peters, 2008) aims to provide the user with his own tags for indexing and search, as well as a managing option for the tags. This means that the user can import his tags from the different collaborative information services (e.g. Flickr, BibSonomy, del.icio.us) to tagCare, manage and edit these within tagCare, use tagCare to perform a meta search of several information services and then index the retrieved resources, with his own tags, from within tagCare. Managing the tags mainly refers to the summarization of synonyms and the disambiguation of homonyms, which both occurred on the different platforms, and the establishment of further relations between the tags (e.g. meronymy or hyponymy). In information retrieval, these procedures can be exploited, person-specifically, in the query expansion, e.g. by incorporating the defined synonyms into the search with an OR link, as tagCare concentrates on the user’s personomy. The social bookmarking service FaceTag Quintarelli, Resmini, & Rosati, 2007a) uses a faceted thesaurus (with the facets ‘resource type,’ ‘themes,’ ‘people,’ ‘language,’ ‘date’ and ‘purposes’) to index and search the resources. Furthermore, the system enables the manual adding of tag hierarchies, thus instituting a hyponymhyperonym structure. Quintarelli, Resmini and Rosati (2007a) summarize the spectrum of FaceTag: Optional tag hierarchies are possible. Users have the opportunity to organize their resources by means of parent-child relationships.
163
http://www.tagcare.org.
Information Retrieval with Folksonomies
381
Tag hierarchies are semantically assigned to editorially established facets that can be leveraged later to flexibly navigate the resource domain. Tagging and searching can be mixed to maximize findability, browsability and user-discovery (Quintarelli, Resmini, & Rosati, 2007a, 11).
For information retrieval, this means that the users can restrict or broaden their search results either via the facets or via the different tag hierarchies and thus make their search more effective (Tonkin et al., 2008) (see Figure 4.43): In this kind of interface, users can navigate multiple faceted hierarchies at the same time. [...] For these reasons, faceted metadata can be used to support navigation along several dimensions simultaneously, allowing seamless integration between browsing and free text searching and an easy alternation between refining (zooming in) and broadening (zooming out) (Quintarelli, Resmini, & Rosati, 2007a, 12).
The combination of facets and tags has here proven successful and is also capable of meeting the different user demands.
Figure 4.43: The Search Functions of the Social Bookmarking Service FaceTag. Source: Quintarelli, Resmini, & Rosati (2007b, Fig. 1).
Kennedy et al. (2007) present an automatized approach for extracting place and event tags from the Flickr folksonomy. They also task themselves, then, to recognize paradigmatic relations in the syntagmatic relations between tags. Here the authors not only draw from the purely syntagmatic tag relations, but also regard the connection of the tags to the image, or the displayed content, in order to form clusters. The goal of this approach is enhanced retrieval functionalities: Improving precision and breadth of retrieval for landmark and place-based queries. Soft annotation of photos, or suggesting tags to an un-annotate georeferenced photos uploaded by users. Generating summaries of large collections by selecting representative photos for places and identified landmarks (Kennedy et al., 2007, 632).
382
Information Retrieval with Folksonomies The World Explorer application enables users to easily explore and efficiently browse through this large-scale geo-referenced photo collection, in a manner that improves rather than degrades with the addition of more photos (Ahern et al., 2007, 1f.).
The extraction of place and event tags proceeds in two steps: 1) based on the photo data on geographical latitude and longitude, the photos are summarized in clusters (via the k-means cluster algorithm). In these clusters, the most representative tags are determined via TF*IDF, where TF represents the popularity of the tags in the cluster and IDF their popularity in the database. The value ‘user frequency’ is also multiplied with the TF*IDF value for each tag, which means that those tags that are indexed by many different users receive a higher rating. 2) “The method we use to identify place and event tags is Scale-structure Identification (or SSI). This method measure how similar the underlying distribution of metadata is to a single cluster at multiple scales. [...] The process for identifying event tags is equivalent, using the time distribution [...]” (Kennedy et al., 2007, 634). This procedure allows for the distinction between place and event tags, which is particularly helpful if the place and event both have the same name (e.g. ‘Bluegrass,’ the quarter in San Francisco and the festival). It becomes apparent that the syntagmatic relations are here used specifically in order to extract paradigmatic relations, i.e. place and event information, and to offer it up for information retrieval. Kennedy et al. (2007) implemented the system in the application ‘TagMaps164,’ or ‘World Explorer’ (Ahern et al., 2007; Jaffe et al., 2006; Naaman, 2006), and created a sort of live visualization of the world by combining Flickr and Yahoo Local Maps. In Figure 4.44 the search interface is displayed. In the left area we can see the satellite pictures of Berlin and the most common place tags. The popularity of a tag is represented by its font size, as in the tag clouds. By clicking on a tag (here ‘alexanderplatz’), the top 20 photos tagged with ‘alexanderplatz’ on Flickr are called up and appear at the right-hand side of the screen. Also, a window with further tags (here ‘alex’ and ‘fernsehturm’) that have often been allocated to this region opens up. These ‘also see here’ tags are generally the other top tags of the cluster calculated above (Ahern et al., 2007). If the user clicks on such a tag, the search result of the first query is not refined, but the system shows new pictures that match the tag and the location. By clicking on a photo, the user is either directed to Flickr or he can provide the system with the information that this photo has not been correctly allocated to the tag. Thus the application can improve its search results as well as its clustering procedure via user feedback. The goal of the visualization, i.e. of the combination of geographical maps, photos and semantic tags is to: Engage users in serendipitous discovery of interesting locations and photographs. Allow users to examine and understand a particular area of interest. Assist in exploration of areas of interest previously unknown to the users (Ahern et al., 2007, 5).
164
http://tagmaps.research.yahoo.com.
Information Retrieval with Folksonomies
383
Figure 4.44: The Application TagMaps. Source: http://www.tagmaps.research.yahoo.com.
A similar approach is pursued by Lee, Won and McLeod (2008a; see also Lee, Won, & McLeod, 2008b), who also put geotags and indexing tags into a relation with each other in order to solve the homonym and synonym problem in information retrieval, for example. They put down the latitude and longitude of each indexing tag by extracting the data from the geotags and are thus able to determine what tags might occur, in what places, possibly together. In searches via homonymous or synonymous tags, the user can decide for himself which search result he was expecting with his search request. Laniado, Eynard and Colombetti (2007a; 2007b) combine folksonomies from del.icio.us with WordNet in order to be able to execute semantically substantiated query expansions. In recommender systems that work with co-occurrences during indexing and retrieval both, they criticize the lack of ‘real’ semantic relations (Laniado, Eynard, & Colombetti, 2007a). Only with difficulty could relevant information be found in this way, according to the authors. The combination of tags and WordNet takes four steps: 1) ‘Tree Building:’ for each tag of a resource, the term chain is followed right up to the hyperonym, saved as a term tree and then summarized together with all the other term trees, 2) ‘Compression:’ unnecessary knots are removed in order to simplify browsing along the term trees, 3) ‘Branch Sorting:’ the branches of the term trees are sorted according to the words’/tags’ popularity in the resource collection, 4) ‘Result Output:’ the term tree is presented during a search. The user interface for the search is displayed in Figure 4.45. The users can incorporate hyponyms or hyperonyms into their search by clicking or start new searches with better search terms. The system offers an additional functionality, the display of the WordNet definition of each term as well as the number of hits that are indexed with this tag. Even if query expansion via WordNet achieves food results,
384
Information Retrieval with Folksonomies
the authors locate a problem in the combination of WordNet and tags: “only about 8% of the different tags used are contained in the lexicon” (Laniado, Eynard, & Colombetti, 2007b, 194). Even after processing the tags via the stemming procedure, this problem could not be solved (Laniado, Eynard, & Colombetti, 2007a, 153). This difficulty can probably be put down to the different forms of compounding and the strongly personal tag design. Cattuto et al. (2008a) also compare del.icio.us tags with the WordNet terms. They observe, however, that 61% of the 10,000 most popular tags are contained in WordNet.
Figure 4.45: Query Expansion on del.icio.us, Based on WordNet. Source: (Laniado, Eynard, & Colombetti, 2007b, 197, Fig. 3).
Kolbitsch (2007a) also uses WordNet for query expansion in his system ‘WordFlickr’ and aims to solve the ‘vocabulary problem’ during the search. The system attaches itself to Flickr and supports the user during his search. For example, the search requests can be expanded via semantic relations such as synonymy, antonymy, hyponymy and meronymy (for query expansion via synonyms, see also Lee & Yong, 2007a; 2007b). The enhanced search lets the user choose the relations (see Figure 4.46). The procedure itself is sketched in Figure 4.47. After the user has submitted his query to the system, the search terms are analyzed and processed with a stemming algorithm in order to then be forwarded to WordNet in their cleansed form. There, they are aligned with the WordNet Synsets and the terms for query expansion are selected according to the user’s commands and then forwarded to Flickr. Flickr is then searched via the expanded query and the search result is displayed for the user.
Information Retrieval with Folksonomies
385
Figure 4.46: The Search Interface of WordFlickr with Query Expansion. Source: Kolbitsch (2007b).
Figure 4.47: The Workings of WordFlickr. Source: Kolbitsch (2007a, 80, Fig. 1).
Kolbitsch (2007a) also observes that the WordNet database is insufficient for the needs of folksonomy users. Their restriction to the English language is a problem, as well as the lack of personal or technical words and the frequent occurrence of spelling mistakes in folksonomies, which makes alignment with WordNet harder. Also, WordFlickr does not take into consideration the problems of homonymy and tag ambiguity yet. The evaluation of WordFlickr showed that the query expansion
386
Information Retrieval with Folksonomies
via WordFlickr leads to fewer results, which contain very personal tags, such as ‘2005,’ ‘party’ or ‘roadtrip’ (Kolbitsch, 2007a, 83): “Hence, WordFlickr is capable of offering tags in the results that are semantically closer to the user’s query” (Kolbitsch, 2007a, 83). A search request for ‘rock,’ a homonym, showed that a simple Flickr search yielded more musically-oriented results, while the query via WordFlickr tended to result in geologically-themed resources. The third search request, for ‘shoe,’ showed that Flickr and WordNet yielded almost identical results. Kolbitsch (2007a) thus concludes: “Hence, it can be concluded that WordFlickr yields results that are, at worst, as good as Flickr’s search results. Therefore the use of WordFlickr’s concept can be a valuable addition to tagging systems such as Flickr” (Kolbitsch, 2007a, 84). Al-Khalifa and Davis (2006) connect the user tags with terms and relations from different ontologies in their system ‘FolksAnnotation,’ and thus facilitate a more semantically rich indexing of resources. In Al-Khalifa, Davis and Gilbert (2007) they report on the use of their system for information retrieval. The combination of tags and ontologies, or the working of tags into the ontologies, the tags are not treated as keywords but as RDF structures with Property-Value pairs. Thus even searches that answer specific questions, execute a reasoning process or facilitate ‘Search by Difficulty, Search by Instructional level and Search by Resource type’ can be performed (Al-Khalifa, Davis, & Gilbert, 2007). To evaluate the system’s retrieval effectiveness, three kinds of search were performed: 1) Ontology Browsing, i.e. the scanning of the ontology for suitable query terms, 2) Ontology Querying, the direct entering of search terms and additional selection of different facets, such as resource type or subject, via drop-down menu, and 3) Semantic Search via the tagenriched ontologies, as compared to a search only via folksonomy tags and manually created expert keywords. The result is summarized as follows: These results demonstrate that semantic search outperforms folksonomy search in our sample test, this is because folksonomy search, even if the folksonomy keywords were produced by humans, is analogous to keyword search and therefore limited (Al-Khalifa, Davis, & Gilbert, 2007).
If fields were implemented as Tag Gardening tools in a tagging system, and if users indexed with the help of these fields (Bar-Ilan et al., 2006), then these results can be exploited in information retrieval. Thus it is possible to search for tags fieldspecifically, on the basis of the fields (Heckner, Neubauer, & Wolff, 2008), where the search results can be restricted in advance and semantically disambiguated by the user, e.g. with regard to resource types. The use of fields in tagging systems for the purpose of knowledge representation and information retrieval would on the one hand significantly increase the expressiveness, and on the other hand the retrieval effectiveness, of folksonomies: “One step towards semantically enhanced tags which would allow for structured retrieval could be achieved by dividing the tag entry field into separate categories” (Heckner, Neubauer, & Wolff, 2008, 9). This approach is identical to the faced search suggested by Quintarelli, Resmini and Rosati (2007b) above. Strohmaier (2008) also tries to provide an improved information retrieval by prescribing fields. He concentrates on ‘purpose tags,’ which are meant to make it easier for the user during searches later on to select the desired resource. Purpose Tags are defined as follows: “Purpose tags aim to explicitly capture aspects of intent, i.e. the different contexts in which particular resources can be used. When
Information Retrieval with Folksonomies
387
assigning purpose tags, users are assumed to tag a website with a specific purpose or goal in mind” (Strohmaier, 2008, 37). Strohmaier (2008) here builds on the following assumption: “The intuition behind that is that purpose tags 1) expand the vocabulary of traditional tags and 2) help bridging the gulf between user intent expressed in search queries and the resources users expect to retrieve” (Strohmaier, 2008, 37). Heymann, Koutrika and Garcia-Molina (2008) are able to confirm in a study that users generally use a different vocabulary for search than they do for indexing resources. This makes it a lot harder to retrieve relevant resources in collaborative information services with the help of folksonomies. Thus Strohmaier (2008) implements Purpose Tags in order to coerce the users of the resource (whether author or user) to formulate a definition of the resource’s use via tags, thus broadening the access path to the resource for searching users. During tagging, the users should now stop describing only the content of the resource, but also index its purpose. Here the users are asked, during indexing: “This resource helps me to...?“ (Strohmaier, 2008, 37). In a free-text field, they can answer the question with a short sentence, e.g. “find a physician.” Strohmaier (2008) hopes to create a further tag variant in the users’ indexing vocabulary, thus improving the folksonomy’s retrieval effectiveness. In a retrieval test, the author is able to demonstrate that the Purpose Tags are well suited to restrict a hit list with regard to the searcher’s current information need, while an expansion of the query by related tags has generally been avoided by the users: While all users were comfortable using purpose tags to narrow down search results in the purpose tagging prototype, none of the users used del.icio.us tags (provided at the top of the search result page) to refine their search. Users reported that by being presented with a list of possibly related goals, they felt “guided” during search, and that purpose tags were helpful because they “felt natural” when trying to accomplish a search goal (Strohmaier, 2008, 39).
In a concluding study, Strohmaier (2008) also finds out that roughly 72% of Purpose Tags are new and cannot be found in the tagging vocabulary of del.icio.us. Thus he arrives at the same result as Heymann, Koutrika and Garcia-Molina (2008) and confirms that users of collaborative information services tag with another vocabulary than they search. Thus Strohmaier (2008) concludes that “purpose tags help to expand the vocabulary of existing kinds of tags in a useful way” (Strohmaier, 2008, 40). Furthermore, Purpose Tags can be processed with methods of cooccurrence of cluster analysis, just like traditional indexing tags, in order to clarify the relations between them. In analogy to ‘emergent semantics’ (Wu, Zhang, & Yu, 2006; Zhang, Wu, & Yu, 2006; Spyns et al., 2006), Strohmaier (2008) speaks of the developing tag network as “emergent pragmatics, where the focus is on the usage context of resources rather than their meaning” (Strohmaier, 2008, 41). As an outlook on retrieval research on ‘emergent pragmatics,’ Strohmaier (2008) sums up: “Mapping search queries onto such goal graphs to refine or disambiguate search queries, and using these graphs to mine associations with resources could represent an interesting problem for future research” (Strohmaier, 2008, 41). In this context, even the ‘seedlings’ or ‘baby tags’ (Peters & Weller, 2008b) that arise in information retrieval, can be located and used. If it turns out that the users search with a tag but receive no hits, the tagging system is alerted to which neologisms arise in the users’ language use and to the fact that they may even need these search tags – their frequency of usage can even be exploited to find out how
388
Information Retrieval with Folksonomies
great this demand is. The tagging system can then point out this particularity to the users in the context of the information resources’ indexing and then offer the new tags for indexing. Thus access to the resources is broadened and the close connection between knowledge representation and information retrieval again becomes evident. Another option for using the fields for information retrieval is syntactic indexing, the weighting of the search terms made possible by it and the adjustment of the ranking according to these aspects (Knautz, 2008). Syntactic indexing is the linking of indexing terms into fixed chains during indexing (DIN 31.623/3, 1988; Stock & Stock, 2008, 351ff.), in order to be able to reflect the exact content of the represented resource with more precision. Syntactic indexing clearly relates the indexing terms to one another. Knautz (2008) uses syntactic tagging to create tag chains and to use them for an improved visualization of folksonomies. As two tags co-occur in a chain more often, they appear to be in direct relation to each other, which can be displayed via spatial proximity in the tag cloud. This relation, made visible by syntactic tagging, can also be used for the optimization of hit lists, or for the ranking. If a user executes a search with several query terms, they are checked for occurrence in a syntactic chain and displayed in the ranking: if all search terms occur in an indexing chain, it can be assumed that the resource reflects the subject that was searched for and is then yielded as a top result. The indexing chains can furthermore be combined with weighting values, so the user can determine himself which search results are to be displayed or which terms he deems particularly important. Thus he makes sure that he receives the most appropriate resources by exploiting the tags’ context for his search via the indexing chains. The different approaches show that Tag Gardening endeavors make sense not only in knowledge representation but can also be fruitful for information retrieval and support the user in his searches. Here it is not strictly necessary that knowledge representation prepare the ground so the systematic summarization of tags can be used for retrieval. Query expansions or search-tag recommendations via pre-existing KOS build on the information services and alert the users to further search options, without changing the strcture of the information service. The connection of Tag Gardening in knowledge representation and Tag Gardening in information retrieval would build on itself to formidable effect, though, and offer more elaborate retrieval functionalities.
Outlook Unanswered research questions on the subjects of folksonomies and information retrieval mainly concern the design of search engines for an improved retrieval of resources from collaborative information services, the visualization of search results, the measuring of the success of folksonomies and tagging systems and the exploitation of user activity for information retrieval. In search engine development, for instance, the question arises of whether a ‘one fits all’ solution could be possible, i.e. the development of a search engine for all resource types throughout the different collaborative information services, or whether different specialized search engines meet user needs in searching for information more effectively. Hearst, Hurst and Dumais (2008) see an advantage for specialized search engines that are able to perfectly adapt to the resource-specific
Information Retrieval with Folksonomies
389
characteristics and retrieval demands and suggest the creation of a blog search engine that works with facets. The facets should cover the following aspects and make them available for information retrieval: • ‘blog genre,’ e.g. diary, group blog etc., • ‘blog style,’ e.g. informative, humorous etc., • ‘frequency of update,’ e.g. daily, weekly etc., • ‘proportion of original content,’ the amount of cited and original blog entries, respectively, and • ‘detailed topic categories,’ e.g. politics, technology etc.
Figure 4.48: Search Engine Mash-Up of Image Editing and Flickr. Source: http://labs.systemone.at/retrievr/about.
In terms of Web 2.0, however, it is also thinkable to construct the search engines as mash-ups and thus offer new retrieval functionalities. Maness (2006) mentions a search engine mash-up of Flickr and an image editing program, such as Gimp165 or MS Paint166. Here the user can draw his search request, choosing between different shapes and colors, and is the yielded results in the form of Flickr images that display similar visual characteristics as the drawn images (see Figure 4.48). The visualization of search results is not discussed yet either. The advantages and disadvantages of tag clouds are sufficiently investigated and numerous improved display options are presented, but: Few Web applications do anything to organize or to summarize the contents of such responses beyond ranking the items in the list. Thus, users may be forced to scroll through many pages to identify the information they seek and are generally not provided with any way to visualize the totality of the results returned (Kuo et al., 2007, 1203).
165 166
http://www.gimp.org. http://en.wikipedia.org/wiki/Microsoft_Paint.
390
Information Retrieval with Folksonomies
Ask.com (Lewandowski, 2007) takes a few steps in this direction with its search result presentation and has a combined display of traditional websites and resources from collaborative information services, such as photos and videos, as well as Wikis (see Figure 4.49). However, Ask.com does not go beyond a visualization of the results in the form of a list either. Furthermore, the question remains of how the users or authors of the videos, blogs of photos can be integrated into the hit list.
Figure 4.49: Results Display from Ask.com. Source: Lewandowski (2007, 85, Fig. 4).
Fichter and Wisniewski (2008) remark that the measuring of collaborative information services’ success has not been discussed yet. If one wants to determine their success via their folksonomies’ effectiveness, I would refer them to the conceptual approach by Furner (2007). Here the success criteria for folksonomies and tagging systems are discussed in great detail and suggestions for the development of an evaluation system made. The joint venture Social Media (AG Social Media, 2009) is leading initial discussions towards establishing a ‘social media currency’ meant to consist of range and intensity measurements. Bertram (2009) criticizes the lack of maintenance in tagging systems and folksonomies on the part of the users: “While tagging, hardly anyone repairs […] sources once they are saved, hardly anyone deletes an invalid URL in social boomarking, for example” (Bertram, 2009, 24)*. In chapter 3, the Tag Gardening tool ‘tagCare’ (Golov, Weller, & Peters, 2008) has been introduced, which is capable of meeting this criticism, at least in personal resource management and personal tag maintenance. Smith et al. (2008) above formulate the six requirements (1) ‘filter,’ (2) ‘rank,’ (3) ‘disambiguate,’ (4) ‘share,’ (5) ‘recommend’ and (6) ‘matchmake’ for folksonomies. After the illustrations in this chapter, it can be observed that the first four are not being met in practice as of yet. An exception is Flickr’s Interestingness Ranking, which implements parts of the requirements in its algorithm. This book has expanded on the Interestingness Ranking and conceptually arranged the first four factors into a relevance ranking; the evaluation of these ranking factors’ practicability must be made elsewhere. The last two requirements, ‘recommend’ and ‘matchmake,’ are entirely met by folksonomies and collaborative information services. A prototypical example would be Amazon, but LibraryThing, for instance,
Information Retrieval with Folksonomies
391
also makes heavy use of folksonomies’ recommendation options. Important at this point is the aspect of recommendations based on identical tags of two users not exclusively being generated automatically, but reflecting user participation. The significance for e-commerce is defined by Dye (2006): Automated search engines are constantly trying to think like people: how they search and how they say what they mean. Human-generated metadata, when applied correctly, can be more valuable than that generated by a robot. For marketing towards the Internet community, insight into the consumer’s thought process can be priceless. That’s why Amazon, Yahoo!, and other major Internet commerce sites are exploring how to incorporate tagging into their interfaces. [...] Amazon can generate recommendations for other items based on users’s tags (Dye, 2006, 43).
Over the course of this chapter, it could be confirmed that folksonomies are suited for social search, where their strengths lie mainly in the areas of relevance ranking and social browsing (Brusilovsky, 2008), as the following quote confirms: What is social search? […] Simply put, social search tools are internet wayfinding services informed by human judgement. Wayfinding, because they’re not strictly search engines in the sense that most people know them. And human judgement means that at least one, but more likely dozens, hundreds or more people have “consumed” the content and have decided it’s worthy enough to recommend to others (Sherman, 2006).
It becomes clear, then, that the users and their activity in collaborative information services and folksonomies play an important role. It is they who transform the traditional information retrieval of libraries and the internet into an ‘Information Retrieval 2.0.’ Evans and Chi (2008) can confirm the increasing significance of user activity, manifested in ratings, resources, comments, folksonomies etc. They asked users for certain information before, during and after searching, in order to find out when, or where, it makes sense to interact with other users. As a result of their interviews, the authors conclude that social interaction before searches only whet users’ appetite for searching in the first place, and is used during searches mainly if a user conducts self-motivated information research. ‘Informational queries’ (Broder, 2002) are ‘normal’ search requests for literature or facts contained therein, in contrast to navigational questions (where a concrete URL is searched) and transactional questions, in which a planned transaction is in the foreground (generally in ecommerce, e.g. while buying a product). Informational researches thus pertain to the finding of concrete information previously unknown to the user or incorporating a sequence of known routines, e.g. researching weather data for a known location. In information queries, Evans and Chi posit that users exchange information with a particular focus on search terms and retrieval strategies. After the search, the users tell other users about their search result, either in order to get feedback or to share new-found knowledge, regarding the research or the result. According to Evans and Chi (2008), social interaction thus plays an important role during all stages of research. They have summarized these results and the frequency of the different cases in a model, sketched in Figure 4.50. Using this knowledge, they legitimize the use of folksonomies and all of its facets (such as social navigation or collaborative recommender systems) for use in information retrieval, since folksonomies keep the users of collaborative information services visible and available for social interaction at all times.
392
Information Retrieval with Folksonomies
Figure 4.50: Social Interaction before, during and after Research. Source: Evans & Chi (2008, 85, Fig. 1).
Information Retrieval with Folksonomies
393
Bibliography AG Social Media (2009). Ergebnisse des Measurement Summits, from http://agsm.de/?p=116. Agichtein, E., Brill, E., & Dumais, S. (2006). Improving Web Search Ranking by Incorporating User Behaviour Information. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, USA (pp. 19–26). Ahern, S., Naaman, M., Nair, R., & Yang, J. (2007). World Explorer: Visualizing Aggregate Data from Unstructured Text in Geo-referenced Collections. In Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, Vancouver, BC, Canada (pp. 1–10). Alani, H., Dasmahapatra, S., O'Hara, K., & Shadbolt, N. (2003). Identifying Communities of Practice through Ontology Network Analysis. IEEE Intelligent Systems, 18(2), 18–25. Albrecht, C. (2006). Folksonomy. Diplomarbeit, TU Wien, from http://www.cheesy.at/download/Folksonomy.pdf. Al-Khalifa, H. S., & Davis, H. C. (2006). FolksAnnotation: A Semantic Metadata Tool for Annotating Learning Resources Using Folksonomies and Domain Ontologies. In Proceedings of the 2nd International IEEE Conference on Innovations in Information Technology, Dubai, UAE. Al-Khalifa, H. S., & Davis, H. C. (2007). FAsTA: A Folksonomy-Based Automatic Metadata Generator. In Proceedings of the 2nd European Conference on Technology Enhanced Learning, Crete, Greece. Al-Khalifa, H. S., Davis, H. C., & Gilbert, L. (2007). Creating Structure from Disorder: Using Folksonomies to Create Semantic Metadata. In Proceedings of the 3rd International Conference on Web Information Systems and Technologies, Barcelona, Spain. Ames, M., & Naaman, M. (2007). Why We Tag: Motivations for Annotation in Mobile and Online Media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, San Jose, California, USA (pp. 971–980). Angeletou, S., Sabou, M., Specia, L., & Motta, E. (2007). Bridging the Gap between Folksonomies and the Semantic Web: An Experience Report. In Proceedings of the 4th European Semantic Web Conference, Innsbruck, Austria (pp. 30–43). Anick, P. (2003). Using Terminological Feedback for Web Search Refinement: A Log-based Study. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada (pp. 88–95). Auray, N. (2007). Folksonomy: The New Way to Serendipity. Communications & Strategies, 65, 67–89. Austin, J. L. (1972). Zur Theorie der Sprechakte. Stuttgart: Reclam. Baeza-Yates, R. (2007a). Graphs from Search Engine Queries. Lecture Notes in Computer Science, Theory and Practice of Computer Science, 4362, 1–8. Baeza-Yates, R. (2007b). Mining Queries. Lecture Notes in Computer Science, Knowledge Discovery in Databases, 4702, 4. Bao, S. et al. (2007). Optimizing Web Search Using Social Annotations. In Proceedings of the 16th International WWW Conference, Banff, Alberta, Canada (pp. 501–510).
394
Information Retrieval with Folksonomies
Bar-Ilan, J., Shoham, S., Idan, A., Miller, Y., & Shachak, A. (2006). Structured vs. Unstructured Tagging - A Case Study. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Bateman, S., Gutwin, C., & Nacenta, M. (2008). Seeing Things in the Clouds: The Effect of Visual Features on Tag Cloud Selections. In Proceedings of the 19th ACM Conference on Hypertext and Hypermedia, Pittsburgh, PA, USA (pp. 193– 202). Beal, A. (2006). Wink's Michael Tanne Discusses the Future of Tagging, from http://www.marketingpilgrim.com/2006/01/winks-michael-tanne-discussesfuture.html. Begelman, G., Keller, P., & Smadja, F. (2006). Automated Tag Clustering: Improving Search and Exploration in the Tag Space. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Belkin, N. J., & Croft, W. B. (1992). Information Filtering and Information Retrieval: Two Sides of the Same Coin? Communications of the ACM, 35(2), 29–38. Belkin, N. J., Oddy, R. N., & Brooks, H. M. (1982). ASK for Information Retrieval: Part I. Background and Theory. Journal of Documentation, 38(2), 61–71. Bender, G. (2008). Kundengewinnung und -bindung im Web 2.0. In B. H. Hass, T. Kilian, & G. Walsh (Eds.), Web 2.0. Neue Perspektiven für Marketing und Medien (pp. 173–190). Berlin, Heidelberg: Springer-Verlag Berlin Heidelberg. Bertram, J. (2005). Einführung in die inhaltliche Erschließung. Würzburg: Ergon. Bertram, J. (2009). Social Tagging - Zum Potential einer neuen Indexiermethode. Information - Wissenschaft & Praxis, 60(1), 19–26. Birkenhake, B. (2008). Semantic Weblog: Erfahrungen vom Bloggen mit Tags und Ontologien. In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 153–162). Münster, New York, München, Berlin: Waxmann. Bischoff, K., Firan, C. S., Nejdl, W., & Paiu, R. (2008). Can All Tags Be Used for Search? In Proceedings of the 17th ACM Conference on Information and Knowledge Mining, Napa Valley, CL, USA (pp. 193–202). Blank, M., Bopp, T., Hampel, T., & Schulte, J. (2008). Social Tagging = Soziale Suche? In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 85–96). Münster, New York, München, Berlin: Waxmann. Bradford, S. C. (1934). Sources of Information on Specific Subjects. Engineering, 137, 85–86. Braun, M., Dellschaft, K., Franz, T., Hering, D., Jungen, P., & Metzler, H., et al. (2008). Personalized Search and Exploration with MyTag. In Proceedings of the 17th International Conference on World Wide Web, Beijing, China (pp. 1031– 1032). Brin, S., & Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. In Proceedings of the 7th International World Wide Web Conference (Vol. 7). Broder, A. (2002). A Taxonomy of Web Search. ACM SIGIR Forum, 36(2), 3–10. Brooks, C., & Montanez, N. (2006a). An Analysis of the Effectiveness of Tagging in Blogs. In N. Nicolov; F. Salvetti; M. Liberman, & J. Martin (Eds.), Computation Approaches to Analyzing Weblogs. Papers from the 2006 AAAI Spring Symposium (pp. 9–15). Menlo Park, CA: AAAI Press.
Information Retrieval with Folksonomies
395
Brooks, C., & Montanez, N. (2006b). Improved Annotation of the Blogosphere via Autotagging and Hierarchical Clustering. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Brusilovsky, P. (2008). Social Information Access: The Other Side of the Social Web. Lecture Notes in Computer Science, 4910, 5–22, from http://www2.sis.pitt.edu/~peterb/papers/SocialInformationAccess.pdf. Butterfield, D., Costello, E., Fake, C., Henderson-Begg, C., Mourachow, S., & Schachter, J. (2006a). Media Object Metadata Association and Ranking. PatentNo. US 2006/0242178A1. Butterfield, D., Costello, E., Fake, C., Henderson-Begg, C., & Mourachow, S. (2006b). Interestingness Ranking of Media Objects. Patent-No. US 2006/0242139. Candan, K. S., Di Caro, L., & Sapino, M. L. (2008). Creating Tag Hierarchies for Effective Navigation in Social Media. In I. Soboroff, E. Agichtein, & R. Kumar (Eds.), Proceedings of the 2008 ACM Workshop on Search in Social Media, Napa Valley, California (pp. 75–82). Capocci, A., & Caldarelli, G. (2008). Folksonomies and Clustering in the Collaborative System CiteULike. Journal of Physics A: Mathematical and Theoretical, 41, from http://arxiv.org/abs/0710.2835. Carman, M. J., Baillie, M., & Crestani, F. (2008). Tag Data and Personalized Information Retrieval. In I. Soboroff, E. Agichtein, & R. Kumar (Eds.), Proceedings of the 2008 ACM Workshop on Search in Social Media, Napa Valley, California (pp. 27–34). Cattuto, C. (2006). Semiotic Dynamics in Online Social Communities. European Physical Journal C, 46(2), 33–37. Cattuto, C., Benz, D., Hotho, A., & Stumme, G. (2008a). Semantic Analysis of Tag Similarity Measures in Collaborative Tagging Systems, from http://www.citebase.org/abstract? id=oai:arXiv.org:0805.2045. Cattuto, C., Benz, D., Hotho, A., & Stumme, G. (2008b). Semantic Grounding of Tag Relatedness in Social Bookmarking Systems. In Proceedings of the 7th International Semantic Web Conference, Karlsruhe, Germany (pp. 615–631). Chen, H. (1994). Collaborative Systems: Solving the Vocabulary Problem. IEEE Computer, Special Issue on Computer-Supported Cooperative Work, 27(5), 58– 66. Chi, E. H., & Mytkowicz, T. (2006). Understanding Navigability of Social Tagging Systems, from http://www.viktoria.se/altchi/submissions/submission_edchi_ 0.pdf. Chiang, K. (2006). Clicking Instead of Walking: Consumers Searching for Information in the Electronic Marketplace. Bulletin of the ASIST, 32(2), from http://www.asis.org/Bulletin/ index.html. Chopin, K. (2008). Finding Communities: Alternative Viewpoints through Weblogs and Tagging. Journal of Documentation, 64(4), 552–575. Choy, S., & Lui, A. K. (2006). Web Information Retrieval in Collaborative Tagging Systems. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, Hong Kong (pp. 352–355). Christiaens, S. (2006). Metadata Mechanism: From Ontology to Folksonomy…and Back. Lecture Notes in Computer Science, 4277, 199–207.
396
Information Retrieval with Folksonomies
Cormode, G., & Krishnamurthy, B. (2008). Key Differences between Web 1.0 and Web 2.0. First Monday, 13(6), from http://www.uic.edu/htbin/cgiwrap/bin/ojs/ index.php/fm/article/view/2125/1972. Cox, A., Clough, P., & Marlow, J. (2008). Flickr: A First Look at User Behaviour in the Context of Photography as Serious Leisure. Information Research, 13(1), from http://informationr.net/ir/13-1/paper336.html. Crawford, W. (2006). Folksonomy and Dichotomy. Cites & Insights: Crawford at Large, 6(4), from http://citesandinsights.info/v6i4a.htm. Crestani, F., Lalmas, M., van Rijsbergen, C. J., & Campbell, I. (1998). “Is this document relevant? … probably”: A Survey of Probabilistic Models in Information Retrieval. ACM Computing Surveys, 30(4), 528–552. Croft, W. B., & Harper, D. J. (1979). Using Probabilistic Models of Document Retrieval without Relevance Information. Journal of Documentation, 35, 285– 295. Cui, H., Wen, J. R., Nie, J. Y., & Ma, W. Y. (2002). Probabilistic Query Expansion Using Query Logs. In Proceedings of the 11th International Conference on World Wide Web, Honolulu, Hawaii, USA (pp. 325–332). Culliss, G. A. (1997). Method for Organizing Information, US 6.006.222. Dennis, B. M. (2006). Foragr: Collaboratively Tagged Photographs and Social Information Visualization. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Dieberger, A. (1997). Supporting Social Navigation on the World Wide Web. International Journal of Human-Computer Studies, 46(6), 805–825. Dieberger, A., Dourish, P., Höök, K., Resnick, P., & Wexelblat, A. (2000). Social Navigation: Techniques for Building More Usable Systems. Interactions, 7(6), 36–45. Diederich, J., & Iofciu, T. (2006). Finding Communities of Practice from User Profiles Based on Folksonomies. In Proceedings of the 1st International Workshop on Building Technology Enhanced Learning Solutions for Communities of Practice, Crete, Greece. DIN 31.626/3 (1988). Indexierung zur inhaltlichen Erschließung von Dokumenten. Syntaktische Indexierung mit Deskriptoren. Dmitriev, P., Eiron, N., Fontoura, M., & Shekita, E. (2006). Using Annotation in Enterprise Search. In Proceedings of the 15th International Conference on World Wide Web, Edinburgh, Scotland (pp. 811–817). Dourish, P., & Chalmers, M. (1994). Running Out of Space: Models of Information Navigation. In Proceedings of HCI, Glasgow, Scotland. Dubinko, M., Kumar, R., Magnani, J., Novak, J., Raghavan, P., & Tomkins, A. (2006). Visualizing Tags Over Time. In Proceedings of the 15th International Conference on World Wide Web, Edinburgh, Scotland (pp. 193–202). Dye, J. (2006). Folksonomy: A Game of High-tech (and High-stakes) Tag. EContent, April, 38–43. Edelman (2008). Edelman Trust Barometer 2008, from http://edelman.com/trust/ 2008/trustbarometer08_final.pdf. Efthimiadis, E. N. (1996). Query Expansion. Annual Review of Information Science and Technology, 31, 121–187. Efthimiadis, E. N. (2000). Interactive Query Expansion: A User-based Evaluation in a Relevance Feedback Environment. Journal of the American Society for Information Science and Technology, 51, 989–1003.
Information Retrieval with Folksonomies
397
Egozi, O. (2008). A Taxonomy of Social Search Approaches, from http://blog.delver.com/index.php/2008/07/31/taxonomy-of-social-searchapproaches. Evans, B. M., & Chi, E. H. (2008). Towards a Model of Understanding Social Search. In I. Soboroff, E. Agichtein, & R. Kumar (Eds.), Proceedings of the 2008 ACM Workshop on Search in Social Media, Napa Valley, California (pp. 83–86). Farrell, S., Lau, T., Wilcox, E., & Muller, M. J. (2007). Socially Augmenting Employee Profiles with People-tagging. In Proceedings of the 20th Annual ACM Symposium on User Interface Software and Technology, Newport, Rhode Isaland, USA (pp. 91–100). Feinberg, M. (2006). An Examination of Authority in Social Classification Systems. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Feldman, S. (2000). Find What I Mean, Not What I Say. Meaning-based Search Tools. Online - Leading Magazine for Information Professionals, 24(3), 49–56. Fichter, D. (2006). Intranet Applications for Tagging and Folksonomies. Online Leading Magazine for Information Professionals, 30(3), 43–45. Fichter, D., & Wisniewski, J. (2008). Social Media Metrics: Making the Case for Making the Effort. Online - Leading Magazine for Information Professionals, Nov/Dec, 54–57. Fokker, J., Pouwelse, J., & Buntine, W. (2006). Tag-Based Navigation for Peer-toPeer Wikipedia. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Forsberg, M., Höök, K., & Svensson, M. (1998). Design Principles For Social Navigation Tools. In Proceedings of the 4th ERCIM Workshop on User Interfaces for All, Stockholm, Sweden. Freyne, J., & Smyth B. (2004). An Experiment in Social Search. Lecture Notes in Computer Science, 3137, 95–103, from http://www.csi.ucd.ie/UserFiles/ publications/1125325052211.pdf. Freyne, J., Farzan R., Brusilovsky P., Smyth B., & Coyle, M. (2007). Collecting Community Wisdom: Integrating Social Search & Social Navigation. In Proceedings of the International Conference on Intelligent User Interfaces, Honolulu, Hawaii, USA (pp. 52–61). Fujimura, K., Fujimura, S., Matsubayashi, T. Y. T., & Okuda, H. (2008). Topigraphy: Visualization for Large-scale Tag Clouds. In Proceedings of the 17th International Conference on World Wide Web, Beijing, China (pp. 1087– 1088). Furnas, G., Landauer, T., Gomez, L., & Dumais, S. (1987). The Vocabulary Problem in Human-System Communication. Communications of the ACM, 30(11), 964–971. Furnas, G., Fake, C., Ahn, L. von, Schachter, J., Golder, S., & Fox, K., et al. (2006). Why Do Tagging Systems Work? In Proceedings of the Conference on Human Factors in Computing Systems, Montréal, Canada (pp. 36–39). Furner, J. (2007). User Tagging of Library Resources: Toward a Framework for System Evaluation. In Proceedings of World Library and Information Congress, Durban, South Africa.
398
Information Retrieval with Folksonomies
Ganesan, K. A., Sundaresan, N., & Deo, H. (2008). Mining Tag Clouds and Emoticons behind Community Feedback. In Proceedings of the 17th International Conference on World Wide Web, Beijing, China (pp. 1181–1182). Garfield, E. (1979). Citation Indexing. New York, NY: Wiley. Giunchiglia, F., Marchese, M., & Zaihrayeu, I. (2005). Towards a Theory of Formal Classification. In Proceedings of the AAAI-05 Workshop on Contexts and Ontologies: Theory, Practice and Applications, Pittsburgh, Pennsylvania, USA. Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992). Using Collaborative Filtering to Weave and Information Tapestry. Communications of the ACM, 35(12), 61–70. Golder, S., & Hubermann, B. (2006). Usage Patterns of Collaborative Tagging Systems. Journal of Information Science, 32(2), 198–208. Golov, E., Weller, K., & Peters, I. (2008). TagCare: A Personal Portable Tag Repository. In Proceedings of the Poster and Demonstration Session at the 7th International Semantic Web Conference, Karlsruhe, Germany. Golovchinsky, G., Pickens, J., & Back, M. (2008). A Taxonomy of Collaboration in Online Information Seeking. In 1st International Workshop on Collaborative Information Retrieval, Pittsburgh, Pennsylvania, USA. Golub, K., Jones, C., Matthews, B., Moon, J., Nielsen, M. L., & Tudhope, D. (2008). Enhancing Social Tagging with a Knowledge Organization System. ALISS, 3(4), 13–16. Goodrum, A., & Spink, A. (2001). Image Searching on the Excite Web Search Engine. Information Processing and Management, 37(2), 295–311. Gordon-Murnane, L. (2006). Social Bookmarking, Folksonomies, and Web 2.0 Tools. Searcher - The Magazine for Database Professionals, 14(6), 26–38. Graefe, G., Maaß, C., & Heß, A. (2007). Alternative Searching Services: Seven Theses on the Importance of “Social Bookmarking”, from http://ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol301/Paper_1_Maas.pdf. Graham, R., Eoff, B., & Caverlee, J. (2008). Plurality: A Context-Aware Personalized Tagging System. In Proceedings of the 17th International Conference on World Wide Web, Beijing, China (pp. 1165–1166). Grahl, M., Hotho, A., & Stumme, G. (2007). Conceptual Clustering of Social Bookmarking Sites. In Proceedings of I-KNOW, Graz, Austria (pp. 356–364). Greenberg, J. (2001). Automatic Query Expansion via Lexical-Semantic Relationships. Journal of the American Society for Information Science and Technology, 52(5), 402–415. Gruber, T. (2005). Ontology of Folksonomy: A Mash-Up of Apples and Oranges, from http://www.tomgruber.org/writing/mtsr05-ontology-of-folksonomy.htm. Gruzd, A. (2006). Folksonomies vs. Bag-of-Words: The Evaluation and Comparison of Different Types of Document Representations. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Guy, M., & Tonkin, E. (2006). Folksonomies: Tidying Up Tags? D-Lib Magazine, 12(1), from http://www.dlib.org/dlib/january06/guy/01guy.html. Gyöngyi, P., Garcia-Molina, H., & Pedersen, J. (2004). Combating Web Spam with Trustrank. In Proceedings of the 30th International Conference on Very Large Data Bases, Toronto, Canada (pp. 576–587).
Information Retrieval with Folksonomies
399
Halpin, H., & Shepard, H. (2006). Evolving Ontologies from Folksonomies: Tagging as a Complex System, from http://www.ibiblio.org/hhalpin/homepage/notes/ taggingcss.html. Halpin, H., Robu, V., & Shepherd, H. (2007). The Complex Dynamics of Collaborative Tagging. In Proceedings of the 16th International WWW Conference, Banff, Alberta, Canada (pp. 211–220). Hammond, T., Hannay, T., Lund, B., & Scott, J. (2005). Social Bookmarking Tools (I)). D-Lib Magazine, 11(4). Han, P., Wang, Z., Li, Z., Krämer, B., & Yang, F. (2006). Substitution or Complement: An Empirical Analysis on the Impact of Collaborative Tagging on Web Search. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, Hong Kong (pp. 757–760). Harman, D. (1992). Ranking Algorithms. In W. B. Frakes & R. Baeza-Yates (Eds.), Information Retrieval. Data Structure & Algorithms, Information Retrieval. Data Structure & Algorithms (pp. 363–392). Upper Saddle River, NJ: Prentice Hall PTR. Hassan-Montero, Y., & Herrero-Solana, V. (2006). Improving Tag-Clouds as Visual Information Retrieval Interfaces. In Proceedings of the International Conference on Multidisciplinary Information Science and Technologies, Mérida, Spain. Hayes, C., & Avesani, P. (2007). Using Tags and Clustering to Identify TopicRelevant Blogs. In Proceedings of ICWSM 2007 Boulder, Colorado, USA (pp. 67–75). Hayes, C., Avesani, P., & Veeramachaneni, S. (2007). An Analysis of the Use of Tags in a Blog Recommender System. In Proceedings of IJCAI 2007 Hyderabad, India (pp. 2772–2777). Hearst, M., & Rosner, D. (2008). Tag Clouds: Data Analysis Tool or Social Signaller? In Proceedings of the 41st Hawaii International Conference on System Sciences. Hearst, M. A., Hurst, M., & Dumais, S. T. (2008). What Should Blog Search Look Like? In I. Soboroff, E. Agichtein, & R. Kumar (Eds.), Proceedings of the 2008 ACM Workshop on Search in Social Media, Napa Valley, California (pp. 95–98). Heckner, M., Neubauer, T., & Wolff, C. (2008). Tree, funny, to_read, google: What are Tags Supposed to Achieve? A Comparative Analysis of User Keywords for Different Digital Resource Types. In I. Soboroff, E. Agichtein, & R. Kumar (Eds.), Proceedings of the 2008 ACM Workshop on Search in Social Media, Napa Valley, California (pp. 3–10). Heymann, P., & Garcia-Molina, H. (2006). Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems: InfoLab Technical Report, from http://dbpubs.stanford.edu/pub/2006-10. Heymann, P., Koutrika, G., & Garcia-Molina, H. (2008). Can Social Bookmarking Improve Web Search? In Proceedings of the International Conference on Web Search and Web Data Mining, Palo Alto, California, USA (pp. 195–206). Hill, W. C., & Hollan, J. D. (1994). History-Enriched Digital Objects: Prototypes and Policy Issues. The Information Society, 10(2), 139–145. Hotho, A. (2008). Analytische Methoden zur Nutzerunterstützung in TaggingSystemen (Vortrag). Good Tags & Bad Tags - Workshop "Social Tagging in der Wissensorganisation. Institut für Wissensmedien Tübingen. Hotho, A., Jäschke, R., Schmitz, C., & Stumme, G. (2006a). Bibsonomy: A Social Bookmark and Publication Sharing System. In Proceedings of the Conceptual
400
Information Retrieval with Folksonomies
Structure Tool Interoperability Workshop at the 14th International Conference on Conceptual Structures 2006 (pp. 87–102). Hotho, A., Jäschke, R., Schmitz, C., & Stumme, G. (2006b). FolkRank: A Ranking Algorithm for Folksonomies. In M. Schaaf & K. D. Althoff (Eds.), Proceedings of FGIR 2006: Workshop Information Retrieval 2006 of the Special Interest Group Information Retrieval, Hildesheim, Germany. Hotho, A., Jäschke, R., Schmitz, C., & Stumme, G. (2006c). Information Retrieval in Folksonomies: Search and Ranking. Lecture Notes in Computer Science, 4011, 411–426. Hotho, A., Jäschke, R., Schmitz, C., & Stumme, G. (2006d). Das Entstehen von Semantik in BibSonomy. In Social Software in der Wertschöpfung, BadenBaden, Germany. Jackson, D. M. (1970). The Construction of Retrieval Environments and Pseudoclassification Based on External Relevance. Information Storage and Retrieval, 6, 187–219. Jaffe, A., Naaman, M., Tassa, T., & Davis, M. (2006). Generating Summaries and Visualization for Large Collections of Geo-Referenced Photographs. In Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, Santa Barbara, USA (pp. 89–98). Jäschke, R., Hotho, A., Schmitz, C., & Stumme, G. (2006). Wege zur Entdeckung von Communities in Folksonomies. In S. Braß & A. Hinneburg (Eds.), Proceedings of the 18th Workshop Grundlagen von Datenbanken, HalleWittenberg, Germany (pp. 80–84). Jäschke, R., Marinho, L., Hotho, A., Schmidt-Thieme, L., & Stumme, G. (2007). Tag Recommendations in Folksonomies. Lecture Notes in Artificial Intelligence, 4702, 506–514. Jäschke, R., Krause, B., Hotho, A., & Stumme, G. (2008). Logsonomy - A Search Engine Folksonomy. In Proceedings of the 2nd International Conference on Weblogs and Social Media, Seattle. Jiang, T., & Koshman, S. (2008). Exploratory Search in Different Information Architectures. Bulletin of the ASIST, 34(6), http://www.asis.org/Bulletin/ index.html. Joachims, T. (2002). Optimizing Search Engines Using Clickthrough Data. In Proceedings of SIGKDD Edmonton, Alberta, Canada (pp. 133–142). John, A., & Seligmann, D. (2006). Collaborative Tagging and Expertise in the Enterprise. In Proceedings of the 15th International Conference on World Wide Web, Edinburgh, Scotland. Jörgensen, C. (2007). Image Access, the Semantic Gap, and Social Tagging as a Paradigm Shift. In Proceedings of the 18th Workshop of the ASIS&T Special Interest Group in Classification Research, Milwaukee, Wisconsin, USA. Jung, S., Herlocker, J. L., & Webster, J. (2007). Click Data as Implicit Relevance Feedback in Web Search. Information Processing and Management, 43, 791– 807. Kautz, H., Selman, B., & Shah, M. (1997). Referral Web: combining social networks and collaborative filtering. Communications of the ACM, 40(3), 63–65. Kennedy, L., Naaman, M., Ahern, S., Nair, R., & Rattenbury, T. (2007). How Flickr Helps Us Make Sense of the World: Context and Content in CommunityContributed Media Collections. In Proceedings of the 15th International Conference on Multimedia, Augsburg, Germany (pp. 631–640).
Information Retrieval with Folksonomies
401
Kessler, M. M. (1963). Biblographic Coupling between Scientific Papers. American Documentation, 14, 10–25. Ketchum, & University of Southern California Annenberg (2009). Media Myths & Realities, from http://ketchumperspectives.com/archives/2009_i1/index.php. Kipp, M. E. I. (2006a). @toread and cool: Tagging for Time, Task, and Emotion. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Kipp, M. E. I. (2006b). Complementary or Discrete Contexts in Online Indexing: A Comparison of User, Creator, and Intermediary Keywords. Canadian Journal of Information and Library Science, 30(3), from http://dlist.sir.arizona.edu/1533. Kipp, M. E. I. (2007a). Tagging for Health Information Organisation and Retrieval. In Proceedings of the North American Symposiumon Knowledge Organization (pp. 63–74). Kipp, M. E. I. (2007b). Tagging and Findability: Do Tags Help Users Find Things? from http://eprints.rclis.org/11769/1/asist2007poster.pdf. Kipp, M. E. I. (2008). Searching with Tags: Do Tags Help Users Find Things? In Proceedings of the 71st ASIS&T Annual Meeting. People Transforming Information - Information Transforming People. Columbus, Ohio, USA. Kipp, M. E. I., & Campbell, D. (2006). Patterns and Inconsistencies in Collaborative Tagging Systems: An Examination of Tagging Practices. In Proceedings of the 17th Annual Meeting of the American Society for Information Science and Technology, Austin, Texas, USA. Kleinberg, J. (1999). Hubs, Authorities, and Communities. ACM Computing Surveys, 31(4es), No. 5, from http://portal.acm.org/citation.cfm?id=345966. 345982. Knautz, K. (2008). Von der Tag-Cloud zum Tag-Cluster: Statistischer Thesaurus auf der Basis syntagmatischer Relationen und seine mögliche Nutzung in Web 2.0Diensten. In Proceedings der 30. Online-Tagung der DGI, Frankfurt a.M., Germany (pp. 269–284). Kolbitsch, J. (2007a). WordFlickr: A Solution to the Vocabulary Problem in Social Tagging Systems. In Proceedings of I-MEDIA and I-SEMANTICS, Graz, Austria (pp. 77–84). Kolbitsch, J. (2007b). WordFlickr: Where Folksonomy Meets Taxonomy, from http://www.kolbitsch.org/ research/papers/2007-WordFlickr-Presentation.pdf. Kome, S. (2005). Hierarchical Subject Relationships in Folksonomies. Master Thesis, University of North Carolina, Chapel Hill, from http://etd.ils.unc.edu:8080/dspace/handle/1901/238. Koutrika, G., Effendi, F. A., Gyöngyi, P., Heymann, P., & Garcia-Molina, H. (2007). Combating Spam in Tagging Systems. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, Banff, Alberta, Canada (pp. 57–64). Krause, B., Hotho, A., & Stumme, G. (2008). A Comparison of Social Bookmarking with Traditional Search. In Proceedings of the 30th European Conference on IR Research, ECIR, Glasgow, Scotland (pp. 101–113). Kroski, E. (2005). The Hive Mind: Folksonomies and User-Based Tagging, from http://infotangle.blogsome.com/2005/12/07/the-hive-mind-folksonomies-anduser-based-tagging.
402
Information Retrieval with Folksonomies
Kuo, B. Y., Hentrich, T., Good, B. M., & Wilkinson, M. D. (2007). Tag Clouds for Summarizing Web Search Results. In Proceedings of the 16th International WWW Conference, Banff, Alberta, Canada (pp. 1203–1204). Kwiatkowski, M., & Höhfeld, S. (2007). Empfehlungssysteme aus informationswissenschaftlicher Sicht - State of the Art. Information Wissenschaft & Praxis, 58(5), 265–276. Lambiotte, R., & Ausloos, M. (2006). Collaborative Tagging as a Tripartite Network. Lecture Notes in Computer Science, 3993(1114), 1114–1117. Laniado, D., Eynard, D., & Colombetti, M. (2007a). A Semantic Tool to Support Navigation in a Folksonomy. In Proceedings of the 18th Conference on Hypertext and Hypermedia 2007 Manchester, UK (pp. 153–154). Laniado, D., Eynard, D., & Colombetti, M. (2007b). Using WordNet to Turn a Folksonomy into a Hierarchy of Concepts. In Proceedings of the 4th Italian Semantic Web Workshop, Bari, Italy (pp. 192–201). Lee, S. S., & Yong, H. S. (2007a). Component based Approach to handle Synonym and Polysemy in Folksonomy. In Proceedings of the 7th IEEE International Conference on Computer and Information Technology, Fukushima, Japan (pp. 200–205). Lee, S. S., & Yong, H. S. (2007b). TagPlus: A Retrieval System using Synonym Tag in Folksonomy. In Proceedings of the 2007 International Conference on Multimedia and Ubiquitous Engineering, Seoul, Korea (pp. 294–298). Lee, S. S., Won, D., & McLeod, D. (2008a). Tag-Geotag Correlation in Social Networks. In I. Soboroff, E. Agichtein, & R. Kumar (Eds.), Proceedings of the 2008 ACM Workshop on Search in Social Media, Napa Valley, California (pp. 59–66). Lee, S. S., Won, D., & McLeod, D. (2008b). Discovering Relationships among Tags and Geotags. In Proceedings of the 2nd International Conference on Weblogs and Social Media, Seattle, USA. AAAI Press. Lee, C. S., Goh, D. H., Razikin, K., & Chua, A. Y. K. (2009). Tagging, Sharing and the Influence of Personal Experience. Journal of Digital Information, 10(1), from http://journals.tdl.org/jodi/ article/view/275/275. Lerman, K., & Jones, L. A. (2006). Social Browsing on Flickr, from http://arxiv.org/ abs/cs/0612047. Lerman, K., Plangprasopchok, A., & Wong, C. (2007). Personalizing Image Search Results on Flickr, from http://arxiv.org/abs/0704.1676. Lewandowski, D. (2005a). Web Information Retrieval. Technologien zur Informationssuche im Internet. Frankfurt am Main: DGI. Lewandowski, D. (2005b). Web Searching, Search Engines and Information Retrieval. Information Services & Use, 25(3-4), 137–147. Lewandowski, D. (2007). Trefferpräsentation in Web-Suchmaschinen. In Proceedings der 29. Online-Tagung der DGI, Frankfurt am Main, Germany (pp. 83–90). Li, R., Bao, S., Fei, B., Su, Z., & Yu, Y. (2007). Towards Effective Browsing of Large Scale Social Annotations. In of the 16th International WWW Conference, Banff, Alberta, Canada (pp. 943–952). López-Huertas, M. J. (1997). Thesaurus Structure Design: A Conceptual Approach for Improved Interaction. Journal of Documentation, 53, 139–177. Luhn, H. P. (1958). The Automatic Creation of Literature Abstracts. IBM Journal, 2(2), 159–165.
Information Retrieval with Folksonomies
403
Lux, M., Granitzer, M., & Kern, R. (2007). Aspects of Broad Folksonomies. In Proceedings of the 18th International Conference on Database and Expert Systems Applications (pp. 283–287). Maass, W., Kowatsch, T., & Münster, T. (2007). Vocabulary Patterns in Free-for-all Collaborative Indexing Systems. In Proceedings of International Workshop on Emergent Semantics and Ontology Evolution, Busan, Korea (pp. 45–57). Macaulay, C. (1998). Information Navigation in "The Palimpsest". unpublished PhD Thesis. Maier, R., & Thalmann, S. (2007). Kollaboratives Tagging zur inhaltlichen Beschreibung von Lern- und Wissensressourcen. In R. Tolksdorf & J. Freytag (Eds.), Proceedings of XML Tage, Berlin, Germany (pp. 75–86). Berlin: Freie Universität Berlin. Maier, R., & Thalmann, S. (2008). Institutionalised Collaborative Tagging as an Instrument for Managing the Maturing Learning and Knowledge Resources. International Journal of Technology Enhanced Learning, 1(1), 70–84. Maness, J. M. (2006). Library 2.0 Theory: Web 2.0 and Its Implications for Libraries. Webology, 3(2), from www.webology.ir/2006/v3n2/a25.html. Marlow, C., Naaman, M., boyd, d., & Davis, M. (2006a). HT06, Tagging Paper, Taxonomy, Flickr, Academic Article, To Read. In Proceedings of the 17th Conference on Hypertext and Hypermedia, Odense, Denmark (pp. 31–40). Marlow, C., Naaman, M., boyd, d., & Davis, M. (2006b). Position Paper, Tagging, Taxonomy, Flickr, Article, ToRead. In Proceedings of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Mathes, A. (2004). Folksonomies - Cooperative Classification and Communication Through Shared Metadata, from www.adammathes.com/academic/computermediated-communication/folksonomies.html. Mayr, P. (2006). Thesauri, Klassifikationen & Co - die Renaissance der Kontrollierten Vokabulare? In P. Hauke & K. Umlauf (Eds.), Beiträge zur Bibliotheksund Informationswissenschaft: Vom Wandel der Wissensorganisation im Informationszeitalter. (pp. 151–170). Bad Honnef: Bock + Herchen. McFedries, P. (2006). Folk Wisdom. IEEE Spectrum, 43(Feb), 80. Mejias, U. A. (2005). Tag Literacy, from http://blog.ulisesmejias.com/2005/04/26/tag-literacy. Merholz, P. (2005). Clay Shirky's Viewpoints are Overrated, from http://www.peterme.com/archives/000558.html. Merton, R. K. (1949). Social Theory and Social Structure. New York, NY: The Free Press. Merton, R. K., & Barber, E. G. (2004). The Travels and Adventures of Serendipity. A Study in Sociological Semantics and the Sociology of Science. Princeton, NJ: Princeton Univ. Press. Mika, P. (2005). Ontologies are us: A Unified Model of Social Networks and Semantics. Lecture Notes in Computer Science, 3729, 522–536. Millen, D. R., & Feinberg, J. (2006). Using Social Tagging to Improve Social Navigation. In Proceedings of the AH2006 Workshop of Social Navigation and Community-based Adaption, Dublin, Ireland. Millen, D. R., Feinberg, J., & Kerr, B. (2006). Dogear: Social Bookmarking in the Enterprise. In Proceedings of the Conference on Human Factors in Computing Systems, Montréal, Canada (pp. 111–120).
404
Information Retrieval with Folksonomies
Millen, D. R., Whittaker, M. Y. S., & Feinberg, J. (2007). Social Bookmarking and Exploratory Search. In Proceedings of the 10th European Conference on Computer-Supported Cooperative Work, Limerick, Ireland (pp. 21–40). Mishne, G. (2006). AutoTag: A Collaborative Approach to Automated Tag Assignment for Weblog Posts. In Proceedings of the 15th International Conference on World Wide Web, Edinburgh, Scotland (pp. 953–954). Mislove, A., Koppula, H. S., Gummadi, K. P., Druschel, P., & Bhattacharjee, B. (2008). Growth of the Flickr Social Network. In Proceedings of the 1st Workshop on Online Social Networks, Seattle, WA, USA (pp. 25–30). Moffat, A., Webber, W., Zobel, J., & Baeza-Yates, R. (2007). A Pipelined Architecture for Distributed Text Query Evaluation. Information Retrieval, 10, 205–231. Morrison, P. (2007). Why Are They Tagging, and Why Do We Want Them To? Bulletin of the ASIST, 34(1), 12–15. Morrison, P. (2008). Tagging and Searching: Search Retrieval Effectiveness of Folksonomies on the World Wide Web. Information Processing & Management, 44(4), 1562–1579. Muller, M. J. (2007). Comparing Tagging Vocabularies among four Enterprise TagBased Services. In Proceedings of the International ACM Conference on Supporting Group Work, Sanibel Island, FL, USA (pp. 341–350). Munro, A. (1998). Judging Books by their Covers: Remarks on Information Representation in Real-world Information Spaces. In Proceedings of VRI Workshop Visual Representation and Interpretation, Liverpool, Great Britain. Naaman, M. (2006). Eyes on the World. Computers in Libraries, 39(10), 108–111. Page, L., Brin, S., Motwani, R., & Winograd, T. (1998). The PageRank Citation Ranking: Bringing Order to the Web, from http://ilpubs.stanford.edu:8090/422. Pang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis. Hanover, Massachusetts, USA: Now Publishers. Panke, S., & Gaiser, B. (2008). Nutzerperspektiven auf Social Tagging – Eine Online Befragung, from www.e-teaching.org/didaktik/recherche/ goodtagsbadtags2.pdf. Paolillo, J., & Penumarthy, S. (2007). The Social Structure of Tagging Internet Video on del.icio.us. In Proceedings of the 40th Hawaii International Conference on System Sciences. Passant, A. (2007). Using Ontologies to Strengthen Folksonomies and Enrich Information Retrieval in Weblogs: Theoretical Background and Corporate Usecase. In Proceedings of ICWSM 2007 Boulder, Colorado, USA. Peters, I., & Stock, W. (2007). Folksonomies and Information Retrieval. In Proceedings of the 70th ASIS&T Annual Meeting. Joining Research and Practice: Social Computing and Information Science. Milwaukee, Wisconsin, USA (pp. 1510–1542). Peters, I., & Stock, W. G. (2009). "Power Tags" in Information Retrieval. Library Hi Tech (accepted). Peters, I., & Weller, K. (2008a). Tag Gardening for Folksonomy Enrichment and Maintenance. Webology, 5(3), Article 58, from http://www.webology.ir/2008/ v5n3/a58.html.
Information Retrieval with Folksonomies
405
Peters, I., & Weller, K. (2008b). Good Tags + Bad Tags. Tagging in der Wissensorganisation: Von "Baby-Tags" zu "Tag Gardening". Password, 5, 18– 19. Peterson, E. (2006). Beneath the Metadata. D-Lib Magazine, 12(11), from http://www.dlib.org/ dlib/november06/peterson/11peterson.html. Pluzhenskaia, M. (2006). Folksonomies or Fauxonomies: How Social is Social Bookmarking? In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Quintarelli, E. (2005). Folksonomies: Power to the People. In Proceedings of the ISKO Italy-UniMIB Meeting, Milan, Italy. Quintarelli, E., Resmini, A., & Rosati, L. (2007a). Face Tag: Integrating Bottom-up and Top-down Classification in a Social Tagging System. Bulletin of the ASIST, 33(5), 10–15. Quintarelli, E., Resmini, A., & Rosati, L. (2007b). Facetag: Integrating Bottom-up and Top-down Classification in a Social Tagging System. In Proceedings of the IA Summit, Las Vegas, Nevada, USA. Rees-Potter, L. K. (1989). Dynamic Thesaural Systems. A Bibliometric Study of Terminological and Conceptual Changes in Sociology and Economics with the Application to the Design of Dynamic Thesaural Systems. Information Processing & Management, 25, 677–691. Resnick, P., Iacovou, N., Sushak, M., Bergstrom, P., & Riedl, J. (1994). GroupLens: An Open Architecture for Collaborative Filtering of Netnews. In Proceedings of the ACM Conference on Computer Supported Cooperative Work, Chapel Hill, North Carolina, USA (pp. 175–186). Resnick, P., & Varian, H. (1997). Recommender Systems. Communications of the ACM, 40(3), 56–58. Richardson, M., Prakash, A., & Brill, E. (2006). Beyond PageRank: Machine Learning for Static Ranking. In Proceedings of the 15th International Conference on World Wide Web, Edinburgh, Scotland (pp. 707–715). Rivadeniera, A. W., Gruen, D. M., Muller, M. J., & Millen, D. R. (2007). Getting Our Head in the Clouds: Toward Evaluation Studies of Tagclouds. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, San Jose, California, USA (pp. 995–998). Robertson, S. E., & Sparck Jones, K. (1976). Relevance Weighting of Search Terms. Journal of the American Society for Information Science and Technology, 27, 129–146. Rocchio, J. J. (1971). Relevance Feedback in Information Retrieval. In G. Salton (Ed.), The SMART Retrieval System - Experiments in Automatic Document Processing (pp. 313–323). Englewood Ciffs, N. J.: Prentice Hall PTR. Rorissa, A. (2008). User-generated Descriptions of Individual Images versus Labels of Groups of Images: A Comparison Using Basic Level Theory. Information Processing and Management, 44(5), 1741–1753. Ruge, G., & Goeser, S. (1998). Information Retrieval ohne Linguistik? nfd: Information - Wissenschaft und Praxis, 49(6), 361–369. Russell, T. (2006). Cloudalicious: Folksonomy Over Time. In Proceedings of the 6th ACM/IEEE-CS joint Conference on Digitl Libraries, Chapel Hill, North Carolina, USA (p. 364).
406
Information Retrieval with Folksonomies
Russell, T. (2007). Tag Decay: A View into Aging Folksonomies: Poster at the ASIS&T Annual Meeting 2007, from http://www.terrellrussell.com/projects/ tagdecayposter-asist07.pdf. Salton, G., & MacGill, M. J. (1987). Information Retrieval - Grundlegendes für Informationswissenschaftler. McGraw-Hill-Texte. Hamburg: McGraw-Hill. Sandusky, R. (2006). Shared Persistent User Search Paths: Social Navigation as Social Classification. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Schiefner, M. (2008). Social Tagging in der universitären Lehre. In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 73–83). Münster, New York, München, Berlin: Waxmann. Schillerwein, S. (2008). Der 'Business Case' für die Nutzung von Social Tagging in Intranets und internen Informationssystemen. In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 141–152). Münster, New York, München, Berlin: Waxmann. Schmidt, S., & Stock, W. G. (2009). Collective Indexing of Emotions in Images. A Study in Emotional Information Retrieval (EmIR). Journal of the American Society for Information Science and Technology, 60(5), 863–876. Schmitz, P. (2006). Inducing Ontology from Flickr Tags. In of the Collaborative Web Tagging Workshop at WWW 2006, Edinburgh, Scotland. Schmitz, C., Grahl, M., Hotho, A., Stumme, G., Cattuto, C., & Baldassarri, A., et al. (2007). Network Properties of Folksonomies. In Proceedings of the 16th International WWW Conference, Banff, Alberta, Canada. Schneider, J. W., & Borlund, P. A Bibliometric-based Semi-automatic Approach to Identification of Candidate Thesaurus Terms: Parsing and Filtering of Noun Phrases from Citation Contexts. Lecture Notes in Computer Science, 3507, 226– 237. Sen, S., Geyer, W., Muller, M. J., & Moore, M. (2006a). FeedMe: A Collaborative Alert Filtering System. In Proceedings of the 20th Anniversary Conference on Computer Supported Cooperative Work, Banff, Alberta, Canada (pp. 89–98). Sen, S., Lam, S., Rashid, A., Cosley, D., Frankowski, D., & Osterhouse, J., et al. (2006b). tagging, communities, vocabulary, evolution. In Proceedings of the 20th Anniversary Conference on Computer Supported Cooperative Work, Banff, Alberta, Canada (pp. 181–190). Sen, S., Harper, F., LaPitz, A., & Riedl, J. (2007). The Quest for Quality Tags. In Proceedings of the 2007 International ACM Conference on Supporting Group Work (pp. 361–370). Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27, 379–423, from http://cm.belllabs.com/cm/ms/what/shannonday/ shannon1948.pdf. Shardanand, U., & Maes, P. (1995). Social Information Filtering: Algorithms for Automating “Word of Mouth”. In Proceedings on Human Factors in Computing Systems (pp. 210–217). Sherman, C. (2006). What's the Big Deal With Social Search? from http://searchenginewatch.com/3623153. Shirky, C. (2003). Power Laws, Weblogs and Inequality: Diversity plus Freedom of Choice Creates Inequality. In J. Engeström; M. Ahtisaari, & A. Nieminen (Eds.), Exposure: From Friction to Freedom (pp. 77–81). USA: Aula.
Information Retrieval with Folksonomies
407
Sinclair, J., & Cardew-Hall, M. (2008). The Folksonomy Tag Cloud: When Is It Useful? Journal of Information Science, 34(1), 15–29. Sinha, R. (2006). Findability with Tags: Facets, Clusters, and Pivot Browsing, from http://rashmisinha.com/2006/07/27/findability-with-tags-facets-clusters-andpivot-browsing. Skusa, A., & Maaß, C. (2008). Suchmaschinen: Status Quo und Entwicklungstendenzen. In D. Lewandowski & C. Maaß (Eds.), Web-2.0-Dienste als Ergänzung zu algorithmischen Suchmaschinen, Web-2.0-Dienste als Ergänung zu algorithmischen Suchmaschinen (pp. 1–12). Berlin: Logos. Smith, G. (2008). Tagging: Emerging Trends. Bulletin of the ASIST, 34(6), 14–17. Smith, M., Barash, V., Getoor, L., & Lauw, H. L. (2008). Leveraging Social Context for Searching Social Media. In I. Soboroff, E. Agichtein, & R. Kumar (Eds.), Proceedings of the 2008 ACM Workshop on Search in Social Media, Napa Valley, California (pp. 91–94). Spalding, T. (2007). Tagmash: Book Tagging Grows Up, from http://www.librarything.com/thingology/2007/07/tagmash-book-tagging-growsup.php. Spyns, P., Moor, A. de, Vandenbussche, J., & Meersman, R. (2006). From Folksologies to Ontologies: How the Twain Meet. Lecture Notes in Computer Science, 4275, 738–755. Sterling, B. (2005). Order Out of Chaos. Wired Magazine, 13(4), from http://www.wired.com/wired/archive/13.04/view.html?pg=4. Stock, W. G. (2000). Informationswirtschaft. Management externen Wissens. München: Oldenbourg. Stock, W. G. (2006). On Relevance Distributions. Journal of the American Society for Information Science and Technology, 57(8), 1126–1129. Stock, W. G. (2007a). Information Retrieval. Informationen Suchen und Finden. München, Wien: Oldenbourg. Stock, W. G. (2007b). Folksonomies and Science Communication. A Mash-up of Professional Science Databases and Web 2.0 Services. Information Services & Use, 27, 97–103. Stock, W. G., & Stock, M. (2008). Wissensrepräsentation. Informationen auswerten und bereitstellen. München, Wien: Oldenbourg. Storey, M. A. D. (2007). Navigating Documents Using Ontologies, Taxonomies and Folksonomies. In Proceedings of the 2007 ACM Symposium on Document Engineering, Winnipeg, Canada (pp. 2-2). Strohmaier, M. (2008). Purpose Tagging: Capturing User Intent to Assist GoalOriented Social Search. In I. Soboroff, E. Agichtein, & R. Kumar (Eds.), Proceedings of the 2008 ACM Workshop on Search in Social Media, Napa Valley, California (pp. 35–42). Sturtz, D. (2004). Communal Categorization: The Folksonomy, from http://www.davidsturtz.com/drexel/622/sturtz-folksonomy.pdf. Svensson, M. (1998). Social Navigation. In N. Dahlbäck (Ed.), Exploring Navigation: Towards a Framework for Design and Evaluation of Navigation in Electronic Spaces. SICS Technical Report T98:01 (pp. 72–88). Swedish Institute of Computer Science. Svensson, M., Laaksolahti, J., Höök, K., & Waern, A. (2000). A Recipe Based Online Food Store. In Proceedings of the 5th International Conference on Intelligent User Interfaces, New Orleans, Louisiana, USA (pp. 260–263).
408
Information Retrieval with Folksonomies
Szekely, B., & Torres, E. (2005). Ranking Bookmarks and Bistros: Intelligent Community and Folksonomy Development, from http://torrez.us/archives/2005/ 07/13/tagrank.pdf. Szomszor, M., Cattuto, C., Alani, H., O'Hara, K., Baldassarri, A., & Loreto, V., et al. (2007). Folksonomies, the Semantic Web, and Movie Recommendation. In Proceedings of the 4th European Semantic Web Conference, Innsbruck, Austria (pp. 71–84). Tonkin, E., Corrado, E. M., Moulaison, H. L., Kipp, M. E. I., Resmini, A., & Pfeiffer, H. D., et al. (2008). Collaborative and Social Tagging Networks. Ariadne, 54, from http://www.ariadne.ac.uk/issue54/tonkin-et-al. Tudhope, D., & Nielsen, M. L. (2006). Introduction to Knowledge Organization Systems and Services. New Review of Hypermedia and Multimedia, 12(1), 3–9. van Damme, C., Hepp, M., & Siorpaes, K. (2007). FolksOntology: An Integrated Approach for Turning Folksonomies into Ontologies. In Proceedings of the European Semantic Web Conference, Innsbruck, Austria (pp. 71–85). van Damme, C., Hepp, M., & Coenen, T. (2008). Quality Metrics for Tags of Broad Folksonomies. In Proceedings of I-Semantics, International Conference on Semantic Systems, Graz, Austria (pp. 118–125). van Hooland, S. (2006). From Spectator to Annotator: Possibilities offered by UserGenerated Metadata for Digital Cultural Heritage Collections. In Proceedings of the CILIP Cataloguing & Indexing Group Annual Conference, Norwich, Great Britain. Vander Wal, T. (2004). Feed On This, from http://www.vanderwal.net/random/ entrysel.php?blog=1562. Vander Wal, T. (2005). Folksonomy Explanations, from http://www.vanderwal.net/ random/entrysel.php?blog=1622. Vander Wal, T. (2008). Welcome to the Matrix! In B. Gaiser, T. Hampel, & S. Panke (Eds.), Good Tags - Bad Tags: Social Tagging in der Wissensorganisation (pp. 7–9). Münster, New York, München, Berlin: Waxmann. Viégas, F., & Wattenberg, M. (2008). Tag Clouds and the Case for Vernacular Visualization. Interactions, 15(4), 49–52. Voß, J. (2006). Collaborative Thesaurus Tagging the Wikipedia Way, from http://arxiv.org/ftp/cs/ papers/0604/0604036.pdf. Wang, J., & Davison, B. D. (2008). Explorations in Tag Suggestion and Query Expansion. In I. Soboroff, E. Agichtein, & R. Kumar (Eds.), Proceedings of the 2008 ACM Workshop on Search in Social Media, Napa Valley, California (pp. 43–50). Wash, R., & Rader, E. (2006). Collaborative Filtering with del.icio.us, from http://bierdoctor.com/papers/delicious_chi2006_wip_updated.pdf. Wash, R., & Rader, E. (2007). Public Bookmarks and Private Benefits: An Analysis of Incentives in Social Computing. In Proceedings of the 70th ASIS&T Annual Meeting. Joining Research and Practice: Social Computing and Information Science. Milwaukee, Wisconsin, USA. Watts, D. J., & Strogatz, S. H. (1998). Collective Dynamics of ‘Small-world’ Networks. Nature, 393, 440–442. Weinberger, D. (2005). Tagging and Why It Matters, from http://www.cyber.law. harvard.edu/home/uploads/507/07-WhyTaggingMatters.pdf. Weiss, A. (2005). The Power of Collective Intelligence. netWorker, 9(3), 16–23.
Information Retrieval with Folksonomies
409
Weller, K., & Stock, W. G. (2008). Transitive Meronymy. Automatic Concept-based Query Expansion Using Weighted Transitive Part-whole Relations. Information Wissenschaft & Praxis, 59(3), 165–170. Wexelblat, A., & Maes, P. (1999). Footprints: History-rich Tools for Information Foraging. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Pittsburgh, Pennsylvania, USA (pp. 270–277). Wiggins, R. (2002). Beyond the Spider: The Accidental Thesaurus. Searcher - The Magazine for Database Professionals, 10(9), from http://www.infotoday.com/ searcher/oct02/wiggins.htm. Winget, M. (2006). User-Defined Classification on the Online Photo Sharing Site Flickr…or, How I Learned to Stop Worrying and Love the Million Typing Monkeys. In Proceedings of the 17th Annual ASIS&T SIG/CR Classification Research Workshop, Austin, Texas, USA. Wu, X., Zhang, L., & Yu, Y. (2006). Exploring Social Annotations for the Semantic Web. In Proceedings of the 15th International Conference on World Wide Web, Edinburgh, Scotland (pp. 417–426). New York, NY: ACM Press. Wu, H., Zubair, M., & Maly, K. (2006). Harvesting Social Knowledge from Folksonomies. In Proceedings of the 17th Conference on Hypertext and Hypermedia, Odense, Denmark (pp. 111–114). Xu, Z., Fu, Y., Man, J., & Su, D. (2006). Towards the Semantic Web: Collaborative Tag Suggestions, from http://www.semanticmetadata.net/hosted/taggingwswww2006-files/13.pdf. Xue, G., Zeng, H., Chen, Z., Yu, Y., Ma, W., & Xi, W., et al. (2004). Optimizing Web Search Using Click-through Data. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management, Washington D.C., USA (pp. 118–126). Yanbe, Y., Jatowt, A., Nakamura, S., & Tanaka, K. (2007). Can Social Bookmarking Enhance Search in the Web? In Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, Vancouver, BC, Canada (pp. 107–116). Zhang, L., Wu, X., & Yu, Y. (2006). Emergent Semantics from Folksonomies: A Quantitative Study. Lecture Notes in Computer Science, 4090, 168–186. Zhou, D., Bian, J., Zheng, S., Zha, H., & Giles, C. L. (2008). Exploring Social Annotations for Information Retrieval. In Proceedings of the 17th International Conference on World Wide Web, Beijing, China (pp. 715–724). Ziegler, C. (2006). Die Vermessung der Meinung. Sentiment Detection: Maschinelles Textverständnis. iX-Archiv, 10, 106–109. Zollers, A. (2007). Emerging Motivations for Tagging: Expression, Performance, and Activism. In Proceedings of the 16th International WWW Conference, Banff, Alberta, Canada.
Conclusion
We are lazy, stupid, dishonest, and self-ignorant. And if you think about it, that’s largely true. We don’t follow instructions; sometimes we can’t spell and punctuate properly; and we often aren’t the best judges of our own information. These facts make all metadata somewhat suspect (tags, too). --Gene Smith, 2008.
This book has aimed to discuss the properties of folksonomies in sufficient detail while paying particular attention to those aspects which are relevant for the tasks of knowledge representation and information retrieval concerning information resources. No clear picture of folksonomies emerged: on the one hand, they offer a strong tool to the users for editing and gathering resources, of which they make heavy use. On the other hand, so many of folksonomies’ inherent factors turn out to be fundamentally flawed that editing and retrieval via folksonomies cannot be called effective. The solution for this dilemma is an approach that strengthens the positive effects of folksonomies on indexing and information retrieval, minimizes the negative ones as far as possible and makes provision for the typical features of folksonomies and behaviors in Web 2.0. The great enthusiasm of users to participate in the indexing, rating, creation etc. of information resources in particular should not be forgotten or reined in here. The advantages of folksonomies are mainly to be found in their linguistic flexibility, taking seriously the language of every user and refraining from any standardization during tagging. The Long Tail of little-used but more variable tags broadens access to the information resources and also represents individual indexing needs. Especially tags not concerning the resource’s aboutness create further added value for the user, since they attach ratings, opinions etc. to the resource, thus making a quality judgment. Controlled vocabularies can also profit from the folksonomies’ linguistic variability by being adjusted to the users’ language use. The close connection in folksonomies of resources, tags and users can be exploited for tag recommender systems during indexing, but also for the discovery of communities. It is further possible to use folksonomies to let the burden of massindexing information resources be shouldered by many, and to playfully introduce users to the usefulness of indexing content and knowledge representation (e.g. via Games with a Purpose). Folksonomies display numerous disadvantages, however, which are the result mainly of the variability of human language and the lack of a controlled vocabulary. The disadvantages are summarized in Figure 1, within the ‘Problem-Tag Clouds.’ This downside, which was identified as an advantage of folksonomies just a minute ago, is particularly noticeable in information retrieval and becomes visible to the user by increasing the search effort. (Multilingual) synonyms are not summarized in
412
Conclusion
folksonomies, homonyms are not separated, there are different forms of compounds, syncategoremata and performative tags are strongly person-oriented, and the tags are not spell-checked – all of which leads to the user needing to pay attention to innumerable variants, entering or excluding them manually, for a successful recalloriented research. The cognitive and time effort is obviously immense for searches of this kind. A similar problem exists with regard to the foregoing of paradigmatic relations or fields within folksonomies that would allow the user to embed tags into a context, thus making the indexing and information retrieval more precise.
Figure I:
Folksonomies Enriched with Traditional Methods of Knowledge Representation and Information Retrieval Protect the User from Information Overload.
The different perspectives on information resources can also have a negative effect on knowledge representation and information retrieval, since they influence the tags’ specificity. A user who is not aware of this distinct indexing method may possibly struggle with the efficient retrieval of resources. This difficulty also appears in the determination of the actual Documentary Reference Unit that the tags refer to, since it varies throughout the collaborative information services. Spam tags exploit the preferred placement of high-frequency tags, e.g. in tag clouds, and lead the user to possibly irrelevant results. The suitability of tag clouds as a visualization method for folksonomies is conditional, since they, too, are influenced by the ‘tyranny of the majority’ and mainly register the most popular tags. In knowledge representation
Conclusion
413
and information retrieval, this leads to those tags being used more and more for both tasks, lessening the folksonomy’s expressiveness. A relevance ranking of the hit lists is non-existent or merely experimental in the most collaborative information services. This makes it even harder for users to make relevance judgments concerning the search results. The conceptual thoughts on solving folksonomies’ problems are also sketched in Figure 1 and located in the ‘umbrella.’ This umbrella serves users as a tool for reducing information ballast, or to avoid ‘information overload’ by applying the methods for indexing and research developed in this book to folksonomies. Thus indexing and search tags as well as search results can be better aligned with user needs. The suggestions for avoiding the disadvantages of folksonomies that are discussed include the exploitation of the factors ‘resources,’ ‘users’ and ‘tags’ for relevance ranking, the consideration of different tag distributions and the use of Power Tags as well as the combination of pre-existing KOS with folksonomies for Tag Gardening, and are summarized in the following. To exemplify the practical implementation of folksonomies in Web 2.0, in the first chapter I developed a taxonomy that describes the term ‘Web 2.0’ intensionally and cordons off the terms ‘social software’ and ‘collaborative information service’ from each other. Collaborative information services were introduced as a new term and localized as a hyponym of social software. The fundamental features of collaborative information services are folksonomies as an indexing method on the one hand, and resource management as the main purpose of the information service on the other. After this followed the presentation of different collaborative information services and of how they make use of folksonomies. The results were, among others, that we must differentiate between three types of folksonomy (‘broad,’ ‘narrow’ and ‘extended narrow’), which leads to a broadening of the established terminology in folksonomy research. The introduction and explanation of Extended Narrow Folksonomies followed in chapter 3. A critical evaluation of folksonomies for the areas of knowledge representation and information retrieval closed the first chapter. Chapter 2 served as a short introduction into the broad subject areas of knowledge representation and information retrieval and mainly explained the terms from information science necessary for the discussion of folksonomies to follow. Chapter 3 dealt with the features of folksonomies, their advantages and disadvantages as a method of knowledge representation. The above quote by Gene Smith perfectly sums up the problem of folksonomies as an indexing tool: usergenerated tags are ambiguous, variable and are neither formulated nor indexed consistently, not even by the same user. The last aspect in particular also appears as a disadvantage in established documentation languages and knowledge organization systems used by human indexers, and is described as the Intra-Indexer Consistency. Support can be provided by automatic indexing tools, but they too must grapple with the variability of human language, making a retroactive disambiguation by human hands unavoidable – even though the users’ activities are by no means free from the problems mentioned above. But especially when dealing with the allocation of metadata, in this case tags, to a context, every indexing and retrieval system needs human judgment. The same situation was observed in this book for folksonomies in knowledge representation. Apart from its numerous advantages, such as up-to-dateness, passing on the burden of indexing to many users and possibilities for information
414
Conclusion
distribution via recommender systems, folksonomies display all disadvantages of natural language, so that indexing with folksonomies to the exclusion of all other methods – particularly with a view to efficient and goal-oriented information retrieval – is not to be recommended: Surviving an information deluge requires better metadata, better structure, and better keywords. In short, it requires indexing. […] Humans remain the most effective tool for pointing users to what they need. In other words, you can try to automate it, but you’ll never truly succeed (Maislin, 2005, 7f.). Indexing content via folksonomies requires the human indexer; not only by definition during indexing, but particularly after indexing. That is why Tag Gardening was presented as a method of retroactive semi-automatic disambiguation and structuring of the folksonomies. Since these structuring activities are passed on to the users, they are not viewed as the opposite of folksonomies’ liberal and useroriented design, but as an equally user-oriented approach whose goal is first and foremost to individually adjust the Tag Garden according to users’ personal needs, in knowledge representation and information retrieval. The tasks ‘weeding,’ ‘seeding,’ ‘fertilizing’ and ‘landscape architecture’ were defined for Tag Gardening. The first two gardening activities can be largely accomplished automatically, since they refer to the removal of spam tags, the correcting of typos and the adding of less frequently used tags etc., and do not absolutely need user input. The latter two activities, however, absolutely require it, since here the context and meaning of the tags must be taken into consideration for their management. For weeding, it would make sense to implement a Type-Ahead functionality for the system and thus allow the user to adopt (orthographically correct) tags or to offer the user an editing function for his tags, allowing him to correct them afterwards. Editing the tags with methods from Natural Language Processing, such as stemming or word conflation, is also viewed as a weeding task and supports the unification of the tags for a better retrieval effectiveness. Seeding serves to increase the folksonomy’s expressiveness and is achieved by alerting users to seldom-used tags via ‘greenhouses’ or inverse tag clouds. Discreet tag recommender systems can further animate users to adopt such tags. The allocation of tags to a context is guaranteed by landscape architecture. Here the user himself becomes active and summarizes synonyms (multilingual ones, too), disambiguates homonyms or establishes semantic relations between tags, with the help of controlled vocabularies or KOS, among others. The result of these structuring endeavors is a more expressive and precise folksonomy, which can also be used for a goal-oriented information retrieval, e.g. during query expansion via semantic relations. The gardening activity of fertilizing is performed in combination with KOS, by recommending terms from the controlled vocabularies to the user during indexing. Thus the user can decide for himself whether he wants to add tags to his surrogate or use more specific or general tags. Here, too, increasing the folksonomy’s expressiveness and the indexing quality of the information resources is the objective. Fertilizing not only aims to enrich folksonomies, however, but can also be applied to controlled vocabularies and KOS. While folksonomies have a weak structure, KOS are heavily formalized and only with great difficulty can they adapt to new circumstances. On the other hand, they are extremely expressive and able to describe the context of a resource with great precision. With the Shell Model, I presented an approach that exploits both the
Conclusion
415
flexibility of folksonomies and the expressiveness of controlled vocabularies, and which was applied to folksonomies for the first time. The Shell Model allows for indexing while taking into consideration the resource’s relevance, i.e. differently relevant resources are indexed with differently expressive methods of knowledge representation, where the most relevant resource is indexed with the most powerful indexing tool. The goal of this approach is to increase the retrieval probability of a resource. Folksonomies can be used as an additional method of knowledge representation and can complement every controlled vocabularies via the tags. Thus access to the resources is increased again. Besides that, changes to users’ language use can be determined via folksonomies, and hence, the controlled vocabulary can be adapted more quickly. Thus the folksonomy ‘fertilizes’ the Knowledge Organization System. Since the Shell Model requires a quality control of the information resources before indexing, it is best suited to closed domains, e.g. company intranets or libraries. As has been demonstrated, the user is able to increase folksonomies’ effectiveness in knowledge representation and hence, in information retrieval, via Tag Gardening, as natural language’s variability is restricted and becomes, in Gene Smith’s words, less ‘suspect.’ The performing of gardening activities on the personomy level turns out to be unproblematic, since the user can change and use his tags at his guise. Still unaddressed is the transferral of Tag Gardening to the database level, which would then concern all users. Later scientific debate must show which approach is the most meaningful and pertinent. Remaining just as open is the question of whether tagging systems and folksonomies can collapse due to rising numbers of users, tags or resources. The approaches presented in this book stemming from the growth processes of natural systems, network economy and scientometry did show an inconsistent picture in answering this question, but were still able to provide plausible clue to the growth of tagging systems and folksonomies. The subject ‘Folksonomies’ Network Effects’ spans an entirely new research area. In research literature, it has been claimed that the distribution of tags per resource and within the database often follows a Power Law, and this was, in a generalized way, applied to all distributions within the folksonomies. Here, though, it was assumed that the mere presence of a Long Tail confirms the presence of an informetric distribution. Other distribution forms were left undiscussed in this context; the possibility of an inverse-logistic distribution in particular was neglected. The inverse-logistic distribution displays the characteristical Long Tail, just like the Power Law, but distinguishes itself very clearly at the beginning of the distribution curve. While the Power Law declines heavily at the beginning and moves into the Long Tail quite quickly, the inverse-logistic shows a more even curve progression with a turning point marking the beginning of the Long Tail. If one does without an analysis of the curve’s beginning, the inverse-logistic distribution can stay undetected and hence, falsely be termed an informetric distribution. This book was able to show, via a number of examples, that the inverse-logistic distribution is also to be found in folksonomies. Knowing about the different forms of distribution is important for the development and design of tag recommender systems based on high frequency tags, for example, or for relevance ranking in information retrieval. Furthermore, they must be made provision for during the determination of Power Tags.
416
Conclusion
For the subarea of information retrieval, it could be observed that folksonomies represent a further access path to information resource on the World Wide Web; in the case of multimedia resources they are even, with the possible exception of the resources’ titles, the only access path allowing the user to find the data. Their tolerance vis-à-vis multilingual, syncategorematic, performative and judgmental tags can also be deemed an advantage and serves to increase the number of access paths. Thus folksonomies are highly able to solve the ‘vocabulary problem’ in information retrieval. Since the tags are actively added to the resources by the users and the resources are thus subject to human quality control, of whatever nature, folksonomies can further be seen as complementary to the online search engines’ full-text storage. Due to the fact that tags do not necessarily define terms but are still consciously selected by the user during the indexing process, they are exactly in the middle of naturallanguage representations of information resources, such as words in full-texts, and restrictive KOS, which only allow a controlled vocabulary for indexing and retrieval. There is also a quality control for the indexed resources themselves, since they are selected by the users and managed within the collaborative information services. This leads to fewer duplicates in collaborative information services than in the search engines’ hit lists. Folksonomies’ greatest strength in information retrieval can be regarded as browsing, and particularly Pivot Browsing. The fact that users, resources and tags can be used as browsing entries means that the resources in collaborative information services can be discovered in a variety of ways. This high degree of linkage can also be used for collaborative filtering or recommender systems, and suggest interesting resources or even users to the users. The strength of folksonomies for social search is particularly noticeable in this respect. Browsing is supported by the different visualization methods of folksonomies, first and foremost by the tag clouds. Even though tag clouds are no longer alphabetical lists of the most popular tags for a resource or database, they effectively combine browsing with active searching – the user gets an overview of the database via the cloud and performs a query by clicking on a tag. Since there are mostly tag clouds for both the database and resource levels, the user can refine, restrict or expand his query and hit list at his guise, which represents a huge advantage of tag clouds and folksonomies. A more effective browsing and searching is looked for in improved tag clouds, which reflect the semantic closeness of tags in their visualization. Information resources often have further metadata, such as titles or descriptions, but are sometimes not indexed with tags. This circumstance is disadvantageous for those resources during browsing via tag clouds, though, and in the ranking after an active search. In order to index these resources with tags anyway, this book recommends that successfully used search tags be used as indexing tags. This means that resources found with a search tag are indexed with that tag. In this way a Broad Folksonomy is created for that information resource. This is why this procedure can also be used to enrich Narrow Folksonomies with tag frequencies if one wants to make use of tag distributions. This book extensively described the necessity of a relevance ranking for folksonomy-based retrieval systems. According to Lewandowski (2005), “the particular strength of ranking procedures [...] comes to fruition in environments where the system is used by inexperienced users in particular, and where there is no
Conclusion
417
prior selection of sources” (Lewandowski, 2005, 90)*. The scenario described here is prototypical for the situation in collaborative information services and again clarifies how important a ranking is to support the user during information retrieval and rating. That is why both query-dependent and query-independent ranking factors were developed, considering the features of folksonomies and exploiting them for relevance ranking. The developed ranking factors can be allocated to the three large criteria sets ‘tags,’ ‘collaboration’ and ‘users,’ where the latter contain those ranking factors that can be determined independently of the search request. For the ranking algorithm that ends up being used, a mixture of elements from both kinds of ranking factors is recommended, since the query-dependent factors do not provide for a disambiguation of duplicates in the hit list and a query-independent ranking would always yield the same results and arrangement of these results. For the querydependent ranking factors, Luhn’s Thesis was applied to folksonomies for the first time and the use of text-statistical means for determining the Retrieval Status Value of a resource was discussed. Furthermore, a new reference value for Inverse Document Frequency (IDF) was introduced: Inverse Tag Frequency (ITF). The goal of text statistics was here to exclude too high-frequency tags, due to their small discriminatory power, and too low-frequency tags, due to the lack of representation (they might contain spelling errors etc.) from the determination of a resource’s relevance. At this point it is still an open question how to handle tags with the same meaning but different realizations of that meaning, which can distort the textstatistical values. Tags like ‘social_bookmarking,’ ‘social-bookmarking’ and ‘socialbookmarking’ are semantically synonymous, but text statistics treats them as single words in its term frequency and Inverse Document Frequency. It is to be assumed that a summarization of such tags would lead to values more suitable for the ranking – the tag’s popularity would be augmented, the number of tags for a resource decreased and the frequency of occurrence in the entire database increased – but this thesis has yet to be empirically confirmed. The summarization could be implemented via alignment with pre-existing KOS or Natural Language Processing (NLP) and should be localized in knowledge representation’s Tag Gardening as well as in information retrieval’s. Here it is important that any tag-changing measures be implemented before the text-statistical calculation. Also on the basis of statistical calculations, Power Tags can be determined for the resource level and exploited for information retrieval. Power Tags distinguish themselves by often being allocated to a resource and thus being considered general by the majority of users. It is absolutely necessary to consider the underlying tag distribution when determining Power Tags: a Power Law will yield fewer Power Tags than an inverse-logistic distribution. If one assumes that users’ Collective Intelligence is reflected in the popularity and generality of a tag, Power Tags as a retrieval option are capable of restricting the search to resources that are popular for the tag in question, and thus relevant. This would also mean that the resource most frequently indexed with the search tag will occupy the top spot in the hit list. Considering the reverse-proportional connection between Recall and Precision, the retrieval option Power Tags leads to an increase in the hit list’s Precision because the Recall is decreased due to the restricted number of resources to be searched. In this book, I presented an exemplary retrieval scenario with the option ‘Power Tags’ and developed a diagram charting the progress of the implementation. In chapter three, Tag Gardening was explained in detail for knowledge representation. Even though Tag Gardening is principally regarded as a retroactive
418
Conclusion
manual structuring endeavor after indexing, (semi-)automatic, system-side support on the part of the user is required due to the many tags to be edited. An exclusive and an inclusive approach were suggested for this, recommending tag candidates to the user to check and manage. The exclusive approach uses Power Tags and determines those tags that share a co-occurrence relation with them, alerting the user to this connection. The inclusive approach applies similarity and cluster algorithms to the tags and thus also informs users of certain tag-tag relations. This book is purely based on conceptual thoughts combining the methods of knowledge representation and information retrieval with folksonomies, or adapting them to folksonomies’ needs. Here numerous questions of detail in folksonomy research were answered or approaches to solutions suggested, but this is certain to have only raised just as many new research questions, which must be tackled in the future. An implementation of the approaches is also still in the air, so their applicability and usefulness cannot be evaluated in practice yet. It is surely desirable, then, that the research results presented in the above be enthusiastically debated in IT and information science and be put to the test as soon as possible.
Bibliography Lewandowski, D. (2005). Web Information Retrieval. Technologien zur Informationssuche im Internet. Frankfurt am Main: DGI. Maislin, S. (2005). The Indexing Revival. intercom, Feb, 6–8, from http://taxonomist.tripod.com/ftp/IndexingRevival.pdf. Smith, G. (2008). Tagging. People-powered Metadata for the Social Web. Berkeley: New Riders.
Index of Names
A Acquisti, A. ............................................22, 110 Adamic, L.A. .......................................172, 262 Agichtein, E. ............................. 339, 353f., 393 Ahern, S. ............ 199, 200, 229, 262, 270, 382, 393, 400 Aitchison, J. .........................119, 122, 125, 146 Alani, H............................. 302f., 309, 393, 408 Albert, R...............................................174, 263 Albrecht, C.. 131, 146, 157, 262, 322, 326, 393 Alby, T. ......................... 18, 19, 26, 50, 54, 107 Al-Khalifa, H. S. ............ 1, 10, 155, 191f., 196, 205, 230ff., 246, 247, 250, 254, 258, 260, 262, 309, 364, 386, 393 Ames, M.... 186f., 198, 200, 204, 209, 262, 393 Amitay, E. ..................... 78, 107, 199, 229, 262 Anderson, C. ........................131, 146, 171, 262 Andrich, D............................................161, 262 Angeletou, S.............. 250, 262, 373, 378f., 393 Anick, P................................................305, 393 Ankolekar, A....................13, 20, 107, 249, 262 Arch, X...................................................55, 107 Arrington, M. ...................................26, 38, 107 Auray, N...............................................289, 393 Aurnhammer, M........................ 213, 223, 262f. Ausloos, M........................ 157f., 271, 299, 402 Austin, J. L.................. 200, 263, 278, 356, 393 Avesani, P. .................. 153, 245, 268, 375, 399
B Bächle, M...............................................18, 107 Bacigalupo, F. ........................................49, 107 Back, M................................................295, 398 Baeza-Yates, R................. 119, 136, 146f., 149, 336, 393, 404 Baillie, M. ............................................309, 395 Bak, P. ..................................................179, 263 Baldassarri, A.......................115, 277, 406, 408
Bao, S. ....................... 229, 232, 263, 272, 301, 339ff., 344, 393, 402 Barabási, A...........................................174, 263 Barash, V......................................................407 Barber, E. G. ........................................289, 403 Bar-Ilan, J............................ 56, 107, 130, 146, 191ff., 228f., 263, 376, 386, 394 Barrington, L. ................ 49, 117, 176, 232, 280 Bateman, S. .............. 226, 259f., 263, 315, 319, 322, 394 Bates, M. E.................. 100, 107, 156, 218, 263 Battle, S. ...................... 78, 116, 199f., 229, 280 Bauckhage, C. ....................... 26, 118, 208, 235 Bawden, D........................... 119, 122, 125, 146 Beal, A................................. 232, 263, 346, 394 Bearman, D. ...........................................69, 107 Beaudoin, J.................. 148, 166, 200, 263, 272 Begelman, G....................... 244, 263, 287, 298, 306, 315, 317, 330, 333, 394 Belkin, N. J.................................. 287, 294, 394 Bender, G. ............................................302, 394 Beni, G. ................................................166, 263 Benson, E. A. .........................................37, 112 Benz, D.........................................................395 Berendt, B. .......................... 153, 156, 220, 263 Bergstrom, P.................................................405 Bertram, J. ..... 85, 107, 119, 146, 285, 390, 394 Betrabet, S. ...................................................147 Bhattacharjee, B. ..................................274, 404 Bian, J...........................................................409 Billhardt, H...........................................136, 146 Birkenhake, B...................... 189, 263, 376, 394 Bischoff, K. ......... 192, 197, 229, 263, 312, 394 Blank, M.............................. 215, 263, 290, 394 Blood, R. ................................................96, 107 Bopp, T.................................................263, 394 Borlund, P. A. ......................................376, 406 Borrajo, D.............................................136, 146 Bowers, J. S..........................................202, 263
420
Index of Names
boyd, d....................................10, 113, 273, 403 Bradford, S. C. .................................. 370f., 394 Braly, M. ................................................24, 107 Braun, H.................................................13, 107 Braun, M. .....................................334, 360, 394 Braun, S................................243, 253, 263, 282 Breeding, M. ..........................................56, 108 Brill, E.................... 139, 149, 339, 341, 353ff., 360, 393, 405 Brin, S. ........ 138, 146, 149, 340, 356, 394, 404 Broder, A..............................................391, 394 Brooks, C. .......................... 155, 185, 198, 229, 232, 244ff., 263f., 283, 293, 305, 307, 311, 375, 394f. Brooks, H. M. ......................................287, 394 Brownholtz, B..............................................109 Bruce, H. ......................................................265 Bruce, R. ..............................................191, 264 Brusilovsky, P.....................263, 284, 290, 362, 391, 395, 397 Bul, Y. ..................................................148, 272 Buntine, W. ................. 188, 210, 266, 300, 397 Buriol, L. S...................................................264 Burns, S..................................................85, 110 Butler, J. ...............................................161, 264 Butterfield, D. ................. 231, 264, 333, 343f., 360, 376, 395 Buxmann, P..........................................180, 264 Buza, K. ...............................................231, 273 Byde, A. ...............................................209, 264
C Caldarelli, G...................... 177, 264, 306f., 395 Calefato, F....................................209, 245, 264 Campbell, A. ....................................70, 79, 108 Campbell, D. ................... 139, 148, 154, 168f., 177f., 184, 200, 207f., 212, 216, 219f., 228, 241, 255, 264, 270, 364, 401 Campbell, I...........................................146, 396 Candan, K. S. .......................253, 264, 376, 395 Cap, C. .............................................. 257f., 275 Capocci, A................. 174, 177, 264, 306f., 395 Capps, J. ...............................................255, 264 Carboni, D............................................200, 264 Cardew-Hall, M. ................ 120, 150, 204, 218, 278, 315, 319ff., 407 Carlin, S. A. ...............................................6, 10 Carman, M. J........................................309, 395 Carroll, J. M. ................................................266
Casey, M. E. ...........................................56, 108 Cattuto, C. ........... 115, 172ff., 176, 177, 264f., 277, 364f., 376, 379, 384, 395, 406, 408 Caverlee, J. ...........................................308, 398 Cayzer, S. .................... 78, 97, 108, 116, 199f., 209, 229, 264, 280 Chad, K. .............................................. 55f., 108 Chalmers, M.........................................291, 396 Chappell, A. ...........................................68, 118 Chen, H. ..................253f., 265, 286f., 373, 395 Chen, Z.........................................................409 Cheng, X. ...................................... 81, 85f., 108 Cherry, R. .....................................................108 Chi, E. H........................ 206, 265, 284f., 313f., 391f., 395, 397 Chiang, K. ............................................292, 395 Chignell, M. .........................................145, 151 Chirita, P. A. ........................................210, 265 Chopin, K. ................... 96, 108, 226, 230, 265, 285, 302, 310, 395 Choy, S.............. 4, 10, 218, 265, 289, 372, 395 Christiaens, S. ........... 213, 254f., 265, 310, 395 Chu, H. .................................................220, 281 Chua, A. Y. K.......................................272, 402 Chun, S...................................................69, 108 Civan, A. ..............................................161, 265 Clark, J. A. .............................................16, 108 Clough, P................. 13, 70, 78, 80, 108, 194f., 265, 288, 396 Coates, T. ...............................................15, 108 Coenen, T. ... 236, 246, 253, 280, 347, 363, 408 Colaiori, F. ...................................................264 Colombetti, M. ................. 218, 238, 271f., 376, 383f., 402 Cormode, G. .......................... 14, 108, 368, 396 Corrado, E. M.......................................279, 408 Cosentino, S. L.......................................55, 108 Cosley, D..............................................277, 406 Costache, S...................................................265 Costello, E. ...........................................264, 395 Cox, A. .................... 13, 70, 78, 80, 108, 194f., 265, 288, 396 Coyle, K. ................................................55, 108 Coyle, M.......................................................397 Crane, D. ................................................16, 108 Crawford, W.......... 56, 108, 123, 146, 288, 396 Cress, U. ...............................................157, 269 Crestani, F. ................ 136, 146, 309, 339, 395f. Croft, W. B.......... 137, 146, 294, 361, 394, 396
Index of Names Cui, H. ..................................................308, 396 Culliss, G. A.........................................354, 396 Cuomo, D...............................................24, 108 Czardybon, A. ........................................87, 108
D Dabbish, L............................................101, 117 Dale, C. ......................................... 81, 85f., 108 Damianos, L...........................................24, 108 Dannenberg, R. ....................................102, 112 Danowski, P. ..............................15, 56, 59, 109 Dasmahapatra, S. .........................................393 Davis, H. C............................ 1, 10, 155, 191f., 196f., 205, 230ff., 246f., 250, 254, 258, 260, 262, 309, 364, 386, 393 Davis, M........ 10, 111, 113, 270, 273, 400, 403 Davison, B. D.......................204, 281, 309, 408 de Chiara, R. ........................................257, 265 de Moor, A. ......................19, 96, 109, 278, 407 de Moya-Anegón, F. ............................140, 147 de Solla Price, D. J................... 179, 181ff., 265 Dellschaft, K. .............. 157, 175, 178, 265, 394 Dennis, B. M. ....76f., 109, 200f., 265, 343, 396 Deo, H. .............................................. 328f., 398 Derntl, M......................................216, 228, 265 Desal, K................................................148, 272 Di Caro, L. ...........................253, 264, 376, 395 Dice, L. R.......................................... 142f., 146 Dieberger, A..........................................290,396 Diederich, J. .....215, 265, 300ff., 305, 366, 396 Dierick, F. ....................................210, 238, 269 Dietl, H.................................................180, 265 Dmitriev, P........................................ 336f., 396 Dobusch, L.............................................17, 109 Doctorow, C...........................................19, 109 Doi, T. ..................................................215, 274 Donato, D.....................................................264 Dourish, P. .............................70, 114, 291, 396 Downie, J. S. ..........................................54, 109 Druschel, P...........................................274, 404 Dubinko, M. .........................196, 266, 327, 396 Dugan, C. .......................................... 103f., 109 Dumais, S.T. ................... 266, 339, 353f., 388, 393, 397, 399 Dvorak, J. C. ..........................................13, 109 Dye, J. ..................... 123, 146, 153, 160, 164f., 190f., 201, 260, 266, 283, 293, 298, 315f., 334, 391, 396
421
E Eck, K.....................................................96, 109 Edelman................................................302, 396 Effendi, F. A.........................................271, 401 Efimova, L. ..................................... 19, 96, 109 Efthimiadis, E. N..................................305, 396 Egghe, L. ..................... 131, 146, 171, 173, 266 Egozi, O................................................362, 397 Ehrlich, K. ........................................... 88f., 113 Eiron, N. .......................................................396 Elbert, N. ................................................20, 109 Eoff, B. .................................................308, 398 Evans, B. M............................... 284, 391f., 397 Evans, R. ..............................................131, 146 Eynard, D. ...... 218, 238, 271f., 376, 383f., 402
F Fake, C. ............................... 264, 266, 395, 397 Farooq, U. ........................... 184, 190, 192, 266 Farrell, S...................... 24, 88f., 109, 113, 234, 266, 301, 397 Farzan, R. .....................................................397 Faust, K. ...............................................145, 151 Fei, B. ...................................................272, 402 Feinberg, J. .................. 20, 24, 113, 289, 291f., 297, 331, 337, 403f. Feinberg, M. ...................... 24f., 109, 113, 153, 156, 167, 220, 225, 266, 289, 397 Feldman, S. ............................... 4, 10, 285, 397 Fengel, J. ..............................................253, 266 Fensel, D. .............................................253, 266 Ferber, R...............................................119, 146 Fichter, D. .........156, 216f., 266, 316, 390, 397 Figge, F. .............................................. 55f., 109 Firan, C. S. ...........................................263, 394 Fischer, G.. ...........................................242, 266 Fischer, T. ..............................................96, 114 Fish, A. .................................................257, 265 Flack, M. ..............................................113, 272 Fokker, J...................... 188, 210, 266, 300, 397 Fontoura, M..................................................396 Forsberg, M..............................290ff., 302, 397 Forsterleitner, C. ....................................17, 109 Fox, E. A. .................................... 133, 147, 150 Fox, K...................................................266, 397 Frakes, W. B................................ 119, 147, 149 Frankowski, D......................................277, 406 Franz, T. .......................................................394 Freyne, J. ..................................... 290, 348, 397
422
Index of Names
Froh, G. ..................................................24, 107 Fu, F. ....................................................175, 266 Fu, Y. ...................................................281, 409 Fujimura, K. .........................................329, 397 Fujimura, S...................................................397 Furnas, G................... 213, 218, 253, 266, 286, 288f., 337, 367, 397 Furner, J. ............... 56, 109, 310, 312, 390, 397
G Gaiser, B. ................ 157f., 187, 191, 215, 227, 257, 267, 275, 301, 404 Galton, F. .......................................... 166f., 266 Galvez, C..............................................140, 147 Ganesan, K. A. .................................. 328f., 398 Ganoe, C. H. ................................................266 Ganter, B. .....................................................270 Garbe, M. .......................................... 257f., 275 Garcia-Molina, H. ..................... 5, 10, 21f., 26, 111, 156, 168, 171, 192, 210, 214, 232ff., 235, 253, 267, 269, 271, 289f., 311f., 352, 360, 376, 387, 398f., 401 Garfield, E................... 128, 138, 147, 340, 398 Garg, N.........................................204, 258, 267 Garrett, J. J. ......................................... 16f., 110 Gasser, L. .....................................................116 Gaul, W. .................................................37, 110 Gauthier, I. ...................................................272 Geisler, G. ..............................................85, 110 Geisselmann, F.............................214, 242, 267 Geleijnse, G. ..........................................55, 110 Gendarmi, D................ 209, 213, 245, 264, 267 Getoor, L. .....................................................407 Geyer, W. .............................................109, 406 Geyer-Schulz, A...........................................110 Gilbert, L....................................1, 10, 386, 393 Gilbertson, S. .........................................38, 110 Gilchrist, A...........................119, 122, 125, 146 Giles, C. L. ...........................................266, 409 Gissing, B.....................13, 22, 110, 166ff., 267 Giunchiglia, F. .....................123, 147, 379, 398 Godwin-Jones, R.... 23, 79, 110, 178, 259, 267 Goeser, S. .............................................373, 405 Goh, D. H.............................................272, 402 Goldberg, D. ........................................295, 398 Golder, S. ...................... 4f., 10, 153, 168, 172, 190, 196, 198, 207, 220, 232, 248f., 266f., 292, 297, 311, 358, 397f. Golov, E. ..................... 258, 267, 380, 390, 398
Golovchinsky, G. .................................295, 398 Golub, K.............................. 249, 267, 373, 398 Gomes, L. .............................. 81, 110, 266, 397 Good, B. M...................................................402 Goodrum, A. ....... 202, 217, 267, 283, 286, 398 Gordon-Murnane, L. ................... 23, 110, 168, 212, 267, 283, 373, 377f., 398 Governor, J.................................. 231, 236, 267 Graber, R. ...............................................48, 110 Graefe, G. .............................. 25, 110, 297, 398 Graham, J. ..............................................70, 110 Graham, R. ...........................................308, 398 Grahl, M. ................... 111, 115, 214, 253, 267, 277, 306f., 398, 406 Granitzer, M. ................... 171f., 176, 184, 192, 272, 285, 347, 363f., 367, 403 Grassmuck, V.........................................17, 110 Greenberg, J. ........................................305, 398 Griffith, J. ...............................................24, 108 Gross, R..................................................22, 110 Gruber, T. .................. 125, 147, 153, 161, 218, 231, 243, 245, 247, 257, 267, 287, 398 Grudin, J.......................................................266 Gruen, D. M. ................................................405 Grün, P. ........................................................108 Gruppe, M. .............................................88, 110 Gruzd, A...............................................311, 398 Gummadi, K. P.....................................274, 404 Güntner, G................................... 244, 252, 267 Gust von Loh, S. ......................... 108, 178, 279 Gutwin, C. ........................... 315, 319, 322, 394 Guy, M. ............................ 5, 10, 185, 205, 212, 218ff., 228ff., 252, 267, 284, 324, 332f., 336, 341, 398 Gyöngyi, P. . 233, 267, 271, 352, 360, 398, 401
H Hahsler, M....................................................110 Halpin, H. .................. 120, 132, 147, 154, 170, 172, 174, 178, 214, 236, 239, 248, 253, 268, 367, 374, 399 Hamm, F.................................................88, 110 Hammersley, B.......................................17, 110 Hammond, T. .................. 24f., 110, 113, 153f., 156f., 165, 185f., 225, 232, 244, 268, 272, 288, 299, 361, 399 Hammwöhner, R. ...................................22, 110 Hampel, T... 117, 225, 263, 265, 267, 268, 394 Han, P...................................................368, 399
Index of Names Hanappe, P. ............................... 213, 223, 262f. Handschuh, S. ..............................................265 Hänger, C. ................... 60f., 110, 214, 216, 267 Hank, C. .................................................70, 112 Hannay, T.................... 110, 113, 268, 272, 399 Hanser, C..............................153, 156, 220, 263 Har’El, N..............................................107, 262 Harman, D........... 134, 137, 147, 345, 347, 399 Harper, D. J.. ........................137, 146, 361, 396 Harper, F. M.........................................277, 406 Harrer, A. .............................................260, 268 Harrington, K. ......................................255, 268 Hassan-Montero, Y............ 198, 212, 220, 248, 268, 289, 298, 311, 314, 316f., 319, 399 Hayes, C...................... 153, 245, 268, 375, 399 Hayman, S............... 24, 37, 88, 111, 153, 212, 252, 268 Hearst, M. A.................. 315f., 318ff., 388, 399 Heckner, M. .................. 6, 10, 184, 192, 197f., 202f., 210, 230, 268, 386, 399 Heinz, H. J..............................................22, 110 Held, C. ................................................157, 269 Heller, L. ................... 15, 24, 55f., 59, 109, 111 Henderson-Begg, C..............................264, 395 Henrichs, N. .........................................128, 147 Hentrich, T. ..................................................402 Hepp, M. ................. 139, 150, 157f., 198, 213, 215, 223, 228, 230f., 253, 280, 302, 304, 347, 363, 376, 379f., 408 Herget, J. ..............................................212, 269 Hering, D......................................................394 Herlocker, J. L. ............................354, 361, 400 Herrero-Solana, V. 198, 212, 220, 248, 268, 289, 298, 311, 314, 316f., 319, 399 Heß, A. .......... 25, 110, 210, 238, 269, 297, 398 Heuwing, B. .....................................56, 58, 111 Heylighen, F.........................................179, 269 Heymann, P............... 5, 10, 21f., 26, 111, 156, 168, 171, 192, 210, 214, 232, 234f., 253, 269, 271, 289f., 311f., 376, 387, 399, 401 Hierl, S. ................................................212, 269 Hill, W. C.............................................292, 399 Hiwiller, D. ..................................................108 Hlava, M. M. K....................................126, 149 Höhfeld, S. ...................................295, 305, 402 Hollan, J. D. .........................................292, 399 Honiden, S............................................215, 274 Höök, K....................290ff., 302, 396, 397, 407 Hornig, F. ...................................18, 70, 81, 111
423
Hotho, A................ 31, 34, 111, 115, 156, 165, 208, 214, 225, 240, 245, 253, 267, 269ff., 277, 285, 289, 306f., 313, 336, 339, 342f., 346, 352, 360, 372, 379, 395, 398-401, 406 Howe, J.................................................166, 269 Huang, H. .............................................131, 147 Hubermann, B. ...................... 4, 5, 10, 88, 111, 153, 168, 172, 190, 196, 198, 207, 220, 232, 248, 249, 267, 292, 297, 311, 358, 398 Hurley, C. ........................................ 81, 85, 111 Hurst, M. ..............................................388, 399
I Iacovou, N. ...................................................405 Idan, A......................................... 146, 263, 394 Ingwersen, P.........................................222, 269 Iofciu, T........... 215, 265, 300ff., 305, 366, 396 Iyer, H. ........................................ 201, 220, 269
J Jaccard, P. ......................... 142f., 147, 209, 269 Jäckel, M. ...............................................18, 116 Jackson, D. M.......................................376, 400 Jacobi, J. A. ............................................37, 112 Jaffe, A. .......78, 111, 199f., 229, 270, 382, 400 James, D. ................................................16, 108 Jäschke, R...................... 31, 35, 111, 115, 157, 204, 208, 215, 270, 277, 300f., 310, 336, 338, 399f. Jatowt, A. .............................................282, 409 Jeong, W...................................... 192, 210, 270 Jiang, T.................................................287, 400 Joachims, T. .....................................353ff., 400 John, A. ....................... 24, 111, 157, 270, 273, 300, 303f., 342, 400 Jones, C. ...................................... 267, 290, 398 Jones, K. W. .........................................202, 263 Jones, L. A. ................. 79f., 112, 290, 361, 402 Jones, R. .................................................49, 112 Jones, W. ......................................................265 Jörgensen, C. ...................... 168, 201, 213, 253, 270, 286, 288, 376, 400 Jung, S. ........................................ 354, 361, 400 Jungen, P. .....................................................394 Jungierek, M.........................................237, 277
K Kaden, B.................................................56, 112 Kaiser, T.................................................49, 112 Kalbach, J. .....23, 112, 119-122, 147, 240, 270
424
Index of Names
Kannampallil, T. G. .....................................266 Kautz, H. ................................... 295f., 308, 400 Keller, P. ................... 244, 263, 287, 298, 306, 315, 317, 330, 333, 394 Kellog Smith, M. ................. 69, 112, 153, 191, 213, 219, 221f., 270 Kennedy, L........................ 200, 270, 381f., 400 Kern, R. ...................... 171, 172, 176, 184, 192, 272, 285, 347, 363f., 367, 403 Kerr, B..................... 20, 24, 113, 289, 331, 403 Kessler, M. M. .... 129, 147, 158, 270, 295, 401 Kim, M. H. ...................................................148 Kim, W. Y....................................................148 Kipp, M. E. I. ............................. 120, 123, 139, 147f., 154, 168ff., 177f., 184, 191f., 198, 200f., 206ff., 213, 219, 224, 238, 241, 250, 260, 270, 279, 311, 342, 357, 364, 377f., 401, 408 Kiss, J. ....................................................49, 112 Kiyota, Y............................4, 10, 166, 209, 274 Klasnja, P. ....................................................265 Kleinberg, J. ................ 138, 148, 340, 356, 401 Knautz, K. . 198, 240, 271, 317f., 375, 388, 401 Knees, P. ................................................55, 110 Kohlhase, A. ................................212, 259, 271 Kolbitsch, J. ...................... 218, 271, 384f., 401 Kome, S................................223, 271, 305, 401 Koppula, H. S.......................................274, 404 Koshman, S. .........................................287, 400 Koushik, M. .................................................147 Koutrika, G. .................. 10, 21f., 26, 111, 156, 192, 210, 232-235, 269, 271, 311f., 334, 352, 387, 399, 401 Kowatsch, T. ................... 155, 157, 168f., 172, 178, 272, 338, 403 Krämer, B.....................................................399 Kramer-Duffield, J.................................70, 112 Krause, B.................... 111, 225, 233, 271, 313, 336f., 339, 400f. Krause, J............................ 214, 242, 250f., 271 Krishnamurthy, B.... 14, 22, 108, 112, 368, 396 Kropf, K. ..........................................55, 56, 109 Kroski, E. ........................... 130, 148, 157, 166, 200f., 213f., 224, 271, 292, 333, 401 Krötzsch, M. ....................13, 20, 107, 249, 262 Kuhlen, R. ....................................119, 140, 148 Kuhns, J. L. ..........................................136, 149 Kulyukin, V. A.....................................136, 148 Kumar, R................................97, 112, 266, 396
Kuo, B. Y. ................................... 329, 389, 402 Kwak, H. ......................................................113 Kwiatkowski, M.......................... 295, 305, 402
L Laaksolahti, J. ..............................................407 Lackie, R. J.............................................15, 112 Ladengruber, R............................ 117, 225, 268 Lalmas, M. ...........................................146, 396 Lam, S. .................................................277, 406 Lambiotte, R...................... 157f., 271, 299, 402 Lancaster, F. ........................ 119, 148, 221, 271 Lanckriet, G. ................. 49, 117, 176, 232, 280 Landauer, T. .........................................266, 397 Lange, C. ....................................... 13, 17f., 112 Laniado, D...... 218, 238, 271f., 376, 383f., 402 Lanubile, F. ................. 209, 213, 245, 264, 267 LaPitz, A. .....................................................406 Lau, T. ............................. 24, 88, 109, 266, 397 Lauw, H. L. ..................................................407 Law, E. L. M. .......................................102, 112 Layne, S................................................221, 272 Le Bon, G. ............................................169, 272 Lebel, C. .................................................68, 118 Lee, C. S...................... 184, 220, 272, 352, 402 Lee, J. H. ..............................................133, 148 Lee, K. J. ..............................................189, 272 Lee, L. ..................................................358, 404 Lee, S. S. ........................................... 383f., 402 Lee, W. .........................................................147 Lee, Y. J. ......................................................148 Leistner, M. ............................................22, 112 Leonardi, S. ..................................................264 Lerman, K. .................. 70, 79f., 112, 114, 253, 276, 290, 300, 376, 402 Lessig, L.................................................17, 112 Levenshtein, V. I..................................246, 272 Levy, M. .................................................55, 112 Lewandowski, D. ............... 133, 139, 148, 150, 339, 340, 390, 402, 416f. Ley, T. ......................................... 210, 244, 275 Li, R..................... 253, 272, 290, 367, 374, 402 Li, Z..............................................................399 Liddy, E. D...........................................136, 148 Lin, X. .... 127, 148, 168, 191f., 200f., 228, 272 Linde, F. ...........................179ff., 183, 216, 272 Linden, G. D...........................................37, 112 Lindstaedt, S................................ 210, 244, 275 Liu, H. ......................................... 204, 210, 279
Index of Names Liu, J. ............................................ 81, 85f., 108 Liu, L............................................................266 Loasby, K...............................80, 112, 255, 272 Logan, G. D. ........................................161, 272 Logan, R.......................................199, 229, 280 Lohmann, S. .........................................260, 268 Lohrmann, J. ..........................................22, 113 López-Huertas, M. J. ...........................375, 402 Loreto, V. .........................172ff., 176, 265, 408 Lothian, N. 24, 37, 88, 111, 153, 212, 252, 268 Lovins, J. B. .........................................140, 148 Luce, R. D. ...........................................161, 272 Luhn, H. P. ...........................134, 148, 348, 402 Lui, A. K. .......... 4, 10, 218, 265, 289, 372, 395 Lund, B. ........ 26, 110, 113, 165, 268, 272, 399 Lux, M...................... 171f., 176, 184, 192, 272, 285, 347, 363f., 367, 403
M Ma, W. Y..............................................396, 409 Maarek, Y. ...........................161, 178, 191, 272 Maaß, C. ...................... 25, 110, 115, 129, 150, 210, 238, 269, 297, 398, 407 Maass, W.................. 155, 157, 168f., 172, 178, 272, 338, 403 Macaulay, C. ........................................301, 403 MacGill, M. J. ......................130, 135, 149, 406 Macgregor, G. ......................... 119, 122f., 148, 212f., 217ff., 272 Mack, M. L. .........................................202, 272 MacLaurin, M. .................. 204, 208f., 228, 272 Maes, P.............. 284, 292, 295, 300f., 406, 409 Magnani, J............................................266, 396 Mai, J....................................................120, 148 Maier, R. ...............172, 178, 190, 242f., 272f., 363-366, 403 Maislin, S.. ...........................................413, 417 Maly, K. ................... 157, 213, 215f., 235, 281, 299, 301, 409 Man, J...........................................................409 Maness, J. M. ...................56, 61, 113, 389, 403 Manning, C. D. ....................................119, 148 Mao, J...........................................................281 Maojo, V. .............................................136, 146 Marchese, M. .......................123, 147, 379, 398 Marinho, L. ..........................231, 270, 273, 400 Markey, K. ............... 121, 149, 221f., 248, 273 Markkula, M. .......................................222, 273
425
Marlow, C. .......... 5, 10, 70, 113, 157ff., 164f., 184, 204, 211, 216, 273, 293, 356, 403 Marlow, J. ...............13, 70, 78, 80, 108, 194f., 265, 288, 396 Marnasse, N. ................................................272 Maron, M. E. ........................................136, 149 Mathes, A. ....................... 5, 10, 120, 149, 154, 156, 196f., 201, 212ff., 217, 228, 230, 273, 285, 289, 292, 367, 403 Matsubayashi, T. Y. T..................................397 Matthews, B. ........................................267, 398 Mayr, P........... 241, 273, 286f., 305, 370f., 403 McCall, R. ....................................................266 McCalla, G. ..................................................263 McCulloch, E. ......................... 119, 122f., 148, 212f., 217ff., 272 McFedries, P. ...................... 20, 113, 153, 167, 248f., 273, 315, 403 McLeod, D. ..........................................383, 402 McMullin, J..........................................106, 113 Medeiros, N..........................................123, 149 Meersman, R. .......................................278, 407 Meeyoung, C......................................85ff., 113 Meinel, C............................. 232, 255, 259, 276 Mejias, U. .................... 228, 252, 273, 297, 403 Ménard, E.............................................221, 273 Meng, P. .................................................49, 113 Menschel, R. ........................................169, 273 Merholz, P. ...................... 153, 212, 227f., 242, 250, 273, 285, 403 Merton, R. K. .............. 173, 205, 273, 289, 403 Metzler, H. ...................................................394 Micolich, A. P. ................................ 85, 87, 113 Mika, P. ............................... 213, 273, 379, 403 Milgram, S. ............................................88, 113 Millard, D. E. .........................................13, 113 Millen, D. R. ............... 20, 24, 109, 113f., 187, 257, 279, 289, 291f., 297, 331, 337, 403ff. Miller, G.............................. 140, 149, 218, 273 Miller, P............................. 49, 55, 56, 108, 113 Miller, Y...................................... 146, 263, 394 Mishne, G.............................................308, 404 Mislove, A......................... 215, 274, 289f., 404 Mitzenmacher, M. ................................172, 274 Moffat, A..............................................336, 404 Mohelsky, H.........................................169, 274 Mönnich, M............................................61, 113
426
Index of Names
Montanez, N....................... 155, 185, 198, 229, 232, 244ff., 264, 283, 293, 305, 307, 311, 375, 394f. Moon, J. ...............................110, 113, 267, 398 Moore, M. ............................................109, 406 Mork, K................................................171, 274 Morrison, P. .............. 204, 212, 219, 225, 228, 258, 274, 310, 333, 336, 361, 404 Morville, P. ..........................................153, 274 Motschnig, R................................................265 Motta, E................................214, 262, 278, 393 Motwani, R. .........................................149, 404 Moulaison, H. L. ..................................279, 408 Mourachow, S. .....................................264, 395 Mühlbacher, S. ........................ 184, 192, 197f., 202f., 210, 230, 268 Muller, M. J. ..................... 88f., 109, 113, 187, 191, 193f., 204ff., 210f., 229, 238, 257, 266, 274, 279, 293, 397, 404ff. Müller-Prove, M. ..... 153, 157f., 163, 178, 274 Munk, T................................................171, 274 Munro, A..............................................292, 404 Münster, T....................... 155, 157, 168f., 172, 178, 272, 338, 403 Mytkowicz, T............ 206, 265, 285, 313f., 395
N Na, K. ...........................................220, 255, 274 Naaman, M............ 10, 186f., 189, 198ff., 204, 209, 262, 270, 273f., 382, 393, 400, 403f. Nacenta, M...........................315, 319, 322, 394 Nair, R..................................262, 270, 393, 400 Nakagawa, H......................4, 10, 166, 209, 274 Nakamura, S.........................................282, 409 Nasser, V............................................. 15f., 118 Navon, Y. .....................................................272 Neal, D. ..................... 199, 204, 217, 222, 229, 231, 259, 274 Nejdl, W.......................................263, 265, 394 Neubauer, T. ..... 184, 197, 202f., 268, 386, 399 Newman, M. E. J. ... 131, 149, 172ff., 176, 274 Nichani, M. ............................................96, 114 Nicholas, D. ...............................................4, 10 Nichols, D. ...................................................398 Nie, J. Y. ......................................................396 Nielsen, M. L. ..... 124, 150, 267, 285, 398, 408 Niwa, S.................................................215, 274 Noruzi, A....................... 24, 114, 218, 252, 274 Notess, G. R. ..............................13, 14, 17, 114
Nov, O. .................................................189, 274 Novak, J....................................... 112, 266, 396 Nüttgens, M..........................................253, 266
O O’Reilly, T. ................................... 14, 101, 114 Oates, G....................................... 1, 10, 70, 114 Ochoa, X. .................................... 189, 219, 281 Oddy, R. N. ..........................................287, 394 O'Hara, K. ............................................393, 408 Ohkura, T. ......................... 4, 10, 166, 209, 274 Oki, B. M. ....................................................398 Okuda, H. .....................................................397 Oldenburg, S. .............................. 257, 258, 275 Orchard, L. M.........................................26, 114 Ornager, S. ...........................................222, 275 Ossimitz, G. ................................. 178-182, 275 Osterhouse, J. .......................................277, 406 Ostwald, J.....................................................266
P Page, L......... 138, 146, 149, 340, 356, 394, 404 Paik, W.................................................136, 148 Paiu, R. .................................................263, 394 Palen, L. .................................................70, 114 Palmeri, T. J. ................................................272 Pammer, V. ................................. 210, 244, 275 Pan, Y. X. ...............................................24, 114 Pang, B. .................................................358,404 Panke, S................... 157f., 187, 191, 215, 227, 257, 267, 275, 301, 404 Panofsky, E. .........................................221, 275 Paolillo, J................ 81, 85, 114, 177, 206, 238, 275, 300, 377, 404 Pascarello, E...........................................16, 108 Passant, A.............................................376, 404 Patashnik, O. ..........................................31, 114 Pedersen, J................... 233, 267, 352, 360, 398 Penumarthy, S. .... 206, 238, 275, 300, 377, 404 Peters, I.......... 5, 10, 108, 121f., 141, 149, 158, 200, 210, 221, 223f., 231, 236-241, 243, 247, 250, 258, 267, 275f., 281, 345, 363, 368ff., 373ff., 380, 387, 390, 398, 404f. Peterson, E. ............. 55f., 114, 120, 149, 191f., 212, 225, 276, 333f., 405 Pfeiffer, H. D........................................279, 408 Pickens, J..............................................295, 398 Picot, A...................................................96, 114 Pietronero, L.............................172ff., 176, 265
Index of Names Pikas, C. K. ............................................96, 114 Pind, L..........................................195, 208, 276 Pitner, T........................................................265 Plangprasopchok, A. ...................... 70, 80, 112, 114, 253, 276, 300, 376, 402 Plieninger, J. ..........................................56, 114 Pluzhenskaia, M................. 185, 191, 198, 201, 224, 276, 293, 405 Porter, M. F. .........................................140, 149 Pouwelse, J.................. 188, 210, 266, 300, 397 Prakash, A. .......... 139, 149, 341, 355, 360, 405
Q Quain, J. R..............................................49, 114 Quasthoff, M. .......................232, 255, 259, 276 Quintarelli, E...................... 120, 123, 149, 156, 164, 168, 212, 217, 227f., 242, 244, 276, 315ff., 364, 380f., 386, 405
R Rader, E...................... 190, 205, 218, 276, 281, 288, 296, 298f., 301, 408 Raghavan, P. ............... 112, 119, 148, 266, 396 Rainie, L.....................................................1, 10 Rajamanickam, V...................................96, 114 Raschka, A. ..........................................237, 277 Rashid, A..............................................277, 406 Rasmussen, D. I. ..................................179, 276 Rasmussen, E. M. ............. 142f., 149, 222, 276 Rattenbury, T. ......................................270, 400 Raymond, M. .......................................250, 276 Razikin, K. ...........................................272, 402 Reamy, T......................................154, 219, 276 Rebstock, M. ........................................253, 266 Redmiled, D. ................................................266 Redmond-Neal, A. ...............................126, 149 Rees-Potter, J. K. .................................376, 405 Reeves, B. ....................................................266 Regulski, K. ...............................25, 31, 33, 114 Reichel, M....................................212, 259, 271 Reinmann, G. ...................................19, 25, 114 Resmini, A. .........................168, 227, 244, 276, 279, 315ff., 380f., 386, 405, 408 Resnick, P. ........................ 295f., 301, 396, 405 Ribeiro-Neta, B....................................136, 146 Richardson, M..... 139, 149, 341, 355, 360, 405 Riedl, J. ............................................. 277, 405f. Rivadeniera, A. W. ................... 319, 321f., 405 Robertson, S. E. ...................137, 149, 361, 405
427
Robu, V. ..................... 132, 147, 172, 174, 178, 268, 367, 399 Rocchio, J. J. ....................... 135, 149, 361, 405 Rodriguez, P.................................................113 Röll, M. ..................................................96, 115 Romero, D. M. .......................................88, 111 Rorissa, A............................ 202, 276, 335, 405 Rosati, L. ............................ 168, 227, 244, 276, 315ff., 380f., 386, 405 Rosch, E. ..............................................201, 276 Rosenfeld, L. ....................... 225, 250, 255, 276 Roseway, A. ................................ 199, 229, 280 Rosner, D. ............................. 315f., 318ff., 399 Rosson, M. B........................ 13, 15f., 113, 118 Röttgers, J...............................................13, 115 Rousseau, R......................... 131, 146, 173, 266 Rowlands, I. ...............................................4, 10 Royer, S................................................180, 265 Ruge, G. ...............................................373, 405 Ruocco, S. ............................................257, 265 Russ, C. ........................................ 178-182, 275 Russell, T. .........178, 189, 277, 316, 327, 405f.
S Sabou, M. .............................................262, 393 Sack, H. ................. 85, 87, 115, 226, 232, 255, 259, 277 Sadr, J...........................................................272 Salton, G......................... 130, 133ff., 142, 149, 150, 339, 405f. Sandler, M. .............................................55, 112 Sandusky, R. ........................................290, 406 Sanna, S................................................200, 264 Sapino, M. L........................ 253, 264, 376, 395 Savastinuk, L. C. ....................................56, 108 Schachner, W. ........................................13, 115 Schachter, J. .................. 26, 115, 264, 266, 397 Schedl, M. ..............................................55, 110 Schiefner, M.............. 219, 259f., 277, 297, 406 Schillerwein, S. ............................ 283, 303,406 Schmidt, A. ............................... 242f., 263, 272 Schmidt, J................ 19, 88, 96, 115, 153, 156, 160, 212, 277 Schmidt, S. ...........................................359, 406 Schmidt-Thieme, L. ... 110, 231, 270, 273, 400 Schmitz, C..................... 15, 31, 111, 115, 153, 207ff., 214f., 253, 270, 277, 289, 399f., 406 Schmitz, J. ............................................128, 150 Schmitz, P. .......... 196, 214, 246, 277, 379, 406
428
Index of Names
Schnake, A. ............................................96, 115 Schneider, J. W. ...................................376, 406 Schulenburg, F. ....................................237, 277 Schulte, J. .....................................117, 263, 394 Schütt, P. ................................................23, 115 Schütze, H. ...........................................119, 148 Scott, J..........................................110, 268, 399 Seeger, T. .............................................119, 148 Seligmann, D. .......................24, 111, 157, 270, 300, 303f., 342, 400 Selman, B.................................. 295f., 308, 400 Sen, S. ..................... 164, 173, 189f., 196, 207, 277, 290, 293, 301, 319, 346-349, 355, 368, 378, 406 Servedio, V. .................................................264 Settle, A................................................136, 148 Shachak, A. ..................................146, 263, 394 Shadbolt, N. .................................................393 Shah, M. .................................... 295f., 308, 400 Shannon, C. E. .....................................313, 406 Shardanand, U......................284, 295, 301, 406 Shatford, S............................................221, 277 Shaw, R. .................................................13, 115 Shekita, E. ....................................................396 Shen, K.................................................214, 277 Shepard, H......... 120, 132, 147, 154, 172, 174, 178, 214, 236, 239, 253, 268, 367, 374, 399 Shepard, R. N.......................................161, 277 Sherman, C.................................4, 10, 391, 406 Shirky, C. ................... 120, 123, 129, 131, 150, 161, 171, 175, 177, 216, 218, 227, 273, 278, 285, 363, 406 Shoham, S. ...................................146, 263, 394 Sifry, D.........................................1, 11, 96, 115 Sigurbjörnsson, B..... 196, 204, 206f., 209, 278 Sinclair, J.................... 120, 150, 204, 218, 278, 315, 319ff., 407 Sinha, R....... 20, 115, 161f., 217, 278, 289, 407 Sint, R. .........................................244, 252, 267 Siorpaes, K...................... 139, 150, 157f., 198, 213, 215, 223, 228, 230f., 253, 280, 302, 304, 376, 379f., 408 Sivan, R................................................107, 262 Sixtus, M. ...............................................13, 115 Skusa, A. ............... 25, 115, 129, 150, 297, 407 Smadja, F. ................. 244, 263, 287, 298, 306, 315, 317, 330, 333, 394 Small, H. ........................... 129, 150f., 158, 278 Smith, B. ................................................37, 112
Smith, G. 5f., 11 , 20f., 26, 66, 106, 115, 153f., 157, 164, 236, 238, 245, 278, 348, 372, 374, 407, 411, 417 Smith, L........................................................116 Smith, M...... 191, 221, 278, 284, 302, 390, 407 Smyth, B...............................................290, 397 Sneath, P. H. A.................................. 142f., 150 Soffer, A...............................................107, 262 Solana, V. H. ........................................140, 147 Song, Y.........................................................266 Sormunen, E.........................................222, 273 Soroka, V. ....................................................272 Spalding, T. ......... 37, 43, 61, 66, 116, 372, 407 Spärck Jones, K................ 119, 134, 137, 149f., 361, 405 Specia, L.............................. 214, 262, 278, 393 Spiering, M.............................................61, 113 Spindler, G. ............................................22, 112 Spink, A............... 202, 217, 267, 283, 286, 398 Spiteri, L. ...................1, 11, 23f., 55f., 59, 116, 191, 212, 223, 227, 229f., 278 Spyns, P................... 155, 170, 243, 248f., 253, 278, 387, 407 Staab, S............... 157, 175, 178, 253, 265, 278 Stallman, R. M. ......................................17, 116 Star, S. ..................................................153, 278 Steels, L....... 157f., 213, 223, 253, 262f., 278f. Stegbauer, C. ..........................................18, 116 Sterling, B. ................ 1, 11, 167, 279, 292, 407 Stock, M. ................ 119, 121, 125, 129, 142ff., 150, 161, 165, 221, 279, 306, 376, 388, 407 Stock, W. G. ..................... 45, 116, 119ff., 125, 129, 131ff., 138f. 141-144, 149f., 158, 161, 165, 177f., 200, 221, 223, 230, 275, 279, 281, 293, 305f., 313, 336, 338ff., 345ff., 351, 359, 361, 363f., 368ff., 376, 378, 388, 404, 406f., 409 Storey, M. A. D....................................315, 407 Strauch, D.............................................119, 148 Strogatz, S. H. ..................... 88, 117, 145, 151, 214, 281, 289, 408 Strohmaier, M. .......................... 376, 386f., 407 Stuckenschmidt, H. ..............................253, 279 Studer, R...............................................253, 278 Stumme, G. ....................... 111, 115, 214, 225, 253, 267, 270f., 277, 306f., 313, 336, 339, 395, 398-401, 406 Sturtz, D. ..................... 70, 116, 153, 212, 225, 228, 279, 339, 407
Index of Names Stvilia, B. ...............................................22, 116 Su, D. ...................................................281, 409 Su, Z. ....................................................272, 402 Subramanya, S. B.........................204, 210, 279 Sundaresan, N. .................................. 328f., 398 Surowiecki, J................................ 166-170, 279 Sushak, M. ...................................................405 Svensson, M......... 290ff., 295, 301f., 397, 407 Sweda, J. ..........................................57, 59, 116 Szekely, B. .. 232, 279, 334, 339, 341, 347, 408 Szomszor, M. ...............................296, 299, 408
T Tanaka, K.............................................282, 409 Tang, C.................................................179, 263 Tassa, T. .......................................111, 270, 400 Taylor, A. G. ........................................119, 150 Tennis, J. ................... 120, 150, 155, 160, 212, 247ff., 255, 279 Tepper, M...............................................18, 116 Terdiman, D. ........................................217, 279 Terrio, R. D. ...........................................15, 112 Terry, D........................................................398 Thalmann, S. .............. 172, 178, 190, 242, 273, 363-366, 403 Thompson, A. E. ................................. 55f., 116 Thompson, C..........................................88, 116 Thom-Santelli, J...........................187, 257, 279 Tochtermann, K. ..13, 22, 110, 115, 166ff., 267 Toffler, A. ..............................................18, 116 Tomkins, A. .................................112, 266, 396 Tonkin, E............. 5, 10, 155, 159f., 168, 170f., 185, 198f., 201f., 205, 212, 218ff., 228ff., 231, 245, 252, 267, 279, 284, 289, 324, 332f., 336, 341, 353, 364f., 381, 398, 408 Torniai, C. ................... 78, 116, 199f., 229, 280 Torres, E...... 232, 279, 334, 339, 341, 347, 408 Tourte, G. J. L. .......................... 198f., 230, 279 Toyama, K....................................199, 229, 280 Trant, J. ................ 68f., 107f., 116f., 153, 191, 193, 195, 280 Tredinnick, L. ..................................14, 17, 117 Tschetschonig, K. ............37, 43, 117, 225, 268 Tudhope, D. ........ 124, 150, 267, 285, 398, 408 Turnbull, D.................... 49, 117, 176, 232, 280 Twidale, M. B. .............................................116
U Udell, J. ..................................88, 117, 236, 280
429
V van Damme, C................. 139, 150, 157f., 198, 213, 215, 223, 228, 230, 231, 236, 246, 253, 280, 302, 304, 347, 363, 376, 379, 380, 408 van Harmelen, F...................................253, 279 van Hooland, S.................... 259, 280, 335, 408 van House, N. A............................ 70, 79f., 117 van Rijsbergen, C. J. ............................146, 396 van Veen, T. ...........................................15, 117 van Zwol, R................... 70, 79, 117, 196, 204, 206f., 209, 278 Vandenbussche, J.................................278, 407 Vander Wal, T..................... 5f., 11, 154f., 157, 164f., 167, 171, 213, 226f., 257, 280, 283, 295, 363, 408 Vandijck, E.......................... 236, 246, 253, 280 Varian, H. .................................. 295f., 301, 405 Veeramachaneni, S. .... 153, 245, 268, 375, 399 Veres, C.... 193, 205, 213, 228, 248f., 253, 280 Viégas, F. .................................. 315f., 319, 408 Vollmar, G. ..........................................236, 280 von Ahn, L. .......... 100ff., 112f., 117, 266, 397 Voß, J . .................61, 68, 117, 122f., 125, 132, 150f., 247, 256, 280, 311, 377, 408 Vrandecic, D. .................. 13, 20, 107, 249, 262 Vuorikari, R. ............................... 189, 219, 281
W Waern, A. .....................................................407 Waitelonis, J.................... 85, 87, 115, 226, 277 Walter, A. .....................................................263 Wan, H. ................................................209, 264 Wang, J................ 166, 204, 263, 281, 309, 408 Wang, L........................................................266 Wang, Z........................................................399 Wash, R. ..................... 190, 205, 218, 276, 281, 288, 296, 298f., 301, 408 Wasserman, S.......................................145, 151 Wattenberg, M. ......................... 315f., 319, 408 Watts, D. J. .................. 88, 117, 145, 151, 214, 281, 289, 408 Webb, P. L. ................................... 81f., 84, 117 Webber, W. ..................................................404 Weber, I...................................... 204, 258, 267 Weber, S...........................See Gust von Loh, S. Weber, V. ...............................................13, 107 Webster, J.................................... 354, 361, 400 Wegner, J. ..................... 18, 49, 52, 70, 90, 117 Weiber, R. ............................................180, 281
430
Index of Names
Weinberger, D.................. 6, 11, 120, 129, 151, 153, 205, 224, 261, 281, 298, 333, 366, 408 Weiss, A.................... 15f., 18, 49, 117, 166ff., 208, 281, 288, 408 Weller, K................... 122, 125, 149, 151, 210, 223f., 231, 236-241, 243, 247, 258, 267, 275f., 281, 373ff., 378, 380, 387, 390, 398, 404f., 409 Wen, J. R......................................................396 West, J....................................................55, 117 Westcott, J..............................................68, 117 Westenthaler, R............................244, 252, 267 Wetzker, R. ............................25, 118, 208, 235 Wexelblat, A. .......................292, 300, 396, 409 Whittaker, M. Y. S...............289, 291, 297, 404 Wierzbicka, A. .....................................193, 281 Wiesenfeld, K. .....................................179, 263 Wiggins, R. ..........................................336, 409 Wilcox, E. ....................................109, 266, 397 Wilkinson, M. D. .........................................402 Willett, P. .............................................119, 150 Wills, C. E..............................................22, 112 Winget, M. ................ 70, 118, 195f., 200, 228, 281, 330, 343, 409 Winograd, T. ........................................149, 404 Wisniewski, J. ......................................390, 397 Wolff, C. ...............184, 192, 197f., 202f., 210, 230, 268, 386, 399 Won, D. ................................................383, 402 Wong, A...............................................142, 150 Wong, C. ....................... 70, 112, 300, 376, 402 Wu, F......................................................88, 111 Wu, H. ..................... 133, 150, 157, 213, 215f., 235, 281, 299, 301, 409 Wu, L. ..................................................214, 277 Wu, X. .............. 213, 215, 250, 253, 281f, .292, 387, 409 Wusteman, J...........................................17, 118 Wyman, B. .....................................69, 108, 117
X Xi, W............................................................409 Xu, C. ...................................................220, 281 Xu, Z. ................... 204, 206f., 209ff., 228, 281, 290, 293, 297, 351f., 409 Xue, G. ............................................. 353ff., 409
Y Yaltaghian, B. ......................................145, 151 Yanbe, Y. .............................199, 282, 342, 409
Yang, C. S. .................. 142, 150, 220, 255, 274 Yang, F.........................................................399 Yang, J..................................................262, 393 Yang, K. .......................................................266 Ye, C. ...................................................189, 274 Yen, Y. ...................................................81, 118 Yong, H. S............................................384, 402 York, J. ...................................................37, 112 Yu, E. S. ..............................................136, 148 Yu, Y. ...............213, 215, 250, 253, 272, 281f., 292, 387, 402, 409
Z Zacharias, V. ............................... 243, 263, 282 Zadeh, L. A. .........................................212, 282 Zaihrayeu, I. ........................ 123, 147, 379, 398 Zanarini, P. ...........................................200, 264 Zang, N................................................ 15f., 118 Zeng, H.........................................................409 Zeng, M. L. ..........................................124, 151 Zerdick, A. ...........................................182, 282 Zha, H...........................................................409 Zhang, L. ...213, 215, 253, 281f., 292, 387, 409 Zheng, H...............................................250, 282 Zheng, S. ......................................................409 Zhou, D. ...............................................285, 409 Ziegler, C. ............................................358, 409 Ziehe, M. ................................................49, 107 Zimmermann, C. ................... 26, 118, 208, 235 Zimmermann, H. H. .............................140, 151 Zobel, J.........................................................404 Zollers, A. ............... 189, 198f., 230, 279, 282, 358, 409 Zubair, M. ............... 157, 213, 215f., 235, 281, 299, 301, 409
Subject Index
43Things .......................... 22, 90, 105, 160, 315
A aboutness.............................................. 221, 411 author aboutness ......................................222 indexer aboutness.....................................222 request aboutness .....................................222 user aboutness ..........................................222 aboutness tags ...............................................198 abstraction relation........................................124 access ........ 260, 283, 285-288, 316, 367, 387f., 411, 415f. accidental thesaurus ................................... 336f. active searching.... 287, 290, 298, 311, 334, 416 adjusting the ranking factors.........................361 affective tags ................................ 198, 342, 357 AG Social Media ................................. 390, 393 aggregation (of ranking factors) .......... 353, 356 AJAX ... See asynchronous javascript and XML alphabetical arrangement of the folksonomy315 AltaVista .......................................................310 Amazon ......... 21, 37, 43, 62, 90, 105, 233, 301, 315, 359, 362, 390 quick tagging..............................................39 spam prevention.......................................363 anomalous state of knowledge......................287 ANSI/NISO Z39.19 ............................. 126, 146 AOL ..............................................................312 API .......See application programming interface application programming interface ................15 archives ...........................................................55 ASK............. See anomalous state of knowledge Ask.com ........................................................390 association relation .................................... 124f. association rules............................................208 asynchronous javascript and XML .............. 15f. Atom feeds ............................................. 17, 301 Audioscrobbler................................................49 authorities............................................. 138, 356 Authority (Technorati)..................................100
automatic spammer detection ...................... 234 automatic tagging......................................... 232 AutoTag........................................................ 308
B baby tags....................................... 239, 351, 387 bag model ..................................................... 164 ballast ........................................... 294, 367, 413 Basic Level........................................... 202, 335 Basic-Level tags...................... 201, 220, 335 Basic-Level theory .................................. 201 Bayes’ Theorem ........................................... 136 betweenness ................................................. 145 bibliographic coupling ........ 129, 158, 295, 300, 302, 340 BibSonomy........25, 31, 56, 59f., 105, 165, 233, 245, 258, 310, 337, 342, 380, 400 BibTip ............................................................ 61 binary relevance distribution ....................... 131 Blinklist .......................................................... 25 blog search engines .................................. 15, 96 Bloglines ................................................... 17 Blogoscoop................................................ 96 Blogpulse .................................................. 96 Technorati ............... 1, 23, 96, 105, 178, 363 Bloglines ........................................................ 17 Blogoscoop..................................................... 96 Blogpulse ....................................................... 96 blog tags (Technorati) .................................... 97 blogs ..............................................15, , 19, 96ff. BonPrix .......................................................... 43 Boolean operators ........................................ 133 Boolean retrieval models ............................. 133 (mixed) minimum-maximum model....... 133 bottom-up categorization ............................. 213 Bradford’s Law of Scattering ...................... 370 Broad Folksonomies ............104, 164-167, 170, 212, 259, 297, 346, 363, 413, 416 bag model ................................................ 164 collaborative tagging............................... 164
432
Subject Index
free-for-all tagging...................................164 browsing............ 288-293, 311, 314, 316f., 320, 371f., 416 Bundesministerium für Justiz. ............... 17, 108
C categorization....................................... 161, 202 category system.............................................249 centrality .......................................................145 centralized tag collection point.....................257 Ciao .................................................................19 citations ................................ 128, 138, 292, 340 citation analysis.................................. 138f., 340 citation chains ...............................................289 citation indexing .......................... 120, 128, 144 CiteULike....................... 26, 165, 258, 311, 378 classification systems....... 2, 104, 120, 122, 126 Cleveland Museum of Art ..............................69 click rates ............................................. 354, 355 Click-Rate Weight ........................................355 clickthrough rates..........................................292 Clipfish............................................................80 Clipmarks......................................................226 closeness .......................................................145 Cloudalicious ....................................... 178, 327 cluster algorithms................. 137, 214, 301, 379 cluster analysis ..................................... 142, 306 clusters (Flickr) .............................................330 co-citations........................... 129, 158, 172, 340 cognitive effort searching ......................................... 290, 412 tag cloud...................................................320 tagging..................................... 161, 163, 225 cognitive skills (during tagging)...................163 cold start problem ................................ 336, 338 collaboration (ranking factor)...... 345, 353, 417 collaboration in the creation of a retrieval system ......................................................293 collaborative filtering................ 293, 295f., 299, 301, 416 collaborative filtering mechanism of the community ...............................................297 collaborative indexing ......................... 165, 288 collaborative information services........ 1, 6, 15, 19f., 22, 104, 153, 155, 227, 255, 283f., 287, 292, 296f., 301, 333, 363, 391, 412f., 416f. classification ................................. 6, 15, 104 collaborative intelligence..............................167
collaborative recommender systems........... 215, 300f., 391 CollaborativeRank ....................................... 341 collaborative tagging............................ 157, 164 collabulary.................................................... 231 collective indexing ....................................... 287 Collective Intelligence .........164, 166-169, 177, 231, 363, 366, 417 critical mass............................................. 168 in folksonomies ...................................167ff. in knowledge representation ........... 167, 170 for quality control ................................... 168 collective tagging ......................................... 157 collective vocabulary ................................... 154 combination of the exclusive and inclusive approaches (tag gardening)..................... 375 Commentator Weight................................... 356 comments ..................................................... 292 commercial information services............. 15, 45 WISO........................................... 45, 47, 105 Engineering Village .......................45ff., 105 communication and socializing .............. 15, 19 communities ................... 290f., 298, 301ff., 411 community tags ............................................ 201 Complete-Link Procedure.................... 144, 302 compounds ........................................... 184, 230 concept-based Tag Gardening ..................... 376 concepts........................................................ 293 conceptualization ......................................... 161 concordances................................ See tag couds concordances (Amazon)............................... 315 conflation ...........................................140f., 414 conformity theory......................................... 190 Connotea ........................ 26, 165, 258, 335, 339 consensus-based indexing............................ 168 consensus-based Tag Gardening.................. 376 content-based recommender systems ........ 299f. content-based Tag Gardening ...................... 376 content creator.............................................. 165 controlled vocabularies .......... 3, 120, 122, 128, 178, 225, 227, 249, 255, 286ff., 310, 338, 364, 411 co-occurrence ....................................... 124, 142 co-occurrence analyses ..............208f., 214, 232 Copy Left ....................................................... 17 co-resources ................................................. 158 cosine.................................................... 135, 142 co-tags .......................................................... 172 course code................................................... 260
Subject Index CoW ..........................See Commentator Weight creation of a knowledge base................... 15, 19 creation of KOS ........................... 364, 377, 414 Creative Commons .........................................17 criteria-centric approach ...................................3 criteria for relevance ranking........................345 critical mass ................................. 168, 216, 337 Collective Intelligence .............................168 recommender systems..............................301 success of tagging systems .....................260 tag distributions........................................178 users .........................................................336 Croft-Harper formula....................................137 cross-database tagging ............................... 256f. cross-platform search....................................334 crowd psychology .........................................169 crowdsourcing...............................................166 CRW ..............................See Click Rate Weight
D decomposition of compounds...........139ff., 230 degree ............................................................145 del.icio.us ................ 1, 25ff., 60, 105, 166, 258, 289, 296, 298f., 306ff., 310-314, 318f., 322, 324f., 327, 330f., 334, 339, 341, 360, 364f., 380, 383f., 387 del.icio.us graph......................................... 322f. del.icio.us Soup.................................... 322, 323 descriptors .................................... 125, 252, 311 descriptor set .................................................125 design (tag clouds) ..................................... 321f. desire lines ........................................... 214, 253 development of retrieval systems based on folksonomies ............................................335 Dewey Decimal System................................127 Dice Coefficient............................................142 Die Zeit .........................................................331 digg..................................................................19 DIN 1463/1 .................................. 120, 126, 146 DIN 31.626/3 ....................................... 388; 396 DIN 32705 ........................................... 127, 146 disadvantages of information retrieval with folksonomies ............................................332 discovery of communities.............................215 discriminatory power of tags .............. 134, 206, 238, 311, 347, 417 DMOZ...........................................................248 document (type) tags............................ 201, 241 documentary reference units................ 226, 412
433
documentation languages........... See knowledge organization systems dogear........................................... 291, 331, 337 Dogear Game ............................................... 103 DogmaBank ................................................. 243 DRU ...............See documentary reference units Dublin Core Metadata.................................. 130 Dutch National Archive ............................... 335
E e-commerce .................................................... 37 Amazon .......... 37, 43, 62, 90, 105, 233, 301, 315, 359, 362, 390 BonPrix ..................................................... 43 edges............................................................. 144 effectiveness of tag clouds for visualizing search results ........................................... 329 effectiveness of tagging systems (evaluation method) ................................................... 312 E-Learning .........................................226, 259f. electronic marketplaces.................................. 15 Elsevier........................ See Engineering Village eMarketer ................................................... 1, 10 emergent pragmatics .................................... 387 emergent semantics ............253f., 290, 374, 387 emoticons ..................................................... 328 emotional tags .............................................. 198 Engineering Village ............................45ff., 105 enterprise search........................................... 283 equivalence relations..................124f., 128, 241 error detection and correction (NLP)... 139, 141 ESP Game ...................................... 21, 101, 258 event tags...................................................... 257 exact match .................................................. 133 exchangeability of tags ................................ 257 exchangeable image file........... 72, 77, 200, 229 EXIF ......................See exchangeable image file expert finders........ 303, See also More like Me! expert recommender............ 304, See also More like Me! ExpertRank........................................... 304, 342 explicit user feedback .................................. 356 exploratory search .........289, See also browsing Extended Narrow Folksonomies......... 104, 166, 212, 242, 259, 413 Extispicious.................................................. 324 extracting place and event tags .................... 381
434
Subject Index
F Facebook .........................................................88 FaceTag.............................................. 244, 380f. faceted knowledge organization system.......244 faceted thesaurus............................ See FaceTag Federal Ministry of Justice ......................... , See Bundesministerium für Justiz FeedMe .........................................................301 feed reader.......................................................17 Bloglines ....................................................17 NewsGator .................................................17 feedback loop ....................................... 205, 254 fertilizing (tag gardening) ............ 241, 250, 414 field-based retrieval ......................................386 field-based tagging....................... 192, 229, 236 Flickr ............... 1, 69ff., 105, 165f., 229, 257ff., 288, 296, 312, 326ff., 330, 333f., 339, 343f., 351, 360, 376, 378, 380ff., 384, 386, 389f., 400 Flickr Cluster ................................................378 Flickr Graph ..................................................326 folder ............................................................ 24f. folder structure..............................................227 FolkRank......................36, 313, 342f., 360, 379 FolksAnnotation................................... 260, 386 folksonomic zeitgeist (The Guardian) ..........315 folksonomies .........1, 122, 153, 155, 167, 283f., 286f., 290ff., 391, 411, 413, 415, 418 definition ..................................................153 advantages discovery of communities ......................215 development & maintenance of KOS ....213 follow desire lines ..................................214 FuzzyLogic.............................................212 language of users....................................212 Long Tail ................................................212 more like this/ me...................................215 neologisms..............................................213 network properties..................................216 provide broader access ...........................213 scalability................................................216 serendipity ..............................................217 small world .............................................214 solve heterogeneity problem ..................214 solve vocabulary problem ......................213 tag clouds................................................218 quality control ........................................216 disadvantages cognitive effort .......................................225
language variability ............................... 218 less Media Functions for Tags .............. 225 Long-Tail tags ....................................... 220 metadata problem .................................. 224 no segmentation of resources ................ 226 no use of paradigmatic relations ........... 222 spam tags ............................................... 224 tagging motivations ............................... 220 unstructured list of keywords ................ 219 various indexing levels.......................... 221 various tag specificity............................ 220 usability ................................................... 225 effectiveness............................................ 390 in information retrieval ........................... 284 folksonomy-based person recommender systems ................................................. 303, See also More like Me! folksonomy-based recommender systems .. 299, 303, 391, See also collaborative filtering font colors (tag clouds) ................................ 329 font size (tag couds) ..................................... 318 fRank ............................................................ 341 free-for-all tagging ....................................... 164 Free Rider Problem (problems of recommender systems)................................................... 296 from chaos comes order ............................... 168 functional tags ................. See performative tags Furl ......................................................... 25, 310 future retrieval.............................................. 162 Fuzzy Logic.................................................. 133
G Games with a Purpose.............. 15, 23, 100, 411 Dogear Game .......................................... 103 ESP Game ................................. 21, 101, 258 Google Image Labeler............................. 101 Steve Museum......................................... 102 TagaTune................................................. 102 GBI Genios ........................................See WISO genidentity............................................ 124, 128 generation of hierarchical clusters ............... 245 generation of semantic metadata.................. 247 genre-specific tags ....................................... 241 geotags......................... 72, 77f., 199f., 229, 383 Gibeo ............................................................ 226 Gimp............................................................. 389 Gmail...................................................... 19, 335 goal-sharing services.............................15, 22ff. 43things ............................... 22, 90, 160, 315
Subject Index good tags .............................................. 211, 349 Google.......................................... 310, 313, 340 Google Image Labeler ..................................101 graphs............................................................145 greenhouses...................................................239 Group-Average-Link Procedure .......... 144, 302 Grouper .........................................................325 GroupLens.....................................................295 Guggenheim Museum.....................................69 GWAP.......................See games with a purpose
H heterogeneity problem ..................................214 heterogeneity treatments...............................242 hierarchical relations124f., 127f., 140f., 223, 241 part-of relation (meronymy) ....................124 abstraction relation (hyponymy)..............124 gene identity.................................... 124, 128 instance ....................................................124 HITS..............................................................340 Hive Mind .....................................................166 homonym disambiguation ................. 140f., 231 homonyms............ 125, 127f., 140f., 220, 239f., 294, 330, 380, 383 hubs ...................................................... 138, 356 human factor in browsing .............................290 hybrid recommender systems .......................300 hyperonyms.................................. 126, 305, 309 hypertext structures.......................................289 hyponyms................................... 126f., 305, 309 hyponymy .....................................................124
I iconographical level......................................221 iconological level..........................................221 iconology.................................................... 221f. identification of context-specific tags ..........141 IDF ................. See inverse document frequency IDX................................................................140 IFLA....................................................... 65, 111 images ...........................................................283 Image Notion ................................................243 implicit relevance feedback .........See click rates implied consensus .........................................167 indexer...........................................................153 indexer vocabulary........................................288 indexing.............................. 120f., 153, 338, 414 indexing levels ..............................................221 indexing mass information ...........................258
435
indexing with controlled vocabularies......... 161 information exchange via tags ..................... 259 information filtering...........................293f., 298 advantages ............................................... 299 information filters ................................ 293, 297 information flood ........................... 23, 227, 285 information linguistics ................................. 139 information need .......................... 294, 320, 324 information overload.................................. 412f. information retrieval ................... 119, 130, 160, 283ff., 294, 299, 411, 416, 418 information retrieval 2.0 .............................. 391 information theory ....................................... 313 informetric distribution ........................ 131, 364 inlinks.........................................138f., 340, 356 instance......................................................... 124 instant messaging ........................................... 15 interaction history ................................ 292, 300 Interestingness.......................................... 73, 76 Interestingness Ranking .............343f., 351, 390 inter-indexer consistency ..... 121, 221, 248, 293 International Patent Classification (IPC)..... 127 interoperability ............................................. 241 intra-indexer consistency ............. 121, 221, 413 inverse document frequency .............134, 347f., 350, 417 inverse logistic distribution...............132, 176f., 364, 415 inverse resource frequency .......................... 347 inverse tag clouds................................. 178, 239 inverse tag frequency ................... 348, 350, 417 IRF.................... See inverse resource frequency island effect .................................................... 56 isness .......................................................... 221f. ISO 2788 .............................................. 126, 147 iterative stemmer.......................................... 140 Porter Stemmer ....................................... 140 ITF............................. See inverse tag frequency
J Jaccard-Sneath Coefficient .......... 142, 258, 317 for tag recommendations......................... 209 joke tags ....................................................... 225
K Ketchum ............................................... 302, 401 Kleinberg Algorithm.................... 138, 340, 356 KMeans-Cluster algorithm .......................... 306 K-Nearest-Neighbors Procedure.................. 144
436
Subject Index
knowledge base........................................ 19, 21 Knowledge Maturity Model ...................... 242f. knowledge organization systems ............ 2, 288, 358, 413 knowledge representation ........... 119, 155, 160, 167, 227, .. 247, 254f., 284f., 293, 295, 332, 411, 413, 418 KOS......... See knowledge organization systems Kullback-Leibler formula .............................178
L landscape architecture (tag gardening activity) 239, 414 language identification (NLP) ............. 139, 141 language variability ......................................218 large number of tags .....................................333 Last.fm ........................49ff., 105, 257, 310, 312 Audioscrobbler...........................................49 latent hierarchical taxonomy ........................214 lemmatization....................................139ff., 230 Levenshtein Metric .......................................246 Lexis Nexis ...................................................333 libraries .................................................... 15, 55 Library 2.0................................................ 56, 61 Library of Congress ............................... 62, 250 Library of Congress Subject Headings.........128 Library of Pennsylvania..................................57 LibraryThing ...................43, 56, 60f., 105, 238, 245, 335, 372, 390 LibraryThing for Libraries..............................68 licences............................................................15 Copy Left ............................................ 15, 17 Creative Commons ............................. 15, 17 Open Access ..............................................15 Open Source........................................ 15, 17 linguistic variability ............. 285, 293, 332, 411 link topology ........................ 138, 144, 159, 340 links........................................................ 15, 289 LinkedIn..........................................................88 logsonomy................... See accidental thesaurus Long Tail...................131, 171, 176f., 239, 258, 319, 364, 370, 411, 415 tags .................................................. 178, 365 Long Trunk ................................ 132, 176f., 364 tags ...........................................................178 longest-match stemmer ................................140 Lovins Stemmer.......................................140 Lotka’s Law ......... 131, 171, See also Long Tail Lovins Stemmer............................................140
Luhn’s Thesis............................... 134, 348, 417
M Ma.gnolia ....................................................... 25 madness of the crowds ................................. 169 maintenance of controlled vocabularies338, 374 Marginalia .................................................... 226 mash-ups ................................... 15, 17, 200 389 mass indexing........................................... 4, 411 mass information.................................. 283, 292 Matthew Effect............. 173, 205, 299, 305, 319 me ......................................... 201, 224, 230, 336 message boards .............................................. 15 measuring success of collaborative information services.................................................... 390 media functions for tags....................... 225, 238 meronymy .................................................... 124 MeSH ........................................................... 311 metadata ................55f., 68, 119, 121, 129, 155, 160, 224, 256, 283, 288, 416 meta noise .................................................... 225 meta search................................................... 360 meta search engine ....................................... 334 metatags........................................................ 156 Metropolitan Museum of Art......................... 69 microblogging ................................................ 15 Microsoft Live...................................... 244, 310 Microsoft Paint............................................. 389 More like me!............................... 215, 302, 360 More like this!.............................................. 215 motivations for tagging .............................. 185f. MovieLens ........................................... 347, 348 MSN ............................................................. 313 MSU Keywords........... See accidental thesaurus multimedia resources ................................... 416 multiple-word tags .... 184, See also compounds museums................................................... 55, 68 music-sharing services ................................... 49 Last.fm .................. 49ff., 105, 257, 310, 312 MyBib .......................................................... 58f. MySpace......................................................... 88 MyTag .................................................. 334, 360
N named entities (NLP) ................................... 139 Narrow Folksonomies............. 104, 165ff., 242, 258, 336, 338, 346, 363, 413, 416 self-tagging.............................................. 165 set model ................................................. 165
Subject Index simple tagging..........................................165 natural language processing ...... 139, 141, 230, 414, 417 NLP algorithms........................................238 natural systems..............................................182 natural visual flow (tag clouds) ....................316 Negative Relevance Feedback Weight.........349 Negative Tag-Sentiment Weight ..................359 negative usage of tags...................................334 network economics .................................. 7, 180 network effects..................................... 180, 415 network goods...............................................216 network model ..............................................144 networking function of folksonomies...........300 netzeitung.de. ......................................... 87, 113 news portals ............................................. 15, 19 digg.............................................................19 NewsGator ......................................................17 n-grams ........................................ 134, 139, 141 NLP .................See natural language processing no metadata (problems information retrieval)...................................................334 no quality control (problems information retrieval)...................................................334 nodes .............................................................144 nomenclatures .............................. 120, 122, 128 non-aboutness tags........................................198 non-descriptors..............................................125 non-textual information resources ...............283, 286, 288 non-textual tags.............................................256 normalization ............................. 350, 355f., 359 NRFW ..........................See Negative Relevance Feedback Weight NTSW ..... See Negative Tag-Sentiment Weight notations........................................................126
O OATS ........................ See Open Annotation and Tagging System object overlaps ..............................................302 ofness ................................................... 221, 222 online crowds................................................179 online herds...................................................179 online public access catalogue........................56 Ontocopi....................................................302ff. ontologies................................... 122, 124f., 386 ontology maturing.........................................253 ontology merging..........................................253
437
OPAC ..........See online public access catalogue OPAC 2.0 ....................................................... 56 Open Annotation and Tagging System........ 260 Open Directory Project ................ 127, 129, 310 Open Source................................................... 17 opinion-expressing tags ........See sentiment tags outlinks.......................................138f., 340, 356 P PageRank .............138, 161, 233, 340, 352, 356 PageRank Weight......................................... 356 page-view counts ......................................... 292 paradigmatic relations................. 124, 126, 128, 155, 222, 224, 247, 309, 374, 377, 381 Pareto distribution ........................................ 131 parsing .......................................... 139, 141, 230 part-of relations ............................................ 124 Partial Retrieval Status Value ...... 353, 356, 359 PennTags .................................................. 57, 59 people-tagging.............................................. 88f. performance tags .................................. 198, 230 performative tags ......................... 200, 224, 241 Performative Tag Weight........................... 357f. performative verbs ....................................... 356 perma-link ...................................................... 97 perpetual beta ................................................. 15 personal information management .............. 195 personal resource management..... 19, 293, 312, 373, 390 personal tagging ........................................... 156 personalization of ranking ........................... 359 personomies ................ 156, 169, 190, 225, 237, 240, 245, ... 257f., 293, 298, 324f., 330, 334, 360, 415 PerTW .................See Performative Tag Weight photosharing services..................................... 69 Flickr .......... 1, 69ff., 105, 165f., 229, 257ff., 288, 296, 312, 326ff., 330, 333f., 339, 343f., 351, 360, 376, 378, 380ff., 384, 386, 389f., 400 phrase recognition ........................................ 230 Picnik.............................................................. 72 pivot browsing..............................20, 289f., 416 persons/ users........................................... 289 resources .................................................. 289 tags........................................................... 289 platform level ............................... 165, 170, 363 Plurality ........................................................ 308 P-Norm......................................................... 133
438
Subject Index
podcasts.............................................. 15, 19, 49 polysemes......................................................125 Porter Stemmer .................................... 140, 230 Positive Relevance Feedback Weight ..........349 Positive Tag-Sentiment Weight....................358 post activation analysis paralysis..................161 Post Tags (Technorati)....................................97 Power Law ................131, 171, 177, 184, 337f., 364, 415 generation/ development..............172, 174ff. Kullback-Leibler formula ........................178 Matthew Effect ........................................173 preferential attachment ............................174 Semiotics Dynamics ................................172 Shuffling Theory......................................174 success breeds success.............................173 Yule process.............................................172 Yule-Simon process.................................173 Power Tag Weight ........................................353 Power Tags....... 363, 365ff., 370, 376, 415, 417 as retrieval tool.........................................368 in Tag Gardening .....................................373 Power Tags I .................................................374 Power Tags II................................................374 Power Tags Only ......................... 368, 370, 372 precision....................130, 238f., 241, 284, 305, 333, 367, 370f., 417 presence.........................................................292 preferential attachment .................................174 pre-iconographical level ...............................221 PRFW............................ See Positive Relevance Feedback Weight probabilistic model ...................... 136, 339, 361 Problem-Tag Clouds .....................................411 prosumer .........................................................18 PRW ................................See PageRank Weight pseudo-relevance feedback.......... 137, 309, 361 PTSW ........ See Positive Tag-Sentiment Weight PTW .............................. See Power Tag Weight Public Library of Nordenham.........................60 PubMed ................................................ 311, 329 pull approach in retrieval ............. 287, 290, 294 purpose tags ............................................... 386f. push approach in retrieval.......... 287, 290, 293f. push services .................................................297
Q quality control .............................. 168, 216, 296 via Collective Intelligence .......................168
quality judgment .................................. 216, 299 quality judgment on information resources 299, 416 quality judgments on resources in the digital world ....................................................... 292 quality judgment on tags...................... 349, 352 quasi-KOS ..........................142, 246, 306, 375f. creation of quasi-KOS............................. 364 query expansion .................. 305, 309, 333, 374, 378, 383f. (semi-) automatic query expansion......... 309 query modification .... 305, See query expansion query refinement ....... 305, See query expansion query tags ..................................................... 348 query terms................................................... 336 query vector.................................................. 351 query-dependent ranking ............................. 345 query-independent ranking .......................... 345
R ramp-up problems ........................................ 300 problems of recommender systems ........ 296 ranking algorithms ....................................... 361 RankNet........................................................ 341 rating systems................................... 15, 19, 359 Ciao ........................................................... 19 Rating-System Weight ................................. 359 RawSugar ............................................. 244, 330 Really Simple Syndication....................... 15, 17 RSS feeds .................................................. 17 recall ............130, 133, 140, 238, 284, 305, 333, 360, 367, 372, 417 recall-precision graph of tag recommendations 310 recommender systems......... 137, 142, 145, 223, 295f., 299, 301ff., 308, 336, 416 collaborative filtering .............................. 299 content-based recommender systems...... 299 for similar tags ........................................ 338 hybrid systems ......................................... 299 references ..................................... 128, 138, 340 ReferralWeb ................................................. 295 related terms ................................................. 305 relational recommendations......................... 309 relative term frequency ........................ 134, 347 relevance .............. 119, 123, 131, 133, 144, 216 of resources ............ 216, 250, 283, 285, 293, 339, 354, 357, 361, 373, 387 of tags .............................. 160, 319, 315, 348
Subject Index in KOS .....................................................358 relevance distributions ......................... 131, 172 relevance feedback............................... 135, 361 relevance feedback loop ...............................136 Relevance Feedback Weight ............. 349f., 358 relevance ranking..........7, 131, 133, 136f., 139, 142, 144f., 170, 284, 336, 338f., 351, 390f., 413, 415f. concept-based features .............................339 dynamic ranking.......................................339 query-dependent ranking................. 339, 417 query-independent page quality features .339 query-independent ranking ............. 339, 417 static ranking ............................................339 Researchers’ Holy Grail ...................... 131, 284 resources .......................................................157 resource-centric approach.................................3 resource-specific (search) tag distributions..338 resource level ...........................165f., 170f., 363 resource management ........ 15, 19, 21, 255, 293 retrievability..................................................338 retrieval effectiveness .......................... 361, 387 tag clouds ........................................ 319, 321 folksonomies ................................... 310, 313 retrieval models.............................................133 retrieval performance of recommended tags ...........................................................310 retrieval status value of the resource ...........131, 339f., 345, 359, 417 RFW ...............See Relevance Feedback Weight Robertson-Sparck Jones formula......... 137, 361 Rocchio algorithm................................ 135, 361 RSS....................See Really Simple Syndication RSS feeds ....................... 17; 287, 294, 298, 301 RSV.............................See retrieval status value RSV(r).................... See retrieval status value of the resource RSW ........................ See Rating-System Weight
S satiation of tagging systems......................178ff. concerning resources ...............................180 concerning tags ........................................182 concerning users ......................................179 SBRank .........................................................342 scalability ......................................................216 scale invariance.................................... 172, 178 Scope Notes ..................................................252 SDI .See selective dissemination of information
439
search effort.......................................... 333, 412 search engines ...................................... 129, 416 search engine for all collaborative information services.................................................... 388 search tags .................................................... 416 Search-Tag Weight .............................. 348, 350 search vocabulary......................................... 387 searching ...................................... 290, 297, 371 searching behavior ....................................... 335 seeding (tag gardening activity)........... 238, 414 Seeding, Evolutionary Growth and Reseeding Model ...................................................... 242 seedlings............................................... 239, 387 segmenting resources ................................... 226 selective dissemination of information....... 294, 299 self-fulling prophecy.................................... 205 self-organization........................................... 179 self-organized criticality .............................. 179 self-tagging................................................... 165 semantic browsing ....................................... 290 semantic clusters .......................................... 142 semantic folksonomy ................................... 241 semantic gap......................................... 213, 286 semantic navigation ..................................... 291 semantic relations................. 122, 240, 305, 378 Semantic Web .............................. 124, 159, 249 Semiotics Dynamics..................................... 172 Send a Friend ....................................... 296, 359 sentiment analysis ........................................ 358 sentiment tags....................................... 199, 358 Sentiment Weight................................. 358, 359 SER Model ...............See Seeding, Evolutionary Growth and Reseeding Model serendipity ............................................ 217, 289 set model ...................................................... 164 sharing services .................................. 15, 20, 22 Shatford Classification................................. 335 Shell Model ........................................250f., 414 Shepard-Luce formula ................................. 161 short head .................................... See long trunk Shuffling Theory .......................................... 174 signalling tags ................. See performative tags similarity algorithms (tag gardening) .......... 373 similarity coefficients.................. 142, 144, 301, 339, 379f. similarity comparisons ................................. 137 similarity matrix........................................... 143 similarity values ................................... 142, 214
440
Subject Index
simple tagging...............................................164 Simpy ..............................................................25 Single-Link Procedure ......................... 144, 302 slide controls ........................................ 331, 359 small worlds ......................................... 145, 214 Soboleo .........................................................243 social bookmarking services................. 15, 20f., 23ff., 298 BibSonomy .................. 25, 31, 56, 59f., 105, 165, 233, 245, 258, 310, 337, 342, 380, 400 Blinklist......................................................25 CiteULike...................................................26 Connotea ....................................................26 del.icio.us ........... 1, 25ff., 60, 105, 166, 258, 289, 296, 298f., 306ff., 310-314, 318f., 322, 324f., 327, 330f., 334, 339, 341, 360, 364f., 380, 383f., 387 Furl .............................................................25 Ma.gnolia ...................................................25 ranking .....................................................342 Simpy .........................................................25 social browsing ..................... 290, 391, See also social navigation social media currency ...................................390 social navigation .................. 290, 292, 301, 391 social networks .........15, 19, 22, 88, 295, 303f., 308, 326 43things............................... 22, 90, 160, 315 Facebook ............................................. 85, 88 LinkedIn.....................................................88 MySpace ............................................. 85, 88 StudiVZ......................................................88 XING..........................................................88 social network analysis .................................159 social OPAC ...................................................57 social proof........................... 168, 207, 292, 300 social ranking algorithms..............................361 social search ......................... 284, 362, 391, 416 social search ranking.....................................362 social software ..................... 6, 15, 18f., 22, 413 social tagging ...................................See tagging SocialPageRank ............................................340 SocialSimRank..............................................340 SOPAC....................................... See social opac SpaceNav ......................................................324 spagging ....................................... See spam tags spam ..................................................... 296, 334 in ranking .................................................362 prevention ................................................234
spam tags............................159, 224, 232f., 412 spatial navigation ......................................... 291 speech recognition ....................................... 230 SPW............................ See Super Poster Weight Squirl ............................................................ 335 statistical consensus ..................................... 167 stemming .......................... 139ff., 246, 384, 414 Steve Museum.............................................. 102 Steve Project .................................................. 69 Cleveland Museum of Art......................... 69 Guggenheim Museum............................... 69 Metropolitan Museum of Art.................... 69 StudiVZ ...................................................... 1, 88 STW ..............................See Search Tag Weight sub-communities .......................................... 302 Subject Browser (Die Zeit) .......................... 331 success breeds success ..........173, 205, See also Matthew Effect super-posters ................................................ 352 Super-Poster Weight .................................... 352 SW...................................See Sentiment Weight swarm intelligence ....................................... 166 syncategorematic tags .................. 185, 224, 230 synonyms .................... 125, 128, 136, 140, 220, 239f., 294, 305, 309, 380, 383 synonymy ..................................................... 124 syntactic indexing ........................ 128, 318, 388 syntagmatic relations .......... 124, 128, 155, 159, 223, 317, 378, 381
T tag avoidance........................................ 203, 232 tag bundles .................................106, 244, 330f. del.icio.us .................................................. 28 tag categories ....................................... 196, 312 aboutness tags.......................................... 198 as fields.................................................... 198 Basic-Level tags...................................... 201 community tags ....................................... 201 dependent on collaborative information services.................................................... 197 dependent on resource type..................... 197 document type tags .................................. 201 emotional tags.......................................... 198 geotags ..................................................... 199 in del.icio.us............................................. 196 in Flickr.................................................... 196 me ............................................................ 201 non-aboutness tags .................................. 198
Subject Index performative tags......................................200 tag categorization model ..........................202 time tags ...................................................200 tag categorization model...................... 202, 203 tag clouds ..................163, 218, 252, 287, 289f., 315f., 319, . 321, 328ff., 331, 363, 389, 412, 416 advantages................................................320 disadvantages ................................ 316f., 320 tag clusters .................240, 305, 317f., 375, 377 tag clusters (tag gardening)...........................375 tag distributions....... 164, 166ff., 170, 220, 225, 231, 248, 336, 338, 363f., 366, 415ff. inverse-logistic distribution .................176ff. Long Tail..................................................171 platform level (database level) ....... 165, 177 Power Law .................................... 171, 177f. ranking .....................................................170 resource level .................................. 165, 177 tag editing..............See media functions for tags tag garden............................................. 237, 414 tag gardener........................................ 237, 240f. tag gardening..................................... 413f., 417. knowledge representation 155, 231, 235f., 247, 250, 252 information retrieval ........... 310, 366, 372f., 377, 388 tag gardening activities ................................236 fertilizing......................................... 241, 250 landscape architecture..............................239 seeding .....................................................238 weeding ....................................................238 tag hierarchies ..................................330f., 379f. tag literacy............................................ 228, 252 tag management systems ........................... 257f. tag mirror ........................................................68 tag options.....................................................331 in del.icio.us .............................................330 tag recommender systems................... 204, 215, 228, 241, 245f., 252, 258, 299, 305, 307, 322, 363, 379, 411, 414f. based on association ules .........................208 based on context.......................................308 based on co-occurrence............................207 based on frequent tags .............................204 based on knowledge organization systems 208, 210, 309 based on other metadata...........................209 based on other resources..........................210
441
based on own tags ................................... 204 tag-resource connection .......................... 308 during indexing ...................................... 204 methods ................................................... 305 to avoid typing errors .............................. 204 tag scope....................................................... 164 tag similarity ................................................ 322 tag size.......................................................... 318 tag specificity ....................................... 232, 335 Tag Time-Machine....................................... 260 TagaTune...................................................... 102 tagCare ......................................... 258, 380, 390 tagging.......................................... 153, 155, 161 tagging-based social networks .....................See goal-sharing services tagging behavior .................................. 184, 293 change ..................................................... 189 cross-database tagging ............................ 194 field-based tagging.................................. 192 grammar .................................................. 193 in Flickr ................................................... 194 in intranets............................................... 193 in museums .............................................. 193 indexing ................................................... 195 indexing frequency of tags ...................... 184 influence on ............................................. 189 motivations for tagging ........................... 185 registered vs. non-registered users .......... 195 tagging competence................................. 195 tagging effectiveness.................................... 311 tagging games ..........See Games with a Purpose tagging guidelines ................................ 229, 238 tagging systems ...................................... 20, 155 TagLines..................................................... 327f. TagMaps..................................................... 382f. tagmashes ....................................... 66, 106, 372 tagography............................................ 201, 260 TagOntology ................................................ 243 tagr (Flickr) .................................................. 259 tagr................................................................ 244 TagRank ....................................................... 341 tags.............................20ff., 153, 156ff., 184, 189ff., 293, 297, 311, 315, 416 chronological development..................... 178 informetric analyses ................................ 178 tags vs. controlled vocabularies .............. 191 ranking factor ........................345f., 413, 417 TagsAhoy ..................................................... 335 TagsCloud ............................................ 325, 326
442
Subject Index
tapestry ..........................................................295 taxonomy.......................................................154 taxonomy of services in Web 2.0 ................. See collaborative information services (classification) Technorati .................... 1, 23, 96, 105, 178, 363 term control...................................................125 term frequency ............. 134, 346, 350, 353, 417 term weight of tags .......................................350 terminological control...................................125 text features (tag cloud) ................................321 text statistics......................................... 134, 417 text-word method................................. 120, 128 TF ........................................ See term frequency The Guardian ................................................315 thematic coupling..........................................160 thematical linking ........................ 158, 300, 302 thesauri................................. 120, 122, 125, 256 thumbs-down rating ............................. 349, 359 time....................................................... 157, 357 in tripartite graph .....................................157 in ranking .................................................351 time and task related tags.See performative tags time machine.............................................. 337f. Time Weight ...................................... 351f., 357 timelines........................................................363 time tags ........................................................200 Topigraphy....................................................329 trackback .........................................................97 translation......................................................232 trilateral network..................See tripartite graph tripartite graph ............ 120, 144, 157, 160, 290, 295, 305 collaborative information services...........160 folksonomies ............................................157 tripartite hypergraph ............See tripartite graph tripartite network..................See tripartite graph translation (NLP) ....................................... 140f. truncations.....................................................127 trust................................................................292 trusted network .......................................... 298f. TrustRank..................................... 233, 352, 360 TW ..........................................See Time Weight Twitter.............................................................19 type-ahead function ......................................211 tyranny of the majority ................ 205, 366, 412
University Library of Cologne....................... 59 University Library of Heidelberg .................. 60 University Library of Hildesheim ................. 58 University Library of Karlsruhe .................... 61 BibTip ....................................................... 61 University of Southern California Annenberg... 302, 401 unstructured list of keywords....................... 219 un-tagged resources ..................................... 308 usability ................................................ 225, 227 user aboutness .............................................. 222 user roles .............................................. 232, 259 user-generated content ................2, 13, 19, 96f., 216, 335 user-generated indexing ............................... 212 users.............................................................. 157 as ranking factor.............. 345, 356, 413, 417 User Weight ................................................. 355 UserRank...................................................... 341 users’ online status....................................... 292 UW .......................................... See User Weight
V vagueness .....287, See also vocabulary problem value judgment (performative tags)............. 241 Vector Space model ............ 135, 214, 313, 317, 339, 351, 361 videos ........................................................... 283 Videosharing Services ................................... 80 Clipfish...................................................... 80 Vimeo ........................................................ 80 Youtube ........ 80ff., 105, 165, 296, 334, 339, 349, 360 Vimeo ............................................................. 80 visual display of folksonomies .................... 260 visual influence (tag clouds) ........................ 322 visualization of folksonomies............... 284, 289, 314, 378 of search results....................................... 389 vocabulary problem ..................213, 253, 286f., 384, 416 vodcasts .......................................................... 49 vote early and often phenomenon (problems of recommender systems)............................ 296 Vox Populi ................................................... 166
W U umbrella ........................................................412
W ................................... See term weight of tag WDF ................ See within-document frequency
Subject Index Web 1.0 ...........................................................14 Web 2.0 .... 1, 13ff., 18, 130, 284, 302, 411, 413 web catalogs..................................................129 Web of Science .............................................333 webtop...................................................... 15, 17 weeding (tag gardening activity) ......... 238, 414 Wikipedia...................15, 19f., 22, 59, 233, 256 folksonomy ..............................................256 wikis ................................................................22 wisdom of crowds.........................................166 WISO ............................................... 45, 47, 105 within-document frequency ................. 134, 347 word identification (NLP)................ See parsing Wordie.............................................................94 WordFlickr................................................. 384f. WordNet....................140, 218, 231, 379, 383ff. word-of-mouth propaganda ..........................302 word placement (tag cloud) ..........................321 words.................................................... 134, 293 work (LibraryThing) .......................................65 World Explorer .............................................382 world map .......................................................78
X XING...............................................................88 Xerox............................................... See tapestry
Y Yahoo! Local Maps ......................................382 Yahoo! Web Catalog ................... 129, 248, 310 YouTube ....................80ff., 105, 165, 296, 334, 339, 349, 360 Yule process..................................................172 Yule-Simon process............................. 173, 175
Z zeitgeist ...........................................................67 Zipf’s Law.....................................................131
443