File loading please wait...
Citation preview
The Social Semantic Web
John G. Breslin · Alexandre Passant · Stefan Decker
The Social Semantic Web
123
John G. Breslin Electrical and Electronic Engineering School of Engineering and Informatics National University of Ireland, Galway Nuns Island Galway Ireland [email protected]
Alexandre Passant Digital Enterprise Research Institute (DERI) National University of Ireland, Galway IDA Business Park Lower Dangan Galway Ireland [email protected]
Stefan Decker Digital Enterprise Research Institute (DERI) National University of Ireland, Galway IDA Business Park Lower Dangan Galway Ireland [email protected]
ISBN 978-3-642-01171-9 e-ISBN 978-3-642-01172-6 DOI 10.1007/978-3-642-01172-6 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2009936149 ACM Computing Classification (1998): H.3.5, H.4.3, I.2, K.4 c Springer-Verlag Berlin Heidelberg 2009 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: KuenkelLopka GmbH Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Contents
1 Introduction to the book ....................................................................................1 1.1 Overview......................................................................................................1 1.2 Aims of the book, and who will benefit from it? .........................................3 1.3 Structure of the book....................................................................................4 1.3.1 Motivation for applying Semantic Web technologies to the Social Web ...............................................................................................................5 1.3.2 Introduction to the Social Web (Web 2.0, social media, social software)........................................................................................................5 1.3.3 Adding semantics to the Web...............................................................6 1.3.4 Discussions...........................................................................................6 1.3.5 Knowledge and information sharing ....................................................6 1.3.6 Multimedia sharing...............................................................................7 1.3.7 Social tagging.......................................................................................7 1.3.8 Social sharing of software ....................................................................7 1.3.9 Social networks ....................................................................................8 1.3.10 Interlinking online communities.........................................................8 1.3.11 Social Web applications in enterprise ................................................8 1.3.12 Towards the Social Semantic Web.....................................................9 2 Motivation for applying Semantic Web technologies to the Social Web .....11 2.1 Web 2.0 and the Social Web......................................................................11 2.2 Addressing limitations in the Social Web with semantics .........................13 2.3 The Social Semantic Web: more than the sum of its parts.........................15 2.4 A food chain of applications for the Social Semantic Web .......................17 2.5 A practical Social Semantic Web ..............................................................19 3 Introduction to the Social Web (Web 2.0, social media, social software) ....21 3.1 From the Web to a Social Web ..................................................................21 3.2 Common technologies and trends ..............................................................25 3.2.1 RSS.....................................................................................................25 3.2.2 AJAX..................................................................................................27 3.2.3 Mashups .............................................................................................28 3.2.4 Advertising .........................................................................................30 3.2.5 The Web on any device ......................................................................32 3.2.6 Content delivery .................................................................................34 3.2.7 Cloud computing ................................................................................35 3.2.8 Folksonomies .....................................................................................38 3.3 Object-centred sociality .............................................................................39
vi
The Social Semantic Web
3.4 Licensing content....................................................................................... 42 3.5 Be careful before you post ......................................................................... 42 3.6 Disconnects in the Social Web .................................................................. 44 4 Adding semantics to the Web .......................................................................... 45 4.1 A brief history............................................................................................ 45 4.2 The need for semantics .............................................................................. 47 4.3 Metadata .................................................................................................... 51 4.3.1 Resource Description Framework (RDF)........................................... 52 4.3.2 The RDF syntax ................................................................................. 54 4.4 Ontologies.................................................................................................. 56 4.4.1 RDF Schema ...................................................................................... 59 4.4.2 Web Ontology Language (OWL)....................................................... 61 4.5 SPARQL.................................................................................................... 62 4.6 The ‘lowercase’ semantic web, including microformats ........................... 64 4.7 Semantic search ......................................................................................... 66 4.8 Linking Open Data .................................................................................... 67 4.9 Semantic mashups ..................................................................................... 69 4.10 Addressing the Semantic Web ‘chicken-and-egg’ problem..................... 71 5 Discussions ........................................................................................................ 75 5.1 The world of boards, blogs and now microblogs....................................... 75 5.2 Blogging .................................................................................................... 76 5.2.1 The growth of blogs ........................................................................... 77 5.2.2 Structured blogging ............................................................................ 79 5.2.3 Semantic blogging.............................................................................. 81 5.3 Microblogging ........................................................................................... 85 5.3.1 The Twitter phenomenon ................................................................... 88 5.3.2 Semantic microblogging .................................................................... 89 5.4 Message boards.......................................................................................... 91 5.4.1 Categories and tags on message boards.............................................. 92 5.4.2 Characteristics of forums ................................................................... 94 5.4.3 Social networks on message boards ................................................... 97 5.5 Mailing lists and IRC............................................................................... 100 6 Knowledge and information sharing ............................................................ 103 6.1 Wikis........................................................................................................ 103 6.1.1 The Wikipedia.................................................................................. 105 6.1.2 Semantic wikis ................................................................................. 105 6.1.3 DBpedia ........................................................................................... 110 6.1.4 Semantics-based reputation in the Wikipedia .................................. 111
Contents
vii
6.2 Other knowledge services leveraging semantics......................................112 6.2.1 Twine................................................................................................112 6.2.2 The Internet Archive ........................................................................115 6.2.3 Powerset ...........................................................................................117 6.2.4 OpenLink Data Spaces .....................................................................119 6.2.5 Freebase............................................................................................119 7 Multimedia sharing ........................................................................................121 7.1 Multimedia management .........................................................................121 7.2 Photo-sharing services .............................................................................122 7.2.1 Modelling RDF data from Flickr......................................................123 7.2.3 Annotating images using Semantic Web technologies.....................125 7.3 Podcasts ...................................................................................................126 7.3.1 Audio podcasts .................................................................................127 7.3.2 Video podcasts .................................................................................129 7.3.3 Adding semantics to podcasts ..........................................................131 7.4 Music-related content ..............................................................................133 7.4.1 DBTune and the Music Ontology.....................................................133 7.4.2 Combining social music and the Semantic Web ..............................134 8 Social tagging ..................................................................................................137 8.1 Tags, tagging and folksonomies ..............................................................137 8.1.1 Overview of tagging.........................................................................137 8.1.2 Issues with free-form tagging systems .............................................140 8.2 Tags and the Semantic Web.....................................................................142 8.2.1 Mining taxonomies and ontologies from folksonomies ...................143 8.2.2 Modelling folksonomies using Semantic Web technologies............144 8.3 Tagging applications using Semantic Web technologies.........................148 8.3.1 Annotea ............................................................................................148 8.3.2 Revyu.com........................................................................................149 8.3.3 SweetWiki ........................................................................................151 8.3.4 int.ere.st ............................................................................................151 8.3.5 LODr ................................................................................................152 8.3.6 Atom Interface..................................................................................153 8.3.7 Faviki................................................................................................154 8.4 Advanced querying capabilities thanks to semantic tagging ...................155 8.4.1 Show items with the tag ‘semanticweb’ on any platform.................155 8.4.2 List the ten latest items tagged by Alexandre on SlideShare............155 8.4.3 List the tags used by Alex on SlideShare and by John on Flickr......157 8.4.4 Retrieve any content tagged with something relevant to the Semantic Web field ...................................................................................158
viii
The Social Semantic Web
9 Social sharing of software.............................................................................. 159 9.1. Software widgets, applications and projects ........................................... 159 9.2 Description of a Project (DOAP)............................................................. 160 9.2.1 Examples of DOAP use.................................................................... 161 9.3 Crawling and browsing software descriptions ......................................... 164 9.4 Querying project descriptions and related data........................................ 166 9.4.1 Locating software projects from people you trust ............................ 166 9.4.2 Locating a software project related to a particular topic .................. 167 10 Social networks............................................................................................. 169 10.1 Overview of social networks ................................................................. 169 10.2 Online social networking services ......................................................... 173 10.3 Some psychology behind SNS usage..................................................... 175 10.4 Niche social networks............................................................................ 177 10.5 Addressing some limitations of social networks.................................... 179 10.6 Friend-of-a-Friend (FOAF).................................................................... 181 10.6.1 Consolidation of people objects ..................................................... 184 10.6.2 Aggregating a person’s web contributions ..................................... 186 10.6.3 Inferring relationships from aggregated data.................................. 187 10.7 hCard and XFN...................................................................................... 189 10.8 The Social Graph API and OpenSocial ................................................. 190 10.8.1 The Social Graph API .................................................................... 190 10.8.2 OpenSocial ..................................................................................... 192 10.9 The Facebook Platform.......................................................................... 193 10.10 Some social networking initiatives from the W3C .............................. 194 10.11 A social networking stack.................................................................... 194 11 Interlinking online communities ................................................................. 197 11.1 The need for semantics in online communities...................................... 197 11.2 Semantically-Interlinked Online Communities (SIOC)......................... 198 11.2.1 The SIOC ontology ........................................................................ 201 11.2.2 SIOC metadata format.................................................................... 203 11.2.3 SIOC modules ................................................................................ 205 11.3 Expert finding in online communities.................................................... 206 11.3.1 FOAF for expert finding ................................................................ 208 11.3.2 SIOC for expert finding.................................................................. 209 11.4 Connections between community description formats .......................... 211 11.5 Distributed conversations and channels................................................. 212 11.6 SIOC applications.................................................................................. 215 11.7 A food chain for SIOC data ................................................................... 216 11.7.1 SIOC producers .............................................................................. 218 11.7.2 SIOC collectors .............................................................................. 223 11.7.3 SIOC consumers............................................................................. 224 11.8 RDFa for interlinking online communities ............................................ 231
Contents
ix
11.9 Argumentative discussions in online communities................................234 11.10 Object-centred sociality in online communities...................................236 11.11 Data portability in online communities................................................238 11.11.1 The DataPortability working group..............................................238 11.11.2 Data portability with FOAF and SIOC.........................................240 11.11.3 Connections between portability efforts.......................................241 11.12 Online communities for health care and life sciences..........................242 11.12.1 Semantic Web Applications in Neuromedicine............................243 11.12.2 Science Collaboration Framework ...............................................244 11.12.3 bio-zen and the art of scientific community maintenance ............246 11.13 Online presence....................................................................................246 11.14 Online attention....................................................................................247 11.15 The SIOC data competition .................................................................247 12 Social Web applications in enterprise.........................................................251 12.1 Overview of Enterprise 2.0 ....................................................................251 12.2 Issues with Enterprise 2.0 ......................................................................255 12.2.1 Social and philosophical issues with Enterprise 2.0 .......................255 12.2.2 Technical issues with Enterprise 2.0 ..............................................258 12.3 Improving Enterprise 2.0 ecosystems with semantic technologies........262 12.3.1 Introducing SemSLATES...............................................................262 12.3.2 Implementing semantics in Enterprise 2.0 ecosystems ..................263 12.3.3 SIOC for collaborative work environments....................................266 13 Towards the Social Semantic Web..............................................................269 13.1 Possibilities for the Social Semantic Web .............................................269 13.2 A community-guided Social Semantic Web ..........................................271 13.2.1 Wisdom of the crowds and the Semantic Web...............................272 13.2.2 A grassroots approach ....................................................................273 13.2.3 The vocabulary onion.....................................................................275 13.3 Integrating with the Social Semantic Desktop .......................................278 13.4 Privacy and identity on the Social Semantic Web .................................279 13.4.1 Keeping privacy in mind ................................................................279 13.4.2 Identity fragmentation ....................................................................280 13.5 The vision of a Social Semantic Web ....................................................281 Acknowledgments..............................................................................................285 Dedication from John........................................................................................287 Biographies ........................................................................................................289 References ..........................................................................................................291
1 Introduction to the book 1.1 Overview The Social Web - encompassing social networking services such as MySpace, Facebook and orkut, as well as content-sharing sites (that also offer social networking functionality) like Flickr, Last.fm and del.icio.us - has captured the attention of millions of users as well as billions of dollars in investment and acquisition. As more social websites form around the connections between people and their objects of interest (to avoid these sites becoming boring), and as these ‘object-centred networks’ (where people connect via these objects of interest) grow bigger and more diverse, more intuitive methods are needed for representing and navigating the content items in these sites: both within and across social websites. Also, to better enable user access to multiple sites and ultimately to contentcreation facilities on the Web, interoperability among social websites is required in terms of both the content objects and the person-to-person networks expressed on each site. This requires representation mechanisms to interconnect people and objects on the Web in an interoperable and extensible way (Breslin and Decker 2007). Semantic Web representation mechanisms are ideally suited to describing people and the objects that link them together in such object-centred networks, by recording and representing the heterogeneous ties that bind each to the other. By using agreed-upon Semantic Web formats to describe people, content objects, and the connections that bind them together, social networks can also interoperate by appealing to common semantics. Developers are already using Semantic Web technologies to augment the ways in which they create, reuse, and link content on social networking and social websites. These efforts include the Friend-of-aFriend (FOAF) project1 for describing people and relationships, the Nepomuk social semantic desktop2 which is a framework for extending the desktop to a collaborative environment for information management and sharing, and the Semantically-Interlinked Online Communities (SIOC) initiative3 for representing online discussions (Breslin et al. 2005). Some social networking services (SNSs), such as FriendFeed, are also starting to provide query interfaces to their data, which others can reuse and link to via the Semantic Web. The Semantic Web is a useful platform for linking and for performing operations on diverse person- and object-related data (as shown in Figure 1.1) gathered from heterogeneous social websites (in what is termed ‘Web 2.0’). 1
http://www.foaf-project.org/ (URL last accessed 2009-06-09) http://nepomuk.semanticdesktop.org/ (URL last accessed 2009-06-09) 3 http://sioc-project.org/ (URL last accessed 2009-06-09) 2
J.G. Breslin et al., The Social Semantic Web, DOI 10.1007/978-3-642-01172-6_1, © Springer-Verlag Berlin Heidelberg 2009
2
The Social Semantic Web
Unified representation
Ontology languages
Metadata and vocabularies
Unified queries
Web 2.0
Fig. 1.1. Interconnecting and reusing distributed Web 2.0 data with semantic technologies
In the other direction, object-centred networks and user-centric services for generating collaborative content can serve as rich data sources for Semantic Web applications (Figure 1.2).
Collaboration Architecture of participation
Browsing interfaces
Authoring
Mash-ups
Semantic Web
Fig. 1.2. Powering semantic applications with rich community-created content and Web 2.0 paradigms
This linked data can provide an enhanced view of individual or community activity in localised or distributed object-centred social networks. In fact, since all this data can be semantically interlinked using well-given semantics (e.g. using the
1 Introduction to the book
3
FOAF and SIOC ontologies), in theory it makes no difference whether the content is distributed or localised. All of this data can be considered as a unique interlinked machine-understandable graph layer (with nodes as users or related data and arcs as relationships) over the existing Web of documents and hyperlinks, i.e. a Giant Global Graph as Tim Berners-Lee recently coined4. Moreover, such interlinked data allows advanced querying capabilities, for example, ‘show me all the content that Alice has acted on in the past three months in any SNS’. In this book, we will begin with our motivations followed by overviews of both the Social Web and the Semantic Web. Then we will describe some popular social media and social networking applications, list some of their strengths and limitations, and describe some applications of Semantic Web technologies to address current issues with social websites by enhancing them with semantics. Across these heterogeneous social websites, we will demonstrate a twofold approach towards integrating the Social Web and the Semantic Web: in particular, (1) by demonstrating how the Semantic Web can serve as a useful platform for linking and for performing operations on diverse person- and object-related data gathered from these websites, and (2) by showing that in the other direction, social websites can themselves serve as rich data sources for Semantic Web applications. We shall conclude with some observations on how the application of Semantic Web technologies to the Social Web is leading towards the ‘Social Semantic Web’, forming a network of interlinked and semantically-rich content and knowledge.
1.2 Aims of the book, and who will benefit from it? Initially, we aim to educate readers on evolving areas from the world of collaboration and communication systems, social software and the Social Web. We shall also show connections with parallel developments in the Semantic Web effort. Then, we will illustrate how social software applications can be enhanced and interconnected with semantic technologies, including semantic and structured blogging, interconnecting community sites, semantic wikis, and distributed social networks. The goal of this book is that readers will be able to apply Semantic Web technologies to these and to other application areas in what is termed the Social Semantic Web. This book is intended for computer science professionals, researchers, academics and graduate students interested in understanding the technologies and research issues involved in applying Semantic Web technologies to social software. Applications such as blogs, social networks and wikis require more automated ways for information distribution. Practitioners and developers interested in such applica-
4
http://dig.csail.mit.edu/breadcrumbs/node/215 (URL last accessed 2009-06-09)
4
The Social Semantic Web
tion areas will also learn about methods for increasing the levels of automation in these forms of web communication. For those who have background knowledge in the area of the Semantic Web, we envisage that this book will help you to develop application knowledge in relation to social software and other widely-used related Social Web technologies. For those who already have application knowledge in web engineering or in the development of systems such as wikis, social networks and blogs, we hope this book will inspire you to develop and create ideas on how to increase the usability of social software and other web systems using Semantic Web technologies.
1.3 Structure of the book We shall now give an introduction to the chapters in this book and explain the logical chapter layout and flow (Figure 1.3). Following an overview of the motivation for combining the Social Web and the Semantic Web, we will proceed with an introduction to various technologies and trends in both the Social Web and the Semantic Web domains.
Fig. 1.3. Chapter flow for the book
1 Introduction to the book
5
This will be followed by a series of chapters whereby various Social Web application areas will be introduced, and semantic enhancements to these areas will be described. The areas we focus on are: online discussion systems such as forums, blogs and mailing lists; knowledge sharing services such as wikis and other sites for (mainly textual) information storage and recovery; multimedia services for sharing images, audio and video files; bookmarking sites and similar services organised around tagging functionality; sites for publishing and sharing community software projects; online social networking services; interlinked online communities; and enterprise applications. These chapters will have varying ratios of semantic implementations to non-semantic ones where state-of-the-art semantic techniques may have achieved more traction in some application areas. Finally, in the last chapter we will describe approaches to integrate these social semantic applications in what we have termed ‘Social Semantic Information Spaces’.
1.3.1 Motivation for applying Semantic Web technologies to the Social Web This part will focus on the motivation for applying Semantic Web technologies to the Social Web, as summarised in the introductory description just given.
1.3.2 Introduction to the Social Web (Web 2.0, social media, social software) We shall begin with an overview of social websites, looking at common Social Web technologies and methods for collaboration, content sharing, data exchange and representation (enhancing interaction and exchange with AJAX and mashups, how content is being categorised via tagging and folksonomies, etc.). We shall also discuss existing structured content that is available from social websites, mainly via content syndication whereby people can keep up to date with published material using RSS, Atom and other subscription methods. Then we will introduce the notion of object-centred sociality (referencing the observations of Jyri Engeström and Karen Knorr-Cetina), where social websites are organised around the objects of interest that connect people together.
6
The Social Semantic Web
1.3.3 Adding semantics to the Web In this chapter, we will examine state of the art in the Semantic Web such as metadata and ontology standards and mashups, as well as some efforts aimed at providing semantic search and leveraging linked data. We shall talk about why object-centred sociality provides a meaning for representing Social Web content using semantics. The chapter will focus not only on the ‘uppercase’ Semantic Web (where formal specifications such as OWL and RDF are used to represent ontologies and associated metadata), but will also look at the ‘lowercase’ semantic web (where developer-led efforts in the microformats community are creating simple semantic structures for use by ‘people first, machines second’).
1.3.4 Discussions We shall describe the area of blogging, one of the most popular Social Web activities. Blogs are online journals or sets of chronological news entries that are maintained by individuals, communities or commercial entities, and can be used to publish personal opinions, diary-like articles or news stories relating to a particular interest or product. We shall begin by describing current approaches to blogging, and detail how semantic technologies improve both the processes of creating and editing blog posts, and of browsing and querying the data created by blogs (via structured blogging and semantic blogging). We shall also discuss forums, mailing lists, and other web-based discussion systems such as microblogging, a recent trend regarding lightweight and agile communication on the Web.
1.3.5 Knowledge and information sharing Wikis are collaboratively-edited websites that can be updated or added to by anyone with an interest in the topic covered by the wiki site, and have been used to create online encyclopaedias, photo galleries and literature collections. We shall describe the Social Web application area of wikis, and describe how adding semantics to wikis can offer distinct benefits: augmenting the language text in wiki articles with structured data and typed links enables advanced querying and browsing. We shall examine popular semantic wikis in usage today (e.g. Semantic MediaWiki), and we will look at semantic services that leverage structured information from wikis (such as the DBpedia). We shall briefly detail how a reputation system with embedded semantics could be deployed in a large-scale community site like the Wikipedia. We shall also look at the latest wave of knowledge net-
1 Introduction to the book
7
working and information sharing services (including Twine, Freebase, and OpenLink Data Spaces).
1.3.6 Multimedia sharing We shall begin by looking at Social Web applications for storing and sharing photographs and other images (Flickr, Zooomr, etc.), and describe an application called FlickRDF that exports semantic data from the Flickr service. We shall then describe both audio and video podcasting, and give some ideas for the application of semantics to this area (e.g. through metadata descriptions and applications like ZemPod). We shall finish the chapter with a description of how semantic technologies can be applied to social music services and websites like Last.fm, through projects such as DBTune and the Music Ontology.
1.3.7 Social tagging This chapter will discuss social tagging and bookmarking services on the Web. We shall look at tagging and how semantics can assist the tagging process as well as enhancing related aspects such as tag clouds. We shall look at annotated social bookmarks, where sites like del.icio.us are allowing people to publicly publish textual descriptions of their favourite links along with associated annotations of use to others, and we will describe different issues related to tagging behaviours. We shall describe how semantics can be added to tagging systems, both by defining models to represent tagging activities or particular behaviours and by extracting a hierarchy of concepts or vocabularies from tags. Semantic social bookmarking and tagging applications (e.g. int.ere.st, Revyu, LODr) will also be described to emphasise how different aspects of tagging applications can be augmented thanks to Semantic Web technologies.
1.3.8 Social sharing of software The Social Web allows us to not only share data or multimedia content, but also applications, especially free-software applications or lightweight add-ons to web pages such as widgets. We shall look at how interoperability among social websites is possible not just in terms of the expressed content but also in terms of the social applications in use (e.g. widgets) on each site. We shall give an overview of existing ways to share software on the Web, focussing on how a social aspect can be added to data such as software projects or widget descriptions. We shall follow
8
The Social Semantic Web
this with a description of methods for describing software projects using semantics, and we will see how applications can be identified and discovered on the Web thanks to these semantics. We shall also discuss how trust mechanisms for consuming applications can be leveraged via the distributed social graph so that users can decide who to accept any new data or applications from.
1.3.9 Social networks We shall begin with an overview of social networks, and look at current developments regarding the ‘social graph’. We further describe the idea of object-centred sociality as introduced in Chapter 3. We shall then discuss initiatives from major Web companies to provide interoperability between social networking applications such as Facebook Connect and Google’s OpenSocial and Social Graph APIs. We shall finish the chapter with a description of how open and distributed semantic social networks can be created through definitions such as Friend-of-a-Friend (FOAF) or XHTML Friends Network (XFN), enabling interoperability between different SNSs.
1.3.10 Interlinking online communities We shall describe the usage of Semantic Web technologies for enhancing community portals and for connecting heterogeneous social websites - SIOC is currently being used for information structuring as well as for export and information dissemination. We shall describe current standardisation activities as well as research prototype applications and commercial implementations. We shall also show how SIOC can be combined with other ontologies (including FOAF, SKOS, and Dublin Core) in architectures for community site interoperability. We will look at current projects that enable one to query for topics or to browse distributed discussion content across various types of social websites (e.g. the SIOC Explorer, Sindice SIOC Widget).
1.3.11 Social Web applications in enterprise We shall begin with an overview of Enterprise 2.0, looking at how Social Web applications are being used internally and externally by companies. We shall then examine the application of Semantic Web technologies to Enterprise 2.0 ecosystems. In particular, we will look at the usage of semantics in integrated enterprise social software suites as well as how the Semantic Web can help us to integrate
1 Introduction to the book
9
the various components that are being used in Enterprise 2.0 ecosystems. For example, we will show how collaborative work environments can be enhanced through the application of semantics (e.g. SIOC4CWE).
1.3.12 Towards the Social Semantic Web Finally, we will discuss and present current approaches to realize the ideas of Vannevar Bush (Bush 1945) and Doug Engelbart (Engelbart 1962) on distributed collaboration infrastructures, towards both the Social Semantic Web and the Social Semantic Desktop (together, we term these as Social Semantic Information Spaces). We can combine the semantically-enhanced social software applications described in previous chapters into a Social Semantic Information Space. In the spirit of seminal visions such as Bush’s Memex and Engelbart’s open hyperdocument system (OHS), this chapter will detail how previous perspectives on group forming, network modelling and algorithms, and innovative IT-based interaction with feedback are driving new initiatives for creating semantic connections within and between people’s information spaces.
2 Motivation for applying Semantic Web technologies to the Social Web Many will have become familiar with popular Social Web applications such as blogging, social networks and wikis, and will be aware that we are heading towards an interconnected information space (through the blogosphere, inter-wiki links, mashups, etc.). At the same time, these applications are experiencing boundaries in terms of information integration, dissemination, reuse, portability, searchability, automation and more demanding tasks like querying. The Semantic Web is increasingly aiming at these applications areas quite a number of Semantic Web approaches have appeared in recent years to overcome the boundaries in these application areas, e.g. semantic wikis (Semantic MediaWiki), knowledge networking (Twine), embedded microcontent detection and reuse (Operator, Headup, Semantic Radar), social graph and data portability APIs (from Google and Facebook), etc. In an effort to consolidate and combine knowledge about existing efforts, we aim to educate readers about Social Web application areas and new avenues open to commercial exploitation in the Semantic Web. We shall give an overview of how the Social Web and Semantic Web can be meshed together.
2.1 Web 2.0 and the Social Web One of the most visible trends on the Web is the emergence of the Web 2.0 technology platform. The term Web 2.0 refers to a perceived second-generation of Web-based communities and hosted services. Although the term suggests a new version of the Web, it does not refer to an update of the World Wide Web technical specifications, but rather to new structures and abstractions that have emerged on top of the ordinary Web. While it is difficult to define the exact boundaries of what structures or abstractions belong to Web 2.0, there seems to be an agreement that services and technologies like blogs, wikis, folksonomies, podcasts, RSS feeds (and other forms of many-to-many publishing), social software and social networking sites, web APIs, web standards1 and online web services are part of Web 2.0. Web 2.0 has not only been a technological but also a business trend: according to Tim O’Reilly2: ‘Web 2.0 is the business revolution in the computer in-
1 2
http://www.webstandards.org/ (URL last accessed 2009-06-09) http://radar.oreilly.com/archives/2006/12/web-20-compact.html (accessed 2009-06-09)
J.G. Breslin et al., The Social Semantic Web, DOI 10.1007/978-3-642-01172-6_2, © Springer-Verlag Berlin Heidelberg 2009
12
The Social Semantic Web
dustry caused by the move to the Internet as platform, and an attempt to understand the rules for success on that new platform’. Social networking sites such as Facebook (one of the world’s most popular SNSs), Friendster (an early SNS previously popular in the US, now widely used in Asia), orkut (Google’s SNS), LinkedIn (an SNS for professional relationships) and MySpace (a music and youth-oriented service) - where explicitly-stated networks of friendship form a core part of the website - have become part of the daily lives of millions of users, and have generated huge amounts of investment since they began to appear around 2002. Since then, the popularity of these sites has grown hugely and continues to do so. (Boyd and Ellison 2007) have described the history of social networking sites, and suggested that in the early days of SNSs, when only the SixDegrees service existed, there simply were not enough users: ‘While people were already flocking to the Internet, most did not have extended networks of friends who were online’. A graph from Internet World Stats3 shows the growth in the number of Internet users over time. Between 2000 (when SixDegrees shut down) and 2003 (when Friendster became the first successful SNS), the number of Internet users had doubled. Web 2.0 content-sharing sites with social networking functionality such as YouTube (a video-sharing site), Flickr (for sharing images) and Last.fm (a music community site) have enjoyed similar popularity. The basic features of a social networking site are profiles, friend listings and commenting, often along with other features such as private messaging, discussion forums, blogging, and media uploading and sharing. In addition to SNSs, other forms of social websites include wikis, forums and blogs. Some of these publish content in structured formats enabling them to be aggregated together. A common property of Web 2.0 technologies is that they facilitate collaboration and sharing between users with low technical barriers – although usually on single sites (e.g. Technorati) or with a limited range of information (e.g. RSS, which we will describe later). In this book we will refer to this collaborative and sharing aspect as the ‘Social Web’, a term that can be used to describe a subset of Web interactions that are highly social, conversational and participatory. The Social Web may also be used instead of Web 2.0 as it is clearer what feature of the Web is being referred to4. The Social Web has applications on intranets as well as on the Internet. On the Internet, the Social Web enables participation through the simplification of user contributions via blogs and tagging, and has unleashed the power of communitybased knowledge acquisition with efforts like Wikipedia demonstrating the collective ‘wisdom of the crowds’ in creating the largest encyclopaedia. One outcome of such websites, especially wikis, is that they can produce more valuable knowledge collectively rather than that created by separated individuals. In this sense, the Social Web can be seen as a way to create collective intelligence at a Web-scale 3 4
http://www.internetworldstats.com/emarketing.htm (URL last accessed 2009-06-09) http://en.wikipedia.org/wiki/Social_web (URL last accessed 2009-06-09)
2 Motivation for applying Semantic Web technologies to the Social Web
13
level, following the ‘we are smarter than me’ principles5 (Libert and Spector 2008). Similar technologies are also being used in company intranets as effective knowledge management, collaboration and communication tools between employees. Companies are also aiming to make social website users part of their IT ‘team’, e.g. by allowing users to have access to some of their data and by bringing the results into their business processes (Tapscott and Williams 2007).
2.2 Addressing limitations in the Social Web with semantics A limitation of current social websites is that they are isolated from one another like islands in the sea (Figure 2.1). For example, different online discussions may contain complementary knowledge and topics, segmented parts of an answer that a person may be looking for, but people participating in one discussion do not have ready access to information about related discussions elsewhere. As more and more social websites, communities and services come online, the lack of interoperability between them becomes obvious. The Social Web creates a set of single data silos or ‘stovepipes’, i.e. there are many sites, communities and services that cannot interoperate with each other, where synergies are expensive to exploit, and where reuse and interlinking of data is difficult and cumbersome. The main reason for this lack of interoperation is that for most Social Web applications, communities, and domains, there are still no common standards for knowledge and information exchange or interoperation available. RSS (Really Simple Syndication), a format for publishing recently-updated Web content such as blog entries, was the first step towards interoperability among social websites, but it has various limitations that make it difficult to be used efficiently in such an interoperability context, as we will see later. Another extension of the Web aims to provide the tools that are necessary to define extensible and flexible standards for information exchange and interoperability. The Scientific American article (Berners-Lee et al. 2001) from Berners-Lee, Hendler and Lassila defined the Semantic Web as ‘an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation’. The last couple of years have seen large efforts going into the definition of the foundational standards supporting data interchange and interoperation, and currently a quite well-defined Semantic Web technology stack exists, enabling the creation of defining metadata and associated vocabularies.
5
http://www.wearesmarter.org/ (URL last accessed 2009-06-09)
14
The Social Semantic Web
i
ii
iii
iv
Fig. 2.1. Creating bridges between isolated communities of users and their data6
A number of Semantic Web vocabularies have achieved wide deployment – successful examples include RSS 1.0 for the syndication of information, FOAF, for expressing personal profile and social networking information, and SIOC, for interlinking communities and distributed conversations. These vocabularies share a joint property: they are small, but at the same time vertical – i.e. they are a part of many different domains. Each horizontal domain (e.g. e-health) would typically reuse a number of these vertical vocabularies, and when deployed the vocabularies would be able to interact with each other. The Semantic Web effort is in an ideal position to make social websites interoperable by providing standards to support data interchange and interoperation between applications, enabling individuals and communities to participate in the creation of distributed interoperable information. The application of the Semantic Web to the Social Web is leading to the ‘Social Semantic Web’ (Figure 2.2), creating a network of interlinked and semantically-rich knowledge. This vision of the Web will consist of interlinked documents, data, and even applications created by the end users themselves as the result of various social interactions, and it is modelled using machine-readable formats so that it can be used for purposes that the 6
Images courtesy of Pidgin Technologies at http://www.pidgintech.com/
2 Motivation for applying Semantic Web technologies to the Social Web
15
current state of the Social Web cannot achieve without difficulty. As Tim BernersLee said in a 2005 podcast7, Semantic Web technologies can support online communities even as ‘online communities [...] support Semantic Web data by being the sources of people voluntarily connecting things together’. For example, social website users are already creating extensive vocabularies and semantically-rich annotations through folksonomies (Mika 2005a).
Fig. 2.2. The Social Semantic Web
Because a consensus of community users is defining the meaning, these terms are serving as the objects around which those users form more tightly-connected social networks. This goes hand-in-hand with solving the chicken-and-egg problem of the Semantic Web (i.e. you cannot create useful Semantic Web applications without the data to power them, and you cannot produce semantically-rich data without the interesting applications themselves): since the Social Web contains such semantically-rich content, interesting applications powered by Semantic Web technologies can be created immediately.
2.3 The Social Semantic Web: more than the sum of its parts The combination of the Social Web and Semantic Web can lead to something greater than the sum of its parts: a Social Semantic Web (Auer et al. 2007, Blu7
http://esw.w3.org/topic/IswcPodcast (URL last accessed 2009-06-09)
16
The Social Semantic Web
mauer and Pellegrini 2008) where the islands of the Social Web can be interconnected with semantic technologies, and Semantic Web applications are enhanced with the wealth of knowledge inherent in user-generated content. In this book, we will describe various solutions that aim to make social websites interoperable, and which will take them beyond their current limitations to enable what we have termed Social Semantic Information Spaces8. Social Semantic Information Spaces are a platform for both personal and professional collaborative exchange with reusable community contributions. Through the use of Semantic Web data, searchable and interpretable content is added to existing Web-based collaborative infrastructures and social spaces, and intelligent use of this content can be made within these spaces - bringing the vision of semantics on the Web to its most usable and exploitable level. Some typical application areas for social spaces are wikis, blogs and social networks, but they can include any spaces where content is being created, annotated and shared amongst a community of users. Each of these can be enhanced with machine-readable data to not only provide more functionality internally, but also to create an overall interconnected set of Social Semantic Information Spaces. These spaces offer a number of possibilities in terms of increased automation and information dissemination that are not easily realisable with current social software applications: By providing better interconnection of data, relevant information can be obtained from related social spaces (e.g. through social connections, inferred links, and other references). Social Semantic Information Spaces allow you to gather all your contributions and profiles across various sites (‘subscribe to my brain’), or to gather content from your friend / colleague connections. These spaces allow the use of the Web as a clipboard to allow exchange between various collaborative applications (for example, by allowing readers to drag structured information from wiki pages into other applications, geographic data about locations on a wiki page could be used to annotate information on an event or a travel review in a blog post one is writing). Such spaces can help users to avoid having to repeatedly express several times over the same information if they belong to different social spaces. Due to the high semantic information available about users, their interests and relationships to other entities, personalisation of content and interface input mechanisms can be performed, and innovative ways for presenting related information can be created. These semantic spaces will also allow the creation of social semantic mashups, combining information from distributed data sources together that can also be enhanced with semantic information, for example, to provide the geolocations of friends in your social network who share similar interests with you. 8
http://www2006.org/tutorials/#T13 (URL last accessed 2009-06-09)
2 Motivation for applying Semantic Web technologies to the Social Web
17
Fine-grained questions can be answered through such semantic social spaces, such as ‘show me all content by people both geographically and socially near to me on the topic of movies’. Social Semantic Information Spaces can make use of emergent semantics to extract more information from both the content and any other embedded metadata. There have been initial approaches in collaborative application areas to incorporate semantics in these applications with the aim of adding more functionality and enhancing data exchange - semantic wikis, semantic blogs and semantic social networks. These approaches require closer linkages and cross-application demonstrators to create further semantic integration both between and across application areas (e.g. not just blog-to-blog connections, but also blog-to-wiki exchanges). A combination of such semantic functionality with existing grassroots efforts such as OpenID9 (a single sign-on mechanism) or OAuth10 (an authentication scheme) can bring the Social Web to another level. Not only will this lead to an increased number of enhanced applications, but an overall interconnected set of Social Semantic Information Spaces can be created.
2.4 A food chain of applications for the Social Semantic Web A semantic data ‘food chain’, as shown in Figure 2.3, consists of various producers, collectors and consumers of semantic data from social networks and social websites. Applying semantic technologies to social websites can greatly enhance the value and functionality of these sites. The information within these sites is forming vast and diverse networks which can benefit from Semantic Web technologies for representation and navigation. Additionally, in order to easily enable navigation and data exchange across sites, mechanisms are required to represent the data in an interoperable and extensible way. These are termed semantic data producers. An intermediary step which may or may not be required is for the collection of semantic data. In very large sites, this may not be an issue as the information in the site may be sufficiently linked internally to warrant direct consumption after production, but in general, many users make small contributions across a range of services which can benefit from an aggregate view through some collection service. Collection services can include aggregation and consolidation systems, semantic search engines or data lookup indexes.
9
http://openid.net/ (URL last accessed 2009-06-09) http://oauth.net/ (URL last accessed 2009-06-09)
10
18
The Social Semantic Web
Fig. 2.3. A food chain for semantic data on the Social Web
The final step involves consumers of semantic data. Social networking technologies enable people to articulate their social network via friend connections. A social network can be viewed as a graph where the nodes represent individuals and the edges represent relations. Methods from graph theory can be use to study these networks, and we refer to initial work by (Ereteo et al. 2008) on how social network analysis can consume semantic data from the food chain. Also, representing social data in RDF (Resource Description Framework), a language for describing web resources in a structured way, enables us to perform queries on a network to locate information relating to people and to the content that they create. RDF can be used to structure and expose information from the Social Web allowing the simple generation of semantic mashups for both proprietary and public information. HTML content can also be made compatible with RDF through RDFa (RDF annotations embedded in XHTML attributes), thereby enabling effective semantic search without requiring one to crawl a new set of pages (e.g. the Common Tag11 effort allows metadata and URIs for tags to be exposed using RDFa and shared with other applications). Interlinking social data from multiple sources may give an enhanced view of information in distributed communities, and we will describe applications to consume and exchange this interlinked data in future chapters.
11
http://commontag.org/ (URL last accessed 2009-07-07)
2 Motivation for applying Semantic Web technologies to the Social Web
19
2.5 A practical Social Semantic Web Applying Semantic Web technologies to social websites allows us to express different types of relationships between people, objects and concepts. By using common, machine-readable ways for expressing data about individuals, profiles, social connections and content, these technologies provide a way to interconnect people and objects on a Social Semantic Web in an interoperable, extensible way. On the conventional Web, navigation of data across social websites can be a major challenge. Communities are often dispersed across numerous different sites and platforms. For example, a group of people interested in a particular topic may share photos on Flickr, bookmarks on del.icio.us and hold conversations on a discussion forum. Additionally, a single person may hold several separate online accounts, and have a different network of friends on each. The information existing on each of these websites is generally disconnected, lacking in semantics, and is centrally controlled by a single organisation. Individuals generally lack control or ownership of their own data. Social websites are becoming more prevalent and content is more distributed. This presents new challenges for navigating such data. Machine-readable descriptions of people and objects, and the use of common identifiers, can allow for linking diverse information from heterogeneous social networking sites. This creates a starting point for easy navigation across the information in these networks. The use of common formats allows interoperability across sites, enabling users to reuse and link to content across different platforms. This also provides a basis for data portability, where users can have ownership and control over their own data and can move profile and content information between services as they wish. Recently there has been a push within the web community to make data portability (i.e. the ability for users to port their own data wherever they wish) a reality12. Additionally, the Social Web and social networking sites can contribute to the Semantic Web effort. Users of these sites often provide metadata in the form of annotations and tags on photos, ratings, blogroll links, etc. In this way, social networks and semantics can complement each other. Already within online communities, common vocabularies or folksonomies for tagging are emerging through a consensus of community members. In this book we will describe a variety of practical Social Semantic Web applications that have been enhanced with extra features due to the rich content being created in social software tools by users, including the following: The Twine application from Radar Networks is an example of a system that leverages both the explicit (tags and metadata) and implicit semantics (automatic tagging of text) associated with content items. The underlying semantic data can also be exposed as RDF by appending ‘?rdf’ to any Twine URL.
12
http://www.dataportability.org/ (URL last accessed 2009-07-21)
20
The Social Semantic Web
The SIOC vocabulary is powering an ecosystem of Social Semantic Web applications producing and consuming community data, ranging from individual blog exporters to interoperability mechanisms for collaborative work environments. The DBpedia represents structured content from the collaboratively-edited Wikipedia in semantic form, leveraging the semantics from many social media contributions by multiple users. DBpedia allows you to perform semantic queries on this data, and enables the linking of this socially-created data to other data sets on the Web by exposing it via RDF. Revyu.com combines Web 2.0-type interfaces and principles such as tagging with Semantic Web modelling principles to provide a reviews website that follows the principles of the Linking Open Data initiative (a set of best practice guidelines for publishing and interlinking pieces of data on the Semantic Web). Anyone can review objects defined on other services (such as a movie from DBpedia), and the whole content of the website is available in RDF, therefore it is available for reuse by other Social Semantic Web applications. As Metcalfe’s law defines, the value of a network is proportional to the number of nodes in the network. Metcalfe’s law is strongly related to the network effect of the Web itself: by providing various links between people, social websites can benefit from that network effect, while at the same time the Semantic Web also provides links between various objects on the Web thereby obeying this law (Hendler and Golbeck 2008). Therefore, by combining Web 2.0 and Semantic Web technologies, we can envisage better interaction between people and communities, as the global number of users will grow, and hence the value of the network. This will be achieved by (1) taking into account social interactions in the production of Semantic Web data, and (2) using Semantic Web technologies to interlink people and communities.
3 Introduction to the Social Web (Web 2.0, social media, social software) Web 2.0 is a widely-used and wide-ranging term (in terms of interpretations), made popular by Tim O’Reilly who wrote an article on the seven features or principles of Web 2.0. To many people, Web 2.0 can mean many different things. Most agree that it can be thought of as the second phase of architecture and application development for the Web, and that the related term ‘Social Web’ describes a Web where users can meet, collaborate, and share content on social spaces via tagged items, activity streams, social networking functionality, etc. There are many popular examples that work along this collaboration and sharing meme: MySpace, del.icio.us, Digg, Flickr, Upcoming.org, Technorati, orkut, 43 Things, and the Wikipedia.
3.1 From the Web to a Social Web Since it was founded, the Internet has been used to facilitate communication not only between computers but also between people. Usenet mailing lists and bulletin boards allowed people to connect with each other and enabled communities to form, often around topics of interest. The social networks formed via these technologies were not explicitly stated, but were implicitly defined by the interactions of the people involved. Later, technologies such as IRC (Internet Relay Chat), web forums, instant messaging, blogging, social networking services, and even MMOGs or MMORPGs (massively multiplayer online [role playing] games) have continued the trend of using the Internet (and the Web) to build communities. The structural and syntactic web put in place in the early 90s is still much the same as what we use today: resources (web pages, files, etc.) connected by untyped hyperlinks. By untyped, we mean that there is no easy way for a computer to figure out what a link between two pages means. Beyond links, the nature of the objects described in those pages (e.g. people, places, etc.) cannot be understood by software agents. In fact, the Web was envisaged to be much more (Figure 3.1). In Tim Berners-Lee’s original outline for the Web in 1989, entitled ‘Information Management: A Proposal’1, resources are connected by links describing the type of relationships between them, e.g. ‘wrote’, ‘describes’, ‘refers to’, etc. This is a precursor to the Semantic Web which we will come back to in the next chapter.
1
http://www.w3.org/History/1989/proposal.html (URL last accessed 2009-06-09)
J.G. Breslin et al., The Social Semantic Web, DOI 10.1007/978-3-642-01172-6_3, © Springer-Verlag Berlin Heidelberg 2009
22
The Social Semantic Web
Fig. 3.1. Adapted from ‘Information Management: A Proposal’ by Tim Berners-Lee
Over the last decade and a half, there has been a shift from just ‘existing’ or publishing on the Web to participating in a ‘read-write’ Web. There has been a change in the role of a web user from just a consumer of content to an active participant in the creation of content. For example, Wikipedia articles are written and edited by volunteers, Amazon.com uses information about what users view and purchase to recommend products to other users, and Slashdot moderation is performed by the readers. Web 2.02 is a widely-used and wide-ranging term (certainly in terms of interpretations) made popular by Tim O’Reilly. O’Reilly defined Web 2.0 as ‘a set of principles and practices that ties together a veritable solar system of sites that demonstrate some or all of those principles, at a varying distance from that core’. While this definition is quite vague, he defined seven features or principles of Web 2.0, to which some have added an eighth: the long tail phenomenon (i.e. many small contributors and sites outweighing the main players). Among these features, two points seems particularly important: ‘the Web as a platform’ and ‘an architecture of participation’. Actually, in spite of the 2.0 numbering, this vision is close to the original idea of Berners-Lee for the Web, i.e. that it should be a par-
2
http://tinyurl.com/7tcjz (URL last accessed 2009-06-09)
3 Introduction to the Social Web
23
ticipative medium. For example, the first Web browser called WorldWideWeb3 was already a read-write browser, while current ones are generally read-only. The first idea from O’Reilly of ‘the Web as a platform’ considers the Web and its principles as a way to provide services and value-added applications in addition to generally static contents. In some cases, the Web can even be seen as a transit layer for information to the desktop or mobile devices, for example, using RSS. We can also consider that ‘the Web as a platform’ refers to the migration of traditional desktop services such as e-mail and word processing to web-based applications, for example, as provided by Google with Gmail and Google Docs. In that context, the vision of ‘an architecture of participation’ emphasises how applications can help to produce value-added content and synergies by simply using them, thanks to the way they were designed. As people begin to use Web 2.0 applications for their own needs (uploading pictures, writing blog posts, tagging content), they enhance the global activity of the system and this can be a benefit for everyone. O’Reilly hence makes a comparison with open-source development principles and peer-to-peer architectures in relation to how they are providing the same kind of architecture of participation. The evolution of the Web is - in our opinion - mostly a sociological and economic one, as referred to in the book ‘Wikinomics’ (Tapscott and Williams 2007). However, thanks to the strong interactions between services and users, it has led to interesting practices in terms of software development. O’Reilly in particular incites application developers to go further than they would in traditional development processes and to constantly deliver new features, leading to ‘the perpetual beta’, considering that ‘users must be treated as co-developers’. Agile developments methods are therefore becoming popular on the Web, as well as languages that adhere to such software development principles (e.g. Ruby on Rails).
Fig. 3.2. The Social Web in simple terms: users, content, tags and comments 3
http://www.w3.org/People/Berners-Lee/WorldWideWeb.html (URL last accessed 2009-06-09)
24
The Social Semantic Web
While some describe Web 2.0 simply as a second phase of architecture and application development for the World Wide Web, others mainly think of it as a place where ‘ordinary’ users can meet, collaborate, and share content using social software applications on the Web - via tagged items, social bookmarking, AJAX functionality, etc. - hence the term ‘the Social Web’. The Social Web is a platform for social and collaborative exchange with reusable community contributions, where anyone can mass publish using web-based social software, and others can subscribe to desired information, news, data flows, or other services via syndication formats such as RSS. There are many popular examples that work along this collaboration and sharing meme: Twitter, del.icio.us, Digg, Flickr, Technorati, orkut, 43 Things, Wikipedia, etc. It is ‘social software’ that is being used for this communication and collaboration, software that ‘lets people rendezvous, connect or collaborate by use of a computer network. It results in the creation of shared, interactive spaces.’4 With the Social Web, all of us have become participants, often without realising the part we play on the Web - clicking on a search result, uploading a video or social network page - all of this contributes to and changes this Social Web infrastructure. There may be different motivations for leveraging social websites, from personal expression to political campaigning (e.g. Barack Obama’s presidential campaign raised 87% of his funds through social websites5). Social websites provide access to community-contributed content that is posted by some user and may be tagged and can be commented upon by others (Figure 3.2). That content (termed ‘social media’) can be virtually anything: blog entries, message board posts, videos, audio, images, wiki pages, user profiles, bookmarks, events, etc. Users post and share content items with others; they can annotate content with tags; can browse related content via tags; they may often discuss content via comments; and they may connect to each other directly or via posted content. Social websites that are sharing content are covered in some part by what is called the Digital Millennium Copyright Act (DMCA)6. It provides a safe harbour if a service cannot reasonably prevent against anything and everything being uploaded (and is unaware of it). The user agreements of most social websites usually request that users do not add other people’s copyright material. Fair use is usually permitted, such that if one shares something copyrighted they should use an extract with a link to the main content. There are a variety of figures available for the ratio of social media contributors versus casual browsers or lurkers7 on social websites. CNET’s News.com8 site says: ‘A recent Hitwise study indicates that as few as 4 percent of Internet users actually contribute to sites like YouTube and Flickr, and more than 55 percent are 4
http://en.wikipedia.org/w/index.php?title=Social_software&oldid=26231487 (accessed 2009-06-09) http://tinyurl.com/4t2r6h (URL last accessed 2009-06-09) 6 http://www.copyright.gov/legislation/dmca.pdf (URL last accessed 2009-06-09) 7 http://www.tiara.org/blog/?p=272 (URL last accessed 2009-06-09) 8 http://tinyurl.com/lrw3l9 (URL last accessed 2009-06-09) 5
3 Introduction to the Social Web
25
men. [...] To be the mainstream trend (that it [Web 2.0] deserves to be), it must evolve from the currently small group of people who are creating and filtering our content to a position where the ‘everyman’ is embraced.’ The UK technology site vnunet.com9 mentions: ‘Bill Tancer, general manager of Hitwise, said that the company’s data showed that only a tiny fraction of users contributed content to community media sites. Just 0.16% of YouTube users upload videos, and only 0.2% of Flickr users upload photos. Wikipedia returned a more reasonable percentage, with 4.6% of visitors actually editing and adding information.’ We can deduce that the percentage of contributors may include those who upload content items (videos, images, etc.) as well those who comment on that content.
3.2 Common technologies and trends We shall now describe some of the common technologies used and other trends in social websites, including RSS, AJAX, mashups, content delivery, and advertising models. Future chapters will describe typical usages of these features, including blogging, wiki-based collaborations, and social networking.
3.2.1 RSS As we will see in this book, Social Web principles allow people to publish information more often and more easily. Consequently, there is a need for readers to know where to get new and pertinent information and how to consume it. Content syndication aims to solve that issue, by providing a website with the means to automatically deliver the latest content from blogs, wikis, forums or news services in computer-readable feeds that can be reused and subscribed to by other people and systems. For example, news content from newspapers is often syndicated so that headlines can be read by people in their own feed reader programs or integrated into their own websites. Rather than mass spamming via e-mail, interested parties can subscribe to feeds to be notified about changes or updates to information (self service). A common syndication format can have many uses, including connecting services together, ‘mashing’ together of data, etc. Previous to syndication, semi-regular visits to bookmarked sites resulted in a lack of accuracy in monitoring information. Now, feed aggregators or readers allow you to check multiple blog or news feeds on a regular basis, and you can choose to view only new or updated posts since your last access. You can pull information from sites and put it directly into your desktop (Thunderbird) or 9
http://tinyurl.com/ksdwkw (URL last accessed 2009-06-09)
26
The Social Semantic Web
browser application (Google Reader, Bloglines), allowing you to quickly scan a human-readable view of multiple feeds for relevant content. Intelligent pushing of feeds (e.g. with ‘pingback’) can also be facilitated to update content immediately on aggregator sites (e.g. PlanetPlanet) or other search and navigation applications. Content syndication is thus a first step towards the Semantic Web as it provides interoperability between applications. We shall see later on that it is somewhat limited in terms of achieving the complete goal. In order to define a standard for modelling such information feeds, various formats such as NewsML10 were proposed in the late 90s. The most commonly adopted syndication format is ‘RSS’, which has various meanings (Really Simple Syndication, Rich Site Summary and RDF Site Summary) and comes in different flavours (currently there are eight variations). Some of the variations are from private organisations (0.9 by Netscape), some of them are closed (2.0), and some of them are from open consortiums (1.0). However, they all share the same basic principles: the latest articles, with hyperlinks, titles and summaries, are syndicated using a computer-readable format (XML or RDF). In general, one does not have to worry about which feed format a blog or website provides, because practically any aggregator or news reader will be able to read it anyway. From the Semantic Web perspective, the RSS 1.0 variant (in RDF) allows us to combine syndicated articles with metadata from other vocabularies such as FOAF or Dublin Core.
Fig. 3.3. Content on a blog being published as RSS
The RSS feed structure (as shown in Figure 3.3) is as follows: Class ‘channel’: – –
Properties ‘title’, ‘link’, ‘description’ Contains ‘items’
Class ‘item’: – 10
Properties ‘title’, ‘link’, ‘description’, ‘date’, ‘creator’, etc.
http://newsml.org/ (URL last accessed 2009-06-09)
3 Introduction to the Social Web
27
The strength of RSS is in its generality, but therein lies its weakness: when one is subscribed to multiple channels or items, there is no way to easily group by different types of content based on the available metadata. RSS is used for more than just blog headlines and news syndication, having applications in libraries (e.g. to announce new book acquisitions), shared calendars (RSSCalendar.com), recipe clubs, etc. Executives in many corporations are also starting to mandate what RSS feeds they wants their companies to provide. Similar to RSS is the Atom Syndication Format11, an XML format and recent IETF standard that is also commonly used for syndicating web feeds (e.g. from Blogger.com). The lack of unification between RSS formats is one of the reasons that led to Atom being created. The Atom Publishing Protocol12 (APP or AtomPub for short) is related to this, being a simple HTTP-based protocol for creating and modifying web resources, and the specification was edited by Joe Gregorio and Bill de hÓra. One important thing regarding content syndication is the way in which it can not only enable a user to control the consumption of information, but through Web 2.0-type services, the user can also control its production (i.e. they can control when and where it must be delivered, contrary to traditional mailing list subscriptions, for example).
3.2.2 AJAX AJAX, standing for Asynchronous JavaScript and XML, is a method for creating interactive web applications whereby data is retrieved from a web server asynchronously without interrupting the display of a currently-viewed web page. AJAX has won over much of the Web due to the seamless interaction it provides, and website developers have voted with their feet by deploying it on their sites. As an example of AJAX in use, Google Maps retrieves surrounding map image tiles for a map being displayed on screen, so that when one moves in any direction, the new map can be displayed without reloading the browser window. One of the challenges with AJAX is that the source code is often available to anyone with a web browser. It is therefore crucial to protect against having someone ever executing any JavaScript code that is external to an AJAX application. Since browsers were not initially designed for AJAX-type methods, it has also taken some years for browsers to become solid productive AJAX containers. With the emergence of JIT (just-in-time) compilation technology, browsers running AJAX code will soon be able to operate at least two to three times faster.
11 12
http://tools.ietf.org/rfc/rfc4287.txt (URL last accessed 2009-06-09) http://tinyurl.com/yscdv9 (URL last accessed 2009-06-09)
28
The Social Semantic Web
There are currently 50 or 60 AJAX development toolkits, but many believe that the Web industry should rally around a smaller number, especially open-source technologies which offer long-term portability across all the leading platforms. According to Scott Dietzen, president and chief technical officer of Zimbra, their web-based e-mail application is one of the largest AJAX-based web applications (with thousands of lines of JavaScript code)13, and there are more than 11,000 participants in the Zimbra open-source community. There are some common techniques for speeding up AJAX applications. Firstly, code should be combined wherever possible. Then, pages are compressed to shrink the required bandwidth for smaller pipes. The next method is caching, which avoids browsers having to re-get the JavaScript and re-interpret it (e.g. by including dates for when the JavaScript files were last updated). The last and most useful technique is ‘lazy loading’. For cases where a very large JavaScript application is on a single page, it can be broken up into several modules that can be loaded on demand, reducing the time from when one can first see an application to when they can start using it. However, while AJAX generally aims to provide user-friendly and intuitive interfaces, it can also lead to some usability issues. For example, pages rendered via AJAX cannot generally be bookmarked as they will not have a proper URL but often use, for example, the homepage URL.
3.2.3 Mashups ‘Mashups’ (services that combine content from more than one source into an integrated experience, often with new browsing and visualisation capabilities such as geolocation) are also becoming more common in social websites, and the recent Pipes service14 from Yahoo! illustrates just some of the possibilities offered by combining RSS feeds with data and functionality from other sources. A mashup is a web application that combines data from multiple sources into a single integrated tool. The term mashup can apply to composite applications, gadgets, management dashboards, ad-hoc reporting mechanisms, spreadsheets, data migration services, social software applications and content aggregation systems. In the mashup space, companies are either operating as mashup builders or mashup infrastructure players. According to ProgrammableWeb15, there are now around 400 to 500 mashup APIs available, but there are 140 million websites according to NetCraft, so there is a mismatch in terms of the number of services available to sites.
13
http://tinyurl.com/scottdietzen (URL last accessed 2009-06-09) http://pipes.yahoo.com/ (URL last accessed 2009-06-09) 15 http://www.programmableweb.com/ (URL last accessed 2009-06-09) 14
3 Introduction to the Social Web
29
The main value of mashups is in combining data. For example, HousingMaps, a mashup of Google Maps and data from craigslist (Figure 3.4), was one of the first really useful mashups. One of the challenges with mashups is that they are normally applied to all items in a data set, but if you are looking for a house, you may want a mashup that allows you to filter by things like school district ratings, fault lines, places of worship, or even by proximity to members of your Facebook or MySpace social network.
Fig. 3.4. The HousingMaps website integrates online accommodation data with a geographical mapping service
Mashups are also being used in business automation to automate internal processes, e.g. to counteract the time wasted by ‘swivel-chair integration’ where someone is moving from one browser on one computer to another window and back again to do something manually. Content migration via mashups has been found to be more useful than static migration scripts since they can be customised and controlled through a web interface. Rod Smith says16 that mashups allow content to be generated from a combination of rich interactive applications, do-it-yourself applications plus the current ‘scripting renaissance’ (e.g. as described in the previous section on AJAX). According to Joe Keller, marketing officer with Kapow17, the three components of a mashup are the presentation layer, logic layer, and the data layer - i.e. access to fundamental or value-added data. Fundamental data includes structured data, standard feeds and other data that can be subscribed to, basically, data that is open to everyone. The value-added 16 17
http://2006.blogtalk.net/Main/RodSmith (URL last accessed 2009-06-09) http://www.slideshare.net/schee/s18 (URL last accessed 2009-06-09)
30
The Social Semantic Web
data is more niche: unstructured data, individualised data, vertical data, etc. The appetite for data collection is growing, especially around the area of automation to help organisations with this task. The amount of user-generated content available on the Social Web is a goldmine of potential mashup data, enabling one to create more meaningful time series that can be mashed up quickly into applications. However, Keller claims that the primary obstacle to the benefit of value-added data is the lack of standard feeds or APIs for this data. We shall discuss in Chapter 12 how the Semantic Web can help with this problem and can help to enhance mashup development.
3.2.4 Advertising With the advent of Web 2.0, web-based advertising is often classified into three categories: banners and rich media, list-type advertisements, and mobile advertising (i.e. a combination of banner and ad lists grouped together on a mobile platform), according to Rie Yamanaka, director with Yahoo! Japan’s commercial search subsidiary Overture KK18. Ad lists are usually quite accurate in terms of targeting since they are shown and ranked based on a degree of relevance to what a user is looking at or for. The focus has also shifted from TV and radio advertising towards Internet-based advertising and it is growing exponentially, primarily driven by ad lists and mobile ads. In terms of metrics, traditionally Internet-based ads have been classified in terms of what one wants to achieve. For banner ads (which many think of as being very ‘Web 1.0’-like), the number of impressions is key (e.g. if one is advertising a film, the volume of graphic ads displayed is most important) and charges are based on what is termed the CPM (cost per mille or thousand). However, ad lists (as shown in search results where the aim is to get a full web page on the screen) are focussed more on rankings and the CPC (cost per click), and are often associated with Web 2.0 where the fields of SEO (search engine optimisation) and SEM (search engine marketing) come into play. Another term that is now becoming more important is the CPA (cost per acquisition), i.e. how much it costs to acquire a customer. Four trends (with associated challenges) are quite important in the field of webbased advertising: the first is increased traceability (i.e. how one can track and keep a log of who did what); the next is behavioural or attribute-based targeting (over one-third of websites are now capable of behavioural targeting according to Advertising.com19); the third is APIs for advertising (interfacing with traditional business workflows); and finally is the integration between offline and online media (where the move to search for information online is becoming prevalent). 18 19
http://tinyurl.com/rieyamanaka (URL last accessed 2009-06-09) http://www.docstoc.com/docs/1748110/publisher-survey-07 (last accessed 2009-06-09)
3 Introduction to the Social Web
31
1. With traceability, one can get a list of important keywords in searches that result in subsequent clicks, with the ultimate aim of increasing revenues. Search engine marketing (SEM) can also be used to help eliminate the loss of opportunities that may occur through missed clicks. The greatest challenge in the world of advertising is figuring out how much in total or how much extra a company makes as a result of advertising (based on what form of campaign is used). If one can figure out a way to link sales to ads, e.g. through internet conversion where one can trace when a person moves onwards from an ad and makes a purchase, then one can get a measure of the CPA. On the Web, one can get a traceable link from an ad impression to an eventual deal or transaction (through clicking on something, browsing, getting a lead, and finding a prospect). One can also compare targeted results and what a customer did depending on whether they came from an offline reference (e.g. through custom URLs for offline ads) or directly online. For companies who are not doing business on the Web, its harder to link a sale to an ad (e.g. if someone wants to buy a Lexus, and reads reference material on the Web, they may then go off and buy a BMW without any traceable link). 2. Behavioural targeted advertising, based on a user’s search history, can give advertisers a lot of useful information. One can use, for example, information on gender (i.e. static details) or location (i.e. dynamic details, perhaps from an IP address) for attribute-based targeting. This can also be used to provide personalised communication methods with users, so that very flexible products can be deployed as a result. Spend on behavioural targeted advertising is continuing to grow at a significant rate due to a combination of greater advertiser acceptance and greater publisher support. By 2011, ‘very large publishers will be selling 30% to 50% of their ad inventory using this [behavioural targeting] technique’, according to Bill Gossman, CEO of Revenue Science20. 3. APIs for advertising can be combined with core business flows, especially when a company provides many products, e.g. Amazon.com or travel services. For a large online retailer, there can be logic that will match a keyword with the current inventory, and the system will hide certain keywords if associated items are not in stock. This is also important in the hospitality sector, where for example there should be a change in the price of a product when it goes past a best-before time or date (e.g. hotel rooms normally drop in price after 9 PM). With an API, one can provide highly optimised ads that cannot be created onthe-fly by people. Advertisers can therefore take a scientific approach to dynamically improving their offerings in terms of cost and sales. 4. Matching online information to offline ads, while not directly related to Web 2.0, is important too. Web 2.0 is about personalisation, and targeting internetbased ads towards segmented usergroups is of interest, and so there is a need to find the best format and media to achieve this. If one looks at TV campaigns, one can analyse information about how advertising the URL for a particular 20
http://www.emarketer.com/Article.aspx?R=1004989 (URL last accessed 2009-06-09)
32
The Social Semantic Web
brand can lead to people visiting the associated website. Some people may only visit a site after seeing an offline advertisement, so there can be a distinct message sent to these types of users. If a TV ad shows a web address, it can result in nearly 2.5 times more accesses than could be directly obtained via the Internet (depending on the type of products being advertised), so one can attract a lot more people to a website in this way. There is a lot of research being carried out into how to effectively guide people from offline ads to the Web, e.g. by combining campaigns in magazines with TV slots. It depends on what service or product a customer should get from a company, as this will determine the type of information to be sent over the Web and whether giving a good user experience is important (since you many not want to betray the expectations of users and what they are looking for). Those in charge of brands for websites need to understand how people are getting to a particular web page when there are many different entry points to a site. It is also important to understand why customers who watch TV are being invited onto the Web: if it is for government information, selling products, etc. The purpose of a 30-second advert may actually be to guide someone to a website where they will read material online for more than five minutes. In the reverse direction (i.e. using online information to guide offline choices), there are some interesting statistics. According to comScore21, pre-shoppers on the Web will spend 41% more in a real store if they have seen internet-based ads for a product (and for every dollar that preshoppers spend online, they may spend an incremental $6 in-store22). Since much of the information in social websites contains inherent semantic structures and links, advertising campaigns can be created that will focus on certain topics or profile information. The semantic graph categorises people, places, organisations, products, companies, events, places, and other objects, and defines the relationships among them. Users can define new profile categories and add metadata to these categories that can help improve the relevance of advertising engines. That metadata can then be used to personalise advertising content and provide targeted solutions to advertisers.
3.2.5 The Web on any device There has been a gradual move from the Web running as an application on various operating systems and hardware to the Web itself acting as a kind of operating system (e.g. Google Chrome OS23), where a variety of applications can now run within web browsers across a range of hardware platforms. Due to the range of 21
http://tinyurl.com/n2wbvp (URL last accessed 2009-06-09) http://us.i1.yimg.com/us.yimg.com/i/adv/research/robo5_final.pdf (accessed 2009-06-09) 23 http://tinyurl.com/mkt6lv (accessed 2009-07-21) 22
3 Introduction to the Social Web
33
web-based applications available, browser software is essential for a range of computing systems including the desktop, mobile phones and for other devices (e.g. the Nintendo Wii, the iPhone and the One Laptop Per Child24 $100 laptop). For example, the Opera Mini browser for mobiles is a small (100 kB) Java-based browser, where processing of pages takes place via proxy on a fixed network machine, and then a compressed page is sent to the browser. As the Web has moved from text pages towards multimedia files that can be accessed via portable devices, there is a corresponding need to be able to handle these new media types on the Web. According to Håkon Wium Lie, one of the prime architects of CSS (Cascading Style Sheets) and the chief technical officer with Opera Software25, video needs to be made into a ‘first-class citizen’ of the Web. At the moment, it takes a lot of ‘black magic’ and third-party plugins and object tags before you can get video to work in a browser for most users. There are two problems that need to be solved. The first is how videos are represented in markup. Some have proposed that a element be added to the forthcoming HTML5 specification. The second problem is in relation to a common video format. Håkon Wium Lie says that the Web needs a baseline format based on an open standard (e.g. Ogg Theora, which is free of licensing fees), and in HTML5 there could be a soft requirement or recommendation to use this format. SVG effects (overlays, reflections, filters, etc. created in the Scalable Vector Graphics format) can also be combined with these video elements, and some browsers can access the 3-D engines of graphics card hardware to render bitmaps onto vector surfaces for rotations or other graphical animations. HTML5 will include new parsing rules, new media elements, some semantic elements (section, article, nav, aside), and also some presentational elements will be removed (center, font). CSS can also be used to control how the Web is displayed on different devices, where screen real estate may be more limited. The CSS Zen Garden26 allows people to take a boring document and to test how their stylesheets will alter its look. CSS has a number of properties for handling fonts and text on the Web, and different styles may be more appropriate for different devices. Browsers have around ten fonts that can be viewed on most platforms (usually Microsoft’s core free fonts), but there are many more fonts out there (e.g. there are 2500 font families available on Font Freak). In CSS2, you can import a library of fonts, so that fonts residing on the Web can be used more across a range of devices in the future. There is also the Acid2 test which can be used to test how well web pages display on a variety of browsers and devices. Acid2 consists of a single web page where every element is positioned using some CSS or HTML code with some PNGs, and if a browser renders it correctly, it should show a smiley face (but rarely does). 24
http://laptop.org/ (URL last accessed 2009-07-16) http://tinyurl.com/hakonwiumlie (URL last accessed 2009-06-09) 26 http:// www.csszengarden.com/ (URL last accessed 2009-07-16) 25
34
The Social Semantic Web
A variety of convergences are taking place between other software applications and hardware systems. E-mail has made great progress in becoming part of the web experience (Hotmail, Gmail, Yahoo! Mail, etc.). The same thing is now happening to IM (instant messaging), to VoIP (voice over IP), to calendars, etc. For example, a presence indicator next to an e-mail inbox shows if each user is available for an IM or a phone call. In the reverse direction, if someone tries to call or IM you, you can push back and say that you just want them to e-mail you because you are not available right now. Being able to prioritise communications based on who your boss is, who your friends are, etc., is a crucial aspect of harnessing the power of ubiquitous computing. On voice, you often want to be able to see your call logs, using these to click on a person and call them again, but you may also want to forward segments from that voice call over e-mail or IM. Internet-enabled mobile phones and mobile applications of Web services, especially microblogging services such as Twitter – that we will describe later on in this book – also augment this process of ubiquitous computing, or more specifically ubiquitous social networking. More and more people tend to share their locations or their activities as a live stream of their life (‘lifestream’), and all in real time.
3.2.6 Content delivery The Web has moved beyond a means for distributing just textual data and still images. As video becomes that aforementioned first-class citizen on the Web, new paradigms for sharing and accessing videos are required due to the expectations of users in terms of social website usage conventions and also due to the sheer volume of data involved. The market for IP video via the Web and the Internet is huge, and a recent Cisco report called the ‘Exabyte Era’ shows that P2P (peer to peer), which currently accounts for 1014 PB of traffic each month, will continue to rise with a 35% year-over-year growth rate. User-contributed computing has taken off, and is delivering over half of all Internet traffic today. P2P video is now being accessed via the Web, with companies like Joost (a multichannel online TV service) moving from standalone applications to browser-embedded solutions. According to Eric Klinker, chief technical officer for content-delivery network BitTorrent Inc.27, a new order of magnitude has arrived, the exabyte (EB). One exabyte is 2^60 bytes, which is 1 billion gigabytes. If you wanted to build a website that would deliver 1 EB per month, you would need to be able to transfer at a rate of 3.5 TB/s (assuming 100% network utilisation). 1 EB corresponds to 292,000 years of online TV (stream encoded at 1 MB/s), 5,412 years of Blu-Ray DVD video (maximum standard 54 MB/s), 29 years of all online radio traffic, 1.7 years of YouTube traffic, or just one month of P2P traffic. If you had a central service 27
http://tinyurl.com/ericklinker (URL last accessed 2009-06-09)
3 Introduction to the Social Web
35
and wanted to deliver 1 EB of data, you would need about 6.5 MB/s peak bandwidth, and 70,000 servers requiring about 60-70 megawatts in total. At a price of $20 per MB/s, it would cost about $130 million to run per month! The ‘Web 2.0’ way is to use peers to deliver that exabyte of data. However, not every business is ready to be governed entirely by their userbase and sometimes a hybrid-model approach is required (e.g. 55 major studios have made 10,000 titles available via BitTorrent.com). By leveraging the Web 2.0 nature of distributed computing, we can enable many things that would not or could not be achieved otherwise. For example, Electric Sheep is a distributed computing application that renders a single frame on your machine for a 30-second long screensaver, which you can then use. Social networks also require many machines, but the best example of distributed computing is web search. Google has an estimated 500,000 to 1 million servers, corresponding to $4.5B in cumulative capex (capital expenditure) or 21% of their Q2 net earnings (according to Morgan Stanley). And yet, search is still not a great experience today, since you often have a hard time finding what you want. Search engines are not contextual, they do not see the whole Internet (e.g. the ‘dark web’ consists of content that is not indexed by search engines), they are not particularly well personalised or localised, and they sometimes are not dynamic enough (i.e. they cannot keep up with the content from social websites). The best Web 2.0 applications have involved user participation, with users contributing to all aspects of the application (including infrastructure). ‘Harness the power of participation, and multiply your ability to deliver a rich and powerful application’, says Eric Klinker. Developers need to consider how users can do this (not only through contributed content or code, but also through computing power).
3.2.7 Cloud computing According to Kai-Fu Lee, Vice President of Engineering at Google and President of Google Greater China28, cloud computing can provide many of the features that users of the Web now expect: accessibility, shareability, freedom (i.e. their data wherever they are), simplicity, and security. Data is stored in the ‘Cloud’, on some server somewhere that is not necessarily known by the user, but it is ‘just there’ and accessible. As an analogy, banks too have become ‘Clouds’, allowing people to go to any ATM and remove money from their bank wherever they are. Electricity can be thought of in a similar fashion, as it can come from various places, and you do not have to know where it comes from: it just works. Software and services are also moving to the Cloud, usually accessible via a fully-featured web browser on a client device. Finally, the Cloud should be accessible from any device, especially from phones. When the 28
http://www.youtube.com/watch?v=gIjBrxbdSPY (URL last accessed 2009-06-09)
36
The Social Semantic Web
Apple iPhone was released, Google found that web usage from that device was 50 times greater than that from other web-capable phones. Where the PC era was hardware centric and the client-server era was more software centric (making it suitable for enterprise computing), cloud computing is more service centric: abstracting the server, making it very scalable, hiding complexities, and allowing the server to be anywhere. Three key requirements for making cloud-based computing a reality are now in place: the falling cost of storage, ubiquitous broadband, and the democratisation of the tools of production. These forces are also allowing cloud-based computing to become more like a utility, and much of this is due to IBM and DEC’s pioneering work in the 1990s on making computing itself a utility. Kai-Fu Lee says that there are six important properties of cloud computing, being: user centric; task centric; powerful; accessible; intelligent; and programmable. 1. User centric means that both the data and the application moves with you. People do not want to reinstall their address books or their applications on different machines as it is painful to do it. If someone drops or breaks their laptop, they will be anxious about potentially losing data, and we also know how difficult it is to do something as simple as switching mobile phones. It is hard because synchronising data is usually complicated. For example, the infrared (IR) functionality on a mobile phone is not easy to use or user centric: how often do people use IR to backup data to their laptops? If the data is stored in the Cloud instead - images, messages, whatever - once you are connected to the Cloud, any new PC or mobile device that can access your data or that allows you to create data becomes yours, even if the device itself cannot store all of your data. Not only is the data yours, but you can share it with others: you do not have to worry about where the data is. PCs are normally our window to the world, but mobile devices can do more. There are around three billion mobile phone users worldwide, dwarfing the number of PCs that are Internetaccessible. Since mobile services know who you are and often where you are, they can give you more targeted content. Intelligent mobile search is extremely useful, giving local listings and results relevant to context. The most powerful and popular application is the aforementioned mapping with mashups - especially when people get lost or if they spontaneously want to go somewhere with mapping applications allowing users to search for relevant attractions nearby or even see real-time traffic flows, etc. As there is a move from e-mail usage towards photo sharing or mapping applications, these are moving into the Cloud as well. 2. There is a move towards task-centric computing where the applications of the past - spreadsheets, e-mail, calendars - are becoming modules, and they can be composed and laid out in a task-specific manner. For example, the task may be multiple teachers collaborating on the creation of a departmental curriculum, where one can see a list of the users who are currently viewing the curriculum spreadsheet, and those teachers can have real-time debates via chat in parallel
3 Introduction to the Social Web
3.
4.
5.
6.
37
with the curriculum development. Spreadsheet editing allows collaboration and publishing between a selected group of people, with associated version control mechanisms. Having many computers in the Cloud means that powerful computing tasks can be carried out that a single personal computer cannot perform. For example, search engines work faster than searching in desktop applications such as Thunderbird or Word. Of course, web search has to be much faster even though there are many more documents: for the storage, if there are 100 billion pages at 10 kB per page, this corresponds to about 1000 TB of disk space. Cloud computing should therefore have an infinite amount of disks and computation at its disposal. When a query is issued to a web search engine like Google, it queries at least 1000 machines (potentially accessing thousands of terabytes of data). Accessibility to diverse data types in different clouds can be achieved through universal search. For example, if you want to do a specific type of search, for restaurants, images, etc. in a particular location, PageRank-type web search may not necessarily be the best option. It is difficult for most people to get to the right vertical search page in the first place since they usually cannot remember where to go. Universal search is basically a single search system that will access all of these vertical searches. This search requires simultaneously querying and searching over all of the specific databases: news, images, videos, tens of such sources today, with potentially hundreds or thousands of them in the future. These multiple simultaneous searches then get ranked, so it will be even more computing intensive than current web search methods. Intelligent data mining and massive data analysis are required to generate some intelligence for the masses of data available in cloud computing applications. But this needs to be combined with people - via their collaboration and contributions - to change a mass (or mess!) of photos or facts or whatever into a very powerful combination. People and computing tools working together can create intelligent knowledge. Applications like Google Earth are much more useful when people can contribute to them, e.g. as National Geographic showed when they added many high-resolution photos to it. Reviews, 3-D buildings, etc. can turn a tool from a collection of pictures into something special. Creativity adds connections to data-centric applications, enabling intelligent combinations of content. But With all of this data comes the issue of server costs. In a choice between high-end servers or cheap PC-class servers, better cost efficiency and improved performance can be achieved by going for very many PC-class servers (as long as appropriate reliability mechanisms are put in place to deal with the higher failure rates). With the many servers used for storing and accessing masses of data in cloud computing setups, there are associated new programming solutions required for fault tolerance, distributed shared memory, and other programming paradigms. For fault tolerance, Google uses the ‘Google File System’ which is a distributed disk storage architecture. Every piece of data is replicated three times. If one
38
The Social Semantic Web
machine dies, a master redistributes the data to a new server. There are around 200 clusters (some with over 5 PB of disk space on 500 machines). Google uses what they call the ‘Big Table’ for distributed memory. For example, if one is storing every web page from yahoo.com, no single machine can store that so multiples are required. The largest cells in the Big Table are 700 TB, spread over 2000 machines. MapReduce is an example of a new programming paradigm for cloud computing, where a trillion records are cut into a thousand parts on a thousand machines. Each machine will then load a billion records and will run the same program over these records, and then the results are recombined. While in 2005, there were some 72,000 jobs being run on Google’s MapReduce setup, in 2007, there were two million jobs (the usage seems to be increasing exponentially). One criticism of cloud computing is that it makes many of the same claims as networked computing and networked desktops did in the 1990s (i.e. the theory that users would move to using computers with no hard drives, where all data would be stored elsewhere and would be available from any networked desktop that a user would choose to use). However, users still wanted control over their own desktops (in particular, having offline access to the data contained therein), and hence local storage is still a primary consideration when purchasing a computer.
3.2.8 Folksonomies As mentioned, a key feature of social websites is that contributed content may be tagged with a keyword by the content creator or sometimes by community members. Tagging is common to many social websites - a tag is a keyword annotation that acts like a subject or category for the associated content. Folksonomies are constructed from these tags: they are collaboratively generated, open-ended labelling systems that enable users of social websites to categorise their content using the tags system, and to thereby visualise popular tag usages via ‘tag clouds’ (visual depictions of the tags used on a particular website, like a weighted list in visual design) as shown in Figure 3.5. Examples of systems that use tags are blogs, social bookmarking sites, photo or video sharing services and wikis. Folksonomies are a step in the same direction as the Semantic Web (Specia and Motta 2007). The Semantic Web often uses top-down controlled vocabularies to describe various domains, but it can also utilise folksonomies and therefore develop more quickly since folksonomies are a great big distributed classification system with low entry costs. We shall discuss social tagging and the connections between folksonomies, tagging and the Semantic Web in more detail in Chapter 8.
3 Introduction to the Social Web
39
Fig. 3.5. A typical tag cloud
3.3 Object-centred sociality Social networks exist all around us - at workplaces as well as within families and social groups. They are designed to help us work together over common activities or interests, but anecdotal evidence suggests that many online social networking services (SNSs) lack such common objectives (Irvine 2006). Instead, users often connect to others for no other reason than to boost the number of friends they have in their profiles29. Many more browse other users’ profiles simply for curiosity’s sake. These explicitly-established connections become increasingly meaningless because they are not backed up by common objects or activities. The act of connecting sometimes becomes a site’s primary (only) activity. In fact, some sites act simply as enhanced address books: although potentially useful for locating or contacting someone, they provide little attraction for repeat visits. This is a flaw with the current theory. As Jyri Engeström, co-founder of the Jaiku.com microblogging site (microblogging is a lightweight form of blogging that consists of short message updates), puts it30, ‘social network theory is good at representing links between people, but it doesn’t explain what connects those particular people and not others.’ Indeed, many are finding that social networking sites are becoming increasingly boring and meaningless. Another problem is that the various SNSs do not usually work together. You thus have to re-enter your profile and redefine your connections from scratch when you register for each new site. Some of the most popular SNSs probably would not exist without this sort of ‘walled garden’ approach31, but some flexibil29
http://www.russellbeattie.com/notebook/1008411.html (URL last accessed 2009-06-09) http://www.zengestrom.com/blog/2005/06/speaking_on_obj.html (accessed 2009-06-09) 31 http://tinyurl.com/mnp7dn (URL last accessed 2009-06-09) 30
40
The Social Semantic Web
ity would be useful. Users often have many identities on different social networks. Reusable or distributed social networking profiles would let them import existing identities and connections (from their own homepage or another site they are registered on), thereby forming a single global identity with different views. Engeström has theorised32 that the longevity of social websites is proportional to the ‘object-centred sociality’ (Bouman et al. 2008) occurring in these networks, i.e. the degree to which people are connecting via items of interest related to their jobs, workplaces, favourite hobbies, geographic locations, etc. Similarly, Ken Jordan and colleagues advocate augmented social networks, in which citizens form relationships and self-organise into communities around shared interests (Jordan 2003).
Fig. 3.6. Users form object-centred social networks (using their possibly multiple online accounts) around the content items they act on via social websites
32
http://www.zengestrom.com/blog/2005/04/why_some_social.html (accessed 2009-06-09)
3 Introduction to the Social Web
41
On the Web, social connections are formed through the actions of people, via the content they create together, comment on, link to, or for which they use similar annotations. Adding annotations to items in social networks (using topic tags, geographical pinpointing, etc.) is particularly useful for browsing and locating interesting items and people with similar interests. Content items such as blog entries, videos, and bookmarks serve as the lodestones for social networks, drawing people back to check for new items and for updates from others in their network. On Flickr, people can look for photos categorised using an interesting ‘tag’, or connect to photographers in a specific community of interest. On Upcoming, events are also tagged by interest, and people can connect to friends or likeminded others who are attending social or professional events in their own locality. Figure 3.6 is illustrative of an object-centred social network for three people, showing their various user accounts on different websites and the things that they create and do using these accounts. Rather than being connected simply through online social network relationships (i.e. by explicitly-defined friends-type contacts), these people are bound together (via their user profiles) through ‘social objects’ of common interest: the content they create together, co-annotate, or for which they use similar annotations. For example, Bob and Carol are connected through bookmarked websites that they both have annotated on musical keyboards and also through music-related events that they are both attending. Similarly, Alice and Bob are using matching tags on media items about pets and are subscribed to the same blog on birds. For many social websites, success has come from enabling communities formed around common interests, where the users are active participants who as well as consuming information also provide content and metadata. In this way, it is probable that people’s SNS methods will continue to move closer towards simulating their real-life social interaction, so that people will meet others via something they have in common, not by randomly approaching each other - eventually leading towards more realistic interaction methods with friends as online connections become intertwined with their real-world interests. Multiplayer online gaming has had groups (‘clans’) of people working towards common purposes for more than a decade, even though they may never have met each other in real life. We may start to see gaming social websites where real-time multiplayer online games will appear in browser-embedded windows just as YouTube does for videos, with user-to-user conversations and running commentaries being carried out in parallel to these games. Web interactions have not reached the level of real-world interaction just yet, but real-time microblogging and features like being able to respond to people’s videos with video comments of your own (e.g. as on YouTube) are going in this direction. These online activities can often lead to real-world group meetups (e.g. localised Tweetup events for microbloggers arising from updates on Twitter), with online activities reinforcing offline ones and vice versa.
42
The Social Semantic Web
Virtual worlds such as Second Life have already begun to provide a user experience which is more faithful to reality and where networks of friends are interacting in much more realistic ways. Users interact via avatars in a threedimensional environment where they can move between different areas and socialise with other residents. An important aspect of Second Life is that the world is largely user-created. Residents can buy land, construct houses and create objects. It is also possible to trade with other users, as well as buy or sell using the world’s internal currency, the Linden Dollar. Second Life’s world encourages residents to meet and stay in touch with other users with similar interests via themed areas and events – a prime example of object-centred sociality.
3.4 Licensing content As we previously mentioned, an important feature of social websites is the way they let users add content to the Web, no matter what format it is in. When publishing content on the Web, it is important to know how people will be able to access and reuse it. Should they be allowed to copy and paste someone else’s blog post on a wiki? Can they use a picture someone else has taken for their next book cover? Or can they reuse some songs written by someone else that are only meant for non-commercial use? Contrary to common perception, if something is freely accessible on the Web, it does not mean that it is free to reuse. The Creative Commons33 (CC) project provides a legal framework that defines rights regarding how one can reuse content that users publish. It provides six different contracts for defining if and how people can reuse content that has been published, if they can modify it and if it may be used for commercial purposes. In this way, content providers can decide exactly how people can use their content. For example, on Flickr, different licences are offered when adding a picture, and various bands publish their content using CC licences on Jamendo34. ccMixter35 is a Creative Commons-sponsored service that provides ways to create and publish mashups of CC-based content.
3.5 Be careful before you post When filling out profiles on social websites or posting content, users are often oblivious to the fact that the content they post is not just available to their friends on the site, but by default, content is usually visible to everyone. Some are quite 33
http://creativecommons.org/ (URL last accessed 2009-07-16) http://jamendo.com/ (URL last accessed 2009-06-09) 35 http://ccmixter.org/ (URL last accessed 2009-06-09) 34
3 Introduction to the Social Web
43
happy to post personal content online if their online presence is a significant component of their real life (e.g. for social media experts, online or offline celebrities, etc.) Others may be more private and therefore should be aware of the public nature of content they post online. There are some basic guidelines that should be followed when adding information to either personal or public areas on social websites: Common sense should prevail, and you should not post anything that you would not give to a stranger in the street. That includes your phone number, your address, your birth date, etc. Try not to use your real name or your e-mail address in your online nickname or posting account. Keep your work e-mail details separate from accounts used for message boards or blogs where you post informally: get a Hotmail or Gmail account for such activities. Also, do not give any account password to your friends unless there is a very good reason to do so. Be careful about posting potentially damaging information about your relationships with professional colleagues and friends or family, or personal specifics about yourself (because even though you may be posting anonymously, it can be very easy for someone to put one and one together and figure out who you are). If you post inflammatory statements about something or somebody, be aware that doing so under your own name may lead to a campaign of retaliation against you. If you post defamatory statements, be prepared for legal action. There is effectively a permanent record of what you contribute to the (Social) Web (e.g. if you let slip something you should not about your workplace or family, sometimes even if the original site disappears). It may be on the original site you posted on, in Google’s cache, in the Wayback Machine (a periodic caching of website content by the Internet Archive that stretches back to the beginnings of the Web), or someone may just save it to their own site or computer. Remember that when you post something sensitive: it could well be there forever, for your parents, your kids, your boss, your future employer to see (even after you have logged off from this mortal coil). Blogging is a powerful medium due to its open nature and public contributions, but it is this openness that means that whatever you say can be read by all and people can build up a picture of who you are and what you are doing (even if you do not realise that they are reading or actively following your blog). As with social networking sites, some people mistakenly think that their blog is only being read by a closed circle of friends, but of course if it is publicly accessible, anyone or any search engine can get access to it and forward content to others. Finally, you should not arrange to meet anyone you have only talked to online alone in the real world.
44
The Social Semantic Web
The above guidelines are not an attempt to make Social Web users paranoid, but it is prudent to be careful about what you contribute. There is already a huge amount of publicly-available information about individuals ranging from phone book entries to local government planning applications and objections, and it is becoming easier to link this to less formal information such as blog posts or photos taken (of you, by others) at parties or other events.
3.6 Disconnects in the Social Web The Social Web is allowing people to connect and communicate via the Internet, resulting in the creation of shared, interactive spaces for communities and collaboration. There is currently a large disconnect in the online social space. Blogs, forums, wikis and social networking sites all can contain vibrant active communities, but it is difficult to reuse and to identify common data across these sites. For example, Wikipedia contains a huge body of publicly-accessible knowledge, but reuse of this knowledge outside of Wikipedia and incorporating it into other applications poses a significant challenge. As another example, a user may create content on several blogs, wikis and forums, but one cannot identify this user’s contributions across all the different types of social software sites. We shall address this in future chapters by describing methods for connecting these sites. We shall identify core vocabularies for describing interlinked social spaces, and provide guidelines and tools for describing content in social software. Our use cases will provide examples of how adding semantic information to social websites will enable richer applications to be built.
4 Adding semantics to the Web The ‘Semantic Web’ can be thought of as the next generation of the Web where computers can aid humans with their daily web-related tasks as more meaningful structured information is added to the Web (manually and automatically). For example, using a combination of facts like ‘John works_at NUI Galway’, ‘Mary knows John’, ‘John is a Person’, ‘Mary is a Person’, ‘NUI Galway is an Organisation’, ‘A Person works_at an Organisation’, and ‘A Person knows a Person’, you can allow computers to answer relatively straightforward questions like ‘Find me all the people who know others who work at NUI Galway’ which at the moment is quite difficult for us to do without some manual processing of information returned from search results.
4.1 A brief history During the evolution of human civilisation, new technologies enabled the creation of more and more recordings of knowledge in various media forms. The invention of the Web and the evolution towards the Semantic Web can be explained by the need to cope with the ever-increasing amount of information and knowledge. Indeed, the creation and recording of knowledge began with cave drawings as early as 32000 BC. One of the earliest written expressions was the Sumerian cuneiform scripts written on clay tables about 3000 BC. Rapid progress in the creation and distribution of text and pictures was made by the invention of the mechanical printing press by Johannes Gutenberg in 1440. The invention of photography 1839 by Louis Daguerre added another principal form of media, followed by the invention of the phonograph for sound recordings by Thomas Alva Edison in 1877 and the capability to effectively create movies developed by the Lumière brothers around 1885. The proliferation of media created the need to collect and organise these media items. As early as 700 BC, media items were organised in libraries, and the Ancient Library of Alexandria, assumed to have up to 1 million scrolls, is one of the most well-known examples of an early large collection of media. However, those libraries were (and still are) centralised collections of media with strict organisation and indexing principles. In 1945, Vannevar Bush, a science administrator for the US government, was one of the first people to realise that the proliferation of knowledge in various media forms had opened up new challenges that central repositories and the indexing mechanisms of conventional centralised libraries could not meet. That led Bush to
J.G. Breslin et al., The Social Semantic Web, DOI 10.1007/978-3-642-01172-6_4, © Springer-Verlag Berlin Heidelberg 2009
46
The Social Semantic Web
postulate the ‘Memex’ (Bush 1945), a device ‘in which an individual stores all his books, records, and communications, and which is mechanised so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.’ The idea of the Memex was picked by Douglas Engelbart, a computer engineer at Stanford Research Institute, who was aiming towards augmenting human intellect (Engelbart 1962) to increase the capability of a person ‘to approach a complex problem situation, to gain comprehension to suit his particular needs, and to derive solutions to problems.’ In the course of working on augmenting human intellect, Engelbart invented many of the core technologies we are still using today, such as hypertext, display editing, and the mouse. However, the core idea of augmenting human intellect on a societal scale as formulated in (Engelbart 1962) remains largely unrealised. Ted Nelson, another information technology pioneer, founded the Xanadu project in 1960 with the goal of creating a computer network with a simple user interface. Tim Berners-Lee, a programmer at the European Organisation for Nuclear Research (CERN) who was inspired by the visions of Engelbart and Nelson, was able to realise part of the idea: the hypertext system we know today as the World Wide Web, or the Web for short, which made global information access feasible. Computers on the Web use the HTTP (Hypertext Transfer Protocol) communications standard to transfer web pages containing display instructions in HTML (Hypertext Markup Language). However (Berners-Lee 1999) conceded that the original goal and motivation remained largely unfulfilled: There was a second part of the dream. [...] We could then use computers to help us analyse it, make sense of what we're doing, where we individually fit in, and how we can better work together. [...] The second part has yet to happen.
Realising the original idea of the Memex, the ‘augmentation of human intellect’, and the ‘second part of the dream’ has turned out to be a difficult problem. The Web has created a global common information space, but it has amplified the problem of information and knowledge overload by causing the creation of more and more hyperlinked web documents. The central problem is that computers are perfectly capable of rendering web documents. However, computers provide little support in helping people to understand, organise and manage the knowledge contained in these documents. To be able to augment the human intellect, something more than the HTML Web we use today is required: a ‘Semantic’ Web that machines can process for us, where computers help us to make sense of the information enabling us to work better together.
4 Adding semantics to the Web
47
4.2 The need for semantics Searching for information today is based on finding words within web pages and matching them. For example, if a person was searching for information on the former English rugby captain Martin Johnson, they would visit a site such as Google and type ‘Martin Johnson’ into the search box. The search engine will not only return web pages for the rugby player, but primarily those relating to his more famous artist namesake Martin Johnson Heade (and many other Martin Johnsons besides). One way to improve the situation would be for a web page author to add some extra meaning to the information, for example, by telling the computer that Martin Johnson is indeed a rugby player and that every rugby player is a person. This is a simple example of annotation, where semantic meaning can be added to the Web. Now a computer can determine that this Martin Johnson is a rugby player, and that he may be the one that you are looking for. However, the HTML Web can not express these annotations. The principles of the HTML Web since its invention in the 1990s have essentially remained the same: resources (web pages, files, etc.) are connected by semantically-untyped hyperlinks. By untyped, we mean that there is no easy way for a computer to figure out what a link between two pages means, i.e. what are the semantics of the relationship between the pages. For example, on the UEFA football website, there are hundreds of links to the various organisations that are registered members of the association, but there is nothing explicitly saying that the link is to an organisation that is a ‘member of’ UEFA or what type of organisation is represented by the link. On a professor’s work page, she may link to many papers that she has authored, but the page may not say that she is the author of those papers or that she wrote such-and-such when she was visiting at a particular university. Moreover, while anyone can guess that she is a professor, which is a person, a computer cannot extract any information about the nature of the objects (or in fact, any objects at all) described in the pages as the computer is only aware of web pages and hyperlinks. While a reader interprets these pages and hyperlinks as representing real-world concepts and properties, a computer cannot. There is hence a knowledge gap between what is on the Web and the interpretation we can make when compared to what a computer can deduce. To close the knowledge gap, knowledge representation mechanisms on the Web beyond HTML are needed. Since the Web has as its basis a number of unique properties (such as distribution, diversity and heterogeneity), these properties serve as requirements for knowledge representation mechanisms. These requirements are: Entity identity. Entities on the Web have to be uniquely identifiable: not just documents, but all possible entities (e.g. persons). Object identity is a necessary prerequisite to enable computers to process and understand information. Relationships. Representing entities alone is not sufficient. Relationships between entities capture relevant knowledge and need to be expressed.
48
The Social Semantic Web
Extensibility. Given the wide variety of communities and topics on the Web and the need for evolution and adaptability, a fixed schema would not be adequate for the rapid evolution of topics. Vocabularies (ontologies). In order to exchange and represent an agreement about how to transfer data on a specific topic, an agreement on which vocabulary to use is necessary. The Semantic Web, put forward by the inventor of the current Web, Sir Tim Berners-Lee, addresses these issues and requirements by allowing one to provide metadata that is associated with web resources, and behind this metadata there are associated vocabularies or ‘ontologies’ (we will not make any distinction between these terms in the book, hence using both) that describe what this metadata is and how it is all related to each other. The Semantic Web, or as some have termed it ‘Web 3.0’1, is ‘an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation’ (Berners-Lee et al. 2001). The word ‘semantic’ stands for ‘the meaning of’, and therefore the Semantic Web is one that is able to describe things in a way that computers can better understand: basically, adding more meaning to the Web. Computers can only do so much with the ‘natural language’ information that is on the Web at the moment: they are not evolved enough to understand what pages of text are about. The idea of a Semantic Web involves a move from unstructured pages of text to structured information that can not only be understood by people but can be interpreted by computers to present the information to people in new ways. MIT’s (Massachusetts Institute of Technology2) Stefan Marti summarised ‘the Semantic Web for dummies’ as3: XML customised tags, like: Nena + RDF relations, in triples, like: (Nena) (is_dog_of) (Kimiko/Stefan) + Ontologies / hierarchies of concepts, like: mammal -> canine -> Cotton de Tulear -> Nena + Inference rules, like: If (person) (owns) (dog), then (person) (cares_for) (dog) = Semantic Web!
The next step is the development of various ontologies. Ontologies, providing a vocabulary of terms in a certain area (for example, there would be separate ontologies for sports or soaps or science) are used to specify the meanings of the annotations added to web pages. For the rugby example, there may be a definition in an ontology that a rugby player is a member of a team, or that each team has 15
1
http://turing.cs.washington.edu/NYT-KnowItAll.htm (URL last accessed 2009-06-09) http://web.mit.edu/ (URL last accessed 2009-07-16) 3 http://web.media.mit.edu/~stefanm/commonsense/SemanticWeb.ppt (accessed 2009-06-09) 2
4 Adding semantics to the Web
49
players. These ontologies are designed to be understandable by computers as part of the Semantic Web (using the Resource Description Framework, or RDF). Some of the more popular Semantic Web vocabularies include FOAF (Friendof-a-Friend, for social networks), Dublin Core (for resources online or in libraries), SIOC (for online communities and content), the W3C Basic Geo Vocabulary4 (for the coordinates of geographic locations), and the Gene Ontology (for genes in organisms). (Bizer et al. 2007) also provides a list of popular and core vocabularies that people should use when publishing data on the Semantic Web5. People can also create custom vocabularies for their own information representation requirements (Figure 4.1).
Fig. 4.1. We can now describe lots of things semantically!6
Figure 4.2 shows the node types which exist in a typical Semantic Web data model, i.e. a vocabulary (similar to FOAF), and the relationship types which connect them together. The relation or predicate ‘maker’ is considered to be the inverse of the relation ‘made’, in other words, they represent the same relationship, but in opposite directions. While it is known that adding metadata to websites can often improve the percentage of relevant document hits in search engine results, it is difficult to persuade Web authors to add metadata to their pages in a consistent, reliable manner (either due to perceived high entry costs or because it is too time consuming). For example, few web authors make use of the simple Dublin Core metadata system, e.g. by indicating the creator or creation date of their pages, even though DC metadata tags can increase a page’s prominence in search results.
4
http://www.w3.org/2003/01/geo/ (URL last accessed 2009-07-16) http://tinyurl.com/whichvocabs (URL last accessed 2009-06-09) 6 http://autrans.crao.net/gallery/MousseAutrans2004-01-07/1 (URL last accessed 2009-06-09) 5
50
The Social Semantic Web
Fig. 4.2. An example Semantic Web data model
The main power of the Semantic Web lies in interoperability, and combinations of vocabulary terms: interoperability and increased connectivity is possible through a commonality of expression; vocabularies can be combined and used together: e.g. a description of a book using Dublin Core metadata can be augmented with specifics about the book author using the FOAF vocabulary. Vocabularies can also be easily extended (using modules) in a distributed manner. A person can adapt an ontology published on the Web to their own needs and then republish the changes so that anyone can benefit from it. Of course, to be successful, the Semantic Web should rely on a set of core ontologies that are agreed upon and used by most people for adding semantics to their content. Through this, true intelligent search with more granularity and relevance is possible: e.g. a search can be personalised to an individual by making use of their identity profile and relationship information. In later sections we will see that the Semantic Web also provides useful mechanisms for describing and leveraging the social objects that bind us together in social websites. Since more interesting social networks are being formed around the connections between people and their objects of interest, and as these objectcentred social networks grow bigger and more diverse, more intuitive methods of navigating the information contained in these networks have become necessary – both within and across social networking sites. We have mentioned how individuals are connecting through these shared objects, but this can apply to whole communities as well (e.g. a community of interest for mountaineering may consist of both people and content distributed across photo-, bookmark- and event-centred social networks, Figure 4.3). Person- and object-related data can also be gathered from various social networks and linked together using a common representation format. This linked data can provide an enhanced view of individual or community activity in a localised or distributed object-centred social network(s) (‘show me all the content that Alice has acted on in the past three months’).
4 Adding semantics to the Web
51
Fig. 4.3. Community groups are also connected through objects of interest
In the following sections we describe the necessary building blocks of the Semantic Web in more detail.
4.3 Metadata Metadata has been with us since the first librarian made a list of the items on a shelf of handwritten scrolls. The term ‘meta’ comes from a Greek word that denotes ‘alongside, with, after, next’. Metadata can be thought of as ‘data about data’, and it commonly refers to descriptive structured data about web resources that can be used to help support a wide range of operations. Metadata can be used for many purposes: to provide a structured description of characteristics such as the meaning (semantics), content, structure and purpose of a web resource; to facilitate information sharing; to enable more sophisticated search engines on the Web; to support intelligent agents and the pushing of data (e.g. from blog feeds); to minimise data loss or repetition; and to help with the discovery of resources by enabling field-based searches. We can consider a library analogy to a Web without metadata, where every word in every page in every book must be indexed. Because such indexing will lag the growth and change in the Web, it often yields poor search results. With even some basic metadata, using the library analogy we have books with categories, titles, descriptions, ratings, yielding better retrieval. However, this also results in some extra work classifying things and assigning properties. Many kinds of resources, objects, or things on the Web can be annotated with metadata: HTML documents, digital images, databases, books, museum objects, archival records, collections, services, physical places, people (e.g. using FOAF), abstract ‘works’, concepts, events and even metadata records themselves. This metadata can be used by people, for example, a collection owner managing or controlling access to resources, or a researcher seeking or interpreting resources, or by computerised services or agents, for example, aggregators (e.g. blog collec-
52
The Social Semantic Web
tions), web portals presenting a ‘landscape’ of data to users, or brokers performing query tasks on behalf of users. Metadata can be created by software tools (e.g. by indexing robots or web crawlers accessing resource content), and it can be created by people through descriptions added by a resource owner or by third parties (e.g. specialist cataloguers or resource users). However, creating (and maintaining) high-quality metadata is not always cheap, and there may be rights or copyright issues for metadata as well as for the underlying resources. Depending on which approach offers the most flexibility, metadata can be embedded within a resource and extracted from the resource itself (depending on its format), or it may simply be linked to a resource via an external file or a database of resource descriptions. One may also need to present different subsets of metadata in different contexts. To exchange metadata, metadata standards are required. These are agreed-on criteria for describing metadata for purposes of interoperability. As a simple example, a date (e.g. attached as metadata to a file) could be expressed as January 31, 2009, 31 janvier 2009, 2009-01-31, 01-31-2009, or 31012009 (amongst other variations), and it is obvious that we need some consistent forms for exchanging such metadata. There are already many metadata standards for different domains7, and sometimes mappings are required between these standards. For the Semantic Web, the common standard to describe metadata is RDF, and we will now describe in more detail what RDF is and how it can be used8.
4.3.1 Resource Description Framework (RDF) The Resource Description Framework (RDF) is used to represent entities, referred to by their unique identifiers or URIs (Uniform Resource Identifiers), and binary relationships between those entities. RDF consists of two parts: the RDF data model specification and a serialisation syntax (RDF/XML is often used, but it can take other forms including Notation 3 and Turtle). The data model definition is the core of the specification, and the syntax is necessary to transport RDF data in a network. In RDF, two entities and a binary relationship between these entities is called a statement, or a triple. Represented graphically, the source of the relationship is called the subject of the statement, the labelled arc itself is the predicate (also called the property) of that statement, and the destination of the relationship is called the object of that statement. The data model of RDF distinguishes between entities (also called resources), which have a URI identifier, and literals, which are just strings. The subject and the predicate of a statement are always resources, 7 8
http://en.wikipedia.org/wiki/Metadata_standards (URL last accessed 2009-07-17) Some of the aforementioned formats, such as Dublin Core, can also be expressed using RDF
4 Adding semantics to the Web
53
while the object can be a resource or a literal. In RDF diagrams, resources are typically drawn as ovals, and literals are drawn as boxes. An example of a statement is given in Figure 4.4: the resource http://www.deri.ie/projects/#11 is a subject and has a property http://www.SemanticWeb.org/#hasHomepage and the value of the property (the object) is the http://www.sioc-project.org/ resource.
Fig. 4.4. The simplest possible RDF graph: two nodes and an arc
The statement can be read as: the resource http://www.deri.ie/projects/#11 has a homepage, which is the resource http://www.sioc-project.org/. At first glance it might look strange that predicates are also resources and thus have a URI as a label. However, to avoid confusion it is necessary to give the predicate a unique identifier. Simply ‘hasHomepage’ would not be sufficient, because different vocabulary providers might define different versions of the predicate hasHomepage with possibly different meanings. A set of statements forms a graph. Figure 4.5 shows an extension of Figure 4.4: the property http://purl.org/dc/terms/creator with value John Breslin (a literal) has been added to the graph.
Fig. 4.5. An extension of the previous example
This is the ‘core’ of RDF. To allow for a more convenient data representation, additional vocabularies and conventions need to be introduced. For example, predicate URIs are often abbreviated by using the XML-namespace syntax. Instead of writing the full URI form of the predicate http://www.SemanticWeb.org/#hasHomepage, the namespace form sw:hasHomepage is used with the assumption that the substitution of the namespace prefix ‘sw’ with ‘http://www.SemanticWeb.org/#’ is defined. The namespace prefix ‘rdf’ is commonly used to refer to the specification explaining how metadata should be produced according to the RDF model and syntax (Lassila and
54
The Social Semantic Web
Swick 1999). In this case, the ‘rdf’ prefix would be expanded to the URL of the RDF-specific vocabulary http://www.w3.org/1999/02/22-rdf-syntax-ns#.
4.3.1.1 Blank nodes Sometimes it is not convenient to provide an explicit URI for a resource in RDF, if the URI is not really visible to the outside world. In these cases ‘blank nodes’ (also called anonymous resources or bnodes) are used. In RDF, a blank node is a resource, or a node in an RDF graph, which is not identified by a URI. A blank node can be used as the subject or the object in an RDF triple. Figure 4.6 shows an example of a blank node.
Fig. 4.6. A metadata instance often may not have a full URI but rather a blank node identifier
4.3.2 The RDF syntax In order to facilitate interchange of the data that is represented in RDF, a concrete serialisation syntax is needed. RDF/XML (Beckett 2004) is an obvious choice, but it is worth noting that the RDF data model is not tied to any particular syntax and can be expressed in any syntactic representation (or vice versa, extracted from other forms of data, e.g. Topic Maps (ISO/IEC 13250)). Furthermore, because of the XML serialisation, the RDF syntax definition is rather complicated. RDF APIs are used to shield developers from the details of any particular serialisation syntax, and can handle RDF data as graphs.
4 Adding semantics to the Web
55
The RDF specification suggests two standard ways to serialise RDF data in XML: an abbreviated syntax and a standard syntax. Both serialisation possibilities use the XML namespace mechanisms (Bray et al. 1999) to abbreviate URIs as already described. An example is given below for the abbreviated syntax. The abbreviated syntax is very close to how one would intuitively model data in XML. The following XML is the serialisation of the RDF graph given in Figure 4.5.
John Breslin
XML documents can carry RDF code, which is mapped into an RDF datamodel instance. The start and end of the RDF code is indicated by the tags and . Some other common serialisation syntaxes for RDF are N3 (Notation 3) (Berners-Lee 1998) and Turtle (Beckett and Berners-Lee 2008), the second one being a subset of the first one. The same RDF graph from Figure 4.5 is given below in Turtle9: @prefix @prefix @prefix @prefix
: . rdf: . rdfs: . dcterms: .
a :Project ; :hasHomepage . a rdfs:Resource ; dcterms:creator “John Breslin” .
9
For future examples in this book, we will not list the namespaces and prefixes in our RDF code snippets. In this example, we also point out the shortcut “a”, which is equivalent to rdf:type
56
The Social Semantic Web
The semicolon at the end of a line means that subsequent statements (until a period is encountered at the end of a line) will use the same subject as in the previous line. Therefore, in the above example, the semicolon indicates that the predicate ‘dcterms:creator’ and associated literal object ‘John Breslin’ applies to the subject ‘http://www.sioc-project.org/’. Another approach for serialising RDF is through the annotation of XHTML documents with RDFa (Resource Description Framework in Attributes) (Adida et al. 2008)10, which makes it possible to embed semantics in XHTML attributes in such a way that the data can be mapped to RDF and objects can be identified by URIs. This approach bridges the gap between the Semantic Web for humans and for machines since a single document with RDFa can contain information for both. This also prevents the repetition of information between an HTML document and an RDF/XML one. The same RDF graph from Figure 4.5 is given below in RDFa:
The SIOC Project is hosted at http:// www.sioc-project.org.
This page has been created by John Breslin
4.4 Ontologies Metadata elements are used to provide structure to the description of a resource, e.g. for a distance-learning course this could be title, description, keywords, author, educational level, version, location, language, date created, etc., and the basic data model for RDF has been introduced which allows us to express metadata about a particular resource. For practical purposes it is necessary to define schema information for this metadata, since a common vocabulary (also called an ontol10
http://www.w3.org/TR/xhtml-rdfa-primer/ (URL last accessed 2009-06-09)
4 Adding semantics to the Web
57
ogy) needs to be agreed on in order to facilitate information and knowledge exchange. Figure 4.7 shows a representation of how metadata about people, their social connections and their interests is produced according to some ontology specifications.
Fig. 4.7. Metadata and ontologies
As another example, if there is metadata about a soccer team, an underlying ontology will say that a soccer team always has a goalkeeper and always has a manager, so each metadata entry for a soccer team should have that information. The term ‘ontology’ (from the Greek words on = being and logos = to reason) was originally coined in philosophy to denote the theory or study of being as such. The use of the term ontology in computer science has a more practical meaning than its use in philosophy. The study of metaphysics is not in the foreground in computer science, but rather what properties a machine must have to enable it to process data that is being questioned within a certain domain of discourse. Here ontology is used as the term for a certain artefact. Tom Gruber’s widely-cited answer to the question ‘what is an ontology?’ is: ‘An ontology is a specification of a conceptualization.’ In (Gruber 1993), this statement is elaborated on: A body of formally represented knowledge is based on a conceptualization: the objects, concepts, and other entities that are assumed to exist in some area of interest and the relationships that hold among them (Genesereth and Nilsson 1987). A conceptualization is an abstract, simplified view of the world that we wish to represent for some purpose. Every knowledge base, knowledge-based system, or knowledge-level agent is committed to some conceptualization, explicitly or implicitly.
Specification in this context means an explicit representation by some syntactic means. Most approaches to ontology modelling agree on the following primitives for representation purposes:
58
The Social Semantic Web
Firstly, there must be a distinction between classes and instances, where classes are interpreted as a set of instances. Classes may be partially ordered using the binary relationship ‘subClassOf’, which can be interpreted as a subset relationship between two classes. The fact that an object is an element of a certain class is usually denoted with a binary relationship such as ‘type’. Secondly, a set of properties (also called attributes or slots) is required. Slots are binary relationships defined by classes, which usually have a certain domain and a range. Slots might be used to check if a certain set of instances with slots is valid with respect to a certain ontology. These are the modelling primitives of RDF Schema (RDFS), which we will introduce shortly. For example, classes can be thought of as general things or concepts in a domain and may be things like ‘Person’, ‘Document’, ‘Book’, or ‘WebPage’. There may also be relationships among these classes such as ‘Book’ and ‘WebPage’ have a ‘subClassOf’ relationship to ‘Document’, or a ‘Page’ is ‘containedIn’ a ‘Book’. Typical properties or attributes would be that a ‘Person’ has an ‘age’, or that a ‘WebPage’ has a ‘creationDate’.
Fig. 4.8. From taxonomy to ontology to knowledge base
While the words vocabulary and ontology are often used interchangeably, a more strict definition is that a vocabulary is a collection of terms being used in a particular domain, that can be structured (e.g. hierarchically) as a taxonomy and combined with some relationships, constraints and rules to form an ontology. Typical constraints would be ‘the cardinality must be at least 1’ or ‘the maximum value is 300’ and some sample rules or axioms would be that ‘cows are larger than dogs’ or ‘cats cannot eat only vegetables’. A combination of an ontology together with a set of instances of classes constitutes a knowledge base (Figure 4.8). Implementing or creating ontologies consists of defining all the ontology components through an ontology-definition language. It is generally carried out in two
4 Adding semantics to the Web
59
stages: an informal stage, where the ontology is sketched out using either natural language descriptions or some diagram technique, and a formal stage, where the ontology is encoded in a formal knowledge-representation language that is machine computable (e.g. RDF Schema or OWL, the Web Ontology Language). Different tools (e.g. Protégé11) may also be used in the implementation of an ontology. However, even more important than implementing the ontology is documenting it. There is a need to produce clear informal and formal documentation so that the ontology can be understandable by others. An ontology that cannot be understood will not be reused.
4.4.1 RDF Schema The purpose of the RDF Schema (RDFS) specification (Brickley and Guha 2004) is to define the primitives required to describe classes, instances and relationships. RDF Schema is an RDF application, in that it is defined in RDF itself. The defined vocabulary is very similar to the usual modelling primitives available in framebased languages12 (where entities in a domain are modelled as frames that have a set of associated slots or properties). In this section, the vocabulary used in the aforementioned examples is defined using RDF Schema. The namespace-prefix ‘rdfs’ is used as an abbreviation for http://www.w3.org/2000/01/rdf-schema#, the RDFS namespace identifier.
Fig. 4.9. An ontology represented in RDF Schema 11 12
http://protege.stanford.edu/ (URL last accessed 2009-06-09) http://tinyurl.com/framebl (URL last accessed 2009-07-14)
60
The Social Semantic Web
Figure 4.9 depicts an RDF Schema-based ontology, defining the class sw:Project and two properties sw:hasHomepage and sw:hasMember. The class node is defined by typing the node with the resource rdfs:Class, which represents a meta-class in RDFS. sw:Project is also defined as a subclass of rdfs:Resource, which is the most general class in the class hierarchy defined by RDF Schema. The rdfs:subClassOf property is defined as transitive. Properties (predicates) are defined by typing them with the resource rdfs:Property, which is the class of all properties. Furthermore, the domain and range of a property can be restricted by using the properties rdfs:range and rdfs:domain to define value restrictions on properties. For example, the property sw:hasHomepage has the domain sw:Project and a range rdfs:Resource (which is compliant with the use of sw:hasHomepage in Figure 4.4). Using these definitions, RDF data can be tested with compliance regarding a particular RDF Schema specification. The RDF Schema defines more modelling primitives: The property rdfs:label allows one to define a human-readable form of a name. The property rdfs:comment enables comments. The property rdfs:subPropertyOf indicates that a property is subsumed by another property. For example the fatherOf property is subsumed by the parentOf property, since every father is also a parent. The properties rdfs:seeAlso and rdfs:isDefinedBy are used to indicate related resources. As a convention, an application can expect to find an RDF Schema document declaring the vocabulary associated with a particular namespace at the namespace URI. As another example, the following ontology snippet defines the fact that a research institute is a specific kind of organisation: rdf:type rdfs:Class ; rdfs:subClassOf .
Therefore, thanks to inference principles related to this property, one can identify all Organisation(s) even if they are defined as instances of ResearchInstitute. Moreover, subclasses and subproperties can be defined for classes and properties appearing in existing external ontologies, so that one can extend any ontology available on the Semantic Web for his or her own needs in a distributed way. This also requires some agreement between people, as we will discuss in Section 13.2.2. As mentioned, RDF Schema allows us to define the domain and range for each property. The following code identifies that the property hasMember links an Organisation to a Person: a rdfs:Property ; rdfs:domain ; rdfs:range .
4 Adding semantics to the Web
61
An interesting aspect of domains and ranges is that each instance linked to using a hasMember property does not have to be explicitly defined as an instance of a Person, but it rather becomes such an instance by inference as soon as this property exists. Hence, one statement might be enough to represent several facts about a single resource.
4.4.2 Web Ontology Language (OWL) The expressiveness of RDF Schema is somewhat limited. For example, it cannot be used to define that a property is symmetric (e.g. isNeighbourOf) or transitive (e.g. locatedIn). In order to model such advanced axioms within ontologies, the W3C started a working group on OWL (Web Ontology Language) in 2001, based on the work performed within the DAML+OIL project, itself based on OIL from Europe and DAML-ONT from the USA. OWL became a W3C Recommendation in 2004 for defining ontologies13, and goes beyond RDF Schema in terms of expressivity as its semantics are based on Description Logics. Since an introduction to Description Logics is beyond the scope of the book, we mostly rely on RDF Schema with some OWL extensions, most notably owl:sameAs. The built-in OWL property owl:sameAs indicates that two URI references actually refer to the same thing: the individuals have the same ‘identity’. This is useful to indicate when two entities are actually identical (the same) even when they have different identifiers14. OWL extends the notion of classes and properties defined in RDF Schema, and it provides new axioms to define advanced characteristics and constraints regarding classes and properties. OWL actually provides three sublanguages with different degrees of expressivity: OWL-Lite extends RDFS and provides new axioms such as symmetry and cardinality constraints (however, cardinality can be only 0 or 1 in OWL-Lite). OWL-DL (DL being inherited from Description Logics) adds new axioms (and provides these axioms in OWL) including union, intersection and disjunction between classes, as well as extended OWL-Lite cardinality constraints. OWL-Full does not add new axioms but interprets them differently and thus becomes more powerful (for example, a URI can represent at the same time a class and an instance).
13
Here, we only refer to OWL 1 since OWL 2 is currently undergoing a standardisation process Identity on the Semantic Web is a complex issue, and this topic has been discussed during the 1st International Workshop on Identity and Reference on the Semantic Web (IRSW 2008), with more information available at http://ceur-ws.org/Vol-422 14
62
The Social Semantic Web
An important thing to keep in mind regarding these languages and the Semantic Web in general is that they refer to what is termed an ‘open-world assumption’. Therefore, if a fact is not defined, nothing can be assumed about it. For example, if no triples mention that ‘:John :worksWith :Alex’ and if someone asks ‘is John working with Alex’, the answer will not be ‘no’ but rather ‘there is no answer’ as there are not enough facts to answer that query.
4.5 SPARQL As we have seen, RDF(S) and OWL are useful languages for representing ontologies and metadata on the Semantic Web. However, once this metadata has been published, query languages are required to make full use of it. SPARQL (SPARQL Protocol And RDF Query Language) aims to satisfy this goal and provides, as the name says, both a query language and a protocol for RDF data on the Semantic Web. SPARQL can be thought of as the SQL of the Semantic Web, and offers a powerful means to query RDF triples and graphs. As Tim Berners-Lee said15: Trying to use the Semantic Web without SPARQL is like trying to use a relational database without SQL. SPARQL makes it possible to query information from databases and other diverse sources in the wild, across the Web.
As RDF data is represented as a graph, SPARQL is therefore a graph-querying language, which means that the approach is different than SQL where people deal with tables and rows. Moreover, it provides extensibility within the query patterns (based on the RDF graph model itself) and therefore advanced querying capabilities based on this graph representation, such as ‘find every person who knows someone interested in Semantic Web technologies’. SPARQL can be used to query independent RDF files as well as sets of RDF files, either loaded in memory by the SPARQL query engine or through the use of a SPARQL-compliant triple store (a triple store is a storage system for RDF data). Therefore, there is currently a need to know which files must be queried before running a query, which can be an issue in some cases and can be considered as a hurdle to overcome. However, approaches such as voiD, the Vocabulary of Interlinked Datasets16 (Alexander et al. 2009), can be used in addition to distributed SPARQL query engines in order to dynamically identify which RDF sources should be considered when querying information. SPARQL offers four query forms that can be used to run different types of queries: 15 16
http://www.w3.org/2007/12/sparql-pressrelease.html.en (last accessed on 2009-06-09) http://rdfs.org/ns/void/html (URL last accessed 2009-07-16)
4 Adding semantics to the Web
63
SELECT, used to retrieve information based on a particular pattern, CONSTRUCT, used to create an RDF graph based on RDF input and that can be used as a translation service for RDF data (between different ontologies), ASK, used to identify if a particular query pattern can be matched on the queried RDF graph, and DESCRIBE, used to identify all triples related to the particular object that must be described. For example, the following SPARQL query represents the question we posed earlier (identifying people that know someone interested in the Semantic Web), using the foaf:knows relationship to identify a relationship between two people, the foaf:name property to link a person to his or her name and the foaf:topic_interest property to model an interest of some person. In this query, we identify the notion of ‘Semantic Web technologies’ using the DBpedia URI for the Semantic Web, i.e. http://dbpedia.org/resource/Semantic_Web. As we mentioned earlier, to get results for the following query17, one must apply it to a set of RDF files or a triple store that contains relevant information. SELECT DISTINCT ?who ?name { ?who foaf:name ?name ; foaf:knows [ foaf:topic_interest ] }
Moreover, SPARQL provides different modifiers (such as ORDER or LIMIT) to organise the various results. The following query therefore extends the previous one by ordering the people by name, and limiting the output to only two results. SELECT DISTINCT ?who ?name { ?who foaf:name ?name ; foaf:knows [ foaf:topic_interest ] } ORDER BY asc(?name) LIMIT 2
While SPARQL is obviously a key component of the Semantic Web, it has some limits. At the time of writing, SPARQL does not provide any aggregate function, hence implying a need to use external languages (such as a Python or PHP script) to run aggregations, which can make the adoption of RDF technologies complicated in some cases. Various SPARQL engines have implemented this
17
We have omitted the vocabulary prefix definitions in these examples
64
The Social Semantic Web
functionality however, for example, OpenLink Virtuoso18 (a hybrid triple store and RDBMS-based middleware platform) and ARC219 (an RDF solution for PHP and MySQL-based applications). Other relevant but specific extensions have been provided by other engines such as path-based or imprecise queries. Furthermore, SPARQL is a read-only language, in that it does not allow one to add or modify RDF statements. The SPARQL Update W3C Member Submission (Seaborne et al. 2008) provides some ways to update, add and delete RDF triples. To overcome these and other issues, the W3C SPARQL Working Group is currently working on an update of SPARQL, taking into account a variety of desired features including updates, aggregates and negation20. Also, as one may notice, there are some relationships between SPARQL and XQuery, and between the RDF and XML worlds in general. XSPARQL21 (Akhtar et al. 2008) aims to provide a way to bridge the gap between those two worlds by extending SPARQL and XQuery, thereby offering a way to query both XML and RDF data using the same query language. Finally, as we mentioned, SPARQL is both a query language and a protocol. By providing HTTP bindings for it, as well as normalised serialisation of the results (in XML or JSON), it can be efficiently used to provide open access to RDF databases. In this book, we will therefore present various applications that offer a SPARQL endpoint to their users, i.e. a way to run HTTP-based queries on RDF stores publicly available on the Web, thereby delivering open and structured data to customers. An example of such a SPARQL endpoint is the one provided by DBpedia22.
4.6 The ‘lowercase’ semantic web, including microformats Microformats allow specific pieces of structured information to be embedded within the HTML markup code that makes up web pages. This information can then be discovered and reused by various applications. Microformats have been successful in bringing semantic metadata to the current Web through a vibrant developer community centred around a wiki-based website23 and a set of mailing lists. Through this community, several microformats have been created and are currently in widespread use by large companies such as Yahoo! and Automattic, in particular on social websites.
18
http://www.openlinksw.com/virtuoso/ (URL last accessed 2009-07-07) http://arc.semsol.org/ (URL last accessed 2009-07-07) 20 http://www.w3.org/2009/sparql/docs/features/ (URL last accessed 2009-07-15) 21 http://xsparql.deri.org/ (URL last accessed 2009-07-16) 22 http://dbpedia.org/sparql (URL last accessed 2009-07-16) 23 http://microformats.org/ (URL last accessed 2009-06-09) 19
4 Adding semantics to the Web
65
Fig. 4.10. The microformats logo
The range of available microformats includes hCard, XFN, hCalendar, hReview, rel-tag, etc. The hCard microformat can be used to describe information about a person such as their name and contact details (e.g. on social networking sites). The hReview microformat is used for describing information about reviews, and the hAtom microformat allows one to express information about content items available for syndication, such as blog posts and comments (derived from the Atom syndication format). Microformats have adopted an approach to ‘solve problems’ for particular scenarios, rather than providing arbitrary Semantic Web data structures that can be used for any purpose. However, despite various arguments24 there is no reason that both the Semantic Web and microformats communities cannot work together. Both communities are trying to add semantics in the Web, and using mechanisms like GRDDL (Gleaning Resource Descriptions from Dialects of Languages)25 and poshRDF26, the existing work on both sides can be combined and reused. As with the Semantic Web, microformats have many applications beyond social websites. In enterprise, typical usage scenarios relate to saving companies time in keeping third-parties (such as customers or price comparison sites) updated27. An example is the use of microformats to power systems which show a customer’s loan with the interest calculated daily on the outstanding amount based on an interest rate (taken from a microformat-enabled site) and the fixed amount. There are also discussions on how microformats can be used to represent financial data in documents ranging from online statements to e-commerce receipts, e.g. debit or credit figures28, a total of any kind, or an interest figure; on how currencies should be represented29; and on the use of hCalendar for investor relations event entries30. Finally, microformats can be added to Excel spreadsheets31 as a means to embed some ‘reusable, stable semantics’. 24
http://chatlogs.planetrdf.com/swig/2006-09-04.html#T09-46-58 (last accessed 2009-06-09) http://www.w3.org/TR/2007/REC-grddl-20070911/ (URL last accessed 2009-06-09) 26 http://esw.w3.org/topic/poshRDF (URL last accessed 2009-06-09) 27 http://tinyurl.com/3xk82g (URL last accessed 2009-06-09) 28 http://tinyurl.com/a2pu3a (URL last accessed 2009-06-09) 29 http://tinyurl.com/3ax8m6 (URL last accessed 2009-06-09) 30 http://tinyurl.com/2r3yrl (URL last accessed 2009-06-09) 31 http://tinyurl.com/2a3ohp (URL last accessed 2009-06-09) 25
66
The Social Semantic Web
There are some limitations with microformats in terms of representing the relationships between individual fragments of data, which limits the ability to properly describe the linked, Web nature of data (e.g. hAtom is sometimes used to represent blog comments, but it does not have a property to indicate what blog post the comment is in a reply to). Parsing of microformats can sometimes be difficult as a significant number of exceptions and special cases have to be taken into account. References to objects (such as people, content items, etc.) can also be ambiguous. In addition, microformats cannot be extended as easily as RDF vocabularies, which are more flexible in terms of reusability and integration for different needs. A generic approach for storing the information contained within microformats is needed if we are to store and query information about all different kinds of Social Web objects in a uniform way. One option is to store microformats in their native HTML format, but these would be difficult to process and query. Alternatively, domain-specific data stores and applications could be used for each particular kind of microformat object, but they may lack flexibility and limit the ability to perform universal search queries over links between different object types. The third option is to use RDF, which has advantages over the first two options as it is more generic and allows one to store and process information about various types of resources and the relations between them. GRDDL can serve as a means of moving from microformats to RDF, bridging the gap between the semantic web and the Semantic Web.
4.7 Semantic search Machine-readable Semantic Web data is now being crawled by Semantic Web search engines like Sindice32, SWSE33 or Swoogle34. These search engines can usually match keywords in any data that has been crawled or integrated into a semantic store. It could be from structured information about people, places, dates, library documents, blog items or topics, whatever. In fact, there is no limit to the types of things that can be indexed and searched - since RDF (an open data model that can be adapted to describe pretty much anything) is used as the data format. Anyone can reuse existing RDF vocabularies like SIOC to publish data; they can publish data using their own custom vocabularies (e.g. to describe stamp collecting or Bollywood movie genres or whatever); or they can combine public and custom vocabularies (e.g. take FOAF and one’s own vocabulary about soccer to describe players and managers on a soccer team). Sindice (Tummarello et al. 2007) can be thought of as a big semantic index of the Web. It allows you to find pointers to relevant pages or URIs where particular 32
http://sindice.com/ (URL last accessed 2009-06-09) http://swse.deri.org/ (URL last accessed 2009-06-09) 34 http://swoogle.umbc.edu/ (URL last accessed 2009-06-05) 33
4 Adding semantics to the Web
67
keywords are mentioned, where certain property values are used (e.g. pages where a person says their e-mail address is
[email protected]), or where certain facts or semantic triples appear. Sindice gives you pointers to where stuff is, whereas many other engines give you the stuff as well (without you having to go to the source page). Sindice also has an API that can provide results in a reusable (semantic) format that can be leveraged by other applications. Alternately, SWSE (Semantic Web Search Engine) shows you semantic information about the object of interest (e.g. a person’s phone number, their friends, etc.) which may be derived from multiple sources (i.e. the information on an object comes from tens of sources consolidated together via unique identifiers for that object or through what is called ‘object consolidation’). Both SWSE and Swoogle allow query capabilities over the collections of all Semantic Web statements, so if you search for Galway, it can show you the relevant statements as well as pointing you to the pages they were obtained from. Geotemporal information is particularly useful for searching across a range of domains, and provides nice semantic linkages between things. For example, having geographic information and time information is useful for describing where people have been and when, for detailing historical events or TV shows, for timetabling and scheduling of events, etc., and for connecting all of these things together (‘I’m travelling to Edinburgh next week: show me all the TV shows of relevance and any upcoming events I should be aware of according to my interests…’). A social search engine that makes use of semantic information is Tusavvy35, allowing users to search ‘community knowledge without navigating the entire web’. Tusavvy reveals ‘not easily linked-to pages’ that are often buried in conventional search results. It was built by aligning human factors with search: using sociallyannotated web data, leveraging a lexicon built via semantically-related tags, and utilising rankings selected through a user’s accumulated interests.
4.8 Linking Open Data In spite of various standardisation efforts during the last few years regarding languages to model and query data on the Semantic Web, a critical mass of RDF data did not exist on the Web until recently. While native exports of FOAF data from some social websites (e.g. LiveJournal) gave a glimpse of mainstream adoption of the Semantic Web, the provision of RDF data was a still a domain restricted to a few early adopters. In parallel, a large amount of rich (semi-structured) data became publicly available on the (Social) Web, for example, using Creative Commons licences or GNU FDL (the GNU Free Documentation Licence) as in the
35
http://www.tusavvy.com/ (URL last accessed 2009-06-09)
68
The Social Semantic Web
Wikipedia. Based on these observations, the Linking Open Data (LOD)36 community effort started in mid-2007, supported by the W3C Semantic Web Education and Outreach group. The aim of this initiative is to expose the data already publicly available on the Web (in non-RDF forms) using RDF and to interlink it so as to emphasise the value of the Giant Global Graph. In order to achieve this pragmatic vision of the Semantic Web (pragmatic in that it is more focused on exposing large data sets in RDF rather than performing advanced reasoning), the project is based on the four tenets of Linked Data, as defined by (Berners-Lee 2006): 1. 2. 3. 4.
Use URIs as the names for things. Use HTTP URIs so that people can look up those names. When someone looks up a URI, provide useful information. Include links to other URIs so that they can discover more things.
Fig. 4.11. The Linking Open Data dataset cloud from March 200937
Thanks to this effort, lots of RDF data is now available on the Web and can be used in various applications, from advanced data visualisations and querying systems to complex mashups. More importantly, this data is all linked together, 36 37
http://linkeddata.org/ (URL last accessed 2009-06-09) http://tinyurl.com/loddatasets (URL last accessed 2009-06-09)
4 Adding semantics to the Web
69
which means that one can easily navigate from one information source to another thanks to Semantic Web browsers such as the Tabulator38 (thereby breaking through the barriers that exist between traditional websites). Various strategies can be used to provide those links from manual interlinking (Hausenblas et al. 2008) to advanced heuristics for raising ambiguity and heterogeneity problems (Raimond et al. 2008). As Figure 4.11 shows, the nature of currently-available data sets is quite varied. For example, DBpedia provides a complete Wikipedia export in RDF and hence acts as the nucleus for this ‘Web of Data’ (a re-branding of the Semantic Web), while GeoNames provides RDF information about millions of geographic entities. Social data can also benefit from this initiative. For instance, the Flickr profile exporter interlinks user information with GeoNames entities (we will describe this later on). Another outcome from the Linking Open Data project is that thanks to the billion of triples and links available on the Web, making this data available in RDF has bootstrapped the Semantic Web in conjunction with projects like SIOC (more later). Companies are now gaining interest in the Semantic Web, some of them even becoming part of this Web of Data effort. Zemanta39 recently released an API that provides named-entity extraction for textual data based on entities from the LOD cloud, and Freebase’s RDF data exports are now linked to DBpedia URIs. Moreover, we believe that this huge amount of data can raise interesting research challenges, such as distributed querying and reasoning on large-scale distributed data, as well as the trustworthiness of information sources on the Web.
4.9 Semantic mashups Although it is distributed and sometimes completely disconnected, data from the Semantic Web is represented using a common language, i.e. RDF. Therefore, it enables the production of semantic mashups in an easy way by allowing the combination of RDF data from different data sources. For example, geolocation data from GeoNames and information about personalities from DBpedia could be used for celebrity geolocation mashups. Social data can also be taken into account. For example, the FOAFMap application (Passant 2006) provides one of the first geolocation mashups for social semantic data, displaying a complete social network on a Google Map from a single entry point, i.e. the starter person’s FOAF file, with further people queried on-the-fly. It is not just on the Social Web that these mashups are useful. In an organisational context, semantic mashups can make it easier for employees of an organisation to get at relevant data, to integrate it and to share it.
38 39
http://www.w3.org/2005/ajar/tab (URL last accessed 2009-07-16) http://www.zemanta.com/ (URL last accessed 2009-07-07)
70
The Social Semantic Web
With the Semantic Web, it is possible to reduce the costs for people who are interested in mixing together or mashing up data from many different sources while hiding much of the complexity that makes it happen. Most of the component pieces for these mashups exist: the parts just have to be combined together. According to Eric Miller from Zepheira40, the main actions required by such mashup components are: create, publish, and analyse. Open-source tools that can be used include Remix (create), Exhibit (publish) and Studio (analyse) from MIT’s SIMILE project41 (Semantic Interoperability of Metadata and Information in unLike Environments), and DERI Pipes42 from NUI Galway. Exhibit is a software service for rendering data. Data is fed into the system and a facetted-navigation interface is returned, without the need for a database or a business logic tier. Exhibit can style the data in different ways, and the data can then be viewed through different ‘lenses’. Remix is a tool that can provide semantically-mashed data for Exhibit. It combines visual interfaces, data transformation interfaces and data storage components. Remix also leverages persistent identifiers (for people, places, concepts, network objects, etc.) using purlz.org. For example, Remix can be used to ‘stitch’ together two related spreadsheets from different sources (organisations, groups, people, etc.). Fields can be mapped from one spreadsheet to the other and then you can see if it makes sense from a data perspective. Remix has tools for ‘simultaneous editing’ which allows editing over patterns of data, so by editing one entry you can edit all of them. This acts like a script which can change ‘last name, first name’ to ‘first name last name’ without any complicated programming. During each step, every piece of data has an identifier and therefore becomes a web resource in a framework that enables people to mash data together as part of a resource-oriented architecture. You can connect any fields together, but this may not necessarily make sense, so there is a need for interfaces to show users whether it does or not. In Exhibit, you can then take the stitched-together data and create an interface to it by customising facets and views, applying different themes, etc. This combination of tools enables a non-technical expert to not just produce a user interface to interact with semantic mashups of data, but to publish the information on the Web so that other people can benefit from it. As a final component in a semantic mashup, Studio can be used to analyse data, e.g. as reports with pattern analyses, which is particularly useful for organisational data. Because it is based on RDF and SPARQL, queries can be created that are relevant to a particular organisation, e.g. ‘show me the most popular or least popular reports’, or ‘show me any reports that used some of my data’. This can bring organisations into a ‘Linked Enterprise Data’ (LED) framework, a parallel idea to the Linking Open Data initiative described earlier. Miller says that LED is all
40
http://tinyurl.com/ericmiller (URL last accessed 2009-06-09) http://simile.mit.edu/ (URL last accessed 2009-07-20) 42 http://pipes.deri.org/ (URL last accessed 2009-07-07) 41
4 Adding semantics to the Web
71
about exposing and linking enterprise data, while showing that there are benefits in terms of solutions that can be made available immediately. From NUI Galway, the DERI Pipes application allows a variety of input data types (RDF, XML, JSON, microformats, etc.) to be ‘mashed up’ through a graphical interface (inspired by Yahoo! Pipes) or by using a command line tool. Pipes are basically simple commands (taking an input and transforming it) that can be combined together to create a certain desired output. Since it is also available as open source, DERI Pipes can be easily extended or customised and applied in use cases where a local deployment is required. The DERI Pipes GUI allows pipes to be graphically edited, debugged and invoked. The execution engine is also available as a standalone JAR file which is ideal for embedded use.
4.10 Addressing the Semantic Web ‘chicken-and-egg’ problem The challenge for the Semantic Web is related to the chicken-and-egg problem: it is difficult to produce data without interesting applications, and vice versa. The Semantic Web cannot work all by itself, because if it did it would be called the ‘Magic Web’. For example, it is not very likely that you will be able to sell your car just by putting your Semantic Web file on the Web. Society-scale applications are required, i.e. consumers and processors of Semantic Web data, Semantic Web agents or services, and more advanced collaborative applications that make real use of shared data and annotations. The Web
The Social Web
The Social Semantic Web
Personal Websites
Blogs
Semantic Blogs: semiBlog, Haystack, Structured Blogging, Zemanta
Content Management Systems, Britannica Online
Wikis, Wikipedia
Semantic Wikis: Semantic MediaWiki, SemperWiki, Platypus, DBpedia, Rhizome
AltaVista, Google
Google Personalised, Searchles
Semantic Search: SWSE, Swoogle, Intellidimension, Powerset, Hakia
CiteSeer, Project Gutenberg
Google Scholar, Book Search
Semantic Digital Libraries: JeromeDL, BRICKS, Longwell
Message Boards
Community Portals
Semantic Forums and Community Portals: SIOC, OpenLink Data Spaces, Talis Engage
Buddy Lists, Address Books
Online Social Networks
Semantic Social Networks: FOAF, PeopleAggregator, Social Graph API
Table 4.1. From Web ‘1.0’ to the Social Semantic Web
The Semantic Web effort is mainly towards producing standards and recommendations that will interlink applications, and the primary Social Web meme as
72
The Social Semantic Web
already discussed is about providing user applications. These are not mutually exclusive43: with a little effort, many Social Web applications can and do use Semantic Web technologies to great benefit. Table 4.1 shows some evolving areas where these two streams have and will come together: semantic blogging, semantic wikis, semantic social networks (Mika 2007) and in parallel with these the Semantic Desktop. These all fall in the realm of what Nova Spivack (CEO of Semantic Web company Radar Networks) has termed the ‘Metaweb’44, or Social Semantic Information Spaces. Semantic MediaWiki45, for example, has already been commercially adopted46 by Centiare (now MyWikiBiz). There are also great opportunities for mashing together of both Social Web data or applications and Semantic Web technologies, which just require the use of some imagination. Dermod Moore wrote47 of one such Social Web application mashup for a hobby project: a Scuttle48 + Gregarius49 + Feedburner50 + Grazr51 hybrid (these are, respectively, a web-based social bookmarking application, a webbased feed aggregator system, an RSS feed management service, and a provider of web-based aggregation widgets). This mashup allows one to aggregate one’s favourite blogs or other content on a particular topic and then to annotate bookmarks to the most interesting content found. Bringing this a step further, we could have a ‘semantic social collaborative resource aggregator’. In this hypothetical system:
Social network members specify their favourite content sources, You and your friends specify any topics of interest, You specify friends whose topic lists you value, Metadata aggregator collects content from sites you and friends like (which may be human tagged, or could be automatically tagged), Highlights content that may be of interest to you or your friends, If nothing of interest is currently available, content sources may have semantically-related sources in other communities for secondary content acquisition and highlighting, You bookmark and tag the interesting content, and share! In fact, the recent Twine application from Radar Networks (which will be discussed later) offers much of this functionality.
43
http://tinyurl.com/csk97b (URL last accessed 2009-06-09) http://tinyurl.com/ljv7gj (URL last accessed 2009-06-09) 45 http://semantic-mediawiki.org/ (URL last accessed 2009-06-09) 46 http://www.sbwire.com/news/view/9912 (URL last accessed 2009-06-09) 47 http://bonhom.ie/2006/04/what-weeks-delay-can-produce.html (last accessed 2009-06-09) 48 http://sourceforge.net/projects/scuttle/ (URL last accessed 2009-07-07) 49 http://gregarius.net/ (URL last accessed 2009-07-07) 50 http://www.feedburner.com/ (URL last accessed 2009-07-07) 51 http://www.grazr.com/ (URL last accessed 2009-07-07) 44
4 Adding semantics to the Web
73
There have been many announcements relating to commercial Semantic Web applications recently, with much attention being given to the startup companies in this space: Powerset (acquired by Microsoft), Metaweb (creators of Freebase) and Radar Networks (Twine), and also to the big companies who have announced what they are doing with semantic data: Reuters (Calais API), Yahoo! (semantically-enhanced search) and Google (Social Graph API and Rich Snippets52). According to Marta Strickland on the Three Minds blog53, the Bintro social networking service (short for ‘business introduction’) uses semantic technologies to match profiles together for business opportunities. Unlike LinkedIn, it is less based on who you know and more based on what you know. Another related service is from BanyanLink54 that allows college students to collaborate with career centres, and uses semantic-matching technologies to connect students to internship opportunities. In the next few chapters, we will discuss some of the most popular Social Web application areas, and describe how each of these can be enhanced with semantics to not only provide more functionality but also to create an overall interconnected set of social information spaces.
52
http://tinyurl.com/richsnippets (URL last accessed 2009-07-07) http://tinyurl.com/67g3ab (URL last accessed 2009-06-09) 54 http://www.banyanlink.com/ (URL last accessed 2009-07-07) 53
5 Discussions Discussions on different topics can take a variety of forms online, from bulletin boards to blogs to mailing lists. With the move from text content to multimedia content in the Social Web, discussions are now being attached to content items as well (e.g. lists of comments appear for many videos in YouTube or photos in Flickr). More recently, the phenomenon of microblogging has taken root, where people create short text entries about what they are doing that are normally limited to 140 characters, and people can reply within their own activity streams to messages posted by others using a simple reply syntax. However, all of these discussion methods share a common format: someone begins a conversation (either with a text post or a multimedia item), and others weigh in with their views on the topic or item under discussion.
5.1 The world of boards, blogs and now microblogs Although it is difficult to calculate the size of discussion spaces like the ‘blogosphere’ (the world of blogs) or the ‘boardscape’ (the world of message boards), we can make some estimates based on the State of the Blogosphere (now called the State of the Live Web) from Technorati’s Dave Sifry and also based on some statistics from BoardTracker (Breslin et al. 2007a). According to Dave Sifry1, there are some 230 million posts that use tags. This accounts for 35% of all posts, which leads one to a figure of 657 million posts on the blogosphere (or at least those that Technorati tracks). However, this does not include comments. BoardTracker, the largest message board search engine, estimate that there are over 6 billion discussions and about 100 billion posts on message boards. A discussion ‘thread’ has on average 16 replies. Accounting for a similar comment ratio on blogs, this could bring the number of blog discussion entries (starter posts and comments) to about 10 billion posts and comments. Based on this, the boardscape is roughly 10 times bigger than the blogosphere, but of course message boards have been around for longer than blogs. Where blogging allowed people to send their thoughts online to an open audience, audioblogging or podcasting allowed people to record them, and videoblogging or ‘vlogging’ allowed them to deliver their messages via video. Now, microblogging enables anyone to exchange short text messages within their community or simply to write in brief to the general public. Twitter, the world’s largest microblogging 1
http://www.sifry.com/alerts/archives/000493.html (URL last accessed 2009-06-09)
J.G. Breslin et al., The Social Semantic Web, DOI 10.1007/978-3-642-01172-6_5, © Springer-Verlag Berlin Heidelberg 2009
76
The Social Semantic Web
site, celebrated its one billionth ‘tweet’ (microblog post) in November 20082, and is nearing three billion posts as of July 2009. The total number of microblog posts in the ‘tweetosphere’ could be double that when the contributions of other microblogging sites like Jaiku, Identi.ca and Pownce (now closed) are taken into account. We will now give an overview of some popular discussion methods and detail how semantic technologies have been brought to bear on applications in these areas. While discussion systems also encompass Q&A sites, instant messaging and other services, we will not cover them here. However, we note that the methods we now describe can easily be deployed on other types of discussion systems, and we refer to the SIOC Types module3 (to be detailed later) that defines the required terms for such systems.
5.2 Blogging A blog, or weblog, is a user-created website consisting of journal-style entries displayed in reverse-chronological order. Entries may contain text, links to other websites, and images or other media. Often there is a facility for readers to leave comments on individual entries, which make blogs an interactive medium. Thanks to the use of trackbacks4, bloggers can also state that a blog post is a reply to another one, introducing a distributed system for conversations in the blogosphere. Bloggers often link to their favourite blogs or friends through ‘blogroll’ links on the sidebar (forming social networks of bloggers, as shown in Figure 5.1). Also, the latest headlines, with hyperlinks and summaries, are syndicated using RSS or Atom formats (e.g. for reading one’s favourite blogs with a feed reader) as described earlier. Blogs may be written by individuals or by groups of contributors (ranging from the arms of political campaigns to corporations, e.g. the Google Blog). A blog may function as a personal journal, or it may provide news or opinions on a particular subject. They are also starting to cross the generation gap - teenagers might have a blog via a social networking service, their parents may blog themselves and even their grandparents can be posting, reading or commenting on blogs! Compared to the web trend of the 90s where people set up homepages on hosting services like Geocities, blogs are now one of the most popular methods by which people can acquire and maintain an online presence. However, because of the blogging trend towards spontaneous and more regular updates, as well as the reverse chronological ordering of posts on blogs, these online presences have a completely different dynamic.
2
http://mashable.com/2008/11/12/twitter-one-billion-tweets-wow/ (last accessed 2009-06-09) http://rdfs.org/sioc/types (URL last accessed 2009-07-07) 4 http://www.sixapart.com/pronet/docs/trackback_spec (URL last accessed 2009-06-09) 3
5 Discussions
77
Fig. 5.1. A graph of a community of bloggers connected to each other via blogroll links (the midgrey names are those linked to a single highlighted blog shown in dark-grey, ‘Holy Shmoly’)
5.2.1 The growth of blogs The growth and take-up of blogs over the past four years has been dramatic, with a doubling in the size of the blogosphere every six or so months (according to statistics from Technorati5). Over 120,000 blogs are created every day, working out at about one a second. Nearly a million blog posts are being made each day, with over half of bloggers contributing to their sites three months after the blog’s creation. Technorati counted 70 million blogs at the beginning 2007 and now estimates that there are 133 million of them available6. The nature of blogs is quite varied, from teenager’s weblogs to technology experts or political opinions. Many political or opinion-type blogs are considered to 5 6
http://technorati.com/weblog/2007/04/328.html (URL last accessed 2009-06-09) http://technorati.com/blogging/state-of-the-blogosphere/ (URL last accessed 2009-06-09)
78
The Social Semantic Web
be a form of ‘grassroots journalism’ (Gilmor 2004), and they often have larger audiences than or at least ones comparable to the websites of mainstream media as observed in some studies by Technorati7. Many bloggers also use Google Adsense advertising on their blogs to get extra revenue (therefore search engine optimisation for blogs becomes important for getting visitors). It also interesting to study the temporality of information flow between the blogosphere and traditional media services (Cointet et al. 2007). As we will see when we introduce microblogging, bloggers are often at the forefront of information, where traditional media cannot act as fast as the online ‘wisdom of crowds’. Similar to accidentally wandering onto message boards and web-enabled mailing lists, when searching for something on the Web, one often happens across a relevant entry on someone’s blog. RSS feeds are also a useful way of accessing information from your favourite blogs, but they are usually limited to the last 15 or 20 entries, and do not provide much information on exactly who wrote or commented on a particular post, or what the post is talking about. Some approaches like SIOC (Semantically-Interlinked Online Communities, more later) aim to enhance the semantic metadata provided about blogs, comments and posts, but there is also a need for more information about what exactly a person is writing about. Blog entries often refer to resources on the Web and these resources will usually have a context in which they are being used, and in terms of which they could be described. For example, a post which critiques a particular resource could incorporate a rating, or a post announcing an event could include start and end times. When searching for particular information in or across blogs, it is often not that easy to get it because of ‘splogs’ (spam blogs) and also because of the fact that the virtue of blogs so far has been their simplicity - apart from the subject field, everything and anything is stored in one big text field for content. Keyword searches may give some relevant results, but useful questions such as ‘find me all the Chinese restaurants that bloggers reviewed in Dublin with a rating of at least 5 out of 10’ cannot be posed, and you cannot easily drag-and-drop events or people or anything (apart from Uniform Resource Locators - URLs) mentioned in blog posts into your own applications. Blog posts are sometimes categorised (e.g. ‘Scotland’, ‘Movies’) by the post creator using pre-defined categories or tags, such that those on similar topics can be grouped together using free-form tags / keywords or hierarchical tree categories. Posts can also be tagged by others using social bookmarking services like del.icio.us or personal aggregators like Gregarius. Other services like Technorati can then use these tags or keywords as category names for linking together blog posts, photos, links, etc. in order to build what they call a ‘tagged web’. Utilising Semantic Web technology, both tags and hierarchical categorisations of blog posts
7
http://technorati.com/weblog/2006/02/83.html (URL last accessed 2009-06-09)
5 Discussions
79
can be further enriched and exposed in RDF8 via the SKOS (Simple Knowledge Organisation Systems) framework9. There have been some approaches to tackle the issue of adding more information to blog posts, so that advanced queries can be made regarding the posts’ content, and the things that people talk about can be reused in other posts or applications (because not everyone is being served well by the lowest common denominator that we currently have in blogs). One approach is called ‘structured blogging’10 (mainly using microformats to annotate blog content), and the other is ‘semantic blogging’ (using RDF to represent both blog structures and blog content): both approaches can also be combined together.
5.2.2 Structured blogging Structured blogging is an open-source community effort that has created tools to provide microcontent from popular blogging platforms such as WordPress and Movable Type. The term microcontent indicates a unit of data and associated metadata communicating one main idea. Sources of microcontent include microformats11, which enable semantic markup to be embedded directly within XHTML. Microformats therefore provide a simple method of expressing content in a machine-readable way, facilitating re-use and aggregation. An example of a microformat is hReview, which allows for the structured description of reviews within web pages. Although the original effort has tapered off, structured blogging is continuing through services like LouderVoice12, a review site which integrates reviews written on blogs and other websites. In structured blogging, packages of structured data are becoming post components (Figure 5.2). Sometimes (not all of the time) a person will have a need for more structure in their posts - if they know a subject deeply, or if their observations or analyses recur in a similar manner throughout their blog - then they may best be served by filling in a form (which has its own metadata and model) during the post creation process. For example, someone may be writing a review of a film they went to see, or reporting on a sports game they attended, or creating a guide to tourist attractions they saw on their travels. Not only do people get to express themselves more clearly, but blogs can start to interoperate with enterprise applications through the microcontent that is being created in the background.
8
http://tinyurl.com/lua587 (URL last accessed 2009-06-09) http://www.w3.org/2004/02/skos/ (URL last accessed 2009-06-09) 10 http://structuredblogging.org/ (URL last accessed 2009-06-09) 11 http://microformats.org/ (URL last accessed 2009-06-09) 12 http://www.loudervoice.com/ (URL last accessed 2009-06-09) 9
80
The Social Semantic Web
Fig. 5.2. A typical structured blogging entry form, in this case for a restaurant review
5 Discussions
81
Take the scenario where someone (or a group of people) is reviewing some soccer games that they watched. Their after-game soccer reports will typically include information on which teams played, where the game was held and when, who were the officials, what were the significant game events (who scored, when and how, or who received penalties and why, etc.) - it would be easier for these blog posters if they could use a tool that would understand this structure, presenting an editing form with the relevant fields, and automatically create both HTML and RSS with this structure embedded in it. Then, others reading these posts could choose to reuse this structure in their own posts, and their blog reading / writing application could make this structure available when the blogger is ready to write. As well as this, reader applications could begin to answer questions based on the form fields available – ‘show me all the matches from South Africa with more than two goals scored’, etc. At the moment, structured blogging tools (such as those from LouderVoice) provide a fixed set of forms that bloggers can fill in for things like reviews, events, audio, video and people - but there is no reason that people could not create custom structures, and news aggregators or readers could auto-discover an unknown structure, notify a user that a new structure is available, and learn the structure for reuse in the user’s future posts. Semantic Web technologies could also be used to ontologise any available post structures for more linkage and reuse. Some other past attempts at structured blogging include Qlogger13, the Lafayette Project14, and JemBlog15. To date, structured blogging tools have been provided for single-user blogging platforms16, but would be more suited for deployment in multi-user blogging communities powered by WordPress Multi-User or Drupal where they could better achieve critical mass. It would also be interesting if structured blogging tools could be integrated with Social Web reading lists or media consumption sites, e.g. All Consuming or Last.fm. Most of these produce RSS, which could be used as the basis for getting potential review items in a dropdown list.
5.2.3 Semantic blogging Blog posts are usually only tagged using free-form keywords by the blog owner (i.e. the blog post creator). However, there is often much more to say about a blog post than simply what category it belongs in or what topics it relates to. Semantic Web technologies can also be used to enhance any available post structures in a
13
http://www.qlogger.com/ (URL last accessed 2009-06-09) http://craphound.com/megstalk.txt (URL last accessed 2009-06-09) 15 http://sourceforge.net/projects/jemblog/ (URL last accessed 2009-06-09) 16 http://structuredblogging.org/download.php (URL last accessed 2009-06-09) 14
82
The Social Semantic Web
machine-readable way for more linkage and reuse. This is where semantic blogging comes in. (Cayzer 2004) envisioned an initial idea for semantic blogging with two main aspects that could improve blogging platforms: a richer structure both for blog post metadata and their topics - using shared ontologies - and richer queries in terms of subscription, discovery and navigation. He later defined a Snippet Manager service implementing some of these features. (Karger and Quan 2004) gave some other ideas about ‘what would it mean to blog on the Semantic Web’. They argued that such tools should be able to produce structured and machineunderstandable content in an autonomous way, without any additional input from the users. They also provided a first prototype based on the Haystack platform (Quan et al. 2003) that showed new ways to navigate between content thanks to these techniques. Traditional blogging is aimed at what can be called the ‘eyeball Web’ - i.e. text, images or video content that is targeted mainly at people (Möller et al. 2006). Semantic blogging aims to enrich traditional blogging with metadata about the structure (what relates to what and how) and the content (what is this post about - a person, event, book, etc.). Already RSS and Atom (a format for syndicating web content) are used to describe blog entries in a machine-readable way and enable them to be aggregated together. However by augmenting this data with additional structural and content-related metadata, new ways of querying and navigating blog data become possible. In structured blogging, microcontent such as microformats or RDFa is positioned inline in the (X)HTML (and subsequent syndication feeds) and can be rendered via CSS. Structured blogging and semantic blogging do not compete, but rather offer metadata in slightly different ways (using microcontent and RDF respectively). There are already mechanisms such as GRDDL which can be used to move from one to the other and that allows one to provide RDF data from embedded RDFa or microformats. Extracted RDF data can then be reused as one would any native RDF data, and as such it may be processed using common Semantic Web tools and services. The question remains as to why one would choose to enhance their blogs and posts with semantics. Current blogging offers poor query possibilities (except for searching by keyword or seeing all posts labelled with a particular tag). There is little or no reuse of data offered (apart from copying URLs or text from posts). Some linking of posts is possible via direct HTML links or trackbacks, but again, nothing can be said about the nature of those links (are you agreeing with someone, linking to an interesting post, or are you quoting someone whose blog post is directly in contradiction with your own opinions?). Semantic blogging aims to tackle some of these issues, by facilitating better (i.e. more precise) querying when compared with keyword matching, by providing more reuse possibilities, and by creating ‘richer’ links between blog posts.
5 Discussions
83
Fig. 5.3. Annotating a blog entry with an address book entry (Möller et al. 2006)
Fig. 5.4. Integrating a semantic blogging application with desktop data
It is not simply a matter of adding semantics for the sake of creating extra metadata, but rather a case of being able to reuse what data a person already has in their desktop or web space and making the resulting metadata available to others. People are already (sometimes unknowingly) collecting and creating large amounts of structured data on their computers, but this data is often tied into specific applications and locked within a user’s desktop (e.g. contacts in a person’s address book as in Figure 5.3, events in a calendaring application, author and title information in documents, audio metadata in MP3 files). Semantic blogging can
84
The Social Semantic Web
be used to ‘lift’ or release this data onto the Web, as in the semiBlog17 application (now called Shift) which allows users to reuse metadata from Apple Mac desktops in blog posts (see Figure 5.4). For example, looking at the picture in Figure 5.5 (Möller and Decker 2005), Ina writes a blog post which she annotates using content from her desktop calendaring and address book applications. She publishes this post onto the Web, and John, reading this post, can reuse the embedded metadata in his own desktop applications. In this picture, the semantic blog post is being created by annotating a part of the post text about a person with an address book entry that has extra metadata describing that person. Once a blog has semantic metadata, it can be used to perform queries such as ‘which blog posts talk about papers by Stefan Decker?’ It can also be used for browsing not only across blogs but also other kinds of discussion methods; or it can be used by blog readers for importing metadata into desktop applications (or using the Web as a clipboard).
Fig. 5.5. Lifting semantic data from the desktop to the Web and back again
Conversations can also span multiple blog sites in blog posts and their comments, and bloggers often respond to the entries of other users in their own blogs. The use of semantic technologies can enable the tracking of these distributed conversations. Links between units of conversation could even be enhanced to include sentiment information, e.g. who agrees or disagrees with the initial opinion. SparqlPress18 is another prototype that leverages Semantic Web technologies in blogs. It is not a separate blogging system but rather an open-source plugin for the popular WordPress platform, and it aims to produce, integrate and reuse RDF data for an enhanced user experience. SparqlPress mainly relies on the FOAF, SIOC and SKOS Semantic Web vocabularies. 17 18
http://semiblog.semanticweb.org/ (URL last accessed 2009-06-09) http://bzr.mfd-consult.dk/sparqlpress/ (URL last accessed 2009-06-08)
5 Discussions
85
One interesting feature that SparqlPress provides is the way that it combines FOAF and OpenID. OpenID is a decentralised login system that allows people to register on different websites using the same login ID and password, with the login being a URL. From that URL, a person can link to their FOAF profile, i.e. an RDF representation of their persona, by adding a link in the web page header or by simply using RDFa. Via this link, SparqlPress can retrieve the FOAF file of a user when they are logging in, and it is then able to display extra information about the user on the blog. This could include their homepage and other accounts or blogs he or she may have on the Web, if this information is provided in the FOAF file. Connecting OpenID to one’s FOAF social networking profile19 can also be useful for the blocking of blog comment spam. Zemanta provides client-side and server-side tools that enrich the content being created by bloggers or publishers, allowing them to automatically add hyperlinks, choose appropriate tags, and insert images based on an analysis of the content being posted. Zemanta also automatically suggests Common Tags20 to publishers (more in Chapter 8), and allows them to embed these tags in their content. As well as the aforementioned semantic blogging systems, others have been developed by HP21 (Cayzer 2004), the National Institute of Informatics, Japan22 (Ohmukai and Takeda 2004), and MIT (Karger and Quan 2004).
5.3 Microblogging Microblogging is a recent social phenomenon on the Social Web, with similar usage motivations (i.e. personal expression and social connection) to other applications like blogging. It can be seen as a hybrid of blogging, instant messaging and status notifications, allowing people to publish short messages (usually fewer than 140 characters) on the Web about what they are currently doing. These short messages, or microblog posts, are often called ‘tweets’ and have a focus on real-time information. As a simple and agile form of communication in a fluid network of subscriptions, it offers new possibilities regarding lightweight information updates and exchange. Twitter is now one of the largest microblogging services, and the value of microblogging is demonstrated by its popularity and by Google’s acquisition of Jaiku, another leading microblogging service. Individuals can publish their brief text updates using various communications channels such as text messages from mobile phones, instant messaging, e-mail and the Web. The simplicity of publishing such short updates in various situations or locations and the creation of a more flexible social network based on subscriptions 19
http://esw.w3.org/topic/FoafOpenid (URL last accessed 2009-06-09) http://www.commontag.com/ (URL last accessed 2009-07-13) 21 http://www.hpl.hp.com/personal/Steve_Cayzer/semblog.htm (URL last accessed 2009-06-09) 22 http://www.semblog.org/ (URL last accessed 2009-06-09) 20
86
The Social Semantic Web
and response posts makes microblogging an interesting communications method that has been studied from a social point of view (Java et al. 2007). Moreover, this mean of publishing can be extended with multimedia in the form of short video recordings, e.g. as in Seesmic which is considered to be the first video microblogging service. Those who are interested in what someone is doing can also receive updates through various means (Web, e-mail, IM, SMS). Some people call microblogging ‘lifestreaming’, while others think it is just a lot of mundane, trivial stuff (e.g. ‘having toast for breakfast’). Microblogging is addictive not because the content is interesting but rather because you may want to find out what someone is doing right now. Through microblogging services such as Twitter, you can know very minute things about someone’s life: what they are thinking, that they are tired, etc. Historically, we have only known that kind of information for a very few people that we are close to (or celebrities). Microblogging is quite useful for getting a snapshot of what is going on in and for interacting with your community or communities of interest. Similar to using a blog aggregator and scanning the titles and summaries of many blogs at once, thereby getting a feel for what is going on at a particular point in time, microblogging allows one to view status updates from many people in a compact (screen) space. Some microblogging services also have SMS integration, allowing one to send updates and receive microblog posts from friends via a mobile phone (although Twitter dropped outbound SMS notifications for non-US residents during 2008). One of the advantages of microblogs is that people can talk about a greater range of subjects due to the ease of writing a short post, since they are more likely to talk about a variety of diverse topics in multiple microblog posts that are limited to 140 characters as opposed to a writing a longer, single blog post. In fact, this constraint also makes microblogging somewhat more interactive due to the backand-forth conversations that result when someone looks for clarification of what is being said in those shorter status updates. Microblogging is also more conversational because everyone is using the same service and there are less delays logging on or filling in profile fields than one has to do to when posting a comment on someone else’s blog. A disadvantage is that the momentum of microblogging sites like Twitter is now such that you have to keep checking back much more regularly to be kept upto-date with everything that is going on or to find those hidden gems of information or knowledge. If you are subscribed to a few hundred people it makes it difficult (impossible even) to see all that is relevant since even the most interesting microbloggers will not be talking about stuff that is interesting to you all the time. However, Twitter clients like TweetDeck23 do allow various searches to be set up in separate columns, such that updates relevant to a certain keyword or combina-
23
http://www.tweetdeck.com/ (URL last accessed 2009-07-21)
5 Discussions
87
tion of keywords (e.g. ‘galway OR ireland’, ‘semantic web’) can be monitored quite easily, irrespective of whom one is following. This communication method is also promising for corporate environments in facilitating informal communication, learning and knowledge exchange (e.g. Yammer24 is an enterprise microblogging platform). Its so far untapped potential can be compared to that of company-internal wikis some years ago. Microblogging can be characterised by rapid (almost real-time) knowledge exchange and fast propagation of new information. For a company, this can mean real-time Q&A and improved informal learning and communication, as well as status notifications, e.g. about upcoming meetings and deliveries. However, the potential for microblogging in corporate environments still has to be demonstrated with real use cases (e.g. IBM has recently deployed an internal beta microblogging service called Blue Twit25). We expect that a trend of corporate microblogging will emerge in the next few years similar to what happened with blogging, wikis and other Enterprise 2.0 services as we will describe in Chapter 12. Traditionally, microblogging has been mainly used by technically-minded Web users and bloggers, but this is begging to change, with newspapers and celebrities getting in on the phenomenon. Microblog-type publishing can also be set up on personal services, for example, the WordPress platform offers a dedicated template interface (Prologue) that lets people publish short and real-time updates, and SixApart have recently developed their own installable microblogging platform called Motion following their acquisition of Pownce. However, there is no aggregation for personal microblogs that would take into account the special characteristics of it as a new medium. In the section on blogging, we discussed how it has led to ‘grassroots journalism’. To that extent, microblogging is an interesting phenomenon, especially Twitter, as updates can be posted in many ways and from different devices (e.g. via text message from mobile phones). Hence, it was one of the first media to report the major earthquake in China and the terrorist attacks in Mumbai26. Of course, one must be careful about choosing who to trust or not, as in any social media service. By sharing their personal lifestreams, people sometimes expose themselves to privacy issues, voluntary or not. Some services allow users to block public access to their tweets, but most publish them for all to see. For example, it appears that more than 10 posts every two minutes on Twitter are about meetings, many of them being professional or corporate ones. These may contain some facts that competitors can take into account.
24
http://www.yammer.com/ (URL last accessed 2009-07-21) http://tinyurl.com/58cezv (URL last accessed 2009-06-09) 26 http://tinyurl.com/5levvl (URL last accessed 2009-06-09) 25
88
The Social Semantic Web
5.3.1 The Twitter phenomenon Twitter was established by Evan Williams, formerly of Pyra Labs and Blogger. While working with Jack Dorsey and Biz Stone at Obvious Corp. on the podcasting directory service Odeo, they began developing the Twitter microblogging service in Ruby on Rails, and it took two weeks to build the first functional prototype of Twitter27. It has become hugely popular - now ranked at website number 15 in the world - with millions of users worldwide (20% of Twitter users are in Japan). As a result, there are many challenges to scaling since Twitter is one of the largest Rails applications, and there are various scaling problems that have yet to be solved according to Williams. Like many social websites, Twitter has evolved as users of the application demanded it. For example, Twitter initially did not have a system for allowing people to comment on each others’ tweets, so the users invented a convention by using the ‘at’ sign and a username (e.g. @johnbreslin) to comment on other people’s tweets. Twitter also has an API that has enabled third-party services like Twittervision (a map of the world showing various random tweets taking place). Twitter’s API has been quite successful, with dozens of desktop applications, others that extract data and present it in different ways, various bots that post information to Twitter (URLs, news, weather, etc.), and more recently a timer application that will send a message at a certain time period in the future for reminders (e.g. via the SMS gateway). The API allows a simple service like Twitter to become more powerful and reusable in other applications. Although there is no official business model for Twitter beyond ads on their Japanese service, an example of third-party commercial usage of Twitter is Woot, a single special offer item per day site that has a lot of followers on Twitter. Twitter has evolved beyond being a haven for social media gurus like Robert Scoble or technology experts like Ward Cunningham (the creator of the wiki), as it now has its fair share of celebrity ‘twitterers’ or ‘tweeple’ with many followers (in fact, the first result from Google when you search for twitterer is actor and writer Wil Wheaton, or ‘@wilw’ on Twitter). Some celebrities are twittering by proxy or via their ‘social media directors’, but many celebrities take the time out to engage with the public and with their fans by posting tweets themselves. Politicians have been using Twitter as part of their election campaigning, e.g. Barack Obama (with over 1.5 million followers in July 2009), John McCain and John Edwards. Actors have also embraced Twitter: from the popular science fiction drama Heroes, Greg Grunberg, Brea Grant and David H. Lawrence XVII are regular Twitter users. David Hewlett (Stargate Atlantis), movie star Luke Wilson, director Kevin Smith, comedian John Cleese and Stephen Fry also frequent Twitter, as do sportspeople Shaquille O’Neill, Lance Armstrong and Andy Murray. From the music world, 27
http://tinyurl.com/evanwilliams (URL last accessed 2009-06-09)
5 Discussions
89
there is Britney Spears (or at least her social media staff), MC Hammer and Dave Matthews. Other famous twitterers include Virgin founder Richard Branson, illusionist Penn Jillette, and former US vice-president Al Gore.
5.3.2 Semantic microblogging Michael Arrington wrote a post28 on the technology blog TechCrunch about the need for a ‘decentralised Twitter’ and for open alternative microblogging platforms, which was picked up by technologists Dave Winer, Marc Canter and Chris Saad amongst others. The SMOB29 or semantic microblogging prototype developed in DERI (a Semantic Web research institute in NUI Galway), and available as an open-source framework, is an example of how Semantic Web technologies can provide an open platform for decentralised / distributed publishing of microblogging content (see Figure 5.6), mainly using the FOAF and SIOC vocabularies.
Fig. 5.6. Global architecture of distributed semantic microbloggging
An aim of SMOB was also to demonstrate how such technologies can provide users with a way to control, share and remix their own data as they want, not solely dependant on the facilities provided by a third-party service. In this way, SMOB-published data belongs to the user who created it. As soon as someone writes some microblog content using a SMOB client, the content is spread through 28 29
http://tinyurl.com/69n9o9 (URL last accessed 2009-06-09) http://smob.sioc-project.org/ (URL last accessed 2009-07-20)
90
The Social Semantic Web
various microblogging servers, or aggregators, but remains available locally to the user who created it as depicted in Figure 5.6. Hence, if one aggregator closes for some reasons, the user can still use their data as it really belongs to him or her and not to any third-party aggregation service. This goal is more globally shared by the DataPortability project that will be described later.
Fig. 5.7. Latest SMOB updates rendered in Exhibit
In order to represent microblogging data, SMOB uses FOAF and SIOC to model microbloggers, their properties, account and service information, and the microblog updates that users create. A multitude of publishing services can ping one or a set of aggregating servers as selected by each user, and it is important to note that users retain control of their own data through self hosting as we detailed previously. The aggregate view of microblogs uses ARC230 for storage and querying, and MIT’s Exhibit31 faceted browser for the user interface as shown in Figure 5.7. It therefore offers a user-friendly interface to display complex RDF data, aggregated from distributed sources. Moreover, in order to further benefit from Semantic Web technologies, microblog posts can also embed semantic tags, e.g. geographical tags which can leverage the GeoNames database to power new visualisations such as the map view in Figure 5.8.
30 31
http://arc.semsol.org/ (URL last accessed 2009-06-09) http://www.simile-widgets.org/exhibit/ (URL last accessed 2009-06-09)
5 Discussions
91
At the moment, the complete data set of updates is publicly available and can be browsed with any RDF browser such as Tabulator, but in the future privacy concerns can be addressed by requiring OpenID authentication. The SMOB client has also been adapted to allow cross posting to StatusNet (another distributed and open microblogging platform) and to Twitter (via CURL and HTTP authentication).
Fig. 5.8. Map view of latest microblog updates with Exhibit
5.4 Message boards The message board has been a popular feature of internet-based communication since the early days of mailing lists and Usenet newsgroups. Web-based message boards or ‘forums’, which originated around 1995, are online discussion areas that operate in a similar manner to the dial-up bulletin board services and Usenet newsgroups from the 1980s and 1990s. Since the late 1990s, message board systems have become more sophisticated, allowing multiple message boards, categorised by common parent headings, to be hosted on a single site. One of the easiest ways of creating an online community is via the creation of a message board. Message boards allow discussions to be held by many Internet users in a community on a variety of subjects. Most forums on community sites em-
92
The Social Semantic Web
ploy some threaded display methods, where users post threads on a particular topic, and other users can then reply by posting on these threads. Those who share a common interest will discuss related topics of interest on the message board, forming a virtual sub-community. A message board normally contains a set of forums classified into categories, and may also integrate event meeting calendars. Message boards have evolved beyond the traditional admin-maintained structure into one where forums and categories can be created once a critical mass of user support has been received. The moderator of a forum has the responsibility for pruning undesirable threads and for banning unwanted users from the forum. A feedback forum can be used to raise useful suggestions or bug reports that can increase the usability of the underlying software. These message boards are a thriving part of the current HTML Web. Posts on a message board can be referenced via a URI, which is required for Semantic Web applications. Some popular message board systems include vBulletin32, phpBB33, Invision Board, and the ezboard forum hosting service34. Websites of open-source projects such as those hosted on SourceForge.net include forum functionality to enable discussions between project members and software users. However, single message boards and multi-forum sites primarily exist as islands that are not connected together. Apart from some multi-function content management systems such as Drupal that offer a unified login or services using OpenID, there have been few efforts towards connecting various message boards together (Zoints35, Klostu36) despite the potential benefits that this may offer (e.g. linking complementary topics across forum sites, enabling distributed conversations, linking a user’s distributed posts). By interconnecting these message boards together and viewing them as part of an overall ‘boardscape’37, we can enable these potential uses. The concept of a boardscape can be thought of as the world of message boards: millions of users creating billions of posts across thousands of multi-forum sites on the Internet. It is the collection of all message boards and the potential aggregated power of all board communities and their member collectives.
5.4.1 Categories and tags on message boards Forum sites tend to be relatively narrow in scope, being dedicated to communities in niche areas (e.g. anime, sports, TV shows, etc.), and therefore are normally 32
http://www.vbulletin.com/ (URL last accessed 2009-07-16) http://www.phpbb.com/ (URL last accessed 2009-07-16) 34 http://www.ezboard.com/ (URL last accessed 2009-07-16) 35 http://www.zoints.com/ (URL last accessed 2009-06-09) 36 http://www.klostu.com/ (URL last accessed 2009-06-09) 37 http://www.boardscape.com/ (URL last accessed 2009-07-21) 33
5 Discussions
93
categorised according to specific aspects of that niche interest. Other more general-purpose message boards may have more wide-ranging categories, for example, the Irish community website boards.ie uses categorisations similar to those found in the Open Directory Project or the Yahoo! Directory, with top-level categories ranging from Arts and Business to Sports and Technology. Many social networking sites have also incorporated community discussion features such as message boards. Unfortunately, social networks also suffer from poor categorisation of community areas. For example on the orkut site, hundreds of thousands of community message boards are classified in just 28 top-level categories. This makes it difficult for users themselves to find communities matching their interests, or for machines to match users to communities. As regards tagging content on message boards, a number of modules have recently been developed for board administrators that allow them to provide tagging functionality for users of their sites. There are also some newer message board systems (e.g. bbPress, vBulletin 3.7) that offer integrated tagging features. These tagging solutions usually provide a tag cloud of the most popular tags being used by message board posters on a particular site, and via the boardscape they can lead to an overall tag cloud view of tags used across all message board sites that participate in the tagging process. BoardTracker38, a message board search engine, provides an aggregate search across 40,000 message boards from around the world. This search can also be integrated by board administrators into their own discussion sites, providing results from the local site or from many other searchable sites on the boardscape. Threads retrieved through a BoardTracker search may already have been tagged by the original content creator, but they can also be tagged by the third party who performed the search (thereby creating new connections between content in the boardscape). Some sites and services use a hierarchical structured categorisation for classifying message boards, so that a particular message board may be linked to a subcategory in a taxonomy (e.g. Sports & Recreation > Football > American). Such a structured categorisation can be used to match user-defined interests to other possible message boards of interest, either existing message boards on registration (in networks of forums such as CrowdGather39) or newly-created boards through a periodic matching system. New message boards to discuss a common interest could also be proposed or automatically created based on the level of demand for the topic of interest as expressed in user profiles. Using the tagging system already mentioned, and by analysing the tags used in message boards that have already been categorised, a simple way of matching users (via their most commonly used tags) and message boards (either with the highest occurrences of that tag or in a category where that tag is prevalent) become apparent. 38 39
http://www.boardtracker.com/ (URL last accessed 2009-06-09) http://www.crowdgather.com/ (URL last accessed 2009-07-18)
94
The Social Semantic Web
5.4.2 Characteristics of forums Message boards have long been recognised as a place where the majority of discussions between people on the Web take place. However, it has not been detailed how much discussion actually takes place in the boardscape and what are its characteristics. Today’s boards are descendants of the dial-up bulletin board systems (BBS) that developed in parallel to newsgroups during the 1980s. After making their appearance on the Web, boards diverged, evolved and were enhanced with some new concepts and technologies. However, boards retained their basic conceptual design as a communal discussion platform. Also, much of the psychological and motivational aspects of boards remained intact. Much like their earlier versions, where different dial-up BBS systems were physically disconnected from one another and therefore had very little or no relationships to each other at all, today’s message boards are mostly isolated and tend to have no connections with other message boards on the Web, despite the fact that technical barriers were dropped. Message board owners tend to retain a prominent level of moderation and authentication for members and content. Almost all boards require that you register as a user before you can take part in discussions. A great degree of isolation and control over the flow of outbound traffic from the board is commonplace. Linking and quoting discussions (or parts of them) from other boards is uncommon. Communication with members of other boards is impossible. This aspect of the boardscape is what Ron Kass of BoardTracker refers to as the ‘island mentality’. A term which is often used in the boardscape is ‘lurkers’, which refers to individuals who are following (as viewers) the message board and discussions in it, but do not post or get involved directly in the discussion or community. It is estimated that the number of lurkers in a specific community is much greater than the actual number of registered and active members. While message boards vary greatly in most of their characteristics - be it the total number of members on the board site, their age, gender, language, the theme or main topics of the board, the site design, or even the board software - there are interesting aspects of the discussions taking place on these boards that arise from an aggregated view of all of the discussions generated on them. As evident from classifications performed by BoardTracker on message boards40, there are boards on virtually every subject in every niche. While some bias exists towards technology and towards recreation, there are board communities on even the most esoteric subjects. Another characteristic of message boards is their true global nature, either in international coverage, multi-lingual, age and other demographic characteristics. Coverage of boards in different countries is of course influenced by the degree of penetration of the Internet in those countries. In some cases however, cultural characteristics influence the popularity of using 40
http://boardtracker.com/all/1/1 (URL last accessed 2009-06-09)
5 Discussions
95
message boards in some countries. As an example, the usage of message boards in the Chinese market is relatively higher than the average European counterpart. Estimates of the total number of message board members show the true multinational dimension of message boards. Altogether, it is estimated that over 450 million people are registered members to message boards in the boardscape. Figure 5.9 shows a summary of some geographical statistics gathered by BoardTracker. The total number of discussions accumulated in the boardscape is estimated to be over 6 billion discussions and over 100 billion posts. As of January 2009, BoardTracker had indexed over 60 million threads (approximately a billion posts) across about 40,000 forums and sub-forums worldwide. In a single day in 2009, the number of posts indexed by BoardTracker was 4 million, corresponding to a yearly total of nearly 1.5 billion posts (note that the current coverage of BoardTracker is estimated at about 4% of the total boardscape).
Fig. 5.9. Board members distribution worldwide
Big-Boards.com41, a statistics site for large message boards that covers about 1,800 sites, reports over 158 million registered accounts with 6.7 billion posts on those board sites. It is again important to note that Big Boards only lists forums with over 500,000 posts, and as well as many more medium to small-sized forums in the ‘long tail’, there are also other large message board sites not listed in this service including Yahoo! forums, MySpace forums, the ezboard forum network and others. One key factor for a discussion is its length, i.e. how many exchanges are made in a thread. A single person starts a thread in a forum, and others in the forum will reply to that thread (in most cases). Some threads get no replies: for example, because there is no call for a reply; there is no need for one (on board announcements for example); some threads are looking for information that others cannot provide; a topic is locked; or for other reasons. Some threads, however, generate hundreds or thousands of replies. There are even extreme cases where the number of replies made by members of the forum exceeds 100,000.
41
http://www.big-boards.com/ (URL last accessed 2009-06-09)
96
The Social Semantic Web
Figure 5.10 shows a histogram of the number of threads with specific numbers of replies. As can be seen, the biggest group in this histogram is the set of threads without replies. However, it still only accounts for 13% of the total number of threads. This means that 87% of threads on message boards get at least one reply. It is obvious then that message boards are not a lonely place like many other internet mediums. To emphasise this, we observe that a thread in the boardscape receives about 16 replies on average. About 2.5% of the threads get between 51 and 100 replies, less than 1% receive between 101 and 200 replies, and about 0.5% receive more than 200 replies. While not a very high number, it does mean that in an average message board, one in two hundred threads receives about 200 replies or more.
Fig. 5.10. Percentage of threads ordered by number of replies
Fig. 5.11. Search results from Google showing number of posts and authors from boards
5 Discussions
97
Google have implemented some message board parsing algorithms to determine how many posts are on a thread, how many users posted on that thread and when the last post was made. This can be seen in the search result for ‘irish pubs boards.ie’ shown in Figure 5.11. It is not complete, and probably relies on identifying certain HTML structures for non-Google discussion sites, e.g. there is a blog discussion and a forum thread in the middle of the results that do not display the total posts or commenters. However, this is moving towards the Semantic Web vision of providing more metadata about discussions on the Web to help you in finding more relevant information. Google’s ‘Rich Snippets’ effort (allowing webmasters to mark up their content with microformats and RDFa) is another move in this direction.
5.4.3 Social networks on message boards Social networks have grown in popularity recently (more later), networking both social acquaintances and professional associates. Many social networking sites have incorporated community discussion features such as message boards. Rather than add a message board to a social network, we can also take advantage of the large number of message boards available to create parallel social networks. Some social websites are using FOAF to export social networking data from user profiles. However, for existing message board communities, incentives for users to provide this semantically-rich content are necessary to aid in the integration of these communities on the Semantic Web. One approach that can be used in combination with FOAF and other RDF export functionality is the development of a social networking component that provides a graphical view of the bidirectional links between user profiles (and parallel FOAF representations) to create a ‘friend’ connection. In a message board-based community, profile information on each user is mainly gathered at registration through the use of required fields that must be completed by a user before an account can be fully registered. This can include interests, work details, and so on. Such information can form the basis of a FOAF profile for a particular user, assuming that they have made the information publicly available (an option can be added that would allow the automatic creation of a FOAF file from their profile if enabled by a user). A useful feature of some community-based message board systems is the friends list or ‘buddy list’. This allows users to see when their friends are online, or to send all of their friends a private message at once. Each friends list is normally private to a particular user, but by allowing users to make a portion or the entire list publicly accessible, the public friends can be used as part of a FOAF export system. A FOAF exporter for the vBulletin message board system called
98
The Social Semantic Web
vBFOAF42 has been developed, and a similar module has also been developed for the phpBB system43. The user ID or hashed e-mail address can be used to create a unique URI for a FOAF profile, and the interest keywords can be mapped to URIs for corresponding categories in the Open Directory Project44 (ODP) taxonomy or appropriate resources in DBpedia. To encourage the installation of the FOAF exporter, corresponding social network visualisation modules were created for both vBulletin45 and phpBB46. The social networking module for vBulletin (called vBFriends) is shown in Figure 5.12. The friends display mode for a particular user ID shows who is linked to and from that user by analysing the user IDs harvested from all friends lists (similar to how the now-defunct Plink service dealt with foaf:knows entries from FOAF files). In 2007, Jelsoft Inc. provided integrated social networking functionality in their flagship message board product vBulletin.
Fig. 5.12. Original social networking module for vBulletin
The implicit social networking information that can be derived from message board interactions and explicitly-defined buddy lists can be used to create a social network in reverse: a boardscape with semantic elements that spans across all message boards, linking users, forums and posts. Having such a structure enables 42
http://www.vbulletin.org/forum/showthread.php?t=101472 (URL last accessed 2009-06-09) http://tinyurl.com/kl4nuf (URL last accessed 2009-06-09) 44 http://www.dmoz.org/ (URL last accessed 2009-06-09) 45 http://www.vbulletin.org/forum/showthread.php?t=101470 (URL last accessed 2009-06-09) 46 http://www.phpbb.com/community/viewtopic.php?f=16&t=264201 (accessed 2009-06-09) 43
5 Discussions
99
us to provide solutions to the limitations mentioned earlier with message board islands. A unified login, or links between various user accounts and the person who owns them, can be provided via the boardscape - this has the associated advantage of being able to connect all of a person’s post content together (and that of their friends). Related topics can be also linked by topic tag, hierarchical category or directly by distributed reply-type hyperlinks (similar to trackbacks). By leveraging all of this content, as well as social networks with FOAF, we can envision a first definition of distributed social networking and online communities that we will later detail in Chapter 11.
Fig. 5.13. Sample of the boardscape and the interconnected nature of its users and posts
Figure 5.13 shows a partial view of the boardscape, and some of the connections that are contained therein. A person holds many user accounts on different sites, and content is created about similar topics across various message boards. Users will know other users in their social network, either on the same site or across a number of sites. A conversation may begin on one message board site, but eventually lead to and end up on a different message board elsewhere. By creating these connections between the users and posts on boards, we enable many interesting possibilities. We shall later discuss how SIOC can be used to represent the content on message boards and other social websites, and how this can be combined with social networking and personal profile information expressed in FOAF.
100
The Social Semantic Web
Pidgin Technologies are the developers of a service called Klostu47, a site that connects message boards from around the world through a central access portal and unified login system. Klostu allows one to find thousands of message boards in one place: people can make friends via their social networking functionality or browse and search through millions of forum topics. Klostu are also releasing modules for various message board systems, allowing board owners to integrate the Klostu single sign-on system into their own sites.
5.5 Mailing lists and IRC A large number of systems preceding the current Web are still deployed and widely used on the Internet. E-mail is used for exchanging messages and files in an asynchronous way, Usenet is still used to exchange messages, and IRC is used for synchronous chat. E-mail is still the most prevalent asynchronous one-to-many communication medium on the Internet. Mailing lists provide a quick method to set up communications features for an online community. They were also one of the first methods used to set up and support a closed-group online community. Unfortunately, e-mail and mailing lists can still be subject to abuse (e.g. through mail bombs, spam, or other unsolicited mail). Mailing lists still occupy a huge segment of online discussions, and along with the growth of the Web, mailing lists have moved towards web-based mechanisms and online archives, making them accessible to a wider audience. Although e-mail’s main transport protocols are SMTP, POP3, and IMAP4, and the format is text-based (RFC 82248), the contents of mailing lists are also being made available on the Web in HTML format. For example, Yahoo! Groups (formerly eGroups) allows the creation of private or public community mailing lists, with messages either browsable via the Web or sent via individual or digest-type e-mails. Archives of mailing lists hosted on individual servers are often made available online in HTML, using tools such as GNU Mailman or MHonArc. Some mailing lists, such as DBWorld, already have message headers defined to include annotations in semi-structured format, e.g. metadata descriptions about calls for papers. To capture this large amount of legacy data exchanged in online communities in a semantic form, these systems and protocols need to be considered for translation to the Semantic Web. In contrast to web-based systems, where we just need to translate the data, we may need to employ protocol wrappers to move from legacy protocols to the Semantic Web. For example, for e-mail, we may need to translate the data representation format from RFC 822 to RDF.
47 48
http://www.klostu.com/ (URL last accessed 2009-06-09) http://www.ietf.org/rfc/rfc822.txt (URL last accessed 2009-06-09)
5 Discussions
101
The SWAML (Semantic Web Archive of Mailing Lists) project from (Fernandez et al. 2007) is an exporter for mailing list content in Semantic Web format. SWAML reads a collection of e-mail messages stored in a Unix-type mailbox (from a mailing list compatible with RFC 4155) and generates an RDF description of it. It is written in Python, using SIOC as the main ontology to represent a mailing list in RDF, and is also available as a Debian package. SWAML fulfils a much-needed requirement for the Semantic Web: to be able to refer to semantic versions of e-mail messages and their properties using resource URIs. By reusing the SIOC vocabulary for describing online discussions, SWAML allows users of SIOC to refer to e-mail messages from other discussions taking place on SIOC-enabled forums, blogs, etc., such that distributed conversations can eventually occur across these discussion media. Also, by providing email messages in SIOC format, SWAML provides a rich source of data from mailing lists for use in SIOC applications.
Fig. 5.14. The Buxon browser for viewing semantic mailing list data
The SWAML creators have also developed their own applications that work with SIOC mailing list data. The Buxon browser (see Figure 5.14), developed in PyGTK, for browsing SIOC forums (in this case mainly mailing lists) is an interesting example of a program using SIOC message data that can come from one or many sources (e.g. from a ‘virtual forum’ or container of posts from multiple sites and systems). For example, the ‘archive.py’ example script packaged with python-
102
The Social Semantic Web
libgmail can be used to download and convert an inbox from a Google Gmail account to Unix mailbox format, and then ‘swaml.py’ can be run on that mailbox to convert it to SIOC RDF. The resulting RDF can then be browsed with the Buxon application. Another interesting SIOC-enabled application is the Mailing List Explorer49. MLE allows the exploration of mailing lists via query, timeline view, etc. It provides RDF representations (including SIOC metadata) for any valid W3C public mailing list archive. A Java-based application for generating SIOC data from mailing list archives has also been developed50, leveraging RSS and Atom feeds from web-based message archives. The application uses the RDFReactor51 library for creating RDF APIs. Finally, IRC (Internet Relay Chat) can also benefit from Semantic Web technologies. The sioclog52 project aims to record IRC conversations in a machinereadable way using RDF (in particular, by using the SIOC vocabulary). Hence, IRC conversations can be browsed using the same tools as for mailing lists, for example, the Buxon browser described before. Through the MicroTurtle IRC bot53, users can also define links to their FOAF profiles from IRC, thereby identifying IRC content as theirs and lifting this content into the world of Linked Data.
49
http://sw.joanneum.at/mle/xplore.php (URL last accessed 2009-06-09) http://tinyurl.com/mla2sioc (URL last accessed 2009-06-09) 51 http://ontoware.org/projects/rdfreactor/ (URL last accessed 2009-06-09) 52 http://github.com/tuukka/sioclog/tree (URL last accessed 2009-07-07) 53 http://buzzword.org.uk/2009/mttlbot/#bot (URL last accessed 2009-07-07) 50
6 Knowledge and information sharing ‘Universal access to all knowledge can be one of our greatest achievements’, according to Brewster Kahle. There have been various efforts to categorise world knowledge and to leverage this using semantic technologies, e.g. through Wikipedia and the DBpedia, and with Cycorp (developers of the Cyc knowledge base of common-sense knowledge) joining the Linking Open Data initiative via their OpenCyc project. There have also been some successes in various question and answering systems with pieces of knowledge that can be mined and found. These may be two extreme cases, but the popularity of social websites for organising knowledge shows that the answer lies somewhere in the middle at a sweet spot: some organisation and leveraging this via semantics, but not too much. Wikipedia and the DBpedia is a positive step in this direction, and the question and answering approach can still be brought closer to the Wikipedia community-created knowledge approach.
6.1 Wikis Many people are familiar with the Wikipedia1, but less know exactly what a wiki is. A wiki is a website which allows users to edit content through the same interface they use to browse it, usually a web browser, while some desktop-based wikis also exist. This facilitates collaborative authoring in a community, especially since editing a wiki does not require advanced technical skills. A wiki consists of a set of web pages which can be connected together by links. Users can create new pages (e.g. if one for a certain topic does not exist), and they can also change (or sometimes delete) existing ones, even those created by other members. The WikiWikiWeb was the first wiki, established by Ward Cunningham in March 1995, and the name is based on the Hawaiian term wiki, meaning ‘quick’, ‘fast’, or ‘to hasten’. Wiki often act as informational resources, like a reference manual, encyclopaedia, or handbook. They amass to a group of web pages where users can add content and others can edit the content, relying on cooperation, checks and balances of its members, and a belief in the sharing of ideas. This creates a community effort in resource and information management, disseminating the ‘voice’ amongst many instead of concentrating it upon few people. Therefore, contrary to how blogs reflect the opinions of a pre-defined set of writers (or a single author), wikis 1
http://www.wikipedia.org/ (URL last accessed 2009-06-09)
J.G. Breslin et al., The Social Semantic Web, DOI 10.1007/978-3-642-01172-6_6, © Springer-Verlag Berlin Heidelberg 2009
104
The Social Semantic Web
use an open approach whereby anyone can contribute to the value of the community. Changing a wiki page is quite straightforward. For example, on the MediaWiki system employed by the Wikipedia, you simply click on the ‘Edit’ tab in an article, type in the new text, and then click on ‘Save Page’. Wikis employ a simple markup system for linking to other articles or for formatting text, e.g. [[Ireland]] will provide a link to the article on Ireland, and putting three single quotes around a word (i.e. ‘‘‘important’’’) will format the word in bold type. One of the most well-known and highly-used wikis is the Wikipedia free-access online encyclopaedia. Wikis are also being used for free dictionaries, book repositories, event organisation, writing research papers, project proposals and even software development or documentation. In this way, the openness of wiki-based writing can be seen as a natural follow-up to the openness of source-code modification. Wikis have become increasingly used in enterprise environments for collaborative purposes: research projects, papers and proposals, coordinating meetings, etc. SocialText2 produced the first commercial open-source wiki solution, and many companies now use wikis as one of their main intranet collaboration tools. However, wikis may break some existing hierarchical barriers in organisations (due to a lack of workflow mechanisms, open editing by anyone with access, etc.) which means that new approaches towards information sharing must be taken into account when implementing wiki solutions. We shall discuss this more in the chapter dedicated to social software in enterprise environments. There are hundreds of wiki software systems now available, ranging from MediaWiki3, the software used on the Wikimedia family of sites, and Eugene Eric Kim’s PurpleWiki4, where fine-grained elements on a wiki page are referenced by purple numbers (a concept of Doug Engelbart), to Alex Schröder’s OddMuse5, a single Perl script wiki install, and WikidPad6, a single-user desktop-based wiki for notes. Many are open source, free, and will often run on multiple operating systems. The differences between wikis are usually quite small but can include the development language used (Java, PHP, Python, Perl, Ruby, etc.), the database required (MySQL, flat files, etc.), whether attachment file uploading is allowed or not, spam prevention mechanisms, page access controls, RSS feeds, etc. Personal wikis are often used on desktop systems for personal information management. Such wikis need to be very simple, very fast, and very usable (for ‘note-taking on steroids’). As well as WikidPad, popular personal wikis include Tomboy and VoodooPad. Typical uses are for organising notes, links, categorising to do lists, and appointments.
2
http://www.socialtext.com/ (URL last accessed 2009-06-09) http://www.mediawiki.org/ (URL last accessed 2009-06-09) 4 http://purplewiki.blueoxen.net/ (URL last accessed 2009-06-09) 5 http://www.oddmuse.org/ (URL last accessed 2009-06-09) 6 http://wikidpad.sourceforge.net/ (URL last accessed 2009-06-09) 3
6 Knowledge and information sharing
105
6.1.1 The Wikipedia The Wikipedia project consists of over 250 different wikis, corresponding to a variety of languages. The English-language one is currently the biggest, with nearly three million pages, but there are wikis in languages ranging from Gaelic to Chinese. A typical wiki page will have two buttons of interest: ‘Edit’ and ‘History’. Normally, anyone can edit an existing wiki article, and if the article does not exist on a particular topic, anyone can create it. If someone messes up an article (either deliberately or erroneously), there is a revision history - as in most wiki engines so that the contents can be reverted or fixed by the community. Thus, while there is no pre-defined hierarchy in most wikis, content is auto-regulated thanks to an emergent consensus within the community, ideally in a democratic way (for example, most wikis include discussions pages where people can discuss sensible topics). There is a certain amount of ego-related motivation in contributing to a wiki like the Wikipedia. People like to show that they know things, to fix mistakes and fill in gaps in underdeveloped articles (stubs), and to have a permanent record of what they have contributed via their registered account. By providing a template structure to input facts about certain things (towns, people, etc.), wikis also facilitate this user drive to populate wikis with information. As well as the Wikipedia, the Wikimedia Foundation has a family of sites including the Wiktionary and Wikibooks. Wikibooks also features some annotated texts; indeed, there is much public domain material available from free book sites such as Project Gutenberg7 that is ripe for annotation through such efforts.
6.1.2 Semantic wikis We discussed semantic blogging in the previous chapter, but it is not just blog posts that are being enhanced by structured metadata and semantics - this is happening in many other Social Web application areas. Wikis such as the Wikipedia have contained structured metadata in the form of templates for some time now, and at least twenty ‘semantic wikis’8 have also appeared to address a growing need for more structure in wikis. There has been a move from wikis as editors for web pages to semantic wikis that can act as sophisticated annotation systems (Figure 6.1). The sweet spot lies somewhere in the middle: some structures and annotations, but not too much that would discourage users from providing these semantics. In his presentation on ‘The Relationship Between Web 2.0 and the Semantic Web’9, Mark Greaves (from Vulcan Inc. and formerly with DARPA) said that se7
http://www.gutenberg.org/ (URL last accessed 2009-06-09) http://semanticweb.org/index.php/Semantic_Wiki_State_Of_The_Art (accessed 2009-06-09) 9 http://tinyurl.com/markgreaves (URL last accessed 2009-06-09) 8
106
The Social Semantic Web
mantic wikis are a promising answer to various issues associated with semantic authoring, by reducing the investment of time required for training on an annotation tool and by providing incentives required for users to contribute semantic markup (attribution, visibility and reuse by others).
Fig. 6.1. The move from wikis to semantic wikis
Typical wikis usually enable the description of resources in natural language. By additionally allowing the expression of knowledge in a structured way, wikis can provide advantages in querying, managing and reusing information. Wikis such as the Wikipedia have contained structured metadata in the form of templates for some time now (to provide a consistent look to the content placed within article texts), but there is still a growing need for more structure in wikis (e.g. the Wikipedia page about Ross Mayfield links to about 25 pages, but it is not possible to answer a simple question such as ‘find me all the organisations that Ross has worked with or for’). Templates can also be used to provide a structure for entering data, so that it is easy to extract metadata about the topic of an article (e.g. from a template field called ‘population’ in an article about London). Semantic wikis bring this to the next level by allowing users to create semantic annotations anywhere within a wiki article’s text for the purposes of structured access and finer-grained searches, inline querying, and external information reuse. Generally, those annotations are designed to create instances of domain ontologies and their related properties (either explicit ontologies or ontologies that will emerge from the usage of the wiki itself), whereas other wikis use semantic annotations to provide advanced metadata regarding wiki pages. Obviously, both layers of annotation can be combined to provide advanced representation capabilities, as shown in Figure 6.2. For example, the Semantic MediaWiki system allows people to add structured data into pages, such as typed links and attributes (relationships and number / text properties respectively). By allowing people to add such extra metadata, the system can then show related pages (either through common relationships or properties, or by embedding search queries in pages). These enhancements are powered by the metadata that the people enter (aided by semantic wiki engines). A semantic wiki should have an underlying model of the knowledge described in its pages, allowing one to capture or identify further information about the pages (metadata) and their relations. The knowledge model should be available in a formal language as RDFS or OWL, so that machines can (at least partially)
6 Knowledge and information sharing
107
process and reason on it. For example, a semantic wiki would be able to capture that an ‘apple’ article is a ‘fruit’ (through an inheritance relationship) and present you with further fruits when you look at the apple article. Articles will have a combination of semantic data about the page itself (the structure) and the object it is talking about (the content), as shown in Figure 6.2.
Fig. 6.2. The connection between structural (page) metadata and content (relating to the described concept) metadata
Some semantic wikis also provide what is called inline querying. For example, in SemperWiki10, questions such as ‘?page dc:creator EyalOren’ (find me all pages where the creator is Eyal Oren) or ‘?s dc:subject ‘todo’’ (show all me all my to do items, as shown in Figure 6.3) are processed as a query when the page is viewed and the results are shown in the wiki page itself (Oren et al. 2006). Also, when defining some relationships and attributes for a particular article (e.g. ‘foaf:gender male’), other articles with matching properties can be displayed along with the article. Moreover, some wikis such as IkeWiki (Schaffert 2006) feature reasoning capabilities, for example, retrieving all instances of foaf:Person when querying for a list of all foaf:Agent(s) since the first class subsumes the second one in the FOAF ontology.
10
http://www.eyaloren.org/semperwiki.html (URL last accessed 2009-06-09)
108
The Social Semantic Web
Fig. 6.3. Performing inline queries for to do items using the SemperWiki semantic wiki system
Finally, just as in the semantic blogging scenario, wikis can enable the Web to be used as a clipboard, by allowing readers to drag structured information from wiki pages into other applications (for example, geographic data about locations on a wiki page could be used to annotate information on an event or a person in one’s calendar application or address book software respectively).
6.1.2.1 Semantic MediaWiki
One of the most popular semantic wikis is Semantic MediaWiki (Krötzsch et al. 2006), an extension to the popular MediaWiki system. Semantic MediaWiki allows for the expression of semantic data describing the connection from one page to another, and attributes or data relating to a particular page. Let us take an example of providing structured access to information via a semantic wiki. There is a Wikipedia page about JK Rowling that has a link to ‘Harry Potter and the Deathly Hallows’ (and to other books that she has written), to Edinburgh because she lives there, and to Scholastic Press, her publisher. In a traditional wiki, you cannot perform fine-grained searches on the Wikipedia data set such as ‘show me all the books written by JK Rowling’, or ‘show me all authors that live in the UK’, or ‘what authors are signed to Scholastic’, because the type of links (i.e. the relationship type) between wiki pages are not defined. In Semantic MediaWiki, you can do this by linking with [[author of::Harry Potter and the Deathly Hallows]] rather than just the name of the novel. There may also be some
6 Knowledge and information sharing
109
attribute such as [[birthdate:=1965-07-31]] which is defined in the JK Rowling article. Such attributes can be used for answering questions like ‘show me authors over the age of 40’ or for sorting articles, since this wiki syntax is translated into RDF annotations when saving the wiki page. Moreover, page categories are used to model the related class for the created instance. Indeed, in this tool, as in most semantic wikis that aim to model ontology instances, not only do the annotations make the link types between pages explicit, but they also make explicit the relationships between the concepts referred to in these wiki pages, thus bridging the gap from documents plus hyperlinks to concepts plus relationships. For instance, in the previous example, the annotation will not model that ‘the page about JK Rowling is the author of the page about Harry Potter and the Deathly Hallows’ but rather that ‘the person JK Rowling is the author of the novel Harry Potter and the Deathly Hallows’. Since Semantic MediaWiki is completely open in terms of the terms used for annotating content, the underlying data model, i.e. the different ontologies used to model the instances, evolve according to the user behaviour. For example, each ‘Category’ page leads to a new class in the ontology. However, extracted data may be subject to heterogeneity problems. For instance, some users will use [[author of:somebookname]] while others will prefer [[has written:somebookname]], leading to problems when querying data since the semantics of the two relationships are different despite the common ideas underlying them. Other wikis such as OntoWiki (Auer et al. 2006), IkeWiki or UfoWiki (Passant and Laublet 2008a) can assist the user when modelling semantic annotations, in order to avoid those heterogeneity issues and to provide data that is based on pre-defined ontologies so that it can be more easily and efficiently re-used for querying and navigating the wiki.
6.1.2.2 OntoWiki OntoWiki11 is a semantic wiki developed by the AKSW research group at the University of Leipzig that also acts as an agile ontology editor and distributed knowledge engineering application. Unlike other semantic wikis, OntoWiki relies more on form-based mechanisms for the input of structured data rather than using syntax-based or markup-based inputs. One of the advantages of such an approach is that complicated syntaxes for representing structured knowledge can be hidden from wiki users and therefore syntax errors can be avoided. OntoWiki visually presents a knowledge base as an information map, with different views on available instance data. It aims to enable intuitive authoring of semantic content, and also features an inline editing mode for editing RDF content, similar to WYSIWIG for text documents. As with most wikis, it fosters social 11
http://ontowiki.net/ (URL last accessed 2009-07-16)
110
The Social Semantic Web
collaboration aspects by keeping track of changes and allowing users to discuss any part of the knowledge base, but OntoWiki also enables users to rate and measure the popularity of content, thereby honouring the activities of users. OntoWiki enhances the browsing and retrieval experience by offering semantically-enhanced search mechanisms. Such techniques can decrease the entrance barrier for domain experts and project members to collaborate using semantic technologies. OntoWiki is open source and is based on PHP and MySQL.
6.1.3 DBpedia While it is not a wiki per-se, it is worth mentioning DBpedia in relation to wikis and semantic wikis. DBpedia12 provides an RDF export of the Wikipedia and can be seen as one of the core components of the Linking Open Data project. It has been created by exporting the ‘infoboxes’ (i.e. metadata entered on various articles for pre-defined template structures) from various language versions of Wikipedia and linking them together. By weaving Wikipedia articles and related objects into the Semantic Web, DBpedia defines URIs for many concepts so that people can use them in their semantic annotations. For example, one can define that they are interested in the Semantic Web and located in France by writing these two triples: :me foaf:topic_interest :me geonames:locatedIn
The DBpedia data set is freely available for download and it also provides a public SPARQL endpoint so that anyone can interact with it for advanced querying capabilities. For example, to identify all actors born in New York that starred in a movie directed by Quentin Tarantino, one would usually have to browse tens of web pages. With the DBpedia, a single SPARQL query can be used to find the answer to that question. The following query can be posed at the DBpedia SPARQL endpoint13 to identify the relevant movies and actors, leading to the answers shown in Figure 6.4. SELECT DISTINCT ?movie ?person WHERE { ?movie dbpedia-owl:director . ?movie dbpedia-owl:starring ?person . ?person dbpedia-owl:birthplace . } 12 13
http://dbpedia.org/ (URL last accessed 2009-06-09) http://dbpedia.org/sparql (URL last accessed 2009-07-16)
6 Knowledge and information sharing
111
Fig. 6.4. Results of a SPARQL query for movie information from DBpedia
An interesting application related to DBpedia is DBpedia Mobile14, a ‘locationcentric DBpedia client application for mobile devices’ that consists of a map interface, the Marbles Linked Data Browser15 and a GPS-enabled launcher application. The application displays nearby DBpedia resources (from a set of 300,000) based on a users’ geolocation as recorded through his or her mobile device. Efforts are also ongoing towards allowing DBpedia to feed new content back into the Wikipedia16 (e.g. by suggesting new values for infoboxes, or by contributing back new maps created via DBpedia Mobile). Other applications can make use of DBpedia to provide third-party services, for example, Faviki17 allows users to bookmark web pages using common terms from the Wikipedia (as extracted by DBpedia).
6.1.4 Semantics-based reputation in the Wikipedia As a global, independent and neutral framework to which we can all contribute content, Wikipedia could serve as the basis for a de-facto global and open reputation system. At the moment, Wikipedia does not provide much information on people’s reputations, i.e. those who make changes to articles are not very visible on Wikipedia and are not treated as experts as such. On the Wikipedia website, it is often the case that the contributor who may know the most about an article is not clearly identified in the Wikipedia article as being the foremost expert. There have been various attempts to establish reputation sites on the Web, e.g. Naymz18, which may help a person to improve their visibility in search engines. However, there is a problem with these sites in that a person’s reputation can only be truly reflected online if they regularly contribute to the site and maintain an upto-date version of their profile with all of their achievements. Another issue is that people who already have a good reputation will most probably not join these sites,
14
http://wiki.dbpedia.org/DBpediaMobile (URL last accessed 2009-06-09) http://marbles.sourceforge.net/ (URL last accessed 2009-07-16) 16 http://videolectures.net/www09_kobilarov_dbpldh/ (URL last accessed 2009-07-07) 17 http://faviki.com/ (URL last accessed 2009-07-14) 18 http://www.naymz.com/ (URL last accessed 2009-06-09) 15
112
The Social Semantic Web
perhaps due to time constraints, or if reputation is related to the number of connections or endorsements one has (which may be by invitation). Wikipedia can be improved by the addition of a global reputation system with embedded semantics. This could be achieved by placing larger emphasis on the discussion pages in the Wikipedia, and by introducing threaded structures in these pages from which expertise would emerge. For example, experts could emerge from their actions in discussion pages when their suggested changes have been accepted, highlighting those who made the best changes on the article page itself. If we include microcontent such as microformats or RDFa in these pages, we solve two problems at one stroke: (1) Wikipedia benefits from a richer reputation framework where people can be motivated to add contextual semantic information to make their content more searchable (directly benefiting their own reputations), and (2) this can also move forward the Semantic Web, by solving the issue of who will be motivated to add the semantics to the Semantic Web and why. This information can also be used to power services like Garlik’s QDOS19 that aim to measure people’s ‘digital status’ or estimated online rating. Related work on Wikipedia and trust or reputation measures has been described by (McGuinness et al. 2006) and (Adler and de Alfaro 2007).
6.2 Other knowledge services leveraging semantics We shall now discuss some other knowledge services that are benefiting from their usage of semantic technologies, including the Twine service from Radar Networks, the Internet Archive, Freebase, and OpenLink Data Spaces.
6.2.1 Twine Radar Networks is one of a number of startup companies that is practically applying Semantic Web technologies to social software applications. Radar’s flagship product is called Twine, and the company is led by CEO Nova Spivack20. In 2003, Radar developed a desktop-based semantic tool called ‘Personal Radar’, a personal assistant for knowledge sharing. It was effectively a Java-based P2P version of ‘Twine’ powered by RDF and with some appealing visualisations. At the time, most venture capitalists were not interested, but Radar received angel funding from Vulcan Capital (the founder Paul Allen is said to believe that adding structure to the Web is inevitable).
19 20
http://qdos.com/ (URL last accessed 2009-06-09) http://tinyurl.com/novaspivack (URL last accessed 2009-06-09)
6 Knowledge and information sharing
113
The Twine service allows people to share what they know and can be thought of as a knowledge networking application that allows users to share, organise, and find information with people they trust. People create and join ‘twines’ (community containers) around certain topics of interest, and items (documents, bookmarks, media files, etc., that can be commented on) are posted to these twines through a variety of methods. Twine has a number of novel and useful functions that elevate it beyond the social bookmarking sites to which it has been compared, including an extensive choice of twineable item types, twined item customisation (‘add detail’ allows user-chosen metadata fields to be attached to an item) and the ‘e-mail to a twine’ feature (enabling twines to be populated through messages sent to a custom e-mail address). The focus of Twine is these interests. Where Facebook is often used for managing one’s social relationships and LinkedIn is used for connections that are related to one’s career, Twine can be used for organising one’s interests. Spivack also calls this ‘interest networking’ as opposed to social networking. With Twine, one can share knowledge, track interests with feeds, carry out information management in groups or communities, build or participate in communities around one’s interests, and collaborate with others. The key activities are ‘organise, share and discover’. Twine allows people to find things that might be of interest to them based on what they are doing. The key ‘secret sauce’ according to Spivack is that everything in Twine is generated from an ontology. Even the site itself - user interface elements, sidebars, navigation bar, buttons, etc. - come from an application-definition ontology. Similarly, the Twine data is modelled on a custom ontology. However, Twine is not just limited to these internal ontologies, and Radar is beginning the process of bringing in other external ontologies and using them within Twine. At a later stage, they hope to allow people to make their own ontologies (e.g. to express domain-specific content) resulting in the Twine community having a more extensible infrastructure. Twine performs natural language processing on text, mainly providing automatic tagging with semantic capabilities. It has an underlying ontology with a million instances of thousands of concepts to generate these tags (at present, Twine is exposing just some of these). Radar are also working on statistical-analysis and machine-learning approaches for clustering of related content to show people, items and interests that are related to each other (for example, to give information to users such as ‘here are a selection of things that are all about movies you like’). Twine search also has semantic capabilities. For example, bookmarks can be filtered by the companies they are related to, or people can be filtered by the places they are from. Underneath Twine, a lot of research work on scaling has been carried out, but it is not trying to index the entire Web. However, Twine does pull in related objects (e.g. from links in an e-mail), thereby capturing information around the information that you bring in and that you think is important. Twine wants to bring semantics to the masses, and therefore it is not just aiming at Semantic Web enthusiasts but rather at mainstream users. The interface has
114
The Social Semantic Web
to be simple so that someone who knows nothing about structured data or automatic tagging should be able to figure out in a few minutes or even seconds how to use it. Individuals are Twine’s first target market, allowing them to author and develop rich semantic content. For example, this could be a professional who has a need for a particular interest in some technical subject that is outside the scope of what they are doing at the moment. However, such a service becomes more valuable when users are connected to other people, if they join groups, thereby giving a richer network effect. The main value proposition for these users is that they can keep track of things they like, people they know, and capturing knowledge that they think is important. When groups start using Twine, collective intelligence begins to take place (by leveraging other people who are researching material, finding items, testing, commenting, etc.). It is a type of communal knowledge base similar to other services like Wikia or Freebase. However, unlike many public communal sites, in Twine more than half of the data and activities are private (60%). Therefore privacy and permission control is very important, and it is deeply integrated into the Twine data structures. Since Twine left beta, public twines have become visible to search engines and SEO has been applied to increase the visibility of this content. Twine is powered by Java, PostgreSQL and WebDAV. Since relational databases are not optimised for the ‘shape’ of semantic data that is being stored in Twine, the data store had to be tweaked. Twine uses an eight-element tuple store (subject-predicate-object, provenance, time stamp, confidence value, and other statistics about the triple or item itself). Predicate inferencing can be performed across statements for access control, etc. Some of the feature requests for Twine include import capabilities, interoperability with other applications, and the aforementioned ability to use other ontologies. At the moment, Twine works with e-mail (sending notifications out and allowing the twining of e-mails sent in), RSS (pushing feeds out), and browsers (e.g. for bookmarking). There have been various requests for interoperability with mind maps, various databases, and enterprise applications. Twine currently has a REST API which allows people to make their own add-ons. In terms of data interoperability, semantic data can be obtained from Twine in RDF for reuse elsewhere (by appending ‘?rdf’ to the end of any Twine URL). Having already hardcoded some interoperability with services like Amazon.com and provided import functionality from del.icio.us, Radar are also looking at potential adaptors to other services including Digg, desktop bookmark files, Outlook contacts, Lotus Notes, Exchange and Freebase. With such a service, there is a requirement for duplication detection. Most people submit similar bookmarks and it is reasonably straightforward to identify these, e.g. when the same item is arrived at through different paths on a website and has different URLs. However some advanced techniques are required when the content is similar but comes from different locations on the Web. Referring to our earlier discussion on object-centred sociality in Chapter 3, there is great potential in the community aspects of twines. These twines can act
6 Knowledge and information sharing
115
as ‘social objects’ that will draw people back to the service in a much stronger manner than other social bookmarking sites currently do (in part, this is due to there being a more identifiable home for these objects and also due to the improved commenting facilities that Twine provides). Radar is focussing on advertising as the first revenue stream for Twine. Since Twine has semantic profiles for both users and groups, it can understand and leverage their interests quite effectively. Radar will be pilot testing sponsored content or advertisements in Twine based on these interests. According to Spivack, if something is extremely relevant to your interests, then it is almost as valuable as content (even if it is sponsored). When Radar began work on the Twine application, they also started working on a commercial version of the underlying platform. One of their aims is to allow non-Semantic Web savvy people to build applications that use the Semantic Web without having to do any programming.
6.2.2 The Internet Archive The Internet Archive21 is home to various types of media archived on the Web, from books and web pages to audio and video content. They also host legallydownloadable software titles (e.g. old software that can be reused or replayed via virtual machines or emulators). Media can be reviewed, rated and bookmarked for sharing with other users. Due to the huge amount of both textual and multimedia content on the site, there are many advantages to leveraging semantics and interlinking the media in such a huge data source. According to Brewster Kahle, co-founder of the Internet Archive, one book is approximately 1 MB in size, so all of the books in the US Library of Congress (26 million books of them) would correspond to about 26 TB (with images, that figure would be somewhat larger)22. At present, it costs about $30 to scan a book in the US. For about 10 cents a page, books or microfilm can now be scanned at various centres around the United States and put online. 250,000 books have been scanned in so far and are held in eight online collections. Such books can also be made available to recipients of laptops through the OLPC project23. However, most people like having printed books, so book mobiles for print-on-demand books are beginning to appear. Such a book mobile charges just $1 to print and bind a short book. There are a number of issues related to putting audio or recorded sound works online. At best, there are two to three million discs that have been commercially distributed, but the issue with putting these online is in relation to rights. The Internet Archive has 100,000 items in 100 collections. Audio costs about $10 per 21
http://www.archive.org/ (URL last accessed 2009-06-09) http://videolectures.net/iswc07_kahle_uahk/ (URL last accessed 2009-06-09) 23 http://laptop.org/ (URL last accessed 2009-06-09) 22
116
The Social Semantic Web
disk (roughly one hour) to digitise, so about a third of the price of a book. Rock ‘n’ roll concerts are the most popular category of the Internet Archive audio files (with 40,000 concerts so far); for ‘unlimited storage, unlimited bandwidth, forever, for free’, the Internet Archive offers bands their hosting service if they waive any issues with rights. There are various cultural materials that do not work well in terms of record sales, but there are many people who are very interested in having these published online via services such as the Internet Archive. Video makes up another large portion of the Internet Archive with 55,000 videos in 100 collections. Most people think of Hollywood films in relation to video, but at most there are 150,000 to 200,000 video items that are designed for movie theatres, and almost half of these are from India. Many films are locked up in copyright, and are therefore problematic. The Internet Archive has about 1,000 of these films (out of copyright or otherwise permitted). However, there are many other types of video materials that people want to see: thousands of archival films, advertisements, training films and government films that have been downloaded from the website millions of times. Academics can also put copies of their video lectures online at the Internet Archive. Video costs about $15 per hour of content for digitisation services. There are an estimated 400 channels of ‘original’ television content (ignoring duplicate rebroadcasts), but if you were to record a television channel for one year, it would require about 10 TB of data with a cost of $20,000 for that year. The Television Archive24 team from the Internet Archive have been recording 20 channels from around the world since 2000 (it is currently about a petabyte in size). This corresponds to about 1.5 million hours of TV, but little has been made publicly available due to copyright reasons (apart from video recorded during the week of the 9/11 attacks). The Internet Archive is probably best known for archiving web pages. Their ‘Wayback Machine’ archive25 started in 1996, by taking a snapshot of every accessible page on a website. It is now about 2 PB in size, with over 100 billion pages. Most people use this service to find their old materials again, since most people ‘don’t keep their own materials very well’ according to Kahle (e.g. Yahoo! came to the Internet Archive to get a 10-year-old version of their own homepage). Preservation or how to keep all of these materials available to the public is an important task for the Internet Archive. The Internet Archive in San Francisco has four employees and 1 PB of storage: including the power bill, bandwidth and people costs, their total costs are about $3 million per year; 6 GB bandwidth is used per second; and their storage hardware costs $700,000 for 1 PB. They have a backup of their book and web materials in Alexandria (somewhat unfortunately known for its ancient destroyed library), and also store audio material at the European Archive in Amsterdam. Also, their Open Content Alliance initiative allows
24 25
http://www.televisionarchive.org/ (URL last accessed 2009-06-09) http://www.archive.org/web/web.php (URL last accessed 2009-06-09)
6 Knowledge and information sharing
117
various people and organisations to come together to create joint collections for all to use. Search is now beginning to make in-roads in terms of time-based search, and this is particularly relevant to archives of content with strong temporal aspects like the Internet Archive. For example, one can examine how words and their usage change over time (e.g. ‘marine life’). Semantic Web applications for accessing and searching information in the Internet Archive can help people to deal with the huge onslaught of information on the site. There is a need to take large related subsets of the Internet Archive collections and to help them make sense for people. Much work has been carried out on both wikis and search (and even on combining these services, e.g. Google SearchWiki26), but according to Brewster Kahle there is a need to ‘add something more to the mix’ to bring more structure to the Internet Archive project. This may involve combining the ease of access and authoring from the wiki world with computer-aided ways to incorporate the structures that we all know are in there. Such methods should be flexible enough so that people can add structure one item at a time or so that computers can be employed to help with this task. For example, in the recent OpenLibrary.org joint initiative27, the idea is to build one web page for every book ever published (not just ones still for sale) to include content, metadata, reviews, etc. The relevant concepts in this project include: creating Semantic Web concepts for authors, works and entities; having wiki-editable data and templates; using a tuple-based database with history; and making it all available in open source (both the data and the Python code). OpenLibrary.org has over 10 million book records, with 250,000 of them containing the full texts.
6.2.3 Powerset Natural language search company Powerset is another in a generation of startups that are employing semantic technologies to augment the way that people can access knowledge and information. The company’s first product was a semantic search and discovery tool for the Wikipedia social website, and Powerset was acquired by Microsoft in 2008. Barney Pell, chief technical officer of Powerset, believes that natural language can help with the realisation of the Semantic Web28, especially both sides of the chicken-and-egg problem (the chicken and the egg). On one side, annotations can be created from unstructured text, and ontologies can be generated, mapped and linked. On the other side, natural language search can consume Semantic Web in-
26
http://tinyurl.com/5vyt36 (URL last accessed 2009-06-09) http://openlibrary.org/ (URL last accessed 2009-06-09) 28 http://videolectures.net/iswc07_pell_nlpsw/ (URL last accessed 2009-06-09) 27
118
The Social Semantic Web
formation, and can expose Semantic Web services in response to natural language queries. The self-stated goal of Powerset is to enable people to interact with information and services as naturally and effectively as possible, by combining natural language and scalable search technology. Natural language search interprets the Web, indexes it, interprets queries, searches and matches. Historically, search has matched query intents with document intents, and a change in the document model has driven the latest innovations. The first is proximity: there has been a shift from documents being a ‘bag of keywords’ to becoming a ‘vector of keywords’. The second is in relation to anchor text: adding off-page text to search is next. Documents are loaded with linguistic structure that is mostly discarded and ignored (due to cost and complexity), but it has immense value. A document’s intent is actually encoded in this linguistic structure, from which Powerset’s semantic indexer extracts meaning. Converging trends that are enabling this natural language search are emerging language technologies themselves, lexical and ontological knowledge resources, Moore’s law, open-source software, and commodity or cloud computing. Powerset integrates not just text from websites but diverse types of resources, e.g. newsfeeds, blogs, archives, metadata, video, and podcasts. It can also do real-time queries on databases, where a natural language query is converted into a database query to give results that can drive further engagement. As an example of how Powerset works, when the query ‘Sir Edward Heath died from what’ is entered, the system parses each sentence; extracts entities and semantic relationships, identifies and expands these to similar entities, relationships and abstractions; and then indexes multiple facts for each sentence. The first fact returned from the Wikipedia says ‘Heath died from pneumonia’. Multiple queries on the same topic to Powerset will retrieve the same ‘facts’ (e.g. the query ‘what killed Edward Heath’ returns the same fact). The information on the various entities or relationships can also come from multiple sources, e.g. information on Edward Heath or Deng Xiaoping may be from Freebase and details on pneumonia can come from WordNet. Powerset can also handle more abstract queries that would be difficult to express or perform in conventional keyword search, such as ‘who said something about WMDs?’ Powerset have stated that they will provide various APIs to the developer community and will give access to their technologies to build mashups and other applications. Powerset’s other community contributions will be in the form of data sets, annotations, and open-source software. Powerset’s language technologies are the result of commercialising the XLE work from PARC, leveraging their ‘multidimensional, multilingual architecture produced from long-term research’. Some of their main challenges for Powerset have been in the areas of scalability, systems integration, incorporating various data and knowledge resources, and enriching the user experience. According to chief operating officer Steve Newcomb29, it takes more computing power to parse 29
http://news.cnet.com/8301-17939_109-9738465-2.html (URL last accessed 2009-06-09)
6 Knowledge and information sharing
119
semantics than to simply index, and nearly 20 percent of Powerset’s ongoing budget is spent on computing resources. Pre-acquisition, Powerset’s commercial model was based on advertising (like most search engines) and on licensing their technologies to other companies or search engines.
6.2.4 OpenLink Data Spaces OpenLink Data Spaces (ODS)30 is a commercial semantically-powered collaboration platform that leverages popular Semantic Web vocabularies including FOAF, SIOC, SKOS and MOAT. ODS SPARQL endpoints provide access to semantic instance data from a range of ODS application instances, including blogs, wikis, aggregated feeds (RSS 1.0, 2.0 and Atom), shared bookmarks, discussions (i.e. comment threads), photo galleries, briefcases (e.g. WebDAV file servers), etc. The associated MyOpenLink.net31 service is an example of an ODS-based service that can expose semantic instance data to SPARQL query service clients. ODS exposes all its data in the form of real or virtual RDF graphs via its Virtuoso32-based quad store. There are a number of modules for the OpenLink Data Spaces (ODS) platform that each export semantic metadata (e.g. using the SIOC vocabulary), including ODS-Blog, ODS-Wiki, ODS-Bookmarks, ODSAddressBook, ODS-Calendar, ODS-Polls, ODS-Gallery (for photos), ODS-Feeds (for feed aggregation and exposure via SIOC), and ODS-Discussion (for comments across blogs, wikis or any other data space that supports some form of commenting). OpenLink have also released an EC2 / S3 Amazon Image-version of their Virtuoso product, which includes semantic data support: ‘your blogs, wikis, bookmarks, etc. are based on the SIOC ontology (think open social graph++)’. We will introduce SIOC in more detail later on.
6.2.5 Freebase The open collaborative knowledge database Freebase was launched by San Francisco-based Metaweb Technologies in 2007. Founded by Danny Hillis, co-founder of Thinking Machines and Applied Minds, and Robert Cook, a former video game developer, Metaweb has received nearly $60 million in funding. Freebase has been described by Metaweb as a ‘massive collaboratively-edited database of cross-linked data’, and aims to become ‘the world’s database, with all of the world’s information.’ At present, Freebase mainly incorporates community30
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/Ods (accessed 2009-06-09) http://myopenlink.net:8890/sparql/ (URL last accessed 2009-06-09) 32 http://www.openlinksw.com/virtuoso/ (URL last accessed 2009-06-09) 31
120
The Social Semantic Web
created data combined with data imported from open access repositories including the Wikipedia and MusicBrainz (an open-content music database and associated set of tools for analysing patterns in music). However, the company have also said that Freebase could be used for proprietary or commercial data, thereby potentially providing an additional revenue stream from such a service (and mirroring similar intentions from Radar Networks whose public Twine service may later be repackaged for organisational use). Freebase organises its data and categories of data in ontology-like structures called ‘Freebase Types’, based on a graph model. Any user can create and modify their own types and associated properties, and these can be promoted for adoption by administrators of the relevant domains that the type belongs to. Freebase data is licensed under the Creative Commons Attribution license. Data can be accessed via a JSON-based API such that third parties can develop remote applications to leverage Freebase data. Data can be queried using the Metaweb Query Language (MQL). Recently, Freebase announced the availability of all of its data in RDF33, thereby joining efforts in the Semantic Web community on the Linking Open Data project. Various projects, such as DBpedia, now provide links to Freebase concepts, since each Freebase concept has its own URI that can be referenced by external applications.
33
http://rdf.freebase.com/ (URL last accessed 2009-07-20)
7 Multimedia sharing As we have seen so far in this book, a key feature of the Social Web is the change in the role of a user from simply being a consumer of content. Furthermore, it is not just textual content that can be shared, annotated or discussed, but also any multimedia content such as pictures, videos, or even presentation slides. Moreover, this content can also benefit from Semantic Web technologies. In this chapter, we will describe various trends regarding multimedia sharing on the Social Web and we will focus on how Semantic Web technologies can help to provide better interlinking between multimedia content from different services.
7.1 Multimedia management There is an ever-increasing amount of multimedia of various formats becoming available on the Social Web. Current techniques to retrieve, integrate and present these media items to users are deficient and would benefit from improvement. Semantic technologies make it possible to give rich descriptions to media, facilitating the process of locating and combining diverse media from various sources. Making use of online communities can give additional benefits. Two main areas in which social networks and semantic technologies can assist in multimedia management are annotation and recommendation. Some efforts such as DBTune1 already provide musical content exported to the Semantic Web for music-based recommendations. We shall describe these efforts in more detail later on in this chapter. Social tagging systems such as Last.fm allow users to assign shared free-form tags to resources, thus generating annotations for objects with a minimum amount of effort. The informal nature of tagging means that semantic information cannot be directly inferred from an annotation, as any user can tag any resource with whatever strings they wish. However, studying the collective tagging behaviour of a large number of users allows emergent semantics to be derived (Wu et al. 2006). Through a combination of such mass collaborative ‘structural’ semantics (via tags, geo-temporal information, ratings, etc.) and extracted multimedia ‘content’ semantics (which can be used for clustering purposes, e.g. image similarities or musical patterns), relevant annotations can be suggested to users when they contrib-
1
http://dbtune.org/ (URL last accessed 2009-06-09)
J.G. Breslin et al., The Social Semantic Web, DOI 10.1007/978-3-642-01172-6_7, © Springer-Verlag Berlin Heidelberg 2009
122
The Social Semantic Web
ute multimedia content to a community site by comparing new items with related semantic items in one’s implicit and explicit networks. Another way in which the wisdom of crowds can be harnessed in semantic multimedia management is in providing personalised social network-based recommender systems. (Liu et al. 2006) presents an approach for semantic mining of personal tastes and a model for taste-based recommendation. (Ghita et al. 2005) explores how a group of people with similar interests can share documents and metadata, and can provide each other with semantically-rich recommendations. The same principles can be applied to multimedia recommendation, and these recommendations can be augmented with the semantics derived from the multimedia content itself (e.g. the information on those people depicted or carrying out actions in multimedia objects2).
7.2 Photo-sharing services As soon as people began to use digital cameras for taking pictures, they tended to publish them on the Web. However, installing dedicated applications such as Gallery3 or having one’s own storage space on the Web requires some technical expertise, thereby limiting the picture-sharing experience to only a few users. Similar to blogging platforms that provide simple mechanisms for people who want to publish their thoughts online without technical requirements, Social Web applications that let people easily publish, tag and share pictures began to appear, with Flickr being one of the most popular. Flickr, now owned by Yahoo!, allows you to upload pictures by selecting some images from your hard drive, to add text descriptions and tags, and to mark regions of interest on a photo by annotating them (‘add note’). As well as offering tagging and commenting mechanisms, Flickr allows users to organise their pictures into browsable sets. Pictures can be searched by date (i.e. by upload date or by the real ‘taken on’ date using EXIF metadata), tag, description, etc. Flickr offers control mechanisms for deciding who can access photos, and one can define each picture’s visibility (private, public, only friends, only family). As well as the web interface, pictures can be uploaded to Flickr by e-mail or using desktop utilities, and users can display thumbnails of pictures on their blog or website using ‘badges’. Millions of pictures are now available on Flickr, and upload statistics on the Flickr homepage show thousands of pictures being uploaded each minute. Thanks to camera phones and custom uploading applications, many of these incorporate automatic geolocation metadata, such that people can publish pictures as soon as
2 3
http://acronym.deri.org/ (URL last accessed 2009-06-09) http://gallery.menalto.com/ (URL last accessed 2009-06-09)
7 Multimedia sharing
123
they take them on the street, underground or anywhere and these are then automatically linked to a particular place on a map. Apart from the uploading and storage facilities that Flickr offers, an important feature of the service is its social aspect. Flickr offers social networking functionality in the form of adding friends and exchanging messages with them. Pictures can not only be seen by anyone but they can also be subject to conversation. Groups can even be created, to foster a community around a particular topic, following the idea of object-centred communities that we mentioned earlier on in this book. For example, the ‘Squared Circle’ group, dedicated to pictures of circular things, has nearly 6,500 members and 83,000 pictures, with related discussion threads4. There are some limitations with Flickr. You cannot export your data easily, and you cannot modify or edit your pictures (apart from rotation). You have to pay if you want: to allow higher resolution viewing of your images; to create more than three photo sets; to be able to post to more than 10 groups; or to upload many (large) pictures, since the free version is limited to 100 MB of data transfer per month. As a result, some other feature-rich services have become quite popular including Zooomr5.
7.2.1 Modelling RDF data from Flickr While Flickr does not natively expose any data in RDF, various exporters have been written to provide semantically-enhanced data from this popular photo sharing service. As with many other social websites, Flickr provides an API for developers, and RDFizers (tools for converting from various data formats to RDF) can be written based on this API. For example, the FlickRDF exporter6 (Passant 2008a) provides a representation of Flickr social networks and related user-generated content in RDF, mainly using the FOAF and SIOC ontologies. Therefore, it allows one to export their Flickr connections in FOAF so that they can be related to connections from their personal FOAF profile or from other social websites providing data in RDF (such as Twitter and its related FOAF exporter), enabling the construction of a distributed social graph as detailed in Chapter 10. The exporter relies on FOAF and SIOC as follows: It uses FOAF to model people as instances of foaf:Person, as well as the various relationships between people using the foaf:knows relationship. Depending
4
http://flickr.com/groups/circle/ (URL last accessed 2009-06-09) http://www.zooomr.com/ (URL last accessed 2009-06-09) 6 http://tinyurl.com/flickrdf (URL last accessed 2009-07-09) 5
124
The Social Semantic Web
on how much information is publicly available, it can provide more information, such as the person’s name using the foaf:name property. It uses SIOC to model the related user account (sioc:User) as well as the various user galleries that belong to it, using sioc:owner_of and sioc_t:ImageGallery. SIOC is also used to model the various groups a user belongs to, using the sioc:member_of property and the sioc:Usergroup class. Some sample metadata from the FlickRDF exporter is given below: a foaf:Person ; foaf:name “Alexandre Passant” ; foaf:mbox_sha1sum “80248cbb1109104d97aae884138a6afcda688bd2” ; foaf:holdsAccount ; foaf:knows ; foaf:knows ; foaf:knows ; foaf:knows ; foaf:knows ; foaf:knows ; sioc:member_of ; sioc:member_of ; sioc:member_of .
Moreover, in order to provide global interlinking with other RDF data, the exporter also relies on other ontologies and data sources such as GeoNames7 to model the geolocation of a user (based on the Flickr information from the user profile). By providing such a complete export, one can for example identify all Flickr galleries owned by a friend-of-a-friend who lives in France. Other ways to provide semantically-enhanced data from Flickr have been provided. The Flickr Wrappr8 provides information regarding pictures related to any 7 8
http://www.geonames.org/ (URL last accessed 2009-07-07) http://www4.wiwiss.fu-berlin.de/flickrwrappr/ (URL last accessed 2009-06-09)
7 Multimedia sharing
125
DBpedia URI. In this way, one can identify all pictures related to a particular monument, city, person, etc. This exporter combines the multilingual capacities of DBpedia and its geolocation features with the Flickr API so that it can identify all pictures related to a particular concept. The export is available both in HTML and RDF (thanks to content negotiation), so that human readers as well as software agents can benefit from it. Another service called Flickr2RDF also provides a method for extracting RDF information from any Flickr picture9. This API mainly uses FOAF and Dublin Core to represent such information in RDF, and also provides a way to export Flickr notes so that notes applied to particular image regions can also be represented in RDF (using the Image Region vocabulary10). We shall describe more ways to annotate image regions in the next section. (Maala et al. 2007) have presented a conversion process with linguistic rules for producing RDF descriptions of Flickr tags that helps users to understand picture tags and to find various relationships between them. Finally, as we will mention in Chapter 8, machine tags from Flickr can be translated into RDF using the Flickcurl API. This API also allows other information about Flickr pictures to be translated into RDF using Dublin Core (and the WGS84 Geo vocabulary11 if the picture has been geotagged).
7.2.3 Annotating images using Semantic Web technologies While annotating Flickr pictures and extracting some RDF information from them requires the use of a service like Flickr2RDF, there are generic ways to add semantic information to images that can be applied to any picture. The ‘Image Annotation on the Semantic Web’ document12 from the W3C Multimedia Semantics Incubator Group references various vocabularies, applications and use cases that can be used for such tasks. A simple way to do this is to represent metadata related to a particular picture (such as the title, author, image data, etc.) using common Semantic Web vocabularies such as FOAF or Dublin Core (as performed by Flickr2RDF). This then provides a means to query for metadata about pictures in a unified way. Going further, the MPEG-7 (Moving Picture Experts Group) standard and its associated RDF(S)/OWL mappings can also be used to represent image regions and add particular annotations about them13. These annotations can be combined with other metadata, for example, modelling that a region depicts a person (identi9
http://www.kanzaki.com/works/2005/imgdsc/flickr2rdf (URL last accessed 2009-06-09) http://www.bnowack.de/w3photo/pages/image_vocabs (URL last accessed 2009-06-09) 11 http://www.w3.org/2003/01/geo/wgs84_pos (URL last accessed 2009-06-09) 12 http://www.w3.org/TR/swbp-image-annotation/ (URL last accessed 2009-06-09) 13 http://www.w3.org/2005/Incubator/mmsem/XGR-mpeg7/ (URL last accessed 2009-07-16) 10
126
The Social Semantic Web
fied using the FOAF vocabulary), a place (referring to DBpedia or GeoNames information), etc. Vocabularies such as Digital Media14 or the Image Region vocabulary can be used for a similar task. Applications such as M-OntoMatAnnotizer15 or PhotoStuff16 can be used to provide such annotations and to create the corresponding RDF files that can then be exchanged or shared on the Web. While many of these techniques usually require a separate RDF file for storing metadata information, annotations can sometimes be directly embedded into the image itself, for example, in SVG (Scalable Vector Graphics) images as described by the SVG Tiny specification17. At the moment, there is little agreement on what media vocabularies should be used across the board. One useful task would be to define a set of mappings between these various models, allowing us to efficiently combine the best parts of different ontologies for annotating multimedia content. This is one of the current tasks of the W3C Media Annotation Working Group18. As defined by the charter of the group, its goal ‘is to provide an ontology designed to facilitate crosscommunity data integration of information related to media objects in the Web, such as video, audio and images’. A first draft of this ontology was published in June 200919.
7.3 Podcasts Podcasts are to radio what blogs are to newspapers or magazines - people can create and distribute audio content using podcasts for public consumption and playback on personal / portable media players, computers or other MP3-enabled devices. Video podcasts, also known as ‘vlogs’ from video blogs or ‘vodcasts’ from video podcasts, are a variation on audio podcasts where people can produce and publish video content on the Web for consumption on media playing-devices, and this content can range from individuals publishing home movies or their own news ‘interviews’, to studios releasing TV episodes or movies for a fee. We shall now describe these two areas in more detail, along with some ideas on how semantic metadata can be leveraged for this application area.
14
http://www.mindswap.org/2005/owl/digital-media (URL last accessed 2009-06-09) http://tinyurl.com/kyyjl (URL last accessed 2009-06-09) 16 http://www.mindswap.org/2003/PhotoStuff/ (URL last accessed 2009-06-05) 17 http://tinyurl.com/svgtiny (URL last accessed 2009-06-05) 18 http://www.w3.org/2008/WebVideo/Annotations/ (URL last accessed 2009-06-05) 19 http://www.w3.org/TR/2009/WD-mediaont-10-20090618/ (URL last accessed 2009-07-07) 15
7 Multimedia sharing
127
7.3.1 Audio podcasts Audio podcasting has become quite popular in the past few years, with podcast recordings ranging from interviews and music shows to comedies and radio broadcasts. One of most popular podcasts is by comedian Ricky Gervais for the Guardian Unlimited website. Although the concept of podcasting was suggested in 2000, the technical roots started to evolve in 2001, with the influence of blogs being a key aspect. The word ‘podcast’ itself is a portmanteau of ‘pod’ from iPod and ‘broadcast’, and the term came into popular use around 2004 with one of the first-known podcasts being produced by Adam Curry. Several technologies had to be in place for podcasting to take off: high-speed access to the Internet, MP3 technology, RSS, podcatching software, and digital media players. In 2005, the word ‘podcast’ already yielded over 100 million Google hits, and in 2006, the number of podcasts surpassed the number of radio stations worldwide. From simple origins, podcasting has become a major force for multimedia syndication and distribution. Much of the strength of podcasting lies in its relative simplicity, whereby casual users can create and publish what is effectively an online radio show and can distribute these shows to a wide audience via the Web. All a user needs to create a podcast is some recording equipment (e.g. a PC and microphone), an understanding of subscription mechanisms like RSS, and some hosting space. It is also easy for a consumer to listen to podcasts, either by using traditional feed-catching methods to subscribe to a podcast feed and thereby receive automatic intermittent updates, or by subscribing to a podcast discovered through the categorised podcast directories of Odeo or the iTunes20 music store on a desktop computer, iPod Touch or iPhone. However, it is not only individuals who are publishing podcasts, since larger organisations have leveraged the positive aspects of such technologies. Many companies now have regular podcasts, ranging from Oracle and NASA to General Motors and Disney. Also, many radio stations have begun making podcasts of their programmes available online (e.g. NPR’s Science Friday), although these usually are devoid of music or other copyright content. Many sites have offered downloads of audio files or streaming audio content (in MP3 or other format) for some time. Podcasts differ in that they can be downloaded automatically via ‘push’ technologies using syndication processes such as RSS described earlier. When a new audio file is added to a podcast channel, the associated syndication feed (usually RSS or Atom) is updated. The consumer’s podcasting application (e.g. iTunes) will periodically check for new audio files in the channels that a consumer is subscribed to, and will automatically
20
http://www.apple.com/itunes/ (URL last accessed 2009-06-05)
128
The Social Semantic Web
download them. Podcasts can also be accompanied by show notes, usually in PDF format. After recording a podcast using a computer with a line-in or USB microphone, editing can be performed using open-source utilities like Audacity21. The podcast can then be self-hosted using services like LoudBlog22 or WordPress.org23 with the PodPress24 extension, or hosted on other third-party services such as WordPress.com25, Blast26 or Blogger (e.g. by uploading a file to the Internet Archive and linking to a post on Blogger using their ‘Show Link Field’ option). As well as the iTunes application from Apple, a popular open-source tool for downloading podcasts is Juice27. There is also a legal aspect to podcasting. Copyright, the branch of law that protects creative expression, covers texts displayed or read aloud, music played during podcasts (even show intros or outros), audio content performed or displayed (e.g. in video podcasts, more later), and even the interviews of others may be protected under copyright. The solution is to try and use what is termed ‘podsafe’ content, i.e. Creative Commons-licensed works28, works in the public domain (e.g. from the Internet Archive29), or at the very least, material that adheres to fair use principles30. Universities are also publishing lectures or other educational content through podcasts31, allowing students to listen to or view their lectures on demand. Teachers can publish podcasts of their lectures and assignments for an entire class or for the public, e.g. to supplement physical lectures or to fully serve the needs of distance-learning students. Conversely, students can create and publish content and deliver it to their teachers or other students. Some popular educational podcasts are provided by Stanford32 and MIT33. Some more podcasting technologies and derivatives include: ‘autocasting’, the automatic generation of podcasts from text-only sources (e.g. from free books at Project Gutenberg); multimedia messaging service-based podcasts and ‘mobilecasting’, i.e. mobile podcasting and listening or viewing through mobile phones; 21
http://audacity.sourceforge.net/ (URL last accessed 2009-06-05) http://www.loudblog.com/ (URL last accessed 2009-06-05) 23 http://wordpress.org/ (URL last accessed 2009-06-05) 24 http://www.podpress.org/ (URL last accessed 2009-06-05) 25 http://wordpress.com/ (URL last accessed 2009-06-05) 26 http://www.blastpodcast.com/ (URL last accessed 2009-06-05) 27 http://juicereceiver.sourceforge.net/ (URL last accessed 2009-06-05) 28 http://wiki.creativecommons.org/Podcasting_Legal_Guide (URL last accessed 2009-06-05) 29 http://www.archive.org/details/opensource_audio (URL last accessed 2009-06-05) 30 http://tinyurl.com/bkxtcs (URL last accessed 2009-06-05) 31 http://tinyurl.com/ln2v7t (URL last accessed 2009-06-05) 32 http://itunes.stanford.edu/ (URL last accessed 2009-06-05) 33 http://web.mit.edu/itunesu/ (URL last accessed 2009-06-08) 22
7 Multimedia sharing
129
‘voicecasting’, or podcast delivery through a telephone call; and ‘Skypecasting34’ or phonecasting where podcasts are created by recording a Skype conference call or regular phone call. At the SDForum / SoftTECH Event on Architecting Community Solutions in 2005, Zack Rosen of CivicSpace Labs posed the idea for an evolutionary step in web-based discussions, whereby phone conversations could be recorded (via Asterisk, an open source Linux-based PBX application) and then streamed or downloaded as audio discussions that would augment the traditional text discussions on message board sites. We may also see mailing lists being linked to PBX phone numbers that you could ring up to leave audio comments for members of the list. Podcasting is moving in this direction: you can not only have text comments as replies to podcast postings but you can also add audio comments (this is a feature of the LoudBlog podcasting platform).
7.3.2 Video podcasts Video podcasts (Felix and Stolarz 2006) are similar to audio podcasts, and can be downloaded to PCs or personal media players using many of the same tools and mechanisms. Known by a variety of terms (video blogging, vidblogging, vlogging, vodcasting from ‘video on demand’, video casting or vidcasting), video podcasting ranges from interviews and news to tutorials and behind-the-scenes documentaries. Some television stations are also making episodes of their series downloadable for free (e.g. via Channel 4’s 4OD35 player in the UK) or for a fee. With video podcasts, anyone can have their ‘own’ internet TV station: all they need is a camera and some effort. Some of the most popular video podcasts (from the Podcast Alley36 directory) include one offering woodworking advice, a gadget news show, digital video camera tutorials, discussions on real-life issues, and a Big Brother-type series. Video podcasters can make money from their podcasts through various means: by using Google Adsense for display ads or by having a PayPal ‘tip jar’ at the podcast download site, by manually inserting video advertisements or by using Revver37 ‘RevTags’ (a clickable advert at the end of each video). According to a story from the Guardian38, despite the relatively modest number of users who are watching online video, research indicates that video downloads are responsible for more than 50% of all internet traffic, and this may in the future cause gridlock on the Internet. Premium Internet video services will reach $2.6 34
http://www.voip-sol.com/15-apps-for-recording-skype-conversations/ (accessed 2009-06-05) http://www.channel4.com/programmes/4od (URL last accessed 2009-07-07) 36 http://www.podcastalley.com/ (URL last accessed 2009-07-07) 37 http://www.revver.com/ (URL last accessed 2009-06-05) 38 http://www.guardian.co.uk/technology/2007/feb/10/news.newmedia (accessed 2008-05-01) 35
130
The Social Semantic Web
billion in 200939, and according to Forrester Research, with more than half of adults (53% of consumers 18 and older) stating that they view online video40, mainstream adoption of Internet video has arrived. Similar to the differences between audio downloads and podcasting, there are some distinctions that can be made between video downloads and podcasts. Both involve a content creation process, use codecs (coder-decoders) for media compression, may be transferred via multiple file formats, and can possibly leverage some streaming services. Like audio podcasting, video podcasting differs in that it includes some method for automated download of video files, e.g. using an RSS subscription mechanism or possibly some blogging or CMS (content management system) software. DRM (digital rights management) or restrictive transfer protocols are not usually a feature of video podcasts, otherwise nobody would bother downloading them. Video podcasts are normally created through a digital camera or camcorder, webcam, mobile phone, etc. Video files are then transferred from the recording device, or may be captured live via USB, TV card, etc. After conversion, editing and compression using processing tools like VirtualDub or Adobe Premiere, the videos are uploaded to the Web, including popular video sharing services like YouTube, blip.tv, etc. Video podcasts need to be fairly short: less than 5 minutes is good, 15 minutes is okay, but 30 minutes is too long. Since a lot of video podcasts are similar to ‘talk radio’, there can be a bit of a learning curve. As with audio podcasting, you should use ‘podsafe’ audio41 from sources like GarageBand or Magnatune in your videos. Seesmic42 is a ‘microvlogging’ application in the style of services like Twitter (such that it is being referred to as ‘the video Twitter’). However, if a picture is worth a thousand words (and a video contains many thousands of pictures), then Seesmic is quite different to Twitter in terms of expressivity and what can be conveyed through even a short video message (when compared to 140 characters). Seesmic has a simple but intuitive interface for creating content and viewing videos (from the public or from friends). The emphasis in Seesmic is mainly towards using one’s webcam for creating microvlogs, but it also encourages the uploading of short video files (e.g. in Flash video format). Another recent trend is that of ‘lifecasting’ or live video streaming, as exemplified by services such as Ustream43 (allowing video to be broadcast live from computers and mobiles) and Qik44 (for sharing live video from mobiles only).
39
http://www.instat.com/newmk.asp?ID=1478 (URL last accessed 2009-06-05) http://tinyurl.com/dmff8l (URL last accessed 2009-06-09) 41 http://tinyurl.com/2jjtpu (URL last accessed 2009-06-05) 42 http://seesmic.tv/ (URL last accessed 2009-07-16) 43 http://www.ustream.tv/ (URL last accessed 2009-07-07) 44 http://qik.com/ (URL last accessed 2009-07-07) 40
7 Multimedia sharing
131
7.3.3 Adding semantics to podcasts Semantic metadata can be associated with both the overall structure and audio content of podcasts. Such metadata for podcasts can be attached to the channel and item descriptions in RSS 1.0 format, and may simply involve a reorganisation of pre-existing structured data (see Figure 7.1). For example, Apple has written a specification document45 describing their iTunes namespace46 (an extension for RSS 2.0) that details podcast metadata for use in iTunes listings and iPod displays. Yahoo! has also created a namespace for syndicating media items47, intended as a replacement for the RSS enclosure element. In fact, it may also be possible to explicitly define metadata in an RSS 1.0 extension for multimedia data where such metadata does not already exist (Hogan et al. 2005). Podcast content can also be annotated, more so through automatic speech recognition, but people could also add annotations (e.g. URL references) or tags to parts of a recording as they listen to it. This could also be combined with the Music Ontology48 (more later).
Fig. 7.1. Some sources of metadata for a semantic representation of a podcast file
One possibility would be to extract and convert the metadata that is often embedded in multimedia files, and this could be extracted when songs are played during the recording of a podcast. An example of such embedded metadata would be the ID3 / ID4 / APE tags often found in MP3 files and annotated via tools like the 45
http://www.apple.com/itunes/whatson/podcasts/specs.html (URL last accessed 2009-06-05) http://www.itunes.com/DTDs/Podcast-1.0.dtd (URL last accessed 2009-06-05) 47 http://video.search.yahoo.com/mrss (URL last accessed 2009-06-08) 48 http://musicontology.com/ (URL last accessed 2009-06-05) 46
132
The Social Semantic Web
ID3 Tag Editor49. Such tags provide information relating to the file name, song or piece name, creator or artist, album, genre and year. Other multimedia metadata standards include the MPEG series of standards (e.g. MPEG-7, a means of expressing audio-visual metadata in XML). Upon parsing of such information, a pretemplated RSS 1.0 file can be filled in with the available supplemental information for further interpretation by podcasting tools. This metadata can then be used by tools such as the Podcast Pinpointer described by (Hogan et al. 2005), a prototype application for the intelligent location and retrieval of podcasts. Many sites have begun using word recognition technologies in the indexing of multimedia files, with one such popular site being the video site blinkx. Word recognition software has seen many advances in recent years, and is becoming more and more accurate. Services can use these technologies to create a transcript of spoken words contained in the audio of podcast files. This would be quite useful in keyword searches. Others are employing human transcription services to convert the content of audio podcasts to text files, especially since ‘content is king’ on the Web and podcasts can be a valuable source of new text content that may not be available elsewhere. As well as these transcripts, HLT (Human Language Technology) could be implemented to derive a structure from the prose. These structures could also be attached to RSS 1.0 documents thereby complementing existing metadata. An example of a semantically-enhanced podcast service is the ZemPod application described by (Celma and Raimond 2008). It uses both speech and music recognition algorithms in order to automatically split a podcast into different parts and then adds RDF metadata to each part of it in order to ease the way in which podcast files can be consumed and browsed. Metadata can be related to extracted keywords as well as to the recognised songs. Regarding the latter, additional information can be retrieved or interlinked from existing sources for a better user experience. For example, one could identify all podcasts containing a song that lasts less than two minutes and was written by an American band that played at least twice in the CBGB music club. We shall now describe in detail other initiatives related to adding semantics to music-related content on the Web, many of which can be used to semantically describe the content in both audio and video podcasts.
49
http://www.id3-tagit.de/ (URL last accessed 2009-06-05)
7 Multimedia sharing
133
7.4 Music-related content
7.4.1 DBTune and the Music Ontology A wide range of music-related data sources have been interlinked within the Linking Open Data initiative (Raimond et al. 2008). Some efforts such as DBTune50 already provide musical content exported to the Semantic Web, and recent work has been performed in order to reuse that interlinked musical content for music-based recommendations (Passant and Raimond 2008).
Fig. 7.2. Sources of music-related data interlinked with the Linked Open Data cloud
For example, the DBTune project exports the data sets depicted in Figure 7.2 in RDF, interlinked with other data. These data sets encompass detailed editorial information, geolocations of artists, social networking information amongst artists and listeners, listening habits, Creative Commons content, public broadcasting information, and content-based data (e.g. features extracted from the audio signal characterising structure, harmony, melody, rhythm or timbre, and content-based similarity measures derived from these). These data sets are linked to other ones. For example, Jamendo (a music platform and community for free downloadable music) is linked to GeoNames, therefore providing an easy-to-build geolocation-
50
http://dbtune.org/ (URL last accessed 2009-06-05)
134
The Social Semantic Web
based mashup for music data. Artists within MusicBrainz are linked to DBpedia artists, MySpace artists, and artists within the BBC’s playcount data. In order to represent assorted types of information from these music data sets, such as differentiating between bands or solo artists, as well as various kinds of artists, the Music Ontology (MO) provides a complete vocabulary for musicrelated information modelling which ties in with well-known vocabularies such as FOAF. For example, the ‘Artist’ class in MO is a subclass of the ‘Agent’ class from FOAF.
7.4.2 Combining social music and the Semantic Web Information from DBpedia, music-related services and data sets described in the previous section can be efficiently combined with social information such as social networks, tagged blog posts, etc. to provide advanced services for end users to browse and find music-related information. Hence, (Passant and Raimond 2008) have detailed various ways for using Semantic Web technologies to enable the navigation of music-related data. For example, by modelling social network information from various platforms (Last.fm, MySpace, etc.) using FOAF (as we will describe later), information can be suggested to a user not just from his or her friends on a particular network but from friends-of-friends on any network. This is shown in Figure 7.3 and goes further than some generic collaborative filtering algorithms provided in most social music applications. A related project is ‘Foafing the Music’ (Celma et al. 2005) which uses FOAF-based distributed social networks as well as content-based data available in RDF to suggest related information in recommender systems.
:alex
foaf:knows
foaf:topic_interest
dbpedia:Ramones
:yves
foaf:knows
:tom
foaf:topic_interest
dbpedia:Rancid
Fig. 7.3. Combining social networks and musical interests across social websites
7 Multimedia sharing
135
Another way to benefit from user-generated tagged audio content is to leverage advanced semantic tagging capabilities such as MOAT (described in Chapter 8). For example, pictures of Joe Strummer or other former band members from the Clash could be displayed when browsing blog posts about the band, as depicted in Figure 7.4 (e.g. by leveraging relationships existing between both in the DBpedia).
Fig. 7.4. Interlinking related music information from a content management system and photosharing service
Fig. 7.5. Browsing similar artists using information from the DBpedia
136
The Social Semantic Web
As we explained earlier, an aspect of Semantic Web data modelling is the presence of typed links between concepts rather than simple hypertext links between documents. These links can then be used when browsing content, so that one can decide to visit an artist page from another one because they are in the same musical genre or are signed to the same label. A first experiment based on artist information available in DBpedia is depicted in Figure 7.5.
8 Social tagging Tagging has rapidly become a common and popular practice on social websites. It allows people to easily annotate the content they publish or share with free-form keywords in order to make the content more easily browsable and discoverable by others, leading to a social component of tagging. While tagging is a lightweight, agile and evolving way to annotate content, we believe it can be efficiently combined with formal modelling schemes such as ontologies to make it more powerful and to be part of the Semantic Web as a whole. In this chapter, we hope to give a comprehensive overview of the benefits of Semantic Web technologies for tagging activities both from a theoretical and practical point of view, as we describe both models and applications that can bridge the gap between social tagging and the Semantic Web.
8.1 Tags, tagging and folksonomies
8.1.1 Overview of tagging Apart from providing a means to create discussions and to define or manage social networks, one of the most important features of social websites is the ability to upload and share content with one’s peers. That particular feature also reinforces the object-centred sociality aspect that was described earlier in this book: people share, interact and meet thanks to common interests related to particular objects. For example, a community may form around a particular movie, a technology or a place. On many social websites, this data can be shared either with whoever is subscribed to (or just browsing) the website or else within a restricted community. Furthermore, not only textual content can be shared, but also various media types such as videos and audio, as we saw in Chapter 7. In order to make this content more easily discoverable, users can add free-form keywords, or tags, that act like subjects or categories for anything that they upload or wish to share. For example, this book could be tagged with the keywords ‘semanticweb’ and ‘socialweb’ on a scientific bibliography management system such as Bibsonomy1. This is depicted in Figure 8.1 showing how a related journal paper has been tagged. A tag is normally a single-word descriptor so punctuation marks 1
http://bibsonomy.org/ (URL last accessed 2009-06-05)
J.G. Breslin et al., The Social Semantic Web, DOI 10.1007/978-3-642-01172-6_8, © Springer-Verlag Berlin Heidelberg 2009
138
The Social Semantic Web
are usually avoided, but some systems support phrases in quotation marks like ‘global warming’ and others use camelCase to distinguish between words. One of the most popular tagging systems is the social bookmarking service del.icio.us, which allows one to store their favourite bookmarks on the Web via quick buttons in a browser (instead of locking them into a single desktop browser installation). Bookmarks saved in del.icio.us become accessible from anywhere and are normally public. After bookmarking your favourite URL, e.g. ‘http://www.nuigalway.ie/’, you can then add tags, e.g. ‘university cool nuigalway courses students’. Users can subscribe to other user’s bookmarks, and bookmarks can be forwarded to other registered users using the custom ‘for:username’ tag syntax in del.icio.us.
Fig. 8.1. A journal article tagged with ‘semanticweb’ and ‘socialweb’ in Bibsonomy
On the microblogging service Twitter, people have been using what are called ‘hashtags’2 (i.e. tag keywords prefixed with the ‘#’ or hash symbol) to annotate their microblog posts. While the use of hashtags began in late 2007, Twitter only added hyperlink support for these tags in July 2009, such that clicking on a hashtag brought one to a search service where related microblog posts using the same tag were shown. While tags can be generally considered as a type of metadata, it is important to keep in mind that they are user-driven metadata. Indeed, while a blog engine may automatically assign a creation date to any blog post, or a photo sharing service such as Flickr will use embedded EXIF information to display the aperture of the camera with which a photo was taken, tags are added voluntarily by users themselves and a tag reflects the needs and the will of the user who assigns it. In this way, tags focus on what a user considers as important regarding the way he or she wants to share information. The main advantage of tagging for end users is that one does not have to learn a pre-defined vocabulary scheme (such as a hierarchy or taxonomy) and one can use the keywords that fit exactly with his or her needs 2
http://twitter.pbworks.com/Hashtags (URL last accessed 2009-07-07)
8 Social tagging
139
or ‘desire lines’3. Moreover, tags can be used for various purposes, and (Golder and Huberman 2006) have identified seven different functions that tags can play for end users, from topic definition to opinion forming and even self-reference. (Marlow et al. 2006) also identified that in some cases, tags can be social elements that a user wants to emphasise, e.g. ‘seen_in_concert’. As tags are useful only when used in combination with the resource they are related to, they are generally associated to tagging actions. A tagging action then represents the fact of assigning one or more keywords to online resources. Obviously, many tags can be assigned to the same resource, and on some services, different users can assign (the same or different) tags to the same resource, leading to a social feature in those tagging systems. For example, in del.icio.us, a bookmark can be saved by several users, each of them being able to assign his or her own tags to the item. In order to simplify the tagging process, websites generally provide auto-completion features or automatically suggest tags, typically by analysing tags already assigned by other users to the same resource. From a theoretical point of view, a tagging action is often represented as a tripartite model between a User, a Resource and a Tag as proposed amongst others by (Mika 2005a). Figure 8.2 represents three different tagging actions (T1, T2, T3) made by two different users (U1, U2) on a particular picture.
Fig. 8.2. Representing different tagging actions related to the same content
Emerging from the use of tagging on a given platform, these actions lead to what is generally called a folksonomy, a term coined by (Vander Wal 2007) as a portmanteau of the works ‘folks’ and ‘taxonomy’. A folksonomy is hence a social, 3
http://www.adaptivepath.com/ideas/essays/archives/000361.php (last accessed 2009-06-05)
140
The Social Semantic Web
collaboratively-generated, open-ended, evolving and user-driven categorisation scheme. Contrary to pre-defined classification schemes, users can use their own terms, which makes the folksonomy evolve quickly, based on the user’s needs and benefiting from the ‘architecture of participation’ effect. Websites that support tagging therefore benefit from the ‘wisdom of the crowds’ effect. Information retrieval from tags and folksonomies is simply carried out using tag-based search engines, which leads to some issues that we will describe in the following sections. Folksonomies also provide a way to fluently navigate between various related tags and content, leading to serendipitous discovery of items. For example, users can generally navigate from one tagged item to the list of all items tagged with a similar tag, and so on. A popular visualisation scheme for these tagging ecosystems is the use of tag clouds, where the highly-used tags are bigger (or bolder) than the other ones (similar to a weighted list in visual design). These tag clouds also give an overview of the main categories or topics discussed in the related community website as seen in Figure 8.3.
Fig. 8.3. Tag cloud of popular tags from del.icio.us
8.1.2 Issues with free-form tagging systems In spite of its advantages when annotating content items, tagging leads to various issues regarding information retrieval. It makes the task of retrieving tagged content sometimes quite costly, especially when looking for information tagged by other people. (Mathes 2004) says that the ‘folksonomy represents simultaneously some of the best and worst in the organisation of information’.
8 Social tagging
141
As we mentioned previously, tag-dedicated search engines are simply based on plain-text strings, i.e. a user types a tag and gets all the content that has been tagged with that particular keyword. It therefore leads to various issues that we will now describe. (Interestingly, the issues below have parallels in the world of libraries, and are one reason why librarians now use classification schemes like thesauri or taxonomies to classify items, such as Dewey Decimal Classification or the ACM Taxonomy.)
8.1.2.1 Tag ambiguity Since tags are text-strings only, without any semantics or obvious interpretation (rather than a set of characters) for a software program that reads them, ambiguity is an important issue. While a person knows that the tag ‘apple’ means something different when it is used in relation to content about a laptop or on a picture of a bag of fruit, a search engine will not be able to distinguish between them. It will retrieve both items for a search on ‘apple’ even if the user had the computer brand in mind. Consequently, the user will have to sort out what is relevant and what is not regarding his or her expectations, which can be a costly step depending on the number of retrieved items. For example, Figure 8.4 shows the result of a search regarding most relevant items tagged ‘apple’ on Flickr, which mixes pictures of fruit and Apple devices.
Fig. 8.4. Tag ambiguity in a Flickr search for pictures tagged ‘apple’
142
The Social Semantic Web
8.1.2.2 Tag heterogeneity Tag ambiguity refers to the same tag being used to refer to different things, but a parallel issue is that different tags can also be used to refer to the same thing. In this case, a user must run various queries to get the content related to a particular concept or object. Such heterogeneity is mainly caused by the multilingual nature of tags (e.g. ‘semanticweb’ and ‘websemantique’) but also due to the fact that people will use acronyms or shortened versions (‘sw’ and ‘semweb’), as well as linguistic and morpho-syntactic variations (synonyms, plurals, case variations, etc.). For example, we observed that at least ten variations are used on del.icio.us to identify the term ‘semantic web’, not taking into account narrower tags, as we will now explain.
8.1.2.3 Lack of organisation between tags Since a folksonomy is essentially a flat bundle of tags, the lack of relationships between them makes it difficult to find information if one is not directly looking at the right tag. This is clearly a problem in the practice of tagging, especially if, as noted by (Golder and Huberman 2006), users use different tags depending on their level on expertise, or if they search for broader or narrower ones. For example, while we mentioned the tags ‘semanticweb’ or ‘socialweb’ regarding this book, an expert may not use those terms which would be too broad but instead would prefer to use terms such as ‘sioc’, ‘rdfa’, ‘sparql’ which will help him or her to better classify the data. Then, if someone is simply looking at items tagged ‘semanticweb’, they will not be able to retrieve the book even though there is a clear relationship between both in terms of the technological domain. To overcome this issue, clustering algorithms can be used to identify related tags as introduced by (Begelman et al. 2006). However, their success depends on the tagging distribution, i.e. if there is a strong co-occurrence between tags or not, which may not be the case in some folksonomies, even for tags that identify related concepts. In some cases, these algorithms can also be combined with other approaches to identify related tags and communities around particular topics (Hayes and Avesani 2007).
8.2 Tags and the Semantic Web In the past, folksonomies and ontologies have been regularly cited as opposite and exclusive means for managing and organising information. A frequent point of view was to consider folksonomies as a bottom-up classification, while ontologies were seen as a centralised top-down approach. This way of thinking was also part of a larger set of opposing views between Web 2.0 and the Semantic Web. How-
8 Social tagging
143
ever, as we have described in this book, we believe that this opposition is unjustified and should not exist since these two fields are in fact complementary (and synergistic) paths towards enhancing the Web. This opposition has often been stated in posts on the blogosphere, and one of the main reasons may be due to a misunderstanding of the original Semantic Web article by (Berners-Lee et al. 2001). The common interpretation is that this initial vision will lead to a unique universal ontology for the Semantic Web, which was not the case as Semantic Web pioneer Jim Hendler states on his blog4 when answering various comments. However, numerous works related to the links between tags, related objects (tagging actions, folksonomies, tag clouds, etc.) and the Semantic Web have been published during the last couple of years. We can divide these works into two main areas: (1) the ones aiming to define, mine or automatically link to taxonomies or ontologies from existing folksonomies, and (2) works based on defining Semantic Web models for tags and related objects. Again, the border between both is sometimes fuzzy since both approaches can be combined together.
8.2.1 Mining taxonomies and ontologies from folksonomies This first set of approaches is mainly based on the idea that emergent semantics naturally appear through the use of tags. As (Golder and Huberman 2006) report, there is generally a stable set of tags used for a given resource (or in a given tag space) after a certain amount of time. For example, on del.icio.us, the tags for an item stabilise after it has been tagged about a hundred times. Therefore, emergent semantics can be used to mine taxonomies or ontologies from folksonomies and this research area has led to several works during the past few years.
Fig. 8.5. Mining hierarchical relationships from co-occurrence of tags, adapted from (Halpin et al. 2006)
4
http://www.mindswap.org/blog/2007/11/21/shirkyng-my-responsibility/ (accessed 2009-06-02)
144
The Social Semantic Web
Among others, (Halpin et al. 2006) used an approach based on related cooccurrences of tags to extract hierarchical relationships between concepts, as depicted in Figure 8.5. Based on the reflexive co-occurrence of tags, they extract broader and narrower relationships between concepts that they model as an RDFS vocabulary using the rdfs:subClassOf property. (Mika 2005a) defined a sociallyaware approach for automatically building ontologies by combining social network analysis and clustering algorithms based on folksonomies. One outcome of his work is that sub-communities of users can also be mapped to a hierarchy of tags: communities of experts use narrower tags than the broader communities they are included in. More recently, the FoLksonomy Ontology enRichment (FLOR) 5 technique provides a completely automated approach to semantically enrich tag spaces by mapping tags to Semantic Web entities (Angeletou 2008). By enriching tag spaces with semantic information about the meaning of each tag, some issues with tagging regarding information retrieval (such as tag ambiguity as mentioned earlier) can be solved.
8.2.2 Modelling folksonomies using Semantic Web technologies While the previous section described work on extracting and linking structured models based on tags and tagging activities, another approach to bridge folksonomies and the Semantic Web is to use RDF(S)/OWL modelling principles to represent tags, tagging actions and other related objects as tag clouds. While tag-based search is the only way to retrieve tagged content at the moment (leading to the aforementioned problems), these new models allow advanced querying capabilities such as ‘which items are tagged ‘semanticweb’ on any platform’, ‘what are the latest ten tags used by Stefan on del.icio.us’, ‘list all the tags commonly used by Alex on SlideShare and by John on Flickr’ or ‘retrieve any content tagged with something relevant to the Semantic Web field’. Having tags and tagged content published in RDF also allows one to easily link this to or from other Semantic Web data, and to reuse it across applications in order to achieve the goal of a global graph of knowledge. While it has not been implemented, (Gruber 2007) defined one of the first approaches to model folksonomies and tagging actions using a dedicated ontology. This work considers the tripartite model of tagging and extends it with (1) a space attribute, aimed at modelling the website in which the tagging action occurred, and (2) a polarity value in order to deal with spam issues. His proposal provides a complete model to represent tagging actions, but also considers the idea of a tag identity, such that various tags can refer to the same concept while being written differently, introducing the need to identify some common semantics in the tags themselves. 5
http://flor.kmi.open.ac.uk/ (URL last accessed 2009-06-05)
8 Social tagging
145
The Tag Ontology6 was the first RDF-based model for representing tags and tagging actions, based on the initial ideas of Gruber and on the common theoretical model of tagging that we mentioned earlier. This ontology defines the ‘Tag’ and ‘Tagging’ classes with related properties to create the tripartite relationship of tagging. In order to represent the user involved in a tagging action, this ontology relies on the FOAF vocabulary that we will describe in more detail later. An important feature of this model is that it defines a Tag class, hence implying that each tag will have a proper URI so that tags can be used both as the subject and object of RDF triples. Moreover, this class is defined as a subclass of skos:Concept and the ontology introduces a ‘relatedTag’ property. The Simple Knowledge Organisation System (SKOS)7 is a lightweight RDFS vocabulary allowing people to define controlled vocabularies such as taxonomies and thesauri. In this way, tags can be linked together, for example, to model that the ‘rdfa’ tag is more specific than ‘semanticweb’. However, the proposed property does not differentiate between two tags that are related because they represent the same concept but are spelled differently (‘websemantique’ and ‘semanticweb’) or if one tag identifies a concept which is broader than the other (‘rdfa’ and ‘semanticweb’). Finally, while it does not specifically consider the tagging space, it introduces a way to temporally define the tagging action thanks to a taggedOn property. The Social Semantic Cloud of Tags (SCOT) ontology (Kim et al. 2007) is focused on representing tag clouds and defines ways to describe the use and cooccurrence of tags on a given social platform, allowing one to move his or her tags from one service to another and to share tag clouds with others. While we will introduce the ideas of data portability later on in this book, it is important to mention that SCOT envisions this portability not for the content itself but for the tagging actions and the tags of a particular user. SCOT reuses the Tag Ontology as well as SIOC and models tags, tagging actions and tag clouds. An important aspect of the SCOT model is that it considers the space where the tagging action happened (i.e. the social platform, e.g. Flickr or del.icio.us), as suggested by Gruber’s initial proposal. SCOT also provides various properties to define spelling variants between tags, using a main spellingVariant property and various subproperties such as acronym, plural, etc. Another ontology related to tagging is Meaning Of A Tag (MOAT)9, which aims to represent the meaning of tags using URIs of existing domain ontology instances or resources from existing public knowledge bases (Passant and Laublet 2008b), such as those from the Linking Open Data project introduced in Chapter 4. The goal of MOAT is thus to create a bridge between folksonomies and existing ontologies or knowledge bases so that the issues of free-form tagging regarding information retrieval can be solved. For example, it allows us to model facts such as 6
http://www.holygoat.co.uk/projects/tags/ (URL last accessed 2009-06-05) http://www.w3.org/2004/02/skos/ (URL last accessed 2009-06-05) 8 http://scot-project.org/ (URL last accessed 2009-06-05) 9 http://moat-project.org/ (URL last accessed 2009-06-05) 7
146
The Social Semantic Web
‘In this blog post, I use the tag “apple” and by that I mean the computer brand identified by dbpedia:Apple_Inc., while the “apple” tag on that other picture means the fruit identified by dbpedia:Apple’, as depicted in Figure 8.6. To achieve this goal, it provides a lightweight OWL-DL ontology that reuses and extends the Tag Ontology. MOAT also relies on SIOC and FOAF to model the tagged resource and the user that assigned the tag to it respectively. MOAT is more than a single model, as it also provides a framework10 based on the ontology, the goal of which is to let people easily bridge the gap between simple free-form tagging and semantic indexing. The latter is more powerful than the former in terms of information retrieval but is certainly more complex in terms of annotating content. The proposed framework aims to reduce this gap by helping users to annotate their content with URIs of Semantic Web resources from the tags that they have already used for annotated content. Furthermore, while it mainly consists of a model and a framework for augmented tagging, the MOAT approach can be automated as applied by (Abel 2008) in the GroupMe!11 system. It therefore provides a nice bridge between approaches that extract tag meanings from folksonomies, as we described in the previous section, and those that aim to model tags with Semantic Web technologies.
Fig. 8.6. Modelling the meaning of the ‘apple’ tag in a tagging action using MOAT
More recently, the Common Tag initiative12 (involving AdaptiveBlue, Faviki, Freebase, Yahoo!, Zemanta, Zigtag and DERI, NUI Galway) developed a lightweight vocabulary13 with a similar goal of linking tags to well-defined concepts (represented with their URIs) in order to make tagging more efficient and interconnected. In particular, it focuses on a simple approach allowing site owners to publish RDFa tag annotations, as well as providing a complete ecosystem of pro10
http://moat-project.org/architecture (URL last accessed 2009-06-05) http://groupme.org (URL last accessed 2009-06-08) 12 http://commontag.org/ (URL last accessed 2009-07-16) 13 http://commontag.org/ns (URL last accessed 2009-07-16) 11
8 Social tagging
147
ducers and consumers of Common Tag data that can help end users to deploy applications based on this format, as depicted in Figure 8.7. In addition, other models that can be used to represent tags include the Nepomuk Annotation Ontology (NAO)14, SIOC, and the Annotea annotation15 and bookmark16 schemas. Both NAO and SIOC define a new ‘Tag’ class, with sioc:Tag defined as a subclass of skos:Concept. SIOC also defines a topic property to link a resource to some of its topics. While not explicitly using the ‘tag’ word in its definition, the Annotea bookmark model provides a ‘Topic’ class and a ‘hasTopic’ property to link an item to some related keywords. This model also defines a ‘subTopicOf’ property in order to model hierarchies of topics. However, in contrast to the main ontologies defined previously, these three vocabularies do not provide any way to model the tagging action itself (i.e. the tripartite relation between a resource, a tag and a user). Hence, they cannot capture the complete representation of folksonomies but simply focus on the relationship between a tagged resource and its related tags.
Fig. 8.7. The initial Common Tag ecosystem17
14
http://www.semanticdesktop.org/ontologies/nao/ (URL last accessed 2009-06-05) http://www.w3.org/2000/10/annotation-ns# (URL last accessed 2009-06-05) 16 http://www.w3.org/2003/07/Annotea/BookmarkSchema-20030707 (accessed 2009-06-05) 17 From http://commontag.com/Home 15
148
The Social Semantic Web
Finally, machine tags18 from Flickr can be used as a way to provide augmented tagging to end users. By defining tags in the form of ‘prefix:property=value’ (such as ‘geo:lat=43.22’ or ‘lastfm:event=3544’), they allow users to add machinereadable metadata to annotated pictures. For example, the tag ‘geo:lat=43.22’ can be used to define the location where a picture was taken, especially as it can be automatically generated from camera phones and dedicated upload applications, while the ‘lastfm:event=3544’ tag can be used to automatically aggregate some Flickr pictures related to a particular event on Last.fm. While they are not directly represented using Semantic Web technologies, machines tags can be mapped to RDF using the Flickcurl API19.
8.3 Tagging applications using Semantic Web technologies Various tools already provide advanced tagging features using Semantic Web technologies or RDF exports of tagged data based on the techniques introduced in the previous sections. We shall now describe some of them.
8.3.1 Annotea As mentioned before, while not strictly defined as a ‘tagging’ system, Annotea20 was almost certainly the first web-based social application that used Semantic Web technologies. Begun in 2001, it allowed people to simply add notes and comments to web pages that they browsed, and to bookmark and then share them through a community of users that were subscribed to a dedicated Annotea server. An important feature of Annotea is its openness and compliance to W3C standards. In particular, any data is available in RDF using the dedicated Annotea annotation and bookmark vocabularies mentioned before. Furthermore, Annotea relies on a simple user interface so that the use of these technologies was completely transparent for the end user. As it was launched some years before the Web 2.0 meme and many of the works described in this book, Annotea can be seen as a precursor of many Social Semantic Web applications. Moreover, various clients can be used to interact with an Annotea server, from the W3C browser Amaya21 (as shown in Figure 8.8) to Firefox plugins.
18
http://www.flickr.com/groups/mtags/ (URL last accessed 2009-06-05) http://librdf.org/flickcurl/ (URL last accessed 2009-06-05) 20 http://www.w3.org/2001/Annotea/ (URL last accessed 2009-06-05) 21 http://www.w3.org/Amaya/ (URL last accessed 2009-06-05) 19
8 Social tagging
149
Fig. 8.8. Adding an annotation on Annotea using the Amaya browser 22
8.3.2 Revyu.com Revyu.com23 is an online service dedicated to creating reviews for all sorts of things: from conference papers to pubs or restaurants. It reuses some well-known principles and features of Social Web applications such as tags, tag clouds and star ratings, and it provides a JavaScript bookmarklet to ease the publication of new reviews for end users when browsing the Web. Most importantly, Revyu.com, winner of the Semantic Web challenge in 2006, is completely RDFbased. Each review is modelled using the RDF Review vocabulary24 (compatible with the hReview microformat) and tags as well as tagging actions are represented using the Tag Ontology (Figure 8.9). As Revyu.com also provides a SPARQL endpoint to allow one to query its data, it efficiently allows one to reuse tagged data 22
From http://www.w3.org/Amaya/screenshots/Overview.html http://revyu.com/ (URL last accessed 2009-06-05) 24 http://vocab.org/review/terms.html (URL last accessed 2009-06-09) 23
150
The Social Semantic Web
from the website in any other application, as well as enabling mashups with existing content. Two important features of Revyu.com regarding the use of Semantic Web technologies are: Integration and interlinking with other data sets. Thanks to different heuristics, Revyu.com integrates identity links (using owl:sameAs properties) to resources already defined on the Semantic Web, especially resources being described in data sets from the Linking Open Data cloud that we described earlier. For example, most reviews regarding research papers are linked to the paper definition from the Semantic Web Dogfood project25, while reviews about movies can be automatically linked to their DBpedia URI. Thus, it provides global interlinking of Semantic Web resources rather than defining new URIs for existing concepts. The ability to consume FOAF-based user profiles. While many Social Web applications require the user to fill in their personal details when subscribing, with those details having already been filled in on other platforms, Revyu.com allows one to simply give his or her FOAF URI so that the information contained therein is automatically reused. As we will describe in more detail later on this book, consuming FOAF profiles in web-based applications provides a first step towards solving data portability issues between applications on the Social Web.
Fig. 8.9. Tagged data and related RDF annotations from Revyu.com
25
http://data.semanticweb.org/ (URL last accessed 2009-06-05)
8 Social tagging
151
8.3.3 SweetWiki SweetWiki (Semantic WEb Enabled Technology Wiki)26 from (Buffa et al. 2007) is a semantic wiki prototype featuring augmented-tagging features for end users. In contrast to the other wikis we mentioned in Chapter 6 of this book, it is not designed for creating and maintaining ontology instances, but rather uses Semantic Web technologies to augment the user experience and navigation between pages. One relevant feature of SweetWiki regarding work described in this chapter is the ability to organise tags as a hierarchy of concepts. This hierarchy is then modelled in RDFS so that it can be reused in other applications, while the wiki model itself is defined using a particular OWL ontology. Most importantly, this hierarchy of tags is not a personal one but is built and shared amongst all the users of the wiki. In this way, SweetWiki provides a social and collaborative approach to maintaining hierarchies of concepts that can be seen as lightweight ontologies. Moreover, users can define two tags as synonyms in order to solve heterogeneity issues. From a tagging point of view, tags can be not only assigned to web pages but also to pictures and embedded videos, and these are then used to retrieve or browse content, while similar and related tags are used to augment the navigation process by suggesting related pages. Finally, SweetWiki models all of its data using RDFa. Hence, an application that wants to reuse it is only required to extract and parse an XHTML page, since all the required RDF annotations are embedded in it and can be extracted using GRDDL.
8.3.4 int.ere.st Based on the SCOT ontology, int.ere.st is a web application dedicated to tag portability between applications. The main objective of int.ere.st (Kim et al. 2008) is to demonstrate how Semantic Web and Social Web technologies can be combined to support better tag sharing and creation across various online communities. Using int.ere.st, people can save, tag and bookmark in their own as well as other people’s tag clouds, as represented using the SCOT ontology. The tag meta-search also allows one to look for similar patterns of tagging from other people based on their interests (as expressed using tags). Tag clouds can be imported into int.ere.st from various services, and a related SCOT exporter is available for the popular WordPress blogging platform. Some of the major functionalities provided by the application include: various options for tag searching, such as and (&), or (space), co-occurrence (+), broader (>), and narrower (