History in the Age of Abundance?: How the Web Is Transforming Historical Research 9780773558212

A guide to the World Wide Web and its archives for the contemporary historian. A guide to the World Wide Web and its a

131 36 7MB

English Pages [327] Year 2019

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover
HISTORY IN THE AGE OF ABUNDANCE?
Title
Copyright
Dedication
CONTENTS
Figures and Tables
Acknowledgments
Introduction
1 Exploding the Library
2 Web Archives and Their Collectors
3 Accessing the Records of Our Lives
4 Unexpected Needles in Big Haystacks
5 Welcome to GeoCities, Population Seven Million
6 The (Practical) Historian in the Age of Big Data
Conclusion
Notes
Bibliography
Index
Recommend Papers

History in the Age of Abundance?: How the Web Is Transforming Historical Research
 9780773558212

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

HISTORY IN THE AGE OF ABUNDANCE?

HISTORY IN THE AGE OF ABUNDANCE? How the Web Is Transforming Historical Research

IAN MILLIGAN

McGill-Queen’s University Press Montreal & Kingston • London • Chicago

© McGill-Queen’s University Press 2019 ISBN 978-0-7735-5696-6 (cloth) ISBN 978-0-7735-5697-3 (paper) ISBN 978-0-7735-5821-2 (ePDF) ISBN 978-0-7735-5822-9 (ePUB) Legal deposit second quarter 2019 Bibliothèque nationale du Québec Printed in Canada on acid-free paper that is 100% ancient forest free (100% post-consumer recycled), processed chlorine free This book has been published with the help of a grant from the Canadian Federation for the Humanities and Social Sciences, through the Awards to Scholarly Publications Program, using funds provided by the Social Sciences and Humanities Research Council of Canada.

We acknowledge the support of the Canada Council for the Arts, which last year invested $153 million to bring the arts to Canadians throughout the country. Nous remercions le Conseil des arts du Canada de son soutien. L’an dernier, le Conseil a investi 153 millions de dollars pour mettre de l’art dans la vie des Canadiennes et des Canadiens de tout le pays. Library and Archives Canada Cataloguing in Publication   Milligan, Ian, 1983–, author         History in the age of abundance? : how the web is transforming historical research / Ian Milligan. Includes bibliographical references and index. Issued in print and electronic formats. ISBN 978-0-7735-5696-6 (cloth). –ISBN 978-0-7735-5697-3 (paper) ISBN 978-0-7735-5821-2 (ePDF). –ISBN 978-0-7735-5822-9 (ePUB)        1. Web archives–Case studies. 2. History–Research–Case studies. 3. Case studies. I. Title. ZA4197.M55 2019                                027                             C2018-906592-3                                                                                                   C2018-906593-1

This book was designed and typeset by Peggy & Co. Design Inc. in 10.5/14 Sabon.

For Auden

CONTENTS

Figures and Tables

ix

Acknowledgments

xiii

Introduction

1

3

Exploding the Library

29

2 Web Archives and Their Collectors 3

Accessing the Records of Our Lives

4 Unexpected Needles in Big Haystacks 5 6

62 106 143

Welcome to GeoCities, Population Seven Million The (Practical) Historian in the Age of Big Data

Conclusion Notes

236 247

Bibliography Index

265 293

171 213

FIGURES AND TABLES

Figures i.1 1.1

2.1 2.2 3.1 3.2 3.3

3.4

3.5

3.6

A rack of servers at the Internet Archive in San Francisco. Photo by author, 2017. 5 Baran’s illustration of three network types from Baran, “On Distributed Communications.” Used with permission from the RAND Corporation. 38 Line-Mode Browser Emulator. Used with permission from the European Organization for Nuclear Research (CERN ). 66 Internet Archive Wayback Machine screenshot. Used with permission from the Internet Archive. 86 The webpage ianmilligan.ca, with just the front HTML page preserved. 108 The webpage ianmilligan.ca, a screenshot taken the same day as it appeared with all resources showing. 109 uwaterloo.ca from 22 October 1997, viewed through an emulated Mosiac 2.2 browser from http://oldweb.today. Used with permission from Rhizome. 113 uwaterloo.ca from 22 October 1997, viewed in a modern Firefox browser. Used with permission from the Internet Archive. 115 uwaterloo.ca from 22 October 1997, viewed through an emulated Internet Explorer 4.01 browser from http://oldweb. today. Used with permission from Rhizome. My thanks to them for this high-resolution screenshot. 116 Frequency of the term public transit across three Canadian political parties from webarchives.ca. 126

x

3.7 3.8 3.9 4.1 4.2 4.3

5.1

5.2 5.3 5.4 6.1

6.2 6.3 6.4 6.5 6.6 6.7

Figures and Tables

All links within the Canadian Political Collection, 2005–2009. 135 Three major Canadian political parties and their inbound/ outbound links, 2005–2009. 137 Link structures prior to the 2006 Canadian federal election. 138 Default Wayback Machine on Archive.org, 2015. Used with permission from the Internet Archive. 145 Exploring temporal violations in a web browser. Used with permission from the Internet Archive. 147 The Columbia University Human Rights Web Archive. Screenshot used with permission from Columbia University Libraries. 149 Winning page for the GeoCities “Homesteader of the Year” competition. Used with permission from the Internet Archive. 177 Link structure of the GeoCities EnchantedForest. 188 EnchantedForest/Glade/3891: The highest ranked site. Used with permission from the Internet Archive. 189 Mentions of GeoCities in a LexisNexis Media Survey, 1995–2013. 211 Webrecorder capturing the University of Waterloo’s webpage. Used with permission of Rhizome. My thanks to them for this high-resolution screenshot. 215 Archives Unleashed Cloud, in development ca late 2018. 218 The command line versus the graphical user interface. 221 Plain text extracted from a collection of websites about the 1917 Halifax Explosion. 225 Voyant tools “revealing” the Halifax Explosion Web Archive. Used with permission from Stéfan Sinclair. 227 Halifax Explosion loaded into the Gephi Visualization Platform. 229 A slightly more complicated Gephi Visualization of the Nova Scotia Artist-Run Centres Collection. 231

Figures and Tables

xi

Tables 2.1 3.1 5.1 6.1

Selection of long-deleted communities 103 Hypothetical hyperlink sources/targets 136 Link relationships in one GeoCities neighbourhood Domain frequency in a web archive 224

187

ACKNOWLEDGMENTS

No book exists in a vacuum, and many thanks are necessary to those who made the publication of this book possible. I’m fortunate to work daily with a fantastic group of colleagues who share my passion for web archiving, making data accessible, and trying to push the conversation around historians and web archives forward. My colleague and long-time collaborator Nick Ruest at York University has done much to shape my understanding of data, stewarding digital cultural heritage, and how to make things actually useful. Jimmy Lin has also done much to develop web archiving and Big Data infrastructure, and his critical and incisive eye has helped hone much of my thinking about how a historian can actually use these sorts of data. Special thanks as well to Samantha Fritz, our project manager extraordinaire who has a finely honed sense of how to make things not only useful but fun to work with, as well as to my postdoctoral fellow Ryan Deschamps, who similarly helps to create useful and engaging tools and visualizations. If it wasn’t for Nick, Jimmy, Samantha, and Ryan, much of my thinking about interdisciplinary collaboration would be just that – wishful thinking. Colleagues and peers helped make the project what it is, from listening to conference presentations to providing much-needed time away from it. At the University of Waterloo’s Department of History, I’m lucky to work every day with a great gang of people. I’m tempted to list my entire department, but thanks especially to Gary Bruce, Dan Gorman, Donna Hayes, Geoffrey Hayes, Jane Nicholas, Julia Roberts, Susan Roy, John Sbardellati, Lynne Taylor, Ryan Touhey, and Jim Walker. All of you have let me bend your ears about things as varied as web archives, administrative tasks, and where to find the best beer or coffee in Waterloo. A great group of graduate students have worked with me as research

xiv

Acknowledgments

assistants on web archiving projects more generally: Shawn Dickinson, David Hussey, Katie Mackinnon, Sarah McTavish, Patrick O’Leary, Denée Renouf, Eric Vero, and Jeremy Wiebe. The web archiving community is a truly great one, and I’ve had the pleasure of being able to collaborate with or be inspired by many people. Conversations with Jefferson Bailey, Niels Brügger, Nathalie Casemajor, Olga Holownia, Mark Graham, Andrew Jackson, Martin Klein, Justin Littman, Emily Maemura, Michael Nelson, Anna Perricci, Matthew Weber, Peter Webster, Michele Weigle, Jane Winters, and Nicholas Worby have all informed the research found in this book. The Internet Archive as an organization is wonderful and deserves many thanks for being so hospitable. Tom Peace has been a great sounding board for many of these ideas as well. My sincerest thanks as well to William J. Turkel, who mentored me in digital history and started me down the roads that ultimately led to this book. Finally, thanks to the British Library, Royal Danish Library, Bibliothèque nationale de France, Library of Congress, and Library and Archives Canada for hosting me for visits, talks, or tours. I came away inspired by all of the wonderful work being done in every single one of your institutions. McGill-Queen’s University Press has been a fantastic venue to publish with. My sincerest thanks to Kyla Madden, senior editor, who stewarded the manuscript through the process and took such care in thinking about the major arguments at play as well as how to make them clear to readers! Her suggestions and insights helped make the manuscript far stronger than it would otherwise be. Thanks also to the anonymous peer reviewers who helped refine the book as well. Thanks as well to Ian MacKenzie for his astute copy edits. All errors or omissions in the manuscript, of course, are mine and mine alone. Elsewhere in the publishing world, my thanks to journals and editors that allowed some of the early dissemination of this work and gave their permission for portions of those works to appear in this book. Some early ideas for this book appeared in “Mining the Internet Graveyard: Rethinking the Historians’ Toolkit,” Journal of the Canadian Historical Association 23, no. 2 (2012): 21–64; and “Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives,” International Journal of Humanities and Arts Computing 10, no. 1–2 (2016): 87–94. Portions of chapter 5 were adapted from my open (CC-BY) chapter “Welcome to the Web: The Online Community of GeoCities and the Early Years of the World

Acknowledgments

xv

Wide Web,” in The Web as History, edited by Ralph Schroeder and Niels Brügger, 137–58 (London: UCL , 2017). Similarly, my sincerest thanks to the European Centre for Nuclear Research (CERN ), RAND Corporation, the Internet Archive, Columbia University Libraries, Professor Stéfan Sinclair, and Rhizome for giving permission to use images in this book. Financial support for the research that underpins this book was provided by the Social Sciences and Humanities Research Council’s Insight Grant program and an Ontario Early Researcher Award. Our infrastructure and tools development work are currently supported by the Andrew W. Mellon Foundation. Other support has come from Compute Canada. Finally, thanks to my family. My parents, Cecile, John, Peter, and Terry probably know more about web archiving than they wish they did. Certainly my wife and partner Jennifer Bleakney does! Thanks Jenn, for being there, reading persnickety chapters, and enriching my life in every single way. I feel like the luckiest person in the world every morning when I wake up. Our son Auden Emerson Milligan was born as I wrote the first draft of this book. Two smartphone-wielding parents probably means that Auden has more pictures sitting in the cloud than anybody could have conceived of only a decade ago. His world, even more than ours, will be shaped by the forces of historical abundance. For that reason, at the end of the day, this work is for him.

HISTORY IN THE AGE OF ABUNDANCE?

INTRODUCTION

Our collective cultural heritage, the legacy that we leave behind for the generations to come, faces a serious problem in the digital age. We used to, as a rule, forget. Now we have the power of recall and retrieval at a scale that will decisively change how our society remembers. For historians, professionals who interpret and bring shape to narratives of the past, this is a dramatic shift. The digital age brings with it great power: the prospect of a more democratic history and of more voices included in the historical record, a realization of the social historian’s dream. Yet it also brings significant challenges: what does it mean to write histories with born-digital sources – from websites written in the mid-1990s to tweets posted today? How can we be ready, from a technical perspective as well as from a social or ethical one, to use the web as a historical source – as an archive? Historians with the training and resources are about to have far more primary sources, and the ability to process them, at their fingertips. What will this all mean for our understanding of the past? How can these sources be used responsibly? Finally, if historians cannot rise to the moment, what does this mean for the future of our profession? The problem can be summed up in something as innocuous as a personal homepage, hosted on the free GeoCities.com service. GeoCities. com, founded in 1994, provided free websites to anybody who wanted to create one. A user would visit GeoCities.com, enter an email address, and receive a free megabyte (later two, then ten) to stake an individual space on the burgeoning Information Superhighway. These sites took many shapes and sizes: a Buffy the Vampire Slayer fan site, a celebration of a favourite sports team, a family tree, even a young child’s tribute to Winnie the Pooh. Early web users flocked. By October 1995, the first

4

History in the Age of Abundance?

ten thousand users had created their sites. Two years later, a million had. And by 2009, some seven million users had created accounts on GeoCities.com. While just how to define a webpage itself is a complicated question – as we explore in this book, a webpage is really a collection of resources and files (from text, images, embedded widgets, movies, to style sheets, or other interactive components) that are pulled together by your web browser – it is clear that these 7 million accounts are emblematic of the growing scale of web archives. Indeed, if we count these unique webpages, we see that there were over 186 million of them. Yet if you visit GeoCities.com today, you will see only an advertisement for one of Yahoo!’s website hosting services – no evidence at all of the human activity that had once happened there. Without web archiving, these websites would have been, for the most part, irretrievably lost. Thanks to the Internet Archive, however, we can today “visit” GeoCities. The San Francisco–based Internet Archive, founded in April 1996 by Brewster Kahle and Bruce Gilliat, grew out of a recognition that people were increasingly living their lives online and that this culture was in danger.1 Legally designated a library in 2007 by the State of California, the non-profit Internet Archive now has expansive holdings, including hundreds of billions of webpages, millions of books, audio recordings, videos, and images, and over a hundred thousand software programs, all freely accessible at http://archive.org.2 When you go to the Internet Archive’s website and access one of their archived websites, you are accessing data stored on a hard drive within their repurposed Christian Science church in San Francisco’s Richmond District (see figure i.1). In many ways, the founding of the Internet Archive in 1996 was a visionary act. The library was a critical reaction to the fact that our society was undergoing a medium shift in how we document our lives, societies, and cultures. As the West rapidly transitioned from traditional print sources, which we had preservation strategies for, towards digital bits and bytes, the transitory nature of websites and content opened up an opportunity for the Internet Archive to help preserve our cultural memory. The Archive does so in part by collecting the world’s largest web archive, the generic term for the collections of web content for use by present and future historians, researchers, and even members of the public; web archiving is the process of gathering, storing, preserving, and making accessible that data.3 As we collectively move online,

Introduction

5

Fig i.1 A rack of servers at the Internet Archive in San Francisco. The light in the middle of the left rack shows an access operation happening, presumably a user finding content amongst the billions of resources stored there.

we also produce far fewer traditional paper records than we did before, making this preservation process increasingly indispensable. You can visit the Internet Archive’s Wayback Machine at http:// archive.org/web and “go back in time” to visit the rudimentary, often broken, archived copies of personal (or corporate, or educational, or beyond) websites collected between 1996 and 2009. The Wayback Machine – the name comes from Sherman and Mr Peabody’s “WABAC machine” from the Rocky and Bullwinkle Show – is the Internet Archive’s portal to let you access their collections. Try it yourself by visiting the Wayback Machine and searching for http://GeoCities.com. In the case of GeoCities, it offers over 186 million distinct web addresses, some 32 billion words to index, search, and explore. This information is now at the fingertips of anybody with an internet connection and a web browser, to remotely access and explore information held on those hard drives in California. That the voices of so many could conceivably reside on a (big) hard drive speaks volumes about how the historical record is dramatically changing. GeoCities exemplifies the shifting scope of how we gather, preserve, and provide access to our culture’s historical

6

History in the Age of Abundance?

record. Yet it also raises questions around how we can responsibly use this content, as very few of those authors would have realized when writing that their words would one day end up in a large archive. The amount of data, in human terms, that the Internet Archive and other web archives have preserved is mind boggling. Indeed, GeoCities is just one small part of the broader constellation of born-digital material being retained. At the time, GeoCities existed alongside competitor services like Angelfire or Tripod. Users subsequently left GeoCities for the greener fields of Friendster or MySpace before settling in places like Facebook or Instagram today.4 Data scientists Erez Lieberman Aiden and Jean-Baptiste Michel have compared the record left to literary scholars studying Edgar Allen Poe (422 letters) with the record the average American leaves behind today. Most of us generate thousands of emails, likes, Facebook and Instagram pictures, YouTube video, Dropbox files, and beyond. As they note, “This material comprises an astonishingly detailed record of the lives of billions of people – a record that did not exist at all mere decades ago. It has no precedent in human history.”5 We see this everywhere, from the high-quality cameras many of us carry in our pockets to document the mundane, to the ways that many (myself included) live their lives online through social media. Yet having all this material saved does not mean that it will be accessible – not only do the data need to be stewarded, but it also needs to be discoverable – in many cases thanks to metadata (or “data about data,” discussed later in this book) that can describe or make sense of all of this data. Millions of webpages are useful only if you can make sense of all of that data. This book explores the medium shift that underpins web archives, arguing that our historical record is being and will be profoundly affected by the advent of these massive born-digital repositories. It does so in three main respects. First, it introduces foundational elements: what historians need to know to get themselves up to speed on how the web, its constituent parts, and the process of web archiving works. Historians tend to employ largely implicit methodologies, but as we move into new types of sources that our training has not equipped us to deal with, we need to be more explicit. Second, the book explores the new questions that historians are now able to ask, thanks to web archives, and what kinds of subjects they can now explore. These new questions stem from the twin dimensions of our shifting historical record brought by web archives: their larger scale (more data than ever before) but also

Introduction

7

their much larger scope (much of that data is being produced by people traditionally left out of the historical record). Third, through methodological case studies, this book provides pathways for how historians can use web archives. What types of approaches, methods, tools, and search functions can a historian use to turn web documents into historical sources? Conscious of how specific technical details can date quickly, this book draws on abstract principles whenever possible, although the final chapter provides some concrete pathways for a historian wanting to get started in this field. Why does the book focus on historians? It may seem premature, in some respects, given that the number of historians focusing on post-1996 topics where they can directly use web archives is still quite small. Right now, librarians and archivists lead the conservation about the collection, access, and preservation of web archives. Indeed, those professionals are leading the extremely complicated conversation around how to preserve and make accessible digital material in perpetuity. My own work is informed by the great work of librarians and archivists, as my acknowledgments and the chapters that follow note, in particular my long-time collaborator Nick Ruest at York University. Yet historians will be among the primary future users of these materials, as the professionals who interpret and give shape to our understanding of the past. They are in danger of being left behind as research topics begin to consider the 1990s. Decisions are right now being made by information professionals, involving all phases of the collection, discovery, search, and preservation infrastructure of the web, that will have dramatic consequences for historians. More importantly, as critical scholarly users, historians need to wake up to the changes ahead, so that they can be ready for the algorithms that will shape their research for decades to come. This does not mean abandoning traditional research methods – historians will long continue to be masters of close reading and parsers of nuance and context – but it does mean that new skills to better contextualize and understand digital material are needed. It is not just historians who use web archives who are seeing their research methods transformed by the digital. Indeed, web archives may be just one prominent symptom of a much larger revolution in historical scholarship. While the focus of the book is squarely on borndigital resources like web archives, many of the lessons and experiences drawn from it could be extrapolated to the experience of historians

8

History in the Age of Abundance?

with digitized primary sources. Whereas historians had been obliged to travel to archives for most of their primary documents, an increasing number of resources are now available through digital search portals: from ProQuest to Google Books to the Hathi Trust. A historian who studies the nineteenth century is now increasingly confronted by source abundance as well: so much is digitized that it may be necessary to rely on technology to bring order to it before finding the “right” documents to read. Even historians who rely on traditional archives are increasingly taking photographs of documents, bringing cameras or cloud storage folders brimming with data, which is then analyzed from the comfort of their desks. Some of these commonalities are mentioned throughout this book. However, while there is overlap, web archives are unique in many important ways – the conditions of their formation, the ethical issues surrounding their creation and preservation, the role of traditional archives and other knowledge keepers, and the novelty surrounding them – that a volume focusing on them is warranted. Let us now turn to the issues that make web archives unique and challenging.

Our New Digital World: From the Suburbs of Toronto to the Crimea Web archives represent a new means of knowledge acquisition, distribution, and beyond, in ways that affect how we might understand the life of a child living in a North American suburb to geopolitical conflagrations around the world. Through two brief vignettes, I will shed light on some of the major issues that we will explore in this book. In the expansive suburbs of 1990s Toronto, Canada’s largest city, an eleven-year-old boy sat at his personal computer. Pulling up the Usenet discussion group rec.games.minatures.warhammer, he posted a question about the popular board game Warhammer 40,000 (a tabletop game in which players manoeuvre small armies of models against each other, following intricate rules). He asked how people playing team games together should proceed when they decide to fight amongst themselves, a house rule invented by a group of middle school friends. The many responses ranged considerably. The details themselves do not matter, but their existence raises an interesting point – not least to me, the post’s author. It is my own first trace of a digital presence, preserved today in

Introduction

9

multiple web archives for all to see if they are so inclined. It forms part of a chorus of historical voices that, if not terribly useful in isolation (my ruminations on board games may not be historically significant in and of themselves), in aggregate they can shed light on human culture, activity, and interaction across cities, regions, and countries. This example illustrates several key characteristics of our changing historical record. First, the musings of an eleven-year old child, published from home, are today available to historians sitting at any internetconnected computer. Even if it might be difficult to determine that I was a child when I posted my question, it represents the sort of behaviour and activity that before would have been preserved only rarely – and is now preserved as a matter of course. Diaries, even from the 1960s or 1970s, are rarities to be found in an archive with excitement. If I had been eleven in 1930 and had written extensively about the games I had played, it would have been a meaningful source for historians of that period (in the extremely unlikely event that it was preserved by a special collections unit, as opposed to ending up in a basement, attic, or garbage can). Such digital traces are now so ubiquitous that finding something specific among the available information is the real challenge. Rather than being scarce yet valuable, these posts are now so commonplace as to be a nuisance. Scarcity was frustrating, but super-abundance brings its own challenges. The inherent frustration in dealing with too much data, however, needs to be tempered with the realization of what a revolutionary shift we are witnessing as historians. As Jason Scott, a web archivist and historian who spearheaded the efforts to preserve GeoCities and other online communities, has noted, “At a time when full-color printing for the average person was a dollar-per-printed-page proposition and a pager was the dominant (and expensive) way to be reached anywhere, mid 1990s webpages offered both a worldwide audience and a near-unlimited palette of possibility. It is not unreasonable to say that a person putting up a webpage might have a farther reach and greater potential audience than anyone in the history of their genetic line.”6 Indeed, a young Dane or British resident posting a webpage today about a favourite board game is likely to be included in a respective national library’s legal deposit web archive – such musings accorded similar status to that of a published book. Some European countries have undertaken web archiving since the mid- to late 1990s, following

10

History in the Age of Abundance?

different models that we will explore later in the book. Other countries like Canada are beginning to explore legal deposit models that might see similar respect afforded to individual blog pages and posts. This underscores the transformation in the sheer amount of publication that can now take place. Second, my Warhammer 40,000 post raises questions that are endemic to web archives created by ordinary people: how can we ethically use sources like this? Did my eleven-year-old self expect privacy when I wrote that in 1995, let alone have any inkling that in twenty years an adult version of myself could find it in a publicly facing database? I am certain that it did not cross my mind. While the high schoolers with whom I and other scholars have spoken in Canada and the United States are now largely aware of online privacy (witness the rise of Snapchat, whose ephemerality is at least part of its appeal), these concerns are still with us. Old tweets and blog posts can be taken out of context and used years later, leading some journalists to speculate if users should proactively delete their own histories.7 We return to these ethical dilemmas later in this book, as they lie at the core of these sources. In their opportunity and abundance lurks risk. My second example operates at a different scale. If web crawlers – the internet robots programmed by projects such as Internet Archive to crawl the web and save its pages – have reached into suburban living rooms and added content to their web archives, so too can they reach across oceans and capture content from war zones and beyond. One notable example came during the 2014 occupation of the Crimea by the Russian military, when masked gunmen seized the Crimean Centre for Investigative Journalism, a local non-profit media organization. “From this building does not come true information,” one rifle-toting gunman declared to the staffers inside. While reporters were assured that they could continue to work if they were more “truthful,” they decided to escape. Yet they could not bring all of their equipment with them. As with most modern media institutions, the centre’s institutional memory was digitally stored: from back issues, video recordings, and beyond. Their news website, Investigator (http://investigator.org.ua/), a full-featured media site, with news, videos, and feature articles, was their main publication venue. What would happen with their computers, left behind? What if their offices were searched? Passwords cracked? Servers penetrated before access rules could be changed? As journalist Bob

Introduction

11

Garfield noted, “Thirty goons break into your office and confiscate your computers, your hard drives, your files … and with them, a big chunk of your institutional memory.”8 With the journalists out of their office, preservation efforts would come from afar. Over ten thousand kilometres away, in their former Christian Science Church in San Francisco, the Internet Archive sprang into action. The Investigator had been visited by Internet Archive web crawlers a few times before, as part of their general attempt to harvest as much of the publicly accessible web that they could within their resources and other constraints. Yet the sheer number of websites scoped and gathered in those sweeps meant that it was a spotty collection and far from complete – when working at such scale, you cannot really download everything. There was one snapshot in 2009, four in 2010, a few dozen times throughout 2013, but web crawlers had not visited since almost a month earlier in February 2014. A lot had happened in those intervening weeks, however: the crisis in Kiev followed by the Crimean invasion. Librarians, information professions, and archivists at the Internet Archive were able to quickly and decisively save this information. Backing up the Investigator’s back issues, websites, videos, and digital archives did not take a modern-day task force of soldiers, or somebody to sneak in and photocopy hundreds of pages, or any subterfuge at all – just a web crawler reaching across the ocean. The process happened quickly. The crawler visited the homepage of investigator.org.ua and saved a copy. It then made a list of all the links on that page and went to each of them, downloading a copy of each. This process continued, until the entire website was saved. Between the first and nineteenth of March, the site was saved fourteen different times: fully searchable, videos preserved, and today we can see how news stories on this website evolved throughout the Ukrainian conflict.9 In an earlier age, the Crimean Centre documents might have been an unavoidable casualty of war. We can now reach across space and save information that is not only important in the present, but will be of significant value to historians in the future. The Investigator – an on-the-spot and independent media outlet – is just the kind of source that would be very useful for historians, the citizens of the Ukraine, and others seeking to understand what had happened in the region. Newspapers and other media outlets write the first draft of history and are often the first port of call for historians seeking to

12

History in the Age of Abundance?

understand a period or event. Moreover, all news now breaks first on the web – it is where the story spreads and where it can also be controlled and suppressed. The ability to capture and archive information before it disappears will allow historians to construct a more complete, if complex and even confusing, historical record. While not without its own pitfalls – the spectre of the Global North reaching to the South to archive materials without their consent has been raised by some scholars, a point I return to later in this book – the long and quick arm of the web crawler will make for a much richer and more representative historical record. Consider that it took years for Daniel Ellsberg to copy and leak the Pentagon Papers in 1971 – and only a few months for Chelsea Manning to copy hundreds of thousands of military logs, diplomatic cables, and videos in 2010.10 The burdensome and potentially risky process of copying documents at a machine, dozens at a time, can now be done with a few keystrokes and mouse clicks. We can now quickly and safely preserve and disseminate vast quantities of information – in the Crimea, this was a way of fighting back against the gunmen who sought to silence the free media by breaking into their offices. While we use similar technical approaches to deal with the White House or the Liberal Party of Canada and with a child’s board game blog, they introduce ethical questions. The president of the United States does not expect privacy when he or she posts anything on the web; however, you or other private citizens may. The lines between all of these sources can also be very fuzzy. All of these dimensions underpin the book you are about to read. I argue that web archives will force the reshaping of how we train, write, and disseminate our histories. Historians and our society more generally need to address this shift. All of this is coming sooner than we may think.

Web Archives and the Writing of Inclusive Histories Web archives are not a niche concern. Indeed, coming to grips with what they represent is an issue at the heart of the future of the historical profession. Some examples can help to bring this into relief. Think of the sources that we now encounter and that are created online. The traditional media increasingly reaches the majority of its audience through websites, which can in some cases evolve online and differ dramatically from their print counterparts – a historian studying the twenty-first

Introduction

13

century, for example, should not rely simply on the print edition or database searches. Small and large businesses alike use websites to market, sell, and describe products, as well as to attract investors and excitement around corporate brands. Government departments at all levels engage the public through their websites, providing information on services, issues, and policy objectives; tracking how these have changed over time can shed light on the nature of the civil service and how they implement the whims of elected leaders. Political parties use websites and other web-based platforms, such as social media, to reach out to the general public during election campaigns and in between. Activists and social justice organizations bring large communities together online, such as #OccupyWallStreet in the United States, #IdleNoMore in Canada, and #MeToo globally, forming a treasure trove of information for historians and scholars seeking to understand social movements. The list goes on: charities, non-profit organizations, health websites, cultural organizations from museums to operas to local community theatres and local anime aficionados, and beyond. While much of this is speculative – as noted before, the number of historians engaged in post-1996 scholarship is still relatively small – it is easy to realize how web archives could transform historical scholarship. Political historians can use web archives to gather information about elections and the everyday process of making policy: from blogospheres surrounding judicial nominations, for example, to the establishment of news spheres around issues, politicians, and beyond. The Trump presidency, for example, has had a vivid online presence. From Trump’s Twitter persona to ongoing feuds with newspapers like the New York Times and the Washington Post, many of the high-profile moments of his administration have taken place online. It would be impossible to reconstruct this period of American history from printed sources alone; impossible to write a history of Trump without webpages and Twitter archives. Yet if political history is one of the obvious examples that we often turn to when studying web archives, all historians will find their studies affected by this medium shift. Examples can help illuminate the sheer range of historians who will be able to draw on these sources to revolutionize research projects. Military historians studying conflicts from the late 1990s onwards will have the voices of both rank-and-file soldiers, posting on discussion boards and other venues from increasingly connected overseas bases, and their families and friends establishing support networks at home.

14

History in the Age of Abundance?

Historians of capitalism or business can use corporate websites to explore how companies evolved; more still can use their advertising campaigns to see how society was reflected in them. Gender historians can explore changing conceptions of identity, playing out within online communities, as well as the toxic masculinities of the #GamerGate protest movement, for example. Historians of race can explore antecedents, reactions, and discussion about the events of Charlottesville, Virginia, or Ferguson, Missouri, or any other innumerable cases that show the salience of that category to this day. Historians of children and youth will have information about what young people thought, or at least wanted others to think that they were thinking about. For cultural and social historians tackling topics in any period after the mid-1990s, they will have a profound insight into the thoughts, desires, decisions, and everyday activities of ordinary people. This medium shift offers profound opportunities for all kinds of historians, as well as challenges. All of this is to emphasize that no historian can ignore the phenomenal transformation of information and records of experience online, underscoring the necessity for the historical profession to engage with the impact of web archives. To neglect the web would ignore the main medium for communication, publishing, social interaction, commercial enterprise, and creative activity since the 1990s. Consider some examples. How could a historian understand a social movement like #MeToo without considering websites and social media? Using only print sources like the New York Times or the Globe and Mail would give a skewed perspective, especially as events occurred during a challenging time for traditional media. Or reaching further back, could a historian explore the terrorist attacks of 11 September 2001 without considering how they unfolded on the web and how information spread, to make sense of rapidly changing stories in the hours, days, and weeks after the attacks in New York, Washington, and Pennsylvania? Could a historian write a history of youth cultures that did not consider the narratives of freedom and oppression carved out online from the late 1990s onwards? To write these histories through traditional print sources – newspapers, government documents, other print ephemera preserved in archives – would be dishonest. Yet it also requires an ethics of care, as I discuss later in this book: people have rights to privacy online, both today and in the future when their archived material is used, requiring a careful balance of expectations of privacy and of scale to use responsibly.

Introduction

15

One of my favourite examples underscores the scale shift in the historical record. The Old Bailey, which preserves the transcripts and proceedings of criminal trials in London’s criminal court, includes 197,745 trial transcripts published between 1674 and 1913. It is described without hyperbole on its website as “the largest body of texts detailing the lives of non-elite people ever published.”11 As a rule, everyday Londoners in the middle of the eighteenth century did not leave behind historical sources. We instead reconstruct their lives from when they appear in birth and death registers, or come into contact with the legal system, or appear in ecclesiastical records. My own historical work on Canada’s First World War and the 1960s bears this out: information about individuals, even those historically significant by contemporary standards (for example, activist newspaper editors, student leaders, high-profile union organizers) is fleeting at best.12 James Gleick explains this well in his masterful The Information: A History, a Theory, a Flood: “The information produced and consumed by humankind used to vanish – that was the norm, the default. The sights, the sounds, the songs, the spoken word just melted away. Marks on stone, parchment, and paper were the special case.”13 The 186 million documents in GeoCities, created by 7 million people, begin to hint at what this means for historians. In the 239 years that the Old Bailey collected, we have under 200,000 sources, which represent an exhaustive treasure trove of information; in the fifteen years of GeoCities, we have nearly 200 million documents created by 7 million individuals. The Old Bailey was the largest body of non-elite text ever preserved and presented – at least until the rise of web archiving. The Old Bailey versus GeoCities is just one example that helps underscore the rise of preservable, born-digital text. Before the web, most conversations took place via speech – now people are “writing in a growing number of social media, including emails, blogs, chats and texting on mobile phones.”14 This is a key difference between born-digital and digitized primary sources. More importantly, this writing is being preserved at scale. So too is the growing ability to transcribe speech, from computer systems to the accessibility of closed-caption television transcripts. Web content is a big part of this constellation of text that we can now preserve, alongside increasing numbers of images, videos, sounds, and beyond. In short, we are facing a step change in how we preserve and access historical content.

16

History in the Age of Abundance?

Who’s In and Who’s Out? Web archiving will not produce a “complete” record of our world, nor should it – we do not live in an all-seeing surveillance state. No archival record ever has captured, nor could probably any representation capture the essential humanness, richness, and complexity of our lived life around us in its entirety. But we can keep a few cautionary notes in mind when we use web archives as a mirror through which to explore society. First, as I discuss in this book, web archives are not only necessarily incomplete; these absences are not necessarily “bad” – there are serious ethical imperatives at the heart of using web archives in general. To do so, I explore the GeoCities archives in chapter 5 – hundreds of millions of pages created by everyday people – and explain how the historical use of these documents needs to be weighed against the ethical imperatives around the right of privacy, as well as the factors of informed consent and harm. I also explore the parts of the web that do not end up in web archives, such as Facebook or the record of Google searches. In the wake of the Cambridge Analytica scandal, which saw personal Facebook data being initially harvested for academic purposes and subsequently misused for political microtargeting, we are reminded that not only will this material not be accessible to academics – it also should not, as a result of privacy expectations of those platform. But we need to account for those silences in archives, to realize that certain parts of the web will be included in repositories like the Internet Archive, and other parts (such as Facebook) will not. Second, and related, we need to recognize that we do not all use and publish on the web in the same way. This force shapes the web archives that we have and underscores the need to use web archives in ethically and transparently. We need to understand the web as a publishing medium to understand the archives that stem from it. Big data does not mean magical, all-encompassing, and inclusive big data. As Safiya Noble notes, “Users live on Earth in myriad human conditions that make them anything but immune from privilege and prejudice, and human participation in the web is mediated by a host of social, political, and economic access points – both locally in the United States and globally.”15 We can see this borne out in statistics and contemporary research. Within Canada and the United States, for example, people use the web

Introduction

17

in different ways as the result of lines of class, gender, and ethnicity. As Pew Research found in 2017, for example, while internet usage is nearly ubiquitous amongst “young adults, college graduates and those from high-income households,” an adoption gap remains “based on factors such as age, income, education and community type.”16 The gaps are sobering: 98 per cent of adults with an income above US$75,000 use the internet, compared to 79 per cent of those under US$30,000; even more glaringly, 98 per cent of college graduates are online, whereas only 68 per cent of high-school dropouts are. Beyond that, 15 per cent of black Americans and 23 per cent of American Hispanics use smartphones but have no home internet. As Pew notes in general, “Reliance on smartphones for online access is especially common among younger adults, non-whites and lower-income Americans.”17 From the World Bank, we can see how these disparities are magnified across the planet; whereas Canada has an 88.5 per cent internet usage rate, South Africa has only a 51.9 per cent, and many other countries are far lower.18 All of this serves as a critical reminder that despite the volume of data we are working with in web archives, it has significant gaps and omissions structured by categories of class, gender, ethnicity, and locale. Many will not be included in this record because they are not present on the web or are engaging with that platform in a different way. But of course the same has been true of all archival records – they have been shaped by the collectors’ decisions and reflect contemporary biases, policies, and beyond. Even if a Twitter or blog post makes it into a web archive, too, that does not mean that it accurately represents “the real world,” as anybody who has spent a minute on the platform knows – there is not just the problem of “fake news,” for many of us sculpt and curate our online presences. This all means that we need to be conscious of a web archive’s bias, just as we are conscious of how a national library or university special collection can be biased as well. No archive is a true reflection of the world. Even if people publish online, their contributions do not necessarily find their way into the Internet Archive or another global web archive. Web crawlers have no magical way to find all content on the web, and indeed they need to generally find some way to a personal website by following link after link after link from a large set of starting pages. A large Internet Archive crawl might begin by sending web crawlers to the million most popular webpages (as determined by Alexa, a company that

18

History in the Age of Abundance?

evaluates website popularity in order to help value advertisements on them). A crawler arrives at a page, saves a copy of it, notes all the links on that page, and then begins to follow those links. For each page it arrives at, it repeats this process. A large crawl of the web can accordingly take months. The 2011 Wide Web Crawl carried out by the Internet Archive began in March and ended only in December, for example. It is certain that by following links a web crawler will find a university or large corporate page; likely that it will discover an academic’s page; but less and less likely that it may find an individual’s blog. How web archives are shaped by the vagaries of the web archiving algorithm is a key part of this book. Web archives are collected by technologists, librarians, archivists, historians, and others who design, configure, and monitor the crawlers, which find and collect content. This human element means that the underlying algorithms – like all algorithms – unavoidably reflect people’s biases and decisions.19 Accordingly, sources are unavoidably gathered unevenly and unequally. This is not meant as a dig: there could be no complete and all-seeing archive of our time, and we cannot pretend that there could be. The selection bias found in all collections, whether digitized or not, needs to be understood, as it underpins any research done with this material. When historians use conventional archives, they know – or should be able to determine – the nature and origin of a given collection of documents and be alert to the subjectivities (and silences) that may characterize a collection and its cataloguing. The same holds true for web archives. In a world where everything appears to be digitized – but of course, everything is not – these cautionary notes are all the more important. To give an example, digital collections of webpages can be skewed by the unequal power of hyperlinks on the web. Crawlers that start finding links by following them from the top one million popular websites, for example, are far more likely to acquire IBM ’s website, or that of an academic at Harvard University, than they are to find a South African child experimenting with a personal website. This all shapes what and who is included in this new archival record, and what and who is not. Alongside these critical questions, we need to recognize that we as a society and a historical profession will still be left with an incomparable historical record in size, scope, and dimension. The challenge will then be to discover and represent the breadth and depth of the web’s historical record, and the full diversity of its creators and subjects.

Introduction

19

Even if we have the content, it can be very difficult to use. Webpages themselves are difficult to technically use. Webpages – in the sense of a single document in sitting in archive box – do not exist. The complexity of these documents, comprising many constituent components such as images, videos, text, and beyond, brings with them considerable challenges we will discuss throughout this book. Writing history using the web does not mean that all historians need to change, or that previous methods of knowledge, research, writing, and beyond are suddenly facing obsolescence. This cannot be stressed enough. But it does mean that historians tackling a research question after 1996 need to approach things very differently from the way they did before. This is the era of Big Data, which I define as simply having more data than you could read yourself in a reasonable time – and in turn thus lends itself well to computational intervention to make sense of.20 Big Data is not better than earlier forms of information, simply different. The claim by Wired editor Chris Anderson, who argued in 2008 that we were entering a “world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear … with enough data, the numbers speak for themselves” is false.21 The idea that as the sample size increases, our analyses will reveal “truth,” quickly falls apart as we realize how uneven our collections are, the unavoidable biases, the vagaries of the algorithm, and the librarian’s subjective decisions. We will never have a perfect archive. Web archives do not speak for themselves, nor can they be understood without recourse to the vast array of other sources that can contextualize the knowledge they contain: from oral interviews, to media reports, government documents, critical theory, and beyond.

Historians Need to Be Ready Sooner Than We Think – and We Aren’t Ready Now The 1990s will soon be history, the province of professional historical inquiry. Depending on how a historian views contemporary history, it may already be. We have so far recounted the advantages of using these web-based sources: a larger historical record, generated by more actors, working with more information of all kinds. The disadvantages will be the complexity of managing and using these sources – both in practical respects, but also the philosophical and ethical dimensions

20

History in the Age of Abundance?

– and their relative obscurity within the historical profession to this point. Yet without using the web, histories of the 1990s will be incomplete for the most part. Ignoring the web would be like ignoring print culture. There is no firm point for when events and periods progress from simply being the “past” to being ripe for historical exploration. The best way to make sense of this is to look at other historical events and see when they became topics for historical inquiry. When do historians tackle an era? In other words, when does the past become history? Take 1968, the global year of revolutions. By the mid-1980s, the first historical monographs on the 1960s began to appear.22 These books set the first historical narratives into motion, becoming touchstones in the historiography. A decade later, by the mid-1990s, more literature had emerged and the 1960s became a largely uncontentious topic for doctoral study.23 The year 1968 had become history, with few batting an eye when a historian studies that period (indeed, it is not uncommon for job advertisements for tenure-track professors to emphasize post-1950 or even post-1960 periods today). It took roughly twenty years for the first drafts of 1960s history to be written, and only ten years after that for it to be an uncontroversial part of the profession. The web is roughly the same age today as 1968 was when historians wrote their first drafts of the events of that tumultuous decade. It is difficult to pin down an exact birthdate for the web, but if we place it in the summer of 1991 with its public announcement and widespread availability, it is well over twenty-five years old today.24 While carrying out “recent histories” is challenging – issues range from a lack of anchoring historiography, historical participants who can “talk back,” and issues of copyright and privacy legislation – it is increasingly well-travelled ground.25 But carrying out histories with born-digital resources brings new questions, largely unexplored by historians. Soon we will be needing to confront these resources – and historians are not ready. This book hopes to change that.

History in the Age of the Algorithm Given that no single human can read every page in any web archive of even middling size, not even a fraction of them – recall the 32 billion words that underpin GeoCities.com alone – we will need to rely on computers to make sense of the data. In the past, historians relied

Introduction

21

on archivists and librarians (often far more than we realized) to make sense of our archival documents and to organize this material for us. Yet the scale of this data means that we historians need to do more of that organizing, finding, sifting, and beyond. Pressing technical challenges include questions of how these systems will decide what to prioritize, what not to, and what critically aware algorithms we can develop to find the information that we need. Historians also, as I note especially in this book’s conclusion, need to begin to transform their professions in respects as varied as professional development, hiring, training, and promotion. Institutions like the American and Canadian Historical Associations need to continue this conversation amongst departments, facilitating digital scholarship. Right now, it is very difficult to develop the skills necessary to use digital sources in an undergraduate or graduate program in history. These skills are necessary, for historical scholarship and for basic citizenship. Algorithms lie at the heart of almost all research with digital resources (which, of course, these days means most research, period). We see this every day when we run our own Google searches and rarely progress past the first page or two of the results. Google may have indexed diverse voices, but if their algorithms consign them to the fifth, or fiftieth, or five hundredth page of results, they might as well be non-existent. As Siva Vaidhyanathan has noted, the PageRank algorithm that underpins Google has “historically favored highly motivated and Web-savvy interests over truly popular, important, or valid interests … Being popular or important on the Web is not the same as being popular or important in the real world.”26 While our first instinct may be to explore web archives using a traditional web search engine, such as Google or Bing – presenting results, ranked via some sort of metric, on individual search engine results pages – I hope to show in this book that alternative methods will be superior for all but the most specific searches aimed at narrow topics. Indeed, by moving away from keyword search and towards alternative ways of exploring – whether through hyperlinks, frequently occurring phrases or terms, or other forms of data that can describe what we are looking at – we can begin to ensure that historians can open up the “black boxes” of proprietary or opaque search engines.27 By “black box” I refer to a system where a question goes into it, and an answer comes out, without really understanding what decision processes were involved in

22

History in the Age of Abundance?

giving the answer. Every time we use a proprietary search engine, such as Google, a decision is made behind the scenes to show a certain result as #1 – and another as #1,000,000, virtually ensuring that you will not click on it. At the very least, open access software will let us know how these mechanisms work. In the case of Google, much of the underlying algorithm is based on PageRank, a system that ranks websites on how many times they are linked to (and in turn weighs the value of their links by the prominence of the site sending them). While we return to PageRank later in this book, this underscores that historians also need to understand the underlying principles of information retrieval. In the digital age, a historian really does need to know what PageRank is. A tangible example can help show search’s limitations when working at scale with web archives. One of my favourite examples of the black box (for researchers, in any case) we need to interrogate comes from a Canadian web archive created by the University of Toronto and the Internet Archive of some fifty political parties and interest groups between 2005 and today. This collection includes all of Canada’s major federal political parties, most minor ones, as well as a somewhat nebulous assemblage of political interest groups: environmental foundations, campaigns to ban land mines, and the Assembly of First Nations, to name a few. It also has sites linked to those sites, such as Facebook pages, Twitter feeds, or newspaper sites. The collection is extremely useful, and, without these resources, there would be a gap in our understanding of Canada’s recent political history. When an institution like the University of Toronto wants to create a web archive, it has a few different options in how it can curate, collect, and preserve this information. Some institutions may build their own technical capacity and collect the web archives themselves using opensource software like Heritrix, a program that goes out on to the web and takes snapshots of pages systematically. Many others turn to outside providers, such as the Internet Archive’s subscription service Archive-It, which gives a front-end “dashboard” for a curator to use. They can enter in the sites they want to be preserved, set limitations and rules, and leverage the Internet Archive’s considerable expertise to grab this material. This also provides simple access tools for researchers to access the material. Archive-It is a well-designed and powerful service, enriching our cultural heritage is innumerable ways every single day. Yet historians need to use such interfaces with thoughtfulness and care.

Introduction

23

Consider the results when one searches using Archive-It’s simple search engine for Canadian Prime Minister Stephen Harper’s name. This can be found at https://archive-it.org/collections/227. The first result is his Facebook page from May 2009. Why May 2009? Why Facebook? Why not his political homepage? Even more interesting, the third result is an article by a Canadian journalist about his chief political advisor, a political scientist from Alberta. These are helpful results, and the search engine is extremely useful for very targeted queries, but as a starting point for serious research we are in trouble. With millions of results, we need to know how the search engine ranks results – and, just as importantly, historians need to be able to make sense of that method as well. It also underpins the scale and questions at play: over a million results are provided from this one web archive. The search engine is deciding what we see and read, not ourselves. Just as importantly, the results would certainly be different if you – the reader – were to run them. More data has been continuously added, and more importantly, the search algorithms and interfaces are continually refined, tweaked, and changed. The same search, run hours or months apart, might lead to very different results. Of course, historians have long worked with archival records and documents mediated and made accessible, thanks to the labour of others. The difference between this state of affairs and today is that historians cannot necessarily avail themselves of the fine-grained access and description work that they take for granted with conventional archival holdings. An example can help make this clear. Consider historians doing work at a national archive such as Library and Archives Canada. They discover that a given fonds, or a group of documents from a single source, exists, and then consult a finding aid to see box-level descriptions (or, if the archives have had sufficient resourcing, file or even document-level descriptions). For example, a box might be labelled “Correspondence and Publications,” and files might be broken down by branch offices or individuals. A box is ordered, and the historian works in a reading room to explore the files within. These finding aids were created by archivists, and the historian relies on their professional competency and ability to ensure that files are accurately described, as well as trusts that relevant documents were preserved when records were accessioned and preserved in the archive. Archivists and librarians today are concerned with the evolution of these traditional information systems in the digital age.

24

History in the Age of Abundance?

While historians rarely receive formal training in archival practice, there is an implicitly assumed understanding of the process; the same is not true for born-digital records. With traditional access, archival and library labour fades into the background – it works so smoothly that historians often need to be reminded about the work that goes into making archival content so discoverable. With web archives, historians suddenly are without the professional framework and infrastructure they have long relied on, a topic I explore a bit more at length throughout the book and especially in chapter 6. To use the archived web as a historical source, we will need a fundamental understanding not just of web archives, but of the web, computational thinking, and the ways in which these currents fit into the much broader shift towards the digital humanities. This also requires renewed attention to interdisciplinary scholarship – working with computer scientists and librarians, for example, in order to garner new insights from human culture. As humanists begin to grapple with and explore the implications of the digital turn, in history as well as other professions, it is becoming clear that questions of “how to analyze, interpret, and exploit big data are big problems for the humanities.”28 As those potentially most affected by having many of their primary sources eventually become digital, historians need to become leaders in the digital humanities. Researching topics from the 1990s onwards require it, and they will soon be firmly within the province of history – if they are not already.

The Scholarly Conversation and the Structure of This Book This book makes an argument for the significance of web archives, as part of our cultural heritage and as an indispensable historical source. It is divided into six chapters, revolving around the central question of how web archives will fundamentally challenge, reshape, and enrich the historical profession: in training, research, and teaching. It tackles the question from several angles: a theoretical introduction to web archives, a technical overview, and explorations of how this material is preserved, stored, accessed, and analyzed. It touches on major issues of ethical concern as well, especially the implications of accessing the online lives of millions of users in both the GeoCities web archive and the legal deposit collections of national libraries.

Introduction

25

Others have explored these questions before, though largely as practitioners.29 The Big UK Domain Data for the Arts and Humanities (BUDDAH ) project brought together a group of humanists to explore self-selected research questions, unified only by needing as a source archived websites that fell within the United Kingdom’s “domain crawl.” By this, I mean all those resources that were on .uk domains (i.e., the British Library at bl.uk, or the Daily Mail newspaper at dailymail.co.uk) between 1996 and 2013. Their findings, now appearing in published form, provide a foundation for this project and others. Most scholarship has taken place within the web archival field, often without formal proceedings: from the International Internet Preservation Consortium annual meeting to the Web Archives as Scholarly Sources conferences that have been held in Aarhus, Denmark, the University of Michigan, and the University of London’s School for Advanced Studies. As a review of the citations within this book shows, many of the conversations are taking place outside the conventional published literature: it happens on blogs, in library white papers, and other forms of grey literature. What has been missing in much of this, however, has been engagement with “non-digital” or traditional historians.30 Internet studies and scholars of New Media, with their technical and critical understanding, have also been exploring the nature of the archived web. The implications of web archives from an archival and internet studies standpoint have been exhaustively studied by Niels Brügger, who has probably done the most to bring web archive researchers into a cohesive scholarly community.31 This book is very much in conversation with this scholarship, and it is woven throughout. The variety of perspectives at play in this emerging field of web history can perhaps be best seen in the SAGE Handbook of Web History, co-edited by Brügger and me, which combines forty chapters from a variety of perspectives.32 Brügger’s The Archived Web, released just before this book went to press, explores the web through the lens of media studies and historiography, providing an invaluable foundation for their exploration. As Brügger notes in his work, it maintains a “clear focus on the archived web as a semiotic, textual system.”33 This book, conversely, approaches similar questions in a very different way. First, it places more emphasis on the use and impact of the archived web; while I certainly discuss some of the foundations of web archives, the emphasis is on the transformation of historical research more generally. Accordingly, History in the Age of Abundance? is grounded

26

History in the Age of Abundance?

in my own perspective as a practising historian and in the sweep of professional historiography. The two books and their disparate perspectives complement each other well. With notable exceptions that we will explore in this book, historians have been relatively slow to adapt to this medium shift.34 This is both unfortunate and alarming. As scholars of history, we are (or should be) interested in what we can learn about the past from web archives. Some of this will be similar to a “web history,” if the internet or the web is our primary object of study, but increasingly much of this will simply be histories of all subjects, using websites and webpages as our new primary sources. In addition to reading these digital sources for content, historians need to pay greater attention to the origin and creation of the form of the source itself. Indeed, for web historians, the stories of the sources themselves may be of historical interest. But at the very least, all historians using web-based sources must understand that the origin and creation of the digital sources that they are using, and their modification over time, is an essential part of understanding the content and narratives themselves. These themes can be found in the chapters that follow. The first chapter, “Exploding the Library,” sets the stage by providing a conceptual understanding of the internet, the web, hypertext, and the contemporary shift towards the digital humanities. Historians who use web-based sources need to know what the web is, as the technical decisions made at the web’s birth continue to affect how we interpret, access, and ultimately make use of web-based and web-generated primary sources. It argues that the web presents a new form of media for historical researchers, requiring particular tools and approaches that do not easily map onto earlier, non-digital (or traditional) formats for research. Chapter 2, “Web Archives and Their Collectors,” explores web archives from the perspectives of the collectors. Who are they? How do they collect pages? What agenda and parameters do they follow? Just how big is the archive’s scope? The chapter begins with the fears, prompted by the irreversible loss of our very earliest web history, of a “digital dark age,” examining both what we can learn from that experience and how to grapple with the ever-present problem of digital preservation. The challenges range from technical issues (robots.txt, walled gardens) to social ones that systematically undervalue our digital heritage. It concludes with the twin case studies of AOL Hometown (a lost online community) and GeoCities (an archived community) and reflects on what we can learn from these cases today.

Introduction

27

We then turn to the questions of archiving, accessing, and making use of this material. Chapter 3, “Accessing the Records of Our Lives,” explores the major issues of this field, such as the difficulty of capturing ever-changing content, the impact of changing standards, and how digital records will ultimately demand adjustments in the way historians approach their sources. Woven throughout the chapter are several case studies. First, we explore the “browser wars” between Netscape and Internet Explorer, ancestors of today’s Google Chrome, Mozilla Firefox, Apple’s Safari, or Microsoft Edge. As browsers bring together all the elements that make up a webpage, how they interpret code and instructions determine how we read the documents today. We then explore the data mining and metadata extraction carried out by the National Security Agency and others, before concluding with an extended exploration of the Canadian Political Parties and Interest Groups collection, which I study to compare metadata analysis with traditional content explorations. In this chapter, I argue that we need to move beyond traditional methods of close content reading to more technical analyses of metadata and similar new methodologies – this will mean developing our knowledge of Big Data and the tools to use it. This does not mean abandoning close reading of sources, but thinking about the analytical possibilities of abundant information. What questions can historians now ask that they never would have, because the data wasn’t there? Chapter 4, “Unexpected Needles in Big Haystacks,” moves into the world of computer science and information retrieval to explore how we find the information that we need. By mobilizing search engines, high-performance computing, and national libraries, we can make sense of the vast array of cultural information contained in web archives. How can we scale our efforts so that web archives become accessible? In particular, I highlight the benefits of collaboration between historians, computer scientists, and librarians – highlighting the interdisciplinary work that I have been able to carry out with Nick Ruest (York University) and Jimmy Lin (University of Waterloo), amongst others. Wary of being too technical, as technology changes very rapidly in this field, I focus on underlying principles rather than code snippets or actual implementations. Chapter  5, “Welcome to GeoCities, Population Seven Million” explores the GeoCities.com web archive. It allows us to investigate a virtual ruin, evidence of a vibrant, online community that was born, thrived, and declined between 1994 and 2009. Entering it, we also gain

28

History in the Age of Abundance?

a sense of the ethical quagmire web archive researchers now find themselves in when studying our recent digital past. This chapter examines what we can learn and what we need to consider as we traverse the now archived GeoStreets and GeoAvenues of GeoCities, from the childfocused EnchantedForest to the festive BourbonStreet. In these places many users teased out their relationship with the web, building a foundation for the blogging and social networking explosion of the 2000s. GeoCities was a virtual city that users helped each other build. Our modern relationship with communications technology owes a great debt to this web service. In it, we can discover issues that will confront historians as they explore the web archives of everyday people at scale. Finally, chapter 6, “The (Practical) Historian in the Age of Big Data,” looks at what a historian needs to know to begin to work with web archives at scale. Aware that providing overly specific technical details will quickly date the chapter, it instead focuses on broad trends and currents in the field of applied web history research. It begins by looking at the major trend in the field right now, the push towards greater accessibility of both collecting and analysis, before discussing how a historian can get started in this world. It concludes by exploring how these tools and their changes can have dramatic impact on the sorts of historical questions that historians and others can ask of web archives. The time for this is now. The stakes are high. Imagine a history of 2019 that draws primarily on print newspapers, approaching this period as “business as usual,” ignoring the revolution in communications technology that fundamentally affected how people share, interact, and leave historical traces behind. And when we do use web archives, we need to be knowledgeable of their functionalities, strengths, and weaknesses: we need to begin to theorize and educate ourselves about them, just as historians have been cognizant of analog archives since the cultural turn. The challenge is considerable, but the potential is even greater.



EXPLODING THE LIBRARY

In the 1976 American political thriller All the President’s Men, Bob Woodward and Carl Bernstein are intrepid Washington Post reporters investigating the Nixon administration in the wake of the Watergate break-in. Determined to figure out what is going on, they visit the Jefferson Reading Room at the Library of Congress, the three-story rotunda at the heart of the building. They are looking for the library records of E. Howard Hunt, one of the White House “plumbers” at the heart of the administration’s illegal activities, trying to see if he had been investigating Senator Teddy Kennedy. Initially rebuffed by a librarian, they find an ally in a younger clerk and ask for a year of library records. “I’m not sure if you want ’em, but I got ’em,” the clerk says. Bound stack of slips after bound stack of slips thud onto the reading room table. The script describes the scene: “WOODWARD AND BERNSTEIN seated at a table with from anywhere between 10 to 20 thousand slips of paper. In front of them, seated at a high desk, the bearded clerk looks down on them, shaking his head. It’s a staggering amount of work to thumb through.”1 The scene evokes information overload: dogged reporters facing down a massive record-producing institution of the American government. It is a reminder that too much information can be as harmful as too little. As a physical monument to knowledge, the Library of Congress is a useful touchstone. It is massive by almost every measure. Its 167 million items are housed across 1,349 kilometres of shelving; at a casual walking pace, it would take over eleven days of continuous walking to pass by every volume. In 1949, when information theory founder Claude Shannon reflected on sites that might store “information,” he put the Library of Congress at the top of a logarithmic scale: beginning with a single digit, through a page of paper, to the Encyclopaedia

30

History in the Age of Abundance?

Britannica, the Library of Congress represented the largest repository of information he could conceive.2 Yet today, the web archives of the Library of Congress may dwarf even their monumental physical collections in sheer quantity. Libraries are embodiments of our cultures’ history and heritage. We use the destruction of the Library of Alexandria in antiquity as a symbol of collective cultural destruction and as an emblem of collective memory. Today, national libraries steward national heritage, from the US Library of Congress to the British Library, to the Bibliotheca Alexandrina in Egypt and the National Diet Library in Japan, to the Bibliothèque nationale de France, Library and Archives Canada, and many others. Others such as the Internet Archive fill a similar role, stewarding digital information, most of which does not have a counterpart in print. The scale of information produced today, all over the world, has meant that these libraries have had to adapt. The amount of information being produced and stored daily is extraordinary. Indeed, there is a cottage industry in even wowing consumers about all of the information assembled every single day. As of February 2016, every minute saw 400 hours of YouTube videos being uploaded; the Internet Archive had over 657 billion webpages in its Wayback Machine as of May 2018; Pinterest saw twenty terabytes (TB , or 1,000 gigabytes, or GB ) of additional content being added every day in July 2014; and so forth.3 These figures are already dated, and the dates are varied in the list above as a result of the proprietary nature of some of this information, but the precise figures do not actually matter; simply the overwhelming scale. We live in the most documented age ever. Simultaneous to this explosion of digital information, the amount of information stored in traditional, or analog, formats is dramatically shrinking as a percentage. We simply are not storing material today on paper, film, microfilm, tape, or other such mediums like we used to.4 This is not relevant to just those who study the web itself, but for historians who study any form of political, economic, social, or cultural phenomena. Our libraries are getting bigger, and they are changing. Using web archives necessitates navigating through information repositories that are, by almost any standard, Big Data. We are all in Woodward and Bernstein’s shoes now, facing staggering amounts of data. What will this mean for historians? Will it be an insurmountable challenge, or will we be able to rise to the task?

Exploding the Library

31

This chapter introduces the basics of what a historian needs to get up to speed on how the web and web archiving are transforming historical scholarship. We begin with a crash course in the history of the internet and the web: where did these platforms come from, and how have they combined to produce this large new category of primary sources? It then explores the changing scale and scope of born-digital documents: just how many are being produced, but more importantly, who is – and who is not – producing them? It also draws on historians’ experiences with digitized sources, arguing that our failure to transparently cite and engage with digitized newspapers portends poorly for our ability to work with web archives at scale. The chapter then concludes by exploring the rise of the digital humanities and how historical debates about objectivity and subjectivity are both shaped and fundamentally unaltered by these changing notions of scale and scope. Ultimately, it argues that all of this has combined to position us within a new wave of historical scholarship – and that, as it is nearing three decades since the publishing of the first webpage – we need to be ready sooner than we think.

“A Series of Tubes”: An Internet Crash Course In June 2006, Senator Ted Stevens, a Republican from Alaska, stood in his country’s Senate and appeared to betray a shocking ignorance of how the internet worked. “I just the other day got … an Internet [that was] sent by my staff at 10 o’clock in the morning on Friday. I got it yesterday [Tuesday]. Why? Because it got tangled up with all these things going on the Internet commercially,” he declared, confusing his email client with the mechanisms of the broader internet. His next statement would become a punchline to a million jokes: “And again, the Internet is not something that you just dump something on. It’s not a big truck. It’s a series of tubes.”5 This emerged as symbolic confirmation that those with their hands on the levers of power did not understand how the internet works. As Googlers Eric Schmidt and Jared Cohen have argued, the “internet is among the few things humans have built that they don’t truly understand.”6 We today live in a society dominated by the power of networked communication but generally have not the faintest idea of how it came to be. Studying web archives requires knowledge about what the web and the internet are today, and where it all came from: how does the

32

History in the Age of Abundance?

internet work, how is information delivered, how does the web layer on top of the internet – and how has this changed since the internet was invented? Answering these questions does not necessitate a degree in computer programming. Historians have long used newspapers, for example, without an intimidate understanding of how printing presses or newsrooms have worked. But just as scholars have needed to know the basics of newspapers and their circulation – where they were published, what their political slant was, how they were distributed, how people would have read them (in their homes, or in cafés), how many people might have read them – they need to know the basics of the internet. It turns out that though we may not need an in-depth understanding of the TCP/IP protocol’s inner workings, we may at least need to know what TCP/IP stands for. Some quick definitions are in order. The internet (in the sense of the global network that most of us are connected to) is the biggest internet (in the generic sense) or internetwork in existence. An internet, in a general sense, is a network of two or more networks. You might have a network at home or work. This can take several forms: for example, an Apple TV or gaming console hooked up to a television, a printer, two computers, and assorted mobile devices, all connected to a single router. I also have a network in my home. If my network connects to your network, sharing data and facilitating interconnections, we have an internet. The internet, then, that we mostly speak of today is the interconnection of millions of networks with a common set of standards to make sure all of the systems can exchange data seamlessly. It encompasses several different concepts. As Astra Taylor explains, the internet is best understood “as a series of layers: a physical layer, a code layer, and a content layer.” We usually think of the latter – the finished information reaching our computers – but an awareness of the first two helps us truly understand the network.7 It is also important to recognize what the internet is not – it is not a synonym of the World Wide Web. The web is an information system that uses the internet. The network of networks that comprises the internet physically exists. It is not quite a series of tubes, but as an analogy this is actually a good starting place. In an era of ever-connected cloud-computing, with files saved to Dropbox and music streaming across wireless connections, the internet can sometimes seem to be “just out there.” Yet the internet is a series of interconnected cables, wires, and switching boxes – yes,

Exploding the Library

33

often running in actual tubes to protect them from the outside world. Fibre optic cables send bursts of light, billions of times a second, across oceans and continents, stretching between cities and towns, stitched together by massive datacentres, owned by private companies such as Cologix or Nokia’s Alcatel-Lucent Enterprise, governed by the national laws of the countries where the servers reside.8 The tangibility of the internet, and the physical location of cables and computer networks, actually matters a great deal. Fibre optic cables are tapped and analyzed by intelligence agencies; South Korean users are forbidden to view North Korean propaganda, and Germans face restrictions around Holocaust denial and neo-Nazi literature.9 Less seriously, Netflix catalogues differ on the basis of IP addresses from which users access Netflix.com, as do the advertisements on the sites we visit. Americans and Canadians visit the same Netflix.com URL but have a very different experience. The nation-state continues to matter. As Schmidt and Cohen note, states “have power over the physical infrastructure connectivity requires – the transmission towers, the routers, the switches … they control the entry, exit and waypoints for Internet data.”10 The magic that underlies this network lies in the establishment of common standards that allow a Canadian-based computer manufactured by Apple Computers in Texas to exchange information with a Koreanmanufactured computer in the Republic of South Africa. That this happens seamlessly is a tremendous human achievement. This common standard, the Transmission Control Protocol/Internet Protocol, or TCP/ IP , makes the Internet possible. There used to be many other competing protocols, until TCP/IP was adopted by the American Department of Defence in the early 1980s (eventually being adopted by the Advanced Research Projects Agency Network, or ARPANET ) and spread throughout the private sector, before being freely released as open source software in 1989. The most important part of the internet is that everybody has agreed on how to talk to each other, and a structured process has been constructed around how to make changes to these protocols. What is TCP/IP ? Think of the internet as an impossibly large postal system. Sometimes we send small bits of information to each other, a few bytes, but other times we exchange large amounts of information. Some of these are huge – the Large Hadron Collider at the European Organization for Nuclear Research provides datasets of over 300  TB , themselves a small fraction of the data they regularly generate and

34

History in the Age of Abundance?

analyze.11 All of this information is transmitted as digital data, a series of ones and zeros. The words we type, for example, are turned into a series of digital information; they might be sent as a quick series of light pulses sent through a fibre optic cable, stretching under our streets, between our cities, and underneath our oceans.12 These files (big or small) are sent as packets, usually tiny segments of the overall file, making them durable and able to flow in different directions. Each packet has a source, a destination, an identification, and a checksum, which is a quick mathematical calculation that can be done to ensure that the packet information is being delivered properly. If something is corrupted along the way, which can happen, our computers can quickly discover by checking this “checksum.” There is also a prescribed limit to the number of paths it can travel on before being discarded – for example, if a network is very congested, a packet might eventually be dropped and then need to be resent (which slows overall data transmission). This means that if Computer A wants to send information to Computer B, it does not need a direct link: it can instead go through several intermediaries, or routers, on its way there. All of these locations have an address, or IP (Internet Protocol) address: a series of numbers like 129.97.128.116 (a server number; in this case, a server at the University of Waterloo) or 129.97.109.14 (a workstation on the same network). These messages thus follow physical networking paths, largely between computer networks linked by fibre optic cables. A message travelling between two computers at two Canadian universities, the University of Waterloo and the University of Saskatchewan (2,167 kilometres apart), takes the following route: traceroute to 128.233.192.40 (128.233.192.40)

1 * v438-aco-rt-ev1.uwaterloo.ca (129.97.143.129) 1.830 ms

1.692 ms

0.573 ms

0.413 ms

0.360 ms

0.364 ms

0.349 ms

0.327 ms

0.414 ms

0.384 ms

0.382 ms

0.742 ms

0.516 ms

0.547 ms

2 gi1-12-dist-rt-mc.uwaterloo.ca (129.97.1.97) 3 te2-16-cn-rt-rac.ns.uwaterloo.ca (172.16.31.113) 4 te4-9-ext-rt-mc.ns.uwaterloo.ca (172.16.31.229) 5 66.97.28.65 (66.97.28.65)

Exploding the Library

35

7 be202.gw01-toro.orion.on.ca (66.97.16.26) 4.023 ms

3.902 ms

3.937 ms

3.879 ms

3.857 ms

3.819 ms

34.369 ms

34.292 ms

34.336 ms

42.331 ms

42.290 ms

42.290 ms

42.568 ms

42.343 ms

42.351 ms

42.549 ms

42.403 ms

42.441 ms

8 toro1rtr1.canarie.ca (205.189.32.41) 9 wnpg1rtr1.canarie.ca (205.189.32.178) 10 sask1rtr1.canarie.ca (205.189.32.178) 11 c4-srnet-sas.canet4.net (205.189.32.221) 12 208.75.72.84 (208.75.72.84)

The message began by first going to a Faculty of Arts router in the Environment 1 building at the University of Waterloo (ev1.uwaterloo), then to a router in the Mathematics and Computing building (mc.uwaterloo), before travelling on the regional Optical Regional Advanced Network for Ontario (ORANO ) and Ontario Research and Innovation Optical Network (ORION ). It then used Canada’s Advanced Research and Innovation Network to make it from nearby Toronto (toro1rtr1.canarie), to Winnipeg (wnpg1rtr1), and then to Saskatchewan (sask1rtr1), where it was then transmitted to Saskatoon’s campus. The message travels extremely quickly: times are measured in milliseconds (ms), which are one-thousandths of a second. To move within my university campus, messages take less than a thousandth of a second, and to move between major cities they take around two-tenths of a second. Messages that leave North America travel along submerged cables and take longer.13 Bankers have invested millions of dollars in order to shave milliseconds off transmission times to facilitate high-frequency trading.14 Think of these IP addresses as library call numbers, letting us quickly locate things in an otherwise overwhelming sea of items. Everything connected to the internet has an IP address, from PlayStations, to computers, even to internet-connected televisions and fridges. Indeed, so many things now have addresses that we are exhausting them, even though the numbered format above has room for over four billion entries. This protocol, IP v4, is slowly giving way to a new protocol, IP v6, which will have room for 340 billion billion billion billion addresses, each looking like this: 2001:0db8:85a3:0042:1000:8a2e:0370:7334. As with call numbers, you can learn a few things from IP addresses:

36

History in the Age of Abundance?

each can be roughly geolocated and are assigned en masse to large institutions and internet service providers. Some IP addresses are fixed, while others are dynamic (assigned by the internet service provider), but they form an important component of the web’s architecture. IP addresses, like the Domain Name System discussed in the next paragraph, are overseen by the Internet Assigned Numbers Authority, or IANA , and implemented by the Internet Corporation for Assigned Names and Numbers, or ICANN . In 2016, ICANN transitioned towards a “multistakeholder model,” ending its contract with the United States Department of Commerce.15 The actual allocation and sale of IP addresses falls to regional bodies, such as the Chantilly, Virginia-based American Registry for Internet Numbers, which covers Canada, the United States, much of the Caribbean, and Antarctica.16 One final component makes the internet, and therefore the web: the Domain Name System, or DNS , like cnn.com or Wikipedia.org. To prevent the internet from being navigable only by bookmarked IP addresses, direct links, and people with photographic memory, the DNS matches addresses like 173.194.70.103 with Google.com, creating a more memorable address and serving as a giant, automated Yellow Pages. Domain names can exist independently of the actual served webpage. A  user can buy “janesmith.com” from a domain registrar (such as GoDaddy Inc., Tucows, or Netfirms) and then subsequently point that name at a WordPress site that is being hosted at Wordpress.com. All that the user has bought is the domain name itself, whereas the server is hosted independently. The early 2012 protests against the American Stop Online Piracy Act, which saw prominent websites like Wikipedia going “dark” in protest, were in part against the bill’s provision to mandate DNS servers to stop referring people to infringing websites. The sites would still be up, but inaccessible to everybody who did not have the exact IP address. In October 2016, a major DNS hack saw websites such as Twitter, Spotify, GitHub, Reddit, and the New York Times brought down, despite using diverse hosts and being geographically dispersed.17 This underscores the importance of this seemingly innocuous system. This crash course in basic internet theory helps us understand what we will find inside web archives. But where did it come from? What processes occurred to bring people together so that we could create such an interconnected global network?

Exploding the Library

37

From Chaos to Standard: A (Very) Short History of the Internet Robert Taylor, a researcher with the Advanced Research Projects Agency (ARPA ), surveyed his Pentagon office in 1966.18 He had three different computer terminals. One was connected to a defence contractor’s network in Santa Monica, California; another to the University of California’s Berkeley campus; and the third to a network at the Massachusetts Institute of Technology, near Boston. Each terminal had its own separate set of access commands and its own separate community, accessed only through physically different systems.19 While each was powerful in and of itself, he wondered if there was a way to bring all three computers together, to let different networks, users, and computer types interact with each other. Taylor recalled his thought process: “If you have these three terminals, there ought to be one terminal that goes anywhere you want to go.” Out of this conundrum would evolve the ancestors of today’s web.20 Three major technical and conceptual developments make today’s internet possible: time sharing, packet switching, and the adoption of common communications protocols. In the 1950s and 1960s, computers were big and expensive mainframe computers in specialized facilities. As they represented large capital investments, there was pressure to ensure that they were used to their fullest. Time sharing solved this problem, as it would let many users use many smaller terminals to connect to the big mainframe computer and share the resource. While time sharing has many different creators and origin stories – as with many inventions, it was as much the product of a widely recognized need as a single individual’s ingenuity – work done at ARPA helped it reach fruition.21 Indeed, the Cold War’s research and funding infrastructure would play a large role in the building blocks that make today’s internet possible.22 The second element of the internet’s invention was packet switching, a method for data transmission in which a message was broken down into parts and sent independently over routers, then reassembled at their destination. This involved a new vision of how to run a distributed communications network. In this lies the argument that the internet was created in order to survive a nuclear attack upon the United States, and thus to have a network that could survive the loss of

38

History in the Age of Abundance?

Fig 1.1 Baran’s illustration of three network types from Baran, “On Distributed Communications.”

several central communication nodes but still retain a cohesive network. As with many myths, this has some bearing in reality, although it overstates the direct link between RAND (a non-profit think tank with strong ties to the United States military) and the resulting network. In 1962 the RAND Corporation’s Paul Baran outlined reliability and survivability as two of the main benefits of distributed networks.23 While Baran’s network was not designed or implemented by RAND , his ideas would see germination in the ARPANET some years later. Figure 1.1 illustrates the three main ways that Baran identified to organize computer networks. The three types are centralized, decentralized, and distributed networks. Centralized networks rely on a central node, linked to several end points. Think of a transportation system where you always have to go to a central hub: a commuter system where you might have to go to a central station from the suburb, and then back out on another line to get to your destination. A decentralized network uses several main hubs, the hub-and-spoke model used by many airlines today. The third model is a distributed network, which looks like a developed road network or grid-like subway system: nodes have multiple connections to each other, and there are dozens or even more paths to get from any

Exploding the Library

39

point A to any point B. The network can survive the loss of many nodes as well as many connecting paths. In our road example, imagine an avenue or intersection being closed for construction – traffic is able to find alternate routes to get to its destination, sometimes with minimal or no delay. When we send messages over the internet, our “packets” go through a variety of nodes before ending up at their final destination. Packet switching would be first adopted in 1966 by a team at the British National Physical Laboratory (NPL ). By 1966 packet switching and time sharing had produced the situation facing Taylor in his Pentagon office: different networks had emerged, letting users share the power of central mainframe computers, but these networks existed largely independently. Taylor wondered if they could be brought together. The solution came in a famous 1968 paper, written with fellow ARPA researcher and time-sharing pioneer J.C.R. Licklider, “The Computer as a Communication Device.” It articulated the idea of an ARPANET , which would allow networks to talk to each other without having to translate code to work on each individual computer.24 In this lies the core to the internet as a standardized network of networks. The contract to construct ARPANET was awarded to the Massachusetts research firm Bolt, Beranek, and Newman (BBN ) in early 1969, and the first four-node network (UCLA , UC Santa Barbara, Utah, and Stanford Research Institute) was soon operational with the first message being sent at the end of October 1969. Their implementation of Taylor’s idea involved a series of interface message processors (IMP s) that would route the messages to their destinations. It would enable timesharing on a massive scale. You would send a packet of data to an IMP , and it would either use the message or continue to pass it on. These IMP s are the predecessors to today’s routers, the backbone of the modern internet.25 ARPANET was operational on a larger scale by 1973 and continued to develop and expand, connecting large American research institutes. It soon, however, faced problems of scalability. In France the Institut de Recherche en lnformatique et en Automatique agency co-ordinated their own network, CYCLADES . With ARPANET , the network itself was responsible for data integrity. Each IMP router needed to ensure that each packet was being transmitted properly, and if a packet was unreliable, the IMP needed to regulate it. This gave them a lot of work, leading to network congestion and traffic jams of

40

History in the Age of Abundance?

packets as they waited in queues to be processed by each IMP . Imagine a letter being sent through many different postal stations, and each one has to closely inspect the letter to ensure it is in perfect condition before sending it along. In contrast, CYCLADES adopted a “centrifugal approach in which data transmission was not regulated by the equipment of the network itself but by the computers sending and receiving the data at its edges.”26 In short, the onus to ensure healthy packets was placed on senders and receivers themselves, rather than the communications infrastructure itself. Traffic jams at critical junctures would end. This approach was critical for the scalability of the modern internet. The final step was the common standard to unite all networks. This would be the Transmission Control Protocol, or TCP . Imagine the needs to somehow make a hypothetical host, connected to the CYCLADES network, connect through IMP s in the ARPANET , through another gateway, through IMP s in the British National Physical Laboratory (NPL ) network, and towards a terminal on that network. How could you make sure all the packets got from point A, say in France, to point C, say in Britain? A standard was needed that everybody could agree upon. TCP/IP , the Transmission Control Protocol/Internet Protocol, was that standard. ARPANET engineer Robert Kahn approached Vint Cerf, then a junior assistant professor at Stanford University, and the two set to work designing the protocol that would enable network intercommunication. In 1974 the specifications of the TCP program – authored by Cerf, Yogen Dalal, and Carl Sunshine – were released to the broader community for comments and input. Still available online, they make for engaging reading about the basic roots of the internet. Drawing an analogy with the real-world postal service, TCP was envisioned as a “way for processes to exchange letters with each other.” It adopted the CYCLADES model of making hosts rather than infrastructure responsible for ensuring data got where it was destined. A paragraph encapsulates the overall thrust of the idea: “Processes exchange finite length LETTERS as a way of communicating; thus, letter boundaries are significant. However, the length of a letter may be such that it must be broken into FRAGMENTS before it can be transmitted to its destination. We assume that the fragments will normally be reassembled into a letter before being passed to the receiving process … We specifically assume that fragments are transmitted from Host to Host through means of a PACKET SWITCHING NETWORK .”27

Exploding the Library

41

Impressed, the Defence Advanced Projects Agency (DARPA ), ARPA ’s successor, issued contracts to implement TCP/IP .28 ARPANET would soon become defunct itself once it adopted TCP/IP , and existed as part of the broader internet for a few years before being shut down. While there were competing standards, TCP/IP was incorporated into the Unix operating system, which today forms the foundation of Linux and Mac operating systems and was also adopted as the standard for defence communications in the United States.29 By the early to mid-1980s, it had been adopted on a wide scale, at least in the West.30 The internet had been born. The next step on the road to our popular explosion of information was to make the internet accessible. Enter the web.

The Internet’s Killer App: The World Wide Web At 10 p.m., 2 October 1997, in St John, New Brunswick, Canada, a user clicked a mouse button and a web milestone was reached. The millionth GeoCities personal home page was created.31 GeoCities had been founded in late 1994 and had begun to rapidly grow throughout 1997, doubling in size in six months. As the Financial News explained, “By offering anyone with access to the internet the ability to contribute their talent and ideas, meet others with similar interests, and participate in creating the electronic communities of the future, GeoCities has virtually exploded.”32 This New Brunswicker was part of a wave of people stepping onto the web and changing our historical record forever. The web made the internet accessible to many more people, from businesses, to academics, and eventually to everyday people around the world – hence the “killer app” moniker.33 The web is not a synonym for the internet but should rather be understood as a very large set of hypertext documents that users can access via the internet. While the web was made realizable thanks to the widespread networking potential of the internet, the idea behind an interconnected hypertext database has been around much longer.34 As with the broader internet, it was the adoption of a common standard and language that allowed the web to flourish and spread. The lingua franca of the web is HTML , or HyperText Markup Language. It was designed as a language for joining disparate people. The original idea was that if two users had the WordPerfect word processor program, for example, they could share data in that format – but in the absence

42

History in the Age of Abundance?

of a common format (if one had WordPerfect and one did not), HTML would work.35 It was a way to get around incompatible file formats. Two key components underpin HTML – hypertext, and what is meant by a markup language. Both concepts are fundamental to any understanding of web archives. Hypertext refers to a body of text that is interconnected by links – we use them frequently as we browse the web and move from page to page. It is a concept that evolved independently of the internet or modern computers. In 1945, in a famous Atlantic Monthly article, Vannevar Bush – former head of the American Office of Scientific Research and Development during the Second World War, which had overseen the Manhattan Project to invent the atomic bomb – outlined his concept of the Memex, a hypothetical information retrieval system that will seem familiar to users of today’s web: “The owner of the memex, let us say, is interested in the origin and properties of the bow and arrow … Next, in a history, he finds another pertinent item, and ties the two together. Thus he goes, building a trail of many items. … When it becomes evident that the elastic properties of available materials had a great deal to do with the bow, he branches off on a side trail which takes him through textbooks on elasticity and tables of physical constants.”36 In library terms, imagine being able to browse through stacks of books, but also being able to move not just between the books themselves but also laterally through the pages of related books. Or in a more identifiable example: Wikipedia today. Many of us get lost down Wikipedia worm holes, clicking from one page to the next, with games existing to see how many hyperlinks one must visit to reach a given page just by clicking links. The idea of hypertext was further developed in 1965 by Theodor Holm Nelson, who presented “A File Structure for the Complex, the Changing, and the Indeterminate” at the Association for Computing Machinery (ACM ) annual conference. Nelson, fresh from a graduate degree in Harvard’s Department of Sociology, outlined his idea. The way of communicating he imagined would “have every feature a novelist or absent-minded professor could want, holding everything he wanted in just the complicated way he wanted it held.”37 It would later become known as Xanadu, a global library that would contain all of human information, making explicit interconnections – hyperlinks – between

Exploding the Library

43

entries. His 1980 book Literary Machines outlined a model of shared information, navigated through an innovative method of hypertext that would allow the continual reworking of information. Nelson’s bold prediction that in 2020 there would be “hundreds of thousands of file servers – machines storing and dishing out materials,” with “hundreds of millions of simultaneous users, able to read from billions of stored documents, with trillions of links” was prescient.38 An early implementation of hypertext came in December 1968 with Douglas Engelbart’s “Mother of All Demos,” which showed off a windowed implementation of hypertext (as well as a mouse, conference calling, document collaboration, and essentially almost everything else we take for granted today in modern computing – hence its “Mother of All Demos” moniker).39 If the idea of hypertext can be traced back to Bush and Nelson, the web itself originated with Tim Berners-Lee, a research fellow at the European Organization for Nuclear Research (CERN ). CERN , straddling the French-Swiss border near Geneva, is and was a large institution with collaborators from all over the world. As a result, it had many different hardware and operating systems and languages, with people traveling back to their home country but still being involved in projects. “In all this connected diversity,” Berners-Lee would later write in his autobiographical Weaving the Web, “CERN was a microcosm of the rest of the world, though several years ahead in time.”40 With the problem of connecting all these disparate individuals, systems, and nationalities, Berners-Lee developed a program while at CERN in 1980, ENQUIRE . This program was an interconnected database of people and software modules.41 While hypertext did not appear in the original specification, the idea of focusing “on the way the system is composed of parts, and how these parts are interrelated” was a portent of developments to come.42 Building on this, in 1989 Berners-Lee set out to implement ENQUIRE ’s model of interconnected nodes on a much larger scale. His March 1989 document, “Information Management: A Proposal,” laid out his vision of what would become the web. High turnover, thousands of users from dozens of countries, physically sprawling facilities, all meant that the “actual observed working structure of the organisation is a multiply connected ‘web’ whose interconnections evolve with time.”43 Hypertext would bring everything together, running on a decentralized network. As Gillies and Cailliau note, “The proposal contained all the ideas that would eventually make the World Wide Web. It even

44

History in the Age of Abundance?

anticipated the sort of problems the web would encounter as it spread about the globe.”44 “Information Management” outlined the problems with basing a system on keywords – notably that people never choose the same ones, a problem that persists with online “tags” on blogs today – explained hypermedia and hypertext, and asked for the support of two people for six to twelve months to realize the project.45 His boss granted approval with the famous marginal comment, “Vague, but exciting.” In 1990 the proposal was resubmitted to support the purchase of a NeXT computer (a particular – and expensive – desktop computer developed by Steve Jobs after he was forced out of Apple Computers), which brought together the power of the UNIX command line with the user-friendliness of an Apple-esque environment.46 It all came together quickly after that. A subsequent proposal in November 1990, “WorldWideWeb: Proposal for a HyperText Project,” was co-authored with Robert Cailliau and gave concrete details and timelines for implementation.47 The next hurdle was finding a way to let people who didn’t have one of the rare and expensive NeXT computers to be able to use the web (imagine today that if only those who had high-end iMacs could use a program). Responding to this problem, a CERN intern, Nicola Pellow, developed the first “line-mode browser.”48 As all platforms could display and enter text on a command prompt, this was the simplest solution to let the web run across all kinds of different computers. The web was now cross-platform. Finally, the next step was making sure that people outside of CERN  – all around the world – could both know about and use the web. On 6 August 1991, the two browsers (WorldWideWeb and the linemode one), as well as the basic code to run a web server, were placed onto the internet and publicized via Usenet newsgroups.49 With this step, the web was arguably born. The potential and strength of the platform was ensured on 30 April 1993 when CERN declared that the technology would be available freely to everybody, without fees or royalties. A new era of online communication had begun. If previous networks had been the province of academics and a few geeks here and there, the web had the potential to reach much further. Over the next decade, it would. In Canada, for example, 4 per cent had internet access in 1995; 25 per cent in 1998; 60 per cent in 2001; 71 per cent in 2005; and 88 per cent in 2015.50 In the United States, the numbers jumped from 52 per cent in 2000 to 84 per cent in 2015. While adoption lagged

Exploding the Library

45

in the United States, primarily along demographic lines of age, class, racial and ethnic differences, and urban versus rural communities, it has seen dramatic uptake.51 Today Canadians spend “an average of 36.3 hours online on their desktop computers every month.”52 In the United States, over one in five internet-using Americans report that they are online “almost constantly” (and another two-fifths report being online “several times a day”).53 Only 13 per cent of these American internet users reported being online less than daily. If it seems that the web is everywhere, that’s because for many of us in the Global North it is.

A New Form of Primary Document? The Nuts and Bolts of Hypertext The most important component of the web documents that we use is HyperText Markup Language, or HTML . Hypertext matters for two reasons when studying archived web documents. First, as an instrumental element of the web, baked into it from the very start, it means that web documents cannot be understood on their own, for they were designed within a web of other hypertext documents. Second, the “markup language” part of the HTML acronym also matters to us, as we need to understand how the documents themselves are written. Before moving into the nuts and bolts of HTML , it is worth again underscoring the point that webpages do not exist in the sense of a single document sitting in an archive box – they are complicated documents that are made possible through the interplay of many different files (from the images within them to the documents that style the page or the music that plays).54 It is difficult – indeed, nearly impossible – to get every single component of a page so that it looks exactly the same as it did before. I return to this broader point in chapter 3 when we discuss what a webpage is in some technical detail. Just as parts of an archived webpage may be gone, preventing us from fully seeing what it looked like in the past, we also need to grapple with the distributed nature of the web as a series of documents connected by hyperlinks. The nature of hypertext means that documents can be difficult to contextualize without the wider web of documents that they are a part of. If you have archived a page, but not the pages that lie behind the links on that page, what are you missing? This becomes an inherent problem, given the selective archiving of web

46

History in the Age of Abundance?

material. As the result of storage limits and budgetary constraints for a given crawl, web archivists cannot follow every link on every page a crawler visits, thus the sources they attempt to reconstruct in an archive are necessarily incomplete. Consider the issues a Wikipedia web archive presents. While the pages themselves are hosted on http://wikipedia.org/ and are frequently archived (both by Wikipedia itself and outside organizations like the Internet Archive), they contain embedded links to other websites and are written with the assumption that a visitor has access to the broader web. However, many of these links, accessed through a web archive, will be broken – the page might not be archivally preserved – or will not synchronize in time with the original capture. A Wikipedia page might have been archived in February, but the other sites that the page links to might not have been collected until the following October. Even more problematically, a web archive that contained just Wikipedia pages without any of their external links would be a deeply incomplete source. It would be like a library with just a few books, full of footnotes to now empty shelves. The web is intended to be a connected assemblage of documents and loses its effectiveness when considered in small, isolated sections, but this is unavoidable when making smallscale web archives – if one designs a web crawler to follow and download every link it encounters, it would end up downloading the entire web. A common compromise is to allow a crawler to visit one link beyond the “seed” page (think one degree of separation), but even that increases costs exponentially. The “markup language” part of the HTML acronym also matters to us, as we need to understand how the webpages themselves are written. Indeed, learning to read this markup code is akin to classical scholars learning Greek to understand their historical sources. We can “translate” HTML using web browsers, but sometimes we might want to read the original. The requirement to learn another language has long been part of graduate programs, and to be able to read code like this may be considered a necessary part of future historian’s training. Luckily, HTML is easier for most of us to learn than ancient Greek. Whenever you are reading a webpage online, an HTML document is providing the underlying instructions for how it should work. It is worth exploring this a bit further. Markup describes the way that an HTML document is crafted. Think of when you edit – or “mark up” – a document on

Exploding the Library

47

paper. Imagine when you want some text to be rendered in italics, you circle and write “italics” above it; if something should receive emphasis, perhaps you underline it. This is what a markup language entails. The HTML document contains two things: the content itself, but also basic instructions on how it should be displayed. Consider the following basic HTML code:

Web Archives for Historians

Hello World! We’re here!

If you wrote that code in a text editor, saved it as “index.html,” and opened it up in a web browser such as Chrome, Safari, or Firefox, you would see the following line of text in a page with the title of “Web Archives for Historians”: Hello World! We’re here! In HTML all the instructions are enclosed in angled brackets like , known as tags. Most are paired with each other. For example, you start an instruction for emphasis (or italics) , and when you do not want to use emphasis anymore, you close it off with . Paragraphs begin and end with

and

respectively, a top-level heading with and respectively, and so forth. A simple document might look like:

A basic document Introduction

This is the introductory paragraph of an essay.

This is a second paragraph for this essay.



This means that as historians we may engage with these sources in many different ways. Depending on how we are using the web archive, we might be working with finished versions of pages, or with the underlying code itself. When interested in content analysis, for example, the first thing a historian often does is “strip” out the tags: remove the , ,

48

History in the Age of Abundance?

, and tags in the above example, so that our computers

can just read “Hello World! We’re here!” without stumbling over the stylistic commands. Indeed, when documents are indexed into search engines, this is an early step. Yet if historians are interested in reading content at the level of the document alone – in terms of how the page was assembled – those italics, and titles, and beyond matter. HTML does not exist in isolation. Well-written HTML does not alone dictate how a page should appear, but often gives only starting instructions. For more advanced stylistic options, websites also have cascading style sheets (CSS ), which contain information concerning colours, typeface, and how the site should be laid out. HTML and CSS can be thought of as instructions, interpreted by the browser to create the experience on your screen. Sometimes they will look a bit different, depending on the browser or computer. If your system does not have a certain font, for example, it may have to use a substitute one. The use of standards by the twenty-first century largely means that modern browsers generally look the same (though not always – try comparing sites in Chrome, Firefox, Edge, and Safari, and you will see minor differences in how they are rendered). This is not necessary the case for archived webpages, however, as we discuss later. What exactly does a historian retrieve when accessing a web archive? As we will see, it really depends on how content was crawled. First, in some cases, a scholar receives the full pages: a rendering of the site as it most likely appeared when it was originally captured. Yet it will still be a facsimile, depending on what was or was not captured. The page could be missing an image, the font could be different from what a user originally saw, or when various hyperlinks are clicked, they may generate error messages noting that the content has disappeared or was not archived. Alternatively, in other cases, the scholar may work with a “derivative” dataset of the original: the plain text extracted, or just the HTML file, or just a PDF file or image that was embedded in the site.

The Scale at Play Imagine that earlier HTML example (Hello World! We’re here!), and then picture hundreds of millions or even billions of them, and you can begin to grasp the scale at play in web archives. Modern web development makes this even more complicated. Individually, webpages can still be

Exploding the Library

49

read in much the same way as traditional print sources have been. When somebody clicks on a link, however, or wants to think about where the page might have belonged within the larger archive, it gets a bit tougher. The large scale of the web requires a different methodology. The sheer amount of information generated on the web means that the cultural record that historians can draw on for records generated over the last twenty years is very different, both in scale and scope. Returning to our opening vignette, imagine Woodward and Bernstein with their stacks upon stacks of library slips with the incredulous library clerk watching them. In a way, they were lucky, compared to the scale of today’s data. What if they had been tackling a similar investigation during the 1990s Clinton administration? They would have been confronted with an exponential increase in the amount of information available to them. As historian Dan Cohen has aptly illustrated, “A single historian might have been able to read and analyze the 40,000 memos issued at the White House during the Johnson administration, but … could never handle the 4 million email memos sent while Clinton was in office.”55 This issue of TMI – too much information – is not confronting just political historians, of course, but historians and scholars of all subjects and areas. An example from my own background as a historian can help illuminate. My first book, Rebel Youth: 1960s Labour Unrest, Young Workers, and New Leftists in English Canada, was a traditional academic monograph.56 Addressing the interplay between young people, labour unions, and new social movement during Canada’s long sixties (roughly the late 1950s to the early 1970s), it grew out of several questions: why did unions seem so disconnected today from urban social activists, for instance, or what would it have been like to be a child growing up in the climate of the sixties but then thrust into a disciplined industrial environment? To research this project, I turned to traditional sources: written documents generated by the state or other large institutions with archives and special collections, deposited papers at a scattering of other sites, as well as around seventy oral history interviews. The last were especially important, as the movements that I studied did not keep formal records. This was unsurprising, as student radicals staying up late in a campus pub or communal living room did not often generate traditional written documents, except maybe a manifesto or two. And even those were often written by a self-selecting few, leaders or self-professed

50

History in the Age of Abundance?

leaders of various organizations. Similarly, young workers engaged in illegal wildcat strikes left almost no evidence: what remained for the most part were people’s imperfect memories, police surveillance documents, security or observer reports of graffiti, and a few newspaper accounts. Even though the events I was studying happened only forty or fifty years before, information was scant. Similar research inquiries today would produce different types of information, and much more of it, highlighting the shifting sense of scale. Indulge me in a thought experiment. What if the student radicals and labour activists whom I had been studying had Twitter accounts? Or kept blogs? Or communicated by Facebook? In some of these cases, the historical record may still not be there, especially in the case of Facebook and its attendant privacy concerns and “walled garden” nature. But the blogs and Twitter accounts are very real sources that historians will be confronted with: they are largely being preserved, and they will present considerable challenge to sift through. They also exist on a scale that does not easily compare to what has come before, underscoring the novel nature of web archives and born-digital resources more generally. Modern social movements – from the Canadian #IdleNoMore protest focusing on the situation of First Nations peoples to the global #Occupy movement that grew out of New York City – leave the sorts of records that would rarely, if ever, have been kept by previous generations. On one day of the Canadian Idle No More movement on 11 January 2013, there were 55,334 tweets using the hashtag #IdleNoMore. This is an important social movement in a relatively small Western country; one day, tens of thousands of tweets. Assuming the tweets were of average length, they would fill 1,300 pages at 300 words per page for that day alone. To use another tangible example, our research team collected around four million tweets made during the 2015 Canadian federal election, the thoughts of 318,176 unique users. Or – one of the largest collections we have seen, the #WomensMarch hashtag between 12 and 28 January 2018, where the Twitter dataset collected amounted to over fourteen million tweets (and there were undoubtedly more tweets not collected as the result of limitations in the Twitter developer interface). Scholars of the 1984 Canadian federal election would use letters to the editors, polling results, and newspaper coverage as rough proxies of popular attitudes and voting intentions; those of 2015 will have the voices of ordinary people, albeit filtered through the various prisms

Exploding the Library

51

and cleavages of social media.57 These are all sources that are now being preserved, albeit unevenly. When a large-scale social movement erupts, grassroots and institutionally based web archivists are on the case to ensure some records are preserved for future historians. We have already seen this with the archiving of the Investigator website during the Crimean invasion, as well as concerted efforts to document the civil rights protests that took place in Ferguson, Missouri, and elsewhere after the police murder of the teenager Michael Brown.58 In Canada, Library and Archives Canada has begun targeted web archiving, choosing one thematic collection per quarter – one focused, for example, on the First Nations #IdleNoMore movement.59 Occupy Wall Street was extensively documented by George Mason University’s Center for History and New Media’s #OccupyWallStreet archive, continuing their tradition of digital archiving that stretched back to the attacks of 11 September 11 and Hurricane Katrina.60 Not all movements receive the widespread recognition of the Women’s March, Occupy, #MeToo, or #IdleNoMore. The historical record has largely privileged extraordinary events over ordinary ones, and the availability of web archives may not change that. What of those events and movements who do not receive specific curatorial or media attention, but are nonetheless just as vital to understanding our contemporary age, either because they document ordinary life or because we do not understand their future significance today? Here we enter the world of large-scale web collecting. As we will discuss later in the book, countless blogs and other websites are now archived and preserved by large organizations such as the Internet Archive, which undertakes wide scrapes of the extant web, or the national-level domain efforts of national libraries in countries including France, Britain, and Australia. Tweets were archived by the Library of Congress until 2018, and many more scholarly groups continue Twitter collection today.61 In short, even for those events escaping official notice today, the historical record will be dramatically different from what I had to draw upon when studying the 1960s and 1970s. This process is accelerating as more and more web users continue to add content to the web. Evidence suggests that the rate of online writing is accelerating, with next-generation users who access the internet from multiple devices being almost 25 per cent more likely than

52

History in the Age of Abundance?

first-generation users to create websites or publish blog posts.62 Right now, a young student radical is probably tweeting what she had for lunch. Even something as banal as that, which may not animate historical scholarship save for the most detail-oriented biographer, is significant when scaled up. For example, food historians studying nutrition in periods as recent as the Second World War have had difficulty reconstructing the nutritional history of Canada.63 Over two billion users are currently connected to the web, and unless cataclysm strikes, it is not a stretch to foresee that most humans will be connected by 2025.64 Some of them will tweet about their lunch. The aggregate records of a hundred thousand people and their nutritional preferences could transform aspects of our understanding of the past. If we consider how users have multiple identities and avatars online, at some point in the next few years there will be more online identities than living humans.65 With serious investment being made in web archiving, this means that our libraries are exploding in size, capturing more information and at a previously unimagined scale. If the Library of Congress topped Claude Shannon’s 1949 list of information repositories, its conventional collection is now dwarfed by its electronic holdings. It is in turn – at least in sheer quantity – dwarfed by the Internet Archive with its over fifteen petabytes of web archives. All of our libraries have dramatically grown over the last two decades, thanks to expanded digital collections mandates, from the regional library with an Archive-It subscription, to Western national libraries. As we will discuss in detail in chapter 4, an expanded concept of legal deposit is swelling the holdings of national libraries such as the British Library and the Bibliothèque nationale de France. These countries, among others from Denmark to Spain to Portugal, now treat websites as “publications.” They are governed by legal deposit regulations, meaning that the amount of the web being preserved increases every year; legal deposit refers to the legal obligation to deposit publications with a central library (i.e., if you publish a book in Canada, you need to send a copy to Library and Archives Canada). Indeed, when British legal deposit was expanded on 6 April 2013 to cover “material published digitally and online, so that the Legal Deposit Libraries can provide a national archive of the UK ’s non-print published material, such as websites, blogs, e-journals and CD-ROM s,” the British Library suddenly found itself collecting far more material than ever before.66 Before that

Exploding the Library

53

period, a librarian would select a website to be preserved, an email message would be sent to its owners, and if they gave permission – roughly some 20 per cent did, most did not respond – then (and only then) would the site would be preserved. It was, as Wired put it, a “legal nightmare.”67 One in five websites selected does not make for a comprehensive collection, especially given biases in who might consent and who might not. Would you click on a link in an email sent out of the blue by the “British Library”? Maybe, but probably not. The result of the expansion in legal authority, therefore, has been a dramatic change in the amount of information housed in these national libraries. They are decreasingly subject to the ability to connect with an individual webmaster, or in other regimes, operate on an opt-out basis. Researchers have demonstrated that if you took every national and institutional web archive and connected them, they would amount to yet another Internet Archive.68 Indeed, the rising adoption of legal deposit regimes is one of the most exciting developments on the road to ensuring that historians will have archived web material to access. Library acquisition policies have changed as well, becoming more democratic in some ways – including not only those who are wealthy or professionally connected enough to publish a book, for example – and in other ways less so. For the large-scale web crawls made by the Internet Archive, the British Library, or the Bibliothèque nationale de France, selection criteria are largely algorithmically driven, rather than being determined according to a source’s potential historical significance. Humans still play a role, of course, but not in the same way as previously. For social historians who seek to study the lives of everyday people, the abundance of information available offers profound research potential. Yet with this abundance, and the particular way it is being collected, come big questions: how can we make sense of all of this information, and how can we determine who and what it does not represent?

The Rise of the Digital Humanities: Making Sense of Abundance The nearly fathomless amounts of text and number of images found within these repositories is where the academic field of the digital humanities can come in. As algorithms begin to do the heavy lifting

54

History in the Age of Abundance?

in humanities research, this emerging discipline suggests new ways to responsibly parse collected data, subject them to critical analysis, and interpret the results produced by our computers. The digital humanities are a scholarly tradition that also grew in part out of the problem of too much text and needing tools to work with it. In 1946 Roberto Busa, an Italian Jesuit priest, wanted to create a comprehensive concordance of the works of St Thomas Aquinas, providing the context in which every single word he wrote appears. The result was a series of some thirteen million computer-readable punch cards.69 The quest to create this Index Thomisticus would bring him to the United States and International Business Machines Corporation (IBM ), an initial test in 1951, and the final published version in 1974 (online in 2005 at http://www.corpusthomisticum.org). By the 1960s and 1970s historians had also begun to turn to early computers – imagine stacks upon stacks of punch cards – to help them make sense of large datasets such as national censuses.70 Coded by armies of researchers, innovative studies helped social historians understand the degree to which social mobility occurred between censuses, or the economics underpinning American slavery.71 Yet this turn towards “Cliometrics” ended as quickly as it had begun for a variety of reasons, ranging from illegible handwriting in primary documents, to the difficulties of teaching numbers to humanists, to argumentative overreaches. It also led to an indelible association between computational methodology and quantitative histories.72 Historians, most of whom felt (fairly or unfairly) uncomfortable reducing human experiences to numbers, largely eschewed computers in the aftermath of this movement. But computers did not go away, even if their lustre had faded slightly within the ivory tower. Other historians became involved with computers, not to facilitate large-scale research, but to help collect, present, and disseminate materials: a vision of computers and the humanities grounded in the public history mission of reaching new audiences, working with teachers, and making history relevant.73 Indeed, the recent roots of contemporary digital history lie in this field of public history. If qualitative historical research largely eschewed computers, the same was not true for some trailblazing digital scholars in the field of English literature. While there are diverse origin stories for digital history and the broader digital humanities, literature scholars began focusing on the problem of TMI over a decade ago. Literary critic Franco

Exploding the Library

55

Moretti’s 2005 book Graphs, Maps, Trees made the provocative argument that literary scholars were not reading enough books, and most were studying the same set of around two hundred significant works. For an individual reader, two hundred books are a lot of books to read. Indeed, it roughly lines up with the size of graduate comprehensive exams in the humanities, the completion of which is seen as mastery of a field of advanced study. But to study only these two hundred books, Moretti argued, had the meant denying the explosion of mass publishing in the nineteenth century. In the collective products of this publishing revolution might lie profound insights about Victorian culture and history – a narrative produced by lesser-known authors and ordinary readers, rather than the elites defined by the canon. To come to grips with the breadth of writing and publication in the nineteenth century would mean reading tens of thousands of books. And therein lies the problem: “Close reading won’t help here, a novel a day every day of the year would take a century or so … And it’s not even a matter of time, but of method: a field this large cannot be understood by stitching together separate bit of knowledge about individual cases, because it isn’t a sum of individual cases: it’s a collective system, that should be grasped as such, as a whole.”74 Such abundant information would require new techniques of analysis. While Moretti was focusing on nineteenth-century novels, the core problem he faced is similar to that of researchers confronting modern web archives. Making sense of more text than you can read requires “distant reading,” as opposed to the “close reading” traditionally practised by literary scholars and historians (reading primary sources or novels one by one). In this, Moretti looked towards a historian for inspiration: Fernand Braudel.75 Braudel, a member of the historical Annales school, studied large stretches of historical time and space, from the histories of entire countries to global interconnections. When scholars trace the origins of Big History, Braudel looms large.76 His most famous work, The Mediterranean and the Mediterranean World in the Age of Philip II, used the Mediterranean Sea as a metaphorical canvas upon which people and cultures lived their lives.77 Before the term existed, Braudel was “distantly reading” landscapes and large numbers of sources. Just as Braudel evoked a vision of standing on the shores of the Mediterranean and considering it from the perspective of the longue dureé – large stretches of time

56

History in the Age of Abundance?

and space – we can begin to similarly think of web archives and their attending massive amounts of information. In this we see the world of web archiving colliding with the study of Big Data. In 2015 I co-authored Exploring Big Historical Data: The Historian’s Macroscope. In it we argued that while computer scientists might advance a definition of Big Data revolving around a various number of V’s (volume, velocity, variety, and veracity is one common construction for example), we argued that “big is in the eye of the beholder. If it’s more data than you could conceivably read yourself in a reasonable amount of time, or that requires computational intervention to make new sense of it, it’s big enough.”78 Historians have put ideas of “distant reading” to good use. For example, the Data Mining with Criminal Intent project used the transcribed records of over 197,000 trials between 1647 and 1913 to explore the structural shape of various criminal trials and to find similar trials; they discovered, for example, that there was a “significant rise in women taking on other spouses when their husbands had left them,” and that plea bargaining began to significantly rise by 1825.79 Another article that grew out of the project argued that violent crimes were both discussed and treated differently by the nineteenth century, representing a new discrete category of crime; it helped confirm the “civilizing process” hypothesis, allowing the project team to see the “decreasing acceptability of interpersonal violence as part of normal social relations.”80 Other projects have used distant reading to explore how commodities were traded around the Atlantic world (looking at six million pages of parliamentary records, for example, to see how global trade evolved).81 For those interested in the classical world, a research team at Stanford University assembled a database of Roman road networks (84,631 kilometres) and canals/rivers (28,272 kilometres), combined them with weather data and financial information to let people calculate how long and how expensive it would be to travel from any point A to any point B.82 Big Data is a very real concern for historians, as it is for those in English, political science, and many other disciplines. All of the above projects were concerned with the importance of context, a concept that lies at the heart of distant reading. Just as Moretti was interested in contextualizing a non-famous nineteenth-century novel, we need to concern ourselves with questions of context with all forms of Big Data. Is a blogger’s post about an emerging scandal representative,

Exploding the Library

57

or an outlier? Was a particular sentiment generalized or an isolated one? Without context, I can use keyword searching to find evidence for almost anything when working with a corpus of billions of documents. Just because I can find hundreds of websites about ducks from 1996, that does not mean that it matters or is illustrative at all – whereas if I found hundreds of pamphlets about ducks preserved in archives and special collections, I would begin to think that they were significant somehow – the product of a conscious acquisitions policy rather than the whims of an algorithm. At the scale we are now working with in born-digital data, the mere fact of something existing – or even hundreds of something existing – may not signify something significant. Context is king. Distant reading is a necessity when working with web archives. We cannot read every webpage but can instead get a sense of what was being created and talked about, and what mattered to people, at scale. The distant reading of web archives also requires a minimum awareness of computational methods and principles. This is because web archives will be delivered to researchers in several different ways, as we discuss later in this book. They may be browsing old webpages pageby-page in the Wayback Machine, similarly to how we experience the web today. Or, alternatively, they may get a large file transfer of Web ARC hive, or WARC files, large archival file formats that contain all of the crawl material (from images, to text, to PDF and Word Documents). This may require the development of custom tools to explore this material. With this material, a first instinct might be to turn to the search engine. When there is robust full-text searching of websites with good descriptive data, we may decide to trust our search engines to find individual pages that are relevant to our tasks. And some digital scholarship will be similar to what historians do today, using the Google Scholar search portal and Google-esque smart search bars on library home pages.83 Yet we must be aware of the parameters and priorities of search engines, for they will not be the same as those of historians as a group and certainly not of individual historians doing their research. How does a search engine work? Why is a given result presented on the first page of results, rather than the hundredth page (where nobody will ever find it)? How were keywords selected and extracted? How was this web archive created? What might it be missing? Would it have been written in a form of HTML that is different from that of today, meaning that it may have had content visible in the 1990s, such as the  tag, that

58

History in the Age of Abundance?

is absent today? One does not need to be able to write a new own search engine, but historians need to understand how one works. Otherwise we will be at the mercy of systems that we do not understand and that may skew our research without our being aware of it. Whenever I think of search engines I think of the importance of handling with care. They are powerful tools that are reshaping and have reshaped how we approach research in fields as wide-ranging as early modern manuscripts, nineteenth-century newspapers, and web archives today.84 Critically using search engines means that one cannot simply search for documents within web archives: documents about Donald Trump, for example, or Winnie the Pooh, or Pokémon, or Standard Oil without the underlying mechanics. In other words, we must always question exactly the results that are surfaced and in what order. If there are ten thousand pages on Trump, for example, why is his campaign home page number one, and a grassroots opposition page the five-hundredth hit? These are cautionary notes for all search engine behaviour, but at the scale of the web age they are even more pressing. Context truly is king.

Transparency, Humility, and Subjectivity This all underscores the inherent subjectivity of source material from any time and place. Web archives and the data they contain do not represent any form of objective or complete knowledge about the past, no more so than any other inherently subjective historical method. Claims about truth were flirted with in the field’s first engagement with Cliometrics in the 1960s and 1970s and are again rearing their head.85 Peter Turchin’s idea of cliodynamics, published in Nature in 2008, argued for a study of patterns that “cut across patterns and regions,” to “collect quantitative data, construct general explanations and test them empirically on all the data … To truly learn from history, we must transform it into a science.”86 Here science refers to reproducible, empirical research, as opposed to the subjectivisms of humanistic inquiry. While Turchin does not speak directly to the digital turn, the idea that larger datasets somehow translate into objective, empirical research is not a rare one. This explicit idea underpinned the Wired editor’s aforementioned “end of theory” piece.87 Just as the debate between empiricists and relativists reared its head during the 1960s and 1970s engagement with computers, it is a recurrent

Exploding the Library

59

theme today. Earlier practitioners embraced objectivity as the rest of the historical profession was largely turning to an understanding of subjectivity, leading to estrangement.88 With these earlier lessons in mind, historians and other scholars using web archives need to recognize the inherent subjectivity of the tools they design and use. The results of a “distant reading” algorithm may appear to be objective, in that the same algorithm run on the same dataset will lead to the same answer. But if I designed the algorithm, my subjectivity is embedded in it: the weight given to various categories of analysis, the encoding decisions taken, the process by which unstructured text was turned into the structured data that computers make sense of, etc. And beyond that, the archives themselves are not perfect representations of the underlying reality. None of this is a neutral process: at almost every step it reflects or should reflect the user’s judgment as a researcher. All this comes before the critical process of making sense of the data. And even if the historian arrives at the same “answers,” in sources and results, it does not mean that she will draw the same conclusions. As always, results need that extra step of interpretation. This requires transparency in our methods, more so than historians have often been rigorous about and comfortable with.89 Yet here we can see overlap between the new skills and approaches needed with born-digital sources as well as those from earlier periods that have been digitized. Transparency requires a conscious effort, whether citing a website or a nineteenth-century newspaper article from a database. This can range from rethinking how historians cite material, to sweeping changes in our research methods. Abundance and scale, as well as technological change, are reshaping almost all aspects of historical research. Indeed, I have made similar arguments in an article that addressed how Canadian historians had used digitized historical newspapers. Just citing an article in the Canadian Globe and Mail, I argued, for example, is not enough. Our citations need to have more information, such as noting that it was found in a searchable database, a microfilm, or the original newspaper. We should make research processes somewhat reproducible by providing basic details of the search strings used and the medium studied. Was it accessed by a LexisNexis search, for example, or by ProQuest, or an archived newspaper copy?90 Even with non-digitized archives, historians increasingly travel with their digital cameras in hand; most archival reading rooms are now full of historians

60

History in the Age of Abundance?

bent over desks taking photos with cameras, smartphones, or even iPads. They then return to their home institutions or offices, sit at desks, and catalogue and make sense of the material they collected. Efficient, to be sure, but it also means that the ability to respond to sources, to pull additional boxes or change research questions on the fly, is considerably reduced. In both examples, the same documentary record can be mediated in very different ways by technology. The same stands true for web archives, where the pathways we take to data can be just as important as the source itself. Along these lines, scholars Trevor Owens and Fred Gibbs have argued that we need a new approach to historical writing that “will foreground the new historical methods to manipulate text/data coming online, including data queries and manipulation, and the production and interpretation of visualizations”; they point towards humanistic traditions of openness, helping to foreground methodological work so that it can be both verified and inspirational to other scholars.91 When working with new digital sources, methods need to be foregrounded, not relegated to footnotes or separate blogs. Transparency is imperative in the digital age. Historians operated on an implicit understanding of how the research process at a traditional library or archive worked. As we move into digital tools, the implicit needs to be made explicit. For example, a user needs to explicitly note the sorts of search terms used to query a database. If he used computational methods, the code or at the very least decision tree needs to be provided. For example, were decisions made by a researcher about filtering out images or the plain text from the HTML that creates it? If web archives are to enter the mainstream of historical research, historians need to understand how these sources work.

Conclusions We are in the midst of a third revolution in computational history. The first wave in the 1960s and 1970s saw employment in demographic and economic histories, epitomized by Cliometrics. While producing knowledge that still underpins many studies of today, claims of a “scientific” (and implicitly “better”) history alienated many historians and ultimately saw Cliometrics relegated to a subfield. Then a second wave of computational history appeared in the 1990s, thanks to the personal

Exploding the Library

61

computing revolution: word processing, graphical user interfaces such as Windows and Apple’s OS , and the rise of the World Wide Web, which facilitated scholarly conversations on platforms such as H-Net. We are now in the midst of a computational history revolution thanks to three main factors: decreasing storage costs, the power of the internet and distributed cloud computing, and the rise of professionals dealing with both digital preservation and open-source tools. The processes described in this chapter are already part of history, and historical scholarship is beginning to draw on sources generated in the 1990s and beyond on the web and internet. In 2021 it will have been thirty years since Berners-Lee published the first website in August 1991 and launched the web. Professional historians need to be ready to study this period – or at the very least to mentor the next generation of scholars who will. In the chapters that follow, we will explore how this all necessitates profound rethinking.



WEB ARCHIVES AND THEIR COLLECTORS

Right now a web crawler is saving content. This happens every millisecond of every day. If you were to consult the server logs of who is visiting a website, they would make up a substantial portion of traffic. A Googlebot gathers content for Google’s monumental index. Ia_archiver collects information for the Internet Archive. Bl.uk_lddc_bot gathers material destined for the British Library’s legal deposit collection. Today these web crawlers are the primary agents for preserving the born-digital cultural heritage that makes up our lives. Thanks in part to these crawlers, we can “go back in time” with the Wayback Machine and see websites as they existed stretching back to 1996. Today hundreds of people around the world engage in the collaborative practice of web archiving. This ranges from the broad, global crawls of the Internet Archive to small, subject-specific web archiving carried out by smaller institutions or individual researchers. Some is done with very particular purposes in mind: a research project, saving material before it is rapidly deleted (during a change of government, for example, or an unfolding natural disaster of war); others with nothing other than an eye to making sure people in the future have the broadest possible understanding of what the web – and by extension our world – looked like today. Collecting this information requires global effort. Every year delegates gather at meetings like that of the International Internet Preservation Consortium – an important organization formed in 2003 of libraries, archives, and researchers involved in preserving the web, which has an annual conference – or other summits, symposiums, and meetings, sharing experiences, thoughts, technical specifications, and research examples, all with an eye to improving the quality of our archival records. Web archivists know that not everything can be saved, not even a large

Web Archives and Their Collectors

63

percentage of it. Yet as the decisions that they make dictate, in part, what is saved and what is not, they need to document and be conscious of the decisions that they are making. In the absence of an overarching code of principles, much is left to the individual memory professionals at each institution. With a line of code here or there, a website can be forgotten forever or included in the ever-growing global archive. It is a humbling task. But ultimately the volume of web material is too high to be selective on a granular level. It may be easier to try to grab it all, within user-implemented constraints, than to curate selectively. Just as importantly, what seems to be important today might not really matter tomorrow. Conversely, of course, the mundane might end up being significant. This chapter explores web archives from the perspectives of the collectors. Who are they and where does the data end up? What are they trying to do? How do they collect pages? What challenges do web archivists face? It is a story that begins from the fears of a “digital dark age” and a recognition that if our very earliest web history has been irrecoverably lost, we can learn from that lesson and grapple with the ever-present problem of preserving digital information today. It accordingly begins by introducing the concept of a web archive and its origins, before exploring the biggest threat to saving the web: our collective cultural apathy. It is a story of utopia – the impetus to save as much information about the world around us as possible – as well as dystopia, in the fears of a “digital dark age” where information will be irrecoverably lost. The reality is somewhere in the middle.

The First Webpage What does the web’s first page look like? We will never know for sure. Launching in December 1990, http://info.cern.ch served as the primary entry point for people first arriving on the web (it maintained a central list of servers on the web) and helped spur the critical mass of sites necessary to sustain such a large project. Berners-Lee has called the web a “bobsled,” where creators needed to ensure the sled got started before hopping in and letting momentum take it along its way.1 This first site, http://info.cern.ch, was the “pusher,” which made sure the sled took off. As far as the importance of a single site goes, it’s extremely significant. Thinking about this page is a useful way to begin grasping the monumental challenges that face the collectors in the age of web archives. Widespread archiving by the Internet Archive did not begin until 1996,

64

History in the Age of Abundance?

which meant that they could not preserve http://info.cern.ch, as it had already been removed. By 1996 the site had been converted into a museum exhibit.2 The address in the browser bar might have been the same, but the first website was gone. Other websites from this period have met a similar fate, such as the Economist magazine’s early 1994 site.3 However, all is not lost if you are willing to spend significant time and money in recovery – and in doing so, we can learn a bit about how to preserve and reconstruct fragile digital information, as well as the difficulties of scaling this process to meet the Big Data deluge. To celebrate the twentieth anniversary of CERN making the web’s technology freely available, the World Wide Web Consortium (W3C ) – the primary standards body of the web – began reconstructing and ultimately relaunching the original webpage http://info.cern.ch. They had some early backup copies at CERN , the first from 3 December 1992 at 08:37:20 GMT , but otherwise had to rely on screenshots that had been published in print material to know how the site originally appeared. These included a picture in the December 1991 CERN  Newsletter, and another from April 1992.4 Between those two dates, we can see considerable evolution, demonstrating that the site was changing during that time. At first glance, it seems odd to be consulting print books to reconstruct digital objects. But paper is a surprisingly durable material. Imagine a book published in December 1990 (or even 1790!) and placed on a shelf in your living room or office. Barring a flood or fire, it is almost certainly readable today. The pages might yellow a bit, there might be some dust, but the content would be intact. Paper is not perfect – especially paper susceptible to acid, like a pulp novel from the 1960s – but most books today are printed on acid-free paper and can survive for centuries. A website similarly created in December 1993, let alone 2018 or 2019, will not exist without active maintenance. This ranges from paying server fees, to ensuring backups in the case of a hard drive fault, to making sure it is safe from malicious intrusions. Neglect is far more dangerous to digital sources than it is to conventional print ones. Even today we know much more about how 1990s netizens built GeoCities pages, for example, from two books – the Creating GeoCities Webpages and Yahoo! for Dummies – than from the archived pages themselves (which were not fully preserved because they were interactive). In any case, in 2013 CERN restored the December 1992 copy of the first website to the original URL , rudimentary HTML code and all, to widespread media fanfare.5 There was enough interest that the website

Web Archives and Their Collectors

65

was down for days in response to the sheer number of visitors. As the Washington Post explained, the page had been “dragged out of cyberspace and restored for today’s internet browsers as part of a project to celebrate 20 years of the Web.”6 For a historian, however, what does this mean? Had the first webpage been saved? Indeed, can an archived representation ever capture the original? What does it truly mean to be able to access a backed-up version of a site from 1992? Consider some complicating factors. Most early visitors to the CERN site would have arrived using Nicola Pellow’s line-mode browser, rather than a fully functional web browser, where one could click on links and engage with it like we would today. It would be folly to view CERN ’s page on a modern Chrome browser and consider it equivalent to what it was like to experience the site in 1991. A project at CERN worked to replicate this experience and created a simulation of it. Difficult decisions had to be made: they had to “fix” modern HTML to deliver pages that would better resemble their original appearance; they had to rewrite links so that they had reference numbers instead of just being standard hyperlinks, and several other decisions along those lines. This point is worth underscoring: to make a defunct site historically accurate from the perspective of a user, they needed to deform the source. This is because users navigated these websites not by clicking on links (as the most common interfaces were keyboard based), but by finding the reference next to a link and typing it on their screen (the link might be numbered “23,” and then you would type “23”).7 The archival team at CERN had to manually slow the drawing of characters: modern systems are so quick that text just all appears at once, but the early browsers often rendered them character by character. Extensive work went into recreating the typeface used: they took photographs of an old machine that was still running, recreated it manually, and in doing so realized that this first browser did not support lower-case accents, for example.8 The process of recovery and reconstruction of the original CERN page is a fascinating story of media archaeology, illustrating both what is possible as well as the expensive and time-consuming nature of doing it “right.” You can visit http://line-mode.cern.ch/ and judge the results for yourself (or see figure 2.1). The website does a good job in recreating the experience of browsing the web in 1992. It emulates the look of a monochrome cathode ray tube monitor, with low-resolution and high-contrast green characters appearing on a black background, appearing one

66

History in the Age of Abundance?

2.1 Line-Mode Browser Emulator.

character at a time. As you enter commands, your speakers emit the clacking sound of working with an old, heavy keyboard. It is a simulacrum of the first webpage. However, it also underscores the inherent limitations that web archivists face. We cannot do this for every single site, or even a tiny percentage of them. The saga of the first webpage demonstrates the difficulty of recreating a website, and whether it is even truly possible. Given the need to actively preserve content, decisions to preserve sites needed to be made as far back as 1991 in the case of CERN ’s first website. This means that archiving must take place far earlier in the content’s lifespan than with print material. It is a race against time. It also reminds us that the archived material that is preserved will be viewed through technology that is very different from the original: produced for flickering CRT  monitors, with attendant eyestrain and colour contrast, versus how we view them today on modern LCD screens – not to mention how different one would have been designed for line-mode browsers rather than graphical ones.

Web Archives and Their Collectors

67

What Are Web Archives? The term web archive is not straightforward. Given the many different communities involved in web archiving – professional archivists and librarians, technologists, and digital humanists – it is perhaps no surprise that each body of practice brings its own understanding of the term itself. Disentangling the different meanings of the concept will help to understand what is involved in the collecting sorting, cataloguing, and analyzing of these sources. A good starting point is a traditional archive. While every archive is different, for many historians there is a familiar rhythm and pattern to how they work from a user’s perspective. The first step is usually a finding aid, found either through a Google search, the bibliography of another book, or expertise in the subject matter; many of these are online (sometimes as PDF s and other times as interactive pages), while others require a visit to the archive itself. Finding aids have details on what one could find in an “archival fonds,” or the assemblage of documents from the same source (the records of the student government of the University of Waterloo, for example). These painstakingly assembled documents often have box-level information – a box of financial information here, or a box of activist pamphlets there – and some have file-level information (a file on the records from 1977, for example, or 1978). Once a historian knows what she might want to look at, she places a request with the archive for specific boxes. They arrive, the historian carefully removes file folders or records, and all of the document review is done in a specialized reading room. At all times, documents are carefully handled, and the original order of the boxes, files, and documents themselves is preserved. The historian benefits from the work and care of the archivist, which facilitates her research findings. What seems relatively simple from the perspective of a user is the tip of the iceberg of a rich tradition of archival theory, research, and practice, of which I only nod at here. It is important to remember that historians are not the sole users of archives; indeed, we might be best understood as expert users of these systems. John Ridener, a leading thinker in the world of archival science, notes that at a basic level the archive “has been conceived as the repository for institutional knowledge between its usefulness as active records and its source of information for

68

History in the Age of Abundance?

historians and researchers.”9 He is echoed by the Society of American Archivists’ (SAA ) official definition, which holds that archives are “materials created or received by a person, family, or organization, public or private, in the conduct of their affairs and preserved because of the enduring value contained in the information they contain or as evidence of the functions and responsibilities of their creator, especially those materials maintained using the principles of provenance, original order, and collective control; permanent records.”10 In general, this means that archivists collect materials to assemble collections that have an organic relationship within them, with similar provenance. For example, a repository might have the archives of Bertrand Russell, or Hillary Clinton,” or the Free the Children Charity, but would be less likely to have a thematic archive of similar documents selected from multiple donors. An archive of how Canadians understood 9/11, for example, or charitable giving in 2014 would not be an archive in this formal sense. By this standard, the Occupy Wall Street or 11 September web archives are not archives in this formal sense – and are perhaps best understood as digital collections. They would be more at home in the institutional context of a library, perhaps a Special Collections department. The three core principles of archiving outlined in the SAA definition above help bring this into relief. They are provenance, original order, and collective control. Provenance refers to the history of an object or collection, identifying where it came from. Archives respect this and ensure that documents created or maintained by the same individual or organization are represented together and are kept separate from records with a different provenance. The principle of original order means that the physical or intellectual order of the records established by the creator is maintained. Finally, the principle of collective control refers to the practice of considering archives as “aggregate information,” rather than individual items, and stresses keeping documents together and in their original context.11 The core concept of respect des fonds, which is the normative standard for archival arrangement and description in modern archives, emphasizes origin and provenance over the kinship of subject matter; it emphasizes genealogy over similarity. Alongside these three principles are a rich theory of appraisal, the process of selecting what does and what does not belong in these institutional repositories of memory; understandably, these are critical decisions.12

Web Archives and Their Collectors

69

Outside of research and scholarly contexts, people are most likely to see the word archive in the context of technology. We “archive” our email and we compress files into “archived” ZIP or TAR files. The usage of archive in this context stems from tape archives, an artifact of the previously high cost of storage. If you had data that you wanted to preserve or back up, you could write it to a tape archive and store it, often off-site. Magnetic tapes were cheap and allowed people to store GB s, but at the cost of long wait times if you needed to retrieve the information (the tape would continually be rewinding and fast forwarding to navigate various data blocks – imagine if you are watching a video, and want to fast forward to a very specific moment in it, but you can do so only by using the fast forward and rewind buttons, that’s a good comparison of how slow data retrieval on a tape can be). The TAR file format, or Tape Archive file, harkens back to this.13 To archive something is thus an indelible part of computing lingo, thanks to these earlier concepts. Indeed, the file format that comprises modern web archives today is at the heart of this. The WebARChive, or WARC , file is an ISO -standard (ISO  28500:2009). The WARC file format, which is certified by the International Standards Organization, preserves web-archived information in a linked form.14 Imagine a modern website, and the hundreds of thousands or millions of files that make it happen: a university’s domain, for example, replete with HTML files, CSS style instructions, graphics, Microsoft Office files, PDF files, and beyond. Indeed, we will return to this point, but this hints at the complexity of even a single “page” so defined. A single webpage – a blog post, for example, or an author’s promotional page – is in reality witness to a complex interplay between hundreds of different files: a Twitter widget being called from Twitter.com, a JPG image, an HTML document providing text, a PDF embedded in the bottom of a page with a press release. It is not worth getting too granular at this point, but it underscores the point that on the web there is rarely a single document that you can reproduce: it all exists in complex orchestration with many other elements. In all of these cases, each single file (from the JPG image to the PDF file) that you archive you would want to have some metadata – when it was captured, what captured it, what was the full URL , and beyond. To meet both the needs of metadata and the challenges of working with millions of files, the WARC file concatenates all individual files into much bigger files. Imagine if you had downloaded all the pages

70

History in the Age of Abundance?

from a university webpage: you would now have thousands of files and would want to keep track of (1) when you downloaded it, (2) what tool you used to do so, and (3) the title of the page, URL , and beyond so you could conceivably retrieve some files later on. A WARC fixes this problem by taking all these files and combining them with some of that metadata. It takes hundreds or thousands of files and combines them into one file, with metadata blocks before each individual entry. This is not to reduce web archiving to the file format, however, but to note that in the sense of TAR and ZIP files being archives, the web archive continues this process. Although the “digital archive” is a born-digital format, this does not explain why the digital humanities have taken to the concept with such gusto. The “digital archive,” in the sense of the Center for History and New Media Occupy Wall Street Archive or the September 11 Digital Archive, brings with it a definition encompassing the archive as a cultural record, but also the digital sense of being able to cheaply store large amounts of information over a long time. This model has been popular within the academic community, with hundreds if not thousands of projects, including “Our Marathon,” a Northeastern University project that collects memories and online resources pertaining to the 2013 Boston Marathon bombings.15 These critically important projects serve as sites of memory and steward our cultural memory both for scholars and for citizens, who may want to explore these traumatic events in depth. Such thematically organized collections do not meet the normative definition of an “archive,” yet they enthusiastically employ the term. In this, it perhaps grows out of a different vision of an archive, such as ethnographic field collections, as Trevor Owens has pointed out.16 Unless and until the SAA revises its own definitions and protocols for the creation and maintenance of archives, thinking of both their print and electronic collections of records, these digital archives should best be understood as “digital collections.” Where does this leave web archives as a term, then? The web archival field incorporates multiple experts, from archivists and other information professionals, to digital humanists and technologists. The term itself is unlikely to satisfy people, given the diverse lineages and communities at play in their creation. As Owens notes, the idea of a web archive is “much more in keeping with the computing usage of archive as a back-up copy of information than the disciplinary perspective of archives.”17 The

Web Archives and Their Collectors

71

idea of an archive itself may be changing. In large-scale web archives, such as the UK Web Archive or the Internet Archive, the main characteristics of a formal archive are not present: different websites by different creators are bundled together, with the only shared provenance being that they were published on the web. They may be controlled collectively, but they consist of largely undescribed data, and the “original order” preserved is determined by the ebb and flow of the Heritrix web scraper as a result of the web’s intrinsically unordered nature. As we are finding, the unique nature of the web does not sit comfortably with traditional notions of archiving. Archivists are grappling with the impact of digital objects on their fundamental professional respect des fonds concept.18 In tackling definitional questions, Brügger suggests that we define web archives as “deliberate and purposive preserving of web material,”19 which encompasses a variety of collecting, curating, and preservation. This is a very good starting definition, although interrogating the term with respect to traditional archival practice and technological approaches is useful in teasing out definitional continuities and changes. Another important component of traditional archival work is archival description, or the process of “identify[ing] and explain[ing] the context and content of archival material in order to promote its accessibility.”20 Archival description aims to follow a set of common practices that help researchers and other interested parties find information. In Canada, for example, our rules – aptly named the Rules for Archival Description, or RAD – range from standardized capitalization and punctuation, to controlled vocabularies of document types, qualifiers, place names, and so forth.21 Things are not straightforward for web archives, or with many other forms of digital media (or traditional archives, for that matter). As Nick Ruest, Anna St-Onge, and I noted in an article about an archive we collected, “Archival descriptive standards still limited our provision of an accurate measuring of our materials. How does one provide a physical extent of scraped and archived files, for example? How many meters are a thousand tweets?”22 We were not alone with these issues. Allison Jai O’Dell, an archivist at the University of Miami, grappled with similar issues when accessioning a collection of websites. She wonders: Should websites be described as individual pages, or as collective sites? Would they be added into library catalogues or finding aids? Who is the creator – their library or

72

History in the Age of Abundance?

the original authors? – and so forth.23 All of these questions illustrate the awkwardness with which web collections fit into the archival tradition, but also the degree to which earlier traditions of description and other practices are evolving. Why then, does this book use the term web archives, as opposed to say digital collections? In some ways, it is a pragmatic decision, nodding towards contemporary practice and nomenclature. For better or for worse, we have the Internet Archive, we have the UK Web Archive, and we colloquially and largely professionally refer to web archives and web archivists. For historians, it provides a familiarizing point of reference: we are used to finding unstructured primary sources in archives, using finding aids and other search tools developed by archivists, and it lines up well with workflows. It is also a useful acknowledgment of the computational methods that underpin much of the work that we do. The three definitions, that of the digital humanist, technologist, and archivist, come together when using these materials. We do, however, need to be mindful that words have power, and appropriation comes with downsides. Web archives are not traditional archives – not in content, form, or conception.

A Brief Origin Story of Web Archiving: From Digital Dark Age to the Infinite Archive In 1995 Michael Lesk, a Rutgers University computer scientist, noted the new challenges and opportunities facing librarians and archivists. On one level, more digital data should lead to a much richer historical record. Yet this hope for a better record needed to be tempered, Lesk noted: “We do not know today what Mozart sounded like on the keyboard … What will future generations know of our history? … But digital technology seemed to come to the rescue, allowing indefinite storage without loss. Now we find that digital information too, has its dark side.”24 The 1990s was an exciting time as the personal computer revolution continued to expand. Digital objects were becoming far more numerous than ever before. Files were increasingly sophisticated and complicated: images and text were combined within single files, presenting preservation challenges. Physical storage mediums were also rapidly evolving, as floppy disks now co-existed alongside CD-ROM s. Both had different, and to some degree unknown, lifespans. Compounding these technical

Web Archives and Their Collectors

73

challenges were legal ones too, as copyright laws do not have exceptions for preserving digital. For example, to save a particular file or multimedia experience might require copying it to a different format, which in many contexts would be illegal. The nascent field of digital preservation thus found itself facing a difficult proposition. These pressures were all compounded by the fact that as digital tools became more accessible, more and more people were using them – records were increasingly likely to be born-digital with no paper copy. All this meant that the ideal of a vastly improved historical record thanks to digital storage had to be tempered by the spectre of a digital dark age. In the most apocalyptic visions, the digital turn could mean that little would be preserved – even less than if the shift towards recording information in digital formats had never happened. As W. Daniel Hillis would later recall, there were fears that “historians will look back on this era and see a period of very little information. A ‘digital gap’ will span from the beginning of the wide-spread use of the computer until the time we eventually solve this problem.”25 It is worth recalling that this threat of a medium change was not unique to the digital turn. In 1929 Robert C. Binkley, a historian at New York University, raised the spectre of the then recent cultural heritage residing on fragile, acidic pulp paper. Lisa Gitelman explains, in words that seem familiar when applied to the digital context today: “Cheap paper had enabled ‘the development of the culture’ over the previous half-century, supporting the institutions of a healthy civil society: a robust publishing industry and universal literacy as well as governmental and nongovernmental bureaucracies and scholarly subspecialisation. But the same cheap paper boded ill for future historians.” “The records of our time are written in dust,” Binkley warned in a talk at the First World Congress of Libraries and Bibliography in Rome in 1929.26 Binkley’s proposed solutions involved reproduction, from microfilming fragile documents to other forms of mechanical duplications.27 The concerns raised by Binkley and other scholars led to dramatic advances in technology, the widespread adoption of acid-free paper, and in general a concerted effort towards paper preservation. Crucially, the widespread adoption of microfilm facilitated research projects around the world – allowing scholars as well as members of the general public access to primary and secondary sources they could not otherwise personally consult.

74

History in the Age of Abundance?

The 1990s need to be understood in this longer tradition.28 They would similarly mark a turning point in how we preserved digital material. Just as cheap paper revolutionized cultural production and opened avenues for more democratic sources, so too could digital technology. Out of this came the Internet Archive, co-founded by technology entrepreneur Brewster Kahle. Kahle had developed the Wide Area Information Servers (WAIS ) architecture, an early solution for navigating the internet that helped users find relevant documents and information quickly. Kahle had also co-founded Alexa Internet with Bruce Gilliat, a company that harvested the World Wide Web to track website rankings and traffic. Crucially for what was to develop, Alexa operated by downloading a copy of each website that it visited.29 Kahle’s two fields of information retrieval and web harvesting would come together fortuitously. After WAIS was sold to service provider American Online in June 1995, Kahle and Gilliat founded the Internet Archive in June 1996. As noted in the introduction, the Internet Archive grew out of the fears of a “digital dark age,” a recognition that as more and more human activity and culture was playing out online, it was in danger of being lost. Today, with the goal of “universal access to all knowledge,” the Internet Archive has expanded to collect not only old webpages, but books, music, videos, images, text, software, and beyond. Kahle was looking forward to much of this in 1996 as the media began to note this audacious undertaking forming in San Francisco. Looking to the future, Kahle told the Philadelphia Inquirer that the “archiving must continue as the Web evolves. I’m not going to read much of it myself, but someday, people will be glad it’s available. I’m glad someone is doing this.”30 The Internet Archive, as a centralized hub with lofty ambitions to preserve the extant web, would fill a critical gap. From 1996 onwards, they began to rapidly collect, relying for the first few years on Alexa crawls and later on their own explorations of the web. The Internet Archive collects data by sending web crawlers out into the web to download webpages that it found. Depending on how the web developed and the limits placed on a crawler, this is a potentially infinite process – hence the “infinite archive.”31 Crawlers visit sites, download what they find, collect all outbound hyperlinks, visit those, download, collect all of their hyperlinks, and so on. A tangible example might help bring this into relief. Imagine a crawler visiting the news

Web Archives and Their Collectors

75

site CNN.com. It downloads the page, finds a link to another article (say about the Republican Party’s new platform). It then follows the link to that article, downloads the page, and finds a link to the Republic Party’s platform on GOP.com. It then follows that link, downloads it, and so forth. You can probably imagine how this process is indeed potentially infinite. A crawler could archive forever unless it is given a finite “depth,” or limit of the number of links to follow from the starting page (for this reason, depths or limits are always imposed). Over the next few years, the Internet Archive continually preserved large swaths of the internet and World Wide Web: two TB s by May 1997, almost fourteen TB s by March 2000.32 This data was physically stored in the Internet Archive in San Francisco with off-site backups as well. The size, however, precluded human cataloguing or even traditional searching, until 2016 and the advent of limited keyword searching on site homepages. Scale is, of course, a double-edged sword: scale makes the Internet Archive’s collections invaluable but also hard to grapple with. Underscoring the importance of the rhetoric around digital preservation, the Internet Archive was not alone in its efforts to archive the web, although it was the most ambitious in both size and scope. Other organizations, from national libraries to large university libraries, were also coming to grips with the tensions identified by Hillis and Lesk, among others. Given the concurrent emerging of web archiving across three continents in 1996, it reinforces the need to not see the Internet Archive as a lone hero – from its inception, web archiving had global support (although the Archive has often adopted a critical leadership role that should also not be minimized). It also underscores the point that web archiving is being carried out by a range of actors, from the non-profit Internet Archive to national libraries; if something were to happen to the former – a dramatic change in leadership, closure, bankruptcy, implementation of a paywall, or something (none of which is plausible in 2018!) – web archives would continue, thanks to their dispersion. In 1995 the National Library of Australia “identified the issue of the growing amount of Australian information published in online format only as a matter needing attention” and struck a committee to begin work in January 1996.33 By late 1996 the library began to download websites, focusing on “selected Australian online publications, such as electronic journals, government publications, and web sites of

76

History in the Age of Abundance?

research or cultural significance.”34 In the United States, early projects included the Smithsonian’s archiving of the 1996 American presidential election, and the University of North Texas’ CyberCemetary (1997), which provided access to defunct federal government websites.35 In Sweden, an even more ambitious project was launched: Kulturarw3. Their Royal Library recognized that “in the future an increasing amount of material will be published on the Internet, and only on the Internet,” requiring an expanded mandate. Beginning in 1996, they began to carry out wide scrapes of the Swedish web, completing seven between 1996 and 2000. This required them to come to grips with what a national web is, a complicated question that continues to vex national libraries today. Kulturarw3 used three collecting metrics. First, they selected sites belonging to the Swedish top-level domains, or those ending in .se. Second, they selected websites registered with Swedish addresses/telephone numbers. Finally, they archived selected websites hosted under the .nu top-level domain, which means “now” in Swedish, and was a popular place for Swedes to develop websites.36 Kulturarw3 collected some 300 GB of data per web sweep. While necessarily more limited in scope than the Internet Archive, undertakings like those of the Australian National Library, the Swedish Royal Library, and North Texas showed recognition of the problem of saving digital cultural heritage. The year 1996 thus looms large as the time that the web began to be preserved en masse, both by national libraries as well as by the Internet Archive. Hillis’s “digital gap” lasted only five years or so, between the advent of the public web in 1991 and the commencement of widespread web archiving in 1996. Yet for that five-year period we are largely out of luck if we want to explore early websites. With parts of the web now being preserved, the next step to combating a digital dark age was to find a way to provide access. Some archives would become dark archives without public access, more would limit to on-site access only, but others would need users to make the case for their continued relevance. By 2000 the Internet Archive was granting researchers access to its data via secure command-line. While this allowed researchers to access the San Francisco–based data via the internet, using the command line to remotely work on servers certainly required (and would still today require, as discussed in chapter 6) technical skills beyond those of the average researcher or user. It would not be until

Web Archives and Their Collectors

77

late 2001, with the Wayback Machine, that the Internet Archive would be accessible to a general public. The Wayback Machine is a web portal that allows a user to type in a web address – or, as of 2016, to run simple text searches on domain home pages – and explore archived versions of the pages. It was launched on 24 October 2001 to great fanfare at the University of California, Berkeley’s Bancroft Library. There Kahle pulled up the first demonstration website: a Clinton-era press release “proclaiming the prevention of hijacking and terrorist attacks in the air a priority.”37 This dovetailed with the Internet Archive’s ongoing attempts to preserve the digital footprint of the 11 September attacks, in conjunction with the Library of Congress. The Wayback Machine was overwhelmingly popular, and a flood of requests meant that service was intermittent into late November. The predominant paradigm for accessing web archives – the Wayback Machine – was upon us, a state of affairs that would continue until the present day. A new era of historical research had begun. Knowing what we do about the fragility of digital sources, the decisions taken in the early to mid-1990s to preserve this content were fortuitous. So too were subsequent technological advantages and policy changes, in particular the expansion of legal deposit to cover web archives – which we will discuss in detail in chapter 4. Yet we cannot be under the impression that most things are preserved, for they are not. Four major issues continue to shape the web archives being produced today: link rot, walled gardens, the robots.txt file, and the neglect of corporations to steward our user-generated content.

404 Not Found: The Short-Lived Website “Sorry! We can’t find the page you’re looking for.” A 404 Not Found page is a constant reminder of digital fragility. This usually minor inconvenience speaks to a major problem for our historical record, especially given the regularity with which one encounters 404s. And even if the content has simply been migrated to a new place, the changed URL can wreak havoc with our ability to find what we are looking for. Just how long does a website last on average? There is no easy answer. Oft-cited lifespans range from Scientific American’s 1997 figure of 44 days, to 100 days estimated by Nicholas Taylor, a leader in the field of web archives.38 Taylor notes that the very question of website lifespan is

78

History in the Age of Abundance?

difficult to tackle: “For instance, is a ‘webpage’ defined by its URI or by its contents? A non-resolving link doesn’t necessarily imply that the content once hosted there no longer exists (1); it may have been archived or simply exist at a new location (albeit, one mediated by a paywall) to which the web server was not configured to redirect page requests. … An automated link checker visiting a list of URI s and logging all ultimately successful and failed requests would miss these subtleties.”39 By exploring large collections of hyperlinks and measuring the incidence of 404s, we can begin to get a rough proxy for this question. One study of 100,826 computer science articles carried out in 2000 found that the percentage of invalid links varied from 23 per cent from articles published in 1999 to 53 per cent for those published in 1994.40 There have been at least seventeen similar studies, with valid link responses ranging between a high of 80 per cent to a low of 39 per cent.41 A 2013 study reviewed links in three legal journals, the Harvard Law Review, the Harvard Journal of Law and Technology, and the Harvard Human Rights Journal, as well as every Supreme Court of the United States ruling, and discovered a “serious problem of reference rot: more than 70% of the URL s within the above-mentioned journals, and 50% of the URL s within U.S. Supreme Court opinions suffer reference rot.”42 While these studies are not likely to represent broader web content, which is probably more fragile (having not been cited in a Supreme Court or academic article), they show the scale of the problem. Another example: within the Google Books database, there are over 247,000 references to http://geocities.com, almost all of which would no longer work, as GeoCities closed in 2009. Think of all the broken endnotes and footnotes! Scholarship suggests that this is an accelerating problem, given the rising importance of social media. A 2013 study noted that “after the first year of publishing, nearly 11% of shared resources will be lost and after that we will continue to lose 0.02% per day.”43 Grassroots sites – part of the revolutionary appeal of web archives – are particularly at risk, as a domain registered in the heat of a political movement may not be renewed the next year if momentum flags, or leaders disagree, or somebody just plain forgets to pay the server dues or even cannot remember who purchased it in the first place. We can see this in one high-profile social movement. In 2012, Archive-It created an archive of Occupy Wall Street. Two years later, in 2014, only 41 per cent of the sites that they had selected remained on the live World Wide Web

Web Archives and Their Collectors

79

– the majority of sites had been taken down.44 For a future historian of the Occupy movement, this is a disaster narrowly averted only by the active preservation of relevant material by web archivists. Beyond sites simply disappearing, URL s also change. This problem is nearly as old as the web itself. In 1998 Berners-Lee himself argued that “Cool URI s don’t change,” encouraging developers to ensure consistent identifiers for their documents and pages.45 Despite the fragility of digital sources, we do collect quite a number of them – so much so that the Internet Archive has petabytes of old websites. The question then is, what do we end up preserving? What do we not capture? Are our archives representative in any way of the existing web? These are difficult questions. The dynamic, ever-changing nature of the web means that there is no answer to just how big it is, let alone which sites exist and which do not. When asked how many pages are out there in the web, Google answered that they “don’t know.” “Strictly speaking,” Google engineers declared in a 2008 post, “the number of pages out there is infinite – for example, web calendars may have a ‘next day’ link, and we could follow that link forever, each time finding a ‘new’ page.”46 Over a decade later, the problem of the almost literally infinite web is still with us. Just as Google engineers mused about the problem of the calendar, archival web crawlers literally fall into this classic trap. A person would never sit and click the “next year” link on a calendar forever, ending up millennia in the future, whereas a web crawler will happily do so – clicking “next month” until it is millennia into the future. It may appear that a collection has 300,000 snapshots, but if 290,000 of them are connected to a calendar, it is misleading. They “exist” only when called into existence by the web crawler. In short, we will never know the size and vitality of the web with precision, nor how much of it has been preserved. A team of researchers at Old Dominion University tackled this question in a 2013 paper. Drawing on extensive URL sample lists from several directories (the Open Directory Project, Delicious, Bit.ly, and search engines), they concluded that 35–90 per cent of publicly accessible sites have at least one archival copy.47 We have earlier seen discussions about link rot in specific bodies of literature – such as academic articles or legal judgments – but this Old Dominion paper is the only one to tackle the bigger question of the Internet Archive. A smaller but no less valuable

80

History in the Age of Abundance?

study by Oxford Internet Institute and Goldsmiths researchers compared TripAdvisor properties and reviews found on the live web with those found in the Internet Archive, discovering that only 24 per cent of pages were archived. More importantly, the team “also found that the archived pages do not resemble a random probability sample. There is a clear bias toward prominent, well-known and highly-rated webpages. Smaller, less well-known and lower-rated webpages are less likely to be archived.”48 This is the nature of crawls: they start on popular or important pages, the sorts of pages that appear in the lists that a librarian selects or that make sense for a global crawl of the web to begin on. The Internet Archive, for example, have started their crawls of the web (their “Wide Web Crawls”) by starting crawlers on the top one million pages as ranked by global traffic popularity service Alexa. Alexa ranks sites so that advertisers know how much they can charge to sell advertisements on pages, along with other services. While this is a useful place to start crawls – they need to begin somewhere! – it does build in an unavoidable bias towards the sorts of sites you can find by crawling out from a top million-ranked site. This is not a slight. As there’s no universal directory of websites that exist, inevitably decisions have to made on where to start. There are also additional technical difficulties, meaning that there is almost a cat-and-mouse war between web developers and web crawlers. Some new web features require responses to be coded or else the content is lost. Consider a page like Twitter.com, which, at least at time of writing, relies on the “infinite scroll” (so do many other sites such as Facebook.com or Tumblr.com). When you visit a Twitter account it loads only the few first dozen tweets and, as you scroll down the page, it loads more. While this saves bandwidth for a user, it confuses a web crawler – it would get only the first dozen tweets. It was not until the introduction of the Umbra capture mechanism by the Internet Archive in 2014 that their main crawlers could simulate the user scrolling down.49 Later in this book, we discuss the technological limitations in archiving Flash that led to whole swaths of the web not being archived.50 All statistics and stories shared here underscore the need to collect and archive the pages of the web, ideally soon after their creation. This is an issue that begs our immediate attention. Whatever the exact numbers, two things need to be underscored. First, digital resources are fragile. Without active preservation, information disappears. Historians need to be engaged earlier in the process, to help identify, collect, catalogue, and

Web Archives and Their Collectors

81

use our digital cultural heritage. Second, digital sources are preserved unevenly. Large corporate websites or mass media organizations are more likely to be collected, or even to maintain their own news archives. Grassroots organizations are less likely to do so. Save the concerted effort to document Occupy Wall Street as it was happening, the majority of the movement’s digital footprint would have been gone only two years later. It is a race against time. As scary as the realization was in the 1920s that cheap paper holding our historical records was rotting away, our digital heritage today is facing an even greater threat.

The Question of Representativeness We are not all online, and those of us who are do not interact with the web in the same way. This is an ever-evolving process, from the early 1990s when web access was limited largely to academics, certain industries, the government, and hobbyists who had the means and knowledge to connect their computers to the internet, to the contemporary situation, where internet access is more mainstream, and mobile platforms are supplanting and complementing traditional desktop or laptop access. Yet not everyone is connected, and that point needs to underpin our understanding of these resources. Just as historians read traditional print archives with caution – cognizant that the voices of the working class or minorities are missing or silenced – so too do we need to approach web archives with an understanding of who’s online and who is not. This can all underpin our ethical approaches to these sources as well. Even today, while 2.5 billion or so people are connected in some fashion, over 4 billion more are not. The “digital divide” is a critical question for internet studies scholars who seek to understand how lack of internet access exacerbates inequalities.51 The World Internet Project, which originated at the UCLA Center for Communication Policy, brings together data from over twenty countries to explore varying levels of access.52 Even if pundits and some scholars foresee the entire world becoming connected in the next few decades, we still need to look back as historians to understand levels of internet access in the period we are studying.53 Internet and web usage was demographically different in 1995 from in 2005, or 2015. Early web historians can make the strongest case for unrepresentativeness. In 1996, for example, the first year of the Internet Archive’s

82

History in the Age of Abundance?

collections, only 11 per cent of Canadian households had a personal computer of any kind. While this would increase to 36 per cent in 1997, and to 60 per cent in 1998; and from 20 per cent with internet access in 1998 to 40 per cent in 2001, this rapid change also underscored just how many people were not online.54 In 2003 Statistics Canada noted that “households with high income, members active in the labour force, those with children still living at home and people with higher levels of education have been at the forefront of Internet adoption.”55 As late as 2010, the picture in Canada was still uneven: Among the one-fifth (21%) of households without home Internet access in 2010, over one-half (56%) reported they had no need for or interest in it. Other reasons for having no access included the cost of service or equipment (20%), or the lack of a device such as a computer (15%). About 12% of households reported they lacked confidence, knowledge or skills. Relatively more households in the lowest income quartile reported the cost of service or equipment (24%) as a reason.56 Researchers at Canada’s Western University, in a fascinating piece on the persistent digital divide, crunched this data and found clear demographic factors connected to internet connectivity – along lines of “socioeconomic status, education, immigration status, and age.” Through interviews and further studies, they found differences in usage (while both male and female respondents had access to the web, the former tended to be more active users) as well as substantial subgroups of people with lesser activity such as recent immigrants to Canada. In an era where internet usage is connected to both earnings and ability to participate in society, their study is a useful warning.57 Figures in the United States have followed a similar trajectory. Pew Research, the go-to source for demographic information on who’s using the internet and how, noted in 2017 that while almost all (98 per cent ) of adults with an income above $75,000 used the internet, only 79 per cent of those with an income of less than $30,000 did. The disparity also appears when levels of education are compared: 32 per cent of those who do not complete high school do not use the internet, compared to 2 per cent of college graduates who do not. One in five rural households

Web Archives and Their Collectors

83

remains without internet use.58 Some of this lack of rural connectivity owes itself to the widespread lack of broadband connectivity, as well as the reality of an aging, less-educated, rural population in the United States.59 In very remote places like Canada’s North, high-speed internet is often almost entirely unavailable or far beyond the means of the average resident.60 Even these aggregate statistics betray significant differences in how people use the web. Just because two people have access does not mean they are on equal footing when engaging with the medium. Eszter Hargiattai calls this a “second-level digital divide,” showing how age was a critical determinant in how people worked with the web. In her 2002 study she worked with a random assemblage of American internet users and assigned them search tasks and found that “young people (late teens and twenties) have a much easier time getting around online than their older counterparts (whether people in their 30s or 70s).”61 While the study is dated, for a web historian, such investigations are critical to understanding the composition of online content! Other detailed studies have tried to explore who produces content on the web, as opposed to just having access. A remarkable 2011 survey of American adults found that producing content was positively associated with having “consistent and high quality online access at home, school or work and having a high-status information habitus and Internet-in-practice”; in short, class matters. The author, Jen Schardie, noted that “digital production inequality suggests that elite voices still dominate in the new digital commons.”62 Studies of young adults have shown that their propensity to contribute is closely linked to their parents’ educational achievement (another proxy for class).63 Complicating this fact, however, was a 2010 study of college students that found that white students were less likely to produce content than other ethnicities such as African Americans, Hispanics, and Asians.64 The statistics on who produces content can obscure the type of content being produced. In a study done in the United Kingdom, Grant Blank found a typology of contribution. The first was “skilled content,” or activities such as “maintaining a personal website, writing a blog, and posting writing, stories, poetry or other creative work” – in short, work that requires a high level of skill and time to produce. This sort of work tended to be young technically savvy people with multiple devices but of varying social status. The second was “social and entertainment

84

History in the Age of Abundance?

content,” such as using social media, posting photos, videos, music, and the like; these users tended to be lower-income, unmarried, young yet technically skilled people. Finally, “political content” was produced by highly educated students or those who used the internet at work.65 While these findings are not generalizable to other countries, they do suggest close attention is needed to who posts, blogs, and provides content on various topics. Finally, there is systematic bias in what can be collected and what cannot be collected from the web. For the largest web crawls, this comes from the sheer scale of the web and rate at which webpages are created can disappear before being archived – in this context, it is difficult for the web crawler to discover all pages. Importantly, some pages are easier to find! The Internet Archive’s Wide Web Crawls begin from the “top” one million Alexa sites. Internet Archive crawlers start on these top million pages and then fan out across the web. This gets them a good way towards much of the web. But it does mean that some sites are less likely to be found: the webpage of an academic at the University of Toronto in Canada, for example, will be found by a crawler. On the other hand, it is less likely that a teenager’s homepage, or an activist in the Non-Global North, is likely to have sites found – the further away they are in hyperlinks from those top million pages, the less likely the crawler will land on them. Given the difficulty that researchers have had simply quantifying what is included in web archives and what is not, almost no work has been done on the biases within the collection. In 2015 researcher Kalev Leetaru demonstrated the idiosyncratic composition of pages with the Archive’s then-445 billion webpage collection, arguing that “far greater visibility is needed into their algorithms and … critical engagement is needed with the broader scholarly community,” arguing that the importance of web archiving meant that it could not be left to “blind algorithms that we have no understanding of how they function.”66 Yet the calls for better descriptive metadata, more transparency, and more outside involvement overlook the fundamental problem, which has been well encapsulated by David Rosenthal: “Web archives, and the Internet Archive in particular, are not adequately funded for the immense scale of their task … unless they are prepared to foot the bill for generating, documenting and storing this metadata, they get what they get and they don’t get upset.”67 Legal deposit web archives, such as those in Britain or France, can be much

Web Archives and Their Collectors

85

more comprehensive, as they are able to get lists of registered domains from national registrars (the company that people use to register their website’s domain). Even more importantly, their mandates to crawl this web have been funded – financing and available resources is just as important as the technical expertise or access to lists of sites to capture. Questions of representativeness, or understanding selection bias, are even more pressing when working with smaller collections such as those curated by a library, archive, or other institution, such as those who use the Internet Archive’s subscription service Archive-It. These are collections that tackle an event (an election, for example, or international sporting event) or theme (the pages of political parties, or of activists, or beyond). As I have explored in other work, the selection criteria for these special collections are often opaque, undocumented, and can lead to surprising results when you begin to explore the underlying content.68 In the Canadian Political Parties and Interest Groups collection, for example, why was the Canadian Centre for Policy Alternatives – a left-wing think tank – included, but the right-wing National Citizens Coalition left out? Was it an oversight? The result of a deliberate collections policy? Some other factor when the collection was scoped in 2005? The real limitation likely lies not in agenda-driven exclusions or haphazardness but is similar to the problems facing the overall web archiving: lack of dedicated resources and time. Most web archiving is carried out by government documents librarians, records management professionals, or others who have several mandates; web archiving is simply one item on a long list. An opportunity thus lies in the more idiosyncratic and specialized, smaller web archival collections. While we do not know the precise nature of biases within the broad crawls, we know that they exist: skewing to the powerful and those most likely to be found by following links from the top million Alexa seeds. When documenting and thinking about the collections that archivists, librarians, historians, and others create at a more granular level, we can seek to counterbalance those forces. Intentionally including subaltern, activist, and other perspectives can help lead to a more inclusive overall record of existence. The first step is to be cognizant of the biases within a collection – recognizing, for example, that certain voices are privileged within a collection over others – and frame both collecting efforts and scholarly work on them accordingly. The second step is to begin to reassess the records that are being collected.

86

History in the Age of Abundance?

2.2 Internet Archive Wayback Machine screenshot.

Transparency is critical and complicated by the sheer scale of these collections. Given how big they can be, the human collectors seldom have the time to document all their decisions, or to describe why a certain website is included and why another is not. In a dream world, whenever a site was captured, somebody would describe why it was captured and how. We, of course, do not live in a dream world and are constrained by time and effort. But revealing the algorithm used to select sites, for example, or the top-level decision-making or objectives that may have led to a given site being collected can help us at least write informed histories. The field is already moving in this direction, as 2017 changes to the Wayback Machine demonstrate. In figure 2.2 I show the results of a Wayback Machine query for the government of Canada’s webpage. As my mouse hovers over the 01:48:50 snapshot from 22 June 2013, note that right underneath the time graph at the top of the screenshot, we can see why it was collected. In particular, we see the string: “Sat, 22 Jun

Web Archives and Their Collectors

87

2013 01:48:50 GMT (why: archiveitpartners, archiveitdigtialcollection, ArchiveIt-Collection-3608, ArchiveIt-Partner-75).” In that cryptic string we can see the reason why the site was collected and put into the Internet Archive that day. In the above example, it was not a random wide crawl that grabbed this snapshot of the federal webpage, but rather the conscious decision of a collector at the University of Toronto (Archive-It Partner 75) as part of their Canadian Government Information collection (Archive-It Collection 3608). Further complicating matters, webpages can exist in an archive and online at the same time – they might be the same, slightly different, or completely divergent. The homepage of the University of Waterloo, for example, at http://uwaterloo.ca exists today (you can visit it yourself in your browser) and also exists in very different forms all the way back to the Internet Archive’s first capture of that site in 1997.69 While add-on programs to your web browser, variously called plug-ins or extensions, can help bridge the divide between a live website and an archived website, these sources exist in many different ways. For this reason, many citation formats require that you note when a webpage is accessed, for example. There are no simple solutions to overcoming the digital divide and the question of who is and who is not included in web archives. The two main starting points are for scholars to be aware of the issue when framing their research questions and to become knowledgeable about the construction of the web archives they use. On the former, this may be as simple as framing findings appropriately. Take a misleading headline on the CBC : “Sorry America: Study Shows Canadians Really Are More Polite.”70 Instead of arguing that “Canadians are politer than Americans,” for example, reorienting it as “our minority of largely white, technologically savvy males are politer than their minority of largely white, technology-savvy males” makes for a better stating of results. In addition, thinking about what is in the web archive and what is not – as a historian would do with a traditional archive – can help make for more rigorous histories. Some of this can be aided by technology, such as the Wayback Machine showing the provenance of some web archives, and others may simply be from reflection. In any case, a more self-reflective practice can enable better scholarship. This approach may not satisfy all people using web archives. Given the complexity of the web, its ever-evolving nature, and the difficulty

88

History in the Age of Abundance?

of capturing even a small percentage of it for preservation, we will ever be able to draw scientifically reliable data from it. We can understand that there is bias in the collection and seek to understand some of its dimensions, but it is unlikely we will ever be able to quantify or truly grasp what is at play. At some point, researchers might be tempted to throw up their hands. Why then, despite the problems of representativeness, should we still use web archives and seek better means of preserving them? Even if we do not all use the web equally, and what we do put on the web is not captured systematically, without bias, we still have the potential to know far more about human culture and activity than ever before. Traditional archives are not free from these problems, either: their collections skew towards those who had their hands on the levers of power or who rubbed shoulders with those who did; they were literate, privileged, connected, usually men, and usually white. Historians recognize the systematic biases at play in collections like those of Library and Archives Canada, the National Archives of the United States, or the British Library, but these imperfect collections have long served as the basis of historiography. Ultimately, web archives need to be understood in the same way. This book argues that web archives are not superior to older archives by virtue of their size and scale, they are simply different. And while they may include voices that never before would have been part of the historical record, they still disproportionately favour the powerful. They still need to be used with care and caution, and historians and archivists need to create smaller, curated collections to help balance out the forces within the broad crawls and to ensure the widest possible representation on the historical record.

Walled Gardens and Robots.txt Most things that happen in this world will not become part of recorded history, either because they are so commonplace as to escape attention or because they are carried out in an oral tradition and preserved (if at all) only in memory.71 Even if printed sources existed at one point, they may have been purged as the result of neglect, natural disaster, government decree, or even deliberate intervention. The same is true of web archives – most of what happens online will not enter them.

Web Archives and Their Collectors

89

This is not a bad thing. Not everything that we write online should end up in a web archive, infinitely discoverable and traceable back to us – as I argue, Facebook and Google will largely be left out of our historical record, and there are increasingly good reasons for individuals to delete their public Twitter streams at risk that out-of-context or decadesold statements or jokes will be used against them. This is not to bemoan their absence – we should not expect Facebook to be in publicly accessible archives, and as a semi-public person and writer I occasionally nervously wonder how my own web presence might be interpreted one day – but to recognize the silences and account for them accordingly in historical analysis. Just as the vast majority of Twitter users will not delete their own Twitter histories, the vast majority of Facebook users will not take extra steps to ensure their future preservation. The future absences from the web archival record can be seen most vividly in the internet giants that have constructed walled ecosystems on the web. They number among them Facebook and Google, but also e-commerce or hosting providers like Amazon and Yahoo. Parts of their collections, such as aggregate anonymized user data (although perhaps not in the wake of the AOL search results controversy), may make it into the record – today, for example, historians can access friend relationship data from defunct websites like Friendster, although that is not necessarily or even likely a roadmap for the future. For the most part it is overwhelmingly unlikely, short of a WikiLeaks scenario, that we will ever have access to the vast majority of the communications and discussions that happen within these networks. Facebook walls, timelines, pictures, and the like will likely experience the same fate as most web sources (and, as noted before, sources more generally): lost to the public record. As big as the Internet Archive is, it is dwarfed by companies like Facebook and Google. Facebook does not consistently update its statistics, but its sheer size is evident. Even now very-dated figures – it is surprisingly difficult to find reliable figures from these large corporations – speak to the incredible scale of data that these corporations work with. As of late 2012, it was daily ingesting 300 million digital photographs, over 500 TB s of data, 2.5 billion individual content items, 2.7 billion “likes,” all contributing to a data centre that holds well over a hundred petabytes of information.72 That 500 TB of data per day represents a

90

History in the Age of Abundance?

growth equal to one thirtieth of the Internet Archive’s total size, each and every day, and the numbers are even greater today. While there is limited access to anonymized data through Facebook’s Data Science branch, it is not an open platform. In theory, users can download their information from Facebook and make it accessible, but it is probable that, barring dramatic shifts in how our culture values privacy, most of this flurry of activity will not become publicly accessible. Just as Facebook dwarfs the Internet Archive, Google is a colossus compared to Facebook. Google’s data centres are so big that images of them are shared as “server porn,” and photo galleries and descriptions are spread around so that we can see “where the Internet lives.”73 The exact amount of information that Google possesses and generates is not fully known, but one thought experiment – tracking capital expenditures, power consumption trends, overall hard drive manufacturing capability, and Google’s continued consumption of magnetic tapes – led to the rough figure of fifteen exabytes.74 An exabyte is 1,000 petabytes. Think of all the potential historical data that Google has: countless images of houses, apartments, buildings, and streets through the Streetview Project; digitized and fully searchable scanned versions of millions of books through Google Books; extensive satellite imagery in Google Earth; not to mention the exhaustive email holdings of Gmail, hosted documents of Google Drive, and beyond. The absence of Google from web archives also underscores a major problem of contemporary web technology – its inherent personalization in how we find information. One of Google’s core businesses for consumers (its real core business is providing advertising) is the provision of quick, powerful, and refined search results from the web. Its massive index, over 100 petabytes in size, is updated by the minute and is generated by an army of Googlebots. These crawlers navigate the web, following links and hopping from page to page, downloading copies as they go. The exact contours of how Google’s search algorithm ranks pages as it does so remains a proprietary secret, although the general contours are well established. Its PageRank algorithm roughly ranks websites on the basis of who links to a page, and in turn, who links to them, weighting each. While Google does create an archive of the web, it is short lived – researchers have convincingly noted that they are not a replacement for web archives, as they do not have a long-term preservation mandate.75

Web Archives and Their Collectors

91

Search results are personalized for users, based on geographic location, demographics, and account information. If I search for taco in Waterloo, Ontario, I will receive results very different results from those I make in Palo Alto, California.76 Search query logs give access to the everyday thoughts, concerns, and questions of people – which we can explore for ourselves at Google Trends (https://www.google.ca/trends/). This is critical web infrastructure that eludes preservation. If we all discover our information in different ways, imagine the difficulty ahead for historians seeking to reconstruct it all. It suggests that the nature of the modern web – personalized, dynamic, updated by the second – inherently eludes preservation. The absence of Facebook and Google also acknowledges another issue noted above: ethics. While having Google or Facebook would enrich the historical record, of course their public archival availability would not be an unalloyed good. Coming out of a paradigm where openness is associated with good – much of the work in web archives occurs within an ecosystem of open-source software and open scholarship – my initial reaction was to bemoan the loss of these repositories of information to the web archiving record. Imagine what we could do with the thoughts of millions: reactions to public events, mourning of celebrity deaths, and the celebrations and tribulations of everyday life. Cambridge Analytica, however, has shown us the risks of researcher access. In March 2018 the British broadcaster Channel 4 revealed that Cambridge Analytica, a political consulting firm that had worked on high-profile campaigns, including those of Donald Trump and the Brexit Campaign in the United Kingdom, had – among other allegations – used personal information from Facebook to facilitate the microtargeting of political advertisements. This data was obtained by a University of Cambridge academic, Aleksandr Kogan, who had collected it through a Facebook app. Users had signed up to take a personality quiz, which, unbeknownst to most of them, gave the researcher not only access to their own data but also that of their friends. As a result of the size of Facebook networks, this quickly snowballed to millions of accounts – each user who installed the app gave, on average, access to “at least 160 other people’s profiles, none of whom would have known or had reason to suspect [that their information was being harvested].”77 While Kogan’s research, and the fact that the content was given to Cambridge Analytica to theoretically manipulate electoral outcomes,

92

History in the Age of Abundance?

struck many as particularly egregious; he was, of course, not the only academic using Facebook material. Other researchers had convincingly demonstrated that just knowing a user’s “likes” on Facebook was sufficient to identify information as varied as a user’s “sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age, and gender.”78 In the wake of the Cambridge Analytica scandal, Facebook tightened researcher access to its information, and as of writing the future of academic collaborations with the organization was unclear. In any case, Cambridge Analytica serves as a useful cautionary note – researchers need to be aware of the ethical pitfalls of their research. For web archiving researchers, it is a reminder that not all absences are to be bemoaned: the line between what is available for research and what is not needs to be continuously negotiated. I return to a discussion of ethics at length in chapter 5, when we use the real-world example of the GeoCities web archive to tease out some of these issues. Just as Google, Facebook, and Amazon elude crawling, that power is actually available to anybody who has a web server. This is thanks to the robots.txt Protocol. Robots.txt is a text file, placed on a server, that excludes visiting web crawlers. Using it, a website can prevent Google or Bing, or even the Internet Archive, from using its site and content. A site can essentially disappear from public view and preservation. While there are exceptions – with Internet Archive government crawls, .gov and .mil websites are collected, regardless of the robots.txt exclusion, and Archive-It partners have had latitude in this respect – in general the protocol remains fundamental to web archiving.79 This is largely the result of understandable privacy and ethical concerns, and respecting conventions on the web, rather than any imposed technical limitations. The Robots.txt protocol, germane to the preservation of contemporary heritage, emerged from a very different mid-1990s context. The early web was then characterized by resource scarcity. Bandwidth was limited, and many servers accordingly restricted the number of visitors that a site could have or the amount of data that could be transferred. Once too much data had been transferred, visitors would see a “this site has reached its cap” message. This led to a “are robots good or bad?” debate. If one needed to limit access to a website, it is understandable that developers would want to preserve bandwidth for actual users rather than automated programs. With this problem in mind, Martjin

Web Archives and Their Collectors

93

Koster – a Dutch software engineer who developed early search engines – developed the robots.txt, or robots exclusion protocol in 1994.80 Given its outsized impact on cultural preservation today, the simplicity of the robots.txt protocol is surprising. A few lines of code exclude crawlers from a site. A website owner puts the robots.txt file in the base directory of the website, such as http://webarchives.ca/robots.txt. If the file contains the following code, User-agent: * Disallow: /

all robots that follow the protocol will stop exploring the website. Some rogue bots ignore it, and some legal deposit organizations have permission to override it, but web crawler traffic will essentially cease. While based on the honour system, it is largely followed. Relying as it does on web crawlers, the Internet Archive uses robots.txt as its opt-out protocol. Using the following code in a robots. txt file will not only stop the Internet Archive from crawling your website, it will retroactively remove all archived pages from its publicly available collection: User-agent: ia_archiver Disallow: /

This means that robots.txt is exceptionally important for web archives and the historians who use them to know about. In many ways, it is an imperfect solution to ensure privacy rights, especially since it is too technically difficult to implement for most, meaning that we cannot construe the absence of a robots.txt protocol within a website as ethical consent from creators to have their content archived. Most web content publishers do not have access to their own servers. The creator of a Wordpress.com blog, for example, cannot access her robots.txt file – neither could a GeoCities site administrator, a hobbyist with a university website, or most other personal providers. In addition, the retrospective aspect introduces some wrinkles. If domains change hands, somebody could add a robots.txt file and exclude the domain – even content he had not generated – retroactively. To put this into perspective, if I forget to renew ianmilligan.ca and somebody buys it, she can destroy my

94

History in the Age of Abundance?

own digital memory. Or more vexingly, in the right political conditions government regulations could prompt its removal. Retroactive removals bring with them questions about how complete and permanent a web archive can be, and speaks to our earlier question of whether a web archive is even an archive in the proper sense of the word. To make a necessarily imperfect comparison, consider contrasts between traditional and web archives. Traditional institutional archives seldom let a donor come into the stacks and retroactively remove a document or two that she feel depicts her in a poor light. This is especially important for public figures and organizations, such as politicians or large corporations. Indeed, Peter Webster – a UK  web archiving researcher – notes that this is an inherent weakness of web archives. The more they are used, the more visible they become, and the more vulnerable they are to removal. In late 2013 the British Conservative Party restructured their website, moving old speeches to a portion of the site that then had the robots.txt exclusion protocol applied to it. Webster notes, “I mused on a possible ‘Heisenberg principle of web archiving’ – the idea that, as public consciousness of web archiving steadily grows, the consciousness of that fact begins to affect the behaviour of the live web. In 2012 it was hard to see how we might observe any such trend, and I don’t think we’re any closer to being able to do so. But the Conservative party episode highlights the vulnerability of content in the Internet Archive to a change in robots.txt policy by an organisation with something to hide and a new-found understanding of how web archiving works.”81 If a user changes the robots.txt file, the whole website is removed from the publicly facing index. Matters are different when the focus switches to private citizens who are not public figures. If a regular person gives personal information to an archive, he has generally done so with informed consent – the same, of course, is not true of web archives. Our historical record will always be full of holes. Some of these will be big – created by the media giants of Facebook and Google. Some will be small, individual webpages that have elected to add the robots.txt file. As awareness of web archiving grows, even more sources might be jeopardized. While this is all an integral part an incomplete historical record, the omissions are still worth reflecting on. In some ways, however, they pale in comparison to the biggest threat to digital preservation: our own neglect.

Web Archives and Their Collectors

95

The Apathy Threat: Racing against Time to Save Online Communities Technical challenges around how to preserve web documents pale in comparison to the obstacles posed by social apathy and the tendency of corporations to engage in widespread active digital destruction. Large amounts of our web heritage have been destroyed, and more is threatened with destruction, as the result of simple neglect. Given their dominance of much of our engagement online, corporations now steward a significant part of our cultural heritage, a responsibility that many have failed to uphold. Our personal writing is more than likely to live on blogs, websites, and other cloud services, provided for “free” by corporations. When profit begins to wane, so too does their commitment to our online content. Digital preservation is at its core a human problem. In this, we are reminded that the Library of Alexandria did not crumble as the result of war or sudden catastrophe, but a long period of decline – as Abby Smith Rumsey notes,“People stopped valuing its contents.”82 This does not mean that all digital information must be preserved, but that people should have the option to have their online presence and content preserved. Collection is not always an unalloyed good, but the option to do so is. Consider an analogy. In 1963, despite community opposition, New York City’s monumental Beaux-Arts Pennsylvania Station was demolished and replaced with a subterranean (and largely charmless) structure. While some perceive the demolition today as an act of modernist excess, during the 1960s it was part of a broader wave of replacing the old with the new. The destruction of the old Penn Station saw some of that enthusiasm wane, leading to renewed interest in architectural preservation. Structures have been imposed to slow the dismantling of built heritage in many jurisdictions, allowing for sober second thought. Penn Station serves as a cautionary reminder that in our rush to embrace the future we might make irreversible mistakes. Destruction is not always bad – we cannot live in cities full of nineteenth-century buildings, for example – but we must take care as we steward our cultural heritage. Flash forward fifty years to 2013, to the West African country of Mali in the midst of the Malian Civil War. As the French army and Mali government forces approached Timbuktu, Islamic rebels fled. During their withdrawal, they appeared to set fire to two libraries holding medieval manuscripts. The 700,000 manuscripts housed within provided

96

History in the Age of Abundance?

an invaluable account of life in the Abbasid Caliphate. When news of their destruction reached historians and the media, there was justifiable outrage.83 While it would be later discovered that one of the buildings had not been burned and the manuscripts had already been smuggled out, the prospect of these irreplaceable pieces of Islamic history and heritage resonated. In digital media, however, we accept the destruction of historical sources on a barely imaginable scale. I have argued throughout this book that this is a cultural heritage that could allow us – if ethically used, and with care and attention to privacy at scale – to reconstruct the daily routines, exchanges, thoughts, and feelings of people who live a large part of their lives connected to the web (even if it does not get granular to the level of the individual, we can get broad information on trends). And in any case, they serve as valuable personal time capsules for individuals who live their life on the web and want the ability to retrieve those memories later. It is distressing to think that this memory could suddenly and thoughtlessly disappear. Deletion is a fraught term in the digital age. If a web archive finds a page and saves it before it is deleted, then there is a copy of it (although it may be an imperfect copy, as I note below, but the content is often preserved). If a page is deleted before a web archive finds a page, for all intents and purposes it is usually gone forever. Occasionally a page might be found on an old physical disk, or preserved on a hard drive, but unless that is in turn provided to an archive (the Internet Archive accepts uploads) it is also gone. Web archives themselves are not immune to deletion – one could imagine a hacker attack or natural disaster or beyond – but take great care to ensure the continued stewardship of their heritage. In short, unless web content is in a web archive, it faces a vulnerable future. Several examples can show the fragility of our digital heritage. In June 2013 MySpace, a popular content platform that had been the dominant social network between 2005 and 2008, decided to relaunch and rebrand itself. In the process, it deleted all its user blogs overnight. There was no chance to save them, as it happened so quickly. Millions of contributions, critical personal records of events of the last decade or so, were lost in the blink of an eye. I wrote about this online and the comments were sobering: arcticaja bemoaned that she lost her “memories,” and mblanco84 noted “inconsolable since this loss and [had] barely left the

Web Archives and Their Collectors

97

house.” Another commenter pointed out the role that MySpace blogs had played for deployed American military personnel: “What deeply concerns me, is that before the Department of Defense cut off access to MySpace on their computers, American troops used MySpace as a means to effectively communicate with a large group of people back home. Our troops didn’t write letters that would take 6 weeks or longer to arrive; they chose a nearly instant means of delivery. As a result, we don’t have those fragile scraps of paper that might remain of a note or diary from past wars.”84 Only a few years earlier, MySpace had been the centre of many people’s online social world. This was destruction on the scale of Timbuktu earlier in the year, but there were far fewer protests. MySpace did all of this without a hint of apology, but instead with the corporate equivalent of a wink and a smile. In this it joined the pantheon of companies cheerfully destroying user data.85 “We’re focused on building the best My[S]pace possible. And to us, that means helping you discover[,] connect and share with others using the best tools available. Going forward we’re concentrating on building and maintaining the features that make those experience better” was the corporate response.86 Several outlets, including the Canadian Broadcasting Corporation (CBC ), picked up the story.87 Users had lost not only their blogs, but also their private messages, uploaded videos, comments, and posts that they had made on other users’ sites, their customized website backgrounds, and any games that they might have been playing. For those who lived their lives online, it was as if a virtual reset button had been hit: blogs are records of daily lives, and to lose them without warning is heart-breaking. Yet, for the most part, historians were unmoved. In some ways, this is understandable: there is a difference between the priceless medieval manuscripts of Timbuktu – of singular value – and MySpace pages. Yet it also illustrates lack of foresight. We do not know what will be considered valuable by a historian decades or centuries from now, and taking a cavalier approach to any historical source is a foolhardy route. In any event, when a physical repository is endangered, historians are outraged; when a digital one is, the reaction is far more muted. Indeed, in the case of MySpace, derision was the general reaction, from professionals and the public. “If you wanted to keep your website, don’t rely on a free service, host your own!” explained one commentator

98

History in the Age of Abundance?

on the CBC .88 This common objection is hardly reasonable. Most users need commercial providers to host their content, as they lack the financial or technical means to run their own web servers. Crucially, users have a reasonable expectation that their content will be stewarded responsibly. Yes, services will fold and collapse, but there needs to be ample warning, and these institutions are responsible to collaborate with web archives and libraries. Even if a user’s content is hosted elsewhere, that does not mean that users have no moral ownership over them, and that their thoughts, posts, and content do not form part of our collective cultural heritage. For social and cultural historians, MySpace provides a record of the expressed ideas, thoughts, and feelings of millions of people from across the socio-economic stratum. If used ethically, the deleted blogs could have been an amazing resource for historians, similar to the GeoCities collection, which we will explore in chapter 5. For three years, MySpace was the hub of the World Wide Web, and now a portion of it was gone, suddenly, without warning, all in the interests of serving their users “better.” For social historians this should have been our wake-up call to the fact that the voices and heritage of our early digital age are being lost. MySpace is a symbol of what can happen when online digital heritage is not respected, as is all too common. Luckily there are teams of people online who fight this trend, preserving the information that will become the basis of future historiography. From their greatest victories, historians now have treasure troves of information, which can be used responsibly to tell representative histories of culture and society.

Who You Gonna Call? Rushing to Save GeoCities and Other Potential Internet Graveyards In this book’s introduction, we briefly encountered the Proceedings of the Old Bailey, 1674–1913, a primary source that British historians turn to when they attempt to reconstruct everyday life in London between the seventeenth and twentieth centuries. Rightfully described as “the largest body of texts detailing the lives of non-elite people ever published,” these 197,745 criminal trials provide invaluable windows into the lives of those who came into contact with the court. Everyday people were largely

Web Archives and Their Collectors

99

anonymous at this time, save when they encountered record-keeping institutions like the courts. In 239 years, around 200,000 documents were generated that are used in innumerable historical works. I compared it to GeoCities, the online community with seven million users and some 186 million “documents.” Yet if it were not for the determined intervention of a band of guerrilla archivists, we would not have this content today. GeoCities will likely emerge as a main source through which we reconstruct the lives of everyday people online in the late 1990s and early 2000s. Yahoo!, which purchased GeoCities in 1999, unceremoniously deleted most of it in 2009 after its usage numbers had declined precipitously. It would be as if in 1913 to save space in a relatively empty warehouse, the landlord decided to throw the Proceedings of the Old Bailey out, as they had outlived their immediate usefulness. In the story of how this came to be – how people decided in 2009 to make a concerted effort to save GeoCities with an eye to our future digital cultural heritage – we have a window on the precariousness of digital sources and the models for how this can be overcome. To understand how GeoCities was saved we need to go back to one of the first mass deletions. AOL Hometown had been a web-hosting platform offered by America Online. Within four years of its 1998 founding, it had grown to at least fourteen million home pages by AOL and non-AOL users alike.89 In August 2008 word began to circulate that sites would be permanently shuttered on 31 October 2008, although formal news of this did not come from AOL until the last day of September, giving only a month’s notice to some people that their personal sites would be destroyed.90 In comment threads, users expressed bewilderment about how they could download or move their information to another platform. It remains difficult for an average user to back up information from the web, requiring either that each individual page be tediously downloaded, or that one has the knowledge of scripting (often on the command prompt) or other download options.91 People’s online presence was being rapidly destroyed – without sufficient notice, consent, or compassion. Jason Scott, an online archivist and computer historian known for his preservation efforts and documentary on early Bulletin Board Systems, among other endeavours, reacted on his popular blog in December:

100

History in the Age of Abundance?

We’re talking about terabytes, terabytes of data, of hundreds of thousands of man-hours of work, crafted by people, an anthropological bonanza and a critical part of online history, wiped out because someone had to show that they were cutting costs this quarter. It’s an eviction; a mass eviction that happened under our noses and we let it happen.92 In his popular post, shared widely, Scott compared AOL Hometown’s shutdown to his own experience of being evicted from a physical apartment – which, while painful, at least required advance notice and consideration of his hardship – whereas when “we evict people from their webpages, fuck all is required.”93 These sites had profound significance for their users, and for our broader culture, as he noted in a follow-up post. “At a time when color printing could cost you a dollar a sheet of paper, you could have a full-color presentation available all over the world … this technology was amazing, vast, and falling into the hands of people who wouldn’t have ever composed a newsletter, or even a diary,” he declared.94 It was too late to save AOL Hometown, but archivists could be ready for the next time an online community was threatened with destruction. It would need to be an organized team effort. Individuals can seldom harvest large amounts of information themselves, as websites impose download or rate limits: a cap on how much you can download every hour or two, for example, or blocking the address of a computer that is making too many requests. Users do not encounter these restrictions in daily browsing, but if you are trying to download an entire site, these limitations emerge. This means that it can take a very long time for one system to download a site. If capped at fifteen megabytes per hour, for example, downloading the four TB of GeoCities would take around 266,667 hours, or thirty years. To circumvent that limitation, teamwork is necessary – lots of different users organizing to each download part of the site. Scott called this Archive Team, and their battle cry is worth quoting: “Fuck the EULA s and the clickthroughs. This is history, you bastards. We’re coming in, a team of multiples, and we will utilize Tor [a way to anonymously surf the web by concealing the origin and destination of web traffic] and scripting and all manner of chicanery and we will dupe the hell out of your dying, destroyed,

Web Archives and Their Collectors

101

losing-the-big-battle website and save it for the people who were dumb enough to think you’d last.”95 Archive Team was ready by January 2009, with its own chat channel, website, and mission to save endangered digital resources from destruction. It would soon be tested. In April 2009 news began to spread about another impending mass deletion: GeoCities. “GeoCities will close later this year,” the help page noted, part of a broader process that would see Yahoo! “focus on helping our customers explore and build new relationships online in other ways.”96 It was a relatively muted announcement, given the size of GeoCities. Seven million people were being told to pack their bags. Of course, the vast majority of users had already stopped updating their sites – the activities of GeoCities had dropped dramatically after 2004 or so, except in GeoCities Japan, which remained until March 2019 – but its pages remained, valuable artifacts of our early web history. Even if it made business sense for Yahoo! to depreciate GeoCities, it was handled as if it were just any service coming to an inglorious end. Mainstream sentiment, articulated in the media, largely embraced the “good riddance” narrative of casting out the old to make way for the new, to sober analyses of Yahoo!’s changing business models. Outrage was a minority opinion.97 Much of the discussion was defined by nostalgia, a dawning recognition that the web actually had a history.98 But it was a history that would prove hard to preserve. Users received no support to save their own material unless they wanted to upgrade to Yahoo!’s paid web-hosting service. The advice given online by experts was to use a command line tool to mirror the pages, which was a procedure well beyond the technical abilities or patience of most users.99 The only other option was the time-consuming path of right-clicking every page to download it. And that was even if a user knew that her site was closing: notices were sent to the email addresses on record, and many had changed their addresses since setting up their site in the mid-1990s. Archive Team sprang into action. True to the example above, Yahoo! limited each downloading computer to fifteen megabytes an hour.100 This meant that the team would need to coordinate the activities of many computers. The best way to circumvent this limitation was to have collaborators, known as Archive team warriors, run virtual machines on their own computers – these pre-configured machines could run pre-configured programs, acting in concert with the hundreds of

102

History in the Age of Abundance?

other  VM s. The Archive Team found themselves in a crash course in GeoCities history: figuring out the “neighbourhood” system (discussed in chapter 5), how sites were laid out, and what sections had already been destroyed. Popular technology blog Slashdot linked to Archive Team was picked up by the mainstream media, and digital preservation issues splashed across technology websites.101 They saved most of GeoCities (it is always difficult to tell just how much they might have recovered, as there is no “master” copy of the site), and soon made it available for download as a BitTorrent file (or way to have multiple people host and share the same file, and parts of it, to dramatically speed up download times). Indeed, at the time of the torrent being made available in 2011, it was the largest torrent ever released, some 643 GB . As Scott noted on his blog, this was “a collection for historians, for researchers, for developers. For those who want to do study on the heritage on something so soon gone and yet so much of part of how we got here.”102 Archive Team, in the end, was not alone in saving this material, testifying to growing interest in digital preservation and in GeoCities itself. GeoCities was also mirrored by ReoCities, the Internet Archive carried out one last big download, and another website Internet Archaeology also collected a subset. Chapter 5 relies on these datasets to explore what we can discover from the GeoCities web archive, as well as using it for an in-depth exploration of how such sources need to be used with ethical care. We know from the earlier MySpace example that the problem of mass deletion did not end with GeoCities. Yahoo! would continue their record of destruction on 1 April 2013 when they deleted their fifteenyear-old Yahoo! Messages platform, in order to “help focus [their] efforts on core Yahoo! product experiences.” This cancelled another fifteen years of the largely non-commercialized voices of everyday people, on topics as varied as business, the internet, government, hobbies, science, education, and beyond. Table 2.1 suggests how widespread digital destruction has been, revealing that it has been a persistent problem across time and space. These examples, compiled from various technology blogs, the Archive Team deathwatch, and LexisNexis searches all met early deaths. These are the challenges faced by our contemporary historical record. While there are still glimmers of them – a page saved in the Wayback Machine here, screenshots in a magazine there, write-ups in a magazine – they are

Web Archives and Their Collectors Table 2.1

103

Selection of long-deleted communities

Name

Approximate user size

Start year

End year

Notes

Six Degrees

1–3.5 million

1997

2001

One of the earliest social networks. Log-in required to get into pages so Wayback Machine is of very limited utility.

Whuddleworld

76,000

Unknown

2006

An online community targeted at children, which ran out of money

Wallop

Unknown

2004

2007

Microsoft’s short foray into social networking

Flip

300,000

2007

2008

A social network focused on teen girls

Yahoo Kickstart

Unknown

2007

2008

Yahoo!’s short foray into social networking

AOL Hometown

Unknown, at least 14 million pages in 2002

1997

2008

Discussed in this chapter

MingleNow

Unknown

2005

2008

A social network focused on nightlife activities. Purchased by Yahoo! and shut down.

Yahoo 360!

10.5 million views

Unknown

2009

Another Yahoo! social networking effort, which later became Yahoo! Profiles

Splashcast

100,000

2007

2009

Another social network all about user-created channels. Is not in Wayback Machine due to robots.txt.

Brightfuse

100,000

2009

2010

A short-lived LinkedIn competitor

Star Wars Forums

Unknown (250GB in content)

2001

2011

A Star Wars–themed discussion board

Gamepro Forums

Unknown

1998

2011

A video games–themed discussion board

Posterous

Unknown

2008

2013

Allowed photo sharing between social networks

for the most part irrecoverably gone. When pages are deleted, and not archived, they are more or less gone – perhaps a collector saved something, perhaps a major grant could reconstruct a single page – but in general, these options are not there.

104

History in the Age of Abundance?

If the widespread closure of online communities is one issue, so too is something that happens on an equally wide scale: the death of individual users and what this means for their online content. As Larry Cebula, a public historian at Eastern Washington University, has noted in an argument that should seem familiar to readers of this book, our Facebook, Twitter feeds, MySpace profiles, WordPress blogs, and so forth are the diaries and records of today: “The real revolution in personal writing and documentation for our era, however, is the way that it will illuminate the lives of us peasants. Every fry cook at McDonald’s has a Facebook page.”103 Yet if a person passes away, how can we make sure that this material is preserved? Digital records need to be promptly preserved if we want to provide a voice for the dead, rather than just left to be “discovered” in the future. Twitter probably will not last two decades, let alone two generations. When it comes to deceased users, each company has their own smorgasbord of policies. Twitter simply allows for the deactivation of a dead user’s account upon receipt of a death certificate from a close family member or trustee estate.104 Dropbox did not have a deceased person policy, deleting all information for accounts not accessed within twelve consecutive months, although they now provide access if you can provide “a valid court order establishing that it was the deceased person’s intent that you have access to the files in their account after the person passed away and that Dropbox is compelled by law to provide the deceased person’s files to you.”105 Facebook converts the site of a deceased user into a memorial, which means that it is at the mercy of Facebook’s continued existence, unlike the downloadable Twitter archive. All of this, from the level of the large community like GeoCities to the individual who passes away, speaks to the inherent instability and transience of digital data. In light of its value, we need to preserve it today. Historians recognize that our ability to craft narrative depends upon the accessibility of sources, and with digital records the decision to retain and preserve needs to be taken much earlier in the source’s lifecycle.

Conclusions The story of web archives and web archiving is at its base a human one: from those who created the websites in the first place and our obligations to them, to those who began preserving the web in the mid-1990s and

Web Archives and Their Collectors

105

who execute and run the crawls today. As we increasingly live our lives online, these issues lie at the heart of our contemporary cultural heritage. The process of preserving and storage digital cultural heritage is a complicated one. This complexity includes the very definition of a web archive itself, as we see that this brings together the historian’s archive, the technologist’s archive, the archivist’s archive, and the digital humanists’ idiosyncratic combination of these. Emerging from a mid-1990s climate of digital skepticism – the “digital dark age”– as well as a climate of technological utopianism, the desire to preserve the web and make it available has inherent elements of both forces. On the one hand, the utopian asks, what if we could preserve everything? On the other hand, the dystopian responds, what if GeoCities had been completely deleted? We would have nothing. It is a conversation that continues today. This is a fruitful tension to leave the chapter on as we move into the process of digging into the raw material that makes up our cultural records of today. The utopian impulses help us explore the implications of having this much material, prompting us to think about the realization of social history and the records of those who would never have been preserved before – the children who wrote pages in the children’s section of GeoCities, for example, or a lawyer’s active blog. Yet it also hints at dystopia, as much of this will – or could be – read without the active consent or even knowledge of the donor. Similarly, the “digital dark age” narrative helps us realize that not everything will be preserved, and that we need to fight the widespread destruction of user-generated content. It also gives some sense that perhaps the internet may have a capacity to forget, after all. This book has been exploring the realm of the conceptual: the top-level, broad considerations, with little specific technical discussion beyond that of the critical and archive-shaping robots.txt file. In the next chapter, we shift gears and begin to discuss some of the nuts and bolts of web archives from the user perspective. How is it actually collected? How should it be? How can we make sense of literally millions of pages? It is a process that is fundamentally shaped by decisions discussed in this chapter.



ACCESSING THE RECORDS OF OUR LIVES

New forms of heritage require new methods of access, upending the traditional historian’s approach. Archives have long been closely associated with the historical profession, and vice versa. Archives are where many sources are made available for consultation, having been organized and catalogued by professional archivists. Among their many duties as stewards of cultural heritage, archivists provide the infrastructure for historians to do much of their work. The American Historical Association notes about this relationship: “As much as they depend on historical sources, [historians] rely on archivists to arrange, describe, preserve, and provide access to source collections.”1 This traditional relationship is being disrupted and reworked by the advent of web archives, which bring different technical, ethical, and epistemological challenges. The relationship between archive and historian is changing, but close interplay between archivists and historians will continue. Web archives may hold out the promise of being systematic and complete, generated by algorithms and keyword searches, yet the initial collection scoping and even much of the collecting process itself is inherently subjective. We have seen how they appear differently on different systems, and that archives can be retroactively altered with a file like robots.txt. All of this means that a firm understanding of their technical apparatus is necessary to use them effectively. Given the inherent fragility of digital sources, decisions made today will shape the future record, a conversation that historians have largely been absent from.2 Ultimately, historians will need to change how they approach historical sources: where to find them, how to read them, and – on a technical level – what these sources are made of. This will require that historians join and participate in the broader, ongoing conversation.

Accessing the Records of Our Lives

107

This chapter explores the major issues at play, including what a webpage is, the impact of changing web standards on how we see and interpret webpages, and how the work of historians involves turning away from traditional methods of closely reading content, to more distant large-scale analysis of metadata. It does so to begin to equip historians with substantive conceptual skills that explain how they can begin to use web archives: theoretically, but also tangibly, as they begin to break web archives apart to look at the information within. To do so, this chapter looks to research in the digital humanities, with an eye to adopting the critical methodologies pioneered in that field. Woven throughout are several case studies: the browser wars between Netscape and Internet Explorer, which highlight the way proprietary HTML affects our understanding of sources; the data mining and metadata extraction carried out by the National Security Agency and others; and, as a web archive research example, the Canadian Political Parties and Interest Groups collection, which I study to compare metadata analysis with traditional content explorations. However, moving beyond the traditional practice of painstakingly reading one document at a time – which has its place, of course, but does not scale to the sheer size of web archives – to thinking about sources the same way that the National Security Agency does brings with it limitations. By the end of the chapter, I hope to convince readers that these techniques can be used for good if applied with a critical and ethical eye.

What Is a Webpage? Let us begin with the individual documents that most scholars are likely to concern themselves with in these collections. If we go to a traditional archive and consult a collection of personal letters, we have singular documents that we can hold in our hands and evaluate in their entirety. Logos, typefaces, images, and other such elements are all physically attached and embedded in the paper itself. This is not the case with a webpage. A webpage, in the sense of a single document sitting in a box in the archives, does not exist – it is not analogous to a piece of paper, unless generated as a PDF or other form of print document.3 It is, as Niels Brügger notes, “fragmented.”4 A webpage is rather best understood as a document that comprises various, disparate web resources – images, stylistic information, videos, sounds, and beyond – that are assembled by the web browser following instructions.

3.1

The webpage ianmilligan.ca, with just the front HTML page preserved.

Accessing the Records of Our Lives

109

3.2 The webpage ianmilligan.ca, a screenshot taken the same day as it appeared with all resources showing.

Consider an individual user’s reasonably large WordPress site, a popular blog content management system. With a few dozen posts the site can amount to almost 20,000 files if you wanted to replicate or “mirror” it on your own system so you could have a copy of the site as it existed at a single point in time. This is because all of the individual pages that you can generate on even a relatively simple WordPress site are both numerous (from pages generated for each month, to search interfaces, and beyond) and complicated. Try it yourself if you are curious and find how many different pages you can generate at any given WordPress site. There are Twitter widgets in sidebars, dynamic comment management programs like Disqus facilitating conversation at the end of posts, images hosted on external websites, all stitched into the broader fabric of the web through hyperlinks and external calls. Many of the images that we see on webpages are hosted on external servers or are dynamic content generated elsewhere. There is often a content page, such as an index.html file, but it is bare without supporting multimedia content – it may just contain the text, or even just indications for where your browser should find even that. Consider figures 3.1 and 3.2.

110

History in the Age of Abundance?

The differences between the two figures are the difference between using just one file (the index.html file of a WordPress blog) as opposed to the hundreds of files that might make up a complicated “webpage.” Just to name only a few components of a single page: Images are separate files that need to be downloaded in addition to basic HTML instructions. Style is dictated by cascading style sheets, or computer code that tells the web browser how to graphically display the content within a given page. Social media interaction buttons come from plug-ins, as do email subscription buttons and the like. In short, hundreds of individual files and downloads may be needed to make just one relatively simple page possible. This means that there is no simple one document entitled “ianmilligan.ca’s front page on 13 March 2018.” James Baker has noted how misleading the term page can be.5 Webpages are complex, interconnected, dynamic, and constantly changing documents. This means that a reproduced archived website will almost never be identical to the original.6 They are always facsimiles, captured to varying levels of cohesiveness and completeness. Moreover, different elements within a web browser may not be captured at the same time, leading to technically fictive sites being archived. Remember how webpages are crawled: a web spider visits a site, downloads a copy, and then begins to follow the links in each page. The text, images, widgets, and so forth may be captured at different times as well. The datestamp is misleading. To grab everything at exactly the same time would require thousands of simultaneous crawlers and would risk overloading a server (or at the very least being a bad web citizen). Some concrete examples can help illustrate. A crawl of a website carried out at 13:45 GMT on 15 July 2006 could lead to a snapshot of a website that never really existed. Consider this example from Niels Brügger, who was creating his own web archive: During the Olympics in Sydney in 2000, I wanted to save the website of the Danish newspaper JyllandsPosten. I began at the first level, the front page, on which I could read that the Danish badminton player Camilla Martin would play in the finals a half hour later. My computer took about an hour to save this first level, after which time I wanted to download the second level, “Olympics 2000.” But on the front page of this section, I could already read the result of the badminton finals (she lost).

Accessing the Records of Our Lives

111

The website was – as a whole – not the same as when I had started; it had changed in the time it took to archive it.7 More worrying for the integrity of the pages, however, is that the temporal shift that Brügger notes occurs within the individual pages themselves, as well. This can be harder to detect. Indeed, research on this problem of temporal shifts within webpages has found widespread evidence of temporal incoherence. A study by researchers at Old Dominion University and Los Alamos National Laboratory uses an example of a Weather Underground webpage preserved in the Internet Archive: “The large radar image near the page center shows the weather in Varina, Iowa, USA was clear and sunny. A closer look tells a different story. The daily image for Thursday shows cloudy with a chance of rain and the rest of the daily images are partly cloudy. These discrepancies indicate temporal incoherence between the archived root and the embedded archived resources.”8 Their example is that despite rain being called for in the textual forecast, not a single cloud was displayed in the radar visualization. This page never existed as portrayed in the Wayback Machine. This is because images and other objects are not embedded or collected at the same time as the main page. In the example case, the image of the cloud-free Iowan plains was not crawled at the same time as the text on the page itself, which called for rain. Scott Ainsworth, Michael Nelson, and Herbert Van De Sompel give many similar examples in their work, noting that it was “common for the temporal spread between the oldest and newest captures to be weeks, months, and sometimes a year or more,” with some outliers being as big as five or even ten years.9 Only 17.9 per cent of web archive holdings that they surveyed were both complete and fully temporally coherent. The disquieting conclusion that the majority of archived pages might have the potential to be inventions of a web crawler is arresting, requiring a rethinking of how we approach these sources. It does not mean that they should all be abandoned as fictional creations of web crawlers, but it does mean that researchers need to be aware of the way the archive is assembled, especially if they are drawing time-sensitive data from it. Research is underway to determine how best to highlight these issues to researchers, although now that it is highlighted in the Wayback Machine (if you know where to look), the future is looking bright.

112

History in the Age of Abundance?

The National Library of Australia’s web archiving manager Paul Koerbin explains the discrepancy between original webpages and their archival versions this way: “To appreciate the significance of this web archiving outcome we should understand that the live web, like a living organism, always exists in the present. It may of course include content – and there is a great deal of such content – that was created and published online long ago; well long ago in digital terms at least, but it exists now and is published anew, like regenerating cells, every time it is accessed through your browser or device.”10 This process, however, can go awry at times, as we have seen. In chapter 4 we discuss some approaches to dealing with this temporal incoherence.

Browser Wars: , , and the Sensory Experience of Web Archives Another wrinkle is the inconsistency introduced by ageing software and hardware. While recent web browsers have standardized much but not all of our contemporary web browsing experiences, webpages from the first decade of the web were interpreted radically differently on the basis of the browser. Web browsers are computer programs that allow a user to essentially interact with the web: the browser retrieves the information from a given page, displays it on a screen, and lets a user engage with the material, interact with the site, and find further links or resources to explore. Chrome, Firefox, and Edge are all examples of modern browsers. The best example of how webpages appear differently to different users of browsers comes from the mid- to late 1990s “browser wars” waged between Netscape and Microsoft. While the earliest web browsers had been text-based, such as the Line-Mode Browser and the subsequent text-based Lynx browser, the 1993 release of the National Center for Supercomputing Applications (NCSA ) Mosaic browser brought graphics to most users.11 The web’s first “killer application” had arrived, the graphical web browser. In 1994 a company – Mosaic Communications Corporation – was formed. Comprising many coders recruited away from the NCSA , Mosaic soon released the Netscape Navigator Web browser. With Mosaic and Netscape, users had browsers that were functionally similar to contemporary ones. They had a toolbar at the top with

Accessing the Records of Our Lives

113

3.3 uwaterloo.ca from 22 October 1997, viewed through an emulated Mosiac 2.2 browser from http://oldweb.today.

commonly used functions (back, forward, home), graphics rendering in the HTML page alongside text, and were extremely easy to use as the user could click through pages as they desired. Mosaic also released a fully functioning beta version for users to use for free.12 You can see an emulated screenshot of Mosaic in figure 3.3. As users began to explore the web, they spent more time within their browsers. Prescient thinkers began to foresee browsers becoming general-purpose computing platforms. This laid the foundation for a clash between Netscape and Microsoft, the dominant operating system provider.13 The browser wars, waged between 1995 and 1999, saw rapid change in user preference. In the first year, Netscape had a 90 per cent user base. Four years later in 1999, Microsoft had 76 per cent of the share, and Netscape only 23 per cent .14 They competed not only in user interface respects, but also in how they would interpret and display the HTML that underlay the web itself.

114

History in the Age of Abundance?

The browser war is a useful entry into the issues that face web historians grappling with archived websites. Webpages, largely from the 1995 to 1999 era, but with ripples into the 2000s and even later, often proclaimed allegiance to a particular browser. “This site is best viewed with Netscape Navigator 4.0,” a clickable image might declare on one page, while the next site over argued that their “site is best viewed with Internet Explorer 3.0.” In an attempt to capture larger user shares, browsers began to diverge from accepted HTML standards and introduce proprietary elements. is one of the best examples. A ubiquitous element of 1990s personal homepages, the tag does to text what it suggests. The following line of HTML code shows in action: I am blinking at you.

Or, rather, it blinks, depending on the browser you are viewing the

HTML code with. In Netscape, the word blinking would be flashing

off and on in the sentence “I am blinking at you.” However, viewed in Internet Explorer or any modern browser, you would not see any blinking text.15 The tag, concocted as a joke in a Mountain View, California, bar in 1994, was an undocumented feature that found its way into the Netscape code base. Discovered by users, it began to spread across the web. The tag was a bad idea for many reasons: blinking text can prompt epileptic strokes, obscures text for those with visual disabilities, violates accessibility regulations in many countries, takes control from users who could not decide if they wanted blinking text or not, and was seen to violate norms of good web taste. For many users, however, it was a very easy way to add visual pizzazz to a page. It became associated with large user-generated communities such as GeoCities. While it gave conniptions to web developers and tech-savvy users, it was a popular part of early web history. Even if the blink instruction had its origin as a barroom joke, it remained a proprietary element, however. Microsoft’s Internet Explorer did not (and does not) support it – there would be no blinks for their users. Instead, they introduced their own proprietary alternative: . It served a similar function:

Accessing the Records of Our Lives

115

3.4 uwaterloo.ca from 22 October 1997, viewed in a modern Firefox browser.

Coming soon: My webpage!

would generate the text “Coming soon: My webpage!” scrolling onto the page from the left, slowly moving to the right, as if in a Times Square marquee. A page with marquees would be “best viewed” in Explorer, and a page with blinks in Netscape. Users would have visually different experiences, depending on what browser they viewed the site with. The story of and showcases the difficulty of closely reading websites exactly as users viewed them at the time. If a group of historians went to a print archive together and viewed documents, they might encounter minute differences in how they physically perceived them – colour-blindness, for example – but in general the documents would appear the same to all the historians and the same as they did to the creator. Scholars exploring websites will have substantially different experiences when they use different browsers, and that has dramatic implications for the study of the archived web today. Consider what an early scrape of the University of Waterloo, from 22 October 1997, looks

116

History in the Age of Abundance?

3.5 uwaterloo.ca from 22 October 1997, viewed through an emulated Internet Explorer 4.01 browser from http://oldweb.today.

like in a modern Chrome browser via the Wayback Machine (figure 3.4). The image is rendered on a high-resolution monitor. I have shrunk it down to a tiny slice of my desktop, but we can see almost the entire webpage at a glance, columns render perfectly, and there is a crispness about the image. What if somebody had viewed this page in 1997 using a browser from only one year before, Netscape 1.0, on a screen at a much lower resolution? Indeed, 640 × 480 or 800 × 600 were common resolutions at the time. The user would have to scroll quite a bit to see what we, on a modern high-resolution screen, take for granted. Users, especially in the earliest days of the web, were much less likely to scroll down, leading to a different experience. Internet Explorer users would also have had a different experience, as figure 3.5 shows on a browser from 1996. We can use browser emulators, as seen in figure 3.5, to simulate these effects. One project, http://oldweb.today, is especially useful, letting you explore old Wayback Machine pages with a variety of Netscape Navigator, Internet Explorer, and old Apple Safari versions. Deja Vu, at http://www.dejavu.org/ provides a few more older browsers. Bear in mind, of course, that you are running an emulator in a small window

Accessing the Records of Our Lives

117

on your own desktop, probably on a modern LCD monitor, as opposed to a flickering CRT monitor. New media artists and curators Olia Lialina and Dragan Espenschied have asked these questions of the GeoCities collection. Espenschied explains the problems facing us today as an unavoidable trade-off between authenticity and accessibility. It is extremely difficult to generate a perfectly replicated and rendered page, as the complexity scales up with the desired level of faithfulness. Indeed, high levels of authenticity can militate against accessibility, as complicated hardware and software solutions are needed.16 Remember the steps taken to preserve CERN ’s first webpage: a custom emulator, navigating old technical manuals, interviews, and the like, to generate just one relatively faithful page. It does not scale. An understanding of the technological evolution of the web can help us make sense of design and content decisions. As Espenschied notes, one of the most significant changes “render[s] characters with smoothed out edges” as opposed to “historic aliased pixel text display.” As for the sound that would play in the background as a visitor arrived at a webpage, “MIDI music files do not sound at all as they used to when they dominated web audio.” Above all, the physical elements of accessing web material have changed: “All graphical output looks very different on CRT monitors and their special surface-to-pixel ratios than it does on contemporary flat screens. For example, when looking at historic webpages on an 800 × 600 pixel 14" CRT screen with a 60hz refresh rate, it becomes clear why many people decided to use dark backgrounds and bright text for their designs instead of emulating paper with black text on a white background.”17 Like historical objectivity, authenticity in recovering web materials is best understood as a noble goal, even if it cannot truly be achieved.18 If we cannot make faithful replicas for each page, we can at least be cognizant of the authenticity dimensions at stake: recall the various browsers used to access material and how hardware and software considerations affected web design decisions and users. At least we have still “saved” the tags, even if they do not render properly in our browsers – we can at least look at the raw HTML code to see if it exists and reintroduce the functionality if we wish to. Other cases of changing standards and technology have unfortunately left us with few traces.

118

History in the Age of Abundance?

Dead Formats: The Havoc of Technological Change One of the most notable digital format losses is the Flash platform. Flash swept to prominence during the dot-com boom of the 1990s as, according to Megan Sapnar Ankerson, it “appealed to a new vision of the web, one vastly different from the static, silent, textual form that imitated the aesthetics of print.”19 It allowed animations to greet viewers and provide navigational elements for the site. It became nearly ubiquitous on commercial websites, from large websites to restaurants, allowing for artistic flair and creativity. Flash has now largely disappeared from the web. Garnering early criticism for being proprietary and slow, its decline was capped off by Apple’s war against it and its omission from most mobile browsers.20 With Flash websites not appearing on phones and tablets, it has been largely consigned to the digital dustbin. Flash was very difficult to archive, meaning our record of its heyday is limited. Most Flash sites have been deleted, with only a few holdouts persisting. As Ankerson has noted, “Most Flash sites escape the archive because proprietary multimedia files are difficult for web crawlers to save; unlike HTML pages, one cannot use a browser to view the source code of a Flash file. If we hoped to access [a site] through the Wayback Machine, we would see only the launch page with a broken image icon where the enter button would be. Essentially, the Wayback Machine shows only that a Flash site once existed, but it gives no indication of the site’s content, visual style, or purpose.”21 To see what these sites looked like, we will need to resort to magazines and other print sources with screenshots. The surprising durability of print again rears its head. Flash is not an isolated example – new technologies in general wreak havoc with web archiving. The problem of disappearing content will probably never be completely resolved, as web development moves faster than our ability to save material. It is a game of cat and mouse, as archivists move to capture the newest form of web publishing, from dynamic content to the infinite scroll that currently characterizes sites like Facebook, WordPress, and Twitter. Just as web crawlers are updated to capture content, some new development comes along that transforms everything. To provide another example, one Canadian web archiving project that I was involved with ran into issues concerning comments posted

Accessing the Records of Our Lives

119

on popular websites using the popular Disqus comment platform.22 Disqus is a plug-in for pages that allow users to have conversations with each other: many newspapers and other websites use this system rather than trying to design their own. The issue occurs when you first load a webpage: Disqus does not immediately appear, taking a few seconds to pop into existence. While modern versions of the standard Heritrix web crawler – Heritrix being the program that goes out throughout the web and systematically takes snapshots of pages – will now access this material, older versions did not. In this example, the team fortunately downloaded JPG s, PDF s, and WARC file content from each website: comments were preserved in the PDF s, but not the WARC s, raising questions about which version was the “proper” one to include in the archival record (we ultimately included all three). As the project was interested in blogs and their comments – it concerned a libel case against a Canadian librarian where comments featured prominently – this was a dangerous omission from the WARC files. This speaks to the broader problem of digital preservation, or the long-term maintenance and accessibility of digital objects. There are a variety of solutions within this field: preserving old hardware systems, migrating file formats into modern ones (i.e., converting WordPerfect WPS files into archival PDF s or open-format DOCX files), or emulating old computers within modern hardware. Each has its merits and drawbacks, and there is no one perfect solution. Much of the trouble with digital preservation lies in the sheer conceptual complexity of digital objects. Matthew Kirschenbaum encourages us to think of such objects as three different components: the physical object (the magnetic flux generated by a hard drive head, for example – websites are physically stored somewhere), the logical object (the file itself, such as index.html), and the concepts embodied in the actual content.23 Trevor Owens expanded upon this when contemplating the 1982 Atari video game Pitfall: “Is it the binary source code, is it the assembly code written on the wafer inside the cartridge, is it the cartridge and the packaging, is it what the game looks like on the screen? Any screen? Or is it what the game looked like on a cathode ray tube screen? What about an arcade cabinet that plays the game? The answer is, that these are all Pitfall.”24 There is no single approach to preserving digital objects, from old video games to website comments. Different scholars will be interested

120

History in the Age of Abundance?

in different aspects of the same digital object: a historian of users may be interested in the CRT itself or the actual physical cabinet, a cultural historian might be curious about the content and what Pitfall, say, tells us about youth culture in the 1980s, and a scholar of science and technology might be more interested in the underlying code or framework. The catch with digital preservation is that we need to make these decisions now, creating protocols for how we save, preserve, and provide access to this material. As Owens correctly notes, “If humanists want to have the right kind of thing around to work from they need to be involved in pinning down what features of different types of objects matter for what circumstances.”25 Historians have long tended to prioritize content over form, privileging the textual content over the visual. Witness the role of photographs in academic monographs, often playing a supporting role rather than being critically interrogated (there are exceptions, of course). Given the visual and sensory nature of the web, this is insufficient. The underlying mechanisms matter: how a page was archived dictates what content will be available, and what technology was used at the time or browser dramatically affects how it will be seen and understood. We need to be thoughtful and proactive in our approach to preserving digital scholarship and sources.

Farewell Wayback Machine, Hello Big Data: What a Crawl of the Web Looks Like The Internet Archive reached their ten-petabyte milestone on 26 October 2012 – and celebrated with a party. Despite a power outage, speakers took to the stage to extoll the sheer amount of data now housed in their building (an old renovated Christian Science church in the west end of San Francisco). One speaker provided “some useless comparisons that are supposed to help”: i.e., if a byte was a person, then the Internet Archive had 94,000 times the number of people who ever lived, or if a byte was a second, then it would last 317 million years.26 In any case, it represented a significant moment in the world of digital cultural heritage. To keep the celebration going, the Internet Archive announced that they would be releasing even more data: an entire crawl of the web from 2011. It was provided to researchers “warts and all” so that the Archive could explore “how others might be able to interact with or learn from this content if we make it available in bulk.”27 For a historian like me,

Accessing the Records of Our Lives

121

this was a godsend: it would not only shed light on what a crawl of the entire web looked like, it could also help us look into the everyday lives of web users. Farewell, Wayback Machine. Hello, Big Data. This crawl of the web, known as wide00002, is a useful way to learn what a crawl looks like. It comes as a set of 85,570 WARC files, each one GB in size, for a total of around eighty-five TB . This is a lot of data: a few thousand dollars a month to store in a cloud storage host like Amazon, or over ten thousand dollars to purchase redundant storage (where data can survive the failure of hard drives as it is stored over multiple drives).28 Consumer hard drives could store it far more cheaply – a twelve TB hard drive can be purchased for around $1,000 – but their failure rates mean that without data management plans, that is not a feasible path forward. Putting eighty-five TB on consumer hard drives is a recipe for data loss. Fortunately, the Internet Archive has powerful and well-considered redundant storage infrastructure. Their “petaboxes,” one of which was seen in the introduction (figure i.1), store and process up to a petabyte of information.29 So how can we work with such a crawl? Just as conventional archives have finding aids, so do web archives like wide00002. They have CDX files, which are a much smaller file format, as rather than providing the content or data of the files collected during a web crawl, they instead just describe each site on one line. A CDX file might have thousands of lines, where each line is similar to this one, for the website of the Canadian academic journal Just Labour. This is just one line: ca,yorku,justlabour)/ 20110714073726 http:// www.justlabour.yorku.ca/ text/html 302

3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ http://www.

justlabour.yorku.ca/index.php?page=toc&volume=16 462 880654831 WIDE-20110714062831-crawl416/ WIDE-20110714070859-02373.warc.gz

The full specifications are available online, but the important fields for most researchers can be found in the first few fields, which are separated by spaces above.30 The first field provides the top-level domain, in this case .ca, as well as the subsequent subdomains. Moving left to right in the URL , we go from .ca, to yorku (York University in Toronto, Ontario), and finally to justlabour, the name of the journal. We can

122

History in the Age of Abundance?

see that the site was crawled on 14 July 2011 at 7:37 Greenwich Mean Time, and that it was redirected (“302”) to a table of contents page for an individual issue of the journal. In one case where I wanted to work with a smaller amount of information, I was able to use these CDX files to extract specific files: such as just around 7 per cent of the .ca websites, and 2.3 per cent of the .com websites. Without these finding aids, I would have been drowning in data. CDX files are not the most straightforward to read, but neither are regular finding aids. Historians receive training in finding aids – and web archive researchers need to be able to read CDX files. What can a researcher do with this raw material of a web scrape? While one could run a Wayback Machine, similar to what the Internet Archive has done, this reproduces the earlier limitations: in many cases we still need to know the URL . Instead, we can use tools to better search and sift through the material. In this, we garner insight into the decisions that need to be made when transforming complicated sources into usable historical information. CDX files can let you know what is in a collection: how many files and from what domains, when data was collected, the overall rate of successful crawls versus running into lost files, content types, and beyond. Over time, we can use them to see how the web has evolved and disappeared. As lightweight files, roughly 1 per cent of the total size of a web archive, CDX files are an essential starting point for many historical questions, such as what information has been collected and when. How has the basic structure of the web – the kinds of files, for example, or the names we give them – changed? How big is a web domain, like .ca, versus .com? Yet, as the questions you can ask with CDX files are admittedly quite limited, most scholars will still want access to the underlying material. With wide00002, or any web scrape, the natural next step for analysis is to focus on the text – searching and reading the textual material captured. This is not necessarily a bad idea. While the web is multimedia at its core, a considerable amount of information is still delivered through text. But searching web archives is different and more complex than conducting the searches to which we are accustomed on the contemporary web. For example, Google and Bing are predicated on finding the information that you need right now: today, from where you are typing the query, and so forth. Finding archived web material is more difficult. An example can illustrate.

Accessing the Records of Our Lives

123

The most robust search engine for archived web material is the UK Web Archive’s Shine search tool. Created by the UK Web Archive and launched in late 2014, it was a robust and working prototype in 2015 and available at http://www.webarchive.org.uk/shine. It serves as inspiration for future projects and may yet be the foundation for future web collection access. The British team uses Shine to explore their 1996–2010 legal deposit collection, the product of the UK web sphere (the .uk top-level domain as well as other high-profile British websites). Several bursaries were awarded to researchers to explore and provide input for the development of the search portal. One of the Shine researchers, Gareth Millward, wrote about his experiences in a Washington Post article. It outlined a frustrating experience: “If I put ‘RNIB ’ [Royal National Institute of Blind People] into the British Library’s prototype search engine, I get half a million results, spread across three decades. They’re not sorted by date, or by relevance. I get RNIB press releases alongside random results on employment websites.”31 He singled out an RNIB -marketed talking watch that was advertised all over the web, meaning “that on news stories about soccer, international conflict and fashion – not directly and obviously related to the work done by the charity – I had thousands of repeated references to a timepiece.”32 This led him to suggest the creation of smaller corpuses that scholars could use to search within the much broader collection, rather than trying to work with the entirety of the UK collection. Is finding a multitude of hits on the RNIB watch, however, a bug or a feature? It depends on the question asked. As Andrew Jackson, the web archiving technical lead at the British Library, responded, exploring archived websites cannot be the same as for a consumer exploring the live web: When a user searches for “iPhone,” Google might guess that you care about the popular one, but perhaps a historian of technology might mean the late 1990’s Internet Phone by VocalTec. Terms change their meaning over time, and we must enable our researchers to discover and distinguish the different usages. … Moreover, in a very fundamental way, the historians we have worked with are not searching for the one top document, or a small set of documents about a specific topic. They look to the web archive as a refracting lens onto the society that built it, and

124

History in the Age of Abundance?

are using these documents as intermediaries, carrying messages from the past and about the past. In this sense, caring about the first few hits makes no sense. Every result is equally important.33 It is a new frontier, requiring new approaches to accessing and analyzing information. We need to pull our gazes back and see things with a macroscope, as Jackson and others have suggested. Indeed, the conceit lies at the heart of a 2015 book I co-authored, Exploring Big Historical Data: The Historian’s Macroscope.34 Our book argued that if historians were to continue to be able to lead in studying the recent past, they needed to learn new skills and techniques to navigate the array of borndigital (like websites) and digitized (like newspaper databases) sources that are now baked into the core of our profession. These approaches, discussed in the book, really argued that we needed to understand algorithms, design better tools, and engage with the various scales and types of sources we can now find. Ultimately, we can either generate custom solutions – such as a search engine that filters out advertisements of the RNIB watch, or highlights them (imagine a historian studying online advertising) – or we can begin to create more flexible approaches. Much textual analysis necessarily starts with extracting plain text from webpages. This is not a straightforward process and often requires specialized tools, which we will discuss later.

The Power of Text: From Plain Text to Beautiful Data Text is gold for a data miner. With it, we can generate searchable indexes, use various criteria to select down to smaller bits of text that we can read with our own eyes (so instead of reading an entire web archive, perhaps just the pages that contain a certain word or mention a person’s name), or explore more sophisticated data mining approaches. The irony is that deforming the primary source to generate simple plain text can yield far more access to content than working with the original, rich, multimedia formats. But all of these examples speak to the power of text: how a historian can leverage developments in the field of computer science and information retrieval to find new insights in large, vast arrays of text. There is too much text for a person to read in many of these archives – but not too much for an algorithm to make sense of it.

Accessing the Records of Our Lives

125

The first step to use text is to extract the “plain text” from a web archive. Recall our discussions of HTML and all of the other rich elements that make up a website. If we want to do text, we often need to take all of that web content and just strip out the text: take away the tags that adorn the text, extract text from PDF files, and beyond. This has the effect of also shrinking a web archive down, often to around 10 per cent or so of its original size – a one TB web archive might become a still large but perhaps more feasibly sized, 100 GB collection. Once plain text is out, what to do with it? A search index, like one that can be used on Google or Bing, is often the starting point. While researchers cannot turn to these titans of industry, they can turn to opensource projects that seek to create similar search indexes on your own collections. One notable such program is Apache Solr, which lets you create a search index from your own array of material. With a search engine, you can then search for particular keywords, phrases, and beyond to find content within a web archive – much like you might with Google or Bing. But with web archives we have an additional variable: the date that a page was preserved in the archive. This allows us to read a temporal layer onto search results. Taking our running example of political party websites, then, we can see how often a given term appeared in a certain year within particular websites. For an example, see figure 3.6. This is a search for public transit across the archived websites of the Liberal Party of Canada, the Conservative Party of Canada, and the New Democratic Party of Canada – the three major federal political parties. Our team discovered that the Liberal Party of Canada embraced public transportation under the leadership of Stéphane Dion (December 2006 – December 2008), and then largely eliminated mention of this policy plank on their website after his defeat. Note the precipitous decline in the Liberal figure in 2009 under their new leader, Michael Ignatieff. In some ways this is more helpful than trying to use a search engine to find all pages containing given text. If we were to just look within the Liberal Party of Canada’s web archive for public transit in 2008 alone we would see 8,536 results. While search engines and frequency charts, as seen in Shine, are very useful for the general public and more specialized researchers, they have one major downside: it is still not always able to subset results to a number of pages that you can read. Those 8,536 results are a veritable needle in the haystack of a web archive – but still

126

History in the Age of Abundance?

3.6 Frequency of the term public transit across three Canadian political parties from webarchives.ca.

months if not more of time to read them all. A researcher cannot really read that many pages. Where even to start? But what if we could examine that text more closely? If you were researching public transportation, you might want to work with those results alone – discovering what the Liberals were saying about public transportation in 2008, without having to read every single page. That is doable. A subset of text can be downloaded, perhaps just the text containing the string public transit in 2008 on Liberal.ca, and then use textual analysis to parse the results. This might be looking at word clouds of the content extracted – what words tend to appear alongside public transit, for example, such as subways or buses; or it might be, as we saw above, looking at how the frequency of words or topics changes over time. We can also explore other elements of text. Named-entity extraction, or NER , is one promising approach to explore web archives. In short, an NER program reads text and marks it up to identify people, organizations, locations, and other categories. We can see, for example, what cities and countries are mentioned in a particular web archive, or which names come up most often. Text is extremely useful. But it is hamstrung in many of the above cases by its sheer size. Apart from isolating specific subsets, we are not

Accessing the Records of Our Lives

127

able to read all of the content. We instead need to use computers to parse the information for us. We can use data mining and textual analysis to make sense of it all; much of this falls within the apparatus of the computer science field of “natural language processing,” which tries to make sense of everyday language and text. The problem with techniques like natural language processing, however, is that it is very intensive (the system needs to do a lot of work to make sense of even relatively straightforward English sentences), difficult to use, and results in a lot of occasionally misleading junk data. Rather than trying to get a computer to read texts, there may instead be opportunities in the structured data that underlies the web archive: hyperlinks, metadata headers, and machine-generated information. But how can this dovetail with our traditional historical predilection for the content of the sources themselves?

A Web of Links: Why Metadata Might Matter More Than Content It is tempting to think of web archives as we would traditional historical archives: page after page of content, flipping through them as if we would in a traditional archive. The discovery tools discussed above accelerate this process: they allow us to find the pages we want, or the clusters of pages, or to distantly read the content within. Yet even the most sophisticated approaches largely treat web archives as if they were any other large corpus of text, akin to the millions of books at the heart of Google Books or the HathiTrust research consortium, or the nearly two hundred thousand court transcripts of the Old Bailey.35 Web archives, however, are different in the nature of their underlying hypertext. The metadata contained within them may actually shed more light on historical questions than trying to parse the content itself. This section of the chapter explains the concepts behind metadata analysis, and the next one provides an example of questions that can be better answered through metadata than content analysis. Metadata is a surprisingly difficult concept to define, especially once we move beyond the shorthand and generally insufficient definition of “data about data.”36 Yet it is an extremely important concept that lies at the heart of finding relevant information quickly. One of the biggest problems in finding information is making sure that it is well described,

128

History in the Age of Abundance?

from subject headings to classifications to author information. Imagine a library without a library catalogue, or an archive without finding aids, and you can begin to imagine a digital collection that does not have metadata. Information needs to be described, tagged (like on a blog, where posts might be tagged dogs if they are about dogs or cats if they are about cats), or indexed in order to help users make sense of this large amount of information. Yet metadata comes in many different shapes and sizes. In the wake of the 2013 Edward Snowden revelations, metadata became familiar to a large public audience. The American National Security Agency (NSA ) aptly defines it as “information about content (but not the content itself).”37 This definition, which also conveniently helps to deflect attention from the privacy implications of working with metadata, illuminates the power of metadata. The NSA works with metadata because it is often more useful than the content it accompanies. In 2013 the NSA argued that their widespread collection of American domestic telephone metadata – whom each person called, was called by, and the lengths of their calls – was not surveillance, as “it does not collect the content of any communication.”38 Part of the government’s defence was that it was “just” metadata. As President Barack Obama noted, “Nobody is listening to your telephone calls … what the intelligence community is doing is looking at phone numbers and durations of calls. They are not looking at people’s names, and they’re not looking at content.”39 As Wired sarcastically explained, “People breathed a sigh of relief since first learning of the surveillance because surely there’s nothing to worry about when it comes to such seemingly innocuous information – it’s just metadata.”40 As accomplished harvesters of personal information, the NSA knows that metadata can be far more revealing than content itself. In the NSA example, single calls or emails may not tell a story. Rather, recurring patterns of contact, the identity of the sender and receiver, and other such structured data can help us grasp the stories of individual relationships more quickly.41 Another apt example is the work of Christian Rudder, co-founder of the online dating website OkCupid. Through a combination of metadata and content analysis, he explored the “actual” preferences of users (in contrast to their stated preferences) in ages and ethnicities of those they identified as potential mates, reducing those

Accessing the Records of Our Lives

129

who used his site to basic statistics.42 For example, a survey of men found that they claimed to prefer younger women; in practice, when looking at whom they actually contacted on OkCupid, they tended to not message people more than nine years younger.43 Beyond online dating sites, a quick way to see the power of metadata in your own digital life is to visit the MIT Media Lab’s Immersion project at https://immersion.media.mit.edu/. If you have a Google Mail account, you can give it access to your own communications metadata from just the From, To, CC , and Timestamp fields. From this, the site is able to reconstruct your digital life. My own anonymized case illustrates the point (take note: even though it is “just” metadata, I would not be comfortable sharing this all online!). Since October 2004 I have participated in over 70,000 email threads within Gmail, each of which has between one and over a hundred individual messages. As I forward all my professional emails into my Gmail account as well, and respond via an alias server, this is a record of my personal and professional life. Just a look at my metadata can quickly identify clusters of people whom I email and who tend to CC each other or be included on similar messages: my colleagues at the University of Waterloo, or my research collaborators, or my undergraduate friends, or friends and other personal acquaintances. You can begin to reconstruct details of my personal and professional lives: those whom I consider my closest friends and confederates, those I know primarily through others and who introduced me to them. A time slider can reveal how my friendships emerge, dwindle, and strengthen. It is a snapshot of my life. Even though it is “just” metadata, it is far more illuminating than reading even hundreds of my email threads (or even applying text analysis techniques to them). This example vividly illustrates the power of metadata. Metadata can also be harnessed, of course, for historical research. In chapter 5 I explore how we have used hyperlinks to find sites of interest within the GeoCities web archive: viewing links as a “vote of confidence” in a site, you can find sites that reflect exemplar sites that users were interested in at the time. These can help find the individual sites to read, allowing a marrying of distant reading through metadata and close reading via the actual webpages themselves. Other historical projects can also use metadata fruitfully. The Old Bailey Online, for example, has made its metadata available: each trial has information

130

History in the Age of Abundance?

about locations within it, the offence committed, the gender of the accused or victim, and beyond. The portal London Lives 1690 to 1800, available at https://www.londonlives.org/, allows people to leverage this metadata to see a rich overview of crime and social relations in London during that period. The movement towards metadata has been a dramatic shift for my own methodology. When I started working with web archives, I saw metadata as an impediment. It was line after line in the header of the file, keeping me from the valuable content below. The HTML tags – links, stylistic information, and metadata fields – were similarly obstacles to be removed in order to lay the foundation for textual analysis. This was not without its merits, as the substantial insights previously discussed stem largely from that approach. However, as I worked more with the data, it appeared that metadata could be the critical element. What else could we learn from it? A lot, it turns out, by trying to look past the individual trees of content and towards the forest of metadata: links, code, and the like. You can discover, for example, who was the most central person in my own email correspondence in 2005; or, by looking at how websites linked to each other, what were the ten most popular sites (according to people linking to them) in 1996. Outside of networks, too, one could see: how many types of a certain kind of image was used in 2004, or how frequent PDF s were, or how websites grow and shrink over time, just by looking at the dates of publication. Speaking more generally about big data and the digital humanities, Rockwell and Sinclair have noted the technical conveniences that underlie its use: “Metadata has the added virtue that it is more efficient than the full contents of [phone] calls. The problem with big data is that there is too much of it, and that is even more so for images, audio, video, and other storage-intensive multimedia formats. In many cases we have convenient metadata that provides a surrogate for what is most important about a phenomenon.”44 What they mean here is that metadata is a proxy for something – if we cannot read every single page to see what other sites inspire them, we could look at the easily extracted hyperlinks to get at least a pretty good sense of things. Partially in response to researcher feedback, the Internet Archive and other web archives are now beginning to share new file formats

Accessing the Records of Our Lives

131

that provide sufficient metadata to reconstruct the networks and basic characteristics of websites. This also lets them get around thorny questions of copyright or other restrictions on sharing the raw material itself. By stripping out the content, metadata makes data more portable, but also encourages us to look at the characteristics that make web archives special in the first place. We have already encountered one lightweight metadata format: CDX files. They provide only minimal information, and from them we can learn only basic crawl contours: counts of how many websites were part of each domain, content types, what websites may have failed or forced redirects, and the basic timeline of when items were crawled. Initially the only alternatives to these minimalist CDX files were the heavyweight WARC files themselves. A CDX file might not have enough information, but the WARC files themselves are far too large: they contain everything that a web crawler captured, often including full videos, high-resolution images, music files, and the like. In between these formats emerged the Web Archive Transformation, or WAT file. It consists of everything found in a WARC file except for content. It contains information including the page’s title, any data placed within the field (“typically used to specify page description, keywords, author of the document, last modified, and other metadata”), and other tags specifying creation and modification dates.45 Most critically, it also includes all links and their anchor text. To understand this, an example can help. A hyperlink to another website in HTML would look like this: Search here!

Rendered in a web browser, this would have a link that read “Search here!” and when you clicked on it, you would be brought to Google.com. WAT files store both the outbound link destination and the text that a user clicks on to activate the link (in this case, “Search here!”). Given the sheer size of WARC files, WAT files are a reasonable starting point for scholars to begin to explore for data contextualization. Some research examples can help explain. Perhaps the most involved examples of web archived link analysis come from work done on the Dutch blogosphere by two new media scholars. Their longitudinal visualizations of link structures between 1999 and 2009

132

History in the Age of Abundance?

allowed them to trace where the blogosphere first arose, how it evolved and developed, and how it declined.46 Where content analysis might suggest a primary interest in issues or events, the approach taken by Anne Helmond and Esther Weltevrede instead took the form of “structural” analysis.47 In Digital Methods, University of Amsterdam professor Richard Rogers notes how a team at the Digital Methods Initiative conjured a similar map of the early global blogosphere – drawing on a list compiled when the blogosphere was still small enough that a comprehensive directory could exist – and was able to effectively use network analysis to both carry out their analysis and to find out more sites to download and explore. The team used links to connect not just the sites in their archive, but also locate through outbound links the influential sites (in terms of being often linked to, in any case) that were no longer in the web archive. Indeed, it is a useful way to see not only what they had collected, but just as importantly, absences in the archive came to life by showing up on the network diagram. It was a fruitful undertaking. “The map of the early blogosphere, showing interlinkings between archived and nonarchived sites, is a means of conjuring up a past state of the web, and appears to be a method of working with web archives (historical link analysis) that has stuck. Among other things, it shows a sense of the relevance of the site at the time, and thus also the significance of the sites in the collection (and those missing). Perhaps it also could a put a value on the missing sites so as to aid with their recovery.”48 Indeed, links can tell us a lot, as they are generally deliberate acts. Studies from the 1990s revealed the selective rather than capricious nature of hyperlinking – people are conscious about whom they are linking to, rather than random or thoughtless.49 These links are in many ways akin to scholarly footnotes, which are also deliberately deployed. As conscious acts of connectivity, they are very useful to historians uncovering user behaviour. The inclination, however, may still be to gravitate towards content rather than this metadata. In the next section I demonstrate that by using the Canadian political party collection we first encountered above, fruitful historical information can be found within metadata.

Accessing the Records of Our Lives

133

A Metadata Case Study: Finding Hyperlink Stories amongst Canadian Political Parties In 2005 the University of Toronto began assembling a web archive of all Canadian major political parties, minor political parties, and political interest groups – First Nations activists, campaigns to ban landmines, groups fighting for same-sex marriage, and so forth. They did so using the subscription branch of the Internet Archive, Archive-It, which allows institutions to carry out web archiving without having to develop local infrastructure and technical expertise.50 In August 2014 I began corresponding with them to see how we could use their collection to explore methods of web archival access and analysis. We initially turned to metadata contained within the archive itself. By using metadata alone, we found a robust approach to web archiving analysis that complemented, and in many cases exceeded, what we were able to do with a conventional search portal (which we discuss in the next chapter when we turn to enabling citizen access). Hyperlinks are a critical part of the web.51 By looking at all the links on a site as a network, we can begin to see central nodes: which ones receive the most inbound links, for example, or which ones are positioned between most websites, giving a sense of the overall structure of the web connecting them. We often do so by drawing them on a board, or at least in our minds, assigning shapes to represent each website or domain, with lines drawn between them to represent links. This is called a network diagram. We can then generate queries, such as trying to find the most popular website by counting the number of links it receives (a rough proxy for popularity, of course). It might be worth thinking of these as “votes”: the more votes that a website receives, the more important it might be. They do need to be used with considerable caution, however, as some websites abuse algorithms that rank websites on the basis of the number of links they receive. For example, early search engines used to rank results for search hits – the more links that a website received, the more prominent it would be. The reasoning was that a website about cats that 1,000 people link to is probably more important and useful than a website about cats that 10 people link to. It is not a bad idea, but it can be easily gamed: if you owned the website that only 10 people linked

134

History in the Age of Abundance?

to and wanted to have a higher position in the search engine, you could either try to make a better site and get more attention, or you could begin to explore ways to get another 1,000 people to link to your site. Sometimes this might be paying somebody, or starting to trade links in order to boost rankings without really having meaningful connections. These link counts are often misleading, then. In some cases, you see “link farms,” or websites that exist only to create a multitude of links to other websites in order to boost rankings. You can often tell they are a link farm because the website is just a seemingly random array of hyperlinks, without any novel or meaningful content. This raises issues with our network graphs. Imagine if I set up a random website that nobody ever visited except web archiving bots: should the links, or “votes,” that it sends to other websites be as important as the links or votes that real websites send to them? Surely not – a link to a website from the website of the New York Times should be more valuable than a link from my fabricated one. Accordingly, algorithms like PageRank – which resides at the core of Google’s search engine – go through a network graph (like the one of all the links between websites in a web archive) and try to solve this problem of weighting the “votes” accordingly. PageRank does this by scoring a website on the basis of the links that it receives, but these scores are weighted by virtue of the score that the website sending them has received. If this sounds complicated, that is because it is. The only way that PageRank knows that links from the New York Times should count for more than links from a fabricated one is for it to have a sense of what websites link to the New York Times, and then who in turn links to those websites. Suffice it to say, this requires a lot of computer power, and the entire network needs to be calculated several times. It is, like any other rough proxy for real-world influence, notoriously messy as well. In the introduction of this book, for example, we saw Siva Vaidhyanathan’s cautionary note that web popularity or importance does not necessarily correspond with real-world popularity or importance – something any historian has to keep in mind as they parse their data.52 In other cases, especially when we are looking at a small number of actors, such as political parties, simple aggregate page links matter.53 In figure 3.7, for example, we see all links from the University of Toronto Canadian Political Parties and Political Interest Groups collection.

Accessing the Records of Our Lives

135

3.7 All links within the Canadian Political Collection, 2005–2009. The bigger the text, the more often it was linked to.

These are the links between all of the sites included in the collection, as well as all the links to pages that might not have been included it (i.e. Twitter.com may not have been archived as part of the collection, but whenever a site linked to it we made note of it). As you can see in the figure, the links that are the most common are to social media platforms like Twitter and Facebook, the sorts of things that are included in the footers, headers, and content of page as a matter of course. The more intentional links, such as pages between political parties or to certain media outlets, are obscured in it (we will turn to those shortly). The sheer number of links can overwhelm. With more focused research questions – for example, how did the linking patterns of major political parties change over time? – we can begin to see a fuller picture of metadata’s utility. Here is where we move beyond the big nodes in

136

History in the Age of Abundance?

Table 3.1 Hypothetical hyperlink sources/targets Source

Target

Number of links

Conservative.ca

Liberal.ca

10

Liberal.ca

NDP.ca

15

Liberal.ca

YouTube.com

40

this network diagram (i.e., Twitter.com) and begin to look at more detailed research questions. Taking the three major political parties in Canada – the left-leaning New Democratic Party (NDP ), the centrist Liberal Party, and the right-leaning Conservative Party – I extracted their metadata to see how the parties related to each other. Drawing on five years of quarterly web scrapes, we generated tables much like the example in table 3.1. There we can see that the website of the Conservative Party of Canada linked ten times to the Liberals, who in turn linked fifteen times to the NDP , and also linked forty times to YouTube. This example is useful because it illustrates two things: first, that every page within each domain is counted (i.e., if Conservative.ca/about and Conservative.ca/contact both pointed towards the Liberals, it would count as two hits). Second, it demonstrates that domains outside of the crawl, such as YouTube, were captured as well. Such numbers need to be used with caution, as they are not inherently meaningful. All links are not equal in impact: some are quietly tucked away in footers, while others are trumpeted in large header font on the splash page of a website and generate quite a bit of noise. Yet within that noise we can sometimes find genuine signals: if every website in a domain suddenly starts linking to another website, we can infer that a concerted campaign is underway to attack, or support, or to draw information from, or perhaps we are seeing the adoption of a new social media platform. In politics, links are deliberate and usually meaningful – perhaps even more than on the web at large. Political parties rarely trivially link to each other, and evidence of their interaction can be very useful. Consider the example in figure 3.8, which plots the three major parties and the different websites that they are linking to. In this graph,

3.8

Three major Canadian political parties and their inbound/outbound links, 2005–2009.

138

History in the Age of Abundance?

3.9 Link structures prior to the 2006 Canadian federal election.

each circle above represents a node or a website domain – i.e., all the pages within the liberal.ca site, or domain, are clustered under that one circle – and each line represents a link between all pages in each domain with all pages in another domain. You can see what domains were linked only from the NDP , which ones were linked only from the Conservatives, and which were linked by all the parties. We can also see how relationships between these websites changed. Figure 3.9 shows how the three parties linked to each other in advance of the 2006 Canadian federal election. This was a pivotal election, which saw the centrist Liberal Party of Canada (which had governed for so much of the twentieth century that they were considered the country’s “natural governing party”) displaced by the newly formed Conservative Party of Canada. In figure 3.9 we can see that the NDP extensively linked to the Liberals, and the two parties were linked to by many of the same websites. If we are curious what is happening, we can at any time “zoom in” by looking at the individual pages to find out what they are linking to or what the context of that link is, but it is often useful to keep our gaze distant.

Accessing the Records of Our Lives

139

Why was this? They were competing over the same votes. The NDP helped to bring down the Liberals during this election, a narrative reflected in both the political literature and this link graph. This can be seen by thinking about the natures of the links at play here. When the NDP linked to the Liberals, they were striking against them – linking to inconsistencies and weak spots in their platforms. Yet when sites that linked to both linked to both parties, these can of course have been favourable links, neutral links, or negative links. This example shows that while metadata can show us the connections, to uncover the meaning behind them we still need to look at the sites themselves. A mixed method approach that brings together metadata and content is often the way forward.

Bringing It Together: Combining Content and Metadata Analysis through Topic Modelling Link metadata illuminates more than individual websites do. While we will return to text analysis in the next chapter of this book, we can use this approach – the analysis of metadata, combined with content analysis – to extract and model subsets of collections. Rather than using keywords to find and isolate a collection, mapping link structures is suggestive. One experiment proved particularly interesting. Taking the two main political parties, the Liberals and Conservatives, over the period of study, we used their link structures to isolate the communities that grew out of their party websites. Finding the parts of the link structure that tended to link more internally with a certain network (i.e., websites that tend to link to each other) rather than externally, can serve as a rough proxy for community. We found that the domain liberal.ca was in the same community as interest groups such as the National Association of Women and Law and media organizations such as Maclean’s magazine and the Canadian Broadcasting Corporation. Perhaps unsurprisingly, their leftwing competitor, the New Democratic Party of Canada, appeared in the same community, as they also had links to these websites. By contrast, the Conservatives – the governing party during the vast majority of the period we were studying – were grouped with many Cabinet ministers’ pages, and with groups such as Consumers First, which fought for price parity between Canada and America. By extracting some of these pages and topic modelling the ensuing results, we can confirm existing narratives and raise new questions.

140

History in the Age of Abundance?

What is topic modelling? In brief, it is a text analysis technique that finds “topics” in unstructured text. For example, imagine that I am writing about women in a male-dominated labour movement. Given the subject matter, I may use words such as female, equity, differential, and women. Perhaps, when I write about something else like the factory, I use words like truck, assembly, whistle, and klaxon. Topic modelling, in a very rough nutshell, takes the text that I have written and finds groups of words that tend to appear together: in this case, parts of the document where I write about the women and parts of the document where I talk about the factory itself. It is a quick way to get a bird’s-eye view of what might be under discussion and important in a large amount of text.54 Taking the link communities that appeared around political parties, we then downloaded all the home pages within those communities from the Internet Archive and topic modelled the text within their websites. In December 2014 the websites in the Liberal community contained topics such as cuts to social programs, mental health issues, municipal issues, housing, and their new leader, Justin Trudeau. The most prevalent topics on Conservative websites included Ukraine, the economy, family and senior issues, and the high-profile stimulus-based Economic Action Plan. In short, we found roughly what we would have expected to find. When we conducted the same experiment for 2006, nearly a decade earlier and the year of a federal election, the results were more surprising. On the websites within the Liberal community the following topics were top of mind: community questions, electoral issues, universities, human rights, child-care support, and northern issues. For those within the Conservative community, important topics included education and governance and, notably, several topics relating to Canada’s Indigenous peoples. While the Liberals had advanced a comprehensive piece of legislation to improve the conditions of Canada’s Indigenous population, and hence the Liberal community’s interest was unsurprising, Conservative interest in the topic was. Indigenous issues were not a stated priority of the party: perhaps it reflected the Conservative opposition to Liberal initiatives? As one commenter on an earlier presentation of this research suggested, the uptick on Conservative-friendly sites may have represented the influence of key advisors within the Conservative campaign itself, one of whom was a leading Conservative scholar of Nativenewcomer relations, who were attempting to shape the Conservative

Accessing the Records of Our Lives

141

message and to not allow the Liberals to dominate this issue. Questions are raised, suggesting great promise in marrying content and metadata in such a manner. This example is just a matter of suggesting questions for researchers and how links and content can be joined together; we will return to tools for content analysis in a later chapter. In any case, this all suggests that we need to reassess our relationship with historical sources in the digital age. While textual analysis may come most naturally to those accustomed to reading sources for the information they contain, a combination of metadata and content analysis may ultimately be the most feasible and fruitful approach for web-based historical scholarship. As the structure of web links is intrinsic to how online sources are created and generated, they lend themselves well to this approach. Metadata allows us to pull our gaze back and “distantly read” millions of these websites. Yet the parallels with the National Security Agency also remind us of the ethical questions at the heart of much of this method.

Conclusions As we have seen repeatedly throughout this chapter, the raw data that make up web archives is very messy, meaning that our analyses also have an element of fuzziness about them. This chapter has been concerned with the more technical side of our explorations into these web archives, as well as conceptual ones. From how to define a webpage – hundreds or even thousands of objects and external calls that combine to form one seemingly simple “page” in your web browser – to the underlying index and web content creator formats that power the Internet Archive, a technical understanding of the archive is essential if we are to make sense of the results we gather. At the heart of the approaches presented in this chapter is the tension between traditional content exploration and the increasing importance of metadata as a tool for historians, and even as a source itself. Historians may see web sources simply as digital versions of print sources, and imagine that they will be able to work with content in the way they always have, but they will be forced through technical limitations and developments, and the sheer impossibility of reading all the sources they gather, to focus on metadata. Metadata is structured, relatively

142

History in the Age of Abundance?

straightforward, and accordingly easier to extract and visualize. When it comes to web-based sources, historians will need to reappraise what content means and expand their definition of it. This raises one last critical question: who is in control as we enter this new world? A fear in all of our work with web archives, as seen in this chapter, is that the focus on tools and what is technically possible might begin driving research, rather than the research questions driving tools development. Some of the research questions that I have asked in this and other projects have been influenced by the available technology – I cannot do a close reading of pages, so end up asking questions that lend themselves to structured data. How often are photos borrowed and shared across GeoCities, for example? How many different file types appear, and how does that change? How do link structures reflect the lived reality of Canadians? How does keyword frequency change over time? All of these questions are dictated to some degree by available technology. They are precise questions I feel confident I can answer with the data I am able to gather. It is worrying to think about the questions they may not be asked because they cannot be satisfactorily answered by the sources, and with the tools, that are available. Of course, historians’ questions have always been framed in part by the availability of sources and the infrastructure available to interpret them. So, turning to a different kind of record does not necessarily represent a dramatic transformation. Yet we still need to consciously avoid letting technology dictate the historical research agenda, especially insidiously. Historical questions that would have risen anyway are essential to keeping this field healthy, and really signify the place of a historian at the table.



UNEXPECTED NEEDLES IN BIG HAYSTACKS

This chapter explores how we can find needles in haystacks, mobilizing search engines, computing clusters, and national libraries to find and make sense of the extraordinary wealth of cultural information contained in web archives. How can we, as historians and archivists, scale our efforts so that web archives become genuinely accessible and usable? After an overview of the basic infrastructure and work of various national endeavours, the chapter addresses the computational skills and pedagogical reforms necessary to tackle this new era of research. Wary of being too technical, as technology rapidly changes in this field, the chapter focuses on underlying principles rather than providing reams of code and specific technical implementations and approaches – indeed, the technology develops so quickly that it might be dated by the time this book appears on a shelf. As collections grow, new methods are necessary. Some of this work involves physically sitting down at a computer in a national library, exploring pages of a website one at a time. Other steps involve harnessing and deploying cutting-edge enterprise big data platforms to explore data at scale. All, however, share the common goal of preserving our cultural heritage and making it accessible. While the sheer amount of data can be overwhelming, technology can help bring it into relief.

The Deceptively Simple and Powerful Wayback Machine All of the methods of access and exploration that we have discussed so far in this book, as well as the ones to come, ultimately rely on a Wayback Machine to give us access to the documents themselves. The program has a deceptively simple purpose – allowing a user to view archived

144

History in the Age of Abundance?

websites held in WARC and ARC files much as if they were viewing them at the time of their creation – but this means that it is the key to most of our explorations into archived websites. The Wayback Machine was introduced in 2001 to provide access to the Internet Archive’s holdings, freeing researchers from needing to know how to remotely access a server. The International Internet Preservation Consortium and other open-source developers have continued development on the open-source OpenWayback.1 There are many Wayback Machines, from the original one at the Internet Archive that provides access to their global collections, to smaller Wayback Machines providing access to Archive-It collections, the holdings of smaller institutions or national libraries, and beyond. The Wayback Machine’s seeming simplicity helps to conceal some very important decisions and assumptions made behind the scenes. Reconstructing old websites is not a straightforward process. The Internet Archive’s Wayback Machine is seen in figure 4.1, using the example of Wikipedia from 2001. It lets you look at pages as they existed at the time that they were captured, with the caveats we have already explored about technology and browsers. You can see how a page has evolved over the course of its life. Accordingly, the most important element of the Wayback Machine for the user is the bar along the top: the total number of captures, or times that a given website has been preserved, is listed, as well as a temporal graph that shows how frequently it has been collected since 1996. In the Wikipedia example illustrated in this figure, we see that the first crawl was in July 2001, with intermittent crawls in 2002 and 2003, before being continuously crawled daily or over the course of several days from 2004 onwards. This demonstrates the growing popularity of Wikipedia, as reflected in crawl frequency, while allowing us to see how a single page (in this case, Wikipedia.org) has evolved over almost fifteen years. We can begin to click through history. It also attempts to preserve temporal integrity, a topic we discussed at the beginning of the last chapter. Remember how webpages are crawled: a web spider visits a site, downloads a copy of a page, and then begins to follow the links it contains. This means that two pages are rarely, if ever, downloaded at the same time; and indeed, they might be crawled days or weeks apart. As we have discussed, the homepage of a newspaper cannot be crawled at the exact same time as the articles linked from it. Content may not have been crawled on the first attempt either, or a

4.1

Default Wayback Machine on Archive.org, 2015. Rendering Wikipedia from 2001, when it had “only” 6,000 articles.

146

History in the Age of Abundance?

decision may have been made to go deeper into a website as it became more popular. In figure 4.1, the first crawl was on 27 July 2001. Users who click on the link to “history” are brought forward to 11 September 2002, because that was the first time that that specific page was captured. This is made clear from the bar at the top of the page. Yet even a single website can mislead. Remember the complexity of webpages – the dozens or hundreds of files, from text, images, widgets, and so forth – that make up a “single” webpage. The example of the Weather Underground page we discussed in chapter 3 bears that out well. In short, Wayback users are often exploring webpages that never existed in the way that they were archived. Programs like oldweb.today are beginning to make this anachronism transparent, largely using the TimeMap protocol, which shows the different resources used to assemble each page. To test this procedure in the Internet Archive’s Wayback Machine, you can load a page and explore the details tucked away in the header. For example, this 13 February 1998 capture of GeoCities.com – found at http://web.archive.org/web/19980213154824/http://www13. geocities.com/ – contains an Amazon.com banner from 9 October 2000 (http://web.archive.org/web/20001009123431/http://pic.geocities.com/ images/main/120x60amazon.gif). Figure 4.2 shows how a user can test for temporal violations, by selecting “About this capture” in the upper right corner of the screen. The Wayback Machine allows close reading of web archived content, but it is not an all-encompassing, flawless discovery tool. As noted, a researcher needs to know the URL or a keyword on a home page to begin, which is no small requirement. Yet it and its deceptive simplicity conceal many of the underlying technical decisions that have been made. It may look as if you have gone way back in time to see an old webpage, but there is a good chance that webpage never existed as displayed. Users need to be aware of this limitation, and the Wayback Machine (and other systems) now make that very straightforward to explore. It is a useful reminder, articulated several times throughout this book, that web archives are always imperfect facsimiles of the original: navigated using different hardware, HTML standards, and browsers, and with difficulties in displaying dynamic content as it would have originally existed. Just as paper documents are imperfect representations of reality, so too are web archives – but, as we have discussed, an archived paper document in many ways is a closer representation of its own

Unexpected Needles in Big Haystacks

147

4.2 Exploring temporal violations in a web browser.

reality than an archived webpage is of its. With these provisos in mind, let us look at various models that help us find our way to these Wayback Machine pages.

Avoiding the Tyranny of the Algorithm Since 1996 we have been collecting large quantities of web archives: we now need to put them to good use. We have seen what is possible when we have direct access to the underlying files that drive web archives: extracting metadata from political parties over time, for example, or making fruitful use of Geocities (as we will see in chapter 5). To use these sources, however, requires advanced technical skills. Where does this leave historians today, especially those who may not want to develop advanced computational skills? We need to start thinking about accessible research portals. The conviction that web archives are for everyone and should be accessible to everyone has been steadfast since their inception. The

148

History in the Age of Abundance?

Internet Archive’s 2001 launch of the Wayback Machine made their collections accessible.2 They were not satisfied with access flowing only to researchers who knew how to program. At this point in the history of web archiving, we are building on this earlier heritage as we develop new tools. Historians need to be involved in this development. It is too important to their work for it to be left to information retrieval experts and external vendors. When working with thousands of documents – let alone millions or billions – we will need understandable algorithms in order to sort and navigate them.3 We cannot read every source, and we cannot rely on either time sequencing or randomness to present them to us. We need to use relevance-ranking to make sense of it all. Yet if we do not know why the first result in a list of search engine results is ranked number one, and why another result is the ten-thousandth, we will be at the mercy of others. Indeed, in the worst-case scenario, the computer scientist, librarian, or other technologist who designed the algorithm (or an unseen hand behind any of these) may wish to direct the historian towards certain sources and away from others. In effect, they could choose what sources appear in a search. This limitation is unavoidable and not bad in and of itself, but historians need to be involved with these decisions so that they can understand algorithmic biases. To understand this point, examples can help, beginning with the most common form of institutional web archiving: the Archive-It web archive.

Special Collections: Exploring Archive-It Web Archives The back end of web archiving is time consuming and requires technological expertise, and so many libraries and other collecting institutions outsource this aspect and focus instead on providing curatorial and subject expertise.4 One company they turn to is Archive-It, the subscription service of the Internet Archive, which helps institutions run their web archiving activities. Columbia University Libraries, for example, has been assembling their Human Rights Web Archive since 2008, a global collection covering NGO s, human rights organizations, notable individuals, and other relevant sites (see figure 4.3). Using the Archive-It web service, universities and institutions like them have a back-end portal that allows them to enter seeds – websites to begin their crawling

Unexpected Needles in Big Haystacks

149

4.3 The Columbia University Human Rights Web Archive.

from – as well as to manage the path that the web crawler takes as it collects information. In several respects, topic-specific sites are ideal collections for researchers, compared to the wide-open scope of the broader web harvests. They are thematically focused, scoped by librarians or others who provide the initial list of websites to collect, and tied to institutions that can grant access. There are hundreds of Archive-It collections, covering diverse topics from the unrest in Ferguson, Missouri, to the Russian invasion of the Crimea, and the legacies of the British slave trade. They are a modern-day Archives and Special Collections, offering collections similar to what one would find on a visit to a university’s library to explore niche areas of specialty. Critically, most are accessible to the public. The Canadian Political Parties and Political Interest Groups (CPP ) collection, which we discussed in chapter 3, is a key example of a useful Archive-It collection. The CPP collection is of national interest in Canada, covering the websites of some fifty groups ranging from

150

History in the Age of Abundance?

the major political parties of Canada, to minor ones, and an sometimes nebulous assemblage of political interest groups. It continues to be collected quarterly. In 2015 our team at the University of Waterloo and York University – led by Nick Ruest – explored how we could provide greater public access to the archives. Despite the many scholarly-use cases for the Canadian political collection, its actual access metrics remained surprisingly low – and it was not being cited in published research, even in studies where access to its content would have helped. Part of the fault for this limited usage lies in the standard Archive-It search interface, which, although very useful for many users, has limitations for a serious, scholarly user. For example, in 2018, I queried “Stephen Harper,” the Canadian prime minister between 2005 and 2015: 1,302,624 hits over 65,132 search results pages in the CPP collection. Harper’s Facebook page from May 2009 appears at the top of this list, but there is no real indication of why it should be ranked above his Twitter account, for example, or his official parliamentary page. Search results in the CPP archive are ranked by order of their likelihood to be “best matches,” an inherently subjective metric that is unsuited to academic researchers. A targeted search might help, but there are no further options to easily refine it on this page – one has to continue to make stabs in the dark with more sophisticated search strings. Compare it to the results of commercial search engines, such as Amazon’s results for a similar query. If you searched for “Stephen Harper” in Amazon.com, the first hits are very relevant. Along the left-hand column, various “facets” or filters help users narrow down searches. A seeker may want the twentieth century, for example, or a work on First Nations. As Amazon sells hundreds of millions of products, this navigation mechanism is important. The United Kingdom’s legal deposit web archive faced a similar problem with the ranking of their results as they attempted to make sense of the sixty-five TB s that comprised the .uk domain crawl between 1996 and 2013. Working with Big UK Domain Data for the Arts and Humanities (BUDDAH ) researchers, staff at the British Library developed the Shine portal, which we discussed in the last chapter. Given the significance of the CPP collection, in Canada our team wondered if we could implement the Shine interface on this collection – the interface that we have briefly explored in the last chapter.

Unexpected Needles in Big Haystacks

151

With the 2015 Canadian federal election a few months away, the time might be ripe to try to bring web archives to everyday voters. What would happen? What could we learn?

Enabling Citizen Access to Canadian Political Web Archives Ten years is a long time on the web. Canada saw seismic shifts in its political culture between 2005 and 2015. Between 1993 and 2006, the Liberal Party of Canada had governed Canada in a series of majority governments, allowing them to imprint their particular brand of centrist, liberal governance on the country.5 This involved articulating a vision of Canada with a strong federal government at home and as a proponent of soft power abroad. The Liberals were arguably able to imprint their brand onto the country.6 All of this changed in early 2006 with the election of the federal Conservative Party of Canada, a new party that united the previous “Red Tory” Progressive Conservative Party (which married fiscal conservatives with social liberals) and western fiscal and social conservatives. This new government, led by Stephen Harper, was understood by many academic and journalist commentators as heralding a vision of Canada based on nationalistic and military values, decentralization and provincial rights, and limited government intervention.7 The period between 2005 and 2015 also saw other substantial political shifts: the rise of the left-wing New Democratic Party of Canada (NDP ) from its also-ran status to eclipsing the Liberals as the Opposition. Debates about the role of Quebec within Canada’s federal government resurfaced, as did substantial shifts in foreign policy approaches. Canada saw military interventions in Afghanistan and Iraq, and increasingly heated debates at home about Canada’s treatment of indigenous peoples. These shifts all played out online. In light of these trends, it is perhaps surprising that the CPP collection that the University of Toronto and Archive-It had been assiduously collecting over the years was not used more by journalists, academic researchers, and even members of the public interested in recent Canadian political history. One could use the collection to trace the rise and fall of various ideas (and promises) within the major political parties, for example, or how public platforms, policy positions, and

152

History in the Age of Abundance?

representations changed over time. Yet I was not able to find any citations in the scholarly literature. Why might usage, as represented in both scholarly citations and analytics, have been so low? The CPP collection suffered from privacy by obscurity, even if political parties may have benefited from having this material tucked away. The collection was hidden on Archive-It’s webpage, with its limited search engine, among hundreds of other collections. It does have a collection page itself (https://archive-it.org/ collections/227), but it is a little hard to find, especially if you do not know that you really want to use something called a web archive. While it is a collection with considerable importance to Canadian observers, it is just too difficult to find and meaningfully access, even for those who are vaguely familiar with the concept of web archives. Our interdisciplinary team (again, I want to emphasize the collaboration with Nick Ruest that made this possible) adapted the Shine search portal to work on the political collection. One of our main considerations in implementing it for the CPP collection was the public nature of the collection – as a web archive of political parties and actors, there was little expectation of privacy within the collection. As we discuss in the next chapter, the same would not be true of a large web archive of the voices of everyday people (such as GeoCities). Given the public nature of the collection, however, we felt comfortable to launch it with a media release and to see it carried on the Canadian national broadcaster, CBC . Over two thousand people came to run searches on the site over the next three days, and within two weeks over 8,400 searches had been run on the service. A common theme in the media coverage, and their own communications with me, was that most Canadians had never heard of web archives before. The Shine interface and WebArchives.ca allowed us to present several important findings, as a result of incorporation of powerful full-text search and the ability to visualize “trends” over time (similar to the example of “public transportation” in political parties in the last chapter). First, political parties were an ideal platform for experimenting with web archives. One problem with Shine is that it is difficult to find trends when pages are not deleted. If we found an archived page with the keyword recession in 2015, there was always the chance that it was actually written in say, 2010 or 2012, which tells us little about changes in 2015. Many domains, such as civil service departments, news websites, or

Unexpected Needles in Big Haystacks

153

university websites do not delete content but instead archive it. Yet political parties delete content very frequently, as leaders or policies change (this is sometimes presented as a negative event for democracy, but it does leave little question about policies and leadership). This meant that web archives were useful – i.e., if they had not been archived, there would be no live version to turn to today – and also generated more useful results. If a political party website is talking a lot about “public transportation” in 2015, it is likely a party that is concerned about the issue. Deleted content thus told us quite a bit about the evolution of the Canadian political sphere. User-generated comments, for example, were at one point quite common. The Green Party of Canada, for example, ran a blogging platform on their website until 2012, with commentary, feedback, and lively conversation. When we ran searches for the term fascist, a provocative term that is outside the typical language of Canadian parliamentary democracy, GreenParty.ca kept appearing in the search list. Indeed, in early 2012, if you visited the URL http:// www.greenparty.ca/blogs/365/2011-12-13/fascist-legislation-us-senate you would have found a blog post, on the federal Green Party’s website, referring to YouTube videos by a fringe American group about indefinite detention of American citizens. Today, if you visit that URL , you receive a “403 Access Denied” page. Perhaps in its bid to become a more significant political party, the Green Party had erased the majority of its user-generated content from its website? Or perhaps this was part of a broader trend away from user-generated content on sites like these. This was material that would have been previously inaccessible, and undiscoverable, and could now be plumbed using the archives. Similarly, the left-leaning NDP used to use the pejorative term tar sands to describe Canada’s Athabasca Oil Sands, eventually replacing them with the oil sands – perhaps in a play for votes as they moved towards the centre of the political spectrum. More importantly, we demonstrated what was possible with a publicly facing portal with intuitive user features and the right timing. Results were ranked in crawl order (i.e., the first site to appear in a list of results was the first to have been collected into the system, as opposed to the most relevant to a search), as the original BUDDAH researchers had been wary of having search results mediated through the black box of search engine technology.8 While a valid point – algorithms we do not understand too often shape research – the better solution is to develop

154

History in the Age of Abundance?

transparent ranking mechanisms and discovery practices.9 Given the unique nature of temporally based, fragmentary web archives, historians themselves need to be involved in this sort of work. Luckily computer scientists and information retrieval experts have been experimenting with web archives. It turned out that just as historians and humanists were beginning to think about accessing this material, computer scientists were hoping to find new users. In the synergy between these two groups, a fruitful pathway for humanistic research into web archives could be born. Enter the Archives Unleashed toolkit, the web archiving analytics platform.

Archives Unleashed: Unleashing the Power of Cluster Computing to Distantly Read Web Archives The GeoCities.com dataset, mentioned earlier, that initially seemed impossibly large – the 168 million multimedia documents – turned out to be trivial to computer scientists. In hindsight, this discovery is unsurprising. Google searches take milliseconds and they are searching an index that is well over hundreds of GB s.10 Web archives are big for humanists and social scientists, but for computer scientists and those working in the area of Big Data, they are data of a size that they are used to dealing with regularly. Companies like Google use cluster computing to make such large data problems computable. Rather than using expensive, purpose-built supercomputers, large arrays of mostly off-the-shelf commercial computers are put together in a network. It is both financially economical and fast. Assembling twenty, forty, or hundreds of computers together, working on a common task, allows problems to be distributed to each individual computer, or node. The individual results can then be stitched together once the project is done. Several open-source platforms can coordinate these nodes and distribute jobs, including Apache Hadoop and Apache Spark.11 While both are difficult to use, readable programming languages can be laid on top of them to execute programs, such as Apache Spark or an application programming interface that allows the simpler Python language to interface with Spark (PySpark). Clusters are also now increasingly accessible. In Canada, for example, the federal government makes computing resources available for free to researchers,

Unexpected Needles in Big Haystacks

155

albeit after a peer-reviewed competition. Even easier, Amazon rents clusters to users for cents or low dollars an hour with their EC2 (Elastic Compute Cloud) or EMR (Elastic Map Reduce) services. They are becoming ever-cheaper commodities, meaning users can have remote access to dramatic computing power at the click of a button. The Archives Unleashed Toolkit aims to harness the power of this technology in work with web archives. Rather than developing specific platforms for working with web archives, it integrates web archives into Big Data infrastructure. Created by Jimmy Lin, a computer scientist at the University of Waterloo, with masterful contributions from Nick Ruest (librarian/archivist at York University), as well as myself, the Archives Unleashed Toolkit provides an environment to both ingest web archive content and run sophisticated analytics on content gathered.12 In the summer of 2014, we began working on building out the Archives Unleashed Toolkit (then known as Warcbase) to meet the needs of historians and social scientists who want to use web archives. It is a flexible platform that can be fruitfully run on a powerful desktop, a laptop, or a computing cluster. It scales extremely well, meaning that the exact same scripts work on a million-dollar cluster or a thirty-dollar microcomputer.13 This is not the place to recount the technical details, but rather to showcase the possibilities behind interdisciplinary collaboration. Historians will not all become programmers. Rather, they must be able to implement – with understanding – algorithms designed by others. In the case of the Archives Unleashed Toolkit, our models can be transparently exported to other historians. Following steps documented in a publicly accessible manual, researchers can follow steps from (1) assembling a collection of web archives, to (2) ingesting their web archives into the system, and finally (3) following our models to run analytics on them.14 Wary of providing too much technical information here, as the underlying infrastructure is bound to change between writing and publication and beyond, an example script can help illustrate the amount of technical knowledge a historian might need to leverage this technology. In the following example, we use the Archives Unleashed Toolkit to extract the plain readable text of the Green Party of Canada’s website from the Canadian politics web archive.

156

History in the Age of Abundance? import io.archivesunleashed.spark.rdd.RecordRDD._ import io.archivesunleashed.spark.matchbox. {RemoveHTML, RecordLoader}

RecordLoader.loadArchives(“/shared/collections/ CanadianPoliticalParties/arc/”, sc)

.keepValidPages()

.keepDomains(Set(“greenparty.ca”))

.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))

.saveAsTextFile(“cpp.text-greenparty/”)

In short, this script “loads” the web archive files stored in the directory /shared/collections/CanadianPoliticalParties/arc/, filters them by documents that are just HTML (i.e., no pictures, or PDF s), filters them by the URL (in this case the “greenparty.ca” host), and then extracts the raw text as plain text and stores it in the directory cpp.text-greenparty. This text extractor script could be reused for different hosts. One could change “Greenparty.ca” to “Conservative.ca” if a different political party was wanted. By following instructions, historians can do this work. Tweakable, modular commands can give just enough information and power to users to do the sophisticated analysis they need to perform. We can also use the toolkit to extract hyperlinks, facilitating the network approaches that we saw in chapter 3. Indeed, those examples were created using the Archives Unleashed Toolkit. The following script extracts all links from domain to domain by year and month from the Canadian political parties collection we have been using: import io.archivesunleashed.spark.matchbox.

{ExtractDomain, ExtractLinks, RecordLoader}

import io.archivesunleashed.spark.rdd.RecordRDD._ RecordLoader.loadArchives

(“/shared/collections/CanadianPoliticalParties/ arc/”, sc)

.keepValidPages()

.map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))

Unexpected Needles in Big Haystacks

157

.flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1). replaceAll(“^\\s*www\\.”, “”), ExtractDomain(f._2). replaceAll(“^\\s*www\\.”, “”))))

.filter(r => r._2 != “” && r._3 != “”) .countItems()

.filter(r => r._2 > 5)

.saveAsTextFile(“cpp.links-all”)

Without going through every line, this script finds the valid HTML files by filtering everything out, generates a string that has the date (yyyy-mm, the first six digits of the date string) and the links, filter out the preceding www from each domain so that www.conservative.ca is recorded simply as conservative.ca (otherwise you get bits of both), and finally counts and aggregates it all. Results look like: 200511

westernblockparty.com zundelsite.org

200602

blocquebecois.org

200602 200602 200602 200602

69

blocquebecois.org

bloc.org

blocquebecois.org

go.microsoft.com

46

cbc.ca

12

conservative.ca greenparty.ca

154

blocquebecois.org 26437 conservateur.ca

48

In this case, we can see that the greenparty.ca linked to the Canadian Broadcasting Corporation twelve times, the blocquebecois.org to a Microsoft server forty-six times (we can use links as a proxy to see the rise and fall of Flash, for example), and the Western Block Party to the right-wing Holocaust denier Ernst Zundel sixty-nine times. These findings can be explored via a graph, but also through web browser visualizations and in other network analysis software. I discuss this in a bit more detail in chapter 6. In a series of hackathons our research team (including the wonderful Matt Weber from Rutgers, Nathalie Casemajor from the Institut national de la recherche scientifique, Jimmy Lin from Waterloo, and Nicholas Worby at the University of Toronto) hosted in Toronto, Ontario, Washington, DC , and the Internet Archive, the Archives Unleashed Toolkit and other platforms saw further development. Some of this work was in the realm of visualization, and another portion was in the area of extending the Archives Unleashed Toolkit to work with large

158

History in the Age of Abundance?

Twitter collections or to identify popular and recurring images. At the Library of Congress in Washington, we were able to take Supreme Court nomination collections, finding links and extracting text, with an eye to unleashing their WARC files. We explored at senate.gov (the US Senate) and house.gov (the US House of Representative), to look at where their pages about the nominations of Justice Samuel Alito and Justice John Roberts were linking to. These were two nominations that both began in 2005, and from these Library of Congress collections we could see how blogs, government pages, and news organizations all reacted to their candidacies. Once the hackathon team had this data, the questions began to flow. As one onlooker from the Library of Congress noted, The room bustled with people clacking laptop keys, poking at screens, bunching around whiteboards and scrawling rapidly on easel pads. At one table, a group queried web site data related to the Supreme Court nominations of Justice Samuel Alito and Justice John Roberts. They showed off a word cloud view of their results and pointed out that the cloud for the archived House of Representatives websites prominently displayed the words “Washington Post” and the word cloud for the Senate prominently displayed “The New York Times.” The group was clearly excited by the discovery. This was solid data, not conjecture. But what did it mean? And who cares? Well, it was an intriguing fact, one that invites further research. And surely someone might be curious enough to research it someday to figure out the “why” of it. And the results of that research, in turn, might open a new avenue of information that the researcher didn’t know was available or even relevant.15 Indeed, the Library of Congress is an ideal pivot point here. The hackathon showed the glimmers of what we could do with national library collections. However, despite the prominence that these institutions have had throughout this book, it turns out that just because national libraries can now collect – and in some cases are legally empowered to do so – it does not mean that we can have access to them.

Unexpected Needles in Big Haystacks

159

National Library Legal Deposit Access: The Double-Edged Sword The Archives Unleashed Toolkit or Shine are great if you have the WARC files themselves, often from the Internet Archive or from some other cultural heritage organization. But what if you want to use one of the legal deposit collections we have encountered throughout this book, such as those in the great libraries of Europe that have been amassing large amounts of information about their national web domains and communities? The collections of the British Library, for example, or the Bibliothèque nationale de France (BnF) and the Danish Det Kongelige Bibliotek? These national libraries and archives have taken varying approaches to researcher access, which one research team has categorized in three ways: dark archives, which are totally inaccessible, largely as the result of current copyright or privacy concerns (Norway, for example) but may be available in the future; white or open archives, which are available to the public on the web (the Internet Archive’s Wayback Machine); and “grey” archives, which allow access only under specific circumstances.16 Grey archives, such as those of the British Library or the French BnF, are of particular interest. In some cases, despite being digital, the access regime is far more restrictive than historians are used to – even when using comparatively restricted and similarly classified on-site resources. National legal deposit, the legal obligation to deposit published materials with a national library or cultural organization, is already an uneasy bedfellow with web archiving in general. Hyperlinks span nationalities, underscoring the web’s inherently global nature. Nevertheless the legal power, authority, and longevity of national libraries means that they will be significant sites of web access.17 Additionally, while legal deposit regimes have traditionally sought to gather everything published within their realms, the algorithmic and data implications of web archiving mean that for the most part these institutions must resort to some degree of sampling. As we know, even the most diligent archivist cannot “get it all,” nor should anyone hold themselves to that unrealistic standard. The British Library offers perhaps the best example of the uneasy relationship between legal deposit and web archiving. Their 2003

160

History in the Age of Abundance?

Legal Deposit legislation initially required website owners to opt in, a shortcoming rectified by a piece of 2013 supplemental legislation that enshrined mandatory legal deposit. In a nutshell, this means that all creators of websites in the United Kingdom are subject to having their data preserved by the British Library: sites that are registered in the United Kingdom (when they get their domains) are collected, from individual blogs and homepages to large corporate websites. They also do not need to respect the robots.txt protocol, the file that can ask a browser to stop crawling a website that was discussed earlier in this book. In many respects, this was in recognition of the fact that borndigital sources are the documentary record of tomorrow, and came with it an institutional commitment to ensure its preservation and subsequent access. Legal deposit is a laudable idea and would have prevented widespread destruction like the loss of AOL Hometown or GeoCities. The library concept of legal deposit, however, does not always mesh well with web archives. Access regimes that made sense for print books coexist awkwardly with born-digital material, hampering the ability of researchers to gather material in a timely fashion. There is also considerable private material: people may have provided sensitive personal information online, and they may feel uncomfortable with a copy of it being saved by the government for all to access. Hence restrictions need to be imposed. Yet legal deposit can be an uneasy bedfellow with web archiving. As an instrument, legal deposit made sense when there were a limited number of publishers, but now that we are all potentially publishers on the web, the legal situation is far trickier. In short, the power of legal deposit comes with the regulations inherent on legal deposit. A case study can help bring these issues into relief. At the British Library, where in 2015 I was able to work with the web archival collection in their reading rooms, legal deposit poses a double-edged sword. As the library’s legal guidelines explain, “Access to non-print works that have been delivered to the deposit libraries under the regulations is restricted to computer terminals on premises controlled by the deposit libraries,” and access “to the same non-print work is restricted to one computer terminal at any one time in each of the deposit libraries.”18 Explicit comparisons were drawn with traditional legal deposit and research practices. Unfortunately, the rules implemented for on-site access to online material are more draconian than if the researcher were to use the

Unexpected Needles in Big Haystacks

161

same material in print form. I would have been able to research more quickly and efficiently if the material had been printed, placed in an archival box, and shelved for me to request through the British Library terminals. While this is understandable to some degree – I can imagine copying content from an archived webpage far more quickly than I could photocopy a book, for example – it also raises questions around how efficient research can be with these materials. None of this is meant as a criticism of the British Library or other national libraries, but merely to note how legal deposit legislation can awkwardly interact with web archives. An example of how a researcher finds information in the United Kingdom’s legal deposit Web Archive bears this out. To use it, you need to physically visit and work at one of their six legal deposit libraries: the British Library at King’s Cross in London, the National Library of Scotland, the National Library of Wales, the Bodleian Libraries at Oxford University, the University Library of Cambridge University, or the Library of Trinity College, Dublin. Entrance to these libraries is not automatically guaranteed. At the British Library, one needs a reader pass to enter the reading room. This requires a valid research reason, being old enough, and proof of identity and address.19 As many a disappointed researcher has discovered, one does not simply arrive at the Bodleian Library or King’s Cross and get immediate access to the resources contained within. Once inside, however, one can sit down at a computer terminal and load up the legal deposit web archive. The British Library web archive search portal allows you to do faceted keyword searches on the collection, similar to Shine. You can enter your search terms, which can be sorted by crawl date, and you can decide to view either oldest or newest crawls first. Subsequently, you can choose to refine by a number of automatically or manually generated metadata fields: refine by author, content type (HTML , PDF , etc.), crawl year (1996, 1997, etc.), domain (.uk, .com, etc.), domain suffix (i.e., bl.uk, useful for information from large institutions like the British Library or others), or author (a field automatically generated from metadata). Website content has been indexed into their regular search engine, integrating them as “publications” into the general-purpose search engine that users navigate to find content. This means that any user searching for information on “pirates” can find related websites listed alongside relevant books or other archival documents. So far so good.

162

History in the Age of Abundance?

The trouble comes when a user finds content of interest. How does a researcher get the material out of the British Library to consult in more depth? Speed is increasingly of essence for historians, especially those who are visiting archives for a short period. The first sign that this might be a problem is emblazoned above each computer terminal screen: “No photography.” This is understandable, given the licensing agreements that would govern much of the content accessible on these terminals, but it also begins to suggest the difficulties with researching web archival content. For historians at most archives and libraries today largely conduct their research through digital photography, taking hundreds or thousands of photos on their cameras or phones for downloading and reading at a later time. The consultancy Ithaka S+R has noted the general transition for historians away from an analytical visit to the archives towards one of mass collection. Their visits have become “increasingly photographic and less serendipitous.”20 While exact statistics on the degree to which digital photography now dominates archival research, it is a trend stretching back into the 2000s.21 Research trips are no longer bound by extremely lengthy stays in foreign cities, nor by the high expense of photocopying fees. Even the British Library, a long outlier in not allowing digital photography of material, relented in 2015 and allowed personal use photography for many of their collections. These research repositories are now full of historians with bad postures, leaning over their documents with digital camera in hand. Specialized software is even available for this, speaking to at least a medium-sized market. While the British Library now allows photography of most of their print materials, this permission does not extend to born-digital resources.22 The notice on their website as of October 2018 notes, “Access is via Library terminals and digital copies of content cannot be taken.”23 This means that while historians undertaking traditional research will be able to take literally thousands of free photographs, those doing digital research are limited in their ability to capture digital records. Paradoxically, their work is far slower. They are limited to two options: taking notes as they work on the web archive-collection machines, either by hand with a pencil or on a laptop precariously perched on a relatively small desk, or they can print content page by page from a printer. The price of printing is twenty-six British pence a page, or at time of writing around forty US cents or fifty-two Canadian cents. The

Unexpected Needles in Big Haystacks

163

research process of using these collections would be slower – and far more expensive – than using any other traditional archive that allowed digital photography. Similarly much of the user interface of the legal deposit web archive is based on restricting easy access to the content. A paramount goal of the legal deposit web archive, even beyond researcher usability, is ensuring that the content remains on the British Library’s computer and is not downloaded. The library’s web archive actually runs within a virtual machine, an emulation of an actual machine. At least as of my site visit in 2015, text or other content cannot be copied and pasted outside the machine: you can copy text within it and paste it into the browser bar, but only within that confined environment. Imagine not being able to copy and paste in and out of a web browser that you are researching with, and you have a sense of what this means. Frustratingly, every time a new browser tab is opened, the user is confronted with a copyright message that takes a few seconds to click through. It is very slow. These access restrictions also make the pages less faithful than the original. The browser renders them statically: dynamic content is eliminated, videos cannot play, slideshows do not work. Pages are generated as images, with only hyperlinks remaining. Yet you cannot mouse over the links to see where they lead, meaning you can discover their destination only by clicking on them. Users who select a link, follow it, and receive a 404 Not Found error do not know what they are missing. Researchers cannot take the URL to the Internet Archive to find missing content. Crucially, the source of the HTML cannot be read, meaning that the primary source is no longer intact as originally designed. Curiously, two users cannot access the same content at the same time. This is thanks to the print legal deposit regime, which means that only “one” copy of each website was collected. The access restrictions placed on the non-print legal deposit collection mean that meaningful research projects cannot be realistically carried out on this corpus. If the websites had been printed, catalogued, placed in archival boxes, and available to be consulted as paper documents at either the British Library or National Archives, they would be researchable far more quickly. It would permit free reproduction for quotations and research at home, thanks to digital photography, as well as optical character recognition for full-text search and even incorporation into a researcher’s own database. Scholars in cities without libraries

164

History in the Age of Abundance?

with access to the non-print legal deposit collection – British historians based in the United States, for example, or Australia – would find it prohibitively difficult to access the web archive. Indeed, if a doctoral candidate working under me in Canada wanted to do a PhD on a topic involving this collection, I would warn against it. Given the need to quickly complete a PhD, print sources can sometimes be much easier to use than these digital ones.

Legal Deposit around the World: Differing Access Regimes and Inspirations The British Library is not representative of all legal deposit access regimes, of course, although it does help us begin to understand the tensions between legal deposit and accessing content. Some other libraries have taken a similar approach of on-site access only, albeit with some lesser restrictions. In France, the Bibliothèque nationale de France, which has had web legal deposit since 2006 (strengthened and clarified in 2011), harvests the .fr domain, as well as sites within other domains that are “created by individuals based on French territory and which contents are made in France.”24 This appears to be done largely through finding domains hosted in France. This mission is carried out through several crawls: an annual broad crawl, consisting of (as of 2015) about 120 TB a year, starting with the 4.5 million seed domains obtained from registrars. There are also selective crawls, driven by approximately a hundred subject-specific librarians who select websites of interest to be collected. The key criteria are to find material that is “public” – so that it can be truly considered a “publication” under the auspices of legal deposit – and to collect a comprehensive snapshot of French online activity. As with the UK , not just the .fr domain, but places all over the web where French citizens gather, discuss, and promulgate information. When visiting the BnF in 2015, I had the opportunity to go on a tour and meet with many of their personnel – from the technical staff who ran the crawls, to the highly autonomous curators who assembled collections. One curator I met, who assembled material on philosophy and history, outlined his approaches. For example, he would visit the websites of every academic Department of Philosophy across France, trying to find the blogs of professors; he would search for book reviews of popular French books; and he would follow social media communities to

Unexpected Needles in Big Haystacks

165

find links of interest. In short, he had developed a highly autonomous collection strategy that melded the talents of a collections librarian with the modern technology of a web archive. Access came with similar restrictions as the British Library, if slightly more relaxed. Perhaps given the more centralized role of the French state, access is available only at the main BnF site at Paris, as well as a satellite library in Avignon in southeastern France. The workstations had some restrictions – no photography, and no saving of documents as PDF s – but one could take screenshots and copy-and-paste selected quotes to another document on the station. More recently, since my visit, researchers seem to be able to access web archives using their own laptop (albeit only when physically on site).25 The BnF web archives were found either by searching by URL , or searching by keyword, presented in an attractive and easy-to-use French-language interface. Behind the scenes, they have a research unit exploring more sophisticated ways of providing access: WAT files for all collections, for example, and other forms of network analysis and cluster access. While the on-site limitations mean that there are few researchers – estimated to be thirty to forty users a month as of late 2015 – the less restrictive regulations may help spur further use. This interest in researcher access carries over to other national libraries. In Denmark, the national authorities have moved towards an access-centred model. Denmark, a small country of under six million people, has a world-leading web archive run as a joint collaboration between the Det Kongelige Bibliotek (the royal national library) and the Statsbiblioteket (state and university library). This may be due in part to the small researcher community. In the summer of 2015 they estimated that there were about thirty research users, allowing for close interplay between them and the librarians and technical staff who run the archive.26 Legal deposit of websites began in Denmark with their December 2004 Act on Legal Deposit of Published Material. Covering all Danish material “published in electronic communication networks,” it considered eligible materials to include two criteria: (1) it is published from internet domains, etc., which are assigned specifically to Denmark, or (2) it is published from other internet domains, etc., and is directed at a public in Denmark.27 Within the next ten years, it collected 600 TB of information – the majority from the .dk top-level domain, as researchers struggled to find

166

History in the Age of Abundance?

Danish information elsewhere.28 It saw slow but steady evolution. The Danish library’s first crawl in 2005 took about a year to complete, but by 2012 they were carrying out quarterly crawls of the web. With the cooperation of Facebook, they began to gather open Danish profiles, as well as a special collection of Danish YouTube videos.29 Thematic crawls complement annual snapshots, such as for elections, sporting events, or a national teachers’ lockout. In 2015, to celebrate ten years of legal deposit, they announced the development and release of a fulltext index for the collection. An outlier compared to the British and French, the Danish web archive does not require researchers to be physically on site in Copenhagen or Aarhus. Remote access can be granted through a one-page form, and if the project does not involve sensitive personal information, access can be allowed within a day or two. My own experience bears out this relatively painless experience. I was able to log in from my personal laptop, in Canada, through a virtual private network, and run a full-text search on the Danish web. Another exception is their fellow Nordic country of Iceland, which has had legal deposit since 2003 and has made the move towards allowing fully online access to their legal deposit web archive. Three times a year, and also during specific event crawls, the entire Iceland (.is) web domain is crawled and archived. Providing access through a custom Wayback Machine implementation, researchers and interested parties can gain access to the instance from home at http://vefsafn.is/. The only limitation is on material that requires payment to view, which is restricted to on-site consultation. Otherwise the entirety of the web archives is accessible online.30 They plan to release full-text indexes and more content from outside the Icelandic domain. These are only a few examples of legal deposit, found mostly within Europe: in addition to the examples provided here, Austria, Croatia, Estonia, Finland, Germany, Portugal, New Zealand, Norway, Spain, Slovenia, and Sweden all have enacted laws, although researcher access varies from openly accessible to on-site consultation only. Outside Europe, the situation is considerably less promising from a researcher access perspective. In Africa only one national library is a member of the International Internet Preservation Consortium (the Bibliotheca Alexandrina in Alexandria, Egypt); it has a partnership with the Internet Archive, notably hosting a significantly sized

Unexpected Needles in Big Haystacks

167

copy, or mirror, of the Archive’s content.31 Yet it does not appear to carry out widespread work on its own web domain. Similarly, while South Africa has a legal deposit law that covers electronic publications, websites and online material are not included.32 Indeed, the lack of web archiving initiatives throughout the continent of Africa has raised questions about the South-North information flow, as institutions in the Global North such as the Internet Archive preserve their websites without permission.33 In a provocative paper, Peter Lor and Johannes Britz lay out the moral perspectives on South-North web archiving, arguing for a model of permissions, not playing into restrictive regimes by either embargoing information or restricting access (for, say, opposition groups), ensuring original creators have access to their archives, and helping to build the technical capabilities of Global South institutions themselves. In South America, the National Library of Chile has a fantastic web archiving program, providing on-site access to thematic collections of websites ranging from electronic media, to constitutional debates and political elections.34 In Asia, Singapore carries out domain crawls of their national domain (.sg) but provides access only to those sites that have provided explicit permission.35 The Japanese Web Archiving Project (WARP ) began archiving websites in 2002, and in 2010 received the power to “archive Japanese official institutions’ websites: the government, the Diet, the courts, local governments, independent administrative organizations, and universities.” Electronic magazines and special events are also archived with permission.36 China, while devoted to extensive preservation of print culture, also does not have a web archive.37 South Korea carries out selective web archiving.38 North America is, in some ways, farther behind in terms of national web archiving. This is not a factor of not having dedicated personnel and development teams, but rather a reflection of fewer dedicated state policies and resources. Library and Archives Canada, a founding member of the International Internet Preservation Consortium, has the power via the 2004 Library and Archives Canada Act to gather a “representative sample of the documentary material of interest to Canada that is accessible to the public without restriction through the Internet or any similar medium.”39 Their first major collection, the Government of Canada Web Archive, comprised three annual crawls of the federal government between 2005 and 2007. Internal priorities shifted, however,

168

History in the Age of Abundance?

under new political masters and library leadership, and no material was collected between 2007 and 2013, when the process restarted. While some spot crawling was done, none of the material collected over these six years is likely to be accessible, as it was carried out without an official mandate. Given pivotal shifts in Canadian politics discussed earlier in this chapter, this is a travesty. However, the content that is accessible is available through simple full-text search and the Wayback Machine. Library and Archives Canada has returned to active collection with quarterly thematic collections, including on the Olympics in Vancouver, and the First Nations protest movement Idle No More. Yet while Library and Archives Canada might, under their legislative authority, have the power of widespread legal deposit of websites, they do not exercise it outside of these subject-specific collections. The US Library of Congress carries out national web archiving for the United States, albeit without the power of legal deposit for websites. It is probably most famous in this space for its high-profile 2010 agreement with Twitter to archive every public tweet, and the subsequent public relations and research issues that ensued when public access turned out to be far too difficult to achieve under the agreement’s terms, and collection ceased at the end of 2017.40 Its web archiving attempts go back to 2000 and the Mapping the Internet Electronic Resources Virtual Archive (MINERVA ) project, which saw it partnering with the Internet Archive to preserve material around the 2000 election (and subsequent ones until the present). They were in place when the 11 September 2001 attacks, allowing them to collect over thirty thousand websites in its aftermath.41 While they do not have legal deposit of collections, the Library of Congress does notify site owners when they crawl and ignores robots.txt in some cases. Website owners have the opportunity to opt out of the collection, but on-site access at the library is preserved and, in the online search engine, it is made clear when something might require a visit to the on-site collection. We can thus see that web archives have many different homes: the large collection in the Internet Archive, smaller collections collected by universities, libraries, and other institutions through Archive-It, and the extensive holdings of national libraries, accessible in a variety of different ways. Stitching them all together is the Memento Time Travel service. Led by researchers at Los Alamos and Old Dominion University, Memento provides common search access across dozens of web archives

Unexpected Needles in Big Haystacks

169

– and all of the ones mentioned in this chapter.42 Quickly integrating with web browsers, thanks to accessible plug-ins, Mementos incorporates web archives seamlessly into the browsing experience; rather than being relegated to websites you need to specifically visit (à la archive.org), they are baked into every page that you visit.43 When on a page, you can click on the Memento icon and be brought back in time. The multiple archives are also very helpful: a search for the Canadian Department of Agriculture within Memento for those scrapes closest to New Year’s Day, 2006, finds a crawl three days before that from Library and Archives Canada, eleven days afterwards from the Internet Archive, almost two years later from Archive-It, and a few more years afterwards from the Library of Congress. If we began with the problem of fragmentation, of having so many sources spread out over many collections, we now have movements towards combination, synthesis, and finding fruitful connections. Yet the movement towards legal deposit, and the placement of web archives within national libraries, is an unsettled one. Currently, research using these collections is no more efficient than using print collections, and in many cases – especially the British Library – are far less efficient. Technical problems, however, can be surmounted. The cultural ones of the historical profession will ultimately prove far tougher.

Conclusion Web archives can be accessed with tools of varying types and degrees of precision and power: from the traditional Wayback Machine, sufficient for limited explorations, to more computational inquiries with search engine front ends and even computer code itself. As the price of digital storage plummets and communications are increasingly disseminated via digitized data, humanities and social science researchers have seen dramatic transformations of their primary sources. We need to begin to lay the groundwork to ensure that they can adequately manage borndigital sources. By doing so, web archives can serve a diverse body of researchers, including historians, digital humanists, and social scientists from political science, sociology, and public health, and beyond the academy, within government, journalism, and public advocacy. It is clear that new methods of access are necessary. Even using the search portals at WebArchives.ca, the British Library, or the BnF

170

History in the Age of Abundance?

requires some knowledge: knowledge of how crawls were conducted, the underlying infrastructure that began to generate results, and the legal frameworks that were at play in even collecting this material. Were robots.txt obeyed? How are sites ranked and presented to you? Are you playing the big role in this research, or is the black box of the algorithm at play? And if the latter, who designed the algorithm, and how? Tools like the Archives Unleashed Toolkit, Shine, and others aim to democratize this process and bring historians into the fold. By letting users run simple scripts, adjusting the code that they are running, and exploring results, the complexity of the underlying archive can be harnessed and turned into an advantage. Communities can be found, hyperlinks traced, and full-text extracted and mined. The technology is difficult, but can be learned by anyone. The true barrier, as discussed in the conclusion, is one of culture. Libraries have recognized the importance of web archives, with legal deposit and other access changes. Now historians must do their part. Within the broader movement towards the digital humanities, historians should be leading the charge; indeed, we may have the most at stake in this transition. Yet historians are largely laggards who have not been the most active in conversations about digital heritage, archival practice, and beyond. The next time that a computer scientist approaches a group of historians, the problem of access will be resolved when the historian is receptive: open to teamwork, mutual cooperation, sharing of research findings, and willing to accept true interdisciplinary work. In those meetings of minds might lie the true solution to the problem of access. Taking this caution to heed, the next chapter moves into one of the largest domain-level repositories of everyday thoughts, beliefs, and behaviour: the GeoCities archive, an amalgamation of some thirty-eight million webpages generated by everyday people between 1994 and 2009. What can we learn from this database if we start with a traditional historical research question: how was community lived and enacted on the web during this period? And just as importantly, in addition to discussing how these sources can be technically accessed, the chapter explores the ethical conundrums that researchers quickly face. With the raw material and technical components in mind, we are now positioned to ask serious questions of this rich material and begin to tap into the potential of web archives.



WELCOME TO GEOCITIES, POPULATION SEVEN MILLION

GeoCities was a virtual boom town. Founded in late 1994, the service that allowed web users to quickly and easily build their own websites for free saw a meteoric increase in the number of users. People wanted to not only surf the web for other people’s content but add their voices to the mix as well. By mid-1998, as excitement about the web percolated, 18,000 new users a day were creating accounts and websites.1 In 1999 it was the third most visited website on the entire web, eventually encompassing at least 186 million “documents,” spread across seven or so million webpages.2 Today, however, a visitor to GeoCities.com finds only an advertisement for Yahoo!’s website hosting service. No trace of the once vibrant community remains at that site, nor has it since its 2009 closure. The GeoCities web archive allows us to explore a virtual ruin, and all of the technical and ethical quandaries that such a place presents for historical research. Within these archives, we can see evidence of a popular online community that was born, thrived, and declined between 1994 and 2009, as well as detect a sense of the ethical quagmire web archive researchers now find themselves in when studying our recent digital past. This chapter uses GeoCities as a case study of the technical, humanistic, and ethical issues in web archiving. What can we learn from these archives? What should we try to learn? In what areas should we practise caution and perhaps retrain inquiry, given the ethical issues at stake? Can we – should we? – realize the dream of a more democratic history through the TB s of GeoCities, now housed in WARC files by the Internet Archive or the Archive Team torrent? As the web was propelled into the mainstream of North American society by the mid- to late 1990s, GeoCities was one of the first sites

172

History in the Age of Abundance?

to welcome users and facilitate their first steps into digital reality. For the first time, users could create their own webpages without needing programming skills or knowledge of FTP , Telnet, HTML , Usenet, and so forth. It was in places like GeoCities where users became part of virtual communities, held together by volunteers, neighbourhood watches, webrings, and guest books. These methods, grounded in rhetoric of both place and community, made the web accessible to millions of people. This thus makes GeoCities an invaluable resource for a historian of the 1990s. Jason Scott, who spearheaded the 2009 effort to preserve GeoCities in its closing moments, is eloquent: To browse among these artifacts is to find a cross-section of humanity. A mother’s emotional memories of the loss of her two year old son, sixteen years earlier. A self-described alien abductee’s recounting of 25 years of unusual memories and ufo sightings. A proud owner of a parrot. … At a time when full-color printing for the average person was a dollar-per-printed-page proposition and a pager was the dominant (and expensive) way to be reached anywhere, mid 1990s web pages offered both a worldwide audience and a near-unlimited palette of possibility. It is not unreasonable to say that a person putting up a web page might have a farther reach and greater potential audience than anyone in the history of their genetic line.3 As GeoCities exploded in size – eventually growing to seven million users who created hundreds of thousands of documents – it is today an incomparable source of the thoughts, recollections, arguments, reflections, and beyond of everyday people between 1994 and 2009. This chapter uses GeoCities as a case study of both why historians should use web archives – it is an exemplar case of the changing scope of information produced by everyday people – and how they can access them. It asks what we can learn, and what we need to consider, as we explore the now archived GeoStreets and GeoAvenues of GeoCities, from the children-focused zone of the EnchantedForest to the festiveness of BourbonStreet. It was here that many users teased out their relationship with the web, building a foundation for the blogging and social networking explosion of the 2000s. It was a virtual city that users built together. Our modern relationship with communications technology

Welcome to GeoCities, Population Seven Million

173

owes a debt of gratitude to this web service. In it, we can see the issues that will confront historians as they begin to explore web archives of everyday people. After discussing GeoCities and its archives, we conclude with the more general ethical issues created by web archives like these – they offer a lot of new voices but need to be used with exceptional care.

The Voices of Millions: What was GeoCities? To understand GeoCities, we need to situate it in the broader scope of networked communication between everyday internet users and enthusiasts. In this sense, it is part of a continuity that stretches back to the 1978 origins of bulletin board systems (BBS s), grassroots networks run by people out of their basements, garages, and living rooms.4 People would run their own BBS s, and others could call in to play games, share stories, swap messages, and socialize from the comfort of their homes (indeed, the idea for BBS s was born in a Chicago snowstorm). Long-distance calling fees meant that these were largely local networks, with long-distance networking really being available only to those who were either quite wealthy or the small band of hackers who had mastered the ability to “phreak,” or manipulate, the long-distance telephone system. By the mid- to late 1980s BBS s were becoming more accessible, with articles such as a 1986 spread in the Toronto Star (Canada’s largest circulation newspaper) that sought to give a step-by-step introduction to them.5 Similar networks appeared by the 1980s, known as FreeNets – beginning with a 1984 Cleveland BBS – which developed into a sense that internet access should be community based and geographically framed as such. Those who had been able to access the internet or other networked communications had been largely doing so through a network generally run by somebody in their community and on a medium that made it relatively easy to contribute, as opposed to just consume. Indeed, the infrastructure of BBS s was explicitly geared towards user-generated content: the system was the board, and the users contributed the “notes” that would adorn it. By the 1990s and the early days of the web, things had changed. While the web was a read-write medium (i.e., designed to host content as well as let people create it), hosting a website was far more difficult than posting content on a BBS . Designing your own site required specialized knowledge of HTML , as well as the technical issues in actually

174

History in the Age of Abundance?

operating your own web server to host a site. It was free services like GeoCities that truly democratized online access, by building on the contributions that BBS s had made. GeoCities was not the only service that helped bring people online – its competitors tripod.com (late 1994) and Angelfire.com (1996) were similar – but it was by far the largest. What would eventually grow to be a place for seven million accounts had relatively mundane beginnings in November 1994. In the southern California city of Beverly Hills, a web server flickered to life. David Bohnett, fresh from the software industry and heartbroken from the recent death of his partner, launched Beverly Hills Internet, a company that would let users create their own free sites. The company’s name, grounded in geography, spoke to the desire for community that lay at the heart of the undertaking – and continued when Beverly Hills Internet changed its name to GeoCities shortly after its founding. As Bohnett recalled in 2012, “We all have something to share with each other, which enriches both their lives and ours as well.”6 Some of the impetus for this service came from Bohnett’s own background. As he told the New York Times in 1998, a lot of what he did had “to do with being gay and part of a minority that had not had an equal voice in society.”7 Compared to competitors like Tripod and Angelfire, GeoCities’ unique focus on community gave it a distinctive early web presence. The community metaphor resonated, given the personal imperative of Bohnett and other GeoCities founders to bring people together. Moreover, the idea of place fit with the then dominant “cyberspace” metaphor. Users wanted to situate themselves in a place. The web was the new “frontier,” sermonized by Wired magazine and exalted by technological utopians across the political spectrum, from Newt Gingrich to anarchical socialists to the Electronic Frontier Foundation.8 The geographical community metaphor of GeoCities connected with a public that was conditioned to think of the web as an ever-expanding geographical space. A brief description of how GeoCities appeared to users can help put it into relief. Users would surf to GeoCities.com and could either browse the website to find websites that they might want to read, or decide to create their own site. The “geo” part of GeoCities would be immediately evident: sites were clustered in “neighbourhoods,” or topic-based sections of the site. One might focus on gay and lesbian issues, another on websites by and for children, another on family-related sites, and so forth. These were presented as maps: the EnchantedForest would have a

Welcome to GeoCities, Population Seven Million

175

series of huts, for example, that a user could move into – each assigned with a four-digit number. Within that four-digit address, users could create as many pages as they could fit within their size cap of one or two megabytes (depending on when they signed up). As of late 1996, there were twenty-nine neighbourhoods. The read-write nature of GeoCities struck a chord with web users. Five weeks after GeoCities opened, there had been over 600,000 “hits,” and by late 1995 there were already 1,400 websites.9 The first 10,000 user milestone was reached in October 1995, the first 100,000 by August of 1996, and the first million users by October 1997. By mid-1998 the site was easily one of the top ten draws on the web and was growing by 18,000 new users a day.10 The growth meant that by November 1995 GeoCities could move out of Bohnett’s apartment and into a formal office with three paid employees. The media began to notice. Echoing the corporation’s marketing rhetoric, coverage emphasized place and spatial metaphors. “What if you want to do more than just look at live images from Hollywood?” asked Roger Ridey in the English newspaper the Independent. “What if you want to live there? Now you can.”11 The web was not just something to be consumed passively, it was something that you could contribute to. For everyday users, the dream of the read-write web articulated by pioneers like Berners-Lee was realized through services like this. In GeoCities’ exponential growth lay the seeds of its undoing. Much of this experiment in user-generated community would come to an end in 1999, when Yahoo! corporation purchased GeoCities. Tom Evans, publisher of US News & World Report, was hired by GeoCities in April 1998 to make it more user friendly and turn it into a lucrative business, in anticipation of becoming a publicly traded corporation. When the IPO came in August 1998, near the height of the dot-com boom, the markets responded very positively: the initial offer of $17 per share soon soared to the $40s within a few months. Yahoo!, which had been establishing its own Clubs section to create online community, began inquiring about acquiring GeoCities in the fall of 1998. Formal meetings began in January. The deal came together quickly and was announced on 27 January. GeoCities was purchased by Yahoo! for $4.6 billion, or $117 a share. As John Motavalli, author of Bamboozled at the Revolution, notes, “At the time of the Yahoo! deal, GeoCities was getting 55 million page views a day, and it was the number-three site, according to Media

176

History in the Age of Abundance?

Metrix. Yahoo! was number one, and AOL was number two. GeoCities called the final sale price a ‘kingmaker premium.’”12 Changes to GeoCities after Yahoo!’s acquisition were dramatic, and much of the analysis in this chapter accordingly ends before that acquisition. Yahoo! changed GeoCities’ corporate culture, according to Motavalli. The exact reasons for GeoCities’ decline are outside of the scope of this book, and this chapter is concerned primarily with the infrastructure of GeoCities constructed between 1994 and 1999.

Moving into GeoCities: Reconstructing First Steps The first step when coming to grips with GeoCities is to think of it from the perspective of potential users. What was it like when they showed up at GeoCities.com, interested in creating a new website? GeoCities was an experiment in accessible, user-generated content. Compared to earlier methods of publishing content – from posting on Usenet, to creating your own web server – this was easy. Users could fill out a straightforward template, or a series of forms, make a few clicks here and there, all without worrying about credit card payments or maintenance. First, the elephant in the room (at least to those readers who remember GeoCities or have encountered such pages elsewhere). The pages’ reputation unfairly precedes them. GeoCities pages were not works of art, especially by our standards: they were clunky, text heavy, with repetitive backgrounds and garish clipart. Yet it is important to remember that a GeoCities page offered a powerful publishing platform, the ability to reach a large audience, and in many ways helped realize the original vision of a read-write web. While the first browser that Tim Berners-Lee developed and used had been both a browser and an editor, subsequent browsers had depreciated the simultaneous ability to contribute and write to the web. This meant that, according to Berners-Lee, “editing webpages became difficult and complicated for people.”13 Before Blogger, Facebook, Twitter, and Wikipedia, GeoCities offered some of that functionality. Whereas Facebook and Twitter offer subdued, templated designs, the user flexibility offered by GeoCities resulted in garish designs but also the realization of nearly infinite customization. Figure 5.1 provides a high-profile example of one such website, XTM ’s Homepage, the winner of the Homesteader of the Year award (discussed later in this chapter).

Welcome to GeoCities, Population Seven Million

177

5.1 Winning page for the GeoCities “Homesteader of the Year” competition.

Anybody could freely create a GeoCities page with an initial size limit of one megabyte. The platform’s free nature helped break the potentially vicious cycle that might mitigate against widespread web adoption: if people were going to use the web, they had to have meaningful content to view. But if there was going to be meaningful content, people needed to have the opportunity, space, and tools to create it. “Locations on the Internet become easier to relate to when they are rich with content,” explained GeoCities management in 1996 of their decision to develop the free program. They continued by noting that GeoCities was “just the first step in building World Wide Web based communities that are destined to become a vital part of the NET  … societies of the New Frontier.”14 The site offered a lot to users: early website creators, free bandwidth, and sufficient disk space. Most importantly, GeoCities offered a central well-trafficked position on the web that would help sites get visitors – the site had a high profile in web portals and searches, meaning that many users went on GeoCities.com to look for content. The connective tissue provided by GeoCities could help people find each other, making the wide-open scale of the web just a bit more human.

178

History in the Age of Abundance?

When visitors landed on the GeoCities home page, they entered the site through a neighbourhood structure, as everyone who created a free GeoCities page had to belong to a neighbourhood. I will discuss these specifically more in depth, but in brief the process involved sifting through a list, reading up on what sorts of individual GeoCities pages would be welcome within them (i.e., the Area 51 neighbourhood included “Fanzines for Star Trek, the X-Files, the Twilight Zone,” amongst other examples), to give a sense of the neighbourhood’s flavour to potential settlers. Users then created a page. The process of creating a site presents interesting methodological problems to the researcher studying GeoCities through its web archive. For future GeoCitizens building a site, they could use a simple templatedriven creator, or if they knew HTML they could use the advanced editor to create a more sophisticated site. More significantly, the network effects inherent in GeoCities manifested themselves to users even at this early stage. Users who wanted to learn how to use HTML were sent to other users to learn the basics, specifically to the http://GeoCities.com/ Athens/2090 site (hereafter, sites will be referred to in shorthand by their neighbourhood and address). Athens/2090, the “Home Pages’ Home Page,” provided straightforward instructions on how to code basic HTML , as well as making astute comparison to the then-dominant WordPerfect word processing program, which also used a markup format.15 From this we can begin to see the degree to which GeoCities presented itself as an alternative to other web development options at the tim, and fulfilled some of the web’s original vision – albeit in a corporate, advertisement-supported space. Millions of users came through these gates and created GeoCities. From the scant traces of the past that remain, we can begin to see the broad contours of a community emerge.

Using Web Archives to Explore a Specific Research Question: Online Communities Exploring a collection of dead websites can be eerie. Websites are frozen in time: old guest books, dead links, static hit counters, animated GIF s that have been long pulled from the live web. Yet in these frozen artifacts are keys to contemporary historical research. Take one question for GeoCities: was it actually a community?16 It was a central aspect of the corporation’s marketing rhetoric and stated goals, but was it actually

Welcome to GeoCities, Population Seven Million

179

achieved? By endeavouring to explore this question, we can begin to see some of our distant reading mechanisms at work. Community is difficult to define. Howard Rheingold advanced a definition of virtual communities in his 2000 The Virtual Community as “social aggregations that emerge from the Net when enough people carry on those public discussions long enough, with sufficient human feeling, to form webs of personal relationships in cyberspace.”17 He described in particular the emergence of a gift-based economy, where people give their time without direct reward (although, perhaps, down the road somebody will help them out). It is not enough to simply declare that community exists, in a website splash page or a press release, it must be enacted, received, and perceived as such by members. In short, community requires effort. As Stephen Doheney-Farina has noted, “A community is bound by place, which always includes complex social and environmental necessities. It is not something you can easily join. You can’t subscribe to a community as you subscribe to a discussion group of the net. It must be lived. It is entwined, contradictory, and involves all our sense.”18 The ease of joining GeoCities has led some scholars to dismiss the notion of its being a community out of hand. Christos J.P. Moschovitis declares that simply offering web space and email was insufficient, noting that many “members signed up for the free web space, rather than to make new friends.”19 Certainly many GeoCities users did just that: signed up and created websites without interacting with their digital neighbours. Some first-hand evidence bears this isolation out. In salon.com, Stephanie Zacharek describes arriving at her new online home in 1999: “Welcome to my home at GeoCities. I live at 9258 Fashion Avenue, in a neighborhood appropriately called Salon. I moved in here earlier last week because I was told that ‘Design, Beauty and Glamour are the toast of Fashion Avenue,’ but so far there’s not a whiff of glamour to be seen – my neighborhood is a ghost town of hundreds of empty pages, halfstarted Web sites and vacant lots; only a handful of the members seem to be at all interested in fashion.”20 While Zacharek was a bit late for the heyday of community, her point is valid. Many users never did get past the “Under Construction” stage of a brand-new site, as Jason Scott’s collection of construction clipart aptly reveals.21

180

History in the Age of Abundance?

Yet there are clear traces of virtual community. New users were generously welcomed by existing users, who aimed to facilitate their first steps onto the web: sending a welcome wagon to help them learn the ropes of HTML , maintaining their site, and in some cases “stopping by” to send an email or leave a guest book comment to give them a welcoming “badge” image that they could embed on their website – and also to make sure that they knew about HTML resources that were available for their consultation. In that spirit, users were encouraged to reach out to their new neighbours and were expected to keep up appearances by continually improving their website. None of this was required. Just as in a real-world community, many people did not throw themselves into associational life. For those who wanted community, however, GeoCities was a place to find it. The consciously taken decision to develop GeoCities along the neighbourhood model all helped to create a sense of virtual community. As Cliff Figallo noted in his 1998 book Hosting Web Communities, “One outstanding example of users being put at the center of community creation and growth is GeoCities. … In fact, their entire concept revolves around providing neighborhoods of common interest in which members can build and locate their homepages. … By aggregating member presence according to their interests, they also please advertisers and bolster their revenue potential.”22 This was not hyperbole: community would be found at the neighbourhood level, from the family-focused Heartland, to the LGBT haven of WestHollywood, or the philosophical Athens. A new arrival would generally be met by a local “community leader,” another user who volunteered – in exchange for some relatively minor perks – to help others learn basic web design. These new users could apply to their community leaders for site reviews, which could lead to awards. They would be encouraged to link with fellow sites using webrings, or even in some cases pay an active role in the governance of GeoCities itself. This community structure largely endured between 1995 and 1999, when Yahoo! acquired GeoCities and rearranged the community structure; users moved towards “vanity” websites (such as http://geocities.com/ ~ianmilligan) rather than neighbourhoods. But in those early years GeoCities sought to be a new kind of web host. It was a place where many learned how to make first websites, with friendly neighbours and helpful advice. The web might have seemed infinitely bigger than the BBS s or local chat rooms of yesterday, but that did not mean it could not be home.

Welcome to GeoCities, Population Seven Million

181

The Homesteaders and Neighbourhoods of GeoCities The central metaphor of GeoCities was homesteading. As Creating GeoCities Websites – a print book that explained the process of creating websites in 1999 – noted, members were not “simply customers; they’re Homesteaders. Because GeoCities is more a community than simply a place to store a few Webpages, the goal is to make all members feel at home.”23 “Homesteading” was a conscious metaphor, keeping with the then-frontier spirit of the web.24 GeoCities defined a homestead in a four-fold way: “1. a dwelling with its land and buildings occupied by the owner as a home. 2. any dwelling with its land and buildings where a family makes its home. – v.t. 3. to acquire or settle on (land) as a homestead. – v.i. 4. to acquire or settle on a homestead.”25 Homesteads were grouped in neighbourhoods. This fit well with Bohnett’s view, as well as his later co-lead John Rezner, who saw “neighbourhoods, and the people that live in them, [as providing] the foundation of community.”26 People with similar interests would assemble together in a circumscribed area. The neighbourhood system was associated with the rhetoric of community in news coverage and press releases. When the neighbourhood system was dismantled, references to community similarly quickly disappeared. As Olia Lialina, a professor of New Media and co-author of the One Terabyte of Kilobyte Age blog has noted, the end of neighbourhoods after Yahoo!’s purchase represented the end: “Users became isolated.”27 Within a few years, by 2003, new users were asked what topic they were interested in – from alternative lifestyles, computers, the military, pets, romance, science, or women’s issues – but not for community, but “to determine the type of ads that will appear on your site.”28 In short, post-1999 GeoCities was very different. We have already encountered a few of these neighbourhoods. Those writing about “education, literature, poetry, philosophy” would be encouraged to settle in Athens; political wonks to CapitolHill; small businesspeople or those working from home in Eureka; and beyond. Some neighbourhoods came with restrictions and explicit guidance, such as the protective EnchantedForest for children. Others were more expansive in scope, such as the largest neighbourhood, “Heartland,” which focused on “families, pets, hometown values.” Each enjoined users to settle in their place and gave lists of example topics that users could

182

History in the Age of Abundance?

write about, as well as other example GeoCities websites that they might be interested in reading themselves. Popular neighbourhoods began to fill up quickly, necessitating sprawl into the “suburbs”: users soon had to move to Heartland/Plains or Heartland/Hills. Each neighbourhood was limited to 9,000 sites (1,000–9,999); Heartland would eventually have forty-one suburbs by 1999. Each neighbourhood had its own community leaders, guidelines, webrings, and so forth. Content standards were maintained by the “Neighbourhood Watch,” which was centrally managed by GeoCities: “If you notice any of your neighbors not following our policies, please let us know,” the administrators asked.29 These policies, which were identical for all neighbourhoods except the more restrictive child-focused EnchantedForest, prohibited offensive pages (bigotry, racism, hatred, or profanity), illegal activity, hosting commercial services, providing pornography, creating hidden or password-protected pages, infringing copyright, or linking to material that was prohibited by the rules as well.30 Belonging to a neighbourhood was mandatory, unless you were one of the few users who paid for their site. This feature was baked into the fundamental structure of GeoCities. These addresses comprised the spatial dimension of GeoCities. Upon joining, new users received an encouraging invitation that echoed the homesteading idea. “In order to keep the neighbourhoods a lively and enjoyable place, we would like you to move in within a week after you have received your password and confirmation E-mail,” the administrators noted. “Your neighbors would prefer to live next door to someone who has moved in rather than a vacant lot.”31 Money could buy more storage, but not a second address. There was only one way to do that: continual improvement. “Part of your responsibility as a resident of GeoCities is to keep your home page fresh and exciting,” the webpage explained to those seeking a second site. “If your original page is kept current, and is consistent with the theme of the neighborhood, you may apply for a second GeoCities address.”32 In that case, all you would need was a second email address and your digital real estate holdings could expand. John Logie’s provocative comparison of GeoCities homesteading with that of the American western frontier helps us understand why the site linked expansion to user improvement rather than user payment. His Rhetoric Society Quarterly article articulates the popular mythology

Welcome to GeoCities, Population Seven Million

183

of the American 1862 Homestead Act: “emblematic of a kind of pioneer grittiness, best exemplified by blurry photographic images of settlers standing proudly before their sod huts, fighting, and ostensibly winning a battle against nature.”33 In this mythology – not to be confused with the historical narrative itself – pioneers opened up the West, opening up territory and improving it for the rest of the (white) country. A similar idea was at work within GeoCities: users would develop sites, drive traffic to their territory, making it a hub for increased web traffic. Users were not encouraged to be voyeurs. On the contrary, GeoCities depended upon the industry of users and their generation of content; new content is what brought people onto the site and kept them there. Even although advertisements were not initially embedded in personal sites, GeoCities was able to put them into their connective tissue: neighbourhood pages, administrative sites, and so forth.

Filter, Aggregate, Analyze, and Visualize: A Process Cycle for Working with Web Archives Our work with web archives, showcased at events like the Archives Unleashed hackathons as well as other events such as the Society of American Archivists, the International Internet Preservation Consortium annual meetings, and specialist conferences on web archives and history led our project team to develop a process cycle for working with web archives like GeoCities. This was subsequently published in the ACM Journal of Computing and Cultural Heritage.34 The lead author and architect of this was my computer science colleague Jimmy Lin, who as an outsider to the historical profession was able to pull his gaze back and begin to systematically see the steps that many of us historians took as implicit. The model, the filter-aggregate-analyze-visualize, or FAAV cycle, provides a process model for working with web archives at scale. The FAAV cycle begins with a question that a historian might want to ask from a web archive and then moves through the four stages. For example, was there community within GeoCities? Given the scale of most web archives, the first step is filtering out the information that is of interest: perhaps a given range of crawl dates (sites crawled in 2015, for example), pages from a particular domain (information from the Liberal Party of Canada at liberal.ca); it could also be content-based, such as pages that contain the phrase climate change or

184

History in the Age of Abundance?

only those that link to the United Nations at un.org. All of these could be combined (i.e., Liberal.ca pages that link to un.org that talk about climate change and that were crawled in 2015). In the case of GeoCities, we begin with 186 million pages. Our first step is to filter it down, in the first case to the several million pages that make up the child-focused EnchantedForest. We do this using a URL filter (filtering on only those that contain the string http://geocities.com/EnchantedForest). Second, researchers then want to analyze the pages that they filtered out. This is still largely a computational step. They might decide to extract all outbound links from the pages they have found, with an eye to tracking influencers – or seeing how they changed over time (i.e., how often a given political party linked to certain kinds of news sites, or vice versa). In this case, we might be exploring how often certain pages mentioned climate change over time, or how their link patterns changed – from linking to certain academics for evidence, or news sites, or none at all. At this stage, a researcher is often interested in the text of a page: extracting named entities, such as people, organizations, or places; calculating the underlying emotional sentiment of documents; extracting plain text to read later on; or trying to establish the various topics at play in a given part of a web archive. In our running GeoCities example, we then decide to analyze this web archive by running the Archives Unleashed Toolkit to extract the plain text (just the text that is visible), the hyperlink network, and the images that appear on the page. Third, once an analysis has been run, scholars may wish to aggregate or summarize the last step. Much of this is counting: how often were links made between given sites? How many times was an entity named? What were the maximums? Minimums? Averages? For GeoCities, we take the plain text and decide to see how often given topics appear, we take the hyperlinks and see what other websites these pages tended to link to, and we take the images and see which ones tended to appear the most. Finally, scholars often want to visualize results: tables, network graphs, and so forth. The fruits of this exploration are displayed in the analysis that follows. Topics are arranged in a table, hyperlinks are laid out in a graph, and images are arranged in plots and tools. To make this all a bit more tangible, let us see it all in action.

Welcome to GeoCities, Population Seven Million

185

Distantly Reading Community in Web Archived Neighbourhoods Exploring the digital ruins of GeoCities today presents unique challenges for historians. How can you extract meaningful historical information from such a massive dataset? These strategies for doing this exercise range from counting words (i.e., how often does the word Clinton appears versus the word Bush), which can be useful about the relative frequency of a given word, but can occlude the context in which a word appears, to more sophisticated approaches such as topic modelling. We can also turn to the metadata analysis discussed in the previous chapter, exploring websites through hyperlinks. In this section, I begin with distant text analysis before moving into network analysis. One example I used in the previous chapter was topic modelling the pages that connected to the Liberal Party of Canada and the Conservative Party of Canada – taking the text of pages, finding words that occurred frequently in proximity to each other (either in the same sentence or nearby), and designating them as “topics.” When it came to studying the idea of community in GeoCities, I wondered if the most popular or trafficked topics within each neighbourhood matched the vision that GeoCities had for them. It turns out that they did. Topics appeared in the neighbourhoods that they mostly “should” be appearing in. Two examples from popular culture help us understand. In the EnchantedForest, the area generally designed to be websites made by children and their parents for children and other parents, we have a topic consisting in part of “pooh friends tigger winnie christopher color piglet” – the main characters of A.A. Milne’s Winnie-the-Pooh (Eeyore makes an appearance down the list as well, although Rabbit is absent). In Hollywood we find the characters from Friends: “joey rachel ross monica chandler.” Neighbourhoods were being used by people in accordance with GeoCities’ vision, which is itself meaningful. Exploring the topics, however, we can see that this was not always the case. A neighbourhood such as the EnchantedForest remained child-focused, in part the result of efforts of community leaders responding to fears about online child exploitation. Pentagon, on the other hand, expanded beyond its initial aim of being a way to connect widely deployed and constantly moving military members into a hub of military historical, activism, and political discussion. The fashion hub FashionAvenue became more than a place for discussion about style; it

186

History in the Age of Abundance?

was also a place for beauty pageants and a place to share images of favourite television and movie stars. Heartland advanced a particular vision of “family” focused on the Christian faith, internal issues, and genealogy. If we can distantly read textual content through topic modelling, we can also distantly read web content by looking at many images.35 Images can thus give us a sense of how the neighbourhoods worked. Drawing on the methodologies of Lev Manovich, I extracted every image from each neighbourhood and arranged them as montages.36 These arrangements allow us to see the overall contours of a community, although they need to be used with some caution.37 For example, by looking at all images from the child-focused EnchantedForest, we can quickly see that most of them are cartoon characters. At a glance, we get a sense of the feel of the neighbourhood without needing to visit page by page. We can also get feel for how people borrowed and adapted these images across the EnchantedForest. For example, we can begin to find how often the exact same image or animated GIF was distributed across the neighbourhood. From this, I saw that an animated GIF of Tigger, from the animated Winnie the Pooh series, was the eleventh most popular image in the EnchantedForest, appearing forty-eight times. This sense of borrowing and cohesiveness appears across the many GeoCities neighbourhoods. Popular culture communities contain grabs from popular television programs and movies; Athens, for example, contains a disproportionate number of black-and-white images – upon examination, historical figures, indicative of the educational underpinnings of the community. From both image analysis and topic modelling, surprises are few and far between. We generally find what we would expect to find. With so many pages, the trick for the historian is determining which ones to look at. A reasonable starting point is to find the pages that most users seemed to be looking at themselves, on the basis of their linking patterns. To do so in the EnchantedForest gives us a sense of overall community contours, as well as the major themes that lie in its rise and fall. Many GeoCities users wanted to be discovered: to have other users find their site and engage with content. Testament to this intent was the near-ubiquity of guest books and hit counters. In the late 1990s, inclusion in search engines was not automatic: many required that you fill out a form to ensure that you were properly discovered. To find content, then, many users relied upon links: from guest books, from awards that users gave to each other, or from webrings that they might have been members of.

Welcome to GeoCities, Population Seven Million Table 5.1

187

Link relationships in one GeoCities neighbourhood

Origin

Destination

Number of links

http://geocities.com/ EnchantedForest/Meadow/1134

http://www.geocities.com/ EnchantedForest/1004

83

http://geocities.com/Area51/ Stargate/1357

http://www.geocities.com/Area51/ EnchantedForest/4213

33

http://geocities.com/Eureka/1309

http://www.geocities.com/ EnchantedForest/Tower/7555

27

Accordingly, we then extracted all links from the EnchantedForest. Table 5.1 shows an example of the link relationships that we found. It demonstrates that on all pages in EnchantedForest/Meadow/1134 – the EnchantedForest Meadow Community Center (including numerous subpages such as the Meadow newsletter, the webring, the community leader register, etc.) – there were eighty-three links to the EnchantedForest/1004 site – the main community center. “Community centers” were major hubs of activity within GeoCities, as they sought to bring a neighbourhood together: a place to share information about sites, highlight particularly successful pages, promote “awards,” perhaps run a newspaper – in short, a place for users to gather within particular parts of the site. Meadow was a community case for the “Meadow” suburb: the few thousand sites that made up all the pages within that part of GeoCities. Table 5.1 shows how community centres were in communication with each other: the main one for the entire EnchantedForest heavily linking out to its suburban offshoots. Scale that up across all of GeoCities, and we can begin to see which EnchantedForest websites had the most links to them and the most links from them, and which sites would most likely have been stumbled upon during random adventures through the site. We cannot rely simply on the aggregate numbers of inbound links as a proxy of interest because we must remain aware of the problem of link farms (which we discussed earlier in this book) – early search engines often relied excessively on link patterns to privilege certain sites over others, meaning some sites gamed the system by linking to others hundreds of times. An alternative metric, PageRank, which lies at the heart of Google’s search engine today, can help us find useful sites. Briefly explained, the PageRank value is based upon links to other sites, each of which can be considered a “vote of confidence” in that site. However, these votes

188

History in the Age of Abundance?

5.2 Link structure of the GeoCities EnchantedForest.

are weighted according to the PageRank of the site that is issuing the link, helping us get around the issue of sites that give out lots of links but have no merit in and of themselves. The EnchantedForest was an interconnected community. While tables, like table 5.1, can help us make sense of individual sites, it can be useful to view hyperlink structures as networks as in figure 5.2. The figure shows all sites within the EnchantedForest and draws lines for each hyperlink between them. A “hairball,” like in figure 5.2, shows how interconnected the network is: sites had lots of links to each other, and several sites were connected to many other ones. Nodes – websites – are sized according to their PageRank, and above we can see a few sites with high PageRank scores. This helps us explore some of the most popular sites. They ran the gamut from those aimed at children, those that built community either through services or award provision, and – of course – those written by children themselves. What they generally shared was a tight connection into the GeoCities ecosystem: they had received numerous awards, shared avatars with others, became featured pages, and generally threw themselves into the vibrant community that was forming.

Welcome to GeoCities, Population Seven Million

189

5.3 EnchantedForest/Glade/3891: The highest ranked site.

Some of the most prominent sites played connective roles within GeoCities: lots of sites linked to them, and they in turn linked to many others. The site with the highest PageRank above, EnchantedForest/ Glade/3891 (figure 5.3), was the EnchantedForest Awards Page.38 It governed the main award distributed in the neighbourhood, the EnchantedForest Award of Excellence (AOE ). A semi-official website, operating with the imprint of community leaders, Glade/3891 ran an application process for those who wanted the award. In practice, these sites then reached out to successful sites and gave them an image, or badge, that they could host on their own pages. Each of these badges, of course, contained a link back to the awarder – creating lots of hyperlinks, which then turns up in our PageRank analysis! We return to some of these processes shortly. The combination of topic modelling, image analysis, and network analysis suggests a few tentative findings. First, the neighbourhood system roughly worked – the children’s neighbourhood did concern itself with children, Heartland concerned itself with family topics (albeit more narrowly defined) such as genealogy and Christian values, and the

190

History in the Age of Abundance?

Pentagon became dominated by military topics. Second, such things did not happen by accident. In the EnchantedForest, for example, the highest PageRank-ranked sites had to do with community: “community leaders,” awards given out. In short, community appears to have been deliberately and physically enacted. Let’s now explore how that came to be.

The Peer-Driven Community Glue of GeoCities: Leaders, Awards, Guest Books, and Rings This GeoCities community was made possible through several means, uncovered through the aforementioned techniques. These ranged from the volunteers who welcomed new visitors, to guest books and webrings that stitched sites together, resulting in a successful early online community for those who partook. The first method by which community was created was through the aptly named “community leaders.” Each was generally assigned a block of addresses to steward. Some neighbourhoods assigned leaders on the basis of their addresses. For example, if in March 1997 you resided in the 2650–2999 block of the Heartland neighbourhood, your leader would be “Alison (AKA Alaithea),” who was an expert in a host of things ranging from HTML to Microsoft’s Internet Explorer.39 Her own website provided information on “color, layout, navigation, graphics & more” for webpages, with sensible advice on how to create an attractive website (with still valid advice on the ideal size of text blocks and limiting length of pages). She also provided galleries of attractive backgrounds, even allowing dynamic previews for your own homepage.40 She was the model of a community leader: helpful, generous, accessible, and welcoming. Other neighbourhoods operated on an “at large” model: your street did not have dedicated people but was instead served by a general pool of leaders. Much of Athens, for example, operated on this model.41 As a group, these leaders offered help with basic HTML and design, and anyone with complaints about potentially objectionable content within a GeoCities could notify them before escalating it to the GeoCities administrative team. And there were a lot of them: using information from two Internet Archive scrapes in 1996 and 1997 and pulling out individual email addresses of each leader, we can see the growth of the program. On 19 December 1996 there were 414 community leaders throughout GeoCities; by 2 July 1997 there were 1,040.

Welcome to GeoCities, Population Seven Million

191

In the EnchantedForest we find that, for many users, the neighbourhood’s “community center” would be their first stop, and it is here that we can begin to see what made the community function effectively. The main community centre was located at EnchantedForest/1004 and provided an introduction to assuage parents worried about the wide-open cyber frontier that they were now facing:42 “Welcome to the EnchantedForest, home of the littlest GeoCitizens and some of the best homepages. This is a neighborhood for pages by kids, for kids – from animals to zephyrs, the Forest has everything that a young mind could need. This neighborhood is safe for anyone to wander through, and it’s safe because of the dedicated Community Leaders who so generously volunteer their time to keeping it safe.”43 They provided further instructions for parents – mostly cautionary, such as monitoring what their children put online, particularly too much personal information. Each suburb had their own community centre with similar content.44 Some became hubs for awards and interesting sites for users; others gave “critters” to adopt and emblazon on your page; many had help pages for basic HTML and beyond; and some even ran newsletters and hosted events. As GeoCities bridged the gap between the earlier model of BBS – where you could “yell for SysOp” and actually make the administrator’s computer beep to grab her or his attention for help – and the more open, impersonal world of the web, leaders formed a critical connective tissue. If we download all self-authored descriptions of these 1,040 community leaders in the entirety of GeoCities and pull out keywords, we can get a sense of how they saw their roles and what they offered. Community leaders were there to help users with their pages, offer information, answer questions, engage with people, offer free advice, and do it with a virtual smile (“love,”“willing,”“free,”“feel free,” and “likes,” were common phrases, for example). Beyond providing help, community leaders facilitated connection by playing an integral part in adorning websites with awards; Richard Rogers has termed these an “early form of website analysis,” a way to encourage filling the web’s void.45 Perhaps the best collection of these early awards can be found in GeoCities. Some had official imprints: formal community committees awarded some, such as the Heartland Award of Excellence, voted upon by the volunteer leaders. Criteria included whether or not they adhered to community standards, from

192

History in the Age of Abundance?

displaying multimedia to publishing clearly written text. Community awards also had the ulterior motive of ensuring that sites fitted into the neighbourhood, used efficient and well-written HTML , and merged content with use of JavaScript and multimedia pop-ups.46 Other awards were not official: users exchanged them to help link sites together. These awards were not hidden and were spread throughout GeoCities. You could click on an award to learn more details about it, you could see opportunities to submit or give awards, and in any case the official committees made it clear that potential awards were only a review away. Recipients usually received a graphical badge to adorn their page. These awards made community tangible, an everyday reminder of the webs that tied sites together. They were woven into GeoCities’ fabric, as community leaders were given explicit instructions by site administrators to find the “best sites” to showcase. If awards celebrated the “best” sites and were a way to exchange favours between users, guest books served as an equally important connective tissue between community members. A seemingly omnipresent sight throughout websites of the late 1990s and early 2000s, guest books were an important community-building element for users on the GeoCities platform. They were more than just a way to thank or compliment a particularly useful or enjoyable website: for that, there was email. If that mode of communication occupied the “private” side of the communication spectrum, guest books were not quite fully “public” either. Beyond the omnipresent webpage counter, a small set of digits on GeoCities sites that increased by one every time a visitor arrived, guest books were a prime means of evaluating a site’s reception. They took various shapes and sizes. At a minimum, they were short user-generated snippets: visitors could click on the guest book and be provided with a short form, providing their name, website, email address, physical location, and a few comments. Larger ones took the form of a large questionnaire: favourite animals, colours, relationship memories, and beyond – occasionally a dozen or more questions. Guest books played a critical role in building the GeoCities community. In her study of personal home pages, carried out in 1998, sociologist Katherine Walker placed them within the genre of web self-presentation. Seeing guest books as akin to the webpage site counter, Walker argued that they functioned “as a testament to popularity and a confirmation

Welcome to GeoCities, Population Seven Million

193

that others regard the created page and the identity it represents as worthy.”47 They also played a significant role for the person leaving the comment: “Leaving a message with an address might lead to response not only from the guest book’s owner, but also from others reading the guest book. As such, the audience may potentially receive a greater reward from filling in a guest book than from just sending a private email message. Guest books are a form of role support.”48 Guest book comments often included an invitation to visit one’s own webpage, discussed mutual interests, and provided a public email address to help them build up a network of contacts. Overall, they encouraged and facilitated public engagement. Comments were almost universally positive and personalized. When we run textual analysis on these corpuses, the overwhelmingly most common words were my, you, I, and your, among other such informal pronouns. Great, love, enjoyed, thanks, wonderful, and other hyperbole were common instances of gratitude and expression. People liked to thank each other for their content. In more developed form, some of these guest books resembled elaborate questionnaires. Drawing on selective keywordin-context explorations of guest books, my research found that questions included, in order of popularity: favourite music, favourite animal, favourite book, favourite website, favourite food, favourite singer, favourite TV show, and so forth. Within communities focused on a particular animal, singer, actor, or band, the questions became more focused: your favourite Shania Twain song, Keanu Reeves movie, or dog breed. If GeoCities homesteaders connected through guest books, community leaders, and awards, they also connected through the direct level of website-to-website links. Webrings were a frequent sight during the mid-to-late 1990s, and GeoCities was no exception. They arose as a solution to the problem of finding content on the web. At the time, search engines and directories operated in no small part on a submission basis, meaning that creators of websites were responsible for registering their sites with these services (other sites that were not submitted could be discovered, but common advice given to new site creators was to ensure they were listed!). The process of making sure that one’s website was sufficiently publicized was time-consuming: visiting each service, navigating web forms, providing information, and so forth; indeed, for-profit services emerged to handle the process of registering a website across so many

194

History in the Age of Abundance?

platforms. While web crawlers were in their incipient stages by the late 1990s, they often required – as the popular search engine AltaVista did – that users place specialized “metadata,” or information about their sites, into the HTML code of their webpage. Keywords were essential. This all meant that finding information on the internet circa 1996–99 through mainstream channels was a top-down experience: you navigated hierarchical directories of websites or engaged with search engines optimized for those who had the most technical know-how and access to their HTML headers.49 For somebody just stepping out onto the web, using a GeoCities PageBuilder, making sure that interested parties found your website was a challenge. An alternative model arrived on the scene in late 1994. Denis Howe developed the idea of EUROP a: the Expanding Unidirectional Ring of Pages.50 It was a simple idea: on your own website, you would create a EUROP a subpage. You would copy the HTML template from the EUROP a page, incorporating some of your own material. The person who had joined before you would put a link to your subpage, and after you created your own site, the next person who joined would send you a message with a link to put on your page. In this way, visitors to the EUROP a ring could click their way across the web, learning about the various webpages they were clicking through to consider visiting them more in the future. It laid the groundwork for a more focused and sophisticated concept: the webring.51 Sage Weil further developed Howe’s idea. Weil coded a web server as well as an HTML script that would let users place an “easy-to-see navigation bar, pointing the way to other sites within the cluster” onto their pages; it was released for free to all.52 Some at first saw them as being better than the keyword-based search engines, as they appeared to give users the “opportunity to explore single topic Web sites without having to deal with dead links and non-relevant Web sites that are swept into a keyword search.”53 Each webring had an administrator to doublecheck that a site belonged (i.e., if it was a genealogy-focused webring, one could not add a site about cats), and when approved the site had a section on the bottom encouraging users to link to either the “next” site or the “previous site,” or to an overall index of all the sites that belonged to the ring. If directories were top-down, this was bottom-up: anybody could start a webring, anybody could join one if there was a decent fit, and they formed a new way to connect the World Wide Web.

Welcome to GeoCities, Population Seven Million

195

They proliferated rapidly on GeoCities, with over tens of thousands of mentions on various topics. Visualizing the overlap between clusters is remarkably difficult, because they are ubiquitous: invitations to join, links, and pages that had memberships in upwards of three or four rings. They formed a connective tissue throughout the site, bringing people together within and also drawing outside visitors in and inside visitors out. Here again we see a reminder of the antecedents of Web 2.0: the harnessing of network effects, the primacy of user-generated context, and the rise of social networks. Once the webring was set up, users did not need to go through a central server: they just went from peer to peer. GeoCities homesteaders sought community, and webrings and links were a way to build it. In an era before the ubiquity of Google, webrings in particular were an early way to continue the grassroots promise. You could get a free homestead and join up with another community of homesteaders – connecting to each other freely without having to submit addresses or provide personal information to a large corporation. While far more removed than awards and community leaders, it was another way through which community could be achieved within the virtual web.

The Lives of Others: Ethical Challenges with Web Archives in General Working with content created by millions of private citizens requires care.54 In the last chapter, we talked about public organizations, such as political parties. As publicly facing and often (justifiably) scrutinized organizations, they do not expect privacy on their party’s homepage and do not post anything that they do not want the outside world to see. But what happens when we turn our attention to ordinary people who have posted content online? As with everything I discuss in this section, the line between a public and private individual is, of course, not clear either; is a blogger who writes for prominent sites but has less than 20,000 Twitter followers a public figure?55 As we discuss, context and empathy matter, as working with web archives requires constantly navigating a grey zone. That is not to say that there are no clear considerations at either end of the expectation of privacy spectrum, of course. Where politicians have no expectation of privacy, a fourteen-year-old who wrote her webpage in 1996 likely does. Even the current generation of young people who

196

History in the Age of Abundance?

have grown up in a media savvy and digitally sophisticated environment have a surprising degree of naïveté about how widely accessible their digitally shared material is.56 Addressing this disconnect cannot necessarily happen at the collecting level, because, as we have discussed, web archivists are faced by challenges – the Internet Archive cannot make a page-by-page ethical determination, nor can most collecting librarians. Ethics do need to be considered by researchers themselves, however. This might require keeping in mind the self-examination and practical methods of oral historians and others who have worked outside formal institutional contexts and with human subjects. At the very least, while there are no simple answers, historians working with web archives must consider the key issues of consent and harm. Traditional archives come with restrictions, documented and administered by professionals. Some are restrictions on access, either perpetual or time-delayed, and often require researchers to sign formal agreements as a condition of using a collection. In some cases, they must submit a final thesis or article for review to ensure the protection of personal information. At other times, restrictions are implemented to protect the privacy of third parties. The identities of those who wrote letters to government ministers, for example, are often censored in the Canadian province of Ontario, or the identities of police informants can be protected for perpetuity. While frustrating at times, they are transparent policies that strike a balance between privacy and access. Historians researching in traditional venues are circumscribed by the ethical and professional imperatives and guidance of the archive. As professionals, archivists are sensitive to these issues. Oral historians face a different set of issues, as they operate not in archives or libraries, but in the living rooms, retirement homes, workplaces, and other personal spaces where they meet their interviewees. Given their relationship with human subjects, oral historians have much of their work governed by university or other institutional review boards (IRB s), often influenced by regional or national policy. It is worth noting that as historians rely mostly on published documents, historians’ main interaction with an IRB is for oral histories – unlike in many of the social sciences, where IRB s are routine for most research. IRB s concern themselves with many issues, notably obtaining informed consent from research participants, reducing potential harms, and making clear that participants have the right to withdraw from research studies.

Welcome to GeoCities, Population Seven Million

197

None of these questions are straightforward, but the traditional IRB process does bring with it substantial support. In Canada, for example, oral historians must take a centralized federal granting council course and pass a series of online quizzes to obtain a certificate that can be presented to their home institution for clearance – in addition to IRB – before they set out to interview people. In the United States, oral histories were specifically excluded from “human subject regulation” by the US Department of Health and Human Services only in 2015, although researchers must still often go through their local institutional ethics boards.57 While none of this is simple or even without controversy, it is now well-travelled ground, and there are many websites and resources, and much expertise, to draw on. Accordingly, beginning oral historians will find much support and advice, including edited collections and books on the subject.58 So where do web archives fall between the territory of traditional archives, where researchers come to consult the collections, and the work of oral historians, who are themselves the collectors? The algorithmic nature of web archiving means that a historian cannot necessarily turn to the web archivist for guidance. Web archivists generally do not have direct communication with the donors of archived material, and ethics boards would likely frown on archivists or researchers contacting donors. Oral historians, on the other hand, often do have a direct relationship with their subjects and may ask what they wish to disclose and what they wish to remain private. At the same time, they have access only to what their subjects tell them and cannot “search” them like a collection. The lack of communication with the authors of these archived websites complicates ideas about informed consent. As we have seen, opt-in models have largely failed because they have very low response rates. This means that, for the most part, informed consent from those who wrote websites a decade or two ago is unlikely to be found, and the question about having their material withdrawn from the public web archives is essentially outside a researcher’s purview. Website authors might also Google themselves and find that their old website was used in a published academic paper, at a point in the scholarly cycle when it is too late to retract or remove. The major dimension with which a web archive researcher must be concerned, then, is the question of harm. How can we work with our sources to respect privacy and mitigate potential harms that could come from a historian reading, citing,

198

History in the Age of Abundance?

and writing about this material? As with so many things in historical research, context is king. Web archive researchers can sometimes feel on their own as they begin to navigate the personal information inevitably contained within web archives – absent the institutional supports of the archive or the professional networks of oral historical research. Luckily, they are not alone, as the digital humanities and computational social sciences more generally are encountering these new questions of ethics. While the ethical conversation has been dominated by IRB s, recent currents are emphasizing the use of principles and ethical frameworks rather than simple binary approaches of ethical versus non-ethical. Drawing off the four main principles identified in the 1979 Belmont Report – the framework for working with human subjects that has influenced the Common Rule regulations in the United States – Matthew J. Salganik has provided a useful approach for digital social scientists when working with their online research questions. The four principles, those of respect for persons (informed consent when possible), beneficence (maximize benefits and minimize harms), justice (distribute risk and benefits across social groups), and respect for the law and public interest (not only following rules but also being transparent to society) can help guide researchers.59 Salganik’s three main points of advice to researchers charting these waters are astute: to consider the IRB as a floor, meaning that while it needs to be involved when necessary, approval is a starting point, not the end of weighing ethics; researchers should put themselves in the shoes of their subjects and society; and research ethics is “continuous, not discrete.”60 This idea of a continuum influences my own thinking about GeoCities. Just as with computational social science, much of the ethical thinking about web archives is compounded by the rapidly evolving technological infrastructure that supports web archive research – as of writing in 2018 I might be able to provide a direct quotation to a webpage that could not be searched out by readers themselves, but down the road full-text search might make even more content accessible, for example. This is not a new problem – researchers always have to be aware of how datasets might be used by future actors – but worth keeping in mind. Questions begin to appear almost immediately when working with content like that of GeoCities. What if you come across pages

Welcome to GeoCities, Population Seven Million

199

where people are openly mulling suicide, or discussing attempts? Or are expressing very personal grief about the loss of a child, or writing about their self-doubts and struggles with becoming a parent? Or, on a less morose note, are celebrating their ability to reach out to a vibrant gay and lesbian community online while being closeted in their “real life” in a Southeast Asian country? As user perceptions of privacy have evolved over the last two decades, it is often easy to find information about individuals. Some of this is the result of search engines being dramatically improved over the last twenty years, making content far more discoverable. Some of this information, too, might have been shared in the context of an “intimate public” – parts of the web that operate in a bounded, small-scale nature – theoretically accessible to any with a web browser but being characterized by tight-knit emotional bonds.61 On this note, and on further examples I will provide when discussing the ethics of using GeoCities in chapter 6, we can see that the “public-private” divide needs to be complicated. I suspect that most readers would recognize that there are no straightforward answers to any of the above questions – all of which I or my students have encountered while using web archives. On the one hand, they are all part of fascinating stories that shed light on human culture and activity. On the other hand, these sources contain personal information that authors may not want associated with their names some twenty years after writing them. Given the importance of the histories of everyday people, of course, it is not ethical to not collect these stories; as I have argued throughout this book, they are important counterbalances to stories of the powerful and dominant. Recent work being done by Sarah McTavish, who was working on her PhD with me at the University of Waterloo, pushed my own understanding of the ethical conundrums we can face in web-based historical research. McTavish explores gay and lesbian identity online. In the WestHollywood neighbourhood of GeoCities, she found dozens of websites by LGBT individuals: complete in some cases with full names, personal information, contact information, and beyond. We were immediately confronted with very real questions. What should be shown in conference presentations, and what should be blurred out? What names should be provided, and what should not be? And in publications and dissertation work, what form should citations take? Is it “real” historical research if the citations do not point to the exact historical document?

200

History in the Age of Abundance?

Fortunately a community of scholars has confronted these questions and are attempting to work them out: scholars of the “live web” as well as computational social scientists more generally. Current practices of researchers in internet studies are defined by guidelines and considerations established by scholarly associations and leading researchers in the field, rather than prescriptive regulations. They generally operate on the assumption that openly accessible web-based materials are publications but do not believe this gives them carte blanche to use the material. Although it is legal to quote from tweets, blogs, or websites, that does not necessarily make it ethical. As Aaron Bady, a thoughtful blogger, journalist, academic, and commentator, has pointed out, “The act of linking or quoting someone who does not regard their twitter as public is only ethically fine if we regard the law as trumping the ethics of consent.”62 Technologist Anil Dash is also quick to clarify the ‘fairness’ of using these sources: “When did we agree to let media redefine everyone who uses social networks as fair game, with no recourse and no framework for consent?”63 From the field of information studies, Safiya Noble argues, “There are many troubling issues to contend with when our every action in the digital record is permanently retained or is retained for some duration so as to have a lasting impact on our personal lives.”64 While her particular emphasis is on Google’s practices, the point can equally apply to social media or web archive research. Recent currents in archival thinking, too, have helped challenge the implicit assumptions about “openness” that suffuse Western archives and web-based research more generally. In my own work, my own bias is towards openness wherever possible – a world view influenced by Western norms in scholarship and information. Yet Kimberly Christen, a professor at Washington State University, has noted the importance of groups – in her case, Indigenous communities around the world – to dictate their own access restrictions “based on cultural parameters.”65 Christen and collaborators have put theory into action with the Mukurtu Content Management System, which provides far more granular access restrictions on content. As she writes, “In the case of the Warumungu community’s Mukurtu Wumpurrarni-kari Archive, access to certain cultural materials (and the knowledge that animates these materials) is decided based on a dynamic system of accountability where one’s age, gender, ritual status, family, and place-based relationships all

Welcome to GeoCities, Population Seven Million

201

combine (and recombine as affiliations shift over a lifetime) to produce a continuum of access to materials within the community.”66 While this seems antithetical to the closed/open binary of traditional access methods, these approaches help to build friction and raise questions about the right approach to cultural heritage access. Cultural protocols should lie at the heart of access questions. Indeed, some high-profile controversies help bring this into relief and show why we need to be sensitive in our thinking about ethics. In 2014 scholars at the University of Southern California encountered criticism when their study of “black Twitter,” or how African Americans interact with that social media network, was publicized. Dorothy Kim argued that the Black Twitter lead researcher’s doctoral supervisor “did not have the contextual and historical background to understand the long history of minorities and especially black men and women being used as data and experimental bodies for research and scientific experiments.”67 Given the grey area in which such research falls – generally not needing Institutional Review Board approval – there will be more controversies to come. In 2016, for example, Danish researchers made public tens of thousands of profiles from the OkCupid dating site on the grounds that it was publicly accessible to a web scraper (they grabbed the text off the site, which consisted of profile preference fields such as “sexual habits, politics, fidelity, feelings on homosexuality”).68 Other datasets, like the information obtained through the 2015 hack of infidelity website Ashley Madison (which exposed thousands of people’s names, email addresses, and other personal information of people who had signed up for the socially taboo site), also raised serious ethical problems. Both collections – the OkCupid profiles and the Ashley Madison data – will end up in web archives. Should they be weeded out? Public information is not always ethical. Several archives are trying to navigate the relationship with researchers to ensure ethical use of their collections. Some national libraries, such as the Danish Royal Library, ask researchers when requesting access to their non-print legal deposit (but not their traditional book documents) collection to discuss if they will be researching issues involving “sensitive personal data.” This governs personal information that might reveal ethnic background, political opinions, religious beliefs, membership in a union, health or sexual information, criminal offences, social problems, or internal family relationships.69 While this

202

History in the Age of Abundance?

process relies upon self-declared research proposals, users sign agreements and are put in touch with the Danish Data Protection Agency if they will be using such personal material to receive guidance and advice on the project. It depends on a high level of trust, as logs do not seem to be regularly reviewed (if they are at all) but frames ethics as paramount from the moment the researcher begins to explore the collection – and retains the power to withhold access to collections if ethical concerns mount. Scale, however, is the complicating factor when considering institutional ethical practices and web archives. While the granular model of Mukurtu admirably contests open/closed binaries and challenges us to reconsider the primacy of Western access models, it does not scale to the billions of resources found in a web archive. Content creators and owners are difficult to identify and often impossible to contact, as we have seen. This puts collectors in a quandary, as they would be forced to make access decisions on behalf of third parties. Accordingly, with web archives, the ethical onus is often placed on the researchers themselves. This research responsibility is founded on the three main assumptions held by many collectors and researchers: that the web is a publishing medium, the “ephemeral nature [of the web] is a deficiency of the medium” that can be “fixed” by web archivists and trained archival personnel, and web archives are thus collecting “material that is freely available anyway.”70 As archivist Eira Tansey notes, the “idea that if self-published personal content is publicly findable on the web, its fair game for journalistic, academic, or archival re-use is so common that few question it.”71 As we use web archives, we need to ask questions about privacy and ethical issues, especially given the scale, scope, and invisibility of much web archiving efforts. Just as commentators were rightly concerned about the ethics of exploring Black Twitter as a primary source, these dimensions are at play when we begin to look back on websites made in the mid-1990s. The crucial question for researchers working with these archived websites is whether they were created with an expectation of privacy. This becomes especially difficult to answer when working with defunct sites like GeoCities. Can we, and should we, consider GeoCities sites to be proper publications – in the sense of something written and published for public access? Contemporary studies of web activity can help bring this into context.

Welcome to GeoCities, Population Seven Million

203

One social media scholar, danah boyd, has explored these questions from a contemporary perspective in It’s complicated. While boyd collected her data two decades after GeoCities’ inception, her research and interviews demonstrate a complicated and often conflicted approach to online privacy amongst teenagers. Her respondents are worth exploring at length: Although many adults believe that they have the right to consume any teen content that is functionally accessible, many teens disagree. For example, when I opened up the issue of teachers looking at students’ Facebook profiles with African American fifteen-year-old Chantelle, she responded dismissively: “why are they on my page? I wouldn’t go to my teacher’s page and look at their stuff, so why should they go on mine to look at my stuff?” She continued on to make it clear that she had nothing to hide while also reiterating the feeling that snooping teachers violated her sense of privacy. The issue for Chantelle – and many other teens – is more a matter of social norms and etiquette than technical access.72 Online communications are astutely compared by boyd with those taking place in a busy café. These are conversations taking place within a public place, but that does not mean that there is no expectation of privacy “due to social norms around politeness and civil inattention.”73 Other have advanced similar arguments.74 Returning to Aaron Bady, he has noted that while it is legal to quote strangers online, it is also legal to photograph people as well – most would agree it is not ethical if one is engaging in “creep shots,” or pictures of a sexual nature taken of people in public places without their consent.75 Ultimately, many internet users have an expectation of privacy by virtue of their obscurity: most tweets and blog posts are needles in a haystack, especially if shared with only a limited number of followers.76 As we will note, until web archives were subjected to text mining and data analysis, the Wayback Machine largely indirectly enabled a “privacy-by-obscurity” model. The Internet Archive does still offer a removal service – if an individual writes to them, content can be removed from the publicly facing index. The Association of Internet Researchers (AoIR), the professional scholarly association for those scholars, has put its stamp of approval on

204

History in the Age of Abundance?

ethical guidelines for researchers working with the web. In short: context is key, and the wide variety of approaches means that there cannot be a one-size-fits-all approach to ethics. The AoIR report on ethical guidelines is a “dialogic, case-based, inductive, and process approach to ethics” and worth reading for historians considering working with web-based sources. They outline a series of tensions around whether these sources should be considered akin to “human subjects,” whether the internet is a public or a private space, the difference between dealing with sources on an aggregate level versus that of the individual, and whether researchers’ approach to ethical issues should be top-down (regulatory) or bottom-up (context specific).77 As with other ethical approaches, the AoIR report stresses the latter approach: rather than prescriptive rules, important concepts to consider. Building on this, internet researcher Stine Lomborg identifies two factors that researchers may consider. The first is distance, insofar as the ethical issues of using a large dataset – and approaching it in aggregate – are different from focusing on individual users. Is the researcher distant from the subject, such that the subject is not being singled out or could be identified as a result of the data presented? The second revolves around original user perceptions of privacy. If users expected privacy when posting, or did not consider their behaviour for public consumption, should we respect that today? By thinking about ethics in a case-based system, there are no “final, all-encompassing standards” but rather guidelines that “serve to stimulate and inspire ethical reflection throughout a research project.”78 As historians too, of course, we must realize that the historical context of when sources were created informs how we can ethically use them today. My example from Sarah McTavish showed that users put vast amounts of personal information online in GeoCities, often things that they would not want people to know in the “real world” – their sexual orientation, addresses, and beyond. People also made jokes or statements online without realizing the ramifications of having a publicly accessible record being kept for years, as mores changes, personal celebrity waxed or waned, and beyond. As Megan McArdle noted in the Washington Post, “Emergent communications technologies are often first used informally, by early adopters who treat them as a sort of extended private space … we should show some mercy to those early adopters who didn’t realize

Welcome to GeoCities, Population Seven Million

205

that blogs or social media would end up being more like a newspaper than a barstool.”79 There are no straightforward answers or templates for historians to consider when using this material, but they all suggest that we owe our historical sources and individuals a duty of care. Historians need to understand context, cultural protocols, and expectations of privacy when using these resources. Scale can mitigate, but not eliminate, problems – so too can changing our citation patterns or anonymizing these print sources. The citations may not be there, but we can trust each other that we are doing good, responsible historical scholarship. A good way to bring some of this into relief, then, is to return to this chapter’s case study. Where does this all leave us with web archives like GeoCities?

Ethics in the Case of GeoCities If we accept that GeoCities was worthy of preservation and will form an essential part of the future historical record, we recognize the importance of collecting material before it disappears. This almost immediately shifts the onus of consent from collector to researcher. Time was of the essence when GeoCities was being closed, making an opt-in process an impossibility. In any case, the experience of opt-in versus opt-out mechanisms shows that the former has a low success rate – would you click on a strange email from a “web archive,” even if they could find your current email address?80 If Archive Team and GeoCities had not collected the imminently closing GeoCities domain, the information would simply have been gone – there is no Undo button. In some cases, web archivists may have time to provide notice to a website’s registrants, letting them know that their website is being archived. But in the case of a site like GeoCities, we seldom have someone’s real-world name: email addresses are mostly defunct after fifteen years of disuse, and online aliases are generally not people’s real-world names. While we can occasionally track down the identity of an author – pathways such as Googling a GeoCities handle and finding a LiveJournal account, and from there a Twitter account and a real name – that feels extremely invasive and would require further ethical clearance from institutional review boards. In any case, it does not scale to the millions of sources web archives contain. While as a historian trained to collect

206

History in the Age of Abundance?

oral histories, my first instinct was often to think in terms of informed consent, it is also important to remember that informed consent is a proxy for treating the people and communities we research ethically and with respect.81 Hence the AoIR approach of context and thinking deliberately may prove to be the best way forward. But, as Salganik cautions researchers, once informed consent is not possible, we need to recognize that we are very much in a “gray area.”82 Indeed, not being able to get informed consent does not mean that we can simply give up the project of doing web histories with everyday people. I feel similarly uncomfortable with leaving the voices of everyday people completely outside the historical record when there is ample opportunity to include them. Moving to a full opt-in process would likely lead to the historical record being dominated by corporations, celebrities and other powerful people, tech males, and those who wanted their public face and history to be seen a particular way. Given the short lifespan of the average website – the oft-bandied-about figure is around one hundred days – it is important to collect material before it disappears. Yet the invaluable opportunity presented by sources like GeoCities – some of the early everyday people who began to use the web, their thoughts, fears, loves, passions, and beyond – need to be tempered by a recognition that not all of these sources were created equally, nor should they all be treated the same. They exist in different ethical contexts. Some first-hand examples can help to concretize these more abstract ethical questions. My explorations into GeoCities have found several categories of websites that have no straightforward answer to how we should work with them. One example of ethically fraught sites are memorial sites. Let us look at one website in particular, a memorial site to a deceased dog called Missy, “the most loved and loving dog in the world.” There are pictures of Missy playing with toys, sleeping, hanging around an antique car, wearing costumes, and posed for pictures. Poems are written in her honour. It demonstrates not only how important she was to the lives of her family, but potentially part of a broader understanding of how significant pets are to the lives of many North Americans. It does not take too great a leap of imagination to consider how historians might use such a source. Indeed, this was not Missy’s first website: a crawler had captured her first webpage, “Missy’s World,” dedicated to her in March 2002, part of a much bigger community of mixed breed dog owners, connected to

Welcome to GeoCities, Population Seven Million

207

each other through hyperlinks and webrings (a group of likeminded users whose pages contained links to each other). Missy touched the lives of more than her owners, as her guest book contained the reflections of other people. “Your site brought tears to my eyes,” wrote Tracy, another user. It was a well-travelled corner of the web. From the guest books, we learn that the owner of the memorial site was also grieving for her late daughter who had dedicated her own life to animals before her tragic death in 1997. The guest books, comprising comments and conversations between other users who dropped by the site, speak to the power that this site had to other users: inspiring them to get through funerals, preparing eulogies, navigating their grief over lost children, parents, and even pets. On the one hand, this is very private, intimate material. Yet it was also published on a public website with a popular guest book and became a hub for many other GeoCities users who had lost loved ones or animals or their own, leaving messages, encouraging visitors, and having public conversations with each other. The site thus does not fit neatly into the public or private sphere. The site was public, discoverable via the GeoCities neighbourhood page that it was a part of, as well as webrings, but it was also part of a relatively closed social network. Most visitors on the guest book had tales of loss. And judging from the relatively low level of technical expertise visible in the website’s development, it would surprise me if the site owner or many of her commentators would have reasonably known that their words would be accessible to historians today. Yet here they are on my hard drive. If memorial pages are fraught, pages about suicide are even more so. It feels invasive to read through all the pages – obtained through having the raw contents of all of GeoCities from both Archive Team and the Internet Archive itself – as a historian two decades later. I found several sites written from the perspective of those who wanted to help people suffering from extreme depression or suicidal thoughts, including help lines and those writing advice columns, but also sites by those considering suicide. These are posts written on personal sites (often wide-ranging personal sites covering many topics) by individuals who bared their souls, explaining their contemplations and attempts, their coping mechanisms, and their support networks if they had them. One site in particular stood out: written by a girl, it is an early blog-type website that provides an account of reading her mother’s suicide note,

208

History in the Age of Abundance?

the girl’s broken family relationships, and her utter and complete sadness and loneliness. On one level, she gives consent that her story be shared with others, noting that she will tell her boyfriend and some friends about the site and will otherwise leave it to the “search engines” to bring people if they were interested. Yet I felt like an invasive presence on the site. I do not share its location here. Indeed, there is a chance that I was the first to read the page itself in years. In both examples above, I have declined to include information that would make the sites findable. As a historian, I would be on steady legal ground if I was to quote from them and link to the publicly accessible copy in the Internet Archive’s Wayback Machine. It might not be ethical to do so, however. Websites like those above are the sorts of materials that are not traditionally deposited in archives. In the past historians had to rely upon rare glimpses of everyday people: a letter to the editor, an interview by a community newspaper or magazine, a fleeting interview with a media outlet, or, in the rarest of all cases, a personal diary that was made available to a historian. In the last case, there may be a research agreement that needs to be signed by an archive, or in other cases, historians may need to consider the ethical obligations they have to print material as well. The big difference with the web from the above is the lack of archival and institutional support, and the sheer scale at work. There are now millions of diaries, on both the live and archived web, representing an impressive assemblage of everyday voices. But just because it is on the web does not mean that it waives the default expectation of human subjects and participants to privacy. Historians now have even more power, because we can access the blogs, ruminations, and personal moments of millions of people. This power needs to be used responsibly, likely with oversight from institutional review boards, such as those that regulate the research of oral historians. The precise role that IRB s should play in overseeing oral history projects is sometimes debated, but there is general agreement that external regulation is valuable. It makes us consider the potential harms, benefits, and other factors at play when researchers interact with the people they are writing about. Ethically using web archives like GeoCities and other large collections authored by everyday people will be a watershed, however, as it points towards a more inclusive and democratic historic record.

Welcome to GeoCities, Population Seven Million

209

Privacy in the Age of the WARC : Some Potential Approaches Despite the Wayback Machine having been widely available to the public for over twenty years, questions about the ethics of using it have emerged only relatively recently. Until now, most users have enjoyed privacy by obscurity – their presence on the internet was known mainly to those they shared it with, and they weren’t easy for others to find. Developments in data mining, text analysis, and accessible web archiving analytics suites have changed this limitation. The growing availability of downloadable WebARChive files (the “raw material” of the web archives discussed in the last chapter) and searchable databases, laudable for their access, threatens the status quo. When internet pages can be accessed only via the Wayback Machine, they are more difficult to find and thus enjoy privacy by obscurity. Unless researchers know the exact URL of the webpage they are seeking or can navigate one page at a time from an existing URL that they know, the page is relatively inaccessible. The Internet Archive does not provide full-text search into the holdings of a domain like GeoCities.com, which has had the effect of preserving user privacy over twenty years. This is rapidly beginning to change. The discovery tools used in this book, and in other projects, illustrate that this is coming to an end. Without search tools, which we have discussed throughout this book, serious research with web archives is impossible. But with them, the previous approach to privacy no longer works. What should we do with GeoCities, then, when working with the underlying WARC files that make up the collection, to avoid unethical privacy breaches? One issue is to begin to tease out just who is chosen to set the standards for what is ethical and what is unethical. While many university IRB s may consider publicly accessible websites as “publications,” this view may need to change. Yet in any large web archive, given the sheer amount of data that becomes accessible to researchers, much of the onus is will have to fall on the scholars themselves to navigate this material. In many web archive cases the data is accessible, either through the Internet Archive’s Wayback Machine, or in the reading room of a national library. So how can a researcher ethically use a large web archive, like GeoCities? First, we can use “distant reading” to zoom our gaze away from the individual websites and to look for larger patterns within an

210

History in the Age of Abundance?

archive. We have seen some of this in this chapter: looking at link patterns to ascertain PageRank, exploring topics, looking at thousands of images instead of tens. None of this by any means eliminates all ethical concerns about a collection – think of what the NSA has done with similar kinds of data – but it does mitigate them to some degree. Individuals have their privacy protected, and others cannot find their sites without having the same level of access to the archive. People are obscured, but they are still read into the historical record. They also speak to a valuable research method, as of course no matter how diligent or comprehensive researchers can be, they will not be able to read every GeoCities page. Yet historians still need to read pages, or at least make a clear case for why not if they ultimately obscure the source. They need to cite, quote, and footnote sources with links. Professional norms demand verifiable and somewhat reproducible research. But things are complicated in the case of web archives, where a document is a click away rather than an airplane ticket to an obscure archive. Ultimately and imperfectly, the onus will thus need to fall on individual researchers to carry out a risk assessment and to consider the context in which the material they are reading was published in. Did the author of the individual website they are citing or using as evidence have a reasonable expectation of privacy? If such a person was a GeoCities community leader, with inbound links from hundreds of other sites and featured on directories, probably not. If the author was posting a heartfelt story on a friend’s GeoCities guest book, part of a seemingly closed social network of a few high school chums, there probably was such an expectation. This means that when engaging with individual sites, the central metric should be “expectation of privacy.” A website linked to many other websites, with a proud prominent viewer counter in the margins demonstrating the thousands of visitors who came, signals a website where an owner wanted to be read and encountered. A smaller website, with no link counters, addressed to mainly friends, written by a teenager with revealing messages and pictures, presents a more ethically fraught source. At the very least, we need to pause and think about our obligations as historians. We must use sources like GeoCities with tempered enthusiasm. We have power because we can access the blogs, ruminations, and personal moments of literally millions of people that would never before have been accessed – but we need to use this power responsibly.

Welcome to GeoCities, Population Seven Million

211

5.4 Mentions of GeoCities in a LexisNexis Media Survey, 1995–2013.

Conclusion Between 1994 and 1999, GeoCities users carved out an active online community that has been preserved as remnants amongst web archives. Those who sought meaningful connections within GeoCities could find them: from the community leaders who welcomed them, to the awards they might receive and proudly adorn on their sites, to the guest books that they signed and invitations they issued, to the links and webrings that connected their pages. The 1999 acquisition of GeoCities by Yahoo! made dramatic changes to the site’s structure. In September of that year, pages became accessible through the old neighbourhood URL but also just via a user’s Yahoo! ID (such as http://www.geocities.com/username123). While neighbourhoods persisted as vestigial elements of the GeoCities experience until the wholesale destruction of the site in 2009, they cased to be a significant dimension in most users’ day-to-day adventures. GeoCities is a story of meteoric rise and fall. Consider the chart in figure 5.4. After the heyday of 1999 and its acquisition by Yahoo!, GeoCities began to recede from popular consciousness. Apart from a brief resurgence in 2008 when it became clear that overseas users were becoming key consumers of GeoCities – a brief scandal involved the Afghani Taliban availing themselves of sites – media attention returned to GeoCities only when it became clear that the site would close. By then, however, it was too late. As we’ve seen in previous chapters, it was

212

History in the Age of Abundance?

only the quick action of web archivists that makes it possible to explore GeoCities today. Yet through these web archives, limited as they are and circumscribed by a single scrape, we can learn a lot about these digital places. We can see the ruins of a robust web community that mattered to the lives of many people. In a context of diminishing social and community ties in the 1990s, GeoCities was a partially successful attempt to provide a cyber equivalent.83 We should not understand the decision to set up the website as a series of GeoAvenues, homesteads, and neighbourhoods as haphazard, but rather as one that spoke to a particular vision of what the web could be and what its users wanted. If we began with cautionary notes about ethics, this chapter has shown that we need to use care: but that when used with care, we can derive significant meaning from the cultural records of everyday people. Here, amongst the ruins of GeoCities, we can see how new web users teased out their relationship to the web. They were not alone but were part of a larger community. Web archives present an interesting opportunity to look back to the days between 1994 and 1999 and how – spread out across time and space – users figured out what the web would mean to them. GeoCities, a massive assemblage of non-commercialized public speech, presents an interesting way to begin thinking about the history of the early World Wide Web – and to the intriguing potential found within the WARC s.



THE (PRACTICAL) HISTORIAN IN THE AGE OF BIG DATA

Historians often take the infrastructure that supports their research for granted, from search engines to archives. This is not a dig at historians, but rather a reflection of just how smoothly libraries and archives function. Consider the route that historians take when using a “traditional” print archive: they might go to a website, download a finding aid, identify a series of boxes and files to have ready for consultation upon arrival, go to the physical archive, and then take notes or reproductions of these resources. A lot of this process happens in the background, and to a new archival user a lot of it can seem pretty serendipitous: amazing that all the documents pertaining to one strike can be quickly found in a few files, for example, or that related topics can be found close to each other. The simplicity of a finding aid – metadata! – belies all of the work that went into its creation. To an archivist or librarian, it can seem frustrating when a historian proclaims to “discover” something in a well-labelled file or box, because the historian does not realize all of the training and processes that combined to make the document “discoverable” in the first place. Because archives work so well from the perspective of a user, archival labour and practices can fade into the background. This is a useful segue into the world of Big Data and what historians and others need to know to leverage it. A surefire way for a historian to recognize the value of the archival or library profession is to suddenly be confronted with the vast data of a web archive. Many of the problems confronting a web archive researcher result from suddenly not having the professional framework and infrastructure from which historians studying earlier time periods benefitted. In this final chapter, I explore the current and future state of digital tool development. In chapter 4 I reflected briefly on the dangers of

214

History in the Age of Abundance?

providing too much technical information for fear that the book would date too quickly. Accordingly, this chapter focuses on trends, current attention to web archival and digital history research, with an applied emphasis on how methodological innovation is happening. The chapter begins by exploring one main trend in web archiving research right now, that of greater accessibility, before providing several tangible examples of how a historian could begin in web archiving analysis. It then concludes by discussing that as these tools evolve, so too do the questions that we can ask of them.

The Push towards Accessibility: Current Trends in Web Archiving Research A theme throughout this book is that the scale of web archives makes them difficult to work with. Web archives fill storage devices quickly, and working with all of their content – ranging from Word documents to HTML files to videos to images – is challenging. These forces also make it difficult to develop usable, efficient analysis tools. Since the late 1990s, researchers and other information professionals have been struggling with how to provide access to web archives. Accordingly, one major theme in web archiving tool development has been accessibility. We can see this in two major respects: making it easier to collect data, and making it easier to then analyze that data. Web content, as we have seen, is often difficult to capture. The dominant web crawler is Heritrix, a collaborative open-source project by the Internet Archive and several European national libraries dating to 2003.1 The name itself comes from “an archaic word for heiress … [s]ince our crawler seeks to collect and preserve the digital artifacts of our culture for the benefit of future researchers and generations, [the] name seemed apt.”2 Heritrix gives users the ability to carry out wide-ranging broad crawls of the web, following user-specified rules (for example, crawl only sites from a given domain like CNN.com). It forms the basis of the Internet Archive’s crawls through their Archive-It service as well as their global Wayback Machine. Many other national libraries use Heritrix as the de facto standard in creating web crawls, as do many other organizations that need to preserve web content for compliance or archival reasons. Heritrix has a catch, however. While free and open-source software, it is difficult to use. It requires that the user not only knows how to employ the command line to launch the program, but also has a conceptual

The (Practical) Historian in the Age of Big Data

215

Fig 6.1 Webrecorder capturing the University of Waterloo’s webpage.

understanding of processing chains, various file formats, and debugging other odd behaviour that Heritrix might encounter when downloading data. It is powerful but generally requires a user who has expertise as a developer – not ideal for a historian or most librarians and archivists.3 Accordingly, the Internet Archive’s Archive-It service, which we have seen throughout this book, is a service that provides the power of Heritrix with the usability of a modern front-end interface (as well as storage and customer support). Archive-It, however, is priced and targeted at institutional subscribers. Single researchers in the humanities are unlikely to have the necessary resources to use it as individuals, and they might not be able to leverage a university library subscription (especially if they are at a smaller institution or are an independent scholar). This is not meant as a criticism of Archive-It in many ways: it is not designed to be run by independent researchers, but is rather operating on a basis similar to that of many other subscription-based library research services. Other services are emerging that let users run their own web crawls without needing high-level technical skills or institutional support. Webrecorder is perhaps the best example of this current. To use it, users visit https://webrecorder.io, identify a website they want to capture, which then records all of the content that they are loading into their browser. If users want to collect a crawl of the University of Waterloo’s

216

History in the Age of Abundance?

webpage, for example, they would enter that URL and begin to click through to various pages and menu bars, watch embedded videos, and the like, all of it being recorded and stored to WARC files by Webrecorder. Figure 6.1 shows what this software looks like as it begins to capture the University of Waterloo’s site. Webrecorder has other features as well: you can try crawling sites using different browsers, or even enable “auto scroll” so that the browser continually scrolls down and downloads content. This is critical for harvesting a Twitter timeline, for example, as tweets appear only as the user mouses down the page. While as of writing the business model for Webrecorder is evolving, it is committed to keeping a level of free access for individuals. The Internet Archive, too, has long had accessibility baked into its essence. Whereas a scholar wanting to use web archives in 1998 would have needed to know how to access a server and run code to find content, since late 2001 users have been able to use the Wayback Machine to play back webpages and explore them through their web browsers. This emphasis on accessibility has carried over to the methods by which people can request sites to be saved and included in the Internet Archive. In 2011, for example, users who wanted their page to be archived had to have their site “listed in major directories … or encourage other, popular sites to link to you.”4 In short, webmasters had to get their sits noticed by the Internet Archive if they wanted to make sure they would be saved by it. In 2014 the Internet Archive launched their “Save Page Now” function. Anybody can go to archive.org/web and enter a URL , and that site will be quickly harvested and indexed by the Internet Archive. These simple little tools help bring the power of the web archive to a regular user. There are now extensions and plug-ins that bring that function into web browsers themselves, so uses can simply click an icon when they find something they want saved. In the wake of Donald J. Trump’s 2016 election as president of the United States, these community tools were put to good use as activists sought to document climate data and other sensitive government documents that were at risk of disappearing during the presidential transition. If you are browsing content and worry that it is at risk of being deleted, there is now quick and easy way to make sure that the content is saved: “Save Page Now.” Now that it is easier than ever to capture born-digital material, the next frontier is facilitating easy access to web archival content in forms

The (Practical) Historian in the Age of Big Data

217

other than using the Wayback Machine to play back pages one at a time. In the past two chapters, we have seen some of the tools that we used to extract text and hyperlinks from web archives. I gave only snippets of code, cognizant that they would date quickly, but the fact that I was providing code itself suggests the technical difficulty at play. Until recently, working with web archives at scale like this required a level of technical knowledge: how to use a command line at a minimum, knowledge of programming and a developer environment in many more cases, and in almost every case the ability to navigate an acronym soup of frameworks and computational words. In short, web archive analysis has long been a daunting task to undertake. The similar impetus towards accessibility that motivated Webrecorder and “Save Page Now” underlies a research project that several colleagues and I launched in 2017, the Archives Unleashed Cloud. Our goal is to make the analysis of web archives as easy as Archive-It or Webrecorder makes the process of crawling and archiving websites. In short, to hide the WARC files, the code, and the execution behind a graphical front end. In figure 6.2 you can see a “collection page.” This represents an Archive-It collection. It consists of an in-browser interactive hyperlink diagram, download options for graph files, text files, and domain counts, as well as some additional information about what has been collected. This project is still very much under development, and we are aiming to support the full filter-analyze-aggregate-visualize cycle discussed in the last chapter. What do these moves towards accessibility in the world of crawling and analysis mean? Perhaps most promisingly, they suggest that the web archiving ecosystem is expanding. Until recently, it has been dominated by large institutions that had the resources to crawl and store material themselves – the national libraries and the Internet Archive, for example – or those who had the money to pay others to do that for them. Archive Team, the distributed collective of guerrilla web archivists that I introduced in chapter 3, was more of an exception than the norm. With Webrecorder, “Save This Page Now,” and the growing analysis tools such as the Archives Unleashed Cloud, individuals or small groups now have the power – or at least the potential – to use this material. Exciting moves towards a distributed web archive, such as the “InterPlanetary Wayback” scheme, which would see web archive files distributed around a global Interplanetary File System (IPFS ), could further see the landscape decentralize from a few large players to many smaller ones.5

218

History in the Age of Abundance?

Fig 6.2 Archives Unleashed Cloud, in development ca late 2018.

All of this push towards accessibility suggests one promising new direction in web archiving tool development. Just as web crawlers are evolving from opaque developer tools to web browser–based user interfaces, the analysis ecosystem is similarly moving in that direction. I could have listed other examples, although some are developing too quickly for a print book to do justice – it seems as if the Web Science and Digital Libraries Research Group, at Old Dominion University, for example, have new releases of tools and workbenches throughout every month. Their blog, accessible at https://ws-dl.cs.odu.edu, is well worth a bookmark for a historian or other scholar interested in exploring web archives. These are exciting times for researchers who want to work with web archives. Right now, to do web archiving analysis really requires that a historian or other scholar really wants to work with web archives – they require an investment of time, money, and energy that is probably justified only if web archives can lie at the heart of an entire project. In the future, a user should be able to do analysis with a minimum expenditure of time and effort – meaning that web archive analysis can become part of a researcher’s workflow, just as microfilm, or newspaper databases, or Google Books is part of it today.

The (Practical) Historian in the Age of Big Data

219

How to Get Started in Web History: Basic Skills Getting started in web archiving, and working with data at scale, is daunting for anybody, let alone a historian or other humanist or social scientist who does not come from a technical background. While the field of web history is growing, it is still new enough that scholars who want to use web archives cannot generally rely on graduate-level training. There are few digital history courses at the graduate level in North America, and amongst the topics covered in them, the use of born-digital resources is seldom explored for more than a week or two. So in a practical sense, what, then, do scholars need to know to get started? First, they need to be familiar with the landscape of web archiving toolkits and resources. The International Internet Preservation Consortium maintains an “awesome list” of resources for web archiving, covering topics such as available training, tools and software, community resources, and a list of past projects that are no longer maintained.6 Many of these tools are well discussed in the annotated list and provide comprehensive functionality for parts of the web archiving ecosystem, from collection to analysis. However, users will quickly realize that almost none of these tools are usable in the way that traditional software might be: there are rarely easy downloads, with icons to click on and a standard installer wizard. Rather, there will be instructions for “building” software, typing in commands to terminal or command line windows, and quite a bit of language around the GitHub coding platform. Accordingly, after a brief understanding of the web archiving ecosystem, burgeoning web archive historians will need to develop specific technical skills to work in this field. Even while things are moving towards accessibility, as discussed in the last section, the more that historians become invested in web history, the more skills that they will need. For digital historians more generally, these skills are increasingly necessary. The most important set of skills are general computational ones. While accessibility is a growing emphasis of collecting and analysis tools, the research landscape is still dominated by rapidly evolving and developing research projects rather than stable infrastructure. Once research questions expand beyond a few dozen websites, it is important to be able to leverage the power of a computer to automate research workflows. Whenever users find themselves repeating the same thing, there

220

History in the Age of Abundance?

is usually a way to get a computer to do it for them. In the examples above, for example, you might want to send several thousand URL s to the Internet Archive to be archived via the “Save Page Now” button. You could paste them in one at a time, or you could use a utility to do it for you. Or you might run some analyses using the Archives Unleashed Cloud, but then want to parse the ensuing large text files to see how often a given phrase or word appeared. A foundational skill for much of the above is the command line. Users who were introduced to computing in the 1970s, 1980s, or even the early 1990s will be familiar with this interface: the text-only screen, the blinking cursor, the commands consisting of relatively opaque phrases and words. For those who have worked with computers since the age of Windows or Mac OS , and who are not software developers, the command line is an unfamiliar way to control a computer. An example of the command line can be seen in figure 6.3, where we see a list of files in a directory from the command line (top) and the same view in the standard Mac OS view (bottom). The command line, however, is quite powerful and flexible. It is worth traversing its learning curve. Many research tools run only on the command line, as it is easier – and often more efficient and secure – to develop tools to run there rather than through a standalone program or the web browser. It also comes with a set of built-in tools that allow you to do things as varied as count words, find lines in files that contain certain words, sort information based on certain criteria, and interface with other programming languages to interact with files. This book is not the place to teach readers how to use the command line, but there are several online resources. They have the advantage of being updatable when there are minor changes to software packages, as well as providing connections to an even greater research software ecosystem. Of course they face the potential vulnerability of online sources, in that they might move (or be deleted). And physical items are not immune either. A useful starting place for learning these skills is the Programming Historian (https://programminghistorian.org). Founded in 2008, the site publishes “novice-friendly, peer-reviewed tutorials that help humanists learn a wide range of digital tools, techniques, and workflows to facilitate research and teaching.” Lessons often begin with setting up software on your own laptop, working with provided datasets, and then swapping them out. In particular, lessons on the Bash

The (Practical) Historian in the Age of Big Data

221

Fig 6.3 The command line versus the graphical user interface. Both are showing similar information but note the “ls -lh” command required at top to view it.

command line, text processing, and topic modelling would all help begin to recreate some of the examples in the last two chapters.7 These basic lessons provide a basic computational literacy. Beyond that, the set of lessons on the Python programming language are an exceptionally useful starting course in thinking algorithmically about historical sources. They use a running example of the Old Bailey Online in order to give historians tangible workflows. Another excellent resource is Software Carpentry, which facilitates high-quality, low-cost instruction in basic skills including the command line, the GitHub version control system, and programming languages like Python and R.8 Beyond Programming Historian there are also dynamic educational sites online that train users in basic programming and development skills. While these sites are emerging from the venture capital hype cycle, they are useful ways to learn skills. The interactive nature of the

222

History in the Age of Abundance?

lessons helps to affirm skills and competencies in a way that a book does not, unless users can work alongside them. Codeacademy, which was extremely prominent throughout 2014 and 2015 (partnering with the White House, for example), remains one of the most popular sites as of writing to learn basic skills (https://www.codecademy.com). For example, I used this site to learn the elements of Java and Ruby on Rails. While they did not make me a full-stack developer, they did and do let me read code, as well as give enough information to change scripts and programs to adapt them to research workflows. Web historians need to become fluent in these basic capabilities, or at the very least work on a team where others have these skills. Yet, even in an interdisciplinary environment, historians should not hide from technical skills when working with these sources. Basic computational literacy can help in two primary ways. First, as working with web archives necessarily requires the manipulation of algorithms and code as well as understanding the algorithms that produced the web archives, learning basic technical skills can help them begin to think like a computer. Second, technical skills introduce flexibility. Research questions, especially at advanced levels, can become quite individual and boutique – requiring the adaption and alteration of existing tools. All of this means that technical skills are worth the investment. As I note below, hopefully the historical profession can begin to follow with graduate-level training in the near future.

How to Get Started in Web History: Finding WARC Files With basic skills learned, web archive researchers then need to find web archival data to study. While the Wayback Machine runs on WARC and the earlier similar ARC files, there is no way to download them out of the interface itself. This is one frustration that researchers often encounter: they want to do computational work on web archives, such as exploring hyperlink networks or doing some text mining, but have difficulty finding WARC files to use. Fear not, there are several ways to find WARC files: contacting a librarian or archivist who has responsibility for their collection; finding resources on the Internet Archive; or collecting it yourself. This status quo is unlikely to change, as web archivists steward their collections and exercise control over their usage rights and access conditions. Traditional archives steward collections and control access

The (Practical) Historian in the Age of Big Data

223

–researchers generally do not walk right into an archival stack to remove boxes on their own – and web archives are not different in this respect. Throughout this book we have seen Archive-It collections. The running example of the Canadian Political Parties and Political Interest Groups collection is one. The web archivist responsible for the curation of an Archive-It collection has access to the underlying WARC files and can often work with a researcher to provide access to them. Most Archive-It collection pages identify the organization that was responsible for its collection, such as a university or library. Working directly with the librarian or archivist responsible can also help resolve copyright or usage issues. There are also several web archival collections available for free download. At the Internet Archive you can download the raw WARC files for an entire crawl of the World Wide Web: the wide00002 collection, discussed in chapter 4.9 A subject query for “WARC ” also brings up a wide array of collections at the Internet Archive itself, covering several thousand websites. Finally, users can create their own WARC s to run analysis on. Webrecorder exports files into the WARC format, allowing scholars to do research on the small collections they might assemble there. There are also a wide array of other tools that are available for creating web archive files, such as the “wget” command line tool or even running Heritrix yourself, although these require fairly extensive knowledge of the command line. But much of this is a feasible way to get a bit of data. Perhaps most promisingly, researchers might begin with some small WARC s created with Webrecorder, before reaching out to a collections librarian for access to a larger array of research files. Most importantly, though, researchers need to remember that there is lots of web archival data out there – they often just need to ask.

How to Get Started in Web History: Basic Web Archive Analysis Finally, once historians have relevant technical skills and a few WARC files, comes the pathway to analysis. We have seen what you can do with web archive analysis tools in the previous chapters. Accordingly, this section explores three different pathways to analysis. First, how to carry out a basic overview analysis of what is in a collection. Second, how to

224

History in the Age of Abundance?

Table 6.1 Domain frequency in a web archive Domain

Frequency

www.equalvoice.ca

4644

www.liberal.ca

1968

greenparty.ca

732

www.policyalternatives.ca

601

www.fairvote.ca

465

www.ndp.ca

417

www.davidsuzuki.org

396

www.canadiancrc.com

90

www.gca.ca

40

communist-party.ca

39

use basic text analysis tools to find patterns in content. Third, how to explore the network diagrams that are created by mapping hyperlink patterns. The FAAV cycle used in chapters 4 and 5 is a useful way to do research. This section briefly explores how you can actually implement those abstract principles. It will do so by introducing in some detail the tools that underlie much of the analysis research presented in this book. All three sets of data – the overview of a collection, text within it, and hyperlinks – can be extracted using the Archives Unleashed tools. The Archives Unleashed Toolkit can be found at https://archivesunleashed. org/aut/. Funded by the Andrew W. Mellon Foundation and federal and Ontario governments, we are committed to making this tool available through the 2020s. The Archives Unleashed Cloud, for Archive-It collections, is also accessible at https://cloud.archivesunleashed.org. Code changes, but the underlying goal of providing collection analytics, text, and network diagrams lies at the heart of the project. When getting a collection of WARC s, the first step is to figure out what is inside them. Crawlers, as we know, operate in complicated ways: sometimes they may be directed to follow all links within a page to grab content. At other times they may find embedded content within a page (for example, the Twitter widget on a page that is actually from Twitter.com). The Archives Unleashed Toolkit allows a user to see a list of URL s that have been crawled, such as the results in table 6.1. There we can see that the WARC file contains mostly sites from Equalvoice.ca, but that other political parties are represented. The table will influence the analysis in the next two steps. Over time, these figures can

The (Practical) Historian in the Age of Big Data

225

Fig 6.4 Plain text extracted from a collection of websites about the 1917 Halifax Explosion.

also inform what has been or what has not been collected: has the Communist Party of Canada, for example, disappeared from archival records from a certain date onwards, reflecting shifting collection criteria or disappearing sites? Extracting all of the plain text from a web archive, or that of certain domains, is often the next step. Throughout this book, we have seen that while looking at pages one by one through the Wayback Machine can be very illuminating, while exploring many pages at scale can let you find broader patterns within a large collection. Getting a computer to read sources for you entails a fairly steep learning curve, although thanks to developments in the digital humanities and computational social science it is no longer as daunting as it might have been only half a decade ago. Useful starting places for text analysis are the aforementioned Programming Historian, which has lessons on using programming languages or off-the-shelf tools to explore content. Text analysis courses are also a staple of digital humanities training institutes, such as ones held annually in Victoria, British Columbia; Oxford, England; Guelph, Ontario; or elsewhere. Finally, conceptual introductions can be found in several textbooks, including Exploring Big Historical Data and Humanities Data in R.10

226

History in the Age of Abundance?

Once text is extracted from a web archive, using a tool like the Archives Unleashed Toolkit or Cloud, it might look quite a bit like figure 6.4. Here I have used an Archive-It collection from Dalhousie University dealing with the memory of the 1917 Halifax Explosion. Meaningfully exploring this much text can require several different text analysis approaches. You could just begin by reading all the text, but the size of web archives often means that it could take days, weeks, or even months to do so. Computers can come to the rescue by reading the text for us and making sense of it for our interpretation. The first approaches often entail just doing keyword searches with a text editor. How often does a given word appear? Are you looking for a certain individual or location? In some cases, this might narrow the text file down sufficiently. If not, more work may be in order. The next step might be looking for concordance lines, or lines that show a given keyword in context (or KWIC ). For example, what words tend to appear to the left and right of a word – to describe a person, for example, or to help find recurring phrases? Alternatively, you might be interested in how terms evolve throughout a website – perhaps the word climate appears more in 2010 and less in 2014. One great introduction to text analysis is Voyant Tools, co-developed by McGill University’s Stéfan Sinclair and the University of Alberta’s Geoffrey Rockwell. Its motto is “Reveal Your Texts.” Available at http://voyant-tools.org, and also through an open-source download to run on your own computer, Voyant Tools works on small to medium-sized collections of text. Figure 6.5 shows Voyant Tools after pasting in the Halifax Explosion plain text from figure 6.4. Rather than reading the text file one page at a time, the visualization reveals patterns throughout the archive. From the Voyant interface, we can begin to learn things from the text: • Frequently occurring words: In the upper left-hand corner of the Voyant dashboard is a “word cloud.” The bigger the word, the more often it appears in the text. At a glance, we can see the rough topics that this web archive is about: not just the Halifax Explosion (which we knew from the collection’s title) but also that museums, libraries, archives, as well as stories about the explosion, are all represented.

Fig 6.5

Voyant tools “revealing” the Halifax Explosion Web Archive.

228

History in the Age of Abundance?

• Overall statistics on the collection: We can see the most frequent words at lower right, as well as the total length, the number of words, and unique words. • Concordances: In the lower right-hand column we can see the keywords in context idea that I mentioned earlier. What are the five words that appear to the left of a given word and to the right of it? In this case, I have searched for “Halifax” – in what context does that word appear? As before, we can see that the word Explosion follows several mentions of the word Halifax – but so to do the names of a library and museums. • Word trends: Voyant allows you to load corpuses of text and see how words evolve. Where in the web archive is the word museum more concentrated than other places? Or is it consistent, seeing use throughout the entire archive? In the upper right we can see the trend diagram of selected terms. Voyant can be a useful starting place for text analysis. As needs begin to outstrip it, a user may need to turn to more sophisticated computational approaches. The resources discussed above are worth exploring in detail. Finally, open-source tools interpret the hyperlink networks extracted from web archives. We have seen examples of the source and target tables in this book, such as table 3.1, where we see how many links from each domain end up to another domain. Most visualizations used in this book are from Gephi, an open-source and free graph visualization platform that is available at https://gephi.org. A program that works well on Windows, Mac, and Linux computers alike, Gephi allows a user to load up files and work with them directly – to create not only visualizations, but to use the “data laboratory” functions of the software to explore the link graph statistically. The Archives Unleashed Toolkit exports to a Gephi file format (GraphML or GEXF ) natively, and the Archives Unleashed Cloud does basic layouts on the networks to let a user import them and begin working with them right away. This is important, as by default there is no standard way to lay out a graph – they are just collections of origins and targets. Using networks requires some familiarity with network theory. The foundation of any network is nodes and edges. In the cases that we use in web archival research, nodes tend to be websites (in most of the cases

Fig 6.6

Halifax Explosion loaded into the Gephi Visualization Platform.

230

History in the Age of Abundance?

throughout this book the nodes have been domains – i.e., all of the sites within Liberal.ca are clustered into one node) and edges are the hyperlinks between them. From this assemblage of nodes and edges one can compute many different things: the community of a website (what other websites does it tend to link to and be linked to that are not shared by other sites); which websites are more central than others; and in other cases, the ability to zoom in and see the “ego” network of a given website or whom it links to and is in turn linked from. There are several accessible scholarly introductions to network theory.11 An example may help illuminate. Using the Halifax Explosion collection example, if we used the Archives Unleashed Toolkit or Cloud to extract the links, Gephi would load it like as we see in figure 6.6. Here we see the major site – 100years100stories.ca, as well as the sites that it links to: social media platforms, museums, archives. This is a very simple network, however, as it is just one major site that then links out to other sites. Figure 6.7 shows a more complicated network, in this case, an Archive-It collection of Artist-Run Centres in Halifax, Nova Scotia. Inside the Gephi interface we can see a few things as well. Beyond the network diagram at centre, which provides a sense of who links to whom (and what sites are shared between pages and which ones are different), there are many different options: • Ability to change the appearance of nodes (domains) and edges (links between them): Should nodes (domains) be bigger or smaller on the basis of how many times they are linked to? How many times do they link out? Some combination of the two? What is their relative PageRank within the network? How should nodes be coloured? • Context of the graph: How many nodes are there? How many edges? • Statistics: Note all of the options at right. A user can compute what the “average” website is in terms of links in and out, which websites are the most central, or other methods such as PageRank and beyond. • Layout of the graph: Finally, the actual physical placement of an edge or a node is actually arbitrary. Without a layout, however, a graph is just a jumble of random placements. Using Gephi’s layout functions, various models can be used to move the nodes and edges around: simulating gravity, placing certain ones with

6.7

A slightly more complicated Gephi Visualization of the Nova Scotia Artist-Run Centres Collection.

232

History in the Age of Abundance?

more connections in the middle, or beyond. The actual layout selected often depends on what a user is trying to get out of the visualization. None of this is meant to be a comprehensive introduction to Gephi, but rather a hint at what is possible. In many cases, network analysis is a useful way to find further sites and pages to view: to explore, for example, the most popular pages within a web archive, or the ones that link to certain things. Other tools can challenge the primacy of text-based research, such as with images. One might use the Archives Unleashed Toolkit to extract all images from a web archive, for example, and be confronted with thousands of image files. Image analysis is a rapidly evolving research area within the digital humanities, and a recent edited collection on the topic will hopefully help ignite conversations about the possibilities in computer vision.12 Unfortunately, the size of images and their relative novelty in analysis means that there is no simple solution for image analysis in the same vein as Voyant Tools or Gephi. ImageMagick is a fully featured suite of tools for working with images and is both free and open source. You can find it at http://imagemagick.org. Unfortunately, it requires the use of the command line to use. In the previous chapter, I used ImageMagick to work with the images from GeoCities in order to reflect on community. The infrastructure around analyzing web archives is rapidly expanding. Even as I write this, our research team is exploring the use of neutral networks to identify images within web archives, as well as attempting to find better models to classify websites into sets of user-defined topics (i.e., all sites discussing “climate change” or “health”). During this rapid development, the right place to explain all of this is not in a book, but within blogs, social media, and rapidly evolving conference papers. Even technical journal articles can date too quickly. But the basics of knowing what is inside a collection, text analysis, and network diagrams strike me as core elements that will lie at the heart of web archival research for years to come.

The (Practical) Historian in the Age of Big Data

233

As Tools Expand, the Questions Do Too All of this is increasing the questions that we can ask of historical sources in the web age. The relationship between tool and historical question often needs to be continually interrogated. If one can look only at lists of URL s, or of file types, or pages one-by-one in the Wayback Machine, questions will skew towards the kinds of answers that one can glean from them. Conversely, a scholar who can extract all or certain images or text can ask increasingly expansive questions about what might be found in these web archives. Finally, recent developments in artificial intelligence are beginning to suggest even more sophisticated approaches to machine learning that might continue to expand the research questions we can ask. The scope of early scholarship on web archives was limited by the restriction of available tools. Before the advent of relatively easy-touse analytical tools, analysis was limited mostly to using the Wayback Machine, which may have inhibited the early development of web history. Early analysis tools were limited as well, for they were too difficult for many to use, but also many computing environments had relatively marginal power. One main current of this book has been an exploration of how the discipline of history will be affected by the changing historical record. The political historian can now begin to explore how everyday people engaged with political parties and the everyday process of policymaking, not just through letters to ministers and editors, but through their blogs and social media feeds, and a cultural historian can now understand say the Pokémon GO phenomenon not through New York Times reporters but from the personal accounts of players on the web and Twitter. Furthermore, as we saw in chapter 1, the expanding historical record means that a historian of youth and childhood no longer needs to rely on police reports and sociological studies to glean morsels of what young people might be thinking but can explore their (mediated) thoughts directly through web archives. This helps to expand the social history mission of hearing from not only more voices and people, but those who were not traditionally represented in the documentary record. Throughout the broader firmament of the digital humanities, too, we can see examples of what these expanding tools can mean for historical

234

History in the Age of Abundance?

questions. The digital humanities have been particularly involved in research questions surrounding literary analysis. The Journal of Cultural Analytics, launched in 2016, is “dedicated to the computational study of culture … [using] computational and quantitative methods to the study of cultural artifacts (text, image, sound) at significantly larger scales than traditional methods.”13 Articles have included explorations of performance styles, topic modelling novels, to data mining databases of what readers checked out of libraries to understand how reading habits have developed over time. Some of the questions explored with cultural analytics could have been asked before – albeit in many cases through anecdote or by doing close readings that could generalize to cover the whole – and others, such as those pertaining to corpora of thousands of novels, could simply not. The scope of historical research thus can expand, although the hard work of interpretation, crafting narrative, and beyond remains. A theme throughout the book is that web history is not a “scientific history” – we can write and develop algorithms to explore text at scale, but a human still needs to figure out what that means for understanding the human condition. Indeed, as the digital age and the archived web become more central to understanding human culture and activity, such questions will need to be explored in more depth by more researchers – complicating our current understanding of historiography and the philosophy of history.

Conclusions Whereas historiographical currents evolve over years or even decades, thanks to the nature of ongoing conversations among thousands of historians as well as the pace of academic publishing, computational approaches to culture can seem to change in a matter of days or even hours. Sometimes this is due to sudden software shifts: a major code library that serves as a prerequisite to software changes, making analytical tools unreliable or revealing security issues. At other times, this is due to the rapid development of software: a new programming language or approaches that dramatically increase efficiency or usability, at the cost of making all screenshots and code examples immediately outdated. Yet as software matures, the potential for rapid, game-changing functionality diminishes. The Archives Unleashed Toolkit will probably be a bit

The (Practical) Historian in the Age of Big Data

235

different by the time you are reading this book, but Heritrix will likely be very similar to what is described above. However, some basic building blocks of the web archiving collection to analysis lifecycle should remain relatively constant. Crawling is relatively mature, and even as Webrecorder rapidly evolves, the basic service and concept are so germane to citizen-led web archives that the concept of easy web archiving should be here to stay. For analysis tools, things will likely develop even more rapidly, but the basic elements of a scholar wanting to find out what is inside a collection, wanting to leverage the hyperlinks that are so intrinsic to the modern web to find content, and looking to explore text at scale should remain relatively consistent. Many of these tools will require algorithmic awareness and an understanding of programming, but with the advent of the resources discussed in this chapter, that no longer need not be a daunting prospect for a humanist or a social scientist. Historians should not have to become coders to use web archives, but some basic skills will help them interact with born-digital sources on their own merits: mediated through a computer, as opposed to printing them off or viewing them in a rote, traditional way. It is an exciting time to be a web historian. Yet the stakes have never been higher – as we will see in the next and final section, the future of the historical profession is at risk.

CONCLUSION

Our society used to forget – put another way, we did not leave so many traces of ourselves behind. Now we can better remember, but on a scale that will decisively change how historians work. The historical record has traditionally skewed to those who, as a result of their positions of privilege and influence, have been able to imprint themselves on the historical record, as well as those who found themselves there for reasons of infamy. The web changes that. While it does not mean the web’s historical record will be democratically representative of society – there are still considerable barriers to web access and publishing along lines of race, ethnicity, class, and gender – the web gives the opportunity to more people than ever before to put themselves on the historical record. Things are now being recorded that would never have been before, by people who would never before have left a documented life for historians. This book has argued that this is a big shift that deserves our attention. Scholarship covering periods after 1996, the year widespread web archiving began at the San Francisco–based Internet Archive and several national libraries around the world, will be credible only if it incorporates this born-digital information. Imagine writing a history of Donald Trump’s presidency or the 11 September terrorist attacks without using archived websites. Or imagine approaching the subject of the Iraq war without considering the posts and thoughts of deployed soldiers as they played out across the web. The same goes for any number of social and cultural topics from celebrities like Michael Jackson to social phenomena like Tamagotchis. It would be intellectually dishonest to tackle those topics without turning to the web. And the 1990s are history. We are now at the same distance from the web’s birth as the first historians of the 1960s were from their subject.

Conclusion

237

Historians began to interpret the events of 1968 a mere twenty years after that pivotal year. In another ten years, a vibrant and divisive historiography had formed. As historians, we need to begin thinking about born-digital sources now and to lay the intellectual groundwork for this impending shift. As we reach the book’s conclusion, a few final reflections on the adaptations that the historical profession needs to achieve are in order.

The Historical Profession in the Age of Digital Abundance Historians need to be ready to face the forces that I have outlined throughout this book. Until now, I have discussed historians mostly in the abstract – as members of a profession dedicated to researching and writing about the past. But historians also constitute a profession by virtue of their training, and standards, professional societies and networks, and culture. Culture is perhaps the most significant obstacle to meaningful historical use of web archives. Historians can have hundreds of computing clusters waiting for inputs, web archives collected and assembled, and legal deposit legislation passed to enable this form of transformative scholarship – but historians need to be willing and able to rise to the occasion. To count on such change happening, there must be incentives. How can this happen? Things are slowly changing within the historical profession to better enable digital scholarship. Such transformation can be measured by the growing proportion of digital history positions advertised within the academic tenure-track market as well as the rising profile of digital offerings at flagship conferences such as the American Historical Association, as represented in books, well-attended workshops and conference panels, and conversations that take place within its professional bodies. Yet the profession remains devoted largely to traditional, textual scholarship. We can see this entrenchment in the professional reluctance to engage in two key dimensions of digital work: collaborative models of scholarship and quantitative analysis. Historian James Baker has noted that while an understanding of quantitative methodologies is necessary for digital history to thrive, recent standard historical methods textbooks have removed quantitative sections.1 Beyond textbooks,

238

History in the Age of Abundance?

too, lie struggles for recognition of digital scholarship within history departments. Digitally enabled scholars are not pariahs – they do get hired and conduct high-profile scholarship – but tenure and promotion standards at many research universities continue to insist on traditional monographs for career advancement. Just as importantly, too, the profession is still structured in a traditional way. Geographic areas are overwhelmingly privileged in job advertisements, as borne out by surveys of the labour market. In 2016–17, for example, there were 9 primarily “digital” jobs in the American Historical Association’s Career Center – as opposed to 44 Asianists, 55 Europeanists, 26 Latin Americanists, 138 Americanists, or 35 global scholars.2 This is not to disparage these other important specializations, many of which are critical for moving the profession away from a narrow focus on certain parts of the world, but simply to point out the privileging of geography in academic framing. In my own history department at the University of Waterloo, geography is seen as the primary conceptual organizing framework for much of our internal administrative work. This emphasis intrinsically privileges subject matter over method, as do many historical journal editors. Methods are often cut from journal manuscripts or buried deep in footnotes. A chief tension within digital history has been whether digital historians should adopt traditional methods of framing and argumentation or continue to stress methodological innovation, a tension (among several) that lies at the heart of the recent “Arguing with Digital History” white paper.3 Even more troubling, there are still few places in which to get a PhD in digital history, but that will hopefully change as more graduates enter the field. However, as historians are generally hired as geographic specialists with digital being a secondary preference, the teaching and research interests of these individuals are often torn between traditional and digital scholarship. Methods are difficult to professionally privilege. Under normal conditions, this position might be tenable, but with a paradigm shift underway to digital, we need more historians to begin to engage with and develop the theoretical and scholarly apparatus that can enable new forms of scholarship. In the short term, however, interdisciplinary scholarship is one way forward, a theme that has appeared throughout this book. Even that brings with it substantial obstacles, however. The best example of this

Conclusion

239

historical inattention to collaboration and resistance to numbers comes from the high-profile example of the Culturomics project at Harvard University, released in 2010 and continually revised afterward. This is almost certainly the most successful and high-profile digital history project ever. The team indexed word and phrase frequency across five million books (stretching back to the sixteenth century), subsequently expanded to over eight million.4 Researchers can visit http://books. google.com/ngrams/ and search their own queries, allowing them to see the rise and fall of cultural ideas and phenomena through targeted keyword and phrase searches. For example, you can search for and plot the relative rise and fall of the terms nationalization and privatization. We see the former, the process by which the government takes over sectors of the economy, rising with the Second World War, and peaking during the era of big government in the 1960s, 1960s, and 1970s, before being replaced with ideas of privatization (expressed by Thatcher and Reagan, for example) in the 1980s and 1990s. While not perfect – you cannot see the context in which the word appears, so if you searched for tax you would not know if people hated, loved, or were neutral to the idea – it is a fascinating way to track the expression of ideas through large sets of cultural data. Despite the central conceit of longitudinal analysis, however, and the fundamentally historical nature of the project, there were no historians involved in its conception. The Science journal article has a relatively lengthy author list, including thirteen people and the Google Books team, but includes no named historians. Other disciplines and fields were well represented, including psychology, biology, computer science, mathematics, and publishing. Given the already important role that digital sources play for historians, one that will accelerate with the rise of born-digital material such as web archives, this does not bode well. At the time, the president of the American Historical Association, Anthony Grafton, noted this “striking” absence. Arguing that the “the lack of historical expertise occasionally showed” in the project, especially given the strengths of Harvard’s own history department in this field, Grafton also reflected on the obstacles to collaboration in the field.5 The historical profession still does not properly reward multiple collaborators on projects, and often does not have physical spaces on campuses to facilitate such exchanges. Indeed, tenure standards – the definition of

240

History in the Age of Abundance?

how we expect junior researchers to behave – largely adhere to the sole authored monograph as the gold standard of scholarly achievement.6 Responding to Grafton, the n-gram project leaders, scientists Erez Lieberman Aiden and Jean-Baptiste Michel, shed light on the current state of the profession vis-à-vis collaborative digital projects. They did approach historians to participate, the pair claimed, and while some served as advisors, every person identified as an author on the Science article had “directly contributed to either the creation of the collection of written texts (the ‘corpus’), or to the design and execution of the specific analyses we performed. No academic historians met this bar.” Moreover, they wrote, while the “historians who came to the meeting were intelligent, kind, and encouraging” they didn’t seem to have a good sense of how to wield quantitative data to answer questions, didn’t have relevant computational skills, and didn’t seem to have the time to dedicate to a big multi-author collaboration. It’s not their fault: these things don’t appear to be taught or encouraged in history departments right now.7 While historians have been and are involved in many digital projects, some discussed in this book and others throughout the much broader field, this general indictment holds true.8 Historians need to learn how to work better as part of interdisciplinary teams and must gain greater technical ability and competence themselves. Recognizing this critical need, the Programming Historian, discussed in the last chapter, has been devoted to training historians in basic techniques. The effect is not to create junior information retrieval specialists, or computer scientists lite, but rather to equip historians with the fundamentals of using their computers to interact with sources and to think algorithmically. Even if they are working on a team, historians will still need computational knowledge. We need to prepare the next generation of historians to work with the deluge of born-digital sources that will fundamentally transform their profession, while equipping current historians who are interested in these sources with the tools – and the mindset – that they will need. Some of these changes can come through the growing number

Conclusion

241

of professional institutes and events being run and workshops being held at annual meetings such as the American Historical Association.9 This is all necessary because, ultimately, we will need to use web archives and other born-digital repositories to uncover the history of the 1990s and beyond – and such access requires computational methodologies. The historical profession and historians will need to prepare themselves for the coming shift in the way that history is researched and written (as well as taught, although that is somewhat secondary in this book). But how to actually do this? Addressing historians’ deficit in skills and knowledge will require attention on several fronts. First, historians will need to have a fundamental knowledge of the underpinnings of how digital primary sources are produced, preserved, and accessed – the “digital basics.” Many of these have been outlined in this book, particularly in the first two chapters and the last one. This will allow historians to understand their sources: why they have been preserved, how they were made, what components are at play, and crucially, how they can be scrutinized. Second, historians must learn how to digitally interact with the conventional sources that complement, enhance, and help contextualize the digital objects they are interrogating: citation management software, databases, optical character recognition for paper-based sources, and a comprehensive understanding of the advantages and disadvantages of repositories such as Google Books or the Internet Archive. Third, as argued in chapter 6, historians need access to and training in tools that process large amounts of data: to find and visualize patterns, recurring text, from topic modelling, network analysis, to an understanding of how metadata can help illuminate content. Finally, what underpins all of the above is the need for basic algorithmic fluency. This does not mean that all historians need to become programmers – that would be exclusionary and not speak to the core skills and abilities of what it means to be a historian. But it does mean that they need to think algorithmically: to think about how code might operate, how digital objects are created and stored, and to realize the human dimensions behind programming languages. Decisions are never neutral, and historians will need to be equipped to evaluate the tools and platforms that they use. As a shift in how we do history, this may require pedagogical changes. Compared to those who work in many other disciplines in

242

History in the Age of Abundance?

the humanities, historians traditionally receive minimal training in the production of their sources: in Canada at least, many graduate programs either have no required methodology courses or those that do skew heavily on the arguments of historians, rather than “how to do history.” We learn on the job. Most PhD and some MA programs require second-language proficiency: not just so that particular sources can be read, but to instill new ways of thinking and understanding the world. Coding could be part of this. And, as “web history” begins to emerge as a discrete field of study, mounting courses (or weeks) on the topic will help as well. There is no magic bullet. This all leaves one last question, however: who can lead on this major paradigm shift in the historical profession? There are no simple answers – for example, the profession has been grappling with recognition of digital contributions in tenure and promotion for well over a decade. Leadership and inspiration can come from professional organizations like the American or Canadian Historical Association, and professional development workshops, publications, and other affiliates can help normalize digital research – making it appear to be the norm within the profession rather than an outlier. The American Historical Association’s “Guidelines for the Professional Evaluation of Digital Scholarship by Historians,” which argue against the centrality of the monograph and for the evaluation of digital scholarship in its original format and on its own merits, is an important first step.10 But given the importance of born-digital scholarship to the future study of the historical profession, even firmer action should be taken. Showcasing exemplary scholarship at annual conferences is one thing, as is providing recommendations for tenure and promotion. But moving to recognize interdisciplinarity at conferences – imagine plenary sessions featuring historians and librarians talking about these developments – as well as trying to lobby departmental chairs even more concretely is quite another. Some of this is undoubtedly already happening, thanks to the dedicated staff and leaders at the American Historical Association, but the existential threat presented by born-digital resources makes this all the more pressing. But ultimately, though, academics should not always look to others to implement such changes. Academic historians are in a privileged position: we govern our own norms, evaluate and review each other’s

Conclusion

243

scholarship, edit our leading publications – in short, historians decide what historians do. Change cannot simply come from above but needs to come from individual history departments, units, and independent scholars. Through books like this, articles, white papers, and beyond, a slow professional shift may be underway.

The Coming Shift We have reasons to be both optimistic and pessimistic as we begin to engage with web archives. Optimistic because of the opportunities Big Data presents us to incorporate voices that would never have been recorded in traditional archives, to pull our gaze back from individual stories and contextualize them, and to explore a more detailed and textured historical record than ever before. None of this represents a perfect record, as this book has reinforced in many respects. Ironically, despite the abundance of information, these modern archives are more fragile in some ways than ever before. Websites flicker into existence, but just as quickly disappear. Server fees go unpaid, corporations shut down, interests change, passwords are forgotten, and contributors die. The illustrative case of Occupy Wall Street, which we explored earlier in this book, where only 41 per cent of sites on that topic collected during the 2011 events remained as live websites by 2013, suggests just how transient our digital record can be. Paper, especially in its acid-free archival form, is very durable. Our digital ones and zeroes are not. Indeed, while many national libraries have begun web archiving and are beginning to gather large swaths of the web, they are a relative latecomer to the game. Had it not been for the visionary activities of early pioneers such as the Internet Archive, some national libraries such as those of Sweden and Australia, and a few institutions like the University of Northern Texas in 1996, the “digital dark age” glumly prophesied by many in the early 1990s could have lasted much longer than the few years it did. Yet even sites that are preserved are vulnerable. In chapter  3 I compared a piece of paper to a website – the one piece of paper, with everything indelibly linked to it, versus the hundreds or even thousands of files that were necessary to reconstruct one relatively simple WordPress page. Of the content that is preserved, much of it will have

244

History in the Age of Abundance?

holes: missing comments plug-ins, missing images, even missing HTML elements. Those GeoCities pages that blinked so much in 1996 no longer blink today, given the differences in how browsers interpret pages. All of this may seem unduly negative. Herein lies the tension. On one hand, fragility, incompleteness, and erasure. Yet on the other, abundance: we have documented and preserved far more information than at any previous time in history, at a mind-boggling scale that requires specialized tools to even begin to quantify and make sense of. Neither narrative captures the full story. It is best to think of web archives as holding an unimaginably large amount of information, imperfectly preserved and necessarily partial as any corpus of historical evidence, but still representing a revolutionary change to how we preserve, access, and think about the past. This is happening in two respects: first, more information is being captured and saved than ever before; and second, and most importantly, that information is being preserved by, and tells us about, people who never before would have found themselves (much less put themselves) on the accessible historical record. This is new territory, and none of it is straightforward. The overwhelming majority of webpages collected were created by people who were unaware that their information would live on within the Internet Archive or other libraries. They did not consent to have their sites included in these repositories, nor did many have access to or knowledge of the robots.txt file that would exclude their sites from crawls such as those performed by the Internet Archive. There really is no better way to collect this material, however. Attempts to garner websites through an opt-in process have produced extremely low response rates – would you answer an unexpected email from a web archive that landed in your spam filter, requesting permission to store your webpage and perhaps other personal information? The obligatory opt-in is a method that would lead again to domination by powerful voices, corporations, and others. Proceeding instead by archiving as much of the web as possible, and handling the archived material ethically, is in everyone’s best interests. It is best for historians, who will have an extraordinary and unparalleled breadth of sources and voices to draw from. It is best for society at large, which benefits from a rich, complex, and multifocal historical record and resulting historiography. And it is arguably best for the individual creators of, and contributors to, web content. It is unlikely

Conclusion

245

that they can ensure long-term preservation of their contributions, and many of them will wish to. The web archive becomes their archive. Moreover, we live in a time when it is possible to not only suppress but falsify information – the era of “fake news.” Independent and ethical web archiving, such as that pursued by the Internet Archive and national libraries, is a defensive mechanism against malicious interference in the documentation of the present (and past). And given the short lifespan of the average website, it is important collect the material before it disappears. In some cases, people may be able to opt out of web archives – the Internet Archive, for example, will honour some requests to have content removed, giving some individual agency back to people. For this reason, the ethical onus needs to be shared between individual researcher and, perhaps, institutional review boards (IRB s) with expanded mandates. While some oral historians encounter IRB s with mild irritation, they do force historians to have a conversation: about the ethics of their research, the power dynamics between subject and interviewer, and the importance of securing informed consent, even though websites may fall into the broad category of “published material” and are thus exempt from many IRB s as currently constituted. Guidelines as discussed above, and in other scholarly works on the ethical use of web archives and social media, could help individual researchers, librarians, and archivists to help identify, untangle, and resolve ethical dilemmas about privacy and the use of personal material without consent. Putting the onus on individual researchers is unfair to them, especially given the current paucity of training on digital resources that we just discussed. The road ahead will not be straightforward, but it is worth travelling. It will require historians to change – to adopt new standards and practices in our craft, to build our knowledge of computers and algorithms, and to work more collaboratively, building teams with colleagues in the digital humanities and computer science. It will require more ethical engagement, in thought and practice. It will involve mistakes along the way, as historians find their way in this new territory. This book is but a first salvo in what I hope will be a growing corpus of work in this field. Ultimately, it will be worth it. Web archives offer the prospect of incorporating more voices and more people. A more inclusive history is around the corner. We need to be ready.

NOTES

INTRODUCTION 1 Kimpton and Ubois, “Year-by-Year,” 202. 2 For a wonderful, in-depth ethnographic observation of the Internet Archive, see Ogden, Halford, and Carr, “Observing Web Archives.” 3 Niu, “Overview of Web Archiving.” 4 The rapid pace of change means that these contemporary examples may seem hilariously outdated by the time you read this paragraph. Substitute Facebook, Tumblr, or Instagram with the social networks of the future. 5 Aiden and Michel, Uncharted, 195. 6 Scott, “Unpublished Article on Geocities.” 7 See, for example, Romano, “‘Controversy’ over Journalist Sarah Jeong.” 8 On the Media, “Calling for Back Up.” 9 You can see it yourself at http://wayback.archive-it.org/4399/*/http:// investigator.org.ua/. 10 Greenberg, This Machine Kills Secrets, 11–15. 11 Old Bailey, “Proceedings of the Old Bailey.” 12 I did my MA on Canada’s First World, studying free-speech prosecutions. My PhD, published as a monograph, looked at the relationship between young workers, New Leftists, and youth more generally. 13 Gleick, Information, 396–7. 14 Carenini, Ng, and Murray, Methods for Mining and Summarizing Text Conversations, 1. 15 Noble, Algorithms of Oppression, 61. 16 Pew Research Center, “Internet/Broadband Fact Sheet.” 17 Ibid. 18 World Bank, “Internet Users (per 100 People).”

248

Notes to pages 18–30

19 A larger point evocatively made by both O’Neil, Weapons of Math Destruction; and Noble, Algorithms of Oppression. 20 Graham, Milligan, and Weingart, Exploring Big Historical Data, 3–4. 21 Anderson, “End of Theory.” 22 To name only a few examples: Isserman, If I Had a Hammer; Levitt, Children of Privilege; Kostash, Long Way from Home; and Gitlin, Sixties. 23 In Canada, for example, the still dominant interpretation of the period is provided by Owram, Born at the Right Time. 24 The World Wide Web foundation, for example, dates it to 12 March 1989, with the submission of the web’s proposal by Sir Tim Berners-Lee. 25 For example, Christy et al., Doing Recent History. 26 Vaidhyanathan, Googlization of Everything, 63. 27 Pariser, Filter Bubble, 103–4. 28 Rockwell and Sinclair, Hermeneutica, 14. 29 Gomes, Miranda, and Costa, “Survey on Web Archiving Initiatives”; Toyoda and Kitsuregawa, “History of Web Archiving”; National Digital Information Infrastructure and Preservation Program, “Preserving Our Digital Heritage.” 30 There are notable exceptions, of course. Very thoughtful reflection on these challenges can be found in Winters, “Coda.” See also Winters, “Breaking in to the Mainstream.” 31 Brügger, The Archived Web; Brügger, Web 25; Brügger and Schroeder, Web as History; Brügger, “The Archived Website and Website Philology”; Brügger, “Website History”; Brügger, “Web History and the Web as a Historical Source”; Brügger, Web History; and Brügger, “Digital Humanities in the 21st Century.” 32 Brügger and Milligan, SAGE Handbook of Web History. 33 Brügger, Archived Web, 7–8. 34 When historians have approached web archives, it is often in passing when considering digital sources or the field more broadly. For example, see Cohen and Rosenzweig, Digital History; Dougherty and Nawrotzki, Writing History in the Digital Age; Turkel, Kee, and Roberts, “Method for Navigating the Infinite Archive”; and Graham, Milligan, and Weingart, Exploring Big Historical Data.

CHAPTER ONE 1 2 3 4

Goldman, “All the President’s Men.” Gleick, Information, 232. Lin, “My Data Is Bigger Than Your Data.” Brügger, “Humanities, Digital Humanities, Media Studies, Internet Studies.”

Notes to pages 31–41

5 6 7 8 9 10 11 12

13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

31

249

Doctorow, “Sen. Stevens’ Hilariously Awful Explanation.” Schmidt and Cohen, New Digital Age, 1. Taylor, People’s Platform, 36. Evocatively described in Blum, Tubes. Schmidt and Cohen, New Digital Age, 89. Schmidt and Cohen, New Digital Age, 82–3. Futurism, “CERN Just Dropped 300 TB of Large Hadron Collider Data Free Online.” As I type this, for example, my propensity to continually “Ctrl + S” this file to Dropbox means that the letters I’m typing in Waterloo, Ontario, Canada, are continually being sent through to a data centre somewhere in the United States. Hickman, “Mysterious Cable.” Lewis, Flash Boys, particularly 7–22. ICANN, “Stewardship of IANA Functions Transitions to Global Internet Community.” See, for example, the American Registry for Internet Numbers, “ARIN Fee Schedu le.” Thielman and Johnston, “Major Cyber Attack.” For a great overview of ARPA, see Weinberger, Imagineers of War. Markoff, “Internet Pioneer.” An interesting alternative window onto the history of the Internet can be seen in Peters, How Not to Network a Nation. Gillies and Cailliau, How the Web Was Born, 12–15. Abbate, Inventing the Internet; Gillies and Cailliau, How the Web Was Born. Baran, “On Distributed Communications Networks.” Licklider and Taylor, “The Computer as a Communication Device.” Ryan, History of the Internet and the Digital Future, 29–30. Ryan, “Essence of the ’Net.” Dalal, Sunshine, and Cerf, “Specification of Internet Transmission Control Program.” Leiner et al., “Brief History of the Internet.” Testament to the power of Unix, in 2016 Microsoft Windows added a Unix subsystem to make developers happy. Global adoption would come unevenly as the result of several factors, from political or economic to the very real limitations of character encoding formats, which made non-Latin alphabets difficult to use until the early 1990s. For more on this, see the expansive Goggin and McClelland, Routledge Companion to Global Internet Histories. PR Newswire, “GeoCities Welcomes One Millionth ‘Homesteader’ Pioneer.”

250

Notes to pages 41–51

32 PR Newswire, “GeoCities Welcomes One Millionth ‘Homesteader’ Pioneer.” 33 The “killer app” comparison has been made by many, apparently including Sir Tim Berners-Lee. For more, see Deken, “Web’s First ‘Killer App.’” There were other early systems that aimed to do things similar to the Web, such as Minitel, but none ultimately had the same staying power or open standards. For more on Minitel, see Mailland and Driscoll, Minitel. 34 A central point in Barnet, Memory Machines. Barnet tells the story of hypertext without focusing on the Web, instead tracing the medium’s evolution between Vannevar Bush’s 1930s Memex and the 1980s. 35 Berners-Lee, Weaving the Web, 41. 36 Bush, “As We May Think.” 37 As quoted in Rosenzweig and Grafton, Clio Wired, 203. 38 Nelson, Literary Machines. 39 Engelbart, “Mother of All Demos.” 40 Berners-Lee, Weaving the Web, 14. 41 Berners-Lee, Weaving the Web, 9–11. 42 Berners-Lee, “Enquire Manual.” 43 Berners-Lee, “Information Management.” 44 Gillies and Cailliau, How the Web Was Born, 183. 45 Berners-Lee, “Information Management.” 46 Gillies and Cailliau, How the Web Was Born, 185–90. 47 Berners-Lee and Cailliau, “WorldWideWeb.” 48 Gillies and Cailliau, How the Web Was Born, 29–30. 49 Gillies and Cailliau, How the Web Was Born, 46; Bryant, “20 Years Ago Today.” 50 World Bank, “Internet Users (per 100 People).” 51 Perrin and Duggan, “Americans’ Internet Access.” 52 CBC News, “Canadians’ Internet Usage.” 53 Perrin, “One-Fifth of Americans Report Going Online ‘Almost Constantly.’” 54 See also Brügger, Archived Web, 24–6. 55 As quoted in Turkel, Kee, and Roberts, “Method for Navigating the Infinite Archive,” 62. 56 Milligan, Rebel Youth. 57 For example, Pew Research shows in their study of US users that it skews young, college-educated, and affluent (above $50,000 household income). See Duggan, “Demographics of Social Media Users.” 58 Summers, “Ferguson Twitter Archive”; Internet Archive Global Events, “Ferguson, MO – 2014”; Kaplan, “Masked Gunmen Seize Crimean Investigative Journalism Center.”

Notes to pages 51–8

251

59 Library and Archives Canada, “Library and Archives Canada Acquisition Update.” 60 Center for History and New Media, “Occupy Archive”; Emory News Center, “Emory Digital Scholars Archive Occupy Wall Street Tweets.” 61 While access remains problematic, at least the information was collected between 2013 and 2018. See Luckerson, “What the Library of Congress Plans to Do with All Your Tweets.” For a more modern example of datasets collected by individual researchers and activists, you can visit DocNow. “Tweet ID Datasets.” 62 Blank and Dutton, “Next Generation Internet Users,” 40–2. 63 See, for example, Mosby, Food Will Win the War. 64 Schmidt and Cohen, New Digital Age, 4. 65 Schmidt and Cohen, New Digital Age, 32. 66 British Library, “Introduction to Legal Deposit.” 67 Scott, “Archiving Britain’s Web.” 68 Nelson, “Profiling Web Archives.” 69 Winter, “Roberto Busa, S.J.” 70 A powerful overview of history’s engagement with numbers can be seen in Gaffield, “Surprising Ascendance of Digital Humanities.” 71 Fogel and Engerman, Time on the Cross; Katz, The People of Hamilton, Canada West. 72 Anderson, “History and Computing.” 73 Robertson, “Differences between Digital History and Digital Humanities.” 74 Moretti, Graphs, Maps, Trees, 3–4. 75 I have noted this process in Milligan, “Mining the ‘Internet Graveyard.’” 76 See, for example, Guldi and Armitage, History Manifesto, 9–10. 77 Braudel, The Mediterranean and the Mediterranean World. 78 Graham, Milligan, and Weingart, Exploring Big Historical Data, 3. 79 Cohen et al., “Data Mining with Criminal Intent.” 80 Klingenstein, Hitchcock, and DeDeo, “Civilizing Process in London’s Old Bailey.” 81 Klein et al., “Trading Consequences.” 82 Scheidel, Meeks, and Weiland, “ORBIS”; Rosen, “Plan a Trip Through History with ORBIS.” 83 Evidence of the digital penetration into the research lives of historians can be systematically seen in Rutner and Schonfeld, “Supporting the Changing Research Practices of Historians.” 84 Milligan, “Illusionary Order”; Putnam, “The Transnational and the Text-Searchable.” 85 See for example Fogel and Elton, Which Road to the Past? 86 Turchin, “Arise ‘Cliodynamics.’”

252

Notes to pages 58–73

87 Anderson, “End of Theory.” 88 Broader context over this debate can be found in Novick, Noble Dream. 89 Historical methods in historical articles and monographs are often hidden, tucked away in footnotes or appendices in many cases, or even nonexistent. 90 Milligan, “Illusionary Order.” 91 Gibbs and Owens, “Hermeneutics of Data and Historical Writing.”

CHAPTER TWO 1 Berners-Lee, Weaving the Web, 49. 2 See Cailliau, “Hypertext and WWW Information.” When providing URLs from the Wayback Machine, these citations provide the date of the scrape itself, not the date of document itself. 3 Economist, “Difference Engine.” 4 Palmer, “Earliest Web Screenshots.” 5 Tsukayama, “CERN Reposts the World’s First Web Page”; CBC News, “1st Website Ever Restored to Its 1992 Glory.” 6 Tsukayama, “CERN Reposts the World’s First Web Page.” 7 Suda, “CERN: Line Mode Browser.” 8 Suda, “Meyrin: CERN Terminal Font.” 9 Ridener and Cook, From Polders to Postmodernism, 7. 10 Pearce-Moses. “Glossary of Archival and Records Terminology.” 11 Theimer, “Archives in Context and as Context.” 12 Ridener and Cook, From Polders to Postmodernism, 2–5. 13 Owens, “What Do You Mean by Archive?” 14 Merity, “Navigating the WARC File Format”; ISO, “ISO 28500:2009.” 15 Our Marathon, “Our Stories, Our Strength, Our Marathon.” 16 Owens, “What Do You Mean by Archive?” 17 Owens, “What Do You Mean by Archive?” 18 Bailey, “Disrespect des Fonds.” 19 Brügger, “Website History and the Website as an Object of Study,” 150. 20 International Council on Archives, “ISAD(G).” 21 Canadian Committee on Archival Description, “Rules for Archival Description.” 22 Milligan, Ruest, and St.Onge, “Great WARC Adventure.” 23 O’Dell, “Describing Web Collections.” 24 Lesk, “Preserving Digital Objects.” 25 As quoted in Meloan, “No Way to Run a Culture.” 26 Gitelman, Paper Knowledge, 54. 27 A fascinating collection of primary documents can be found at “Robert C. Binkley.”

Notes to pages 74–83

253

28 See the broader overview in Webster, “Users, Technologies, Organisations.” 29 Brown, Archiving Websites, 8–10. 30 Gillmor, “Future Historians Will Rely on Web.” 31 Minard, Internet Archive. 32 Internet Archive, “Internet Archive.” 33 National Library of Australia, “History and Achievements.” 34 National Library of Australia, “History and Achievements.” 35 Hoffman, “Development of the CyberCemetery (2011).” 36 Arvidson, Persson, and Mannerheim, “Kulturarw3 Project.” 37 Schwartz, “Page by Page History of the Web.” 38 As quoted in Taylor, “Average Lifespan of a Webpage.” 39 As quoted in Taylor, “Average Lifespan of a Webpage.” 40 Lawrence et al., “Persistence of Web References in Scientific Research.” 41 Sanderson, Phillips, and Van de Sompel, “Analyzing the Persistence of Referenced Web Resources with Memento.” 42 Zittrain, Albert, and Lessig, “Perma,” 167. 43 SalahEldeen and Nelson, “Losing My Revolution,” 1. 44 LaCalle and Reed, “Poster: The Occupy Web Archive.” 45 Berners-Lee, “Cool URIs Don’t Change.” 46 Alpert and Hajaj, “We Knew the Web Was Big …” 47 Ainsworth et al., “How Much of the Web Is Archived?” 48 Hale, Blank, and Alexander, “Live versus Archive,” 59. 49 Reed, “Introducing Archive-It 4.9 and Umbra.” 50 Ankerson, “Writing Web Histories with an Eye on the Analog Past.” 51 Dutton and Graham, “Introduction,” 10. 52 World Internet Project, “World Internet Project.” 53 The utopian perspective is provided in Schmidt and Cohen, New Digital Age. 54 Canadian Radio-television and Telecommunications Commission (CRTC), “New Media.” 55 Statistics Canada, “Household Internet Use Survey.” 56 Statistics Canada, “Canadian Internet Use Survey.” 57 Haight, Quan-Haase, and Corbett, “Revisiting the Digital Divide in Canada.” 58 Pew Research, “Internet/Broadband Fact Sheet.” 59 Whitacre, “Technology Is Improving.” 60 Masse, “Why Internet Is Expensive in Canada’s North.” 61 Hargittai, “Second-Level Digital Divide.” 62 Schradie, “Digital Production Gap.” 63 Hargittai and Walejko, “Participation Divide.” 64 Correa, “Participation Divide among ‘Online Experts.’”

254

Notes to pages 84–99

65 Blank, “Who Creates Content?” 66 Leetaru, “How Much of the Internet Does the Wayback Machine Really Archive?” 67 Rosenthal, “You Get What You Get and You Don’t Get Upset.” 68 Maemura et al., “If These Crawls Could Talk: Studying and Documenting Web Archives Provenance.” 69 See https://web.archive.org/web/*/uwaterloo.ca. 70 Craggs, “Sorry, America.” 71 Carenini, Ng, and Murray, Methods for Mining and Summarizing Text Conversations, 1. 72 Niu, “Overview of Web Archiving.” 73 Google, “Inside Our Data Centers.” 74 Munroe, “Google’s Datacenters on Punch Cards.” It’s telling that the best glimpse into this is via a popular web comic rather than a rigorous source. 75 Vargas, “Link to Web Archives, Not Search Engine Caches.” 76 For more on Google, see Vaidhyanathan, Googlization of Everything; or Noble, Algorithms of Oppression. 77 Cadwalladr, “‘I Made Steve Bannon’s Psychological Warfare Tool.’” 78 Kosinski, Stillwell, and Graepel, “Private Traits and Attributes.” 79 Rossi, “Robots.txt Files and Archiving .gov and .mil Websites.” 80 Koster, “Important: Spiders, Robots and Web Wanderers.” 81 Webster, “When Using an Archive Could Put It in Danger.” 82 Rumsey, When We Are No More, 45. 83 For a snippet of the coverage, see Harding, “Timbuktu Mayor”; Khazan, “Here’s What Was in the Torched Timbuktu Library”; Zanganeh, “Has the Great Library of Timbuktu Been Lost?” 84 Comment posted at Milligan, “In a Rush to Modernize.” 85 As documented in Jason Scott’s tumblr at https://ourincrediblejourney. tumblr.com/. 86 ArchiveTeam.org, “Myspace.” 87 Bowman, “Myspace’s $20M Relaunch Deletes Its Remaining Users’ Blogs.” 88 Bowman, “Myspace’s $20M Relaunch Deletes Its Remaining Users’ Blogs.” 89 Hu, “AOL Home Page Glitches Irk Users.” 90 Wilson, “Attention AOL Hometown Users.” 91 It is worth noting that these sorts of solutions – in this case opening up the largely monochrome command prompt on your computer, writing some computer code, and systematically recovering your site – is outside of the technical expertise of all but a small handful of people.

Notes to pages 100–11

92 93 94 95 96 97 98 99

100 101

102 103 104

105

255

Scott, “Eviction.” Scott, “Eviction.” Scott, “Datapocalypso.” Scott, “Datapocalypso.” Web.archive.org, “GeoCities Will Close Later This Year.” Buckler, “RIP GeoCities 1995–2009.” timothy, “Yahoo Pulls the Plug on GeoCities.” The same point as with AOL Hometown holds – this would have involved opening up a command prompt and running a wget command. If this does not make sense to you, you probably can sympathize with how this was not necessarily the best solution to preserve an individual’s website! See Gasperson, “How to Move Your Site from Geocities to a New Host.” Scott, “Geocities: Lessons So Far.” kdawson, “Archive Team Is Busy Saving Geocities”; Modine, “Web 0.2 Archivists Save Geocities from Deletion”; Stimson, “Jason Scott Is In Your Geocities, Rescuing Your Sh*t.” Scott, “Geocities Torrent.” Cebula, “An Open Letter to the Historians of the 22nd Century.” Twitter, “How to Contact Twitter about a Deceased User,” Twitter Support, 2018, https://help.twitter.com/en/rules-and-policies/contacttwitter-about-a-deceased-family-members-account. Dropbox, “How to Access the Dropbox Account of Someone Who Has Passed Away?,” https://www.dropbox.com/en/help/security/accessaccount-of-someone-who-passed-away?_locale_specific=en.

CHAPTER THREE 1 American Historical Association, “Historians in Archives.” 2 Toyoda and Kitsuregawa, “History of Web Archiving”; Gomes, Miranda, and Costa, “Survey on Web Archiving Initiatives”; Kuny, “Digital Dark Ages?”; Song and JaJa, “Fast Browsing of Archived Web Contents”; Ball, “DCC State of the Art Report.” 3 The role of PDFs and similar document formats is discussed in Gitelman, Paper Knowledge. 4 Brügger, Archived Web, 26–8. 5 Baker, “Page, but Not as We Know It.” 6 A point also made by Brügger, Web History. 7 Brügger, Archiving Websites. 8 Ainsworth, Nelson, and Van de Sompel, “Only One Out of Five Archived Web Pages Existed As Presented.”

256

Notes to pages 111–24

9 Ainsworth, Nelson, and Van de Sompel, “Only One Out of Five Archived Web Pages Existed As Presented.” 10 Koerbin, “Web Archiving - an Antidote to ‘Present Shock’?” 11 The first browser, Berner-Lee’s WorldWideWeb, was actually a graphical browser as well. But it was written for the NeXT platform, making it expensive and out of reach for the vast majority of computer users. There were other early browsers, but Mosaic was easy to install on a Windows machine. 12 Koerbin, “Web Archiving.” 13 The history of early browsers, and the attendant struggle between Microsoft and Netscape, has yet to be studied in a historical booklength study. 14 Windrum, “Back from the Brink.” 15 The tag lingered in Mozilla Firefox, a descendent of Netscape, until release 23 on 6 August 2013. It saw an unceremonious end in those release notes: “Dropped blink effect from text-decoration: blink; and completely removed element.” 16 Espenschied and Lialina, “Authenticity/Access.” 17 Espenschied and Lialina, “Authenticity/Access.” 18 I am borrowing the idea of a “noble dream” from Novick, That Noble Dream. 19 Ankerson, “Writing Web Histories with an Eye on the Analog Past,” 393. 20 Ankerson, “Writing Web Histories with an Eye on the Analog Past,” 359–97. 21 Ankerson, “Writing Web Histories with an Eye on the Analog Past,” 394. 22 Milligan, Ruest, and St.Onge, “The Great WARC Adventure.” 23 Kirschenbaum, Mechanisms, 3. He is drawing on Thibodeau, “What Does It Mean to Preserve Digital Objects?” 24 Owens, “Digital Preservation’s Place in the Future of the Digital Humanities.” 25 Owens, “Digital Preservation’s Place in the Future of the Digital Humanities.” 26 drinehart, “10,000,000,000,000,000 Bytes Archived.” 27 Rossi, “80 Terabytes of Archived Web Crawl Data Available for Research.” 28 Based on Amazon’s online S3 calculator, as well as our own IT quotes. 29 Petaboxes are discussed at Internet Archive, “Petaboxes.” 30 For more information on the CDX format, see Internet Archive, “CDX Format.” 31 Millward, “I Tried to Use the Internet to Do Historical Research.” 32 Millward, “I Tried to Use the Internet to Do Historical Research.” 33 Jackson, “Building a ‘Historical Search Engine’ Is No Easy Thing.”

Notes to pages 124–48

257

34 Graham, Milligan, and Weingart, Exploring Big Historical Data. 35 Michel et al., “Quantitative Analysis of Culture Using Millions of Digitized Books”; Old Bailey, “Proceedings of Old Bailey”; HathiTrust Research Center, “HathiTrust Digital Library.” 36 For an in-depth definition, aimed at an archival and librarian audience, see Corrado and Moulaison, Digital Preservation, 111–12. 37 Greenwald, No Place to Hide, 132. 38 Feinstein, “Sen. Dianne Feinstein.” As mentioned in Greenwald, No Place to Hide, 133. 39 White House, “Statement by the President.” 40 Blaze, “Phew, NSA Is Just Collecting Metadata.” 41 Greenwald, No Place to Hide, 133. 42 Rudder, Dataclysm, 46–8. 43 Rudder, Dataclysm, 37–8. 44 Rockwell and Sinclair, Hermeneutica, 118. 45 See information about at w3schools, “HTML Meta Tag.” For more information on WATs, see Stern, “Web Archive Transformation (WAT)”; Merity, “Navigating the WARC File Format.” 46 Weltevrede and Helmond, “Where Do Bloggers Blog?” 47 Weltevrede and Helmond, “Where Do Bloggers Blog?” 48 Rogers, Digital Methods, 81. 49 Rogers, Digital Methods, 44. 50 University of Toronto, “Canadian Political Parties and Political Interest Groups.” 51 A point made very effectively in Brügger, Web 25, 4–5. 52 Vaidhyanathan, Googlization of Everything, 63. 53 This example is drawn from Milligan, “Lost in the Infinite Archive.” 54 For more information, see Graham, Weingart, and Milligan, “Getting Started with Topic Modeling and MALLET”; Jockers, “LDA Buffet Is Now Open”; Blei, Ng, and Jordan, “Latent Dirichlet Allocation.”

CHAPTER FOUR 1 The myth that the Wayback Machine is the Internet Archive continues. See the Wikipedia entry for “Wayback Machine” at https://en.wikipedia. org/w/index.php?title=Wayback_Machine&oldid=680478188. 2 Internet Archive, “The Internet Archive”; Cullen, “Wayback Machine Restores Ye Old Web.” 3 As noted earlier, to learn more about the power of algorithms in our lives, see O’Neil, Weapons of Math Destruction; and Noble, Algorithms of Oppression. 4 See https://archive-it.org/.

258

Notes to pages 151–66

5 Indeed, the dominance of the Liberal party led Canadian political commentators such as Jeffrey Simpson to wonder if Canada was becoming a de facto one-party state. See Simpson, Friendly Dictatorship. 6 Chapnick, “‘Conservative’ National Story?” 7 McKay and Swift, Warrior Nation; Berlin, “Harper, Historians and the Myth of the Warrior Nation”; Frenette, “Conscripting Canada’s Past.” 8 Jackson et al., “Desiderata for Exploratory Search Interfaces.” 9 I have made a similar point about needing to understand the algorithms that run our research lives in Milligan, “Illusionary Order.” 10 Google Search, “Crawling & Indexing”; Barroso, Dean, and Holzle, “Web Search for a Planet.” 11 Apache refers to the Apache Software Foundation, an open-source community that develops and distributes free software. 12 A long list of collaborators has made the Archives Unleashed Toolkit possible, in addition to Jimmy Lin, including recent work from Nick Ruest, Alice Zhou, Jeremy Wiebe, Ryan Deschamps, Youngbin Kim, Boris Lin, Joseph Zhou, and Titus An. 13 Lin and Milligan, “Warcbase: Scaling ‘Out’ and ‘Down’ HBase.” 14 Lin et al., “Warcbase.” 15 Ashenfelder, “Digital Collections and Data Science.” 16 Stirling et al., “State of E-Legal Deposit in France.” 17 Stirling et al., “State of E-Legal Deposit in France.” 18 GOV.UK, “Guidance on the Legal Deposit Libraries (Non-Print Works) Regulations 2013.” 19 “Get a Reader Pass,” British Library, April 2014, http://www.bl.uk/ reshelp/inrrooms/stp/register/stpregister.html. 20 Rutner and Schonfeld, “Supporting the Changing Research Practices of Historians.” 21 Bateman et al., “Taking a Byte out of the Archives.” 22 “Self-Service Copying and Photography,” Text, British Library, 2015, https://web.archive.org/web/20150310145630/http://www.bl.uk/reshelp/ inrrooms/stp/copy/selfsrvcopy/selfservcopy.html. 23 Found when one tries to access material from home. See https:// beta.webarchive.org.uk/en/ukwa/noresults. 24 Bibliothèque nationale de France, “Internet Archives.” 25 Bibliothèque nationale de France, “Internet Archives.” 26 Laursen and Møldrup-Dalum, “Keynote on the History of the Danish Web Archive.” 27 Government of Denmark, “Act on Legal Deposit of Published Material.” 28 Laursen and Møldrup-Dalum, “Keynote on the History of the Danish Web Archive.”

Notes to pages 166–74

29 30 31 32 33 34 35 36 37 38 39 40

41 42 43

259

IIPC, “10 Years Anniversary of the Netarchive (Netarkivet).” Vefsafn.Is, “Icelandic Archive.” Bibliotheca Alexandrina, “Internet Archive.” Nicholson, “Legal Deposit in South Africa.” Lor and Britz, “A Moral Perspective on South–North Web Archiving.” Biblioteca Nacional Digital de Chile, “Archivo de la Web Chilena.” Pabón Cadavid, Basha, and Kaleeswaran, “Legal and Technical Difficulties.” WARP, “Let’s WARP to the Past of the Web.” Hockx, “Web Archives and Chinese Literature.” National Library of Korea, “Web Archiving System of the National Library of Korea.” Legislative Services Branch, “Consolidated Federal Laws of Canada, Library and Archives of Canada Act.” Stone, “Tweet Preservation”; Luckerson, “What the Library of Congress Plans to Do with All Your Tweets”; Osterberg, “Update on the Twitter Archive at the Library of Congress”; Scola, “Library of Congress’ Twitter Archive Is a Huge #FAIL.” An excellent overview can be found in Zimmer, “Twitter Archive.” Grotke, “Web Archiving at the Library of Congress.” Memento, “About the Time Travel Service. A vision articulated in Van de Sompel et al., “Memento: Time Travel for the Web.”

CHAPTER FIVE 1 Motavalli, Bamboozled at the Revolution, 191. 2 It is difficult to know with any certainty just how big GeoCities was, exactly. The company itself had a motivation to inflate numbers, and the web archives are not complete today. The seven million number can be found in multiple locations, however, such as Fletcher, “Internet Atrocity!” The number of “documents” comes from the number of HTML documents found in an analysis of the 2009 Internet Archive scrape. 3 Scott, “Unpublished Article on Geocities.” 4 For an early account, see Christensen and Suess, “Hobbyist Computerized Bulletin Board.” To learn more about BBSs, the best source is Jason Scott’s BBS: The Documentary. 5 Cunliffe, “Reaching Electronic Bulletin Boards Often a Headache for the Uninitiated.” 6 Ocamb, “David Bohnett.”

260

Notes to pages 174–89

7 Hansell, “Neighbourhood Business.” 8 Wired and Gingrich made odd bedfellows, as discussed in Turner, From Counterculture to Cyberculture, 8 and 215. 9 Business Wire, “Beverly Hills Internet.” 10 Motavalli, Bamboozled at the Revolution, 191. 11 Ridey, “Roger Widey Travels under the Volcano.” 12 Motavalli, Bamboozled at the Revolution, 194. 13 Lawson, “Berners-Lee on the Read/Write Web.” 14 GeoCities, “GeoCities FAQ Page.” 15 GeoCities, “The ‘Home Page’ Home Page.” 16 I dealt with this more extensively in Milligan, “Welcome to the Web.” 17 Rheingold, Virtual Community, xx. 18 Doheny-Farina, Wired Neighborhood, 37. 19 Moschovitis, History of the Internet. 20 Zacharek, “Addicted to eBay.” 21 Scott, “Please Be Patient: This Page Is under Construction!” 22 Figallo, Hosting Web Communities, 108–10. 23 Sawyer and Greely, Creating Geocities Websites, 8. 24 Think of the Electronic Frontier Foundation, for example, as noted in Turner, From Counterculture to Cyberculture, 172. 25 Internet Archive Wayback Machine, “GeoCities Homesteading Program Information.” 26 They mentioned this in the foreword to Sawyer and Greely, Creating Geocities Websites. 27 Lialina, “Some Remarks on #neocities @kyledrake.” 28 Karlins, Build Your Own Web Site, 60. 29 “GeoCities Neighborhood Watch Program,” 13 April 1997. 30 GeoCities, “Page Content Guidelines and Member Terms of Service.” 31 As noted http://www.ewebtribe.com/remember/GC_FAQ_old.html. 32 Geocities, “GeoCities FAQ Page: Home Page Information.” 33 Logie, “Homestead Acts,” 37. 34 Lin et al., “Warcbase.” 35 I have looked at the use of images in depth in Milligan, “Learning to See the Past at Scale.” 36 See the hands-on guide by Manovich, “Guide to Visualizing Video and Image Sequences.” 37 For example, images are arranged in a montage without relationship to others, and scholars have noted that we tend to privilege up–down relationships over left–right relationships, even if they are identical. See Montello et al., “Testing the First Law of Cognitive Geography.” 38 You can visit it yourself GeoCities, “EnchantedForest Awards Page.”

Notes to pages 190–8

39 40 41 42 43 44

45 46 47 48 49

50 51 52 53 54 55 56

57 58

59

60

261

GeoCities, “About the Heartland Community Leaders.” GeoCities, “Introduction: The Elements of Web Page Style.” GeoCities, “Athens’ Community Leaders.” An issue I have explored in greater depth in Milligan, “‘A Haven for Perverts, Criminals, and Goons.” GeoCities, “Welcome to the EnchantedForest.” For this, I consulted the community centres of Glade, Creek, Palace, Dell, Meadow, Pond, Cottage, Fountain, Mountain, and Tower (all ca 1999 or 2000). Rogers, Digital Methods, 61. See, for example, GeoCities, “Augusta Award Application”; GeoCities, “The Eureka Awards”; and GeoCities, “OuttaSite Awards Program.” Walker, “‘It’s Difficult to Hide It,’” 106. Walker, “‘It’s Difficult to Hide It,’” 106. There are, of course, ways to game Google and Bing today – as witnessed by the large, albeit occasionally spammy, Search Engine Optimization (SEO) field. There is, however, a large difference between today’s web search landscape and the late 1990s. Ryan, History of the Internet and the Digital Future, 118–19. Casey, “Creating and Managing Webrings: A Step-By-Step Guide.” Bournellis, “Round Numbers.” Casey, “Creating and Managing Webrings.” An extremely useful overview, focused on social media data ethics, can be found in Taylor and Pagliari, “Mining Social Media Data.” Dawson, “Dark Side of Going Viral.” A survey commissioned by the Atlantic and the Aspen Institute, for example, found that “younger Americans express a greater expectation that the personal information they use on sites such as Facebook and Twitter will remain private. Slightly more than half of 18-to-29year-olds said they held this expectation, whereas only 38 percent of Americans over 65 said the same.” See Rosen, “59% of Young People Say the Internet Is Shaping Who They Are.” Ritchie, “Should We ‘Consent’ to Oral History?” Notable examples include High, Oral History at the Crossroads; Llewellyn, Freund, and Reilly, Canadian Oral History Reader; Thomson, Oral History Reader. Salganik, Bit by Bit, 294–301. US Department of Health & Human Services, Belmont Report. There is also the Menlo Report, which focuses specifically on how the principles can be applied to digital research at Dittrich and Kenneally, Menlo Report. Salganik, Bit by Bit, 323–4.

262

Notes to pages 199–212

61 Morrison, “‘Suffused by Feeling and Affect.” My thanks to Jennifer Douglas for the reference in her thoughtful presentation as part of a “Social Media: New Challenges and Opportunities” panel at the Association of Canadian Archivists’ 2018 annual meeting. 62 Bady, “#NotAllPublic, Heartburn, Twitter.” 63 Dash, “What Is Public?” 64 Noble, Algorithms of Oppression, 129. 65 Christen, “Opening Archives,” 186. 66 Christen, “Opening Archives,” 189. 67 Kim, “Social Media and Academic Surveillance.” See also Callahan, “USC’s Black Twitter Study Draws Criticism.” 68 Resnick, “Researchers Just Released Profile Data on 70,000 OkCupid.” 69 Netarkivet, “Retningslinjer for Adgang Til Netarkivet.” 70 Rauber, Kaiser, and Wachter, “Ethical Issues in Web Archive Creation and Usage.” 71 Tansey, “My Talk at Personal Digital Archiving 2015.” 72 boyd, It’s Complicated, 58. 73 boyd, It’s Complicated, 61. 74 Barnes, “Privacy Paradox.” 75 Bady, “#NotAllPublic, Heartburn, Twitter.” In the case of “creepshots,” the taking of photographs in a public place is legal in most Western countries, although subsequent use of them could bring civil penalties (such as profiting off the image of a person). 76 For more on this, see Hartzog and Stutzman, “Case for Online Obscurity.” 77 See Ess and AoIR Ethics Working Committee, “Ethical DecisionMaking and Internet Research”; Markham and Buchanan, “Ethical Decision-Making and Internet Research. For more on the blurring of public and private spaces, see the editors’ introduction to Pereira, Ghezzi, and Vesnic-Alujevic, The Ethics of Memory in a Digital Age. 78 Lomborg, “Personal Internet Archives and Ethics.” 79 McArdle, “People Are Getting Fired for Old Bad Tweets.” 80 Scott, “Archiving Britain’s Web”; Davis, “Archiving the Web.” 81 This point was made several times at the Ethics and the Archived Web conference, held in March 2018 in New York City. For more see http://eaw.rhizome.org/. 82 Salganik, Bit by Bit, 307. 83 Diminishing social and community ties are discussed in Putnam, Bowling Alone.

Notes to pages 214–39

263

CHAPTER SIX 1 Mohr et al., “Introduction to Heritrix.” 2 The code for Heritrix can be found at https://github.com/ internetarchive/heritrix3. 3 For example, see the user guide at http://crawler.archive.org/user.html. It is straightforward and well written for developers, but the learning curve for a casual user is steep. 4 Wayback, “My Site’s Not Archived!.” 5 InterPlanetary Wayback and IPFS are both rapidly developing technologies. For more information, see the InterPlanetary Wayback code repository at https://github.com/oduwsdl/ipwb and IPFS’s information at https://ipfs.io. 6 This Awesome List can be found at https://github.com/iipc/ awesome-web-archiving. 7 See Milligan and Baker, “Introduction to the Bash Command Line”; and Graham, Weingart, and Milligan, “Getting Started with Topic Modeling and MALLET.” 8 The Software Carpentry website is at https://software-carpentry.org. 9 This dataset is available at https://archive.org/details/wide00002. 10 Graham, Milligan, and Weingart, Exploring Big Historical Data; Arnold and Tilton, Humanities Data in R. 11 Graham, Milligan, and Weingart, Exploring Big Historical Data, chap. 6, https://www.worldscientific.com/doi/suppl/10.1142/p981/suppl_file/ p981_chap06.pdf. 12 See my own contribution to that volume in Milligan, “Learning to See the Past at Scale.” 13 The description of the journal can be found at http://culturalanalytics. org/about/about-ca/.

CONCLUSION 1 Baker, “Digital History and the Death of Quant.” This is part of a trend also discussed in Jordanova, History in Practice. 2 Reudiger, “Another Tough Year.” 3 Arguing with Digital History working group, “Digital History and Argument.” 4 See the original paper at Michel et al., “Quantitative Analysis of Culture Using Millions of Digitized Books.” Subsequent work on the new corpus, incorporating speech tagging, can be found at Lin et al., “Syntactic Annotations for the Google Books Ngram Corpus.”

264

Notes to pages 239–42

5 Grafton, “Loneliness and Freedom.” 6 History is considered a “book discipline,” meaning that the book is the gold standard for tenure and promotion – i.e., one full peer-reviewed book, preferably with a university press, to be promoted to associate professor; another one to be considered for the rank of full professor. See the series of essays at Denbo, “Forum: History as a Book Discipline.” 7 Comment by Jean-Baptiste Michel and Erez Lieberman Aiden on Grafton, “Loneliness and Freedom.” 8 See, for example, some of the Digging into Data projects, such as Cohen et al., “Data Mining with Criminal Intent”; Klein et al., “Trading Consequences.” As I have written about earlier in this book, at George Mason University and other places, historians have been engaged in the critical work of preserving and making accessible reams of born-digital content, from emails to oral testimonies on the web. 9 There is now an annual “Getting Started in Digital History” workshop at the AHA. For other annual institutes, see HILT, “Humanities Intensive Learning and Teaching Institute”; DHSI, “DHSI | Digital Humanities Summer Institute.” 10 Digital History Working Group, “Guidelines for the Professional Evaluation of Digital Scholarship.”

BIBLIOGRAPHY

Abbate, Janet. Inventing the Internet. Cambridge, MA: MIT Press, 2000. Aiden, Erez, and Jean-Baptiste Michel. Uncharted: Big Data as a Lens on Human Culture. New York: Riverhead Hardcover, 2013. Ainsworth, Scott G., Ahmed Alsum, Hany SalahEldeen, Michele C. Weigle, and Michael L. Nelson. “How Much of the Web Is Archived?” In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, 133–6. New York: ACM, 2011. https://doi.org/10.1145/1998076.1998100. Ainsworth, Scott G., Michael L. Nelson, and Herbert Van de Sompel. “Only One Out of Five Archived Web Pages Existed as Presented.” In Proceedings of the 26th ACM Conference on Hypertext & Social Media, 257–66. New York: ACM, 2015. https://doi.org/10.1145/2700171.2791044. Alpert, Jesse, and Nissan Hajaj. “We Knew the Web Was Big …” Official Google Blog (blog), 25 July 2008. https://googleblog.blogspot.com/2008/07/ we-knew-web-was-big.html. American Historical Association. “Historians in Archives.” AHA: Careers for Students of History, 2018. https://www.historians.org/jobs-andprofessional-development/career-resources/careers-for-students-of-history/ historians-in-archives. American Registry for Internet Numbers. “ARIN Fee Schedule.” arin.net, 1 July 2016. https://www.arin.net/fees/fee_schedule.html. Anderson, Chris. “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.” Wired, 23 June 2008. http://www.wired.com/2008/06/ pb-theory/. Anderson, Ian. “History and Computing.” Making History: The Changing Face of the Profession in Britain, 2008. http://www.history.ac.uk/ makinghistory/resources/articles/history_and_computing.html. Ankerson, Megan. “Writing Web Histories with an Eye on the Analog Past.” New Media & Society 14, no. 3 (1 May 2012): 384–400.

266

Bibliography

ArchiveTeam.org. “Myspace.” 2 July 2013. http://archiveteam.org/ index.php?title=Myspace. Arguing with Digital History working group. “Digital History and Argument.” White paper, Roy Rosenzweig Center for History and New Media, 13 November 2017. https://rrchnm.org/argument-white-paper/. Arvidson, Allan, Krister Persson, and Johan Mannerheim. “The Kulturarw3 Project – The Royal Swedish Web Archiw3e – An Example of ‘Complete’ Collection of Web Pages.” Jerusalem, 2000. http://archive.ifla.org/IV/ifla66/ papers/154-157e.htm. Ashenfelder, Mike. “Digital Collections and Data Science.” Signal: Digital Preservation Blog at Library of Congress (blog), 14 September 2016. http:// blogs.loc.gov/thesignal/2016/09/digital-collections-and-data-science/. Bady, Aaron. “#NotAllPublic, Heartburn, Twitter.” New Inquiry, 10 June 2014. https://thenewinquiry.com/blog/notallpublic-heartburn-twitter/. Bailey, Jefferson. “Disrespect des Fonds: Rethinking Arrangement and Description in Born-Digital Archives.” Archive Journal 3 (Summer 2013). http://www.archivejournal.net/issue/3/archives-remixed/disrespect-desfonds-rethinking-arrangement-and-description-in-born-digital-archives/. Baker, James. “Digital History and the Death of Quant.” British Library Digital Scholarship Blog, 5 April 2014. http://britishlibrary.typepad.co.uk/ digital-scholarship/2014/04/digital-history-and-the-death-of-quant.html. – “A Page, but Not as We Know It.” British Library Digital Scholarship Blog, 12 June 2013. http://britishlibrary.typepad.co.uk/digital-scholarship/ 2013/06/a-page-but-not-as-we-know-it.html. Ball, Alex. “DCC State of the Art Report: Web Archiving.” University of Edinburgh; UKOLN, University of Bath; HATII, University of Glasgow; Science and Technology Facilities Council, 8 January 2010. http:// www.era.lib.ed.ac.uk/handle/1842/3327. Baran, Paul. “On Distributed Communications Networks.” RAND Corporation, September 1962. http://pages.cs.wisc.edu/~akella/CS740/ F08/740-Papers/Bar64.pdf. Barnes, Susan B. “A Privacy Paradox: Social Networking in the United States.” First Monday 11, no. 9 (4 September 2006). http://firstmonday.org/ojs/ index.php/fm/article/view/1394. Barnet, Belinda. Memory Machines: The Evolution of Hypertext. London: Anthem, 2013. Barroso, L.A., J. Dean, and U. Holzle. “Web Search for a Planet: The Google Cluster Architecture.” IEEE Micro 23, no. 2 (March 2003): 22–8. https:// doi.org/10.1109/MM.2003.1196112. Bateman, Kirklin, Sheila Brennan, Douglas Mudd, and Paula Petrik. “Taking a Byte out of the Archives: Making Technology Work for You.” Perspectives

Bibliography

267

on History (American Historical Association), 1 January 2005. https:// www.historians.org/publications-and-directories/perspectives-on-history/ january-2005/taking-a-byte-out-of-the-archives-making-technology-workfor-you. Berlin, David. “Harper, Historians and the Myth of the Warrior Nation.” Globe and Mail, 20 August 2012. http://www.theglobeandmail.com/arts/ books-and-media/book-reviews/harper-historians-and-the-myth-of-thewarrior-nation/article4490206/. Berners-Lee, Tim. “Cool URIS Don’t Change.” W3C Style, 1998. http://www. w3.org/Provider/Style/URI.html?_ga=1.202456593.1433420173.1395769418. – “Enquire Manual – In HyperText.” World Wide Web Consortium (W3C), October 1980. http://www.w3.org/History/1980/Enquire/manual/. – “Information Management: A Proposal.” World Wide Web Consortium (W3C), March 1989. http://www.w3.org/History/1989/proposal.html. – Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web. San Francisco: HarperBusiness, 2000. Berners-Lee, Tim, and Robert Cailliau. “WorldWideWeb: Proposal for a HyperText Project.” World Wide Web Consortium (W3C), 12 November 1990. http://www.w3.org/Proposal.html. Bibliotheca Alexandrina. “Internet Archive,” 2018, Bibliothèque nationale de France. “Internet Archives,” 27 March 2015. http://www.bnf.fr/en/ collections_and_services/book_press_media/a.internet_archives.html. Biblioteca Nacional Digital de Chile. “Archivo de la Web Chilena,” 2018. http://archivoweb.bibliotecanacionaldigital.cl/. Blank, Grant. “Who Creates Content?” Information, Communication & Society 16, no. 4 (1 May 2013): 590–612. https://doi.org/10.1080/136911 8X.2013.777758. Blank, Grant, and William H. Dutton. “Next Generation Internet Users: A New Digital Divide.” In Society and the Internet: How Networks of Information and Communication Are Changing Our Lives, ed Mark Graham and William H. Dutton, 36–52. New York: Oxford University Press, 2014. Blaze, Matt. “Phew, NSA Is Just Collecting Metadata. (You Should Still Worry).” WIRED , 19 June 2013. http://www.wired.com/2013/06/phew-itwas-just-metadata-not-think-again/. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (2003): 993–1022. Blum, Andrew. Tubes: A Journey to the Center of the Internet. Repr. New York: Ecco, 2013. Bournellis, Cynthia. “Round Numbers: WebRing, a Home-Grown Tool for Linking Web Sites, Tries to Survive the Transition from Freeware to Free Market.” CIO: Web Business, 1 June 1998.

268

Bibliography

Bowman, John. “Myspace’s $20M Relaunch Deletes Its Remaining Users’ Blogs.” CBC News. Your Community Blog, 13 June 2013. http://www.cbc.ca/ newsblogs/yourcommunity/2013/06/myspaces-20m-relaunch-deletes-itsremaining-users-blogs.html. boyd, danah. It’s Complicated: The Social Lives of Networked Teens. New Haven, CT: Yale University Press, 2014. Braudel, Fernand. The Mediterranean and the Mediterranean World in the Age of Philip II. New York: Harper & Row, 1972. British Library. “Introduction to Legal Deposit.” Legal Deposit: About Us, 2014. http://www.bl.uk/aboutus/legaldeposit/introduction/. Brown, Adrian. Archiving Websites: A Practical Guide for Information Management Professionals. London: Facet Publishing, 2006. Brügger, Niels. The Archived Web: Doing Web History in the Digital Age. Cambridge, MA: MIT Press, 2018. – “The Archived Website and Website Philology: A New Type of Historical Document.” Nordicom Review 29, no. 2 (2008): 155–75. – Archiving Websites: General Considerations and Strategies. Aarhus: Center for Internetforskning, 2005. – “Digital Humanities in the 21st Century: Digital Material as a Driving Force.” Digital Humanities Quarterly 10, no. 2 (2016). http:// www.digitalhumanities.org/dhq/vol/10/3/000256/000256.html. – “Humanities, Digital Humanities, Media Studies, Internet Studies: An Inaugural Lecture.” Papers from the Centre for Internet Studies. Aarhus: Center for Internetforskning, 2015. http://cfi.au.dk/fileadmin/www.cfi. au.dk/publikationer/cfis_skriftserie/016_Brugger.pdf. – Web History. Edited by Niels Brügger. New York: Peter Lang Publishing, 2010. – “Web History and the Web as a Historical Source.” Zeithistorische Forschungen/Studies in Contemporary History 9 (2012). http://www. zeithistorische-forschungen.de/site/40209295/default.aspx#person. – ed. Web 25: Histories from the First 25 Years of the World Wide Web. New York: Peter Lang Publishing, 2017. – “Website History and the Website as an Object of Study.” New Media & Society 11, no. 1–2 (1 February 2009): 115–32. https://doi. org/10.1177/1461444808099574. – ed. Web 25: Histories from the First 25 Years of the World Wide Web. New York: Peter Lang Publishing, 2017. Brügger, Niels, and Ian Milligan, eds. SAGE Handbook of Web History. London: SAGE, 2018. Brügger, Niels, and Ralph Schroeder, eds. The Web as History. London: UCL, 2017.

Bibliography

269

Bryant, Martin. “20 Years Ago Today, the World Wide Web Was Born.” Next Web, 6 August 2011. http://thenextweb.com/insider/2011/08/06/ 20-years-ago-today-the-world-wide-web-opened-to-the-public/. Buckler, Craig. “RIP GeoCities 1995–2009.” SitePoint (blog), 3 May 2009. http://www.sitepoint.com/rip-geocities/. Bush, Vannevar. “As We May Think.” Atlantic, July 1945. http:// www.theatlantic.com/magazine/archive/1945/07/as-we-maythink/303881/?single_page=true. Business Wire. “Beverly Hills Internet, Builder of Interactive Cyber Cities, Launches 4 More Virtual Communities Linked to Real Places,” 5 July 1995. http://www.thefreelibrary.com/Beverly+Hills+Internet,+builder+of+ interactive+cyber+cities,+launches...-a017190114. Cadwalladr, Carole. “‘I Made Steve Bannon’s Psychological Warfare Tool’: Meet the Data War Whistleblower.” Guardian, 18 March 2018. https:// www.theguardian.com/news/2018/mar/17/data-war-whistleblowerchristopher-wylie-faceook-nix-bannon-trump. Cailliau, Robert. “Hypertext and WWW Information.” Internet Archive Wayback Machine, 3 December 1998. https://web.archive.org/web/ 19981203062522/http://info.cern.ch/. Callahan, Yesha. “USC’s Black Twitter Study Draws Criticism.” Root, 3 September 2014. https://thegrapevine.theroot.com/usc-s-black-twitterstudy-draws-criticism-1790885678. Canadian Committee on Archival Description. “Rules for Archival Description.” Canadian Council of Archives, July 2008. http:// www.cdncouncilarchives.ca/RAD/RADComplete_July2008.pdf. Canadian Radio-television and Telecommunications Commission (CRTC). “New Media.” Regulatory policies, 17 May 1999. http://www.crtc.gc.ca/ eng/archive/1999/PB99-84.HTM. Carenini, Giuseppe, Raymond Ng, and Gabriel Murray. Methods for Mining and Summarizing Text Conversations. San Rafael, CA: Morgan & Claypool Publishers, 2011. Casey, Carol. “Creating and Managing Webrings: A Step-by-Step Guide.” Information Technology and Libraries, December 1999. Academic OneFile. CBC News. “Canadians’ Internet Usage via Desktops Highest in World, ComScore Says,” 27 March 2015. http://www.cbc.ca/news/business/desktopInternet-use-by-canadians-highest-in-world-comscore-says-1.3012666. – “1st Website Ever Restored to Its 1992 Glory,” 30 April 2013. http:// www.cbc.ca/1.1350832. Cebula, Larry. “An Open Letter to the Historians of the 22nd Century.” Slate, 22 July 2013. http://www.slate.com/articles/arts/culturebox/2013/07/how_will_ historians_of_the_future_sort_through_the_data_glut_of_the_present.html.

270

Bibliography

Center for History and New Media. “Occupy Archive,” 2011. http:// occupyarchive.org/. Center for History and New Media, and American Social History Project / Center for Media and Learning. “September 11 Digital Archive: Saving the Histories of September 11, 2001.” http://911digitalarchive.org/. Chapnick, Adam. “A ‘Conservative’ National Story? The Evolution of Citizenship and Immigration Canada’s Discover Canada.” American Review of Canadian Studies 41, no. 1 (23 February 2011): 20–36. https://doi.org/10. 1080/02722011.2010.544853. Christen, Kimberly. “Opening Archives: Respectful Repatriation.” American Archivist 74, no. 1 (Spring/Summer 2011): 185–210. Christensen, Ward, and Randy Suess. “Hobbyist Computerized Bulletin Board.” BYTE Magazine, November 1978. Christy, Alan, Alice Yang, David Greenberg, Eileen Boris, Gail Drakes, Jennifer Klein, Jeremy Saucier, et al. Doing Recent History: On Privacy, Copyright, Video Games, Institutional Review Boards, Activist Scholarship, and History That Talks Back. Edited by Claire Potter and Renee Romano. Athens: University of Georgia Press, 2012. Cohen, Dan, Frederick Gibbs, Tim Hitchcock, Geoffrey Rockwell, Jörg Sander, Robert Shoemaker, Stéfan Sinclair et al. “Data Mining with Criminal Intent: Final White Paper,” 31 August 2011. http://citeseerx.ist. psu.edu/viewdoc/download?doi=10.1.1.458.2152&rep=rep1&type=pdf. Cohen, Daniel J., and Roy Rosenzweig. Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web. Philadelphia: University of Pennsylvania Press, 2005. http://chnm.gmu.edu/digitalhistory/. Corrado, Edward M., and Heather Lea Moulaison. Digital Preservation for Libraries, Archives, and Museums. Lanham, MD: Rowman & Littlefield Publishers, 2014. Correa, Teresa. “The Participation Divide among ‘Online Experts’: Experience, Skills and Psychological Factors as Predictors of College Students’ Web Content Creation.” Journal of Computer-Mediated Communication 16, no. 1 (1 October 2010): 71–92. https://doi.org/10.1111/j.1083-6101.2010.01532.x. Craggs, Samantha. “Sorry, America: Study Shows Canadians Really Are More Polite.” CBC News, 8 January 2016. http://www.cbc.ca/news/canada/ hamilton/news/canadians-polite-twitter-1.3395242. Cullen, Drew. “Wayback Machine Restores Ye Old Web.” Register, 14 November 2001. http://web.archive.org/web/20011125112115/ http://www.theregister.co.uk/content/6/22834.html. Cunliffe, Alison. “Reaching Electronic Bulletin Boards Often a Headache for the Uninitiated.” Toronto Star, 10 August 1986. Dalal, Y., C. Sunshine, and V. Cerf. “Specification of Internet Transmission Control Program,” December 1974. https://tools.ietf.org/html/rfc675.

Bibliography

271

Dash, Anil. “What Is Public? It’s so Simple, Right?” Message, 24 July 2014. https://medium.com/message/what-is-public-f33b16d780f9. Davis, Corey. “Archiving the Web: A Case Study from the University of Victoria.” code{4}lib Journal 26 (21 October 2014). http://journal.code4lib. org/articles/10015. Dawson, Ella. “The Dark Side of Going Viral.” Vox, 10 July 2018. https:// www.vox.com/first-person/2018/7/10/17553796/plane-bae-viral-airplaneromance. Deken, Jean Marie. “The Web’s First ‘Killer App’: SLAC National Accelerator Laboratory’s World Wide Web Site, 1991–1993,” in Web 25: Histories from the First 25 Years of the World Wide Web, ed. Niels Brügger, 57–78 (New York: Peter Lang Publishing, 2017). Denbo, Seth. “Forum: History as a Book Discipline: An Introduction.” Perspectives on History: American Historical Association, April 2015. https:// www.historians.org/publications-and-directories/perspectives-on-history/ april-2015/an-introduction. DHSI. “Digital Humanities Summer Institute.” 2015. http://www.dhsi.org/. Digital History Working Group. “Guidelines for the Professional Evaluation of Digital Scholarship by Historians.” American Historical Association, June 2015. https://www.historians.org/teaching-and-learning/digitalhistory-resources/evaluation-of-digital-scholarship-in-history/guidelinesfor-the-professional-evaluation-of-digital-scholarship-by-historians. Dittrich, D., and E. Kenneally. The Menlo Report. Center for Applied Internet Data Analysis (CAIDA). https://www.caida.org/publications/papers/2012/ menlo_report_actual_formatted/. DocNow. “Tweet ID Datasets.” https://www.docnow.io/catalog/. Doctorow, Cory. “Sen. Stevens’ Hilariously Awful Explanation of the Internet.” Boing Boing (blog), 2 July 2006. http://boingboing. net/2006/07/02/sen-stevens-hilariou.html. Doheny-Farina, Stephen. The Wired Neighborhood. New Haven, CT: Yale University Press, 1996. Dougherty, Jack, and Kristen Nawrotzki. Writing History in the Digital Age. Ann Arbor: University of Michigan Press, 2013. drinehart. “10,000,000,000,000,000 Bytes Archived.” Internet Archive Blogs (blog), 26 October 2012. http://blog.archive. org/2012/10/26/10000000000000000-bytes-archived/. Dropbox. “How to Access the Dropbox Account of Someone Who Has Passed Away?” https://www.dropbox.com/en/help/security/access-accountof-someone-who-passed-away?_locale_specific=en. Duggan, Maeve. “The Demographics of Social Media Users.” Pew Research Center: Internet, Science & Tech (blog), 19 August 2015. http://www. pewInternet.org/2015/08/19/the-demographics-of-social-media-users/.

272

Bibliography

Dutton, William H., and Mark Graham. “Introduction.” In Society & the Internet: How Networks of Information and Communications Are Changing Our Lives, ed. Mark Graham and William H. Dutton, 1–22. London: Oxford University Press, 2014. Economist. “Difference Engine: Lost in Cyberspace.” 1 September 2012. http://www.economist.com/node/21560992. Emory News Center. “Emory Digital Scholars Archive Occupy Wall Street Tweets,” 21 September 2012. http://news.emory.edu/stories/2012/09/ er_occupy_wall_street_tweets_archive/campus.html. Espenschied, Dragan, and Olia Lialina. “Authenticity/Access.” One Terabyte of Kilobyte Age (blog), 9 April 2012. http://contemporary-home-computing. org/1tb/archives/3214. Ess, Charles, and AoIR Ethics Working Committee. “Ethical Decision-Making and Internet Research: Recommendations from the AOIREthics Working Committee,” 27 November 2002. http://aoir.org/reports/ethics.pdf. Feinstein, Dianne. “Sen. Dianne Feinstein: Continue NSA Call-Records Program.” USA TODAY , 20 October 2013. http://www.usatoday.com/ story/opinion/2013/10/20/nsa-call-records-program-sen-dianne-feinsteineditorials-debates/3112715/. Figallo, Cliff. Hosting Web Communities: Building Relationships, Increasing Customer Loyalty, and Maintaining a Competitive Edge. New York: John Wiley & Sons Canada, 1998. Fletcher, Dan. “Internet Atrocity! GeoCities’ Demise Erases Web History.” Time, 9 November 2009. http://content.time.com/time/business/ article/0,8599,1936645,00.html. Fogel, Robert, and Geoffrey Elton. Which Road to the Past?: Two Views of History. New ed. New Haven, CT: Yale University Press, 1984. Fogel, Robert William, and Stanley L. Engerman. Time on the Cross: The Economics of American Slavery. Reissue. New York: W.W. Norton, 1974. Frenette, Yves. “Conscripting Canada’s Past: The Harper Government and the Politics of Memory.” Canadian Journal of History 49, no. 1 (2014). https://doi.org/10.3138/cjh.49.1.49. Futurism, Todd Jaquith. “CERN Just Dropped 300 TB of Large Hadron Collider Data Free Online.” ScienceAlert, 26 April 2016. http://www.sciencealert. com/cern-just-dropped-300-tb-of-large-hadron-collider-data-online-for-free. Gaffield, Chad. “The Surprising Ascendance of Digital Humanities: And Some Suggestions for an Uncertain Future.” Digital Studies / Le champ numérique 9. http://doi.org/10.16995/dscn.2. Gasperson, Tina. “How to Move Your Site from Geocities to a New Host.” Tiplet: Expert Tips & Tech Support, 24 April 2009. http://tiplet.com/tip/ how-to-move-your-site-from-geocities-to-a-new-host/.

Bibliography

273

GeoCities. “About the Heartland Community Leaders.” Internet Archive Wayback Machine, 1 March 1997. http://web.archive.org/ web/19970301082611/http://www1.geocities.com/Heartland/7546/ hclabout.html. – “Athens’ Community Leaders.” Internet Archive Wayback Machine, 21 December 1996. http://web.archive.org/web/19961221091944/ http://www.geocities.com/Athens/9999. – “Augusta Award Application.” Oocities.org. http://www.oocities.org/ augusta/1020/birdform.htm. – “EnchantedForest Awards Page.” Internet Archive Wayback Machine, http://web.archive.org/web/20010721183753/http:/www.geocities.com/ EnchantedForest/Glade/3891/. – “The Eureka Awards.” Oocities.org, last updated 1999. http://www.oocities. org/eureka/4999/execclub/. – “GeoCities FAQ Page: General Information,” 21 December 1996. http://web. archive.org/web/19961221005714/http:/www.geocities.com/homestead/ FAQ/faqpage1.html. – “GeoCities FAQ Page: Home Page Information,” 21 December 1996. http://web.archive.org/web/19961221005714/http:/www.geocities.com/ homestead/FAQ/faqpage1.html. – “GeoCities Neighborhood Watch Program,” 13 April 1997. http://web. archive.org/web/19970413015812/http://www8.geocities.com/homestead/ neighbor_watch.html. – “The ‘Home Page’ Home Page,” 21 December 1996. http://web.archive.org/ web/19961221005656/http://www.geocities.com/Athens/2090/, accessed 12 August 2013. – “Introduction: The Elements of Web Page Style – Shady Oaks.” Internet Archive Wayback Machine, 1 March 1997. http://web.archive.org/ web/19970301083309/http://www1.geocities.com/Heartland/5419/ elements.htm. – “Page Content Guidelines and Member Terms of Service,” 21 December 1996. http://web.archive.org/web/19970413002711/http://www8.geocities. com:80/homestead/homeguide.html. – “Welcome to the EnchantedForest.” Internet Archive Wayback Machine. https://web.archive.org/web/20000619163933/http://www.geocities.com/ EnchantedForest/1004/welcome.html. Gibbs, Frederick, and Trevor Owens. “The Hermeneutics of Data and Historical Writing.” In Writing History in the Digital Age, ed. Kristen Nawrotzki and Jack Dougherty, 159–70. Ann Arbor: University of Michigan Press, 2013.

274

Bibliography

Gillies, James, and Robert Cailliau. How the Web Was Born: The Story of the World Wide Web. Oxford: Oxford University Press, 2000. Gillmor, Dan. “Future Historians Will Rely on Web.” Philadelphia Inquirer, 22 September 1996. Gitelman, Lisa. Paper Knowledge: Toward a Media History of Documents. Durham, NC: Duke University Press, 2014. Gitlin, Todd. The Sixties: Years of Hope, Days of Rage. New York: Bantam Books, 1987. Gleick, James. The Information: A History, a Theory, Flood. New York: Vintage, 2012. Goggin, Gerard, and Mark McLelland, eds. The Routledge Companion to Global Internet Histories. London: Routledge, 2017. Goldman, William. “All the President’s Men.” imsdb, 1975. https:// www.imsdb.com/scripts/All-the-President’s-Men.html. Gomes, Daniel, Joao Miranda, and Miguel Costa. “A Survey on Web Archiving Initiatives.” Proceedings of the 15th International Conference on Theory and Practice of Digital Libraries, 2011. http://sobre.arquivo.pt/ sobre-o-arquivo/a-survey-on-web-archiving-initiatives. Google Search. “Crawling and Indexing.” https://www.google.com/search/ howsearchworks/crawling-indexing/. GOV.UK. “Guidance on the Legal Deposit Libraries (Non-Print Works) Regulations 2013,” 5 April 2013. https://www.gov.uk/government/ publications/guidance-on-the-legal-deposit-libraries-non-print-worksregulations-2013. Government of Denmark. “Act on Legal Deposit of Published Material: The Royal Library, Translation of Act No. 1439,” 22 December 2004. http:// www.kb.dk/en/kb/service/pligtaflevering-ISSN/lov.html. Grafton, Anthony. “Loneliness and Freedom.” Perspectives on History: American Historical Association, March 2011. http://historians.org/ publications-and-directories/perspectives-on-history/march-2011/ loneliness-and-freedom. Graham, Shawn, Ian Milligan, and Scott Weingart. Exploring Big Historical Data: The Historian’s Macroscope. London: Imperial College Press, 2015. http://www.worldscientific.com/worldscibooks/10.1142/p981. Graham, Shawn, Scott Weingart, and Ian Milligan. “Getting Started with Topic Modeling and MALLET.” Programming Historian, 2 September 2012. http://programminghistorian.org/lessons/topic-modeling-and-mallet.html. Greenberg, Andy. This Machine Kills Secrets: Julian Assange, the Cypherpunks, and Their Fight to Empower Whistleblowers. Repr. New York: Plume, 2013. Greenwald, Glenn. No Place to Hide: Edward Snowden, the NSA, and the U.S. Surveillance State. New York: Metropolitan Books, 2014.

Bibliography

275

Grotke, Abbie. “Web Archiving at the Library of Congress.” Computers in Libraries, December 2011. http://www.infotoday.com/cilmag/dec11/ Grotke.shtml. Guldi, Jo, and David Armitage. The History Manifesto. Cambridge: Cambridge University Press, 2014. Haight, Michael, Anabel Quan-Haase, and Bradley A. Corbett. “Revisiting the Digital Divide in Canada: The Impact of Demographic Factors on Access to the Internet, Level of Online Activity, and Social Networking Site Usage.” Information, Communication & Society 17, no. 4 (21 April 2014): 503–19. https://doi.org/10.1080/1369118X.2014.891633. Hale, Scott A., Grant Blank, and Victoria D. Alexander. “Live versus Archive: Comparing a Web Archive to a Population of Web Pages.” In The Web as History, ed. Niels Brügger and Ralph Schroeder, 63–79. London: UCL, 2017. Hansell, Saul. “The Neighbourhood Business; GeoCities’ Cyberworld Is Vibrant, but Can It Make Money?” New York Times, 13 July 1998. Harding, Luke. “Timbuktu Mayor: Mali Rebels Torched Library of Historic Manuscripts.” Guardian, 28 January 2013. http://www.theguardian.com/ world/2013/jan/28/mali-timbuktu-library-ancient-manuscripts. Hargittai, Eszter. “Second-Level Digital Divide: Differences in People’s Online Skills.” First Monday 7, no. 4 (1 April 2002). http://firstmonday.org/ ojs/index.php/fm/article/view/942. Hargittai, Eszter, and Gina Walejko. “The Participation Divide: Content Creation and Sharing in the Digital Age.” Information, Communication & Society 11, no. 2 (1 March 2008): 239–56. https://doi. org/10.1080/13691180801946150. Hartzog, Woodrow, and Frederic D. Stutzman. “The Case for Online Obscurity.” SSRN Scholarly Paper. Rochester, NY: Social Science Research Network, 23 February 2012. http://papers.ssrn.com/abstract=1597745. HathiTrust Research Center. “HathiTrust Digital Library | Millions of Books Online,” 2015. http://www.hathitrust.org/. Hickman, Leo. “The Mysterious Cable That Links the UK to the US.” Guardian, 23 October. http://www.theguardian.com/technology/2009/ oct/23/mysterious-cable-uk-us. High, Steven. Oral History at the Crossroads: Sharing Life Stories of Survival and Displacement. Vancouver: UBC Press, 2015. HILT. “Humanities Intensive Learning and Teaching Institute.” HILT homepage, 2015. http://www.dhtraining.org/hilt/. Hockx, Michael. “Web Archives and Chinese Literature.” UK Web Archive Blog (blog), 13 September 2012. http://britishlibrary.typepad.co.uk/ webarchive/2012/09/web-archives-and-chinese-literature.html.

276

Bibliography

Hoffman, Starr. “Development of the CyberCemetery (2011).” 25 September 2011. http://www.slideshare.net/geekyartistlibrarian/development-of-thecybercemetery-2011. Hu, Jim. “AOL Home Page Glitches Irk Users.” CNET News, 1 February 2002. http://news.cnet.com/2100-1023-827901.html. ICANN. “Stewardship of IANA Functions Transitions to Global Internet Community as Contract with U.S. Government Ends.” Icann.org, 1 October 2016. https://www.icann.org/news/announcement-2016-10-01-en. IIPC Programme and Communications Officers. “10 Years Anniversary of the Netarchive (Netarkivet), the Danish National Web Archive.” Netpreserveblog (blog), 8 July 2015. https://netpreserveblog.wordpress. com/2015/07/08/10-years-anniversary-of-the-netarchive-netarkivet-thedanish-national-web-archive/. International Council on Archives. “ISAD(G): General International Standard Archival Description.” International Council on Archives, 2000. https:// www.ica.org/sites/default/files/CBPS_2000_Guidelines_ISAD%28G%29_ Second-edition_EN.pdf. Internet Archive. “CDX Format.” https://archive.org/web/researcher/cdx_file_ format.php. – “The Internet Archive: Building an ‘Internet Library.’” Internet Archive, 20 May 2000. http://web.archive.org/web/20000520003204/http://www. archive.org/. – “Petabox.” https://archive.org/web/petabox.php. Internet Archive Global Events. “Ferguson, MO – 2014.” Archive-It Collections, August 2014. Accessed 12 September 2014. https://archive-it. org/collections/4783. Internet Archive Wayback Machine. “GeoCities Homesteading Program Information,” 22 February 1997. http://web.archive.org/ web/19970222174816/http://www1.geocities.com/homestead/. ISO. “ISO 28500:2009,” 2009. http://www.iso.org/iso/catalogue_detail. htm?csnumber=44717. Isserman, Maurice. If I Had a Hammer: The Death of the Old Left and the Birth of the New Left. New York: Basic Books, 1987. Jackson, Andrew. “Building a ‘Historical Search Engine’ Is No Easy Thing.” UK Web Archive Blog (blog), 19 February 2015. http://britishlibrary. typepad.co.uk/webarchive/2015/02/building-a-historical-search-engine-isno-easy-thing.html. – “Shine.” GitHub, 2 April 2014. https://github.com/ukwa/shine. Jackson, Andrew, Jimmy Lin, Ian Milligan, and Nick Ruest. “Desiderata for Exploratory Search Interfaces to Web Archives in Support of Scholarly Activities.” In Proceedings of the 16th ACM/IEEE-CS on Joint

Bibliography

277

Conference on Digital Libraries, 103–6. New York: ACM, 2016. https://doi. org/10.1145/2910896.2910912. Jockers, Matthew L. “The LDA Buffet Is Now Open; or, Latent Dirichlet Allocation for English Majors.” Matthew L. Jockers (blog), 29 September 2011. http://www.matthewjockers.net/2011/09/29/the-lda-buffet-is-nowopen-or-latent-dirichlet-allocation-for-english-majors/. Jordanova, Ludmilla. History in Practice. 2nd ed. London: Bloomsbury Academic, 2006. Kaplan, David E. “Masked Gunmen Seize Crimean Investigative Journalism Center.” Global Investigative Journalism Network, 2 March 2014. http://gijn.org/2014/03/02/masked-gunmen-seize-crimean-investigativejournalism-center/. Karlins, David. Build Your Own Web Site. 1st edition. New York: McGraw-Hill Osborne Media, 2003. Katz, Michael B. The People of Hamilton, Canada West: Family and Class in a Mid-Nineteenth-Century City. Cambridge, MA: Harvard University Press, 1975. kdawson. “Archive Team Is Busy Saving Geocities.” Slashdot, 27 April 2009. http://tech.slashdot.org/story/09/04/27/2252227/archive-team-is-busysaving-geocities. Khazan, Olga. “Here’s What Was in the Torched Timbuktu Library.” Washington Post, 29 January 2013. http://www.washingtonpost.com/blogs/ worldviews/wp/2013/01/29/heres-what-was-in-the-torched-timbuktulibrary/. Kim, Dorothy. “Social Media and Academic Surveillance: The Ethics of Digital Bodies.” Model View Culture (blog), 7 October 2014. https:// modelviewculture.com/pieces/social-media-and-academic-surveillancethe-ethics-of-digital-bodies. Kimpton, Michele, and Jeff Ubois. “Year-by-Year: From an Archive of the Internet to an Archive on the Internet.” In Web Archiving, 201–12. Berlin: Springer, 2006. https://doi.org/10.1007/978-3-540-46332-0_9. Kirschenbaum, Matthew G. Mechanisms: New Media and the Forensic Imagination. Cambridge, MA: MIT Press, 2012. Klein, Ewan, Beatrice Alex, Claire Grover, Colin Coates, Aaron Quigley, Uta Hinrichs, James Reid et. al. “Trading Consequences: Final White Paper,” March 2014. http://tradingconsequences.blogs.edina.ac.uk/files/2014/03/ DiggingintoDataWhitePaper-final.pdf. Klingenstein, Sara, Tim Hitchcock, and Simon DeDeo. “The Civilizing Process in London’s Old Bailey.” Proceedings of the National Academy of Sciences 111, no. 26 (1 July 2014): 9419–24. https://doi.org/10.1073/ pnas.1405984111.

278

Bibliography

Koerbin, Paul. “Web Archiving: An Antidote to ‘Present Shock’?” National Library of Australia Blog, 18 March 2014. http://www.nla.gov.au/blogs/ web-archiving/2014/03/18/web-archiving-an-antidote-to-present-shock. Kosinski, Michal, David Stillwell, and Thore Graepel. “Private Traits and Attributes Are Predictable from Digital Records of Human Behavior.” Proceedings of the National Academy of Sciences, 11 March 2013. http://www. pnas.org/content/early/2013/03/06/1218772110. Kostash, Myrna. Long Way from Home: The Story of the Sixties Generation in Canada. Toronto: Lorimer, 1980. Koster, Martijn. “Important: Spiders, Robots and Web Wanderers.” WWW-Talk Listserv, 25 February 1994. http://1997.webhistory.org/www.lists/wwwtalk.1994q1/0717.html. Kuny, Terry. “A Digital Dark Ages? Challenges in the Preservation of Electronic Information.” 63rd IFLA Council and General Conference, 27 August 1997. http://archive.ifla.org/IV/ifla63/63kuny1.pdf. LaCalle, Maria, and Scott Reed. “Poster: The Occupy Web Archive: Is the Movement Still on the Live Web?” Washington, DC, 2014. Laursen, Ditte, and Per Møldrup-Dalum. “Keynote on the History of the Danish Web Archive.” Presented at the Web Archives as Scholarly Sources Conference, Aarhus University, Denmark), 9 June 2015. Lawrence, S., D.M. Pennock, G.W. Flake, R. Krovetz, F.M. Coetzee, E. Glover, F.A Nielsen et al. “Persistence of Web References in Scientific Research.” Computer 34, no. 2 (February 2001): 26–31. https://doi. org/10.1109/2.901164. Lawson, Mark. “Berners-Lee on the Read/Write Web (an Interview).” BBC, 9 August 2005. http://news.bbc.co.uk/2/hi/technology/4132752.stm. Leetaru, Kalev. “How Much of the Internet Does the Wayback Machine Really Archive?” Forbes, 16 November 2015. https://www.forbes.com/sites/ kalevleetaru/2015/11/16/how-much-of-the-internet-does-the-waybackmachine-really-archive/#4a53bd809446. Leiner, Barry M., Vinton G. Cerf, David D. Clark, Robert E. Kahn, Leonard Kleinrock, Daniel C. Lynch, Jon Postel et al. “A Brief History of the Internet.” SIGCOMM Computer Communication Review 39, no. 5 (October 2009): 22–31. https://doi.org/10.1145/1629607.1629613. Lesk, Michael. “Preserving Digital Objects: Recurrent Needs and Challenges.” lesk.com, 1995. http://www.lesk.com/mlesk/auspres/aus.html. Levitt, Cyril. Children of Privilege: Student Revolt in the Sixties: A Study of Student Movements in Canada, the United States, and West Germany. Toronto: University of Toronto Press, 1984. Lewis, Michael. Flash Boys: A Wall Street Revolt. New York: W.W. Norton, 2014. Lialina, Olia. “Some Remarks on #neocities @kyledrake.” One Terabyte of

Bibliography

279

Kilobyte Age (blog), 1 July 2013. http://blog.geocities.institute/archives/4012. Library and Archives Canada. “Library and Archives Canada Acquisition Update,” 3 May 2013. http://www.bac-lac.gc.ca/eng/news/news_releases/ Pages/2013/acquisitions-update.aspx. Library of Congress. “Library of Congress Collections Policy Statements Supplementary Guidelines: Web Archiving,” 2013. http://www.loc.gov/acq/ devpol/webarchive.pdf. Licklider, J.C.R., and Robert W. Taylor. “The Computer as a Communication Device.” Science and Technology, April 1968, 20–41. Lin, Jimmy. “My Data Is Bigger Than Your Data.” http://lintool.github.io/mydata-is-bigger-than-your-data/. Lin, Jimmy, and Ian Milligan. “Warcbase: Scaling ‘Out’ and ‘Down’ HBase for Web Archiving.” Presented at the HBaseCon 2015, San Francisco, 7 May 2015. http://www.slideshare.net/HBaseCon/ecosystem-session-5-49044346. Lin, Jimmy, Ian Milligan, Jeremy Wiebe, and Alice Zhou. “Warcbase: Scalable Analytics Infrastructure for Exploring Web Archives.” ACM Journal of Computing and Cultural Heritage 10, no. 4 (July 2017): 22:1–30. Lin, Yuri, Jean-Baptiste Michel, Erez Lieberman Aiden, Jon Orwant, Will Brockman, and Slav Petrov. “Syntactic Annotations for the Google Books Ngram Corpus.” In Proceedings of the ACL 2012 System Demonstrations, 169–74. Stroudsburg, PA: Association for Computational Linguistics, 2012. http://dl.acm.org/citation.cfm?id=2390470.2390499. Llewellyn, Kristina R., Alexander Freund, and Nolan Reilly. The Canadian Oral History Reader. Montreal and Kingston: McGill-Queen’s University Press, 2015. Logie, John. “Homestead Acts: Rhetoric and Property in the American West, and on the World Wide Web.” Rhetoric Society Quarterly 32, no. 3 (1 July 2002): 33–59. https://www.jstor.org/stable/3886008?seq=1#page_scan_tab_ contents. Lomborg, Stine. “Personal Internet Archives and Ethics.” Research Ethics 9, no. 1 (September 2012): 20–31. Lor, Peter, and Johannes J. Britz. “A Moral Perspective on South–North Web Archiving.” Journal of Information Science 30, no. 6 (1 December 2004): 540–49. https://doi.org/10.1177/0165551504047925. Luckerson, Victor. “What the Library of Congress Plans to Do with All Your Tweets.” Time, 25 February 2013. http://business.time.com/2013/02/25/ what-the-library-of-congress-plans-to-do-with-all-your-tweets/. Maemura, Emily, Nicholas Worby, Ian Milligan, and Christoph Becker. “If These Crawls Could Talk: Studying and Documenting Web Archives Provenance.” Journal of the Association for Information Science and Technology 69, no. 10 (October 2018): 1223–33.

280

Bibliography

Mailland, Julien, and Kevin Driscoll. Minitel: Welcome to the Internet. Cambridge: MIT Press, 2017. Manovich, Lev. “Guide to Visualizing Video and Image Sequences.” Google, 30 March 2012. https://docs.google.com/document/ d/1PqSZmKwQwSIFrbmVi-evbStTbt7PrtsxNgC3W1oY5C4/edit. Markham, Annette, and Elizabeth Buchanan. “Ethical Decision-Making and Internet Research: Recommendations from the AOIR Ethics Working Committee (Version 2.0).” AOIR, September 2012. http://aoir.org/reports/ ethics.pdf. Markoff, John. “An Internet Pioneer Ponders the Next Revolution.” New York Times, 20 December 1999. http://partners.nytimes.com/library/tech/99/12/ biztech/articles/122099outlook-bobb.html. Masse, Bryson. “Why Internet Is Expensive in Canada’s North.” VICE Money, 15 March 2017. https://news.vice.com/en_ca/article/j5d3jg/why-Internet-isexpensive-in-canadas-north. McArdle, Megan. “People Are Getting Fired for Old Bad Tweets. Here’s How to Fix It.” Washington Post, 24 July 2018. https://www.washingtonpost. com/opinions/we-need-a-statute-of-limitations-on-bad-tweets/2018/07/24/ a84e335c-8f7d-11e8-b769-e3fff17f0689_story.html?utm_term=. d551160ff253. McKay, Ian, and Jamie Swift. Warrior Nation: Rebranding Canada in an Age of Anxiety. Toronto: Between the Lines, 2012. Meloan, Steve. “No Way to Run a Culture.” Wired, 13 February 1998. http:// web.archive.org/web/20000619001705/http://www.wired.com/news/ culture/0,1284,10301,00.html. Memento. “About the Time Travel Service,” 27 April 2015. http://timetravel. mementoweb.org/about/. Merity, Stephen. “Navigating the WARC File Format.” Common Crawl (blog), 2 April 2014. http://blog.commoncrawl.org/2014/04/navigating-the-warcfile-format/. Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, Joseph P. Pickett, Dale Hoiberg et al. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331, no. 6014 (14 January 2011): 176–82. https://doi.org/10.1126/science.1199644. Milligan, Ian. “‘A Haven for Perverts, Criminals, and Goons’: Children and the Battle for and against Canadian Internet Regulation, 1991–1999.” Histoire Sociale / Social History 47, no. 96 (2015): 245–74. https://hssh. journals.yorku.ca/index.php/hssh/article/view/40403. – “Illusionary Order: Online Databases, Optical Character Recognition, and Canadian History, 1997–2010.” Canadian Historical Review 94, no. 4 (2013): 540–69.

Bibliography

281

– “In a Rush to Modernize, MySpace Destroyed More History.” ActiveHistory. ca, 17 June 2013. http://activehistory.ca/2013/06/myspace-is-cool-again-toobad-they-destroyed-history-along-the-way/. – “Learning to See the Past at Scale: Exploring Web Archives through Hundreds of Thousands of Images.” In Seeing the Past with Computers: Experiments with Augmented Reality and Computer Vision for History, ed. Kevin Kee and Timothy J. Compeau, 116–36. Ann Arbor: University of Michigan Press, 2019. – “Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives.” International Journal of Humanities and Arts Computing 10, no. 1–2 (2016): 87–94. – “Mining the ‘Internet Graveyard’: Rethinking the Historians’ Toolkit.” Journal of the Canadian Historical Association 23, no. 2 (2012): 21–64. https://doi.org/10.7202/1015788ar. – “Preserving History as It Happens: The Internet Archive and the Crimean Crisis.” ActiveHistory.ca, 25 March 2014. http://activehistory.ca/2014/03/ preserving-history-as-it-happens/. – Rebel Youth: 1960s Labour Unrest, Young Workers, and New Leftists in English Canada. Vancouver: UBC Press, 2014. – “Welcome to the Web: The Online Community of GeoCities and the Early Years of the World Wide Web.” In The Web as History, ed. Niels Brügger and Ralph Schroeder, 137–58. London: UCL, 2017. Milligan, Ian, and James Baker. “Introduction to the Bash Command Line.” Programming Historian, 20 September 2014. https:// programminghistorian.org/en/lessons/intro-to-bash. Milligan, Ian, Nick Ruest, and Anna St.Onge. “The Great WARC Adventure: Using SIPS, AIPS and DIPS to Document SLAPPs.” Digital Studies / Le Champ Numérique 6 (31 March 2016). http://www.digitalstudies.org/ojs/ index.php/digital_studies/article/view/325. Millward, Gareth. “I Tried to Use the Internet to Do Historical Research. It Was Nearly Impossible.” Washington Post, 17 February 2015. https:// www.washingtonpost.com/posteverything/wp/2015/02/17/i-tried-to-usethe-internet-to-do-historical-research-it-was-nearly-impossible/?utm_ term=.36cf55abad16. Minard, Jonathan. Internet Archive. Online Video, 2013. http://vimeo. com/59207751. Modine, Austin. “Web 0.2 Archivists Save Geocities from Deletion.” Register, 28 April 2009. http://www.theregister.co.uk/2009/04/28/geocities_ preservation/. Mohr, Gordon, Michael Stack, Igor Ranitovic, Dan Avery, and Michele Kimpton. “An Introduction to Heritrix: An Open Source Archival

282

Bibliography

Quality Web Crawler.” 4th International Web Archiving Workshop, 2004. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.676.6877& rep=rep1&type=pdf. Montello, Daniel R., Sara Irina Fabrikant, Marco Ruocco, and Richard S. Middleton. “Testing the First Law of Cognitive Geograpy on Point-Display Spatializations.” In Spatial Information Theory: Foundations of Geographic Information Science, ed. Walter Kuhn, Michael F. Worboys, and Sabine Timpf, 316–31. Lecture Notes in Computer Science 2825. Berlin: Springer, 2003. http://link.springer.com/chapter/10.1007/978-3-540-39923-0_21. Moretti, Franco. Graphs, Maps, Trees: Abstract Models for Literary History. London: Verso, 2007. Morrison, Aimée. “‘Suffused by Feeling and Affect’: The Intimate Public of Personal Mommy Blogging.” Biography 34, no. 1 (Winter 2011): 37–55. Mosby, Ian. Food Will Win the War: The Politics, Culture, and Science of Food on Canada’s Home Front. Vancouver: UBC Press, 2014. Moschovitis, Christos J.P. History of the Internet: A Chronology, 1843 to the Present. Santa Barbara, CA: ABC-CLIO, 1999. Motavalli, John. Bamboozled at the Revolution: How Big Media Lost Billions in the Battle for the Internet. New York: Penguin, 2004. Munroe, Randall. “Google’s Datacenters on Punch Cards.” xkcd.com, 2013. http://what-if.xkcd.com/63/. National Digital Information Infrastructure and Preservation Program. “Preserving Our Digital Heritage: The National Digital Information Infrastructure and Preservation Program 2010 Report: A Collaborative Initiative of the Library of Congress.” DigitalPreservation.gov, 2010. http:// www.digitalpreservation.gov/documents/NDIIPP2010Report_Post.pdf. National Library of Australia. “History and Achievements.” Pandora: Australia’s Web Archive, 18 February 2009. http://pandora.nla.gov.au/ historyachievements.html. National Library of Korea. “A Web Archiving System of the National Library of Korea: OASIS.” Conference of Directors of National Libraries in Asia and Oceania, March 2007. https://web.archive.org/web/20130206012147/ http://www.ndl.go.jp/en/cdnlao/newsletter/058/583.html. Nelson, Michael L., Ahmed AlSum, Michele C. Weigle, Herbert Van de Sompel, and David Rosenthal. “Profiling Web Archives.” Presented at the IIPC General Assembly, Paris, 21 May 2014. https://www.slideshare.net/ phonedude/profiling-web-archives. Nelson, Theodor Holm. Literary Machines. Swarthmore, PA: Mindful, 1980. Netarkivet. “Retningslinjer for Adgang Til Netarkivet,” 2015. http://netarkivet. dk/wp-content/uploads/Retningslinjer-for-adgang-til-Netarkivet.pdf.

Bibliography

283

Nicholson, Denise Rosemary. “Legal Deposit in South Africa: Transformation in a Digital World.” Cape Town, South Africa, 2015. http://library.ifla. org/1127/. Niu, Jinfang. “An Overview of Web Archiving.” D-Lib Magazine 18, no. 3/4 (March 2012). https://doi.org/10.1045/march2012-niu1. Noble, Safiya Umoja. Algorithms of Oppression: How Search Engines Reinforce Racism. New York: NYU Press, 2018. Novick, Peter. That Noble Dream: The “Objectivity Question” and the American Historical Profession. Cambridge: Cambridge University Press, 1988. Ocamb, Karen. “David Bohnett: Social Change through Community Commitment.” Frontiers, 16 October 2012. O’Dell, Allison Jai. “Describing Web Collections.” Medium, 17 February 2015. https://medium.com/@allisonjaiodell/describing-web-collectionse32b59893848. Ogden, Jessica, Susan Halford, and Leslie Carr. “Observing Web Archives: The Case for an Ethnographic Study of Web Archiving.” In Proceedings of the 2017 ACM on Web Science Conference, 299–308. New York: ACM, 2017. https://doi.org/10.1145/3091478.3091506. Old Bailey. “The Proceedings of Old Bailey, London’s Central Criminal Court, 1674–1913,” 2018. https://www.oldbaileyonline.org. O’Neil, Cathy. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. New York: Crown, 2016. On the Media. “Calling for Back Up,” 2014. http://www.onthemedia.org/ story/calling-back/?utm_source=sharedUrl&utm_media=metatag&utm_ campaign=sharedUrl. Osterberg, Gayle. Update on the Twitter Archive at the Library of Congress | Library of Congress Blog. 4 January 2013. http://blogs.loc.gov/loc/2013/01/ update-on-the-twitter-archive-at-the-library-of-congress/. Our Marathon. “Our Stories, Our Strength, Our Marathon,” 2018. https:// marathon.library.northeastern.edu. Owens, Trevor. “Digital Preservation’s Place in the Future of the Digital Humanities.” Trevor Owens: User Centered Digital History (blog), 18 March 2014. http://www.trevorowens.org/2014/03/digital-preservations-place-inthe-future-of-the-digital-humanities/. – “What Do You Mean by Archive? Genres of Usage for Digital Preservers | The Signal,” 27 February 2014. http://blogs.loc.gov/digitalpreservation/ 2014/02/what-do-you-mean-by-archive-genres-of-usage-for-digitalpreservers/. Owram, Doug. Born at the Right Time: A History of the Baby Boom Generation. Toronto: University of Toronto Press, 1997.

284

Bibliography

Pabón Cadavid, Jhonny Antonio, Johnkhan Sathik Basha, and Gandhimani Kaleeswaran. “Legal and Technical Difficulties of Web Archival in Singapore.” Singapore, 2013. http://library.ifla.org/217/. Palmer, Sean B. “Earliest Web Screenshots.” inamidst.com, December 2010. http://inamidst.com/stuff/web/screens. Pariser, Eli. The Filter Bubble: How the Personalized Web Is Changing What We Read and How We Think. New York: Penguin, 2011. Pearce-Moses, Richard. “A Glossary of Archival and Records Terminology.” Society of American Archivists, 2005. http://files.archivists.org/pubs/free/ SAA-Glossary-2005.pdf. Pereira, Ângela, Alessia Ghezzi, and Lucia Vesnic-Alujevic, eds. The Ethics of Memory in a Digital Age: Interrogating the Right to Be Forgotten. Houndmills, Basingstoke: Palgrave Macmillan, 2014. Perrin, Andrew. “One-Fifth of Americans Report Going Online ‘Almost Constantly.’” Pew Research Center (blog), 8 December 2015. http://www. pewresearch.org/fact-tank/2015/12/08/one-fifth-of-americans-report-goingonline-almost-constantly/. Perrin, Andrew, and Maeve Duggan. “Americans’ Internet Access: 2000–2015.” Pew Research Center: Internet, Science & Tech (blog), 26 June 2015. http:// www.pewInternet.org/2015/06/26/americans-Internet-access-2000-2015/. Peters, Benjamin. How Not to Network a Nation: The Uneasy History of the Soviet Internet. Cambridge, MA: MIT Press, 2017. Pew Research Center. “Internet/Broadband Fact Sheet,” 12 January 2017. http://www.pewinternet.org/fact-sheet/internet-broadband/. PR Newswire. “GeoCities Welcomes One Millionth ‘Homesteader’ Pioneer in Online Community Doubles Active Membership in Six Months.” Financial News, 9 October 1997. Accessed via Lexis|Nexis. Putnam, Lara. “The Transnational an ephen Robertson et aler 2018; ie sand aand Pitn History,”arship by Historians,”n in historical scholarship. will – and historid the Text-Searchable: Digitized Sources and the Shadows They Cast.” American Historical Review 121, no. 2 (1 April 2016): 377–402. https://doi.org/10.1093/ahr/121.2.377. Putnam, Robert. Bowling Alone: The Collapse and Revival of American Community. New York: Simon & Schuster, 2000. Rauber, Andreas, Max Kaiser, and Bernhard Wachter. “Ethical Issues in Web Archive Creation and Usage: Towards a Research Agenda.” CiteSeer, 2008. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.145.4339. Reed, Scott. “Introducing Archive-It 4.9 and Umbra.” Archive-It Blog (blog), 13 March 2014. https://archiveitblog.wordpress.com/2014/03/13/introducingarchive-it-4-9-and-umbra/.

Bibliography

285

Resnick, Brian. “Researchers Just Released Profile Data on 70,000 OkCupid Users without Permission.” Vox, 12 May 2016. http://www.vox. com/2016/5/12/11666116/70000-okcupid-users-data-release. Reudiger, Dylan. “Another Tough Year for the Academic Job Market in History.” Perspectives in History, 16 November 2017. https://www.historians. org/publications-and-directories/perspectives-on-history/november-2017/ another-tough-year-for-the-academic-job-market-in-history. Rheingold, Howard. The Virtual Community: Homesteading on the Electronic Frontier. Cambridge, MA: MIT Press, 2000. http://www.rheingold.com/vc/ book/intro.html. Ridener, John, and Terry Cook. From Polders to Postmodernism: A Concise History of Archival Theory. Duluth, MI: Litwin Books, 2009. Ridey, Roger. “Roger Widey Travels under the Volcano, and Also Discovers a Web Full of Creepy-Crawlies.” Independent, 12 February 1996. Ritchie, Don. “Should We ‘Consent’ to Oral History?” Oxford University Press Blog, 13 October 2015, https://blog.oup.com/2015/10/oral-history-federalregulation/. “Robert C. Binkley.” 2018. https://www.wallandbinkley.com/rcb/. Robertson, Stephen. “The Differences between Digital History and Digital Humanities,” 23 May 2014. http://drstephenrobertson.com/2014/05/23/ the-differences-between-digital-history-and-digital-humanities/. Rockwell, Geoffrey, and Stéfan Sinclair. Hermeneutica: Computer-Assisted Interpretation in the Humanities. Cambridge, MA: MIT Press, 2016. Rogers, Richard. Digital Methods. Cambridge, MA: MIT Press, 2013. Romano, Aja. “The ‘Controversy’ over Journalist Sarah Jeong Joining the New York Times, Explained.” Vox, 3 August 2018. https://www.vox. com/2018/8/3/17644704/sarah-jeong-new-york-times-tweets-backlash-racism. Rosen, Rebecca J. “59% of Young People Say the Internet Is Shaping Who They Are.” Atlantic, 27 June 2012. https://www.theatlantic.com/technology/ archive/2012/06/59-of-young-people-say-the-internet-is-shaping-who-theyare/259022/. – “Plan a Trip through History with ORBIS, a Google Maps for Ancient Rome.” Atlantic, 23 May 2012. http://www.theatlantic.com/technology/ archive/2012/05/plan-a-trip-through-history-with-orbis-a-google-maps-forancient-rome/257554/. Rosenthal, David. “You Get What You Get and You Don’t Get Upset.” DSHR’s Blog, 19 November 2015. http://blog.dshr.org/2015/11/you-get-what-youget-and-you-dont-get.html. Rosenzweig, Roy, and Anthony Grafton. Clio Wired: The Future of the Past in the Digital Age. New York: Columbia University Press, 2011.

286

Bibliography

Rossi, Alexis. “80 Terabytes of Archived Web Crawl Data Available for Research.” Internet Archive Blogs, 26 October 2012. http://blog.archive. org/2012/10/26/80-terabytes-of-archived-web-crawl-data-available-forresearch/. – “Robots.txt Files and Archiving .gov and .mil Websites.” Internet Archive Blogs, 17 December 2016. https://blog.archive.org/2016/12/17/robots-txtgov-mil-websites/. Rudder, Christian. Dataclysm: Who We Are (When We Think No One’s Looking). Toronto: Random House Canada, 2014. Rumsey, Abby Smith. When We Are No More: How Digital Memory Is Shaping Our Future. London: Bloomsbury, 2016. Rutner, Jennifer, and Roger C. Schonfeld. “Supporting the Changing Research Practices of Historians.” Ithaka S+R, 10 December 2012. http:// www.sr.ithaka.org/wp-content/mig/reports/supporting-the-changingresearch-practices-of-historians.pdf. Ryan, Johnny. “The Essence of the ’Net: A History of the Protocols That Hold the Network Together.” Ars Technica, 8 March 2011. http://arstechnica. com/tech-policy/news/2011/03/the-essence-of-the-net.ars. – A History of the Internet and the Digital Future. London: Reaktion Books, 2011. SalahEldeen, Hany M., and Michael L. Nelson. “Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?” arXiv:1209.3026 [Cs], 13 September 2012. http://arxiv.org/abs/1209.3026. Salganik, Matthew J. Bit By Bit: Social Research in the Digital Age. Princeton, NJ: Princeton University Press, 2018. Sanderson, Robert, Mark Phillips, and Herbert Van de Sompel. “Analyzing the Persistence of Referenced Web Resources with Memento.” arXiv:1105.3459 [Cs], 17 May 2011. http://arxiv.org/abs/1105.3459. Sawyer, Ben, and Dave Greely. Creating Geocities Websites. Cincinnati, OH: Music Sales, 1999. Scheidel, Walter, Elijah Meeks, and Jonathan Weiland. “ORBIS: The Stanford Geospatial Network Model of the Roman World,” 2012. http://orbis. stanford.edu/#. Schmidt, Eric, and Jared Cohen. The New Digital Age: Reshaping the Future of People, Nations and Business. New York: Knopf, 2013. Schradie, Jen. “The Digital Production Gap: The Digital Divide and Web 2.0 Collide.” Poetics 39, no. 2 (April 2011): 145–68. https://doi.org/10.1016/j. poetic.2011.02.003. Schwartz, John. “New Economy; A Library of Web Pages That Warms the Cockles of the Wired Heart and Beats the Library of Congress for Sheer Volume.” New York Times, 29 October 2001. http://www.nytimes. com/2001/10/29/technology/ebusiness/29NECO.html.

Bibliography

287

Scola, Nancy. “Library of Congress’ Twitter Archive Is a Huge #FAIL.” POLITICO, 11 July 2015. http://www.politico.com/story/2015/07/library-ofcongress-twitter-archive-119698.html. Scott, Jason. BBS: The Documentary. http://www.bbsdocumentary.com/. – “Datapocalypso.” ASCII (blog), 5 January 2009. http://ascii.textfiles.com/ archives/1649. – “Eviction, or the Coming Datapocalypse.” ASCII (blog), 21 December 2008. http://ascii.textfiles.com/archives/1617. – “Geocities: Lessons So Far.” ASCII (blog), 26 April 2009. http://ascii.textfiles. com/archives/1961. – “The Geocities Torrent: Patched and Posted.” ASCII (blog), 6 April 2011. http://ascii.textfiles.com/archives/3046. – “Please Be Patient: This Page Is under Construction!,” 2015. http://www. textfiles.com/underconstruction/. – “Unpublished Article on Geocities.” ASCII (blog), 1 December 2009. http://ascii.textfiles.com/archives/2402. Scott, Katie. “Archiving Britain’s Web: The Legal Nightmare Explored.” Wired. Co.Uk, 5 March 2010. https://web.archive.org/web/20160324160612/http:// www.wired.co.uk/news/archive/2010-03/05/archiving-britains-web-the-legalnightmare-explored. Simpson, Jeffrey. The Friendly Dictatorship. Toronto: McClelland & Stewart, 2001. Song, Sangchul, and Joseph JaJa. “Fast Browsing of Archived Web Contents.” In Proceedings of the 8th International Web Archiving Workshop, 2008. https://wiki.umiacs.umd.edu/adapt/images/4/47/Iwaw08_Song_JaJa_ Final.pdf. Statistics Canada. “Canadian Internet Use Survey,” 25 May 2011. http:// www.statcan.gc.ca/daily-quotidien/110525/dq110525b-eng.htm. – “Household Internet Use Survey,” 8 September 2003. http://www.statcan. gc.ca/daily-quotidien/030918/dq030918b-eng.htm. Stern, Hunter. “Web Archive Transformation (WAT) Specification, Utilities, and Usage Overview.” Internet Research, 13 June 2011. https://webarchive. jira.com/wiki/display/Iresearch/Web+Archive+Transformation +%28WAT%29+Specification,+Utilities,+and+Usage+Overview. Stimson, Hugh. “Jason Scott Is in Your Geocities, Rescuing Your Sh*t.” Hughstimson.org (blog), 27 April 2009. http://hughstimson.org/2009/04/27/ jason-scott-is-in-your-geocities-rescuing-your-sht/. Stirling, Peter, Gildas Illien, Pascal Sanz, and Sophie Sepetjan. “The State of E-Legal Deposit in France: Looking Back at Five Years of Putting New Legislation into Practice and Envisioning the Future.” San Juan, Puerto Rico, 2011. http://conference.ifla.org/past-wlic/2011/193-stirling-en.pdf.

288

Bibliography

Stone, Biz. “Tweet Preservation,” 14 April 2010. https://blog.twitter.com/2010/ tweet-preservation. Suda, Brian. “CERN: Line Mode Browser.” optional.is, 25 September 2013. http://optional.is/required/2013/09/25/cern-line-mode-browser/. – “Meyrin: CERN Terminal Font.” optional.is, 26 March 2014. http://optional. is/required/2014/03/26/meyrin-cern-terminal-font/. Summers, Ed. “A Ferguson Twitter Archive.” Inkdroid (blog), 30 August 2014. http://inkdroid.org/journal/2014/08/30/a-ferguson-twitter-archive/. Tansey, Eira. “My Talk at Personal Digital Archiving 2015.” Eiratansey.com (blog), 16 August 2015. http://eiratansey.com/2015/08/16/my-talk-atpersonal-digital-archiving-2015/. Taylor, Arnold, and Lauren Tilton. Humanities Data in R: Exploring Networks, Geospatial Data, Images, and Text. New York: Springer, 2015. Taylor, Astra. The People’s Platform: Taking Back Power and Culture in the Digital Age. New York: Metropolitan Books, 2014. Taylor, Joanna, and Claudia Pagliari. “Mining Social Media Data: How Are Research Sponsors and Researchers Addressing the Ethical Challenges?” Research Ethics, 26 October 2017. https://doi.org/10.1177/1747016117738559. Taylor, Nicholas. “The Average Lifespan of a Webpage.” Signal, 8 November 2011. http://blogs.loc.gov/digitalpreservation/2011/11/the-averagelifespan-of-a-webpage/. Theimer, Kate. “Archives in Context and as Context.” Journal of Digital Humanities, 26 June 2012. http://journalofdigitalhumanities.org/1-2/ archives-in-context-and-as-context-by-kate-theimer/. Thibodeau, Kenneth. “What Does It Mean to Preserve Digital Objects?” Washington, DC: Council on Library and Information Resources, 2002. http://www.clir.org/pubs/reports/pub107/thibodeau.html. Thielman, Sam, and Chris Johnston. “Major Cyber Attack Disrupts Internet Service across Europe and US.” Guardian, 21 October 2016. https://www. theguardian.com/technology/2016/oct/21/ddos-attack-dyn-internet-denialservice. Thomson, Alistair, ed. The Oral History Reader. 2nd ed. New York: Routledge, 2006. timothy. “Yahoo Pulls the Plug on GeoCities.” Slashdot.org, 23 April 2009. http://tech.slashdot.org/story/09/04/23/2339224/yahoo-pulls-the-plug-ongeocities. Toyoda, Masashi, and Masaru Kitsuregawa. “The History of Web Archiving.” Proceedings of the IEEE (Institute of Electrical and Electronics Engineers) 100 (13 May 2012): 1441–3. Tsukayama, Hayley. “CERN Reposts the World’s First Web Page.” Washington Post, 1 May 2013. http://www.washingtonpost.com/business/technology/

Bibliography

289

cern-reposts-the-worlds-first-web-page/2013/04/30/d8a70128-b1ac-11e2bbf2-a6f9e9d79e19_story.html. Turchin, Peter. “Arise ‘Cliodynamics.’” Nature 454, no. 7200 (3 July 2008): 34–5. https://doi.org/10.1038/454034a. Turkel, William J., Kevin Kee, and Spencer Roberts. “A Method for Navigating the Infinite Archive.” In History in the Digital Age, ed. Toni Weller, 61–75. New York: Routledge, 2013. Turner, Fred. From Counterculture to Cyberculture: Stewart Brand, the Whole Earth Network, and the Rise of Digital Utopianism. Chicago: University Of Chicago Press, 2008. Twitter. “How to Contact Twitter about a Deceased User.” Twitter Support, 2018. https://help.twitter.com/en/rules-and-policies/contact-twitter-abouta-deceased-family-members-account. University of Toronto. “Canadian Political Parties and Political Interest Groups,” 2015. https://archive-it.org/collections/227. US Department of Health & Human Services. The Belmont Report, 18 April 1979. https://www.hhs.gov/ohrp/regulations-and-policy/belmont-report/ read-the-belmont-report/index.html. Vaidhyanathan, Siva. The Googlization of Everything. Berkeley: University of California Press, 2011. Van de Sompel, Herbert, Michael L. Nelson, Robert Sanderson, Lyudmila L. Balakireva, Scott Ainsworth, and Harihar Shankar. “Memento: Time Travel for the Web.” arXiv:0911.1112 [cs], 5 November 2009. http://arxiv. org/abs/0911.1112. Vargas, Plinio. “Link to Web Archives, Not Search Engine Caches.” Web Science and Digital Libraries Research Group (blog), 2 January 2018. http:// ws-dl.blogspot.com/2018/01/2018-01-02-link-to-web-archives-not.html. Vefsafn.is. “The Icelandic Archive.” https://vefsafn.is/index.php?page=english. Walker, Katherine. “‘It’s Difficult to Hide It’: The Presentation of Self on Internet Home Pages.” Qualitative Sociology 23, no. 1 (1 March 2000): 99–120. https://doi.org/10.1023/A:1005407717409. WARP. “Let’s WARP to the Past of the Web,” 2015. http://warp.da.ndl.go.jp/ info/WARP_en.html. Wayback. “My Site’s Not Archived! How Can I Add It?” 21 December 2010 (updated 7 February 2011). Wayback Blog. http://seattledo.com/index-25.php. Web.archive.org. “GeoCities Will Close Later This Year,” 26 April 2009. http:// web.archive.org/web/20090426180227/http:/help.yahoo.com/l/us/yahoo/ geocities/geocities-05.html Webster, Peter. “Users: Technologies, Organisations: Towards a Cultural History of World Web Archiving.” In Web 25: Histories from 25 Years of the World Wide Web, ed. Niels Brügger, 175–90. New York: Peter Lang Publishing, 2017.

290

Bibliography

– “When Using an Archive Could Put It in Danger.” Webstory: Peter Webster’s Blog (blog), 31 August 2015. https://peterwebster.me/2015/08/31/whenusing-an-archive-could-put-it-in-danger/. Weinberger, Sharon. The Imagineers of War: The Untold Story of DARPA , the Pentagon Agency That Changed the World. New York: Vintage, 2017. Weltevrede, Esther, and Anne Helmond. “Where Do Bloggers Blog? Platform Transitions within the Historical Dutch Blogosphere.” First Monday 17, no. 2 (2 February 2012). http://journals.uic.edu/ojs/index.php/fm/article/ view/3775. Whitacre, Brian. “Technology Is Improving: Why Is Rural Broadband Access Still a Problem?” Conversation, 8 June 2016. http://theconversation. com/technology-is-improving-why-is-rural-broadband-access-still-aproblem-60423. White House. “Statement by the President,” 7 June 2013. https:// obamawhitehouse.archives.gov/the-press-office/2013/06/07/statementpresident. Wilson, Kelly. “Attention AOL Hometown Users – United States.” Wayback Machine, 2 October 2008. http://web.archive.org/web/20081002141923/ http://www.peopleconnectionblog.com/2008/08/04/attention-aolhometown-users-united-states/. Windrum, Paul. “Back from the Brink: Microsoft and the Strategic Use of Standards in the Browser Wars.” Research memorandum. Maastricht University, Maastricht Economic Research Institute on Innovation and Technology, 2000. http://econpapers.repec.org/paper/ unmumamer/2000005.htm. Winter, Thomas Nelson. “Roberto Busa, S.J., and the Invention of the Machine-Generated Concordance.” Classical Bulletin 75, no. 1 (1999): 3–20. Winters, Jane. “Breaking in to the Mainstream: Demonstrating the Value of Internet (and Web) Histories.” Internet Histories 1, no. 1–2 (2017): 173–9. – “Coda: Web Archives for Humanities Research: Some Reflections.” In The Web as History, ed. Niels Brügger and Ralph Schroeder, 238–48. London: UCL, 2017. World Bank. “Internet Users (per 100 People).” World Bank Data, 2016. https://data.worldbank.org/indicator/it.net.user.zs. World Internet Project. “World Internet Project,” 2017. https://www. worldinternetproject.com. w3schools. “HTML Tag,” 2015. http://www.w3schools.com/tags/ tag_meta.asp. Zacharek, Stephanie. “Addicted to eBay.” Salon.com, 30 December 1999. http://www.salon.com/1999/12/30/feature_237/.

Bibliography

291

Zanganeh, Lila Azam. “Has the Great Library of Timbuktu Been Lost?” New Yorker, 29 January 2013. http://www.newyorker.com/news/news-desk/ has-the-great-library-of-timbuktu-been-lost. Zimmer, Michael. “The Twitter Archive at the Library of Congress: Challenges for Information Practice and Information Policy.” First Monday 20, no. 7 (21 June 2015). http://firstmonday.org/ojs/index.php/fm/article/ view/5619. Zittrain, Jonathan, Kendra Albert, and Lawrence Lessig. “Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations.” SSRN Scholarly Paper. Rochester, NY: Social Science Research Network, 1 October 2013. http://papers.ssrn.com/abstract=2329161.

INDEX

11 September terrorist attacks, 14, 70, 77, 168, 236 100years100stories.ca, 229, 230 404 Not Found pages, 77, 78, 163 1960s, study of as history, 20, 236–7 1990s, as history, 20, 61, 236, 241 Abbasid Caliphate, 95–6 accessibility: vs authenticity, 117; of cluster computing, 154–5; of computing resources, 154; of content to content creators, 93–4; of digital collections, 149–50; of GeoCities, 172; of internet, 44–5, 52, 81–3, 87, 236; of Internet Archive, 76–7, 216; of legal deposit collections, 159–69, 201–2; push towards, 28, 209, 214–18; technical expertise and, 219; tools for, 153, 169–70, 214– 16; of traditional archives, 196, 222–3; of the Wayback Machine, 148, 216; of WebARChive (WARC) files, 222–3; of web archives, 6, 76–7, 147–8, 200–2; Western models of, 200, 202 acid-free paper, 73, 243 acquisition policies, 53, 57 Advanced Research Projects Agency

(ARPA), 33, 37, 41. See also ARPANET Afghani Taliban, 211 Africa: digital preservation in, 166–7 African Americans, 201 Ainsworth, Scott, 111 Alexa Internet, 74 Alexa ranking service, 17–18, 80, 84, 85 algorithms: changing nature of, 23; vs conscious acquisition policies, 57; gaming of ranking, 133–4, 187; in humanities research, 53–4; literacy in, 148, 222, 235, 241, 245; Python and, 221; role in web archiving, 53, 84; subjectivity of, 22, 59; transparent use of, 84, 86, 155. See also PageRank Alito, Justice Samuel, 158 All the President’s Men, 29, 49 AltaVista, 194 Amazon, 89, 150, 155 American Historical Association: on archives, 106; Career Center, 238; on Culturomics, 239–40; promotion of digital history by, 237, 241; promotion of interdisciplinarity by, 242

294

Index

America Online. See AOL analog formats. See traditional records analytical tools, 214, 216–18, 223–32, 233, 235. See also Archives Unleashed Toolkit; digital tools Angelfire, 6, 174 Annales school, 55 AOL (America Online), 89, 99–100, 176 Apache software, 125, 154 Apple, 118 ARC files, 144 architectural preservation, 95 Archived Web, The (Niels Brügger), 25 Archive-It, 22, 148–51; accessing WARC files in collections by, 223; Archives Unleashed Cloud and, 224; Artist-Run Centres in Halifax, Nova Scotia collection, 230, 231; Canadian Government Information collection, 86–7; Halifax Explosion collection, 225, 226–7, 229, 230; limitations of, 23, 150, 152; Occupy Wall Street collection, 78–9; selection bias and, 85; storage capacity of, 215; as subscription-based service, 215; use of Heritrix by, 214–15; Wayback Machines for, 144. See also Canadian Political Parties and Interest Groups collection archives, 67–8; archival description, 71–2; categories of, 159; relationship of historians with, 24, 67, 106, 213; role of archivists, 7, 23–4; traditional vs web, 70–2, 88, 94, 200–2. See also national libraries; traditional archives; web archives

Archives Unleashed Cloud, 216, 224, 228, 230 Archives Unleashed Toolkit, 155–8; analytical tools of, 224; changes in, 234–5; hackathon, 183; hyperlink extraction using, 230; image extraction using, 232; network mapping using, 228; for text extraction, 226 Archive Team, 100–2, 171, 205, 217 “Arguing with Digital History,” 238 ARPANET, 33, 38, 39, 40, 41 artificial intelligence, 233 Artist-Run Centres in Halifax, Nova Scotia collection, 230, 231 Ashley Madison website, 201 Asia: national web archiving in, 167 Association of Internet Researchers (AoIR), 203–4, 206 Athabasca Oil Sands, 153 Athens neighbourhood (GeoCities), 181, 186, 190 Australia, digital preservation efforts by, 75–6, 243 Bady, Aaron, 200, 203 Bancroft Library (University of California, Berkeley), 77 bandwidth, 92 Baron, Paul, 38 Bash command line, 220–1 BBSs (bulletin board systems), 173–4, 191 Belmont Report, 198 Berners-Lee, Tim, 43, 61, 63, 79, 176 Bernstein, Carl, 29, 49 Beverly Hills Internet, 174 Bibliotheca Alexandrina, 166 Bibliothèque nationale de France, 164–5

Index

Big Data: compared to web archives, 154; contextualization of, 56–7; definition of, 19; importance for historical study, 243; limitations of, 16; platforms, 143, 154–5; research methods for, 121–32; study of, 56; tools for leveraging, 213–18 Big History, 55–6 Big UK Domain Data for the Arts and Humanities (BUDDAH), 25, 150 Binkley, Robert C., 73 “black” Twitter, 201 Blank, Grant, 83–4 , 114, 244 blogs, 50, 131–2, 153, 172 Bohnett, David, 174, 181 Bolt, Beranek, and Newman (BBN), 39 boyd, danah, 203 Braudel, Fernand, 55–6 British Library, 159–64; access to, 169; digital searching at, 123–4, 150–1, 161; legal deposit in, 52; photography at, 162–3; traditional vs born-digital sources in, 160–1, 162–4; web crawls by, 25, 62 Brown, Michael, 51 browsers, 112–17; vs command lines, 220; emulation of early, 65–6, 116–17; first, 176; graphical, 112–13; line-mode, 44, 65, 66, 112; mobile, 118; role of in viewing archived websites, 27, 87, 115–16, 163; standardization of, 48, 107, 112; text-based, 112. See also browser wars browser wars, 27, 107, 113 Brügger, Niels, 25, 110–11

295

bulletin board systems (BBSs), 173–4, 191 Busa, Roberto, 54 Bush, Vannevar, 42–4 Cailliau, Robert, 44 Cambridge Analytica, 16, 91–2 Cambridge University, 91, 161 cameras. See photography Canada: access to computing resources in, 154; Department of Agriculture, 169; digital preservation by, 167–8; federal elections, 50–1, 138, 140; legal deposit models, 10; oral historian certification in, 197; shifts in political culture in, 151, 153, 168; web access and usage in, 17, 44, 45, 82, 83 Canadian Broadcasting Corporation (CBC), 97–8, 157 Canadian Government Information collection, 86–7 Canadian Historical Association, 242 Canadian Political Parties and Interest Groups collection: Archive-It and, 22–3, 223; metadata analysis of, 107, 133, 134–9; providing access to, 149–50; research methods applied to, 27, 133–8, 139–41, 155–7; searches in, 23, 125–6, 150; selection bias in, 85; Shine used with, 150–1; usage of, 150, 151–2 CapitolHill neighbourhood (GeoCities), 181 cascading style sheets (CSS), 48, 110 Casemajor, Nathalie, 157 cathode ray tube (CRT) monitors, 65–6, 117, 120

296

Index

CBC. See Canadian Broadcasting Corporation CDX files, 121–2, 131 censuses, 54 Center for History and New Media (George Mason University), 51, 70 centralized networks, 38 Cerf, Vint, 40 CERN. See European Organization for Nuclear Research checksum, 34 childhood and youth, history of, 14, 233 Chile, National Library of, 167 China, lack of digital preservation by, 167 Christen, Kimberly, 200–1 Chrome browsers, 115–16 citation and quotation: ethics of, 199, 205, 208, 210; transparency in, 59, 87. See also reference rot Cliometrics and cliodynamics, 54, 58, 60 closed platforms. See walled gardens close reading, 55, 107, 146–7, 234 cluster computing, 143, 154–5 Codeacademy, 222 code and coding. See programming languages collaboration, in historical scholarship, 24, 27, 155, 238–40, 242 collective control, 68, 71 Columbia University Libraries, 148 command line, 220–1; ImageMagick and, 232; literacy in, 214, 216; secure, 76; training in, 219; wget and, 223 Common Rule regulations, 197, 198 computational history, 60–1

computational social science, 198, 200, 225 computer science, 27, 124, 127, 245 computing clusters, 143, 154–5 concordance, 226, 228 consent. See informed consent Conservative Party (United Kingdom), 94 Conservative Party of Canada, 136–9, 139–41, 151, 185 content creators, 83–4; access to robots.txt file by, 93–4; communication with, 197, 202, 205–6; as owners of content, 98; web archives and, 244–5 copyright, 73, 131, 159, 163, 223 corporations, 13, 14, 95 Creating GeoCities Websites, 181 Crimea, invasion of (2014), 10–12, 51, 149 Crimean Centre for Investigative Journalism, 10–11 criminal trials, study of, 15, 56, 98–9 CRT. See cathode ray tube CSS. See cascading style sheets cultural analytics, 234 cultural heritage, 3, 30, 72–4, 201 cultural history, 14, 98, 120, 233 cultural protocols, 200–1, 205 Culturomics, 239–40 CyberCemetery (University of North Texas), 76 Dalal, Yogen, 40 Dalhousie University, 225 data integrity, regulation of, 39–40 data mining, 107, 123–7, 203, 209 “Data Mining with Criminal Intent,” 56 data transmission, 34, 37–9, 39–40 dating sites, 128–9, 201

Index

deceased users, 104 decentralized networks, 38 Defence Advanced Research Projects Agency (DARPA), 41 Deja Vu, 116 deletion of digital records, 95, 96–8, 99–104, 152–3, 216 Denmark, national library of, 165–6, 201–2 Department of Agriculture (Canada), 169 Department of Defence (US), 33 Det Kongelige Bibliotek, 165–6, 201–2 digital age: historians and, 3; scale of data in, 6 digital collections, 148–51; accessibility of, 149–50; accessing WARC files of, 222–3; curation of, 164–5; determining contents of, 223, 224–5, 232; digital humanities and, 70; elections covered by, 138, 140, 168; selection bias in, 85; as sites of memory, 70; thematic organization of, 68, 70. See also Archive-It; Chile, National Library of; web archives digital dark age, 73, 74, 105, 243. See also “digital gap” digital divide, 16–17, 81–2, 87 digital formats, loss of, 118–20 “digital gap,” 76 digital heritage, 30, 95, 170 digital history, 237–43; Culturomics, 239–40; professional hurdles in, 237–8, 242; public history and, 54; trends in, 214–18. See also methodologies digital humanities: collaboration between history and, 245;

297

definition of web archives by, 105; ethics of, 198; image analysis in, 232; literary analysis and, 234; methods of, 107; rise of, 24, 53–6, 170; textual analysis in, 225 Digital Methods Initiative, 132 digital preservation: in Africa, 166–7; apathy towards, 95, 96–8; approaches to, 119–20; challenges of, 26, 73, 79–81, 105; ethics of, 94, 105, 167, 171, 196, 200; by national libraries, 9–10, 75–6, 243; obsolete technology and, 118–20; pioneers of, 4, 75–6, 243; tools for ensuring, 216; urgency of, 4–5, 10–12, 206; walled gardens and, 16, 88–92. See also Internet Archive; legal deposit collections; web archiving digital records: anonymized, 90, 205; challenges of research using, 162–3, 241; fragility of, 27, 77–81, 243–4; fragmentation of, 169; increasing dominance of, 169; need for literacy in, 26; scale of, 6, 48–53, 244; of soldiers, 13, 97, 236; vs traditional records, 30, 141, 146–7, 160–4, 236 digital tools, 213–18; adaptation of, 222; for analysis, 217–18; for archiving, 214–17; for cluster computing, 154; for creating search indexes, 125, 154; for examining WARC files, 224–5; subscription-based, 215; for text, 226–8; training in, 219–22, 240–1; use of with traditional sources, 7–8, 241; for web crawling, 214– 16. See also open-source software; technology Disqus comment platform, 119

298

Index

distant reading: of community in GeoCities, 185–90; as ethical approach to web archives, 204, 209–10; use of by historians, 56– 7; of web archives, 107, 129, 138 distributed networks, 37–9 DNS. See Domain Name System domain names, 36, 79 Domain Name System (DNS), 36 Dropbox, 104 dynamic content, 109, 118 EC2. See Elastic Compute Cloud e-commerce providers, 89 Egypt, national library of, 166–7 Elastic Compute Cloud, 155 Elastic Map Reduce (EMR), 155 elections, in digital collections, 138, 140, 168, 216 Ellsberg, Daniel, 12 email, 129, 192 EnchantedForest neighbourhood (GeoCities): analysis of, 185–9; award of excellence, 189; community centre, 187, 191; content standards, 181, 182; filtering URLs for analysis of, 184 Engelbart, Douglas, 43 English literature, 54–5 ENQUIRE, 43 Equalvoice.ca, 224 Espenschied, Dragan, 117 ethical considerations, 195–210; about fraught websites, 206–7; Cambridge Analytica and, 91–2; with citation and quotation, 199, 205, 208, 210; metadata and, 128, 141, 209–10; in oral history, 196– 7, 205–6; outlined by Association of Internet Researchers (AoIR), 203–4; of representativeness, 81;

role of external regulation in, 208; in web archival research, 10, 12, 170, 195–205, 244–5; in web archiving, 94, 105, 167, 171, 200. See also GeoCities Eureka neighbourhood (GeoCities), 181 European Organization for Nuclear Research (CERN), 33, 43, 44, 63–6, 117 everyday people: GeoCities and, 28, 172–3; in historical record, 208, 233, 244; importance of records of, 206, 212; protecting privacy of, 152, 199; in records of Old Bailey, 15, 98–9; study of lives of, 53, 121 Expanding Unidirectional Ring of Pages (EUROPa), 194 Exploring Big Historical Data: The Historian’s Macroscope, 56, 124, 225 Facebook: collection in Danish national library, 166; deceased users and, 104; as historical record, 50; Stephen Harper’s page on, 150; templates, 176; as walled garden, 16, 89–90, 91–2 “fake news,” 17, 245 FashionAvenue neighbourhood (GeoCities), 185–6 federal elections (Canada), 50–1, 138, 140 fibre optic cables, 33, 34 file formats, 119 finding aids, 23, 67, 121–2, 213 Flash, 80, 118, 157 fonds, 67 food history, 52 France, national library of, 164–5 FreeNets, 173

Index

Friendster, 89 full-text searches, 166, 168, 198, 209 gays and lesbians, 199 gender history, 14 GeoCities, 27–8, 170–95, 205–12; Afghani Taliban and, 211; American western frontier compared to, 182–3; Athens, 181, 186, 190; award system, 186, 188, 189, 191–2, 211; blink and, 244; in books, 64, 78; CapitolHill, 181; as community, 172, 181, 185–93, 195, 212; community centres, 187, 191; community leaders, 190–2, 210, 211; content standards, 182, 190; deletion of, 100–2, 205, 211–12; downloading of, 100; ethics of studying, 16, 198–9, 202, 204, 205–10, 212; Eureka, 181; FashionAvenue, 185–6; founding of, 174; guest books, 186, 190, 192–3, 207, 211; Heartland, 181–2, 186, 189, 190; as historical source, 99, 172–3, 205, 206; hit counters, 186, 192; Hollywood, 185; homesteading in, 181–3; hyperlinks and, 129; image analysis of, 186, 232; in Japan, 101; neighbourhood structure of, 174–5, 178, 180, 181– 3; Pentagon, 185; popularity of, 175–6; preservation of, 99, 100–2, 212; purchase of by Yahoo!, 175– 6, 180, 181, 211; questionnaires, 192, 193; reach of, 9; rise and fall of, 3–4, 181, 211; scale of, 15, 171, 172; textual analysis of, 193; topic modelling of, 185–6; use of HTML in, 178, 180, 190; user-generated content as basis of, 176–8; via

299

Wayback Machine, 5, 117, 146; webrings and, 186, 187, 190, 195, 211; WestHollywood, 199. See also EnchantedForest neighbourhood (GeoCities) George Mason University, 51, 70 Gephi, 229–32 Gibbs, Fred, 60 GIFs, 186 Gilliat, Bruce, 74 GitHub, 219, 221 global trade, study of, 56 Gmail, 90, 129 Goldsmiths, 80 Google: cluster computing by, 154; ethics of, 200; Gmail metadata, 129; historical potential of, 90; n-grams, 239–40; record preservation and, 89; on size of web, 79; as walled garden, 16; web crawls, 62, 90. See also PageRank Google Books, 78 Google Books database, 90 Googlebots, 62, 90 Google Trends, 91 Government of Canada Web Archive, 167–8 government websites, 13, 86–7, 92, 158, 167–8 Grafton, Anthony, 239–40 graphical browsers, 112–13 Graphs, Maps, Trees (Franco Moretti), 54–5 Green Party of Canada, 153; website, 155–7 guest books, 186, 190, 192–3, 207, 211 hackathons, 157–8, 183 hacks and hackers, 36, 173, 201

300

Index

Halifax Explosion collection, 225, 226–7, 229, 230 Hargiattai, Eszter, 83 harm, minimization of, 196, 197–8 Harper, Stephen, 23, 150, 151 Harvard University, 239 Heartland neighbourhood (GeoCities), 181–2, 186, 189, 190, 191 Helmond, Anne, 132 Heritrix web crawler, 119, 214, 223, 235 high-frequency trading, 35 historical profession, 237–43; culture of, 237–8; importance of understanding digital sources, 241; resistance to collaboration in, 238–40; transformation of, 21, 24 historical record: expansion of, 3, 233, 236, 243, 244; impact of web archives on, 6–7; scale of digital, 9 hit counters, 186, 192, 210 Hollywood neighbourhood (GeoCities), 185 homesteading (GeoCities), 176–7, 181–3 Howe, Denis, 194 HTML. See HyperText Markup Language hub-and-spoke networks, 38 Humanities Data in R, 225 Human Rights Web Archive, 148 hyperlinks: analysis of, 131–2, 133–9, 186–7; communities of, 140; as conscious act, 132, 136; error messages for, 48; extraction of, 156, 184, 217, 230; global span of, 159; vs keyboard navigation, 65; mapping of, 224, 228–32; in network theory, 230; problem of invalid or broken, 46, 77–8; significance of, to historical

research, 129, 131–2, 138; use of in GeoCities, 180; in Wikipedia, 42; Xanadu library and, 42–3 hypertext, 41, 42–4, 45. See also hyperlinks HyperText Markup Language (HTML), 41–2, 45–8; access to at British Library, 163; browser wars and, 107, 113; early search engines and, 194; literacy in, 46–7; proprietary, 107, 114–15; in rendering historical webpages, 117; stripping text from, 125; transparent use of, 60; use of in GeoCities, 178, 180, 190; webpages and, 110 Iceland, legal deposit collections of, 166 ideas, history of, 239 Idle No More, 50, 51, 168 ImageMagick, 232 images, 184, 186, 210, 232 Immersion, 129 Index Thomisticus (Roberto Busa), 54 Indigenous peoples, 140–1, 151, 200–1. See also Idle No More infinite scroll, 80, 216 “Information Management: A Proposal” (Tim Berners-Lee), 43–4 information retrieval, 22, 27, 74, 124, 153. See also search engines informed consent: digital preservation and, 94, 105, 200; oral history and, 205–6; web archives and, 196–8, 244–5 Institutional Review Boards (IRBs): contacting content creators and, 205; dominance of in ethical conversation, 198; oral history and, 196–7, 208; web archives and, 201, 208, 209, 245

Index

interdisciplinary scholarship, 24, 27, 155, 238–40, 242 interface message processors (IMPs), 39–40 International Internet Preservation Consortium, 25, 62, 166, 167, 219 internet: access to, 44–5, 52, 81–3, 87, 236; definition of, 32; history of, 31, 37–41; ignorance regarding workings of, 31; national controls over, 33; physical layer of, 32–3; rise of, 61; usage of, 16–17, 82–3 Internet Archaeology, 102 Internet Archive: accessibility of, 76–7, 216; accessing WARC files of, 222–3; beginnings of, 63–4; Bibliotheca Alexandrina partnership with, 166–7; founding of, 4, 74; full-text searches and, 209; Investigator website and, 11; Library of Congress’s partnership with, 168; metadata supplied by, 130–1; as pioneer of digital preservation, 4, 243; redundant storage infrastructure of, 121; retroactive removals from, 94, 203, 245; role in digital preservation, 4, 6; Save Page Now, 216, 217; scale of, 52, 75, 90, 120; scrolling and, 80; University of Waterloo homepage captured by, 87; use of Heritrix by, 214; web crawls by, 18, 62, 80, 92, 120–2. See also Archive-It; GeoCities; Wayback Machine Internet Assigned Numbers Authority (IANA), 36 Internet Corporation for Assigned Names and Numbers (ICANN), 36 Internet Explorer, 107, 112, 114–15 internet studies, 25, 200 Interplanetary File System (IPFS), 217

301

InterPlanetary Wayback, 217 Investigator, 10–11, 51 IP addresses, 33–6 Iraq war, the, 236 IRBs. See Institutional Review Boards Jackson, Andrew, 123–4 Japan, 101, 167 Journal of Cultural Analytics, 234 Just Labour website, 121–2 JyllandsPosten website, 110–11 Kahle, Brewster, 74, 77 Kahn, Robert, 40 keywords: in context, 193, 226, 228; limitations of, 44, 57; n-grams of, 239–40; searches using, 75, 125–7, 161, 165, 226; use of by early search engines, 194 Koerbin, Paul, 112 Kogan, Aleksandr, 91–2 Koster, Martin, 92–3 Kulturarw3, 76 labour movement, topic modelling of, 140 Large Hadron Collider, 33 LCD monitors, 66, 117 legal deposit collections: access to, 159–69, 201–2; Canada, 10; Denmark, 165–6, 201–2; expansion of, 52–3, 77, 84–5; France, 164–5; Iceland, 166; legislation governing, 52, 159–60, 163, 165, 166; robot.txt protocol and, 93, 160; South Africa, 167; United States, 168; web archiving for, 159–61, 162–4. See also British Library Lesk, Michael, 72, 75 LGBT individuals, 199

302

Index

Lialina, Olia, 117, 181 Liberal Party of Canada, 125–6, 136–41, 151, 185 libraries: acquisition policies of, 53; destruction of, 95–6, 97; digital collections in, 52; role of, for historians, 213; as stewards of cultural heritage, 30; use of Archive-It by, 148–9; web archiving by, 7, 170; Xanadu, 42–3. See also national libraries; individual libraries Library and Archives Canada, 10, 51, 167–8 Library of Alexandria, 30, 95 Library of Congress (US): national web archiving by, 168; size of holdings of, 29–30, 52; Supreme Court nominations collection, 158; Twitter collection, 51, 168 Licklider, J.C.R., 39 Lieberman Aiden, Erez, 6, 240 Lin, Jimmy, 27, 155, 157, 183 line-mode browsers, 44, 65, 66, 112 literary analysis, 54–5, 234 Literary Machines (Theodor Holm Nelson), 43 Logie, John, 182–3 Lomborg, Stine, 204 London Lives portal, 130 longue durée, 55–6 Los Alamos National Laboratory, 111, 168–9 mainframe computers, 37 Malian Civil War, 95–6 Manning, Chelsea, 12 Manovich, Lev, 186 Mapping the Internet Electronic Resources Virtual Archive (MINERVA), 168

markup languages, 42, 46–7. See also HyperText Markup Language (HTML) , 114 Massachusetts Institute of Technology (MIT), 129 McTavish, Sarah, 199, 204 Mediterranean and the Mediterranean World in the Age of Philip II, The, 55–6 Memento Time Travel service, 168–9 Memex, 42–4 memorial websites, 206–7 metadata, 127–32; analysis of, 107, 133, 134–9, 141, 185; Canadian Political Parties and Interest Groups collection and, 107, 133, 134–9; of Gmail, 129; in Internet Archive, 130–1; of Old Bailey online, 129–30; use of, by historians, 129–30, 141–2; use of, in surveillance, 128, 210; WARC files and, 69–70 methodologies: for Big Data, 27, 48–9, 121–32, 154–8; for determining content, 121–2, 223, 224–5, 232; FAAV process cycle, 183–4; for images, 184, 186, 210, 232; primacy in digital history, 238; traditional, 6, 7, 19; training for historians, 219–22, 225, 240, 242; used with Canadian Political Parties and Interest Groups collection, 27, 133–8, 139–41, 155–7; for use of web archives, 143–9, 169–70, 222–32; using metadata, 130 #MeToo, 13, 14 Michel, Jean-Baptiste, 6, 240 microfilm, 73 Microsoft, 107, 112, 113, 157

Index

MIDI music files, 117 military history, 13 Millward, Gareth, 123 MIT Media Lab, 129 monitors, 65–6, 116, 117, 120 Moretti, Franco, 54–5 Mosaic browsers, 112–13 “Mother of All Demos,” 43 Mukurtu Wumpurrarni-kari Archive, 200–1, 202 MySpace, 96–8 named-entity extraction (NER), 126 national censuses, 54 National Center for Supercomputing Applications (NCSA), 112 nationalization, n-gram of, 239 national libraries, 158–68; digital preservation by, 9–10, 75–6, 243; remote access to, 166; research portals at, 161, 163, 165, 166, 168, 169–70; scrapes and crawls of national domains by, 25, 51, 76, 164, 166, 167; as stewards of national heritage, 30; use of Heritrix by, 214. See also legal deposit collections; individual libraries National Library of Australia, 75–6, 243 National Library of Chile, 167 National Physical Laboratory (UK), 39 National Security Agency (US), 27, 107, 128, 141, 210 national webs, 25, 51, 76, 164, 166, 167 nation-states, 33. See also national libraries natural language processing, 127 NDP. See New Democratic Party

303

Nelson, Michael, 111 Nelson, Theodor Holm, 42 Netflix, 33 Netscape, 107, 112–13, 114–15 network analysis, 131–2, 133–9, 186–7. See also hyperlinks network congestion, 39–40 network diagrams, 224, 228–32. See also hyperlinks networking paths, 34–5 network intercommunication. See TCP/IP networks: local, 32; models of, 37–9; neutral, 232; standardization of, 39, 40–1; theory, 228–30 New Democratic Party (NDP), 136–9, 151, 153 new media studies, 25 newspapers, 11–12, 32 NeXT computer, 44 n-grams, 239–40 Noble, Safiya, 16, 200 North America, national web archiving in, 167–8 Obama, Barack, 128 Occupy Wall Street collection, 51, 70, 78–9, 243 O’Dell, Allison Jai, 71–2 oil sands, 153 OkCupid, 128–9, 201 Old Bailey Online, 129–30, 221. See also Proceedings of the Old Bailey, 1674–1913 Old Dominion University, 79, 111, 168–9, 218 oldweb.today, 116, 145 Olympics collection, 168 online communities, 96–8, 99–104, 114, 178–80, 179. See also GeoCities

304

Index

online dating sites, 128–9, 201 online writing, 15, 51–2 open-source software: for cluster computing, 154; Heritrix, 214–15; for hyperlink mapping, 228; for image analysis, 232; OpenWayback, 144; professionals in, 61; TCP/IP released as, 33; transparency of, 22; Voyant Tools, 226 OpenWayback, 144 operating systems, 41, 113 oral history, 196–7, 198, 205–6, 208, 245 ordinary people. See everyday people original order, 68, 71 “Our Marathon,” 70 Owens, Trevor, 60; on Pitfall, 119–20 Oxford Internet Institute, 80 packet switching, 37–40. See also data transmission PageRank: hyperlinks and, 90–1, 134, 187–9; limitations of, 21, 22; speed of, 154; use of in link analysis, 190, 210, 230 paper, 64, 73–4, 118, 243 pedagogical reforms, 143, 241–2 Pellow, Nicola, 44, 65 Pennsylvania Station (New York), 95 Pentagon neighbourhood (GeoCities), 185, 190 Pentagon Papers, 12 personal computing revolution, 60–1, 72–3 pets, websites dedicated to, 206–7 Pew Research, 17, 82–3 photography, 59–60, 162–3, 165, 203 Pinterest, 30 Pitfall, 119–20

pixel text, 117 place metaphors, 174–5 plain text, 124–7, 125, 184, 225–6. See also textual analysis Poe, Edgar Allan, 6 political history, 13, 22, 151–2, 233 political parties, 13, 136, 152–3, 195, 233. See also Canadian Political Parties and Interest Groups collection; individual parties privacy: legal deposit and, 159, 160; regulations governing, 196, 201–2; spectrum of, 195, 199; walled gardens and, 88–94. See also ethical considerations; GeoCities privatization, n-gram of, 239 Proceedings of the Old Bailey, 1674–1913, 15, 98–9. See also Old Bailey Online Programming Historian, 220–1, 225, 240 programming languages, 154, 221–2, 235, 241, 242 Progressive Conservative Party (Canada), 151 protocols: adoption of common, 37, 40–1; cultural, 200–1, 205; for digital preservation, 120; of the internet, 33–6, 40–1; robots.txt, 92–4, 106, 160, 168, 244; TimeMap, 146 provenance, 68, 71, 86–7 public history, 54 public transit keyword search, 125–6 publishing revolution, 55 pulp paper, 73 PySpark, 154 Python programming language, 154, 221

Index

quantitative analysis, in historical scholarship, 54, 58, 237, 240 quotation and citation: ethics of, 199, 205, 208, 210; transparency in, 59, 87. See also reference rot race, history of, 14 RAND, 38 ranking mechanisms: Alexa, 17–18, 80, 84, 85; alternatives to, 21; gaming of, 133–4, 187; limitations of, 18, 80, 84, 123–4; need to understand, 148; opacity of, 21–3, 153; transparency and, 153–4 Rebel Youth: 1960s Labour Unrest, Young Workers, and New Leftists in English Canada, 49–50 Recherche en Informatique et en Automatique, 39–40 reference rot, 78 ReoCities, 102 representativeness, 81–5 research tools. See digital tools respect des fonds, 68, 71 Rezner, John, 181 Rheingold, Howard, 179 Ridener, John, 67–8 Roberts, Justice John, 158 robots.txt protocol, 92–4; consent and, 244; national libraries and, 93, 160, 168; retroactive removal and, 94, 106 Rockwell, Geoffrey, 226 Rogers, Richard, 191 Roman road networks database, 56 routers, 39 Royal Library of Sweden, 76, 243 Royal National Institute of Blind People (RNIB), 123 Rudder, Christian, 128–9

305

Ruest, Nick, 7, 27, 150, 152, 155 Rules for Archival Description (RAD), 71

SAGE Handbook of Web History, 25

Salganik, Matthew J., 198, 206 Save Page Now, 216, 217 Scott, Jason, 9, 99–101, 102, 172, 179 screen resolution, 116. See also monitors scrolling, 80, 116, 216 search engines: Amazon, 150; early, 186, 187, 193–4; Memento Time Travel, 168–9; at national libraries, 161, 165, 168, 169–70; opacity of, 21–3; privacy and, 199, 209; Shine, 123, 125, 150–1, 152–3; transparent use, 59, 60; use of with web archives, 27, 57–8, 122–4, 125, 209. See also PageRank; ranking mechanisms search indexes, 48, 125, 154 selection bias, 17–18, 84–8, 106 September 11 Digital Archive, 70. See also 11 September terrorist attacks Shine portal, 123, 125, 150–1 Sinclair, Stéfan, 226 Singapore, national domain crawls in, 167 Slashdot, 102 Smithsonian Institute, 76 Snapchat, 10 Snowden, Edward, 128 social history: census data and, 54; digital records and, 53; importance of digital heritage to, 3, 14, 98, 105, 233 social media: in Canadian Political Parties and Interest Groups

306

Index

collection, 134; ethics and, 78, 200, 203; as historical source, 15, 50–1. See also social networks; individual platforms social movements, 13, 49–51, 78–9 social networks, 195, 200, 207. See also online communities; social media; individual platforms social sciences, 196. See also computational social science Society of American Archivists (SAA), 68 Software Carpentry, 221 soldiers, 13, 97, 236 South Africa, 17, 167 South America, web archiving in, 167 South Korea, web archiving by, 167 standardization, 39, 40–1, 48, 107, 112. See also protocols Stanford University, 40, 56 Stevens, Ted, 31 Stop Online Piracy Act, 36 St Thomas Aquinas, corpus of, 54 subscription-based digital services, 215 suicide, personal webpages about, 207–8 Sunshine, Carl, 40 Supreme Court nominations collection, 158 surveillance, 128, 210 Sweden, national library of, 76, 243 Taliban, 211 Tape Archive (TAR) files, 69 Taylor, Nicholas, 77–8 Taylor, Robert, 37, 39 TCP/IP (Transmission Control Protocol/Internet Protocol), 33–6, 40–1

technology: as driver of historical questions, 28, 60, 233; ethics and rate of change of, 198; impact of changes in, 118–20, 234 temporal incoherence and integrity, 18, 110–12, 125, 144–6 text-based browsers, 112 textual analysis, 124–7, 184; developments in, 209; of GeoCities, 193; plain text extraction for, 125, 184, 225–6; tools for, 226–8; training in, 225; word frequency in, 125–6, 185, 226–7; word trends, 226, 227, 228, 239. See also topic modelling Timbuktu libraries, 95–6, 97 TimeMap protocol, 146 time sharing, 37 topic modelling, 139–41, 185–6, 221 Toronto Star, 173 trade, study of, 56 traditional archives: access to, 196, 222–3; bias in, 81, 88; privacy and, 196; records of everyday people in, 208; vs web archives, 70–2, 88, 94, 200–2. See also archives; traditional records traditional records: digital interaction with, 7–8, 241; vs digital records, 30, 141, 146–7, 160–4, 236; durability of, 64, 118, 243; in historical profession, 237; in national libraries, 161; preservation of, 73; scarcity as issue with, 15, 49–50; scope of, 88; skewed nature of, 236. See also traditional archives training: for digital historians, 28; in digital tools, 219–22; in methodologies, 21, 240–1, 242; in programming languages, 235,

Index

241, 242; in textual analysis, 225; in topic modelling, 221. See also methodologies Transmission Control Protocol/ Internet Protocol. See TCP/IP transmission times, 35 transparency, 85–7; algorithm use and, 84, 86, 155; of Archives Unleashed Toolkit, 155; in methods, 59–60, 198; of opensource software, 22; search results ranking and, 153–4 TripAdvisor, 80 Tripod, 6, 174 Trump, Donald J., 13, 216, 236 Turchin, Peter, 58 Twitter: collections, 157–8, 168; deceased users and, 104; as historical record, 50; infinite scroll of, 80; study of African Americans’ uses of, 201; templates, 176; tools for crawling, 216 UCLA Center for Communications Policy, 81 UK Web Archive, 161–4; Shine search tool, 123 Umbra capture mechanism, 80 Uniform Resource Locators (URLs), 78–9, 121–2, 184, 209 United Kingdom: Conservative Party of, 94; legal deposit in, 150, 159–60; national domain crawl, 25; National Physical Laboratory of, 39. See also British Library United States: Department of Defence, 33; digital preservation efforts in, 76; history of in digital age, 13; legal deposit in, 168; regulations governing human subjects in, 197, 198; soldiers of,

307

97, 236; Supreme Court of, 78, 158; web access and usage in, 17, 44–5, 82–3 University of California, Berkeley, 77 University of California, Los Angeles (UCLA), 81 University of North Texas, 76, 243 University of Southern California, 201 University of Toronto, 22, 87, 157 University of Waterloo, 157, 238; website, 87, 116, 215–16 Unix operating system, 41 URLs. See Uniform Resource Locators Usenet, 8, 44, 176 user-generated content: bulletin board systems and, 173–4; comments, 153; deceased users and, 104; GeoCities and, 171, 172, 175, 176–8, 183; mass deletion of, 95, 96–8, 99–104; scale of, 48–9, 50–3, 75; transparent use of, 59–60. See also ethical considerations Vaidhyanathan, Siva, 134 Van de Sompel, Herbert, 111 virtual communities. See online communities virtual machines, 101–2 visualizations, 157, 184, 226–8, 229–32 Voyant Tools, 226–8 W3C. See World Wide Web Consortium Walker, Katherine, 192–3 walled gardens, 16, 88–92 Warcbase. See Archives Unleashed Toolkit

308

Index

WARC files. See WebARChive (WARC) files Warhammer 40,000, 8, 10 Warumungu community, 200–1 Watergate, 29, 49 WAT files. See Web Archive Transformation files Wayback Machine, 143–7; accessibility of, 148, 216; development of web history and, 233; emulation of early browsers by, 116–17; ethics of, 208, 209; GeoCities via, 5, 146; Icelandic legal deposit collection and, 166; launch of, 77; limitations of, 122, 146–7; obscurity of pages in, 203, 209; temporality and, 111–12, 144–6; transparency and, 86–7; University of Waterloo webpage using, 116; use of at Library and Archives Canada, 168; Wikipedia via, 144–6. See also Internet Archive Weather Underground webpage, 111, 145 web. See World Wide Web WebARChive (WARC) files: access to, 222–3; availability of, 209; vs CDX files, 131; definition of, 57, 69–70; Disqus comments and, 119; ethics of working with, 209–10; significance to historians of, 212; tools for examination of, 224–5; user-created, 216, 223 web archives: access to, 6, 76–7, 147– 8, 200–2; bias and subjectivity of, 17–18, 58–9, 77, 87–8, 106; citizen-led, 235; collaboration for use of, 155, 170; content creators and, 197, 244–5; definition of, 70–1, 105; vs digital collections,

68; distant reading of, 107; distributed, 217; ethics of using, 10, 170, 195–205, 198, 244–5; as facsimiles, 110, 146; homes of, 168; implications for study of history, 6–8, 13–14, 24, 233, 236; lack of professional framework for, 24; limitations of, 16, 17–19, 79–81; obscurity of, 20, 152, 203; portion of web covered by, 79–80; provenance and, 71, 86–7; rendering of content in, 48, 117, 163; resources and funding for, 84–5; retroactive removals from, 94, 106, 203, 245; scale of, 6, 154, 214; scholarship on, 25; security of, 96; temporality and, 110–12, 125; vs traditional archives, 70–2, 88, 94, 200–2; vs traditional records, 146–7; of Wikipedia, 46. See also archives; digital collections; GeoCities; Internet Archive; legal deposit collections; methodologies; web archiving WebArchives.ca, 152 Web Archive Transformation (WAT) files, 131 web archiving: in Asia, 167; beginnings of, 63–4; collaboration required for, 62–3, 67, 75–6; definition of, 4; file formats used in, 119; legal deposit and, 159–61, 162–4; limitations of, 45–6, 159; move towards accessibility in, 28, 214–18; opt-in model of, 197, 205, 206, 244; reasons for, 62; robots.txt protocol and, 92; tools for, 214–17; by users, 235; web history and, 26. See also digital preservation; web crawlers

Index

Web Archiving Project (WARP), 167 web crawlers: development of userfriendly, 218; Flash sites and, 118; functioning of, 74–5, 224; Googlebots, 62, 90; Heritrix, 119, 214, 223, 235; maturity of, 235; robots.txt protocol and, 92–3, 244; scrolling and, 80, 216; Webrecorder, 215–16. See also web crawls web crawls: bias in, 80, 84–7; of Canadian Department of Agriculture, 169; of Canadian government websites, 167–8; of Global South by Global North, 12, 167; infinite nature of, 75, 79; by Internet Archive, 18, 62, 80, 92, 120–2; limitations of, 46; of national domains, 25, 51, 76, 164, 166, 167; targeted, 51, 164, 166, 167–8; temporality of, 18, 110–12, 125; tools for conducting, 214–16; use of algorithms for, 53, 84; of Wikipedia, 144. See also web archiving; web crawlers; web scrapes Weber, Matt, 157 web history: definition of, 26; digital dark age and, 26, 73, 74, 105, 243; “digital gap” in, 76; as field of study, 28, 242; impact of, 233–4; scholarship on, 25–6; skills needed for, 219–22, 235; subjectivity of, 234 webpages: composition of, 4, 19, 69, 107–10; the first, 63–6; reconstruction of, 103, 117; rendering of, 48, 117, 163 Webrecorder, 215–16, 223, 235 webrings, 180, 186, 190, 193–5, 211 Web Science and Digital Libraries

309

Research Group (Old Dominion University), 218 web scrapes, 51, 76, 120–2. See also web crawls websites: about dogs, 206–7; complexity of, 109–10; downloading of, 100; ethically fraught, 206–7, 210; the first, 63; fragility of, 64; government, 13, 86–7, 92, 158, 167–8; grassroots, 78–9; keyboard navigation of, 65; lifespan of, 77–8, 206, 245; as medium of communication, 12–13; models for classifying, 232; personal, 9, 171, 173–4, 179; reconstruction of, 64–6, 144 Webster, Peter, 94 Weil, Sage, 194 Weltevrede, Esther, 132 WestHollywood neighbourhood (GeoCities), 199 wget, 223 wide00002, 120–2, 223 Wide Area Information Servers (WAIS), 74 Wide Web Crawl (2011), 18 Wikipedia, 42, 144–6 Winnie-the-Pooh, 3, 185, 186 Wired, 19, 53, 58, 128, 174 #WomensMarch, 50 Woodward, Bob, 29, 49 Worby, Nicholas, 157 word clouds, 126, 226–7 word frequency charts, 125–6 Wordpress sites, 109 World Wide Web, 174; access to, 44–5, 52, 81–3, 87, 236; content creators, 83–4; definition of, 32, 41; frontier spirit of, 174, 181; GeoCities and, 172, 177, 178, 212; global nature of, 159; as

310

Index

historical source, 3, 26; ownership of content on, 98; proportion of covered by web archives, 79–80; as read-write medium, 173, 175, 176; rise of, 61; study of structure of, 122; usage of, 16–17, 82–3; webrings and, 194 World Wide Web Consortium (W3C), 64 Xanadu library, 42–3 Yahoo!, 4, 89, 102, 103. See also GeoCities youth and childhood, history of, 14, 233 YouTube, 30, 136, 166