597 83 27MB
English Pages 296 [297] Year 2023
Understanding Search Engines
Dirk Lewandowski
Understanding Search Engines
Dirk Lewandowski Department of Information Hamburg University of Applied Sciences Hamburg, Germany
ISBN 978-3-031-22788-2 ISBN 978-3-031-22789-9 https://doi.org/10.1007/978-3-031-22789-9
(eBook)
Translation from the German language edition: “Suchmaschinen verstehen” by Dirk Lewandowski, # Springer 2021. Published by Springer Vieweg, Berlin, Heidelberg. All Rights Reserved. # The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
This book aims to give a comprehensive introduction to search engines. The particularity of this book is that it looks at the subject from different angles. These are, in particular, technology, use, Internet-based research, economics, and societal significance. In this way, I want to reflect the complexity of the search engines and Web search as a whole. I am convinced that only such a comprehensive view does justice to the topic and enables a real understanding. A German-language version of this book has been available for several years. This English edition follows the third German edition of 2021. I am pleased that the publisher has made this international edition possible. In this translation, care has been taken to adapt to the international context where necessary. However, for many examples, it does not matter in which country a search was carried out or a screenshot was taken. However, the references cited in the text were adapted where English-language sources were available. The further reading sections at the end of the chapters have also been adapted. I would like to thank all those who have asked and encouraged me over the years to produce an English edition. So, here it is, and I hope it will be as valuable to many readers as the German editions have been. Hamburg, Germany September 2022
Dirk Lewandowski
v
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 The Importance of Search Engines . . . . . . . . . . . . . . . . . . . . . . 1.2 A Book About Google? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Objective of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Talking About Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Structure of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Structure of the Chapters and Markings in the Text . . . . . . . . . . 1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 2 5 6 7 7 9 9 10
2
Ways of Searching the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Searching for a Website vs. Searching for Information on a Topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 What Is a Document? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Where Do People Search? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Different Pathways to Information on the World Wide Web . . . . 2.4.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Vertical Search Engines . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Metasearch Engines . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Web Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.5 Social Bookmarking Sites . . . . . . . . . . . . . . . . . . . . . . 2.4.6 Question-Answering Sites . . . . . . . . . . . . . . . . . . . . . . 2.4.7 Social Networking Sites . . . . . . . . . . . . . . . . . . . . . . . 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
3
How Search Engines Capture and Process Content from the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 The World Wide Web and How Search Engines Acquire Its Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Content Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Web Crawling: Finding Documents on the Web . . . . . . . . . . . . 3.3.1 Guiding and Excluding Search Engines . . . . . . . . . . . . 3.3.2 Content Exclusion by Search Engine Providers . . . . . .
11 13 13 14 14 17 19 20 21 21 22 22 23 25 28 31 33 37 39 vii
viii
Contents
3.3.3
Building the Database and Crawling for Vertical Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 The Indexer: Preprocessing Documents for Searching . . . . . . . . 3.4.1 Indexing Images, Audio, and Video Files . . . . . . . . . . . 3.4.2 The Representation of Web Documents in Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 The Searcher: Understanding Queries . . . . . . . . . . . . . . . . . . . . 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
5
40 42 47 48 51 54 56
User Interaction with Search Engines . . . . . . . . . . . . . . . . . . . . . . . 4.1 The Search Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Collecting Usage Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Query Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Entering Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Autocomplete Suggestions . . . . . . . . . . . . . . . . . . . . . 4.5.3 Query Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Query Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.5 Distribution of Queries by Frequency . . . . . . . . . . . . . 4.5.6 Query Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.7 Using Operators and Commands for Specific Searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Search Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59 59 60 61 64 65 66 68 69 71 73 74
Ranking Search Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Groups of Ranking Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Text Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Identifying Potentially Relevant Documents . . . . . . . . . 5.2.2 Calculating Frequencies . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Considering the Structural Elements of Documents . . . . 5.3 Popularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Link-Based Rankings . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Usage Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Freshness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Personalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Technical Ranking Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Ranking and Spam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83 85 86 86 88 89 91 92 97 105 106 111 113 114 116 117
76 77 78 79
Contents
ix
6
Vertical Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Vertical Search Engines as the Basis of Universal Search . . . . . . 6.2 Types of Vertical Search Engines . . . . . . . . . . . . . . . . . . . . . . . 6.3 Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Scholarly Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.4 Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Integrating Vertical Search Engines into Universal Search . . . . . 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
119 121 123 125 125 129 132 133 133 134 135
7
Search Result Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 The Influence of Device Types and Screen Resolutions . . . . . . . 7.2 The Structure of Search Engine Result Pages . . . . . . . . . . . . . . 7.3 Elements on Search Engine Result Pages . . . . . . . . . . . . . . . . . 7.3.1 Organic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Advertising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Universal Search Results . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Knowledge Graph Results . . . . . . . . . . . . . . . . . . . . . 7.3.5 Direct Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.6 Integration of Transactions . . . . . . . . . . . . . . . . . . . . . 7.3.7 Navigation Elements . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.8 Support for Query Modification . . . . . . . . . . . . . . . . . . 7.3.9 Search Options on the Result Page . . . . . . . . . . . . . . . 7.4 The Structure of Snippets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Options Related to Single Results . . . . . . . . . . . . . . . . . . . . . . 7.6 Selection of Suitable Results . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
137 138 139 146 146 147 148 150 150 152 153 154 155 156 159 160 161 162
8
The Search Engine Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Search Engines’ Business Model . . . . . . . . . . . . . . . . . . . . . . . 8.2 The Importance of Search Engines for Online Advertising . . . . . 8.3 Search Engine Market Shares . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Important Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Partnerships in the Search Engine Market . . . . . . . . . . . . . . . . . 8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
165 165 166 167 169 170 171 173
9
Search Engine Optimization (SEO) . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 The Importance of Search Engine Optimization . . . . . . . . . . . . . 9.2 Fundamentals of Search Engine Optimization . . . . . . . . . . . . . . 9.2.1 Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.3 HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
175 176 178 180 181 181
x
Contents
9.2.4 Trust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.5 Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.6 User-Related Factors . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.7 Toxins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.8 Vertical Search Engines . . . . . . . . . . . . . . . . . . . . . . . 9.3 Search Engine Optimization and Spam . . . . . . . . . . . . . . . . . . . 9.4 The Role of Ranking Updates . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Search Engine Optimization for Special Collections . . . . . . . . . 9.6 The Position of Search Engine Providers . . . . . . . . . . . . . . . . . 9.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
182 182 183 183 184 185 185 186 187 188 189
10
Search Engine Advertising (SEA) . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Specifics of Search Engine Advertising . . . . . . . . . . . . . . . . . . 10.2 Functionality and Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Distinguishing Between Ads and Organic Results . . . . . . . . . . . 10.4 Advertising in Universal Search Results . . . . . . . . . . . . . . . . . . 10.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
191 194 195 197 198 199 200
11
Alternatives to Google . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Overlap Between Results from Different Search Engines . . . . . . 11.2 Why Should One Use a Search Engine Other Than Google? . . . 11.2.1 Obtaining a “Second Opinion” . . . . . . . . . . . . . . . . . . 11.2.2 More or Additional Results . . . . . . . . . . . . . . . . . . . . . 11.2.3 Different Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.4 Better Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.5 Different Result Presentation . . . . . . . . . . . . . . . . . . . . 11.2.6 Different User Guidance . . . . . . . . . . . . . . . . . . . . . . . 11.2.7 Avoiding the Creation of User Profiles . . . . . . . . . . . . . 11.2.8 Alternative Search Options . . . . . . . . . . . . . . . . . . . . . 11.3 When Should One Use a Search Engine Other Than Google? . . . 11.4 Particularities of Google due to Its Market Dominance . . . . . . . 11.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
203 204 204 205 205 206 207 207 207 208 208 208 210 212 213
12
Search Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1 Source Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Selecting the Right Keywords . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Boolean Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Connecting Queries with Boolean Operators . . . . . . . . . . . . . . . 12.5 Advanced Search Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6 Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.7 Complex Searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
215 217 218 218 222 223 225 228 228 229
Contents
xi
Search Result Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.1 Criteria for Evaluating Texts on the Web . . . . . . . . . . . . . . . . . 13.2 Human vs. Machine Inspection of Quality . . . . . . . . . . . . . . . . 13.3 Scientific Evaluation of Search Result Quality . . . . . . . . . . . . . . 13.3.1 Standard Test Design of Retrieval Effectiveness Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.2 Measuring Retrieval Effectiveness Using Click-Through Data . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.3 Evaluation in Interactive Information Retrieval . . . . . . . 13.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
231 231 232 235
The Deep Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1 The Content of the Deep Web . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 Sources vs. Content from Sources, Accessibility of Content via the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3 The Size of the Deep Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4 Areas of the Deep Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.5 Social Media as Deep Web Content . . . . . . . . . . . . . . . . . . . . . 14.6 What Role Does the Deep Web Play Today? . . . . . . . . . . . . . . 14.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
247 249
15
Search Engines Between Bias and Neutrality . . . . . . . . . . . . . . . . . 15.1 The Interests of Search Engine Providers . . . . . . . . . . . . . . . . . 15.2 Search Engine Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3 The Effect of Search Engine Bias on Search Results . . . . . . . . . 15.4 Interest-Driven Presentation of Search Results . . . . . . . . . . . . . 15.5 What Would “Search Neutrality” Mean? . . . . . . . . . . . . . . . . . . 15.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
261 262 263 265 267 270 271 272
16
The Future of Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1 Search as a Basic Technology . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 Changes in Queries and Documents . . . . . . . . . . . . . . . . . . . . . 16.3 Better Understanding of Documents and Queries . . . . . . . . . . . . 16.4 The Economic Future of Search Engines . . . . . . . . . . . . . . . . . 16.5 The Societal Future of Search Engines . . . . . . . . . . . . . . . . . . . 16.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
275 276 277 278 278 279 281 282
13
14
237 240 241 243 243
251 254 255 256 258 258 259
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
1
Introduction
This book is about better understanding the search tools we use daily. Only when we have a basic understanding of how search engines are constructed and how they work can we use them effectively in our research. However, not only the use of existing search engines is relevant here but also what we can learn from well-known search engines like Google when we want to build our own search systems. The starting point is that Web search engines are currently the leading systems in terms of technology, setting the standards in terms of both the search process and user behavior. Therefore, if we want to build our own search systems, we must comply with the habits shaped by Web search engines, whether we like it or not. This book is an attempt to deal with the subject of search engines comprehensively in the sense of looking at it from different angles: 1. Technology: First of all, search engines are technical systems. This involves the gathering of the Web’s content as well as ranking and presenting the search results. 2. Use: Search engines are not only shaped by their developers but also by their users. Since the data generated during use is incorporated into the ranking of the search results and the design of the user interface, usage significantly influences how search engines are designed. 3. Web-based research: Although, in most cases, search engines are used in a relatively simple way – and often not much more is needed for a successful search – search engines are also tools for professional information research. The fact that search engines are easy to use for everyone does not mean that every search task can be easily solved using them. 4. Economy: Search engines are of great importance for content producers who want to get their content on the market. Because they are central nodes in the Web, they also play an important economic role. Here, the main focus is on search engine visibility, which can be achieved through various online marketing measures (such as search engine optimization and placing advertisements). # The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Lewandowski, Understanding Search Engines, https://doi.org/10.1007/978-3-031-22789-9_1
1
2
1 Introduction
5. Society: Since search engines are the preferred means of searching for information and are used massively every day, they also have an enormous significance for knowledge acquisition in society. Among other things, this raises the question of whether search results are credible and whether search engines play a role in spreading misinformation and disinformation, often treated under the label of “fake news.” My fundamental thesis is that one is impossible without the other: we cannot understand search engines as technical systems if we do not know their social significance. Nor can we understand their social impact if we do not know the underlying technology. Of course, one does not have to have the same detailed knowledge in all areas; but one should achieve a solid basis. Of course, an introductory book cannot cover the topics mentioned comprehensively. My aim is instead to introduce the concepts central to discussing search engines and provide the basic knowledge that makes a well-founded discussion of search engines possible in the first place.
1.1
The Importance of Search Engines
In this book, I argue that search engines have an enormous social significance. This can be explained, on the one hand, by their mass use and, on the other hand, by the ranking and presentation of search results. Search engines (like other services on the Internet) are used en masse. Their importance lies in the fact that we use them to search for information actively. Every time we enter a query, we reveal our interests. With every search engine result page (SERP) that a search engine returns to us, there is a (technically mediated) interpretation of both the query and the potentially relevant results. By performing these interpretations in a particular way, a search engine conveys a specific impression of the world of information found on the Web. For every query, there is a result page that displays the results in a specific order. Although, in theory, we can select from all these results, we rely heavily on the order given by the search engine. De facto, we do not select from the possibly millions of results found but only from the few displayed first. If we consider this, societal questions arise, such as how diverse the search engine market is: Is it okay to use only one search engine and have only one of many possible views of the information universe for each query? The importance of search engines has already been put into punchy titles such as “Search Engine Society” (Halavais, 2018), “Society of the Query” (the title of a conference series and a book; König & Rasch, 2014), and “The Googlization of Everything” (Vaidhyanathan, 2011). Perhaps it is not necessary to go so far as to proclaim Google, search engines, or queries as the determining factor of our society; however, the enormous importance of search engines for our knowledge acquisition can no longer be denied.
1.1 The Importance of Search Engines
3
If we look at the hard numbers, we see that search engines are the most popular service on the Internet. We regard the Internet as a collection of protocols and services, including e-mail, chat, and the File Transfer Protocol (FTP). It may seem surprising that the use of search engines is at the top of the list when users are asked about their activities on the Internet. Search engines are even more popular than writing and reading e-mails. For instance, 76% of all Germans use a search engine at least once a week, but “only” 65% read or write at least one e-mail during this time. This data comes from the ARD/ZDF-Onlinestudie (Beisch & Schäfer, 2020), which surveys the use of the Internet among the German population every year. Comparable studies confirm the high frequency of search engine use: the Eurobarometer study (European Commission, 2016) shows that 85% of all Internet users in Germany use a search engine at least once a week; the figure for daily use is still 48%. Germany is below the averages of the EU countries (88% and 57%, respectively). Let’s look at the ARD/ZDF-Onlinestudie to see which other Internet services are used particularly often. We find that, in addition to e-mail and search engines, messengers (probably WhatsApp in particular) are the most popular. On the other hand, social media services only reach 36%. A second way of looking at this is to look at the most popular websites (Alexa. com, 2021). Google is in the first and third place (google.com and google.de), followed by YouTube (second place), Amazon, and eBay. It is striking that not only Google is in the first place but eBay and Amazon are two major e-commerce companies that not only offer numerous opportunities for browsing but also play a major role in (product) searches. The fact that search engines are a mass phenomenon can also be seen in the number of daily queries. Market research companies estimate the number of queries sent to Google alone at around 3.3 trillion in 2016 (Internet Live Stats & Statistic Brain Research Institute, 2017) – that’s more than a million queries per second! An additional level of consideration arises when we look at how users access information on the World Wide Web. While there are theoretically many access points to information on the Web, search engines are the most prevalent. On the one hand, Web pages can, of course, be accessed directly by typing the address (Uniform Resource Locator; URL) into the browser bar. Then there are other services, such as social media services, which also lead us to websites. But none of these services has achieved a level of importance comparable to that of search engines for accessing information on the Web, nor is this situation likely to change in the foreseeable future. Last but not least, search engines are also significant because of the online advertising market. The sale of ads in search engines (ads in response to a query) accounts for 40% of the market (Zenith, 2021); in Germany alone, search engine advertising generated sales of 4.1 billion euros in 2019 (Statista, 2021). This form of advertising is particularly attractive because, with each search query, users reveal what they want to find and thus also whether and what they might want to buy. This makes it easy for advertisers to decide when they want to offer their product to a user. Scatter losses, i.e., the proportion of users who see an
4
1 Introduction
advertisement but have no interest in it at that moment, can be significantly reduced or even avoided altogether in this way. Search engine providers, like other companies, have to earn money. So far, the only model search engines have used to make money is the insertion of advertising in the form of text ads around the search results. Other revenue models have not caught on. This also means that search engine providers do not, as is often claimed, align their search engines solely with the demands and needs of users but also with their own profit intentions and those of their advertising customers. For companies, however, the importance of search engines is not only a result of being able to use search engines as an advertising platform but also because of being found by users in the organic search results. The procedures that serve to increase the probability of being found are subsumed under the title of search engine optimization. Already at this point, we see that if we consider search engines not only as technical systems but also as socially relevant, we are dealing with at least four stakeholder groups or actor groups (see Röhle, 2010, p. 14): 1. Search engine providers: On the one hand, search engine providers are interested in satisfying their users. This involves both the quality of the search results and the user experience. On the other hand, search engine providers’ second major (or even more significant?) interest is to offer their advertisers an attractive environment and earn as much money as possible from advertising. 2. Users: The users’ interest is to obtain satisfactory search results with little effort and not to be disturbed too much in their search process, for example, by intrusive advertising. 3. Content producers: Anyone who offers content on the Web also wants to be found by (potential) users. However, another interest of many content producers is to earn money with their content. This, in turn, means that it is not necessarily in their interest to make their content fully available to search engines. 4. Search engine optimizers: Search engine optimizers work on behalf of content producers to ensure that their offerings can be found on the Web, primarily in search engines. Their knowledge of the search engines’ ranking procedures and their exploitation of these procedures to place “their” websites influence the search engine providers, who attempt to protect themselves against manipulation. This brief explanation of the stakeholders already shows that this interplay can lead to conflicts. Search engine providers have to balance the interests of their users and their advertisers; search engine optimizers have to ensure the maximum visibility of their clients’ offerings but must not exploit their knowledge of how search engines work to such an extent that they are penalized by search engine providers for manipulation. Clearly, we are dealing with complex interactions in the search engine market. Only if we look at search engines from different perspectives are we able to classify these interactions and understand why search engines are designed the way they are.
1.2 A Book About Google?
5
Search engines have to meet the needs of different user groups; it is not enough for them to restrict their services to one of these groups. When we talk about search engines and their importance for information access, we usually only consider the content initially produced for the Web. However, search engines have been trying to include content from the “real,” i.e., the physical world, in their search systems for years. Vaidhyanathan (2011) distinguishes three types of content that search engines like Google capture: 1. Scan and link: External content is captured, aggregated, and made available for search (e.g., Web search). 2. Host and serve: Users’ content is collected and hosted on their own platform (e.g., YouTube). 3. Scan and serve: Things from the real world are transferred into the digital world by the search engine provider (e.g., Google Books, Google Street View). Vaidhyanathan (2011) summarizes this under “The Googlization of Everything” (which is also the title of his book) and thus illustrates not only that search engine content goes beyond the content of the Web (even if this continues to form the basis) but also that we are still at the beginning when it comes to the development of search engines: So far, only a small part of all the information that is of interest to search engines has been digitized and thus made available for search. Furthermore, there is a second, largely taken-for-granted assumption, namely, that a search process must necessarily contain a query entered by the user. However, we see that search engines can increasingly generate queries by themselves by observing the behavior of a user and then offering information that is very likely to be useful to them. For example, suppose a user is walking through a city with their smartphone in their pocket. In that case, it is easy to predict their desire for a meal option at lunchtime and suggest a restaurant based on that user’s known past preferences and current location. To do this, a query (made up of the above information) is required, but the user does not have to enter it themselves. We will return to this in Chap. 4.
1.2
A Book About Google?
When we think of search engines, we primarily think of Google. We all use this search engine almost daily, usually for all kinds of search purposes. Here, again, the figures speak for themselves: in Germany, well over 90% of all queries to general search engines are directed to Google, while other search engines play only a minor role (Statcounter, 2021). Therefore, this book is based on everyday experience with Google and tries to explain the structure and use of search engines using this well-known example. Nevertheless, this book aims to go further: to show which alternatives to Google there are and when it is worthwhile to use them. But this book will not describe all possible search engines; it is rather about introducing other search engines, utilizing
6
1 Introduction
examples, and thus, first of all, getting the reader to think about whether Google is the best search engine for precisely their research before carrying out more complex searches. To a certain extent, it can also be said that if you know one search engine, you will be better able to deal with all the others. We will learn about the basic structure of search engines and their most important functions by looking at Google, which we all already know, at least from the user side. The acquired knowledge can then easily be transferred to other search engines. Most of the search examples and screenshots also come from Google. In most cases, however, the examples can be transferred to other search engines. Where this is not the case, this is indicated. Regarding the similarity between the different search engines, we can generally say that Google’s competitors are in a dilemma: Even if they offer innovative functions and try to do things differently from Google, they are fundamentally oriented toward Google’s idea of how a search engine should look and work. This orientation toward Google cannot be blamed on the other search engine providers because, on the one hand, they can only win over users if those who are used to Google find their way around immediately; on the other hand, they have to distinguish themselves from Google to be able to offer any added value compared to this search engine.
1.3
Objective of This Book
By its very nature, this book is restricted in its function as an introductory book and is intended as a general overview. This also means that many topics cannot be dealt with in detail, but we must remain “on the surface” instead. However, this does not mean that the contents must be superficial. On the contrary, I have tried to present the contents as simply as possible but without sacrificing the necessary accuracy. Some topics are explored via a specific example (such as a vertical search engine), which is explained in more detail, so that this information can then be applied to other topics. This book is about transfer: what you learn from one or a few search engines should be transferable to others. Therefore, it does not matter that some of the contents in this book – especially when it comes to details of a particular search engine – may have already changed by the time this book is published. This is unavoidable, especially in rapidly evolving fields, but the goal is to convey basic knowledge about search engines that can then be applied to all search engines. This book is not a substitute for introductory works on, for example, information retrieval or searching the Web, even though topics from these areas are covered. The relevant introductory literature on the respective topics is mentioned in the respective chapters. This book aims to provide an overview and a consideration of different perspectives on search engines, not an all-encompassing presentation of the individual topics. Students in particular often fear that they will only be able to understand search engines if they delve into algorithms and technical details. In this book, the essential
1.5 Structure of This Book
7
procedures are described in a concise and understandable manner, but the main aim is to understand the ideas underlying the technical processes. This will enable us to assess why search engines work as well or as poorly as they do at present and what prospects there are for their further development. It is only natural that in any attempt to look at a topic from different perspectives, one gravitates toward one’s own subject and focuses on the interests of one’s own discipline. Thus, my interest and the focus of my consideration naturally follow the subject area and the methods of information science, which always (also) considers technical information systems from the perspective of humans. In addition, however, I have made an effort to also consider the perspective of other subjects such as computer science and media and communication studies (including their literature).
1.4
Talking About Search Engines
To talk about an object, you need a consistent vocabulary. You must know that when you use certain terms, you are talking about the same thing. In order to avoid talking past each other, it is therefore necessary to agree on terminology. Since there is currently no single, unified terminology in the field of search engines, and search engine optimizers, information scientists, and communication scientists, for example, each speak a language of their own, this book is also intended to contribute to mutual understanding. At the end of the book, there is a glossary that lists and explains all important terms in alphabetical order. I have made an effort to include synonyms and related terms so that readers who have already gained some knowledge from the literature can find “their” terms and quickly get used to the terminology I have used.
1.5
Structure of This Book
Of course, you can read this book from cover to cover, which was my primary intention when writing it. However, if you only want to read about a specific topic, the chapter structure allows you to do so. Following this introductory chapter, Chap. 2 covers different ways of searching the Web. Indeed, search engines like Google are not the only form of access to information on the Web, even if the form of the algorithmic universal search engine has become widely accepted. The various forms of search systems are briefly introduced, and their significance is discussed in the context of searching the Web. Chapter 3 then explains the basic technical structure of algorithmic search engines. It explains how search engines obtain content from the Web, how this content is prepared so that it can be searched efficiently, and how users’ queries can be interpreted and processed automatically. After these two technical chapters, we consider the user side in Chap. 4: what is actually searched for in search engines, how are queries formulated, and how do users select the most suitable results?
8
1 Introduction
Closely related are the ranking procedures, i.e., the arrangement of search results. Chapter 5 describes the basic procedures and explains their significance. Although it is often claimed that the ranking of search results is the big secret of every search engine, knowledge of the most important ranking factors can at least fundamentally explain the arrangement of search results, even if the concrete ranking depends on a multitude of weightings that cannot be traced in detail. This understanding, in turn, can help us both in our searches and in preparing our own content for search engines or even in creating our own information systems. Chapter 6 then shows how search results from the general Web Index are extended by adding so-called vertical collections such as news, images, or videos. For this purpose, the well-known search engines have built and integrated numerous vertical search engines whose results are displayed on the search engine result pages. Chapter 7 is devoted to the presentation of search results. For some years now, the well-known search engines have deviated from the usual list form of search result presentation and have instead established new forms of compiling search results with concepts such as universal search and knowledge graph. This has made the result pages more attractive and increased the choices available on these pages. With this type of result presentation, the search engines also deliberately guide the users’ attention. This brings us to the economic realities related to search engines. In Chap. 8, we deal with the search engine market and thus, among other things, with the question of how Google has succeeded in almost completely dominating the search engine market (at least in Europe). Of course, the question of whether such a situation is desirable and how it could be changed is also raised here. Chapter 9 is devoted to the side of those who want to make their content best available via search engines and their helpers, the search engine optimizers. They use their knowledge of search engines’ indexing and ranking procedures to make content easier to find and to bring traffic to their customers. Their techniques range from simple text modification to complex procedures that consider the Web’s linking structure. Chapter 10 provides a detailed description of the advertising displayed in search engines. On the one hand, it deals with the ads shown on search result pages as a type of search result, and on the other hand, with the question to what extent users can distinguish these ads from the organic search results. Chapter 11 deals with alternatives to Google. First, it is important to answer the question of what makes a search engine an alternative search engine. Is it enough that it is simply a search engine other than Google? Then, based on fundamental considerations and concrete situations in the search process, we will explain in which cases it is worth switching to another search engine. In Chap. 12, we change the perspective again and consider search engines as tools for advanced Internet research. In the chapter on user behavior, it became clear that most users put little effort into formulating their queries and sifting through the results. Therefore, we want to see what strategies and commands we can use to get the most out of search engines.
1.7 Summary
9
Another topic related to searching, but also to the general evaluation of search engines, is the question of the quality of the search results, which we will address in Chap. 13. The quality of search results can be viewed from two perspectives: One is about the user’s result evaluation in the course of their search; the other is about scientific comparisons of the result quality of different search engines. Chapter 14 deals with the contents of the Web that are not accessible to general search engines, the so-called Deep Web. An enormous treasure trove of information cannot be found with Google and similar search engines or at least to a limited extent. We will see why this content remains hidden from the search engines and with what methods we can nevertheless access it. While the previous topics dealt with aspects from the areas of technology, use, and Web-based research, Chap. 15 deals with the societal role of search engines. What role do search engines play in knowledge acquisition, and what role should they play? Finally, Chap. 16 focuses on the future of search. Of course, a book like this can only ever offer a snapshot, and 10 years ago, it would have provided a different picture than today. However, the “problem” of search has by no means been solved (and may never be solved), so it is worth looking at today’s search engines not only in their evolution toward the current state but also to venture a look into the (near) future.
1.6
Structure of the Chapters and Markings in the Text
By their very nature, chapters on different topics must be structured differently. Nevertheless, the chapters in this book have certain similarities: At the beginning of each chapter, there is an introduction that defines the topic and briefly describes its significance for the book. Detailed explanations follow this. At the end of each chapter, there is a summary that reviews the most important points. Each chapter also contains a bibliography and, in a separate box, a list of recommendations for further reading for those who wish to delve deeper into the topic. There are also examples in boxes throughout the text that illustrate and deepen what is said in the main text but are not essential for understanding the main text.
1.7
Summary
Search engines are essential tools for accessing the information on the World Wide Web. In this book, we will look at them from the perspectives of technology, usage, economic aspects, searching the Web, and society. The significance of search engines results from their mass use and the fact that they are by far the preferred means of accessing information on the World Wide Web. However, one search engine, in particular, Google, is used for most searches. We should not only consider search engines as technical systems. Due to the interactions of different stakeholders (search engine providers, users, content
10
1 Introduction
producers, and search engine optimizers), there are numerous factors which influence search results and which the search engine providers do not exclusively direct. In terms of content, search engines no longer only capture the Web content but also offer platforms on which users can create content themselves, which is then made searchable. Furthermore, search engine providers offer various vertical search engines in addition to Web search, whose results are included in the general search engine result pages. Finally, content from the physical world is transferred to the digital world and integrated into search.
References Alexa.com. (2021). Top sites in Germany. https://www.alexa.com/topsites/countries/DE. Beisch, N., & Schäfer, C. (2020). Ergebnisse der ARD/ZDF-Onlinestudie 2020: Internetnutzung mit großer Dynamik: Medien, Kommunikatio, Social Median. Media Perspektiven, 51(9), 462–481. European Commission. (2016). Special Eurobarometer 447 – Online Platforms. European Commission. https://doi.org/10.2759/937517 Halavais, A. (2018). Search engine society. Polity. Internet Live Stats, & Statistic Brain Research Institute. (2017). Anzahl der Suchanfragen bei Google weltweit in den Jahren 2000 bis 2016 (in Milliarden). In Statista – Das StatistikPortal. https://de.statista.com/statistik/daten/studie/71769/umfrage/anzahl-der-googlesuchanfragen-pro-jahr/. König, R., & Rasch, M. (Eds.). (2014). Society of the query reader: Reflections on web search. Institute of Network Cultures. Röhle, T. (2010). Der Google-Komplex: Über Macht im Zeitalter des Internets. Transcript. https:// doi.org/10.14361/transcript.9783839414781 Statcounter. (2021). Search engine market share. https://gs.statcounter.com/search-engine-marketshare. Statista. (2021). Prognose der Umsätze mit Suchmaschinenwerbung in Deutschland in den Jahren 2017 bis 2025 (in Millionen Euro). In Statista – Das Statistik-Portal. https://de.statista.com/ prognosen/456188/umsaetze-mit-suchmaschinenwerbung-in-deutschland. Vaidhyanathan, S. (2011). The Googlization of everything (and why we should worry). University of California Press. https://doi.org/10.1525/9780520948693 Zenith. (2021). Prognose zu den Investitionen in Internetwerbung weltweit in den Jahren 2018 bis 2022 nach Segmenten (in Milliarden US-Dollar). In Statista – Das Statistik-Portal. https://de. statista.com/statistik/daten/studie/209291/umfrage/investitionen-in-internetwerbung-weltweitnach-segmenten/.
2
Ways of Searching the Web
At first glance, searching the Web may seem trivial: we enter a query and receive a search engine result page (SERP) on which we select a result. But this is only one of the many ways to access information on the Web. In this chapter, we introduce the different ways of accessing the information on the Web and explain why access via search engines has become dominant.
2.1
Searching for a Website vs. Searching for Information on a Topic
First, we need to ask what we want or can achieve with a search. For now, it is sufficient to distinguish three cases. We will explain these cases with the help of the Firefox browser starting page shown in Fig. 2.1: 1. A user wants to go to a specific website they already know. To do this, they enter the URL into their browser’s address bar (Fig. 2.1, top line). Then, on the website, they either read something directly, conduct a search, or click on further documents. This process has little to do with our intuitive understanding of searching, but it is a means of getting to the information we are looking for. For example, a user interested in news on a current topic can go directly to a news website and either read relevant articles directly on the front page, click on them there, or search for articles using the internal search function of the news website. 2. A user wants to go to a specific website they either already know or do not yet know about and searches for it via the address bar (combined URL and search bar) (Fig. 2.1, top line) or the search field (Fig. 2.1, field placed in the middle). The search is carried out in the previously set search engine (for settings, see Chap. 8). Such a search may be for a known website – in which case searching is merely a “shortcut” to direct entry in the address bar (e.g., entering “ny times” in the search field instead of “www.nytimes.com” in the address bar) or it may be to help if the user can no longer remember the exact address of the website they are # The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Lewandowski, Understanding Search Engines, https://doi.org/10.1007/978-3-031-22789-9_2
11
12
2
Ways of Searching the Web
Fig. 2.1 Start page of the Firefox browser with the address bar and Google search as the default start page (August 26, 2022)
looking for (e.g., if they no longer know whether a website ends with .com or . org). When searching directly for a website that is not yet known, the user at least assumes that such a website exists and searches accordingly. 3. The third type is a user looking for information not yet known to them. This type differs fundamentally from the previous two as the user is not looking for a specific website but for information on a topic. Here, it is impossible to predict with certainty whether this information can be found on a particular website or whether the information from a single website is sufficient to satisfy the information need. We will discuss the subdivision of search queries according to intentions or information needs in more detail in Sect. 4.3. For the time being, it is sufficient to distinguish between a search for known websites and a search for unknown information. To be able to assess different ways of accessing information on the Web, it is essential that we can already distinguish between these cases. In our example, we have already seen that search queries can be entered in different places. We will return to the significance of the search engine preset in a browser’s search box or address bar and the preset start page in the chapter on the search engine market (Chap. 8). Is Searching the Web Like Looking for a Needle in a Haystack? Searching the Web is often compared to looking for a needle in a haystack. This picture is meant to illustrate that it is difficult to find the right thing (the needle) because of the vast amount of information available (the haystack). (continued)
2.3 Where Do People Search?
13
But this image is skewed. In the case of the haystack, we know what the needle looks like, and there is only one needle, so we can tell when our search is finished. However, in the case of searching for previously unknown information, we do not always have such a clearly defined idea of what we want to find. There could be several needles that might also serve our purpose differently. And it could also be that we are only satisfied when we have found several needles that complement or confirm each other.
2.2
What Is a Document?
By its very design, the Web is multimedial and contains much more than just text. In this respect, search engines are not only there to find text on the Web but also other types of information – even if search today is (still) primarily text-based. But regardless of whether it is a text, an image, or a video, we will speak of a document or, alternatively, an information object. So what is a document? When we think of documents, we might first think of official documents, such as those issued by public authorities with a stamp and signature. In information science, however, the term is defined much more broadly: a document is a record of information, regardless of whether it is in written form (text document) or, for example, in pictorial form (image document). Concerning search engines, this means that every piece of content they display (text, images, videos, etc.) is a document. Therefore, we sometimes speak of information objects instead to make it clear that we are not only talking about textual documents.
2.3
Where Do People Search?
The times when search engines were used mainly in the same context, namely, on desktop computers or laptops, are long gone. People now search on a wide range of devices, ranging from smartphones and tablets to wearables and purely voiceactivated devices. Especially in the case of tablets and smartphones, we must distinguish between Web search and search within specific applications: For searches within apps, only a limited amount of data has to be searched, whereas Web search is about “the whole thing,” i.e., the most complete representation of the Web possible. No matter which device we use for searching: (Web) search is a central part of our Internet use. However, as we will see, user behavior differs depending on the context (e.g., mobile vs. at home) and device (e.g., large screen on a laptop vs. small screen on a smartphone). Search engines are adapted to deliver adjusted results and result displays on different devices and in different contexts (see Sect. 7.1 for more details).
14
2.4
2
Ways of Searching the Web
Different Pathways to Information on the World Wide Web
Search engines are by no means the only way to access the information on the Web. In the following, we will present the different types of access and related systems. We will then put them into relation to search engines, which will again be the exclusive focus of the subsequent chapters. We will start with Web search engines themselves, as they are our starting point, and we will then compare the advantages and disadvantages of the other systems with them. In general, a distinction can be made between search engines and other systems: • Search engines include general search engines, vertical search engines, hybrid search engines, and metasearch engines. • Other systems include Web directories, social bookmarking sites, question answering sites, and social networks. To understand the idea of search engines, it is essential to realize that the different ways of accessing Web content also have different objectives. For example, it would be unfair to compare the scope of the databases of search engines and Web directories, as they have very different requirements regarding the comprehensiveness of their databases. It is also relevant whether a system aims to support ad hoc searches (i.e., searches based on the input of a search query) or whether the system is to support browsing of content or monitoring specific sources. For example, the latter is the case with social networks, where users “follow” people or accounts by subscribing to their new messages. This means that content from these accounts is displayed regularly without having to repeatedly conduct a new search.
2.4.1
Search Engines
When we speak of search engines, we usually mean Web search engines (also known as general, universal, or algorithmic search engines). These engines claim to cover the content of the Web as completely as possible and, if necessary, to enrich it with additional content (see Sect. 3.2). Figure 2.2 schematically shows which contents of the Web search engines cover. The cloud represents the universe of the Web. It contains a multitude of documents stored within websites (illustrated by the hierarchical structure of documents). The content that is captured by the search engine is highlighted. Although search engines aim to capture the total content of the Web, this objective is not achieved and cannot be reached either. We will look at the reasons for this in more detail in Chap. 3. Nevertheless, search engines achieve greater coverage of the Web than any other type of search system. This is due, on the one hand, to their universal claim and, on the other hand, to the fact that they capture the content automatically. This process is described in detail in Chap. 3; at this point, it should suffice to say that search engines can capture a huge number of documents on the Web and make them searchable.
2.4 Different Pathways to Information on the World Wide Web Fig. 2.2 The contents of search engines
15
SEARCH ENGINES As many documents as possible from the web
=
= =
=
=
=
=
=
=
=
=
=
= = = = = = =
=
=
= = = =
= =
=
=
= =
=
=
= =
=
= =
=
=
=
=
Fig. 2.3 Start page of the AltaVista search engine (1996); https://web.archive.org/web/1 9961023234631/http://altavista.digital.com/
The Concept of the Algorithmic Search Engine in the 1990s The idea of the search engine as we know it already evolved in the early days of the Web. Early search engines such as Lycos and WebCrawler already worked on the same principle as Google and other search engines do today: They gather the pages available on the Web by following links and return ranked lists of results in response to search queries. This process is fully automatic. (continued)
16
2
Ways of Searching the Web
Perhaps the best way to illustrate the similarity between earlier and today’s search engines is to look at the homepage of AltaVista, the leading search engine at the time, in 1996 (Fig. 2.3). Firstly, the similarity with today’s search engines like Google is striking: There is a centrally placed search field, next to which is a button that can be used to submit the search. In principle, users can enter whatever they want without having to learn a specific query language. Whether single words, whole sentences, or questions: it is up to the automatic processing of the search engine to deliver results that match the search queries. Secondly, the AltaVista homepage contains information about the size of its database. It states 30 million documents – a large number at the time, considering that the Web was still in its infancy. In the meantime, the Web has grown many times over, but the challenge of capturing its content in a complete and up-to-date manner and making it available for search has remained (see Chap. 3). Thirdly, it should be pointed out that AltaVista made its search results available via other portals, including Yahoo, as early as 1996. Even then, many providers did not build their own search engines but used the results of one of the big search engines in cooperation. We will return to such cooperation in Chap. 8 and see its influence on the current search engine market. However, the differences between then and now should not be concealed. Already above the search box, some things are different from today’s search engines: With AltaVista, you could choose between different search modes directly on the start page, in this case, between the preset simple search with only one search field and an advanced search. Already in the simple search, you could choose the collection to be searched (default: Web) and the format in which the search results would be displayed. In later chapters, we will get to know the advanced search and different result presentations. The texts on the AltaVista home page around the search box are also illuminating. On the one hand, a link to a mirror site is offered; such “mirrors” are nothing more than copies of websites available in another geographical location, in this case, Australia. Internet connections in 1996 were much less developed than they are today, and one often had to wait quite a long time for responses from remote Web servers. Mirrors were created to shorten these waiting times. Today, search engines have data centers spread around the world that distribute the search engine’s database and the processing of search queries. However, users no longer have to select one of these data centers explicitly, but both the index and the processing of search queries are distributed automatically.
2.4 Different Pathways to Information on the World Wide Web
2.4.2
17
Vertical Search Engines
There is a distinction to be made between general search engines and vertical search engines. Vertical search engines aim to capture as many documents as possible from selected websites. The term “vertical search engines” is often used instead of “special search engines”; in this terminology, universal search engines are referred to as horizontal search engines. Vertical search engines are restricted to a specific topic and thus make a more targeted search possible. The ranking can be specially adapted to the documents they index, as can the subject indexing of the documents. Finally, there are also advantages in the presentation of the results, which can be adapted to the individual purpose of the vertical search engines and the proficiency of the target audience. The fact that vertical search engines cannot be replaced by universal search engines results from the problems of the latter (for a detailed explanation, see Sect. 6.1): 1. Universal search engines have technical restrictions and (despite the label universal) cannot cover the entire Web. 2. There are financial hurdles that restrict the collection of content and its indexing. 3. Universal search engines are geared toward the average user. 4. They have to provide consistent indexing of all content so that everything is searchable together.1 Vertical search engines intentionally restrict themselves to a specific area of the Web (see Fig. 2.4). Usually, they are limited to certain sources, i.e., websites. These Fig. 2.4 Contents of vertical search engines
VERTICAL SEARCH ENGINES As many documents as possible from selected websites
=
= =
=
=
=
=
=
=
=
=
=
=
=
=
= = = = = = =
= = =
= =
=
=
= =
=
=
= =
=
= =
=
=
=
=
1
Of course, universal search engines can carry out individual indexing for certain types of content or certain databases. However, this cannot be done for the mass of offerings to be indexed.
18
2
Ways of Searching the Web
websites are typically selected by hand. For example, if one wants to build a vertical search engine for news, it makes sense first to compile the relevant news websites, which the search engine then continuously scans for new content (pages). A website is a self-contained offering on the Web that can contain several Web pages. Differentiation is made via the domain (e.g., nytimes.com), subdomains, or directories (e.g., archive.nytimes.com or nytimes.com/section/world). On the other hand, a Web page is a single document usually composed of text and associated media elements (images, videos, etc.). Examples of Vertical Search Engines Vertical search engines can be restricted to very different topics. Examples include Google News (https://news.google.com/), which is restricted to news, and Swiggle (https://swiggle.org.uk/), which is restricted to content suitable for children. Hybrid search engines are a particular type of vertical search engine. Like vertical search engines, they cover a selected part of the World Wide Web but add additional content from databases to the resulting inventory. This database content is not part of the WWW and, therefore, cannot be found through standard search engines (for technical details, see Chap. 14). Figure 2.5 illustrates the hybrid search engine model. HYBRID SEARCH ENGINES As many documents as possible from selected websites + content from databases
=
= =
= =
=
=
=
=
= = = = = = =
=
=
= = = =
= =
= =
=
=
Database
=
= =
=
=
= =
=
= =
=
Fig. 2.5 Contents of hybrid search engines
= =
=
atabase Database
2.4 Different Pathways to Information on the World Wide Web
2.4.3
19
Metasearch Engines
At first glance, metasearch engines look like other search engines. They also provide the user with the same service, namely, potential access to all World Wide Web content. However, they differ from the “real” search engines in that they do not have their own index but, as soon as the user enters a query, they retrieve results from several other, “real” search engines, merge them, and display them in their own results display (see Fig. 2.6). The idea behind metasearch engines is that no one search engine can cover the entire Web. Therefore, combining the results of several search engines that cover different areas of the Web would be worthwhile. A second advantage is supposed to lie in a better relevance ranking of the results since the best results are already fetched by each of the giving search engines, from which a ranking of the best is then created. However, there is considerable criticism of the concept of metasearch engines, which is mainly directed at the fact that the supposed advantages of metasearch are claimed but not empirically proven (Thomas, 2012). It can also be argued that metasearch is an outdated idea, as today’s search engines no longer have the coverage problems that search engines had in the 1990s, when the concept of metasearch engines was born. At the very least, the benefits of better coverage only play a role in a few cases today. Even the supposed advantage of ranking no longer exists today, at least not to the same extent as it did in the past: For one thing, the universal search engines have become far better in this respect, and for another, metasearch engines do not have access to all the documents of the providing search engines (see Fig. 2.6). Rather,
METASEARCH ENGINES The best results from several search engines
Search engine 1 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Search engine 2 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Fig. 2.6 Contents of metasearch engines
Search engine 3 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
20
2
Ways of Searching the Web
they only receive the documents that are already at the top of the list from each of the search engines. Additionally, they do not receive the documents themselves, but only the titles that appear in the result list, the short descriptions, and the URLs. Therefore, it is at least questionable whether a better ranking can be achieved from this.
2.4.4
Web Directories
Web directories aim to find the best websites in a given category and arrange them in a hierarchical directory. Thus, their approach differs fundamentally from that of search engines. While search engines automatically locate and capture content, Web directories are created by people. People select the websites to be included in the directory (see Fig. 2.7); they select the appropriate positions in the classification system and describe them. There are also differences on the side of searching. Although it is generally possible to search directories using keywords, their strength lies precisely in their hierarchical arrangement and, thus, the possibility of navigating from the general to the specific. Once one finds a suitable category, an overview of selected websites on the topic is presented. One can then continue browsing there. In other words, Web directories do not lead to individual documents but to sources (websites). This can be both an advantage and a disadvantage. Web directories no longer play a role in accessing the contents of the Web; search engines have overrun them. Nevertheless, they are discussed here because, on the one hand, they represent a fundamentally different approach to that of search engines Fig. 2.7 Contents of Web directories
WEB DIRECTORIES The best websites, arranged hierarchically
=
= =
= =
=
=
= =
=
= = = = = = =
= = =
= =
=
=
=
=
=
= =
= =
=
= =
=
= =
= =
=
=
=
2.4 Different Pathways to Information on the World Wide Web
21
and, on the other hand, Web directories were the second most important approach alongside search engines, especially in the early days of the Web (see Hamdorf, 2001). For example, Yahoo was founded as a Web directory and only later evolved into a Web portal, which today naturally includes a search engine. Both the Yahoo Web directory and the second major directory, the Open Directory Project (ODP; “DMOZ directory”), have since been discontinued. The end of the Open Directory Project in 2017 thus also meant the end of the “era of Web directories,” even if the volunteers of the ODP now continue to run their directory on their own at https://curlie.org/ and specialized directories still flourish in some niches.
2.4.5
Social Bookmarking Sites
In social bookmarking sites, too, people decide which documents are included in the database. Whereas in Web directories, people act in defined roles as voluntary editors, social bookmarking sites are based on the principle that all users can add documents to the database by storing them as bookmarks in the system. This differs from the bookmarks that a user stores in their browser in that a user’s bookmarks are also accessible to other users and that the bookmarks are indexed using tags. In this way, the stored pages can also be found under keywords not found on the page itself, as they are considered relevant only by the users. Another way of using tags for search is to use frequency metrics to determine which documents are considered particularly important by users. A set of tags is called a folksonomy (from folk and taxonomy; see Peters, 2011). The popularity of a particular URL can be determined by the number of users who have saved it. This, in turn, can be used to rank and evaluate the search results. Some major social bookmarking sites (such as del.icio.us) have ceased to exist. Similar to Web directories, social bookmarking sites are more niche today and have not been able to establish themselves as a viable alternative to search engines.
2.4.6
Question-Answering Sites
A (community-based) question-answering site allows its users to ask specific questions that volunteers can then answer. In principle, any (registered) user can answer questions. The advantage for the questioner is that – in contrast to the formulation of queries in general search engines – they can describe their information need in detail and, if necessary, enter into dialogue with the answerers to arrive at a detailed answer to their question. Apart from the issue of the quality of the answers provided, one of the main problems with this approach is that it constitutes an asynchronous search. This means that the questioner does not get an answer directly, as with search engines, but has to wait until another user is ready (and able) to answer the question. One way
22
2
Ways of Searching the Web
out is to search the archive of questions that have already been asked (and answered). In addition, questions in question-answering sites are also often tagged, which can improve findability. Thus, question-answering sites are used in a different way than search engines: They are particularly suitable when searching with a search engine has not yielded satisfactory results, or the question is so specific that it can best be answered by a human being who thinks their way into the searcher’s information problem. Yahoo Answers was a well-known example of a question-answering site, which was discontinued in 2021. There, one could find a great variety of questions and answers sorted by categories.
2.4.7
Social Networking Sites
Social networking sites have gained enormous importance in recent years and are among the most popular services on the Web, especially for younger users (Beisch & Schäfer, 2020). For many users, they are – along with search engines – the main route of access to the Web’s content. In social networking sites, users are mainly exposed to content via recommendations from contacts (or, more generally, people or services they follow); targeted searches by entering keywords play only a minor role. In this respect, they are often complementary to search engines: when it comes to “discovering” content, social networking sites are used; for targeted searches, search engines are used. On social networking sites, content can only be found or discovered if other users have linked to it or if users themselves have created it. In this respect, social networking sites simultaneously have more and less content to offer than search engines: On the one hand, the content generated by users can be found, which is usually not publicly visible and, therefore, not covered by search engines (see Chap. 14). But, on the other hand, they do not offer a complete picture of the Web’s content like the general search engines, as only what other users have previously created or linked to is shown.
2.5
Summary
Search engines provide access to information on the Web. While general search engines (Web or universal search engines) dominate in use, numerous types of search engines have emerged. In addition, there are other systems for accessing the information on the Web. However, effective integration of the different approaches has not yet been achieved. Web search engines claim to capture the entire Web content and make it searchable. On the other hand, vertical search engines deliberately restrict themselves to certain areas of the Web that they want to cover more completely. They also have the advantage of offering a range of results adapted to the respective topic and, possibly, more detailed search options.
References
23
Hybrid search engines combine content from the Web with content from databases that general search engines cannot access. They are a subgroup of vertical search engines. Metasearch engines do not have their own database but draw on the results of several other search engines and combine them in a new ranking. With Web directories, websites are classified and described by humans in a classification system. While this approach was on an equal footing with search engines in the early days of the Web, it no longer plays a role today. Social bookmarking sites allow users to save and share bookmarks. Both one’s own bookmarks and those of other users can be searched. Question-answering sites are community-based services that allow asynchronous searches: the searcher asks a question, which is then answered (with a time delay) by volunteers. Similar to conventional search engines, however, the archive of questions already submitted in the past can be searched. Social networking sites provide access to content created and linked by other users. Since user-generated content is largely inaccessible to general search engines, it is an important supplement, especially when it comes to “discovering” content – as opposed to targeted searches. Further Reading Unfortunately, there are no books on the history of search engines. However, in Battelle’s (2005) book on Google, you can find some information about their early years. There is also no comprehensive work on the classification of tools for indexing the Web; here, one may have to resort to finding separate books.
References Battelle, J. (2005). The search: How Google and its rivals rewrote the rules of business and transformed our culture. London: Portfolio. Brealey. Beisch, N., & Schäfer, C. (2020). Ergebnisse der ARD/ZDF-Onlinestudie 2020: Internetnutzung mit großer Dynamik: Medien, Kommunikation, Social Media. Media Perspektiven, 9, 462–481. Hamdorf, K. (2001). Wer katalogisiert das Web?: Dokumentarische Arbeit als Big Business und Freiwilligen-Projekt. Information Wissenschaft Und Praxis, 52(5), 263–270. Peters, I. (2011). Folksonomies und Kollaborative Informationsdienste: Eine Alternative zur Websuche? In D. Lewandowski (Ed.), Handbuch Internet-Suchmaschinen 2: Neue Entwicklungen in der Web-Suche (pp. 29–53). Akademische Verlagsgesellschaft AKA. Thomas, P. (2012). To what problem is distributed information retrieval the solution? Journal of the American Society for Information Science & Technology, 63(7), 1471–1476. https://doi.org/10. 1002/asi.22684
3
How Search Engines Capture and Process Content from the Web
This chapter describes the technical basis of search engines. This basis includes how the documents available on the Web are brought into the search engine and how they are made searchable, as well as how the link between a search query and the documents in the database is established. Details on the workings of the crawler, the indexer, and the searcher are given. While we have already named some characteristics of search engines in the previous chapter to distinguish them from other tools for accessing the contents of the Web, we now need a precise definition since we want to deal with search engines in detail: A search engine (also: Web search engine; universal search engine) is a computer system that captures distributed content from the World Wide Web via crawling and makes it searchable through a user interface, listing the results in a presentation ordered according to relevance assumed by the system.
To understand this definition in its details, we will explain the individual elements as follows: 1. Computer system: First of all, the definition specifies that a search engine is a computer system. The word “system” already indicates that it is typically more than one computer, namely, a multitude of computers linked together to perform different functions which jointly form the search engine. 2. Distributed content from the World Wide Web: Here, there is a restriction to specific content. The World Wide Web is a part of the Internet, and search engines are limited to this part first (even if the Web content is sometimes supplemented by other content; see Sect. 3.2). If we were to expect search engines to search the entire Internet, this would also include all e-mails, for example, since e-mail is part of the Internet but not of the World Wide Web. This content is inherently distributed, i.e., there is no central repository where it is stored. Instead, documents are stored on Web servers that are initially # The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Lewandowski, Understanding Search Engines, https://doi.org/10.1007/978-3-031-22789-9_3
25
26
3.
4.
5.
6.
3
How Search Engines Capture and Process Content from the Web
independent of each other. Thus, it is only through the links between documents that a network is created which can be navigated by users (moving from document to document via the links) as well as by search engines. Crawling: We only refer to search engines when the systems in question capture the contents of the Web by following links; this process is called crawling. It starts from known documents; the links contained in these documents are followed, and thus, new documents are discovered, whose links are followed once again. Eventually, this process will theoretically – according to our first assumption, which we will discuss in detail below – create a complete image of the World Wide Web, which will then finally be made searchable. However, the focus on crawling does not mean that content cannot also additionally be brought into the search engine by other methods. We will discuss this in more detail in Sect. 3.2. User interface: To access the content provided by the search engine, a user interface is required in which the user can enter their search queries and view the results. Relevance as assumed by the system: We usually speak of the results being ordered according to relevance. It is more accurate to use the phrase “according to relevance as assumed by the system” since there can be no single correct order of search results. Different documents can be relevant for different users. We will see that there are multiple models and ideas of relevance. It is often wrongly assumed that there are right and wrong ways of ordering search results (see Chap. 5). Presentation ordered according to relevance: This (perhaps unwieldly) term expresses that the presentation of results by a search engine is not necessarily a simple list, but that complex forms of result presentation are possible. We will discuss the presentation of results in search engines in detail in Chap. 7.
What are the advantages and disadvantages of this definition that we use in this book? First of all, it is relatively open, which means that we will probably be able to classify future systems (whatever they may look like then) as search engines according to this definition. However, one limitation of the definition is that it explicitly refers to the World Wide Web and crawling. More precisely, we would have to speak of Web search engines instead of search engines, but this does not correspond to the general usage, which simply refers to search engines. Is YouTube a Search Engine? YouTube is accessible on the Web and allows users to search for videos – so one could assume that it is a search engine. However, if we check the six criteria from our definition, we see that YouTube is not a search engine in our sense: (continued)
3
How Search Engines Capture and Process Content from the Web
27
While YouTube is clearly a computer system, it does not capture distributed content from the World Wide Web via crawling. Instead, users upload videos directly to the YouTube platform, even if they are then embedded in HTML documents on the YouTube website, thus allowing other search engines to crawl them. However, the other criteria from the definition are fulfilled. In contrast, Google Video (https://www.google.com/videohp) is a (vertical) search engine or hybrid search engine: Here, videos from different sources are retrieved from the Web through crawling and then made accessible. However, we can assume that Google does not crawl its own content from YouTube but instead captures it in a structured manner. Google Video offers a much more extensive selection of videos than YouTube; many videos that cannot be found on YouTube can be found on Google Video. Search engines have several components. Again, there are several variants of the basic structure, which are, however, substantially similar. We will use the division of components according to Risvik and Michelsen (2002); see Fig. 3.1. A search engine is divided into four components: the crawler, the search engine database (local store; ideally a complete and current copy of the Web), the indexer, and the searcher. The task these components fulfill is to mediate between the content of the World Wide Web and individual users. Before we describe the individual components in detail in the following sections, we will briefly give an overview of these components. In the figure, the World Wide Web is depicted as a cloud, which illustrates that it has no clearly defined boundaries and is constantly changing. We are dealing with a nebulous quantity that we can never quite grasp. Moreover, as we shall see, it is not possible for search engines to fully map the content of the Web. However, it must first be emphasized that when we search with a search engine, we are never searching the contents of the Web itself, but always just a copy of the World Wide Web prepared by the search engine. It is a characteristic of the quality of a search engine that this copy should always be as complete and current as possible. To access the content from the Web in the first place, a crawler follows the links on each page as described above, moving from page to page. The pages found in this
WWW
Crawler
Indexer
Searcher
End users
Local Store (W copy)
Fig. 3.1 Basic structure of a search engine (Adapted from Risvik and Michelsen, 2002, p. 290)
28
3
How Search Engines Capture and Process Content from the Web
way are stored in memory before they can be accessed. This memory is known as the “local store” and is the raw material for the index. The indexer creates the index, processing the documents so that they can be searched efficiently. This means that instead of simply storing text, images, etc., a representation of each document is created, and indexes are created to enable these representations to be found quickly. Each document is represented by a file that contains all information relevant for its retrieval. In addition, the document representation includes further information that is used for ranking (see Sect. 3.4.2 for more details). Thus, the index can be compared to the index of a book, which refers readers to the sections of the book where the relevant information can be found. Finally, the searcher mediates between users and content. Here, documents are sorted by their assumed relevance and displayed to the user. Why Do Search Engines Only Search a Copy of the Web and Not the Web Itself? If the Web were searched directly when a user entered a query, all content would always be up to date and complete. However, this is impossible since such a search engine would first need to send the crawler whenever a search query is submitted. The crawler would then have to scan all pages of the Web and search for the desired term on each page, then compile the results found, and rank them. On the one hand, this would take far too long with the many billions of documents on the Web. On the other hand, complex calculations could not be carried out with the help of which documents are processed for ranking already during the indexing process. Such an approach would simply not be practical. We can easily see the negative effect of this approach when searching for a keyword in a long document in a word processing program: One after another, all occurrences are displayed; the occurrences are simply listed according to their position in the text. Such a search takes a long time and does not include any evaluation of the relevance of the entries.
3.1
The World Wide Web and How Search Engines Acquire Its Contents
We already distinguished between the Internet and the World Wide Web, which is a part of the Internet. What are the characteristics of the Web? It consists of documents, mainly in HTML format, which have a unique address and can be connected to each other via links. HTML is short for Hypertext Markup Language and represents a language that can be used to represent documents and their links. Each published document is given a URL (Uniform Resource Locator), which uniquely defines the document’s address. Other documents can be embedded into HTML. For example, the videos that can be found on YouTube are embedded in HTML pages. Of course, today’s Web
3.1 The World Wide Web and How Search Engines Acquire Its Contents
29
browsers can display videos and other non-HTML documents directly, but we usually see this content within HTML pages. The goal of a search engine is to create a complete representation of the World Wide Web. However, to determine how well this has been achieved, one must first know how large the Web is. Since the Web is dynamic (i.e., it is constantly changing) and there is no central “registry,” but anyone can simply add or delete their own documents, this question is tough to answer. Since no central authority manages all Web pages, the World Wide Web’s size is also not being measured centrally. There are several methods to determine the size of the World Wide Web. All of them are based on extrapolations, so we can never expect exact numbers here. The most important calculation methods are: 1. Examining a representative sample of websites to see how many documents they contain. On this basis, we then extrapolate the total number of documents present on the Web. Steve Lawrence and C. Lee Giles (1999) took this approach and published a well-known study in 1999, in which they also found that the search engines of the time covered only a small portion of the Web. 2. Using crawling to index Web documents and then determining the total number of documents found. The problem here is that this method can only determine the number of documents found in this single crawling process, but not how many documents have not been found. Therefore, it is not possible to determine the size of the Web in this way, even though search engine companies have repeatedly published figures on the size of the Web itself based on their own data collection (e.g., Alpert & Hajaj, 2008; see also Sullivan, 2005). A somewhat recent figure of this kind comes from Google; in 2016, they stated that the search engine knows 130 trillion URLs (Schwartz, 2016). 3. A third variant measures the overlaps between different search engines and extrapolates the number of documents available on the Web on this basis. Bharat and Broder (1998) and later Gulli and Signorini (2005) used this method. This method can determine the total number of documents covered by search engines; however, all documents not covered by any of the search engines examined remain excluded. 4. Extrapolation of the index sizes and the size of the Web based on the number of hits reported by the search engines for selected queries. Here, the size of the Web is determined as the size of the chosen search engines’ data sets, minus overlaps. Van den Bosch et al. (2016) use this method to consistently calculate the size of the data sets of the two major search engines, Google and Bing, and from this calculate the assumed size of the Web indexed by the search engines (https:// www.worldwidewebsize.com). However, it must be kept in mind that the hit counts provided by the search engines are themselves only imprecise projections (Sánchez et al., 2018; Uyar, 2009) and are not ultimately verifiable. Together with the assumptions used in the overlap calculation, the result is, at best, a very rough projection, which is about 55 billion documents at the beginning of 2021.
30
3
How Search Engines Capture and Process Content from the Web
So, what can we conclude from this discussion about the size of the World Wide Web? First of all, that we are dealing with a structure that we cannot grasp exactly. It is equally difficult for search engines to fully grasp the Web. This, in turn, means that not only do we not know how big the Web is but we also do not know exactly which part of the Web is covered by search engines. Second, we should note that the Web consists of many billions of documents and its coverage, therefore, requires extensive technical resources and a significant financial investment. We will discuss this in detail in Chap. 8. Third, we need to ask ourselves what a document is in the context of the Web. Today, many documents on the Web are created automatically from ever new combinations of information, making it questionable whether documents on the Web should be counted at all. Due to the sheer volume of documents and the associated impossibility of capturing everything, some experts consider search engines to have already failed in their claim to offer at least near-complete access to the content of the Web. Which Pages of a Blog Are Documents? When you write a text and publish it on a blog, it is definitely a new document. However, what about the overview page that is automatically generated by blog software? It lists the teasers to the current articles, and the primary function of this page is to direct readers to the individual articles. Should search engines treat and capture these overview pages as separate documents? Even if we still consider this reasonable for the overview pages – what about all the other overview pages that outline the articles by date, by the assigned tags, etc. and are automatically generated by the blog software? The example is intended to show how new “documents” are automatically compiled and published without any particular input. In principle, all content available on the Web can be combined in ever new forms. However, for search engines, which have a limited capacity, it is essential to identify those documents that contain content relevant to the users. To illustrate the crawling problems that arise due to the structure of the Web, we will use the model from Broder et al. (2000). This model, also known as the bow-tie model (Fig. 3.2), is based on an extensive empirical data collection of the links on the Web and shows how documents are linked to each other and how they are not linked to each other. Only some of the documents on the Web are strongly connected; this is the core of the Web (strongly connected component; SCC). All documents in the core can be reached directly via a link (Broder et al., 2000, p. 310). Links from the core lead to an area called OUT; search engines can easily follow these links. In contrast, links in the area called IN, which point to the core but are not linked back to from the core, are problematic for search engines to track because they cannot be reached via links from any page in the core. Connections between IN and OUT exist only occasionally (tubes). In addition to the connected areas, there are
3.2 Content Acquisition
31
Tendrils
IN
+... .++ OUT
SCC
Tubes Disconnected components
Fig. 3.2 The structure of the Web: “bow-tie model” (Broder et al., 2000, p. 318)
so-called tendrils connected to one of the three large areas but are relatively isolated in general. In an empirical study of 200 million documents, similar sizes were found for the four areas: core, IN, OUT, and tendrils. Of course, the numbers from the Broder et al. article are hopelessly outdated. The relative proportions of the areas of the Web may also have changed in the meantime. However, the model itself can show us that it is still difficult for search engines to cover the Web by simply following links. It clarifies that it is not sufficient to simply follow all links from a starting page and then repeat this in the documents found until no more new documents are found. Broder et al. were able to show that this idealtype conception of the Web does not apply in reality and that more complex procedures must be applied to cover the Web as completely as possible.
3.2
Content Acquisition
Search engines aggregate content from the Web in their databases; this is referred to as content acquisition. According to Vaidhyanathan (2011), there are three areas of content acquisition: 1. Scan and link: External content is collected, aggregated, and made available for search. This is the basis of search engines and thus the most important form of content acquisition. This section will distinguish between the two most important forms in this area: crawling and content delivery via feeds, i.e., the delivery of structured data by content producers.
32
3
How Search Engines Capture and Process Content from the Web
Fig. 3.3 Structured presentation of product information vs. unstructured information in HTML documents
2. Host and serve: Content created by users is collected and hosted on a search engine provider’s platform. In this case, the data is stored under the control of a search engine provider. The content itself, however, comes from the users. The most prominent example is YouTube, a platform owned by Google, to which users can upload videos. Other examples are the descriptions of businesses and restaurants in Google Maps, which are integrated into regular Google searches, and users’ ratings of these places. 3. Scan and serve: Items from the real world are transferred to the digital world by the search engine provider. Google, for example, has scanned a large number of books from libraries and makes them available in its search engine. It is important to note that search engine providers also build up their own databases, which are included in the search engine but are not created by the users. For example, it has long been standard practice for search engines to offer map services (such as Google Maps). To do this, search engine providers buy structured map data, enrich it with their own data, and make it searchable in various ways. The two main methods of capturing external content available on the Web are feeds and crawling. Feeds are used to transmit structured information to the search engine. This is the case, for example, with product search engines, a form of vertical search engine (see Chap. 6): retailers make their product catalogs available to search engines in the form of XML files (see Fig. 3.3). There are specified fields in which, for example, the name of the product, its weight, and its price are entered. The advantage of this method is that detailed information about a product can be determined precisely. In this case, for example, it is easy to extract the price of a product. In addition, the information from the individual suppliers is always complete and up to date (if the supplier keeps their list up to date). Thus, structured data can be used, for example, to carry out price comparison searches quite easily. The disadvantage of this method is that search engines are dependent on the information providers making the data available to them in the appropriate form.
3.3 Web Crawling: Finding Documents on the Web
33
The second method is crawling (see Baeza-Yates & Ribeiro-Neto, 2011, p. 515ff. for more details), i.e., capturing content on the Web by following links. In the simplest case, a single known URL is provided to the search engine. The crawler then visits this document, captures its content for later processing (see Sect. 3.3), extracts all the links contained in the document, and then processes them sequentially in the manner described. Ideally, this would gradually create a network that completely maps the World Wide Web. However, we have already seen that this does not work in practice based on Broder et al.’s (2000) model described in Sect. 3.1. However, crawling is still the most important method for search engines to discover content. It is an excellent way to find content without the content producers actively participating in the process. In addition, it is the only method that can effectively cover a large part of the World Wide Web. Some of the problems described can already be solved by choosing not only one starting point but many. In the next section, we will look at how crawling works and how to optimize it.
3.3
Web Crawling: Finding Documents on the Web
The task of the crawler (also: spider) is to find new documents by following links within already known documents. The crawling process is continuous, as not only are new documents constantly added to the Web but documents are also deleted or modified. In addition, documents on the Web often have short lifespans, and content found at the same URL will change frequently (Ntoulas et al., 2004). An initial set of known Web pages (seed set) serves as the basis of crawling. The links contained in it are followed, the documents found in this way are indexed, and the links contained in them are followed in turn. In this way, all documents available on the Web should be found. This is to be achieved by the broadest possible distribution of the pages contained in the seed set. In the last section, we already discussed that completeness is not feasible for structural reasons. However, a wellcomposed seed set leads to far better Web coverage than a single starting page. Long-established search engines have an advantage over new ones in capturing Web content: they can capture not only the URLs they find in the current crawl but also any document that can still be located at a URL that was once known to the search engine. This means that even documents that are no longer linked on the Web at all can be retrieved. In addition to finding new documents, the crawler’s task is to check already known documents for updates and ensure that the documents have not been deleted in the meantime. If no such checks were made, the result would be that deleted documents would still be included in the search engine’s result lists, but when the user clicked on the individual result, they would be referred to a document that no longer existed, and an error message would be displayed. Crawling is thus not a finished process, even concerning the already known documents, but must take place continuously.
34
3
How Search Engines Capture and Process Content from the Web
Search engines prioritize which documents to visit (whose URLs are already known and are stored temporarily in the crawler queue) according to two main criteria: • Popularity (measured by the linking structure and the number of times the documents are accessed) • Update interval (frequently updated websites, such as news websites, are checked more regularly for updates and new documents) Crawlers can only reach the content of the Web that is accessible via links. This network of linked documents is the so-called Surface Web, in contrast to the Deep Web (also called the Invisible Web), which is the content that search engine crawlers cannot access. The main reasons behind this are the lack of links, the locking of content behind password requests, and dynamic content generated from databases only when a query is executed. We will look at the Deep Web in detail in Chap. 14. What Information Can a Crawler “See”? We are used to Web pages containing not only text but also graphical elements. For example, they may have images, animations, and videos. However, a crawler only “sees” the document’s text and the structural information from the page’s source code. Pages that are heavily graphical and contain very little text can therefore be more difficult for search engines to capture adequately. In some cases, pages may not be captured at all. An excellent way to see how search engine crawlers see HTML pages is to use a so-called Lynx viewer. A Lynx viewer allows one to see only the text and structural information on the page, but not the images or other graphical elements. There are several Lynx viewers on the Web, for example, https://adresults. com/tools/text-browser-lynx-viewer/. When you enter hbo.com into this viewer, it provides a textual representation of the information that a search engine can read at that URL. This representation is mainly the short titles of the programs displayed with the corresponding links. So far, we have treated the incomplete coverage of the Web in crawling as a theoretical problem and used the bow-tie model to explain why a complete coverage of the Web is not possible. However, empirical studies have also measured how fully particular search engines cover content from selected countries. For example, it was found that search engines indexed a higher proportion of pages from the United States than pages from China or Taiwan (Vaughan & Thelwall, 2004; Vaughan & Zhang, 2007). This is referred to as country bias. In addition, different search engines covered documents from other countries to varying degrees. While the specific coverage figures for individual countries may have changed – as is the case with many studies that were conducted at a particular
3.3 Web Crawling: Finding Documents on the Web
35
point in time and can therefore only provide a “snapshot” of the Web and the search engines – the conclusion remains that the search engines by no means cover all documents on the Web and that we are not dealing with a uniform (non-)coverage of content. From the perspective of a person searching for information, one can never really know whether they are searching all available documents by using a search engine. In addition to the fundamental problem of incomplete coverage of the content of the World Wide Web, search engine crawlers have to deal with several other issues (see also Baeza-Yates & Ribeiro-Neto, 2011, p. 522ff.): • There is a lot of duplicate content on the Web, i.e., the same content under different URLs. These duplicates can occur both within the same website (e.g., the same text in the user view and a print view) and be scattered throughout the Web. For example, countless copies of Wikipedia articles (or even the entire Wikipedia website) can be found on various sites on the Web. The challenge for search engines is, on the one hand, to distinguish the original from its copies and, on the other hand, to avoid duplicates as far as possible during crawling. Each request of a document by the crawler costs time, computing power, and money (both on the side of the search engine and the side of the requested website); therefore, it makes sense to exclude duplicates already at this point. • Spider traps are situations in which crawlers can become trapped (intentionally or unintentionally). Often, such traps are created by automatically generated content; a simple example to illustrate this is calendars on the Web. Often, a monthly view is provided with the ability to scroll forward or backward a month. Since these calendars are generated automatically, it is in principle possible to scroll on indefinitely. Now, if a crawler is set to simply follow the link to the next page repeatedly, it will continue crawling the calendar forever, getting stuck on this website. • The term “spam” refers to documents that are not welcome by search engines, especially those that are created for the sole purpose of deceiving search engines and users about their real intention. These are primarily documents of an advertising nature, which do not contain any relevant content and were created for the sole purpose of luring users to a page on which advertising is then placed or a product not sought by the user is sold. Spam is a mass phenomenon and accounts for a large proportion of content on the Web. This is illustrated by Google’s statement that they found 25 billion spam documents every day in 2019 (Google Search Central, 2020). Therefore, it is crucial for search engines that spam content does not get into the index but is excluded in the crawling process itself. Once again, the reason lies in limited resources: every search engine index, no matter how large, has a finite size, and with every spam document that is included in the index, space is missing for a possibly relevant document. As these three problem areas already make clear, it is perhaps not at all desirable for search engines to cover the entire Web. However, it is not only the content mentioned, for which we can probably agree that it should not be included in the
36
3
How Search Engines Capture and Process Content from the Web
Fig. 3.4 Link to a cache copy on the search result page (Bing; September 21, 2022)
Fig. 3.5 Display of a cache copy (Bing; September 21, 2022)
index, that plays a role. There is also content for which inclusion in the search engine database is at least disputed. This includes content permitted in one country but prohibited in others (e.g., documents denying the Holocaust are legally prohibited in Germany but not in the United States). Another example is content that falls under the “right to be forgotten,” i.e., information about the past of a specific person who, by request or lawsuit, has succeeded in having this content not displayed in a search engine as the result of a search for that person’s name (see Sect. 3.3.2). When Was the Last Time a Search Engine Visited a Document? If you want to know when a search engine last visited a document or whether it is currently available in the search engine, you can use the “cached” feature. This feature can be found on Google and Bing directly on the result pages in the document description (see Fig. 3.4). If you now call up this link, you will receive the respective document in the version that was last visited by the search engine. In addition, there is an indication of when the search engine last visited the document (see Fig. 3.5). This makes it easy to see that many documents are not available in their current version in the search engines. In (continued)
3.3 Web Crawling: Finding Documents on the Web
37
the example shown here, 4 days have passed since the last visit from the search engine; in the meantime, of course, the document may have changed. For the users, this means that, on the one hand, a particular document may not be found because, although it would be relevant for the search query, it did not contain the relevant text at the time when the search engine retrieved the document. On the other hand, a document that is known to a search engine but has been updated since the last visit may contain a link to a new document not yet otherwise known to the search engine, which in turn cannot be found when searching.
3.3.1
Guiding and Excluding Search Engines
First, search engines capture everything they find in the crawl. However, if we look at the content of the Web not from the perspective of search engines but from the standpoint of website providers, we quickly see that search engines should not find every piece of content available to them. For example, there may be areas within a website that are not suitable for a user to access directly. Other areas should not be discoverable by search engines at all. It can also be important for very large websites to indicate to the search engines which content should be given priority. The Robots Exclusion Standard is a set of commands accepted by common search engines to guide search engine crawlers and the indexing of websites by search engines. These commands are not binding but an agreement between the major search engine operators. However, this does not mean that all other search engines adhere to these instructions. So, if one uploads content to a webspace, one should expect that it will be found by some search engines (and thus, with a certain probability, by some humans). Search engines can be guided through a file called robots.txt, as well as through metadata. Furthermore, it is possible to create an XML sitemap which contains detailed information for the search engine crawlers. Metadata is information that is added to a document but does not have to be directly visible when the document is accessed. For example, a short description of the document content can be included in the metatag. These descriptions are not visible to a user who views the document, but are used, for example, by search engines to create the snippets on the result pages. Since the metadata is provided at the document level, its information applies only to the document and not the entire website. Therefore, it allows giving the search engine precise and sometimes different instructions for each document. On the other hand, it is, of course, laborious to determine the information for each individual document, especially if many documents are to be given the same information. The robots.txt file is suitable for this purpose. It is a file stored on the top directory level of a website, and it contains information for search engine crawlers. Using these instructions, certain areas of the website can be excluded from indexing. In
38
3
How Search Engines Capture and Process Content from the Web
Fig. 3.6 Excerpts from a robots.txt file (from Google. com (excerpt); August 12, 2021)
Fig. 3.7 Selective exclusion of specific crawlers (excerpt from newyorktimes.com); August 12, 2021)
addition, instructions can be given for the crawlers of specific search engines and all crawlers. The robots.txt files are public. One can view them for any website by typing in the domain and adding /robots.txt. So, for example, if one wants to see Google’s crawler instructions, one can simply enter the following URL: https://www.google.com/ robots.txt. Figure 3.6 shows several excerpts from this file. Using a robots.txt file, it is possible to exclude entire directories from indexing. For example, Fig. 3.6 shows that Google bans the indexing of the entire/search directory. This prohibits the content generated by Google in response to a search query (i.e., the search result pages) from being indexed. In addition to some other settings, it is also possible to exclude specific search engines or crawlers. For example, in Fig. 3.7, using the newyorktimes.com website as an example, it can be seen that certain crawlers should not capture any content of the website at all (using the command “disallow: /”). Why would one want to exclude specific crawlers completely? After all, it is in the interest of the website provider that their website receives visitors via search engines. However, search engines are not the only entities that operate crawlers but
3.3 Web Crawling: Finding Documents on the Web
39
so do numerous providers that, for example, collect content for analytics purposes or e-mail addresses for sending spam messages. Such requests primarily cause traffic for the website operator, which leads to costs without any profit (e.g., visitors to the website). For some years now, search engines have also supported XML sitemaps that enable targeted guidance of search engine crawlers. This is not only about excluding certain documents or directories but about detailed information with which the crawlers can be guided. Creating such a sitemap is particularly useful for larger websites. This sitemap will contain a complete list of all pages of a website, which can prevent potential crawling problems and the incomplete coverage of individual websites as described above. It should be emphasized once again that none of the described methods offer a guarantee that content will not be captured by some crawler nevertheless. Therefore, content that one does not want to appear on the Web should not be openly uploaded to a website in the first place. In addition to finding information that should not be found at all, there is also the opposite case: content is blocked for search engines that adhere to the conventions of the Robots Exclusion Standard due to errors. Thus, content that is accessible on the Web can no longer be found through search engines.
3.3.2
Content Exclusion by Search Engine Providers
Search engines exclude content not only because of technical restrictions or because certain content or sources have been classified (automatically or manually) as spam but also because certain content is deliberately excluded. By deliberate, we mean that search engine providers manually intervene and filter certain results in certain countries. Reasons for this filtering range from government censorship to protection of minors or complaints from copyright holders. The issue of censorship mainly affects countries such as China, which generally try to restrict access to Internet content. Here, search engine providers adapt in different ways. But even in democratic countries, certain content is excluded from the search engines’ databases. However, this is a democratically legitimized exclusion of content. In Germany, search engine providers thus exclude content banned in Germany (such as documents denying the Holocaust). However, this content is not generally deleted from the global index of the respective search engine but can still be found via versions of the same search engine intended for other countries. Furthermore, content is excluded based on youth protection regulations. For example, the German Federal Review Board for Media Harmful to Young People (BPjM) sends search engine providers lists of websites classified as unsuitable for young people. Content from these websites is then excluded from German search results by the search engines. Likewise, owners of copyrighted works have the option of reporting to the search engines if these works are undesirably accessible under certain URLs. After
40
3
How Search Engines Capture and Process Content from the Web
Fig. 3.8 “Right to be forgotten” notice (Google, August 12, 2021). Note that these messages are only shown to users in European Union countries
checking, the search engine providers then exclude these URLs from the search results. In this way, the music and film industry, in particular, has ensured the exclusion of a large number of documents from search engines (Karaganis & Urban, 2015; Strzelecki, 2019). Finally, search engine providers exclude content from search results based on the “right to be forgotten” (see Vavra, 2018). Here, anyone who feels that their rights have been violated by the representation of their person in documents provided by search engines has the option of submitting a request to a search engine not to display this document in the search results. In the event of a positive evaluation of the request, the relevant document will then be excluded from the search results. However, since the right to be forgotten is restricted to the European Union, the relevant documents may be displayed in non-EU versions of the same search engine. Strictly speaking, the “deletions” from the index described above do not remove the documents from the index, but instead, they are not included in the result list, depending on the location of the searching user. From the user’s perspective, this is the same as if this content were not included in the index, even if some references are made to the exclusion of results (where, of course, the specific documents are not mentioned). Figure 3.8 shows a reference to an exclusion based on the right to be forgotten.
3.3.3
Building the Database and Crawling for Vertical Collections
Up to now, we have talked about a single database that is built up by a search engine, the so-called web index. This index contains all documents known to the search engine that were found through crawling the Web and forms the most important foundation of any search engine. In addition, search engines also build separate collections for specific kinds of content (e.g., images) or specific types of content (e.g., scientific texts). However, for the user, the search engine usually presents a uniform picture: Although they can select one of the collections specifically and then also receive a separate search form, the default case is that a search query is entered in the public search interface and then a blend of results from the web index, enriched with results from vertical collections, is returned. This section describes the content acquisition for the vertical collections; a more detailed description of the collections and their basic structure can be found in Chap. 6. Finally, the merging of
3.3 Web Crawling: Finding Documents on the Web
41
results from different collections in the context of the so-called universal search is described in more detail in Sect. 7.3.3. In some cases, separate or additional crawlers are used to create vertical collections since the requirements for collecting different types of content are varied. One could imagine that a search engine operates only one crawler, which can additionally build up specific collections in a targeted manner by prioritizing according to popularity or topicality. In that case, particular documents (or documents from specific sources) would be prioritized in the general crawl. In practice, however, search engines use several different crawlers that take care of particular content. Similar to what has been presented above, the crawler for the Web content takes care of the “main task.” In contrast, the specialized crawlers take care of smaller tasks with special requirements (in the case of crawling selected areas of the Web, this is called focused crawling). An overview of the crawlers that Google, for example, uses to capture Web content can be found at https://support.google. com/webmasters/answer/1061943?hl=en. We already noted that no search engine can capture the content of the Web fully and in real time and that priorities must therefore be set in the crawling process. The same holds true for the special crawls. Furthermore, we can differentiate between these crawls according to the crawling frequency and the targeted content: • There is content where freshness plays a significant role and which therefore needs to be crawled more frequently. This importance of freshness applies to news, for example. To always capture the latest news, a search engine first determines (using human assessors or automatically) a set of news sources (websites). These sources are then visited frequently by a special crawler and scanned for new documents. This process is only practical because the number of sources to be revisited frequently is small: Google News, for example, searches about 700 German news sources, according to Google’s most recently published data (Google, 2017), as opposed to the several hundred million websites on the Web. Hence, the crawler is building a particularly up-to-date collection of content that could theoretically also be captured by the general Web crawler. We will discuss news search in detail in Chap. 6. • The situation is different for images, for example. These are usually embedded in HMTL documents. Therefore, they can be recognized in general crawling but must be indexed differently since they do not contain any textual information that can be made searchable. In this case, the crawler must mainly capture so-called metadata, i.e., data that is not included in the image itself but has either been added to the image by the creator (such as the date the image was taken) or generated from the context. We will discuss the indexing of images using surrounding text in more detail in Sect. 3.4.1. If a search engine has built up different collections of documents with the help of the various crawls, it is possible to operate vertical search engines. Furthermore, if the collections are recombined within the framework of a general Web search (see
42
3
How Search Engines Capture and Process Content from the Web
Chap. 6), it is possible to offer users a more complete and comprehensive presentation of results (see Chap. 7 in detail).
3.4
The Indexer: Preprocessing Documents for Searching
The indexer’s task is to decompose and prepare the documents delivered by the crawler to process them efficiently in the searching process. For the greatest part, indexing uses procedures that are not specific to search engines but are used in all information retrieval systems (see Croft et al., 2009; Manning et al., 2008). The parsing module decomposes retrieved documents into indexable units (individual words, word stems, or N-grams) and records their occurrences within the document. This procedure creates an inverted index, which for every indexed unit records the documents in which this unit occurs. Inverted means that one goes from individual words (index entries) to documents and not vice versa. When we usually navigate the Web, we move from document to document. We read (or skim) text; often, we “scan” the text for words relevant to what we are looking for. An index consists of entries for all words known to the search engine and links to the documents in which each word appears. We can compare the inverted index with the index of a book: If we want to know at which positions in a book a particular concept occurs, we do not have to read through the whole book first to be sure that we have not missed any occurrences of the concept, but we can look in the index to get from the concept we are looking for to the pages of the book where the concept is discussed (see Fig. 3.9). However, the difference between an index of books and an inverted index is that in the index of books, not all words occurring in the text are indexed, but only those concepts that are meaningful for the content. A concept does not have to correspond precisely to a word in the text
Fig. 3.9 Subject index as an example of an index (excerpt from Stock and Stock, 2013)
3.4 The Indexer: Preprocessing Documents for Searching
43
but can also be a preferred term (i.e., another word). This distinction becomes apparent in the example below. Inverted indexes allow fast access because not all documents have to be searched, but only the keywords have to be matched to find out in which documents they occur. We will encounter this problem again later when discussing ranking: since search engines have to be effective and fast, it makes sense to offload as many (pre)processing steps as possible so that they do not have to be performed the moment a search query is issued. In the indexing process, an index is created from the documents representing the documents. However, this also means that when we search using a search engine, we are not searching the documents themselves but their representations (i.e., the index). Each search engine creates its own representation of the documents. Different search engines capture different documents in their crawling and represent the same documents differently by indexing them differently. This, in turn, has an impact on searching: what is not captured in the representation cannot subsequently be searched. If, for example, a search engine extracts the names of the authors of documents during indexing (see Sect. 6.3.2), it can also search for these names later. If it does not do this, it is possible to search for the names later in the full text of the documents, but it is no longer possible to restrict the search to the actual author’s name, but simply to search for whether the name occurs as text somewhere in the document. There are technical problems with the construction of indexes, which result mainly from the mass of documents to be indexed. For this reason, search engines work with distributed indexes, i.e., there is not one index that is stored centrally, but rather a distributed system is created to accelerate access and continuous updating. This is because, just like crawling, indexing must be performed continuously; otherwise, the index entries could refer to documents that no longer exist at all or have changed in such a way that the information recorded in the index no longer matches them. In the following, the process of creating an index will be illustrated using a few examples. The linguistic preprocessing of the documents is excluded. The examples and their presentation are based on Croft et al. (2009). We consider as examples four “documents” D1–D4 on the topic of information science, each consisting of one sentence (taken from the Wikipedia article on information science): D1: Information science (also known as information studies) is an academic field which is primarily concerned with analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of information. D2: Practitioners within and outside the field study the application and the usage of knowledge in organizations in addition to the interaction between people, organizations, and any existing information systems with the aim of creating, replacing, improving, or understanding information systems. D3: Historically, information science is associated with computer science, data science, psychology, technology, and intelligence agencies.
44
3
How Search Engines Capture and Process Content from the Web
Fig. 3.10 Simple inverted index D4: However, information science also incorporates aspects of diverse fields such as archival science, cognitive science, commerce, law, linguistics, museology, management, mathematics, philosophy, public policy, and social sciences.
Figure 3.10 shows an inverted index generated from these documents. Instead of the individual sentences, the words contained in all the documents are now listed alphabetically; the boxes contain the numbers of the documents in which each word occurs. Thus, for example, if we want to identify all documents containing the word “information,” we can easily do so by looking at the list without going through the individual documents themselves. Combinations of words can also be easily identified in this way: For example, if we want to retrieve all documents in which “information” and “psychology” occur together, we look up both words individually in the index and then use the document numbers to check which documents contain both together. In this example, the result is document 3. Thus, such a simple index can quickly identify documents that contain specific words or combinations of words. However, querying such an index can only
3.4 The Indexer: Preprocessing Documents for Searching
45
Fig. 3.11 Inverted index with information on word frequencies
determine whether a document should be included in a result set or not. For ranking purposes, however, it is also important to know additional information, such as word frequencies or the position of words in the documents. More complex inverted indexes are used for this purpose. Figure 3.11 shows an index with information about the word frequencies in the documents. It is now easy to identify in which documents a word occurs particularly frequently, which allows conclusions to be drawn about the relevance of this document for the corresponding keyword (see Sect. 5.2). For instance, we can easily see from the index that the word “science” occurs three times each in documents 3 and 4 but only one time in document 1. However, this feature is not fully effective in our examples since only short sentences are involved. With longer documents, the distribution of word frequencies in the documents can be better exploited. Inverted indexes can also contain information about the position of words in the documents (Fig. 3.12). Each word is now assigned not only the document number but also its position within the document. We can now easily see, for example, that the word “information” occurs in the first position in the first document. In contrast, it is only mentioned in the 28th position in the second document. From this, we can
46
3
How Search Engines Capture and Process Content from the Web
Fig. 3.12 Inverted index with word positions
draw conclusions for the ranking (see Sect. 5.2). Using the position information, it is also easy to determine the proximity of two or more words to each other. Let us assume that a user enters the search query “information science.” We presume that documents containing the two words as a phrase (i.e., right next to each other in the entered order) are preferred. Using the index with word frequencies, we can, for instance, quickly determine that “information” occurs in document 1 at position 1 and “science” also occurs in document 1 at position 2. The two words thus directly follow each other and therefore fulfill the search query for the corresponding phrase. The three indexes mentioned are only examples of indexing documents. There are many other ways to index document text, primarily to provide efficient access to documents (see Croft et al., 2009, Chaps. 4 and 5 for details). Inverted indexes are not only needed for faster access to documents but also for ranking. In the examples, we have already seen that we can infer the meaning of a word in its context from the statistical properties of texts. Thus, we can assume that a
3.4 The Indexer: Preprocessing Documents for Searching
47
document that contains the searched word at the beginning is more relevant for the searcher than a document in which the search term occurs only at a later position. Based on this simple criterion, we could already create a ranking. In our example, the order of the documents according to the position of the word “field” would then be document 2 and then document 1. Of course, such a ranking is far from meeting the requirements of today’s search engines; however, the example is intended to illustrate how inverted indexes can be used to quickly create a ranking without having to access each document individually. So far, we have considered the documents to be indexed as plain continuous texts, i.e., every word potentially has the same meaning. However, from our own reading behavior, we know that this is not the case: we are more likely to notice and attach greater importance to words in headings or particularly highlighted words than to words that are “just there” somewhere in the text. Search engines use this structural information added to the text by giving it special weightings in the indexing. For example, one could simply decide that a word that occurs in the main heading of a text is twice as important as if the same word were to occur in any other position in the text. However, to enable such calculations, the underlying information must already be captured in the index. Thus, a distinction can be made between indexing highlighted words (e.g., bold or italics) and indexing information in dedicated fields. In the latter case, the indexing can also be done using fields. For example, fields can be defined for the title of documents and headings within the document. These fields can then be searched in a targeted manner (see Chap. 12) or used for ranking (see Chap. 5). Indexing also plays a role in the context of search engine optimization (see Chap. 9); there, however, the aim is to exploit knowledge about indexing and adapting one’s own content to the indexing of the search engine.
3.4.1
Indexing Images, Audio, and Video Files
In our previous discussion of indexing, we had started from text documents. However, search engines also deal with other document types, most notably images and videos. There has been extensive work on indexing images based on their content, i.e., trying to automatically capture what is represented in the images (Tyagi, 2017). However, the results of this form of indexing do not yet reach the level of quality that allows these methods to be used for the Web as represented by search engines. Instead, search engines extract only basic information, such as the dominant colors in an image, directly from the images. Other information is extracted from the surrounding text. Thus, in this form of indexing, images are not considered in isolation but in the context of their occurrence on the Web. In most cases, images are embedded in the text. Search engines take advantage of this fact and assume that text located around an image describes this same image. This is illustrated in Fig. 3.13; the highlighted area represents the surrounding text identified by the search engine, which is used for indexing the image.
48
3
How Search Engines Capture and Process Content from the Web
Fig. 3.13 Image with surrounding text
Surprisingly, this relatively simple method can achieve better results than the far more elaborate content-based indexing. Thus, while it can be assumed that contentbased indexing will continue to make progress, indexing using surrounding text will continue to play a crucial role for search engines. The third element of image indexing is the metadata attached to images. On the one hand, this can be the metadata automatically generated by a camera (such as the date the picture was taken, the camera, or the lens used). On the other hand, it can also be intellectually created metadata such as keywords or tags. Video indexing follows a similar approach to image indexing; again, it involves extracting basic information from the videos themselves (such as length and recording quality), combined with metadata and surrounding text.
3.4.2
The Representation of Web Documents in Search Engines
Search engines are often regarded somewhat disparagingly by information professionals. This is based on the argument that because search engines only capture unstructured content from the World Wide Web in full text and make just these full texts searchable, they miss out on essential information that is important for indexing the documents, making it almost impossible to conduct targeted research with search engines. We will deal with the search options of search engines primarily in Chap. 12. At this point, we will deal with the representation of documents, i.e., the information
3.4 The Indexer: Preprocessing Documents for Searching Referring documents
Document Headline 1
Anchor text
Anchor text
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Headline 2
Anchor text
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
49
Metadata Attached to the document: -Title -Description -Keywords -Author
Derived from the document: -Length -Date -Update frequency
From the Web graph analysis: -Page Rank
Fig. 3.14 Representation of Web documents in sections and fields (exemplary)
that search engines capture for a particular document (i.e., a single URL). We will see that this representation goes far beyond the full text of the document. Figure 3.14 shows the division of a document represented in the search engine index into three sections: anchor texts, document text, and metadata. Let’s consider the document text first. In the previous section, we already learned about its representation. On the one hand, it consists of the words of the text; among other things, word frequencies and the position of the words within the text are recorded. On the other hand, it also includes dividing the document into fields, for example, by assigning certain parts of the text to headings. However, the representation of the document also provides information that does not originate from the document itself or can be derived directly from the document text. On the one hand, these are the so-called anchor texts (shown in the figure on the left). Anchor texts are linked texts that refer from one document to another. In other words, they are clickable links. We have considered links in the previous sections in isolation as being references from one document to the other. Now we are concerned with the text through which the link is described and made clickable. Anchor texts can be displayed differently depending on the specific website and browser settings. Still, in the vast majority of cases, they are shown in a different color within a text than the non-clickable body text. Thus, when we read a document on the Web, we can quickly see which text refers to other documents. But only by knowing the “entire” Web (through crawling) can search engines assign the anchor texts not only to the referencing documents but also to the documents that are referenced (see Fig. 3.14). The documents are thus enriched with information from external documents that link to them. It is assumed that
50
3
How Search Engines Capture and Process Content from the Web
authors who refer to a particular page use meaningful text for these links. If many links result in many meaningful texts, these can be evaluated as a meaningful description of the target document (see example box). The third area of the document representations in search engines consists of metadata. This is data about data. It describes documents without the need for these descriptions to appear in the documents themselves. Let’s look at an example from everyday life: If you describe one of your CDs as “party music,” for example, then you have given it metadata. Neither the music itself nor the accompanying booklet is likely to classify the CD as “party music.” Nevertheless, this designation is suitable for classifying the CD or finding it again. Especially for documents that contain no or hardly any text at all, metadata is the most crucial element of document representation. In the case of text documents, metadata serves as supplementary information for search engines. It can be obtained from different sources: • Metadata attached to the document: This is data that the author has added to the document and that can therefore be extracted directly from the document’s HTML source code. Within HTML documents, for example, there are metadata fields for the title of the document (not to be confused with the heading visible to the user; see Chap. 9), for keywords (which do not necessarily have to be included in the text itself), and for a short description of the document content. • Metadata extracted from the document: Such metadata can be obtained directly from the document or the search engine’s prior knowledge of the document. For example, the length of a document can be extracted directly from the document by simply counting the number of words in the document. One piece of information that can only be obtained from the search engine’s knowledge of a document is, for example, the frequency of changes: only if we know at what points in time a document has been changed in the past can we determine a typical frequency, which can then be used for crawling and ranking. • Metadata about the website in which the document is found: Similar to the metadata extracted from the document, such data can also be derived from the entire website of which a particular document is a part. Such information could be, for example, the size of the website or the average frequency of change of documents within that website. This information can also be used to control crawling as well as for ranking. • Metadata from the Web: This is metadata that can only be calculated by knowing the “entire” Web. In this way, they are similar to anchor texts; however, in this case, they are not text assigned to a document but numerical values that classify the document concerning a particular property. The most prominent example of such a numerical value is certainly PageRank (detailed in section “PageRank” in Chap. 5), which is a single number for each document and can be used as a measure of the quality or popularity of the document. The representation of the document text itself, enriched by anchor texts and metadata, creates a complex document representation that goes far beyond the text
3.5 The Searcher: Understanding Queries
51
alone. As we will see, this is crucial to achieving a search as we are accustomed to from search engines today. How Can Information from the Anchor Texts Help with Searching? Suppose a user from Germany wants to find the website of the US Patent and Trademark Office. He is very likely to enter a German-language query such as Patentamt USA, at least more likely than United States Patent and Trademark Office, which is the official name of the authority. However, all common search engines find the official site http://www. uspto.gov even for search queries like Patentamt USA and list it in the first position of the result list. This is only possible if a search engine also exploits the anchor texts. For example, although the home page of the US Patent Office does not contain the word “Patentamt” anywhere, there are many external (German-language) pages that refer to the Patent Office website with precisely these words in the anchor text. As a result, this appropriate document can be displayed for the corresponding query. The analysis of the anchor texts becomes even more evident when we take the example ad absurdum. For a long time, if you searched for here, Google displayed the official Adobe website in the first rank, where Adobe Reader can be downloaded. In this case, the word “here” is also not contained in the document itself, but in many texts that refer to the page (e.g., “Download Adobe Reader for free here”).
3.5
The Searcher: Understanding Queries
We have already seen that the task of a search engine is to mediate between a user and the content of the Web. This mediation can be understood in general terms, but in practice, it is often referred to as simply matching a query with the search engine’s database. We will deal with this matching and the information used for it in Chap. 5 in detail under the aspect of ranking search results. While this is also the searcher’s task, we will first discuss how the search engine can “understand” queries. Why do queries have to be understood in the first place? Is it not enough to simply use the text entered? In Chap. 4, we will take a detailed look at how search engines are used, and we will also see that the queries sent to search engines are generally short, and in many cases, it is not even possible to read from the query alone what a user meant by this query. Therefore, if one cannot determine what kind of documents might be relevant in the first place, it is also challenging to return relevant documents for such queries. But even long and sometimes precise queries are often difficult for machines to interpret, especially when nuances matter, sometimes expressed in different uses of just one word. Google has reported significant improvements for results to such queries in 2019 with its BERT algorithm (Devlin et al., 2019) for interpreting search queries based on neural networks (Nayak, 2019).
52
3
How Search Engines Capture and Process Content from the Web
Pre-query
Post-query
Query
User’s search history
Clicks
Prior queries in session Clicks Documents viewed (+ dwell time) Navigation paths of users with similar sessions
Dwell time
Query Relevance judgments
Fig. 3.15 Temporal position of a query (Adapted from Lewandowski, 2011, p. 63)
In general, so-called query understanding provides a solution for short queries. In this process, the query is enriched with contextual information so that it is possible to distinguish between relevant and non-relevant documents. One can use information about the current user, but also information about what other users have done in the past after entering the same query. Figure 3.15 illustrates the temporal position of a query. The left block contains the information that can be obtained from searches prior to the current query. The user’s search history simply means all the data that has been collected in previous searches by the same user, for example, which queries they entered, which documents they then clicked on, and how long they spent reading these documents. However, information can also be collected about which documents a user has viewed that were not found by a previous query. We will return to this comprehensive data collection about individual users in detail in Sects. 5.3.2 and 5.6. A session is a sequence of search queries and document views that a specific user has performed within a particular time period related to a specific topic. If the user is currently in a session (i.e., they have already entered at least one query before the current query), it is possible to conclude the user’s current interests from the queries they have already entered. For example, suppose a user searches for “Jaguar” and has already entered several queries during the current session that are related to animals. In that case, it does not make sense for search engines to provide the user with documents related to the car manufacturer of the same name. The advantage of collecting search data at the session level, to begin with, is that if the session includes several queries, conclusions can be drawn about the user’s interests without having to store too much data. Furthermore, the data only needs to be kept for a short period (just the session duration). In addition to the queries, the clicks in the result lists can also be analyzed. These provide information about which search result a user preferred and can be used to measure how long the user dwelled on this search result. If the dwell time is very short and the user returns to the search engine result page, we can conclude that the document clicked on was not relevant for this particular user. If, on the other hand,
3.5 The Searcher: Understanding Queries
53
the dwell time is appropriate to the document length, it can be assumed that the user found the document interesting and read it. If we look at the paths of other users who arrived at the same query as the current one, we can draw conclusions from these paths about the interest of the users who entered the query. One can derive probabilities with which the current user is interested in particular content/documents; the query can be enriched accordingly. The right-hand part of Fig. 3.17 shows the activities that can occur after a query has been entered. Here, we can only evaluate other users’ actions after entering the query in question. Although we cannot yet know what the current user will do next, we can make predictions about this if we know the queries and clicks of prior users as well as their dwell time on the documents. We can then determine the typical search trajectories and offer relevant results or assistance to the current user. So overall, query interpretation is a complex task. After all, it is about understanding the query. In this context, we also speak of semantics (the study of the meaning of signs). When people talk about the Semantic Web or Semantic Search, they usually mean understanding the documents and not the queries. However, to deliver optimal results, a search engine must be able to do both. Ultimately, it is impossible not to interpret search queries; the question here is not so much “whether,” but “how.” On the one hand, it is a question of how much data has to be stored about an individual user; on the other hand, it is also a question of the extent to which users are made aware that a query interpretation is taking place and how the query was interpreted. We can distinguish between implicit and explicit query interpretation. Google interprets our queries without making us aware of it; this is referred to as implicit query interpretation. For example, Fig. 3.16 shows a section of a search engine result page for the query “bakery.” Based on the user’s location (in this case Chicago), results for local bakeries are displayed at the top of the list. In contrast, in the “fact search engine” WolframAlpha, the query interpretation explicitly shows us how a query was interpreted. Figure 3.17 shows how the query “london” was interpreted. It was assumed (based on popularity) to mean the city of London (as opposed to the administrative district or name) and to mean London in the United Kingdom (as opposed to, say, the city of London in Canada). However, it is open to WolframAlpha users to select a different interpretation of the query and thus modify the search accordingly. These examples already show that any query interpretation is always only one of many possible interpretations. The role of query interpretation already hints at a topic that we will discuss in later chapters: There is no right and no wrong result set for most queries. Both the ranking of results and the processing of queries are based on assumptions made by humans. One could build search engines on entirely different assumptions than those of Google or Bing and still achieve relevant results.
54
3
How Search Engines Capture and Process Content from the Web
Fig. 3.16 Implicit query interpretation (Google, detail; August 16, 2021)
Fig. 3.17 Explicit query interpretation (WolframAlpha; November 26, 2020)
3.6
Summary
Search engines are computer systems that capture distributed content from the World Wide Web by crawling and make it searchable through a user interface. The results are listed in order of relevance as assumed by the system.
3.6 Summary
55
The task of a search engine is to mediate between users and the content of the World Wide Web. Search engines create a copy of the Web (the database) prepared by the indexer to efficiently match queries with the data to make it efficiently searchable (resulting in the index). Search engines gather the content of the Web primarily by crawling. In this process, links in already known documents are followed, and thus new documents are found. In addition, data is sometimes added to the databases in a structured form through so-called feeds. Problems with crawling arise from the size, structure, and constantly changing nature of the Web. First of all, the size of the Web is uncertain, as is the proportion captured by search engines. Second, the inherent structure of the Web ensures uneven coverage, which, among other things, results in content from different countries being unevenly captured by search engines. The constant updating and changing of content results in the difficulty of keeping search engine databases fresh. Since search engines cannot keep their copy of the Web complete and current at all times, they guide the crawling process based on the popularity and update frequency of documents or websites that they already know of. Website owners can partially guide search engines crawlers or exclude their content from indexing altogether. Search engine providers also exclude certain content themselves; primarily spam content is excluded, but also documents that are prohibited by law in certain countries, fall under provisions for the protection of minors, or have been reported as violating copyright. In addition to the web index, popular search engines also maintain vertical collections (for news and videos, e.g.), some of which are built by dedicated crawlers. Search engines index documents through inverted indexes. In this process, the documents are broken down into words, and the words are referenced to the documents. Here, data such as word frequencies, word positions, and information contained in specific fields of the document can be taken into account. When capturing non-textual information (e.g., images), search engines use metadata and surrounding text in addition to the information contained in the images themselves. For all types of content, search engines create complex document representations that go far beyond indexing the document content only. Among other things, anchor texts from referring documents and metadata from various sources are utilized. Since most queries are too short to output high-quality results by simply matching queries to documents, queries must be interpreted. This interpretation can be hidden from the user or explicitly shown. Further Reading Levene (2010) provides an excellent introduction to the technology behind search engines. General textbooks on information retrieval are also helpful: (continued)
56
3
How Search Engines Capture and Process Content from the Web
Manning et al. (2008) and, because of their particular focus on Web search, Croft et al. (2009) are recommended. A highly detailed work on all aspects of information retrieval is Baeza-Yates and Ribeiro-Neto (2011), which is also a good reference work. Finally, a comprehensive account of query understanding is provided by Chang and Deng (2020).
References Alpert, J., & Hajaj, N. (2008). We knew the web was big... http://googleblog.blogspot.de/2008/07/ we-knew-web-was-big.html Baeza-Yates, R., & Ribeiro-Neto, B. (2011). Modern information retrieval: The concepts and technology behind search. Addison Wesley. Bharat, K., & Broder, A. (1998). A technique for measuring the relative size and overlap of public Web search engines. Computer Networks and ISDN Systems, 30(1–7), 379–388. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., et al. (2000). Graph structure in the web. Computer Networks, 33(1–6), 309–320. Chang, Y., & Deng, H. (Eds.). (2020). Query understanding for search engines. Springer. Croft, W. B., Metzler, D., & Strohman, T. (2009). Search engines: Information retrieval in practice. Pearson. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL HLT 2019 – 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies – Proceedings of the Conference (pp. 4171–4186). Google. (2017). Alles über Google News. https://web.archive.org/web/20170321083046/https:// www.google.de/intl/de_de/about_google_news.html Google Search Central. (2020). How we fought Search spam on Google – Webspam Report 2019. https://developers.google.com/search/blog/2020/06/how-we-fought-search-spam-on-google Gulli, A., & Signorini, A. (2005). The indexable web is more than 11.5 billion pages. In 14th International Conference on World Wide Web (pp. 902–903). ACM. Karaganis, J., & Urban, J. (2015). The rise of the robo notice. Communications of the ACM, 58(9), 28–30. https://doi.org/10.1145/2804244 Lawrence, S., & Giles, C. L. (1999). Accessibility of information on the web. Nature, 400(8), 107–109. Levene, M. (2010). An introduction to search engines and web navigation. Wiley. Lewandowski, D. (2011). Query understanding. In D. Lewandowski (Ed.), Handbuch InternetSuchmaschinen 2: Neue Entwicklungen in der Web-Suche (pp. 55–75). Akademische Verlagsgesellschaft AKA. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press. Nayak, P. (2019). Understanding searches better than ever before. https://www.blog.google/ products/search/search-language-understanding-bert/ Ntoulas, A., Cho, J., & Olston, C. (2004). What’s new on the web?: The evolution of the web from a search engine perspective. In Proceedings of the 13th international conference on World Wide Web (pp. 1–12). ACM. Risvik, K. M., & Michelsen, R. (2002). Search engines and web dynamics. Computer Networks, 39(3), 289–302.
References
57
Sánchez, D., Martínez-Sanahuja, L., & Batet, M. (2018). Survey and evaluation of web search engine hit counts as research tools in computational linguistics. Information Systems, 73, 50–60. https://doi.org/10.1016/j.is.2017.12.007 Schwartz, B. (2016). Google’s search knows about over 130 trillion pages. Retrieved March 2, 2017, from http://searchengineland.com/googles-search-indexes-hits-130-trillion-pagesdocuments-263378 Stock, W. G., & Stock, M. (2013). Handbook of information science. De Gruyter Saur. Strzelecki, A. (2019). Website removal from search engines due to copyright violation. Aslib Journal of Information Management, 71(1), 54–71. https://doi.org/10.1108/AJIM-052018-0108 Sullivan, D. (2005). End of size wars? Google says most comprehensive but drops home page count. http://searchenginewatch.com/article/2066323/End-Of-Size-Wars-Google-Says-MostComprehensive-But-Drops-Home-Page-Count Tyagi, V. (2017). Content-based image retrieval: Ideas, influences, and current trends. Springer. Uyar, A. (2009). Investigation of the accuracy of search engine hit counts. Journal of Information Science, 35(4), 469–480. https://doi.org/10.1177/0165551509103598 Vaidhyanathan, S. (2011). The Googlization of everything (and why we should worry). University of California Press. van den Bosch, A., Bogers, T., & de Kunder, M. (2016). Estimating search engine index size variability: A 9-year longitudinal study. Scientometrics, 107(2), 839–856. https://doi.org/10. 1007/s11192-016-1863-z Vaughan, L., & Thelwall, M. (2004). Search engine coverage bias: Evidence and possible causes. Information Processing & Management, 40, 693–707. Vaughan, L., & Zhang, Y. (2007). Equal representation by search engines? A comparison of websites across countries and domains. Journal of Computer-Mediated Communication, 12, 888–909. Vavra, A. N. (2018). The right to be forgotten: An archival perspective. The American Archivist, 81(1), 100–111. https://doi.org/10.17723/0360-9081-81.1.100
4
User Interaction with Search Engines
Having dealt with the technical structure of search engines in the previous chapter, we now turn to the user side and look at what and how people search in search engines. Knowing user behavior not only helps us to understand actual search engine users but also to transfer key points of this behavior to other information systems and, if necessary, take them into account when creating our own systems. On the one hand, search engines have shaped the behavior of their users through the search functions they offer and the way they present results (see Chap. 7). But, on the other hand, we as users also shape the structure of search engines, which adapt to our behavior. To understand the current structure of search engine user interfaces and also why alternative search engines or approaches to search often have it hard, one must understand how users, who have been accustomed to a specific “search style” for many years, behave. This chapter aims not to introduce the general theories of information behavior that can also be applied to search engines. Introductions or overviews can be found among others in Case and Given (2016), Ford (2015), and Fisher et al. (2005); for a transfer to the context of Web search, see Burghardt et al. (2013).
4.1
The Search Process
The workflow of searching in a search engine can be depicted as a simple process (Fig. 4.1). It begins with the selection of a search engine. Then the query is entered. The next step is to view the search result page and, usually, select a result. However, the search process can also be completed without selecting a result if the information sought can already be found on the search engine result page (SERP; see Sect. 6.2). When a user selects a result and thus reaches a result document, they check it for its suitability for their information need. At this point, the search process may already be finished; optionally, the user navigates through the website structure on # The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Lewandowski, Understanding Search Engines, https://doi.org/10.1007/978-3-031-22789-9_4
59
60
4 User Interaction with Search Engines
1
2 Selecting a search engine
3 Enter query
4 Select results on the SERP
5 Evaluate results
Navigate/search within a found website
Fig. 4.1 The search process in Web search (Adapted from Lewandowski, 2012, p. 104)
which the document suggested by the search engine is located or uses the website’s internal search function. The dotted arrows in Fig. 4.1 show that a user can always jump back to one of the previous steps within the search process. In particular, users often bounce back and forth between result documents and the SERPs when they look at several documents until their information needs are eventually met. The search process can also be considered a series of decisions a user must make. First, the user has to decide on a search engine. Then they must decide how to translate their information need into a query. This is followed by the selection of a result on the SERP, then the decision on whether to look more closely at the result document, and then whether it is worthwhile to continue navigating or searching on that website. Finally, the user has to decide whether they want to select further documents on the SERP, rephrase their query, or replace it with a new query. We can see that the supposedly simple process of searching in a search engine comprises a series of decisions, each of which can also be influenced by the search engine. A simple example of this is the ranking of results: it is much more likely that a user will select the first result in a result list than one in a lower rank. The ranking of results can be thought of as a suggestion by the search engine that we may or may not follow. In Chap. 15, we will discuss in detail the influence search engines have on their users; in this chapter, we will first describe user behavior without going further into the complex relationship between the interests of search engine providers, content producers, and users. The complex result presentation on the search engine result pages and the associated selection behavior will then be dealt with in a separate chapter (Chap. 7). In the following, we will focus on the first part of the search process, entering queries.
4.2
Collecting Usage Data
In this chapter, we will discuss the behavior of search engine users. But first, we must clarify how information about this behavior can be obtained. Essentially, we can distinguish between three methods: surveys, lab studies, and transaction log analyses. In surveys, users are asked about their behavior using questionnaires or similar. In addition to the actual questions on behavior, socio-demographic data (such as age and gender) can also be collected, which enables a detailed evaluation. The disadvantage of recording behavior via surveys, however, is that one has to rely on what
4.3 Query Types
61
users say (or enter in a questionnaire) actually corresponding to what they actually do. Moreover, there can be misrepresentations, for example, if users do not explicitly remember or cannot sufficiently reflect on specific behaviors or if users do not want to disclose certain behaviors. For instance, obtaining valid data on whether and how often users search for pornography online will be impossible. In lab studies, users are invited to a lab where they are observed and interviewed. The advantage of this method is that actual user behavior can be observed, including complex behavior and behavior that has not been anticipated. In addition, groups can be formed by a supplementary survey of socio-demographic characteristics so that a detailed evaluation is possible. A serious disadvantage of this method, however, is that it is restricted to a relatively small number of participants because of the high cost of recruiting participants and conducting the research. A second significant disadvantage is that although behavior can be observed in great detail in the laboratory setting, this behavior does not necessarily correspond to the participants’ actual behavior in real search situations. In transaction log analyses, we look at search queries actually carried out by users, using the automatically generated logs of all interactions between searchers and the search engine. The two major advantages of this data collection method are observing user behavior without the users being aware of it (it is, therefore, a non-reactive method) and that it is possible to record all interactions in a specific time period, thanks to automatic recording. This means that very large data sets can be analyzed, often containing many millions of interactions. One disadvantage of this method, however, is that no further information about the users is known since they do not explicitly provide any data. In addition, one needs access to the data from a specific search engine to conduct transaction log analyses. While search engine providers used to make some data sets available for research purposes, this is rarely the case today, so such studies are now either carried out directly by the search engine providers (though then they are seldom published) or at least in cooperation with them. In log file studies, only data from a specific search engine can be queried; comparing behavior across different search engines is not feasible. At this point, it should be clear that the individual methods have their advantages and disadvantages and that we depend on results from studies using different methods. Consequently, the results presented in the following sections are summarized according to individual aspects of user behavior without separating them by method.
4.3
Query Types
One of the fundamental distinctions we can use to understand how users search using search engines is the distinction by query type. Here, we are not looking at the topic of a query but its objective. So, what does a user expect or want to achieve when they enter a query? The first distinction is between concrete and problem-oriented information needs (Frants et al., 1997, p. 38). Let us consider two sample questions:
62
4 User Interaction with Search Engines
Table 4.1 Distinction between concrete and problem-oriented information needs (Adapted from Frants et al., 1997, p. 38) Concrete need for information The thematic boundaries are clearly defined The search query can be expressed by exact terms Factual information (or a single good document) is usually sufficient to meet the need The information problem is solved with the transmission of the factual information (or the document)
Problem-oriented need for information Thematic boundaries are not determined The search query formulation allows for several terminological variants As a rule, various documents have to be sifted through. Whether the need for information is finally covered remains open With the transmission of the information, the need for information may be modified, or a new need may arise
1. How many inhabitants does Hamburg have? 2. What influence has the writer Kyril Bonfiglioli had on English literature? Even at first glance, a fundamental difference becomes apparent: In the first case, we expect a concrete number; in fact, there is only one correct solution. This is a concrete information need. In the second case, there is not one correct solution but many building blocks that lead us to a solution. And this solution can be quite different depending on the amount of information, interest, interpretation, and way of looking at things. This is a problem-oriented information need. While in the case of concrete information needs, the search has a clearly defined end (namely, when the fact we are looking for has been found), with problemoriented information needs, we never know when we are finished. For our sample question, we can look at just one document that gives us a first rough overview of the topic. But we can also decide to look at a few more documents to find even more aspects of the topic and go more in-depth. This can be continued almost indefinitely – whether we want to answer the question in one sentence or write a dissertation is up to us. Table 4.1 summarizes the most important distinguishing features of the two types of information needs. The differentiation between concrete and problem-oriented information needs is not specific to searching on the Web but describes different types of information needs in general. However, the approach can be transferred to searching the Web with some adjustments to the specific circumstances. Andrei Broder (2002) made the crucial distinction between informational, navigational, and transactional queries. With navigational queries, the aim is to (re)find a page that the user already knows or assumes exists. Examples are searches for company websites (Microsoft) or people (Heidi Klum). Such queries usually have one correct result. The information need is satisfied as soon as the desired page is found. Navigational queries thus address a concrete information need, but with the characteristic that the user wants to go to a specific destination. In the case of informational queries, the information need cannot usually be satisfied by a single document. Instead, the user wants information about a topic
4.3 Query Types
63
Table 4.2 Query intents in Web searcha Query type Navigation
a
Guiding question Where can I go?
Informational
What can I learn?
Transactional
What can I do?
Examples Ebay – a user wants to navigate to the eBay website Pierce Brosnan – a user wants to navigate to the actor’s homepage Goethe Faust – a user wants information about the drama written by Goethe Pierce Brosnan – a user wants to learn about the actor, for example, which films he has appeared in and how critics rated them play bubble shooter – a user wants to go to a website where they can play the game (several websites offer the desired result) Hamburger Sparkasse Online Banking – a user wants to get to the Hamburger Sparkasse website where they want to do online banking (only one website offers the desired result)
Model from Broder (2002); guiding questions are from Thurow and Musica (2009)
and reads several documents. Informational queries are aimed at static documents. After accessing the document, no further interaction on the website is necessary to obtain the desired information. Informational queries are based on problem-oriented information needs. However, it should be considered that search engines do not always answer questions for factual information with a concrete answer but often provide a list of documents from which the concrete answer must be sought. In cases where the search is indeed problem-oriented, the desired result set can range from just one document to a large number of documents. Finally, with transactional queries, a website is searched for on which a transaction subsequently takes place, such as purchasing a product, downloading a file, or searching a database. In other words, the user searches for a source that is then interacted with. Here, one can again distinguish whether a specific source is being searched for (Hamburger Sparkasse online banking; there is only one website on which one can do online banking as a customer of Hamburger Sparkasse) or whether a specific type of transaction is being searched for without the specific website playing a role (buy Homeland season 2 DVD; the DVD is available on many different websites). Table 4.2 shows the three query intents according to Broder with their guiding questions and some examples. The table shows that it is often impossible to tell from the query alone what a user wants to achieve. For example, the query Pierce Brosnan can be both navigational and informational; a user may want to get to the actor’s homepage, but they may also want to get detailed information about the actor from various documents. This is another reason why it is difficult to determine what proportion of the total number of queries is accounted for by the individual query intents. Another problem is that studies dealing with this topic are not based on a common database and differ in terms of the search engine used, the time period, and the classification method (see
64
4 User Interaction with Search Engines
Table 4.3 Share of queries by query intent (from Lewandowski, 2014, p. 48) Query type Navigational Informational Transactional
Range (unambiguously assessable; not unambiguously assessable) 27–42% 11–39% 22%
Lewandowski et al., 2012, p. 1775f.). However, the studies agree that all three query intents account for a substantial share of queries. Table 4.3 shows – admittedly rather roughly – the approximate proportions of the query intents. The high proportion of navigational and simple informational queries, in particular, suggests that user satisfaction with search engines can be explained to a large extent by successfully answering these queries (Lewandowski, 2014). Of course, one can legitimately ask how good Broder’s classification is if many queries cannot be clearly assigned. After all, it should be the very essence of a classification system that all items can be unambiguously assigned to a class. In this respect, the Broder taxonomy is certainly not optimal. Some authors have tried to expand or specify the classification accordingly (e.g., Calderon-Benavides et al., 2010; Kang & Kim, 2003; Rose & Levinson, 2004). However, this has not been satisfactorily achieved. For example, according to Google, there are four search intents for mobile searches (Google, 2015): “I want to know,” “I want to go,” “I want to do,” and “I want to buy.” Nonetheless, the appeal of Broder’s classification lies precisely in its simplicity and understanding of Web search as having multiple possible intentions. And only search engines that can handle these different intentions can produce satisfactory results for users (see Chap. 13).
4.4
Sessions
A session is a sequence of queries and document views performed by a specific user on a specific topic within a particular period of time. The end of a session is defined either by completing the search, by the user terminating the search, or by the passing of a certain period of time. Therefore, a session differs from the search process described in Sect. 4.1. A session can contain one or more search processes or only parts of them. For example, if a user enters a query, views the search engine result page, but then abandons the search, only part of the search process has been completed, but it is a complete session. Conversely, a session can consist of many queries, each triggering a search process involving several documents being viewed. This is the case with a more in-depth search, in which new or modified queries are submitted until a user finally concludes the search because they think they have found enough information on their topic.
4.5 Queries
65
If a user resumes a search that was started earlier after some time, this is no longer the same session. Sessions are characterized by the fact that they are uninterrupted (or only relatively briefly interrupted) search processes. Query intents can imply certain session types, but we cannot tell from a query alone what kind of session it is. We can, however, distinguish sessions according to their length: Probably the shortest form of a session is a so-called lookup search, in which factual information is sought. This query is based on a concrete information need that is usually answered quickly. Other short sessions involve queries that are based on a problem-oriented information need, but where the searcher only wants to get a quick overview of a topic through information from a single document. Typical examples are searches that end with skimming or reading a Wikipedia article. Short sessions are also generated from navigational queries: here, as well, only the appropriate result is quickly selected on the result page, and the session is ended. Typical sessions that begin with informational queries vary considerably in length and can range from viewing one or a few documents to long sessions that involve entering several queries and viewing a large number of documents. Session lengths vary accordingly from well under a minute to several hours. The same applies to sessions that begin with a transactional query. However, it can be assumed that the range here is not quite as wide. Within sessions, different query intents are also frequently mixed. For example, we often observe that a user first searches in an informational way to get an overview of a topic, followed by navigational and/or transactional queries. The limit of looking at search behavior on a per-session basis becomes apparent with so-called exploratory searches (White & Roth, 2009). Here, the information need is often still unclear at the beginning of the process, and a combination of searching and browsing often occurs within such searches. These types of searches can also extend over several sessions – the work is taken up repeatedly, often over several days or weeks. A typical example is planning a holiday trip: At the beginning, a user may only want to get basic information about different travel destinations that seem interesting to them. In the following steps, their plans become more concrete, and their queries change accordingly to concrete searches for hotels, flights, and sights. Throughout this process, new sessions are started again and again – it cannot be assumed that a user planning their annual holiday will carry out all relevant searches and complete the task within just one session.
4.5
Queries
The most basic level of the search process is entering a query. If we look at queries in isolation, we lose all contextual information, so we do not know when a user searched, what they did before or after, and how the query was integrated into a search process. However, by looking at queries, we can learn a lot about how users formulate queries and how complex these queries are.
66
4 User Interaction with Search Engines
In characterizing search queries, we often forget that while our habitual query behavior may be trained and therefore seem natural to us, it is far from the only way queries can be entered. For example, it has become common practice to use search engines to simply string together meaning-bearing words (keywords); however, it is also possible, for example, to ask questions or formulate complex search arguments with the help of special commands.
4.5.1
Entering Queries
While Web search is still primarily understood as an input of textual queries and the output of a list of documents, both the input and output sides are in flux. Queries can be entered in very different ways, and the output can also be not only text but also speech, for example. In Fig. 4.2, the process of query processing is positioned in the context of different types of queries and different types of documents. Nevertheless, the basic process is always the same: A query is submitted and processed by the system, which finally outputs information objects. These information objects can be of very different types; in addition to the familiar lists of texts, images, or videos, they can also be, for example, factual information, direct answers, or summaries from several documents (see Sect. 7.3). A query can be submitted in many different ways. At this point, the three most important forms of explicitly entering queries will be discussed: • Text input • Speech input • Searching with a reference document In today’s search engines, when a user enters a query, we usually deal with a string of search words. On the textual level, however, there are far more possibilities: for example, queries can consist of complex Boolean search arguments (see Sect. 12.3) or fully formulated question sentences.
Query
• Text • Spoken language • Images • Videos • Barcodes • …
Processing
Implicit quries
Informaon objects
• • • • •
Documents Facts Answers Summaries …
• Spoken language • Images • Videos • …
Fig. 4.2 Processing of different types of queries and output of different types of documents (exemplary)
4.5 Queries
67
Entering queries using spoken language has become very popular in recent years. It is an alternative to textual input and is used primarily on mobile phones, where text input is cumbersome. This shows a high potential for voice input. More than half of all queries are now made from mobile devices (Sterling, 2016); the figures vary depending on the search engine (Statista, 2021; see also Sect. 8.3 on the different market shares of search engines for desktop versus mobile queries). In addition to using voice, which can simply be seen as a solution to an obvious problem, many systems have been developed that focus primarily or even exclusively on voice commands. In addition to smartwatches, systems such as Amazon Echo and Google Home are particularly worth mentioning here. These are not primarily designed as (Web) search systems, but in addition to options for controlling smart home systems and simple assistance functions (such as managing shopping lists and calendars), they offer search functions in various databases (from Amazon’s product database to Google’s Web search). With voice search, search engines become less dependent on specific use cases and can be used in a wide variety of contexts. For example, one could think of searching during a car journey, where the driver cannot type in queries via a keyboard (White, 2016, p. 183). A prerequisite for spoken language input is, of course, reliable speech recognition. Technical advances in this area have led to speech recognition now hardly being a problem. The system converts the spoken query into text, which is then used as a query. This illustrates that we are really “only” dealing with a modified form of input but that the actual processing of the query is not affected by this. Systems that allow a spoken dialogue between user and search engine go one step further (see White, 2016, p. 182ff.). In this context, a search session is understood as a dialogue in which reference can be made to queries already submitted and the answers that the search engine has provided. Simple forms of such conversational search have already been implemented. Sullivan (2013) describes an example where someone searches for the age of Barack Obama. After the answer is given, the user can ask, for example, how tall “he” is. The search engine understands who is meant by the previous query and can answer the second query accordingly. The dialogue can be continued; the search engine always uses the context generated within a session. A third input form is searching with a reference document, i.e., one knows a relevant document on a topic and uses this document to find other documents that are similar to it. For example, many search engines support entering an image as a query. Other images are then retrieved that are similar to the image submitted. The similarity between the images is measured based on quite basic characteristics such as colors and similarities in shape; in addition, information from the text surrounding the respective image is added (see Sect. 3.4.1). This type of search works surprisingly well for familiar subjects to be recognized. For example, suppose one wants to obtain information about a building. In that case, one can upload a photo of it to a search engine and receive further information via the image search results. The search engine can also use the user’s location data, which makes it much easier to match a picture to a building.
68
4 User Interaction with Search Engines
In addition to these different forms of explicit query input, there are also so-called implicit queries. The search engine automatically generates these based on a current usage scenario or a user profile. The user is not even aware that a search is being carried out in the background but only sees the result of this search, e.g., in the form of suggestions of news items that might interest them.
4.5.2
Autocomplete Suggestions
Even while users are entering their search query, search engines usually offer suggestions for formulating the query (Fig. 4.3; see also Lewandowski & Quirmbach, 2013), which may be selected optionally. These suggestions can help the user to make more precise search queries, as they will always extend the query that has been entered so far. Therefore, it can be assumed that the longer the query, the more specific it becomes. The autocomplete suggestions are primarily based on queries that other users have entered in the past. Collecting these queries and ordering them according to their popularity make it possible to predict probable input from a user (assuming that the user is likely to search for something that has been searched for frequently in the
Fig. 4.3 Autocomplete suggestions during input (Google; August 29, 2022)
4.5 Queries
69
past). However, information needs, and therefore queries, can change rapidly, so it does not make sense to rank past queries by popularity alone. Instead, the time period of the queries included in the ranking must be restricted. It can be assumed that the period can or must be chosen smaller or larger depending on how often a query is made. Search suggestion output is not a completely automatic process based purely on the queries that users have entered, even if the search engine providers at least suggest this. For some keywords, search suggestions are simply not displayed (on the German version of Google, e.g., for the keyword Juden [jews]) to prevent the search suggestions from leading to content considered undesirable by the search engine provider. Furthermore, over the years, the suggestions for more and more queries have been filtered, or specific words in the suggestions have been excluded so that today one can hardly find any extreme examples of offensive or derogatory suggestions (unlike a few years ago; see Lewandowski & Quirmbach, 2013).
4.5.3
Query Formulation
Essentially, the formulation of queries is about a user expressing their (subjective) information need in words so that these can be processed by the search engine to finally return a result that satisfies the information need. We will return to this topic in the chapter on search result relevance ranking; at this point, we will first discuss the “translation problem” in formulating queries. In a large-scale study, Machill et al. (2003) examined the behavior of German search engine users. They summarized their findings on query formulation as follows: “Most users are unwilling to expend too much cognitive and time energy in formulating their search target” (Machill et al., 2003, p. 169). First, it may seem lamentable that users do not seem to put too much effort into formulating queries. However, one must also ask whether, or in which cases, the effort would be worthwhile at all. For example, in the discussion of query intents, we saw that many queries could be answered unambiguously. In these cases, a search engine can ensure that the unambiguously correct result is listed first, primarily by evaluating click data from previous searches. But even in the case of queries that cannot be answered unambiguously, there is often enough to be found from the click data to produce results that are at least somewhat relevant. For the user, this raises the question of why they should put a lot of effort into formulating a query when they can also achieve their goal more straightforwardly. The same applies to the time spent on formulating a search query; here, the question arises as to why one should think long and hard when one can simply check out what results come up when one enters a query (and then, if necessary, modify it without much effort or replace it with another query). Checking for correct spelling is also part of formulating queries carefully. Unfortunately, users usually do not tend to check their queries before submitting them. Search engines have long reacted to this by showing spell corrections on the result pages or, if there is a high statistical probability of an incorrect entry, by
70
4 User Interaction with Search Engines
Fig. 4.4 Suggested correction (top) and executed correction (bottom) on the search engine result page (examples from Google)
automatically correcting the query with the option of switching to the results for the original query. Figure 4.4 shows the two forms of input correction. From what has been described, we see a shift compared to “classic” information retrieval systems. There, it was assumed that before a query was entered, the user’s information need was extensively and competently translated into the system’s language. Thus, optimal search results could be generated. The issue of ranking, therefore, did not play a particularly important role since it was assumed that the formulation of the query had already narrowed down the set of results to such an extent that it was possible to peruse them intellectually. The statement by Machill et al. still holds today. It can be seen that little has changed on the user side since their study (which was conducted more than 15 years ago) (see, e.g., Stark et al., 2014), but much has changed on the side of the search engines: to achieve a better match between the information needs of the users and the documents indexed, search engines use a user’s contextual information (e.g., their current location or the queries they have entered in the past; see Chap. 5). This information is taken into account in the ranking of search results, as is the collected behavioral data (both queries and clicking behavior) of all users of the
4.5 Queries
71
search engine. This has led to search engines being able to rank results much more “accurately” today, but at the cost that we, as users, have less influence on which results are displayed. However, this does not rule out the possibility of achieving better results by formulating our queries competently (see Chap. 12) and simply means that a targeted promotion of users’ information literacy is urgently needed.
4.5.4
Query Length
A simple way to compare the complexity of queries is to measure their length. While a long query need not be complex in every case, it is reasonable to assume that the longer the query, the more specific it tends to be. For example, in some cases, simple queries consisting of only one word may be based on a complex information need. Still, they show only a low degree of specificity and – unless they are about a concrete information need – often leave room for multiple interpretations. In contrast, it can be assumed that adding further words leads to a specification of the query. Unfortunately, large-scale studies describing search behavior at the query level have been few and far between in recent years. For one thing, to conduct such descriptive studies, one needs data from search engine providers. Also, it may seem unattractive for researchers to conduct replication studies, which are most likely to confirm existing studies. For these reasons, in recent years, studies are more likely to describe user behavior at the search query level in passing but, first and foremost, pursue a different line of inquiry. Therefore, in the following, we will deal with the “classic” studies on user behavior in Web search and add more recent data where these are available and where they complement the basic data in a meaningful way. To begin with, it should be noted that queries are, on average, very short. For example, studies that examined the query behavior of US users cite average values between 1.7 and 2.9 terms per query; the values for German search queries are between 1.6 and 1.8 terms (Höchstötter & Koch, 2009). Furthermore, a recent industry study in Germany indicates that 18.7% of queries consist of only one term, 49.0% of two terms, and 21.0% of three terms. Of particular interest here is the distribution of query length, which is, of course, also influenced by the autocomplete suggestions. The difference between the length of English-language and German-language queries can be explained primarily by the use of compound terms in German: While in English multi-word terms are separated by spaces (“granular synthesis”), in German, on the one hand, these words are written together (“Granularsynthese”), and on the other, it is possible to form words of almost any length. However, it is not really important what the exact average query length is. More interesting is the consistent result that queries submitted to search engines are, on average, very short and that no development toward longer queries formulated directly by the users can be seen. Although the automatically generated autocomplete suggestions (see Sect. 4.5.2) have undoubtedly led to longer search queries (it is not just individual words that are suggested during input, but longer
72
4 User Interaction with Search Engines
35
Share of queries (percent)
30 25 20 15 10 5 0 1
10
100
1000
Words per query
Fig. 4.5 Typical distribution of queries by length in words (after Spink et al., 2001, p. 230)
queries), this primarily underlines the increasing guiding of users by the search engine, but not an “improved” query behavior (Lewandowski et al., 2014). The use of voice-based search (see Sect. 4.5.1) also leads to longer queries, as coherent sentences or at least expressions replace the string of individual search words familiar from text-based search. However, concerning the unambiguousness of queries, this is to be welcomed and makes it easier for search engines to understand the information needs of their users. However, it is not only the average length of queries that is crucial but also how they are distributed. Figure 4.5 shows an example of the distribution of queries by the number of words used. Here, it can be seen that about half of the queries consist of only a single word (see also Höchstötter & Koch, 2009, p. 55). This also indicates that search engines must demonstrate a high level of interpretation to satisfactorily answer their users’ queries. Longer queries arise, for instance, when users enter entire sentences. On the one hand, these can be quotations to recall specific texts. On the other hand, they can be question sentences. Schmidt-Mänz (2007, p. 141f.) examined German queries for interrogative sentences and found that only between 0.1% and 0.2% of queries are formulated as interrogative sentences. More recent industry studies based on data from the United States estimate a value of around 8% (Fishkin, 2017). It can be assumed that these much higher values result from the fact that questions can be asked without much effort through voice input and that search engines are now much better able to answer questions in a meaningful way (see Sect. 7.3.5).
4.5 Queries
4.5.5
73
Distribution of Queries by Frequency
We have seen that queries are posed in various ways and have a low average length. But what about the distribution of the queries, i.e., how individual are they? One might assume that, given the myriad of possible formulations, different queries occur with varying frequencies and that there is a relatively even frequency distribution of queries. However, all empirical studies on this topic show that the distribution of search query frequencies is highly skewed. This means that few queries are submitted very frequently and very many are submitted only rarely. Such distributions are also called informetric (Stock, 2007, p. 77). One also finds such distributions in many other contexts, for example, in the distribution of income in a population, but in information contexts, the distributions are usually much more extreme. The terms for informetric distributions vary; depending on the context, they are referred to as power laws (see Huberman, 2001) or the long tail (Anderson, 2006). In our context, informetric distribution also plays a role in the distribution of words within documents (see Sect. 5.2) and links to documents (see Sect. 5.3.1). Figure 4.6 shows the typical distribution of queries in a search engine. Few queries are entered very frequently; many queries are entered only rarely. The curve, therefore, slopes steeply at the beginning and runs flat and long at the end. The end of the distribution can no longer be shown in the figure because there are too many queries with only a small search volume. Such a distribution of queries has been demonstrated in numerous studies using data sets from various search engines (e.g., Jansen & Spink, 2006; Schmidt-Mänz, 2007; Lewandowski, 2015). We will use an actual distribution of queries that were used as the basis for a study (Lewandowski, 2015) as an example: A total of 30.46 million search queries were sorted according to query frequency, and then the number of different queries that had to be entered to obtain 10% of the search volume in each case was measured.
Fig. 4.6 Distribution of search query frequencies (exemplary)
74
4 User Interaction with Search Engines
Table 4.4 Queries by segment of 10% of queries (Modified from Lewandowski, 2015) Segment 1 2 3 4 5 6 7 8 9 10
Cumulated number of queries 3,049,764 6,099,528 9,149,293 12,199,057 15,248,821 18,298,585 21,348,349 24,398,114 27,447,878 30,497,642
Number of different queries in the segment 16 134 687 3028 10,989 33,197 85,311 189,544 365,608 643,393
The result was that only 16 queries already accounted for 10% of the search volume. These 16 queries were all navigational queries (e.g., facebook and ebay). If, on the other hand, we look at the last segment, i.e., the infrequent queries, which again account for 10% of the search volume, we have 643,393 queries. The complete data, divided into ten segments, each with 10% of the search volume, are shown in Table 4.4. It is easy to see from this example that the search query volume is distributed very unevenly among the queries (for the consequences for content producers, see Chap. 9) and that if one appears at the top of the result list for a popular query, one can expect an enormous number of visitors to one’s website. How Often Are Particular Queries Made? At the end of the year, many search engine providers present lists of the most popular queries of that year. For example, Google’s list for 2019 can be found at https://trends.google.com/trends/yis/2019/GLOBAL/. However, these lists only offer a selection of queries, often sorted by topic. What is missing are the “usual” navigational queries, for which there is hardly any change from year to year. These queries may be too boring for the “annual highlights,” but they still make up a huge part of the total number of queries. If one wants a more accurate picture of search trends, a tool like Google Trends (https://trends.google.com/) is much more appropriate. Here, one can enter one or more queries and get the trend graphs as shown in Figs. 4.7–4.10. However, it is not possible to get exact figures.
4.5.6
Query Trends
If we look at queries over time, we can distinguish different query types based on their trending (Schmidt-Mänz & Koch, 2006): evergreens, impulses, and events.
4.5 Queries
75
Fig. 4.7 Recurring event search terms: “christmas” and “easter” (graph from Google Trends; August 29, 2022)
Fig. 4.8 Impulse keyword: “fukushima” (graph from Google Trends; December 3, 2020)
Fig. 4.9 Evergreen keyword: “sex” (graph from Google Trends)
Fig. 4.10 Keyword with two peaks in the search history per year: “basteln” (graph from Google Trends; January 29, 2021)
Recurring events are searched for regularly, but preferably at specific times. Figure 4.7 shows the search trends for the two queries, christmas and easter. It can be seen that the search volume of these queries changes at different times of the year. The peak of the trend curve coincides with the event itself. It should be noted that a higher swing for a search term does not necessarily mean that it is searched for more frequently and that the analyses in Google Trends are, in fact, only trends, not actual search volumes (Alby, 2020). Figure 4.8, on the other hand, shows a triggering event (impulse), in this case, the query fukushima. Before the nuclear disaster in Fukushima in 2011, there were
76
4 User Interaction with Search Engines
only a few searches for this word; with the reporting in the press, there was also great interest among searchers, which dropped off again relatively quickly. But even after this decline, the keyword is still entered regularly; however, there is no longer a trend for the keyword to become more popular or less popular again. Figure 4.9 shows the query sex as an example of an evergreen. This search term is constantly being searched for. These trends in search behavior show that, firstly, not only do the interests of searchers change but that search engine providers, with the mass of queries they receive, firstly have an enormous knowledge of the wishes and intentions of searchers – John Battelle refers to this as the “database of intentions” (Batelle, 2005, p. 1ff.) – and, secondly, that search engines must react in their ranking to the current interest of users. Figure 4.10 shows the search history of the query basteln (“tinkering” or “crafting”). There are two peaks (of different sizes) every year: one before Easter and one before Christmas. However, a user entering this query will rightly expect fundamentally different results before the respective festive season.
4.5.7
Using Operators and Commands for Specific Searches
We have already characterized average queries sent to search engines as relatively simple, using the length of the queries as our main clue. However, queries can also be qualified (i.e., “made better”) by using specific commands that give a search engine precise instructions for how to process them. So, it is not just a matter of finding the right words (keywords) in the right combination but also knowing how to formulate precise queries. We will consider this topic from the side of Internet-based research in Chap. 12; in this section, we will first look at the actual use of such functionality. Queries can be qualified using two types of modifiers: operators and commands. Operators are used to combine keywords; they can be used to guide the size of the result set. The “classic” operators are the Boolean operators AND, OR, and NOT (detailed in Sect. 11.3). They can be used to search for documents in which specific keywords occur together, in which at least one of several keywords occurs, and/or in which specific keywords do not occur. On the other hand, commands qualify a query by restricting the result set to a particular type of results. For example, the command filetype: serves to search only for documents in a specific format, such as PDF. The result set will thus be reduced compared to a search not restricted to any particular file format. Again, the log files of search engines are the most reliable source for determining the use of operators and commands (see Höchstötter & Koch, 2009). Although surveys also frequently ask respondents which operators they know and use (e.g., Machill et al., 2003, p. 167; Stark et al., 2014), respondents tend to put themselves in a better light and overstate their use of operators correspondingly – in many cases probably also unconsciously. For example, in the study by Stark et al. (2014; p. 56), the users state the use of operators as the second most frequent strategy after
4.6 Search Topics
77
browsing the first search result page. Apart from the fact that nothing is said about the frequency of this use, the statements appear unrealistic compared to the known data from log files. Unfortunately, there are no recent studies on the actual use of operators and commands. However, a clear picture emerges from older studies: the proportion of queries made with operators is in the low single-digit percentage range (Spink & Jansen, 2004, p. 85; Höchstötter & Koch, 2009). It is also significant that one of the early studies (Spink et al., 2001, p. 229f.) found that a large proportion of queries with operators contained errors. Phrase searching (i.e., searching for words in a given order, marked by inverted commas) is used only in less than 3% of queries (Höchstötter & Koch, 2009, p. 57). It cannot be assumed that search behavior in this respect has changed to more elaborate queries in the meantime. The data show that operators are of little importance to ordinary searchers. This is also understandable when one considers the types of queries made and the services that search engines provide in interpreting the queries. In the case of navigational and transactional queries, it can be assumed that operators and commands are only necessary in rare cases; they mainly help with informational queries. And even then, there are only some instances in which it is worthwhile to use them. So is it bad that operators and commands are rarely used? Not as long as one knows when it makes sense to use them. Even expert searchers do not use operators in many queries because they also know that the search engines are set up for simple search behavior and that, in many cases, good results can be achieved without much effort.
4.6
Search Topics
In the last sections, we considered the formal properties of queries. Query intents, sessions, and the wording of queries all say something about how searches are conducted, but not what is searched for. Few studies look at what specific topics searchers are looking for. A simple reason for this may be that people search for all topics and that Web searches, in principle, have no thematic limits. It has been shown, for example, that if queries are assigned to the top categories from a Web directory, every one of these categories is occupied to an appreciable extent, i.e., searches are conducted for all the topics listed there (Lewandowski, 2006; Spink et al., 2001; see Fig. 4.11). Otherwise, mainly the search behavior in Internet-based research within specific subject areas is examined. Differences can be found here, especially concerning the search behavior of laypersons and expert searchers in the respective subject area (the classic study on this is by Hölscher, 2002). However, a detailed presentation of search behavior in individual fields goes beyond the aim of this book; a comprehensive account can be found in Case and Given (2016).
78
30.00%
4 User Interaction with Search Engines
29.00%
25.00% 20.10%
20.00% 15.00%
12.80%
10.00%
7.70% 7.40% 7.30% 4.50% 4.00% 3.40%
5.00%
2.10%
1.20%
0.00%
1
2
3
4
5
6
7
8
9
10
11
1= Commerce, travel, employment or economy, 2= People, places or things, 3= Computers or Internet, 4= Sex or pornography, 5= Health or sciences, 6= Entertainment or recreaon, 7= Educaon or humanies, 8= Society, culture, ethnicity or religion, 9= Government, 10= Performing or fine arts, 11= unknown or other
Fig. 4.11 Topic distribution in queries (data from Lewandowski, 2006)
4.7
Summary
Queries can be made in different ways: Alongside the standard text input, spoken language input is particularly worthy of mention. The latter plays a role especially on mobile devices and in other contexts where typing in queries is not convenient. Users, in general, hardly make use of the search engine’s search options; instead, most search behavior is characterized by simple queries. Search engines have reacted to this behavior by offering technical support in the search process, for example, through autocomplete suggestions, spelling corrections and suggestions, and, last but not least, through the interpretation and contextualization of the queries entered, as described in Chap. 3. The queries sent to search engines can be classified according to their intent into navigational, informational, and transactional queries. Different goals are pursued depending on the query intent, and the desired results must be arranged differently in each case. The length or complexity of the queries also differs depending on the query intent. Individual search queries are part of sessions, i.e., sequences of search queries and document views performed by a specific user within a specific time period and on a particular topic. Sessions can vary in length, ranging from a single query without viewing a document to a combination of multiple queries and document views. Individual queries form the minimum level in the search process. An analysis of the queries shows, first of all, that search engines are used to search for all kinds of topics.
References
79
On average, the queries are characterized by a not very complex formulation (stringing together a few keywords); this is also expressed in their average length. Queries are rarely qualified with operators and commands, although this could significantly increase the quality of the search results in many cases. However, many users are unaware of these possibilities, and they are rarely used. Queries are distributed informetrically in terms of their frequency, i.e., there are a few extremely popular queries and a vast amount of rarely made queries. This impacts both the influence that search engines have through their presentation of results and content producers’ search engine optimization. The frequency with which specific queries are made also depends on temporal factors. For example, some queries reach a relatively constant search volume (so-called evergreens), but many queries are only requested in response to an event or seasonally. Further Reading A comprehensive account of the state of research on information (seeking) behavior can be found in Case and Given (2016); a good textbook on the topic is Ford’s (2015). It explains the main theories and models of information behavior. Thurow and Musica (2009), among others, provide a good and detailed account of user behavior in search engines, which refers to Broder’s three query intents. Ryen White (2016) presents a comprehensive review of the literature on search behavior in information retrieval systems, focusing on more complex interactions that have been largely omitted in this chapter. A good and sometimes quite entertaining insight into what search query analysis can do and its commercial application can be found in Tancer (2008); Stephens-Davidowitz (2017) gives an entertaining description of what can be gleaned about people’s desires and intentions from the collected queries. Jansen et al. (2008) provide a sound basis for anyone who wants to study search behavior based on transaction log analysis. Finally, Rosenfeld (2011) offers a practice-oriented guide for using search query analysis on one’s own website.
References Alby, T. (2020). 5 Gründe, warum Du Google Trends falsch verstehst. https://tom.alby.de/5gruende-warum-du-google-trends-falsch-verstehst/. Anderson, C. (2006). The long tail: Why the future of business is selling less of more. Hyperion. Batelle, J. (2005). The search: How Google and its rivals rewrote the rules of business and transformed our culture. Portfolio. Broder, A. (2002). A taxonomy of web search. ACM SIGIR Forum, 36(2), 3–10. https://doi.org/10. 1145/792550.792552 Burghardt, M., Elsweiler, D., Meier, F., & Wolff, C. (2013). Modelle der Informationsverhaltens bei der Websuche. In D. Lewandowski (Ed.), Handbuch Internet-Suchmaschinen 3:
80
4 User Interaction with Search Engines
Suchmaschinen zwischen Technik und Gesellschaft (pp. 111–141). Akademische Verlagsgesellschaft AKA. Calderon-Benavides, L., Gonzalez-Caro, C., & Baeza-Yates, R. (2010). Towards a deeper understanding of the user’s query intent. In SIGIR 2010 Workshop on Query Representation and Understanding (pp. 21–24). ACM. Case, D. O., & Given, L. M. (2016). Looking for information: A survey of research on information seeking, needs, and behavior. Bingley. Fisher, K. E., Erdelez, S., & McKechnie, L. (2005). Theories of information behavior. Information Today. Fishkin, R. (2017). The state of searcher behavior revealed through 23 remarkable statistics. https://moz.com/blog/state-of-searcher-behavior-revealed. Ford, N. (2015). Introduction to information behavior. Facet Publishing. https://doi.org/10.29085/ 9781783301843 Frants, V. I., Shapiro, J., & Voiskunskii, V. G. (1997). Automated information retrieval: Theory and methods. Academic Press. https://doi.org/10.1108/s1876-0562(1997)97a Google. (2015). Your guide to winning the shift to mobile. Google. https://www.thinkwithgoogle. com/marketing-strategies/micro-moments/micromoments-guide//micromoments-guide-pdfdownload/. Höchstötter, N., & Koch, M. (2009). Standard parameters for searching behavior in search engines and their empirical evaluation. Journal of Information Science, 35(1), 45–65. https://doi.org/10. 1177/0165551508091311 Hölscher, C. (2002). Die Rolle des Wissens im Internet: Gezielt suchen und kompetent auswählen. Klett-Cotta. Huberman, B. A. (2001). The laws of the web – Patterns in the ecology of information. MIT Press. https://doi.org/10.7551/mitpress/4150.001.0001 Jansen, B. J., & Spink, A. (2006). How are we searching the World Wide Web? A comparison of nine search engine transaction logs. Information Processing & Management, 42(1), 248–263. https://doi.org/10.1016/j.ipm.2004.10.007 Jansen, B. J., Spink, A., & Taksa, I. (Eds.). (2008). Handbook of research on web log analysis. Information Science Reference. https://doi.org/10.4018/978-1-59904-974-8 Kang, I. H., & Kim, G. C. (2003). Query type classification for web document retrieval. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 64–71). ACM. https://doi.org/10.1145/860435.860449 Lewandowski, D. (2006). Query types and search topics of German Web search engine users. Information Services & Use, 26, 261–269. https://doi.org/10.3233/isu-2006-26401 Lewandowski, D. (2012). Informationskompetenz und das Potenzial der Internetsuchmaschinen. In W. Sühl-Strohmenger (Ed.), Handbuch Informationskompetenz (pp. 101–109). De Gruyter. https://doi.org/10.1515/9783110255188.101 Lewandowski, D. (2015). Evaluating the retrieval effectiveness of Web search engines using a representative query sample. Journal of the Association for Information Science & Technology, 66(9), 1763–1775. https://doi.org/10.1002/asi.23304 Lewandowski, D. (2014). Wie lässt sich die Zufriedenheit der Suchmaschinennutzer mit ihren Suchergebnissen erklären? In H. Krah & R. Müller-Terpitz (Eds.), Suchmaschinen (pp. 35–52). LIT. Lewandowski, D., Drechsler, J., & Von Mach, S. (2012). Deriving query intents from web search engine queries. Journal of the American Society for Information Science and Technology, 63(9), 1773–1788. https://doi.org/10.1002/asi.22706 Lewandowski, D., & Quirmbach, S. (2013). Suchvorschläge während der Eingabe. In D. Lewandowski (Ed.), Handbuch Internet-Suchmaschinen 3: Suchmaschinen zwischen Technik und Gesellschaft (pp. 273–298). Akademische Verlagsgesellschaft AKA. Lewandowski, D., Kerkmann, F., & Sünkler, S. (2014). Wie Nutzer im Suchprozess gelenkt werden Zwischen technischer Unterstützung und interessengeleiteter Darstellung. In B. Stark, D. Dörr,
References
81
& S. Aufenager (Eds.), Die Googleisierung der Informationssuche (pp. 75–97). De Gruyter. https://doi.org/10.1515/9783110338218.75 Machill, M., Neuberger, C., Schweiger, W., & Wirth, W. (2003). Wegweiser im Netz: Qualität und Nutzung von Suchmaschinen. In M. Machill & C. Welp (Eds.), Wegweiser im Netz (pp. 13–490). Bertelsmann Stiftung. Rose, D. E., & Levinson, D. (2004). Understanding user goals in web search. In S. I. Feldman, M. Uretsky, M. Najork, & C. E. Wills (Eds.), Proceedings of the 13th international conference on World Wide Web (pp. 13–19). ACM. https://doi.org/10.1145/988672.988675 Rosenfeld, L. (2011). Search analytics for your site: Conversations with your customers. Rosenfeld Media. Schmidt-Mänz, N. (2007). Untersuchung des Suchverhaltens im Web: Interaktion von Internetnutzern mit Suchmaschinen. Verlag Dr. Kovac. Schmidt-Maenz, N., & Koch, M. (2006). A general classification of (search) queries and terms. In Proceedings of the Third International Conference on Information Technology: New Generations (ITNG’06) (pp. 375–381). Stephens-Davidowitz, S. (2017). Everybody lies: What the internet can tell us about who we really are. Bloomsbury Publishing. Spink, A., & Jansen, B. J. (2004). Web search: Public searching on the Web. Kluwer Academic Publishers. Spink, A., Wolfram, D., Jansen, B. J., & Saracevic, T. (2001). Searching the web: The public and their queries. Journal of the American Society for Information Science and Technology, 52(3), 226–234. https://doi.org/10.1002/1097-4571(2000)9999:99993.0.CO;2-R Stark, B., Magin, M., & Jürgens, P. (2014). Navigieren im Netz – Befunde einer qualitativen und quantitativen Nutzerbefragung. In B. Stark, D. Dörr, & S. Aufenanger (Eds.), Die Googleisierung der Informationssuche – Suchmaschinen im Spannungsfeld zwischen Nutzung und Regulierung (pp. 20–74). De Gruyter. https://doi.org/10.1515/9783110338218.20 Statista. (2021). Mobile share of organic search engine visits in the United States from 4th quarter 2013 to 4th quarter 2019, by platform. https://www.statista.com/statistics/275814/mobileshare-of-organic-search-engine-visits/. Sterling, G. (2016). Report: Nearly 60 percent of searches now from mobile devices. https:// searchengineland.com/report-nearly-60-percent-searches-now-mobile-devices-255025. Stock, W. G. (2007). Information Retrieval: Informationen suchen und finden. Oldenbourg. Sullivan, D. (2013). Google’s impressive “conversational search” goes live on chrome. Search engine land. https://searchengineland.com/googles-impressive-conversational-search-goeslive-on-chrome-160445 Tancer, B. (2008). Click – What millions of people are doing online and why it matters. Hyperion. Thurow, S., & Musica, N. (2009). When search meets web usability. New Riders. White, R. (2016). Interactions with search systems. Cambridge University Press. https://doi.org/10. 1017/CBO9781139525305 White, R. W., & Roth, R. A. (2009). Exploratory search: Beyond the query-response paradigm. Synthesis Lectures on Information Concepts Retrieval and Services, 1(1), 1–98. https://doi.org/ 10.2200/s00174ed1v01y200901icr003
5
Ranking Search Results
Ranking is about bringing the search results found into a meaningful order so that the most relevant results are displayed first and less relevant results further down toward the bottom. The results are sorted in descending order of relevance, i.e., the higher a result is in the list, the more relevant it is. Today’s search engines are still oriented toward a result list presentation, but they enrich it with many results from other collections (for more details, see Chap. 6). This makes ranking more complex, as there is now no longer just one list that has to be ranked, but a multitude of results from different sources that are compiled on the search engine result page. There is a long-standing discussion around the definition of relevance (see Mizzaro, 1997; Saracevic, 2016), which we will not delve into here. However, we will see in the following that there cannot be an objective determination of the relevance of a result, at least for informational queries. What is relevant for one user may be completely irrelevant for someone else. For example, consider a first-year student searching for information retrieval. They probably first want to see basic information that explains the term and introduces them to the subject. If, on the other hand, their professor enters the same query, they probably want to see completely different results, for example, current research articles on the topic. On the other hand, basic information that is highly relevant to the student is not of interest to them since they are already familiar with this content. This already makes it clear that there is no such thing as the correct ranking, but that ranking is always only one of many possible algorithmic views of the content of the Web. Even though computers carry out the ranking, one must realize that the basic assumptions on which search engine rankings are based are human assumptions. For example, it may be intuitive to rank documents in a way that gives preference to what other users have already approved of (e.g., by clicking on it). However, one could also create a ranking under entirely different assumptions without necessarily arriving at a worse result. Many of the factors we will discuss later are used precisely because they have proven effective in practice and not because they are superior on a theoretical level.
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Lewandowski, Understanding Search Engines, https://doi.org/10.1007/978-3-031-22789-9_5
83
84
5
Ranking Search Results
Thus, the ordering of search engine results is fundamentally based on assumptions. It is a matter of assessing the relevance of the documents found by a machine. We will see that this can lead to a situation where what is relevant to the most users (popularity) is displayed first. On the other hand, personalization ensures that not every user gets the same results for a query but that these results are individually tailored. The exact procedures by which the results are arranged on the search engine result page are the great secret of the search engine providers and what makes any given search engine unique: every search engine is individual in this respect, and no two search engines deliver the same results. We also speak of search engines as “black boxes,” meaning that we use search engines but cannot see what happens inside the search engine when we search. There are an incalculable number of factors that ultimately determine the ranking. Search engine providers have provided numbers in the past; for example, as recently as 2017, Google claimed to use more than 200 factors for ranking, and Microsoft even cited 1000 factors that go into the ranking of its Bing search engine (Nadella, 2010). However, Sullivan (2010) quickly pointed out the insignificance of such counts, as it is ultimately arbitrary what is counted as a factor (see box). Ultimately, we cannot track the exact ranking of a particular search engine, i.e., we cannot predict how a result list will be ordered before we see the results. What we can do, however, is learn to understand the rankings in principle so that when we see the result list for a query, we can understand why these results are displayed in the given order. What Should Be Counted as a Ranking Factor? In this chapter, mainly the groups of ranking factors, i.e., the “big picture,” are dealt with. However, we will also explain individual factors by example, without aiming for completeness. What constitutes a countable ranking factor is a matter of definition. A search engine could, for example, rank a document in which the keyword is shown in bold higher than a document in which this is not the case. The same could be done with another form of markup (like italics or underlining). Now, we could consider each type of highlighting as a ranking factor or highlighting in general as a single factor. In one case, we would get three ranking factors, and in the other, only one. A distinction is often made between ranking factors and signals, in which signals form the smallest unit, while groups of related signals are referred to as factors. Thus, counting ranking factors makes little sense when comparing different search engines (see also Sullivan, 2010). We can assume that all search engines worth mentioning use many factors. However, a good ranking does not result from using as many factors as possible but from their skillful combination and weighting.
5.1 Groups of Ranking Factors
5.1
85
Groups of Ranking Factors
Even though search engine ranking procedures are not disclosed and, if each factor – or even each signal – is taken into account, it is an interplay of hundreds of criteria, six areas have emerged that are decisive for the ranking of results: 1. Text-specific factors are used to match the query words that occur in the documents sought and should therefore be included in the result set. The occurrence of the search terms can also refer to variants of the search terms occurring in the document or to words from documents that refer to the target document (so-called anchor texts; see also Sect. 3.4.2). In addition to the occurrence of the keyword, an occurrence in a prominent position (e.g., heading, beginning of the document, highlighting) is weighted higher by applying text statistics. 2. The second crucial area in the ranking is the popularity of documents, which is measured primarily by the links between the documents (link-based methods) and by users’ click behavior. Popularity measurement is a key area of how search engines measure quality; Google’s famous PageRank algorithm (see section “PageRank”) also measures the popularity of documents. 3. The third ranking category is freshness. Depending on the purpose of the query, it may make sense to display either particularly up-to-date or static but popular documents. Since link-based algorithms tend to prefer older documents, freshness is also used as a balancing factor. In practice, one will usually find mixed result lists in which – if available – a few particularly current documents are interspersed. 4. Locality refers to the user’s location. As a simple example, documents from Germany are preferred if the user’s location is identified there. Usually, however, a far more precise identification of the user’s location is undertaken; this is especially the case concerning searches on mobile devices. 5. Personalization is defined as “the adjustment of information, products, offerings, or (parts of) web pages to the needs, preferences, and capabilities of an individual user” (Riemer & Brüggemann, 2006). Personalized ranking is about giving an individual user results tailored to them. This is done primarily by using the queries that this particular user has entered in the past. 6. By technical ranking factors, we refer to the essential technical characteristics of websites or servers used for the ranking. For example, the speed at which a page is loaded from the server can play a role. It is assumed that users who have to wait longer for a document are more likely to be dissatisfied and abandon their search. Furthermore, studies by the two major search engine providers have shown that the loading speed of search engine result pages considerably influences the number of subsequent clicks (Schurman & Brutlag, 2009). Therefore, documents that load quickly should also be favored in the ranking. In the following sections, we will look at each area in detail and explain them using examples.
86
5.2
5
Ranking Search Results
Text Statistics
Text statistics are used to compare queries and documents. We intuitively assume that the keyword we entered will also appear in the documents returned by the search engine. This is indeed the case in the vast majority of cases. However, there are exceptions, for example, if synonyms were detected in the preprocessing of the query or the spelling of the query was corrected (see Sects. 3.5 and 4.5.3). Text-statistical methods are used to analyze and evaluate documents based on their text. “Good” documents are, for example, those that contain the keyword entered frequently and in a prominent position (e.g., in the headline). Other ranking factors then build on text statistics and serve primarily to evaluate quality. For a query to be compared to the mass of documents, the documents must be prepared using inverted indexes (see Sect. 3.4). The “lookup” in the indexes enables a quick comparison between queries and documents without having to access the documents directly. For text statistics procedures, the documents are captured in different indexes; to be able to use text statistics procedures meaningfully, the documents must already be indexed in depth in the indexing phase. Text-statistical methods are the “classics” in information retrieval and were developed long before search engines. They work very well with databases that, on the one hand, contain a sufficient amount of text per document and, on the other hand, where there is quality control already at the point when the documents are added to the database. An example is the Frankfurter Allgemeine Zeitung’s press database (https://fazarchiv.faz.net): Each article contains a sufficient amount of text for statistical analysis, and only articles that have appeared in the newspaper are included in the database, i.e., quality control was carried out for each article in advance. In the case of search engines, matters are different: although spam is excluded from the index (see Sect. 3.3), this is not comparable to actual quality control of the content. Moreover, anyone can post a document on the Web, even with the idea of manipulating search engine rankings (see Chap. 9). Finally, as we will see, textstatistical analyses are fairly easy to skew and, therefore, can only be used in search engines combined with other quality-determining procedures.
5.2.1
Identifying Potentially Relevant Documents
The first purpose of using text statistics is to find documents from the index potentially relevant to the user. Therefore, the first step is to identify documents that match the query. This narrows down the set of results to be ranked from the basis of all documents in the search engine’s index to a selection of those documents that are potentially relevant to the query. The problem is that the query usually consists of very little text, often only one to a few words (see Sect. 4.5.4). One remedy is to enrich the query (see Sect. 3.5). However, we will first describe the factors that can be considered for matching the query with the documents in the search engine’s index.
5.2 Text Statistics
87
Popularity Freshness
Locality
Personali sation
Technical ranking factors Ranking
Query
Text statistics
Fig. 5.1 Significance of text statistics for ranking documents
Fig. 5.2 Document that only contains a synonym of the search word (Google, query vacation; August 29, 2022)
The first – perhaps seemingly trivial – assumption in text statistic ranking is that the keyword entered occurs in the document. One assumes that the searcher wants to find the word entered in the document. With this first step, one excludes all other documents, i.e., a first narrowing down of the search result is carried out. While the entire index (with billions of documents) must first be searched, this establishes a base set, to which the further ranking factors are applied. All further operations are only carried out on this smaller set. In this respect, the textual matching of the query with the documents is, first of all, a restriction of the number of potential documents. In subsequent steps, it is “only” a matter of ranking this already determined set in a way that “the best” documents are at the top. In this process, on the one hand, the text-statistical factors come into play, and on the other hand, the other, mainly quality-determining factors. Text statistics thus form the basis on which the other factors are added (see Fig. 5.1). Now, in which cases may a search engine return documents that do not contain the keywords entered or that do not have all of the keywords entered? Here, we are mainly concerned with synonyms or quasi-synonyms. Two words are considered synonyms if they express the same meaning. For example, vacation and holiday mean the same thing, even if a different word is used in each case. For a user who enters vacation as a query, documents in which only the word holiday is used are also likely to be relevant. It is, therefore, perhaps not so important to the searcher that their keyword occurs in precisely the form they have entered. Figure 5.2 shows an example of a snippet from a Google result list where the document does not contain the word “vacation,” which the user searched for. Instead, the document contains the word “holiday,” which is considered a synonym.
88
5.2.2
5
Ranking Search Results
Calculating Frequencies
Once the potentially relevant documents have been determined by comparing the query and the index, the first step is to sort the documents according to their textual correspondence to the search query using text statistics. For this purpose, words in the documents are counted and statistically weighted. This section presents some of the most important text statistics factors as examples, albeit without any claim of exhaustiveness. First, one could assume that a document in which the keyword occurs particularly often is more relevant than one in which the keyword occurs fewer times. The document that contains the most instances of the keyword would then appear first in the list. However, it is easy to see that such a simple procedure does not produce good results: it does not take into account the length of the documents (long documents are more likely to contain a word multiple times), nor does it take into account the fact that on the Web documents do not have to go through quality control before they are published. Thus, authors could simply repeat a certain word over and over again in their texts to be listed in the top position for this particular keyword. However, such manipulations are easy to detect and no longer stand a chance with current search engines. On the one hand, a solution is to set a maximum keyword density, i.e., if a word is used too often in a document, the text is identified as “unnatural” and downgraded in the ranking. It should be noted, however, that different words occur with different frequencies within a language: Articles such as a and the and connecting words such as and and or are used very frequently, while other words occur very rarely. This is again based on the informetric distribution already described in Sect. 4.5.5. Figure 5.3 shows how word frequencies are typically distributed. As early as 1958, Hans Peter Luhn observed that the words with the most meaning for indexing are those with medium frequency. The words that occur too frequently say nothing about the meaning of a document, and the very rarely occurring words are not suitable either since they are not or hardly used by the persons searching. Now one could exclude all words that do not carry meaning already in the indexing. And this would indeed make sense if one could not or did not want to index the full texts but only wanted to use relevant keywords generated from the text. In the context of Web search, however, we are dealing with such powerful systems that all documents are indexed in full text and even the particularly common words, which are often treated as so-called stop words in other systems, can be searched for as well. Otherwise, a query such as to be or not to be would be impossible to process, as it consists exclusively of stop words. If we now wish to deal with the problem of simply counting words in documents, a normalization according to document length is an obvious solution. We had seen that with a simple word count, longer documents are more likely to be ranked highly than shorter ones. If we now relate the number of occurrences of a word to the length of the text, we obtain a normalized term frequency (TF) and can thus compare documents of different lengths in terms of their relative word frequencies.
5.2 Text Statistics
89 C
D
E
FREQUENCY
RESOLVING POWER OF SIGNIFICANT WORDS
WORDS
Fig. 5.3 Word distributions in documents (Luhn, 1958, p. 161)
In doing so, however, we have assumed that even when comparing documents and a query consisting of several keywords, every keyword is “worth the same.” However, this need not be the case. For example, let’s assume that a user enters a query consisting of two keywords, one of which is a word that occurs very frequently in the database, but the other is a word that occurs only rarely. Then, by taking into account the word frequencies within the individual documents and their frequency within the entire data set, one can give the rare keywords a higher weight, further refining the ranking of the search results. This is called inverted document frequency. In practice, term frequency and inverted document frequency are combined; the common term for this is TF*IDF (term frequency * inverted document frequency; see Stock & Stock, 2013, p. 284).
5.2.3
Considering the Structural Elements of Documents
An important criterion for text statistics is the occurrence of a keyword in the title and headings of a document. In the case of HTML documents, however, the title is not the visible main heading in the document, but the text specified in the metatag in the document source code. The headings (identified in the source code by the tags , , , etc.) appear in the document text itself and structure it. Since they are usually displayed more conspicuously than the main text, they
90
5
Ranking Search Results
Fig. 5.4 First result for the query Paris Hilton (Google; August 29, 2022)
Fig. 5.5 First result for the query Hilton Paris (Google; August 29, 2022)
catch the user’s eye more quickly and facilitate quick orientation. Therefore, it seems plausible to assume that documents in which the keyword appears in one of the headings should be rated higher. Regardless of the headings, the keyword’s position within the documents also plays a role. Documents in which the keyword appears at the beginning are weighted higher than those in which it appears at a later position. Again, this is easily explained: Users usually want to know quickly whether a document is relevant to them or not. If the keyword appears at the beginning of the document, one can quickly see whether it is in a relevant context and whether one wants to read on. If several keywords are entered, the proximity of the keywords to each other also plays a role. Documents in which the keywords are close together are weighted higher. This is particularly clear in the case of names consisting of first and last names: Obviously, a user who enters such a combination wants to receive documents in which the first name and surname are as near to each other as possible or directly next to each other. Also, in other cases, the proximity of the keywords within the document suggests that these terms occur in the same semantic context. However, not only the proximity but also the order of the keywords within the documents plays a role. Documents in which the keywords occur in the order in which they were entered are weighted higher. A well-known example is the two queries Paris Hilton and Hilton Paris. A user in the first case is likely to search for the person Paris Hilton, while a user in the second case is more likely to want information about the Hilton hotel in Paris. Figures 5.4 and 5.5 illustrate this using the example of the first Google result for each query: although the two queries produce the same result set, the results are sorted differently. The factors described above are only some of the many text statistics factors. In the standard textbooks on information retrieval (e.g., Croft et al., 2010; Manning et al., 2008; Stock & Stock, 2013), there are many more factors that will not be discussed here. However, it has already become clear that text statistics alone can be used to put documents in a meaningful order. The prerequisite for this, however, is
5.3 Popularity
91
that all documents potentially have the same quality level, i.e., that the ranking of the documents does not have to include whether these documents are generally suitable for the user. Text statistics methods can only distinguish which documents are likely to be particularly suitable for the query on the grounds of the texts as such. We can already see with text statistics that ranking is based on human assumptions. For example, it may seem plausible that documents in which the keywords are in the headings are particularly well suited to a query – but this has not really been proven. One could also imagine completely different assumptions that might also produce a good – albeit different – ranking. Given the large number of documents that make it into the shortlist in the first place, good rankings could certainly also be achieved through other factors. Text statistics can, of course, be applied not only to the text from the documents but also to the anchor texts describing the document (see Sect. 3.4.2) or to a combination of document text and anchor texts. Early search engines such as Excite, Lycos, and AltaVista were still strongly oriented toward conventional information retrieval systems and used the standard text statistics ranking. Assuming that all documents included in a database are potentially of equal quality, this kind of ranking (especially when combined with the assumption that users are willing and able to formulate accurate queries) can lead to good results. However, in the context of the Web, where documents are produced by a variety of authors with a variety of motives, the reliability of the documents must also be assessed. Furthermore, early search engines suffered from a great number of spam documents, i.e., non-relevant documents created to deceive search engines. For example, numerous search engines listed the website whitehouse.com, at that time a porn site, in the first position for the query white house. It was not until the introduction of link-based methods for assessing the relevance of a document in terms of its popularity (from about 1997 onward, especially Google) that this problem was solved.
5.3
Popularity
A fundamental way search engines measure the quality of documents is by measuring their popularity. The basic assumption is that what other users have found to be good, at least implicitly, will also be helpful to a currently searching user. Popularity can be measured on different levels: On the one hand, a distinction can be made between popularity with all users, with certain groups of users, and with an individual user (then referred to as personalization; see Sect. 5.6). On the other hand, a distinction can also be made according to the type of data collected: 1. Data collection using links on the Web (link topology method): From the linking structure on the Web, we can determine which documents are particularly popular (they are frequently linked to from other documents). Hence, we measure what is particularly popular with the authors of other documents; only the opinion of those who create documents or set links is considered.
92
5
Ranking Search Results
2. Recording user clicks (usage statistics): By measuring what users actually look at, it is possible to determine which documents are particularly popular. The difference to the link-based methods is that the users’ behavior is evaluated here, not only the ratings of those who also create content. In the following sections, we will explain both types of methods with their advantages and disadvantages in more detail. However, it should be emphasized already at this point that measuring popularity is not a direct representation of quality. Instead, we are working with the assumption that quality is expressed through popularity (see Lewandowski, 2012). And one has to admit that the success of link topological methods, in particular, lends credence to this assumption; for practical purposes, it is also not of great importance whether this approach to measuring quality is well founded on a theoretical level but that it achieves satisfactory results for the users. However, the approximation of quality through popularity can explain well why certain documents are preferred by search engines and others are more likely to be found at the bottom of the result lists. Foremost among these is Wikipedia: The success of documents from this website in search engines can be explained by ranking based on popularity. And in turn, Wikipedia’s good rankings, especially on Google, benefit it: when more users come to the website, more will participate in writing and improving articles. So, we are dealing with a self-amplifying effect. An obvious disadvantage of a – too strong – valuation of document popularity is that documents containing misinformation or disinformation can also be extremely popular. Especially for so-called fake news, it has been shown that these are shared more frequently by users on social networking sites than real news (Vosoughi et al., 2018). Thus, if popularity is rated highly, such documents may appear high in the result lists. Combined with the high credibility attributed to search engine results (see Chap. 15), this can lead to search engines promoting the spread and acceptance of false news.
5.3.1
Link-Based Rankings
Link-based rankings take advantage of the fact that documents on the Web do not stand alone but are linked. As we have already seen with crawling (Sect. 3.3), links can be exploited to capture content. Figure 5.6 shows the position of a document (D) within the Web. The document links to other documents (referred to as the out-links of document D); however, there are also other documents that link to document D (referred to as the in-links of document D). The out-links of D can be easily determined; one only has to extract the links from the document’s source code. Also, as users, we can easily see the out-links of a document, as they are highlighted and clickable in the document text. The situation is different with the in-links of a document: Only if we know all the documents on the Web can we safely identify how many and which links lead to a particular document. This is because links always point in one direction only (they
5.3 Popularity
93
Document Document Document Document D
Document
Document Document Document In-Links of D
Out-Links of D
Fig. 5.6 In-links/out-links (Adapted from Lewandowski, 2005, p. 118)
are “directed”), and there is no way to go directly from a document to the documents that link to it. To carry out link analyses at all, search engines must first create an extensive database of the Web. This is done by crawling, and the information on the links that accumulate in the process can be seen as a by-product of crawling. The general assumption behind link-based rankings is that when an author of a document sets a link, they cast a vote for the linked document. One only links to what one at least finds interesting. However, in the following sections, we will see that not every vote counts equally. Instead, one of the strengths of link-based rankings is that they differentiate between the value of individual links. This, in turn, is necessary because, in principle, anyone can create documents and set links on the Web. Similar to the quality of documents, which varies considerably, the quality or significance of links also varies. Links can be generated automatically in large numbers, and such links should, of course, not be given the same weight as links set by people to point to interesting documents. In the chapter on search engine optimization (Chap. 9), we will also see that it can be so attractive for website operators to get valuable links that they even purchase them. Although search engine providers explicitly prohibit this practice in their terms of use, it illustrates the enormous role that link-based rankings play in search engine rankings (and thus also in the success of websites that are supposed to be findable via search engines).
5.3.1.1 PageRank Probably the best-known link-based ranking algorithm is Google’s PageRank. The method was originally introduced in 1998 (Brin & Page, 1998; Page et al., 1999) and implemented directly in the Google search engine. It thus gained an edge over other link-based rankings that were perhaps more mature on a theoretical level (most
94
5
Ranking Search Results
notably Jon Kleinberg’s HITS; Kleinberg, 1999) but were not yet available in an operational search engine. It is often erroneously claimed that Google sorts documents “by PageRank.” This is wrong. Instead, PageRank is one of many methods that Google uses (or has used in the form described here) to sort documents according to their popularity on the Web. We shall see what is meant by popularity in the context of this procedure. In the following, PageRank will be presented because, on the one hand, it is the best-known link-based ranking algorithm. On the other hand, the essential advantages and disadvantages of link-based rankings can be explained well based on this algorithm. It is, therefore, not important whether and how Google still uses this method today, but rather to understand what central role link-based rankings play in search engine ranking. Even if the specific procedure is no longer used today and other search engines work differently anyway, they all use link-based techniques to evaluate the quality of documents. So how does PageRank work? First, we set the primary goal: For each document on the Web, it should be determined how likely it is that someone who randomly navigates from document to document via the links on the Web will come across precisely that document. The PageRank of a page is then simply the value that indicates this probability. A crucial advantage of such a procedure is that the PageRank value is assigned to every page before a query is executed. Each document is, therefore, already assigned a value as part of the indexing process. So even though the calculation of the values itself may be very complicated and require a lot of computation, at the moment of ranking immediately after a query is entered, a simple calculation is performed using only the PageRank values already known beforehand. This allows the ranking to be done quickly, and thus the queries can be processed promptly. If we look at the model in a more technical way, first of all, the Web is understood as a directed graph, i.e., the Web is a set of documents connected by links, each of which points in only one direction. Although links can also be reciprocal, i.e., two documents each contain a link to the other document, this is not required. This also means that if one wants to set a link, it is not necessary to ask anyone for permission. This is different, for example, from the standard linking model in social networking sites: If one wants to be “friends” with someone on Facebook, for example, both have to confirm with each other. So the connection (the “link”) is only established if both parties agree. The model of the random surfer on which PageRank is based states that a hypothetical user calls up a document randomly, follows one of the links provided there randomly, then follows one of the links in the next document, and so on. The only exception is again randomly determined: With a certain (low) probability, the user no longer feels like following further links and therefore starts following links again on a randomly selected page on the Web. Considering this model, a document receives a high PageRank if: • Many other documents link to it • Documents with a high PageRank link to it
5.3 Popularity
95
100
53 50 3
50 9
50
3
3 Fig. 5.7 Representation of documents with weights in the link graph (Page et al., 1999, p. 4)
Now, why is the first condition not sufficient? First, it would seem obvious that a document with many links is considered more important than a document with only a few links pointing to it. But the random surfer model describes precisely that there can be nodes in the network that are reached more often than other documents when links are called up at random. Let’s look at one such node as an example: the start page of nytimes.com. Let’s assume that this start page already has a high PageRank, which ensures that many visitors come to it via the links alone. Suppose this page contains a link to another page. In that case, it is intuitively clear that this link should be weighted higher than a link from a private homepage, for example, which hardly receives any links and is, therefore, unlikely to be visited by users. How, then, is this difference in link importance calculated? Equation 5.1 describes how the value of a document A is calculated. Here, PR is the PageRank of a particular document, and C is the number of links originating from a document. Finally, d is a damping factor that describes the abovementioned process of links no longer being followed and restarting on a random page instead. In the equation, for each page T which links to document A, its PageRank is divided by the number of outgoing links. Next, the sum of the values calculated in this way is multiplied by the damping factor d (which lies between 0 and 1). Then the difference between 1 and the damping factor is added. PRðAÞ = ð1 - dÞ þ d
PRðTnÞ PRðT1Þ þ ... þ C ðT1Þ CðTnÞ
ð5:1Þ
Let us now look at an example: Figure 5.7 shows a small section of the Web with linked documents. Each document already has a PageRank value, which is shown
96
5
Ranking Search Results
within the document. The value each document “inherits” to the documents it links to (shown next to the arrows) has also already been calculated. The document on the top left has a PageRank value of 100 and two outgoing links. So, with each link, 100/2 = 50 points are passed on – both documents on the right benefit equally from this. Now let’s look at the document on the bottom left. It only has a PageRank of 9 and three outgoing links. Therefore, the document on the top right receives 9/3 = 3 additional points through this link. This shows that the links from the two different documents have very different value. So, if we look at the equation from the perspective of a content producer who wants to increase their visibility in search engines by collecting links, the conclusion is not necessarily to go for quantity but rather to try to get some links from valuable pages in terms of link value. The last question is how one is supposed to calculate the PageRank values when one needs the PageRank values of the other documents, which are not yet known. It seems to be a kind of chicken-and-egg problem. The solution lies in a so-called iterative procedure in which the same value is assigned to all documents in the first step. Then, all PageRank values are provisionally calculated on this basis. Once this has been done for all documents, a new round (iteration) takes place, in which new values are computed for all documents based on the values calculated in the first round. Then, in the third round, the PageRank values are recalculated based on the values from the second round, and so on. But when does the process end? This cannot be determined precisely. However, the values of a single document become more and more uniform between the iterations, i.e., the differences between values calculated for any given document become smaller after each iteration. Although there is no fixed point for stopping the calculation, the calculation can be stopped if the values are found to be hardly changing anymore. The PageRank algorithm was revolutionary in its time at the end of the 1990s, especially due to its implementation in the then-new search engine Google. With its help, Google could produce significantly better results than its competitors. Unfortunately, it can be said that the competition has not recovered from this blow to this day, even though PageRank was only one of the important features of Google’s success (see Chap. 8). With PageRank, a static value is assigned to each document. So it is not a website (a source) that is rated, but each document. The fact that PageRank is a static value is both an advantage and a disadvantage. The advantage is that it is not necessary to calculate the PageRank at the moment a query is submitted, which would not be possible due to the complexity of the calculation. The disadvantage, however, is that PageRank values are not yet known for documents newly found by a search engine, and therefore a compensation factor must be created. Finally, let us look again at the example of the query white house mentioned in the previous section on text statistics. Let’s assume for simplicity that a search engine only has to decide on the ranking of two pages, namely, between the home pages of whitehouse.com (the porn site) and whitehouse.gov (the government
5.3 Popularity
97
institution). But, of course, if the text on both pages is identical (which was the case for a while) and only the images differ, no ranking is possible through text statistics. Still, it can be done employing a procedure such as PageRank: many more (and higher-quality) pages will link to the original White House page than to the porn site. Therefore, link-based rankings make it relatively easy to distinguish between an original and a fake. But they also help identify high-value documents within a set of potentially relevant documents. In this way, they complement text statistics procedures. By themselves, however, methods like PageRank are not useful. For example, if one were to arrange the documents according to PageRank alone, the result would be the same for every query since a static value is assigned to every document regardless of its content and submitted query.
5.3.1.2 Development of Link-Based Rankings As already mentioned, PageRank is neither the only link-based ranking nor is it still used by Google as an essential ranking factor in this form. Instead, it has become a kind of myth, which, on the one hand, is invoked as a justification for Google’s superior market dominance, but, on the other hand, is also used by search engine optimizers (see Chap. 9) to surround their services with an air of genius: they have figured out PageRank and can therefore bring their clients’ pages to the top of Google’s rankings. Even today, all search engines use link-based rankings, even though the importance of links for ranking is repeatedly called into doubt or it is postulated that their importance is declining. The latter may be accurate, but it cannot be denied that links continue to play a significant role, as they can easily be interpreted as an expression of quality. And ultimately, ranking is primarily about reflecting quality (as perceived by users).
5.3.2
Usage Statistics
Link-based rankings are oriented toward the content creators of the Web; only the links that creators set within their documents are included in the calculations. Obviously, with such author-centered methods, only a part of the user community of the Web can be mapped; however, all those users who primarily consume content on the Web are left out. In the context of social media, one often speaks of users who were previously only consumers of information on the Web now becoming authors at the same time. They would thus become “prosumers” (formed from consumer and producer). This may be true insofar as users can now easily create content themselves and publish it on the Web. However, by far, not all users use this possibility, i.e., there is still a big difference between those who create content on the Web and those who merely consume the content. Usage statistics procedures evaluate the interaction behavior of users with a search engine and try to draw conclusions about the quality of the documents viewed
98
5
Ranking Search Results
by the users, especially by counting clicks on search results and measuring the dwell time on these results. This information is then, in turn, incorporated into the rankings. A major advantage of these methods is the mass of data that can be analyzed as a “by-product” of the users’ search processes. However, their serious disadvantage from the user’s point of view is that, at least in the case of procedures that evaluate the data of individual users, extensive data records are created about the respective user. One can differentiate between the usage statistics methods according to the level of data collection: • Using the data of all users: Here, all interactions occurring in the search engine are recorded. Similar to link-based rankings, the results can be improved for all users; however, differentiation or adaptation to individual needs is impossible. A significant advantage from the user’s point of view is that the data can be collected mostly anonymously and does not have to be assigned to individual users. • Use of data from specific user groups: An automatic segmentation of the interactions according to user groups allows a more precise adaptation of the search results to the needs of the respective group. The deeper the degree of differentiation, the easier it is to assign the collected data to individual users. • Use of the data of each user: If the data of individual users is analyzed and this data remains assigned to the individual user (i.e., it is traceable when which user made which query), we speak of personalization. We will discuss this topic in detail in Sect. 5.6; in the following, we will first look at the basic methods of usage statistics. The methods of usage statistics used by search engines have in common that they primarily collect data implicitly. This means that conclusions about quality ratings are drawn from general user behavior. So, for example, if a user clicks on a particular result on the search engine result page, this is seen as a positive evaluation of this result. In contrast, explicit ratings are mainly known from social media sites (e.g., “likes” on Facebook) or from e-commerce (e.g., product ratings on Amazon). Such ratings can, of course, also be used for ranking.
5.3.2.1 Analyzing Clicks on Search Engine Result Pages Let us look at an example of the implicit collection of usage data and its exploitation for ranking. Figure 5.8 shows a hypothetical result list with four results, along with the collected clicks on the individual results. If the search engine were to create a ranking that corresponds exactly to user expectations, we would assume that the first result is clicked on most often, then the second, and so on. In the example, however, we see that although the first result was clicked on most often, the third result was clicked more often than the second. What does this mean? If we assume that users read the result list from top to bottom (see Sect. 7.6), the more frequent clicks on the third result mean that many users skipped the second result because it did not seem relevant to them. However, we must bear in mind that it is not the result itself that was evaluated but the snippet generated by the search
5.3 Popularity
99
Fig. 5.8 Analysis of clicks on the search engine result page (fictitious example)
engine on the search result page. Therefore, it could be that the result is relevant to the query, but the snippet does not seem relevant to the user (Lewandowski, 2008). It should also be noted that the users’ clicks have a strong ranking effect (see Sect. 7.6). This means that the likelihood of a user clicking on the first result is already determined by its position and is by no means solely dependent on the content of the snippet. This means that the number of clicks must be put in relation to the number of clicks expected on the respective position. If, for example, the results on positions 3 and 4 would receive the same number of clicks, this would mean that the users would prefer result 4 due to the description since results on position 4 are generally clicked less often than those on position 3. For the ranking, this would mean that it would be appropriate to place result 4 higher up in the list. In our example, a clear preference of the users is expressed through the clicks, and a search engine could place the result currently in the third position in the second position (see Joachims et al., 2005). The advantage of this procedure is that the search engine can adapt its ranking very quickly to the current needs of its users. The disadvantage, however, is that only documents that can already be found at the top of the result lists are included in the analysis since users usually follow the predefined ranking very closely (see Sect. 7.6). Therefore, it is not to be expected that a relevant result at the bottom of the result list will be moved to the top by such a statistical procedure. Other methods are better suited in these cases, especially measuring freshness (see Sect. 5.4).
100
5
Ranking Search Results
Concerning freshness, it is also essential to ask which time period of the clicks should be included in the measurement. If one were to simply measure all clicks that have ever occurred, this would favor established documents in the ranking and “cement” the ranking in the long run. Therefore, shorter periods of time should be chosen to be able to map changes in user click behavior quickly. In addition to clicks in the result lists, dwell time on the clicked documents is recorded and used for the ranking. It is used to distinguish between documents that were clicked on because of their snippet but were not considered relevant by the user and documents that were relevant. It is assumed that a user who clicks on a document in the result list but returns to the result page after a short time to select a new document was not satisfied with the document clicked on first. Therefore, their click is not counted as a vote for the document they clicked on but as a vote against it. However, if a user clicks on a document and does not return to the result page, this is considered a positive signal. The same applies if the user returns to the search engine result page after a reasonable reading time (dwell time); in that case, we can assume that the document read was relevant but that the user is still looking for further information. In principle, what we said here for evaluating the interaction behavior of all users can also be applied to particular user groups. Here too, although an explicit classification of the user groups would indeed be possible (e.g., by a user assigning himself to specific thematic interests), in practice, such a classification happens indirectly based on search behavior. For example, Fig. 5.9 shows areas of interest automatically determined by Google for a user, which are then used to display relevant advertising. Usage statistics are always used as a supplement to text statistics. By themselves, they would be worthless, as they do not provide information about the content of documents but only about their popularity. A significant advantage of usage statistics is that large amounts of data accumulate quickly. In this way, search engines can react quickly to current events, for example, if the dominant information need for a query changes due to a recent event and other documents are preferred as a result (see Sect. 4.5.6). For example, the query Japan took on a new meaning with the Fukushima nuclear accident in March 2011: users were suddenly no longer primarily interested in basic information about the country but in information about the current event; corresponding documents were preferentially selected in the result lists. By evaluating the click data, search engines could quickly rank the appropriate documents at the top. When the users’ interests in the query changed again later, search engines were able to react accordingly just as quickly.
5.3.2.2 Collecting Data for Usage Statistics Usage statistics are based, first of all, on the data that accrue anyway during query processing. Every Web server logs the queries made to it; among other things, the URL of the document retrieved, the IP address of the requesting computer, the time of the request, and any queries entered are recorded. These data are then analyzed as a whole for use in statistical procedures. However, numerous extensions allow far
5.3 Popularity
101
Fig. 5.9 Excerpt from an automatically generated, user-specific topic profile at Google (August 29, 2022) User profile based on search queries
Level 4
Sessions Level 3
Clickthrough data Level 2
Queries Level 1
Fig. 5.10 Levels of search query analysis (Adapted from Lewandowski, 2011, p. 62)
more comprehensive data collection. Here, too, we can distinguish between different levels of data collection (Fig. 5.10): • The search queries are simply the queries entered by the users, which are not further assigned.
102
5
Selecting a result on the result page
Ranking Search Results
Viewing the results; further actions
Entering a query
Fig. 5.11 Search history data that can be collected by the search engine
• The click-through data includes the clicks that occur after the query. One can now determine what was searched for and which results were selected (and possibly how long a particular result was viewed). • The analysis of sessions (see Sect. 4.4) allows a more accurate picture of search behavior, as queries can now be analyzed in the context of a longer search process. • Finally, user profiling refers to the search behavior of each individual user. Suppose the log files, i.e., the protocols of user interactions with the search engine servers, are evaluated. In that case, only those interactions that take place on the pages of the search engine can be recorded in addition to the queries. All further interactions that occur as soon as the user has left the search engine’s site can no longer be recorded. Figure 5.11 illustrates this: Interactions within the circle can be recorded; these include the queries entered, the search engine result pages, and which results are selected on the result page. Everything that happens after a result has been selected is no longer recorded in the log files. The user no longer interacts with the search engine but is redirected to another server. A new interaction can be recorded only when a user returns to the search engine’s result page. However, it is not possible to determine what the user has done in the meantime: Did they read the suggested document intensively? Or did they click on another document, return to the original document, and finally to the search engine result page? Search engine providers have quickly realized they miss out on relevant data when collecting usage data via the log files alone. This applies not only to the interactions resulting directly from the queries; rather, interactions that initially have nothing to do with queries can also be informative for search. For example, if one
5.3 Popularity
103
can determine which documents on a website users prefer to navigate to, one can favor these pages in the ranking. There are three tools in particular that search engines use to collect complex interaction data: • Browsers and apps offered by the search engine provider • Personalization tools (“search history”) • Analytics tools for website providers Browsers and apps offered by the search engine provider (such as Google Chrome or the Google app) log all user interactions with Web content (stored in the search history), as does every other browser. However, the specific feature is that browsers such as Google Chrome also transmit this data to the search engine provider. There, it is added to the user’s profile and can be further analyzed. This means that the search engine can now evaluate not only the interactions with the search engine but all interactions of a user on the Web, provided they are carried out via that browser. Personalization tools are meant to provide users with better results (see also Sect. 5.6). For this purpose, all interactions with the search engine are recorded for users who are logged in (e.g., in the Google profile; https://myactivity.google.com). Google provides the following information on data collection: “If Web & App Activity is turned on, your searches and activity from other Google services are saved in your Google Account” (Google, 2021a). This means that activities across services are logged and aggregated in the profile. However, precisely what data is collected is not described. Google’s privacy policy (Google, 2021b) also leaves much open in this regard. Figure 5.12 shows an excerpt from Google’s search history that the user can see. Here, you can see the queries made with the results clicked on. Again, all entries are provided with a time stamp. As soon as a user creates a Google account (which is necessary, e.g., to use Gmail or the Google Calendar), the search history is automatically activated (https:// support.google.com/accounts/answer/54068?hl=en). Although it can be deactivated manually, it is not clear how many users are even aware that their data is being collected in this way. Web analytics software, such as Google Analytics, allows website providers to easily analyze their users’ interactions. This includes, for instance, the pages retrieved, typical usage paths, and the origin of the users (e.g., direct request to the website or request via a search engine). The functionality of Google Analytics goes far beyond the standard analysis tools for websites, which are often bundled with domain hosting services. However, the price for the extensive analysis methods is that the website operator shares their data with the search engine: the search engine can then also track how users behave on the respective website and draw conclusions from this. With the data generated during the search, the data of the logged-in users, the data from the tracking of advertising, and the data from Google Analytics, Google has a
104
5
Ranking Search Results
Fig. 5.12 Excerpt from a user’s Google search history (November 30, 2020)
comprehensive picture of the interactions of its users, which goes far beyond the use of the search engine. Rather, data collection is so extensive that it results in “transparent users” to a large extent. The data combined in this way can be used to improve search results but also, for example, to display advertising more accurately. For the user, it is possible to inspect at least part of this data and to switch off the logging, at least in part (https://www.google.com/settings/dashboard).
5.3.2.3 Appraisal of Usage Statistics From a technical point of view, usage statistics are very well suited for improving search results. Especially their ability to react quickly to current changes in search query volume and click behavior makes them an ideal complement to measuring popularity through link-based rankings. However, a fundamental assumption of usage statistics is that users are so similar that they prefer the same results as other users or a certain user group. The way out of this “homogenization” is personalization (see Sect. 5.6), which raises its own problems, especially concerning data protection. Usage statistics require comprehensive collection of data that can be traced back to individual users (data on the origin of the query is automatically collected during data collection). However, usage statistics do not need to store these data in a way that allows them to be traced back to individual users (see Sect. 5.6).
5.4 Freshness
5.4
105
Freshness
In the discussion of usage statistics, we described it as one of their essential properties that they can quickly provide popularity information even for new documents and thus compensate for the disadvantage of link-based rankings, which structurally favor older documents. However, the freshness of documents can also be used directly for ranking; then, the fundamental problem is to determine when a document was actually created or updated. A vivid example of the fact that ranking solely based on text statistics and linkbased rankings is not sufficient was provided by Google in 2001. After the terrorist attacks on the World Trade Center in New York on September 11, the demand for up-to-date information in search engines was enormous. However, search engines failed to display appropriately up-to-date documents. Figure 5.13 shows the Google homepage on the day of the attacks; manually inserted links to news sites can be seen below the search box. Two consequences were drawn from this “defeat” of search engines: Freshness was made an important ranking factor, and news search engines were increasingly developed (see Chap. 6). So how can search engines use freshness information to their advantage? On the one hand, knowledge about the creation and update date, as well as the update frequency of a document, can be used to guide crawling efficiently (see Sect. 3.3). On the other hand, this information can be used for ranking. To do this, it must first
Fig. 5.13 The Google.com homepage on September 11, 2001 (http://blogoscoped.com/files/ google-on-911-large.png; January 6, 2021)
106
5
Ranking Search Results
be determined whether it makes sense to display up-to-date documents for a given query (the so-called need for freshness). If so, the freshness of the documents can be included in the ranking. But how can the freshness of a document be determined? First, HTML files do not necessarily contain a creation or update date. The simplest way, therefore, seems to be to use the update date of the respective file stored on the server. However, it is easy to see that this is inappropriate: Each time a file is updated, even if it is only an exchange of a single letter, the file is given a new date. Creation and update dates can, therefore, only be determined relatively. The creation date of a document is usually equated with the first time it is found by the search engine (which gives a certain advantage to search engines that have been on the market for longer). The update date of a document can be approximated in many ways or by a combination of different factors (Acharya et al., 2005): Again, the date the updated document was first found can be used (ideally, to get a reliable date, one would need to check all documents at the shortest possible intervals), but this is not sufficient. Other indicators are the content update or change of the document (i.e., significant changes in the document text, not only in surrounding texts such as advertisements and references to other articles), a change in the linking of the document (if many links are currently being set to a document, this indicates that it at least has a certain significance presently), as well as the traffic that a document receives. Even if it is difficult to determine the exact creation or update date of a document, this date – possibly supplemented by a factor for the update frequency of the document, which is calculated from past updates – is a useful and easy-to-use addition to the ranking factors described above. This is because the date information is also a static value, i.e., at the moment of ranking, only this value needs to be used without having to perform complex calculations. This, in turn, ensures fast processing and does not negatively affect the time required to create the result pages.
5.5
Locality
Locality is the adaptation of search results (or their ranking) to the user’s current location. The underlying assumption is that a user prefers documents located in their vicinity. A typical example is a search for a restaurant: Suppose a user enters only restaurant as a keyword. Which results should be shown preferentially? There is a strong case for displaying results close to the user’s current location. There is a certain probability that this user would like to visit a restaurant, and it would make little sense if the suggested restaurants were too far away to reach them in a reasonable amount of time. While search engines adapted their ranking procedures to the country interface used early on and displayed the results in a different order for the same query submitted from different countries, since around 2008, the results have been adapted to the user’s specific location.
5.5 Locality
107
Fig. 5.14 First results for the query bundesrat on Google Germany (Google.de; January 31, 2021)
Figures 5.14 and 5.15 show the first search results from Google Germany and Google Switzerland for the query bundesrat (a legislative body). Although the query produces the same set of results overall (not shown in the figure), the ranking differs significantly: While the results from Google Germany favor the German Bundesrat, Google Switzerland gives priority to information on the Swiss Bundesrat. This makes sense as it is likely that a user from the respective country will also search for “their” Federal Council first. As we have seen above, such an adjustment would also be possible by evaluating user behavior. A more straightforward way, however, is to give preference to documents written in the language of the user interface used on the one hand and have their “location” in the country of the requesting user on the other. But what is the “location” of a document? What we are referring to here is not the physical location of the document, i.e., the location of the server on which the document is stored, but the location that is dealt with in the document. For example, a document may be stored on a server in the United States but explicitly refer to an exact location in Germany (e.g., the website of a neighborhood café). The challenge for search engines is now to determine the correct location. This can be achieved, for instance, through the information available on the respective website (e.g., the address in the website disclaimer), from purchased data (e.g., telephone book or
108
5
Ranking Search Results
Fig. 5.15 First results for the query bundesrat on Google Switzerland (Google.ch; January 31, 2021)
business directory entries), or through usage statistics (which shows from where the document is accessed most often). Determining a user’s location is much easier than determining the “location” of a document. In the case of stationary computers, the respective IP address can be used to determine the user’s approximate location. In the case of mobile devices (smartphones, tablets, etc.), determining the location is even more straightforward and more precise: Here, the GPS location data provided by the device can also be used, which makes it possible to determine the location very precisely. For example, Figs. 5.16 and 5.17 show a section of a Google search engine result page on a desktop computer and a smartphone; the query in each case was restaurant. The search results are very much adapted to the user’s current location; in the case of the mobile phone, the recommended restaurants are in the surrounding streets. Again, this adaptation did not require additional location input from the user, but the query was expanded to include the location data. In addition to determining the location of documents, the challenge of local customization lies primarily in the question of when localized results should be displayed and, if so, at what distance they should be from the user. We have already seen that localized search results are displayed even without entering a so-called trigger word indicating the location. In the restaurant example, it was also evident that localized search results can be useful. However, this is not so
5.5 Locality
109
Fig. 5.16 Search engine result page (detail) in Google desktop search (query restaurant; December 3, 2021)
obvious for all queries. For instance, if a user searches for lawyer, this query seems to have little local relevance at first. However, if one takes a closer look at the information needs of the users, it turns out that legal advice is a service that is primarily searched for in the vicinity of the user, even if this vicinity is, of course, a wider one than that of the query restaurant entered by a user on the move. Jones et al. (2008) studied the distance within which localized search results should be located. They found that the desired radius differs significantly depending on the query and that it also depends on the user’s location. For example, users in a rural US state are more willing to travel greater distances than users in densely populated areas. Therefore, local search results need to be adjusted accordingly.
110
5
Ranking Search Results
Fig. 5.17 Search engine result page (detail) in Google’s mobile search (query restaurant; December 3, 2020)
Considering the users’ location results in not all users seeing the same set of results, even though the overall result set may not have changed. As long as the customization was based on country interfaces alone, one could simply switch
5.6 Personalization
111
between the different local views by selecting a search engine’s different country interface (e.g., Google.fr instead of Google.de). However, since customization is now based on the user’s actual location (sometimes combined with location data collected in the past), it is no longer easy to see whether location played a role in ranking results and, if so, what that role was. This makes the result ranking even less comprehensible for the users than it already was. However, the advantages of local customization are evident: it not only ensures better matching between queries and documents but also saves users from having to enter the desired location explicitly. The benefits become even more apparent when considering that about 20% of queries are local (Kao, 2017).
5.6
Personalization
In general, personalization is understood as the adaptation of an object to the needs of a subject (Riemer & Totz, 2003, p. 153). Concerning search engines, different forms of personalization are possible; they range from individually adapting the user interface to personalizing the search results, which is the subject of discussion below. First of all, a distinction must be made between personalization and contextualization. White (2016, p. 267) defines personalization as processes that adapt results (or other elements of the search engine) to an individual user. They are based on long-term data collection, as adjustments to the individual user’s interests are only meaningfully possible through comprehensive knowledge of their interests. On the other hand, a short-term data collection would not produce enough data to reflect the user’s actual (and changing) interests. This is contrasted with contextualization, which, according to White (2016, p. 268), refers to adaptation to the current search situation. This includes, for example, adaptations to the user’s location, the time of day, and adaptations that draw on massive data from past search interactions from other users. Thus, both personalization and contextualization involve adaptation to the current search process, which results in different users in different situations receiving different results or results arranged differently. A problem with both contextualization and personalization arises from the fact that the well-known search engines allow users little or no control. For example, it is impossible to completely switch off the personalization of search results (for specific searches) on Google. Although the effect can be reduced by, for example, logging out of one’s Google account, this does not mean that one receives “objective” results (in the sense of the same results as every other user). The same applies to contextualization: adaptation to the user’s location (as described in Sect. 5.5) cannot be eliminated. Personalization belongs to the field of usage statistics: personalization based on a single user’s behavior is carried out using implicit data from their surfing behavior; explicit data can also be obtained via ratings (either by the user or by their contacts in a social networking site).
112
5
Ranking Search Results
The essence of personalized search results is that users are shown different results or a different ranking based on their profile. This is intended to improve the quality of the results, as the order of the results is changed so that those results that have been determined to be particularly relevant for the individual user are displayed preferentially. These can be results that this user has already viewed in the past (assuming that a result that was useful in the past will be useful again in the future) or results that fit well with a thematic search profile based on the user’s past behavior. For example, a user who has frequently searched for animals in the past might prefer top results for the query jaguar to be about the animal and not about the car brand of the same name. Personalization methods are also particularly suitable for showing results or suggestions to users without them having entered any queries themselves (so-called implicit queries; see Sect. 4.5.1). Especially in the context of the shift in search away from the pure processing of explicit queries toward “personal assistants” (White, 2016, p. 61ff.), these methods play an essential role (see also Guha et al., 2015; Weare, 2009). In Sect. 5.3.2, we have already seen how data can be obtained for usage statistics. The same data can, of course, be used for personalization. An early patent filed as early as 1999 and finally granted in 2001 (Culliss, 2001) describes what explicit (i.e., self-reported by the user) and implicit data can be used for personalization. Of course, being mentioned in a patent does not mean that all this data is actually collected and processed by an existing search engine; however, to illustrate the detailed profiles that search engines can create, the complete list is given here: Demographic data includes, but is not limited to, items such as age, gender, geographic location, country, city, state, zip code, income level, height, weight, race, creed, religion, sexual orientation, political orientation, country of origin, education level, criminal history, or health. Psychographic data is any data about attitudes, values, lifestyles, and opinions derived from demographic or other data about users. Personal interest data includes items such as interests, hobbies, sports, profession or employment, areas of skill, areas of expert opinion, areas of deficiency, political orientation, or habits. Personal activity data includes data about past actions of the user, such as reading habits, viewing habits, searching habits, previous articles displayed or selected, previous search requests entered, previous or current site visits, key terms utilized within previous search requests, and time or date of any previous activity. (Culliss, 2001, p. 3)
Personalization of search results is often seen critically. In addition to creating personality profiles, it would lead to search engine results primarily confirming one’s opinion and suppressing others. Moreover, discoveries that would have been possible in non-personalized search results can be made impossible or at least less likely. For example, Eli Pariser’s criticism in his book The Filter Bubble (Pariser, 2011) acknowledges that users have always selected media according to their tastes. Still, personalization in search engines (and social networking sites) arguably adds a new
5.7 Technical Ranking Factors
113
dimension: The filter bubble is characterized firstly by the fact that it is individually adapted to each user and each user will see different results, secondly by the fact that it remains invisible to the user, and thirdly by the fact that users cannot decide for or against personalization of their results, but that the search engines apply these procedures without further inquiry. However, empirical evidence tends to show that actual personalization takes place much less than assumed in theory (Stark et al., 2021; Krafft et al., 2019). However, these findings are at least partially limited in their significance; above all, the selection of the queries used and a possible overlapping of personalization effects with the effects of localization or contextualization limit the results. However, it is undisputed that the personalization of search results can lead to a far better quality of results, since it is precisely the adaptation to the individual user that can filter out many results that may be relevant to the masses but not to the individual user. Individual customization of search results is also primarily considered positive by users, even if the collection of personal data required for this is seen critically (Jürgens et al., 2014, p. 107f.). In the discussion about personalization, there is often the misunderstanding that search engines no longer show certain results to certain users. However, the result set is also only rearranged; potentially, all results remain visible. But of course, results that appear far back in the result list are hardly noticed by users. However, this effect occurs with all forms of ranking and is not specific to personalization (see Chap. 15). With the discussion of the filter bubble, the debate has shifted from criticism of search engines for displaying superficial results geared to the masses to criticism of them being (overly) tailored to the individual user. Little consideration is given, however, to the fact that there could be other forms of personalization that are not necessarily based on collecting as much data as possible from each user but instead use only selected data for a narrowly defined period of time. Likewise, algorithms could be created that ensure that users are not always shown the same results but, on the contrary, new and inspiring ones.
5.7
Technical Ranking Factors
In addition to the content-related factors described above, so-called technical factors also play a role in ranking search results. Since documents are spread across the Web, and search engines link to third-party documents, they have no influence on how documents are designed and how quickly they can be retrieved. Therefore, such factors are measured and incorporated into the ranking. The most important technical factor is the page loading speed. This measures how fast the document loads when a user clicks on it in the result list. By now, we are used to pages loading within seconds, and a longer loading time often leads to aborting the search or returning to the result list. Therefore, search engines prefer documents that load quickly. Another technical factor is whether the documents work well on mobile devices. After all, search engines are by no means only used on static computers or laptops,
114
5
Ranking Search Results
and it is therefore important that the documents they convey are also easy to read on the device currently being used. This can also be measured and is included in the ranking, and the search engine result lists can therefore differ for the same query simply because of the device used. The secure transmission of content via HTTPS can also play a role in the ranking, even if this factor is weighted relatively low. Finally, the device used should be mentioned as an influence on the ranking: For example, it may make sense to show a Mac user different search results (or a different ranking) than a PC user, especially when it comes to specific queries that implicitly refer to the respective computer type. Improving websites in terms of technical ranking factors is also a topic in search engine optimization (see Chap. 9); there, they form the basis of “technical operability,” on which other measures can then be based. What is particularly interesting about the technical ranking factors is that, in contrast to the other ranking factors, they are not aimed at the quality of the document’s content but at the user experience. This makes it clear that under certain circumstances, documents are disadvantaged in the ranking that would be very appropriate on the content level but have deficiencies on the technical level.
5.8
Ranking and Spam
The term spam is mainly known in the context of e-mail, where it refers to unsolicited messages with advertising content, often from dubious merchants. Search engine spam refers to all attempts to place irrelevant content in the result lists in a way that users actually select it. Ever since search engines have existed, attempts have been made to manipulate documents to be listed in the top positions. This ranges from simply improving a document by using relevant keywords in the text to technical measures that enable the document to be found in search engines (see Sect. 5.7 and Chap. 9) to preparing documents that are obviously not relevant to deceive the search engines. While the first two cases are part of search engine optimization (see Chap. 9), in the latter case, we speak of spam. In the discussion of text statistics, we have already seen that they can easily be spammed by frequently repeating relevant keywords in the text or placing them in a prominent position. A standard method in the early days of search engines was so-called keyword stuffing, i.e., the excessive use of a particular word to achieve a high search engine ranking. Today, of course, such simple methods no longer work, and one can also interpret the development of ranking factors described above as a reaction to spamming attempts. All the factors aim to improve the quality or trustworthiness of the search results. This should make it increasingly difficult to spam search engines. And indeed, the “golden age” of search engine spam may be over (see Brunton, 2013). But unfortunately, spammers’ methods have also become more elaborate, so mass spam attempts continue (see Google Search Central, 2020).
5.8 Ranking and Spam
115
Fig. 5.18 Original (left) and copy of a Wikipedia article enriched with advertising (January 7, 2021)
Search engines already exclude documents classified as spam during crawling or indexing (see Sect. 3.3). In this case, the documents do not get into the searchable index, and it becomes possible not to crawl a large number of documents if an entire website has been classified as spam. This makes search engines more efficient and saves enormous costs in the crawling process. However, this also means the user cannot find documents erroneously classified as spam. However, it is difficult to determine from the outside which documents are wrongly excluded, as the boundaries between documents that are still at least potentially relevant and ones that are definitely spam are fluid, and the criteria applied are not made transparent by search engines. Even today, spam documents still find their way into search engines’ databases. A simple example can show this: There are numerous copies of Wikipedia that are of no additional use compared to the original, as they merely present the same text but enrich it with advertising. Nevertheless, these documents are often found in search engines, as Fig. 5.18 shows. One can easily find these documents by copying a long sentence from a Wikipedia article and using phrase search (see Sect. 12.5) in a search engine to display all the documents that contain that sentence. In contrast to the exclusion of spam documents in the crawling or indexing phase is the handling of spam within the ranking. On the one hand, it cannot be ruled out that spam documents have entered the index after all, and on the other hand, documents suspected of being spam may, in some cases, be relevant search results nevertheless. For ranking purposes, this means that (due to the factors described above) documents of the highest possible quality are ranked higher and only in cases where there are no high-quality documents (or the general quality level of a result set is low) are such documents displayed. Thus, there is no separate handling of spam documents in the ranking since it is first assumed that such documents have already been excluded in indexing. Once documents are included in the index, they are all treated according to the same quality factors. The fight against spam in search engines is carried out both by automated procedures and by human quality raters. It is unlikely that complete automation of the fight against spam can be achieved.
116
5.9
5
Ranking Search Results
Summary
Ranking is a central element of search engines and is carried out using various criteria that can be divided into six areas: text-specific factors, popularity, freshness, locality, personalization, and technical ranking factors. Search engines form their individual ranking by weighing the factors. Text statistics are primarily used to determine potentially relevant documents. Based on this, the other groups of ranking factors serve mainly to evaluate quality, which is necessary because documents on the Web differ considerably in terms of their quality. Using text statistics, the texts themselves are assessed according to the extent to which they match a query. Link-based rankings, on the other hand, try to infer the quality of individual documents from the structure of the Web, which is expressed in the links between the documents. The most well-known method in this area is Google’s PageRank, which assigns a static value to each document. Like link-based rankings, usage-based rankings measure the popularity of documents, but instead of measuring the popularity among Web authors, they assess the popularity among users. A special form of usage statistics is personalization, in which the data of each individual user is collected and analyzed to improve search results. Document freshness is a ranking factor in its own right. Since the contents of the Web and the interests of users change quickly, search engines must be able to react accordingly. Likewise, the location of a requesting user and the location of the documents (in terms of the location of the described content) play a role. Especially in the context of searching from mobile devices, locality has taken on an important role as a ranking factor. Finally, technical ranking factors consider whether documents can be retrieved quickly and without problems. Further Reading Detailed information on ranking procedures can be found in all well-known information retrieval textbooks, for example, Manning et al. (2008) and Croft et al. (2010). In addition, the widely known books on search engine optimization, e.g., Enge et al. (2015), are also helpful here. Although the paper in which the Google search engine was first introduced (Brin & Page, 1998) is somewhat outdated, it is still worth reading. For those who want to delve deeper into the basics of relevance, the article by Mizzaro (1997) and the book by Saracevic (2016) are recommended: Both make clear that relevance in the context of information retrieval can be understood in very different ways and that search engines are always based on a particular relevance model.
References
117
References Acharya, A., Cutts, M., Dean, J., Haahr, P., Henzinger, M., Hoelzle, U., . . . Tong, S. (2005). Information retrieval based on historical data. US Patent No. US 7,346,839 B2. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7), 107–117. https://doi.org/10.1016/s01697552(98)00110-x Brunton, F. (2013). Spam: A shadow history of the internet. MIT Press. Croft, W. B., Metzler, D., & Strohman, T. (2010). Search Engines: Information retrieval in practice. Pearson. Culliss, G. (2001). Personalized search methods. US-Patent 6,182,068 B1. Enge, E., Spencer, S., & Stricchiola, J. (2015). The art of SEO: Mastering search engine optimization (3rd ed.). O’Reilly. Google. (2021a). Find & control your Web & App Activity. https://support.google.com/websearch/ answer/54068?hl=en. Google. (2021b). Privacy policy. https://policies.google.com/privacy?hl=en Google Search Central. (2020). How we fought Search spam on Google – Webspam Report 2019. https://developers.google.com/search/blog/2020/06/how-we-fought-search-spam-on-google. Guha, R., Gupta, V., Raghunathan, V., & Srikant, R. (2015). User modeling for a personal assistant. In WSDM 2015: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (pp. 275–284). https://doi.org/10.1145/2684822.2685309 Joachims, T., Granka, L., Pan, B., Hembrooke, H., & Gay, G. (2005). Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th annual international ACM SIGIR conference on Research and Development in Information Retrieval (pp. 154–161). ACM. https://doi.org/10.1145/1076034.1076063 Jones, R., Zhang, W., Rey, B., Jhala, P., & Stipp, E. (2008). Geographic intention and modification in web search. International Journal of Geographical Information Science, 22(3), 229–246. https://doi.org/10.1080/13658810701626186 Jürgens, P., Stark, B., & Magin, M. (2014). Gefangen in der Filter Bubble? Search Engine Bias und Personalisierungsprozesse bei Suchmaschinen. In B. Stark, D. Dörr, & S. Aufenanger (Eds.), Die Googleisierung der Informationssuche (pp. 98–135). De Gruyter. Kao, E. (2017). Making search results more local and relevant. https://www.blog.google/products/ search/making-search-results-more-local-and-relevant/. Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632. Krafft, T. D., Gamer, M., & Zweig, K. A. (2019). What did you see? A study to measure personalization in Google’s search engine. EPJ Data Science, 8(1), 38. https://doi.org/10. 1140/epjds/s13688-019-0217-5 Lewandowski, D. (2005). Web Information Retrieval: Technologien zur Informationssuche im Internet. Deutsche Gesellschaft f. Informationswissenschaft u. Informationspraxis. Lewandowski, D. (2008). The retrieval effectiveness of web search engines: Considering results descriptions. Journal of Documentation, 64(6), 915–937. https://doi.org/10.1108/ 00220410810912451 Lewandowski, D. (2011). Query understanding. In D. Lewandowski (Ed.), Handbuch InternetSuchmaschinen 2: Neue Entwicklungen in der Web-Suche (pp. 55–75). Akademische Verlagsgesellschaft AKA. Lewandowski, D. (2012). Credibility in web search engines. In M. Folk & S. Apostel (Eds.), Online credibility and digital ethos: Evaluating computer-mediated communication (pp. 131–146). IGI Global. Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2), 159–165. https://doi.org/10.1147/rd.22.0159 Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.
118
5
Ranking Search Results
Mizzaro, S. (1997). Relevance: The whole history. Journal of the American Society for Information Science, 48(9), 810–832. https://doi.org/10.1002/(sici)1097-4571(199709)48:93.0.co;2-u Nadella, S. (2010). New signals in search: The bing social layer. http://blogs.bing.com/ search/2010/10/13/new-signals-in-search-the-bing-social-layer/. Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab. http://ilpubs.stanford.edu:8090/422. Pariser, E. (2011). The filter bubble: What the internet is hiding from you. Viking. Riemer, K., & Brüggemann, F. (2006). Personalisation of eSearch services – Concepts, techniques, and market overview. http://aisel.aisnet.org/bled2006/3 Riemer, K., & Totz, C. (2003). The many faces of personalization. In M. M. Tseng & F. T. Piller (Eds.), The customer centric enterprise: Advances in mass customization and personalization (pp. 35–50). Springer. https://doi.org/10.1007/978-3-642-55460-5_3 Saracevic, T. (2016). The notion of relevance in information science: Everybody knows what the relevance is. But what is it really? Morgan & Claypool Publishers. https://doi.org/10.2200/ s00723ed1v01y201607icr050 Schurman, E., & Burtlag, J. (2009). Performance related changes and their user impact. In Velocity: Web Performance and Operations Conference. Stark, B., Magin, M., & Jürgens, P. (2021). Maßlos überschätzt. Ein Überblick über theoretische Annahmen und empirische Befunde zu Filterblasen und Echokammern. In M. Eisenegger, R. Blum, P. Ettinger, & M. Prinzing (Eds.), Digitaler Strukturwandel der Öffentlichkeit: Historische Verortung, Modelle und Konsequenzen. Springer VS. (Preprint). https://www. researchgate.net/profile/Melanie_Magin/publication/337316528_Masslos_uberschatzt_Ein_ Uberblick_uber_theoretische_Annahmen_und_empirische_Befunde_zu_Filterblasen_und_ Echokammern/. Stock, W. G., & Stock, M. (2013). Handbook of information science. De Gruyter Saur. Sullivan, D. (2010). Dear Bing, We have 10,000 ranking signals to your 1,000. Love, Google. Search Engine Land. http://searchengineland.com/bing-10000-ranking-signals-google-55473. Vosoughi, S., Roy, D., & Aral, S. (2018). The spread of true and false news online. Science, 359(6380), 1146–1151. https://doi.org/10.1126/science.aap9559 Weare, C. B. (2009). System and method for personalized search. US Patent 7,599,916 B2. White, R. (2016). Interactions with search systems. Cambridge University Press. https://doi.org/10. 1017/cbo9781139525305
6
Vertical Search
In the previous chapters, we assumed that a search engine builds up a single database. Although in Sect. 3.3.3, the assembly of a search engine database from different sources has already been described, the reasons for this procedure and the structure of the collections have not yet been explained. In this chapter, the problems of general search engines are discussed, and we explain why vertical search engines were developed in the first place to make certain types of content easier to find. These vertical search engines were then integrated back into the general search engine (creating a process called universal search). In this chapter, we will look at the structure and content of these vertical collections and their integration into general search engines. Then, Chap. 7 will go into more detail on the presentation of the results from the collections. Universal search results include results generated from vertical collections. These collections are databases of special content that either cover a sub-area of the Web or are compiled separately. The collections are usually also searchable individually; in that case, we speak of vertical search engines. The term vertical search engine is also often used because, in contrast to general search engines, certain content is indexed more completely and in greater depth. The idea of universal search as a combination of results from different collections arose partly because users hardly notice the numerous vertical search engines offered by search engines; this is known as tab blindness (Sullivan, 2003). One example of a collection consisting of a subsection of the Web is the news collection. Although news from the Web is available as standard HTML pages and can therefore be included in the regular Web Index without any problems, due to the frequent updates, it makes sense to build up a news collection of its own, which can be inspected for new content at very short intervals (see Sect. 3.3.3). To do this, one must first determine which websites are news sources. Only this restricted number of sources will be crawled particularly frequently; the news documents will be ranked using custom procedures (see Sect. 6.3.1). By limiting vertical searches to clearly defined collections, indexing intervals and depth can be adapted to the corresponding needs. For example, in the case of news, due to the high demand # The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Lewandowski, Understanding Search Engines, https://doi.org/10.1007/978-3-031-22789-9_6
119
120
6
Vertical Search
Search box
News
Images
Videos
Web
Fig. 6.1 Structure of a search engine: various collections (“vertical search engines”) – exemplary but not exhaustive
for freshness, more frequent indexing is necessary (and also possible due to the far smaller quantity of documents) than in the creation of the general Web Index (for more details on news search engines, see Sect. 6.3.1). One example of a collection built from content beyond HTML pages from the Web is the local search results database. This is the “yellow pages” of the search engines; the entries are not generated from the Web alone but are based on structured data that is then enriched with data from the Web. On the search engine result pages (SERPs), these entries are then usually combined with a map showing the locations of the local results. Figure 6.1 shows exemplarily how a search engine accesses different collections, even if the search engine presents itself as one unit to the users. The different proportions in the figure are intended to illustrate that the Web Index is by far the largest data set, supplemented by the other, much smaller collections. While the Web Index aims to cover the Web as fully as possible, the vertical search engines focus on specific topics and are compiled from limited sets of sources that have been selected in advance. Vertical collections show that not (only) the content generally accessible on the Web is used (or a selection of it; see Lewandowski, 2009, p. 57f.), as in Web search, but also content that is not (yet) available on the Web can be accessed. This is the case, for example, with Google Book Search, for which, among other things, printed books are scanned and thus only made available as Web content as part of this collection. We refer to the division of Google services by Vaidhyanathan (2011) mentioned in Sect. 3.2. He divides Google’s services into three areas: (1) scan and link, (2) host and serve, and (3) scan and serve. General search engines, which ostensibly serve only the first area, are expected to be active in all three areas. The advantage of hosting content (area 2) is that extensive
6.1 Vertical Search Engines as the Basis of Universal Search
121
metadata (from author verification to usage data) accrues, and the search engine operating the service has exclusive access to this additional data. With the services mentioned under area 3, exclusive collections are built up by the search engine provider, which are not accessible to other search engines or only with restrictions. It is clear from the above that today a search engine is required to do much more than crawl documents from the Web and collect them in a uniform manner. Instead, it is a matter of the complex construction of document collections, which are then brought together again in search. However, as we will see in Chap. 7, this expansion of the data sets, which also has a considerable impact on the presentation of search results, is hardly noticed by the users in its effects.
6.1
Vertical Search Engines as the Basis of Universal Search
In the discussion of tools for finding information on the Web (Chap. 2), vertical search engines were distinguished from universal search engines by their voluntary restriction to specific content. Accordingly, the definition for vertical search engines is that they “restrict themselves thematically or based on formal document characteristics (e.g., file type)” (Lewandowski, 2009, p. 57). The advantage of such a restriction is that the selected content can be covered more completely and accessed in greater depth. In addition, the search interface and result ranking can be adapted to individual user groups. In this section, vertical search engines are initially dealt with independent of their possible integration into a general search engine. Even though this integration has now become widely accepted and the major search engines each integrate their own vertical search engines (e.g., Google shows results from its own vertical search engines Google News, Google Maps, and Google Video), vertical search engines can certainly be set up and used as independent search engines. In practice, the distinction between horizontal and vertical search engines is also often used, with horizontal search engines covering the Web in its breadth and vertical search engines going into depth in specific areas. However, this subdivision is not very suitable for classifying vertical search engines since, for example, an image search engine would be considered a horizontal search engine (it covers images from the entire Web) but is nevertheless a vertical search engine (it is intentionally restricted to specific content). Vertical search engines fulfill tasks that general search engines cannot accomplish for various reasons. There are problems in four areas (see Lewandowski, 2009, p. 54ff.): • • • •
Technical problems Financial hurdles Orientation toward a single-user model Problems of indexing
122
6
Vertical Search
Technical problems arise first of all from the total quantity of documents available on the Web (see Chap. 3), which cannot be fully captured by any search engine. Furthermore, in crawling, priorities are set based on a measurement of the “general importance” of documents and not on the needs of specific topics or the needs of specific user groups. As a result, universal search engines are often even less complete and up to date in specific areas than when averages are considered across the entire data set. Further technical problems arise from the Deep Web, whose content is not or cannot be captured by the general search engines (see Chap. 14). The Deep Web includes content that cannot be captured by the search engines (at least at present) for technical reasons and content that one must pay to use. On the one hand, the problem with this paid content is that search engines cannot retrieve it; on the other hand, integrating paid content into the universal search engines would mean a paradigm shift away from providing free content. Financial hurdles are closely linked to technical problems. Again, the issue is to capture content from the Web in a complete and up-to-date manner. While this is already technically impossible, financial restrictions usually limit coverage even further. Search engine providers must balance the need to capture as much content as possible with the efficient use of financial resources. This may mean that Web content is only indexed up to a certain depth, as the inclusion of additional documents would not be economically viable. The gain from the additional documents would be too small compared to the investment required. The orientation toward a single-user model means that search engines that address a general user community, i.e., potentially everyone who searches the Web, must always consider the demands and needs of the assumed “average user.” Even if one considers that universal search engines are increasingly personalizing their offers (see Sect. 5.6), this personalization refers exclusively to the query interpretation and the ranking of search results and not to the search process itself or the search options. Universal search engines, for example, do not offer any subject-specific restriction options in the advanced search forms or on the search engine result pages. Subject-specific restrictions are also not possible in query formulation. For specialized Internet-based research, the search results are often trivial or too general unless they have been personalized by a significant number of topic-relevant queries in the past because search engines (have to) align to a non-specific user model. Here, vertical search engines that pursue more specific user models can offer an alternative. Moreover, the search functions of vertical search engines, such as Google’s image search, already show that they can be searched far more precisely than in the general search (see Sect. 12.5). Finally, a crucial problem with indexing is that universal search engines treat all documents in their index similarly. Therefore, the document representations (see Sect. 3.4) are also structured similarly. Although additional information is sometimes extracted from the documents and displayed in the snippets on the search result pages (see Sect. 7.4), this extracted information either comes from the vertical search engines integrated into the universal search engine or is rather general.
6.2 Types of Vertical Search Engines
123
Ultimately, the indexing of documents in universal search engines consists mainly of the full text, the anchor texts referring to the document, and some additional information (see Sect. 3.4.2). As a result, specific information, some of which may already be found in structured form in documents from specific sources, is lost in the indexing by universal search engines. For example, universal search engines capture documents from extensive portals with scientific literature, but the structured names of authors, journals, and books are often lost (see Sect. 6.3.2 for more details). Vertical search engines try to overcome these problems by restricting themselves to a specific subject area, which they can cover more completely. They can index the documents more deeply or in a different way than the universal search engines depending on their thematic focus; the search options, the search process, and the presentation of results can be geared to the individual user group.
6.2
Types of Vertical Search Engines
Vertical search engines include all search engines that intentionally restrict themselves to a specific area of the Web. This does not mean, however, that all vertical search engines function according to the same principles. Instead, several types of vertical search engines have emerged, described below. In focused crawling (see Baeza-Yates & Ribeiro-Neto, 2011, p. 513), selected areas of the Surface Web are captured. This involves content that is, in principle, also indexed by universal search engines. The sources to be searched by such a vertical search engine are usually selected manually; this type of list of verified sources is also known as an allowlist. The advantage of a vertical search engine based on focused crawling can be that it can cover the pre-selected sources or areas of the Web more completely and, if necessary, explore the content more deeply. Furthermore, the restriction to a specific set of sources or an area of the Web can also lead to better results, as sources (websites) that are not relevant to the focus of the vertical search engine are excluded from the outset. An example of a search engine based solely on focused crawling is the children’s search engine Swiggle, which only captures documents from websites classified in advance by editors as being appropriate for children. The second type of vertical search engine is restricted to a specific document type. A document type can be determined technically (e.g., based on the different file formats for images) but also based on a genre (such as news text in news search engines like Google News). However, this does not result in a clear distinction from focused crawling since, in the case of news search, for example, focused crawling is carried out with the restriction to specific sources. Limitation to certain file formats, on the other hand, is not focused crawling since all documents on the Web must potentially be found to then search for those that are available in the corresponding file formats. The third type of vertical search engine is a hybrid search engine. This combines selected content from the Deep Web with selected content from the Surface Web
124
6
Vertical Search
Fig. 6.2 Overview of archived versions of a URL in the Wayback Machine (December 4, 2020)
(usually captured via focused crawling). The best-known example of a hybrid search engine is Google Scholar, which will be discussed in detail in Sect. 6.3.2. One particular case of vertical search engines should be mentioned here for completeness: so-called archive search engines. Archive search engines not only regularly collect content from the Web, but in contrast to conventional search engines, they also make it permanently available in the respective version (see Finnemann, 2019). Therefore, with these search engines, it is possible to look into the Web’s past, as the documents are not overwritten when they are re-indexed. Instead, each version of the document is saved individually and made searchable. The most important archive search engine is the Wayback Machine (https://archive. org/web/), which has collected more than 525 billion versions of documents from the Web over the years. However, the search is not carried out using keywords as with a conventional search engine, but one has to enter the URL of the pages searched for and is then taken to the archived versions of the document via a calendar view (Fig. 6.2). Since similar problems as with conventional Web crawling occur here, the Wayback Machine archive is also by no means complete. Nevertheless, it is an excellent tool – and one without alternatives – if one wants to look into the Web’s past. In the following sections, three selected vertical search engines are described in more detail. They are some of the best-known vertical search engines and illustrate the different technical approaches mentioned above.
6.3 Collections
125
Creating Your Own Vertical Search Engine with Google Custom Search With custom search (https://cse.google.com/), Google offers a simple way to create your own vertical search engines. These are based on manually compiling websites that can then be searched collectively. To create such a search engine, one only has to generate the list of websites to be searched. Google then provides a search box with which these websites can be searched. In the case of custom search engines, however, no custom crawling takes place; instead, Google’s database is simply used, and the search is restricted to the specified sources. One can therefore think of a custom search engine as a search with a query that is limited to a certain selection of sources. If we resort to the vocabulary of search described in Chap. 12, such a query might read, for example: keyword AND (site:website1.com OR site:website2.com OR site:website3.com OR site:website4.com OR site: website5.com) Creating a vertical search engine as a custom search engine can be useful and provides an easy way to search a specific set of sources. However, the creator of the vertical search engine has no influence on the ranking of the search results;Google only offers the choice between ranking by relevance and ranking by date (which is not possible in regular Google search).
6.3
Collections
In the previous sections, the importance and structure of vertical search engines were described; now we will deal with concrete vertical search engines integrated into the general Web search. In Table 6.1, some collections of the Google search engine are listed as examples and briefly explained. Of course, all of these collections can be accessed directly (as vertical search engines proper), but results from these collections are also integrated into the result lists in Web search. This results in a particular diversity that cannot be achieved using a single index. The subsequent sections describe in more detail the different ways collections can be compiled, using four important collections as examples.
6.3.1
News
News search engines are one of the best-known examples of vertical search engines. Relatively early in the development of search engines, news search engines were created as supplements to general search engines (in Germany, e.g., these were Paperball and Paperboy; see also Dominikowski, 2013). Such search engines were first created as stand-alone vertical search engines that were linked to from the search
126
6
Vertical Search
Table 6.1 Google collections (selection; adapted from Lewandowski, 2013, p. 499) Area News
URL https://news.google. com/
Images
https://www. google.com/imghp https://www. google.com/maps
Maps
Books
https://books. google.com
Shopping
https://shopping. google.com https://www. google.com/ videohp https://scholar. google.com
Videos
Scholarly publications
Description Searches the content of selected news sources. During index creation, particular emphasis on updating the index frequently Searches images found on the free Web Combines proprietary map material with content from the free Web and reviews from Google’s own Places service Searches scanned books; distinguishes between public domain books, adapted in full text, and protected works, of which selected pages are displayed Product search based primarily on structured data supplied by retailers Searches videos that are freely available on the Web on various platforms Searches scientific articles and books made available either in repositories, on scientists’ and universities’ websites, or through publishers. Linking free content/ versions and paid works
engine’s homepage and result pages. With the advent of the universal search paradigm, news search results were incorporated directly into Web search results. Google News is a search engine that uses focused crawling to capture the content of selected news websites. This means, on the one hand, that the sources that are regularly captured in crawling have been selected beforehand and, on the other hand, that a section of the Surface Web is captured, i.e., it does not include sources that the general search engines could not capture due to technical constraints. The advantage of the vertical search engine lies in its restriction to selected sources and, thus, its more up-to-date coverage. Google no longer provides information on the number of sources indexed in Google News. The last official information dates from 2017, when a figure of more than 700 German news sources was given for the German version of Google News (Google, 2017). In addition, there are versions of Google News for many countries, each containing an adapted selection of news sources. In this context, news sources are websites that provide news free of charge on the Internet. In many cases, news websites are operated by publishers who sell printed newspapers or magazines and have transferred their established brands to the Internet (e.g., Spiegel.de and Stern.de). However, in most cases, the websites do not contain the content of the printed editions (or, more precisely, the editions for sale, regardless of whether they are read as printed editions or electronically, e.g., on a tablet). News search engines are thus restricted to the free content offered on news websites; however, they can also refer to the increasingly emerging “plus” offers where individual articles on the websites must be paid for. From a news search
6.3 Collections
127
engine perspective, however, linking to such content is not very attractive, as users who are not willing to pay for news content are likely to be dissatisfied with the performance of the news search engine in such cases. Furthermore, news can only be found via news search engines as long as they are available on the Web. For example, if news items are deleted or moved to an archive for which a fee is charged, they will no longer be indexed by news search engines. Therefore, news search engines also do not form archives, meaning that they provide incomplete results at best for research going back further than a few weeks. Crawling is carried out based on the selected news websites (listed in an allowlist). Only documents originating from a previously defined source make it into the news index. This has significant advantages in terms of quality control. Although the quality of individual documents from a particular source may differ, the differences in quality are far smaller than in an open (i.e., non-restricted) crawl of the Web. Freshness plays a decisive role for news search engines: new news must be captured as quickly as possible after publication. This is made possible by the restriction to relatively few sources, which can then be queried at reasonably short intervals. The groups of ranking factors identified in Chap. 5 can also be applied to news search, except that they are given different weightings here. The main issue is a reasonable balance between content fit (text statistics) and freshness (see Long & Chang, 2014, p. 7). It stands to reason that freshness plays a special role in news. Other methods are therefore less suitable: For news, counting and ranking links is not particularly helpful because while news can certainly generate many links quickly (e.g., in social networking sites), the strengths of link-based rankings arise precisely where links accumulate over an extended period of time. On the other hand, news search engines have to update extremely quickly; therefore, “waiting” until a meaningful number of links has been generated is not practical. News search engines not only index documents but also try to cluster documents on the same topic. After all, repeating the same or similar stories in the result lists would not make sense. However, all of these items should still be available, as users may want to read a news item from a particular source, and minor differences or additions may play a role in news items. In addition, because of the restriction to a predefined set of sources, there is no danger of content from one source being copied by another for the purpose of spamming. When summarizing thematically similar messages, known as topic detection and tracking, not only must a one-time or static summary be created, but for new incoming messages, it must be decided whether they should be assigned to one of the existing clusters, whether a new cluster should be formed, or whether the message should not (yet) be assigned to a cluster (Stock & Stock, 2013, p. 366ff.). Figure 6.3 shows an example of clustering in Google News. First, a “main news item” was identified, which is displayed prominently. Then, closely related news items are shown in the cluster with their headline. Figure 6.4 shows the Google News homepage. In addition to the automatically generated and grouped headlines in the main column, the upper navigation shows the
128
6
Vertical Search
Fig. 6.3 Compilation of news on a topic using topic detection and tracking (example from Google News; January 22, 2021)
Fig. 6.4 Google News homepage (August 30, 2022)
sections typical for news offerings. Google News automatically assigns each article to a category. The selection of a category restricts the search and allows browsing through the news much like within news portals. It is also possible to display personalized content based on the user profile stored in the Google account (see Sect. 5.6) (“For you”). Another interesting option is to select the different country versions of Google News to display a news overview from the respective country. News search engines are beneficial tools for news research, with no claim to completeness. However, one should be aware that search engines such as Google News can neither replace research in (electronic) newspaper archives nor professional press monitoring and analysis. Taking news search engines as an example, it is easy to see how a relatively simple restriction in focused crawling can significantly improve the quality of the
6.3 Collections
129
results for the searcher: Only news items are displayed, the documents found can be more up to date and more precisely indexed, and a suitable result presentation (including thematically related news items) is chosen for the specific content.
6.3.2
Scholarly Content
When searching for scholarly content, there are also certain factors to consider that distinguish the acquisition, indexing, and ranking of this content from that in other areas. The presentation in this section refers to the scholarly search engine Google Scholar (https://scholar.google.com). There have been various attempts in the past to establish other science search engines (here in the sense of search engines that search “the scientific Web”). Some (such as Elsevier’s Scirus) have been discontinued; currently, the most important competitor to Google Scholar is Microsoft Academic (https://academic.microsoft.com). Google Scholar is a hybrid search engine that combines content from the open web with content from the Deep Web and makes it searchable in a single interface (see Chap. 14). The size of Google Scholar is estimated at 389 million documents (Gusenbauer, 2019). However, the problem of calculating the index size based on the number of results given by the search engine for selected queries must be taken into account here as well (see Sect. 3.1). Google Scholar captures selected content from the Surface Web. In doing so, the search engine automatically identifies scholarly content, which is done primarily via the origin of the documents on “scholarly servers” (i.e., e.g., from universities), based on the characteristics of the documents (structure of the document and presence of a bibliography), and based on the reference structures (reference from other scholarly documents). Google Scholar is restricted to scientific literature, i.e., it does not simply capture by focused crawling all content on offer from scientific institutions, but rather specifically content that has been identified as scientific publications. Google Scholar mainly captures articles in PDF format but also includes books (from the vertical search engine Google Books). Google Scholar can index selected content from the Deep Web through cooperation with various scientific publishers. This allows the crawlers to capture content behind a paywall (see Sect. 14.4). However, when a user clicks on one of these articles in the Google Scholar result list, they will be taken to a registration or payment form before they are taken to the content. Google Scholar has built up a comprehensive collection of scholarly literature that can be searched using a simple or advanced search. In principle, the content covers all subjects and all sources, although there is no directory of indexed sources. The indexing is not based on sources but on individual articles: when an article is indexed, the sources in the article’s bibliography are included in the index as well. The way Google Scholar captures content has some advantages for Internet-based research: documents are captured very early in Google Scholar, i.e., a scholarly document can theoretically be found from the moment it is available in any form on the Web. For example, if a scientist posts their latest paper on their website as a
130
6
Vertical Search
Fig. 6.5 Example of a misattribution in Google Scholar: author attribution (August 31, 2022)
Fig. 6.6 Example of a misattribution in Google Scholar: title information (September 13, 2022)
so-called preprint well in advance of publication, the search engine will be able to find it long before it is officially published in a journal or book. Therefore, Google Scholar is an excellent way to search for current literature. In addition, Google Scholar can, in principle, record the entire scientific output. This means that, in contrast to many major scientific literature databases, Google Scholar is restricted to neither one subject area nor one particular type of document (e.g., journal articles or books). A serious disadvantage of Google Scholar, however, is the unsystematic structure of its database caused by crawling. This means that its completeness can never be guaranteed. While in literature databases, journals are indexed “cover to cover” (i.e., all articles from each covered journal are indexed), and the limiting factor is more likely to be the number of journals indexed, with Google Scholar, there is no guarantee, even for well-known journals, that their contents are indexed completely (Lewandowski, 2007; Mayr & Walter, 2006; Halevi et al., 2017). However, since the aim of Internet-based research, in particular, is often to be exhaustive, there are severe disadvantages to conducting research exclusively using Google Scholar. Therefore, the search engine should be considered a valuable supplement to Internet-based research. A further problem arises from the sometimes severe errors in the indexing of documents (Jacsó, 2008, 2011, p. 158f.; Halevi et al., 2017, p. 826). The most common errors are incorrect attribution of authors, titles, journal titles, and years. Figures 6.5 and 6.6 show two examples of these misassignments. Google Scholar distinguishes between different document types, which are displayed differently in the results lists (Fig. 6.7): 1. Direct link to the full text: This is a reference to a document that Google Scholar has indexed in full text. It can be the version published by a publisher but also a
6.3 Collections
131
Link to ful text
Book
Normal reference
Citation
Fig. 6.7 Result display in Google Scholar
preprint version from a researcher’s personal website (on document versioning, see below). If the user clicks on the document’s title, they will be taken directly to the article. 2. Book: Here, Google Scholar refers to its Google Books offering (http://books. google.com), a vertical search engine for books. 3. Normal reference: This is a reference without the full text of the document. For example, the document was listed in a subject-specific literature database and found there by Google Scholar, but no matching full text could be found. Users who click on such a document in the result list will be taken to the detailed display in the literature database. 4. Citation: This is a document that is known to Google Scholar only from a bibliography from another document, but for which no entry in a literature database and no full text could be found. Similar to Google News, Google Scholar also performs clustering, but, in this case, based on a single article. Scientific articles are often available in several versions, such as the manuscript the author posted on their website, a preprint in an open-access repository, and the final, formally published version. One can now consider the different versions as duplicate content. It would also make no sense to display all these versions one after the other in the result list, so the different versions are grouped in clusters. In the citation count, too, the citations of all versions will then count for a single article. Google Scholar has some peculiarities in the indexing of documents that illustrate the usefulness of vertical search engines: Although the documents are indexed in full text, as in Google’s Web search (if available), additional information is also recorded. This includes the names of the authors, the source of the article (i.e., the name of the journal or book, volume, etc.) and year of publication, as well as the number of citations. Similar to the PageRank method (see section “PageRank” in Chap. 5), the references pointing to a document are counted. In this case, however,
132
6
Vertical Search
they are not links on the Web but references found in other academic papers. In academia, these accumulated references (citations) are considered an indicator of the importance of a work: the more frequently a work or author is cited, the higher their impact is considered to be (see Havemann, 2013). Google Scholar now allows a quick and (in contrast to other sources in this field) gratuitous insight into the citations of a work: the number is already indicated within the snippets on the search engine result pages. And the citations can also be used for Internet-based research: On the one hand, this makes it easy to identify “classics” on a topic that have been cited particularly often. But on the other hand, one can also find further works relevant to a topic based on an already known article by searching for the known article and then clicking on the citations. This will bring up a list of all articles that have cited the source article.
6.3.3
Images
Google’s image search is offered as an independent search engine (http://www. google.com/imghp; for the advanced search, see Sect. 12.5); as with the other collections, however, its integration into the general search is of primary importance. The need for a separate image search engine arises primarily from the particular nature of images: since they are non-textual documents, they have to be indexed differently from texts. Although the texts surrounding the images are evaluated to a large extent to index the images (see Sect. 3.4.1), information is also extracted from the images themselves, which can be used for presentation and targeted searches (see also Sect. 12.5). A notable characteristic of crawling images is that it is not a separate crawl, as is the case with other vertical crawls, but rather the discovery of images is a by-product of the general Web crawl. If image files are found, they are indexed separately. In addition to the analysis of the surrounding text, metadata that is provided with the images (e.g., title, alternative text, creation date), as well as information from the images themselves (e.g., distribution of colors in the image), is used to create the representation of the images. There are several ways to search in image search: • Entering a textual query: As in regular search, a query can be entered as text; the result will be images deemed suitable based on the surrounding texts. • Searching for an image URL: If the URL of an image is known, other versions of the image (e.g., in a higher resolution), documents containing the image, and visually similar images are displayed as a result. In addition, a textual query matching the image is formed from the documents found. • Searching by reference document: Here, an image is uploaded; the result is displayed the same way as when searching for an image URL.
6.4 Integrating Vertical Search Engines into Universal Search
6.3.4
133
Videos
Google runs YouTube, the world’s largest portal for videos. However, it is less well known that Google also operates its own video search engine (http://www.google. com/videohp), which searches much more than just YouTube. Similar to image search, video files found in the general Web crawl are also indexed separately. Analogous to the image search, the videos are indexed via surrounding texts and metadata. This metadata can, in turn, be exploited in search. Although video search does not have its own advanced search form, various restrictions can be made on the search engine result page after submitting a query. Among other things, it is possible to limit the search according to picture quality, freshness, source, and the presence of subtitles. This again shows that it is worthwhile not only to consider the special collections as supplements to general search but to regard them as search tools in their own right (see also Chap. 12).
6.4
Integrating Vertical Search Engines into Universal Search
In the previous sections, we showed what role vertical collections play for search engines, how they are constructed, and what content-specific collections bring to universal search. Now we will deal with the practical integration of this content. Figure 6.8 uses the example of a news search to show how a user can reach a news result. Again, it is assumed that the news search engine is integrated into a general search engine as part of universal search. If the user reaches the news item through the search at all (and does not directly access a news portal), there are two ways: either through the general Web search or by directly accessing a news search engine. The right-hand side of the figure shows
Web search Search (navi)
News search
Search (info)
Search (info)
Web SERP
Organic results
News website (homepage)
Fig. 6.8 From query to a news item
Browsing
News SERP
Universal Search container
News website (article)
134
6
Vertical Search
how a user who has first selected a news search engine can proceed. On the one hand, they can browse through the news items already (automatically) compiled by the search engine and select one that suits them. They are then taken directly to the article on an external news website. The other option is an informational search, which leads to a news search engine result page where the user selects a result and is taken to the article of interest. If, on the other hand, a user uses the general search engine (in the figure on the left), there are many more ways to get to a news article. On the one hand, the user can search in a navigational way (i.e., for a specific news site such as Spiegel.de) and then get directly to the desired website via the SERP, where they can again browse or search. If, on the other hand, a user searches in an informational way, they can find what they are looking for either on the SERP in the Web results (these can also contain news) or within a universal search container that presents results from the news search (see Sect. 7.3.3). In both cases, selecting a result leads directly to the corresponding news article. The headline of the universal search container is a special case. If one clicks on it, one is taken to a search engine result page on which further results on the topic are presented. The user thus switches from general search to news search. There again, a result can be selected. The process outlined makes it clear that even if a general search engine provides a vertical search engine as an additional service, users do not necessarily have to actively select this vertical search engine to access its content. Instead, there are various ways to get to the content via the universal search integration, and one can assume that far more users get to the corresponding content in this way. Reasons include a lack of awareness of the usefulness of selecting a specialized search engine before conducting Internet-based research, convenience (one gets to the content one is looking for), and a lack of awareness of the search engines’ options (so-called tab blindness; see Sullivan, 2003). By integrating special searches into general Web search, general search engines point users to their own offerings. These are thus given a preferred presentation compared to competing offers that cover the same subject area. The most prominent example is Google’s product search (Google Shopping): The preferential display of these search results earned Google an antitrust fine of 2.42 billion euros from the European Commission (European Commission, 2017). However, this is probably only the beginning of the discussion about the preferential display of certain search results (see Chap. 15).
6.5
Summary
Vertical search engines are search engines that are intentionally restricted to specific areas of the Web. Their strengths lie in greater completeness of the content covered within their area of specialization, in deeper indexing, in adapting the ranking to a specific user group, and in adapting the presentation of results and user guidance to
References
135
this target group. In this way, they counter some of the problems of general search engines, which cannot capture the Web content entirely and in real time and must adapt to the model of an average user. Vertical search engines play a role in the context of Web search primarily through their integration into general search. For example, in what is known as universal search, the results from Web search are combined with those from various vertical search engines and displayed on a single search engine result page. The well-known search engines integrate many vertical search engines into their general Web search. Important collections are news, images, videos, and scholarly content. Vertical search engines are often hardly perceived as such by the users but instead used in the context of their integration into universal search. It turns out that this often creates many paths to a document, which in turn benefits the users. Further Reading Unfortunately, there are no books that offer a comprehensive and in-depth account of vertical search engines and their integration into universal search. However, the well-known research handbooks by Bradley (2017) and Hock (2013) contain extensive references to vertical search engines as tools for Internet-based research. A specialized work that deals with ranking different content in vertical search engines on a technical level is the book by Long and Chang (2014).
References Baeza-Yates, R., & Ribeiro-Neto, B. (2011). Modern information retrieval: The concepts and technology behind search. Addison Wesley. Bradley, P. (2017). Expert internet searching (5th ed.). Facet Publishing. https://doi.org/10.29085/ 9781783302499 Dominikowski, T. (2013). Zur Geschichte der Websuchmaschinen in Deutschland. In D. Lewandowski (Ed.), Handbuch Internet-Suchmaschinen 3: Suchmaschinen zwischen Technik und Gesellschaft (pp. 3–34). Akademische Verlagsgesellschaft AKA. European Commission. (2017). Antitrust: Commission fines Google €2.42 billion for abusing dominance as search engine by giving illegal advantage to own comparison shopping service – Factsheet. http://europa.eu/rapid/press-release_MEMO-17-1785_en.htm Finnemann, N. O. (2019). Web Archive. Knowledge Organization, 46(1), 47–70. https://doi.org/10. 5771/0943-7444-2019-1-47 Google. (2017). Alles über Google News. https://web.archive.org/web/20170321083046/https:// www.google.de/intl/de_de/about_google_news.html. Gusenbauer, M. (2019). Google Scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databases. Scientometrics, 118(1), 177–214. https://doi. org/10.1007/s11192-018-2958-5 Halevi, G., Moed, H., & Bar-Ilan, J. (2017). Suitability of Google Scholar as a source of scientific information and as a source of data for scientific evaluation—Review of the Literature. Journal of Informetrics, 11(3), 823–834. https://doi.org/10.1016/j.joi.2017.06.005
136
6
Vertical Search
Havemann, F. (2013). Methoden der Informetrie. In K. Umlauf, S. Fühles-Ubach, & M. Seadle (Eds.), Handbuch Methoden der Bibliotheks- und Informationswissenschaft: Bibliotheks-, Benutzerforschung, Informationsanalyse (pp. 338–367). De Gruyter. Hock, R. (2013). The extreme searcher’s internet handbook: A guide for the serious searcher (3rd ed.). Information Today. Jacsó, P. (2008). Google Scholar revisited. Online Information Review, 32(1), 102–114. Jacsó, P. (2011). Google Scholar duped and deduped – the aura of “robometrics”. Online Information Review, 35(1), 154–160. https://doi.org/10.1108/14684521111113632 Lewandowski, D. (2007). Nachweis deutschsprachiger bibliotheksund informationswissenschaftlicher Aufsätze in Google Scholar. Information Wissenschaft Und Praxis, 58, 165–168. Lewandowski, D. (2009). Spezialsuchmaschinen. In D. Lewandowski (Ed.), Handbuch InternetSuchmaschinen (pp. 53–69). AKA. Lewandowski, D. (2013). Suchmaschinen. In R. Kuhlen, W. Semar, & D. Strauch (Eds.), Grundlagen der praktischen Information und Dokumentation (6th ed., pp. 495–508). De Gruyter. Long, B., & Chang, Y. (Eds.). (2014). Relevance ranking for vertical search engines. Morgan Kaufmann. Mayr, P., & Walter, A.-K. (2006). Abdeckung und Aktualität des Suchdienstes Google Scholar. Information Wissenschaft Und Praxis, 57(3), 133–140. Stock, W. G., & Stock, M. (2013). Handbook of information science. De Gruyter Saur. Sullivan, D. (2003). Searching with invisible tabs. Search Engine Watch. https:// searchenginewatch.com/sew/news/2064036/searching-with-invisible-tabs. Vaidhyanathan, S. (2011). The googlization of everything (and why we should worry). University of California Press.
7
Search Result Presentation
When we enter a query in a search engine, we often get result lists that supposedly contain many thousands or even millions of results. We see this in the result display on the search engine result pages: “Results 1–10 of 1,900,000.” It is evident that one can do little with 1.9 million results. In the chapter on ranking, we saw how search engines try to bring the best results to the top of the often enormous result sets. This goes hand in hand with the fact that users usually only look at a few results and prefer the first ones in the list (see Sect. 7.6). However, it is not only the position in the result list that makes users select a particular result, but the selection is guided by the structure of the search engine result pages. A search engine result page (SERP) is the HTML page generated by a search engine on which the results for a query are presented. The SERP consists of all the results displayed and the elements surrounding them, such as the search box and navigation elements. The result list, on the other hand, is the list of so-called organic search results, which are automatically placed in a specific order by the search engine based on the ranking. The result list can go over several SERPs; if we scroll from SERP to SERP, we get the continuation of the result list in each case. In the previous chapter, we looked at result sets as ranked result lists. And indeed, search engine result pages used to be little more than a list of ranked search results, perhaps surrounded by advertising. But the result presentation has changed considerably over the years, and we are now dealing with a complex composition of different elements. The search results no longer come from a single index but are generated from multiple indexes (see Sect. 3.3.3). This chapter describes the typical structure of search engine result pages, using Google as an example. Again, Google and other well-known search engines are not very different: Certain standards for the presentation of search results have developed, so that the result pages of various search engines look pretty similar. Of course, the SERPs are constantly being modified. However, these changes are often only marginal; major changes to the SERPs rarely occur.
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Lewandowski, Understanding Search Engines, https://doi.org/10.1007/978-3-031-22789-9_7
137
138
7.1
7 Search Result Presentation
The Influence of Device Types and Screen Resolutions
Search on mobile devices is currently still discussed as separate from search on regular computers (such as desktop PCs and notebooks). However, due to the diversity of devices, the term “mobile device” (and thus also “mobile search”) is losing its power to differentiate. The characteristics that have so far distinguished mobile devices from PCs are, in addition to mobility (which is also present in laptops, albeit to a lesser extent), the widely varying screen sizes, the different user interfaces, and the wealth of contextual information captured on mobile devices (see White, 2016, p. 15ff.). However, there are convergences between device classes: for example, many laptops can now be controlled via a touch interface, while tablets offer screen sizes previously reserved for (smaller) laptops. In terms of screen resolution, the boundaries become even more blurred. The distinction should therefore be made based on the device characteristics and the context of use. Even if one thinks of more recent developments such as smartwatches, it becomes clear that it is precisely the device characteristics that have a limiting effect on what can be done with it and that they favor use in certain contexts but make it so less convenient in others that it is unlikely that it will take place. Figure 7.1 shows the area of the search engine result page for the query angela merkel that is immediately visible without scrolling on different versions (from left to right): the classic desktop version, the iPad version in landscape and portrait format, and the version on an Android smartphone in portrait format. It should be noted that, especially in the desktop version, the device settings (window size and screen resolution) affect how many results are displayed. Due to the multitude of devices and settings, one can hardly speak of standard sizes; the display of results and the number of results vary considerably. The influence of screen resolution and device is significant because they determine what a user gets to see immediately. Of course, it is possible to view many more results by scrolling down further. However, user selection behavior focuses on the one hand on the top results in list displays and on the other hand on the results that are visible without scrolling. Thus, the number of results shown in this so-called visible area has a considerable influence on what users actually see and select. More generally, it can be said that a different number of search results presented on a
Fig. 7.1 Google search engine result page (visible area) in different versions (exemplary; from left to right: desktop, iPad landscape, iPad portrait, Android smartphone portrait; December 6, 2020)
7.2 The Structure of Search Engine Result Pages
139
search engine result page leads to different click behavior (Kelly & Azzopardi, 2015). Of course, it is impractical to present as many results on the small screen of a mobile phone as on a much larger desktop screen. However, this becomes problematic if, for example, the visible area only contains ads and users mistake these for organic results (see Chap. 10 for more details).
7.2
The Structure of Search Engine Result Pages
Figure 7.2 shows the Google search engine result page for the query angela merkel; in this case, it is the complete page in the desktop version. In the upper section, we find the search box with our query and some options to select different vertical search engines or refine the query. Below this follow the results, which are displayed in different forms. We find the following typical types of results on this search engine result page: • Organic results: These are the search results generated from the Web Index using algorithms and ranked by the ranking procedures, where the algorithms treat all documents in the index equally. • Universal search results: These are results that do not come from the general Web Index but separately built collections (see Sect. 3.3.3). As a rule, these results are placed within the list of organic results; the blocks with such results interrupt the list of organic search results. The result presentation usually differs from the organic results and is adapted to the respective collection. For example, video results are presented with a thumbnail (instead of the usual textual description). • Factual information: Search engines increasingly display direct answers to appropriate queries. These range from simply answering factual questions (How high is the Eiffel tower, Fig. 7.3) to including factual containers (weather paris, Fig. 7.4), to answering questions related to the query (Fig. 7.5), to compiling aggregated information on entities such as cities or people (see Fig. 7.2, right part). In the latter case, the term knowledge graph is used. Context-based advertising is a crucial element found on many search engine result pages (see Chap. 10 for details). For example, Fig. 7.6 shows the beginning of a search engine result page on which advertising is shown in the form of text ads. These ads will be displayed according to the query and are similar to the organic results (title, description, URL; see Sect. 7.3.2). The ads can, therefore, also be seen as a special form of search results, which is also expressed in the frequently used term “sponsored results.” Before we consider the individual result types in detail, the structure of typical search engine result pages should be explained. The result presentation in search engines has changed over the years. For example, Fig. 7.7a shows a search engine result page which, apart from the search box and the options at the top, contains only a list of organic results. This is the original
140
7 Search Result Presentation
Fig. 7.2 Search engine result page (example from Google; December 6, 2020)
7.2 The Structure of Search Engine Result Pages
141
Fig. 7.3 Answering a factual question on the search engine result page (example from Google; September 29, 2022)
Fig. 7.4 Integration of a facts container on the search engine result page (example from Google; September 29, 2022)
SERP. Although more complex designs have increasingly replaced it, it can still be seen today, especially for less frequent queries. The second SERP (Fig. 7.7b) differs from the first by the addition of advertising. This is usually above and/or below or to the right of the list of organic results. A significant change in the search result presentation comes from the fact, on the one hand, that an additional column has been added (if advertising is displayed on the
142
7 Search Result Presentation
Fig. 7.5 Display of related questions on a search engine result page (example, Google; January 8, 2021)
right) and, on the other hand, that the organic results no longer appear immediately after the search box (if advertising is displayed above the list of organic results), so that the search result presentation no longer (has to) begin with the organic results, but with the advertising. Advertising and organic results are generated independent of each other (see Chap. 10), meaning that we are now dealing with two independently ranked lists. The third search result page (Fig. 7.7c) still consists of advertising and organic results, but the list of organic results is now being interrupted by universal search results. Since the universal search results are also displayed differently from the organic results (e.g., supplemented by images), this is the first time that the search results have deviated from the pure list display. Finally, the fourth SERP (Fig. 7.7d) shows, in addition to all the elements already mentioned, the knowledge graph, which presents facts or direct answers. The presentation of the search results is further broken up and deviates from the usual form of the results, where the search engine refers to external documents. Here, information needs can be satisfied directly on the search engine result page; it is no longer necessary to leave the search engine. The layouts described show the types of result pages most commonly found today. This is not a complete representation; in practice, there are often result pages in which elements from the described types are recombined, whereby not all elements mentioned must necessarily appear. The “simple” forms (e.g., organic
7.2 The Structure of Search Engine Result Pages
143
Fig. 7.6 Search engine result page with advertising (example from Google; February 1, 2021)
results and advertising, Fig. 7.6) are also still found; not all elements are displayed for every query. However, the trend is that the simple list-based display is moving further and further into the background (Searchmetrics, 2021). On the one hand, the more complex result pages are, the more attractive they are, and the more choices they offer. On the other hand, search engines are moving further away from equal treatment of all documents in the index, as there is no longer a uniform ranking but rather ranking within the individual collections (Web, news, videos, etc.) according to different procedures in each case (see Chaps. 5 and 6) and only then is the search engine result page compiled from the different containers. This last step can also be seen as a ranking, in which the different containers are placed in specific positions.
144
7 Search Result Presentation
Search box, options
Search box, options
Search box, options
Result 1
Ad 1
Result 2
Ad 2
Result 3
Ad 3
Ad 1 Ad 1
Result 4
Result 1
Result 5
Result 2 Result 3
Result 7
Result 4
Result 8
Werbung 1
Ad 1
Ad 2 Ad 3
Ad 2
Werbung 2 Ad 2
Werbung 3
Result 1 Result 1
Ad 3
Ad 3
Result 6
Search box, options
Knowledge Graph
Result 2 Result 3
Ad 4
Ad 5
Universal Search 1
Ad 5
Ad 6
Result 4
Ad 6
Ad 4
Universal Search 1
Result 2
Result 5
Result 9
Result 6
Result 10
Result 3
Result 5 Result 7
Ad 7
Ad 7
Universal Search 2
Result 6 Result 8 Ad 8
Ad 8
Result 4
Result 9
Universal Search 2 Result 5
Result 10 Result 7
Result 6
Result 8
Result 7
Result 9
Result 8
Result 10
Result 9 Result 10
a)
b)
c)
d)
Fig. 7.7 Schematic representations of the most important layouts of search engine result pages
The evolution of search engine result pages shows that more and more results are displayed on a single result page. The list of organic results still generally consists of ten results per page (but tends to show fewer organic results; see Grundmann & Bench-Capon, 2018), with the other result types being added, rather than replacing organic results to any major extent. However, this does not mean that users will see all the results: It is mainly those results in the “visible area” that are seen. This refers to the area of the search engine result page that is immediately visible without scrolling. On the other hand, the “invisible area” can only be reached by scrolling. One often speaks of “above the fold” and “below the fold” in analogy to print newspapers. Since search engines are financed mainly by advertising (see Chap. 8), the presentation of results is designed so that advertisements continue to be displayed even with a relatively small screen size. At the same time, less space is given to the organic results. Studies have shown that many users do not scroll on the result pages and that most clicks on results are in the visible area (Höchstötter & Koch, 2009). This is not surprising if one observes where users’ eyes are directed on the search engine result pages. Eye tracking methods can be used to determine this. This involves using infrared cameras to observe the pupil movements of test subjects, while they use a search engine in a laboratory setting. One can then aggregate the data of all test persons and obtain a so-called heat map, which shows in which areas the users’ fixations accumulate. Figure 7.8 shows a heat map with the typical gaze accumulations in a list-based presentation: Users mainly look at the first results; the further down a result is, the less often it is looked at, and the less attention is paid to its snippet. Numerous studies have confirmed these findings, and one speaks of
7.2 The Structure of Search Engine Result Pages
145
Fig. 7.8 Typical viewing patterns in a list-based result presentation
the “golden triangle” because of this concentration of visual attention. However, this distribution only applies to uninterrupted lists. If, for example, advertising is displayed above the result list, people usually look twice: once at the advertising and then again at the organic results, which are then read as usual. This kind of viewing pattern is known as an F pattern. However, the integration of universal search results leads to entirely different gaze patterns and accumulations. For example, Fig. 7.9 shows how an embedded news container and an image container change viewing patterns. Although the first organic result still receives the most attention (represented by the red dot), the glances are distributed mainly within the image block and partly in the news block. The viewing behavior of search engine users has been investigated in numerous studies (for overviews, see Lewandowski & Kammerer, 2020; Strzelecki, 2020). It is particularly interesting how different elements on the search engine result pages direct the gaze and how user selection behavior changes accordingly: Only what has been seen is ultimately clicked on. One consequence of this is that it is no longer always worthwhile for content producers to be placed at the top of the organic result list if they are presented in a way that is hardly noticed anymore.
146
7 Search Result Presentation
Fig. 7.9 Viewing patterns on a search engine result page with universal search results
7.3
Elements on Search Engine Result Pages
After considering the search result page as a whole, we will now look more closely at the individual elements presented on search engine result pages. These are mainly search results of different types, but also additional elements.
7.3.1
Organic Results
Organic results are search results automatically generated from the Web Index (for the procedure, see Chap. 5). Organic results are ranked on equal terms for all documents, meaning that every document included in the search engine’s Web Index potentially has the same chance of being displayed as a result for a query. It is important to distinguish between the different collections (indexes) of a search engine: only the results from the general Web Index (the search engine’s primary database; see Sect. 3.3.3) are called organic results; results from special collections such as news, videos, etc. are called vertical or universal search results (see Sect. 7.3.3).
7.3 Elements on Search Engine Result Pages
147
Organic results are displayed for every query; they form the core of the search engine result pages. All other result types are optional, even though we are now very often dealing with mixed result pages, and the simple result pages consisting only of organic search results are becoming increasingly rare. But even though organic results are returned for every query, their importance diminishes considerably as other types of results are increasingly brought into the user’s field of vision (see Sect. 7.2). Although search engines have repeatedly been accused of influencing organic results in their own favor (e.g., in Edelman, 2010), so far, no such manipulation has been proven in any of the major search engines. Obtaining such proof would also be difficult, as it is hardly possible to isolate manipulation as a factor vis-à-vis other, genuine ranking factors and thus rule out the possibility that these can also explain the effect. However, this does not mean that organic results are free of manipulation. Yet these manipulations are not carried out by the search engine providers but rather from the outside by search engine optimizers (see Chap. 9 for more details). After all, the fundamental essence of search engine optimization is to create documents in a way that they achieve high visibility in the search engine result pages. One problem with search engine optimization is that it is impossible to see whether or to what extent documents have been optimized. Nowadays, however, it can be assumed that for all queries that indicate a commercial interest on the part of the searcher, it is primarily optimized results that are found at the top of the organic result list. For this reason, it may also be worthwhile to sift through more results or use advanced search methods (see Chap. 12). So, while content providers can manipulate their documents so that the search engines prefer them, it is not possible to buy directly into the organic result lists, i.e., to pay money to a search engine for a better placement.
7.3.2
Advertising
Search engines display advertising in the form of text ads. These are not colorful banners as known from many sites on the Web, but a rather unobtrusive type of advertising. Text ads are context-based, i.e., based on the user’s query. Thus, only ads matching the entered queries are displayed; similar to the organic results, a ranking takes place, which, however, is independent of that of the organic results. This way, two ranked result lists are created, both displayed on the search engine result page. The fact that the ads are displayed specifically to match queries makes them so successful: The moment a user enters a query, they reveal their interest. This can vary in intensity: For example, a user who enters television set will not yet have as concrete an intention to buy as a user who enters buy television set. Advertisers can specify very precisely in the text ad system which users they want to reach, since booking is based on concrete queries.
148
7 Search Result Presentation
Advertising based on search queries can deliver relevant results for the user. This is another argument in favor of this form of advertising. However, it is important that users can clearly distinguish whether a result is displayed because it was determined by the search engines’ algorithms to be the best match for the query or whether this result matches the query but is only displayed because an advertiser paid for it. We will deal with this in detail in a separate chapter on search engine advertising (Chap. 10).
7.3.3
Universal Search Results
In Chap. 6, we described the creation of vertical collections and their integration into search engine result pages. In this section, we will now deal with the presentation and placement of these results. Presentation differs depending on the collection; it is evident that news items, for example, are presented differently from local search results, for which a map is also displayed. Figure 7.2 already illustrated the different presentations of news, images, and videos within the same SERP. Due to the combination of results from different collections, the original structure of the SERP increasingly dissolves, with universal search results interrupting the list of organic search results; as more and more containers with different vertical search results are included, especially in the visible area, one can hardly speak of a list presentation anymore. Depending on the query, a different, complex pattern of results is displayed instead. This becomes even clearer when one considers the number of collections that have been integrated by now. These include: • • • • • • •
News items Images Videos Local search results (maps) Scholarly content Shopping Books
Within each collection, there is a separate ranking; when organic results, advertising, and universal search results are combined on the search engine result page, there is no joint ranking; only the placement of the individual containers is determined. Universal search results can be placed above or among the organic results. The placement is done according to the algorithmically assessed importance of the respective result type. For example, Fig. 7.10 shows the placement of the news container for the query angriff hamburg synagoge (attack hamburg synagogue) on three consecutive days. The placement varies depending on the news volume (number of new documents) and user interest (number of queries; clicks on news items).
7.3 Elements on Search Engine Result Pages
149
Fig. 7.10 Placement of a news container based on news volume (excerpts from Google search engine result pages from January 9 to 11, 2021)
Search engines thus make two decisions when compiling search engine result pages: Firstly, it is decided whether or which universal search containers are relevant to the query and should therefore be included. Secondly, it must be decided where the containers are displayed. With the integration of universal search results, search engines have given up treating all documents equally. Whereas with organic results, all documents can be included in the result lists based on the same ranking factors, the criteria for universal search results are more complex. First of all, the issue is whether a document can get into a collection at all. This involves the requirements that it (or its source) must meet to be included in the collection. For example, in the case of news, it must be decided which sources are considered news sources. Only the content from these sources then feeds into the news results. A second change results from the design of the universal search results. While the snippets of the organic results are largely the same, universal search results are presented differently depending on the result type. For example, video results contain thumbnails that direct the eye of the user on the search engine result pages. Finally, a third change results from the fact that search engine providers have now also become content producers, whereas previously, they were only intermediaries between searchers and documents provided by content producers who were not affiliated with the search engine provider. Google, for example, offers content via its video platform YouTube and via its local search, among others. Naturally, there is competition for these offers, and a problem now arises because Google also points to its own offerings in its search engine. This quickly led to the accusation that Google favors its own offerings over those of its competitors, which ultimately led to antitrust proceedings by the European Commission (European Commission, 2017). The result of these proceedings – in addition to a fine of 2.4 billion euros – was that Google was obliged to give its competitors equal access to privileged display in the shopping results displayed in universal search boxes (for more details, see Sect. 15.4).
150
7.3.4
7 Search Result Presentation
Knowledge Graph Results
Knowledge graph results are boxes displayed in a separate column on the search engine result pages that compile the most important information about a person, building, company, or other entity. This information is compiled automatically by the search engine from structured collections of facts; by now, knowledge graphs are a common approach used by all major search engines and e-commerce portals (Noy et al., 2019). Unlike other types of results, in the case of knowledge graph results, a user does not have to click on the result to be taken to a document on an external page but is shown the information directly on the search engine result page. Figure 7.11 shows two examples of knowledge graph result boxes from Google: on the left is the result for the query Frank Walter Steinmeier; on the right is the result for Rheinturm. In both cases, the box comprises the same elements: images, an excerpt from the Wikipedia article on the topic, some facts identified as particularly important (also from the Wikipedia article), and related queries gleaned from Google’s transaction logs. However, different information is displayed depending on the query or result. For people, this includes basic data such as date of birth, function, etc. and for buildings, the location, architect, etc. In some cases, ratings and typical visiting times are also displayed in the knowledge graph. A unique form of knowledge graph results was added to Google search engine result pages in the weeks leading up to the 2017 German federal election: When the name of a candidate was searched for, up to three principles and priorities of this person were displayed in each case in addition to the standard information. What was special here was that the candidates could provide these statements themselves, meaning that their own texts could be seen directly on the search engine result page (Hinz et al., 2020). Overall, knowledge graph results provide a first insight into what can be done with fact extraction and the display of facts instead of documents on search engine result pages. It is expected that search engines will continue to shift toward answering queries directly on the result pages (see Chap. 16).
7.3.5
Direct Answers
When describing the many possible ways of what is meant by a search query and what is meant by a document (Sect. 4.5.1), it already became clear that the output of a list of documents as a result of a search query is only one of several ways in which a search engine can answer a query. Let’s look back into the history of information retrieval (well before the launch of the first search engine). We find why the model of outputting a list of documents became the standard: due to technical limitations, the complete documents could not be stored in the search system; there were only concise representations of the documents (in the form of bibliographic information, abstracts, and/or subject headings). The purpose of search was to find printed documents in physical libraries. This model was retained in later full-text databases: One wanted (and was able) to
7.3 Elements on Search Engine Result Pages
151
Fig. 7.11 Knowledge graph results (example from Google; September 29, 2022)
eliminate the step from the electronic representation to the printed document, but the representations in the result lists remained. The documents were seen as just that and not as a basis for generating answers (which, of course, also had to do with the technical realities). And this model of document representations and documents eventually continued in Web search engines, even though it may not be ideal from the user’s perspective: In many cases, we do not want to search through documents to find answers but prefer to receive a suitable answer directly. For many factual questions, direct answers are already implemented in search engines. For example, Fig. 7.3 already showed a direct answer to the question about
152
7 Search Result Presentation
Fig. 7.12 Direct answer to a query (example from Google; September 29, 2022)
the height of the Zugspitze. While this answer shows the fact being searched for immediately, answers can also be excerpts from documents found in regular search. Figure 7.12 shows the answer to the query granular synthesis as an example, which is presented directly above the list of organic results. In this case, it is a definition from the website output.com; however, answers are generated from various sources. However, one problem with this – as with the ranking of Web documents in general – is assessing the trustworthiness of the sources or documents from which the answers are generated. Search Engine Land provides an exemplary list of incorrect and inappropriate answers found on Google’s search engine result pages (Gesenhues, 2017).
7.3.6
Integration of Transactions
The complexity of search engine result pages has increased further in recent years, mainly due to the increasing level of integration of universal search containers for transactional queries. With Google, for example, it is now possible to book flights or hotels directly in the universal search boxes for corresponding queries (see Fig. 7.13). The search engine is thus moving away from being an intermediary between users and content producers toward becoming an integrated system that enables tasks to be completed. However, this happens at the price that users are directed to the offerings of the search engine providers (or their partners). Although other offers are often still visible in the list of organic results (or are displayed as alternatives within the universal search boxes), they are neither displayed as prominently as the search engine’s own offerings nor is there a similarly convenient integration.
7.3 Elements on Search Engine Result Pages
153
Fig. 7.13 Hotel booking integrated into the search engine result page (example from Google; September 29, 2022)
7.3.7
Navigation Elements
In order for a search engine result page to be functional, it must also contain navigation elements. Navigation elements are elements that lead to specific areas within the search engine, for example, directly to a vertical collection. They enable searchers who have not already decided to search in a specific database before starting their Internet-based research to select a suitable collection on the search engine result page (see also Sect. 6.4). Other navigation elements provide access to settings that can be adjusted by the user and links to static documents (e.g., help and disclaimer pages).
154
7.3.8
7 Search Result Presentation
Support for Query Modification
Whereas search suggestions (Sect. 4.5.2) directly support the user in formulating their search queries, the suggestions for modifying search queries are provided on the search engine result page once the user has submitted their query and received the first list of results. Therefore, these suggestions are usually at the end of the search engine result page, the assumption being that a user who did not find what they were looking for on the first result page can be helped by suggestions for specifying or restricting their query. Figure 7.14 shows suggestions for specifying the query classification (classification). Just like autocomplete suggestions, these suggestions are generated from the queries of previous users. You can see that the suggestions consist of the query entered initially, expanded by one or two words.
Fig. 7.14 Suggestions for improving a query (example from Google; September 29, 2022)
Fig. 7.15 Suggestions for modifying the query in the knowledge graph (example from Google; September 20, 2022)
7.3 Elements on Search Engine Result Pages
155
Fig. 7.16 Suggested follow-up questions (example from Google; September 29, 2022)
Fig. 7.17 Search options on the Google search engine result page (January 6, 2021)
A somewhat different case is generating related queries, which can be seen in Fig. 7.15 in the form of suggestions displayed as part of the knowledge graph results. Here, the original query is not specified, but thematically similar queries (in the example for the query hamburger michel, other popular sights in the vicinity) are listed based on past user behavior. Here, the benefit lies more in the inspiration for further queries than in the specification of a query that has already been entered. Also, when questions are entered, additional questions are often suggested that relate to the original question and are common follow-up questions. For example, Fig. 7.16 shows suggestions generated based on the question, “How old is Nana Mouskouri?”
7.3.9
Search Options on the Result Page
On search engine result pages, numerous options are offered for refining the query or for setting preferences in general. Figure 7.17 shows the search options pane above the search results on Google. The top line contains the choice of collections, which can vary depending on the query. Additional collections can be found under “more.” If one clicks on a link to a collection, the query is transferred, and the search is carried out in the corresponding collection. If one clicks on the “tools” field, a second line opens, offering options for refining the search results according to country, language, and time period. The search options on the search engine result pages already indicate that search engines can be used for much more than simple queries. In Chap. 12, we will detail the options for qualifying queries.
156
7.4
7 Search Result Presentation
The Structure of Snippets
Having looked at the search engine result page and the individual types of results, we now want to look at the micro level, i.e., the presentation of the individual results. We have already mentioned that the snippets shown for the individual result types are similar but not identical. The term “snippet” refers to the information displayed for an individual result on the search engine result page. Snippets help users to decide whether or not to look at a result document or to get an idea of the document before looking at it. The snippet merely represents the document and is a further (algorithmic) interpretation on the part of the search engine. Consequently, we may consider a result relevant based on the snippet, but it turns out irrelevant on closer inspection (after the click). Conversely, it is also possible that a snippet indicates that the result is not relevant but that it would turn out to be relevant if we only looked at it. Ideally, however, the snippet and the corresponding document match, so that a relevant snippet is followed by a relevant document. However, while a non-relevant snippet and a non-relevant document also match, displaying such a snippet on the search engine result page can be considered unnecessary clutter (see Lewandowski, 2008). Figure 7.18 shows the typical structure of a snippet, including a clickable headline, the domain, and a short description. By now, habits-based standards have emerged for snippets, which extend to the color of the individual elements. For example, in Google and Bing, there is a small triangle next to the domain information, which opens a context menu with further options when clicked. The associated functions are described in Sect. 7.5. The basic information of the snippet (headline, description text, domain) is usually extracted from the documents to be described. The heading is simply taken from the tag in the HTML code of the document and possibly shortened to the maximum length specified for the description in the search engine result page (see Fig. 7.20). The domain indicates the source of the document so that, if the source is already known, one can select it according to trustworthiness or other preferences. Some search engines display the whole URL (or a shortened version); however, Google now uses only the domain, supplemented by one or more outline points if necessary. This provides information about the position of the document within the website in a similar fashion to so-called breadcrumb navigation.
Fig. 7.18 Snippet of an organic result (Google; December 6, 2020)
7.4 The Structure of Snippets
157
Fig. 7.19 Snippet with sitelinks (Google; December 6, 2020)
Fig. 7.20 Snippet containing freshness information (Google, December 6, 2020)
The snippets’ descriptive text can be generated from different sources: • Meta description: This is a metadata tag within the HTML document in which the author can briefly describe the document content. However, the descriptions from the meta description tag are only used by search engines if the document (or the website to which it belongs) has previously been classified as trustworthy. The great advantage of meta descriptions is that they usually contain complete sentences in which the document’s content is summarized in a nutshell, which meets users’ needs (see Fig. 7.19). • External sources: External sources that describe another document or website can also be used. In the past, these were mainly descriptions from Web directories such as the Open Directory, but these no longer play a role. In principle, document descriptions can be generated from any external page. These descriptions are often more accurate than self-descriptions. • Document content: Most snippets are automatically generated from the respective page’s content. This is because only a relatively small proportion of all documents have one of the other forms of description. These automatically generated descriptions show keywords in context (KWIC), i.e., passages from the document text in which the keywords occur. These can be complete sentences (Fig. 7.20) but also other excerpts from the documents (Fig. 7.21). When generating the
158
7 Search Result Presentation
Fig. 7.21 Snippet containing author information, citation count, and link to a machine translation (December 6, 2020)
Fig. 7.22 Snippet containing navigation path and user ratings (December 6, 2020)
description in this way, the search engine cannot fall back on descriptions that have been produced by humans but must select an appropriate excerpt automatically. Not only the search engine result pages in their entirety but also the snippets have changed considerably in recent years. Search engines have sometimes added additional information to the snippets and, in some cases, also changed their design by adding graphic elements. Especially for navigational queries, more complex representations are often found showing several results from one website. For the query spiegel, for example, Google shows not only the usual description for the first result (spiegel. de) but also popular pages of the website (Fig. 7.19). These popular pages can be identified from the links and, above all, from the clicks of those users who continued to click from the website’s homepage. However, there are also many less conspicuous additions to the snippets. These will not be presented here in total but only as examples. For example, snippets are often provided with freshness or date information (Figs. 7.20 and 7.21). The date of the last update detected by the crawler is used for this purpose. Figure 7.21 shows a snippet of a scholarly article supplemented with several details, some of which are specific to scholarly documents. After the usual snippet, a line has been inserted that contains the author’s name automatically extracted from the document and the year of publication of the article. This line also contains information on how often this article has been cited in other scholarly works (see Sect. 6.3.2) and a reference to similar articles.
7.5 Options Related to Single Results
159
The difference between the description of a scholarly article and the description of a product (Fig. 7.22) shows how different the additions to the snippets can be. The latter indicates the ratings and reviews the product has received. Extended snippets are also called rich snippets. The additional information can be extracted from the document (e.g., author names), from other documents or the context of the document in question (e.g., citation information; for this, the citing documents must be known), or from metadata attached to the documents. The major search engines have joined forces in the initiative Schema.org, which aims to create metadata schemes for various types of content and motivate content producers to index their content according to these standards. In addition to improving findability and providing useful information for ranking, this also enables the generation of more relevant snippets.
7.5
Options Related to Single Results
Several options can be selected using a context menu, which can be called up from within a snippet (Fig. 7.23). For Google, these are “in cache” and “similar pages”; other search engines offer similar functions. The function described in Sect. 3.3, which retrieves the last version of a document found by the search engine, is located under “in cache.” In this case, the search engine does not redirect to the current original version of a document (as is the case with a normal click on the result in the result list) but to a copy created during crawling. The “related pages” option retrieves websites similar to the website on which the document described in the result list is located. This function is currently only offered by Google and uses a link-based ranking for similarity detection. It determines which websites are linked to frequently together with the website listed in the result list. For example, Fig. 7.24 shows related pages to the website heise.de: Other news websites on computer topics such as golem.de, chip.de, and netzwelt.de are displayed, as many documents link both to heise.de and one or more of the other websites. The great advantage of this kind of method is that it is independent of the wording used on the different websites and that no linking of the pages to each other is necessary. The severe disadvantage, however, is that the method used by Google only detects similarities between websites, not between individual documents.
Fig. 7.23 Snippet with expanded context menu (October 16, 2017)
160
7 Search Result Presentation
Fig. 7.24 Presentation of similar websites based on an already known site (January 8, 2021)
7.6
Selection of Suitable Results
So how do users select from the elements on a search engine result page? The previous sections have shown the complexity of the current result presentation and the multitude of choices. Of course, all of the elements described are selected regularly. Nevertheless, certain factors can act in favor of a particular element on the search engine result page. In Sect. 7.2, we have already seen how the result presentation influences the user’s gaze, and it has been established that only what the user perceives is clicked on. However, it is wrong to assume that those areas not highlighted in the heat maps in the eye tracking are not perceived at all. Instead, the gaze is not concentrated there, but this does not mean that this area is not perceived by users. First, an element’s likelihood of being selected depends on its position on the search engine result page. Numerous studies have shown not only that the position has a significant effect in the context of listing particularly relevant results at the top of the result list but that the position alone achieves a significant effect. This has been shown in studies in which the order of search results was artificially altered, and the impact on users’ selection behavior was then measured (Keane et al., 2008; Bar-Ilan et al., 2009; Schultheiß et al., 2018). The positioning of an element in the visible versus the invisible area of the search engine result page further affects selection behavior (e.g., Cutrell & Guan, 2007; Granka et al., 2006). The results in the visible area are perceived far more frequently and are consequently selected far more often. This leads to a situation where search engines provide access to the diversity of the Web with its almost unmanageable
7.7 Summary
161
number of offerings, but the actual selection of search results is usually made from a rather limited number of sources. For example, based on 2.5 billion clicks on Yahoo’s search engine result pages, it was found that only 10,000 websites account for about 80% of all clicks (Goel et al., 2010). Of course, the visual presentation of certain results or types of results also significantly influences users’ selection behavior. For example, results enriched with an image are selected more often than results presented less attractively (see Sect. 15.4 for more details). Meaningful snippets and highlighted keywords within the snippets play a role as well. All in all, the result presentation considerably influences users’ selection behavior. Therefore, search engine providers can influence what users perceive and select by using or modifying search engine result page layouts. This then gives rise to possible influences that go beyond technical conditions and can also serve to specifically influence users in the interests of the search engine provider (see Chap. 15 and Lewandowski et al., 2014).
7.7
Summary
Search engines have built complex result presentations composed of various elements. The ranked result list continues to form the basis of the search engine result page (SERP), but it is increasingly enriched with results from other collections, advertising, and factual answers. The tendency is for organic results to play a less and less significant role. Organic results are generated from the Web Index and displayed in an ordered list. Mostly above (and sometimes also next to or below) the result list, text ads are displayed that are structured similarly to the snippets of the organic results and are contextual, meaning they are specific to the query entered. This means that ads can be considered a result type; they can even be relevant to the user. Universal search results are results from the search engine’s vertical collections (data sets) which are constructed in addition to the Web Index. Examples include the news index, the image index, and the video index. Universal search results are displayed above or within the list of organic results; they disrupt the regular display of results. In addition, the snippets of universal search results are adapted to the content of the respective collection and often contain images that change the gaze patterns on the search engine result pages compared to the traditional list display. In addition to different types of search results, search engine result pages also contain navigation elements, as well as search options related to single results. The snippets consist at least of a title, a URL, and a short description. In recent years, however, there have been developments toward more complex presentations. Result presentation considerably influences which results are seen and selected by users. The ranking of the results plays a significant role; however, by enriching the search engine result pages, especially with images and graphic elements, this influence is reduced, and users are directed to other elements.
162
7 Search Result Presentation
Further Reading An interesting eye-tracking study on Google search engine result pages and how they have changed over time can be downloaded from the market research company Mediative (at http://www.mediative.com/whitepaper-theevolution-of-googles-search-results-pages-effects-on-user-behaviour/). Some books deal fundamentally with the design of user interfaces in search applications; Hearst (2009) is particularly worth mentioning here. Works that deal with the user experience design in search also offer interesting insights, even if they do not primarily refer to Web search. The books by Russell-Rose and Tate (2013) and Nudelman (2011) can be recommended here.
References Bar-Ilan, J., Keenoy, K., Levene, M., & Yaari, E. (2009). Presentation bias is significant in determining user preference for search results – A user study. Journal of the American Society for Information Science and Technology, 60(1), 135–149. https://doi.org/10.1002/asi.20941 Bundesverband Digitale Wirtschaft. (2009). Nutzerverhalten auf Google-Suchergebnisseiten: Eine Eyetracking-Studie im Auftrag des Arbeitskreises Suchmaschinen-Marketing des Bundesverbandes Digitale Wirtschaft (BVDW) e.V. http://docplayer.org/10390994Nutzerverhalten-auf-google-suchergebnisseiten.html. Cutrell, E., & Guan, Z. (2007). What are you looking for? An eye-tracking study of information usage in Web search. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI 2007) (pp. 407–416). ACM. https://doi.org/10.1145/1240624. 1240690 Edelman, B. (2010). Hard-coding bias in Google “algorithmic” search results. benedelman.org. http://www.benedelman.org/hardcoding/. European Commission. (2017). EU competition investigation AT. 39740-Google. http://ec.europa. eu/competition/elojade/isef/case_details.cfm?proc_code=1_39740. Fallows, D. (2005). Search engine users: Internet searchers are confident, satisfied and trusting– but they are also unaware and naive. Pew Internet & American Life Project. Gesenhues, A. (2017). When Google gets it wrong: Direct answers with debatable, incorrect & weird content. https://searchengineland.com/when-google-gets-it-wrong-direct-answers-withdebatable-incorrect-weird-content-223073. Goel, S., Broder, A., Gabrilovich, E., & Pang, B. (2010). Anatomy of the long tail: Ordinary people with extraordinary tastes. In Proceedings of the third ACM international conference on Web search and data mining (pp. 201–210). ACM. https://doi.org/10.1145/1718487.1718513 Granka, L., Hembrooke, H., & Gay, G. (2006). Location location location: Viewing patterns on WWW pages. In Proceedings of the 2006 Symposium on Eye Tracking Research & Applications (p. 43). ACM. https://doi.org/10.1145/1117309.1117328 Grundmann, J., & Bench-Capon, S. (2018). Universal search 2018: It’s a mobile world after all. Searchmetrics. Hearst, M. A. (2009). Search user interfaces. Cambridge University Press. https://doi.org/10.1017/ cbo9781139644082 Hinz, K., Sünkler, S., & Lewandowski, D. (2020). Selbstdarstellung und Positionierung von Kandidatinnen und Kandidaten zur Bundestagswahl 2017 in Google-Infoboxen. Medien & Kommunikationswissenschaft, 68(1–2), 94–112. https://doi.org/10.5771/1615-634X-2020-12-94
References
163
Höchstötter, N., & Koch, M. (2009). Standard parameters for searching behaviour in search engines and their empirical evaluation. Journal of Information Science, 35(1), 45–65. https://doi.org/10. 1177/0165551508091311 Keane, M. T., O’Brien, M., & Smyth, B. (2008). Are people biased in their use of search engines? Communications of the ACM, 51(2), 49–52. https://doi.org/10.1145/1314215.1314224 Kelly, D., & Azzopardi, L. (2015). How many results per page? A study of SERP size, search behavior and user experience. In SIGIR’15, August 09-13, 2015, Santiago, Chile (pp. 183–192). ACM. https://doi.org/10.1145/2766462.2767732 Lewandowski, D. (2008). The retrieval effectiveness of web search engines: Considering results descriptions. Journal of Documentation, 64(6), 915–937. https://doi.org/10.1108/ 00220410810912451 Lewandowski, D., & Kammerer, Y. (2020). Factors influencing viewing behaviour on search engine results pages: A review of eye-tracking research. Behaviour & Information Technology, 1–31. https://doi.org/10.1080/0144929X.2020.1761450 Lewandowski, D., Kerkmann, F., & Sünkler, S. (2014). Wie Nutzer im Suchprozess gelenkt werden: Zwischen technischer Unterstützung und interessengeleiteter Darstellung. In B. Stark, D. Dörr, & S. Aufenanger (Eds.), Die Googleisierung der Informationssuche – Suchmaschinen im Spannungsfeld zwischen Nutzung und Regulierung. De Gruyter. https://doi.org/10.1515/ 9783110338218.75 Marable, L. (2003). False oracles: Consumer reaction to learning the truth about how search engines work. Research Report, 30(June). https://consumersunion.org/wp-content/uploads/2013/05/ false-oracles.pdf Noy, N., Gao, Y., Jain, A., Narayanan, A., Patterson, A., & Taylor, J. (2019). Industry-scale knowledge graphs. Communications of the ACM, 62(8), 36–43. https://doi.org/10.1145/ 3331166 Nudelman, G. (2011). Designing search: UX strategies for eCommerce success. Wiley. Russell-Rose, T., & Tate, T. (2013). Designing the search experience: The information architecture of discovery. Morgan Kaufmann. https://doi.org/10.1016/C2011-0-07401-X Schultheiß, S., Sünkler, S., & Lewandowski, D. (2018). We still trust in google, but less than 10 years ago: An eye-tracking study. Information Research, 23(3) http://www.informationr.net/ ir/23-3/paper799.html Searchmetrics. (2021). SERP features monitor – Tracking the state of the search results pages. https://www.searchmetrics.com/knowledge-hub/monitors/serp-features/. Strzelecki, A. (2020). Eye-tracking studies of web search engines: A systematic literature review. Information, 11(6). https://doi.org/10.3390/info11060300 Sullivan, D. (2003). Searching with invisible tabs. Search Engine Watch. https:// searchenginewatch.com/sew/news/2064036/searching-with-invisible-tabs White, R. (2016). Interactions with search systems. Cambridge University Press. https://doi.org/10. 1017/CBO9781139525305
8
The Search Engine Market
Search engines are operated by commercial companies and therefore not only have to refinance themselves but also have to make a profit. In this chapter, we explain which search engines are commercially important, how search engines make money, and what partnerships exist between companies in the search engine market. It is no secret that users mainly rely on one search engine, Google, for their Internet-based research. In this chapter, we will see what share other search engines achieve in the search engine market and what economic factors contribute to Google’s dominant position. To understand why people search primarily with Google and why there are virtually no alternatives that are actually used, one must know and understand the search engine market. We consider Google, Bing, and the other search engine providers as businesses here, and our interest is not on these companies (which offer much more than search engines) as a whole but solely on their role as search engine providers. Also excluded is the market for search applications aimed at companies, for example, to search their websites or internal applications.
8.1
Search Engines’ Business Model
The dominant business model of search engines is contextual advertising, which has already been explained in Sect. 7.3.2 as a result type on the search engine result page. The success of this model is based on various factors (a detailed explanation can be found in Chap. 10): 1. Ads are displayed at the moment when a searcher has already “revealed” their interest through their query. 2. Charging by clicks (instead of impressions) means that the advertising’s success can be measured precisely by every advertiser. 3. The advertising is not very annoying, as it is text-based and, unlike banner advertising, for example, not very intrusive. # The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Lewandowski, Understanding Search Engines, https://doi.org/10.1007/978-3-031-22789-9_8
165
166
8
The Search Engine Market
4. The auctioning of advertising space results in transparent click prices based on the actual competition for any given query. 5. Self-booking allows even the smallest businesses to advertise independently, even if only small budgets are available. No search engine on the Web has yet managed to finance itself with any other business model. Therefore, any additional revenue the search engine providers generate is to be regarded as supplementary revenue. The following three models, in particular, have been tried in the past or are still being used as alternatives or supplements to financing through advertising: • Selling search engine technology to companies that use it to make their own records searchable. A separate market has developed from this, largely independent of the providers in Web search (see White, 2015). Well-known search engine companies have also tried to offer enterprise solutions in the past but have largely failed. The most prominent example is the Google Search Appliance, which was discontinued in 2017. • Paid inclusion: In this model, content producers pay for their websites to be indexed particularly deeply or frequently by a search engine. Earlier approaches, for example, by Yahoo, were abandoned for good; there was also considerable criticism of this model, as all documents are no longer treated equally by the search engine. Google was the last search engine provider to experiment with the paid inclusion model when it made inclusion in the shopping index subject to a fee (Sullivan, 2012). By now, shopping results are listed as ads, and the result list is marked accordingly. • Services via application programming interfaces (APIs): Search engines can be queried partly automatically via so-called APIs, which enables developers of other applications to incorporate one of the search engine’s services, such as a particular set of search results, into their applications. Billing is usually per thousand queries; possible applications range from APIs directly related to search (e.g., Bing’s API for search results) to services related to search engines (e.g., Google Translate API). However, Google discontinued its Web Search API in 2010 and now only allows the creation of search applications for custom websites. In contrast, Bing offers a complete API package with which both general Web search and various vertical searches can be queried.
8.2
The Importance of Search Engines for Online Advertising
The fact that search-based advertising is attractive is evident not only from the factors already mentioned above but, above all, from the economic success of this form of advertising. For Germany alone, a turnover of more than 4.5 billion euros is expected from search engine advertising in 2021 (Statista, 2021). A comparison of various forms of online advertising also shows the great importance of search engine advertising: it accounts for 40% of the entire online advertising market (Zenith,
8.3 Search Engine Market Shares
167
2021). It is a significant component of the online marketing mix. We are dealing with quite a high market volume and consistent growth in paid search advertising. This makes for an attractive market and shows that even with a market share that at first appears small, high revenues can be generated. Google’s parent company, Alphabet, generated a total of US$161.86 billion in 2019, according to its annual report (Alphabet, 2020), of which 83% was generated by advertising. This includes both text ads (Google Ads), which are displayed on Google’s search engine result pages and those of partners to match search queries (see Sect. 7.3.2 and Chap. 10), and ads that are displayed to match content on non-Google websites, such as journalistic offerings (AdSense). About 72% of the revenue in this area comes from ads shown in search, 11% from ads on YouTube, and 16% from ads displayed on affiliate sites. The number of queries submitted to search engines is steadily increasing. In the United States alone, more than 17 billion queries were entered into the two leading search engines, Google and Bing, in October 2020 (ComScore, 2020) – and this figure only includes queries sent from desktop computers! Today, however, more queries are already being sent via mobile devices, so the usual statistics (and the slowing growth shown there) give the false impression that the number of queries is decreasing. Instead, the opposite is true: over many years, the number of queries has continued to grow steadily. For every query made, advertising can at least potentially be displayed. The immense number of queries and its continued growth is another sign of an attractive market.
8.3
Search Engine Market Shares
Only a few providers currently dominate the global market for Web search engines. The strongest competitor to Google is Microsoft, with its search engine Bing. However, this search engine is far from Google in terms of market share: even in the United States, Bing, together with its partner Yahoo (see Sect. 8.5), only achieves a market share of just over 9% (Statcounter, 2021; data for December 2020). Furthermore, while its share for desktop searches is around 16%, for mobile search, which now accounts for the majority of all queries, it is only slightly more than 3% (Statcounter, 2021). In this area, Google achieves a market share of about 95% (Statcounter, 2021). A major reason for this is probably that Google search is seamlessly integrated into the Android mobile operating system offered by Google’s parent company, Alphabet; Google is also the default search engine in the Safari browser preinstalled on the iPhone. In Germany, Google achieves a market share of 92.6% across all searches; Bing is far behind with 4.5% (Statcounter, 2021). In this case, too, the difference between desktop and mobile searches is significant: if we look only at desktop searches, Google achieves about 84.8% (Bing: 10.7%); for mobile searches, the figure is 97.7% (Bing: 0.6%; Statcounter, 2021).
168
8
The Search Engine Market
In Europe, the national search engine markets have been characterized by a quasimonopoly of Google for many years (Maaß et al., 2009). This search engine achieves a share of more than 90% of queries in most countries. Other search engines hardly play a role in Germany, either. In Chap. 13, we will look at whether these market conditions express Google’s superior quality (as, e.g., Haucap, 2018, p. 24, argues) and to what extent they can be desirable for a pluralistic society (Chap. 15). Outside of Europe, important national search engines have been able to establish themselves in some cases, for example, Baidu in China or Yandex in Russia. However, these providers have not yet internationalized their search engines, so they are restricted to their respective countries of origin, at least so far. Search engines can achieve market share in different ways. First, one can assume that a search engine can simply convince new users of its services, who will then continue using the search engine in the future and thus almost automatically ensure market shares. However, this is a way of achieving small gains at most, and recent years have shown that new search engines cannot achieve significant market shares in this way. Other common methods to increase a search engine’s reach or market share are: • The search engine is set as the default search engine in a browser. All users who send a query via the search box, the start page, and/or the URL bar are forwarded to this search engine. The most prominent example of such cooperation is that between the Mozilla Foundation (operator of the Firefox browser) and Google. The Mozilla Foundation’s partnerships with search engines and other information services earn it more than US$450 million a year, making up a large part of the organization’s total revenue (Mozilla Foundation, 2020). • The search engine is preset as the default search engine in an operating system. The model works in the same way as the default setting in the browser; prominent examples are the integration of Bing search in the current versions of Windows and the integration of Google in the Android operating system. • The search engine is integrated into other search-based services. A well-known example of this is the integration of Google into Apple’s voice-controlled virtual assistant Siri. The importance of existing Web search engines as a fundamental technology for such services will likely increase considerably in the future (see Chap. 16). The situation in the search engine market described above naturally raises the question of whether changes in the market are to be expected in the short to medium term. Google’s dominance appears to have been consolidated for years to such an extent that more than 10 years ago, Maaß et al. (2009), based on their industrial economic analysis of the search engine market, came to the following conclusion: [. . .] no change in this market situation is to be expected in the short and medium term. Indeed, high barriers to market entry prevent the entry of new algorithm-based search engines. These hurdles are related to the costs of building and maintaining one’s own search
8.4 Important Search Engines
169
index, which is accessed when a query is made. (Maaß et al., 2009, p. 15, translated from German)
This raises the question of whether the current market conditions are desirable or whether they should be changed by government intervention. We will deal with this in Chap. 15.
8.4
Important Search Engines
In this section, important search engines are considered to be, first of all, those that play a role in terms of market share, at least in a certain region or country. This does not mean, however, that other search engines cannot impress with their technology or special features (more on this in Chaps. 11 and 12). At this point, only those search engines will be dealt with that have either already achieved high market shares in at least one market or at least have the potential to expand their position in other countries. First of all, the search engines currently significant for the German market should be mentioned: Google and Bing. Both search engines have an extensive, up-to-date, and international index, have well-functioning ranking systems, and are adapted to German conditions (concerning language and locality; see Sect. 5.5). Other search engines do not play a role in Germany, as they do not fulfil the criteria mentioned or, in most cases, display the search results from Google or Bing anyway (e.g., Yahoo; see Sect. 8.5). In the United States, the situation is somewhat different. There are several smaller search engines there. What they have in common, however, is that they are insufficiently adapted for the German-speaking market and their results for queries in German cannot achieve results of comparable quality to those of Google or Bing. The situation is similar with Exalead from France (http://www.exalead.com/ search/web/). Although this search engine largely fulfils the criteria mentioned above, it also struggles to adapt to local markets. Furthermore, Exalead now focuses mainly on search solutions for companies and analysis tools based on its Web Index. Therefore, it is not to be expected that the search engine, which by now should be seen more as a showcase for the capabilities of Exalead’s search technology, will be expanded further. One search engine repeatedly said to have the potential to gain a foothold in Europe is Yandex from Russia. This search engine achieves a share of about 44% of queries in the Russian market (Statcounter, 2021). Moreover, it has mature technology that could be made available in other countries after appropriate adaptations. Another leading search engine in its country is Baidu in China. However, this is a special case, as the Chinese state only allows search engines in heavily censored versions, and the major international search engines are systematically hindered in this market (see Jiang & Dinga, 2014). Some formerly big names such as AltaVista, Yahoo, and Ask.com are not included here. This is because they no longer exist as independent search engines:
170
8
The Search Engine Market
AltaVista no longer exists; Yahoo and Ask.com exist only as search portals without their own index.
8.5
Partnerships in the Search Engine Market
An important factor in explaining why many search engine providers compete for users, but only a few of them have survived as independent providers, is the so-called partner index model. “Real” search engine providers such as Google and Bing operate their own search engines but also pass on their search results to partners. Yahoo, for example, gave up its own search engine several years ago and has been showing search results from Bing ever since. Superficially, Yahoo appears as a search engine (in its own layout and with a different presentation of the result pages compared to Bing), but the results come from Bing. Today, all major portals (such as Web.de and T-Online, for whom Web search is just one of many offerings) use this model. The partner index model is based on sharing the profits made by the clicks on text ads that come with the search results. The model is very attractive for both sides, as the search engine provider incurs only minor costs by delivering the search results to the partner; for the portal operator, the immense expense of operating their own search engine is eliminated. The main share of the costs for operating a search engine is incurred up to the point of providing the search engine (i.e., for development costs and for building and maintaining the index); the costs for processing a single query hardly play a role. The partner of a providing search engine only needs to ensure traffic on its portal; profit can be achieved with little effort in this model. It is, therefore, no wonder that hardly any alternative search engines exist or are used by portals. The partner index model is simply too lucrative for companies to be able to offer economically viable alternative solutions. On the other hand, the partner index model has ensured that the search engine landscape has been (further) thinned out (see Lewandowski, 2013). The lack of diversity in the search engine market can therefore be explained at least in part by the success of this model, especially since the more the queries for which ads can be delivered, the higher the profits from the partner index model. Therefore, the model intrinsically favors large search engines with an extensive advertising network. Of course, this assumes that search results and ads are obtained from the same provider. And while this is the rule, it does not necessarily have to be the case (see Sect. 8.5). In any case, the search engine that can deliver the broadest range of ad results would offer the best monetization for the portal operator (assuming the proportionate profit for the portal operator would be the same). Figure 8.1 shows a comparison of Yahoo (left) and Bing (right) search results. Since Yahoo gets its organic results from Bing, they are the same in both cases with the same ranking. However, there can be slight differences in the snippets; for example, a search portal might add an image to an organic result, display no or more ads, or use universal search results from different sources. This shows that a portal can act as a supposedly independent search engine and adjust the display of
8.6 Summary
171
Fig. 8.1 Different result presentation and enrichment based on the same organic results (Yahoo, Bing; January 15, 2021)
results without operating its own search engine. However, there are usually strict limits to what can be done to adapt the search results in the contracts between the search engines and their partners: for example, that the order of the results may not be changed, that no (universal search) results may be added to the result list, and so on. Like metasearch engines (see Sect. 2.4.3), the partner also receives hardly any information on individual results, but only descriptions and URLs of the search results. Let us take a closer look at which partnerships exist in the (German) search engine market. Figure 8.2, modelled on the search engine relationship chart (Clay, 2017) covering the US market, shows the linkages in the German market. Both organic results and text ads are considered. It shows that Google and Bing are the central suppliers in both areas. The chart again shows that there are superficially many search offerings, but only in very few cases are these independent search engines.
8.6
Summary
Search engine providers earn their money almost exclusively from advertising. This is mainly displayed on the search engine result pages in the form of text ads; another form of advertising is context-based advertising on content pages of third-party providers; in this case, the context is not obtained from a query but from a full text. So-called paid search advertising is an important area of online advertising and accounts for about 40% of the online advertising volume in Germany. Based on the data from the United States, it can be assumed that the share of paid search advertising is also likely to increase considerably in Germany.
172
8
The Search Engine Market
Fig. 8.2 The relationship network of search engines (Germany)
In Germany alone, more than 6 billion queries are submitted to general search engines per month. For every one of these queries, at least potentially relevant advertising can be displayed. The search engine market (both in terms of use and ad sales) is dominated by Google. This search engine has consistently achieved a market share of more than 90% in Germany for years, with comparable results in other European countries. Besides Google, Bing also plays a role, although it lags far behind Google in market share; in Germany, its share is in the lower single-digit percentage range. There is a supposed variety of different providers on the search engine market, but to a large extent, they display the results of one of the two big search engines, Google or Bing. In the so-called partner index model, a search engine with its own index delivers search results and text ads to a search portal; the revenue from the clicks on the ads is shared between the partners according to a predefined key. The model is very attractive for both sides and has led to a further thinning of the search engine market. Further Reading A look at the “Timeline of web search engines” (Wikipedia, 2021) is worthwhile for information on the development of the search engine market. The history of search engines, also concerning their commercialization and revenue models, is described in van Couvering (2007, 2008). Information on the evolution in the early years of the search engine market is also found in Battelle (2005)
References
173
References Alphabet Inc. (2020). Annual Report 2019. https://abc.xyz/investor/static/pdf/2019_alphabet_ annual_report.pdf?cache=c3a4858. Batelle, J. (2005). The search: How Google and its rivals rewrote the rules of business and transformed our culture. London: Portfolio. Brealey. Clay, B. (2017). Search engine relationship chart. Bruce Clay. https://www.bruceclay.com/ searchenginerelationshipchart ComScore. (2020). Suchmaschinen-Ranking (Desktop) in den USA nach Anzahl der Anfragen in ausgewählten Monaten von Januar 2010 bis Oktober 2020. In Statista – Das Statistik-Portal. https://de.statista.com/statistik/daten/studie/71500/umfrage/anzahl-der-suchanfragen-aufsuchmaschinen-in-den-usa-seit-2007/. Haucap, J. (2018). Macht, Markt und Wettbewerb: Was steuert die Datenökonomie? Nicolai Publishing & Intelligence. Jiang, M., & Dinga, V. (2014). Search control in China. In R. König & M. Rasch (Eds.), Society of the query reader: Reflections on web search (pp. 140–146). Institute of Network Cultures. Lewandowski, D. (2013). Suchmaschinenindices. In D. Lewandowski (Ed.), Handbuch InternetSuchmaschinen 3: Suchmaschinen zwischen Technik und Gesellschaft (pp. 143–161). Akademische Verlagsgesellschaft AKA. Maaß, C., Skusa, A., Heß, A., & Pietsch, G. (2009). Der Markt für Internet-Suchmaschinen. In D. Lewandowski (Ed.), Handbuch Internet-Suchmaschinen (pp. 3–17). Akademische Verlagsgesellschaft AKA. Mozilla Foundation. (2020). Independent Auditors’ Report and Consolidated Financial Statements 2019 and 2018. https://assets.mozilla.net/annualreport/2019/mozilla-fdn-2019-short-form-092 6.pdf. Statcounter. (2021). Search engine market share. https://gs.statcounter.com/search-engine-marketshare. Statista. (2021). Ausgaben für Suchmaschinenwerbung in Deutschland in den Jahren 2017 bis 2019 sowie eine Prognose bis 2025. In Statista – Das Statistik-Portal. https://de.statista.com/statistik/ daten/studie/456188/umfrage/umsaetze-mit-suchmaschinenwerbung-in-deutschland/. Sullivan, D. (2012). Once deemed evil, Google now embraces “paid inclusion.” Marketing Land. https://marketingland.com/once-deemed-evil-google-now-embraces-paid-inclusion-13138. Van Couvering, E. (2007). Is relevance relevant? Market, science, and war: Discourses of search engine quality. Journal of Computer-Mediated Communication, 12, 866–887. https://doi.org/ 10.1111/j.1083-6101.2007.00354.x Van Couvering, E. (2008). The history of the internet search engine: Navigational media and traffic commodity. In A. Spink & M. Zimmer (Eds.), Web searching: Multidisciplinary perspectives (pp. 177–206). Springer. https://doi.org/10.1007/978-3-540-75829-7_11 White, M. (2015). Enterprise search. O’Reilly. Wikipedia. (2021). Timeline of web search engines. https://en.wikipedia.org/wiki/Timeline_of_ web_search_engines. Zenith. (2021). Prognose zu den Investitionen in Internetwerbung weltweit in den Jahren 2018 bis 2022 nach Segmenten (in Milliarden US-Dollar). In Statista – Das Statistik-Portal. https://de. statista.com/statistik/daten/studie/209291/umfrage/investitionen-in-internetwerbung-weltweitnach-segmenten/.
9
Search Engine Optimization (SEO)
Search engine optimization (SEO) is understood to mean all measures suitable for improving the position of Web pages in search engine rankings. These measures range from simple technical steps that help make documents indexable for search engines to complex manipulations of the linking structure of pages that refer to the documents to be optimized. Search engine optimization is situated between helping content producers (and search engines) and manipulation. Search engine optimization keeps us on the economic side of our topic but now shifts the perspective away from the search engine providers to those who want to profit from their content being found in search engines or to those who ensure that it is found. Search engine optimization can also be seen as an applied discipline of search engine research. On the one hand, technical knowledge about search engines can be used to create and improve search engines or to develop other search solutions apart from Web search. On the other hand, this knowledge can be used to promote content in existing search engines. It is not possible to clearly define where search engine optimization begins. If an author writes a text for the Web and takes into account, for example, that certain words, which they assume users will search for, appear several times in their text, this could already be called search engine optimization because every decision made when writing a text influences how it can be found using search engines. However, optimization for search engines does not only include text optimization but takes place on three levels (see Richter, 2013): 1. Accessibility: This means ensuring optimal findability and indexability by search engines. The necessary measures are of a technical nature. 2. Relevance: The content is adapted and prepared for search engines. The necessary measures refer to formulating and structuring texts or entire websites. 3. Popularity: This includes all measures suitable for increasing the popularity of a document or a website. Examples include collecting relevant external links and promoting the documents via social media.
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Lewandowski, Understanding Search Engines, https://doi.org/10.1007/978-3-031-22789-9_9
175
176
9
Search Engine Optimization (SEO)
Visibility in search engines can, of course, also be achieved by booking advertisements (see Chap. 10). However, each click on an ad must be paid for, whereas traffic via the organic results is free. On the other hand, there are costs for search engine optimization to increase visibility in the organic results. Optimization can be complex and expensive depending on the competition for a keyword or a combination of keywords. Search engine optimization is a component of search engine marketing (SEM; see Griesbaum, 2013; Schultz, 2009); however, both terms are often mistakenly equated. Search engine marketing refers to all marketing measures carried out using search engines. In addition to search engine optimization, this includes placing advertisements on search engine result pages (search engine advertising, SEA).
9.1
The Importance of Search Engine Optimization
One way to assess the importance of search engine optimization is to look at the sources of website traffic. Traffic refers simply to the number of visits; sources describe the origin of these visits. No source is as important for bringing in visitors as search engines. Users, first of all, turn to search engines when looking for something on the Internet. This statement may seem trivial, but next to the options of direct requests to a URL and referrals via links (including links in social media), the share of traffic that is mediated via search engines is astonishing. While websites do, in some cases, disclose their traffic figures (such as those monitored by the German IVW), they do not publish figures on the origins of this traffic or its distribution. However, it is no secret that many websites, whether commercial or not, receive a large part of their traffic via search engines and thus primarily via Google. According to research by Similarweb (www.similarweb.com), ikea.com, for example, gets 49% of its traffic via search engines, ibm.com 47%, and Wikipedia.org as much as 85%. These figures also include the traffic generated via search engine advertising (see Chap. 10), but this accounts for a very small portion in most cases. Another interesting fact is that most websites get only a small part of their traffic from social media. At ikea.com, as mentioned, 49% of visitors come via search engines (of which 85% come via organic results and 15% via ads), while only a little more than 3% come via social media services such as YouTube, Facebook, Reddit, or Pinterest. These figures give an impression of the importance of search engines in bringing visitors to websites. Unfortunately, there are hardly any aggregated figures across a large number of websites. One exception is SEMrush’s 2019 survey, which finds a 29% share for organic search, 0.5% for paid search, and 2.5% for social media across all websites (Statista, 2020). If so many users reach websites via search engines, and if these visitors are also referred free of charge for the content producers, it is clear why search engine optimization is so attractive for them. Due to Google’s market dominance, search engine optimization is often explicitly focused on this search engine and geared toward exploiting its unique peculiarities and weaknesses. However, basic search engine optimization measures also help
9.1 The Importance of Search Engine Optimization
177
increase visibility in all search engines, although there are peculiarities in each search engine. The success of search engine optimization can be explained similarly to that of text ads (see Chap. 10): Searchers have already revealed their interest by entering a query and can be “picked up” on the search engine result page if the query has commercial potential. It is not necessary to generate interest in buying; an existing interest demonstrated by a search query “only” needs to be satisfied. In addition, organic results enjoy a high level of trust among users. They are hardly suspected of being paid for, and it is usually assumed that they are in one of the top positions because they are simply “the best” results. In this context, search engine optimization has another significance: from the user’s point of view, there is a distortion of the results (often unnoticed by them). Assuming that without “optimizations,” the search engines would be able to evaluate the documents solely based on their quality, another influencing factor is the optimization of the document. In a way, search engines cannot ever be truly “neutral” but are influenced, on the one hand, by assumptions that go into the ranking procedures (see Chap. 5) and, on the other hand, by external factors (see also Röhle, 2010, p. 81 f.). The influence of search engine optimization is most evident in commercially viable queries for which there is high competition. Vivid examples are insurance comparisons and loans. For both, a new customer is worth a lot, and there is a market for intermediaries who receive commissions for referring new customers to providers. For queries in these areas, there are virtually no more non-optimized results at the top of the organic result lists. However, the influence of search engine optimization also extends to very specific queries; especially for niche topics, search engine optimization can be particularly worthwhile compared to other forms of customer acquisition. However, optimization is not only used for products and services but also for informative content. This can be seen most clearly in news websites, which, on the one hand, compete with each other, but, on the other hand, also compete with other potential sources of news like social media or blogs. Another important area is content that focuses on public relations interests of companies and organizations: There is also an interest in ensuring that information favorable to the commissioning entities is disseminated in the top positions of the organic result lists. The market for search engine optimization is now worth 80 billion dollars in the United States alone (McCue, 2018). One can assume that the influence of search engine optimization will continue to increase in the coming years. While there is hardly a gap to be found in the area of products and services in which optimization has not already been carried out, the optimization of informative content is not yet comparably advanced. In principle, one should assume that in Internet-based research, optimized documents are to be found within the organic results (especially the ones at the top). We will further discuss the consequences of optimizing informative content for the quality of search results and Internet-based research in Chaps. 13 and 15. In addition to these effects of search engine optimization, which must be critically evaluated, there are also several positive aspects. First of all, search engine
178
9
Search Engine Optimization (SEO)
optimization can serve to make content indexable for the search engines at all. There are technical hurdles in crawling (see Sect. 3.3) that can be removed by content producers or the search engine optimizers they commission. In addition, search engine optimization also requires navigation structures within the website to be optimized, and search engine-optimized HTML source code usually also boosts accessibility (Moreno & Martinez, 2013). On the user’s side, there are, therefore, positive effects through findability, easier navigation, and possibly improvements in using assistive technologies (such as screen readers).
9.2
Fundamentals of Search Engine Optimization
The procedures of search engine optimization use knowledge about search engine ranking factors (Chap. 5) to achieve higher visibility of specific documents or websites on search engine result pages. They are constantly adapted to the search engines’ current ranking factors and how the ranking factors are weighted. This is a much more small-scale perspective than the one in Chap. 5 because, in the end, everything that works plays a role in optimization (pragmatic approach to search engine optimization). It is true that it also makes sense for search engine optimizers to know the basic procedures and factor groups of ranking to create websites that are successful in the long term, regardless of short-term changes in search engines. However, they must also react to short-term changes in single ranking factors or how they are weighted (especially in the context of the frequent “Google updates”—for an overview, see Searchmetrics, 2021) by making adjustments. Some of the basic measures used in search engine optimization are described below. However, the presentation neither aims to be complete nor is it to be understood as a guide for doing search engine optimization. There is extensive literature on this subject directed explicitly at website owners (e.g., Enge, Spencer & Stricchiola 2015; Thurow, 2007). When we speak of the manipulation of search engines or their rankings in the following, this term should first be understood as neutral and should not imply that these are, per se, measures that lead to poorer results. We shall only evaluate this afterward, whereby we will see that it is often difficult to determine the boundary between actions that are beneficial for the users and manipulation in the negative sense. Search engine optimization has undergone considerable development over the years. Whereas in the beginning, it was sufficient to frequently place a keyword for which a document was to be found in the text of the document, and later it was mainly gathering external links that led to success, search engine optimization today is a complex process that involves the optimization of many different factors. The Search Engine Land website has published a compilation of the most critical factors essential for successful search engine optimization. This so-called Periodic Table of SEO Success Factors (Fig. 9.1) essentially reproduces the ranking factors described in Chap. 5, even if the subdivision differs. From the perspective of search engine optimization, the importance of the ranking factors lies in their susceptibility to manipulation, meaning that it is of particular importance to identify those factors
9.2 Fundamentals of Search Engine Optimization
179
Fig. 9.1 The Periodic Table of SEO Success Factors (http://searchengineland.com/seotable; January 12, 2021)
that have a special significance for the ranking on the one hand and can be manipulated with reasonable effort on the other. In addition, it is of great importance for search engine optimization to know which behaviors or “tricks” are considered to be negative by the search engines. Precisely because search engine optimization has always been a matter of trial and error as to how one can cheat search engines, (often supposed) tricks are passed on, some of which, however, show no or even negative effects. Therefore, the Periodic Table of SEO Success Factors combines positive and negative factors, making it a valuable tool. If one looks at it differently, it is simply a guide on how to create an attractive website—in which case search engine optimization is more like a positive side effect. In search engine optimization, a distinction is made between on-page factors and off-page factors, whereby the former are all measures that can be implemented directly on one’s own website, while the latter are the external factors (e.g., links and likes) and thus initially seem more challenging to influence. The distinction is not directly visible in the table; however, the first three groups (Content, Architecture, HTML) refer to on-page factors and the second three groups (Trust, Links, User) primarily to off-page factors. In the following, the factors mentioned in the table are explained in separate sections based on their groups. The compilation and weighting of the factors (indicated by the numbers in the upper right corner of the boxes) are based on the experience of the authors of the chart as well as surveys and are not explained in detail here. One might also ask at which level search engine optimization should take place: at the website level or at the level of the single document? Both are correct and influence each other: optimizing one document, for example, can ensure that this
180
9
Search Engine Optimization (SEO)
document is linked to from other documents and mentioned on social networking sites. In turn, the whole website can benefit from this, as part of the authority that the search engines attribute to the document is transferred to the website. Conversely, individual documents benefit from optimization at the website level.
9.2.1
Content
In the area of content, it is first of all important to ensure that the single documents have a certain quality of content and are well written, which means that they are easy to understand and do not contain any errors. Secondly, the content should match keywords that are actually searched for; even well-written texts with valuable content are useless in terms of search engine optimization if they are optimized for certain keywords, but there are no or hardly any searches for them. Optimizing text is always based on the interests of the searchers—it is, of course, possible to optimize specifically for rarely used queries; however, this should be done consciously. While search engines can be used for researching relevant keywords, specialized tools make this work much easier (e.g., Google Keyword Planner, https://ads.google.com/aw/keywordplanner/). A document can be optimized not only for a single keyword but also for phrases consisting of several words. Such relevant phrases should be identified to determine for which words and phrases the content should be found. Another factor already mentioned in Sect. 5.4 when explaining the ranking factors is the freshness of documents. Regarding search engine optimization, it now makes sense to keep the documents up to date and, if necessary, produce content on current topics. Using multimedia elements (videos, infographics, etc.) can significantly increase the attractiveness of documents. With their help, content can often be easily explained; therefore, they are popular with users. Another way to optimize content is to provide answers to typical user questions in the documents. This not only helps users find their way around the document but can also lead to the question and the corresponding answer being picked up by a search engine and presented on the search engine result page (see Sect. 7.3.5). Last but not least, document content should be “in-depth” and not just deal with the topic superficially but add value compared to other documents. Unfortunately, one major mistake is made again and again: content producers rely on low-quality texts and assume that these are good enough for search engines. Often this is done thinking that one can “outsmart” the search engines to rank one’s content high despite these measures. While such texts were successful for a long time, search engines are now better able to measure the quality of texts on the one hand, and on the other hand, they are increasingly evaluating usage data, which allows conclusions to be drawn about the (poor) quality of the texts.
9.2 Fundamentals of Search Engine Optimization
9.2.2
181
Architecture
In the context of search engine optimization, architecture refers to the technical structure of a website. The point is to technically build the website so that search engines can index it without problems. First of all, this is about the “crawlability” of the website: This means that all documents within the website must be easily accessible for search engines through links and technically created in a way that the search engines can capture them without any problems. Website architecture also includes considering requests from different devices. Here, preparing the content for mobile devices, which mainly comes down to bring legible on considerably smaller screens, has turned out to be an essential factor. By now, both desktop and mobile use of the Web are standard. Therefore, websites should be presented well on any device. Furthermore, search engines evaluate “mobile readability”—in some cases, this goes so far as to preferentially rank different documents when searching from a mobile device than from a desktop computer. Furthermore, ensuring that content does not appear more than once on a website (so-called duplicate content) is important. Although search engines can usually recognize duplicate content and then decide based on other factors which version to prefer or which version(s) to exclude, this means that, on the one hand, the decision no longer lies with the website provider. On the other hand, authority values may be distributed over several versions of a document instead of being assigned to only one version, which results in a weakening of the document in the ranking. The page speed of the documents is another important technical factor. Therefore, one should ensure that the documents can be loaded quickly from the server. Because of the high bounce rates for slow-loading documents (see Sect. 5.3.2), search engines pay attention to page speed and devalue pages that load slowly. Another technical criterion in website architecture is the use of encrypted data transmission between the user’s computer and the website using the HTTPS protocol. Pages that use HTTPS are preferred. The documents’ URLs should be kept relatively short and meaningful, and they should contain the relevant keyword(s) specified for the document. On the one hand, elements of the URLs are displayed in the snippets on the search engine result pages and thus serve as orientation for users; on the other hand, search engines also pay attention to the keywords in the URLs. Standard content management systems generate appropriate URLs automatically if the structure has been defined beforehand.
9.2.3
HTML
In this area, we are concerned with how to best use the elements specified in HTML for the visibility of the document. First of all, a title must be defined with terms that
182
9
Search Engine Optimization (SEO)
are relevant to the page content. Although the title is not displayed directly in the document (usually only in the browser’s title bar), search engines use this information to generate the titles in the snippets (see Sect. 7.4). Similarly, texts placed in the tag are not displayed directly with the document in the browser, but they can be used to generate snippets on the search engine result pages. Descriptions should therefore be meaningful and describe what the document offers. Wherever possible, the search engines’ preferred format for describing structured data (Schema.org; see Sect. 7.4) should be used. Structured data is primarily used for enhanced snippets on the search engine result pages, but search engines can also use it for purposes where the actual source is no longer directly linked. Headings and subheadings should be distinguished hierarchically by tags and contain relevant terms. Another way to make content attractive to search engines through HTML is to use the AMP framework, which, among other things, ensures that documents are displayed more quickly on the user’s side.
9.2.4
Trust
Several factors or measures can influence the level of trust that search engines assign to a website. One is the so-called authority, determined by a harmonious interaction of links, shares on social networking sites (see next section), and other factors. It is, therefore, a composite indicator that considers whether a document or website shows a similar structure (and, if applicable, choice of words) in terms of links and social media performance, for example. “Engagement” considers whether the documents are “accepted” by users who access them via a search engine. Here, the dwell time on the document is measured, whereby a short dwell time indicates that the document is not or only minimally relevant for the users. Over time, a website’s reputation is also built up in search engines. On the one hand, they examine whether the website has had the same or similar content over time. On the other hand, the age of the domain also plays a role; especially newly registered domains or domains taken over from others are critically evaluated by search engines due to suspicion of spam.
9.2.5
Links
Link building, i.e., collecting links from external sites, is the primary way to increase the popularity of a website in search engines. As already pointed out when explaining the PageRank method (Sect. 5.3.1), it is not only the number of links that influences the ranking but also their quality. What is calculated in PageRank and similar algorithms as the value of a linking page boils down to a measurement of trust or quality. Therefore, when systematically collecting links, care should be taken to prefer links from trustworthy and high-quality sites.
9.2 Fundamentals of Search Engine Optimization
183
In Sect. 3.4.2, we already explained how search engines generate their document representations not only from the documents themselves but also, among other things, from the anchor texts of the links of documents that refer to the document to be indexed. This procedure can also be exploited in search engine optimization by collecting links that contain relevant anchor text with the desired keywords and by using meaningful anchor texts within one’s own website. This way, the document representation is enriched with the desired keywords, ensuring a better ranking.
9.2.6
User-Related Factors
This section deals with factors that are directly related to a single user. This includes the location factors already discussed in Sect. 5.5; these range from the country where a user is located to more narrowly defined geographic regions such as a city or neighborhood. Such factors can be challenging to influence at first. Still, branches of a company, for example, can be presented separately and added to local search services to ensure that they can also be found in local contexts. On-page optimization also includes measures to improve the user experience in documents or within the website. For example, search engines measure whether users who access a document from a search engine result page stay there or quickly return to the result page (see Sect. 5.3.2). One aspect of search engine optimization is to increase the average dwell time or reduce the bounce rate by taking appropriate measures on the pages. In the area of user-related factors, search engines can also evaluate data collected over a longer period of time. For example, a document is displayed preferentially to an individual user if that user has requested that document several times in the past. This form of personalization naturally requires storing the respective user’s data (see Sect. 5.6); it is difficult for a website provider to influence this multiple viewing in any other way than by offering engaging content. The intent of the targeted query also plays a role in search engine optimization. Here, the question is why a certain person formulates exactly this query and how the information need behind the query can be satisfied.
9.2.7
Toxins
Just as one can optimize one’s website for search engines, one can also degrade it. It is precisely supposed optimization measures that often contribute to a website being rated worse by search engines. It is, therefore, essential to know the typical mistakes made during optimization. Often, attempts are made to display different content to search engines and users (so-called cloaking). This is possible because Web servers can automatically identify search engines based on their user agent identifier. If a search engine makes a request, the server returns a special version of the document that has been prepared
184
9
Search Engine Optimization (SEO)
especially for the search engine. This may seem attractive at first, but since it is ultimately a deception of the users, this method is penalized by search engines. Other toxins include manipulating the factors considered particularly important by search engines: links and keywords. Just as there are particularly good links or sources of links (i.e., trustworthy and high-quality websites), there are particularly bad sources of links. These include all websites suspected of being spam and other low-quality offerings, for example, blogs that are only set up for search engine optimization, many open forums, and automatically generated content that contains links to external websites. Such links are not only not rated positively by search engines but may even incur penalties. A second measure condemned by search engines is buying links. If this is discovered, the linked website is usually penalized. This does not mean, however, that, in practice, buying links no longer takes place; however, purchased links are increasingly placed more inconspicuously and better integrated into the content of the respective website. Keyword stuffing is the constant repetition of relevant keywords to make the search engines think that the document is actually about that keyword. However, search engines have long since developed techniques that can recognize such clusters and devalue the documents accordingly. The same applies to hidden text, a formerly popular technique in which text that was produced only for search engine optimization is written either in the color of the background or in a color that contrasts only slightly with the background. This technique also leads to search engines devaluing the corresponding documents. It should be a matter of course that a website offers its own content and does not copy content from external sites or uses copyrighted content. Search engines also downgrade websites that have attracted attention in the past by copying external content. The website owner cannot directly influence this. However one should pay attention to this if, for example, one buys or otherwise takes over a domain that was already active in the past. Finally, it is also an issue whether the website or the document is overloaded with advertising or whether very annoying advertising (e.g., pop-ups and interstitials) is displayed. It is, of course, debatable when advertising is disturbing or when a page contains too much advertising. However, it makes sense for search engines to devalue the ranking of pages whose advertising distracts the user too much from the actual content.
9.2.8
Vertical Search Engines
In addition to optimizing content for general Web search, optimization for vertical search or its integration into universal search (see Chap. 6) plays an increasingly important role. In order to optimize for these verticals, such as news or image searches, one should also produce content that can be found via these collections. Search engine optimization does not only mean optimizing for the organic search results; it is about being optimally present on the search engine result pages, which
9.4 The Role of Ranking Updates
185
includes exploiting all types of results. This also includes direct answers on the result pages (see Sect. 7.3.5), which contain a link to the source of the answer. In the area of vertical searches, local search, image search, video search, and voice-based search are certainly the most important. For comprehensive optimization, these areas should also be considered.
9.3
Search Engine Optimization and Spam
Except for the techniques explicitly mentioned as undesirable tricks, the practices described in the previous sections are search engine optimization methods accepted by the search engine providers. One also speaks of white-hat optimization in the case of measures that follow the rules established by the search engine providers and in contrast to black-hat optimization, which is a matter of targeted manipulation that undermines the terms of use of the search engine providers. While black-hat methods such as keyword stuffing and cloaking used to lead to successes that were often even medium term, today, one can only succeed in the short term with such methods, if at all. This means that these methods are inherently unsuitable for optimizing the websites of reputable brands or service providers. However, these methods can still be worthwhile when it comes to quick sales in gray areas. In black-hat optimization, the terms of use and the tolerance limits of the search engines are deliberately exhausted or exceeded; punishments in the form of ranking losses and removal from the index are deliberately accepted. The search engine providers explicitly prohibit one method of black-hat optimization, namely, the purchase or sale of links. However, this practice has repeatedly been detected on reputable websites and, in some cases, penalized by search engines. However, there are also areas where one can assume that it is almost impossible to place documents in one of the top result positions without buying external links. This applies to highly contested keywords such as “health insurance comparison,” where brokers receive relatively high commissions when a contract is sold without having to invest much time and effort in providing advice themselves.
9.4
The Role of Ranking Updates
Search engines continuously improve their ranking algorithms and adapt them to changing circumstances. Reasons for this include new types of content emerging, but also spammers or search engine optimizers trying out new forms of tampering. A good indication of the problems search engines face in determining high-quality content or excluding low-quality content is provided by Google’s Quality Rating Guidelines (2020). Google not only rates (potential) search results automatically but also employs human evaluators. Search algorithms are also adjusted based on the issues found in these evaluations. Such adjustments are made continuously; however, they are mostly minor and do not significantly impact the overall search results (see Moz, 2020). The overall goal of these adjustments is to improve the quality of
186
9
Search Engine Optimization (SEO)
search results—and the direction is clear: Google wants to show results that satisfy users. Any ranking updates are also to be evaluated in this respect: the aim is always to provide users with satisfactory results. With this goal in mind, the directive for content producers and search engine optimizers is obvious: create websites and documents that are appropriate, useful, and enjoyable for users. If one follows this directive (and the legitimate search engine optimization measures already described), one should not have to worry about ranking updates. But why are Google’s ranking updates nevertheless assigned such excessive importance, especially in SEO blogs? The first reason is that, on the one hand, the updates are primarily relevant for large websites that receive a lot of traffic via Google. The second reason is that the updates are relevant to all those who see search engine optimization as adapting content to the requirements of search engines and do not primarily focus on the people who want to use the content. In many cases, this results in tricks being used to improve Google rankings rather than measures being taken to improve the user experience. It is mainly such sites that are penalized in the ranking updates. Furthermore, there are, of course, a large number of websites that offer more or less the same thing and compete for the same users. From the user’s point of view, it is not important on which of these sites they buy a product or read a news item, for example. Ranking updates often cause such sites to lose traffic (to the benefit of their competitors).
9.5
Search Engine Optimization for Special Collections
Search engine optimization is often seen as an overall strategy by which all content offered by a content producer is prepared for search engines, especially Google, using the same method. However, with the by now great importance of universal search (see Chap. 6), i.e., the integration of vertical searches into the general search engine result pages, targeted optimization for these separate collections becomes necessary. Explaining the procedures for the individual collections in detail goes far beyond the focus of this book; here, vertical collections are treated more from the perspective of their integration into universal search (see Sect. 7.3.3) and as vertical search engines (see Chap. 6). At this point, we will focus on the importance of optimization for specific collections. Optimization for specific collections has been most apparent for years in the news sector: news providers compete not only for the top positions in the organic results on Google but also for the top positions in Google News and the associated top positions in the news boxes included in the universal search result pages. This naturally makes optimization more complex. Generally speaking, the principles described above also apply to search engine optimization for individual collections. However, some criteria are specific to the collections, and some known criteria must be adapted. Furthermore, search engines judge the content from the collections according to a different weighting of the
9.6 The Position of Search Engine Providers
187
ranking factors (see Sect. 6.3). An introduction to optimization for individual collections can be found in Enge et al. (2015).
9.6
The Position of Search Engine Providers
Search engine providers and search engine optimizers have an ambivalent relationship (see van Couvering, 2007; on the perspective of search engine developers on the creation of search results, see also Mager, 2012): On the one hand, search engine providers have to protect themselves from excessive optimization to maintain a reasonable quality of search results. On the other hand, they benefit from search engine optimizers who do a lot to ensure that search engines can easily find documents. In addition, many search engine optimization measures also improve the quality of documents for users. For example, improved navigation on a website has a positive effect on detection and ranking by search engines. At the same time, the website becomes easier to use for users; they can find the content more easily. But also, on the copy level, human readers can benefit from search engine optimization measures if texts are better structured and formulated. However, there are also many texts where one can quickly notice that they were written primarily for search engines. Such “SEO texts” are characterized by frequent repetition of keywords, simple structures, and often a certain emptiness of content, which makes them less appealing to human readers. Search engine providers do not want to stop search engine optimization by any means; they just defend themselves against excessive optimization. They set the limits at their discretion. But, of course, these limits are constantly being exhausted or exceeded; new procedures are continually being developed and tested in attempts to bring one’s documents to the top. As a result, a kind of cat-and-mouse game has developed between optimizers and search engine providers. The optimizers find a loophole for placing content at the top of the search engine result pages, the search engines then close this loophole, and so on. If one does not want to take part in this game—and especially for smaller, more specialized websites, there is hardly any need to do so—it can only be recommended to build a website that appeals to users (and is not primarily designed for search engines) and then apply the optimization measures described above. Many other measures presented in search engine optimizers’ blogs as the latest trend do not play a role for standard websites. The much-invoked “Google updates,” in which the search engine changes its ranking algorithms, resulting in ranking gains and losses for certain websites, also speak a clear language: search engines are increasingly trying to adapt to the behavior of ordinary human users and simulate their quality assessments. On the subject of search engine optimization, search engine providers also offer valuable assistance in the form of:
188
9
Search Engine Optimization (SEO)
• Introductory texts (e.g., Google’s “Introduction to search engine optimization”; Google, 2021a) • Rules and regulations (e.g., Google’s “Guidelines for Webmasters”; Google, 2021b) • Forums (e.g., Google’s “Webmaster Central Help Forum”; Google, 2021c) • Tools that help optimize or provide data for this (e.g., Google Analytics and Google Search Console) This underlines the fact that search results are not solely the result of the search engine providers’ specifications but are instead negotiated between search engine providers and search engine optimizers, at least in the cases that are of interest to search engine optimizers (see also Röhle, 2010, p. 163). Consequently, search engine providers regard optimizers as partners as long as they abide by the providers’ rules or the negotiated agreements.
9.7
Summary
The term search engine optimization refers to all measures that improve the visibility of certain documents or websites on the result pages of search engines. Search engine optimization thus oscillates between simple technical measures that enable search engines to index the content at all and complex attempts at manipulation. Search engine optimization has gained enormous importance in all areas where products and services are sold. Similar to context-based ads, it takes advantage of the fact that users reveal their current interest by entering a query. Documents are optimized for queries. Especially for frequently searched terms, which promise high profits if a purchase or a contract is concluded, intense competition has arisen for the best ranking positions. The importance of search engine optimization regarding content relevant to opinion formation is likely to increase further, as well. Search engine optimization techniques can be divided into on-page and off-page measures. On-page measures are concerned with optimizing the documents directly, for example, by including relevant keywords in the text and structuring the documents. On the other hand, off-page optimization deals with measures that increase the popularity of documents or websites. The most important area here is the creation of external links intended to suggest popularity to search engines. Search engine optimization measures within the framework of search engine providers’ terms of use are referred to as white-hat measures. On the other hand, black-hat optimization refers to measures that deliberately exceed the terms of use. Search engine providers do not work against search engine optimizers but see them as partners as long as they adhere to the terms of use or mutual agreements. As a result, search engine providers and search engine optimizers mutually benefit from each other in their shared goal of offering relevant content to users—even if their views of what is relevant to a query may well differ.
References
189
Further Reading There is a plethora of literature on search engine optimization; the books range from concise introductions to comprehensive works. The book by Enge et al. (2015) provides a well-founded and comprehensive introduction. The book by Fox (2012) is an excellent introduction to the changes in marketing caused by search engines. If one wishes to keep up to date on SEO topics, Search Engine Land (http:// searchengineland.com) is a must. This news site offers daily news from the world of search with a focus on search engine optimization and search engine marketing.
References Enge, E., Spencer, S., & Stricchiola, J. (2015). The art of SEO: Mastering search engine optimization (3rd ed.). O’Reilly. Fox, V. (2012). Marketing in the age of Google: Your online strategy IS your business strategy. Revised and updated. Wiley. Google. (2020). Search quality rater guidelines. https://static.googleusercontent.com/media/ guidelines.raterhub.com/en//searchqualityevaluatorguidelines.pdf. Google. (2021a). Search engine optimization (SEO) starter guide. https://developers.google.com/ search/docs/fundamentals/seo-starter-guide?hl=en. Google. (2021b). Webmaster guidelines. https://developers.google.com/search/docs/advanced/ guidelines/webmaster-guidelines?hl=en. Google. (2021c). Search console help. https://support.google.com/webmasters/?hl=en. Griesbaum, J. (2013). Online-Marketing. In R. Kuhlen, W. Semar, & D. Strauch (Eds.), Grundlagen der praktischen Information und Dokumentation (6th ed., pp. 411–423). De Gruyter. https://doi.org/10.1515/9783110258264.411 Mager, A. (2012). Algorithmic ideology: How capitalist society shapes search engines. Information, Communication & Society, 15(2), 769–787. McCue, T. (2018). SEO industry approaching $80 billion but all you want is more web traffic. https://www.forbes.com/sites/tjmccue/2018/07/30/seo-industry-approaching-80-billion-but-allyou-want-is-more-web-traffic/. Moreno, L., & Martinez, P. (2013). Overlapping factors in search engine optimization and web accessibility. Online Information Review, 37(4), 564–580. https://doi.org/10.1108/OIR-042012-0063 Moz. (2020). Google algorithm change history. https://moz.com/google-algorithm-change. Richter, D. (2013). Suchbasiertes Onlinemarketing. In D. Lewandowski (Ed.), Handbuch InternetSuchmaschinen 3: Suchmaschinen zwischen Technik und Gesellschaft (pp. 163–194). Akademische Verlagsgesellschaft AKA. Röhle, T. (2010). Der Google-Komplex: Über Macht im Zeitalter des Internets. Transcript. Schultz, C. D. (2009). Suchmaschinenmarketing. In D. Lewandowski (Ed.), Handbuch InternetSuchmaschinen (pp. 70–98). Akademische Verlagsgesellschaft AKA GmbH.
190
9
Search Engine Optimization (SEO)
Searchmetrics. (2021). Google updates Überblick. https://www.searchmetrics.com/de/glossar/ google-updates/. Statista. (2020). Distribution of worldwide website traffic in 2019, by source. https://www.statista. com/statistics/1110433/distribution-worldwide-website-traffic/. Thurow, S. (2007). Search engine visibility. News Riders. Van Couvering, E. (2007). Is relevance relevant? Market, science, and war: Discourses of search engine quality. Journal of Computer-Mediated Communication, 12, 866–887. https://doi.org/ 10.1111/j.1083-6101.2007.00354.x
Search Engine Advertising (SEA)
10
Search engines earn their money primarily through advertising; other sources of income are of little importance (see Chap. 8). The significance of the text ads shown on the search engine result pages stems from the fact that they are the search engines’ source of revenue. In Germany, around 4.1 billion euros were generated with search engine advertising in 2019; approximately 4.5 billion are expected for 2021 (Statista, 2021). Two aspects are crucial for understanding search engine advertising: (1) search engine advertising is a form of search results, and (2) a distinction is made between organic results and ads on the search engine result pages. Advertisements in search engines are to be regarded as search results since they are generated in response to a query and are played out in relation to this query. The presentation of the ads is also based on the presentation of other search results. In many cases, the ads can be relevant to the search query by providing information that fits the query and helps users satisfy their information needs. The term sponsored link, while perhaps making it less clear that such links are advertisements, points out precisely that they are search results. The difference between search engine advertising and organic results does not lie in the fit to the user’s query or in the presentation, but in the difference in how the results are generated and in their position. Organic search results are automatically generated from the Web Index (see Sect. 7.3.1). Organic results are ranked on equal terms for all documents, meaning that every document included in the search engine’s Web Index potentially has the same chance of being displayed as a result for a query. In contrast, ads are only displayed if an advertiser pays for them. Figure 10.1 shows a search engine result page from Google on which four ad results are displayed before the organic results. This page thus shows two ranked lists: the list of ads (four results) and the list of organic results (all other results; in the figure, the first two organic results are shown). Similar forms of presentation are also found in other search engines. The two lists of results are created independent of each other, and advertisers cannot influence the organic results by booking ads. Therefore, it can also happen # The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Lewandowski, Understanding Search Engines, https://doi.org/10.1007/978-3-031-22789-9_10
191
192
10
Search Engine Advertising (SEA)
Fig. 10.1 Search engine result page with organic results and ads (desktop; excerpt, December 4, 2020)
10
Search Engine Advertising (SEA)
193
that the same website appears both in the list of ads and in the list of organic results. The difference then lies solely in the fact that in one case, it is paid advertising and in the other, it is a result that has reached its position due to the search engine’s relevance ranking. The fact that such a page appears twice on the search engine result page shows that the two lists are independent; there is no duplicate filtering. An example of such a case can be seen in Fig. 10.2; in each case, it is the first result from the list of ads and the first from the list of organic results. It is easy to see that the two snippets are very similar, even though some of the text elements come from different sources. In the case of the ad (above), the advertiser specified the main texts (title and short description) when booking the ad. In the case of the organic result, the title and brief description also come from the website provider. Still, they are to be understood as suggestions to the search engine, which can use these texts but shortens or modifies them if necessary (see Sect. 7.4). The figure also shows that the advertising result is marked with the word “Anzeige” (advertisement) in bold. Finally, in Sect. 9.3, we will look more closely at the extent to which users can distinguish the ads from the organic results on the search engine result pages. Figure 10.3 shows another search engine result page for the query private krankenversicherung vergleich (private health insurance comparison), but in this case, the mobile version. While the desktop version shows results from both the ad list and the organic list, we only see two advertising results in the visible area of the mobile search engine result page. To see the organic results, a user first has to scroll down. In the desktop presentation, too, the ads are always found before the organic results. Obviously, the user’s attention should be focused on the ads first. Depending on the search engine, the number of ads displayed before the organic results is restricted; Google, for example, shows a maximum of four ads. In addition, the
Fig. 10.2 Snippet of an advertising result (top) and an organic result (bottom) for the same website for the same query (example from Google, January 6, 2021)
194
10
Search Engine Advertising (SEA)
Fig. 10.3 Search engine result page with organic results and ads (mobile phone, December 4, 2020)
number of ads presented is not the same every time, but it is adjusted depending on the query, the amount of advertising available, and the individual user. Thus, more ads are found in highly competitive queries (such as our example private krankenversicherung vergleich) than in other, less competitive cases.
10.1
Specifics of Search Engine Advertising
The major distinguishing feature of search engine advertising compared to other forms of advertising (not only on the Web) is that it is displayed as a result of a query. This means that a user who enters a query reveals their interest in a product,
10.2
Functionality and Ranking
195
for example, and can be served advertising to match their query or the information need behind it. This context dependency of search engine advertising is a decisive strength of the model since, in the ideal case, firstly, users are only shown advertising if they are actually interested in it and, secondly, this advertising is then also relevant. In this way, scattering losses are limited in search engine advertising. Furthermore, with search engine advertising, the advertiser pays by clicks instead of impressions, as is the case with other forms of online advertising. This means that the advertiser only has to pay for the advertisement if a user is attracted by the ad, clicks on it, and thus goes to the advertiser’s website. An auction process determines the price per click. In principle, the highest bidder receives the best position on the search engine result page, even if so-called quality factors (see Sect. 10.2) are included in the ad ranking, so that under certain circumstances, an ad with a lower bid can reach the top position. The auction results in transparent click prices that are based on the actual competition for a query. These prices range from a few cents to substantial amounts depending on the competition. Booking ads is easily done online in a self-booking system. This makes it easy even for micro-enterprises to place ads, even if they only have limited budgets. In addition, advertisers can easily design the ads themselves; different versions of an ad can be tested without any issues, and changes can be made at any time with ease. Users find search engine advertising less disruptive than other forms of advertising because it is text-based. The design is unobtrusive because it avoids graphic elements and the disruptive effects of banner advertising. However, this is also accompanied by the fact that it is very similar to organic search results, which can lead to a danger of being mistaken for “real” results (see Sect. 10.3).
10.2
Functionality and Ranking
As described above, as soon as a user enters a query, both the organic index and the ad index are queried. This ad index does not simply consist of all available ads, but changes dynamically based on the criteria advertisers can set for their ads. In addition to geographical constraints (e.g., that a particular ad should only be displayed in a specific city or region), these are mainly socio-demographic constraints (e.g., an ad should only be shown to women between 40 and 50 years of age) and time constraints (an ad should only be displayed at a particular time of day). In addition, the available (daily) budget plays a role; if this is used up, the ad is no longer displayed (for the rest of the day). The price for a text ad is based on the advertiser’s bid for the corresponding keyword. In the booking system, the advertiser can see immediately with which bid they can achieve which position in the ad list and adjust their bid accordingly. In addition to designing the ads and setting the price, settings can be made in the areas mentioned above to adapt the ad to specific target groups. In addition to booking ads individually, it is also possible to create large campaigns with a large number of keywords. This is particularly interesting for larger companies that want to address customers in various ways. In the context of
196
10
Search Engine Advertising (SEA)
this book, campaign management will not be dealt with at length; the aim here is to make the ad booking process understandable in principle. Numerous books on the market deal with the topic in detail, primarily from a marketing perspective (e.g., Marshall et al. 2020). Advertisements are not only displayed on the search engine but also on search portals that display the results of the search engine (for more information on partnerships in the search engine market, see Sect. 8.5). For example, text ads booked with Google are also displayed with Web.de and T-Online. In addition to placing ads in search, Google also offers the option of placing ads with so-called content partners, i.e., on regular websites. These can be banner ads as well as text ads. In the case of text ads, the main difference is that the ads are automatically displayed to match the text available on the website and are not triggered based on a user query. At Google, this program is called AdSense. In addition to the Google advertising network, the Microsoft/Bing network plays an important role, through which Yahoo’s text ads can also be booked. As described above, the design of the ads is very much based on the design of the snippets of the organic results (see Fig. 10.2). When booking the ads, the advertiser can specify precisely which text is shown in the title and description of the ad. In addition, the URL to be displayed can be entered manually and does not necessarily have to be the “real” URL to which the customer is directed when clicking. This makes it possible to use short and “speaking” URLs that are easier to remember and, for example, do not contain parameters for tracking the user. Ad design is subject to numerous rules set by the advertising networks. In addition to content regulations, these are primarily design rules. For example, Google specifies that the title lines of the ads may contain a maximum of 30 characters each and the description a maximum of 90 characters (Google, 2021c). The ads’ ranking on the search engine result pages is based on an auction procedure, whereby the advertiser who bids the highest amount per click for a particular keyword achieves the best position. However, there are restrictions through so-called quality factors, which measure the attractiveness of each ad through usage analysis and affect the ad placement. The quality factors are not to be confused with the minimum requirements placed on the ads. These are, for example, a ban on repeated words and a ban on solely using capital letters (Google, 2021b). In addition, there are regulations on what content is accepted. These are based on legal regulations and search engine providers’ ideas about what is considered acceptable (see, e.g., Google, 2021e). An important issue with text ads is click fraud. This involves clicks that harm the advertiser since they pay per click on their ad. In the simplest case, someone clicks on their competitor’s ad to harm them financially. However, ads are usually provided with a daily budget, meaning that the ad is only displayed on one day until a predefined sum has been spent. Therefore, if fake clicks are generated on an ad, it can be ensured that the daily budget is exhausted early, and thus the ad is no longer displayed. It is then possible for the competitor to achieve a top position in the ad list at a lower click price.
10.3
Distinguishing Between Ads and Organic Results
197
Of course, click fraud was automated early on. The ad networks claim to have taken numerous measures to check each click for legitimacy by checking indicators such as the temporal clustering of clicks and the location from which the clicks originate (Google, 2021d). However, all these measures have been unable to eradicate click fraud, and the issue remains relevant to search engine marketing.
10.3
Distinguishing Between Ads and Organic Results
The success of search engine advertising lies on the one hand in the possibility of very detailed targeting to the interests of the users (and thus also in its relevance to user queries) and on the other hand in its design as a form of search results. Since text ads in search engines are paid for based on clicks and not on the number of times they are displayed, search engine providers are interested in achieving as many clicks as possible. This can be accomplished, for instance, by giving the ads a particularly large amount of space in the users’ typical field of vision on the search engine result pages (see Sect. 7.2) or by designing the ads to be as similar as possible to the design of the organic results. Advertising is also subject to labelling requirements: users must be able to clearly distinguish between advertisements and “editorial content,” i.e., organic results in the case of search engines. In the major search engines, advertising is identified by one or more of the following measures: 1. Separation between organic results and advertising: As described above, the two types of results are not mixed. However, there is not always a clearly recognizable dividing line between the blocks, so that the user may get the impression that they are reading a continuous list when in fact, there are two lists. 2. Marking with the word “ad” or similar: The ads or the ad blocks are explicitly labelled as such. 3. Info button: Clicking on an info button provides explanations of how the ads came about. However, it has become apparent that the labels used in practice are not sufficient. Users are only insufficiently able to distinguish text ads from organic results (Lewandowski et al., 2018). This is not surprising when one considers the very similar design of the two result types and the methods of separating ads and organic results. A note on the similarity of ads and organic results can also be found in Google’s guidelines for the design of text ads: Ads “that do not conform to the unambiguous and informative presentation style of Google search results” are not allowed (Google, 2021b). Ad labelling has been changed repeatedly, with a tendency toward increasingly unobtrusive labelling (see Marvin, 2020) despite numerous complaints about mixing advertising and search results (including by the Federal Trade Commission (2013) in the United States; see Sullivan, 2013). Experiments have shown that the choice of words used in ad labelling (e.g., “sponsored link” versus “ad” or “paid
198
10
Search Engine Advertising (SEA)
advertisement”) alone has a significant influence on click behavior (Edelman & Gilchrist, 2012). The practice of ad labelling is problematic for several reasons: On the one hand, search engines at least accept that users click on an ad result assuming that they are clicking on an organic result determined by the search engine according to fair conditions. Indeed, it has been shown that users who cannot distinguish the ads from the organic results look at the search engine result pages differently (Schultheiß & Lewandowski, 2020) and select ads more often (Lewandowski, 2017). Furthermore, advertisers are put in a situation where they are more or less forced to book additional ads for keywords for which they are ranked first in the organic results, as users view the ad list as part of a single search result list and go through it from top to bottom. Finally, the practice of ad labelling is problematic for users, as they are misled about the origin of the results and may trust a result because it was ranked particularly high by Google (i.e., from the user’s point of view, it deserves special trust), although the placement was achieved by payment.
10.4
Advertising in Universal Search Results
Advertisements can also be placed within some universal search collections. For example, advertisers can book ads in Google Maps, which, on the one hand, appear in the result list similar to organic results, but, on the other hand, can also be displayed directly with a separate marker on the map (Google, 2021a). With Google Shopping, there is even an entire universal search collection that is based solely on ads and does not contain any results that have not been paid for. The two examples mentioned show that it can be even more difficult for users to distinguish between paid and organic results on mixed result pages. This is confirmed by a large-scale study in which users had to mark ads or organic search results on result pages (Lewandowski et al., 2018). Ultimately, the current result presentations are a mixture of result types whose different origins are hardly understandable for users. Similar to the distinction between organic results and text ads, the problem with universal search results is that some placements are paid for, while others were generated “neutrally” from the respective vertical index of the search engine. In addition, some types of results that used to be generated organically are now ads. One must therefore distinguish between (1) organic universal search results and (2) paid universal search results (universal search ads). 1. Organic universal search results are results or blocks of results on the search engine result page generated from specialized search engine indexes (such as news, videos, and images), whereby all results in the index are treated equally by the search engine algorithms, and no direct payment for the rank position is possible. However, a problem arises when search engines themselves act as content producers. This is the case, for example, with Google, which operates the video portal YouTube, whose results are included in the general Google
10.5
Summary
199
search and in Google’s video search engine. Although the latter contains far more than just videos from YouTube, there is at least a conflict of interest if a search engine is to rank its own content. Often the exact same videos can be found on different platforms. From the search engine’s perspective, it would at least make sense to give preference to the version from its own platform since advertising placed around the video can, in turn, generate revenue. However, there are considerable difficulties in evaluating the relationship between Google and YouTube in particular: Of course, YouTube is the most popular video platform—however, the question arises whether it was able to achieve this status precisely because it was so frequently displayed in Google search results (whether organic or universal search results). 2. Paid universal search results are ads labelled similarly to text ads. However, they are given not only preferential placement but also preferential display, for example, by showing images of products. Shopping results, in particular, are results that used to be “organic” but were then switched to ads. It is more than questionable whether users have noticed this change, even if the results are now marked as ads.
10.5
Summary
Two points are crucial for understanding search engine advertising: (1) search engine ads are a form of search results, and (2) on search engine result pages, a distinction is made between organic results and text ads. If advertising is shown on a result page, the searcher will therefore see at least two ranked result lists. Search engine advertising is context-specific because ads are displayed as an outcome of a query. This means that the ads are displayed according to specific interest expressed by searchers, which makes search engine advertising more efficient than other forms of advertising. In addition, payment for the ads is based on clicks, meaning that the advertiser only pays for users brought to their website, not for the display of the ads per se. The ranking of ads in the result list is primarily based on the advertiser’s bid (click price); so-called quality factors also play a role. Advertisers can easily create ads via a self-booking system and place them online. In addition, advertisers can further contextualize ads, for example, by region and socio-demographic criteria. A significant problem arises because users often cannot distinguish between advertising and organic results on search engine result pages. The complex result displays exacerbate this problem on universal search result pages.
200
10
Search Engine Advertising (SEA)
Further Reading Jansen (2011) provides a comprehensive overview of search engine advertising from an academic perspective. A more recent overview of relevant research areas can be found in Schultz (2016). It refers to consumer/user behavior, auction mechanisms for ad placement, the generation of relevant keywords, click fraud, and search engine advertising in the marketing mix. The book by Moran and Hunt (2015) is particularly interesting concerning the interplay of search engine optimization and ad booking.
References Edelman, B., & Gilchrist, D. S. (2012). Advertising disclosures: Measuring labeling alternatives in internet search engines. Information Economics and Policy, 24(1), 75–89. https://doi.org/10. 1016/j.infoecopol.2012.01.003 Federal Trade Commission. (2013). FTC consumer protection staff updates agency’s guidance to search engine industry on the need to distinguish between advertisements and search results. Federal Trade Commission. https://www.ftc.gov/news-events/press-releases/2013/06/ftc-con sumer-protection-staff-updates-agencys-guidance-search Google. (2021a). About local search ads. https://support.google.com/google-ads/answer/3246303? hl=en Google. (2021b). Editorial. https://support.google.com/adwordspolicy/answer/6021546?visit_ id=1-636455514169246896-4294143425&rd=1. Google. (2021c). About text ads. https://support.google.com/google-ads/answer/12437745?hl=en. Google. (2021d). About invalid traffic. https://support.google.com/adwords/answer/2549113? query=spam&topic=%200&type=f . Google. (2021e). Inappropriate content. https://support.google.com/adspolicy/answer/6015406? hl=en Jansen, J. (2011). Understanding sponsored search: Core elements of keyword advertising. Cambridge University Press. https://doi.org/10.1017/CBO9780511997686 Lewandowski, D. (2017). Users’ understanding of search engine advertisements. Journal of Information Science Theory and Practice, 5(4), 6–25. https://doi.org/10.1633/JISTaP.2017.5. 4.1 Lewandowski, D., Kerkmann, F., Rümmele, S., & Sünkler, S. (2018). An empirical investigation on search engine ad disclosure. Journal of the Association for Information Science and Technology, 69(3), 420–437. https://doi.org/10.1002/asi.23963 Marshall, P., Todd, B., & Rhodes, M. (2020). Ultimate guide to Google ads (6th ed.). Entrepreneur Press. Marvin, G. (2020). A visual history of Google ad labeling in search results. https:// searchengineland.com/search-ad-labeling-history-google-bing-254332. Moran, M., & Hunt, B. (2015). Search engine marketing, Inc.: Driving search traffic to your Company’s website (3. Aufl.). Upper Saddle River, NJ: IBM Press.
References
201
Schultheiß, S., & Lewandowski, D. (2020). How users’ knowledge of advertisements influences their viewing and selection behavior in search engines. Journal of the Association for Information Science and Technology, asi.24410. https://doi.org/10.1002/asi.24410 Schultz, C. D. (2016). An overview of search engine advertising research. In Lee (Ed.), Encyclopedia of E-commerce development, implementation, and management (pp. 310–328). IGI Global. https://doi.org/10.4018/978-1-4666-9787-4.ch024 Statista. (2021). Prognose der Umsätze mit Suchmaschinenwerbung in Deutschland in den Jahren 2017 bis 2025 (in Millionen Euro). Statista – Das Statistik-Portal. https://de.statista.com/ prognosen/456188/umsaetze-mit-suchmaschinenwerbung-in-deutschland. Sullivan, D. (2013). FTC updates search engine ad disclosure guidelines after ‘decline in compliance’. http://searchengineland.com/ftc-search-engine-disclosure-164722.
Alternatives to Google
11
In the previous chapters, we have considered search engines in general. Still, in accordance with Google’s overwhelming importance in the search engine market, we have focused on Google, particularly in the examples. The present chapter can also be read as a discussion of the necessity of alternatives to one’s favorite search engine (whatever that may be), despite the explicit reference to Google in the title. In most cases, we are talking about alternatives to Google simply due to market and user factors. In this chapter, the concrete providers that are relevant as alternatives are not described again. Only the providers mentioned in Sect. 8.4 will be discussed by example, or alternatives will be explicitly mentioned for particular cases. There are more theoretical reasons for using other search engines, such as ensuring diversity of opinion, also concerning the compilation of search results or the realization that it is not desirable for a single search engine to store all our queries and other usage data (see Sects. 5.3.2 and 5.6). However, even if users recognize these reasons as important, this hardly ever leads to using other search engines. On the other hand, there are concrete situations where using another search engine would be beneficial. The problem here is that we often do not recognize these situations and settle for worse results than we could achieve. Before one considers alternatives to Google, a supposedly simple question must be answered: are the market conditions perhaps simply reflecting the quality of the search engines, i.e., is Google simply so superior to the other search engines that using another search engine would not be worthwhile at all? This is certainly not the case. So far, no test has proven a qualitative advantage for Google that would be so great as to justify the current market situation (see Chap. 13). Although most studies show an advantage for Google (for an overview, see Lewandowski, 2015), this is not so significant that it would justify Google’s almost exclusive use. Two further findings from the studies mentioned are interesting: Firstly, in these tests, comparisons between search engines are primarily made based on average values for the result quality. In some cases, however, the results for individual queries examined are also given, and the clear result is that no search engine is # The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Lewandowski, Understanding Search Engines, https://doi.org/10.1007/978-3-031-22789-9_11
203
204
11
Alternatives to Google
superior to the other search engines in all cases. Instead, a notable proportion of queries are best answered by one of the search engines that does not perform best on average across all queries (Griesbaum, 2004; Lewandowski, 2015). Secondly, it has been shown that the results of the various search engines, even if they come up with a similar proportion of relevant results, differ significantly: there are many relevant results for numerous queries, and on the user side, it does not necessarily matter whether specific results are displayed, as the number of relevant results far exceeds the amount that any user is willing to look at.
11.1
Overlap Between Results from Different Search Engines
The overlap between the (top) results of different search engines is surprisingly small. Differences result from indexing and ranking. This section will focus on the differences caused by the ranking. In a study by Spink et al. (2006), four search engines were compared concerning the overlap of the first page of results (consisting of the first ten organic results and the advertisements) based on a total of more than 20,000 queries. The results show that the search engines differ considerably. Only slightly more than 1% of the results were found by all four search engines; almost 85% were found by only one search engine. In contrast, Agrawal et al. (2016) find substantial overlaps in the top results of Google and Bing; however, these results are based on only 68 queries. This might indicate some changes in the two major search engines; however, since the results are an outlier compared to all older studies, further conclusions should only be drawn when a replication study with a considerably larger number of queries is available. However, it is not the exact values that are important here, but the intention is to show in principle that the results are different and that it can therefore be worthwhile to look at the results of another search engine—especially if one bears in mind that these search engines all show a considerable number of relevant results in the top positions.
11.2
Why Should One Use a Search Engine Other Than Google?
There are many reasons for using a search engine other than Google. However, this is no recommendation to stop using Google altogether. The question of one’s favorite or default search engine is also one that everyone must decide for themselves. This section first discusses the reasons for using a different search engine, at least sometimes. The following section then describes the circumstances in which it is worth switching from one search engine to another.
11.2
Why Should One Use a Search Engine Other Than Google?
205
11.2.1 Obtaining a “Second Opinion” To begin with, the fundamental question is why Google alone is not enough. After all, users seem to have clearly voted on which search engine is actually not only the best but the one that is almost exclusively worth using through their usage. And indeed, this argument, in the form that the competition is only a click away, is repeatedly put forward, especially by representatives of Google (see Kovacevich, 2009; Hart, 2011; Singhal, 2011). However, we have already seen that selecting a search engine at the moment of an emerging information need is not always a conscious decision but also depends, for example, on which default search engine is set in the browser (see Sect. 2.1; on the economic background of default search engines, see Sect. 8.3). However, it is not only about the actual choice but also about the fundamental possibility of choosing between different algorithmic views of the (Web) world. Every ranking of search results is an interpretation and could also turn out differently—the one correct ranking does not exist. And if we really only had one form of ranking, we could compare this to a situation in which only one TV channel or newspaper exists. In these cases, it might be evident that this would be an undesirable situation—what if this broadcaster or newspaper only carried the news that was in its own interest? Of course, this comparison is flawed: search engines produce a new search engine result page for every query; they do not produce uniform programming like the mass media. Moreover, not every search engine result page is compiled individually by humans, which is why the much-used analogy of search engines as “gatekeepers” (Machill & Beiler, 2002; Stark, 2014, among others) is also inappropriate (see Röhle, 2010, p. 30 f.). And ultimately, search engines produce individual result pages for individual users when they use personalization (see Sect. 5.6). What remains, however, is that each search engine does this in its own way according to its criteria, which humans have previously determined, and thus produces a particular type of results—including certain tendencies in the search results. In addition, there are deliberate preferences, for example, preferring content produced by the search engine provider itself (see Sect. 11.4). This theoretical justification, which ultimately arises from the ideal of a diversity of opinion, will, however, only be briefly touched upon here; in Chap. 15, there will be a more in-depth discussion of the role of search engines as intermediaries of information. In the following, we will rather deal pragmatically with the benefits of switching from one search engine to another from the perspective of a user searching for information.
11.2.2 More or Additional Results Two main purposes of switching to another search engine are to find more results or to get different results.
206
11
Alternatives to Google
Let us first consider the first case: When would one even want to have more results? In most cases, one already gets far more results from a search engine than one can or wants to look at. But there are, of course, cases in which one finds only little information on a topic. Of course, this may simply be because there is nothing more to be found on the Web on this topic, but also because relevant results are excluded due to the keyword entered (for the choice of search terms, see Chap. 12). And finally, it may be because the search engine currently being used simply does not know the relevant documents on a topic. We have seen in Chap. 3 that search engines build up their databases mainly by Web crawling. Due to the structure of the Web, it cannot be guaranteed that the resulting database is complete and up to date. Moreover, the immense size of the Web makes it impossible for search engines to cover it completely. Consequently, we can find different documents with different search engines. However, this does not mean that there are no overlaps between them. Popular documents, in particular, are also found by most search engines and ranked high; if one is looking for an article from Wikipedia, it is hardly worth changing search engines. However, using a different search engine may be worthwhile, especially in the case of rather “rare” documents, such as those on small, not particularly popular websites. The best way to compare this is if one gets only a few results for a query. Then one can often find additional documents in another search engine.
11.2.3 Different Results We have already mentioned that using an alternative search engine can also serve to get other search results. We have established that there is by no means “the right ranking” of search results. Moreover, for many queries, there are more relevant results than we are willing or able to look at. So, in which cases is it worth looking for other results? In answering this question, we are again helped by distinction between informational, navigational, and transactional queries. In the case of navigational queries, it is not worthwhile to search for other results since there is simply a clearly defined correct result. Only if we do not find this result with our standard search engine is it worth switching to another search engine. In the case of informational queries, it may be worth switching if we are at all interested in more results. We assume here that a certain number of results have already been sifted through in the first search engine before a switch is considered. Ultimately, the decision is between viewing more results in the standard search engine versus switching to another search engine. One argument in favor of switching is hoping for fundamentally different search results that are a good complement. One argument against switching is that the two search engines have significant overlaps, leading to duplicates when examining the search engine result pages. However, that reason does not carry too much weight since it is only a matter of duplicates in the result lists and the documents themselves do not have to be
11.2
Why Should One Use a Search Engine Other Than Google?
207
looked at again. In addition, the links to the documents already viewed are marked separately in color in most browsers (regardless of the current search engine). In the case of transactional queries, a distinction must be made as to whether they are intended to serve a targeted navigation with the transaction in sequence (i.e., to navigate to a specific website on which the transaction is to be carried out) or to select one of several possible relevant websites on which the desired transaction is to be carried out. In the latter case, a further distinction must be made as to whether it is a selection from several websites of equal value (e.g., several websites offering the same online game) or websites offering different variants (e.g., different versions of the same game).
11.2.4 Better Results The issue of better results is closely related to the issue of different search results. In the introduction to this chapter, we already showed that search engines do not perform consistently in benchmark tests for single queries, meaning that it can very much depend on the query which search engine delivers the better result. This is especially true for informational queries. The problem now lies in the fact that, as a user, one cannot tell in advance for which queries a particular search engine performs best. In the studies mentioned, no connection could be made to the search topic, for example. Therefore, only trial and error can help.
11.2.5 Different Result Presentation In describing the search engine market (Chap. 8), a distinction was made between search engines with their own index and search portals that obtain their results from another search engine. Concerning organic results, the latter were discarded as alternatives since they display the same results as the giving search engine. However, the supply of search results often only concerns the organic results, not the results from the additional collections (i.e., the universal search results). The search engine result pages compiled from the different types of results can certainly vary, which can also make search portals without their own index viable alternatives, even if the number of cases is likely to be limited.
11.2.6 Different User Guidance From the user’s point of view, a search engine should not only deliver relevant results but also be user-friendly. Although standards have emerged in the basic paradigm of making the contents of the Web searchable (see Chap. 2), in the representation of the search process (see Chap. 4), and the presentation of results (see Chap. 7), there are nevertheless differences between the search engines in user
208
11
Alternatives to Google
guidance, without it being possible to say that one search engine per se has better user guidance than the others. Instead, individual preferences also play a role here: an experienced user who frequently carries out complex searches will expect different support from their favorite search engine than a user who uses search engines primarily for navigating to their favorite websites and researching general knowledge. User guidance can also be a reason to use a search engine that does not have its own index but displays results from another search engine.
11.2.7 Avoiding the Creation of User Profiles Section 5.6 described how search engines personalize search results to improve search results and more accurately serve advertising. For this purpose, comprehensive profiles of each user are created and permanently stored. Many users do not want this; therefore, an alternative search engine can be attractive if it does not create such profiles. This is not to say that the quality of the results no longer plays a role, but one forgoes a certain level of convenience without the bulk of the results being negatively affected. It can also make sense to use search portals without their own index (Sect. 8.5). They show results from one of the major search engines, but without personalizing them. Examples of such search portals are Ecosia and Startpage; the metasearch engine MetaGer also does not collect user data.
11.2.8 Alternative Search Options General search engines are excellent for navigational queries and not particularly complex informational and transactional queries. However, their search capabilities are not particularly well suited to complex information needs, for which it is necessary to formulate complex queries (see Chap. 12 for more details). In this case, switching to another search engine may make sense because it is simply impossible or would be very complicated to realize the query to be made with the usual search engine.
11.3
When Should One Use a Search Engine Other Than Google?
While in Chap. 8, a rather gloomy picture was drawn of the alternatives to Google— there considering other search engines that are suitable as substitutes for Google—a much more positive picture emerges from the use cases mentioned in the last section, since in these cases alternative search engines are considered as supplements to Google.
11.3
When Should One Use a Search Engine Other Than Google?
209
Next search engine result page Same query Modify query
Viewing results on the search engine result page; interaction with results
New/modified query Dissatisfaction with search results
Change to a specific collection of the same search engine Same query Switch to another search engine New/modified query Cancel search
Fig. 11.1 Possible strategies when dissatisfied with search results
However, while the previous sections generally dealt with reasons for using another search engine, the focus will now be on the specific moments in the search process when it makes sense to use an alternative search engine. Apart from the not very frequent case in general Internet-based research where a user notices that the complex formulation of a certain query is not possible with their standard search engine, the dissatisfaction of a user with their current search results plays the most important role. What options are there when the search engine used does not show satisfactory results? And when does one conclude that switching to another search engine would be worthwhile in a specific case? When asking about the use of alternative or additional search engines, one can assume that a user has a favorite search engine that they use as a starting point for all their Internet-based research. On the one hand, it would not make sense to select a specific search tool before every query (a large part of the queries posed to search engines can indeed be answered unambiguously or satisfactorily; see Chap. 12); on the other hand, users usually do not think about the search tool to be used before their queries, but simply trust that their favorite search engine is the right choice for all purposes. One can therefore assume that the issue of choosing an alternative to one’s own standard search engine only comes into play when there is a certain dissatisfaction with the search results. This dissatisfaction usually does not refer to a single search result but the result set as a whole or rather to what the user perceives of the result set. This can be, for example, merely the descriptions of the first results or a mixture of a few results that were actually looked at but did not prove relevant and some snippets for which the result documents were not looked at. Figure 11.1 illustrates possible strategies of a user who is dissatisfied with the current search results after a more or less in-depth review of the search engine result page, possibly supplemented by interacting with one or more search results: • First, it is possible to view further search results or snippets. This can go beyond the first search engine result page(s); in this case, the user clicks to proceed to the next search engine result page.
210
11
Alternatives to Google
• A second option is to modify the query. This can be done by adding more keywords (thereby narrowing the result set), by removing keywords (thereby widening the result set), or by modifying keywords. For overviews of search tactics and query modification categories, see Anick (2003), Boldi et al. (2010), or Smith (2012). • It is also possible to switch to results from a specific vertical collection within the search engine. This can be done by selecting the collection via a tab or clicking on a corresponding link within the universal search results display. In most cases, the query is automatically transferred when such a switch is made. However, if this is not the case, the user can either re-enter the query already used in the vertical search engine or use a modified or new query. • Another option is to switch to another search engine. There, either the same search query is made, or the change is combined with a modification of the query. • Finally, it is, of course, always possible to abandon the search. This overview shows that switching to an alternative search engine is only one of several possible responses to unsatisfactory results. Choosing this option, in turn, depends on various factors, including awareness of other search engines.
11.4
Particularities of Google due to Its Market Dominance
While in the previous sections, we discussed some reasons for switching (at least temporarily) from one’s favorite search engine (whichever it is) to an alternative, the following section deals with reasons (again, in some instances) for not using Google. The focus on this search engine again results from its market dominance; it should not be ruled out that if another search engine had a similar position in the search engine market, similar problems would arise. The problems described in the following are not so much problems to be named theoretically (these were listed in Sect. 11.2) but issues that arise from the concrete practice of the search engine Google. Thus, it is not a matter of systematic distortions but of distortions that occur due to practical decisions made by a specific search engine. In general terms, Google is accused of placing its own services prominently on the search engine result pages, even if competitors’ services should be preferred according to objective criteria (i.e., given “equal opportunities for all” in the composition of the result pages). However, it is difficult to prove that other services would have been the better choice in a specific case or should have been listed higher than the respective Google service. It is precisely the aforementioned “objectivity” of the algorithms that is problematic here: equal treatment would also exist if, for example, a factor were given particular weight that would indirectly prefer Google’s own offerings. In addition, as we have seen in Chaps. 5 and 9, it is impossible to precisely trace a concrete result ranking back to single factors. This makes it difficult to prove
11.4
Particularities of Google due to Its Market Dominance
211
preferences empirically. In the following, the individual allegations of deliberate bias in Google’s search results are explained in detail: 1. Google prefers its own vertical search engines within the universal search result pages: Google operates numerous vertical search engines in addition to Web search, such as for images, news, and local business listings. There are competing offers not only to Google’s Web search but also to its vertical search engines. The accusation is that Google, through the placement of its vertical search engines within the universal search result pages, directs its users to its own offerings, even if these do not deliver better—or even deliver poorer—results than its competitors. The underlying question is whether Google is exploiting its dominant market position, which has led to antitrust proceedings against Google filed with the European Commission (European Commission, 2017). Google’s proposals in these proceedings have shown that the presentation of universal search results in a prominent way plays a significant role in the perception of single results (Möller, 2013) and in the selection behavior of users on the search engine result pages (Lewandowski and Sünkler 2013). Accordingly, the European Commission decided that Google must enable its competitors in product search to enjoy the same conditions as its own shopping search engine in the context of inclusion in universal search. This decision also has implications for all other vertical search engines within the universal search result pages, as the same issues and problems exist there. 2. Google introduces vertical search engines and later turns them into advertising offers without users being sufficiently aware of this: Vertical search engines that have been made popular by being integrated into the universal search result pages also have a high potential as advertising offerings. For example, for several years, Google offered a shopping search engine (Google Shopping) whose database consisted of product catalogues uploaded by retailers and aggregated by the search engine. In 2013, Google changed the system: Since then, the Google Shopping database has consisted entirely of advertisements, i.e., only those products or retailers willing to pay for ads (or the clicks on them) are included. On the search engine result pages, these ads are marked relatively inconspicuously; confusion with search results generated without payment from a general index is thus quite likely. Again, since not all retailers are willing to pay for inclusion in Google’s shopping index, other product search engines may even provide more complete results but be listed less prominently by Google. 3. Google prefers its own collections within the search results or the universal search result pages: This covers collections such as Google Books or Google Street View. These Google-owned collections are not available to other search engines for indexing and are prominently integrated into Google itself. In addition, offerings such as YouTube should also be mentioned here: YouTube is operated by Google, and YouTube results appear very frequently in the top result positions on Google. This only becomes a problem when one knows about the connection between YouTube and Google. However, the case is difficult to judge, as YouTube is indisputably the largest video platform on the Internet (BLM,
212
11
Alternatives to Google
2016) and YouTube results are relevant in many cases—one can, however, also ask whether YouTube’s popularity is not at least to an appreciable extent due to frequent listings on Google search engine result pages. 4. Google prefers documents from its own social network: Google operated its own social network, Google+ (discontinued in 2019), which, however, lagged behind Facebook in terms of usage. Entries from Google+ were preferred in the regular Google result lists by showing the author’s profile picture in addition to the standard snippet, which attracts more attention. 5. Google prefers its own services in the organic results: While all other accusations are related to universal search elements or the presentation of search results, this one is about an actual manipulation of the organic search result list. In the past, Google was accused of preferring its own offerings through so-called hard coding. For example, for specific keywords (such as e-mail), Google’s own offering was listed in the first result position. In contrast, for queries manipulated with special characters (which are ignored during the processing of the queries), the result ranking was different (Edelman, 2010). However, this issue was not pursued further, and it is questionable whether a search engine is really well advised to manipulate result lists in such a rather clumsy way. The problems mentioned result from the fact that Google is no longer just a search engine but is itself a content and platform provider. This creates confusion between the role of an information intermediary, whose task is to guide users to the best/most relevant results, and the role of an information provider, whose interest is to give its own content as much visibility as possible. We will return to the role of search engines as information providers in Chap. 15. What do the problems mentioned above mean for Internet-based research? First, the results from Google’s vertical search engines displayed on the universal search result pages need not be the best possible results from a vertical search engine in that subject area. There may be another vertical search engine on the same subject area that would deliver better results but is not displayed by Google or not in a prominent position and/or in a comparably prominent display.
11.5
Summary
The results of different search engines show only slight overlaps, i.e., the individual search engines deliver different results for the same queries, which can, however, be equally relevant. On the one hand, the differences between the search results are due to the search engines’ different databases, but, above all, to different interpretations of the queries and the Web content. Theoretically, the need for alternative search engines can be derived from the idea of diversity of opinion. However, on a pragmatic level, there are different reasons for using another search engine to complement Google. These include more or additional results, better results, a different result presentation, different user guidance, and other or better search options.
References
213
The decision to use an alternative search engine often arises in the middle of the search process when a user realizes that the results displayed by the search engine they are currently using do not satisfy them. However, selecting another search engine is only one of several possible strategies. While the reasons mentioned above for switching search engines apply in principle to every search engine used as the default, some reasons speak specifically against using Google or against using Google alone. For example, Google sometimes prefers its own offerings on the search engine result pages, even if competing offerings may deliver equivalent or even better results. Further Reading For a discussion of alternative search engines from the perspective of Internetbased research, see the excellent handbooks by Hock (2013) and Bradley (2017). In addition, a discussion of alternative search engines from the perspective of the ideological assumptions underlying the respective search engines can be found in Mager (2014).
References Agrawal, R., Golsahn, B., & Papalexakis, E. (2016). Overlap in the web search results of Google and Bing. Journal of Web Science, 2(1), 17–30. https://doi.org/10.1561/106.00000005 Anick, P. (2003). Using terminological feedback for web search refinement: A log-based study. In Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval (pp. 88–95). New York: ACM. https://doi.org/10.1145/860435. 860453. BLM. (2016). Marktanteil von Video-Sharing-Plattformen in Deutschland im 1. Halbjahr 2016. Statista – Das Statistik-Portal. https://de.statista.com/statistik/daten/studie/209329/umfrage/ fuehrende-videoportale-in-deutschland-nach-nutzeranteil/ Boldi, P., Bonchi, F., Castillo, C., & Vigna, S. (2010). Query reformulation mining: Models, patterns, and applications. Information Retrieval, 14(3), 257–289. https://doi.org/10.1007/ s10791-010-9155-3 Bradley, P. (2017). Expert internet searching (5th ed.). Facet Publishing. https://doi.org/10.29085/ 9781783302499 Edelman, B. (2010). Hard-coding bias in Google “algorithmic” search results. http://www. benedelman.org/hardcoding/. European Commission. (2017). EU competition investigation AT. 39740-Google. http://ec.europa. eu/competition/elojade/isef/case_details.cfm?proc_code=1_39740. Griesbaum, J. (2004). Evaluation of three German search engines: Altavista.de, Google.de and Lycos.de. Information Research, 9(4), 1–35. http://informationr.net/ir/9-4/paper189.html Hart, K. (2011). Mr. Schmidt goes to Washington. Politico. https://www.politico.com/story/2011/0 9/mr-schmidt-goes-to-washington-063989. Hock, R. (2013). The extreme Searcher’s internet handbook: A guide for the serious searcher (3rd ed.). Information Today. Kovacevich, A. (2009). Google’s approach to competition. Google Public Policy Blog. http:// googlepublicpolicy.blogspot.com/2009/05/googles-approach-to-competition.html.
214
11
Alternatives to Google
Lewandowski, D. (2015). Evaluating the retrieval effectiveness of web search engines using a representative query sample. Journal of the Association for Information Science & Technology, 66(9), 1763–1775. https://doi.org/10.1002/asi.23304 Lewandowski, D., & Sünkler, S. (2013). Representative online study to evaluate the commitments proposed by Google as part of EU competition investigation AT. 39740-Google: Report for Germany. http://searchstudies.org/wp-content/uploads/2015/10/Google_Online_Survey_ DE.pdf. Machill, M., & Beiler, M. (2002). Suchmaschinen als Vertrauensgüter. Internet-Gatekeeper für die Informationsgesellschaft? In D. Klumpp, H. Kubicek, A. Roßnagel, & W. Schulz (Eds.), Informationelles Vertrauen für die Informationsgesellschaft (pp. 159–172). Springer. https:// doi.org/10.1007/978-3-540-77670-3_12 Mager, A. (2014). Is small really beautiful? Big search and its alternatives. In R. König & M. Rasch (Eds.), Society of the Query Reader (pp. 59–72). Institute of Network Cultures. Möller, C. (2013). Attention and selection behavior on “universal search” result pages based on proposed Google commitments of Oct. 21, 2013: Report about an eye tracking pilot study commissioned by ICOMP Initiative for a Competitive Online Marketplace. Köln. Röhle, T. (2010). Der Google-Komplex: Über Macht im Zeitalter des Internets. Transcript. https:// doi.org/10.14361/transcript.9783839414781 Singhal, A. (2011). Supporting choice, ensuring economic opportunity. https://publicpolicy. googleblog.com/2011/06/supporting-choice-ensuring-economic.html.. Smith, A. G. (2012). Internet search tactics. Online Information Review, 36(1), 7–20. https://doi. org/10.1108/14684521211219481 Spink, A., Jansen, B. J., Blakely, C., & Koshman, S. (2006). A study of results overlap and uniqueness among major web search engines. Information Processing & Management, 42(5), 1379–1391. https://doi.org/10.1016/j.ipm.2005.11.001 Stark, B. (2014). “Don’t be evil”: Die Macht von Google und die Ohnmacht der Nutzer und Regulierer. In B. Stark, D. Dörr, & S. Aufenanger (Eds.), Die Googleisierung der Informationssuche – Suchmaschinen im Spannungsfeld zwischen Nutzung und Regulierung (pp. 1–19). De Gruyter. https://doi.org/10.1515/9783110338218
Search Skills
12
In the last chapter, we have already seen that one cannot always rely on an “objective” search result ranking. Search engines use the search engine result pages to promote their own services. This behavior will not be evaluated at this point; we will return to it in Chap. 15. The main way to become independent of the standard composition of search results is simply to put more energy and care into formulating queries. However, this does not mean that one has to think long and hard before every query about how best to formulate it. Instead, it is a matter of recognizing when it is worthwhile to formulate one’s queries carefully and in which cases one can achieve better results. Formulating queries has two components: on the one hand, it is about choosing the appropriate keywords, and on the other hand, it is about qualifying the query using special commands and operators that allow controlling the result set. To be able to search in this way, one must, of course, learn the commands and operators. One objection often raised against learning operators and commands that make complex Internet-based research possible in the first place can be summed up in the sentence “I always find what I’m looking for.” But is this really the case? In explaining how users arrive at this opinion, we are again helped by Andrei Broder’s (2002) classification of search query types, supplemented by the question of how comprehensive the result a user wants should be (Table 12.1). The issue now is how well a user can actually judge whether the search result they find is relevant and/or complete. Here, we tie in with the discussion on changing the search engine due to the failure or success of the search (Sect. 11.2). Let us first consider the case of navigational queries: Since, in this case, only one document is relevant and it can be easily determined whether it was found or not (and whether it is in the first position of the result list), a user can determine whether the query was successful or not. In the case of transactional queries, a distinction must be made once again between the case in which the target website on which the transaction is to take place is already known (and the decision about the success of the search can be made unambiguously) or whether the transaction can be carried out on different websites # The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Lewandowski, Understanding Search Engines, https://doi.org/10.1007/978-3-031-22789-9_12
215
216
12
Search Skills
Table 12.1 Assessability of search success by query intent (Adapted from Lewandowski, 2014, p. 46 f) Unambiguously assessable
Navigational Search for an already known document
Not unambiguously assessable
Informational Search for a fact Search for trivia Informational search where information from a specific source is expected (e.g., Wikipedia) Classic information search to gain a complete picture or a comprehensive overview
Transactional Search for a known website on which a transaction is to be carried out Several variants of the transaction are possible
100
“Holy Grail”of information retrieval
Gain
R E C A L L
through elaborate “avergage” information retrieval
P R E C I S I O N
search
100
Fig. 12.1 Improving the quality of results through elaborate search (translated from Stock, 2007, p. 64)
(in which case it is difficult to decide whether the search engine has actually found the best website for this purpose). In the case of informational queries, things are more complex. For example, suppose one searches only for a specific fact or item of trivia (e.g., the date of the first moon landing or a list of films in which a certain actor has played). In that case, it can be decided quickly and confidently whether the query has led to success. The situation is different, however, with informational queries that correspond to the demands of classical information research, where the aim is to obtain a complete or at least comprehensive picture of a topic. In these cases, one can never know whether the results found are actually the best possible ones or whether there are better results or simply more results that examine the topic in greater depth or from a different angle. Although this dilemma cannot be resolved, better results can be achieved through better Internet-based research. Figure 12.1 schematically shows the gain from an elaborate search with the goal of finding all relevant documents on the topic and, at the same time, only relevant documents (i.e., no irrelevant results). In Chap. 13, these two requirements will be formalized in the metrics precision and recall; for the current context, the
12.1
Source Selection
217
requirement to find all relevant results (recall) is sufficient. Stock (2007, p. 57; also see Stock & Stock, 2013, p. 114–5) calls this the “Holy Grail” of information retrieval since this goal cannot be achieved in practice. Phil Bradley’s quote from his book Expert Internet Searching should also be understood in this sense: “Effective internet searching is part science, part art, part skill and part luck” (Bradley, 2017, p. 18). What is meant by this is that, on the one hand, effective searching can be learned by learning commands and the functionality of search engines, combined with skills in research that can be acquired through practice. But, on the other hand, intuition and a dose of luck are also necessary. We will naturally focus more on the first two points in the following. And here, too, only a small section of the basic techniques of Internet-based research can be addressed; those who want to delve deeper into this topic are recommended to read the books listed at the end of the chapter.
12.1
Source Selection
If one wants to solve more complex research problems, it is worthwhile to first think about the choice of search tool. Of course, there is no harm in first entering one or a few queries into a search engine to try out what results are returned. However, this trial and error should not replace actual Internet-based research. Of course, search engines are the right tool for Internet-based research in many cases, but one should not forget that other tools may be more suitable in some instances. The following are the main alternatives: • Vertical search engines—There are countless vertical search engines for particular subject areas, document types, and otherwise selected areas of the Web (see Sect. 6.1). These often allow more targeted Internet-based research than general search engines. For example, if one already knows that one is looking for news on a topic, it would be better to select a vertical search engine for news content directly. One should not rely on the fact that the general Web search will always display the appropriate results from the corresponding collections in universal search. If one has an information need that can be better satisfied by searching in a vertical search engine, it is sensible to search there directly. • Deep Web databases—Numerous databases can be accessed via the Web whose content cannot be found with the general Web search engines (see Chap. 14 for more details). In this case, one should ask whether there could be a specialized database worth searching. Assuming such a database exists, one can search for it in a search engine. Often it is enough to enter the desired topic and the keyword “database” to find what you are looking for. Internet-based research is therefore carried out in two steps: first, a source is searched for (navigational or transactional query), and then a search is done in this source (informational queries). • Searching outside the Web—In many cases, Internet-based research or compiling the information found scattered there means considerable effort. If one wants to inform oneself comprehensively and systematically about a topic, one should ask
218
12
Search Skills
whether there are more suitable sources available outside the Web, for example, books. Let’s take the content of this book as an example: Of course, in many cases, it would be possible to gather much of the information presented in this book from the Web and summarize it accordingly. However, the effort would be many times greater than searching for this book and then acquiring the knowledge presented in it. Even if we look at less extensive cases, it becomes clear that research on the Web is not always efficient. Although in many cases, we may achieve an equally good result with Web search (so the research is effective), it often takes much more time than searching for and reading a systematic source, which would make the research efficient. Of course, the aforementioned possibilities and sources can also be combined in the context of more extensive Internet-based research, and this is often done in practice. However, particularly in the case of more in-depth research, for example, for a student term paper, one will initially research on the Web but quickly realize that Web-based research alone is neither sufficiently effective nor particularly efficient. Better research, therefore, begins with selecting the appropriate source. In many cases, this will be a search engine; however, one should not forget about other possible sources.
12.2
Selecting the Right Keywords
Of course, the choice of keywords is crucial for the success of Internet-based research. However, one is often blinded by the fact that search engines find something for almost every keyword—just because something is found, one assumes that the best possible result has also been found. It is, therefore, worthwhile to think about alternatives to the keywords used. These can be synonyms (i.e., different words with the same meaning) but also related terms. Different keywords can be combined in refined queries later if necessary (see Sect. 12.4). Particularly if one continues to refine a query by adding further keywords, the problem can arise that documents which contain most but not all of the entered keywords are excluded, even though they may well be relevant. In the case of longer queries, it may be worth experimenting with different keyword combinations.
12.3
Boolean Operators
Boolean operators can be used to formulate queries more precisely. With their help, it is possible to limit or expand the result set to have suitable—and ideally only suitable—documents displayed on the search engine result pages.
12.3
Boolean Operators
A OR B
219
A AND B
A AND NOT B
Fig. 12.2 Set diagrams to illustrate the effect of the Boolean operators
The Boolean operators and the Boolean logic behind them form the basis for working with result sets, even if commands are used that are not part of the Boolean operators. Boolean operators are not specific to search engines. Since it is essential to understand Boolean logic to do targeted searches and explain how search engines process queries, they will be introduced in detail here. Queries with Boolean operators can be entered not only in search engines but also in most information retrieval systems, for example, in most databases of the Deep Web (see Chap. 14). There are only three Boolean operators, but they can be used to construct complex queries with the help of parentheses, since the operators can be combined freely. The Boolean operators are AND, OR, and NOT: • AND: With AND, two keywords are connected so that only documents containing both are found. For example, the query coffee AND tea will only find documents containing both the words coffee and tea; the occurrence of only one of these two terms is not sufficient. Adding another keyword which must also be contained in the document reduces the result set. Fewer documents contain a specific keyword and another keyword together than documents that contain the first keyword (regardless of whether they also include the second keyword or not; see Figure 12.2, middle). • OR: A combination of two keywords with the operator OR expresses that documents are to be found in which either one or the other keyword occurs or both. In this way, the Boolean OR differs from our everyday use of the word or. For example, if we ask “Would you like coffee or tea?”, our guest should choose one of the drinks, but not both. However, the Boolean OR (formulated as coffee OR tea) would include this case. By using OR, the result set is expanded. The set of documents containing one or the other term (or both) is larger than the set of documents containing only one of the terms. • NOT: With NOT, a keyword is excluded, i.e., one term is searched for, but another is excluded. For example, coffee NOT tea would find documents containing the word coffee but exclude all documents containing the word tea. NOT reduces the result set. The set of documents that contain a particular word is larger than the set of documents that contain this word but not another.
220
12
Search Skills
Figure 12.2 illustrates the subsets achieved using the respective Boolean operators. The Boolean operators can be combined freely to formulate complex queries. On the one hand, a single Boolean operator can be used to string together multiple keywords, for example, coffee AND tea AND sugar AND milk. With each AND, the result set is further restricted; we are familiar with the case where we simply append one or more additional words to our previous query if we are unsatisfied with the search results. This is nothing more than the construction of an AND query: the search engine interprets the blanks used in the stringing together of keywords as a Boolean AND. On the other hand, the operators can also be combined. (coffee OR tea) NOT (coffee AND tea), for example, would be the expression of the exclusionary OR problem mentioned above: one would only find documents mentioning coffee or tea, but not both in one document. When constructing complex Boolean queries, one must be careful not to make them misleading. It helps to use parentheses that indicate the processing steps. The search argument just mentioned should serve as an example, which is shown here in different variants: 1. 2. 3. 4.
(coffee OR tea) NOT (coffee AND tea) coffee OR (tea NOT coffee) AND tea coffee OR tea NOT (coffee AND tea) coffee OR tea NOT coffee AND tea
The first case describes the already mentioned excluding OR (“either-or”). In the second case, three subsets would be formed in the first step: all documents containing coffee, all documents containing tea but not coffee, and all documents containing tea. The combination of these three sets would not be unique or would depend on whether the AND or the OR would be processed first by the search engine. Therefore, an additional parenthesis would be necessary in this case to make the query precise. The third case again describes three subsets: all documents containing coffee, all documents containing tea, and all documents containing both coffee and tea. Here, too, the lack of parentheses gives rise to different possible interpretations. Finally, the last case shows the query without any parentheses. This results in several possible interpretations. Since we want to ensure that the query is unambiguous when entering Boolean queries, it is particularly important to set the parentheses correctly. Incorrectly entered Boolean queries often achieve the opposite of what is intended: For example, the result set is expanded instead of narrowed down. A simple rule is that parentheses must be set as soon as different Boolean operators are used in a query. So, while one does not need parentheses when the same operator is used several times (e.g., in the query carrot OR beet OR parsley OR radish), a query that contains different operators can only be processed meaningfully if parentheses are set.
12.3
Boolean Operators
221
Furthermore, the Boolean operators must always be written in capital letters within queries; otherwise, they are interpreted as keywords. Theoretically, Boolean operators can be used to form queries of any length. With Google, however, this is not possible; here, one can only formulate Boolean queries of limited complexity, and even with these, there are often undesirable effects. Google also does not accept the NOT operator and instead requires a minus sign to be entered directly before the word to be excluded (i.e., without a space). Google is unsuitable for a targeted search with Boolean operators; switching to another search engine is advisable. While the Boolean operators form the basis for combining keywords, there are other operators that build on the Boolean operators, expand them, or differentiate them. While these are common in professional information research and the systems used there, they are usually not supported by search engines. However, some search engines have created their own search vocabulary, which is not standardized and often only adapted to special cases (see Sect. 12.6). Selected examples will be given in this chapter, but it is impossible to go into the search vocabulary of the individual search engines in full. The two major search engines, Google and Bing, each provide an overview of the operators and commands they support at https://support.google.com/websearch/ answer/2466433?hl=en and https://msdn.microsoft.com/en-us/library/ff795667. aspx, respectively. Examples of How to Use Boolean Operators Searching with Synonymous Keywords If one wants to search for two words with the same meaning, one combines them with OR. An example is the query aubergine OR eggplant. If one were to search for one of the two words, one might miss relevant documents. By using the operator, one avoids having to enter two queries in succession, which would partly produce the same results, as both words occur in many documents. Searching with Synonyms and Other Keywords In this case, we again search for two (or more) synonyms, but this time in combination with another keyword that should be contained in all documents found. The query (aubergine OR eggplant) AND recipe will return documents containing the word recipe in any case and at least one of the words aubergine or eggplant. Since two different operators are used, it is necessary to set parentheses. In this case, the parentheses express that from the set of all documents that contain aubergine or eggplant, those are to be found that also contain the word recipe. This query could also be split into two separate AND queries: aubergine AND recipe and eggplant AND recipe. As in the first example, however, it makes sense not to place the queries one after the other but directly in (continued)
222
12
Search Skills
connection so that one does not have to look through two result lists containing some of the same documents. The parentheses in Boolean search queries can be resolved according to the usual rules: Thus, the search queries (aubergine OR eggplant) AND recipe and (aubergine AND recipe) OR (eggplant AND recipe) are synonymous and produce the same results. Joining Two OR Arguments with AND Of course, the OR connection is not only suitable for synonyms. For example, let’s look at the query (aubergine OR eggplant) AND (cooking OR frying). In the first part, we again have the two synonyms; in the second part, two words that are not synonyms are used. The combination with AND is similar to the previous example, but now the two OR links with two keywords each result in four combinations. In other words, one could also express this query in four single queries, namely: aubergine AND cooking aubergine AND frying eggplant AND cooking eggplant AND frying
Here, it becomes clear that entering these four queries and viewing the four different result lists would be much more tedious than combining them in one query and then viewing only one result list.
12.4
Connecting Queries with Boolean Operators
In the previous section, we described ways in which Boolean operators can be used to connect keywords. This provides a powerful vocabulary that allows the construction of complex queries. But Boolean operators can also be used to connect queries. For example, let’s assume we are researching a topic and sending off different queries one after the other, some of which may have been constructed with the help of Boolean operators. Since we are constantly working in the same subject area, many results will undoubtedly appear again and again in the result lists. At first glance, this may not seem particularly bad, but if one does more in-depth research and poses many queries, looking through such result lists can take a lot of time. In this case, it is worthwhile to combine the queries with OR links after a cursory review of the respective result lists. Since the single queries are processed individually, and the results are then combined into a result list, parentheses must be used again. Connecting the queries is therefore done according to the form (query 1) OR (query 2) OR (query 3).
12.5
Advanced Search Forms
223
To at least come closer to the “Holy Grail” of a query named by Stock and Stock (2013, p. 114–5), which simultaneously yields all relevant results on a topic and only relevant results, we can thus cleverly combine several queries.
12.5
Advanced Search Forms
A second option for making complex queries are so-called advanced search forms. In the standard search (“simple search”), only one search box is available, and if one wants to make a complex search query, one must know the operators or commands with which to qualify the keywords. Advanced search forms, on the other hand, are a tool for making better queries without knowing the commands and operators. Note that not all possible queries can be generated with this kind of tool, which usually only offers a compilation of the most important functions (which, however, are sufficient in many cases). Figure 12.3 shows Google’s advanced search form. In the upper part, there are input fields for the query which, among other things, represent links with the Boolean operators (for “all these words” it is the Boolean AND, for “any of these words” the Boolean OR, and for “none of these words” NOT). In addition, there is the phrase search (“this exact word or phrase”), with which two or more words can be searched for in a specific order. Finally, in the last line in the upper area, it is possible to search for documents that contain numbers in a given range.
Fig. 12.3 Form for the advanced search (example from Google; https://www.google.com/ advanced_search; September 13, 2022)
224
12
Search Skills
The right-hand column contains explanations of each field, some of which indicate how a corresponding query can be formulated in the standard search box using operators. The advanced search form does not offer anything new in principle but is an alternative way of formulating complex queries. In the lower block of the form, there are mostly fields with which the search can be further restricted, for example, by country, language, or file type. These options are discussed in more detail in Sect. 12.6. In advanced search forms, one only has to fill in one field but can also fill in as many fields as desired. These are then linked with AND. For example, suppose one searches for Bayreuther Festspiele in the field “this exact word or phrase” and selects “German” as the language. In that case, only documents will be found that contain the two keywords next to each other in this exact order and that are written in German. Users rarely use advanced search forms (see Sect. 4.5.7). This has led to them being integrated less and less prominently into the user interfaces of the search engines or, in some cases, disappearing altogether. Bing, for example, has a powerful search language that leaves hardly anything to be desired, but offers no advanced search form with which complex queries could be formulated more simply. Advanced search forms offer many more possibilities than the options provided on the search engine result pages for restricting the result set (see Sect. 7.3.9). So, if one already knows that one wants to formulate a precise query, it is advisable to use the advanced search form (or commands) directly. It should not be forgotten that the numerous vertical searches offered by the general search engines also each have their own advanced search forms. These offer search options that are not available via the general search interface. One example is the option within the advanced search of Google’s image search engine (https:// www.google.com/imghp) to restrict the search to a color that dominates the images. In this way, a result set can be significantly reduced, or the accuracy of the results can be increased dramatically by a simple setting (Fig. 12.4).
Fig. 12.4 Advanced image search on Google (excerpt; https://www.google.com/advanced_ image_search; excerpt, September 13, 2022)
12.6
12.6
Commands
225
Commands
When one submits a query via an advanced search form, one can sometimes see how the search engine automatically enriches the query with commands in the search box on the search result page. For example, if one specifies in Google’s advanced search form that only pages from the domain guardian.co.uk are to be found, the query is supplemented with site:guardian.co.uk. So, if one knows the command, one can enter the same query without using the advanced search form. However, not all possible queries can be achieved using the advanced search form. This applies to more complex queries with the Boolean operators but also to queries that contain commands. This section introduces the most important commands that can be used to qualify queries. Some commands are specific to only one search engine, but fortunately, certain conventions have evolved so that the most important commands work at least in the two major search engines. The difference between operators and commands is that operators are used to connect keywords, while commands are used to specify queries. Operators and commands can be combined within a query. All commands serve to limit the result set. A command adds an additional condition to the query that is not fulfilled by all documents that would have been found without the restriction by the command. In Table 12.2, some important search commands are described. In each case, the way they are to be entered in the two most significant search engines, Google and Bing, is indicated. Unfortunately, not all queries can be expressed using commands. For example, if one wants to restrict the search to an exact period of time, this is not possible with Google, neither via the advanced search form nor by using commands. The only way to do this is to submit a query and then limit it using the options provided on the search engine result page (see Sect. 12.5). One difficulty with using advanced search functions, in general, is that one has to gain experience of what is possible with which search engine and, if a certain type of query is possible, whether it has to be made via a command, via the advanced search form, or via the limiting options on the search result page. Unfortunately, there are no rules for this, and what is possible with a single search engine also changes frequently.
226
12
Search Skills
Table 12.2 Important search commands on Google and Binga Command at Bing filetype:
Restricts the search to documents in which the search terms within the quotation marks occur precisely in the order specified
“keyword1 keyword2”
“keyword1 keyword2”
Restricts the search to documents from a specific domain. This can be a top-level domain (e.g., .de) or a particular domain (for instance, microsoft. de) Restricts the search to words that appear in the HTML title () of the document
site:
site:
intitle: allintitle: (if all the following terms are to appear in the title)
intitle:
--
language:
Explanation Restricts the search to a specific file type, for example, PDF or Word documents
Phrase search
Restriction to one domain
Search in the title of the document
Documents in a specific language
a
Command on Google filetype:
Function Restrict file type
Restricts the search to documents in a specific language
Example Search engine optimization filetype:pdf Finds documents that contain the words search, engine, and optimization in PDF format “Peter Jenkowski” Finds documents in which the name occurs in precisely this order (first name, last name) Viagra site: uni-hamburg.de Finds documents that contain the word Viagra and are stored on a server of the University of Hamburg intitle:search engine optimization Finds documents that contain the words search, engine, and optimization in the title granular synthesis language:en Finds documents containing the words granular and synthesis and written in the English language
This overview is far from complete. A deeper insight into the commands supported by the various search engines can be found in the books by Hock (2013) and Bradley (2017)
12.6
Commands
227
When Should One Use the Advanced Search Form, and When Should One Qualify the Query with Commands or Operators? The advanced search forms can be used to make a wide variety of queries, and they are relatively easy to use without having to remember operators or commands. However, searching with commands has serious advantages: • If one knows the commands, complex queries can be formulated much faster. • Many queries cannot even be made with the help of advanced search forms. • And finally, knowledge of the Boolean operators is essential for all strategies for better searching. The advanced search forms cannot replace this knowledge.
Queries That Google Cannot Answer There are a large number of queries which cannot be submitted using Google. Unfortunately, Google only supports a relatively limited search vocabulary; in addition, complex search arguments are often processed incorrectly. For this reason, it is advisable to use a different search engine, especially for such queries. In the following, this will be illustrated by two examples. Combining Boolean Operators With Google, Boolean operators cannot be combined at will. If, for example, one wants to combine several queries that themselves contain links with Boolean operators, Google will often produce incorrect results. For example, the query (aubergine AND recipe) OR (zucchini AND recipe) leads to incorrect results in Google due to this mishandling. In contrast, Bing and other search engines can process this query without problems. Documents Containing a File in a Specific Format The restriction by file type (filetype:) can be used to find files in a specific format. Often, however, one does not only want to find a file in a particular format but a (text) document that also contains a media file, for example. For example, you won’t get very far if you search Google for documents about the writer Ian McEwan that contain an audio file. If you restrict the search to filetype:mp3, you will get no results at all; the query “ian mcewan” mp3 (i.e., without explicit restriction to a file type, but with the designation mp3 as part of the query) mainly returns Web pages where you can download audiobooks for a fee. With Bing, on the other hand, it is possible to use the contains:mp3 command to restrict the search to documents that contain an audio file on the topic. The search query would therefore read: “ian mcewan” contains:mp3 Any file type can be attached to contains: so that, for example, videos can be found quickly by slightly changing the query.
228
12.7
12
Search Skills
Complex Searches
With Boolean operators, advanced search forms, and commands, we have numerous ways to qualify our searches and make them more effective and efficient. Instead of having to look through endlessly long result lists in the hope that something useful might turn up among all the irrelevant results, we can now already control the size and quality of the result set by formulating the query. So far, we have mainly looked at searches carried out in one step. Complex searches, however, are usually carried out in several steps. This can be a combination of several search queries after a cursory review of the result lists for each query (see Sect. 12.4) or research in several steps in which one first searches for a suitable source (a vertical search engine or database) and then carries out the subject search there. However, not only a combination of queries or a combination of navigational and informational searches is common but also switching between different forms of search. This is particularly the case with extensive topics. It is often a question of first collecting a large number of documents to get a comprehensive picture of the topic. However, many information problems cannot be solved without combining different search tools, research strategies, and queries. For example, the journalist Albrecht Ude (2011, p. 185ff.) describes the case of a Chinese athlete who took part in the 2008 Olympic Games in Beijing. The question arose as to whether she had reached the minimum age of 16 required for participation. It was only through extensive Internet-based research using several sources and a combination of several queries that the question could be answered. It should not be denied that the options described in this chapter for qualifying and combining queries as well as for combining queries and selecting sources are only needed in relatively few cases. Most day-to-day queries are relatively simple, and it certainly does not take much effort to select the right vertical search engine and formulate the query appropriately. But even in many simple cases, qualifying the query can be useful and achieved relatively easily, provided one knows the relevant options. However, as soon as one carries out more extensive Internet-based research, using the search options described in this chapter becomes essential. Not only can one use them to find documents that one would not have found by simpler means, but one can also make one’s searches efficient, i.e., one can carry out successful searches in less time.
12.8
Summary
Search engines are useful for many forms of Internet-based research. However, especially before conducting more extensive research, one should ask which sources best suit a particular case. Search engines can be one of them, but they are often not the only relevant sources. Internet-based research is largely about selecting suitable keywords and then connecting the terms within the queries. Boolean operators provide a powerful
References
229
vocabulary for linking search terms, which can be used to control the result sets on the one hand and to combine several queries on the other. The latter avoids having to go through several result lists containing a large proportion of duplicate content. Search engines offer advanced search forms and various commands for qualifying queries. However, neither method can represent all possible queries, so one has to decide in favor of one or the other, depending on the specific case. In the context of complex research, it is often necessary to select different search tools (search engines, databases) and skillfully combine queries that have been qualified with commands and operators. Further Reading Two excellent introductions to professional Internet-based research are the books by Bradley (2017) and Hock (2013). The book The Joy of Search (Russell, 2020) uses numerous examples to show the complex research possible with the help of search engines. Here, the main focus is on the combination of the most diverse search options and vertical collections, allowing searches that initially seem almost hopeless. One limitation, however, is that the book covers only Google and treats the topic rather unsystematically.
References Bradley, P. (2017). Expert internet searching (5th ed.). Facet Publishing. https://doi.org/10.29085/ 9781783302499 Broder, A. (2002). A taxonomy of web search. ACM Sigir Forum, 36(2), 3–10. Hock, R. (2013). The extreme searcher’s internet handbook: A guide for the serious searcher (3rd ed.). Information Today. Lewandowski, D. (2014). Wie lässt sich die Zufriedenheit der Suchmaschinennutzer mit ihren Suchergebnissen erklären? In H. Krah & R. Müller-Terpitz (Eds.), Suchmaschinen (pp. 35–52). LIT. Russell, D. M. (2020). The joy of search: A Google Insider’s guide to going beyond the basics. MIT Press. Stock, W. G. (2007). Information retrieval: Informationen suchen und finden. Oldenbourg. Stock, W. G., & Stock, M. (2013). Handbook of information science. De Gruyter Saur. Ude, A. (2011). Journalistische Recherche im Internet. In D. Lewandowski (Ed.), Handbuch Internet-Suchmaschinen 2: Neue Entwicklungen in der Web-Suche (pp. 179–199). Akademische Verlagsgesellschaft AKA.
Search Result Quality
13
In this chapter, we will examine the quality of search results from two different angles: On the one hand, it deals with how users can assess the quality of the individual results displayed by a search engine; on the other hand, it deals with research methods that can be used to systematically evaluate and compare the quality of the search results displayed by search engines. It will become clear that, on the one hand, documents are not of high quality simply because they are shown in a prominent position by a search engine. On the other hand, even the better performance of a particular search engine in a test does not inherently mean that this search engine also performs best for every query. Both points have implications for using alternative search engines (see Chap. 11) and for Internet-based research.
13.1
Criteria for Evaluating Texts on the Web
While in Chap. 12, we were mainly concerned with source selection and the formulation of queries to achieve better search results, we now turn our attention to the results of what we have found in our searches. As we have seen, although search engines try to reflect quality in their rankings by using different factors (Sect. 5.3), the ultimate quality assessment must be made by the searcher. The criteria discussed in this section for assessing the quality of texts on the Web are independent of using a search engine. It does not matter how one arrived at a document—should one have doubts about the quality or credibility of a document retrieved, it is worthwhile to check these criteria and thereby gain a more accurate picture. But, again, it is not necessary to check all criteria in every simple factfinding search; instead, it is essential to recognize the cases in which it is worthwhile or necessary to check. One can divide checking the quality of Web documents into different areas, as is done, for example, in the quite well-known CARS checklist (Harris, 2020), which posits that a high-quality document should be credible, accurate, reasonable, and # The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Lewandowski, Understanding Search Engines, https://doi.org/10.1007/978-3-031-22789-9_13
231
232
13
Search Result Quality
Table 13.1 CARS checklist (Harris, 2020) Attribute Credible
Accurate
Reasonable
Supported
Criteria Trustworthy source, author’s credentials, evidence of quality control, known or respected authority, organizational support Up to date, factual, detailed, exact, comprehensive, audience and purpose reflect intentions of completeness and accuracy Fair, balanced, objective, reasoned, no conflict of interest, absence of fallacies or slanted tone Listed sources, contact information, available corroboration, claims supported, documentation supplied
Goal An authoritative source, a source that supplies some good evidence that allows you to trust it A source that is correct today (not yesterday), a source that gives the whole truth A source that engages the subject thoughtfully and reasonably, concerned with the truth A source that provides convincing evidence for the claims made, a source you can triangulate (find at least two other sources that support it)
supported. Table 13.1 summarizes these attributes with the respective review criteria and the objectives of the review. These checks can apply to different levels of content: sometimes, it is a matter of evaluating a single document, sometimes the source. Some of the criteria mentioned in Table 13.1 can be checked directly against the text, for example, whether the document is comprehensive and has been prepared with formal care. Others, such as the author’s qualifications, must be verified by further Internet-based research. But here, too, one ultimately evaluates an individual document. Another option is to check the source: Where was the document published? Is it a well-known website that one already trusts because one knows that quality control takes place there (e.g., the well-known journalistic brands)? Checking the domain via a so-called WHOIS service can also reveal information about the source. Checking documents on the Web has a lot to do with common sense. Of course, it is not practical to check every document one comes across on the Web. But especially when it comes to important research or sensitive topics, keeping the aforementioned review criteria in mind is worthwhile.
13.2
Human vs. Machine Inspection of Quality
Machine and human inspection of quality can lead to different results. In Chap. 5, ranking search results was described as a process in which, in addition to analyzing the texts of the potential results, especially factors that measure the popularity of documents are taken into account to draw conclusions about their quality. Examples can be used to show that equating popularity with quality sometimes leads to obviously bad results. A famous example of how even hate sites can appear in the top positions for harmless queries was the website martinlutherking.org, which was consistently
13.2
Human vs. Machine Inspection of Quality
233
listed in the top positions on Google for the query martin luther king for many years. The website is no longer available; why it was shut down is unknown. Nevertheless, it is worth taking a look at this site and its position in search engines to illustrate a general problem of automated quality assessment. The website claimed to be an information site, but a glance at the host of the website (named in the last line on the homepage) revealed that it was, in fact, a racist website: it referred directly to stormfront.org, describing itself as a “white nationalist community,” which already includes the motto “White Pride World Wide” in its logo. Furthermore, the website stormfront.org itself has not been included in Google’s search results (at least in Germany) for a long time (McCullagh, 2002) because, among other things, it denies the Holocaust, which is legally prohibited in Germany. As early as 2000, Paul S. Piper described the website martinlutherking. org as: one of the most odious sites on the Web. It disseminates hateful information about one of the greatest African-American leaders of our era while pretending to be, on the surface, an “official” Martin Luther King, Jr. site. (Piper, 2000)
This may be relatively easy for more experienced users to see, but less experienced users may be fooled: Even the underlying pages, although obviously advocating white power (the recommended books include My Awakening by David Duke), can easily fool less sophisticated Web users because the information is presented in a “factual” manner, cites “government documents,” and offers a polished design apparently sympathetic to King. (Piper, 2000)
However, the fact that the website was nevertheless listed in one of the top positions in Google’s search results makes it clear that the automatic “quality assessment” of the search engines can only be an assessment based on formal criteria and is no guarantee that the content is also relevant or trustworthy. The website mentioned above may be an extreme example in this respect; however, it illustrates that the fact that a website or a document is listed highly by a search engine does not mean that the content can be trusted. Often the verification of documents is left to the search engine, i.e., one assumes that the information found must be correct and true simply because a search engine presents it in a prominent position. However, this is not the case, which makes verification necessary, especially in the case of controversial topics. It is not intended here to advocate always checking every search result against an extensive list of criteria. Similar to using operators and commands for qualifying queries, it is instead a matter of recognizing when such a check makes sense and is necessary. Much can be achieved with common sense alone: If something seems strange when you read it, this is already a good sign that you should check the source. One should also pay attention if one finds documents on a topic that exclusively or primarily represent or emphasize a certain position. A closer look at the authors is often enough to assess the information in its context.
234
13
Search Result Quality
Fig. 13.1 Wikipedia article with a warning that it needs improvement (detail; September 13, 2022)
Less extreme examples are provided by various articles from Wikipedia. Wikipedia articles very often appear in the top positions in the search results of virtually all search engines (Cozza et al., 2016; Höchstötter & Lewandowski, 2009; Lewandowski & Spree, 2011); Wikipedia gets more than 85% of its traffic from search engines (https://www.similarweb.com/website/wikipedia.org/). In most cases, the Wikipedia articles output by search engines are relevant documents that provide a good overview of the topic being searched. However, search engines rate Wikipedia as a source (i.e., largely independent of the single document) as being of such high quality that even inferior documents appear in the top positions for corresponding keywords. These articles do not meet the principles established by Wikipedia itself (Wikipedia: Five pillars, 2021) and are often marked by Wikipedia itself with corresponding warnings. This shows that even supposedly high-quality sources are no guarantee that the information contained on individual topics must be trustworthy, complete, or high quality in any other way. Here, too, a closer examination is therefore advisable (Fig. 13.1). These examples make clear that search engine results should not be blindly trusted. Although search engine providers have taken measures to reduce the frequency with which so-called fake news appears in prominent positions on search engine result pages, examples repeatedly show that this is not always successful. One should also consider that by their very functionality, search engines are not designed to display “the truth” but simply to display documents that contain the specified keywords and which are deemed relevant by users and content producers. Moreover, the quality of the documents displayed also has to do with the quality of the documents available in the first place: If a search engine only finds documents of low quality for a query, these will also be displayed—but users will not see that even the first result listed has only received a low-quality rating.
13.3
13.3
Scientific Evaluation of Search Result Quality
235
Scientific Evaluation of Search Result Quality
In addition to the individual examination of the quality of search results during Internet-based research, there are procedures that systematically evaluate the quality of search results. Such procedures are used not only with search engines but also with many other information retrieval systems. They can draw on a long tradition of evaluation and have been adapted for use with search engines. The results from such evaluation studies can serve several purposes: 1. Improving the systems at the providers’ side: All search engine providers evaluate their systems continuously and also use human evaluators for this purpose (see, among others, Google, 2020). The results of these evaluations can be used, among other things, to gain insights for further developing ranking procedures and for combating spam. 2. Comparing different search engines: Independent studies that compare search engines “from the outside” provide recommendations on which search engines are particularly suitable for successful Internet-based research. Furthermore, they can provide insights into how successful search engines are in helping users find relevant information and whether search engines are fulfilling their role as essential tools for knowledge acquisition within society. Search result quality is first and foremost about measuring retrieval effectiveness, i.e., quality of results operationalized independently of a specific user. In the following, the procedures of information retrieval evaluation are presented in the way they have been adapted to the specifics of search engines. For this purpose, we first assume a typical presentation of results and then explain the standard test design in the following. Figure 13.2 shows an example of a so-called precision graph from a study on the retrieval effectiveness of the two search engines Google and Bing. The precision of the search results is defined as the proportion of relevant results in the total number of results displayed by the respective search engine. Precision is, therefore, about how well a search engine can present only relevant results to the user. Ideally, all results displayed would be relevant, resulting in a precision value of 1 (the number of relevant results displayed divided by the total number of results displayed). The “counterpart” to precision is recall: this measures the proportion of relevant documents a system finds in relation to the total number of relevant documents in the index. To calculate recall, one must know all the documents in the database and evaluate their relevance, which is practically impossible in the vast majority of cases. In the diagram, the positions in the result list are shown on the x-axis, whereby the precision values (on the y-axis) are cumulated in each case. So, for example, if we look at the value 3 on the x-axis, the precision value of 0.83 for Google means that 83% of all results on positions 1 to 3 are relevant for this search engine. So, we are not looking at the individual positions but assume a user who works through the result list in sequence and has therefore looked at the first two documents once they get to position 3.
236
13
Search Result Quality
1.0
0.9
0.8
0.7
Precision
0.6
0.5
0.4
0.3
0.2
0.1
0.0 Google Bing
1
2
3
4
5
6
7
8
9
10
0.847660501
0.838274933
0.834545455
0.836950624
0.837527115
0.835532261
0.829204088
0.824770146
0.822718004
0.819535888
0.82970297
0.824649299
0.809183673
0.797775531
0. 770020534
0.781059063
0. 779608651
0.770103093
0.77445932
0.791534392
Posion cumulave
Fig. 13.2 Precision graph from a search engine evaluation (Lewandowski, 2015, p. 1771)
The two curves in the graph show the progressions for both search engines. It is easy to see that the differences, in this case, are not severe, i.e., it should not matter much to a user which of the two search engines they use. In addition, it should be noted that both search engines display relevant results; however, these are not necessarily the same results (see Sect. 11.1). In Fig. 13.2, only informational queries were considered. A different picture emerges when we look at the results for navigational queries (Fig. 13.3). In this case, the extent to which the two search engines are capable of displaying the correct result in the first position was measured. The results showed that Google managed to do this in 95.3% of cases, while Bing only managed to do so in 76.6% of cases. Looking at the results for the informational and navigational queries together, one can conclude that the differences between the two search engines are largely due to the results for the navigational queries. Of course, the results only refer to a specific point in time; if one were to conduct the same investigation today, different results might emerge. These explanations are also intended less to name concrete results than to explain the basic procedures and modes of presentation. Unfortunately, an important finding from studies on retrieval effectiveness is often lost when comparing search engines: Even if a search engine performs best on average in a test, this does not mean that it also delivers the best result for every query. On the contrary, research on the quality of search results consistently concludes that this is not the case but that depending on the query, sometimes one search engine delivers better results, sometimes the other (e.g., Griesbaum, 2004; Lewandowski, 2008).
13.3
Scientific Evaluation of Search Result Quality
237
Fig. 13.3 Comparison of two search engines for navigational queries (Lewandowski, 2015, p. 1769)
13.3.1 Standard Test Design of Retrieval Effectiveness Studies Retrieval effectiveness is the ability of a search engine to return relevant documents in response to a query. Popular search engines have been evaluated in numerous tests. Primarily English-language queries were used for these tests, but tests using other languages can also be found (e.g., Griesbaum, 2004; Lewandowski, 2008; Lewandowski, 2015). An overview of non-English-language search engine tests in recent years that goes beyond pure retrieval effectiveness tests can be found in Lewandowski (2015). A standard set-up (sometimes slightly modified) is used for most tests, as known from the information retrieval literature and evaluation initiatives (especially TREC; see Harman & Voorhees, 2006; Voorhees, 2019). A set of queries is sent to different search engines; the returned results are first anonymized and randomized and then presented to jurors for evaluation. This is the essential element of the test set-up: people evaluate the results without knowing which search engine returned the individual results nor which result position they were in. For the evaluation, the results are reassigned to the examined systems and the result positions on which they were output. Finally, the search results’ precision (or other metrics) is measured. The tests are restricted to a certain number of result positions, as the number of results in most cases far exceeds the number that can be examined by a user, especially in the case of Web search engines. The structure of retrieval tests is usually based on the steps established by TagueSutcliffe (1992). The particularities of Web search engines have been taken into account by Gordon and Pathak (1999) and (building on this) Hawking et al. (2001).
238
13
Search Result Quality
The five criteria for a valid search engine test as set forth by Hawking et al. refer to representing real information needs, communicating the information need (if an information broker is used), having a sufficiently large number of test queries, selecting the most important search engines, and carefully designing the study and conducting it properly. The typical test design for retrieval tests consists of the following steps: 1. 2. 3. 4. 5. 6. 7.
Selecting queries/tasks Sending the queries to the search engines Collecting and storing the results Randomizing the results; disguising their origin Evaluation of the results by jurors Pooling the ratings and reassigning them to the search engines Analyzing the results
Of course, it is desirable to use as many queries as possible. In practice, however, finding enough jurors to evaluate the results for those queries is often difficult. In most tests, a minimum of 50 queries is used, but this number has simply become established based on experience. Of course, the number of queries to be used highly depends on the test’s purpose. If as many subject areas and different types of search behavior are to be covered as possible, the number of queries must be increased accordingly. When selecting queries, a distinction must be made between tests that attempt to make general statements about the result quality of search engines and those that are deliberately restricted to a specific topic or the queries of a particular user group (e.g., children). If the test is to make a general statement about the search engines’ result quality, the queries should be chosen as broadly as possible. General tests should cover both popular and infrequently asked queries, and the distribution of the length of the queries should also be considered. If the test aims to examine how suitable various search engines are for research in specific subject areas, the same prerequisites apply, except that the queries must, of course, relate to the subject area in question. Enriching the test queries with a description of the information need behind them is always advisable. This description is then displayed to the jurors to support their evaluation. To enable the jurors to evaluate the documents as accurately as possible, a distinction can also be made between the description of the query and explicit evaluation information naming what kinds of documents are to be evaluated as relevant. This procedure is particularly suitable if queries from real users are used, as they are usually best able to describe the intent behind their query and what the ideal documents for this query would look like. The following example illustrates the differences between query, description, and evaluation information:
13.3
Scientific Evaluation of Search Result Quality
239
Query: cost of living united states Description: What is the cost of living in the United States? What proportion of salary should be spent on rent, what proportion on utilities, and what proportion on food? Evaluation information: Relevant documents are those that give an overview of the cost of living in the United States and do not just deal with one of the aspects mentioned. Provided the test administrator is clear about the characteristics of the documents to be evaluated as relevant, it makes sense to give the jurors appropriate guidance. However, suppose instead that queries of general interest are used. In that case, such a precise instruction to the jurors is unnecessary, as they will be able to distinguish relevant from non-relevant documents on their own. Since retrieval tests are intended to reflect the typical behavior of the user group in question, this behavior must also be considered when determining the number of results per query to be evaluated (see Chap. 4). Most studies tend to be restricted to the first ten results per query. When determining the number of results, it should also be noted that the effort of conducting a test increases with a higher number of results. Thus, there are hardly any tests in the literature considering more than the first 20 results. The number of jurors needed for a test depends, of course, on the number of queries to be examined. Usually, each juror evaluates all the results returned by all search engines for one query. However, this is often not possible, especially with a large number of queries; in these cases, each juror can also evaluate the results for several queries. As a rule, each result is judged by only one person. While there are cases in which different jurors will reach different verdicts (Schaer et al., 2010) and it would make sense to have the results evaluated by at least two jurors, there is no definitive evidence as to whether the reliability of the tests can be significantly increased by adding more jurors. Several factors can distort test results. In particular, the origin of the results (i.e., which search engine initially returned them) should be concealed, as otherwise, strong brand effects can be observed in the evaluation (Bailey et al., 2007; Jansen et al., 2007). Furthermore, the original ranking of the results should not be visible to the jurors in order to rule out learning effects (Bar-Ilan et al., 2009). The results of the different search engines should also be mixed in the evaluation. Duplicate content, i.e., results that are output by several search engines, should only be presented to the jurors once (per query) in the assessment to obtain uniform judgements. In the analysis of the test results, the relevance judgements are combined and analyzed as described in the previous section. In addition to precision, other indicators can also be used. In particular, metrics that better reflect typical user behavior are playing an increasingly important role (Carterette et al., 2012). Although certain adjustments have been achieved by modifying the tests, the standard procedures assume a user who goes through the results one after the other and who is not influenced by a prominent presentation of the results or changes their query based on the displayed result set (or the viewed results). And while the
240
13
Search Result Quality
inclusion of the snippets (Lewandowski, 2008), in particular, represents a significant improvement over evaluations that focus purely on the results themselves, it is still hardly possible to speak of a realistic representation of user behavior. Whether this can be achieved within the framework of practicable tests remains to be seen. The standard procedures described have their limitations, on the one hand in their inability to map the often interactive and multi-step process of Internet-based research (see Sect. 4.1). Another restriction of these standard procedures is the assumption that the quality of the results is essentially responsible for users preferring a particular information system over others. Although different systems can be compared using the procedures, a user’s possible willingness to switch cannot be deduced from the results. Again, the evaluation of systems for the purpose of selecting a particular system for one’s own purpose or for comparing one’s own system with other systems differs from the evaluation of Web search engines, which is characterized precisely by the fact that it distinguishes between several third-party systems that are not accessible in the processes. Although recommendations for or against the use of a particular search engine may be made in this context, in practice, other factors that cannot be directly attributed to the quality of the results are probably also or even primarily decisive for choosing a search engine (see Sects. 4.1 and 8.3). In the search engine sector, there is strong brand loyalty; studies have shown that users prefer their favorite search engine even if the results of another search engine are provided in the layout they are familiar with and with the brand name of their favorite search engine (Jansen et al., 2007). Similarly, the brand (especially Google) strongly influences the evaluation of the results returned by a search engine (Ataullah & Lank, 2010; Jansen et al., 2009).
13.3.2 Measuring Retrieval Effectiveness Using Click-Through Data From the general procedure for conducting retrieval tests, it has already become clear that these are very labor-intensive test procedures for which a large number of jurors must be recruited, even though the tests cannot cover more than a relatively small number of queries, especially in comparison to the sheer mass of different queries that are submitted to search engines. In this respect, it is not surprising that methods were sought that are less labor-intensive, on the one hand, and, on the other hand, can accommodate vast numbers of queries and evaluators. One solution is to use interaction data from real search engine users. These methods are based on the data from the transaction log files of search engines. But, again, only a portion of the available data is used for measurement. Still, it is possible, for example, to use a month’s worth of data for analysis, which can be millions of queries. The advantages of such a method are apparent: real queries from real users and their behavior on the result pages are evaluated, and in addition, one can fully cover the queries of a specific period. Such methods based on the users’ click-through data take into account the queries themselves, the results selected on the result page, and, if applicable, the time a user looks at a search result before returning to the result page. Click-based methods are
13.3
Scientific Evaluation of Search Result Quality
241
mainly used to improve rankings; they have already been discussed in this context in Sect. 5.3.2. Since click-based tests do not use jurors but record the data of real users of a search engine, it is also possible to collect many ratings on a single result (Joachims et al., 2005). However, it is essential to remember that infrequently asked queries do not necessarily have many clicks on result documents. Furthermore, such tests do not record explicit relevance judgements but implicit ones. It remains unclear whether users stayed on the best result or were simply satisfied with the result that was considered the best based on the search engine’s ranking. In such tests, there is no systematic evaluation of a previously determined result set since users usually only click on the first or particularly highlighted results, as shown in Chap. 4. While click-based tests have undeniable advantages, it is only possible to conduct such a test if one has access to the actual data generated by the search engine. This restricts the group of people conducting these tests to the search engine providers and institutions cooperating with them. In addition, comparing different systems is only theoretically possible with these methods since search engine providers are unlikely to make their data available for such purposes. In this respect, one can only recommend that the findings obtained from the corresponding tests be considered in combination with juror-based evaluations—which is also how search engine providers proceed in practice. In addition to the automatic quality assessment based on mass data generated with the clicks on the search engine result pages, juror-based evaluations are carried out as a supplement (see Google, 2020).
13.3.3 Evaluation in Interactive Information Retrieval While the methods described so far measure the quality of search results either at the level of results or at the level of result lists, session-based evaluation is increasingly coming into focus (on sessions, see Sect. 4.4). On the one hand, results viewed during a session can be evaluated in the context of click tests based on the dwell time and the time of retrieval. However, on the other hand, session-based evaluations can also be specifically related to further user observation. In this case, the behavior of selected users is logged, whereby it is also possible to map their searches with different search tools (search engines, directories, Wikipedia, social networking sites, etc.). The great advantage of such a procedure is that users who have agreed to such a study can also be observed over more extended periods; that additional data such as age, gender, etc. of the users can be requested; and that the study can be supplemented with further surveys. Such studies can thus include both quantitative and qualitative data, for example, by asking for explicit relevance judgements about the viewed results or the motivation for the search behavior. The disadvantage of evaluating such interactive scenarios is that, at least if the users are free to choose the search engine(s), only insufficient comparative data accrue, and result sets cannot be systematically evaluated since the users are not presented with a predefined set of results for evaluation. It is also difficult to exclude brand effects and other preferences. And last but not least, subjects for such tests are
242
13
Search Result Quality
Table 13.2 Comparison of the test methods (Lewandowski, 2011, p. 224) Test method Retrieval effectiveness test
Test using click-through data Logging studies
Use case Evaluating one’s own system Comparing one’s own system with third-party systems Comparing thirdparty systems with each other Testing one’ s own system Testing one’s own system Comparing one’s own system with third-party systems Comparing thirdparty systems with each other
Evaluated documents All results up to a certain cut-off value
Documents clicked by real users Documents clicked by users in the test
Appraisal Suitable if explicit ratings are to be analyzed for a previously determined number of hits Only option if complete result sets (up to a certain cut-off value) should be evaluated
Well suited for analyzing mass data and making automatic improvements to the ranking of one’s own system Suitable for modelling sessions or exploratory searches Suitable for observing selected users interacting with (and switching between) multiple systems
more difficult to recruit, as the test either has to be conducted in a laboratory or the user has to install a particular software for logging. Nevertheless, these tests are at least a useful supplement to conventional retrieval tests; for specific topics, they are also suitable on their own. Table 13.2 compares the retrieval effectiveness test with jurors with the other two test methods. As important as the quality of search engine results is, it must be made clear that the quality of search engines cannot be reduced to one factor. Instead, the quality of search engines is formed by an interplay of different factors. Lewandowski and Höchstötter (2008) group these factors into four areas: quality of the index, quality of the search results, quality of the search functions, and user-friendliness. To evaluate search engines comprehensively, studies must be carried out in these different areas. Of course, different user groups play a role in weighing these criteria against one another: For example, the quality of the (advanced) search functions (see Chap. 12) probably plays only a minor role for average users since they hardly use them. For professional searchers, on the other hand, the use of commands and operators is essential—a search engine that does not support these sufficiently is hardly useful to them.
References
13.4
243
Summary
When evaluating the quality of search engine results, a distinction must be made between evaluating the quality of the results found in a specific search and systematically evaluating the retrieval effectiveness of search engines. Search engines assess the quality of documents using formal criteria. While, in many cases, this can indeed reflect quality, the high placement of a document in a result list does not necessarily mean that this document is of particularly high quality or even trustworthy. The search engine’s ranking, therefore, does not replace the user’s individual assessment. Such an assessment can be made based on the individual document and its source. For orientation, checklists exist that allow a simple and quite reliable evaluation. The scientific evaluation of retrieval effectiveness raises the question of the search engines’ ability to produce relevant results to queries and sort them according to human-rated relevance. There are established standard procedures with which such studies are carried out. However, these procedures do not reflect real user behavior, which is why they are considered a form of system-oriented evaluation. Evaluation procedures that use click-through data can supplement juror-based procedures with interaction data from real users. Further Reading Instructions on how to conduct retrieval tests can be found in popular textbooks on information retrieval, such as Manning et al. (2008). A valuable overview of misinformation, public relations measures, and other activities that can hinder Internet-based research can be found in the book by Anne P. Mintz (2012).
References Ataullah, A. A., & Lank, E. (2010). Googling bing: Reassessing the impact of brand on the perceived quality of two contemporary search engines. In Proceedings of the 2010 British Computer Society Conference on Human-Computer Interaction, BCS-HCI 2010 (pp. 337–345). https://doi.org/10.14236/ewic/hci2010.39 Bailey, P. Thomas, P., & Hawking, D. (2007). Does brandname influence perceived search result quality? Yahoo!, Google, and WebKumara. In Proceedings of the 12th Australasian Document Computing Symposium. Melbourne, Australia, December 10, 2007. Bar-Ilan, J., Keenoy, K., Levene, M., & Yaari, E. (2009). Presentation bias is significant in determining user preference for search results – A user study. Journal of the American Society for Information Science and Technology, 60(1), 135–149. https://doi.org/10.1002/asi.20941 Carterette, B., Kanoulas, E., & Yilmaz, E. (2012). Evaluating web retrieval effectiveness. In D. Lewandowski (Ed.), Web search engine research. Bingley. Cozza, V., Hoang, V. T., & Petrocchi, M. (2016). Google web searches and Wikipedia results: A measurement study. In 7th Italian information retrieval workshop, IIR 2016; Venezia; Italy; 30 May 2016 through 31 May 2016. Retrieved February 4, 2021, from https://www.
244
13
Search Result Quality
semanticscholar.org/paper/Google-Web-Searches-and-Wikipedia-Results%3A-A-StudyCozza-Hoang/17c4630f132dc22a6a7fb642099beebbf9f528f7 Google (2020). Search quality rater guidelines. Retrieved January 21, 2021, from https://static. googleusercontent.com/media/guidelines.raterhub.com/en// searchqualityevaluatorguidelines.pdf Gordon, M., & Pathak, P. (1999). Finding information on the world wide web: The retrieval effectiveness of search engines. Information Processing and Management, 35, 141–180. https://doi.org/10.1016/S0306-4573(98)00041-7 Griesbaum, J. (2004). Evaluation of three German search engines: Altavista.de, Google.de and Lycos.de. Information Research, 9(4), 1–35. Retrieved January 21, 2021, from http:// informationr.net/ir/9-4/paper189.html Harman, D. K., & Voorhees, E. M. (2006). TREC: An overview. Annual Review of Information Science and Technology, 40(2006), 113–155. https://doi.org/10.1002/aris.1440400111 Harris, R. (2020). Evaluating internet research sources. Retrieved January 21, 2021, from http:// www.virtualsalt.com/evalu8it.htm Hawking, D., Craswell, N., Bailey, P., & Griffiths, K. (2001). Measuring search engine quality. Information Retrieval, 4, 33–59. https://doi.org/10.1023/A:1011468107287 Höchstötter, N., & Lewandowski, D. (2009). What users see – Structures in search engine results pages. Information Sciences, 179(12), 1796–1812. https://doi.org/10.1016/j.ins.2009.01.028 Jansen, B. J., Zhang, M., & Schultz, C. D. (2009). Brand and its effect on user perception of search engine performance. Journal of the American Society for Information Science and Technology, 60(8), 1572–1595. https://doi.org/10.1002/asi.21081 Jansen, B. J., Zhang, M., & Zhang, Y. (2007). Brand awareness and the evaluation of search results. In Proceedings of the 16th international conference on World Wide Web (pp. 1139–1140), Banff, Alberta, Canada. New York: ACM. https://doi.org/10.1145/1242572.1242734. Joachims, T., Granka, L., Pan, B., Hembrooke, H., & Gay, G. (2005). Accurately interpreting clickthrough data as implicit feedback. In Conference on Research and Development in Information Retrieval (pp. 154–161), Salvador, Brazil, ACM. https://doi.org/10.1145/ 3130332.3130334. Lewandowski, D. (2008). The retrieval effectiveness of web search engines: Considering results descriptions. Journal of Documentation, 64(6), 915–937. https://doi.org/10.1108/ 00220410810912451 Lewandowski, D. (2011). Evaluierung von Suchmaschinen. Handbuch Internet-Suchmaschinen 2: Neue Entwicklungen in Der Web-Suche, 203–228. Lewandowski, D. (2015). Evaluating the retrieval effectiveness of web search engines using a representative query sample. Journal of the Association for Information Science & Technology, 66(9), 1763–1775. https://doi.org/10.1002/asi.23304 Lewandowski, D., & Höchstötter, N. (2008). Web searching: A quality measurement perspective. In A. Spink & M. Zimmer (Eds.), Web search: Multidisciplinary perspectives (pp. 309–340). Springer. Lewandowski, D., & Spree, U. (2011). Ranking of Wikipedia articles in search engines revisited: Fair ranking for reasonable quality? Journal of the American Society for Information Science and Technology, 62(1), 117–132. https://doi.org/10.1002/asi.21423 Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press. McCullagh, D. (2002). Google excluding controversial sites. CNET. Retrieved January 21, 2021, from http://news.cnet.com/2100-1023-963132.html Mintz, A. P. (Ed.). (2012). Web of deceit: Misinformation and manipulation in the age of social media. Information Today, Inc. Piper, P. S. (2000). Better read that again: Web hoaxes and misinformation. Search, 8(8), 40.
References
245
Schaer, P., Mayr, P., & Mutschke, P. (2010) Implications of inter-rater agreement on a student information retrieval evaluation, LWA 2010, Kassel. Tague-Sutcliffe, J. (1992). The pragmatics of information retrieval experimentation, revisited. Information Processing & Management, 28, 467–490. https://doi.org/10.1016/0306-4573(92) 90005-K Voorhees, E. M. (2019). The evolution of Cranfield. In N. Ferro & C. Peters (Eds.), Information retrieval evaluation in a changing world (pp. 45–69). Springer. https://doi.org/10.1007/978-3030-22948-1_2 Wikipedia: Five pillars (2021). https://en.wikipedia.org/wiki/Wikipedia:Five_pillars
The Deep Web
14
In previous chapters, we have already pointed out that no search engine can find all the content available on the Web for both technical and financial reasons. This chapter will also deal with this content, but mainly with content accessible via the Web but not in a form accessible to the general search engines. This kind of information is located primarily in databases whose search forms we can access on the Web but whose content is not in the form of HTML pages that search engines can capture. The area of the Web that cannot or is not captured by search engines is called the Deep Web. A simple example of such a database is a telephone directory. We can search for names and get matching phone numbers. However, we would not find the telephone numbers if we entered a name in the search box of a search engine. Therefore, when we do such research via a search engine, we proceed in two steps: First, we look for a suitable source (by entering the navigational query telephone directory, for example), and then we search this source for the specific information we want. A well-known illustration of the difference between the Surface Web and the Deep Web depicts a fishing boat that has cast its net and is catching the fish that are just below the surface of the water (Fig. 14.1). However, the deeper swimming fish cannot be caught in this way: for one thing, the net does not reach down to their depth, and for another, a net would not even be the appropriate tool, as Fig. 14.2 illustrates: rather, one would have to catch individual fish or small swarms of fish with particularly suitable fishing rods. In this analogy, search engines are like a net for capturing the content that lies on the surface of the Web, but not the deeper content, which requires special tools to capture. There are two common definitions of the Deep Web or the Invisible Web. Both definitions date from 2001, i.e., from a time when the phenomenon of the Deep Web
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Lewandowski, Understanding Search Engines, https://doi.org/10.1007/978-3-031-22789-9_14
247
248
14
The Deep Web
Fig. 14.1 Schematic representation of the Surface Web (Bergman, 2001)
SURFACE WEB
THE
DEEP WEB
Fig. 14.2 Schematic representation of the Deep Web (Bergman, 2001)
had only just been discovered or described. This also explains the different choice of terms—by now, Deep Web and Invisible Web are used mainly synonymously. 1 Chris Sherman and Gary Price define the Invisible Web as: Text pages, files, or other often high-quality authoritative information available via the Web that general-purpose search engines cannot, due to technical limitations, or will not, due to deliberate choice, add to their indices of Web pages. (Sherman & Price, 2001, p. 57)
In other words, we are basically talking about all information not found by general search engines. Michael Bergman’s definition of the Deep Web, on the other hand, is narrower:
1
Michael Bergman points out that the term deep web is to be preferred because it has become established and is also to more accurate on an epistemological level: after all, he says, there are no documents that are invisible.
14.1
The Content of the Deep Web
249
The deep Web—those pages do not exist until they are created dynamically as the result of a specific search. (Bergman 2001)
This definition includes only those documents that are dynamically generated as a result of a query in a Deep Web source. The documents are only created in response to a query. For example, suppose one searches an electronic library catalogue for books on a particular topic. In that case, the documents (in this case, the title records of the books) are only generated and compiled at the moment the query is made. So, first, the database entries are not static HTML documents that the search engines’ crawlers can find via links. While we, as users, can make appropriate queries to access the content from the databases, this is not possible for search engines. This has been summed up in the catchy formula: “Search engines cannot fill out forms.” So as soon as a search engine comes across a search form in its crawl, it does not get any further and cannot capture the content “behind” the search form. Although there are approaches that attempt to do just this (Madhavan et al., 2008; an overview of the approaches is given by Hernández et al., 2019), there has been no breakthrough in the complete capture of Deep Web content, nor can such a breakthrough be expected in the foreseeable future (simply because of the volume of databases to be captured). Dynamic documents, i.e., the contents of databases, as named by Bergman, are included in Sherman and Price’s definition since the general search engines cannot find them. Database content is undoubtedly an essential part of the Deep Web. However, considering this part alone would not be sufficient to describe the problems caused by the fact that many contents cannot be found by search engines. Therefore, in the following, Sherman and Price’s definition will be used in a somewhat more precise form: The Deep Web includes all content accessible via the Web that is not (or cannot be) indexed by general search engines, especially the content of databases accessible via the Web. Therefore, in the following, we will use the terms Invisible Web and Deep Web interchangeably. The opposite of the Deep Web is the so-called Surface Web, which consists of all content that general search engines can find. The Deep Web should not be confused with the so-called Dark Web. The Deep Web is often seen as a kind of secret area of the Internet where conspiracy and crime flourish. Of course, such content can also be found on the Deep Web (as on the Surface Web, by the way), but it only makes up a small part of it. The term Dark Web (Chen, 2012) has become established for this part.
14.1
The Content of the Deep Web
Sherman and Price (2001, p. 62ff.) provide an overview of the types of Invisible Web content:
250
14
The Deep Web
• Disconnected pages (i.e., non-linked pages; disconnected pages): These pages are not linked from other documents and, therefore, cannot be found by crawling (see Sect. 3.3). • Pages consisting primarily of images, audio files, or videos: Search engine indexing is based on text. If—as in the case of the file types mentioned—there is no (or hardly any) text, the content cannot be found. Search engines capture such files primarily through surrounding text (see Sect. 3.4.1). • Files in specific formats that search engines cannot capture: This includes all documents that are in file formats that the search engines cannot capture. Although the documents can often be found with the help of search engines, the contents of these documents are not searchable. For a long time, the Flash format was relatively popular and was used on numerous websites. However, this content could not be read by search engines or could only be read inadequately. In their book, Sherman and Price also cite PDF files, which at that time (2001) could not yet be indexed by search engines. One can see from this example that the boundaries of the Deep Web are constantly shifting due to advances in the technology of search engines. However, we will see that there are also areas that even future search engines will not be able to penetrate. • Content in relational databases: This is content that is “hidden” behind search forms. • Real-time content: This is content that is constantly changing, such as stock market prices or weather data. They are available on Web pages, but search engines cannot always index them. • Dynamically generated content: This refers to content adapted to an individual user based on a current query and is, therefore, irrelevant for indexing in search engines. Compared to the state of content capture by search engines described by Sherman and Price in 2001, there have, of course, been significant changes. Search engines are now much better at indexing content in various file formats. Disconnected pages are also likely to be much less of a problem now, partly because content producers have realized that they must take action to ensure their content is accessible. This also applies (at least partly) to making database content visible. Although there are still many databases whose content is not (or cannot be) indexed by search engines, many database providers have created an HTML equivalent for each document in the database, which can be indexed like other HTML pages. Even large databases can be prepared for search engines (Heinisch, 2003; Lewandowski, 2018). Although this method can be used to find single documents or pre-compiled document lists, it is not possible to map the database’s actual search options, which leads to individual result lists or compilations of documents. Less progress has been made in indexing multimedia content, provided one assumes actual content-based indexing. For example, within images or videos, only features that are quite easy to extract (colors, sometimes also superimposed texts) are recognized. At the same time, the main part of the searchable representation is created with the help of surrounding texts (see Sect. 3.4.1).
14.2
Sources vs. Content from Sources, Accessibility of Content via the Web
251
Real-time content can still hardly be found in its current version. Exceptions exist where search engines retrieve this content directly from a database when a query is made and include it in the search engine result page (see Sect. 7.2). However, this only applies to particularly popular content such as the weather and stock market prices; most real-time content must be retrieved directly from appropriate websites.
14.2
Sources vs. Content from Sources, Accessibility of Content via the Web
An important distinction that needs to be made (not only) concerning the Deep Web is between sources and content from sources. In Chap. 12, a difference was already made between searching for content (from different sources) and searching in two steps, where the first step is to search for an appropriate source and the actual thematic search is then carried out in this source in the second step. This two-step process is particularly significant in the case of Deep Web sources because while we can use general search engines to find content from sources we have not even thought of (e.g., from a website of a relevant journal on the topic), the content of Deep Web sources cannot be found via general search engines—but the sources themselves can. This is illustrated in Fig. 14.3: The left part shows a conventional website whose individual documents can be found by the search engines following links. The right part, on the other hand, shows a Deep Web database: The search form shown in the upper part is a conventional HTML page that search engines can capture. However, this page contains a search form used to query the database (shown in the lower part). Search engines cannot reach the contents of the database because they are not capable of filling in the search form in a meaningful way.
Website
Invisible Web database
Fig. 14.3 Distinction between a website and a Deep Web source
252
14
The Deep Web
Fig. 14.4 Taxonomy of online information (translated from Stock, 2003, p. 27)
Stock (2003) distinguishes online information according to its accessibility on the Web versus via the Web (Fig. 14.4). “On the Web” refers to all content that search engines can theoretically index; “via the Web” refers to the content of databases, which search engines cannot capture. In the left part of the diagram, the Web content is further differentiated: On the one hand, there is content that is exploited by search tools. These can be search engines but also Web directories (for differentiation, see Chap. 2). On the other hand, there is also content on the Web that is still not captured by search engines. This can be relevant information that a user misses during their Internet-based research because the search engine has not indexed it for technical or capacity reasons (see Sect. 3.1). Then, there is content that search engines deliberately do not include because it is content that is simply not considered desirable by the search engine in question (spam; see Sect. 5.8). The contents that are not exploited—irrespective of the reasons why they are not exploited—belong to the Deep Web. This shows that there is content that search engines could index without any technical problems but which is nevertheless excluded from the index.
14.2
Sources vs. Content from Sources, Accessibility of Content via the Web
253
On the right-hand side of the diagram, the content accessible via the Web (i.e., the databases in this illustration) is further differentiated as well. Firstly, there are singular databases whose search forms are available on the Web and which can often be accessed free of charge. Such databases are offered, for example, by public institutions whose task includes providing information on their respective topics. In addition, there are commercial information providers who sell their database content. Here, a distinction is made between those who market one or more databases they have produced on their website and aggregators who combine the content of many databases, especially those they have not created themselves, and make them searchable under a single interface. To illustrate the amount of information that can lie behind one single interface, consider the example of LexisNexis, a provider of press, business, and legal databases: This provider combines more than 60,000 different databases under a single interface (LexisNexis, 2021a). These databases contain more than 65 billion documents (LexisNexis, 2021b), most of which are not available on the Surface Web. Finally, hybrid search engines combine the best of both worlds: they use data from the Surface Web (usually a specifically selected subset) and combine it with content from the Deep Web. This way, structured information from databases can be combined with unstructured content from the Web.
Online-Informaonen
Im Web
Via Web
Von Suchwerkzeugen ausgewertet
Nicht von Suchwerkzeugen ausgewertet
Suchmaschinen
u.U. relevante Informaonen
(kostenfreie) singuläre Datenbanken
Selbstvermarkter Web-Kataloge
Kommerzielle Informaonsanbieter
Aggregatoren (u.a. OnlineHosts)
Spam
Hybrid-Systeme
Fig. 14.5 Taxonomy of online information (translated from Stock, 2003, p. 27), supplemented by exemplary information services
254
14
The Deep Web
Which Offering Belongs in Which Area? Figure 14.5 shows Stock’s taxonomy of online information, but this time with examples from the respective areas. In the left part of the figure, we see the examples of services we are already familiar with, which index content on the Web. Bing is an example of search engines; Curlie is used for Web directories (see Sect. 2.4.4). In the second column of the left-hand section, no examples are given since this is the area of the Web that search tools do not cover. In the right-hand area, we first look at a free singular database, namely, the eSearch plus database of the European Office for Harmonization in the Internal Market (https://euipo.europa.eu/eSearch/). This database contains information on trademarks, designs, and models, so its purpose is very specific. The use of eSearch plus is free of charge and free of advertising since the creation of the database is publicly funded. Stock distinguishes between self-marketers and aggregators on the side of commercial information providers. A self-marketer is, for example, MarketLine, a database with information about companies. The database is created by a commercial company; it is financed by selling access to the information. An example of an aggregator is LexisNexis. It is an online host under whose interface one can search in a multitude of databases, especially in the fields of press, law, and business. Finally, we will give an example of a hybrid search engine, Google Scholar. This search engine collects academic literature from the Surface Web and combines it with content located on the Deep Web (see Sect. 6.3.2 for more details).
14.3
The Size of the Deep Web
There are no reliable figures on the size of the Deep Web. However, figures are repeatedly quoted in various publications, which is why this section will address how these numbers come about and why they are hardly meaningful. First of all, it is challenging to calculate the size of the Deep Web. Let us assume that we were dealing with search engines that could capture all the Surface Web content, and on the Deep Web side, only the content of the databases could not be captured. Even in this favorable case, to calculate the size of the Deep Web, we would have to know the number of documents in all these databases and the number of other documents in the Deep Web. Then we would also have to know how much overlap there is between the databases since we do not want to count documents twice.
14.4
Areas of the Deep Web
255
Similar to determining the size of the Surface Web (Chap. 3), this poses significant problems. The size calculations available so far are therefore not based on complete surveys but attempt to extrapolate the total size of the Deep Web based on a selection of sources. And precisely here lies the crux, because the known calculations assume that their sample can be “simply” extrapolated, but this is not the case since the size distribution of the Deep Web databases is an informetric distribution (see Sect. 4.5.5). The most frequently cited figures are (still) those from Michael Bergman’s (2001) paper. Apart from the fact that these figures are hopelessly outdated by now, they are also probably much too high due to the faulty extrapolation mentioned above (see Lewandowski & Mayr, 2006). According to Bergman, in 2001, there were 550 billion documents on the Deep Web, which, based on the maximum of one billion documents indexed in the largest of the search engines on the Surface Web at the time, meant that the Deep Web was 550 times larger than the Surface Web. Bergman used the average size of the known Deep Web databases for his calculation and multiplied this by the estimated total number of databases. However, by calculating the average size (arithmetic mean), it was assumed that the sizes are evenly distributed around a mean value. If we compare the mean value of the number of documents in the 60 databases used by Bergman (5.43 million documents) with the median (4950 documents), we see that we are dealing with a highly skewed distribution and that the calculation of the arithmetic mean therefore produces misleading results. Based on this fact and estimates based on Bergman’s data, Lewandowski and Mayr (2006) conclude that the size of the Deep Web at that time was in the region of the size of the Surface Web and the size of the academically relevant area of the Invisible Web was between 20 and 100 billion documents (Lewandowski & Mayr, 2006, p. 536). This conclusion is, of course, unsatisfactory, as only a very rough estimate is given. Ultimately, the question arises whether it actually matters how large the Deep Web is precisely. Leaving aside the scientific interest in this question and looking at the matter from the perspective of a user searching for information, the main thing that matters is that the Deep Web exists on a significant scale and offers information that cannot be found on the Surface Web or can only be found with difficulty or in an unstructured way. The sources of the Deep Web are, therefore, essential, especially for more extensive Internet-based research.
14.4
Areas of the Deep Web
In addition to the technical reasons already mentioned for documents belonging to the Deep Web, there are structural reasons based on which Sherman and Price (2001, p. 70ff.) divide the Invisible Web into four areas: • Opaque Web: These are pages that could be captured by search engines but are not. Reasons for non-coverage include depth of crawl (search engines do not cover all documents on a website), freshness (search engines cannot keep their
256
14
The Deep Web
Fig. 14.6 A snippet of a proprietary Web source (example from Google; November 11, 2017)
databases fully up to date), the maximum number of results displayed (users cannot access all theoretically available results for a query as search engines usually only display a maximum of 1000 results), and disconnected pages. • Private Web: These are pages deliberately excluded by their authors from indexing by search engines, for example, by means of password queries or exclusion commands in the Robots.txt file (see Sect. 3.3.1). • Proprietary Web: This content can only be used after agreeing to specific terms of use. This may involve registration with personal data, for example, but also content for which a fee is charged and for which a contract has to be concluded beforehand. The presence of content from the proprietary Web becomes evident when the snippet on the search engine result page already indicates this (see Fig. 14.6). The proprietary Web also includes a large part of the content of social media services that can only be accessed after prior login. • Truly Invisible Web: These are pages or websites that search engines cannot index for technical reasons. The truly Invisible Web does not have a clearly defined boundary because the technical capabilities of search engines are constantly changing. Therefore, content that is invisible today could be made visible tomorrow by employing new methods. Some of these areas can be made smaller or perhaps even entirely dissolved by more advanced search engine technology. This concerns the opaque Web and the truly Invisible Web. The content of the private Web and the proprietary Web, on the other hand, will probably never be included in the general search engines, even though parts of it are already being made accessible by specialized or hybrid search engines (see Sect. 6.1).
14.5
Social Media as Deep Web Content
It has already been mentioned that much social media content belongs to the proprietary Invisible Web and is not accessible to general search engines. However, the vast popularity and massive data sets of social media services such as Facebook and Instagram show the significant gap in coverage of relevant content that search engines have. Social media is understood as a variety of offerings:
14.5
Social Media as Deep Web Content
257
Social media is a collective term for Internet-based media offerings based on social interaction and the technical possibilities of the so-called Web 2.0. The focus is on communication and the exchange of user-generated content. (Sjurts, 2011; translated from German)
The emergence of social media services was partly associated with the expectation that they could develop a new way of accessing the information on the Web (Maaß & Lewandowski, 2009; Peters, 2011; see also Sect. 2.4.7). Although these services now reach a huge user group (see Chap. 9), they are hardly ever the focus of search and have by no means replaced search engines but at most supplemented them. From the perspective of Internet-based research, social media services, like other databases of the Deep Web, are specialized information services that are used in addition to other information services. However, at least some of the content of social media services can be found via general search engines, at least for services that are interested in gaining traffic via them. It is possible to classify social media services according to their (targeted) visibility in search engines: 1. Services that make their databases fully searchable: These services have built their own platform but rely (or are dependent) on getting traffic via search engines. The best-known example here is probably question-answering sites (see Sect. 2.4.6), which present the questions with the corresponding answers as publicly accessible HTML pages so that search engines can find them. 2. Services that make their databases only partially searchable: The primary focus of these services is their own platform; however, basic information is made publicly available to get traffic from search engines. The content from social networking sites that can be accessed on the Surface Web represents only a small section of the content available there. Users usually have to explicitly indicate if a particular piece of content is to be made public. By default, messages on social networking sites address restricted audiences, e.g., all “friends” of the author in the network. In other cases, basic information is made publicly available; further information is reserved for the users of the respective network. An example is the social networking site LinkedIn, where the user profiles with basic information are publicly accessible in the default settings. Further information can only be accessed on the platform itself (after registration). However, this means that the data is not fully searchable using search engines, and many types of queries can only be made using the search function on the platform itself. The extent of the desired visibility, and thus of the publicly available data, differs considerably between different social media services. So why are social media service providers not interested in making their content fully accessible via general search engines? First of all, these providers have built up exclusive databases that are of particular value. Whereas single, restricted offerings have a strong interest in being found by search engines and thereby becoming known to users (see Chap. 9), providers such as Facebook have built up such large and exclusive databases that users access these offerings on their own and mediation by
258
14
The Deep Web
search engines is mostly limited to navigational queries. However, it should not be underestimated that even these services still receive a significant portion of their traffic through search engines; for Facebook, for example, this is just under 11% (www.similarweb.com). For search engines, this means an enormous loss in terms of the completeness of their offerings. It is, therefore, not surprising that they have tried to gain access to social media service providers’ data by entering agreements with them. In the United States, for example, Facebook was integrated into the search engine result pages of Bing; Google had, for a time, integrated messages from Twitter into its search results. Furthermore, Google, in particular, tried to build up its own social network (Google+ is worth mentioning here) but failed to do so. Attempts to merge social media and search have not been successful so far, and by now, the separation of the two areas seems to be cemented. At least there have been no new attempts to integrate on a large scale in recent years.
14.6
What Role Does the Deep Web Play Today?
Knowledge of the Deep Web plays a role especially in more complex Internet-based research. As convenient as research using conventional search engines may seem at first, it can be inefficient, especially for more complex tasks. In many cases, much time is wasted with Internet-based research, which is not only needed to find information but also to compile or summarize it. Here, it is often worthwhile to look for suitable Deep Web databases (see Chap. 14). In many cases, the information there is more complete and better structured. The use of paid databases should also be considered against this background: Even supposedly free Internet-based research can be costly in terms of the time the researcher needs for their work. And suppose one weighs these two types of costs against each other. In that case, there is often a cost advantage in searching paid sources or even in commissioning more extensive Internet-based research to an information broker. In this chapter, the Deep Web has been considered primarily from a technical perspective. However, some approaches attempt a “cognitive definition” of the Deep Web: Ford and Mansourian (2006), for example, assume that all those contents belong to the Deep Web that cannot be found by users due to lack of knowledge or lack of commitment. But, of course, this results in entirely different boundaries, which vary for each individual user.
14.7
Summary
The Deep Web (also known as the Invisible Web) includes all content that is accessible on the Web or via the Web but is not covered by general search engines. An essential component of the Deep Web is specialized databases whose content is output in a unique compilation after a query is submitted via a search form. Search engines are currently unable to retrieve this content by using meaningful queries. For
References
259
Internet-based research, this means that it makes sense, especially for more complex inquiries, not to search in search engines alone but also to identify suitable databases of the Deep Web and then conduct research there. It is difficult to determine the size of the Deep Web; there is considerable doubt about the figures that are repeatedly quoted. Furthermore, there are different areas of the Deep Web; in some, search engines are making progress through new technologies. However, the content of other areas, such as paid databases, will probably not be captured by search engines in the future. Further Reading Although the book by Sherman and Price (2001) is hopelessly outdated in its comprehensive listing of Invisible Web sources, the general section on the problems of the Invisible Web and its subsections are still worth reading. Some interesting aspects, especially for people who want to teach the concept of the Deep Web in the context of information literacy training or similar, can be found in Devine and Egger-Sider (2009) and Devine and EggerSider (2014).
References Bergman, M. K. (2001). The deep Web: Surfacing hidden value. Journal of Electronic Publishing, 7(1), 1–17. https://doi.org/10.3998/3336451.0007.104 Chen, H. (2012). Dark web: Exploring and data mining the dark side of the web. Springer. Devine, J., & Egger-Sider, F. (2009). Going beyond Google: The invisible web in learning and teaching (pp. 111–127). Neal-Schuman Publishers. Devine, J., & Egger-Sider, F. (2014). Going beyond Google again: Strategies for using and teaching the invisible web. Facet Publishing. Ford, N., & Mansourian, Y. (2006). The invisible web: an empirical study of “cognitive invisibility.”. Journal of Documentation, 62(5), 584–596. https://doi.org/10.1108/00220410610688732 Heinisch, C. (2003). Suchmaschinen des Surface Web als Promotoren für Inhalte des Deep Web – Wie Doorway-Pages als “Teaser” zu Datenbank-Inhalten in die Index-Files der Suchmaschinen gelangen. In R. Schmidt (Ed.), Competence in content, 25. Online-Tagung der DGI (pp. 13–24). DGI. Hernández, I., Rivero, C. R., & Ruiz, D. (2019). Deep web crawling: A survey. World Wide Web, 22(4), 1577–1610. https://doi.org/10.1007/s11280-018-0602-1 Lewandowski, D. (2018). Zugänglichkeit von Information Services und ihren Inhalten über Suchmaschinen. In F. Schade & U. Georgy (Eds.), Praxishandbuch Informationsmarketing (pp. 358–369). De Gruyter. https://doi.org/10.1515/9783110539011-024 Lewandowski, D., & Mayr, P. (2006). Exploring the academic invisible web. Library Hi Tech, 24(4), 529–539. https://doi.org/10.1108/07378830610715392 LexisNexis. (2021a). Lexis premium content. https://www.lexisnexis.com/en-us/products/lexis/ feature-premium-content.page LexisNexis. (2021b). Company Snapshot. https://www.lexisnexis.com/en-us/about-us/companysnapshot.page. Maaß, C., & Lewandowski, D. (2009). Frage-Antwort-Dienste als alternativer Suchansatz? In R. Kuhlen (Ed.), Information: Droge, Ware oder Commons? Wertschöpfungs- und
260
14
The Deep Web
Transformationsprozesse auf den Informationsmärkten; Proceedings des 11. Internationalen Symposiums für Informationswissenschaft (ISI 2009); Konstanz, 1.–3. April 2009 (pp. 51–61). Verlag Werner Hülsbusch. Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., & Halevy, A. (2008). Google’s deep web crawl. Proceedings of the VLDB Endowment, 1(2), 1241–1252. https://doi.org/10.14778/ 1454159.1454163 Peters, I. (2011). Folksonomies und Kollaborative Informationsdienste: Eine Alternative zur Websuche? In D. Lewandowski (Ed.), Handbuch Internet-Suchmaschinen 2: Neue Entwicklungen in der Web-Suche (pp. 29–53). Akademische Verlagsanstalt AKA. Sherman, C., & Price, G. (2001). The invisible web: Finding hidden internet resources search engines can’t see. Cyberage Books. Sjurts, I. (2011). Soziale Medien. In Gabler Wirtschaftslexikon. Springer Gabler. http:// wirtschaftslexikon.gabler.de/Archiv/569839/soziale-medien-v2.html. Stock, W. G. (2003). Weltregionen des Internet: Digitale Informationen im WWW und via WWW. Password, 18(2), 26–28.
Search Engines Between Bias and Neutrality
15
The importance of search engines was already explained in Sect. 1.1, and it largely comes down to their widespread use. Connected with this is the users’ apparently unshakeable belief in the quality and objectivity of search results, which was nicely summed up as early as 2003 with a quote from a user study: “Of course it’s true, I saw it on the Internet” (Graham & Metaxas, 2003). Tremel (2010, p. 249) summarizes his study on the credibility of search results as follows: “Users follow the machine and its relevance assessment, which is implicitly expressed in the ranking of the results, apparently largely without reflection” [translated from German]. Users thus assume that the quality implied by search engines corresponds to actual quality. But unfortunately, it is precisely this unreflective attitude that ensures that search engine providers can influence what users see and what they ultimately select far beyond technical realities (e.g., unavoidable ranking effects) (on selection behavior, see Sect. 7.6). Even if all documents in the ranking process are treated equally by a search engine, i.e., the ranking is done according to “objective” procedures, the ranking algorithms still contain assumptions made by humans that can be interpreted at least in part as judgements of taste. Already in 2009, Alexander Halavais pointed out: “It is all too easy to forget that the machines we are asking are constructed in ways that reflect human conceptions and values” (Halavais, 2009, p. 3; equivalent in content in the current 2018 edition). The discussion about the ranking or presentation of search results is primarily held concerning the placement of products, services, and companies in the dominant search engine Google. However, the impact of the power of search engines (or this one search engine) is likely to be much more significant in other areas: The dimensions of this problem become particularly clear when it comes to controversial political content: Whoever manages to get their content into the top positions here has taken a significant step towards establishing themselves as an authority on the topic of the query and shaping the users’ opinions accordingly. This problem is exacerbated by the increasing
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Lewandowski, Understanding Search Engines, https://doi.org/10.1007/978-3-031-22789-9_15
261
262
15
Search Engines Between Bias and Neutrality
use of search engines in education, research, and journalism, i.e., the central areas of social knowledge transfer. (translated from Röhle, 2010, p. 12)
The discussion in the following sections is oriented toward three central problems: • Search engine algorithms contain implicit assumptions and value judgements so that no search engine can be neutral. • Search engine providers have diversified through vertical search engines and become content providers; they prefer their own offerings that go beyond core search on their search engine result pages. • Search engines can be manipulated through search engine optimization.
15.1
The Interests of Search Engine Providers
In 2010, the trade magazine Search Engine Land published an article by Ian Lurie naming “three lies” of the major search engines (Lurie, 2010). They are about how search engine providers act and the contradictions that arise to their proclamations. The first point refers to the search engine providers’ claim that it is only the quality of a website’s content that matters in order to be ranked highly in the search results and thus to receive visitors and high linking on the Web accordingly. However, methods of search engine optimization (i.e., adapting one’s own documents to the algorithms of search engines; see Chap. 9) are necessary to rank well. This may not apply to all areas, but the importance of search engine optimization has continued to grow. Today, one can at least assume that it is impossible to achieve good rankings without optimization in all areas where commercial potential can be found. In this context, it is not only areas directly concerned with selling products or services that need to be taken into account but also areas in which the primary focus is on conveying information. Thus, search results are not solely the result of a quality assessment carried out by the search engine but are the result of an interaction between different actors (cf. Röhle, 2010, p. 14): content providers influence the search engine result pages (primarily through their contracted search engine optimizers), and users also have an influence on the results, at least en masse, through their clicks (cf. Sect. 5.3.2). The second issue relates to the numerous non-search offerings that the major search engine providers now make and with which they are in direct competition with other providers specializing in these areas. The core of the problem lies in the fact that general search engines have increasingly diversified by offering and incorporating specialized services and that they have also become content providers, at least in part (e.g., Google with YouTube and Google My Business), whose success, in turn, depends on being found in search engines. An excellent example of this problem are the providers of route planners—all well-known search engines offer such services (Google Maps, Bing Maps) and prefer to list these offerings in their result lists, often including seamless integration of navigation functionality within the paradigm of universal search. In contrast to
15.2
Search Engine Bias
263
these findings, the search engine providers suggest that they are not in competition with those companies presented in the search results. However, a corresponding positioning (and highlighting) in the search engine result lists also means an increased influx of visitors. This extreme fixation on result position and highlighting can result in less well-placed offerings being robbed of a large proportion of their visitors. For many Web offerings, it can be seen that they only exist “by Google’s grace,” as they depend on this dominant search engine feeding them users. The third criticism concerns the assumption that search engines are “good.” This idea goes back to Google’s corporate motto, “Don’t be evil” (Alphabet, 2020). Lurie simply states that although the quality of the search results would be an important criterion for users to use a search engine repeatedly and increased use would ultimately mean more opportunities for the search engines to display advertisements and thus more revenue, there could never be a complete orientation toward result quality. Instead, search engines are simply profit-oriented commercial enterprises that would be neither good nor evil in this sense. The search engine providers’ argument that they are only intermediaries of information without themselves influencing what users read by compiling the result lists can only be upheld if one assumes that it is indeed possible to produce neutral result lists and that the search engines do not pursue their own interests with the result lists, i.e., do not give preferential treatment to their own offerings (or offerings they like in any way). In this respect, current search engines are by no means neutral intermediaries solely interested in the best possible quality of the search results. On the contrary, many interests of both the search engine providers and external providers go into the search results. The fundamental question is thus no longer whether search engine providers behave in a “neutral” manner but what responsibilities arise from their role as dominant technical information brokers in society. Until now, search engine providers have argued on a purely technical level and rejected any further responsibility for their services (see Lewandowski, 2017).
15.2
Search Engine Bias
In the context of search results, bias is understood as the differences between an ideal set of results and ranking and the actual set of results and ranking. The ideal would be a set of results that includes all potentially relevant documents and a ranking of results based solely on objective criteria, thus producing a neutral ranking of results. In this context, the wish for “neutral search engines” is often expressed. However, such search engines do not and cannot exist, as every search engine is subject to distortions (see also Chap. 5). Weber (2011, p. 278) identifies three areas in which these biases arise: • Implementation (in a broad sense) of the respective search engine • Behavior of content providers on the Internet • Behavior of search engine users
264
15
Search Engines Between Bias and Neutrality
The implementation of the search engine refers to technical conditions, especially indexing (as described in Chap. 3), and ranking algorithms (Chap. 5) that are based on human assumptions about the rating of documents and judgements of taste. The behavior of content providers deals with the numerous factors that influence search results; these range from decisions when writing texts and structuring websites to comprehensive search engine optimization measures (Chap. 9). The factors influencing the result positions can be divided into three areas according to actors: On the one hand, of course, it is the search engine operators themselves who, through the algorithms they have developed, determine which results come out on top and which have to settle for lower rankings. But on the other hand, even if the search engines repeatedly emphasize that results are generated purely by machines based on algorithms, it is clear that these algorithms are also based on human assumptions about the suitability of documents for a query. Secondly, the mass (of users) has a considerable influence on the search results (and thus on what reaches the user). The quality of documents on the Web is measured primarily based on their popularity—be it popularity on the Web as a whole or popularity among a specific user group (see Sect. 5.3). Finally, certain individuals also have a say in the results of search engines. Search engine optimization techniques can be used to modify documents (and their position on the Web as expressed by links) to meet the search engines’ criteria better than other documents. The aim is to achieve a high ranking in the result lists, which under certain circumstances can decide the weal and woe of a commercial Web offering (see Chap. 9). The third point Weber mentions concerns a distortion of the search results caused by the user themselves. Simply because of their poor research skills and lack of strategies, users deprive themselves of some of the relevant results, for example, if they only search in one language, although numerous other relevant documents could be found in other languages (Weber, 2011, p. 280 f.). From what has been said, it is clear that a bias-free search engine is not possible, and it is, therefore, less about such an ideal than about the question of which problems arise from bias and how to deal with them. Bias becomes a problem primarily through the combination of three factors (Lewandowski, 2014, p. 233): 1. The dominance of the algorithmic Web search engine model over other methods of finding information on the Web 2. The dominance of Google in this field 3. The behavior of search engine users (short queries, hardly any systematic viewing of results, little knowledge of search engines) It is only because one search engine is consulted for the vast majority of information needs and the users are not very competent and reflective in their search behavior that the biases of this one search engine play such a large role. If users were to use a multitude of search engines, each search engine would still have specific biases, but these would be pushed into the background because they would balance
15.3
The Effect of Search Engine Bias on Search Results
265
each other out to a certain extent. In addition, systematic biases would be easier to detect in a direct comparison of different search engines.
15.3
The Effect of Search Engine Bias on Search Results
This section will describe some examples of search engine bias. It should be borne in mind that these are only snapshots and that no statements can be made as to how the search results should be. Furthermore, the deviations described here are not deviations from a standard but algorithmic interpretations of the search engines that turn out in a certain way. Figure 15.1 shows Google’s search engine result page for the query sportlerinnen (female athletes) for a user who is not logged in. All top results deal with “most beautiful female athletes”; female athletes are not treated in the actual context of their profession. Research on gender and racial stereotypes shows that these are at least reproduced, if not promoted, by search engines (Noble, 2018). The issue here is not whether it is legitimate for a search engine to focus on a single aspect or to prefer a particular type of document, but whether the results might not be different. This leads us directly to whether a single search engine should determine what (almost) all users see (see Chap. 11). It has already been discussed in Sect. 13.2 that a search engine’s automatic “quality assessment” does not necessarily have to go hand in hand with a human quality assessment. Perhaps even more drastically, this effect can be seen in an example that is no longer displayed by now, namely, the query jew on Google (Bar-Ilan, 2006). For a while, Google showed an anti-Semitic propaganda page in the first position of the result list for this query, which led to numerous complaints. Google finally responded by displaying an ad above the search results with the following text: “We’re disturbed about these results as well. Please read our note here.” The text explained that the search results are generated automatically and that there is no direct human intervention. Although the example is outdated, Google also reacted in a similar way in more recent cases: It was noticed today hateful memes appear in image results for “jewish baby strollers.” We apologize. These don’t reflect our opinions. We try to show content matching all key terms searched for, as people normally want. But for “data voids” like this, it can be problematic. . . (Google Search Liaison, 2020)
On the one hand, this procedure again illustrates that automatic quality assessment does not have to conform to human judgements. But on the other hand, it shows a central problem of search engine providers: If they were to intervene manually in the search results, they would be inundated with further demands for manual intervention. Ultimately, they would lose sovereignty over their search results if they were to accept such requests. So, as undesirable as some results may seem to us, on the other hand, the search engine providers’ refusal to change search results in response to
266
15
Search Engines Between Bias and Neutrality
Fig. 15.1 Google search engine result page for the query sportlerinnen (female athletes) (detail; January 6, 2021)
requests is understandable. However, this does not mean there is no manual intervention in the search results (see Sect. 3.3.2). Another problem with search results is confirmatory information. Ballatore (2015) demonstrates using popular conspiracy theories that users who search for them prefer to be shown pages that confirm these theories, which ultimately leads to those pages being ranked higher. The reason for this is not that critical or disconfirming pages are not available, but rather the problem is one of rankings being influenced by click behavior and the confirmation bias reflected therein. The difference between the content available on the Web and the content displayed in search engines becomes evident in a study by White and Horvitz (2009). They investigated what information users searching for relatively harmless
15.4
Interest-Driven Presentation of Search Results
267
symptoms of illness are shown in search engines. By comparing the orientation of documents on the Web versus those preferred by search engines, they showed that dramatic interpretations are significantly more likely to be displayed in search engine results than on the Web in general. For example, a user searching for the harmless symptom headache is very likely to encounter a hint in the search results that the cause could be a brain tumor. The effect becomes even more apparent when documents from specialized medical sources on the Web are used for comparison. These examples show that search results can hardly be distinguished between right and wrong. Instead, they represent a selection from the information or documents available on the Web, chosen based on an algorithmic interpretation. Therefore, search engines cannot be blamed for producing results in this way, even though one would certainly wish for a different interpretation in many respects. However, the considerable influence of the result ranking on both the selection behavior and the evaluation by the users and the associated influence of the search engine providers must be emphasized once again. For example, Epstein and Robertson (2015) demonstrated in large-scale experiments how easily voters could be influenced in their voting decisions by manipulated result rankings. How should the issue of interpretation and the resulting consequences be dealt with then? Since it is impossible to design a search engine that does not interpret the Web’s content (and would thus be objective), the way out seems to lie in a diversity of search engines rather than in regulating the results of individual search engines (Lewandowski, 2016). Furthermore, the issue of search engine bias should be placed in the broader context of biases on the Web (Baeza-Yates, 2018). For example, the mere fact that certain content is produced in larger quantities (e.g., due to commercial motivations) also results in a high probability of bias in search engine databases. Kulshrestha et al. (2019) have therefore rightly pointed out that when dealing with search engine bias, “input bias” and “ranking bias” should be kept apart.
15.4
Interest-Driven Presentation of Search Results
In the previous sections, distortions were described that the search engine providers did not or do not have to intentionally “build” into their search engines. This section will now address the question of how a search engine can support or promote the interests of its operators through its result presentation and what effects this has already had in practice. With regard to the interest-driven presentation of search results, a distinction must be made between different types of results: • Universal search results often come from vertical search engines offered by the search engine itself; here, the mere display of a universal search container with its own results instead of those of others is guided by the search engine provider’s own interests. No relevance evaluation comparing these results to those from competing offerings factors into this.
268
15
Search Engines Between Bias and Neutrality
• Text ads are placed prominently on the search engine result pages by all search engines, as they are by far the most important source of income for search engine providers. Here, the search engine providers’ main interest is that the ads are considered relevant to the query and that users will click on them. In addition to the potential relevance of the ads as search results (see Sect. 10), the distinguishability of ads and organic results (see Sect. 10.3) also plays an important role. If ads are perceived as relevant search results (which they can certainly be), users are more likely to click on them. • Organic results are something like the gold standard of neutrality. They are the foundation of search engines, and the enormous trust placed in search engine results is based on them. In this respect, it is in the interest of search engine providers to maintain this trust. There is no direct influence on the organic results, but their importance can be reduced through the prominent presentation of universal search results. However, it should be asked to what extent search engine providers might be tempted to intervene in the organic results in order to prefer their own offerings (e.g., results from YouTube in the Google search engine). While users may fairly assume that all documents are treated equally in the ranking of organic results and are listed according to objective criteria, the problem is that they transfer their trust from this area to all results presented on the search engine result pages. It is only through this that enormous opportunities arise for search engine providers to influence users in the search process (Lewandowski et al., 2014). Ultimately, it is a question of what it means for knowledge acquisition when, in many cases, search engines provide us with information that is at least guided by interests. And finally, we have to ask what way out there is when it comes to finding out about different aspects of a topic or making information-based decisions (see Lewandowski, 2014). There is another reason why the importance of search engines for the fate of websites can hardly be overestimated: The proportion of visitors referred by search engines is probably much higher than the statistics (i.e., the measurement of users coming directly from the result pages of the search engines) suggest. This is because many users come across a website through search engines for the first time (and then often repeatedly), which they then go to directly (by entering the URL in the browser bar) in the future. Therefore, these uses can also be attributed partly to search engines. The journalist Danny Sullivan, who specializes in search engines, spoke of a so-called search gap as early as 2001 (Sullivan, 2001; see also Sullivan, 2010). Of course, by now, user behavior has changed to the extent that search engines are often used as navigational tools instead of directly entering the URL; however, the primary finding remains valid. The effects that an interest-driven presentation of universal search results can have are illustrated by the representative click distributions on exemplary search engine result pages shown in Figs. 15.2 and 15.3: Even minor changes in the layout can have an enormous influence on which results are clicked on. The examples each show the distribution of 1000 clicks; the only change between the 2 search engine
15.4
Interest-Driven Presentation of Search Results
269
Fig. 15.2 Click distribution on a universal search engine result page (example 1 without underlay; Lewandowski & Sünkler, 2013)
result pages is the gray background of the container with the rival offers to Google’s shopping results. The fact remains that a ranking will never produce a neutral result list because the basic assumption of every ranking is that a set of results is ranked according to predefined criteria (Grimmelmann, 2010; Lewandowski, 2017). And no matter how often the myth that the ranking is only determined by machines (or algorithms) is propagated: Algorithms are still based on human valuations, which decide which documents will later appear at the top of the result lists and which will remain invisible to the user. In addition, search engine providers do not have complete control over their rankings due to the strong integration of popularity-inducing signals (see Sect. 5.3.2). A problem arises not least from the fact that for a large proportion of queries, more relevant documents are found than the user is willing or able to view. These relevant documents are found in lower result positions, although they are rated as relevant by the users. What exactly leads to a document being rated as “relevant” is largely unclear. Although there have been numerous attempts to define the relevance of documents in relation to a query (or an information need), no agreement, even approximate, has yet been reached on such a definition (see Chap. 5). However, with their way of determining relevance, search engines largely meet their users’ opinions. However, we know from studies that users are easily satisfied, especially with the results of Web search engines. Even more differentiated
270
15
Search Engines Between Bias and Neutrality
Fig. 15.3 Click distribution on a universal search engine result page (example 2 with a gray underlay of the middle container; Lewandowski & Sünkler, 2013)
evaluations (which go beyond the pure evaluation “is relevant” vs. “is not relevant”) do little to remedy this. It can be said that relevant documents are returned, which satisfies the users, but leaves us in the dark as to whether a more relevant document might appear in a later position on the result list (which is no longer looked at). The question remains how much the search engine’s relevance assessment resembles human relevance assessment. This, in turn, leads us to the question of which basic assumptions (determined by humans) are actually at work in the ranking algorithms. However, even if these assumptions are not explicit (i.e., the developers of the algorithms are not aware of the effects described in Sect. 15.3), they form the basis for evaluating what a relevant result is and how it differs from more or less relevant ones.
15.5
What Would “Search Neutrality” Mean?
In the debate about search engines, there is often a call for “search neutrality” analogous to the demand for net neutrality. Net neutrality is about ensuring that all data packets sent over the network are treated equally, i.e., transmitted at the same speed and without giving preference to data from certain providers. In this sense, search neutrality would mean that all documents are treated equally in crawling, indexing, and ranking, regardless of the provider of these documents.
15.6
Summary
271
As described in Chap. 3, in the crawling process, we are more likely to deal with biases that arise due to the structure of the Web. Intentional distortions are now being discussed primarily in universal search (here mainly concerning Google’s shopping results), even though the possible backgrounds of the high placement of YouTube results in Google, for example, are also being addressed occasionally. However, the more serious issue here is the question of inclusion in the index: In Sect. 3.1, it was already stated that no search engine would be able to include all documents on the Web and that there are also justified exclusions from the index. In this respect, decisions are already made at this stage that at least limit search neutrality. It is difficult to determine to what extent all documents are treated equally in the indexing process. However, especially with the collections created by the search engines themselves, metadata and structural information can be inferred much more easily and reliably than with documents from the Open Web. In this respect, these collections are likely to be fully represented in the respective search engines, whereas this is not the case for external content. In principle, equal treatment of all documents in the index is possible in the ranking and continues to be the ideal of search engines, provided one only considers the organic results from the Web Index. However, with the movement away from result lists toward complex presentations of results from different collections, search engines are turning away from this ideal. Probably for the reasons mentioned, search neutrality is a fundamentally desirable state, but in practice, it is hardly achievable. Similar to what was shown in the discussion of search engine bias, the question here is, again, probably more about how to deal with a lack of search neutrality than about how one might achieve it (if this is possible at all). Likewise, the question of the search engine providers’ responsibility arises again. A significant problem arises from the fact that while search engines have come a long way from the original model of solely displaying organic results treated equally in ranked lists, users have obviously not taken this step with them. As a result, they continue to assume a “neutral” model of result presentation and trust search engines accordingly, when at least caution would be appropriate.
15.6
Summary
Search engines are usually seen as neutral intermediaries between searchers and the information on the Web. However, on the one hand, some biases are not based on decisions by search engine providers to directly prefer or disfavor documents or sources, for example, due to the structure of the Web and the basic assumptions that go into the ranking. Compiling search results cannot be unbiased in this sense, as any decision to rank search results already creates a bias. On the other hand, distortions arise not only from the design of search engines but also from user behavior and content providers’ influences. Search engine providers consciously influence certain search results because of their own interests as content providers or as vertical search engine providers. Since
272
15
Search Engines Between Bias and Neutrality
these, in turn, depend on visibility in search engines, it is economically understandable that the operators prefer to display them on the search engine result pages regardless of actual relevance ratings. However, search engine providers have considerable influence on what users see and ultimately select. So far, at least, they have not lived up to the responsibility that arises from this. Further Reading Alexander Halavais (2018) provides a solid overview of the social issues related to search engines. There are several books on the influence of Google; Vaidhyanathan’s (2011) book is particularly recommended here.
References Alphabet. (2020). Google code of conduct. https://abc.xyz/investor/other/google-code-of-conduct. html. Baeza-Yates, R. (2018). Bias on the web. Communications of the ACM, 61(6), 54–61. https://doi. org/10.1145/3209581 Ballatore, A. (2015). Google chemtrails: A methodology to analyze topic representation in search engine results. First Monday, 20(7). http://www.firstmonday.org/ojs/index.php/fm/article/view/ 5597/4652 Bar-Ilan, J. (2006). Web links and search engine ranking: The case of Google and the query ‘Jew’. Journal of the American Society for Information & Techology, 57(12), 1581–1589. Epstein, R., & Robertson, R. E. (2015). The search engine manipulation effect (SEME) and its possible impact on the outcomes of elections. Proceedings of the National Academy of Sciences, 112(33), E4512–E4521. https://doi.org/10.1073/pnas.1419828112 Google Search Liaison. (2020). [Hateful memes]. https://twitter.com/searchliaison/status/13095800 75432435713. Graham, L., & Metaxas, P. T. (2003). “Of course it’s true; I saw it on the internet!”: Critical thinking in the internet era. Communications of the ACM, 46(5), 71–75. https://doi.org/10.1145/769800. 769804 Grimmelmann, J. (2010). Some skepticism about search neutrality. The next Digital Decade: Essays on the Future of the Internet, 31, 435–460. Halavais, A. (2009). Search engine society. Polity Press. Halavais, A. (2018). Search engine society (2nd ed.). Polity. Kulshrestha, J., Eslami, M., Messias, J., Zafar, M. B., Ghosh, S., Gummadi, K. P., & Karahalios, K. (2019). Search bias quantification: Investigating political bias in social media and web search. Information Retrieval Journal, 22(1–2), 188–227. https://doi.org/10.1007/s10791-0189341-2 Lewandowski, D. (2014). Die Macht der Suchmaschinen und ihr Einfluss auf unsere Entscheidungen. Information – Wissenschaft & Praxis, 65(4–5), 231–238. https://doi.org/10. 1515/iwp-2014-0050 Lewandowski, D. (2016). Perspektiven eines open web index. Information – Wissenschaft & Praxis, 67(1), 15–21. https://doi.org/10.1515/iwp-2016-0020 Lewandowski, D. (2017). Is Google responsible for providing fair and unbiased results? In M. Taddeo & L. Floridi (Eds.), The responsibilities of online service providers (pp. 61–77). Springer. https://doi.org/10.1007/978-3-319-47852-4_4
References
273
Lewandowski, D., Kerkmann, F., & Sünkler, S. (2014). Wie Nutzer im Suchprozess gelenkt werden: Zwischen technischer Unterstützung und interessengeleiteter Darstellung. In B. Stark, D. Dörr, & S. Aufenanger (Eds.), Die Googleisierung der Informationssuche – Suchmaschinen im Spannungsfeld zwischen Nutzung und Regulierung. De Gruyter. https://doi.org/10.1515/ 9783110338218 Lewandowski, D., & Sünkler, S. (2013). Representative online study to evaluate the revised commitments proposed by Google on 21 October 2013 as part of EU competition investigation AT.39740-Google report for Germany, Hamburg. Lurie, I. (2010). 3 lies the search engines will tell you. Search Engine Land. http://searchengineland. com/3-lies-the-search-engines-will-tell-you-45828. Noble, S. U. (2018). Algorithms of oppression: How search engines reinforce racism. New York University Press. Röhle, T. (2010). Der Google-Komplex: Über Macht im Zeitalter des Internets. Transcript. https:// doi.org/10.14361/transcript.9783839414781 Sullivan, D. (2001). Avoiding the search gap. Search Engine Watch. https://web.archive.org/ web/20130521063800/http://searchenginewatch.com/article/2068087/Avoiding-TheSearch-Gap. Sullivan, D. (2010). Stat rant: Does Facebook trump Google for News & Can’t we measure twitter correctly? Search Engine Land. http://searchengineland.com/stat-rant-google-facebook-twit ter-38484. Tremel, A. (2010). Suchen, finden – glauben?: Die Rolle der Glaubwürdigkeit von Suchergebnissen bei der Nutzung von Suchmaschinen. Ludwig-Maximilians-Universität München. http://edoc. ub.uni-muenchen.de/12418/. Vaidhyanathan, S. (2011). The Googlization of everything (and why we should worry). University of California Press. https://doi.org/10.1525/9780520948693 Weber, K. (2011). Search engine bias. In D. Lewandowski (Ed.), Handbuch InternetSuchmaschinen 2: Neue Entwicklungen in der Web-Suche (pp. 265–285). Akademische Verlagsanstalt AKA. White, R. W., & Horvitz, E. (2009). Cyberchondria. ACM Transactions on Information Systems, 27(4), 23. https://doi.org/10.1145/1629096.1629101
The Future of Search
16
After all that has been said in this book about the current state (and partly also about the past) of search engines, it is, of course, interesting to conclude by looking into the future. This outlook is not meant to be an imagination of what search engines—or, more generally, search processes—might look like in 50 years. After all, who could have imagined today’s search engines 50 years ago? Instead, the aim is to look into the foreseeable future and focus on developments that are already emerging but have not yet come to fruition. This is because we can already recognize much of what will be established as a standard in the next few years. In his book, Ryen White (2016) describes the characteristics and tasks of so-called next-generation search systems, which he sees as personal assistants that can analyze a variety of signals from the environment and learn from them in order to support the searcher in their current situation (p. 61). The result is that a whole range of activities are gaining in importance (White, 2016, p. 61ff.): • Natural interaction with the system; depending on the situation, by speech, touch, gestures, looks, or a combination of these. • Reactive as well as proactive support, i.e., in addition to answering explicit queries, the system also anticipates information needs. • Search over time, devices, and apps, including detecting a resumption of a previously started search. • Keeping the individual user’s data in the cloud so that it can be accessed at any time to support the user. • Support for different phases of tasks and the search process. • Use of context, e.g., the situation, but also user-specific information such as their search expertise. • Support in the further processing of information found (e.g., in collecting and storing, but also in summarizing different documents). • Inclusion of the user’s social context, e.g., via social networking sites.
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Lewandowski, Understanding Search Engines, https://doi.org/10.1007/978-3-031-22789-9_16
275
276
16
The Future of Search
From this list, we can already see that future search systems will still bear an unmistakable resemblance to today’s search engines but will include a much broader spectrum of offerings, especially in terms of application contexts. Some of the points mentioned have already been described in the previous chapters; the following sections will focus on the general transformation that makes the complex interactions described in the list feasible in the first place.
16.1
Search as a Basic Technology
Today, we still essentially understand search engines as systems into which we enter a query and then receive a result (usually in the form of a set of documents). In this respect, search engines are changing: search processes and the processing of data based on “queries” generated in a variety of ways (other than in the sense of a textual input by a user) will form the basis of the most diverse applications. In this context, Peter Morville speaks of ambient findability (which is also the title of his book; Morville, 2005). Already today, many services are based on search as a basic technology but are not (or only in part) search engines themselves, for example: • Google News: While Google News (see Sect. 6.3) is a search engine, it also automatically compiles overview pages without users entering queries. In these overview pages, newly indexed articles are classified by topic, checked for similarity with articles already in the index, and grouped into thematic clusters of articles determined to be important. • Bing iPad App (US version): In its iPad app, Bing provides, among other things, a daily compilation of topics based on popular queries on that day. The graphical presentation implies an information portal; however, when clicking on one of the topics, a query is run in Bing, and the current results are retrieved. • “Personal assistants”: Services such as Apple’s Siri, Microsoft’s Cortana, or Amazon’s Alexa let users control their personal information management through voice input, for example, by asking them to add an appointment to their calendar or to write a message to a recipient from their address book. However, these services can also be used to search in external information resources, for example, for simple searches for facts such as the current weather at the user’s current location. But Web searches can also be carried out, although at present conventional result lists are still displayed, which are expanded by direct answers rather than the direct answers being the actual result of the search. Indeed, these personal voice assistants are still far from having reached their full potential. Nevertheless, they at least give an indication of how search can be seamlessly integrated into the user’s context. • Proactive recognition and processing of information needs: One example of this is the Google app. It generates answers or search results (in the broadest sense) automatically at the appropriate time based on contextual information (which, in this case, is turned into the query). So, for example, a user can be reminded at the
16.2
Changes in Queries and Documents
277
right time when to leave based on an appointment in their calendar because Google knows the location of the meeting from the calendar, the current location of the user from the phone’s GPS data, and the current traffic situation from recent traffic reports. We can assume that in the future, we will perceive services based on search, much less as search engines. This shows how search is or has become a core application of the Internet. It can already be said that search engines are being taken for granted to such an extent that they are hardly noticed and have become “invisible” (Haider & Sundin 2019, p. 1).
16.2
Changes in Queries and Documents
In Sect. 4.5.1, we have already seen that different variations are possible on the query and document sides. For example, queries can be entered as text or spoken language; documents can include, among other things, texts, HTML pages, and spoken answers. All possible combinations form the processing of a query. Here, the difference between reactive and proactive systems must be considered again: The Google app can perform searches completely without the direct input of queries. Instead, the queries are generated automatically from the user’s current context, similar to the familiar recommendation systems. The basis for such systems is the collection of data about the user and their aggregation. Sources include data from services used by the user (such as their calendar and address book), data collected by sensors (e.g., location data transmitted by the mobile phone), and usage data of the individual user (e.g., their queries used in the past and interaction data such as documents viewed). This clearly shows that the future of search will lie in context-oriented and dataintensive services. However, it must also be asked to what extent the collection of personal usage data by these search services is desired (see Sect. 5.3.2.2). On the document side, too, there are increasing changes away from a document list output by the search engine (see also Chap. 7). On the textual level, in addition to document representations (i.e., the snippets) and the texts themselves, summaries of several texts or a specific answer generated from the result set can be displayed as a result. It almost goes without saying that images, videos, and audio can also be presented as results. However, search engines today still primarily stick to the original format of the files, i.e., a result is output in the format in which the search engine found it. Through automatic transcriptions of audio and video files, however, spoken texts, for example, can be converted into written text and displayed (and processed) accordingly. Similarly, spoken answers, for example, can be generated from the result sets—the output format will, in the future, depend even more on the context in which a searcher is using the search engine. Automatic summarization techniques can also generate complex answers to informational queries. Whereas today, if one wants to find out, for example, how Goethe’s influence on Romanticism is to be assessed, one has to sift through a
278
16
The Future of Search
multitude of documents by hand, in the future, search engines could extract the different opinions on this from the documents and output them as an independent result. This would eliminate the need for referring to external documents; the searcher would get the information directly on the search engine result page and only need to click on it if they wanted to look at a source in more detail. Technologies such as Google’s Knowledge Graph, which already displays facts instead of documents on the search engine result page, may be just the beginning of a move away from the query-document paradigm in Web search. However, this shift has significant economic consequences, which are discussed in Sect. 16.4.
16.3
Better Understanding of Documents and Queries
For many years, procedures intended to improve the understanding of documents and the exchange of information between machines have been discussed under the label Semantic Web (or Linked Open Data). In this context, single statements from documents (so-called triples) replace the whole document as the unit to be considered. Initial approaches by search engines to understand documents using semantics can be seen in initiatives such as schema.org. With the schemas provided in this joint initiative, information on the Web can be described in detail so that search engines can evaluate it. Once a search engine has indexed such information, conclusions can be drawn from the combination of known triples (so-called reasoning). This approach is particularly promising when it comes to combining information from multiple sources. Of course, there is not only the side of the documents (or, more generally, the information available on the Web) but also the side of the queries. Different chapters of this book have already dealt with the interpretation and enrichment of queries, which should ultimately lead to a better understanding of user intentions. In addition to enrichment based on past or similar behavior, information transmitted by sensors is becoming increasingly important (see White, 2016, Chap. 2): In addition to location data transmitted by mobile devices, sensors in a wide variety of devices transmit data that can also be used for context enrichment of queries. With more contextual information, the queries can be interpreted much more precisely (or generated automatically in the first place). The contextual information transmitted depends, in each case, on the devices used by the user.
16.4
The Economic Future of Search Engines
In Chaps. 8 and 10, context-based advertising was presented as the dominant business model in the search engine market. On the side of content providers, advertising is also the dominant form of refinancing when it comes to content sites (as opposed to, e.g., retail sites). The business relationship between search engines and content providers is based on the fact that the content providers make their
16.5
The Societal Future of Search Engines
279
content available to the search engines for free and, in return, receive traffic from the search engines, which they can then monetize through advertising. However, this model only works as long as search engines redirect users to documents. And this is undoubtedly one reason why search engines still primarily adhere to the query-document paradigm, even though much more would be technically possible, as described above: If they were to generate answers directly from the external documents and no longer forward users to the documents, they would deprive the content providers of their opportunities to earn money, and it would no longer be attractive for them to make their content available to search engines. They would then become nothing more than unpaid suppliers. Developments such as the knowledge graph and direct responses on search engine result pages should also be considered in this context (cf. Chap. 7). These have so far fallen far short of what is technically feasible. The data used for the knowledge graph results come primarily from Wikipedia (whose data may be reused practically without restriction) as well as data generated by Google itself (e.g., from the analysis of search query volume). Far more opportunities would arise from using the data of the Web at large as a basis. However, this would mean a departure from Google’s previous business model and its interaction with content providers, so no change is to be expected in the next few years. Thus, in many cases, the orientation toward result lists and the referral to documents is no longer a necessary consequence technical restrictions but primarily based on economic interests, which in this case are opposed to technical innovation. If the model were to be dropped and search engines were actually to compile information largely from external documents themselves, this would only be attractive to content providers if they were also paid for the use of their content. One option here would be a markup language for charging for information units at the smallest level, comparable to the charging of royalties on streaming portals like Spotify. The search engine providers themselves could continue to earn money through advertising on the search engine result pages; the content providers would be paid for their content and not, as is the case today, only benefit indirectly through traffic or the possibility of generating revenue through advertising.
16.5
The Societal Future of Search Engines
The fact that search engines are used as broadly as they are shows their essential role in knowledge acquisition. As described in Chap. 15, users typically regard search engines as neutral intermediaries between users and information or documents. However, they already cannot actually fulfil this role for theoretical reasons (every search engine delivers “distorted” results by definition); in addition, search engine providers, on the one hand, have an interest in aligning organic search results and advertisements at least to a certain degree. The current form of presentation will not be sustainable; it at least accepts that a significant proportion of users will be misled. On the other hand, search engines (especially Google) have become providers of content and services which can be found via the general search function. Thus, on a
280
16
The Future of Search
pragmatic level, problems arise from the fact that search engines can prefer their own offerings (especially in universal search, but also in organic results and ad results). The theoretical argument against this is directed against the idea that there can be an objective ranking of results. For most (informational) queries, there are far more relevant results than a user can view or is willing to consider. However, this makes the result ranking arbitrary to a certain extent: for the user, it does not matter which of the many relevant results are shown. On the part of the search engine, it is sufficient that relevant results are displayed—it does not matter exactly which ones they are. The image of search engines as neutral information brokers will change. Search engines must (and will) be perceived as information brokers with their own interests. A clear sign of this was the European Commission’s decision in 2017 to oblige Google to give fair consideration to its competitors in shopping search. Further proceedings are pending before the EU; these are also likely to impact the perception and regulation of search engines significantly. The search engine providers’ self-interest will have consequences for how we deal with search engines as providers of information infrastructures on the Internet: The question will arise whether one wants to leave the provision of infrastructures to a few or a single commercial company or whether it is not rather sensible to provide such infrastructures as public services. Discussions in this context include the disclosure of ranking algorithms, a breaking-up of Google into a part that operates the search engines and other parts, the establishment of state or public search engines (Hege & Flecken, 2014), and building an Open Web Index that is open to all providers who want to create search engines based on Web data (Lewandowski, 2014, 2016, 2019; also see Open Search Foundation, 2019, for the idea for an open search infrastructure). A further area with enormous societal implications is the search engines’ collection of user data. Google, in particular, but also other services follow an approach of first collecting as much data as possible about their individual users to subsequently improve their services and their advertising displays. As in other areas, this involves confirming terms of use that are hardly transparent, which is why we can assume that the majority of users are not aware that or to what extent their data is collected and permanently stored (see Sect. 5.3.2). For example, when using the Google app for the first time, the user is asked whether they want to allow the service to access the device’s location data; however, very few users are aware that if they give their consent, all location data will be logged in minute detail, even if the service is not currently being actively used. By now, however, there is an increased interest in search engines that do not log or store user data, at least in niche areas. Search engines such as MetaGer, Startpage, and Ecosia distinguish themselves from Google in this way. However, it should be borne in mind that these search portals are not search engines with their own database (see Sect. 8.5). The question of extensive data collection is not only a problem of search engines but a general data protection problem. Although personal data is generally considered worthy of protection and the laws on data protection are based on the ideal of data minimization (i.e., data should only be collected to the extent necessary for the
16.6
Summary
281
functioning of a service), companies can make almost any arrangements they want, as long as they only have their users confirm them in the terms of use. Looking at the societal issues associated with search engines, it is astonishing how little secured knowledge there is about search engines or searching the Web, especially when one considers the mass use of search engines and their critical role in knowledge acquisition in private life, the economy, and society. If one looks at research on search engines, it is striking that most of it is concerned with technical advancement and much less with aspects of use or societal issues arising from mass search engine use. The approach pursued in this book of a comprehensive consideration of the topic from different perspectives leads to the demand for search engine research that, on the one hand, pursues further technical development but, on the other hand, does not see this in isolation from societal consequences. And finally, it should also be seen as evaluative research (also: algorithm auditing), which, based on solid empirical data, examines the extent to which search engines fulfil their role as intermediaries of information in all areas of life and how they can fulfil this role (even) better in the future. Furthermore, there is a need to educate users, who need to learn more about the functionality, financing, and search options in order to be able to use search engines in a competent manner (see Machill, 2009; Machill, Beiler & Gerstner, 2012). However, information literacy cannot be the sole solution, as it places the responsibility for accessing true and relevant information solely in the hands of the users. Search engine providers are also called upon here.
16.6
Summary
Search engines have changed continuously since their inception and will continue to do so. There will be profound changes that will alter our fundamental understanding of search. First of all, search will become an enabling technology used in a wide variety of applications. So, we will no longer necessarily go to a search engine to find information, but suitable information will be offered automatically through search technology. This will be accompanied by a change in queries: On the one hand, we will enter queries in a variety of forms (e.g., text, spoken language, images). On the other hand, more and more queries will be implicitly generated from contextual information, eliminating the need for user input. Moreover, search is no longer tied to specific devices. Instead, search can take place in all contexts in which it is needed. The forms of input are then also geared to these contexts. Search engines will better “understand” both queries and documents, enabling them to output not only documents but targeted answers that can also be based on analyzing a large number of documents. The fact that search engines will increasingly generate answers instead of links to documents will also change the economic relationship between search engines and content providers. If users are no longer directed to documents, content providers will lose the opportunity to make money from their free content through advertising. As a result, new models are needed to remunerate content providers.
282
16
The Future of Search
Last but not least, search engines are of enormous importance for knowledge acquisition in society. However, their development is still primarily disconnected from societal discussions; in the future, these will take on a more important role. Therefore, there is also an urgent need to improve knowledge about search engines through research and foster information literacy on the part of users.
References Haider, J., & Sundin, O. (2019). Invisible search and online search engines. Routledge. Hege, H., & Flecken, E. (2014). Debattenbeitrag: Gibt es ein öffentliches Interesse an einer alternativen Suchmaschine? In B. Stark, D. Dörr, & S. Aufenanger (Eds.), Die Googleisierung der Informationssuche (pp. 224–244). De Gruyter. https://doi.org/10.1515/9783110338218 Lewandowski, D. (2014). Why we need an independent index of the web. In R. König & M. Rasch (Eds.), Society of the Query Reader: Reflections on web search (pp. 49–58). Institute of Network Culture. Lewandowski, D. (2016). Perspektiven eines Open Web Index. Information – Wissenschaft & Praxis, 67(1), 15–21. https://doi.org/10.1515/iwp-2016-0020 Lewandowski, D. (2019). The web is missing an essential part of infrastructure: An open web index. Communications of the ACM, 62(4), 24–27. https://doi.org/10.1145/3312479 Machill, M. (2009). 12 goldene Suchmaschinen-Regeln. https://www.medienanstalt-nrw.de/ fileadmin/lfm-nrw/Medienkompetenz/ratgeber-suchmaschinen-farbe.pdf. Machill, M., Beiler, M., & Gerstner, J. R. (2012). Der Info-Kompass: Orientierung für den kompetenten Umgang mit Informationen. https://www.klicksafe.de/fileadmin/media/ documents/pdf/Broschren_Ratgeber/Info_Kompass.pdf. Morville, P. (2005). Ambient findability. O’Reilly. Open Search Foundation. (2019). A free and open internet search infrastructure: For our society and a prospering digital economy in Europe about. https://opensearchfoundation.org/wpcontent/uploads/2019/12/A-free-and-open-Internet-Search-Infrastructure-web.pdf. White, R. (2016). Interactions with search systems. Cambridge University Press. https://doi.org/10. 1017/CBO9781139525305
Glossary
Above the fold Also “visible area”: the part of the search engine result page that can be seen immediately without needing to scroll. The size of this part depends on the screen or window size. The term is derived from a folded newspaper. Address bar Input line in the browser into which a URL is entered to call it up directly. AdSense Google’s advertising program that allows website operators to place ads on their websites. The ads are generated to match the text of the document. Advanced search Search forms with several input fields for specifying queries. Advertising In the context of search engines, especially text advertisements that match a query and are shown on the search engine result pages. Affiliate marketing Marketing via sales partners who direct users to the provider’s site and receive a commission in return. Aggregators Providers who bring together information from different sources. Algorithm A distinct sequence of instructions that are processed one after the other. In the context of search engines, this usually refers to the algorithm or algorithms used to rank the search results. Anchor text Text within a document that links to another document. Search engines use anchor texts to provide a supplementary description of the referenced document. App Application software; in particular, applications available for mobile devices are referred to as apps. Application programming interface (API) An interface that allows for automated access to software; in the case of search engines, the automatic retrieval of search results. Autocomplete Suggestions for amending a query that a search engine makes while a user types a query. Autocorrect Automatic correction of incorrect input (as assumed by the search engine), especially typos. Below the fold Also called “invisible area”: The area of a search engine result page that can only be reached by scrolling. Bias In the context of search engines, this is the deviation of the actual search results from an assumed ideal result. Bias is caused by technical problems in
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Lewandowski, Understanding Search Engines, https://doi.org/10.1007/978-3-031-22789-9
283
284
Glossary
indexing, by ranking factors, by targeted preference for certain offerings by search engine providers, and by external manipulation of search results (see search engine optimization). Black hat In the context of search engine optimization, methods that are banned by the search engines in their terms of use. Blacklist List of documents or websites that are excluded from a search engine’s index. Bookmark A link that is stored in the browser for quick access. Bookmarks allow pages to be found and requested quickly on the Web. Boolean operators Used to express logical relations of elements in queries. The Boolean operators are AND, OR, and NOT; other operators based on the Boolean operators have been developed. Boolean query Query formulated with the help of Boolean operators. Boost Giving certain documents a boost means giving them an advantage in the result lists. Bounce rate The proportion of users who quickly leave a document after having requested it (and, in the case of a user coming from a search engine, return to the search result page). Bow-tie model Model of linking in the Web that distinguishes between a strongly linked core area and various outer areas. This structure causes problems in crawling Web documents completely and evenly. Browser Software used to display documents on the Web. Cache The cache of a search engine in which the last version of the document visited by the crawler is stored. Click fraud Procedure in which clicks on ads are generated manually or automatically to damage the advertiser who pays per click. Cloaking Displaying different content on the same URL depending on the requesting client. For example, different content can be displayed for human users and search engines to deceive search engines about the actual content of documents. Cluster Automatically compiled set of objects recognized as similar. Collection Data set built up by a search engine. Command A command used to qualify a query. Concrete information need Information need that is directed at a fact. The information need is satisfied when the concrete fact is displayed. Content acquisition The inclusion of content in the database of a search engine. This is mainly done by crawling. However, content can also enter a search engine’s database in other ways, for example, by exploiting feeds or accessing structured data from databases. Content management system Software for creating and managing content on websites. The system automatically inserts the content entered into the structure and layout of the website.
Glossary
285
Country bias Bias in content coverage from different countries, i.e., content from certain countries is covered more completely in a search engine’s index than content from other countries. Country interface Interface of a search engine adapted to a specific country. Searching via different country interfaces leads to a different ranking of the results. See also query interpretation. Crawl frequency Frequency of repetition of the crawling of a document. Crawler System for finding content on the Web by following links in documents. Crawling The process of finding documents on the Web by following links on already known pages. Creation date The date on which a document was created. Dark Web An area of the Internet accessible only through specialized tools, known primarily for selling illegal products and distributing illegal and copyrighted content. Data center Computing center of a search engine. Web search engines are distributed across different data centers to ensure efficient processing of crawling and queries. Data collection In usage statistics, data on user behavior can be collected either through ratings explicitly made by the users or through the assessment of clicks (implicitly). Dedicated searcher Model of search behavior used in the classic TREC studies. It assumes a user who, on the one hand, wants to receive all results relevant to their topic and, on the other hand, is willing to look through large result sets. Deep Web All content accessible via the Web that is not (or cannot be) indexed by the general search engines, especially the content of databases that can be accessed via the Web. Document type Type of document, for example, text, image, or video. Document Unit of a carrier of documentary data. In the context of search engines, text documents, but also images and videos, among others, play a role. Domain host A provider who operates websites on behalf of a customer in his data center. Duplicate content Duplicate copy; in the context of the search, the same or very similar content in different documents. Dwell time Time users spend viewing a document. It can be seen as an indicator of the quality of documents that are of a certain length. Event A query that reaches a high search volume at regular times (e.g., Christmas, Oktoberfest). Evergreen A query that is of permanent interest to users and whose search volume is therefore not subject to major fluctuations. Explicit data collection See data collection. Explicit query interpretation See query interpretation. Eye tracking Procedure in which subjects’ pupil movements are measured utilizing infrared cameras. From this, typical gaze patterns can be measured.
286
Glossary
F pattern Typical gaze pattern on search engine result pages consisting of a result list above which a block of text ads is displayed. The gaze starts at the ads and then jumps to the first organic results. Factor See ranking factor. Factual information In contrast to text documents, factual information can be output as direct responses, so that it is not necessary to redirect users to a document from the search engine result page. Feeds Feeds allow a search engine to add data to its index in a structured form. File Transfer Protocol Protocol for the transfer of files via IP networks. Flash File format for multimedia and interactive content. Search engines cannot fully capture Flash content. Focused crawling Crawling that is deliberately restricted to a specific area of the Web. FTP See File Transfer Protocol. Global Positioning System See GPS. Golden triangle A typical eye-tracking pattern in which the first documents in a list are viewed most often and most intensively. Google Ads Google’s ad program where website operators create text ads to be displayed on the search engine result pages. They are booked based on keywords; the price is determined in a bidding process between the advertisers. Google Analytics Software from Google that allows website operators to analyze visitor flows. Google Search Console Software provided by Google that supports search engine optimization (formerly Google Webmaster Tools). GPS Global Positioning System: mobile devices can transmit their exact location, which allows search results to be precisely adapted to this location. Heat map A form of presentation for results from eye-tracking studies. HITS A link topological ranking method developed by Jon Kleinberg that divides documents into hubs and authorities, among other things. Horizontal search engine A general search engine that aims to cover the breadth of the Web’s content. Hosting Technical provision of websites. HTML (Hypertext Markup Language) A markup language for documents on the Web. HTML page Synonym for Web page; also page, document. HTML source code See source code. Hybrid search engine A search engine that captures both parts of the Surface Web (see focused crawling) and content from Deep Web databases and makes them searchable together. Implicit data collection See data collection. Implicit query interpretation See query interpretation. Impulse A query that reaches a high search volume for a short period following an external event, which then slowly subsides. Index A search engine’s database that has already been prepared for searching.
Glossary
287
Indexer Component of the search engine that is responsible for preparing documents. Information literacy Ability to deal competently with information and information systems. Information need (1) Type and amount of information needed to address a problem. (2) Expression of what a searcher wishes to know. Information object Document, independent of a document type (such as text or image). See document. Information retrieval Computer-assisted search or retrieval of information on a specific question. Informetric distribution Highly uneven distribution in which a few objects account for a large part of the volume, while many objects have only a small volume. Examples are the distribution of words in documents and the distribution of links on the Web. In-link A link pointing to a specific document is called an in-link of that document. Inverted document frequency Calculation of the importance of words within document collections, allowing rarer words to be weighted more strongly. Inverted index An inverted index inverts documents from texts containing words to lists of words referring to documents. Invisible area Part of the search engine result page that only becomes visible by scrolling. Invisible Web See Deep Web. IP address Unique identifier of a computer that is needed to send data to it via the Internet. Iterative method A procedure that calculates approximate values in several steps without already having known values at the beginning. Juror A person who assesses the relevance of documents as part of a retrieval test. Keyword stuffing Excessive inclusion of potential keywords within a document with the aim of deceiving search engines. Keyword A single word that is part of a query. Keywords in context (KWIC) Form of displaying excerpts from documents (e.g., in result descriptions), in which an excerpt containing the keywords entered is displayed. Knowledge graph Compilation of facts in a box on a search engine result page. KWIC See keywords in context. Lab study Study (of search behavior) carried out in a laboratory with test persons. Last update date Date of the last content update of a document. Link buying Solicitation of external links in exchange for money to gain link popularity as part of search engine optimization. Local store The copy of the content of the Web created by a search engine in the process of crawling. Ideally, this copy would form a complete and up-to-date representation of the Web. This is sometimes referred to as an index. Locality Adaptation of search results (or their ranking) to the user’s current location.
288
Glossary
Log file A file generated by a Web server or search engine in which the users’ interactions with the system are recorded. Lookup searches Queries that serve to look up facts or satisfy simple problemoriented information needs. Mayfly A query that reaches a high search volume in a short period due to a sudden interest among users. Mayflies are mainly triggered by external impulses (e.g., reports in the media). Metadata Data about data. Metadata is particularly relevant for search engines when indexing non-textual content and generating snippets. Metasearch engine Search engine without its own index that combines the results of several external search engines in its ranking. Metatag See metadata. Microblogging Form of blogging in which the length of messages is (artificially) restricted. Mirror site Copy of a website created for faster local access. Need for freshness Decision as to whether current documents are of particular importance for a query. N-gram Result of breaking down a text into units of N characters or words. Non-intrusive method Research method for observing behavior without the subjects being aware of the current observation, e.g., in log file research. Non-linked pages Pages that may be linked to each other but are not linked from documents already known to a search engine. Online advertising Advertising is placed as part of online offerings. Online advertising is divided into classic online advertising (e.g., banners), paid search advertising, and affiliate marketing. Online host Provider of multiple databases under a common search interface. Opaque Web The part of the Deep Web that could be covered by search engines but is not. Operator A special command for combining keywords to control the number of results to be returned. See also Boolean operators. Organic result Search result automatically generated from the Web Index. The ranking of organic results is done for all documents under the same conditions, i.e., every document that is included in the search engine’s Web Index has essentially the same chance of being displayed as a result for a query. Out-link A link that points from a particular document to another document. PageRank A specific procedure for assessing documents based on their linking on the Web. In this procedure, both the number of incoming links and their individual value are taken into account. Paid inclusion A business model in which content providers pay to have their content included in a search engine’s index. Paid search advertising Part of online advertising in which text advertisements for individual keywords are sold to advertisers. Parsing module A module of a search engine for preprocessing the documents found by the crawler.
Glossary
289
Partner index The index made available to partners by a search engine that has its own index. Partner index model Model in which a search engine with its own index delivers search results and text ads to a partner that operates its own user interface but displays the other partner’s results. The revenue generated from the ad clicks is shared according to a pre-determined key. Periodic Table of SEO Success Factors A compilation of the most important search engine ranking factors from the perspective of search engine optimization, compiled by the website Search Engine Land. Power law See informetric distribution. Precision The proportion of relevant results out of the total number of results displayed by a search engine. Private Web Pages deliberately excluded by their authors from indexing by search engines, for example, through password prompts. Problem-oriented information need Information need whose thematic boundaries cannot be precisely determined. Proprietary Web Content on the Web that can only be used after agreeing to certain conditions of use. Prosumer Combination of consumer and producer, especially used in the context of social media. Quality-determining factors Ranking factors used to measure the quality of documents, e.g., by evaluating links and the number of user accesses. Quasi-synonym Words that are treated as synonyms, although, strictly speaking, they are not. Search engines calculate statistical word similarities that can lead to words being treated as quasi-synonyms. Query intent A distinction between queries according to the goal or the information need behind the query. The most important classification for Web search distinguishes between informational, navigational, and transactional queries. Query interpretation Adding contextual information to a query, for example, the user’s location, to be able to answer queries in a more targeted way. There is a distinction between explicit and implicit query interpretation, depending on whether the added information is communicated to the user. Query The input of a user which is processed by a search engine in one step. A query can consist of one or more words and also of operators and commands. Ranking factor A criterion used to rank search results. Ranking signal See signal. Recall The proportion of relevant documents found by a system out of the total number of relevant documents in the database. Relevance A criterion for determining whether a document is desired as a search result by a currently searching user in their context. Result list List of documents found for a query. A search engine result page can contain several result lists as well as other forms of displaying results. Retrieval effectiveness The ability of a search engine to return relevant documents in response to a query.
290
Glossary
Rich snippet Extended snippet on a search engine result page. Right to be forgotten A term used to describe a requirement under EU law that certain information about a person must be removed from the results of a search engine (or other service) at that person’s request. Robots Exclusion Standard Commands accepted by general search engines to control search engine crawlers and the indexing of websites by search engines. Robots.txt File in which instructions can be given to search engine crawlers to index or exclude content of a website. RSS Format for the transmission of structured data. Screen reader Software that reads aloud the content displayed on the screen. Search Console See Google Search Console. Search engine advertising (SEA) See advertising. Search engine result page The HTML page generated by a search engine on which the results for a query are presented. It consists of all the results displayed and the elements surrounding them, such as the search box and navigation elements. Search engine marketing Marketing measures carried out using search engines. Search engine optimization (SEO) Measures intended to help certain documents or websites achieve higher visibility in search engines. Search engine relationship chart Graphical representation of how search results and text ads are delivered between different search engines. Search engine result page (SERP) Compilation of search results based on a query. Several search engine result pages can be displayed for a search query, and the user can navigate between them. Search history The search query data collected by a search engine from a single, identifiable user. Search suggestion Suggestions for extending or improving a query that are generated by a search engine as the query is being entered. Search vocabulary Collection of commands and operators a search engine supports. Seed set Collection of websites or pages used as the starting point for crawling. SERP See search engine result page. Server See Web server. Session A sequence of queries and document views performed by a specific user within a specific time period on a specific topic. Share (on social networking sites) Sharing content to make it visible to people connected to the sharing person on a social networking site. Signal A feature used for ranking. A signal is the smallest unit, for example, determining whether a website uses HTTPS. This is an indicator of the security of this website, which in turn is a ranking factor, into which several signals can be incorporated. Silo An information collection that is not or not completely accessible to external parties (e.g., search engines). Simple search Standard search form that consists of only one search field.
Glossary
291
Sitelinks Links within the snippets on search engine result pages that lead directly to specific sections of a website. Sitemap Compilation of links to all pages of a website in one document; see also XML sitemap. Snippet Short description of a document on the search engine result page. Source code Text of an HTML document including the HTML commands (tags). Spam Documents not wanted by the search engines, especially those created solely to deceive search engines and users about their actual intent. Spider See crawler. Spider trap A part of a website in which a crawler can become trapped, which can be created intentionally or unintentionally. Sponsored link See text ad. Standard search See simple search. Stop word A word that is excluded from indexing. Surface Web The part of the Web that general search engines can cover. Surrounding text Text that is placed near the respective object in the case of multimedia content (images, videos) on HTML pages. Surrounding text can be used for indexing these objects. Synonym Words that have the same or very similar meanings. Tab blindness Term for the fact that users overlook the links to vertical collections offered by a search engine, in which specialized searches could be carried out. Term frequency Frequency of occurrence of terms (words or word forms) in documents. Text ad Advertisement on the result page of a search engine that is displayed as an answer to a search query and resembles organic results in its presentation. Text Retrieval Conference See TREC. Text statistics Procedures for analyzing texts to create a ranking based on word frequencies, word positions, and other characteristics. Topic detection and tracking Clustering of thematically similar documents and continuous addition of new documents to appropriate clusters. Traffic Visitors who access a document or website. Transaction-log analysis Analysis of log files; used in search engines primarily to obtain information about the popularity of keywords and the interaction behavior of users. TREC Text Retrieval Conference: Evaluation initiative for comparing information retrieval systems. Within the framework of TREC, for example, test collections are developed and made available. Truly Invisible Web Pages or websites that search engines cannot index due to technical reasons. Universal search Enrichment of search results from the Web Index with results from vertical collections and their joint display on a search result page. URL Uniform Resource Locator: Unique “address” of a document on the Web. User agent Software used to access services on the Internet, such as a Web browser. The user agent usually transmits its name when it makes a request to a
292
Glossary
service. Search engine crawlers are also user agents and identify themselves accordingly. They can, therefore, also be recognized by website operators. User experience All aspects of the user’s experience with a search engine or other service. Vertical search engine A search engine that restricts itself thematically or based on formal document characteristics (e.g., file type). Visible area Area of the search engine result page that is visible without scrolling. Web directory Human-created compilation of websites. The websites are sorted into a classification scheme and described with short texts. Web page Also page: A single document that can be accessed on the Web via a unique URL. Web server Computer on which a website is stored. Web See World Wide Web. Webmaster Tools See Google Search Console. Website An offering on the Web that contains several documents (pages). White hat In search engine optimization, measures that conform to the rules set by search engine operators. Whitelist Compilation of human-selected websites that can be used, for example, for focused crawling or as a seed set for general crawling. WHOIS service Directory of website owner data. World Wide Web Service of the Internet consisting of interconnected hypertext documents. Each document has a unique address (Uniform Resource Locator; URL) with which it can be addressed from other documents. The connections between documents are called links. WWW See World Wide Web. XML sitemap A file that allows website operators to provide search engine crawlers with detailed recommendations for crawling large websites in particular. XML Extensible Markup Language; text markup language that allows the description of hierarchically structured data.
Index
A Academic invisible web, 255 Accessibility, 175, 251–254 Accuracy, 6, 224, 232 Additional results, 205–206 Ad labelling, 197 AdSense, 196 Advanced search, 223, 224 Advertisement, 268 Advertising, 147, 148 Alexa, 276 AltaVista, 15, 16, 91, 169 Alternative search engine, 203 Amazon, 3, 67, 98 Anchor text, 49, 51 Application programming interfaces (APIs), 166 Ask.com, 169 Autocomplete, 68–69 B Baidu, 168, 169 BERT, 51 Bias, 263–267 Bing, 29, 84, 165–171, 196, 204, 221, 224–227, 235, 236, 254, 258, 262, 276 Bing Maps, 262 Blog, 30 Boolean operators, 218, 221, 222, 227 Bounce rate, 181, 183 C Cache, 159 CARS checklist, 231, 232 Click-through data, 98–100 Collection, 125–133 Command, 76–77, 225, 226
Complex search, 228 Content acquisition, 31–33 Cortana, 276 Crawler, 34, 35 Crawling, 26, 33–42, 127 Curlie, 254 D Data voids, 265 Deep Web, 122, 129, 217, 249–251, 255 areas, 255–256 content, 256–258 size, 254–255 Direct answer, 150, 152 Disconnected page, 250 DMOZ directory, 21 Document, 13, 157 Document representation, 48–51 Duplicate content, 35, 239 Dynamically generated content, 250 E Ecosia, 208, 280 eSearch plus, 254 European Commission, 134, 149, 211, 280 Evaluation, 231, 235–242 Exalead, 169 F Facebook, 94, 98, 176, 212, 256–258 Factual information, 62, 139 Filter bubble, 112 Firefox, 11, 168 F pattern, 145 Freshness, 85, 105–106, 127
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 D. Lewandowski, Understanding Search Engines, https://doi.org/10.1007/978-3-031-22789-9
293
294 G General search engine, 120, 208 Gmail, 103 Google, 2, 3, 5–6, 15, 16, 29, 32, 35, 38, 41, 51, 53, 64, 67, 69, 74, 75, 84, 85, 87, 90, 93, 96, 97, 100, 105, 111, 120, 125, 126, 129, 131–133, 137, 149, 150, 152, 159, 165–171, 176, 186, 193, 196–198, 203–212, 221, 223–227, 233, 235, 236, 240, 258, 261–265, 268, 269, 271, 276, 279, 280 Google account, 103, 111, 128 Google Analytics, 103, 188 Google App, 277 Google Books, 5, 129, 131, 211 Google Chrome, 103 Google Custom Search, 125 Google Images, 132 Google Keyword Planner, 180 Google Maps, 32, 121, 198, 262 Google My Business, 262 Google News, 18, 41, 121, 123, 126–128, 131, 186, 276 Google Scholar, 124, 129–132, 254 Google Search Appliance, 166 Google Search Console, 188 Google Shopping, 134, 198, 211 Google Street View, 5, 211 Google Translate, 166 Google Trends, 74, 75 Google updates, 178, 187 Google Video, 27, 121 Google+, 212, 258 GPS, 277 H HITS, 94 Host, 5 HTML, 28, 89, 181, 182 HTTPS, 114 Hybrid search engine, 18 I Image search, 132 Indexer, 42–51 Indexing, 47–48 Informational queries, 63 Instagram, 256 Interactive information retrieval, 241–242 Internet-based research, 217
Index Inverted index, 43, 45, 46 Invisible web, see Deep web K Keyword selection, 218 Keywords in context, 157 Keyword stuffing, 184 Knowledge graph, 150 L Link-based ranking, 92–97 Link building, 182 Linked Open Data, 278 Locality, 85, 106–111 Lycos, 15, 91 M Market dominance, 210–212 Market share, 167, 168 Metadata, 37, 50 Meta description, 157 Metager, 208, 280 Metasearch engine, 19–20 Microsoft, 62, 84, 129, 167, 196 Mozilla, 168 N Navigational queries, 62 Navigation elements, 153 Neutrality, 270–271 News search, 125–128 O Online advertising, 166, 167 Opaque web, 255 Open Directory, 21, 157 Operator, 76–77 Organic results, 139, 146, 147, 191, 268 Overlap, 204 P PageRank, 93–97 Paid inclusion, 166 Paperball, 125 Paperboy, 125 Personalization, 85, 103, 111–113
Index Phrase search, 77, 226 Popularity, 34, 85, 91–104, 175 Precision, 235 Private web, 256 Q Query, 2, 63, 65–77, 227, 239 Query formulation, 69–71 Query frequency, 73–74 Query intents, 63, 65 Query length, 71–72 Query modification, 154, 155 Query trends, 74–76 Query type, 61–64, 215 Question-answering site, 21–22 R Ranking, 83–116, 186, 195–197 Ranking factor, 84, 85 Ranking updates, 185, 186 Real-time content, 250 Relevance, 26, 175 Result presentation, 137–161, 207, 267–270 Result quality, 232–234 Result selection, 160, 161 Retrieval effectiveness, 237, 240–242 Robot exclusion, 39–40 Robots.txt, 37, 256 S SEA, see Search engine advertising (SEA) Search engine advertising (SEA), 191 Search engine market, 165–172, 176 Search engine optimization (SEO), 175, 176, 178, 184, 186, 226, 264 Searcher, 51–53 Search options, 155, 208 Search process, 59–60 Search result, 198 Search result quality, 235 Search skills, 215–229 Search suggestion, 69 Search topic, 77 Semantic Web, 53, 278 SEO, see Search engine optimization (SEO) Session, 64–65 Session length, 65
295 Siri, 168, 276 Snippet, 156–159 Social bookmarking site, 21 Social media, 256–258 Social networking site, 22 Source selection, 217 Spam, 35, 114–115, 185 Spider trap, 35 Spotify, 279 Startpage, 208, 280 T Technical ranking factors, 85, 113–114 Text retrieval conference, see TREC Text statistics, 86, 87, 91 TF*IDF, 89 T-Online, 170, 196 Toxin, 183, 184 Transaction, 152 TREC, 237 Truly invisible web, 256 Trust, 182 U Universal search, 119, 122, 139, 148, 149, 198, 199, 267 Usage data, 60–61 Usage statistics, 97–104 User guidance, 207–208 User interface, 26 User profiles, 208 V Vertical search, 119–135 Vertical search engine, 17–18, 121–123, 133–134, 184, 211, 217 Video search, 133 W Wayback Machine, 124 Webcrawler, 15 Web.de, 170, 196 Web directory, 20–21 Website architecture, 181 Web size, 29 WHOIS, 232
296 Wikipedia, 35, 43, 92, 115, 206, 216, 234, 241, 279 Wolfram Alpha, 53 X XML sitemap, 37, 39
Index Y Yahoo, 16, 21, 22, 161, 166, 167, 169, 170, 196 Yahoo Answers, 22 YouTube, 3, 5, 26–28, 32, 133, 149, 167, 176, 198, 211, 262, 268, 271